Bayesian DL & UQ Part 4: Calibration and Conformal Prediction
Bayesian Deep Learning and Uncertainty Quantification — A Journey Toward Trustworthy AI
This is Part 4 of an 8-part series.
- Part 0: Beyond Predictions — Why Uncertainty Matters
- Part 1: The Language of Bayesian Inference
- Part 2: The Art of Approximation — From Variational to Ensemble
- Part 3: The Anatomy of Uncertainty — Aleatoric vs. Epistemic
- Part 4: Calibration and Conformal Prediction (this post)
- Part 5: Single-Pass UQ — Evidential Deep Learning and Distance-Aware Methods
- Part 6: UQ in Science — Molecules, Proteins, and Materials
- Part 7: The Future of UQ — Uncertainty in the Age of Foundation Models
Hook: What Does “90% Confident” Actually Mean?
Your model claims to be “90% confident” about a particular molecule’s activity. How should you interpret that number?
Intuitively, it should mean that when the model makes 100 predictions at 90% confidence, roughly 90 of them should be correct. This is the essence of calibration. Yet real-world deep learning models systematically violate this expectation.
- ResNet-110 (CIFAR-100): Among predictions where the model shows confidence above 90%, the actual accuracy is only about 72% (Guo et al., 2017)
- In drug discovery, compounds recommended with “high confidence” sometimes have a real hit rate no better than random selection
- In medical imaging, the consequences of a misdiagnosis made with 99% confidence
In Parts 2 and 3, we learned how to approximate the posterior and decompose uncertainty. But now it is time to confront the most uncomfortable truth of this series: having an uncertainty estimate is an entirely different matter from having a trustworthy one. Poorly calibrated uncertainty can be more dangerous than having no uncertainty at all — because it creates a false sense of security.
This post addresses two core topics:
- Calibration: How to measure and correct whether a model’s confidence aligns with its actual accuracy
- Conformal Prediction: A groundbreaking framework that constructs statistically valid prediction sets without any assumptions about the model or data distribution
1. What Is Calibration?
1.1 Definition: Agreement Between Confidence and Accuracy
The mathematical definition of calibration is remarkably simple:
\[P(\hat{Y} = Y \mid \hat{P} = p) = p, \quad \forall\, p \in [0, 1]\]- $\hat{Y}$: the model’s predicted class
- $Y$: the true class
- $\hat{P}$: the confidence assigned by the model (maximum softmax output)
In plain terms:
- Collect all predictions where the model is 70% confident — among those, 70% should be correct
- Collect all predictions where the model is 95% confident — among those, 95% should be correct
- When this holds at every confidence level, the model is perfectly calibrated
This property is defined identically for regression. If the model assigns 95% probability to the prediction interval $[\hat{y} - 2\sigma, \hat{y} + 2\sigma]$, then the true value should fall within that interval approximately 95% of the time.
1.2 How to Read a Reliability Diagram
The most widely used visual diagnostic for calibration is the reliability diagram:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Accuracy
1.0 | o (perfect)
| o
| o
0.8 | . o
| o
| . o
0.6 | o
| o
| . o
0.4 | o
| o
|o
0.2 |
|
|
0.0 +----+----+----+----+----+----+----+----+----+----+
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Confidence
Legend: o = perfect calibration (diagonal)
. = typical modern deep network (below diagonal = overconfident)
How to interpret it:
- On the diagonal (o): perfectly calibrated model. Confidence = Accuracy
- Below the diagonal (.): overconfident model. Actual accuracy is lower than the model’s confidence
- Above the diagonal: underconfident model. The model is more accurate than it claims to be
Modern deep learning models almost universally fall below the diagonal — that is, they are systematically overconfident.
1.3 Why Modern Deep Learning Is Overconfident
The key finding of Guo et al. (2017) is this: Model accuracy has improved dramatically over recent years, but calibration has actually gotten worse. The causes of this paradox include:
- Model depth: As networks grow deeper, logit magnitudes tend to increase. Since softmax responds exponentially to differences in logits, even a modest increase in logit values can push confidence to extremes
- Batch Normalization: Stabilizes training but systematically alters logit scales during internal covariate shift correction, with unpredictable effects on calibration
- Weight decay (L2 regularization): Prevents overfitting and improves accuracy, but does not necessarily improve calibration in terms of NLL. The way weight decay reshapes the loss landscape has counter-intuitive effects on calibration
- Side effects of NLL training: Training with cross-entropy loss minimizes NLL, but this can induce overfitted confidence. Training NLL keeps decreasing while validation calibration deteriorates
Key insight: Accuracy and calibration are independent properties. High accuracy does not guarantee calibration, and techniques that boost accuracy can actually degrade calibration. This is the fundamental reason why calibration must be treated as a separate problem.
2. Calibration Metrics
2.1 Expected Calibration Error (ECE)
The most widely used calibration metric. It partitions all predictions into $M$ bins by confidence, then computes the weighted average of the gap between accuracy and confidence in each bin:
\[\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|\]- $B_m$: the set of predictions falling in the $m$-th bin
$ B_m $: the number of samples in that bin - $N$: total number of samples
- $\text{acc}(B_m)$: accuracy within the bin
- $\text{conf}(B_m)$: average confidence within the bin
Strengths:
- Intuitive interpretation: “On average, how much do confidence and accuracy disagree?”
- Simple to compute and directly corresponds to the reliability diagram
- Summarizes calibration quality in a single number
Pitfalls:
- Sensitive to bin count: ECE values can differ substantially between $M=10$ and $M=100$. The choice between equal-width bins and equal-mass bins also affects the result
- Cancellation effect: If the model is overconfident in some bins and underconfident in others, these errors can cancel out, yielding a low ECE despite poor calibration overall
- Maximum Calibration Error (MCE): An alternative that focuses on the worst-case bin. In safety-critical domains, MCE may be more appropriate than ECE
2.2 Negative Log-Likelihood (NLL)
\[\text{NLL} = -\frac{1}{N} \sum_{i=1}^{N} \log \hat{p}(y_i \mid \mathbf{x}_i)\]- Proper scoring rule: Minimizing NLL theoretically induces calibration automatically
- If the model learns the true conditional distribution exactly, NLL is minimized and calibration is achieved
However, practical limitations arise:
- Extreme sensitivity to outliers: If the model assigns $\hat{p} = 0.001$ to a sample that turns out to be correct, $-\log(0.001) \approx 6.9$ has a disproportionate impact on NLL. A single outlier can dominate the entire metric
- Does not measure calibration alone: NLL captures calibration + sharpness simultaneously. A sharp and calibrated model achieves the lowest NLL, but disentangling the two properties is difficult
2.3 Brier Score
\[\text{BS} = \frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} (\hat{p}_{i,k} - y_{i,k})^2\]where $y_{i,k}$ is the one-hot encoded true label.
The Brier Score’s key strength — decomposability (Murphy, 1973):
\[\text{BS} = \underbrace{\text{Reliability}}_{\text{calibration error}} - \underbrace{\text{Resolution}}_{\text{discriminative ability}} + \underbrace{\text{Uncertainty}}_{\text{inherent data uncertainty}}\]- Reliability: The discrepancy between confidence and accuracy at each confidence level — lower is better (= well calibrated)
- Resolution: The degree to which the model assigns different confidence levels to different samples — higher is better (= useful discrimination)
- Uncertainty: The intrinsic uncertainty of the data — beyond the model’s control
This decomposition is powerful because it allows us to diagnose calibration and usefulness separately.
2.4 An Important Caveat: ECE = 0 Does Not Mean a Good Model
Consider the following counterexample:
- Dataset: 90% class A, 10% class B (extreme class imbalance)
- Model strategy: Always output “class A, confidence 90%” for every input
- Result: Every prediction has 90% confidence, and the actual accuracy is also 90% → ECE = 0
This model is perfectly calibrated yet completely useless. It cannot distinguish class B at all. Through the Brier Score decomposition, Reliability = 0 (calibrated), but Resolution is also 0 (no discriminative ability whatsoever).
Lesson: Calibration is a necessary condition, not a sufficient one. Evaluating a model by calibration metrics alone is inadequate — accuracy, resolution, and calibration must be assessed together. The Brier Score decomposition makes this possible.
3. Post-hoc Calibration Methods
These methods retroactively correct the outputs of an already-trained model without modifying the training process itself. They are surprisingly simple yet effective.
3.1 Temperature Scaling — The Power of a Single Parameter
Core idea: Divide the logits by a temperature $T$ before feeding them into the softmax:
\[\hat{q}_i = \text{softmax}(z_i / T)\]- $T > 1$: softens the softmax output (lowers confidence) → corrects overconfident models
- $T < 1$: sharpens the softmax output (raises confidence) → corrects underconfident models
- $T = 1$: equivalent to the original model
Training procedure: Find the $T$ that minimizes NLL on the validation set. Since this is a one-dimensional optimization problem, a single call to scipy’s minimize suffices.
Why does this work so well (key finding of Guo et al., 2017):
- A single parameter $T$ is enough to effectively calibrate most modern networks
- More complex methods (bin-wise scaling, matrix scaling) show little additional benefit
- The reason: miscalibration in modern networks primarily stems from the overall scale of the logits. The relative ordering among logits is generally correct, but their absolute magnitudes are too large
An important property: Temperature scaling does not change the predicted class. Since $\arg\max$ softmax($z/T$) = $\arg\max$ softmax($z$), accuracy is preserved while only calibration improves.
3.2 Platt Scaling
Originally proposed for producing probabilistic outputs from SVMs (Platt, 1999), adapted here for neural networks:
\[\hat{q} = \sigma(az + b)\]- $a$, $b$: two parameters learned on the validation set
- Primarily used for binary classification
- Multi-class extension: class-specific $a_k, b_k$ → $2K$ parameters
Temperature scaling can be viewed as a special case of Platt scaling ($a = 1/T$, $b = 0$, with the same $a$ for all classes). Guo et al. (2017) showed that the additional degrees of freedom tend to cause overfitting in most cases, actually degrading performance.
3.3 Isotonic Regression — A Nonparametric Approach
- Instead of assuming a parametric form, learn a monotonically non-decreasing function $f$ to recalibrate confidence:
- Efficiently solved by the Pool Adjacent Violators (PAV) algorithm
- Strength: makes no functional form assumption
- Weakness: prone to overfitting when the validation set is small. Not consistently better than temperature scaling
3.4 Limitations of Post-hoc Calibration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Before Temperature Scaling After Temperature Scaling
Accuracy Accuracy
1.0 | . 1.0 | o
| . | o
| . | o
0.8 | . 0.8 | o
| . | o
| . | o
0.6 | . 0.6 | o
|. |o
| |
0.4 | 0.4 |
+----+----+----+----+----+ +----+----+----+----+----+
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence Confidence
(overconfident: curve below (calibrated: curve matches
the diagonal) the diagonal)
The diagram above illustrates the before-and-after of temperature scaling. A single parameter can produce this dramatic improvement, but fundamental limitations remain:
- Dependence on validation set representativeness: If the calibration set’s distribution differs from the test set, the learned $T$ becomes meaningless. This is fatal under distribution shift
- Global rather than instance-level correction: Temperature scaling applies the same $T$ to every input. It cannot handle cases where high-confidence and low-confidence predictions require different corrections
- Non-trivial extension to regression: Softmax calibration for classification does not directly translate to prediction interval calibration for regression
- Does not address the model’s own limitations: If the model has learned the wrong features, post-hoc calibration merely adjusts surface-level numbers without solving the underlying problem
These limitations naturally lead to the question: Is there a method that provides statistical guarantees without depending on the model’s form or the data distribution?
4. Conformal Prediction — Distribution-Free Guarantees
4.1 A Revolutionary Idea
Conformal prediction is a framework systematized by Vovk, Gammerman & Shafer (2005) that has seen a surge of interest in recent years. Its core idea is as follows:
- Instead of a single point prediction, construct a prediction set $C(\mathbf{x})$
- Provide a finite-sample guarantee on the probability that this prediction set contains the true label
- This guarantee holds for any model, under any data distribution
- This is not an asymptotic guarantee but a finite-sample one
- It does not matter whether the model is a random forest or a deep network
- It does not matter whether the data is Gaussian or follows any other distribution
The only assumption is exchangeability: the calibration data and the test data must be exchangeable. This is automatically satisfied under i.i.d. sampling.
4.2 Split Conformal Algorithm
The most practical and widely used form is split conformal prediction:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Split Conformal Prediction Pipeline
====================================
[Training Data] ──────> [Train Model f]
│
▼
[Calibration Data] [Compute nonconformity scores]
(X_cal, Y_cal) s_i = score(X_i, Y_i, f)
│ │
│ ▼
│ [Sort scores: s_(1) <= ... <= s_(n)]
│ │
│ ▼
│ [Find quantile q_hat]
│ q_hat = s_( ceil((n+1)(1-alpha)) )
│ │
└───────────────────────>│
▼
[New test input X_new] ──> [Prediction set C(X_new)]
C(X_new) = { y : score(X_new, y, f) <= q_hat }
Step-by-step explanation:
- Data splitting: Separate the data into a training set and a calibration set. The model is trained only on the training set
- Nonconformity score computation: For each $(X_i, Y_i)$ in the calibration set, compute a score $s_i$ that measures “how unusual this (input, output) pair is”
- Quantile computation: Sort all scores and find the $\lceil (n+1)(1-\alpha) \rceil / n$ quantile ($\hat{q}$)
- Prediction set construction: For a new input $X_{\text{new}}$, include in the prediction set every $y$ whose score is at most $\hat{q}$
Intuition behind the coverage guarantee:
- Under exchangeability, the score of a new test point is uniformly distributed among all possible ranks relative to the calibration scores (every rank is equally likely)
- Therefore, the probability that the test score falls at or below the $(1-\alpha)$ quantile is exactly $\geq 1-\alpha$
- This argument depends neither on the form of the model $f$ nor on the data distribution $P_{X,Y}$
4.3 Regression: Conformalized Quantile Regression (CQR)
For regression, the most natural nonconformity score is the absolute residual:
\[s_i = |Y_i - \hat{Y}_i|\]In this case, the prediction interval is:
\[C(X_{\text{new}}) = [\hat{Y}_{\text{new}} - \hat{q},\; \hat{Y}_{\text{new}} + \hat{q}]\]The limitation of this approach is that it produces intervals of uniform width for all inputs. Some inputs are inherently easy to predict (a narrow interval would suffice) while others are difficult (a wide interval is needed), but this method cannot distinguish between them.
Conformalized Quantile Regression (CQR) (Romano, Patterson & Candes, 2019) solves this problem elegantly:
- Step 1: Train a quantile regression model as the base model. Estimate the conditional quantile functions:
\(\hat{q}_{\alpha/2}(x) \quad \text{and} \quad \hat{q}_{1-\alpha/2}(x)\)
- Step 2: Define the nonconformity score as:
- Step 3: Compute $\hat{q}$ (the score quantile) on the calibration set and construct the prediction interval:
Strengths of CQR:
- When the base model’s quantile estimates are good, intervals have adaptive widths that vary with the input
- Even when the base model’s quantile estimates are wrong, the conformal correction ($\hat{q}$) still guarantees coverage
- Best case: narrow and accurate intervals. Worst case: wide but still valid intervals
4.4 Classification: Adaptive Prediction Sets (APS)
In classification, conformal prediction outputs a prediction set — a set of possible classes.
Naive approach: Include every class whose softmax output exceeds a threshold
\[C_{\text{naive}}(X) = \{k : \hat{p}_k(X) \geq \tau\}\]However, this requires the model’s softmax to be well calibrated, and the threshold choice lacks solid theoretical grounding.
Adaptive Prediction Sets (APS) (Romano, Sesia & Candes, 2020):
- Sort the softmax probabilities in descending order
- Use as the score the cumulative probability up to the point where the true class is first included:
- For confident (easy) predictions, the prediction set is small (often a single class)
- For uncertain (hard) predictions, the prediction set grows larger
Intuition behind APS: Saying “this image could be a cat or a tiger” honestly is often more useful for decision-making than saying “this is a cat (confidence 51%).”
5. Limitations and Extensions of Conformal Prediction
5.1 Marginal Coverage != Conditional Coverage
The coverage guarantee of conformal prediction is marginal:
\[P(Y \in C(X)) \geq 1 - \alpha\]This means that on average, a fraction $1-\alpha$ of test points will be included in the prediction set. However, coverage is not guaranteed for specific subgroups or specific regions of the input space.
A concrete problem scenario:
- In molecular property prediction, even if overall 90% coverage is achieved:
- Common scaffolds (abundant in training data): 99% coverage (overly conservative → wide intervals)
- Rare scaffolds (scarce in training data): 60% coverage (insufficient coverage → decision errors)
- These two effects cancel out so that marginal coverage reaches 90%, but at the places that actually matter (rare scaffolds) the guarantee breaks down
What we ideally want is conditional coverage:
\[P(Y \in C(X) \mid X = x) \geq 1 - \alpha, \quad \forall x\]However, distribution-free conditional coverage is theoretically impossible to achieve (Vovk, 2012; Lei & Wasserman, 2014). This is a fundamental limitation of conformal prediction.
5.2 Violation of the Exchangeability Assumption
Practical scenarios where conformal prediction’s sole assumption of exchangeability is violated:
- Time-series data: Stock prices, weather, reaction kinetics — past and future are not exchangeable
- Active learning: Selecting the next data point based on the model’s predictions introduces selection bias that violates exchangeability
- Distribution shift: Data distributions that change over time (concept drift)
- Drug discovery pipelines: During hit-to-lead optimization, the chemical space progressively narrows → early calibration molecules and later test molecules are not exchangeable
Such violations invalidate the coverage guarantee. A promised 90% coverage may in practice deliver only 70%.
5.3 Extensions: Pushing Beyond the Limits
Group-conditional conformal prediction (Barber et al., 2023):
- Partition the data into meaningful subgroups and perform conformal prediction separately within each group
- If exchangeability holds within each group, group-level coverage is guaranteed
- Limitation: if groups are too fine-grained, each group’s calibration set becomes small, leading to wide intervals
Weighted conformal prediction (Tibshirani et al., 2019):
- Addresses covariate shift by weighting calibration points with the likelihood ratio:
\(w(x) = p_{\text{test}}(x)/p_{\text{cal}}(x)\)
- Constructs prediction sets using weighted quantiles
- Limitation: estimating the likelihood ratio itself is challenging
Adaptive Conformal Inference (ACI) (Gibbs & Candes, 2021):
- For settings where exchangeability is violated, such as time series, dynamically adjusts $\alpha$
- Feeds back past coverage errors to update $\alpha$ — an online learning approach
- Converges to the desired coverage rate in the long run
5.4 Conformal Prediction in Cheminformatics
Conformal prediction finds a particularly natural application in cheminformatics:
- Prediction intervals for QSAR models (Svensson et al., ACS Omega, 2024): Applying conformal prediction to drug activity prediction provides, for each molecule, “the 90% prediction interval for this prediction is [a, b].” This enables decision support in drug development that was impossible with point predictions alone
- Defining the applicability domain: Chemists have traditionally defined “the chemical space where this model is reliable” in qualitative terms. Conformal prediction quantifies this through prediction set size. Molecules with excessively wide prediction intervals are deemed outside the model’s applicability domain
- Controlling false positives in virtual screening (ScienceDirect, 2025): Filtering out molecules whose conformal prediction sets include both “active” and “inactive” labels reduces wasted resources on synthesis experiments
Author’s perspective: In our 2019 Bayesian GCN paper, we took the approach of filtering unreliable predictions using epistemic uncertainty (Ryu, Kwon & Kim, 2019). Conformal prediction is complementary to this — while Bayesian methods excel at decomposing the sources of uncertainty, conformal prediction excels at providing statistical guarantees without distributional assumptions. Combining the two yields a more powerful framework.
6. Practical Guide: When to Use What
6.1 Decision Tree
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
What is your situation?
=======================
Q1: Is it sufficient to just recalibrate
the confidence of an already-trained model?
|
YES -+--- NO
| |
v v
[Temperature Q2: Do you need distribution-free
Scaling] coverage guarantees?
(simple, |
effective, YES -+--- NO
just 1 param) | |
v v
Q3: Do you need [Use Bayesian methods
adaptive intervals for uncertainty
per input? estimation +
| reliability diagrams
YES -+--- NO for diagnostics]
| |
v v
[Conformalized [Split Conformal
Quantile Prediction]
Regression] (simplest conformal
(adaptive width, method)
coverage guarantee)
6.2 Situation-Specific Recommendations
Situation 1: You have a well-trained classification model and only need to recalibrate its confidence
- Recommendation: Temperature Scaling
- Rationale: Extremely simple to implement (a single line of scipy.optimize), does not alter accuracy, and is sufficient in most cases
- Caveat: The validation set must be similar in distribution to the test set
Situation 2: You need to present a “90% confidence interval for this prediction” to regulators or decision-makers
- Recommendation: Conformal Prediction (CQR for regression, APS for classification)
- Rationale: The distribution-free coverage guarantee minimizes assumptions, making it easy to explain to external stakeholders
- Caveat: Coverage is marginal, so guarantees for specific subgroups require separate verification
Situation 3: You need uncertainty-based filtering in large-scale virtual screening
- Recommendation: Combine conformal prediction with a well-calibrated base model
- Specific strategy:
- Estimate uncertainty with Deep Ensembles or MC Dropout as the base model
- Calibrate the base model with temperature scaling
- Apply conformal prediction to construct a prediction set for each molecule
- Prioritize molecules whose prediction set contains a single class (high confidence)
- Route molecules with multi-class prediction sets to further experimental testing
Situation 4: Time series, active learning, or other settings where exchangeability is violated
- Recommendation: Adaptive Conformal Inference (ACI)
- Rationale: Online coverage control that adapts to distribution shift
- Caveat: The guarantee weakens to a long-run average. Coverage at any individual time step is not guaranteed
6.3 Principles of Combination
The most powerful approach is to combine calibration and conformal prediction. These two methods are not competitors but complements:
- Calibration (Temperature Scaling, etc.): Corrects the base model’s softmax outputs, making the nonconformity scores used in conformal prediction more meaningful
- Conformal Prediction: Acts as a safety net that guarantees coverage even when calibration is imperfect
Applying conformal prediction to a well-calibrated model yields:
- Smaller prediction sets (at the same coverage level): Well-calibrated scores are more informative
- Closer approximation to conditional coverage: Perfect conditional coverage is impossible, but the better the base model, the smaller the gap between marginal and conditional coverage
Closing Remarks: Toward Honest Uncertainty
Synthesizing everything covered in this post, a complete pipeline for uncertainty quantification comes into view:
- Uncertainty estimation (Part 2): Obtain predictive uncertainty through posterior approximation
- Uncertainty decomposition (Part 3): Separate into aleatoric and epistemic components to diagnose the source
- Calibration verification (this post): Check whether the uncertainty matches reality using reliability diagrams, ECE, and the Brier Score
- Post-hoc correction: If discrepancies exist, apply temperature scaling or similar methods
- Statistical guarantees: Use conformal prediction to provide distribution-free coverage guarantees
Within this pipeline, the role of this post is verification and assurance. If Parts 2 and 3 were about “producing” uncertainty, Part 4 is about confirming that the uncertainty is “honest” and “attaching guarantees” to it.
Let me close by emphasizing three key takeaways:
Calibration is a mandatory checkpoint, but it is not sufficient on its own. Remember that a perfectly useless model with ECE = 0 can exist. A perspective that examines calibration, resolution, and uncertainty together — as in the Brier Score’s three-component decomposition — is essential.
Recognize both the power and the limits of temperature scaling. A single parameter can achieve remarkable calibration improvements, but this depends on the representativeness of the validation set and is vulnerable to distribution shift. When exploring novel chemical spaces in scientific research, there is no guarantee that a $T$ learned from existing data will remain valid.
Conformal prediction is an indispensable member of the modern UQ toolkit. This framework, which guarantees coverage without distributional assumptions, does not compete with Bayesian methods — it complements them. If Bayesian methods are the scalpel that dissects the sources of uncertainty, conformal prediction is the seal that stamps a statistical certificate onto that uncertainty.
In the next installment, Part 5, we address another practical question: Can uncertainty be quantified in a single forward pass, without Deep Ensembles or MC Dropout? We will examine the theory and practice of single-pass UQ methods such as Evidential Deep Learning, SNGP, and DUQ.
References
Calibration
- Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). “On Calibration of Modern Neural Networks.” ICML.
- Platt, J. (1999). “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.” Advances in Large Margin Classifiers.
- Naeini, M. P., Cooper, G. & Hauskrecht, M. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning into Quantiles.” AAAI.
- Nixon, J. et al. (2019). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Murphy, A. H. (1973). “A New Vector Partition of the Probability Score.” Journal of Applied Meteorology, 12(4), 595-600.
Conformal Prediction — Foundations
- Vovk, V., Gammerman, A. & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.
- Angelopoulos, A. N. & Bates, S. (2023). “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” Foundations and Trends in Machine Learning.
- Romano, Y., Patterson, E. & Candes, E. (2019). “Conformalized Quantile Regression.” NeurIPS.
- Romano, Y., Sesia, M. & Candes, E. (2020). “Classification with Valid and Adaptive Coverage.” NeurIPS.
Conformal Prediction — Extensions
- Barber, R. F., Candes, E. J., Ramdas, A. & Tibshirani, R. J. (2023). “Conformal Prediction Beyond Exchangeability.” Annals of Statistics.
- Tibshirani, R. J., Barber, R. F., Candes, E. J. & Ramdas, A. (2019). “Conformal Prediction Under Covariate Shift.” NeurIPS.
- Gibbs, I. & Candes, E. (2021). “Adaptive Conformal Inference Under Distribution Shift.” NeurIPS.
Conformal Prediction in Chemistry
- Svensson, F. et al. (2024). “Conformal Prediction for QSAR.” ACS Omega.
- Conformal prediction for virtual screening and molecular property prediction (2025). ScienceDirect.
Scientific Applications
- Ryu, S., Kwon, Y. & Kim, W. Y. (2019). “A Bayesian Graph Convolutional Network for Reliable Prediction of Molecular Properties with Uncertainty Quantification.” Chemical Science, 10, 8438-8446.
Background
- Lei, J. & Wasserman, L. (2014). “Distribution-Free Prediction Bands for Non-Parametric Regression.” JRSSB.
- Vovk, V. (2012). “Conditional Validity of Inductive Conformal Predictors.” AISTATS.