❓ Phase 23.1 Q&A — UncertaintyQuantifier: confidence calibration, ensemble methods & decomposition #536
Unanswered
web3guru888
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
❓ Phase 23.1 Q&A — UncertaintyQuantifier: Confidence Calibration, Ensemble Methods & Decomposition
Common questions about the
UncertaintyQuantifiercomponent (Phase 23.1). See issue #530 for the full specification and discussion #535 for the architecture overview.Q1: Why decompose uncertainty into epistemic and aleatoric?
Because they demand fundamentally different responses.
A single "confidence" number conflates the two, making it impossible for downstream modules to choose the right response. The
RiskAssessor(23.2) uses the decomposition directly:Q2: Why ensemble disagreement over single-model confidence?
Single-model softmax outputs are notoriously overconfident. A neural network can output 99.9% confidence on an input it has never seen before — this is a known pathology of maximum-likelihood training.
Ensemble disagreement solves this naturally:
This is grounded in Bayesian model averaging — the ensemble approximates the posterior predictive distribution. Unlike MC-Dropout (which is approximate and architecture-dependent), deep ensembles are model-agnostic and empirically well-calibrated (Lakshminarayanan et al., 2017).
Trade-off: N models = N× compute. The
UncertaintyQuantifiersupports configurableensemble_size(default: 5) and lazy evaluation (only compute disagreement when uncertainty is requested).Q3: How does Platt scaling work?
Platt scaling fits a logistic regression on a held-out validation set to map raw model outputs to calibrated probabilities.
Given raw model output
z(logit or score):Parameters
AandBare fit by minimising NLL on the validation set. This is equivalent to learning a temperature (slopeA) and bias (B) correction.Workflow in UncertaintyQuantifier:
(prediction, actual_outcome)pairs on holdout dataA, Bvia L-BFGS optimisationcalibrated_conf = sigmoid(A * raw_conf + B)ece_score > recalibration_thresholdLimitation: Assumes the miscalibration curve is sigmoid-shaped. For more complex patterns, use
CalibrationMethod.ISOTONICorCalibrationMethod.BETA.Q4: What happens with out-of-distribution (OOD) inputs?
Epistemic uncertainty spikes — by design. This is one of the primary reasons we use ensemble-based UQ.
When an OOD input arrives:
UncertaintyEstimate.epistemicfield will be largeThe
DecisionOrchestrator(23.5) uses the epistemic signal to trigger active learning or human-in-the-loop escalation whenepistemic > ood_threshold.Note: Pure aleatoric estimators (e.g., heteroscedastic heads) do NOT detect OOD — only epistemic uncertainty does. This is why the decomposition matters.
Q5: How does the recalibration loop prevent drift?
The calibration model can become stale if the prediction-outcome distribution shifts over time (concept drift). The
UncertaintyQuantifierprevents this with a periodic recalibration loop:(prediction, outcome)pairs (default: last 10,000)ece_scoreon the bufferece_score > recalibration_threshold(default: 0.05), initiate recalibrationuq_calibration_ecePrometheus gaugeThe
recalibration_interval_s(default: 300s) andbuffer_sizeare configurable viaUncertaintyConfig.Q6: How does UncertaintyQuantifier integrate with RiskAssessor (23.2)?
The
UncertaintyEstimateis the primary input toRiskAssessor.assess().The integration contract:
UncertaintyQuantifier→UncertaintyEstimate(epistemic, aleatoric, total, confidence)calibrate()→CalibratedEstimate(calibrated_conf, ece_score, reliability_curve)RiskAssessor.assess()acceptsCalibratedEstimateand producesRiskScorewith confidence intervalsKey insight: Risk = Uncertainty × Consequence. Without calibrated uncertainty, the risk score is meaningless.
Q7: How do you test calibration quality?
Expected Calibration Error (ECE) on synthetic data with known ground-truth distributions, plus reliability diagrams.
Testing strategy:
Synthetic perfectly-calibrated data: Generate predictions where
P(correct | conf=p) = pexactly. Verify ECE ≈ 0.0.Synthetic overconfident data: Generate predictions where true accuracy < stated confidence. Verify ECE > 0 and that calibration reduces ECE.
Reliability diagram validation: Bin predictions by confidence, compute actual accuracy per bin, verify the calibrated curve is closer to the diagonal.
Decomposition sanity checks: Verify
epistemic + aleatoric ≈ total(Pythagorean decomposition), and that epistemic → 0 as ensemble size → ∞ on in-distribution data.Property-based tests: Use Hypothesis to generate random prediction vectors and verify invariants (non-negative uncertainties, calibrated conf ∈ [0,1], monotonicity of calibration map).
Phase 23.1 of the ASI-Build cognitive architecture. See issue #530 for the full specification.
Beta Was this translation helpful? Give feedback.
All reactions