❓ Phase 23.1 Q&A — UncertaintyQuantifier: confidence calibration, ensemble methods & decomposition #536

web3guru888 · 2026-04-13T13:15:39Z

web3guru888
Apr 13, 2026
Maintainer

❓ Phase 23.1 Q&A — UncertaintyQuantifier: Confidence Calibration, Ensemble Methods & Decomposition

Common questions about the UncertaintyQuantifier component (Phase 23.1). See issue #530 for the full specification and discussion #535 for the architecture overview.

Q1: Why decompose uncertainty into epistemic and aleatoric?

Because they demand fundamentally different responses.

Epistemic uncertainty (model doesn't know) → actionable: gather more data, request clarification, consult another module, defer the decision. The system can reduce this uncertainty.
Aleatoric uncertainty (world is noisy) → strategic: widen confidence intervals, hedge bets, use robust optimisation, plan for multiple outcomes. The system cannot reduce this — only adapt to it.

A single "confidence" number conflates the two, making it impossible for downstream modules to choose the right response. The RiskAssessor (23.2) uses the decomposition directly:

if estimate.epistemic > EPISTEMIC_THRESHOLD:
    action = "gather_more_data"      # reducible — invest in information
elif estimate.aleatoric > ALEATORIC_THRESHOLD:
    action = "hedge_with_contingency" # irreducible — plan for variance

Q2: Why ensemble disagreement over single-model confidence?

Single-model softmax outputs are notoriously overconfident. A neural network can output 99.9% confidence on an input it has never seen before — this is a known pathology of maximum-likelihood training.

Ensemble disagreement solves this naturally:

Train N models with different initialisations (or bootstrap samples)
If all N agree → low epistemic uncertainty (they all learned the same thing)
If they disagree → high epistemic uncertainty (the training signal was ambiguous)

This is grounded in Bayesian model averaging — the ensemble approximates the posterior predictive distribution. Unlike MC-Dropout (which is approximate and architecture-dependent), deep ensembles are model-agnostic and empirically well-calibrated (Lakshminarayanan et al., 2017).

Trade-off: N models = N× compute. The UncertaintyQuantifier supports configurable ensemble_size (default: 5) and lazy evaluation (only compute disagreement when uncertainty is requested).

Q3: How does Platt scaling work?

Platt scaling fits a logistic regression on a held-out validation set to map raw model outputs to calibrated probabilities.

Given raw model output z (logit or score):

P(y=1|z) = 1 / (1 + exp(A·z + B))

Parameters A and B are fit by minimising NLL on the validation set. This is equivalent to learning a temperature (slope A) and bias (B) correction.

Workflow in UncertaintyQuantifier:

Collect (prediction, actual_outcome) pairs on holdout data
Fit A, B via L-BFGS optimisation
At inference: calibrated_conf = sigmoid(A * raw_conf + B)
Periodically re-fit when ece_score > recalibration_threshold

Limitation: Assumes the miscalibration curve is sigmoid-shaped. For more complex patterns, use CalibrationMethod.ISOTONIC or CalibrationMethod.BETA.

Q4: What happens with out-of-distribution (OOD) inputs?

Epistemic uncertainty spikes — by design. This is one of the primary reasons we use ensemble-based UQ.

When an OOD input arrives:

Ensemble members, trained on different data subsets, will produce divergent predictions (high variance of means → high epistemic uncertainty)
The UncertaintyEstimate.epistemic field will be large
Downstream modules interpret this as "the system doesn't have reliable knowledge about this input"

In-distribution input:  epistemic ≈ 0.05, aleatoric ≈ 0.20 → confident
OOD input:              epistemic ≈ 0.85, aleatoric ≈ 0.20 → "I don't know"

The DecisionOrchestrator (23.5) uses the epistemic signal to trigger active learning or human-in-the-loop escalation when epistemic > ood_threshold.

Note: Pure aleatoric estimators (e.g., heteroscedastic heads) do NOT detect OOD — only epistemic uncertainty does. This is why the decomposition matters.

Q5: How does the recalibration loop prevent drift?

The calibration model can become stale if the prediction-outcome distribution shifts over time (concept drift). The UncertaintyQuantifier prevents this with a periodic recalibration loop:

Buffer: Maintain a rolling window of recent (prediction, outcome) pairs (default: last 10,000)
Monitor: After each batch, compute ece_score on the buffer
Trigger: If ece_score > recalibration_threshold (default: 0.05), initiate recalibration
Re-fit: Re-train the calibration model (Platt/Isotonic/Temperature/Beta) on the buffer
Swap: Atomically replace the old calibration parameters (no downtime)
Emit: Update uq_calibration_ece Prometheus gauge

async def _recalibration_loop(self) -> None:
    while not self._stop_event.is_set():
        await asyncio.sleep(self._config.recalibration_interval_s)
        ece = self._compute_ece(self._recent_buffer)
        if ece > self._config.recalibration_threshold:
            new_params = self._fit_calibration(self._recent_buffer)
            self._calibration_params = new_params  # atomic swap
            UQ_CALIBRATION_ECE.set(ece)

The recalibration_interval_s (default: 300s) and buffer_size are configurable via UncertaintyConfig.

Q6: How does UncertaintyQuantifier integrate with RiskAssessor (23.2)?

The UncertaintyEstimate is the primary input to RiskAssessor.assess().

# UncertaintyQuantifier produces the estimate
estimate: UncertaintyEstimate = await uq.quantify(predictions)
calibrated: CalibratedEstimate = await uq.calibrate(estimate)

# RiskAssessor consumes it
risk: RiskScore = await risk_assessor.assess(
    action=proposed_action,
    uncertainty=calibrated,
    consequences=consequence_model,
)
# risk.score = f(calibrated.calibrated_conf, consequence.severity)
# risk.confidence_interval = [lower, upper] based on total uncertainty

The integration contract:

UncertaintyQuantifier → UncertaintyEstimate (epistemic, aleatoric, total, confidence)
calibrate() → CalibratedEstimate (calibrated_conf, ece_score, reliability_curve)
RiskAssessor.assess() accepts CalibratedEstimate and produces RiskScore with confidence intervals

Key insight: Risk = Uncertainty × Consequence. Without calibrated uncertainty, the risk score is meaningless.

Q7: How do you test calibration quality?

Expected Calibration Error (ECE) on synthetic data with known ground-truth distributions, plus reliability diagrams.

Testing strategy:

Synthetic perfectly-calibrated data: Generate predictions where P(correct | conf=p) = p exactly. Verify ECE ≈ 0.0.
Synthetic overconfident data: Generate predictions where true accuracy < stated confidence. Verify ECE > 0 and that calibration reduces ECE.
Reliability diagram validation: Bin predictions by confidence, compute actual accuracy per bin, verify the calibrated curve is closer to the diagonal.

def test_calibration_reduces_ece():
    """After calibration, ECE should decrease."""
    raw_preds = generate_overconfident_predictions(n=5000)
    raw_ece = compute_ece(raw_preds, bins=15)

    calibrated = uq.calibrate(raw_preds)
    cal_ece = compute_ece(calibrated, bins=15)

    assert cal_ece < raw_ece
    assert cal_ece < 0.05  # well-calibrated threshold

def test_reliability_diagram_diagonal():
    """Calibrated predictions should follow y=x on reliability diagram."""
    calibrated = uq.calibrate(synthetic_data)
    bins = reliability_bins(calibrated, n_bins=10)
    for bin_conf, bin_acc in bins:
        assert abs(bin_conf - bin_acc) < 0.03  # within 3% per bin

Decomposition sanity checks: Verify epistemic + aleatoric ≈ total (Pythagorean decomposition), and that epistemic → 0 as ensemble size → ∞ on in-distribution data.
Property-based tests: Use Hypothesis to generate random prediction vectors and verify invariants (non-negative uncertainties, calibrated conf ∈ [0,1], monotonicity of calibration map).

Phase 23.1 of the ASI-Build cognitive architecture. See issue #530 for the full specification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓ Phase 23.1 Q&A — UncertaintyQuantifier: confidence calibration, ensemble methods & decomposition #536

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

❓ Phase 23.1 Q&A — UncertaintyQuantifier: confidence calibration, ensemble methods & decomposition #536

Uh oh!

web3guru888 Apr 13, 2026 Maintainer

❓ Phase 23.1 Q&A — UncertaintyQuantifier: Confidence Calibration, Ensemble Methods & Decomposition

Q1: Why decompose uncertainty into epistemic and aleatoric?

Q2: Why ensemble disagreement over single-model confidence?

Q3: How does Platt scaling work?

Q4: What happens with out-of-distribution (OOD) inputs?

Q5: How does the recalibration loop prevent drift?

Q6: How does UncertaintyQuantifier integrate with RiskAssessor (23.2)?

Q7: How do you test calibration quality?

Testing strategy:

Replies: 0 comments

web3guru888
Apr 13, 2026
Maintainer