Show & Tell: Phase 11.4 InterpretabilityProbe — feature attribution and explanation generation for alignment decisions #347
web3guru888
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
What is InterpretabilityProbe?
Safety systems in autonomous agents are only as trustworthy as they are understandable. Phase 11.4 introduces
InterpretabilityProbe— a component that produces post-hoc explanations for every alignment decision: why was this goal blocked? what features drove the alignment score down? which feedback signals most changed the reward model?Without interpretability, the safety layer is a black box. With
InterpretabilityProbe, operators can audit, debug, and build confidence in the system.Component map
Attribution methods
PERMUTATIONINTEGRATED_GRADGRADIENT_SHAPLIME_LOCALPermutation attribution — zero-baseline ablation
Feature vector design
SafetyFilter verdict features (12 dimensions)
goal_prioritytask_countcapability_bitsresource_budget_cpuresource_budget_memconstitutional_scorefederation_trust_scoreviolation_rate_7descalation_historyAlignmentMonitor score features (8 dimensions)
dimension_idwindow_size_ssample_countrecent_trendCounterfactual generation
InterpretabilityProbegenerates counterfactuals: the minimal set of feature changes that would reverse the decision.Example: Goal blocked (score 0.21). Top negative features:
violation_rate_7d (-0.38),federation_trust_score (-0.21). Counterfactual: "Reduce violation rate and improve federation trust to reverse this block."Prometheus metrics
asi_probe_explanations_total{target,method}asi_probe_cache_hits_totalasi_probe_cache_sizeasi_probe_attribution_msasi_probe_confidence_score{target}PromQL
Open questions
ProbeExplanationobjects be stored in the Blackboard permanently, or only surfaced on-demand via API? How long should explanations be retained?Full spec in Issue #346.
Beta Was this translation helpful? Give feedback.
All reactions