Skip to content

Safety: formal_verification auto-proves all constraints due to ungrounded conclusion symbols #7

@web3guru888

Description

@web3guru888

Summary

src/asi_build/safety/formal_verification.py contains a critical logic flaw: the _verify_single_constraint() method constructs conclusions of the form "satisfies_nonmaleficence", which are opaque SymPy symbols with no logical relationship to the hypothesis. As a result, the resolution prover can only prove the conclusion if it is directly present in the axiom set — but since it never is, the satisfiable() call on the negation succeeds trivially (the negated symbol is satisfiable), and the constraint is marked not proved. However, the not satisfiable(...) pattern is also fragile in the other direction: if the hypotheses happen to be unsatisfiable themselves, the overall conjunction And(combined_premises, Not(conclusion)) is also unsatisfiable — meaning the prover incorrectly reports the theorem as proved for any conclusion.

Root Cause

In _prove_by_resolution() (lines 205–250):

# Check satisfiability
is_valid = not satisfiable(And(combined_premises, Not(conclusion_formula)))

And in _extract_facts_from_proposal(), facts are built from keys like "has_human_oversight", "is_reversible", etc. The conclusion "satisfies_nonmaleficence" is a separate symbol not entailed by any axiom or hypothesis.

The _initialize_ethical_axioms() method adds axioms like "beneficence", but none of these axioms ever produce satisfies_<principle> as a derivable conclusion.

Net effect: Whether a constraint is marked valid or invalid depends on whether SymPy's satisfiable() considers the hypotheses consistent or not — not on any meaningful semantic check.

Verification (reproduce with ~10 lines)

from sympy.logic.inference import satisfiable
from sympy import Symbol, And, Not

# Simulate a call with a clearly false action
hyp = [Symbol("has_human_oversight"), Symbol("is_reversible")]
conclusion = Symbol("satisfies_nonmaleficence")  # has no relation to hyp

# This returns {} (false assignment exists) — so is_valid = False (good here)
# BUT if hyp is inconsistent:
bad_hyp = And(Symbol("x"), Not(Symbol("x")))
# not satisfiable(And(bad_hyp, Not(conclusion))) == True --> auto-proves!
print(not satisfiable(And(bad_hyp, Not(conclusion))))  # True

Impact

The safety formal verification layer cannot be relied upon. A malicious or buggy proposal that creates contradictory hypotheses will pass verification. More broadly, the prover provides no meaningful assurance.

Proposed Fix

Two approaches:

Option A (Honest stubs): Mark all prove_theorem() calls as validity=False with reason="Formal verification not implemented — manual review required" until a correct implementation exists. This is safer than silently returning bogus results.

Option B (Semantic grounding): Add explicit semantic axiom rules that derive satisfies_<principle> from grounded fact combinations:

# Example: nonmaleficence is satisfied if no_direct_harm AND is_reversible
Implies(And(no_direct_harm, is_reversible), satisfies_nonmaleficence)

These rules must be carefully designed per principle (not auto-generated).

Option C (SMT-based): Replace the SymPy propositional prover with a proper Z3-based or Dafny-based constraint checker that operates on quantified predicates.

Security Note

This is listed as a research/safety issue, not a critical production bug (the module is marked research-alpha). However, any downstream use of verify_proposal_ethics() as a gate should be disabled until this is resolved.

Labels: bug, safety, help wanted, research

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is neededresearchresearchsafetySafety and alignment module, formal verification

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions