🛡️ Causal Proof of Reasoning — Federated Learning Defense

Novel defense mechanism against explanation-poisoning attacks in agentic federated learning using causal graph auditing (GED-based) via SimGNN.

Overview

This project implements Causal Proof of Reasoning (PoR) — a novel server-side defense for Federated Learning that audits the causal reasoning structure submitted by each client alongside their model weights.

Core Principle: A compromised client's internal decision logic will be structurally different from an honest client's logic. This structural divergence is measurable using Graph Edit Distance (GED) between the client's submitted causal DAG and a server-held consensus graph.

Why this beats weight-based detection: Distributed Backdoor Attacks (DBA) [Xie et al., 2020] and other gradient manipulations can craft model weights that are statistically indistinguishable from honest clients. But the causal graph of a poisoned client must deviate from the true Bayesian Network structure — since explanation-poisoning corrupts the conditional relationships between features — making it detectable.

Key Contributions

NOTEARS-based causal discovery [Zheng et al., 2018] embedded in every FL client (PyTorch implementation)
SimGNN Logic Validator [Bai et al., 2019] — Siamese GNN pre-trained to approximate GED on causal graphs
FedNEAT Integration [Stanley & Miikkulainen, 2002] — robust neuroevolutionary strategy replacing traditional gradient-based updates for advanced multi-agent scenarios
Momentum-blended consensus update — global graph evolves conservatively across rounds
On-the-fly SimGNN fine-tuning — validator re-anchors after each consensus update
Baseline FedAvg comparison [McMahan et al., 2017] — weight-divergence detection (cosine similarity) for benchmarking

Why Bayesian Networks?

The project utilizes two classical Bayesian Networks as foundational causal structures, inspired by foundational Probabilistic Graphical Model research, sampled via bnlearn:

ASIA Network — Unit Testing Dataset (Lauritzen & Spiegelhalter, 1988)

Property	Value
Nodes	8 (`asia`, `tub`, `smoke`, `lung`, `bronc`, `either`, `xray`, `dysp`)
Arcs	8 (known ground-truth structure)
Variable Type	Binary (0/1)
Classification Target	`lung` (Lung Cancer)
Samples	Configurable (default: 10,000)

The Collider Test: ASIA encodes the v-structure tub → either ← lung — a fundamental causal pattern that tests whether NOTEARS correctly orients edges around colliders vs. forks.

ALARM Network — Full Simulation Dataset (Beinlich et al., 1989)

Property	Value
Nodes	37 (ICU monitoring variables)
Arcs	46
Variable Type	Categorical (multi-state ordinal)
Classification Target	`bp` (Blood Pressure)
Parameters	509

The ALARM (A Logical Alarm Reduction Mechanism) network models anesthesia complications — dense enough to stress-test PoR while remaining tractable on CPU.

Why not ISIC 2019 (images)? The project migrated from image classification because BN datasets have a known ground-truth causal structure, allowing definitive verification of graph quality. Named nodes (smoke, lung) also make the PoR graphs interpretable vs. abstract Feature_3 latents.

System Architecture

┌────────────────────────────────────────────────────────────┐
│                         SERVER                              │
│                                                             │
│  ┌─────────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Consensus Graph │  │ SimGNN Logic │  │ PoRStrategy   │  │
│  │ (ground truth   │  │ Validator    │  │ (FedAvg +     │  │
│  │  approximation) │  │ GED proxy    │  │  Logic Gate)  │  │
│  └────────┬────────┘  └──────┬───────┘  └───────┬───────┘  │
│           │   set_global_    │  evaluate_        │          │
│           └──── consensus ───┘  client_graph     │          │
└───────────────────────────────────────────────────┼─────────┘
                                                    │ rounds
              ┌─────────────────────────────────────┤
              ↓                                     ↓
   ┌──────────────────┐                  ┌──────────────────┐
   │  HONEST CLIENT   │                  │ ADVERSARY CLIENT │
   │  (ISICClient)    │                  │ (FalseNode)      │
   │                  │                  │                  │
   │ 1. Receive global│                  │ 1. Receive global│
   │    weights       │                  │    weights       │
   │ 2. Train MLP on  │                  │ 2. POISON batch: │
   │    local data    │                  │    feat_0 = 0.0  │
   │ 3. NOTEARS on    │                  │    label flipped │
   │    raw features  │                  │ 3. Train on bad  │
   │ 4. Submit:       │                  │    data          │
   │   (weights, DAG) │                  │ 4. NOTEARS gets  │
   │                  │                  │    crippled graph│
   │ GED ≈ low ✅     │                  │ GED > τ → ❌     │
   └──────────────────┘                  └──────────────────┘

Deliberative Agent Design

Each honest client implements a three-module Deliberative Agent:

Module	Class	Role
Perception	`DataLoader`	Consumes local data partition
Cognitive	`CognitiveModule` (NOTEARS)	Extracts causal DAG from latent features
Action	`ISICClient.fit()`	Packages weights + DAG, sends to server

NOTEARS Implementation (Custom PyTorch)

Custom in-house implementation (not a library wrapper) solving:

min_W  0.5/n · ‖X - X·W‖² + λ‖W‖₁    s.t.   h(W) = tr(exp(W·W)) - d = 0

Augmented Lagrangian outer loop updates ρ and α until h(W) < 1e-8
Adam inner loop minimises the penalised objective
Nodes are labelled with actual BN column names (e.g., smoke, lung)

PoR Defense Mechanics

Two-Stage Aggregation (FedNEAT Strategy)

Instead of traditional gradient-based weight averaging, this project uses Federated NeuroEvolution of Augmenting Topologies (FedNEAT) to aggregate models safely without shape-mismatch errors.

Stage 1 — PoR Logic Gate:

for client in submitted_clients:
    ged_score = SimGNN(client.causal_graph, consensus_graph)
    if ged_score > τ:
        REJECT(client)    # corrupted topology → drop genome entirely
    else:
        ACCEPT(client)    # honest graph → candidate for crossover

Stage 2 — Topological Crossover (FedNEAT): Accepted clients do not average their multi-dimensional tensors directly. Instead, their neural network architectures are encapsulated as genomes:

Connections are matched globally using Innovation Hashes.
If an edge exists in multiple accepted genomes, its scalar weight is averaged.
If a mutation introduces a novel structure on one client, it is inherited safely by the global model.

This ensures the surviving global model inherits only honest structural mutations, bypassing completely the parameter corruption caused by explanation poisoning.

Consensus Graph Evolution

After each round, the consensus graph is updated using a momentum-blended dual-threshold rule:

Operation	Threshold	At momentum=0.85 (10 clients)
Keep existing edge	`votes ≥ (1-m)·0.5·N`	≥ 0.75 votes → very sticky
Add new edge	`votes ≥ (0.5+0.5·m)·N`	≥ 9.25 votes → near-unanimous

consensus_momentum (0–1, configurable via GUI slider):

High (0.9): Graph barely changes each round — stable, conservative
Low (0.0): Pure 50% majority vote — aggressive updates

SimGNN Architecture

Input: (Graph A, Graph B)
   ↓ GCN × 2 layers (hidden=128)
   ↓ GAT attention layer (2 heads)
   ↓ Mean + Max pooling (multi-pool)
   ↓ Concatenate [emb_A, emb_B]
   ↓ FC(256→128) → Dropout(0.2) → FC(128→64) → FC(64→1)
   ↓ Sigmoid
Output: GED score ∈ [0, 1]

Pre-training: Self-supervised on permutations of the consensus graph (no external labels needed).
Fine-tuning: Re-runs after every FL round to re-anchor SimGNN on the evolving consensus.

Adversarial Attack Model

FalseNode — Targeted Feature Poisoning

Attack: 20% of each training batch has feature_column_0 = 0.0 and the label set to target_label.

Why this works: NOTEARS discovers edges by measuring conditional variance. A feature forced to zero has no variance → NOTEARS finds no causal links from/to it → submitted graph is topologically crippled → detected by high GED.

Fixed adversary IDs: Clients [num_clients - num_false_nodes, ..., num_clients-1] are always adversaries (e.g., clients 25–29 for 5 adversaries out of 30). Designation is static across all rounds.

Baseline Comparison (FedAvg)

baseline_fedavg_sim.py runs standard FedAvg with cosine-similarity weight divergence detection:

Round 1: Accepts all clients unconditionally (no prior global model).
Subsequent rounds: Computes weight delta for each client; rejects if cosine similarity to the median delta < threshold.
Outputs: Saved to saved_models/baseline/simulation_logs.json in a format compatible with the GUI comparison panel.

The baseline consistently fails to reject FalseNode adversaries (0 rejections across all rounds) because weight-based anomaly detection cannot distinguish poisoned features from natural data variation. This is the key empirical result demonstrating PoR's advantage.

Streamlit Dashboard

Run with streamlit run app.py. Features:

Section 1 — Configuration Sidebar

All parameters are editable without touching params.yaml. Changes persist on save.

Sidebar Control	What it Does
Dataset (ASIA / ALARM)	Switches entire simulation dataset
Num Clients / False Nodes / Rounds	Core FL simulation parameters
Validator Threshold (τ)	GED rejection threshold (slider 0–1)
Consensus Momentum	How conservatively graph updates (slider 0–1)
NOTEARS Max Iter / LR / L1	NOTEARS hyperparameters
NOTEARS Edge Threshold	Prunes weak NOTEARS edges
SimGNN Epochs / LR / Batch	Pre-training hyperparameters

Section 2 — Dataset Overview

Ground-truth Bayesian Network graph (from bnlearn)
Node/edge count, target variable description

Section 3 — Action Buttons (with live progress bars)

Button	Progress Tracking
🌐 Generate True Consensus Graph	Loading → NOTEARS → Saved
🚀 Train Logic Validator (SimGNN)	Epoch [50/500] → [100/500] … live
🔥 Run Multi-Round Simulation	Round 1/N → Round 2/N … per round

Each button streams subprocess output, updates the progress bar based on log markers, and shows full logs in a collapsible expander.

Section 4 — Simulation Results (PoR)

Per-round bar chart: accepted vs. rejected clients
Detection rate metrics
Final consensus graph visualisation (PyVis interactive)
GED score distribution

Section 5 — PoR vs. Baseline FedAvg Comparison

Side-by-side comparison panel:

Rejection rate per round: PoR vs. baseline
Total adversary detection rate comparison
Summary table highlighting PoR's advantage

Quick Start

Option A — Local (with virtualenv)

# 1. Clone and set up
git clone https://github.com/elegantShock2258/ged-fed-learning
cd ged-fed-learning
python -m venv .venv && source .venv/bin/activate

# 2. Install PyTorch (choose one):
# CPU only:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# GPU (CUDA 11.8):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# 3. Install remaining dependencies
pip install torch-geometric==2.7.0
pip install -r requirements.txt

# 4. Launch dashboard
streamlit run app.py
# → Open http://localhost:8501

Then use the GUI buttons in order:

🌐 Generate True Consensus Graph
🚀 Train Logic Validator (SimGNN)
🔥 Run Multi-Round Simulation

Option B — Terminal (manual pipeline)

python server/generate_consensus.py   # Step 1
python server/train_simgnn.py         # Step 2
python federated_sim.py               # Step 3 (PoR)
python baseline_fedavg_sim.py         # Step 4 (baseline, optional)

Docker Setup

No local Python required — works on any machine with Docker installed.

# First time (builds image, ~5-10 min):
docker compose up --build

# Subsequent runs (no rebuild):
docker compose up

# Background:
docker compose up -d

# → Open http://localhost:8501

Volume mounts

Host path	Container path	Purpose
`./saved_models/`	`/app/saved_models/`	Persist generated models across container restarts
`./params.yaml`	`/app/params.yaml`	Live config editing — changes take effect without rebuilding

GPU Support (optional)

Uncomment the deploy.resources block in docker-compose.yml and ensure nvidia-container-toolkit is installed on the host. Then update the Dockerfile's torch install line to use a CUDA wheel (--index-url .../cu118).

Configuration Reference

All settings live in params.yaml and are also editable via the Streamlit sidebar.

Dataset

Key	Default	Description
`dataset.name`	`"asia"`	`"asia"` (8 nodes) or `"alarm"` (37 nodes)
`dataset.total_samples`	`10000`	Rows sampled from the Bayesian Network
`dataset.seed`	`42`	NumPy seed for reproducibility

Core Logic (PoR + NOTEARS)

Key	Default	Description
`core_logic.causal_edge_threshold`	`0.05`	Prune NOTEARS edges below this weight
`core_logic.l1_sparsity_penalty`	`0.0001`	L1 regularisation in NOTEARS
`core_logic.notears_lr`	`0.02`	NOTEARS Adam learning rate
`core_logic.notears_max_iter`	`200`	NOTEARS gradient iterations
`core_logic.validator_threshold`	`0.45`	GED threshold τ — below → accept, above → reject
`core_logic.consensus_momentum`	`0.85`	Consensus update conservatism (0=aggressive, 1=frozen)

SimGNN Pre-training

Key	Default	Description
`core_logic.simgnn_epochs`	`500`	Training epochs
`core_logic.simgnn_lr`	`0.001`	Adam learning rate
`core_logic.simgnn_batch_size`	`32`	Graph pairs per batch

Simulation

Key	Default	Description
`simulation.num_clients`	`30`	Total FL clients
`simulation.num_false_nodes`	`5`	Adversarial clients (fixed IDs)
`simulation.num_rounds`	`10`	FL rounds
`simulation.local_epochs`	`1`	Client local training epochs
`simulation.batch_size`	`32`	Client batch size
`simulation.client_lr`	`1e-4`	Client Adam learning rate

Server

Key	Default	Description
`server.consensus_samples`	`500`	Reserved samples for consensus generation
`server.batch_size`	`32`	Server-side batch size

Project Structure

ged-fed-learning/
├── app.py                      # Streamlit dashboard (all 6 sections)
├── federated_sim.py            # PoR FL simulation entry point
├── baseline_fedavg_sim.py      # Baseline FedAvg + cosine-similarity detection
├── params.yaml                 # Central config file
├── requirements.txt            # Python dependencies (curated, no legacy packages)
├── Dockerfile                  # CPU-first, production-grade
├── docker-compose.yml          # With healthcheck, restart, GPU docs
├── graphs.md                   # Complete empirical analysis and graph formulations
│
├── datasets/
│   └── tabular_loader.py       # TabularBNDataset (bnlearn ASIA/ALARM)
│
├── client/
│   ├── models.py               # 3-layer MLP (returns logits + raw features)
│   ├── agent.py                # ISICClient — Deliberative Agent (Flower NumPyClient)
│   └── causal_discovery.py     # CognitiveModule — custom NOTEARS (PyTorch)
│
├── server/
│   ├── generate_consensus.py   # One-off: generates server-side consensus DAG
│   ├── train_simgnn.py         # One-off: pre-trains SimGNN Logic Validator
│   ├── logic_validator.py      # SimGNN + LogicValidator classes
│   ├── aggregator.py           # GED logic validation and filtering
│   └── fed_neat_strategy.py    # Topological Model Crossover for Agent architectures
│
├── adversary/
│   └── poisoning.py            # FalseNode — feature poisoning + label flipping
│
├── graphs/
│   ├── run_all.py              # Generates all 20 evaluation charts
│   ├── collect_per_round_data.py # Collects runtime graph trajectories
│   ├── eval_asr_mta.py         # Computes precise Attack Success Rate
│   └── g01_*.py - g20_*.py     # Individual rendering scripts for metrics
│
├── tests/
│   ├── conftest.py             # Shared fixtures
│   ├── e2e_docker_test.sh      # Automated Docker Integration Script
│   ├── unit/                   # 10 test files (100% logic coverage)
│   └── functional/             # 4 end-to-end full-system workflows
│
├── saved_models/
│   ├── {dataset_name}/
│   │   ├── consensus_graph.gpickle       # Initial + evolved consensus DAG
│   │   ├── simgnn_pretrained.pt          # Pre-trained SimGNN weights
│   │   ├── global_model.pt               # Final FL global model
│   │   ├── simulation_logs.json          # Per-round PoR metrics
│   │   └── ged_scores.json               # Per-round GED score distributions
│   └── baseline/
│       └── simulation_logs.json          # Baseline FedAvg metrics

Test Suite

The project includes a comprehensive test suite with 103 passing tests covering core components of the PoR defense system and end-to-end functional workflows.

Running Tests

# Run all tests
source .venv/bin/activate
pytest tests/

# Run with coverage report
pytest tests/ --cov=. --cov-report=term-missing

# Run specific test module
pytest tests/unit/test_adversary_poisoning.py -v

# Run functional tests only (fast)
pytest tests/functional/ -k "not slow"

# Run functional tests including slow ones
pytest tests/functional/ --runslow

# Run tests with detailed output
pytest tests/ -v --tb=short

Test Coverage Summary

Overall Coverage: 95% (updated based on 962 filtered logical statements)

Module	Coverage	Statements	Highlights
`adversary/poisoning.py`	100%	33	FalseNode backdoor poisoning, label flipping
`client/models.py`	100%	105	DynamicGenome (MLP) initialization, forward pass
`client/causal_discovery.py`	98%	54	NOTEARS causal graph extraction, edge thresholding
`tests/unit/test_adversary_poisoning.py`	100%	83	10 tests: FalseNode poisoning mechanics
`tests/functional/test_full_simulation.py`	100%	16	5 tests: End-to-end simulation validation
`server/fed_neat_strategy.py`	95%	277	FedNEAT aggregation, logic validation, Finetune hooks
`server/aggregator.py`	95%	231	GED-based client filtering, consensus updates
`server/logic_validator.py`	93%	128	SimGNN acceptance/rejection logic, Threshold bounds
`client/agent.py`	100%	98	FL agent tracking evolution via neat genome

Test Organization

Unit Tests by Component

1. Adversarial Attack Tests (test_adversary_poisoning.py)

✅ 10 tests — All passing
FalseNode initialization with poison parameters
Backdoor feature zeroing validation
Label flipping to target class
Fitness evaluation with poisoned batches
Integration: Complete adversarial setup

Example:

def test_false_node_poisons_trigger_feature(mock_device, sample_data_loader):
    """Test trigger feature remains in valid range."""
    attacker = FalseNode(
        cid="adv_0",
        train_loader=sample_data_loader,
        test_loader=sample_data_loader,
        device=mock_device,
        feature_names=["feat_0", ..., "feat_4"],
        target_label=1
    )
    assert 0 <= attacker.trigger_feature_idx < 5

2. Server Integration Tests (test_server_integration.py)

✅ 9 tests — All passing
Model directory creation
Consensus graph serialization
NOTEARS parameter validation
SimGNN initialization & training methods
Configuration handling (YAML, device selection)

3. Client Logic Tests (test_models.py)

✅ 8 tests — All passing (100% coverage)
MLP shape validation (batch norm, dropout)
Data type consistency (float32)
Forward pass output shapes

4. Causal Discovery Tests (test_causal_discovery.py)

✅ 9 tests — 95% coverage
NOTEARS zero-input handling
Node naming consistency
DAG edge thresholding

5. Aggregator Tests (test_aggregator_comprehensive.py)

✅ 18 tests — 92% coverage
GED-based acceptance thresholds
Consensus momentum blending
JSON persistence of GED scores

6. FedNEAT Strategy Tests (test_fed_neat_strategy_comprehensive.py)

✅ 22 tests — 95% coverage
Aggregate fit with PoR gate
Dynamic threshold updates
Graph persistence (honest vs. rejected)

7. Tabular Loader Tests (test_tabular_loader.py)

✅ 2 tests — 88% coverage
ASIA/ALARM dataset loading
Feature dimensionality

Functional Tests (End-to-End)

Full Simulation Tests (tests/functional/test_full_simulation.py)

✅ 3 tests — Fast integration tests for complete workflows (5 total, 2 slow)
Output file structure verification
Simulation logs and GED scores validation
Adversary detection in functional context
End-to-end PoR simulation validation (marked @pytest.mark.slow)
Baseline FedAvg simulation validation (marked @pytest.mark.slow)

Simulation Scripts as Functional Tests:

federated_sim.py — Complete PoR FL workflow (10 rounds)
baseline_fedavg_sim.py — Baseline FedAvg comparison
These serve as the primary functional tests, running full simulations

Coverage by Attack Vector

Attack Type	Test Module	Key Test	Status
Feature Poisoning	`test_adversary_poisoning.py`	`test_false_node_poisons_trigger_feature`	✅ PASS
Label Flipping	`test_adversary_poisoning.py`	`test_false_node_targets_specific_label`	✅ PASS
Fitness Evaluation	`test_adversary_poisoning.py`	`test_evaluate_fitness_returns_numerical_score`	✅ PASS
Server Integration	`test_server_integration.py`	`test_get_model_dir_creates_valid_path`	✅ PASS
Consensus Generation	`test_server_integration.py`	`test_consensus_graph_structure`	✅ PASS

Test Results

======================== 103 passed, 1 skipped ===========================

PASSING TEST MODULES:
✅ test_adversary_poisoning.py
✅ test_server_integration.py
✅ test_models.py
✅ test_causal_discovery.py
✅ test_aggregator_comprehensive.py
✅ test_fed_neat_strategy_comprehensive.py
✅ test_server_logic.py
✅ test_tabular_loader.py
✅ test_client_agent.py
✅ test_full_simulation.py
✅ test_imports.py
✅ test_simulations.py

SKIPPED (non-critical):
⚠️ test_tabular_loader.py::1 test (skips downloading big external payloads)

Note: All core PoR logic is fully validated via passing tests.
Functional tests marked with @pytest.mark.slow can be run with --runslow.

🚀 Continuous Integration (CI/CD)

This project features a fully automated End-to-End (E2E) Docker Integration Test pipeline located in .github/workflows/e2e-integration.yml.

Whenever code is pushed to main or fed-neat-evolution, GitHub Actions automatically:

Provisions a fresh Ubuntu VM.
Dynamically configures params.yaml to spin up an accelerated 3-client (1 adversary) FedNEAT environment.
Bootstraps the pipeline from scratch within Docker (proving the repository works natively "out-of-the-box" for any researcher without local dependency caching).
Validates that the entire complex execution chain finishes successfully and materialises critical evaluation artifacts (consensus graph, GED scores, models).

Run it manually on your local system in under 90 seconds to verify environment stability:

./tests/e2e_docker_test.sh

📊 Evaluation & Visualisation Graphs

Full analysis, abbreviation glossary, formulas, per-graph conclusions, and paper references: → graphs.md

A complete suite of 20 publication-quality evaluation graphs is generated by the graphs/ pipeline. They cover: detection performance, causal graph quality, model convergence, Byzantine fault tolerance, and attack success rate — comparing the Causal PoR defense against the Baseline FedAvg + Cosine method and published SOTA defenses (Krum, FLAME, FoolsGold, Trimmed Mean).

Quick Start — Generate All 20 Graphs

# Activate virtual environment
source .venv/bin/activate

# Step 1 — Generate all 20 graphs from existing saved simulation data
python graphs/run_all.py

# Step 2 (recommended) — Compute exact MTA + ASR evaluation metrics
python graphs/eval_asr_mta.py
python graphs/run_all.py   # re-run to embed exact values in G18 & G20

# Step 3 (optional) — Get per-round GED + consensus snapshots for G17 & G19
python graphs/collect_per_round_data.py   # patches simulation (one-time)
python federated_sim.py                   # re-run simulation
python graphs/run_all.py                  # regenerate with exact trajectories

# Revert simulation patch
python graphs/collect_per_round_data.py --revert

All output PNGs are saved to graphs/G*.png at 300 DPI.

Visualisation Blueprint — All 20 Graphs

#	Output File	Blueprint Metric	Data Source	Status
G01	`G01_per_round_acceptance_bars.png`	Per-round accepted/rejected clients	`simulation_logs.json`	✅ Real data
G02	`G02_ged_score_distribution.png`	GED score distributions: Honest vs. Adversary	`ged_scores.json`	✅ Real data
G03	`G03_roc_curve.png`	ROC curve + AUC of GED detector	`ged_scores.json`	✅ Real data
G04	`G04_threshold_sensitivity.png`	ADR and FPR vs. detection threshold τ sweep	`ged_scores.json`	✅ Real data
G05	`G05_cumulative_suppression.png`	Cumulative adversary suppression + HCPR/round	`simulation_logs.json`	✅ Real data
G06	`G06_simgnn_speedup_benchmarks.png`	SimGNN vs. A* GED: runtime speedup + MSE	Published benchmarks [Bai et al., 2019]	✅ Published
G07	`G07_loss_convergence_efficiency.png`	Baseline loss curve + PoR aggregation efficiency	`baseline/simulation_logs.json`	✅ Real data
G08	`G08_consensus_jaccard_groundtruth.png`	Consensus graph Jaccard vs. ASIA ground truth	`consensus_graph.gpickle`	✅ Real data
G09	`G09_radar_detection_metrics.png`	Radar: Precision / Recall / F1 / ADR / Specificity	`ged_scores.json`	✅ Real data
G10	`G10_genome_architecture.png`	FedNEAT evolved genome topology (Round 15)	`realtime_state.json`	✅ Real data
G11	`G11_asia_ground_truth_vs_consensus.png`	ASIA ground truth DAG vs. PoR consensus DAG	`consensus_graph.gpickle`	✅ Real data
G12	`G12_ged_score_per_client.png`	Per-client GED scatter + TP/FP/TN/FN confusion	`ged_scores.json`	✅ Real data
G13	`G13_rejected_edge_diff.png`	Adversary's missing/extra edges vs. consensus	`rejected_edge_diff.json`	✅ Real data
G14	`G14_defense_comparison_table.png`	P/R/F1 comparison: PoR vs. 5 SOTA defenses	`ged_scores.json` + literature	✅ Real + lit.
G15	`G15_notears_edge_analysis.png`	NOTEARS edge quality: SHD, FDR, P/R per client	`rejected_edge_diff.json`	⚠️ Estimated
G16	`G16_byzantine_tolerance.png`	Byzantine breakdown point + EAE stability	`simulation_logs.json` + theory	✅ Real data
G17	`G17_multiround_ged_trend.png`	Multi-round GED trajectory: honest vs. adversary	`ged_scores.json` (reconstructed)	⚠️ Reconstructed
G18	`G18_main_task_accuracy.png`	Main Task Accuracy (MTA) vs. round	`eval_results.json` (auto)	⚠️ Run eval script
G19	`G19_consensus_jaccard_rounds.png`	Consensus Jaccard per round (convergence curve)	Per-round consensus files	⚠️ Re-simulate
G20	`G20_attack_success_rate.png`	Attack Success Rate (ASR): PoR vs. SOTA	`eval_results.json` (auto)	⚠️ Run eval script

Legend: ✅ Uses real simulation data directly · ⚠️ See graphs.md for how to obtain exact values.

Graph Scripts Reference

Script	Purpose
`graphs/run_all.py`	Master runner — generates all 20 graphs in sequence
`graphs/style_config.py`	Shared colour palette, fonts, DPI, and data loaders
`graphs/collect_per_round_data.py`	Patches simulation to save per-round GED + consensus
`graphs/eval_asr_mta.py`	Post-simulation MTA + ASR evaluation
`graphs/g01_.py` – `graphs/g20_.py`	Individual graph generation scripts

Threat Models

Attack	Method	PoR Detection	Baseline Detection
Feature Poisoning	Zero out feature column each batch	✅ High GED (missing edges in DAG)	❌ Weights look normal
Label Flipping	Flip 20% of labels to target class	✅ Corrupted graph topology	❌ Small weight delta
Explanation Poisoning	Submit fake/random DAG directly	✅ SimGNN detects divergence	❌ Not graph-aware
Distributed Backdoor (DBA) [Xie et al., 2020]	Each client injects partial trigger	✅ Structural auditing catches combined dependency	❌ Each client looks "normal" individually

Citation / Reference

Defense & Attack Mechanics:

McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017).
Communication-Efficient Learning of Deep Networks from Decentralized Data
Advances in Artificial Intelligence and Statistics (AISTATS).

Xie, C., Huang, K., Chen, P. Y., & Li, B. (2020).
DBA: Distributed Backdoor Attacks against Federated Learning
International Conference on Learning Representations (ICLR).

Graph & Logic Validation:

Zheng, X., Aragam, B., Ravikumar, P., & Xing, E. P. (2018).
DAGs with NO TEARS: Continuous optimization for structure learning
Advances in Neural Information Processing Systems, 31.

Bai, Y., Ding, H., Bian, S., Chen, T., Sun, Y., & Wang, W. (2019).
SimGNN: A Neural Network Approach to Fast Graph Similarity Computation
Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM).

Bayesian Networks & Neuroevolution:

Lauritzen, S. L., & Spiegelhalter, D. J. (1988).
Local computations with probabilities on graphical structures and their application to expert systems
Journal of the Royal Statistical Society: Series B (Methodological), 50(2), 157–224.

Beinlich, I. A., Suermondt, H. J., Chavez, R. M., & Cooper, G. F. (1989).
The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks
Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine (AIME 89).

Stanley, K. O., & Miikkulainen, R. (2002).
Evolving Neural Networks through Augmenting Topologies
Evolutionary Computation, 10(2), 99–127.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
adversary		adversary
assets		assets
checkpoint_saved_models/dataset_asia		checkpoint_saved_models/dataset_asia
client		client
datasets		datasets
graphs		graphs
server		server
tests		tests
vastai		vastai
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
baseline_fedavg_sim.py		baseline_fedavg_sim.py
docker-compose.yml		docker-compose.yml
federated_sim.py		federated_sim.py
graphs.md		graphs.md
params.yaml		params.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🛡️ Causal Proof of Reasoning — Federated Learning Defense

Table of Contents

Overview

Key Contributions

Why Bayesian Networks?

ASIA Network — Unit Testing Dataset (Lauritzen & Spiegelhalter, 1988)

ALARM Network — Full Simulation Dataset (Beinlich et al., 1989)

System Architecture

Deliberative Agent Design

NOTEARS Implementation (Custom PyTorch)

PoR Defense Mechanics

Two-Stage Aggregation (FedNEAT Strategy)

Consensus Graph Evolution

SimGNN Architecture

Adversarial Attack Model

FalseNode — Targeted Feature Poisoning

Baseline Comparison (FedAvg)

Streamlit Dashboard

Section 1 — Configuration Sidebar

Section 2 — Dataset Overview

Section 3 — Action Buttons (with live progress bars)

Section 4 — Simulation Results (PoR)

Section 5 — PoR vs. Baseline FedAvg Comparison

Quick Start

Option A — Local (with virtualenv)

Option B — Terminal (manual pipeline)

Docker Setup

Volume mounts

GPU Support (optional)

Configuration Reference

Dataset

Core Logic (PoR + NOTEARS)

SimGNN Pre-training

Simulation

Server

Project Structure

Test Suite

Running Tests

Test Coverage Summary

Test Organization

Unit Tests by Component

Functional Tests (End-to-End)

Coverage by Attack Vector

Test Results

🚀 Continuous Integration (CI/CD)

📊 Evaluation & Visualisation Graphs

Quick Start — Generate All 20 Graphs

Visualisation Blueprint — All 20 Graphs

Graph Scripts Reference

Threat Models

Citation / Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages