System architecture of the Event-Driven SNN FPGA Accelerator.
┌─────────────────────────────────────────────────────────┐
│ Software (Python/PyTorch) │
│ - Model training │
│ - Spike encoding │
│ - Configuration │
└───────────────────────┬─────────────────────────────────┘
│ AXI Bus
┌───────────────────────┴─────────────────────────────────┐
│ FPGA (PYNQ-Z2) │
│ ┌────────────┐ ┌──────────────────┐ ┌───────────┐ │
│ │ AXI │→ │ Event Router │→ │ Core Group│ │
│ │ Interface │ │ (NG, 16-port) │ │ ×16 │ │
│ └────────────┘ └──────┬───────────┘ │ (128 LIF │ │
│ ↓ │ │ neurons) │ │
│ ┌──────────────┐ ┌────┴────────┐ └───────────┘ │
│ │ STDP/R-STDP │ │ Synaptic │ │
│ │ Learning │ │ Connectivity│ │
│ │ Engine (HLS) │ │ Table (BRAM)│ │
│ └──────────────┘ └─────────────┘ │
│ │
│ Total: 2,048 neurons, ~65 BRAM36, ~10K LUT │
└─────────────────────────────────────────────────────────┘
Design Principles:
- Hierarchical Core Group architecture (IEEE-inspired)
- Spike-triggered processing — neuron state updates are gated by incoming
AER spike events; only neuron targets of an arriving spike are active per cycle
(
⚠️ not "asynchronous event-driven" — FPGA is clock-synchronous at 100 MHz) - AC-based operations (accumulate-only, no multiply)
- Dense intra-group + sparse inter-group connectivity
- Fixed-point arithmetic (8-bit weights)
Hardware: Xilinx Zynq-7020 (xc7z020clg400-1) on PYNQ-Z2
Core Group Configuration (16 groups × 128 neurons):
| Resource | Per Group | ×16 + Router + CT | Available | Util% |
|---|---|---|---|---|
| LUT | 557 | ~9,777 | 53,200 | 18.4% |
| FF | 317 | ~5,456 | 106,400 | 5.1% |
| BRAM36 | 3 | ~65 | 140 | 46.4% |
| DSP | 0 | 0 | 220 | 0% |
Timing: Target 100 MHz, Synthesis clean (0 errors, 0 critical warnings)
The core group is the fundamental processing unit, inspired by hierarchical neuromorphic architectures described in recent IEEE literature.
┌──────────────────────────────────────────────┐
│ snn_core_group_top │
│ │
┌──────────┤ ┌─────────────────────────────────────┐ │
AXI ──→ │ Config │ │ Event Router (NG) │ │
Lite │ Decoder │ │ ┌────────────────────────────┐ │ │
│ ├───│ │ Round-Robin Arbiter │ │ │
└──────────┤ │ │ (16 sources + external) │ │ │
│ │ └──────┬─────────────────────┘ │ │
HLS ◄──────────────│───│ learn_spike (observation port) │ │
Learning │ │ │ │
│ └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬────┘ │
│ │ │ │ │ │ │ │ │ │ │ │ │
│ ┌┴┐┌┴┐┌┴┐┌┴┐┌┴┐┌┴┐ │
│ │0││1││2││3││4││5│ ... (16 groups) │
│ │C││C││C││C││C││C│ │
│ │G││G││G││G││G││G│ │
│ └┬┘└┬┘└┬┘└┬┘└┬┘└┬┘ │
│ │ │ │ │ │ │ │
│ ┌──┴──┴──┴──┴──┴──┴───────────────────┐ │
│ │ Synaptic Connectivity Table │ │
│ │ (32K × 17b BRAM, sparse xbar) │ │
│ └─────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
Each core group contains 128 time-multiplexed LIF neurons with dense local synaptic connectivity.
Internal Architecture:
ext_spike_in → [Input FIFO (64)] → [Processing FSM] → Neuron State BRAM
↑ ↓ (fire)
[intra-group recurrence] ← [Local Weight BRAM read]
↓ (non-zero weight)
[push to FIFO]
+ spike_flag_bitmap → output scan → Event Router
Memory Resources (per group):
- Neuron State BRAM: 128 × 24b (16b membrane + 8b refractory) → 1 RAMB18
- Weight Memory: 128 × 128 × 5b (4b weight + 1b exc flag) → 2 RAMB36 + 1 RAMB18
- Spike FIFO: 64 entries → LUTRAM (16 RAMD64E)
FSM States:
IDLE → { SPIKE_RD → SPIKE_CMP → SPIKE_WR → [INTRA_READ → INTRA_ROUTE]* }
→ { LEAK_RD → LEAK_CMP → LEAK_WR (128 iterations) }
- Incoming spikes preempt leak cycles (low-latency event processing)
- Neuron firing triggers intra-group weight row scan (128 lookups)
- Non-zero weights are pushed back into the input FIFO for local propagation
Parameters:
| Parameter | Value | Description |
|---|---|---|
| NEURONS_PER_GROUP | 128 | Neurons per group |
| WEIGHT_WIDTH | 4 | Synaptic weight bits |
| DATA_WIDTH | 16 | Membrane potential bits |
| REFRAC_WIDTH | 8 | Refractory counter bits |
| SPIKE_BUFFER_DEPTH | 64 | Input FIFO depth |
File: hardware/hdl/rtl/core/core_group.v
Sparse inter-group connection storage using dual-port BRAM.
Address Scheme (15 bits for 16 groups):
addr = {src_group[3:0], src_neuron[6:0], fanout_idx[3:0]}
= 2^15 = 32,768 entries
Data Format (17 bits):
[16] valid — entry is active
[15:12] dst_group — destination core group ID (4-bit)
[11:5] dst_neuron — destination neuron within group (7-bit)
[8:1] weight — 8-bit synaptic weight
[0] exc_inh — 1=excitatory, 0=inhibitory
BRAM Usage: 32K × 17b → 17 RAMB36E1
Read Latency: 2 cycles (BRAM read + unpack pipeline)
File: hardware/hdl/rtl/core/synaptic_connectivity_table.v
Central spike routing hub with round-robin arbitration.
FSM:
IDLE → ARB_SELECT → CT_LOOKUP → CT_WAIT1 → CT_WAIT2 → CT_DELIVER → CT_NEXT
↓
LEARN_NOTIFY → IDLE
→ EXT_ROUTE → IDLE (direct external spike routing)
→ WEIGHT_FWD (learning engine weight updates)
Features:
- Round-robin arbiter across 16 group ports + external source
- Fanout iteration: Scans up to 16 CT entries per spike event
- Learning observation port: Forwards all spike events to HLS
- Weight update passthrough: Routes learning updates to groups or CT
- Backpressure handling: Waits when destination group FIFO is full
Resources: 862 LUT, 382 FF, 0 BRAM (all state in registers)
File: hardware/hdl/rtl/core/event_router_ng.v
Integrates all components with PS/HLS interface.
Config Register Mapping (from AXI-Lite):
cfg_router_config_addr[31:28]:
0x0: Connectivity table write (wdata format below)
0x1: Intra-group weight write
0x2: Read routed_spike_count
0x3: Read total_neuron_spikes
CT write wdata[31:0]:
[31] valid, [30:27] dst_group, [26:20] dst_neuron,
[19:16] weight, [15] exc_inh, [14:11] fanout_idx,
[10:4] src_neuron, [3:0] src_group
Intra-group weight wdata[31:0]:
[31:25] src_neuron, [24:18] dst_neuron, [17:14] weight,
[13] exc, [12:9] group_id
HLS Bridge: Converts 11-bit global neuron IDs between HLS and core group addressing ({group_id[3:0], local_id[6:0]}).
File: hardware/hdl/rtl/top/snn_core_group_top.v
- Issue: When
ref_rd > 0in ST_SPIKE_WR, no state transition was assigned, causing the FSM to hang indefinitely in ST_SPIKE_WR. - Fix: Added
state <= ST_IDLEin the refractory branch.
- Issue: External spike FIFO writes could collide with intra-group routing
FIFO writes on the same clock cycle (both writing to
fifo_wr_ptr). - Fix: Added
intra_routingguard —ext_spike_readydeasserts duringST_INTRA_ROUTEandST_INTRA_READstates, preventing simultaneous writes.
- Issue:
result_validwas misaligned withrd_data(off by one cycle). - Fix: Added
lookup_en_d1pipeline stage to align valid signal with data.
| Testbench | Tests | Status |
|---|---|---|
| Core Group (tb_core_group.v) | 15 | 15/15 PASS |
| Router + CT (tb_router_ct.v) | 24 | 24/24 PASS |
| Integration (tb_integration.v) | 16 | 16/16 PASS |
Core Group Tests:
- Reset state, 2. Weight load, 3. Sub-threshold, 4. Supra-threshold,
- Output spike detection, 6. Refractory period, 7. Accumulation (3×4>10),
- Inhibitory, 9. Intra-group recurrence (50→51 chain), 10. Backpressure,
- Zero weight, 12. Burst (10 spikes), 13. Exact threshold,
- Multi-neuron diverse, 15. Leak decay
Router+CT Tests: 1-4. Reset & CT write/read, 5-8. CT CRUD, 9-12. Spike routing via CT, 13-14. Learning notifications, 15-16. Multi-fanout, 17-18. Round-robin, 19-20. Weight forwarding (intra/inter), 21-22. Backpressure, 23. Max fanout (16), 24. Empty CT handling
Integration Tests: 1-3. Reset & enable, 4-6. External spike injection, 7-8. Intra+inter combined, 9-11. Multi-group fanout, 12. Sub-threshold inter-group, 13. Learning notifications, 14-15. Counter consistency, 16. Stress test
Each core group implements 128 time-multiplexed LIF neurons.
State per Neuron (24 bits stored in BRAM):
v_mem: 16-bit unsigned membrane potentialrefrac_counter: 8-bit refractory counter
Operation within Core Group FSM:
// Spike arrives (ST_SPIKE_RD/CMP/WR)
if refrac_counter > 0:
refrac_counter -= 1 // skip, neuron refractory
else:
if exc_flag:
v_mem += weight // excitatory (saturate at 2^16-1)
else:
v_mem -= weight // inhibitory (floor at 0)
if v_mem >= threshold:
spike_out = 1
v_mem = reset_potential
refrac_counter = refractory_period
// Leak cycle (ST_LEAK_RD/CMP/WR, 128 iterations)
leak1 = v_mem >> shift1
leak2 = v_mem >> shift2 (if enabled)
v_mem -= (leak1 + leak2)
Shift-Based Leak (no multiplier):
tau = 1 - 2^(-shift1) - 2^(-shift2)
| tau | shift1 | shift2 | Usage |
|---|---|---|---|
| 0.500 | 1 | 0 | Fast decay |
| 0.875 | 3 | 0 | Moderate |
| 0.906 | 4 | 5 | Typical |
| 0.953 | 5 | 6 | Slow decay |
Parameters:
- threshold: 16-bit (typical 100-2000)
- refractory_period: 8-bit (0-255 timesteps)
- reset_potential: 16-bit (typically 0)
On-chip learning using Spike-Timing-Dependent Plasticity (HLS).
Algorithm: Mozafari weight-dependent STDP
Per-Neuron Traces (Memory-efficient):
// O(N+M) instead of O(N×M)
static neuron_trace_t pre_traces[MAX_NEURONS]; // 720 entries (HLS limit)
static neuron_trace_t post_traces[MAX_NEURONS]; // 720 entries (HLS limit)
struct neuron_trace_t {
ap_uint<8> trace; // 8-bit exponential trace
ap_uint<16> last_spike_time; // Timestamp for lazy update
};Lazy Update: Traces are only recomputed on spike arrival using a 16-entry LUT for exponential decay, avoiding per-cycle updates.
R-STDP: Reward-modulated variant:
Integration with Core Group: The Event Router's learn_spike output
forwards all spike events to the HLS learning engine. Weight updates
flow back through the router to the appropriate core group (intra-group)
or connectivity table (inter-group).
Parameters:
- a_plus, a_minus: Learning rates (8-bit fixed-point)
- w_min, w_max: Weight bounds (8-bit)
- tau_pre, tau_post: Trace decay time constants
- mu: Weight-dependence exponent (Q4.4 fixed-point)
File: hardware/hls/src/snn_top_hls.cpp
Two-tier weight storage reflecting the hierarchical architecture.
Each core group stores a full 128×128 weight matrix in local BRAM.
Address: weight_addr = {src_neuron[6:0], dst_neuron[6:0]}
Data: 5 bits = {exc_flag[4], weight[3:0]}
Size: 128 × 128 × 5b = 81,920 bits per group → 2 RAMB36 + 1 RAMB18
Total: 16 groups × 3 BRAM tiles = 48 BRAM tiles
The Synaptic Connectivity Table stores sparse connections between groups.
Address: {src_group[3:0], src_neuron[6:0], fanout_idx[3:0]} = 15 bits
Data: 17 bits = {valid, dst_group[3:0], dst_neuron[6:0], weight[3:0], exc_inh}
Size: 32K × 17b → 17 RAMB36
Max fanout per neuron: 16 destinations
Total Weight Memory: 48 + 17 = 65 BRAM36 (~46.4% of xc7z020)
32-bit register access for configuration.
Config Commands (via cfg_router_config_addr):
| cmd [31:28] | Function | Data Format |
|---|---|---|
| 0x0 | CT entry write | {valid, dst_grp, dst_nrn, wt, exc, fanout, src_nrn, src_grp} |
| 0x1 | Intra weight write | {src_nrn, dst_nrn, weight, exc, group_id} |
| 0x2 | Read routed_spike_count | — |
| 0x3 | Read total_neuron_spikes | — |
Spike streaming between PS and PL.
Global Neuron ID (11-bit, supports 2048 neurons):
global_id[10:7] = group_id (0-15)
global_id[6:0] = local_neuron_id (0-127)
- PS sends input spikes via AXI Stream with 11-bit global neuron IDs
- Event Router routes spikes to destination core groups
- Core group FSM integrates weight into target neuron membrane
- If neuron fires → spike bitmap set, intra-group weights scanned
- Non-zero intra-group connections pushed to local FIFO
- Output spikes forwarded to Event Router for inter-group propagation
- Router queries CT for sparse inter-group connections (up to 16 per source)
- Output spikes sent back to PS
- Event Router forwards all spikes to HLS learning engine (
learn_spike) - HLS updates pre/post traces using lazy exponential decay
- LTP/LTD weight deltas computed per Mozafari STDP rule
- Weight updates routed back through Event Router:
- Intra-group: forwarded to target core group's weight BRAM
- Inter-group: forwarded to connectivity table BRAM
- (R-STDP) Eligibility traces modulated by reward signal
| Loop | Before | After | Improvement |
|---|---|---|---|
| LTD_LOOP | II=2, UNROLL=2 | II=1, UNROLL=4 | 2× throughput |
| LTP_LOOP | II=2 | II=1 | 2× throughput |
| RSTDP_INNER | No unroll | UNROLL=4 | 4× throughput |
| DECAY_PRE/POST | UNROLL=2 | UNROLL=4 | 2× throughput |
| WEIGHT_SUM | II=2 | II=1 | 2× throughput |
- MAC operation: ~4.6 pJ (multiply + add)
- AC operation: ~0.9 pJ (add only)
- Savings: ~5× per synaptic operation
Estimated PL Breakdown (Vivado report_power, pre-route estimate):
- HLS IP: ~108 mW (65%)
- Verilog RTL (16 groups): ~35 mW (21%)
- PS interface: ~15 mW (9%)
- Clocking: ~8 mW (5%)
- PL total (Vivado estimate): ~166 mW
Measured Board Power (PYNQ-Z2, XADC, 2026-02-21):
| XADC Rail | V_meas | I_typ | P_est | Function |
|---|---|---|---|---|
vccint |
1.017 V | 500 mA | 508.5 mW | PS + PL fabric |
vccaux |
1.808 V | 60 mA | 108.5 mW | I/O banks |
vccbram |
1.018 V | 20 mA | 20.4 mW | Block RAM |
vccpint |
1.017 V | 150 mA | 152.5 mW | PS (ARM) core |
vccpaux |
1.809 V | 30 mA | 54.3 mW | PS I/O |
| Total | 844 mW | ±20% (XADC method) |
Note: XADC estimates P = V_measured × I_typical (fixed datasheet values). Rail voltages are regulated and nearly load-independent, so XADC cannot detect dynamic switching increments. Idle and inference-active power read identically (Δ = +0.14 mW, within 0.5 mW noise floor). For accurate PL-only dynamic power, use an external INA226 on the 5 V input rail or Vivado power analysis with a switching-activity (.saif) file.
- Shift-based leak (no multiplier)
- Spike-triggered processing (state updates gated by AER spike events)
- Per-neuron traces (reduce memory access)
- Lazy trace update (compute on-demand)
- 8-bit weights (reduced memory bandwidth)
- Sparse inter-group connectivity (reduced BRAM)
Target Device: xc7z020clg400-1 Clock: 100 MHz Neuron Count: 2,048 (16 groups × 128)
Build Command:
cd hardware/scripts
./build_integrated.shSynthesis Verification:
# Run RTL synthesis check (16-group configuration)
cd hardware/scripts
vivado -mode batch -source synth_core_group.tclOutput Files: outputs/snn_integrated.bit, outputs/snn_integrated.hwh
See developer_guide.md for detailed build instructions.
- User Guide - Usage examples
- Developer Guide - Development workflow
- API Reference - Python API