Reinforcement Learning for Trading with Insider Signals Under Partial Observability
This project investigates whether reinforcement learning agents can effectively exploit insider trading signals of varying quality. We train and evaluate three RL algorithms—DQN, DRQN, and PPO—on a simulated trading environment with synthetic insider signals at controlled accuracy levels (50%–100%).
- Can RL agents learn to exploit imperfect insider information?
- How does performance degrade as signal quality decreases?
- Which architectural components (LSTM memory, dueling networks) matter most?
- Signal quality is critical: Performance degrades gracefully from 100% to ~60% accuracy, then collapses
- DRQN outperforms DQN: LSTM memory helps filter noisy signals (+15% Sharpe at 80% accuracy)
- Insider signal is the primary alpha source: Ablation shows 92% of returns come from the signal
Insiders-Edge/
├── insiders-edge/ # Main codebase
│ ├── environment.py # Trading environment
│ ├── dqn_agent.py # DQN agent
│ ├── drqn_simple.py # DRQN agent (LSTM-based)
│ ├── ppo_agent.py # PPO agent
│ ├── run_comparison.py # Train all agents
│ ├── evaluate_models.py # Test set evaluation
│ ├── drqn_ablation.py # Ablation study
│ ├── evaluate_ablation.py # Ablation evaluation
│ ├── volatility_analysis.py # Volatility regime analysis
│ │
│ ├── results_200ep/ # Trained checkpoints
│ ├── ablation_results/ # Ablation checkpoints
│ ├── ablation_eval/ # Ablation figures & tables
│ └── visualization/ # Generated figures
│
├── data/
│ ├── synthetic_signals/ # Generated signals by accuracy
│ │ ├── accuracy_100/ # Perfect signal
│ │ ├── accuracy_90/
│ │ ├── accuracy_80/
│ │ ├── accuracy_70/
│ │ ├── accuracy_60/
│ │ └── accuracy_50/ # Random (baseline)
│ ├── sec_dj30/ # Real SEC Form 4 filings
│ └── real_insider_signals/ # Processed SEC signals
│
├── scripts/ # Data generation utilities
├── notebooks/ # Exploratory analysis
└── playground/ # Experimental code
conda create -n insiders-edge python=3.10
conda activate insiders-edge
pip install -r requirements.txtcd insiders-edge
# Train all agents across all accuracy levels (200 episodes)
python run_comparison.py --accuracies 100,90,80,70,60,50 --episodes 200 --save-dir results_200ep
# Train single accuracy
python run_comparison.py --accuracies 100 --episodes 200python evaluate_models.py \
--checkpoint-dir results_200ep \
--accuracies 100,90,80,70,60,50 \
--split test# Train ablation variants
python drqn_ablation.py --accuracy 100 --episodes 200
# Evaluate ablations
python evaluate_ablation.py --checkpoint-dir ablation_results --accuracy 100Synthetic insider signals are generated with controlled accuracy:
At accuracy α:
- With probability α: signal = sign(future_5d_return) [CORRECT]
- With probability 1-α: signal = -sign(future_5d_return) [WRONG]
Final training data combines synthetic (70%) and real SEC Form 4 filings (30%):
combined_signal = 0.7 × synthetic_signal + 0.3 × SEC_signal| Split | Period | Days | Purpose |
|---|---|---|---|
| Train | 2020-01-01 to 2024-06-30 | ~1,130 | Model training |
| Validation | 2024-07-01 to 2024-12-31 | ~125 | Early stopping |
| Test | 2025-01-01 to 2025-10-31 | ~209 | Final evaluation |
- Feedforward network with dueling architecture
- Experience replay (10,000 buffer)
- Target network with soft updates
- ε-greedy exploration: 1.0 → 0.01
- LSTM encoder for temporal dependencies
- Sequence length: 8 timesteps
- Insider advantage modifier (λ = 0.1)
- Dueling architecture
- Actor-critic with shared encoder
- Clipped surrogate objective (ε = 0.2)
- GAE advantage estimation (λ = 0.95)
| Parameter | DQN | DRQN | PPO |
|---|---|---|---|
| Learning rate | 1e-3 | 1e-3 | 3e-4 |
| Discount γ | 0.99 | 0.99 | 0.99 |
| Batch size | 32 | 8 seq | 32 |
| Hidden dims | [128, 64] | [64, 64] | [128, 64] |
| Target update | 1,000 | 1,000 | N/A |
| Buffer size | 10,000 | 1,000 ep | N/A |
| Transaction cost | 0.1% | 0.1% | 0.1% |
| Accuracy | DQN Sharpe | DRQN Sharpe | PPO Sharpe |
|---|---|---|---|
| 100% | 15.4 | 15.9 | 15.1 |
| 90% | 13.2 | 14.1 | 12.8 |
| 80% | 10.5 | 11.8 | 10.2 |
| 70% | 6.8 | 7.9 | 6.1 |
| 60% | 2.1 | 2.8 | 1.9 |
| 50% | -0.3 | 0.1 | -0.5 |
| Configuration | Return | Sharpe | Max DD |
|---|---|---|---|
| DRQN (Full) | +8210% | 15.92 | -3.2% |
| w/o LSTM | +8095% | 15.83 | -4.7% |
| w/o Signal | +69% | 1.55 | -37.0% |
| w/o Dueling | +7665% | 15.46 | -1.5% |
| Seq Len = 1 | +4179% | 12.26 | -7.7% |
| λ = 0 | +3850% | 11.91 | -19.6% |
- Price features: returns, volatility, momentum, RSI
- Position: current holdings (-1 to 1)
- Insider signal: processed signal value
- Technical indicators: moving averages, volume
- 0: Sell (go short)
- 1: Hold
- 2: Buy (go long)
reward = position × daily_return - transaction_cost × |Δposition|
MIT License