Skip to content

carlo-scr/Insiders-Edge

Repository files navigation

Insider's Edge

Reinforcement Learning for Trading with Insider Signals Under Partial Observability

Python 3.10 PyTorch


Overview

This project investigates whether reinforcement learning agents can effectively exploit insider trading signals of varying quality. We train and evaluate three RL algorithms—DQN, DRQN, and PPO—on a simulated trading environment with synthetic insider signals at controlled accuracy levels (50%–100%).

Research Questions

  1. Can RL agents learn to exploit imperfect insider information?
  2. How does performance degrade as signal quality decreases?
  3. Which architectural components (LSTM memory, dueling networks) matter most?

Key Findings

  • Signal quality is critical: Performance degrades gracefully from 100% to ~60% accuracy, then collapses
  • DRQN outperforms DQN: LSTM memory helps filter noisy signals (+15% Sharpe at 80% accuracy)
  • Insider signal is the primary alpha source: Ablation shows 92% of returns come from the signal

Project Structure

Insiders-Edge/
├── insiders-edge/             # Main codebase
│   ├── environment.py         # Trading environment
│   ├── dqn_agent.py           # DQN agent
│   ├── drqn_simple.py         # DRQN agent (LSTM-based)
│   ├── ppo_agent.py           # PPO agent
│   ├── run_comparison.py      # Train all agents
│   ├── evaluate_models.py     # Test set evaluation
│   ├── drqn_ablation.py       # Ablation study
│   ├── evaluate_ablation.py   # Ablation evaluation
│   ├── volatility_analysis.py # Volatility regime analysis
│   │
│   ├── results_200ep/         # Trained checkpoints
│   ├── ablation_results/      # Ablation checkpoints
│   ├── ablation_eval/         # Ablation figures & tables
│   └── visualization/         # Generated figures
│
├── data/
│   ├── synthetic_signals/     # Generated signals by accuracy
│   │   ├── accuracy_100/      # Perfect signal
│   │   ├── accuracy_90/
│   │   ├── accuracy_80/
│   │   ├── accuracy_70/
│   │   ├── accuracy_60/
│   │   └── accuracy_50/       # Random (baseline)
│   ├── sec_dj30/              # Real SEC Form 4 filings
│   └── real_insider_signals/  # Processed SEC signals
│
├── scripts/                   # Data generation utilities
├── notebooks/                 # Exploratory analysis
└── playground/                # Experimental code

Quick Start

1. Setup Environment

conda create -n insiders-edge python=3.10
conda activate insiders-edge
pip install -r requirements.txt

2. Train Agents

cd insiders-edge

# Train all agents across all accuracy levels (200 episodes)
python run_comparison.py --accuracies 100,90,80,70,60,50 --episodes 200 --save-dir results_200ep

# Train single accuracy
python run_comparison.py --accuracies 100 --episodes 200

3. Evaluate on Test Set

python evaluate_models.py \
    --checkpoint-dir results_200ep \
    --accuracies 100,90,80,70,60,50 \
    --split test

4. Run Ablation Study

# Train ablation variants
python drqn_ablation.py --accuracy 100 --episodes 200

# Evaluate ablations
python evaluate_ablation.py --checkpoint-dir ablation_results --accuracy 100

Data Pipeline

Signal Generation

Synthetic insider signals are generated with controlled accuracy:

At accuracy α:
  - With probability α: signal = sign(future_5d_return)  [CORRECT]
  - With probability 1-α: signal = -sign(future_5d_return) [WRONG]

Data Blending

Final training data combines synthetic (70%) and real SEC Form 4 filings (30%):

combined_signal = 0.7 × synthetic_signal + 0.3 × SEC_signal

Data Splits

Split Period Days Purpose
Train 2020-01-01 to 2024-06-30 ~1,130 Model training
Validation 2024-07-01 to 2024-12-31 ~125 Early stopping
Test 2025-01-01 to 2025-10-31 ~209 Final evaluation

Agents

DQN (Deep Q-Network)

  • Feedforward network with dueling architecture
  • Experience replay (10,000 buffer)
  • Target network with soft updates
  • ε-greedy exploration: 1.0 → 0.01

DRQN (Deep Recurrent Q-Network)

  • LSTM encoder for temporal dependencies
  • Sequence length: 8 timesteps
  • Insider advantage modifier (λ = 0.1)
  • Dueling architecture

PPO (Proximal Policy Optimization)

  • Actor-critic with shared encoder
  • Clipped surrogate objective (ε = 0.2)
  • GAE advantage estimation (λ = 0.95)

Hyperparameters

Parameter DQN DRQN PPO
Learning rate 1e-3 1e-3 3e-4
Discount γ 0.99 0.99 0.99
Batch size 32 8 seq 32
Hidden dims [128, 64] [64, 64] [128, 64]
Target update 1,000 1,000 N/A
Buffer size 10,000 1,000 ep N/A
Transaction cost 0.1% 0.1% 0.1%

Results

Performance by Signal Accuracy (Test Set)

Accuracy DQN Sharpe DRQN Sharpe PPO Sharpe
100% 15.4 15.9 15.1
90% 13.2 14.1 12.8
80% 10.5 11.8 10.2
70% 6.8 7.9 6.1
60% 2.1 2.8 1.9
50% -0.3 0.1 -0.5

Ablation Study (100% Accuracy)

Configuration Return Sharpe Max DD
DRQN (Full) +8210% 15.92 -3.2%
w/o LSTM +8095% 15.83 -4.7%
w/o Signal +69% 1.55 -37.0%
w/o Dueling +7665% 15.46 -1.5%
Seq Len = 1 +4179% 12.26 -7.7%
λ = 0 +3850% 11.91 -19.6%

Environment Details

State Space (14 dimensions)

  • Price features: returns, volatility, momentum, RSI
  • Position: current holdings (-1 to 1)
  • Insider signal: processed signal value
  • Technical indicators: moving averages, volume

Action Space

  • 0: Sell (go short)
  • 1: Hold
  • 2: Buy (go long)

Reward

reward = position × daily_return - transaction_cost × |Δposition|

License

MIT License

About

A recurrent dueling-DQN and PPO pipeline for partially observable stock trading with controllable signal-quality experiments.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors