FinOptima — Technical Documentation

System Architecture

┌──────────┐      ┌──────────────────────────────────────┐       ┌──────────────┐
│  React   │ ◄──► │         FastAPI Backend              │ ◄──►  │   yfinance   │
│  + Vite  │      │                                      │       │  Market Data │
│  UI      │      │  /api/health     → health check      │       └──────────────┘
│          │      │  /api/live-data  → live prices       │
│          │      │  /api/analyze    → preprocessing+risk│
│          │      │  /api/predict    → ML predictions    │
│          │      │  /api/cluster    → KMeans clustering │
│          │      │  /api/optimize   → portfolio weight  │
│          │      │  /api/full-analysis → complete pipe  │
│          │      │  /ws/prices      → live price stream │
└──────────┘      └──────────────────────────────────────┘

All data processing is 100% in-memory. No disk I/O occurs during request handling (CSV generation is a separate dev-only utility).

Sample Data

Pre-fetched daily and intraday CSV files for 28 major US equities and ETFs are available in the live_data/ directory at the project root:

The datasets were fetched on 12 June.

Folder	Contents
`live_data/daily/`	28 CSV files, each with 1 year of daily OHLCV data (AAPL, MSFT, GOOGL, AMZN, TSLA, META, NVDA, JPM, NFLX, AMD, AVGO, COST, KO, PEP, WMT, PG, V, BAC, JNJ, UNH, XOM, CAT, GE, LIN, SPY, QQQ, IWM, GLD)
`live_data/intraday/`	Intraday 5m bars for AAPL, MSFT, GOOGL, NVDA (under `5m/` subfolder)

These can be used with the sample data provider for testing without yfinance calls. The generator script is at backend/app/utils/sample_data_generator.py.

Data Pipeline

1. Data Ingestion (`market_data_service.py`)

Detail	Value
Provider	yfinance (`yfinance>=0.2.40`)
Fetch strategy	Batch download via `yf.download()`
Cache	None — timezone cache redirected to per-process temp dir to avoid SQLite locking on Render's ephemeral filesystem
Auto-adjust	`True` (splits/dividends adjusted)
Threading	`True` (parallel ticker downloads)

Default intervals by mode:

Mode	Period	Interval	Min rows
`"daily"`	`"6mo"`	`"1d"`	5
`"intraday"`	`"2d"`	`"5m"`	20

2. Feature Engineering (`preprocessing.py`)

Applied per-symbol chronologically:

Feature	Formula	Purpose
`daily_return`	`close.pct_change()`	Base signal for all downstream models
`log_return`	`log(close / close.shift(1))`	Alternative return measure
`sma_5`, `sma_10`, `sma_20`	Rolling mean of `close`	Trend identification (daily)
`sma_12`, `sma_24`, `sma_78`	Rolling mean of `close`	Trend identification (intraday)
`rolling_volatility`	Rolling std of `daily_return` × `sqrt(252)`	Risk measurement
`rsi`	Vectorized RSI (14-period)	Momentum/overbought-oversold
`momentum`	`close.pct_change(periods=10/12)`	Short-term price momentum

Adaptive windows by mode:

Parameter	Daily	Intraday
SMA windows	5, 10, 20	12, 24, 78
Volatility window	20	24
RSI period	14	14
Momentum window	10	12

Machine Learning Models

1. Regression Predictor (`regression_predictor.py`)

Models

Model	Parameters	Purpose
`LinearRegression`	Defaults (OLS)	Baseline linear model
`RandomForestRegressor`	`n_estimators=50`, `max_depth=5`, `random_state=42`	Non-linear ensemble

Training Methodology

Target: Next-period return (daily_return.shift(-1))
Train/test split: 80/20 chronological (shuffle=False)
Minimum training rows: 20
Scaling: StandardScaler applied only to LinearRegression (Random Forest uses raw values)
Selection: Model with lower MAE on the 20% test set is used for predictions

Features

Both models train on 7 features:

daily_return — recent return
sma_5 / sma_12 — short-term moving average
sma_10 / sma_24 — medium-term moving average
sma_20 / sma_78 — long-term moving average
rolling_volatility — trailing risk estimate
rsi — momentum oscillator
momentum — price change over window

Evaluation Metrics

Metric	Formula	Usage
MAE	`mean(\|y_true - y_pred\|)`	Model selection (lower is better)
Confidence	`0.6 × max(0, 1 - MAE/0.05) + 0.4 × min(\|pred\|/0.03, 1)`	Heuristic score (0–100)

Confidence formula rationale:

MAE < 0.05 (5% error) → high error_score
|predicted_return| > 0.03 (3% return) → high signal_score
Error weighted 60%, signal weighted 40%

Fallback

If regression fails (insufficient data), falls back to momentum:

predicted_return = momentum * 1.0
confidence = 35.0

2. LSTM Predictor (`lstm_predictor.py`)

Lazy-loaded TensorFlow/Keras — only activates when ENABLE_LSTM=true. Disabled by default on Render to stay within 512 MB RAM.

Architecture

Input(seq_length, 3) → LSTM(16, return_sequences=False) → Dropout(0.2) → Dense(1)

Input features per timestep: [close (normalized), daily_return, rsi]

Hyperparameters

Parameter	Daily	Intraday
Sequence length	20 bars	39 bars
Min rows	40	100
Epochs	8	5
Batch size	16	32
Validation split	0.1 (if 20+ samples)	same
Optimizer	Adam	Adam
Loss	MSE	MSE

Training Strategy

All tickers share a single LSTM model. Sequences from all tickers are stacked vertically and one model.fit() call trains across all symbols. This avoids per-symbol Python loops and keeps memory usage predictable.

Output

predicted_return — final timestep forward pass
trend — ternary classification at 0.2% threshold
confidence — scaled by recent_volatility, clamped to [30, 85]

Ensemble

When LSTM is enabled, predictions are averaged with regression outputs:

ensemble_return = (regression_return + lstm_return) / 2

3. Clustering (`clustering.py`)

Model

KMeans with random_state=42, n_init=10.

Cluster Count Logic

Symbols	Clusters
1–2	1
3–4	2
5+	3

Feature Matrix

Built from per-symbol aggregates (preprocessing.build_feature_matrix):

Feature	Calculation
`avg_return`	`daily_return.mean()`
`volatility`	`daily_return.std()`
`momentum`	Most recent `momentum` value
`rsi`	Most recent `rsi` value
`beta_proxy`	`std / (std + 1e-6)`

Scaled via StandardScaler before clustering.

Cluster Profiles

Condition	Label
`avg_ret > 0.001` AND `avg_vol > 0.02`	High-growth, higher-risk
`avg_ret > 0` AND `avg_vol ≤ 0.02`	Steady performers
`avg_ret ≤ 0` AND `avg_vol > 0.02`	Volatile, underperforming
Otherwise	Defensive / low momentum

Portfolio Optimization (`optimizer.py`)

Black-Litterman Expected Returns

Computes implied equilibrium returns via reverse optimization, then blends with model predictions:

Equilibrium returns: Π = δ × Σ × w_eq where Σ is the historical covariance matrix, δ is risk aversion, and w_eq is the equal-weight prior
View matrix P: Identity matrix (one view per asset)
View uncertainty Ω: τ × diag(diag(Σ)) where τ = 0.05
Posterior blend: E[R] = ((τΣ)⁻¹ + PᵀΩ⁻¹P)⁻¹ × ((τΣ)⁻¹Π + PᵀΩ⁻¹ × Q)

Optimization Constraints

Constraint	Value
Asset bounds	`(0.0, 0.40)` — no shorting, max 40% per asset
Budget	`Σ weights = 1.0`
Solver	SLSQP (`scipy.optimize.minimize`)

Optimization Goals

Goal	Objective
`"max_sharpe"`	Maximize `(return - RFR) / (vol × risk_multiplier)`
`"min_volatility"`	Minimize `sqrt(wᵀΣw)`

Risk Preference Multipliers

Preference	`risk_multiplier`
Low	0.5
Medium	1.0
High	1.5

Fallback

If data is insufficient (< 5 rows per symbol), returns equal-weight allocation.

Risk Metrics (`risk_metrics.py`)

Metric	Formula	Interpretation
Annualized Volatility	`σ × √252`	Total risk
Sharpe Ratio	`(μ × 252 − RFR) / σ√252`	Risk-adjusted return (RFR = 2%)
Max Drawdown	`min((cum / peak) − 1)`	Worst peak-to-trough loss
VaR (95%)	`1.645 × σ_p − μ_p`	Maximum loss at 95% confidence (normal)
CVaR (95%)	`σ_p × φ(1.645) / 0.05 − μ_p`	Expected loss beyond VaR (normal)

All metrics use annualized values with TRADING_DAYS = 252.

Performance Benchmarks

Local Execution (8-core CPU, 16 GB RAM)

Operation	5 symbols	10 symbols
Data fetch (yfinance)	~2 s	~4 s
Preprocessing	~0.1 s	~0.2 s
Regression prediction	~0.5 s	~1.0 s
LSTM training (8 epochs)	~8 s	~12 s
Clustering	~0.05 s	~0.05 s
Optimization (SLSQP)	~0.1 s	~0.1 s
Full pipeline (no LSTM)	~3 s	~5 s
Full pipeline (with LSTM)	~11 s	~17 s

Render Free Tier (512 MB RAM, shared CPU)

Operation	5 symbols
Full pipeline (no LSTM)	~20–30 s
Full pipeline (with LSTM)	~60–90 s (often OOM)

LSTM is disabled by default on Render (ENABLE_LSTM=false).

Prediction Accuracy Estimates

Model	Typical MAE (daily)	Typical MAE (intraday)
LinearRegression	0.015–0.025	0.008–0.015
RandomForest	0.012–0.020	0.007–0.012
Momentum fallback	0.020–0.035	0.010–0.020
LSTM	0.010–0.018	N/A (limited intraday testing)

Note: These are observed ranges on major US equities (AAPL, MSFT, GOOGL, NVDA, TSLA). Actual performance varies by market regime and ticker.

Sample Expected Outputs

Request

POST /api/full-analysis
{
  "symbols": ["AAPL", "MSFT", "GOOGL", "NVDA", "TSLA"],
  "mode": "daily",
  "budget": 10000,
  "risk_preference": "medium",
  "optimization_goal": "max_sharpe"
}

Response (abbreviated)

{
  "live_prices": [
    {"symbol": "AAPL", "price": 178.50, "change_pct": 1.25},
    {"symbol": "MSFT", "price": 420.30, "change_pct": 0.85},
    {"symbol": "GOOGL", "price": 175.20, "change_pct": -0.32},
    {"symbol": "NVDA", "price": 880.15, "change_pct": 3.10},
    {"symbol": "TSLA", "price": 245.60, "change_pct": -1.45}
  ],
  "predictions": [
    {
      "symbol": "AAPL",
      "latest_price": 178.50,
      "predicted_return": 0.0085,
      "trend": "upward",
      "confidence": 72.3,
      "model_used": "random_forest"
    }
  ],
  "portfolio": {
    "weights": {
      "AAPL": 0.25,
      "MSFT": 0.20,
      "GOOGL": 0.15,
      "NVDA": 0.30,
      "TSLA": 0.10
    },
    "expected_return": 0.1245,
    "expected_volatility": 0.1820,
    "sharpe_ratio": 0.574,
    "max_drawdown": -0.2830,
    "portfolio_var_95": 0.0245,
    "portfolio_cvar_95": 0.0308,
    "budget_allocation": {
      "AAPL": 2500.0,
      "MSFT": 2000.0,
      "GOOGL": 1500.0,
      "NVDA": 3000.0,
      "TSLA": 1000.0
    }
  },
  "risk_analysis": {
    "volatility": {"AAPL": 0.22, "MSFT": 0.18, "GOOGL": 0.24, "NVDA": 0.45, "TSLA": 0.55},
    "sharpe_ratio": {"AAPL": 0.45, "MSFT": 0.62, "GOOGL": 0.38, "NVDA": 0.71, "TSLA": 0.22},
    "max_drawdown": {"AAPL": -0.32, "MSFT": -0.28, "GOOGL": -0.35, "NVDA": -0.38, "TSLA": -0.52},
    "correlation_matrix": {
      "AAPL": {"AAPL": 1.0, "MSFT": 0.65, "GOOGL": 0.58, "NVDA": 0.42, "TSLA": 0.35},
      "MSFT": {"AAPL": 0.65, "MSFT": 1.0, "GOOGL": 0.62, "NVDA": 0.48, "TSLA": 0.38}
    }
  },
  "cluster_summary": {
    "cluster_labels": {"AAPL": 1, "MSFT": 1, "GOOGL": 1, "NVDA": 0, "TSLA": 2},
    "profiles": {
      "0": "High-growth, higher-risk",
      "1": "Steady performers",
      "2": "Volatile, underperforming"
    }
  }
}

Limitations

yfinance rate limits: Free tier can return YFRateLimitError under rapid requests. The system handles this gracefully by skipping affected symbols, but data may be stale.
No real-time streaming: yfinance is poll-based. Intraday data refreshes with yfinance's update cadence (typically every 1–5 minutes during market hours).
Prediction accuracy: ML models are trained on limited historical data. Returns are educational estimates, not financial advice.
Render free tier constraints: 512 MB RAM ceiling means the LSTM is disabled by default and full analyses take 20–30 seconds.
No persistence: All data is ephemeral. Refreshing the page clears all state.
Single-user: No authentication, portfolios, or saved sessions.

Model Training Methodology — Deep Dive

Data Flow (Single Request)

User tickers (e.g. AAPL, MSFT)
        │
        ▼
┌─────────────────────────────────────────────────┐
│ 1. yfinance.download(tickers, period, interval) │  ← Batch fetches all symbols
│    Returns: MultiIndex DataFrame (Date, Ticker) │
└───────────────────────┬─────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────┐
│ 2. Preprocessing (per-symbol)                   │
│    • Split MultiIndex → per-ticker DataFrames    │
│    • Engineer 7 features from OHLCV              │
│    • Drop NaN rows (initial window)              │
│    Output: {symbol: DataFrame} dict              │
└───────────────────────┬─────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────┐
│ 3. Regression Predictor                          │
│    • For each symbol:                            │
│      - Build X (7 features), y (shift(-1) return)│
│      - 80/20 chronological split                 │
│      - Train LinearRegression + RandomForest     │
│      - Compare MAE on test set → pick best       │
│      - Predict next-period return                │
│    Output: [{symbol, pred_return, confidence}]    │
└───────────────────────┬─────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────┐
│ 4. [Optional] LSTM Predictor                     │
│    • Stack all tickers into 3D sequences          │
│    • Single model.fit() across all symbols        │
│    • Forward pass for each ticker                 │
│    • Ensemble average with regression outputs     │
│    Output: [{symbol, pred_return, confidence}]    │
└───────────────────────┬─────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────┐
│ 5. K-Means Clustering                            │
│    • Build 5-feature matrix per symbol            │
│    • StandardScaler → KMeans (k = 1-3)           │
│    • PCA projection for 2D visualization          │
│    Output: {symbol: cluster_label, profiles}      │
└───────────────────────┬─────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────┐
│ 6. Portfolio Optimization                        │
│    • Compute implied equilibrium returns (Π)     │
│    • Blend with ML predictions (Black-Litterman) │
│    • SLSQP minimization (max Sharpe or min Vol)  │
│    • Capital Defense Guards (max 40% per asset)  │
│    Output: {weights, risk_metrics, allocation}   │
└─────────────────────────────────────────────────┘

Why Chronological Split (shuffle=False)?

Financial time series has temporal dependence — future data cannot be used to predict the past. Using shuffle=False in train_test_split ensures:

Training set: first 80% of trading days (oldest data)
Test set: last 20% of trading days (most recent data)

This mimics real-world deployment where the model is trained on history and evaluated on unseen recent data. A random shuffle would leak future information into training and overestimate accuracy.

Why Compare Two Models Per Symbol?

Each symbol has different statistical properties:

LinearRegression works well when returns have stable linear relationships with features (mature, low-volatility stocks)
RandomForest captures non-linear interactions and feature hierarchies (volatile, regime-switching stocks)

By training both and selecting the one with lower test-set MAE per symbol, the system automatically adapts to each stock's behavior pattern.

Why Shared LSTM (Not Per-Symbol)?

Training a separate LSTM per symbol would:

Multiply memory usage by symbol count (OOM on Render)
Require per-symbol Python loops (slow)
Give each stock less training data

Instead, all tickers are stacked vertically into one training matrix. The LSTM learns generalizable sequence patterns across all symbols, and a single model.fit() call covers the entire universe. This reduces training time from O(n×epochs) to O(epochs) regardless of symbol count.

Ensemble Strategy

When LSTM is enabled (ENABLE_LSTM=true):

ensemble_return = (regression_return + lstm_return) / 2
ensemble_confidence = (regression_confidence + lstm_confidence) / 2

Why equal weight (not weighted by validation performance)?

Regression and LSTM operate on fundamentally different representations (tabular features vs raw sequences)
Their error distributions are often uncorrelated — averaging diversifies model risk
Validation performance on 20% holdout is noisy with small samples
Equal weighting is the most robust strategy when model skill varies by market regime

Evaluation Metrics — Interpretation Guide

MAE (Mean Absolute Error)

MAE Range	Implication	Typical Cause
< 0.01	Very good fit	Stable trend, strong feature-signal relationship
0.01–0.02	Moderate fit	Normal for daily equity returns
0.02–0.05	Weak fit	High volatility, regime change, low signal/noise
> 0.05	Poor fit	Insufficient data, erratic price action

The MAE is the average absolute error in predicting the next period's return. Since daily returns for major equities typically range from -5% to +5%, an MAE of 0.02 means the average prediction is off by 2 percentage points.

Confidence Score (0–100)

error_score     = max(0, 1 - MAE / 0.05)        # 60% weight
signal_score    = min(|pred_return| / 0.03, 1)   # 40% weight
confidence      = 0.6 × error_score + 0.4 × signal_score

Confidence	Meaning
70–100	High confidence — low error + strong signal
50–70	Moderate confidence — reasonable error or moderate signal
30–50	Low confidence — high error or weak signal
< 30	Momentum fallback — model failed, using trend as proxy

Sharpe Ratio Interpretation

Sharpe	Risk-Reward
> 1.0	Excellent — significantly more return than risk
0.5–1.0	Good — acceptable risk-adjusted return
0.0–0.5	Mediocre — barely compensated for risk
< 0.0	Poor — negative risk-adjusted return

Rendered with RISK_FREE_RATE = 0.02 (2% annual risk-free rate).

VaR and CVaR

VaR(95%) = -1.645σ + μ : "95% of days, losses won't exceed this amount"
CVaR(95%) = φ(-1.645)σ/0.05 + μ : "On the worst 5% of days, losses average this amount"

CVaR is always larger (worse) than VaR because it represents the expected shortfall in the tail beyond VaR. Both are annualized.

Performance Comparison — Optimized vs Equal-Weight

Methodology

To evaluate whether the optimization adds value, compare against a naive equal-weight (1/n) baseline:

Scenario	Equal-Weight (1/5)	Optimized (Max Sharpe)	Improvement
AAPL, MSFT, GOOGL, NVDA, TSLA (daily, 6mo)	Sharpe ~0.35	Sharpe ~0.57	+63%
AAPL, MSFT, GOOGL (daily, 6mo)	Sharpe ~0.42	Sharpe ~0.61	+45%
AAPL, MSFT, JPM, KO (daily, 6mo, min_vol)	Vol ~0.18	Vol ~0.14	-22% risk

Note: These are representative ranges from test runs. Actual results depend on the specific time period, market conditions, and selected symbols.

When Optimization Helps Most

Diverse symbols (tech + consumer + energy) — correlation benefits fully exploited
High-conviction predictions — Black-Litterman blends ML views effectively
Low-volatility environment — Sharpe differences are more pronounced

When Equal-Weight Matches or Beats Optimization

Highly correlated symbols (all tech) — covariance structure offers little diversification
Random price movement — predictions near zero, BL collapses to equal prior
Very small symbol sets (2–3 tickers) — constraints dominate, little room for differentiation

How to Run Custom Comparisons

CLI Test Script

# Local: test with 5 stocks, daily mode, max sharpe
curl -X POST http://localhost:8000/api/full-analysis \
  -H "Content-Type: application/json" \
  -d '{"symbols":["AAPL","MSFT","GOOGL","NVDA","TSLA"],"mode":"daily","budget":10000,"risk_preference":"medium","optimization_goal":"max_sharpe"}'

Comparing Models Side-by-Side

The response includes model_comparison per symbol (from regression predictor):

{
  "symbol": "AAPL",
  "predicted_return": 0.0085,
  "model_used": "random_forest",
  "model_comparison": {
    "linear_regression": {
      "predicted_return": 0.0062,
      "mae": 0.0185,
      "confidence": 65.3
    },
    "random_forest": {
      "predicted_return": 0.0085,
      "mae": 0.0142,
      "confidence": 72.3
    }
  }
}

This lets you see which model was selected and how both performed on the test set.

A/B Test: With vs Without LSTM

# Without LSTM (default on Render)
curl -X POST http://localhost:8000/api/full-analysis \
  -d '{"symbols":["AAPL","MSFT"],"mode":"daily","model":"regression"}'

# With LSTM (local only, requires TensorFlow)
curl -X POST http://localhost:8000/api/full-analysis \
  -d '{"symbols":["AAPL","MSFT"],"mode":"daily","model":"ensemble"}'

Compare the confidence scores and predicted returns. The LSTM typically produces slightly different predictions because it sees the raw sequence rather than engineered features.

Known Issues & Mitigations

Issue	Status	Mitigation
`database is locked` (yfinance)	Fixed	Per-process temp dir for tz cache
Empty DataFrame `.iloc` crashes	Fixed	Guards added in regression, risk_metrics, preprocessing, optimizer
WebSocket connects to wrong origin	Fixed	`VITE_API_URL` env var for WS target
CORS block on deploy	Fixed	`https://jmr825.github.io` added to allowed origins
TensorFlow OOM on Render	Workaround	LSTM disabled by default (`ENABLE_LSTM=false`)
yfinance rate limit	Workaround	Skips affected symbol, continues with remaining data

FilesExpand file tree

TECHNICAL_DETAILS.md

Latest commit

History

TECHNICAL_DETAILS.md

File metadata and controls

FinOptima — Technical Documentation

System Architecture

Sample Data

Data Pipeline

1. Data Ingestion (market_data_service.py)

2. Feature Engineering (preprocessing.py)

Machine Learning Models

1. Regression Predictor (regression_predictor.py)

Models

Training Methodology

Features

Evaluation Metrics

Fallback

2. LSTM Predictor (lstm_predictor.py)

Architecture

Hyperparameters

Training Strategy

Output

Ensemble

3. Clustering (clustering.py)

Model

Cluster Count Logic

Feature Matrix

Cluster Profiles

Portfolio Optimization (optimizer.py)

Black-Litterman Expected Returns

Optimization Constraints

Optimization Goals

Risk Preference Multipliers

Fallback

Risk Metrics (risk_metrics.py)

Performance Benchmarks

Local Execution (8-core CPU, 16 GB RAM)

Render Free Tier (512 MB RAM, shared CPU)

Prediction Accuracy Estimates

Sample Expected Outputs

Request

Response (abbreviated)

Limitations

Model Training Methodology — Deep Dive

Data Flow (Single Request)

Why Chronological Split (shuffle=False)?

Why Compare Two Models Per Symbol?

Why Shared LSTM (Not Per-Symbol)?

Ensemble Strategy

Evaluation Metrics — Interpretation Guide

MAE (Mean Absolute Error)

Confidence Score (0–100)

Sharpe Ratio Interpretation

VaR and CVaR

Performance Comparison — Optimized vs Equal-Weight

Methodology

When Optimization Helps Most

When Equal-Weight Matches or Beats Optimization

How to Run Custom Comparisons

CLI Test Script

Comparing Models Side-by-Side

A/B Test: With vs Without LSTM

Known Issues & Mitigations

1. Data Ingestion (`market_data_service.py`)

2. Feature Engineering (`preprocessing.py`)

1. Regression Predictor (`regression_predictor.py`)

2. LSTM Predictor (`lstm_predictor.py`)

3. Clustering (`clustering.py`)

Portfolio Optimization (`optimizer.py`)

Risk Metrics (`risk_metrics.py`)