┌──────────┐ ┌──────────────────────────────────────┐ ┌──────────────┐
│ React │ ◄──► │ FastAPI Backend │ ◄──► │ yfinance │
│ + Vite │ │ │ │ Market Data │
│ UI │ │ /api/health → health check │ └──────────────┘
│ │ │ /api/live-data → live prices │
│ │ │ /api/analyze → preprocessing+risk│
│ │ │ /api/predict → ML predictions │
│ │ │ /api/cluster → KMeans clustering │
│ │ │ /api/optimize → portfolio weight │
│ │ │ /api/full-analysis → complete pipe │
│ │ │ /ws/prices → live price stream │
└──────────┘ └──────────────────────────────────────┘
All data processing is 100% in-memory. No disk I/O occurs during request handling (CSV generation is a separate dev-only utility).
Pre-fetched daily and intraday CSV files for 28 major US equities and ETFs are available in the live_data/ directory at the project root:
The datasets were fetched on 12 June.
| Folder | Contents |
|---|---|
live_data/daily/ |
28 CSV files, each with 1 year of daily OHLCV data (AAPL, MSFT, GOOGL, AMZN, TSLA, META, NVDA, JPM, NFLX, AMD, AVGO, COST, KO, PEP, WMT, PG, V, BAC, JNJ, UNH, XOM, CAT, GE, LIN, SPY, QQQ, IWM, GLD) |
live_data/intraday/ |
Intraday 5m bars for AAPL, MSFT, GOOGL, NVDA (under 5m/ subfolder) |
These can be used with the sample data provider for testing without yfinance calls. The generator script is at backend/app/utils/sample_data_generator.py.
| Detail | Value |
|---|---|
| Provider | yfinance (yfinance>=0.2.40) |
| Fetch strategy | Batch download via yf.download() |
| Cache | None — timezone cache redirected to per-process temp dir to avoid SQLite locking on Render's ephemeral filesystem |
| Auto-adjust | True (splits/dividends adjusted) |
| Threading | True (parallel ticker downloads) |
Default intervals by mode:
| Mode | Period | Interval | Min rows |
|---|---|---|---|
"daily" |
"6mo" |
"1d" |
5 |
"intraday" |
"2d" |
"5m" |
20 |
Applied per-symbol chronologically:
| Feature | Formula | Purpose |
|---|---|---|
daily_return |
close.pct_change() |
Base signal for all downstream models |
log_return |
log(close / close.shift(1)) |
Alternative return measure |
sma_5, sma_10, sma_20 |
Rolling mean of close |
Trend identification (daily) |
sma_12, sma_24, sma_78 |
Rolling mean of close |
Trend identification (intraday) |
rolling_volatility |
Rolling std of daily_return × sqrt(252) |
Risk measurement |
rsi |
Vectorized RSI (14-period) | Momentum/overbought-oversold |
momentum |
close.pct_change(periods=10/12) |
Short-term price momentum |
Adaptive windows by mode:
| Parameter | Daily | Intraday |
|---|---|---|
| SMA windows | 5, 10, 20 | 12, 24, 78 |
| Volatility window | 20 | 24 |
| RSI period | 14 | 14 |
| Momentum window | 10 | 12 |
| Model | Parameters | Purpose |
|---|---|---|
LinearRegression |
Defaults (OLS) | Baseline linear model |
RandomForestRegressor |
n_estimators=50, max_depth=5, random_state=42 |
Non-linear ensemble |
- Target: Next-period return (
daily_return.shift(-1)) - Train/test split: 80/20 chronological (
shuffle=False) - Minimum training rows: 20
- Scaling:
StandardScalerapplied only to LinearRegression (Random Forest uses raw values) - Selection: Model with lower MAE on the 20% test set is used for predictions
Both models train on 7 features:
daily_return— recent returnsma_5/sma_12— short-term moving averagesma_10/sma_24— medium-term moving averagesma_20/sma_78— long-term moving averagerolling_volatility— trailing risk estimatersi— momentum oscillatormomentum— price change over window
| Metric | Formula | Usage |
|---|---|---|
| MAE | mean(|y_true - y_pred|) |
Model selection (lower is better) |
| Confidence | 0.6 × max(0, 1 - MAE/0.05) + 0.4 × min(|pred|/0.03, 1) |
Heuristic score (0–100) |
Confidence formula rationale:
- MAE < 0.05 (5% error) → high error_score
- |predicted_return| > 0.03 (3% return) → high signal_score
- Error weighted 60%, signal weighted 40%
If regression fails (insufficient data), falls back to momentum:
predicted_return = momentum * 1.0
confidence = 35.0
Lazy-loaded TensorFlow/Keras — only activates when ENABLE_LSTM=true. Disabled by default on Render to stay within 512 MB RAM.
Input(seq_length, 3) → LSTM(16, return_sequences=False) → Dropout(0.2) → Dense(1)
Input features per timestep: [close (normalized), daily_return, rsi]
| Parameter | Daily | Intraday |
|---|---|---|
| Sequence length | 20 bars | 39 bars |
| Min rows | 40 | 100 |
| Epochs | 8 | 5 |
| Batch size | 16 | 32 |
| Validation split | 0.1 (if 20+ samples) | same |
| Optimizer | Adam | Adam |
| Loss | MSE | MSE |
All tickers share a single LSTM model. Sequences from all tickers are stacked vertically and one model.fit() call trains across all symbols. This avoids per-symbol Python loops and keeps memory usage predictable.
predicted_return— final timestep forward passtrend— ternary classification at 0.2% thresholdconfidence— scaled byrecent_volatility, clamped to [30, 85]
When LSTM is enabled, predictions are averaged with regression outputs:
ensemble_return = (regression_return + lstm_return) / 2
KMeans with random_state=42, n_init=10.
| Symbols | Clusters |
|---|---|
| 1–2 | 1 |
| 3–4 | 2 |
| 5+ | 3 |
Built from per-symbol aggregates (preprocessing.build_feature_matrix):
| Feature | Calculation |
|---|---|
avg_return |
daily_return.mean() |
volatility |
daily_return.std() |
momentum |
Most recent momentum value |
rsi |
Most recent rsi value |
beta_proxy |
std / (std + 1e-6) |
Scaled via StandardScaler before clustering.
| Condition | Label |
|---|---|
avg_ret > 0.001 AND avg_vol > 0.02 |
High-growth, higher-risk |
avg_ret > 0 AND avg_vol ≤ 0.02 |
Steady performers |
avg_ret ≤ 0 AND avg_vol > 0.02 |
Volatile, underperforming |
| Otherwise | Defensive / low momentum |
Computes implied equilibrium returns via reverse optimization, then blends with model predictions:
- Equilibrium returns:
Π = δ × Σ × w_eqwhere Σ is the historical covariance matrix, δ is risk aversion, and w_eq is the equal-weight prior - View matrix P: Identity matrix (one view per asset)
- View uncertainty Ω:
τ × diag(diag(Σ))where τ = 0.05 - Posterior blend:
E[R] = ((τΣ)⁻¹ + PᵀΩ⁻¹P)⁻¹ × ((τΣ)⁻¹Π + PᵀΩ⁻¹ × Q)
| Constraint | Value |
|---|---|
| Asset bounds | (0.0, 0.40) — no shorting, max 40% per asset |
| Budget | Σ weights = 1.0 |
| Solver | SLSQP (scipy.optimize.minimize) |
| Goal | Objective |
|---|---|
"max_sharpe" |
Maximize (return - RFR) / (vol × risk_multiplier) |
"min_volatility" |
Minimize sqrt(wᵀΣw) |
| Preference | risk_multiplier |
|---|---|
| Low | 0.5 |
| Medium | 1.0 |
| High | 1.5 |
If data is insufficient (< 5 rows per symbol), returns equal-weight allocation.
| Metric | Formula | Interpretation |
|---|---|---|
| Annualized Volatility | σ × √252 |
Total risk |
| Sharpe Ratio | (μ × 252 − RFR) / σ√252 |
Risk-adjusted return (RFR = 2%) |
| Max Drawdown | min((cum / peak) − 1) |
Worst peak-to-trough loss |
| VaR (95%) | 1.645 × σ_p − μ_p |
Maximum loss at 95% confidence (normal) |
| CVaR (95%) | σ_p × φ(1.645) / 0.05 − μ_p |
Expected loss beyond VaR (normal) |
All metrics use annualized values with TRADING_DAYS = 252.
| Operation | 5 symbols | 10 symbols |
|---|---|---|
| Data fetch (yfinance) | ~2 s | ~4 s |
| Preprocessing | ~0.1 s | ~0.2 s |
| Regression prediction | ~0.5 s | ~1.0 s |
| LSTM training (8 epochs) | ~8 s | ~12 s |
| Clustering | ~0.05 s | ~0.05 s |
| Optimization (SLSQP) | ~0.1 s | ~0.1 s |
| Full pipeline (no LSTM) | ~3 s | ~5 s |
| Full pipeline (with LSTM) | ~11 s | ~17 s |
| Operation | 5 symbols |
|---|---|
| Full pipeline (no LSTM) | ~20–30 s |
| Full pipeline (with LSTM) | ~60–90 s (often OOM) |
LSTM is disabled by default on Render (ENABLE_LSTM=false).
| Model | Typical MAE (daily) | Typical MAE (intraday) |
|---|---|---|
| LinearRegression | 0.015–0.025 | 0.008–0.015 |
| RandomForest | 0.012–0.020 | 0.007–0.012 |
| Momentum fallback | 0.020–0.035 | 0.010–0.020 |
| LSTM | 0.010–0.018 | N/A (limited intraday testing) |
Note: These are observed ranges on major US equities (AAPL, MSFT, GOOGL, NVDA, TSLA). Actual performance varies by market regime and ticker.
POST /api/full-analysis
{
"symbols": ["AAPL", "MSFT", "GOOGL", "NVDA", "TSLA"],
"mode": "daily",
"budget": 10000,
"risk_preference": "medium",
"optimization_goal": "max_sharpe"
}{
"live_prices": [
{"symbol": "AAPL", "price": 178.50, "change_pct": 1.25},
{"symbol": "MSFT", "price": 420.30, "change_pct": 0.85},
{"symbol": "GOOGL", "price": 175.20, "change_pct": -0.32},
{"symbol": "NVDA", "price": 880.15, "change_pct": 3.10},
{"symbol": "TSLA", "price": 245.60, "change_pct": -1.45}
],
"predictions": [
{
"symbol": "AAPL",
"latest_price": 178.50,
"predicted_return": 0.0085,
"trend": "upward",
"confidence": 72.3,
"model_used": "random_forest"
}
],
"portfolio": {
"weights": {
"AAPL": 0.25,
"MSFT": 0.20,
"GOOGL": 0.15,
"NVDA": 0.30,
"TSLA": 0.10
},
"expected_return": 0.1245,
"expected_volatility": 0.1820,
"sharpe_ratio": 0.574,
"max_drawdown": -0.2830,
"portfolio_var_95": 0.0245,
"portfolio_cvar_95": 0.0308,
"budget_allocation": {
"AAPL": 2500.0,
"MSFT": 2000.0,
"GOOGL": 1500.0,
"NVDA": 3000.0,
"TSLA": 1000.0
}
},
"risk_analysis": {
"volatility": {"AAPL": 0.22, "MSFT": 0.18, "GOOGL": 0.24, "NVDA": 0.45, "TSLA": 0.55},
"sharpe_ratio": {"AAPL": 0.45, "MSFT": 0.62, "GOOGL": 0.38, "NVDA": 0.71, "TSLA": 0.22},
"max_drawdown": {"AAPL": -0.32, "MSFT": -0.28, "GOOGL": -0.35, "NVDA": -0.38, "TSLA": -0.52},
"correlation_matrix": {
"AAPL": {"AAPL": 1.0, "MSFT": 0.65, "GOOGL": 0.58, "NVDA": 0.42, "TSLA": 0.35},
"MSFT": {"AAPL": 0.65, "MSFT": 1.0, "GOOGL": 0.62, "NVDA": 0.48, "TSLA": 0.38}
}
},
"cluster_summary": {
"cluster_labels": {"AAPL": 1, "MSFT": 1, "GOOGL": 1, "NVDA": 0, "TSLA": 2},
"profiles": {
"0": "High-growth, higher-risk",
"1": "Steady performers",
"2": "Volatile, underperforming"
}
}
}- yfinance rate limits: Free tier can return
YFRateLimitErrorunder rapid requests. The system handles this gracefully by skipping affected symbols, but data may be stale. - No real-time streaming: yfinance is poll-based. Intraday data refreshes with yfinance's update cadence (typically every 1–5 minutes during market hours).
- Prediction accuracy: ML models are trained on limited historical data. Returns are educational estimates, not financial advice.
- Render free tier constraints: 512 MB RAM ceiling means the LSTM is disabled by default and full analyses take 20–30 seconds.
- No persistence: All data is ephemeral. Refreshing the page clears all state.
- Single-user: No authentication, portfolios, or saved sessions.
User tickers (e.g. AAPL, MSFT)
│
▼
┌─────────────────────────────────────────────────┐
│ 1. yfinance.download(tickers, period, interval) │ ← Batch fetches all symbols
│ Returns: MultiIndex DataFrame (Date, Ticker) │
└───────────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 2. Preprocessing (per-symbol) │
│ • Split MultiIndex → per-ticker DataFrames │
│ • Engineer 7 features from OHLCV │
│ • Drop NaN rows (initial window) │
│ Output: {symbol: DataFrame} dict │
└───────────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 3. Regression Predictor │
│ • For each symbol: │
│ - Build X (7 features), y (shift(-1) return)│
│ - 80/20 chronological split │
│ - Train LinearRegression + RandomForest │
│ - Compare MAE on test set → pick best │
│ - Predict next-period return │
│ Output: [{symbol, pred_return, confidence}] │
└───────────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 4. [Optional] LSTM Predictor │
│ • Stack all tickers into 3D sequences │
│ • Single model.fit() across all symbols │
│ • Forward pass for each ticker │
│ • Ensemble average with regression outputs │
│ Output: [{symbol, pred_return, confidence}] │
└───────────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 5. K-Means Clustering │
│ • Build 5-feature matrix per symbol │
│ • StandardScaler → KMeans (k = 1-3) │
│ • PCA projection for 2D visualization │
│ Output: {symbol: cluster_label, profiles} │
└───────────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 6. Portfolio Optimization │
│ • Compute implied equilibrium returns (Π) │
│ • Blend with ML predictions (Black-Litterman) │
│ • SLSQP minimization (max Sharpe or min Vol) │
│ • Capital Defense Guards (max 40% per asset) │
│ Output: {weights, risk_metrics, allocation} │
└─────────────────────────────────────────────────┘
Financial time series has temporal dependence — future data cannot be used to predict the past. Using shuffle=False in train_test_split ensures:
- Training set: first 80% of trading days (oldest data)
- Test set: last 20% of trading days (most recent data)
This mimics real-world deployment where the model is trained on history and evaluated on unseen recent data. A random shuffle would leak future information into training and overestimate accuracy.
Each symbol has different statistical properties:
- LinearRegression works well when returns have stable linear relationships with features (mature, low-volatility stocks)
- RandomForest captures non-linear interactions and feature hierarchies (volatile, regime-switching stocks)
By training both and selecting the one with lower test-set MAE per symbol, the system automatically adapts to each stock's behavior pattern.
Training a separate LSTM per symbol would:
- Multiply memory usage by symbol count (OOM on Render)
- Require per-symbol Python loops (slow)
- Give each stock less training data
Instead, all tickers are stacked vertically into one training matrix. The LSTM learns generalizable sequence patterns across all symbols, and a single model.fit() call covers the entire universe. This reduces training time from O(n×epochs) to O(epochs) regardless of symbol count.
When LSTM is enabled (ENABLE_LSTM=true):
ensemble_return = (regression_return + lstm_return) / 2
ensemble_confidence = (regression_confidence + lstm_confidence) / 2
Why equal weight (not weighted by validation performance)?
- Regression and LSTM operate on fundamentally different representations (tabular features vs raw sequences)
- Their error distributions are often uncorrelated — averaging diversifies model risk
- Validation performance on 20% holdout is noisy with small samples
- Equal weighting is the most robust strategy when model skill varies by market regime
| MAE Range | Implication | Typical Cause |
|---|---|---|
| < 0.01 | Very good fit | Stable trend, strong feature-signal relationship |
| 0.01–0.02 | Moderate fit | Normal for daily equity returns |
| 0.02–0.05 | Weak fit | High volatility, regime change, low signal/noise |
| > 0.05 | Poor fit | Insufficient data, erratic price action |
The MAE is the average absolute error in predicting the next period's return. Since daily returns for major equities typically range from -5% to +5%, an MAE of 0.02 means the average prediction is off by 2 percentage points.
error_score = max(0, 1 - MAE / 0.05) # 60% weight
signal_score = min(|pred_return| / 0.03, 1) # 40% weight
confidence = 0.6 × error_score + 0.4 × signal_score
| Confidence | Meaning |
|---|---|
| 70–100 | High confidence — low error + strong signal |
| 50–70 | Moderate confidence — reasonable error or moderate signal |
| 30–50 | Low confidence — high error or weak signal |
| < 30 | Momentum fallback — model failed, using trend as proxy |
| Sharpe | Risk-Reward |
|---|---|
| > 1.0 | Excellent — significantly more return than risk |
| 0.5–1.0 | Good — acceptable risk-adjusted return |
| 0.0–0.5 | Mediocre — barely compensated for risk |
| < 0.0 | Poor — negative risk-adjusted return |
Rendered with RISK_FREE_RATE = 0.02 (2% annual risk-free rate).
- VaR(95%) = -1.645σ + μ : "95% of days, losses won't exceed this amount"
- CVaR(95%) = φ(-1.645)σ/0.05 + μ : "On the worst 5% of days, losses average this amount"
CVaR is always larger (worse) than VaR because it represents the expected shortfall in the tail beyond VaR. Both are annualized.
To evaluate whether the optimization adds value, compare against a naive equal-weight (1/n) baseline:
| Scenario | Equal-Weight (1/5) | Optimized (Max Sharpe) | Improvement |
|---|---|---|---|
| AAPL, MSFT, GOOGL, NVDA, TSLA (daily, 6mo) | Sharpe ~0.35 | Sharpe ~0.57 | +63% |
| AAPL, MSFT, GOOGL (daily, 6mo) | Sharpe ~0.42 | Sharpe ~0.61 | +45% |
| AAPL, MSFT, JPM, KO (daily, 6mo, min_vol) | Vol ~0.18 | Vol ~0.14 | -22% risk |
Note: These are representative ranges from test runs. Actual results depend on the specific time period, market conditions, and selected symbols.
- Diverse symbols (tech + consumer + energy) — correlation benefits fully exploited
- High-conviction predictions — Black-Litterman blends ML views effectively
- Low-volatility environment — Sharpe differences are more pronounced
- Highly correlated symbols (all tech) — covariance structure offers little diversification
- Random price movement — predictions near zero, BL collapses to equal prior
- Very small symbol sets (2–3 tickers) — constraints dominate, little room for differentiation
# Local: test with 5 stocks, daily mode, max sharpe
curl -X POST http://localhost:8000/api/full-analysis \
-H "Content-Type: application/json" \
-d '{"symbols":["AAPL","MSFT","GOOGL","NVDA","TSLA"],"mode":"daily","budget":10000,"risk_preference":"medium","optimization_goal":"max_sharpe"}'The response includes model_comparison per symbol (from regression predictor):
{
"symbol": "AAPL",
"predicted_return": 0.0085,
"model_used": "random_forest",
"model_comparison": {
"linear_regression": {
"predicted_return": 0.0062,
"mae": 0.0185,
"confidence": 65.3
},
"random_forest": {
"predicted_return": 0.0085,
"mae": 0.0142,
"confidence": 72.3
}
}
}This lets you see which model was selected and how both performed on the test set.
# Without LSTM (default on Render)
curl -X POST http://localhost:8000/api/full-analysis \
-d '{"symbols":["AAPL","MSFT"],"mode":"daily","model":"regression"}'
# With LSTM (local only, requires TensorFlow)
curl -X POST http://localhost:8000/api/full-analysis \
-d '{"symbols":["AAPL","MSFT"],"mode":"daily","model":"ensemble"}'Compare the confidence scores and predicted returns. The LSTM typically produces slightly different predictions because it sees the raw sequence rather than engineered features.
| Issue | Status | Mitigation |
|---|---|---|
database is locked (yfinance) |
Fixed | Per-process temp dir for tz cache |
Empty DataFrame .iloc crashes |
Fixed | Guards added in regression, risk_metrics, preprocessing, optimizer |
| WebSocket connects to wrong origin | Fixed | VITE_API_URL env var for WS target |
| CORS block on deploy | Fixed | https://jmr825.github.io added to allowed origins |
| TensorFlow OOM on Render | Workaround | LSTM disabled by default (ENABLE_LSTM=false) |
| yfinance rate limit | Workaround | Skips affected symbol, continues with remaining data |
