This project implements a comprehensive stock market prediction system using various machine learning and deep learning techniques. It provides tools for data collection, preprocessing, model training, and evaluation, allowing you to predict stock prices and compare different prediction methodologies.
** COMPLETED: April 30, 2025**
This project has been completed and the final report has been prepared using the ACM SIG conference LaTeX template. The report includes:
- Analysis of Linear Regression and LSTM models for stock price prediction
- Performance comparison across multiple stocks (AAPL, MSFT, AMD)
- Literature review of 5 key papers in financial forecasting
- Ablation studies on model architecture and learning rates
- Computational complexity analysis
- Visualizations of prediction performance
stock-prediction-model/
├── acm_template/ # LaTeX templates for the final report
│ └── acmart-primary/ # ACM SIG conference template
├── data/
│ ├── raw/ # Raw stock price data from Yahoo Finance
│ └── processed/ # Processed data with engineered features
├── models/ # Trained model files and training histories
│ ├── figures/
│ ├── linear/ # Linear regression models (.pkl)
│ └── lstm/ # LSTM model files (.pt, _config.json, _scalers.pkl)
├── notebooks/ # Jupyter notebooks for analysis
├── reports/ # Final report documents
│ └── figures/ # Report visualizations
├── results/ # Model evaluation results and predictions
│ ├── figures/ # Generated prediction plots
│ └── metrics/ # Performance metrics CSVs
├── src/ # Python source code
│ ├── fetch_data.py # Script to download stock data
│ ├── preprocess.py # Data cleaning and feature engineering
│ ├── train.py # Model training functionality
│ ├── evaluate.py # Model evaluation and comparison
│ ├── generate_figures.py # Create figures for the report
│ ├── rescale_predictions.py # Fix and rescale predictions
│ ├── diagnose_predictions.py # Tools to diagnose prediction issues
│ └── models/
│ ├── baseline.py # Linear Regression models
│ └── advanced.py # LSTM models
└── requirements.txt # Project dependencies
- Clone this repository:
git clone https://github.com/yourusername/stock-prediction-model.git
cd stock-prediction-model- Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install required dependencies:
pip install -r requirements.txtThis section provides a step-by-step guide to run the entire pipeline from data collection to model evaluation.
Download historical stock data from Yahoo Finance:
# On Windows
python src/fetch_data.py --symbols AAPL MSFT GOOGL AMZN NVDA INTC META CSCO TSLA --start 2010-01-01 --end 2023-01-01 --out data/rawOptions:
--symbolor--symbols: One or more stock symbols (e.g., AAPL, MSFT, GOOG)--start: Start date in YYYY-MM-DD format--end: End date in YYYY-MM-DD format--out: Output directory for the downloaded data
Process the raw data and generate features:
python src/preprocess.py --in_folder data/raw --out_folder data/processedOptions:
--in_folder: Directory containing raw CSV files--out_folder: Output directory for processed files--window_size: Lookback window size in days (default: 20)--min_completeness: Minimum data completeness requirement (default: 0.8)
This step creates:
- Processed and scaled data in NPZ files for each stock
- A processing_summary.csv with data statistics
Train a prediction model using one of the available algorithms:
# Train an LSTM model for AAPL stock
python src/train.py --model lstm --symbol AAPL --data_dir data/processed --output_dir models
# Train a linear model for AAPL stock
python src/train.py --model linear --symbol AAPL --data_dir data/processed --output_dir modelsModels available:
linear: Linear Regression modellstm: Long Short-Term Memory neural network
Important options:
--model: Type of model to train--symbol: Stock symbol to train on--data_dir: Directory with processed data (containing .npz files)--output_dir: Output directory for trained models--epochs: Number of training epochs for neural models (default: 100)--batch_size: Batch size for training (default: 32)--window_size: Window size for time series data (default: 20)--time_based: Whether to use time-based train/test split (default: True)
When training completes, the model files will be saved in the output directory with the following naming pattern:
- LSTM models (.pt):
{symbol}_fold{n}_lstm_{timestamp}.pt - Linear models (.pkl):
{symbol}_fold{n}_linear_{timestamp}.pkl - Model configuration:
{symbol}_fold{n}_lstm_{timestamp}_config.json - Data scalers:
{symbol}_fold{n}_lstm_{timestamp}_scalers.pkl - Training history:
{symbol}_fold{n}_lstm_history_{timestamp}.csv
Evaluate trained models and generate performance metrics and visualizations:
# Evaluate a single LSTM model
python src/evaluate.py --models lstm --model_paths models/lstm/AAPL_fold5_lstm_20250429_161416.pt --symbol AAPL --data_dir data/processed --output_dir results
# Compare LSTM and linear models
python src/evaluate.py --models lstm linear --model_paths models/lstm/AAPL_fold5_lstm_20250429_161416.pt models/linear/AAPL_fold5_linear_20250429_161448.pkl --symbol AAPL --data_dir data/processed --output_dir resultsImportant options:
--models: List of model types being evaluated (lstm, linear)--model_paths: Paths to the trained model files--symbol: Stock symbol to evaluate on--data_dir: Directory with processed data--output_dir: Output directory for results and figures
The evaluation will output:
- Performance metrics (MSE, RMSE, MAE, R², MAPE)
- Inference time measurement
- Prediction plots saved to the output directory
Generate and evaluate price predictions in their original scale:
python src/rescale_predictions.py --model_path models/lstm/AAPL_fold5_lstm_20250429_161416.pt --symbol AAPL --data_dir data/processed --output_dir resultsThis will produce:
- CSV file with actual and predicted prices
- Plot comparing actual vs. predicted prices
- Metrics for both original and rescaled predictions
If you need to diagnose problems with predictions:
python src/diagnose_predictions.py --model_path models/lstm/AAPL_fold5_lstm_20250429_161416.pt --symbol AAPL --data_dir data/processed --output_dir resultsThis creates detailed diagnostic plots to help identify issues with the predictions.
Generate all necessary figures for the final report:
python src/generate_figures.pyThis script creates:
- Training loss curves for AAPL and MSFT
- RMSE and R² comparison charts across models and stocks
- Actual vs. predicted price plots for multiple stocks
- All figures are saved to the
reports/figures/directory
Compile the LaTeX report for submission:
cd acm_template/acmart-primary
pdflatex stock_prediction_report.tex
bibtex stock_prediction_report
pdflatex stock_prediction_report.tex
pdflatex stock_prediction_report.texThis will generate the final PDF report stock_prediction_report.pdf in the ACM format.
Our research revealed several important insights:
- Linear vs. LSTM Performance: Surprisingly, Linear Regression models often outperformed more complex LSTM models for this task
- Best Performance: Best RMSE of 9.75 USD achieved by Linear Regression on MSFT stock data
- Computational Efficiency: Linear models showed 50-60x faster training and inference times compared to LSTMs
- Feature Importance: Recent price history (1-day lag) and short-term moving averages were the most predictive features
- Underfitting: Both model types showed signs of underfitting rather than overfitting, suggesting more feature engineering might be needed
- AMD Data Processing: The AMD dataset required extra cleaning due to extreme outliers
- LSTM Prediction Range Collapse: LSTM predictions tend to fall within a narrower range than actual prices
- Data Leakage Issue (Fixed): Earlier versions had data leakage in the training/evaluation pipeline, fixed in the final implementation
The evaluation produces several metrics:
- MSE/RMSE: Lower values indicate better fit
- MAE: Average absolute error in price predictions
- R²: Value close to 1 indicates good fit; negative values indicate poor performance
- MAPE: Error as percentage of actual values; lower is better
Prediction plots show:
- Actual prices (solid line)
- Predicted prices (dashed line)
- Training loss curves are saved during the LSTM training process
This project fulfills all requirements outlined in the assignmentRubric.txt:
- Integrity: All required sections are included in the final report with proper formatting
- Clarity: Clear problem description and model definitions
- Literature Review: Summary of 5 relevant papers in financial forecasting
- Results: Comprehensive evaluation including:
- Advanced ML model implementation (LSTM)
- Analysis of model performance metrics
- Ablation studies on model architecture and hyperparameters
- Computational complexity analysis
- Comparison between linear and deep learning approaches
- Rami Abdelrazzaq (ramiabdelrazzaq@gmail.com)
- Taha Amir (tahashah61@gmail.com)
- Akshnoor Singh (akshnoorsingh987@gmail.com)
Contributions are welcome! Please feel free to submit a Pull Request.