Skip to content

Latest commit

 

History

History
68 lines (45 loc) · 2.64 KB

File metadata and controls

68 lines (45 loc) · 2.64 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Kaggle time series forecasting competition: predict daily unit sales for ~1,782 store-family combinations (54 stores x 33 product families) for Corporacion Favorita, an Ecuadorian grocery chain. Evaluation metric is RMSLE.

Commands

# Activate main environment
source .venv/bin/activate

# Activate CatBoost environment (isolated due to version conflicts)
source .venv-catboost/bin/activate

# Run notebooks
jupyter lab

# Run Streamlit dashboard
streamlit run app/streamlit_dashboard.py

# Execute a notebook from CLI
.venv/bin/python -m jupyter nbconvert --to notebook --execute notebooks/XX_name.ipynb

# Download competition data (requires ~/.kaggle/kaggle.json)
kaggle competitions download -c store-sales-time-series-forecasting -p data/raw

Architecture

Data Pipeline

Raw CSVs (data/raw/) → Feature engineering (NB02) → Parquet files (data/processed/) → Model training → Saved models (models/) → Predictions → Submission CSVs (submissions/)

Notebook Sequence (must run in order)

Notebooks are numbered 01-13 and each builds on prior outputs:

  • 01-06: Core pipeline (EDA → features → baselines → statistical → ML → evaluation)
  • 07-09: Model optimization (XGBoost, ensemble stacking, CatBoost)
  • 10: External data enrichment (weather, economic indicators)
  • 11: Deep learning (MLP, LSTM/GRU, TFT-Lite via PyTorch)
  • 12-13: Kaggle submission generation (single-step and recursive)

Key Data Splits

  • Training: up to 2017-07-31
  • Validation: 2017-08-01 to 2017-08-15
  • Test (Kaggle): 2017-08-16 to 2017-08-31

Model Artifacts

  • Gradient boosting: .joblib (XGBoost, LightGBM, RF) and .cbm (CatBoost)
  • Deep learning: .pt (PyTorch)
  • Feature metadata: feature_cols.joblib, label_encoders.joblib, dl_scaler.joblib

Streamlit Dashboard (app/streamlit_dashboard.py)

Three-tab dashboard: EDA visualizations, time series explorer, and forecasting comparison. Loads data from data/processed/ parquets and submissions/ CSVs.

Key Technical Conventions

  • Target transform: log1p(sales) so MSE loss approximates RMSLE; invert with expm1
  • Leakage prevention: all rolling/lag features use shift(1) before aggregation; only temporal train/val splits (no random splits)
  • Two virtual environments: .venv for most work, .venv-catboost for CatBoost notebooks (NB09)
  • Data formats: raw data as CSV, processed features as Parquet
  • Ensemble weighting: current best is 0.4 XGBoost + 0.4 CatBoost + 0.2 LightGBM