This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Kaggle time series forecasting competition: predict daily unit sales for ~1,782 store-family combinations (54 stores x 33 product families) for Corporacion Favorita, an Ecuadorian grocery chain. Evaluation metric is RMSLE.
# Activate main environment
source .venv/bin/activate
# Activate CatBoost environment (isolated due to version conflicts)
source .venv-catboost/bin/activate
# Run notebooks
jupyter lab
# Run Streamlit dashboard
streamlit run app/streamlit_dashboard.py
# Execute a notebook from CLI
.venv/bin/python -m jupyter nbconvert --to notebook --execute notebooks/XX_name.ipynb
# Download competition data (requires ~/.kaggle/kaggle.json)
kaggle competitions download -c store-sales-time-series-forecasting -p data/rawRaw CSVs (data/raw/) → Feature engineering (NB02) → Parquet files (data/processed/) → Model training → Saved models (models/) → Predictions → Submission CSVs (submissions/)
Notebooks are numbered 01-13 and each builds on prior outputs:
- 01-06: Core pipeline (EDA → features → baselines → statistical → ML → evaluation)
- 07-09: Model optimization (XGBoost, ensemble stacking, CatBoost)
- 10: External data enrichment (weather, economic indicators)
- 11: Deep learning (MLP, LSTM/GRU, TFT-Lite via PyTorch)
- 12-13: Kaggle submission generation (single-step and recursive)
- Training: up to 2017-07-31
- Validation: 2017-08-01 to 2017-08-15
- Test (Kaggle): 2017-08-16 to 2017-08-31
- Gradient boosting:
.joblib(XGBoost, LightGBM, RF) and.cbm(CatBoost) - Deep learning:
.pt(PyTorch) - Feature metadata:
feature_cols.joblib,label_encoders.joblib,dl_scaler.joblib
Three-tab dashboard: EDA visualizations, time series explorer, and forecasting comparison. Loads data from data/processed/ parquets and submissions/ CSVs.
- Target transform:
log1p(sales)so MSE loss approximates RMSLE; invert withexpm1 - Leakage prevention: all rolling/lag features use
shift(1)before aggregation; only temporal train/val splits (no random splits) - Two virtual environments:
.venvfor most work,.venv-catboostfor CatBoost notebooks (NB09) - Data formats: raw data as CSV, processed features as Parquet
- Ensemble weighting: current best is 0.4 XGBoost + 0.4 CatBoost + 0.2 LightGBM