Free, local-first F1 data engineering + ML + web app replicating TeoMeWhy/f1-lake. Uses DuckDB instead of Databricks, local Parquet instead of S3, and multiple ML models compared via MLFlow.
FastF1 → Parquet (raw/) → DuckDB → Parquet (bronze/silver/gold/) → ML (scikit-learn/XGBoost/CatBoost) → Streamlit App
Medallion layers:
data/raw/— One Parquet per FastF1 session ({year}_{round}_{mode}.parquet), includes weather data (2018+)data/bronze/— Cleaned, consolidatedresults.parquetwith weather columnsdata/silver/— Feature store:fs_driver_life.parquet,fs_driver_last10.parquet,fs_driver_last20.parquet,fs_driver_last40.parquet,fs_driver_all.parquetdata/gold/— ABTs (end-of-year and in-season):abt_champions.parquet,abt_teams.parquet,abt_departures.parquet,abt_champions_inseason.parquet,abt_teams_inseason.parquet,abt_departures_inseason.parquet
# Install dependencies
pip install -r requirements.txt
# Run full ETL pipeline (collect + bronze + silver + gold)
python -m etl.run_pipeline --years 2020 2021 2022 2023 2024 2025
# Re-collect with weather data (FastF1 weather available from 2018+)
python -m etl.run_pipeline --years 2018 2019 2020 2021 2022 2023 2024 2025 --force
# Run individual ETL steps
python -m etl.collect --years 2024 2025 --modes R Q S # collect raw data
python -m etl.collect --years 2018 2019 2020 --force # re-collect with --force to overwrite
python -m etl.bronze # raw → bronze
python -m etl.silver # bronze → silver (feature store)
python -m etl.gold # silver → gold (ABTs)
# Train ML models (logs to mlruns/ and mlflow.db)
python -m ml.champion_model
python -m ml.team_model
python -m ml.departure_model
# Evaluate TimesFM zero-shot forecasts (uses separate venv, logs to same MLflow experiments)
.venv-timesfm/bin/python -m ml.evaluate_timesfm # all 3 targets
.venv-timesfm/bin/python -m ml.evaluate_timesfm champion # champion | constructor | departure
# Run Streamlit app
streamlit run app/main.py
# MLFlow UI
mlflow ui --backend-store-uri sqlite:///mlflow.db
# Docker
docker-compose up- SQL engine: DuckDB for all transformations. SQL files in
etl/sql/useread_parquet()to access data. - Storage: Parquet files only. No database server. Bronze uses
union_by_name=trueto handle mixed schemas (pre/post weather). - Feature store: Point-in-time correct —
fs_driver.sqlusesr.event_date < d.dt_refso features include current-season data up to (but not including) each race date. Features evolve race-by-race within a season. Includes qualifying features (avg position, poles, Q3 reach rate) from collected Q sessions. - Weather features: Collected from FastF1 (air/track temp, humidity, pressure, wind speed/direction, rainfall). Available from 2018+, NULL for earlier years. Aggregated per session in collect, per window in feature store.
- ML tracking: MLFlow with SQLite backend (
mlflow.db) and local artifact store (mlruns/). Each prediction task (champion, team, departure) is a separate experiment with multiple model runs. - ML models: Batch models (LogisticRegression, LightGBM, BalancedRandomForest, XGBoost). Hyperparameter tuning via Optuna (TPE sampler, median pruner). Training scripts accept
--nologregto skip LogisticRegression. Departure model usesscoring="roc_auc", team model usesscoring="combined"(average rank of pr_auc_oot + top1_acc_test) to select models that produce meaningful per-event predictions rather than just high PR-AUC. - Curated feature sets: Champion and team models use explicit feature lists (
CHAMPION_FEATURES,TEAM_FEATURES) to exclude data leakage (season_fraction,season_race_number,season_total_races) and zero-importance features. - Top-1 champion accuracy: Per-event metric checking if the model's highest-probability pick matches the actual champion. Works for both driver (
data/champions.csv) and constructor (data/constructors_champions.csv) models. - ABTs: Two variants per target — end-of-year (one row per driver-year) and in-season (one row per driver-race, for time-series predictions). Departure ABTs include departure-specific features (performance trends, teammate comparison, team tenure, driver age, career teams, seasons since last win/podium) computed in SQL. Team in-season ABT includes momentum features (standings/gap momentum, points acceleration), clinch proximity, and interaction features mirroring the champion model.
- Web app: Streamlit with 4 tabs (Predictions, Model Comparison, EDA, DuckDB Console). DuckDB Console supports Ctrl+Enter to run queries. Models loaded inline via
@st.cache_resource. - Charts: Plotly for interactive visualizations.
- Python venv: Always use
.venv/bin/pythonto run Python commands (e.g..venv/bin/python -m ml.champion_model). The system Python does not have project dependencies installed. - TimesFM venv: TimesFM uses a separate venv:
.venv-timesfm/bin/python.
All free: FastF1, DuckDB, pandas, scikit-learn, XGBoost, CatBoost, imbalanced-learn, Optuna, MLFlow, Streamlit, Plotly, Docker.