Skip to content

Latest commit

 

History

History
78 lines (57 loc) · 5.34 KB

File metadata and controls

78 lines (57 loc) · 5.34 KB

F1 Analytics Platform

Overview

Free, local-first F1 data engineering + ML + web app replicating TeoMeWhy/f1-lake. Uses DuckDB instead of Databricks, local Parquet instead of S3, and multiple ML models compared via MLFlow.

Architecture

FastF1 → Parquet (raw/) → DuckDB → Parquet (bronze/silver/gold/) → ML (scikit-learn/XGBoost/CatBoost) → Streamlit App

Medallion layers:

  • data/raw/ — One Parquet per FastF1 session ({year}_{round}_{mode}.parquet), includes weather data (2018+)
  • data/bronze/ — Cleaned, consolidated results.parquet with weather columns
  • data/silver/ — Feature store: fs_driver_life.parquet, fs_driver_last10.parquet, fs_driver_last20.parquet, fs_driver_last40.parquet, fs_driver_all.parquet
  • data/gold/ — ABTs (end-of-year and in-season): abt_champions.parquet, abt_teams.parquet, abt_departures.parquet, abt_champions_inseason.parquet, abt_teams_inseason.parquet, abt_departures_inseason.parquet

Commands

# Install dependencies
pip install -r requirements.txt

# Run full ETL pipeline (collect + bronze + silver + gold)
python -m etl.run_pipeline --years 2020 2021 2022 2023 2024 2025

# Re-collect with weather data (FastF1 weather available from 2018+)
python -m etl.run_pipeline --years 2018 2019 2020 2021 2022 2023 2024 2025 --force

# Run individual ETL steps
python -m etl.collect --years 2024 2025 --modes R Q S     # collect raw data
python -m etl.collect --years 2018 2019 2020 --force      # re-collect with --force to overwrite
python -m etl.bronze                                       # raw → bronze
python -m etl.silver                                       # bronze → silver (feature store)
python -m etl.gold                                         # silver → gold (ABTs)

# Train ML models (logs to mlruns/ and mlflow.db)
python -m ml.champion_model
python -m ml.team_model
python -m ml.departure_model

# Evaluate TimesFM zero-shot forecasts (uses separate venv, logs to same MLflow experiments)
.venv-timesfm/bin/python -m ml.evaluate_timesfm                 # all 3 targets
.venv-timesfm/bin/python -m ml.evaluate_timesfm champion        # champion | constructor | departure

# Run Streamlit app
streamlit run app/main.py

# MLFlow UI
mlflow ui --backend-store-uri sqlite:///mlflow.db

# Docker
docker-compose up

Conventions

  • SQL engine: DuckDB for all transformations. SQL files in etl/sql/ use read_parquet() to access data.
  • Storage: Parquet files only. No database server. Bronze uses union_by_name=true to handle mixed schemas (pre/post weather).
  • Feature store: Point-in-time correct — fs_driver.sql uses r.event_date < d.dt_ref so features include current-season data up to (but not including) each race date. Features evolve race-by-race within a season. Includes qualifying features (avg position, poles, Q3 reach rate) from collected Q sessions.
  • Weather features: Collected from FastF1 (air/track temp, humidity, pressure, wind speed/direction, rainfall). Available from 2018+, NULL for earlier years. Aggregated per session in collect, per window in feature store.
  • ML tracking: MLFlow with SQLite backend (mlflow.db) and local artifact store (mlruns/). Each prediction task (champion, team, departure) is a separate experiment with multiple model runs.
  • ML models: Batch models (LogisticRegression, LightGBM, BalancedRandomForest, XGBoost). Hyperparameter tuning via Optuna (TPE sampler, median pruner). Training scripts accept --nologreg to skip LogisticRegression. Departure model uses scoring="roc_auc", team model uses scoring="combined" (average rank of pr_auc_oot + top1_acc_test) to select models that produce meaningful per-event predictions rather than just high PR-AUC.
  • Curated feature sets: Champion and team models use explicit feature lists (CHAMPION_FEATURES, TEAM_FEATURES) to exclude data leakage (season_fraction, season_race_number, season_total_races) and zero-importance features.
  • Top-1 champion accuracy: Per-event metric checking if the model's highest-probability pick matches the actual champion. Works for both driver (data/champions.csv) and constructor (data/constructors_champions.csv) models.
  • ABTs: Two variants per target — end-of-year (one row per driver-year) and in-season (one row per driver-race, for time-series predictions). Departure ABTs include departure-specific features (performance trends, teammate comparison, team tenure, driver age, career teams, seasons since last win/podium) computed in SQL. Team in-season ABT includes momentum features (standings/gap momentum, points acceleration), clinch proximity, and interaction features mirroring the champion model.
  • Web app: Streamlit with 4 tabs (Predictions, Model Comparison, EDA, DuckDB Console). DuckDB Console supports Ctrl+Enter to run queries. Models loaded inline via @st.cache_resource.
  • Charts: Plotly for interactive visualizations.

Development Environment

  • Python venv: Always use .venv/bin/python to run Python commands (e.g. .venv/bin/python -m ml.champion_model). The system Python does not have project dependencies installed.
  • TimesFM venv: TimesFM uses a separate venv: .venv-timesfm/bin/python.

Tech Stack

All free: FastF1, DuckDB, pandas, scikit-learn, XGBoost, CatBoost, imbalanced-learn, Optuna, MLFlow, Streamlit, Plotly, Docker.