F1 Analytics Platform

A free, local-first Formula 1 data engineering + machine learning + web application platform. Inspired by TeoMeWhy/f1-lake, rebuilt with DuckDB instead of Databricks, local Parquet instead of S3, and multiple ML models compared via MLFlow.

Architecture

FastF1 --> Parquet (raw/) --> DuckDB --> Parquet (bronze/silver/gold/) --> ML Models --> Streamlit App

The project follows a medallion architecture:

Layer	Path	Description
Raw	`data/raw/`	One Parquet file per FastF1 session (`{year}_{round}_{mode}.parquet`), includes weather data (2018+)
Bronze	`data/bronze/`	Cleaned and consolidated `results.parquet` with weather columns
Silver	`data/silver/`	Feature store with temporal windows (`fs_driver_life.parquet`, `fs_driver_last10.parquet`, `fs_driver_last20.parquet`, `fs_driver_last40.parquet`, `fs_driver_all.parquet`)
Gold	`data/gold/`	Analytical base tables: end-of-year (`abt_champions.parquet`, `abt_teams.parquet`, `abt_departures.parquet`) and in-season (`abt_champions_inseason.parquet`, `abt_teams_inseason.parquet`, `abt_departures_inseason.parquet`). In-season ABTs include clinch detection, momentum features, and clinch proximity

Project Structure

f1-analytics/
├── app/                    # Streamlit web application
│   ├── main.py             # Entry point with tab layout
│   ├── tab_predictions.py  # ML predictions tab
│   ├── tab_model_comparison.py  # Side-by-side model comparison (ROC/PR curves, confusion matrices)
│   ├── tab_eda.py          # Exploratory data analysis tab
│   ├── tab_duckdb.py       # Interactive DuckDB SQL console (Ctrl+Enter to run)
│   └── helpers.py          # Shared UI utilities
├── etl/                    # ETL pipeline modules
│   ├── collect.py          # FastF1 data collection (results + weather)
│   ├── bronze.py           # Raw to bronze transformation
│   ├── silver.py           # Bronze to silver (feature store)
│   ├── gold.py             # Silver to gold (ABTs)
│   ├── run_pipeline.py     # Full pipeline orchestrator
│   └── sql/                # DuckDB SQL queries
│       ├── fs_driver.sql           # Feature store query (point-in-time correct)
│       ├── fs_all.sql              # Join all temporal windows
│       ├── abt_champions.sql       # End-of-year champion ABT
│       ├── abt_teams.sql           # End-of-year constructor ABT
│       ├── abt_departures.sql      # End-of-year departure ABT
│       ├── abt_champions_inseason.sql  # In-season champion ABT
│       ├── abt_teams_inseason.sql      # In-season constructor ABT
│       └── abt_departures_inseason.sql # In-season departure ABT
├── ml/                     # Machine learning models
│   ├── champion_model.py   # Champion prediction training
│   ├── team_model.py       # Best team prediction training
│   ├── departure_model.py  # Driver departure prediction training
│   ├── model_selection.py  # Candidate model definitions
│   ├── predict.py          # Inference utilities
│   ├── utils.py            # Training, splits, metrics, MLFlow setup
│   ├── evaluate_timesfm.py # TimesFM zero-shot forecast evaluation
│   └── timefm_predictor.py # TimesFM predictor wrapper
├── notebooks/              # Jupyter notebooks for exploration
├── data/                   # Parquet data files (raw/bronze/silver/gold)
├── mlruns/                 # MLFlow artifact storage
├── mlflow.db               # MLFlow metadata (SQLite backend)
├── Dockerfile
├── docker-compose.yaml
└── requirements.txt

Tech Stack

All tools are free and open source:

Data collection: FastF1 (results + weather data)
SQL engine: DuckDB
Data processing: pandas
ML models: scikit-learn, XGBoost, LightGBM
Class balancing: imbalanced-learn
Hyperparameter tuning: Optuna (TPE sampler, median pruner)
Zero-shot forecasting: TimesFM (separate venv)
Experiment tracking: MLFlow (SQLite backend)
Web app: Streamlit
Visualizations: Plotly, Matplotlib
Containerization: Docker

Getting Started

Prerequisites

Python 3.12+
pip

Installation

git clone <repo-url>
cd f1-analytics
pip install -r requirements.txt

Running the Full ETL Pipeline

Collects data from FastF1 (including weather) and builds all medallion layers (raw -> bronze -> silver -> gold):

python -m etl.run_pipeline --years 2020 2021 2022 2023 2024 2025

To re-collect with weather data for years that were previously collected without it (FastF1 weather is available from 2018+):

python -m etl.run_pipeline --years 2018 2019 2020 2021 2022 2023 2024 2025 --force

This runs all four steps sequentially:

Collect — Downloads session results and weather data from FastF1 API
Bronze — Cleans and consolidates raw data (handles mixed schemas via union_by_name)
Silver — Builds the feature store with multiple temporal windows (includes weather features)
Gold — Constructs analytical base tables for ML (end-of-year and in-season variants)

Running Individual ETL Steps

# Collect raw data (R = Race, S = Sprint)
python -m etl.collect --years 2024 2025 --modes R S

# Re-collect existing files (e.g., to add weather data)
python -m etl.collect --years 2018 2019 2020 --force

# Build bronze layer
python -m etl.bronze

# Build silver layer (feature store)
python -m etl.silver

# Build gold layer (ABTs)
python -m etl.gold

Training ML Models

Each prediction task trains and compares multiple batch models (LogisticRegression, LightGBM, BalancedRandomForest, XGBoost) with Optuna hyperparameter tuning. All runs are logged to MLFlow. Use --nologreg to skip LogisticRegression.

Both champion and team models use curated feature sets that exclude data leakage features (season_fraction, season_race_number) and zero-importance features. The team model uses a combined scoring metric (PR-AUC + top-1 champion accuracy) to select models that produce meaningful per-event predictions:

# Champion prediction
python -m ml.champion_model

# Best team prediction
python -m ml.team_model

# Driver departure prediction
python -m ml.departure_model

TimesFM Zero-Shot Forecasts

Evaluate Google's TimesFM foundation model as a zero-shot forecaster on the same prediction targets. Uses a separate virtual environment and logs results to the same MLFlow experiments for direct comparison:

# Evaluate all 3 targets
.venv-timesfm/bin/python -m ml.evaluate_timesfm

# Evaluate a single target
.venv-timesfm/bin/python -m ml.evaluate_timesfm champion
.venv-timesfm/bin/python -m ml.evaluate_timesfm constructor
.venv-timesfm/bin/python -m ml.evaluate_timesfm departure

Running the Web App

streamlit run app/main.py

The app runs at http://localhost:8501 and has four tabs:

Predictions — ML model predictions with time-series charts
Model Comparison — Side-by-side metrics table, ROC/PR curves, and confusion matrices for all trained models
EDA — Interactive exploratory data analysis with Plotly charts
DuckDB Console — Run SQL queries directly against the Parquet data (Ctrl+Enter to execute, 13 example queries including weather analysis)

MLFlow UI

View experiment runs, compare metrics, and inspect model artifacts:

Open http://localhost:5000 in your browser.

Docker

Run both the Streamlit app and MLFlow UI with Docker Compose:

docker-compose up

Service	URL
Streamlit App	`http://localhost:8501`
MLFlow UI	`http://localhost:5000`

The data/ and mlruns/ directories are mounted as volumes, so data persists across container restarts.

Acknowledgments

TeoMeWhy/f1-lake — Original project and inspiration
FastF1 — F1 telemetry and session data

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
app		app
data		data
etl		etl
ml		ml
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Plan.md		Plan.md
README.md		README.md
docker-compose.yaml		docker-compose.yaml
f1_lake.md		f1_lake.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

F1 Analytics Platform

Architecture

Project Structure

Tech Stack

Getting Started

Prerequisites

Installation

Running the Full ETL Pipeline

Running Individual ETL Steps

Training ML Models

TimesFM Zero-Shot Forecasts

Running the Web App

MLFlow UI

Docker

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

F1 Analytics Platform

Architecture

Project Structure

Tech Stack

Getting Started

Prerequisites

Installation

Running the Full ETL Pipeline

Running Individual ETL Steps

Training ML Models

TimesFM Zero-Shot Forecasts

Running the Web App

MLFlow UI

Docker

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages