This Jupyter Notebook performs a complete pipeline for clustering movies based on their genres and user ratings (on a scale from 0.5 to 5). It reads a CSV of movie ratings, enriches the data by fetching genre information from TMDb, explores genre-based rating trends, and applies K‑Means clustering with a PCA‑based 2D visualization.
-
Data Loading
- Reads
ratings.csvcontaining columns:Name,Year,Rating(0.5–5). - Note:
ratings.csvis generated by exporting your data from Letterboxd.
- Reads
-
Genre Enrichment
- Uses TMDb API to search each movie by title and year.
- Falls back to title‑only search if needed.
- Retrieves genre list for each found movie.
- Logs (with emojis) any movies not found or with missing genre data.
-
Exploratory Analysis
- Explodes genre lists and computes average rating per genre.
- Displays a horizontal bar chart of average genre ratings.
-
Data Preparation
- One‑hot encodes genres.
- Combines one‑hot genre columns with the
Rating. - Standardizes all features to equalize scale.
-
Elbow Method for K Selection
- Computes K‑Means inertia for K = 1 to 10.
- Plots the “elbow” chart to help choose the optimal number of clusters.
-
Clustering & Visualization
- Applies K‑Means with the chosen K.
- Reduces feature space to two principal components (PCA).
- Displays an interactive Plotly scatter plot, where each point is a movie and color denotes its cluster. Hover tooltips show movie name, year, and rating.
-
Cluster Interpretation
-
Prints summary for each cluster:
- Number of movies
- Average rating
- Top 5 genres by percentage
-
-
Python 3.7+
-
Install dependencies:
pip install requests pandas numpy matplotlib scikit-learn plotly tqdm -
A valid TMDb API key. Set it in the notebook cell:
API_KEY = 'YOUR_TMDB_API_KEY'
-
Export your movie ratings from Letterboxd to
ratings.csv, then place it in the same folder as this notebook. -
Open the notebook in JupyterLab, Jupyter Notebook, or any compatible environment.
-
Install the required libraries if you haven’t already.
-
Enter your TMDb API key in the designated cell.
-
Run the cells in order.
- The enrichment step will take a few minutes depending on dataset size (with a 0.25 s delay per request).
- Inspect the printed logs to see any missing or problematic entries.
-
Tune the number of clusters (K) after viewing the elbow plot.
-
Enjoy the interactive cluster visualization and review the printed cluster summaries.
Created with ❤️ for data‑driven movie analysis!