Skip to content

based-on-what/blockcluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Movie Genre Clustering Notebook

This Jupyter Notebook performs a complete pipeline for clustering movies based on their genres and user ratings (on a scale from 0.5 to 5). It reads a CSV of movie ratings, enriches the data by fetching genre information from TMDb, explores genre-based rating trends, and applies K‑Means clustering with a PCA‑based 2D visualization.

Features

  1. Data Loading

    • Reads ratings.csv containing columns: Name, Year, Rating (0.5–5).
    • Note: ratings.csv is generated by exporting your data from Letterboxd.
  2. Genre Enrichment

    • Uses TMDb API to search each movie by title and year.
    • Falls back to title‑only search if needed.
    • Retrieves genre list for each found movie.
    • Logs (with emojis) any movies not found or with missing genre data.
  3. Exploratory Analysis

    • Explodes genre lists and computes average rating per genre.
    • Displays a horizontal bar chart of average genre ratings.
  4. Data Preparation

    • One‑hot encodes genres.
    • Combines one‑hot genre columns with the Rating.
    • Standardizes all features to equalize scale.
  5. Elbow Method for K Selection

    • Computes K‑Means inertia for K = 1 to 10.
    • Plots the “elbow” chart to help choose the optimal number of clusters.
  6. Clustering & Visualization

    • Applies K‑Means with the chosen K.
    • Reduces feature space to two principal components (PCA).
    • Displays an interactive Plotly scatter plot, where each point is a movie and color denotes its cluster. Hover tooltips show movie name, year, and rating.
  7. Cluster Interpretation

    • Prints summary for each cluster:

      • Number of movies
      • Average rating
      • Top 5 genres by percentage

Requirements

  • Python 3.7+

  • Install dependencies:

    pip install requests pandas numpy matplotlib scikit-learn plotly tqdm
    
  • A valid TMDb API key. Set it in the notebook cell:

    API_KEY = 'YOUR_TMDB_API_KEY'

How to Use

  1. Export your movie ratings from Letterboxd to ratings.csv, then place it in the same folder as this notebook.

  2. Open the notebook in JupyterLab, Jupyter Notebook, or any compatible environment.

  3. Install the required libraries if you haven’t already.

  4. Enter your TMDb API key in the designated cell.

  5. Run the cells in order.

    • The enrichment step will take a few minutes depending on dataset size (with a 0.25 s delay per request).
    • Inspect the printed logs to see any missing or problematic entries.
  6. Tune the number of clusters (K) after viewing the elbow plot.

  7. Enjoy the interactive cluster visualization and review the printed cluster summaries.


Created with ❤️ for data‑driven movie analysis!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors