CinemaCLIP is a MobileCLIP-S1 fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our launch blog post.
This repository ships three serialized forms of the same model:
- Torch (
model.safetensors) — load via thecinemaclipPython package. - CoreML (
ImageEncoder.mlmodel,ImageEncoder.mlpackageandTextEncoder.mlpackage) — for on-device Apple Neural Engine inference. - ONNX (
ImageEncoder.onnx,TextEncoder.onnx, plus_fp16variants) — for cross-platform inference.
pip install cinemaclip # core
pip install "cinemaclip[coreml]" # CoreML export/inference
pip install "cinemaclip[onnx]" # ONNX export/inferencefrom PIL import Image
from cinemaclip import CinemaCLIP
model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval()
# End-to-end classification on a PIL image
image = Image.open("still.jpg").convert("RGB")
predictions = model.predict_image(image)
predictions["classifier_preds"] # Classifier predictions
predictions["clip_image_embedding"]
# Just the image embedding
x = model.preprocess(image).unsqueeze(0)
image_embedding = model.encode_image(x, normalize=True) # [1, 512]
# Just the text embedding
tokens = model.tokenizer(["a medium closeup of "])
text_embedding = model.encode_text(tokens, normalize=True) # [1, 512]The CinemaCLIP.predict_image method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else.
import coremltools as ct
from PIL import Image
img_encoder = ct.models.MLModel("ImageEncoder.mlpackage")
# Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs.
img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC)
out = img_encoder.predict({"Image": img})
embedding = out["clip_image_embedding"] # [512]
probabilities = out["probabilities"] # [101] — concat of 23 per-category outputs
# TODO
text_encoder = ct.models.MLModel("TextEncoder.mlpackage")from PIL import Image
from onnxruntime import InferenceSession
from torchvision import transforms as T
img = Image.open("still.jpg").convert("RGB")
preprocess = T.Compose([
T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC),
T.ToTensor(), # yields float tensor in [0, 1] — no mean/std normalization
])
x = preprocess(img).unsqueeze(0).numpy()
session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"])
emb, probs = session.run(None, {"Image": x})probabilities is a flat [101] vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped CinemaNetSchema.json:
import json
schema = json.load(open("CinemaNetSchema.json"))
label_names = schema["probabilities_labels"] # len == 101The classifier heads are a mix of 3 types of classifiers:
- Single label (softmax activation)
- Multi label (sigmoid activation)
- Binary (sigmoid activation)
CinemaCLIP outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading 4B VLMs).
Two inference modes are reported for CinemaCLIP:
- Classifier — the shipped supervised heads on the CinemaCLIP image embedding.
- 0-shot — zero-shot text/image similarity using CinemaCLIP's own text encoder.
| Category | CinemaCLIP 0-shot | CinemaCLIP Classifier | Qwen3.5-4B | Gemma4-4B | InternVL3.5-4B | Molmo2-4B | DFN ViT-H-14 | MetaCLIP PE-bigG | OpenAI ViT-L-14 | MobileCLIP-S1 | DFN ViT-L-14 | SigLIP2 SO400M | SigLIP2 ViT-gopt |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | 82.9 | 87.6 | 57.6 | 56.7 | 55.3 | 55.3 | 45.9 | 45.2 | 44.8 | 44.2 | 39.0 | 38.7 | 36.5 |
| Color Contrast | 89.6 | 86.8 | 33.7 | 35.3 | 33.7 | 35.3 | 34.0 | 33.1 | 49.4 | 38.7 | 37.1 | 57.7 | 25.2 |
| Color Key | 84.9 | 92.9 | 78.1 | 78.1 | 80.3 | 64.3 | 58.2 | 50.2 | 53.2 | 59.4 | 48.3 | 22.8 | 52.6 |
| Color Saturation | 82.6 | 82.6 | 66.5 | 65.4 | 72.1 | 45.9 | 55.1 | 61.8 | 58.1 | 35.8 | 46.8 | 33.3 | 31.8 |
| Color Theory | 71.3 | 72.7 | 54.0 | 51.7 | 50.7 | 48.7 | 54.7 | 51.7 | 50.7 | 47.3 | 47.7 | 31.3 | 31.7 |
| Color Tones | 86.0 | 86.5 | 50.2 | 62.6 | 70.6 | 62.1 | 58.5 | 50.2 | 52.0 | 55.7 | 47.2 | 24.0 | 17.7 |
| Lighting Cast | 85.9 | 90.4 | 38.3 | 53.3 | 39.8 | 35.7 | 25.4 | 29.3 | 28.8 | 35.7 | 22.8 | 37.8 | 18.2 |
| Lighting Contrast | 93.9 | 95.3 | 29.8 | 39.1 | 38.7 | 46.1 | 35.3 | 35.5 | 32.6 | 39.0 | 39.4 | 48.4 | 37.6 |
| Lighting Edge | 87.6 | 90.4 | 22.8 | 38.8 | 31.2 | 40.4 | 22.4 | 31.6 | 41.6 | 34.0 | 21.2 | 26.0 | 25.6 |
| Lighting Silhouette | 88.4 | 93.1 | 80.9 | 63.0 | 48.9 | 48.8 | 66.6 | 67.1 | 67.4 | 58.4 | 43.5 | 46.2 | 78.9 |
| Shot Angle | 73.4 | 82.3 | 41.9 | 49.2 | 33.2 | 49.9 | 28.0 | 13.7 | 19.0 | 19.6 | 25.9 | 21.3 | 17.2 |
| Shot Composition | 95.5 | 96.0 | 46.0 | 54.5 | 55.7 | 60.5 | 27.8 | 24.3 | 21.3 | 22.0 | 25.2 | 31.4 | 11.4 |
| Shot Dutch Angle | 61.9 | 78.5 | 62.2 | 65.1 | 46.7 | 49.3 | 27.3 | 44.5 | 38.4 | 56.6 | 25.9 | 47.6 | 68.7 |
| Shot Focus | 71.3 | 71.2 | 19.9 | 26.6 | 26.3 | 25.1 | 32.9 | 31.2 | 24.4 | 31.3 | 37.3 | 48.2 | 12.6 |
| Shot Framing | 79.2 | 83.8 | 38.0 | 29.6 | 40.1 | 34.6 | 33.6 | 24.9 | 23.5 | 23.9 | 33.0 | 7.3 | 9.8 |
| Shot Height | 90.5 | 91.8 | 38.1 | 37.4 | 41.2 | 53.0 | 37.6 | 33.7 | 28.9 | 24.0 | 33.6 | 29.6 | 23.9 |
| Shot Lens Size | 67.9 | 70.6 | 49.6 | 28.0 | 43.6 | 46.6 | 32.1 | 28.0 | 34.5 | 30.1 | 25.7 | 30.1 | 17.6 |
| Shot Location | 90.9 | 93.9 | 81.0 | 82.2 | 81.5 | 79.2 | 73.0 | 68.4 | 68.0 | 75.6 | 66.1 | 65.0 | 46.7 |
| Shot Symmetry | 88.3 | 92.9 | 90.2 | 86.7 | 76.0 | 80.2 | 76.6 | 78.0 | 54.0 | 39.3 | 24.9 | 46.0 | 82.4 |
| Shot Time of Day | 69.2 | 89.0 | 75.1 | 66.1 | 70.7 | 70.7 | 68.1 | 69.6 | 60.3 | 73.7 | 71.2 | 48.5 | 42.7 |
| Shot Type | 81.8 | 90.5 | 81.3 | 61.2 | 57.0 | 57.4 | 52.8 | 40.4 | 36.5 | 35.7 | 56.7 | 46.5 | 29.7 |
| Shot Type - Crowd | 91.5 | 99.6 | 97.2 | 88.2 | 94.3 | 94.8 | 55.9 | 69.1 | 68.6 | 77.2 | 37.3 | 52.4 | 69.3 |
| Shot Type - OTS | 92.0 | 95.5 | 92.5 | 85.0 | 83.9 | 87.6 | 53.2 | 57.0 | 73.9 | 60.3 | 42.1 | 50.5 | 51.2 |
The shot.lighting.direction head ships in the classifier heads but has been excluded from the table above being a multi-label classifier.
@misc{cinemaclip2026,
title = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema},
author = {Somani, Rahul and Marini, Anton and Stewart, Damian},
year = {2026},
publisher = {Hugging Face},
doi = {10.57967/hf/8539},
howpublished = {\url{https://huggingface.co/OZU-Technology/CinemaCLIP}},
note = {Model weights and taxonomy}
}