Evaluation: whisper-1 on Indian-context audio — Hinglish, names, addresses, call-center phrases #2761
weekendpm
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
What this is
A structured benchmark of whisper-1 across six categories of Indian-context audio: pure Hindi, pure English (Indian vocabulary), Hinglish (code-mixed), Indian proper nouns, addresses with PIN codes, and call-center phrases.
Methodology transparency upfront:
hifor Hindi/Hinglish,enfor English)whisper-1via OpenAI APIThe synthetic audio means results are directional, not definitive. Real human recordings — especially for Hinglish — would shift numbers. Treating this as a baseline and an invitation for the community to build on it.
Dataset (clips + ground truth + results CSV): https://huggingface.co/datasets/Primepluto/hinglish-whisper-benchmark
Results
Finding 1: No output path for romanized Hinglish
All five Hinglish clips were spoken as code-mixed Hindi-English (romanized, e.g. "Mera order abhi tak deliver nahi hua, please check karo"). whisper-1 consistently transcribed them in Devanagari:
REF:
Mera order abhi tak deliver nahi hua, please check karo.HYP:
mera order abhi tak deliver nahi hua, please check karo.(Devanagari in actual output)REF:
Support team se baat karni hai, hold pe mat rakho.HYP: Fully Devanagari output
This is expected given
language=hiwas set. But the broader issue is structural: there is no way to request romanized Hinglish output from whisper-1. For developers building Indian support-call or WhatsApp transcription tools where downstream systems expect Latin-script text, this is a blocker. No flag, no workaround via the API today.Finding 2: Hindi loanword spelling inconsistency
Whisper transcribes English loanwords in Hindi with non-standard Devanagari spellings. Examples from actual output:
order(loanword) transcribed with a non-standard Devanagari variantpasswordtranscribed phonetically but incorrectlyThese aren't random errors — they reflect real ambiguity in how English loanwords are written in Devanagari. But for downstream NLP tasks like search or entity extraction, inconsistency across runs is more damaging than a fixed variant.
Finding 3: PIN code digit hallucination
One address clip had a 6-digit PIN (400050) transcribed with an extra digit (4000050) — making the PIN invalid. Small sample, needs more testing, but worth flagging for any address or logistics use case.
Finding 4: Indian proper noun accuracy
In English-language context, proper nouns are slightly off but recognizable:
Koramangalatranscribed asKormangalaAbhijittranscribed asAbhijeetIn Hindi-language context, names get fully transliterated to Devanagari, losing the Latin spelling entirely — which breaks any downstream system expecting the original name string.
What I'd want to see next
Real human recordings — especially Hinglish from actual speakers. gTTS Hinglish is too clean and phonetically Hindi-dominant. Real code-mixed speech would stress the model differently.
Auto language-detection behavior — I set language explicitly. What does whisper-1 do with Hinglish audio when language is unset? Does it detect
hioren? That changes the output script entirely.large-v3 comparison — this benchmark used whisper-1 (API). Would large-v3 handle loanword spelling more consistently?
The dataset is small and synthetic, but the categories and ground truth are reusable. If you have real Hinglish recordings — WhatsApp voice notes, support call clips, anything CC-licensed — happy to extend this.
Beta Was this translation helpful? Give feedback.
All reactions