Minimal, fast, and friendly web app to transcribe audio and visualize its mel spectrogram.
The backend is built with FastAPI and OpenAI Whisper (tiny model by default). The frontend is plain HTML/CSS/JS for zero-friction usage.
- Upload audio from your device.
- Record in browser (Start/Stop) and transcribe with one click.
- One-click transcription using Whisper.
- Mel spectrogram generated with librosa + matplotlib, rendered inline as an image.
- Minimal JS — most logic lives in FastAPI.
- Backend: FastAPI, Uvicorn, OpenAI Whisper, Librosa, NumPy, Matplotlib
- Frontend: Vanilla HTML/CSS/JS
.
├─ main.py # FastAPI app with / and /transcribe endpoints
├─ requirements.txt # All Python dependencies (FastAPI, Whisper, Librosa, etc.)
├─ static/ # Frontend assets
│ ├─ index.html # UI (Start/Stop recording, upload)
│ ├─ style.css # Styles
│ └─ app.js # Minimal client-side logic
└─ README.md
Key endpoints/functions:
GET /servesstatic/index.html.POST /transcribehandled bytranscribe_audioinmain.py.
- Python 3.9+ recommended.
- ffmpeg installed and available on PATH (required by Whisper/librosa).
- Windows:
choco install ffmpeg(Chocolatey) or download from ffmpeg.org. - macOS:
brew install ffmpeg. - Linux:
sudo apt-get install ffmpeg(Debian/Ubuntu).
- Windows:
- Create and activate a virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux- Install dependencies
pip install -r requirements.txt
# PyTorch is required by Whisper. If not auto-installed, install one of:
# CPU-only example (Windows/Linux/macOS):
pip install torch --index-url https://download.pytorch.org/whl/cpu
# For CUDA builds, follow: https://pytorch.org/get-started- Run the server
uvicorn main:app --reload
# Open http://127.0.0.1:8000- Open the app in your browser.
Upload a file:
- Choose a file (WAV/MP3/etc.) and click "Transcribe".
Record in browser:
- Click "Start Recording", then "Stop Recording".
- Click "Transcribe" to upload the captured audio.
Wait for status "Done"; see transcript and mel spectrogram below.
Notes:
- The UI includes an audio player for playback of recorded audio.
- The backend decodes uploads (including WebM/Opus from the browser) using ffmpeg via
whisper.load_audio. - You can switch to a larger Whisper model in
main.pyby changingwhisper.load_model("tiny")to"base","small", etc. Larger models are slower but more accurate.
Endpoint: POST /transcribe
Form-Data:
file: the audio file (key must befile).
Response (200):
{
"transcription": "...",
"mel_spectrogram": "data:image/png;base64,...."
}Response (error):
{
"error": "An error occurred during transcription: ..."
}Edit main.py:
- Model size:
model = whisper.load_model("tiny")→"base" | "small" | "medium" | "large". - Device/precision:
fp16=Falseforces CPU-friendly precision; enable fp16 on GPU.
- Whisper model fails to load: Ensure PyTorch installed correctly; try CPU wheel above or install via pytorch.org.
- ffmpeg not found: Install ffmpeg and confirm
ffmpeg -versionworks in your terminal. - librosa/audioread errors: Usually ffmpeg-related; also verify the audio file isn’t corrupted.
- Browser recording uploads but fails to transcribe: Confirm ffmpeg is on PATH and try again; WebM/Opus requires ffmpeg.
- Slow inference: Use
tiny/basemodels or enable GPU (CUDA build of PyTorch + fp16).
- Language selection / translation.
- Word-level timestamps and subtitle export (.srt).
- Save spectrograms and transcripts to disk.
- Model/device selection UI, timestamps toggle.
- OpenAI Whisper: https://github.com/openai/whisper
- FastAPI: https://fastapi.tiangolo.com/
- librosa: https://librosa.org/
Made with 💖 by Rishabh Dhawad.