Is this Suitable for real time websocket streaming? #2741

Sriharan-VJ · 2026-03-09T07:50:04Z

Sriharan-VJ
Mar 9, 2026

Whisper models are suitable for realtime websocket streaming. If yes please suggest me which model results better. Ihave one more doubt is there any way to find the last word of transcription using VAD?

sueun-dev · 2026-03-11T20:17:04Z

sueun-dev
Mar 11, 2026

I would treat Whisper as suitable for near real-time, not true streaming.

The current code processes audio in sliding 30 second windows, so a websocket setup usually means buffering small chunks, running transcription repeatedly, and stitching partial results together. That works, but it is not a native streaming ASR pipeline.

For model choice, if you want the best latency, I would start with tiny.en or base.en for English. If you have enough GPU and want a better speed/quality tradeoff, turbo is a good option to try.

For the last-word question, VAD alone will only tell you speech boundaries, not the actual last word. Whisper does support word_timestamps, so the usual approach is:

use VAD to decide when to flush a chunk
use Whisper word timestamps to read the end time of the last recognized word

So yes for pseudo real-time, but not as a true low-latency streaming model out of the box.

1 reply

ismaeeelxd Mar 12, 2026

Do you recommend anything other than whisper for this purpose?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this Suitable for real time websocket streaming? #2741

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is this Suitable for real time websocket streaming? #2741

Uh oh!

Sriharan-VJ Mar 9, 2026

Replies: 1 comment · 1 reply

Uh oh!

sueun-dev Mar 11, 2026

Uh oh!

ismaeeelxd Mar 12, 2026

Sriharan-VJ
Mar 9, 2026

Replies: 1 comment 1 reply

sueun-dev
Mar 11, 2026