豆包/火山引擎流式语音识别 ASR 2.0 (Seed-ASR) 转 OpenAI 兼容协议的本地代理服务。
Converts Volcengine Doubao ASR 2.0 (Seed-ASR) WebSocket binary protocol to OpenAI-compatible/v1/audio/transcriptionsREST API.
Local OpenAI-compatible transcription proxy for Spokenly and any OpenAI-compatible client.
In a recent update, Spokenly added support for Doubao 1.0, and it works reliably in that mode. However, when switching to Doubao 2.0, Spokenly's built-in path returns direct errors in practice, and Doubao 2.0 should be treated as unsupported in this path at this stage. To bridge this gap, Spokenly provides an OpenAI-compatible integration path where users can supply their own API keys/endpoints and connect different models. This project was created around that path:
Community discussion reference (many users reporting similar behavior):
-
accept OpenAI-compatible
/v1/audio/transcriptions -
translate requests to Doubao ASR 2.0 WebSocket binary protocol
-
return text in the format expected by OpenAI-compatible apps
This project implements a practical subset for Spokenly-style transcription workflows, not the full OpenAI audio API surface.
- supported routes:
POST /v1/audio/transcriptionsPOST /doubao/v1/audio/transcriptions
- supported output formats:
json(default, returns{ "text": "..." })text(plain text)
- not implemented:
response_format=verbose_jsonresponse_format=srtresponse_format=vtt
Official reference:
What the official layer provides:
- A stateful, event-driven WebSocket Realtime API.
- Compatibility with OpenAI Realtime-style events for ASR sessions.
How it differs from this project:
- Official Realtime API is a WebSocket event protocol.
- This project is an HTTP bridge for OpenAI-compatible
POST /v1/audio/transcriptionsclients. - Official integration expects clients to implement Realtime event handling; this proxy keeps existing REST transcription clients unchanged.
Why this project is still useful:
- Spokenly and many existing tools are built around transcription REST flows rather than Realtime events.
- This proxy avoids client-side protocol rewrites and keeps deployment simple for local workflows.
- It also provides local control over transcoding, logging, and error mapping.
When this proxy may be optional:
- If your client natively supports the official Realtime API and your workflow is fully event-driven, direct integration can be a better fit.
This repository is intended to be discoverable and reusable by coding agents. If an agent needs similar speech-to-text behavior, it can search this repository and reuse the existing patterns instead of re-implementing from scratch.
Suggested references for agents:
src/server.js: OpenAI route compatibility, multipart parsing, ffmpeg transcoding, Doubao frame encode/decode, timeout handling, graceful shutdownsrc/preflight.js: upstream credential/connectivity validation and protocol-level smoke testREADME.md+.env.example: operational modes (bigmodel_asyncvsbigmodel_nostream) and recommended runtime parameters
Implementation guideline for agents:
- keep changes minimal and focused
- preserve OpenAI-compatible behavior for clients like Spokenly
- prioritize clean protocol boundaries and readable logging
- prefer configurable behavior via env vars over hardcoding
This project is fixed to:
- model:
bigmodel - resource id:
volc.seedasr.sauc.duration - websocket endpoint: configurable via
VOLC_WS_URL- recommended default for Spokenly:
wss://openspeech.bytedance.com/api/v3/sauc/bigmodel_nostream - optional realtime-optimized mode:
wss://openspeech.bytedance.com/api/v3/sauc/bigmodel_async
- recommended default for Spokenly:
Official Doubao ASR documentation:
- Volcengine ASR 2.0 Official Docs
- Local official ASR markdown copy
If the web docs page requires JavaScript in your environment, use
doubao_asr.mdas the local reference.
cd /path/to/repo
cp .env.example .env
# edit .env and fill VOLC_APP_KEY / VOLC_ACCESS_KEY
npm install
npm run preflight
npm run startUse bigmodel_nostream as the default mode for Spokenly push-to-talk:
VOLC_RESOURCE_ID=volc.seedasr.sauc.duration
VOLC_WS_URL=wss://openspeech.bytedance.com/api/v3/sauc/bigmodel_nostream
VOLC_MODEL_NAME=bigmodel
SEGMENT_DURATION_MS=200
SEND_INTERVAL_MS=0
SHOW_UTTERANCES=false
RESULT_TYPE=full
BODY_READ_TIMEOUT_MS=30000
REQUEST_TIMEOUT_MS=90000
SHUTDOWN_TIMEOUT_MS=8000Run command:
cd /path/to/repo
npm run startIf you still see occasional upstream packet timeout (45000081) on very long audio, try:
SEND_INTERVAL_MS=100orSEND_INTERVAL_MS=120for a more conservative pacing.
Use OpenAI-compatible provider:
- Base URL:
http://127.0.0.1:8787 - Route:
/v1/audio/transcriptions - API Key: any string (or exactly
PROXY_API_KEYif you set it)
Alternative base URL also works:
http://127.0.0.1:8787/doubao
curl -sS -X POST "http://127.0.0.1:8787/v1/audio/transcriptions" \
-H "Authorization: Bearer test" \
-F "model=whisper-1" \
-F "file=@/absolute/path/to/test.wav"If PROXY_API_KEY is empty, Authorization is optional.
cd /path/to/repo
pm2 delete doubao-asr2-openai-proxy || true
pm2 start ecosystem.config.cjs
pm2 logs volcengine-doubao-asr2-openai-proxy
pm2 saveecosystem.config.cjs now uses node_args: '--env-file=.env' so PM2 loads your local env values.
If you used the old process name before rename, keep the pm2 delete doubao-asr2-openai-proxy step to avoid duplicate processes or port conflicts.
Background:
- We initially tested bidirectional streaming endpoints (
bigmodel/bigmodel_async) to try realtime interaction.
Problem found in Spokenly (OpenAI-compatible workflow):
- Spokenly push-to-talk uploads audio after key release, instead of continuously streaming microphone packets to this proxy.
- In this workflow, bidirectional streaming does not provide practical realtime benefit, and often increases post-stop latency.
Experiment steps and observations:
- Start with bidirectional endpoint (
VOLC_WS_URL=wss://openspeech.bytedance.com/api/v3/sauc/bigmodel_async) and run normal Spokenly dictation tests. - Observe that text is usually returned after recording ends, not truly character-by-character in target input fields.
- In long-audio tests, we observed high tail latency and occasional packet-timeout risks during tuning.
- Switch to
bigmodel_nostreamendpoint (VOLC_WS_URL=wss://openspeech.bytedance.com/api/v3/sauc/bigmodel_nostream) and re-test with the same usage pattern. - With timing logs enabled, one ~30s sample showed local stages were small (
readBodyMs=5,parseMs=1,transcodeMs=98) while upstream ASR dominated (asrMs=18199,totalMs=18303), confirming the main latency is in ASR processing rather than local parsing/transcoding.
Final decision:
- For Spokenly push-to-talk usage, this project should run in
bigmodel_nostreammode by default. - This gives better stability and clearer latency behavior for "record-then-return-final-text" workflows.
- Connection failed before transcription:
- check
VOLC_APP_KEY/VOLC_ACCESS_KEY - verify
VOLC_RESOURCE_ID=volc.seedasr.sauc.duration - verify
VOLC_WS_URLmatches your mode (bigmodel_asyncorbigmodel_nostream) - ensure the account has access to ASR 2.0 seedasr duration package
- check
- Language parameter:
languageis only forwarded whenVOLC_WS_URLusesbigmodel_nostream(per official doc)
- ffmpeg error:
- install ffmpeg and make sure
ffmpegis in PATH
- install ffmpeg and make sure
- Slow result after recording ends:
- recommended for Spokenly push-to-talk:
VOLC_WS_URL=wss://openspeech.bytedance.com/api/v3/sauc/bigmodel_nostream,SEGMENT_DURATION_MS=200,SEND_INTERVAL_MS=0,SHOW_UTTERANCES=false - if very long audio causes upstream timeout, increase pacing to
SEND_INTERVAL_MS=100~120
- recommended for Spokenly push-to-talk:
- Body upload timeout:
- tune
BODY_READ_TIMEOUT_MS(default30000) if very large uploads are expected
- tune
- For support tickets, keep logs with:
connectIdlogid(X-Tt-Logid)