-
Notifications
You must be signed in to change notification settings - Fork 234
Gemini Live API - Transcription finished flag never updates in Javascript SDK #1429
Description
Environment details
- Programming language: TypeScript / JavaScript
- OS: macOS 15.x (Sequoia)
- Language runtime version: Node.js 22.16.0
- Package version:
@google/genai1.44.0
Steps to reproduce
Model: gemini-2.5-flash-native-audio-preview-12-2025
API: Live API (WebSocket, v1alpha), ephemeral token auth
- Connect to the Live API with both
inputAudioTranscription: {}andoutputAudioTranscription: {}enabled in the session config:
const session = await ai.live.connect({
model: "gemini-2.5-flash-native-audio-preview-12-2025",
config: {
responseModalities: [Modality.AUDIO],
inputAudioTranscription: {},
outputAudioTranscription: {},
},
callbacks: {
onmessage: (message: LiveServerMessage) => {
console.log(message.serverContent?.inputTranscription);
console.log(message.serverContent?.outputTranscription);
}
}
});- Speak several complete sentences, then wait for the model to respond with a complete turn.
- Observe the
inputTranscriptionandoutputTranscriptionobjects logged fromserverContent.
Expected behavior
Per the SDK's TypeScript types, and the documented pattern for streaming transcription, the final transcription message for each speaker turn should include finished: true, signaling that the turn is complete and the accumulated text can be finalized.
This is the natural mechanism for knowing when to flush a transcript buffer.
Referenced SDK Patterns
Actual behavior
The finished field is never present on either inputTranscription or outputTranscription messages across an entire conversation.
Each transcription message contains only a text field with a fragment. There is no per-transcription signal indicating that a turn has ended.
The only reliable turn-boundary signal is message.serverContent?.turnComplete === true, which arrives on a separate message with no transcription payload. It also fires only at the end of the model's turn, not the user's.
Workaround in use
- Accumulate
inputTranscription.textfragments into a buffer. - Flush the user buffer when the first
outputTranscription.textfragment arrives, treating that as an implicit signal that the user has finished speaking. - Flush the model buffer when
turnComplete: truefires.
This works, but it requires developers to reverse-engineer the signal sequence instead of relying on the documented finished flag.