fix: remove hallucinations from silent audio#1588
fix: remove hallucinations from silent audio#1588ex3ndr wants to merge 2 commits intoggml-org:masterfrom
Conversation
|
Can we get some examples where this change makes a difference in the output? |
|
It does reduce the incidence rate. but not fix it yet. |
|
the openai/whisper skips the current segment if the probability of no speech token is high https://github.com/openai/whisper/blob/e58f28804528831904c3b6f2c0e473f346223433/whisper/transcribe.py#L243-L255 we likely need this? |
|
I have re-tested it, but it feels to be marginal improvement. I will try to implement ignoring segments with high probability of silence |
|
BTW, i am testing on this short segments that i have recorded. I can't get most of them to be reliably detected as silence in most models. |
Whisper models? |
Yes. |
|
I have tried to log |
|
At least this is heading in the right direction. I'm developing something similar to OpenAI's approach, utilizing def _main_loop(self, audio_features: Tensor, tokens: Tensor):
n_batch = tokens.shape[0]
sum_logprobs: Tensor = torch.zeros(n_batch, device=audio_features.device)
no_speech_probs = [np.nan] * n_batch
try:
for i in range(self.sample_len):
logits = self.inference.logits(tokens, audio_features)
if (
i == 0 and self.tokenizer.no_speech is not None
): # save no_speech_probs
probs_at_sot = logits[:, self.sot_index].float().softmax(dim=-1)
no_speech_probs = probs_at_sot[:, self.tokenizer.no_speech].tolist()
# now we need to consider the logits at the last token only
logits = logits[:, -1]
# apply the logit filters, e.g. for suppressing or applying penalty to
for logit_filter in self.logit_filters:
logit_filter.apply(logits, tokens)
# expand the tokens tensor with the selected next tokens
tokens, completed = self.decoder.update(tokens, logits, sum_logprobs)
if completed or tokens.shape[-1] > self.n_ctx:
break
finally:
self.inference.cleanup_caching()
return tokens, sum_logprobs, no_speech_probs |
Sure |
| // suppress sot and nosp tokens | ||
| logits[vocab.token_sot] = -INFINITY; | ||
| logits[vocab.token_nosp] = -INFINITY; // TODO: ignore this token for now | ||
| // logits[vocab.token_nosp] = -INFINITY; // Uncommenting this would produce hallucinations on silent audio |
There was a problem hiding this comment.
Although it is said that token_nosp is the direction to solve the hallucination, it is definitely problematic for you to cancel the suppress token_nosp. First of all, we only hope that the output of the model contains meaningful and visible tokens (except for timestamps). Your cancellation of suppress token_nosp will cause this token to possibly appear in the output of the model, which is something we do not want to see. Secondly, the key to solving hallucination lies in finding a way to skip silence. token_nosp is used to tell you how likely it is that this segment is silent, so that we can skip silence. Therefore, merely cancelling suppress token_nosp without any other action cannot solve hallucination.

This line is very important since on silent audio it would heavily hallucinate.