attempt to fix the repetition/hallucination issue identified in #1046#1052
attempt to fix the repetition/hallucination issue identified in #1046#1052
Conversation
|
Hi @jongwook Not sure if you saw the comment below, but it includes a reproduction case which might be useful: The repetition persists with this PR. |
|
@ryanheise thanks! will look into it... |
|
The problem triggered by the test data from @ryanheise is model sensitive. I see the problem with ryan-test-sub.mp4 |
I can confirm this fixed my example, thanks! 👍
@glangford FYI the subtitles didn't show in your video. |
|
@ryanheise Inline, (on Mac at least) you may need to click on the >> on the right to turn on subtitles. Or download and view with VLC, Quicktime, or whatever and enable subtitles in the viewer. |
|
Ah, I see, Firefox doesn't show any options, but downloading it and opening in VLC works. You can also do hard subs this way #435 (reply in thread)
Here is the 69-whiskey-clip.mp4 |
|
Btw have you guys tried with longer audio, e.g. 5 mins long? I am still getting a lot of repetition even with this fix.
I was hoping to update word segmentation results for whisper-only word timestamps in our paper https://arxiv.org/abs/2303.00747 But currently i am getting better results with our implementation which is similar to https://github.com/linto-ai/whisper-timestamped |
I am testing a longer audio now (running on CPU, larger model, transcript+transcribe so it is taking a while). For clarity,
It seems like there are different possible sources of error, in all the different discussions
|
yes
yes |
|
@jongwook Note from @m-bain example above the repetition occurring with verbose print. The repetitions in this example are all "instantaneous" ; eg same start and end time
they are printed but then immediately cleared by this code, which looks like a bug unique to Line 345 in aac47c9 |
|
@m-bain Given this could you maybe rerun and see if the formal output formats are messed up or not, using |
|
This is not a verbose error, and the start times and end times of repetition are not always instantaneous, see output for the .srt file without verbose: 271 272 273 274 275 276 277 278 |
|
So there are at least two problems then
Given how close the start/end times are it feels like something related to Line 337 in aac47c9 @m-bain Do the same repetitions happen with |
|
Update, I realise there is some specific underline formatting in the word_timestamps, was able to get it working in the end. See here for comparison on word-level timestamp accuracy @jongwook could you share the evaluation for long-form transcription WER? I am unable to reproduce whisper results, right now I report in the vanilla setting -- greedy/beam5 decoding without the heuristic tricks |
…i#1046 (openai#1052) * attempt to fix the repetition/hallucination issue identified in openai#1046 * zero-pad the audio instead of spectrogram * formatting fix * delete debug print
…i#1046 (openai#1052) * attempt to fix the repetition/hallucination issue identified in openai#1046 * zero-pad the audio instead of spectrogram * formatting fix * delete debug print
…i#1046 (openai#1052) * attempt to fix the repetition/hallucination issue identified in openai#1046 * zero-pad the audio instead of spectrogram * formatting fix * delete debug print
…i#1046 (openai#1052) * attempt to fix the repetition/hallucination issue identified in openai#1046 * zero-pad the audio instead of spectrogram * formatting fix * delete debug print

No description provided.