attempt to fix the repetition/hallucination issue identified in #1046 by jongwook · Pull Request #1052 · openai/whisper

jongwook · 2023-03-07T21:48:40Z

No description provided.

ryanheise · 2023-03-08T01:02:42Z

Hi @jongwook Not sure if you saw the comment below, but it includes a reproduction case which might be useful:

#869 (comment)

The repetition persists with this PR.

jongwook · 2023-03-08T01:12:27Z

@ryanheise thanks! will look into it...

glangford · 2023-03-08T01:49:36Z

The problem triggered by the test data from @ryanheise is model sensitive. I see the problem with small but using either small.en or medium.en looks ok although the timing of the last few words is off. Below is the mp3 fragment converted to video to show the English subtitles.

ryan-test-sub.mp4

jongwook · 2023-03-08T01:52:02Z

Thanks all! The incorrect zero-padding of Mel spectrograms as identified in #730 and #838 was contributing to this error. The fix in 477f0be appears to fix the repetition issue.

ryanheise · 2023-03-08T02:22:39Z

The fix in 477f0be appears to fix the repetition issue.

I can confirm this fixed my example, thanks! 👍

Below is the mp3 fragment converted to video to show the English subtitles.

@glangford FYI the subtitles didn't show in your video.

glangford · 2023-03-08T02:33:14Z

@ryanheise Inline, (on Mac at least) you may need to click on the >> on the right to turn on subtitles. Or download and view with VLC, Quicktime, or whatever and enable subtitles in the viewer.

ryanheise · 2023-03-08T11:27:20Z

Ah, I see, Firefox doesn't show any options, but downloading it and opening in VLC works. You can also do hard subs this way #435 (reply in thread)

using either small.en or medium.en looks ok although the timing of the last few words is off.

Here is the base model for comparison, which appears more accurate on the last few words:

69-whiskey-clip.mp4

m-bain · 2023-03-08T13:29:36Z

Btw have you guys tried with longer audio, e.g. 5 mins long? I am still getting a lot of repetition even with this fix.
E.g. on the TEDLIUM test set "AimeeMullins_2009P.wav"

[02:10.440 --> 02:14.720] and needless to say, thank God, I wasn't using a thesaurus back then.
[02:14.720 --> 02:14.720] and needless to say, thank God, I wasn't using a thesaurus back then.
[02:15.460 --> 02:18.580] I mean from this entry, it would seem that
[02:18.580 --> 02:22.800] I was born into a world that perceived someone like me
[02:22.800 --> 02:23.340] I was born into a world that perceived someone like me
[02:23.340 --> 02:27.540] to have nothing positive, whatsoever, going for them
[02:27.540 --> 02:27.540] to have nothing positive, whatsoever, going for them
[02:27.540 --> 02:35.340] When in fact today, I'm celebrated for the opportunities and adventures my life has procured
[02:35.340 --> 02:35.960] When in fact today, I'm celebrated for the opportunities and adventures my life has procured
[02:35.960 --> 02:42.140] So I immediately went to look up the 2009 online edition
[02:42.140 --> 02:42.160] So I immediately went to look up the 2009 online edition
[02:42.160 --> 02:42.160] So I immediately went to look up the 2009 online edition

I was hoping to update word segmentation results for whisper-only word timestamps in our paper https://arxiv.org/abs/2303.00747

But currently i am getting better results with our implementation which is similar to https://github.com/linto-ai/whisper-timestamped

glangford · 2023-03-08T14:38:22Z

Btw have you guys tried with longer audio, e.g. 5 mins long? I am still getting a lot of repetition even with this fix.

I am testing a longer audio now (running on CPU, larger model, transcript+transcribe so it is taking a while). For clarity,

are you running the 20230307 release version? with, or without --word_timestamps?
the repetitions from "AimeeMullins_2009P.wav" above, are they from verbose print to the console?

It seems like there are different possible sources of error, in all the different discussions

model hallucination
new repetition introduced or magnified by --word_timestamps True
(hand waving) segmentation issues

m-bain · 2023-03-08T14:47:29Z

are you running the 20230307 release version? with, or without --word_timestamps?

yes

the repetitions from "AimeeMullins_2009P.wav" above, are they from verbose print to the console?

yes

glangford · 2023-03-08T15:06:15Z

@jongwook Note from @m-bain example above the repetition occurring with verbose print. The repetitions in this example are all "instantaneous" ; eg same start and end time

[02:14.720 --> 02:14.720] and needless to say, thank God, I wasn't using a thesaurus back then.

they are printed but then immediately cleared by this code, which looks like a bug unique to --verbose True

whisper/whisper/transcribe.py

Line 345 in aac47c9

# if a segment is instantaneous or does not contain text, clear it

glangford · 2023-03-08T15:10:03Z

@m-bain Given this could you maybe rerun and see if the formal output formats are messed up or not, using --verbose False?

m-bain · 2023-03-08T15:13:44Z

This is not a verbose error, and the start times and end times of repetition are not always instantaneous, see output for the .srt file without verbose:

271
00:02:14,440 --> 00:02:14,720
and needless to say, thank God, I wasn't using a thesaurus back then.

272
00:02:14,720 --> 00:02:14,720

273
00:02:15,460 --> 00:02:16,180
I mean from this entry, it would seem that

274
00:02:16,180 --> 00:02:16,360
I mean from this entry, it would seem that

275
00:02:16,360 --> 00:02:16,960
I mean from this entry, it would seem that

276
00:02:16,960 --> 00:02:17,220
I mean from this entry, it would seem that

277
00:02:17,220 --> 00:02:17,620
I mean from this entry, it would seem that

278
00:02:17,620 --> 00:02:17,800
I mean from this entry, it would seem that

glangford · 2023-03-08T15:25:14Z

So there are at least two problems then

verbose mode can print cleared segments
something else triggered by word_timestamps

Given how close the start/end times are it feels like something related to seek_shift is still off

whisper/whisper/transcribe.py

Line 337 in aac47c9

seek = previous_seek + seek_shift

@m-bain Do the same repetitions happen with word_timestamps False or no?

m-bain · 2023-03-09T11:36:10Z

Update, I realise there is some specific underline formatting in the word_timestamps, was able to get it working in the end. See here for comparison on word-level timestamp accuracy

@jongwook could you share the evaluation for long-form transcription WER? I am unable to reproduce whisper results, right now I report in the vanilla setting -- greedy/beam5 decoding without the heuristic tricks

…i#1046 (openai#1052) * attempt to fix the repetition/hallucination issue identified in openai#1046 * zero-pad the audio instead of spectrogram * formatting fix * delete debug print

attempt to fix the repetition/hallucination issue identified in #1046

f9cfde9

zero-pad the audio instead of spectrogram

477f0be

jongwook and others added 3 commits March 7, 2023 17:53

formatting fix

41410b7

Merge branch 'main' into fix-decoding-repetition-degradation

1c6a3b4

delete debug print

ea5ef50

jongwook merged commit 919a713 into main Mar 8, 2023

jongwook deleted the fix-decoding-repetition-degradation branch March 14, 2023 19:35

bilo1967 mentioned this pull request Mar 21, 2023

text output looping/repeating (until end) Const-me/Whisper#26

Closed

Conversation

jongwook commented Mar 7, 2023

Uh oh!

ryanheise commented Mar 8, 2023

Uh oh!

jongwook commented Mar 8, 2023

Uh oh!

glangford commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jongwook commented Mar 8, 2023

Uh oh!

ryanheise commented Mar 8, 2023

Uh oh!

glangford commented Mar 8, 2023

Uh oh!

ryanheise commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m-bain commented Mar 8, 2023

Uh oh!

glangford commented Mar 8, 2023

Uh oh!

m-bain commented Mar 8, 2023

Uh oh!

glangford commented Mar 8, 2023

Uh oh!

glangford commented Mar 8, 2023

Uh oh!

m-bain commented Mar 8, 2023

Uh oh!

glangford commented Mar 8, 2023

Uh oh!

m-bain commented Mar 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

glangford commented Mar 8, 2023 •

edited

Loading

ryanheise commented Mar 8, 2023 •

edited

Loading