Adding word timestamps repeats earlier text #1046

glangford · 2023-03-07T13:15:13Z

glangford
Mar 7, 2023

I am comparing the results using v20230306 transcribing the same audio twice, first with --word_timestamps False then with True (keeping other settings the same).

The refinement of timing is excellent with word_timestamps but it looks like the second to last segment of speech is copied at the end, when there is no speech occurring. I will add the two .srt files here - renamed to .txt to make github happy - the repeated subtitles occur first at 101-106 and appear again 120-125 incorrectly. I can post the 1m video as well if needed.

Since only --word_timestamps True appends the extra text, it is possible this is a bug in managing segments when timing is being refined and not a model hallucination.

5Mar-energia.srt.txt

5Mar-energia-no-word-timestamps.srt.txt

Answered by jongwook

Mar 7, 2023

Thanks for reporting this; I wanted to merge the implementation before the branch gets too divergent but I'm sure it still has many rough edges. Trying to reproduce the repetition issue myself to push a fix ..

View full answer

glangford · 2023-03-07T13:16:50Z

glangford
Mar 7, 2023
Author

Is anyone else seeing this if you do an A/B comparison with this new version?

2 replies

ryanheise Mar 7, 2023

I reported something similar here with an example:

#869 (comment)

x86Gr Mar 7, 2023

Yes, same behavior on 5 sample files

x86Gr · 2023-03-07T19:28:54Z

x86Gr
Mar 7, 2023

I can confirm the repetitions occur, and that they're not model hallucinations as they happen even with very small pauses. Putting the repetitions aside, the quality of the transcription looks even better.

0 replies

jongwook · 2023-03-07T19:31:02Z

jongwook
Mar 7, 2023
Maintainer

Thanks for reporting this; I wanted to merge the implementation before the branch gets too divergent but I'm sure it still has many rough edges. Trying to reproduce the repetition issue myself to push a fix ..

1 reply

x86Gr Mar 7, 2023

Slicing off the pauses with a VAD doesn't help, neither does clearing the cache.

'verbose=None, word_timestamps=False' in python does not fix it
'verbose=False, word_timestamps=False' in python does not fix it
'verbose=True, word_timestamps=False' in python does fix it

x86Gr · 2023-03-07T21:10:21Z

x86Gr
Mar 7, 2023

Additional tests: the duplicates are not always identical, some look like two different transcriptions of the same chunk

0 replies

glangford · 2023-03-07T21:23:43Z

glangford
Mar 7, 2023
Author

@jongwook Not sure exactly how this shift code below should function, is it a problem that seek can be overwritten here and previous changes to it in the same iteration have been lost? This only runs if word_timestamps is True

               if len(consecutive) > 0 and len(word_end_timestamps) > 0:
                    seek_shift = round(
                        (word_end_timestamps[-1] - time_offset) * FRAMES_PER_SECOND
                    )
                    if seek_shift > 0:
                        seek = previous_seek + seek_shift ####################

8 replies

x86Gr Mar 7, 2023

with @fix-decoding-repetition-degradation, verbose=True fixes it, at least in the timestamped part. The whole text at the end has still the repetitions.

jongwook Mar 7, 2023
Maintainer

@x86Gr thanks for checking! Would it be possible to attach your input.mp3 here?

x86Gr Mar 7, 2023

Sending via wetransfer as it is an internal use test audio. I've sent also the transcriptions

glangford Mar 7, 2023
Author

For the original file that I posted, the problem goes away using @fix-decoding-repetition-degradation - for --task transcribe. A step forward!

However I also tried --task translate, and there is a repetition problem in that case.

The triggering of repetition is sensitive to the arguments provided; in one run I get a spurious "Thanks for watching!" text at the end during a silent period (only for --task translate). As discussed in #928. In another run I get a repeated sentence. The file is attached (I think it is ok to copy, it is a YT video).

5Mar-energia.mp4

jongwook Mar 8, 2023
Maintainer

Thanks all! I'm seeing that the incorrect zero-padding as identified in #730 and #838 was contributing to this error. I'll add a fix in #1052 so that your audio examples doesn't result in repetitions.

AILucifer99 · 2023-03-08T12:08:17Z

AILucifer99
Mar 8, 2023

The problem for hallucination regarding the Whisper Model can be solved just by using the detailed code provided in the official github repository of Whisper. But before using the code for generation do remember to "SPLIT THE AUDIOS INTO 30 sec SEGMENTS EACH" and hence the problem will be solved.

Audio Splitting code is also provided below :

from moviepy.editor import AudioFileClip
from pydub import AudioSegment
from scipy.io import wavfile

audio_path = glob.glob("./wav_files/*.wav")[0]
audio = AudioFileClip(audio_path)
segment_duration = int(input("Enter the audio threshold duration (in seconds) : "))
try :
print("\n")
print("Splitting the audio files with {} seconds duration each.".format(segment_duration))
# Split the audio file into 30 second segments
segments = [
audio.subclip(start, start + segment_duration) for start in range(0, int(audio.duration), segment_duration)
]
# Create the new directory if it doesn't already exist
new_directory = "segments-{}".format(segment_duration)
if not os.path.exists(new_directory):
os.makedirs(new_directory)
# Write each segment to disk with a unique name in the new directory
for i, segment in enumerate(segments):
segment.write_audiofile(f"{new_directory}/segment_{i}.wav")
except Exception as exp :
print("\n")
print("Splitting of the audio files completed...")

Once splitting the audio files using the above code snippet, then we have to call this whisper code provided below

import whisper
model = whisper.load_model("base")
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)
print(result.text)

Problem primarily assisted by my team mate Sudipa Dutta (github - sudipa23)

Hope this will solve the problem.

0 replies

eschmidbauer · 2023-03-08T19:10:06Z

eschmidbauer
Mar 8, 2023

hi,
im getting a crash trying to use word-level timestamps
Here is my code:

Python 3.10.9 (main, Mar  8 2023, 10:47:38) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import whisper
>>> model = whisper.load_model(
...     name="large-v2", download_root="./models", device="cuda:0")
>>>
>>> result = model.transcribe("test.wav", word_timestamps=True)
python3.10/site-packages/whisper/timing.py:41: UserWarning: Failed to launch Triton kernels, likely due to missing CUDA toolkit; falling back to a slower median kernel implementation...
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "python3.10/site-packages/whisper/transcribe.py", line 320, in transcribe
    add_word_timestamps(
  File "python3.10/site-packages/whisper/timing.py", line 294, in add_word_timestamps
    alignment = find_alignment(model, tokenizer, text_tokens, mel, num_frames, **kwargs)
  File "python3.10/site-packages/whisper/timing.py", line 210, in find_alignment
    text_indices, time_indices = dtw(-matrix)
  File "python3.10/site-packages/whisper/timing.py", line 143, in dtw
    return dtw_cuda(x)
  File "python3.10/site-packages/whisper/timing.py", line 137, in dtw_cuda
    return backtrace(trace.cpu().numpy())
  File "python3.10/site-packages/whisper/timing.py", line 75, in backtrace
    raise ValueError("Unexpected trace[i, j]")
ValueError: Unexpected trace[i, j]
>>>

I've setup my environment like this:

conda create -n whisper python=3.10
conda activate whisper
conda install -c "nvidia/label/cuda-11.7.1" \
	cuda-nvcc=11.7 \
	cuda-toolkit=11.7.1 \
	cuda-libraries=11.7.1 \
	cuda-libraries-dev=11.7.1 \
	cudnn

pip install \
	torch==1.13.1+cu117 \
	torchaudio==0.13.1+cu117 \
	-f https://download.pytorch.org/whl/torch_stable.html

pip install openai-whisper

Wondering if anyone has any ideas/suggestions why this is happening

0 replies

Adding word timestamps repeats earlier text #1046

Uh oh!

Replies: 7 comments · 11 replies

Uh oh!

glangford Mar 7, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jongwook Mar 7, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glangford Mar 7, 2023 Author

Uh oh!

Uh oh!

Uh oh!

jongwook Mar 7, 2023 Maintainer

Uh oh!

Uh oh!

glangford Mar 7, 2023 Author

Uh oh!

Uh oh!

jongwook Mar 8, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Replies: 7 comments 11 replies

glangford
Mar 7, 2023
Author

jongwook
Mar 7, 2023
Maintainer

glangford
Mar 7, 2023
Author

jongwook Mar 7, 2023
Maintainer

glangford Mar 7, 2023
Author

jongwook Mar 8, 2023
Maintainer