Skip to content

Commit c26953f

Browse files
jongwookMajdoddin
authored andcommitted
Use tiktoken (openai#1044)
* use tiktoken==0.3.0 * formatting * tuple should be safer * Update whisper/tokenizer.py Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com> * use tiktoken 0.3.1 * reflecting suggestions * cleanup * bypassing load_tiktoken_bpe to avoid blobfile dep --------- Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>
1 parent 6d09ca5 commit c26953f

15 files changed

+100601
-100096
lines changed

MANIFEST.in

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,4 @@ include requirements.txt
22
include README.md
33
include LICENSE
44
include whisper/assets/*
5-
include whisper/assets/gpt2/*
6-
include whisper/assets/multilingual/*
75
include whisper/normalizers/english.json

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@ numpy
33
torch
44
tqdm
55
more-itertools
6-
transformers>=4.19.0
6+
tiktoken==0.3.1
77
ffmpeg-python==0.2.0

tests/test_transcribe.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
import whisper
88
from whisper.utils import write_vtt
9+
from whisper.tokenizer import get_tokenizer
910

1011

1112
def test_transcribe():
@@ -46,14 +47,18 @@ def test_transcribe_callback():
4647
assert "my fellow americans" in transcription
4748
assert "your country" in transcription
4849
assert "do for you" in transcription
50+
tokenizer = get_tokenizer(model.is_multilingual)
51+
all_tokens = [t for s in result["segments"] for t in s["tokens"]]
52+
assert tokenizer.decode(all_tokens) == result["text"]
53+
assert tokenizer.decode_with_timestamps(all_tokens).startswith("<|0.00|>")
54+
4955
timing_checked = False
5056
for segment in result["segments"]:
5157
for timing in segment["words"]:
5258
assert timing["start"] < timing["end"]
5359
if timing["word"].strip(" ,") == "Americans":
5460
assert timing["start"] <= 1.8
5561
assert timing["end"] >= 1.8
56-
print(timing)
5762
timing_checked = True
5863

5964
assert timing_checked

0 commit comments

Comments
 (0)