This is the changelog for the open source version of tiktoken.
- Build wheels for Python 3.14
- Build musllinux aarch64 wheels
- Support for free-threaded Python
- Update version of
pyo3andrustc-hash - Avoid use of
blobfilefor reading local files - Recognise
gpt-5model identifier - Minor performance improvement for file reading
- Support for
GPT-5 - Update version of
pyo3 - Use new Rust edition
- Fix special token handling in
encode_to_numpy - Better error handling
- Improvements to private APIs
- Support for newer models
- Improvements to private APIs
- Support for
o1ando3models - Better error messages when loading invalid vocabulary files
- Support for encoding to numpy arrays
- Delayed imports when not strictly necessary
- Support for
o1-andchatgpt-4o-models - Build wheels for Python 3.13
- Add possessive quantifiers to limit backtracking in regular expressions, thanks to @l0rinc!
- Provide a better error message and type for invalid token decode
- Permit tuples in type hints
- Better error message for passing invalid input to
get_encoding - Better error messages during plugin loading
- Add a
__version__attribute - Update versions of
pyo3,regex,fancy-regex - Drop support for Python 3.8
- Support for
gpt-4o - Performance improvements
- Optimise regular expressions for a 20% performance improvement, thanks to @paplorinc!
- Add
text-embedding-3-*models toencoding_for_model - Check content hash for downloaded files
- Allow pickling
Encodingobjects. RegisteredEncodingwill be pickled by reference - Workaround PyO3 bug for frozenset conversion
Thank you to @paplorinc, @mdwelsh, @Praneet460!
- Build wheels for Python 3.12
- Update version of PyO3 to allow multiple imports
- Avoid permission errors when using default cache logic
- Add
encoding_name_for_model, undo some renames to variables that are implementation details
- Add
tiktoken._educationalsubmodule to better document how byte pair encoding works - Ensure
encoding_for_modelknows about several new models - Add
decode_with_offets - Better error for failures with the plugin mechanism
- Make more tests public
- Update versions of dependencies
- Add
decode_batchanddecode_bytes_batch - Improve error messages and handling
tiktokenwill now make a best effort attempt to replace surrogate pairs with the corresponding Unicode character and will replace lone surrogates with the Unicode replacement character.
- Add encoding for GPT-4
- Build aarch64 wheels
- Make
blobfilean optional dependency
Thank you to @messense for the environment variable that makes cargo not OOM under emulation!
- Improve performance by 5-20%; thank you to @nistath!
- Add
gpt-3.5-turbomodels toencoding_for_model - Add prefix matching to
encoding_for_modelto better support future model versions - Fix a bug in the README instructions on extending tiktoken
- Update the set of available encodings
- Add packaging metadata
- Add
tiktoken.encoding_for_modelto get the encoding for a specific model - Improve portability of caching logic
Thank you to @fritzo, @arvid220u, @khanhvu207, @henriktorget for various small corrections
- Avoid use of
blobfilefor public files - Add support for Python 3.8
- Add py.typed
- Improve the public tests
- Initial release