Is your feature request related to a problem? Please describe.
Since create_completion may yield text chunks comprised of multiple tokens per yield (e.g. in the case of multi-byte Unicode characters), counting the number of yields may not equal the number of tokens actually generated by a model. To accurately get the usage statistics of a streamed completion, one has to run the final text through the tokenizer again, despite create_completion already tracking the number of tokens generated by the model.
Describe the solution you'd like
When stream=True in create_completion, the final chunk yielded should include the usage statistics in the 'usage' key.
Describe alternatives you've considered
- Saving full generated text and running it through the tokenizer again (seems wasteful)
- Counting the number of yields and hoping we don't have any multi-byte characters (hacky and fragile)
Additional context
The OpenAI API has recently added similar support in their streaming API with the stream_options key: https://platform.openai.com/docs/api-reference/chat/create#chat-create-stream_options
Is your feature request related to a problem? Please describe.
Since
create_completionmay yield text chunks comprised of multiple tokens per yield (e.g. in the case of multi-byte Unicode characters), counting the number of yields may not equal the number of tokens actually generated by a model. To accurately get the usage statistics of a streamed completion, one has to run the final text through the tokenizer again, despitecreate_completionalready tracking the number of tokens generated by the model.Describe the solution you'd like
When
stream=Trueincreate_completion, the final chunk yielded should include the usage statistics in the'usage'key.Describe alternatives you've considered
Additional context
The OpenAI API has recently added similar support in their streaming API with the
stream_optionskey: https://platform.openai.com/docs/api-reference/chat/create#chat-create-stream_options