Add AdEMAMix optimizer#1360
Merged
matthewdouglas merged 5 commits intomainfrom Sep 20, 2024
Merged
Conversation
matthewdouglas
commented
Sep 16, 2024
Comment on lines
+60
to
+62
| # For parity with bnb implementation we combine both fast | ||
| # and slow EMA stats into one stacked tensor. | ||
| state["m1_m2"] = p.new_zeros((2, *p.size())) |
Member
Author
There was a problem hiding this comment.
This is done for ease of compatibility with the existing test suite. In most other implementations we'll see two separate buffers here.
matthewdouglas
commented
Sep 16, 2024
| // AdEMAMix has an additional state buffer, which we packed | ||
| // into state1. We need thread-local storage here for these. | ||
| // TODO: Mark with [[maybe_unused]] after upgrade to min compiler. | ||
| float s3_vals[NUM_PER_THREAD]; |
Member
Author
There was a problem hiding this comment.
There's a few extra memory allocations like this to support AdEMAMix. Have not confirmed if the compiler is optimizing these out for instantiations with OPTIMIZER=ADAM, but if not, the overhead isn't very much.
TimDettmers
previously approved these changes
Sep 20, 2024
Collaborator
TimDettmers
left a comment
There was a problem hiding this comment.
This looks all good to me.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
matthewdouglas
added a commit
to matthewdouglas/bitsandbytes
that referenced
this pull request
Oct 28, 2024
* Add AdEMAMix optimizer * Add PagedAdEMAMix32bit, AdEMAMix32bit * Add PagedAdEMAMix32bit, AdEMAMix32bit * AdEMAMix: add support for alpha/beta3 scheduling * Update paged AdEMAMix
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds support for the AdEMAMix optimizer described here: https://arxiv.org/abs/2409.03137
Includes blockwise 8bit and 32bit versions, each supporting paged operation.
AdEMAMix is a modification to Adam which introduces an additional EMA component. It is observed that AdEMAMix can forget training data at a slower pace and can reach similar loss as AdamW with significantly less training data.
TODO: Implement scheduler for alpha/beta3