-
Notifications
You must be signed in to change notification settings - Fork 325
feat: add adr-001 for node key refactoring #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
4f9a004
8646a9a
6e5b081
4ee7804
9468ee2
0c2d610
1459a18
d8d0bf9
88885f1
5272de4
006d76c
7cc7280
4e6044f
0bf7486
a0dcc0e
ee11dff
d9f7d2e
10184ac
5f94844
85a90e7
8fb87b6
6bbf7f9
204d881
f743311
bf85d92
b064ede
6095bc4
7873267
fb0f182
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # ADR ADR-001: Node Key Refactoring | ||
|
|
||
| ## Changelog | ||
|
|
||
| - 2022-10-31: First draft | ||
|
|
||
| ## Status | ||
|
|
||
| Proposed | ||
|
|
||
| ## Context | ||
|
|
||
| The original node key of IAVL is a hash of the node and it does not take advantage of data locality on disk. The nodes are stored in a random location of the disk due to the random hash value, so it needs to do a random search of the disk to find the node. | ||
|
|
||
| The `orphans` are used to manage the removed nodes in the current version and allow to deletion of the removed nodes for the specific version from the disk through the `DeleteVersion`. It needs to track every time when updating the tree and also requires extra storage to store `orphans`, but there are not many use cases of `DeleteVersion`. There are two use cases, the first one is the rollback of the tree and the second one is to remove the unnecessary old nodes. | ||
|
cool-develope marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Decision | ||
|
|
||
| - Use the sequenced integer ID as a node key like `bigendian(nodeKey)` format. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is nodeKey computed, I remember it's sth like "version+path"?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, it is a just sequenced integer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems better to use "version/seq", where seq is only unique inside the version.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "version/seq" is globally unique as well, seq itself is locally unique within the version. For each version, the nodes are written in a batch anyway, we only need to maintain the seq in memory.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why do we need to update the left and right node keys?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here,
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Meaning that every time we have a new version we update all referenced node keys to refer to that version rather than the original version where they were created? That seems really complex. I feel like I'm maybe not understanding something really basic here
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yihuang , I agree with you. If we keep the local nonce as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think the size difference should be minimal, assuming version is
|
||
| - Remove the `leftHash` and `rightHash` fields, and instead store `hash` field. | ||
| - Remove the `version` field from the node structure. | ||
| - Remove the `orphans` from the tree. | ||
|
|
||
| New node structure | ||
|
tac0turtle marked this conversation as resolved.
|
||
|
|
||
| ```go | ||
| type Node struct { | ||
| key []byte | ||
| value []byte | ||
| hash []byte // keep this field in the storage | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess one consequence is proof generation is slower, previously we can get child hash directly, now we need to load the child node?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are using the ics23 proof, which requires size, version, and height so I believe it doesn't lead to any delays. |
||
| leftHash []byte // will remove | ||
| rightHash []byte // will remove | ||
|
cool-develope marked this conversation as resolved.
Outdated
|
||
| nodeKey int64 // new field, use as a node key | ||
| leftNodeKey int64 // new field, need to store in the storage | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we are using nonce as I thought that the whole idea with the path encoded as a key was that writes would be sequential, contributing to lower frequency of compactions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The nodes are all immutable, the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know the tree path describes the data locality well. The problem is how to keep the path when rotating the tree. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aren't we need to split the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right, I updated the node struct |
||
| rightNodeKey int64 // new field, need to store in the storage | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need these fields, if we already have
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| version int64 // will remove | ||
| size int64 | ||
| leftNode *Node | ||
| rightNode *Node | ||
| subtreeHeight int8 | ||
| persisted bool | ||
| } | ||
| ``` | ||
|
|
||
| New tree structure | ||
|
|
||
| ```go | ||
| type MutableTree struct { | ||
| *ImmutableTree | ||
| lastSaved *ImmutableTree | ||
| nonce int64 // new field to track the current ID | ||
| orphans map[int64]int64 // will remove | ||
| versions map[int64]bool | ||
| allRootLoaded bool | ||
| unsavedFastNodeAdditions map[string]*fastnode.Node | ||
| unsavedFastNodeRemovals map[string]interface{} | ||
| ndb *nodeDB | ||
| skipFastStorageUpgrade bool | ||
|
|
||
| mtx sync.Mutex | ||
| } | ||
| ``` | ||
|
|
||
| ## Consequences | ||
|
cool-develope marked this conversation as resolved.
cool-develope marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### Migration | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An alternative way to migrate is using the state streamer, resync the chain, and recreate the new iavl db with the state changes, then switch.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a breaking change, so validators need to update at once (at least 67% of them). So we need a migration process anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's not consensus breaking, since the root hashes don't change, validators don't need to update at the same time?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably depends on the migration mechanism. I think the one described below won't guarantee the same IAVL tree.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, today IAVL key is a hash, so this update will change the keys stored in the IAVL, hence it will change an order in the tree, and the tree Merkle Hash. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, it can migrate IAVL nodes one to one, change references between them, but the hashes of each node is not changed. |
||
|
|
||
| We can migrate nodes one by one by iterating the version. | ||
|
|
||
| - Iterate the version in order, and get the root node for the specific version. | ||
|
cool-develope marked this conversation as resolved.
Outdated
|
||
| - Iterate the tree and assign the `nodeKey` to nodes which the node version equals. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Migrate a archive node and pruned node will end up with different node keys, but it's probably ok. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Inconsistent node nonces could make the state-sync snapshot inconsistent as well?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the global nonce they would be different but version | local nonce should be same, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, in migration, it depends on the traversal order, this is easy to specify, like preorder+ascending. but when handling the insertion, we should also specify the precise logic of nonce assignments, to allow alternative implementations be compatible with each other.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the inconsistency of nodeKey is not a big problem, the only thing we need is to match the proof.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can test this on a live network with state sync and a simple migration to see if it works to be sure There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
it depends on how we export the state sync snapshot I think, if we re-map the nodekey, then the stored one don't matter. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think the nonce assignment should happens in the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you are right |
||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| We will implement the `Import` functionality for the original version. | ||
|
|
||
| ### Positive | ||
|
|
||
| Using the sequenced integer ID, we take advantage of data locality in the bTree and it leads to performance improvements. Also, it can reduce the node size in the storage. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nowadays, all the "mainstream" db backends are lsm tree, but it should help there too.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm guessng by bTree we refer to IAVL tree? However, I agree that "bTree" seems confusing in this context because the alternative to LSM is the first thing that comes to mind
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One of the major benefits might also be reduced compactions since we always commit semi-sorted data (by nonce) to the underlying LSM. Since we commit versioned data, there wouldn't be a need for as many merge operations as today.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I will update this part and add more description of LSM |
||
|
|
||
| Removing orphans also provides performance improvements including memory and storage saving. Also, it makes it easy to rollback the tree. Because we will keep the sequenced segment IDs for the specific version, and we can remove all nodes for which the `nodeKey` is greater than the specified integer value. | ||
|
|
||
| ### Negative | ||
|
|
||
| It requires extra storage to store the node because it should keep `leftNodeKey` and `rightNodeKey` to iterate the tree. Instead, we can delete the `version`, `leftHash`, and `rightHash` fields in the node and reduce the key size. | ||
|
|
||
| It can't delete the old nodes for the specific version due to removing orphans. We introduce a new way to prune old versions. | ||
|
|
||
| For example, when a user wants to prune the previous 500 versions every 1000 blocks | ||
|
cool-develope marked this conversation as resolved.
Outdated
|
||
| - We assume the pruning is completed for `n`th version and the last nonce of `n`th version is `x`. | ||
| - We iterate the tree from the `n+501`th root node and pick only nodes which the nodeKey is in `[(n+1)th version first nonce, (n+500)th version the last nonce]`. | ||
| - For those nodes, we re-assign the nodeKey from `x+1` in order. | ||
|
|
||
| ## References | ||
|
|
||
| - https://github.com/cosmos/iavl/issues/548 | ||
| - https://github.com/cosmos/iavl/issues/137 | ||
| - https://github.com/cosmos/iavl/issues/571 | ||
Uh oh!
There was an error while loading. Please reload this page.