Abstract
We bring the select_chain stage back within the complexity budget for a single stage by factoring out parts of its behaviour that are not intrinsically required for selecting chains.
Why?
Currently, select_chain does four things:
- it tracks upstream peer progress and asserts correct peer behaviour
- it selects a best chain candidate whenever new information becomes available
- it uses the ledger to validate blocks (EDIT: this is incorrect, validation happens before chain selection, which is arguably worse)
- it extends the chain for downstream peers upon successful validation
This is the most complex stage in the current setup and it needs to be simplified before we can make substantial improvements. The complexity needs to be factored into multiple orthogonal parts that can then be improved independently.
How?
The upstream peer tracking will be moved into the beginning of the consensus pipeline, in effect replacing the pull stage and enlarging its scope to also validate headers (because that is necessary to properly track the upstream peer’s state). This means that any header that is stored is already valid, avoiding duplication of work. Incorrect peer behaviour is recognised closer to the network stack and will trigger disconnections.
Block validation will be moved into a new part of the pipeline which will also fetch blocks with a strategy that allows batching to reduce the impact of network link latency — our current setup requires one RTT plus bandwidth delay for each block, and it does this in 1:1 correspondence with header processing, thereby slowing the whole pipeline down. The main difficulty will be to correctly feed back block validation errors into select_chain to trigger the switch to a different fork (this is not unclear, it only will be the most subtle part of the logic).
Testing Strategy / Acceptance Criteria
Currently, e2e testing on CI has been reduced to run only up to epoch 176, with these changes it should easily run until 182 again; it is reasonable to expect that it can run even further within 15min.
Tests will be added for the disconnection of misbehaving upstream peers.
Discussion points
This will likely be implemented via multiple PRs, starting with factoring out upstream peer tracking.
Dependencies & Related Tasks
No response
Checklist
Abstract
We bring the
select_chainstage back within the complexity budget for a single stage by factoring out parts of its behaviour that are not intrinsically required for selecting chains.Why?
Currently,
select_chaindoes four things:This is the most complex stage in the current setup and it needs to be simplified before we can make substantial improvements. The complexity needs to be factored into multiple orthogonal parts that can then be improved independently.
How?
The upstream peer tracking will be moved into the beginning of the consensus pipeline, in effect replacing the
pullstage and enlarging its scope to also validate headers (because that is necessary to properly track the upstream peer’s state). This means that any header that is stored is already valid, avoiding duplication of work. Incorrect peer behaviour is recognised closer to the network stack and will trigger disconnections.Block validation will be moved into a new part of the pipeline which will also fetch blocks with a strategy that allows batching to reduce the impact of network link latency — our current setup requires one RTT plus bandwidth delay for each block, and it does this in 1:1 correspondence with header processing, thereby slowing the whole pipeline down. The main difficulty will be to correctly feed back block validation errors into
select_chainto trigger the switch to a different fork (this is not unclear, it only will be the most subtle part of the logic).Testing Strategy / Acceptance Criteria
Currently, e2e testing on CI has been reduced to run only up to epoch 176, with these changes it should easily run until 182 again; it is reasonable to expect that it can run even further within 15min.
Tests will be added for the disconnection of misbehaving upstream peers.
Discussion points
This will likely be implemented via multiple PRs, starting with factoring out upstream peer tracking.
Dependencies & Related Tasks
No response
Checklist