spokes-receive-pack: narrow the refs used for the connectivity check#124
Open
newren wants to merge 2 commits into
Open
spokes-receive-pack: narrow the refs used for the connectivity check#124newren wants to merge 2 commits into
newren wants to merge 2 commits into
Conversation
Author
|
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes the spokes-receive-pack connectivity check by narrowing the set of refs used as “stopper” revisions for git rev-list, reducing the cost of the reachability walk in large repositories while keeping the safety properties of the existing check.
Changes:
- Refactors the connectivity check invocation to build
rev-listarguments programmatically. - Adds
connectivityStopperArgs()to switch (behind a sockstat flag) from--exclude-hidden=receive --all --alternate-refsto--toplevel-branches HEAD. - Adds
headResolves()to avoid passingHEADwhen it doesn’t resolve.
Show a summary per file
| File | Description |
|---|---|
| internal/spokes/spokes.go | Introduces a flag-controlled “narrow stopper” argument set for git rev-list and supporting helper logic. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 1/1 changed files
- Comments generated: 1
The 'go test' workflow clones git/git, checks out 'next', and builds it. Recent changes in upstream git flipped Rust integration from opt-in (WITH_RUST) to opt-out (NO_RUST), so 'make all' now invokes 'cargo build' to produce target/release/libgitcore.a. Our ubuntu runner has no cargo, so the build fails with: CARGO target/release/libgitcore.a /bin/sh: 1: cargo: not found make: *** [Makefile:3020: target/release/libgitcore.a] Error 127 Add cargo to the apt-get install line. This matches what github/git itself does in ci/install-dependencies.sh, and is forward-compatible: Git 3.0 will make Rust mandatory. git's Cargo.toml declares rust-version = "1.49.0", so the cargo package shipped with Ubuntu noble is more than new enough.
495d524 to
b43c8ea
Compare
b43c8ea to
ce68a8c
Compare
When spokes-receive-pack verifies that a pushed pack is fully
connected, it invokes
git rev-list --objects --no-object-names --stdin --not \
--exclude-hidden=receive --all --alternate-refs
with the new tips piped in on stdin. The purpose of this check is to
ensure that ref updates sent with the receive-pack are well connected,
i.e. that they don't depend upon objects in the repository that
reference non-existent objects. The portion following "--not" is just a
performance optimization that relies on the assumption that existing
refs in the repository are well connected. (Thus, we assume that refs
are well connected, but not necessarily all objects in the repository
are. The lack of connectedness of other objects could be due to
garbage collection (which itself could be for either sensitive data
removal purposes or purging unreachable objects), replica repair, or
direct writes of objects via Rugged.)
Note that even though the "--not" portion of the command is just an
optimization, it has been found to be suboptimal. In upstream git.git:
* bcec6780b2ec (receive-pack: only use visible refs for connectivity
check, 2022-11-17) found the sheer number of refs to be problematic
and added --exclude-hidden=receive to trim them.
* 68cb0b5253a0 (builtin/receive-pack: add option to skip connectivity
check, 2025-05-20) was added because the connectivity check was
still too expensive.
* a5f0290a7493 (receive-pack: use a smaller set of refs for the
connectivity check, 2026-06-02) was proposed to shrink the number
of refs further than bcec6780b2ec did.
Since we probably still need a connectivity check, the optimal choice
would probably be just the O(1) set of "primary integration branches"
that exist for the repository (think `main`, `next`, `seen`, `maint` in
git.git). Tags tend to sit on those branches or a commit or two beyond
them; pull requests are a few commits on top of those; human feature
branches are another few commits on top of that. Unfortunately,
discovering that list of branch names automatically for any given
repository is difficult. However, my timings suggest that using "HEAD"
is preferable to using "--all" and is likely near optimal in most cases.
Since there are a few repositories out there that leave HEAD pointing to
a non-existent branch or which never update HEAD and update other
branches instead, I propose a middle ground:
--exclude=*/* --branches HEAD
In practice, this only rules out slightly fewer objects than `--all`
while being a much smaller set of refs than all the ones that exist in
the repository.
Some stats across 86 pushes in a monorepo:
TIMES (seconds)
Rule min mean median max
---------------------------------------------------------------------------------------
1: --not --exclude-hidden=receive --all --alternate-refs 0.30 2.12 2.51 8.53
2: --not --exclude-hidden=receive --all 0.29 2.12 2.47 8.90
3: --not --branches 0.20 2.03 2.41 8.94
4: --not $(for-each-ref refs/heads/[^/]*) 0.05 0.87 0.77 2.34
5: --not --exclude=*/* --branches HEAD 0.05 0.86 0.80 2.08
6: --not HEAD 0.00 0.83 0.72 3.34
COUNTS (objects in pack not reachable from the excludes)
Rule min mean median max
--------------------------------------------------------------------------------------------
1: --not --exclude-hidden=receive --all --alternate-refs 0 335797 521505 773282
2: --not --exclude-hidden=receive --all 0 335797 521505 773282
3: --not --branches 0 335797 521505 773282
4: --not $(for-each-ref refs/heads/[^/]*) 0 53997 94 773282
5: --not --exclude=*/* --branches HEAD 0 53997 94 773282
6: --not HEAD 0 41700 94 773282
(For reference, this repo has about 6 times as many refs as branches,
and about 13 times as many total branches as toplevel branches.)
Introduce a new helper `connectivityStopperArgs` that, when the sockstat
flag `spokes_receive_pack_narrow_connectivity_stopper` is set, switches
the stopper from `--exclude-hidden=receive --all --alternate-refs` to
`--exclude=*/* --branches HEAD`. Note that while the timings show
`--alternate-refs` adds essentially nothing when paired with `--all`
(because the alternate's refs largely duplicate the fork's own), pairing
it with `--exclude=*/* --branches HEAD` would re-introduce the alternate's
full ref set as stoppers, so we drop it from the narrow path.
The flag is opt-in (default off) in case there is an unexpected
regression.
The fallback `performCheckConnectivityOnObject` -- only invoked in
the error path to identify which specific object the primary check
choked on -- intentionally keeps `--not --all`. Performance is not
a concern there and broader coverage is helpful for diagnostics.
An alternative we considered and rejected was dropping the rev-list walk
altogether and relying on a `git cat-file --batch-check` existence probe
of each pushed tip, on the grounds that `git index-pack --strict`
already verifies in-pack referential closure. That would be cheaper
still, but we are concerned about unreachable objects that are not well
connected already being in the repository due to garbage collection,
replica repair, or direct writes of objects via Rugged. Thus, our
narrower stopper preserves the walk -- and therefore the safety
guarantee -- while bounding the work by genuinely divergent history
rather than by total ref count.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ce68a8c to
9393c39
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When spokes-receive-pack verifies that a pushed pack is fully connected, it invokes
with the new tips piped in on stdin. The purpose of this check is to ensure that ref updates sent with the receive-pack are well connected, i.e. that they don't depend upon objects in the repository that reference non-existent objects. The portion following "--not" is just a performance optimization that relies on the assumption that existing refs in the repository are well connected. (Thus, we assume that refs are well connected, but not necessarily all objects in the repository are. The lack of connectedness of other objects could be due to garbage collection (which itself could be for either sensitive data removal purposes or purging unreachable objects), replica repair, or direct writes of objects via Rugged.)
Note that even though the "--not" portion of the command is just an optimization, it has been found to be suboptimal. In upstream git.git:
bcec6780b2ec (receive-pack: only use visible refs for connectivity check, 2022-11-17) found the sheer number of refs to be problematic and added --exclude-hidden=receive to trim them.
68cb0b5253a0 (builtin/receive-pack: add option to skip connectivity check, 2025-05-20) was added because the connectivity check was still too expensive.
a5f0290a7493 (receive-pack: use a smaller set of refs for the connectivity check, 2026-06-02) was proposed to shrink the number of refs further than bcec6780b2ec did.
Since we probably still need a connectivity check, the optimal choice would probably be just the O(1) set of "primary integration branches" that exist for the repository (think
main,next,seen,maintin git.git). Tags tend to sit on those branches or a commit or two beyond them; pull requests are a few commits on top of those; human feature branches are another few commits on top of that. Unfortunately, discovering that list of branch names automatically for any given repository is difficult. However, my timings suggest that using "HEAD" is preferable to using "--all" and is likely near optimal in most cases. Since there are a few repositories out there that leave HEAD pointing to a non-existent branch or which never update HEAD and update other branches instead, I propose a middle ground:--exclude=*/* --branches HEAD. In practice, this only rules out slightly fewer objects than--allwhile being a much smaller set of refs than all the ones that exist in the repository.Some stats across 86 pushes in a monorepo:
(For reference, this repo has about 6 times as many refs as branches, and about 13 times as many total branches as toplevel branches.)
Introduce a new helper
connectivityStopperArgsthat, when the sockstat flagspokes_receive_pack_narrow_connectivity_stopperis set, switches the stopper from--exclude-hidden=receive --all --alternate-refsto--exclude=*/* --branches HEAD. Note that while the timings show--alternate-refsadds essentially nothing when paired with--all(because the alternate's refs largely duplicate the fork's own), pairing it with--exclude=*/* --branches HEADwould re-introduce the alternate's full ref set as stoppers, so we drop it from the narrow path.The flag is opt-in (default off) in case there is an unexpected regression.
The fallback
performCheckConnectivityOnObject-- only invoked in the error path to identify which specific object the primary check choked on -- intentionally keeps--not --all. Performance is not a concern there and broader coverage is helpful for diagnostics.An alternative we considered and rejected was dropping the rev-list walk altogether and relying on a
git cat-file --batch-checkexistence probeof each pushed tip, on the grounds that
git index-pack --strictalready verifies in-pack referential closure. That would be cheaper still, but we are concerned about unreachable objects that are not well connected already being in the repository due to garbage collection, replica repair, or direct writes of objects via Rugged. Thus, our narrower stopper preserves the walk -- and therefore the safety guarantee -- while bounding the work by genuinely divergent history rather than by total ref count.