Skip to content

spokes-receive-pack: narrow the refs used for the connectivity check#124

Open
newren wants to merge 2 commits into
mainfrom
newren/cheaper-rev-list-connectivity-check
Open

spokes-receive-pack: narrow the refs used for the connectivity check#124
newren wants to merge 2 commits into
mainfrom
newren/cheaper-rev-list-connectivity-check

Conversation

@newren

@newren newren commented Jun 3, 2026

Copy link
Copy Markdown

When spokes-receive-pack verifies that a pushed pack is fully connected, it invokes

git rev-list --objects --no-object-names --stdin --not \
    --exclude-hidden=receive --all --alternate-refs

with the new tips piped in on stdin. The purpose of this check is to ensure that ref updates sent with the receive-pack are well connected, i.e. that they don't depend upon objects in the repository that reference non-existent objects. The portion following "--not" is just a performance optimization that relies on the assumption that existing refs in the repository are well connected. (Thus, we assume that refs are well connected, but not necessarily all objects in the repository are. The lack of connectedness of other objects could be due to garbage collection (which itself could be for either sensitive data removal purposes or purging unreachable objects), replica repair, or direct writes of objects via Rugged.)

Note that even though the "--not" portion of the command is just an optimization, it has been found to be suboptimal. In upstream git.git:

  • bcec6780b2ec (receive-pack: only use visible refs for connectivity check, 2022-11-17) found the sheer number of refs to be problematic and added --exclude-hidden=receive to trim them.

  • 68cb0b5253a0 (builtin/receive-pack: add option to skip connectivity check, 2025-05-20) was added because the connectivity check was still too expensive.

  • a5f0290a7493 (receive-pack: use a smaller set of refs for the connectivity check, 2026-06-02) was proposed to shrink the number of refs further than bcec6780b2ec did.

Since we probably still need a connectivity check, the optimal choice would probably be just the O(1) set of "primary integration branches" that exist for the repository (think main, next, seen, maint in git.git). Tags tend to sit on those branches or a commit or two beyond them; pull requests are a few commits on top of those; human feature branches are another few commits on top of that. Unfortunately, discovering that list of branch names automatically for any given repository is difficult. However, my timings suggest that using "HEAD" is preferable to using "--all" and is likely near optimal in most cases. Since there are a few repositories out there that leave HEAD pointing to a non-existent branch or which never update HEAD and update other branches instead, I propose a middle ground: --exclude=*/* --branches HEAD. In practice, this only rules out slightly fewer objects than --all while being a much smaller set of refs than all the ones that exist in the repository.

Some stats across 86 pushes in a monorepo:

  TIMES (seconds)

  Rule                                                        min    mean  median     max
  ---------------------------------------------------------------------------------------
  1: --not --exclude-hidden=receive --all --alternate-refs   0.30    2.12    2.51    8.53
  2: --not --exclude-hidden=receive --all                    0.29    2.12    2.47    8.90
  3: --not --branches                                        0.20    2.03    2.41    8.94
  4: --not $(for-each-ref refs/heads/[^/]*)                  0.05    0.87    0.77    2.34
  5: --not --exclude=*/* --branches HEAD                     0.05    0.86    0.80    2.08
  6: --not HEAD                                              0.00    0.83    0.72    3.34

  COUNTS (objects in pack not reachable from the excludes)

  Rule                                                        min       mean   median      max
  --------------------------------------------------------------------------------------------
  1: --not --exclude-hidden=receive --all --alternate-refs      0     335797   521505   773282
  2: --not --exclude-hidden=receive --all                       0     335797   521505   773282
  3: --not --branches                                           0     335797   521505   773282
  4: --not $(for-each-ref refs/heads/[^/]*)                     0      53997       94   773282
  5: --not --exclude=*/* --branches HEAD                        0      53997       94   773282
  6: --not HEAD                                                 0      41700       94   773282

(For reference, this repo has about 6 times as many refs as branches, and about 13 times as many total branches as toplevel branches.)

Introduce a new helper connectivityStopperArgs that, when the sockstat flag spokes_receive_pack_narrow_connectivity_stopper is set, switches the stopper from --exclude-hidden=receive --all --alternate-refs to --exclude=*/* --branches HEAD. Note that while the timings show --alternate-refs adds essentially nothing when paired with --all (because the alternate's refs largely duplicate the fork's own), pairing it with --exclude=*/* --branches HEAD would re-introduce the alternate's full ref set as stoppers, so we drop it from the narrow path.

The flag is opt-in (default off) in case there is an unexpected regression.

The fallback performCheckConnectivityOnObject -- only invoked in the error path to identify which specific object the primary check choked on -- intentionally keeps --not --all. Performance is not a concern there and broader coverage is helpful for diagnostics.

An alternative we considered and rejected was dropping the rev-list walk altogether and relying on a git cat-file --batch-check existence probe
of each pushed tip, on the grounds that git index-pack --strict already verifies in-pack referential closure. That would be cheaper still, but we are concerned about unreachable objects that are not well connected already being in the repository due to garbage collection, replica repair, or direct writes of objects via Rugged. Thus, our narrower stopper preserves the walk -- and therefore the safety guarantee -- while bounding the work by genuinely divergent history rather than by total ref count.

Copilot AI review requested due to automatic review settings June 3, 2026 21:09
@newren

newren commented Jun 3, 2026

Copy link
Copy Markdown
Author

Note: this depends upon https://github.com/github/git/pull/1814 (not true anymore).

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the spokes-receive-pack connectivity check by narrowing the set of refs used as “stopper” revisions for git rev-list, reducing the cost of the reachability walk in large repositories while keeping the safety properties of the existing check.

Changes:

  • Refactors the connectivity check invocation to build rev-list arguments programmatically.
  • Adds connectivityStopperArgs() to switch (behind a sockstat flag) from --exclude-hidden=receive --all --alternate-refs to --toplevel-branches HEAD.
  • Adds headResolves() to avoid passing HEAD when it doesn’t resolve.
Show a summary per file
File Description
internal/spokes/spokes.go Introduces a flag-controlled “narrow stopper” argument set for git rev-list and supporting helper logic.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 1

Comment thread internal/spokes/spokes.go
The 'go test' workflow clones git/git, checks out 'next', and builds
it.  Recent changes in upstream git flipped Rust integration from
opt-in (WITH_RUST) to opt-out (NO_RUST), so 'make all' now invokes
'cargo build' to produce target/release/libgitcore.a.  Our ubuntu
runner has no cargo, so the build fails with:

  CARGO target/release/libgitcore.a
  /bin/sh: 1: cargo: not found
  make: *** [Makefile:3020: target/release/libgitcore.a] Error 127

Add cargo to the apt-get install line.  This matches what github/git
itself does in ci/install-dependencies.sh, and is forward-compatible:
Git 3.0 will make Rust mandatory.  git's Cargo.toml declares
rust-version = "1.49.0", so the cargo package shipped with Ubuntu
noble is more than new enough.
@newren newren force-pushed the newren/cheaper-rev-list-connectivity-check branch from 495d524 to b43c8ea Compare June 5, 2026 01:34
@newren newren requested a review from Copilot June 5, 2026 01:38

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 2/2 changed files
  • Comments generated: 1

Comment thread internal/spokes/spokes.go
@newren newren force-pushed the newren/cheaper-rev-list-connectivity-check branch from b43c8ea to ce68a8c Compare June 5, 2026 04:11
When spokes-receive-pack verifies that a pushed pack is fully
connected, it invokes

    git rev-list --objects --no-object-names --stdin --not \
        --exclude-hidden=receive --all --alternate-refs

with the new tips piped in on stdin.  The purpose of this check is to
ensure that ref updates sent with the receive-pack are well connected,
i.e. that they don't depend upon objects in the repository that
reference non-existent objects.  The portion following "--not" is just a
performance optimization that relies on the assumption that existing
refs in the repository are well connected.  (Thus, we assume that refs
are well connected, but not necessarily all objects in the repository
are.  The lack of connectedness of other objects could be due to
garbage collection (which itself could be for either sensitive data
removal purposes or purging unreachable objects), replica repair, or
direct writes of objects via Rugged.)

Note that even though the "--not" portion of the command is just an
optimization, it has been found to be suboptimal.  In upstream git.git:

  * bcec6780b2ec (receive-pack: only use visible refs for connectivity
    check, 2022-11-17) found the sheer number of refs to be problematic
    and added --exclude-hidden=receive to trim them.

  * 68cb0b5253a0 (builtin/receive-pack: add option to skip connectivity
    check, 2025-05-20) was added because the connectivity check was
    still too expensive.

  * a5f0290a7493 (receive-pack: use a smaller set of refs for the
    connectivity check, 2026-06-02) was proposed to shrink the number
    of refs further than bcec6780b2ec did.

Since we probably still need a connectivity check, the optimal choice
would probably be just the O(1) set of "primary integration branches"
that exist for the repository (think `main`, `next`, `seen`, `maint` in
git.git).  Tags tend to sit on those branches or a commit or two beyond
them; pull requests are a few commits on top of those; human feature
branches are another few commits on top of that.  Unfortunately,
discovering that list of branch names automatically for any given
repository is difficult.  However, my timings suggest that using "HEAD"
is preferable to using "--all" and is likely near optimal in most cases.
Since there are a few repositories out there that leave HEAD pointing to
a non-existent branch or which never update HEAD and update other
branches instead, I propose a middle ground:
    --exclude=*/* --branches HEAD
In practice, this only rules out slightly fewer objects than `--all`
while being a much smaller set of refs than all the ones that exist in
the repository.

Some stats across 86 pushes in a monorepo:

  TIMES (seconds)

  Rule                                                        min    mean  median     max
  ---------------------------------------------------------------------------------------
  1: --not --exclude-hidden=receive --all --alternate-refs   0.30    2.12    2.51    8.53
  2: --not --exclude-hidden=receive --all                    0.29    2.12    2.47    8.90
  3: --not --branches                                        0.20    2.03    2.41    8.94
  4: --not $(for-each-ref refs/heads/[^/]*)                  0.05    0.87    0.77    2.34
  5: --not --exclude=*/* --branches HEAD                     0.05    0.86    0.80    2.08
  6: --not HEAD                                              0.00    0.83    0.72    3.34

  COUNTS (objects in pack not reachable from the excludes)

  Rule                                                        min       mean   median      max
  --------------------------------------------------------------------------------------------
  1: --not --exclude-hidden=receive --all --alternate-refs      0     335797   521505   773282
  2: --not --exclude-hidden=receive --all                       0     335797   521505   773282
  3: --not --branches                                           0     335797   521505   773282
  4: --not $(for-each-ref refs/heads/[^/]*)                     0      53997       94   773282
  5: --not --exclude=*/* --branches HEAD                        0      53997       94   773282
  6: --not HEAD                                                 0      41700       94   773282

(For reference, this repo has about 6 times as many refs as branches,
and about 13 times as many total branches as toplevel branches.)

Introduce a new helper `connectivityStopperArgs` that, when the sockstat
flag `spokes_receive_pack_narrow_connectivity_stopper` is set, switches
the stopper from `--exclude-hidden=receive --all --alternate-refs` to
`--exclude=*/* --branches HEAD`.  Note that while the timings show
`--alternate-refs` adds essentially nothing when paired with `--all`
(because the alternate's refs largely duplicate the fork's own), pairing
it with `--exclude=*/* --branches HEAD` would re-introduce the alternate's
full ref set as stoppers, so we drop it from the narrow path.

The flag is opt-in (default off) in case there is an unexpected
regression.

The fallback `performCheckConnectivityOnObject` -- only invoked in
the error path to identify which specific object the primary check
choked on -- intentionally keeps `--not --all`.  Performance is not
a concern there and broader coverage is helpful for diagnostics.

An alternative we considered and rejected was dropping the rev-list walk
altogether and relying on a `git cat-file --batch-check` existence probe
of each pushed tip, on the grounds that `git index-pack --strict`
already verifies in-pack referential closure.  That would be cheaper
still, but we are concerned about unreachable objects that are not well
connected already being in the repository due to garbage collection,
replica repair, or direct writes of objects via Rugged.  Thus, our
narrower stopper preserves the walk -- and therefore the safety
guarantee -- while bounding the work by genuinely divergent history
rather than by total ref count.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@newren newren force-pushed the newren/cheaper-rev-list-connectivity-check branch from ce68a8c to 9393c39 Compare June 10, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants