Skip to content

Anycast support for the Pelican cache#3510

Draft
bbockelm wants to merge 3 commits into
PelicanPlatform:mainfrom
bbockelm:bgp-advertise
Draft

Anycast support for the Pelican cache#3510
bbockelm wants to merge 3 commits into
PelicanPlatform:mainfrom
bbockelm:bgp-advertise

Conversation

@bbockelm

@bbockelm bbockelm commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

This branch adds support for TCP anycast for the cache.

If enabled, the cache will contact its local router via BGP and advertise its availability to serve an anycast route.

If the client contacts the cache (whether in anycast mode or not), the cache will respond with information about what token is required to access the namespace: this allows the client to work with the cache directly and not contact the director.

The idea of anycast is a Big Change. Submitting as a draft to get early CI feedback.

@bbockelm bbockelm added enhancement New feature or request cache Issue relating to the cache component labels Jun 15, 2026
@bbockelm bbockelm linked an issue Jun 15, 2026 that may be closed by this pull request
bbockelm and others added 3 commits June 14, 2026 19:52
Allow a V2 (persistent) cache to participate in TCP anycast: it peers
with a BGP router via an embedded pure-Go GoBGP speaker and advertises
the configured anycast net blocks, but only while the cache is healthy
and is serving a host certificate with the expected anycast hostname as
a SAN (verified by a TLS probe to the cache's own external URL, not the
anycast name, to avoid probing a different cache).

Service selection is director-preferred by default: clients use the
director's geo/load/health-aware choice and only fall back to the
anycast endpoint when the director is unreachable. Client.PreferAnycast
opts a well-covered site into contacting the anycast endpoint first
(downloads) and routing write-through uploads to it.

Because the anycast endpoint is itself a cache, a 403 from it now
returns the same X-Pelican-* token-hint headers the director would, so
the client can acquire the right token and retry. The header builders
and namespace longest-prefix match are lifted into server_structs and
shared by the director and cache; the cache advertises itself as the
collections-url so listings flow through it. These header improvements
are not gated on anycast and also help a forced-cache transfer during a
director outage.

The cache continues to advertise its unique URL to the director; the
shared anycast address is published federation-wide in the discovery
doc via Director.AnycastUrl and announced via BGP.

Includes unit tests plus a two-instance GoBGP integration test (with
bind-with-retry to avoid a free-port race) and a client write-through
test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TCP anycast only works if the kernel accepts packets destined to the
anycast IP, which requires the address to be present on a local network
device.  Add an option for the cache to add/remove the anycast service
address(es) (IPv4 and/or IPv6) on the relevant interface via netlink
(pure Go, github.com/vishvananda/netlink; Linux only).

New Cache.Anycast parameters:
  - Addresses: the anycast service IP(s) to bind locally (bare IP implies
    a /32 or /128 host route), distinct from Routes (the BGP-advertised
    net blocks).
  - Device: the interface to manage; auto-detected by asking the kernel
    which device routes to the director when left empty.
  - AddressManagement: on/off/auto (default auto). "auto" adds an address
    only if absent at startup and removes (on shutdown) only what Pelican
    added; "on" always adds and always removes; "off" never touches them.

Addresses are bound before BGP starts (so the kernel accepts traffic
before routes draw it) and removed on shutdown after routes are
withdrawn.  Non-Linux builds get a stub that no-ops when management is
off and errors if it is actually requested, keeping cross-platform
builds working.

Tests: cross-platform pure-logic (mode parsing, add/remove decision
matrix, address normalization) plus Linux netlink tests exercising real
add/remove on the loopback device, which skip when the process lacks
CAP_NET_ADMIN.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cache Issue relating to the cache component enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add anycast support for the Pelican cache

1 participant