Skip to content

Multi-IPFS-Gateway support in worker-service/getFile() #6092

@danielnorkin

Description

@danielnorkin

Summary

Sharing a design to turn the single IPFS gateway URL in worker-service/src/api/ipfs-client-class.ts into an ordered, configurable gateway list. Filing this as a design issue first so we can align on env var naming, default gateway list, and timeout strategy before any implementation. No reference branch yet — designing this in the open so the implementation can land directly in develop.

Why

The motivating force here isn't a single bug; it's that the correct security posture for a Guardian deployer running their own Kubo node creates an availability dependency on a single public gateway. Walking through the cause-and-effect:

  1. Hardening the local Kubo is the right default. Any deployer running their own Kubo gateway should set Gateway.NoFetch: true. Without it, the gateway acts as an open proxy for arbitrary IPFS content — DoS amplification risk, inadvertently caching/serving sensitive or illegal material, etc. NoFetch: true is the IPFS equivalent of disabling an open SMTP relay.

  2. Hardening makes the local Kubo retrieval-only-for-pinned-content. Once NoFetch: true is on, the local Kubo can only serve CIDs the deployer pinned themselves. Any content published by someone else (Storacha, Pinata, another Guardian deployer's Kubo, etc.) is unreachable through the local node.

  3. getFile() therefore depends entirely on IPFS_PUBLIC_GATEWAY for any CID the deployer didn't pin. That includes every cross-deployer policy import, every Hedera-timestamp-driven document fetch where the producer used a different pinning service, every methodology library lookup against external authorities.

  4. IPFS_PUBLIC_GATEWAY is a single URL. When the configured gateway doesn't have the CID, getFile() throws and the consuming code path crashes ("Cannot read properties of null (reading 'type')" being the most common downstream symptom). No retry on a different gateway, no graceful degradation.

So a deployer who follows the security best practice loses the ability to reliably consume cross-deployer content. The fix is to make getFile() walk an ordered list of gateways rather than depend on a single URL. This makes the security-hardened deployment mode viable in practice, which is the wider win for the project.

Design

Configuration. A new env var IPFS_PUBLIC_GATEWAYS accepts a comma-separated, ordered list of gateway URLs. The existing IPFS_PUBLIC_GATEWAY continues to work as a single-entry option for backwards compatibility — if IPFS_PUBLIC_GATEWAYS is unset, the single URL is the gateway list.

Read semantics. getFile(cid) walks the gateway list in order:

  • For each gateway, issue an HTTP GET with a per-gateway timeout (default ~10s, configurable).
  • On 2xx with a response body, return immediately.
  • On timeout / 4xx / 5xx, advance to the next gateway in the list.
  • If all gateways fail, throw the same error shape getFile() throws today (so callers don't need to change).
  • Total budget across all gateways respects IPFS_TIMEOUT if set, otherwise scales naturally with list length × per-gateway timeout.

Internal Kubo placement. If IPFS_NODE_GATEWAY is set, prepend it to the front of the list. Reads for CIDs the deployer pinned themselves succeed instantly without ever touching a public gateway, which is both faster and free.

Observability. Each gateway attempt emits a structured log line {cid, gateway, status, ms, attempt_index} so deployers can see which gateways are healthy in production and which can be removed from the list.

Optional in-process cache. Small LRU keyed by CID storing the gateway URL that last served it. On the next read for the same CID, try the cached gateway first before walking the rest of the list. This is a "skip the walk on repeated reads" optimization, not a content cache — bounded by entry count, no payload caching. Could be a follow-up if not desired in the first PR.

Write path is unchanged. addFile, deleteCid, the provider abstraction (IpfsProvider.WEB3STORAGE, IpfsProvider.FILEBASE, IpfsProvider.LOCAL) all stay as-is. Pinning still goes through the configured provider; only the read path picks up the list.

Default gateway list. Sensible community defaults so deployers get the resilience benefit without needing to research gateway options. Proposed defaults: IPFS_NODE_GATEWAY (if set) first, then https://${cid}.ipfs.w3s.link/, https://ipfs.io/ipfs/${cid}, https://dweb.link/ipfs/${cid}, https://cloudflare-ipfs.com/ipfs/${cid}, https://gateway.pinata.cloud/ipfs/${cid}. Where the defaults live in the repo (env template, docker-compose example, docs) is open for discussion.

Out of scope for this change

  • Write-path changes. Pinning, deletion, the provider abstraction all stay as-is.
  • Multi-gateway publishing / re-pinning. A separate question worth exploring later, but not in this issue.
  • Content caching. No payload caching, no disk cache. Pure routing concern.
  • Gateway health monitoring / circuit breakers. Could be a useful follow-up, but the simple list-walk is enough to solve the core problem.

Backwards compatibility

  • Deployers with a single IPFS_PUBLIC_GATEWAY today (whether or not they've enabled NoFetch:true) keep working unchanged. The gateway list is opt-in via the new env var.
  • getFile() error shape is preserved for the "all gateways failed" case.
  • No behavior change for deployers using the LOCAL provider against an unhardened (NoFetch: false) Kubo — that configuration already had network reach.

Open questions

  1. Env var naming. Proposed IPFS_PUBLIC_GATEWAYS (plural). Alternatives worth considering: IPFS_GATEWAY_LIST, IPFS_READ_GATEWAYS. Preference?
  2. Default gateway list. Proposed defaults above. Worth tuning based on which gateways have the best uptime / pin coverage for Guardian workloads. Community input welcome. Also: where the defaults live (env template, docker-compose example, docs) is a small choice worth deciding together.
  3. Timeout strategy. Per-gateway timeout (proposed ~10s) vs total request budget vs both. Want to avoid the pathological case where a list of 6 gateways each timing out at 10s blocks a single read for a full minute.
  4. In-process gateway cache. Worth the complexity for the first cut, or save for a follow-up?
  5. Implementation ownership. Climission can drive end-to-end if useful, or we can split design + implementation if any of the maintainer team wants to own parts. Flag preference.

Once we have alignment on these, opening the PR is straightforward.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions