Multi-IPFS-Gateway support in worker-service/getFile()

## Summary

Sharing a design to turn the single IPFS gateway URL in `worker-service/src/api/ipfs-client-class.ts` into an ordered, configurable gateway list. Filing this as a design issue first so we can align on env var naming, default gateway list, and timeout strategy before any implementation. No reference branch yet — designing this in the open so the implementation can land directly in `develop`.

## Why

The motivating force here isn't a single bug; it's that the *correct* security posture for a Guardian deployer running their own Kubo node creates an availability dependency on a single public gateway. Walking through the cause-and-effect:

1. **Hardening the local Kubo is the right default.** Any deployer running their own Kubo gateway should set `Gateway.NoFetch: true`. Without it, the gateway acts as an open proxy for arbitrary IPFS content — DoS amplification risk, inadvertently caching/serving sensitive or illegal material, etc. `NoFetch: true` is the IPFS equivalent of disabling an open SMTP relay.

2. **Hardening makes the local Kubo retrieval-only-for-pinned-content.** Once `NoFetch: true` is on, the local Kubo can only serve CIDs the deployer pinned themselves. Any content published by someone else (Storacha, Pinata, another Guardian deployer's Kubo, etc.) is unreachable through the local node.

3. **`getFile()` therefore depends entirely on `IPFS_PUBLIC_GATEWAY`** for any CID the deployer didn't pin. That includes every cross-deployer policy import, every Hedera-timestamp-driven document fetch where the producer used a different pinning service, every methodology library lookup against external authorities.

4. **`IPFS_PUBLIC_GATEWAY` is a single URL.** When the configured gateway doesn't have the CID, `getFile()` throws and the consuming code path crashes ("Cannot read properties of null (reading 'type')" being the most common downstream symptom). No retry on a different gateway, no graceful degradation.

So a deployer who follows the security best practice loses the ability to reliably consume cross-deployer content. The fix is to make `getFile()` walk an ordered list of gateways rather than depend on a single URL. This makes the security-hardened deployment mode viable in practice, which is the wider win for the project.

## Design

**Configuration.** A new env var `IPFS_PUBLIC_GATEWAYS` accepts a comma-separated, ordered list of gateway URLs. The existing `IPFS_PUBLIC_GATEWAY` continues to work as a single-entry option for backwards compatibility — if `IPFS_PUBLIC_GATEWAYS` is unset, the single URL is the gateway list.

**Read semantics.** `getFile(cid)` walks the gateway list in order:

- For each gateway, issue an HTTP GET with a per-gateway timeout (default ~10s, configurable).
- On 2xx with a response body, return immediately.
- On timeout / 4xx / 5xx, advance to the next gateway in the list.
- If all gateways fail, throw the same error shape `getFile()` throws today (so callers don't need to change).
- Total budget across all gateways respects `IPFS_TIMEOUT` if set, otherwise scales naturally with list length × per-gateway timeout.

**Internal Kubo placement.** If `IPFS_NODE_GATEWAY` is set, prepend it to the front of the list. Reads for CIDs the deployer pinned themselves succeed instantly without ever touching a public gateway, which is both faster and free.

**Observability.** Each gateway attempt emits a structured log line `{cid, gateway, status, ms, attempt_index}` so deployers can see which gateways are healthy in production and which can be removed from the list.

**Optional in-process cache.** Small LRU keyed by CID storing the gateway URL that last served it. On the next read for the same CID, try the cached gateway first before walking the rest of the list. This is a "skip the walk on repeated reads" optimization, not a content cache — bounded by entry count, no payload caching. Could be a follow-up if not desired in the first PR.

**Write path is unchanged.** `addFile`, `deleteCid`, the provider abstraction (`IpfsProvider.WEB3STORAGE`, `IpfsProvider.FILEBASE`, `IpfsProvider.LOCAL`) all stay as-is. Pinning still goes through the configured provider; only the read path picks up the list.

**Default gateway list.** Sensible community defaults so deployers get the resilience benefit without needing to research gateway options. Proposed defaults: `IPFS_NODE_GATEWAY` (if set) first, then `https://${cid}.ipfs.w3s.link/`, `https://ipfs.io/ipfs/${cid}`, `https://dweb.link/ipfs/${cid}`, `https://cloudflare-ipfs.com/ipfs/${cid}`, `https://gateway.pinata.cloud/ipfs/${cid}`. Where the defaults live in the repo (env template, docker-compose example, docs) is open for discussion.

## Out of scope for this change

- **Write-path changes.** Pinning, deletion, the provider abstraction all stay as-is.
- **Multi-gateway publishing / re-pinning.** A separate question worth exploring later, but not in this issue.
- **Content caching.** No payload caching, no disk cache. Pure routing concern.
- **Gateway health monitoring / circuit breakers.** Could be a useful follow-up, but the simple list-walk is enough to solve the core problem.

## Backwards compatibility

- Deployers with a single `IPFS_PUBLIC_GATEWAY` today (whether or not they've enabled `NoFetch:true`) keep working unchanged. The gateway list is opt-in via the new env var.
- `getFile()` error shape is preserved for the "all gateways failed" case.
- No behavior change for deployers using the `LOCAL` provider against an unhardened (`NoFetch: false`) Kubo — that configuration already had network reach.

## Open questions

1. **Env var naming.** Proposed `IPFS_PUBLIC_GATEWAYS` (plural). Alternatives worth considering: `IPFS_GATEWAY_LIST`, `IPFS_READ_GATEWAYS`. Preference?
2. **Default gateway list.** Proposed defaults above. Worth tuning based on which gateways have the best uptime / pin coverage for Guardian workloads. Community input welcome. Also: where the defaults live (env template, docker-compose example, docs) is a small choice worth deciding together.
3. **Timeout strategy.** Per-gateway timeout (proposed ~10s) vs total request budget vs both. Want to avoid the pathological case where a list of 6 gateways each timing out at 10s blocks a single read for a full minute.
4. **In-process gateway cache.** Worth the complexity for the first cut, or save for a follow-up?
5. **Implementation ownership.** Climission can drive end-to-end if useful, or we can split design + implementation if any of the maintainer team wants to own parts. Flag preference.

Once we have alignment on these, opening the PR is straightforward.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-IPFS-Gateway support in worker-service/getFile() #6092

Summary

Why

Design

Out of scope for this change

Backwards compatibility

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Multi-IPFS-Gateway support in worker-service/getFile() #6092

Description

Summary

Why

Design

Out of scope for this change

Backwards compatibility

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions