Version: 1.0 (Draft) Author: Solutions Architecture Date: February 18, 2026 Status: Pending Review
This project demonstrates Apollo GraphQL Subscriptions operating at scale using the HTTP Callback protocol. The system showcases a federated architecture with multiple Apollo Router pods, multiple subgraph pod instances, event-driven messaging via Kafka/RabbitMQ, and Redis as the subscription state store. A React Web Application serves as the front-end, demonstrating real-time subscription functionality with robust error handling and retry logic.
The solution is designed to be environment-portable — deployable via Docker Compose for local development and Kubernetes (Helm charts) for staging/production.
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Clients (React Web App) │
│ ┌────────────────┐ │
│ │ HTTP Multipart │──── Heartbeats (empty JSON) ────────────────┐ │
│ │ Subscription │◄── Payload / Error sent back ───────────────┤ │
│ └────────────────┘ │ │
└──────────────────────────────────────────────────────────────────┼──────────────┘
│
┌──────────────────────────────────────────────────────────────────┼──────────────┐
│ API Gateway Layer │ │
│ ┌────────────┐ ┌─────────────────┐ ┌──────────┐ │ │
│ │ Load │ │ API Gateway │ │ Rate │ │ │
│ │ Balancer │ │ (Kong/AWS GW) │ │ Limiter │ │ │
│ └────────────┘ └─────────────────┘ └──────────┘ │ │
└──────────────────────────────────────────────────────────────────┼──────────────┘
│
┌──────────────────────────────────────────────────────────────────┼──────────────┐
│ Apollo Router Pods (StatefulSet + Headless Service) │ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ Router │ │ Router │ │ Router │ ◄─ Callback URL │ │
│ │ Pod 1 │ │ Pod 2 │ │ Pod 3 │ must be pod- │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ specific │ │
│ │ │ │ │ │
│ └──────────────┴──────────────┘ │ │
│ │ │ │
└───────────────────────┼───────────────────────────────────────────┘ │
│ POST /graphql (init subscription with callback details) │
┌───────────────────────┼─────────────────────────────────────────────────────────┐│
│ Subgraph Pods ││
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││
│ │ Subgraph │ │ Subgraph │ │ Subgraph │ ││
│ │ Pod 1 │ │ Pod 2 │ │ Pod 3 │ ││
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ ││
│ │ │ │ ││
│ └──────────────┼──────────────┘ ││
│ │ ││
│ ┌────────────┴────────────┐ ││
│ │ Redis (State) │ ← Stores: callback URL, TTL, ││
│ │ │ subscription ID, metadata ││
│ └────────────┬────────────┘ ││
│ │ ││
│ ┌────────────┴────────────┐ ││
│ │ Kafka / RabbitMQ │ ← Domain events published by ││
│ │ (Message Queue) │ microservices, consumed by subgraphs ││
│ └─────────────────────────┘ ││
└──────────────────────────────────────────────────────────────────────────────────┘│
│
┌──────────────────────────────────────────────────────────────────────────────────┐│
│ Observability ││
│ Prometheus / Grafana / ELK Stack / OTEL Collector ││
└──────────────────────────────────────────────────────────────────────────────────┘│
- Client initiates a subscription via HTTP multipart request to the Router.
- Router receives the subscription, generates a pod-specific Callback URL, and POSTs the subscription operation + callback details to the Subgraph.
- Subgraph persists the callback metadata (URL, TTL, subscription ID) into Redis with a default TTL of 60s.
- Subgraph (or a Watcher service) heartbeats the Router every ~30s; on
200 OK, resets the Redis TTL. - Kafka/RabbitMQ emits a domain event → Subgraph consumer picks it up.
- Subgraph checks Redis: "Is this subscription still active (TTL > 0)?"
- If active, Subgraph POSTs the payload to the Router's callback URL.
- Router pushes the multipart chunk to the Client.
Per the architecture diagram, the system implements lossless delivery:
- On Router 5xx/timeout → subgraph sends channel
nackback to the message queue for redelivery. - On subscription dead/expired → subgraph sends ACK + cleans up Redis metadata.
- On 204 No Content → message delivered successfully, channel ACK.
For handling disconnections gracefully:
- Client holds an
aftercursor locally. - On reconnect, client subscribes with
variables: { after: 'cursor_123' }. - Router POSTs to subgraph with callback URL + variables.
- Subgraph queries the Event Store DB:
SELECT * FROM events WHERE id > 'cursor_123'. - Subgraph backfills missed events (Event A, Event B) via POST to callback URL.
- Once caught up, subgraph registers live subscription and starts consuming from RabbitMQ/Kafka.
- New live events (Event C, etc.) flow through the normal callback path.
- Initialize monorepo (e.g., Nx or Turborepo) with the following packages:
packages/router-config— Apollo Router YAML configurationpackages/subgraph-orders— Orders subgraph (Node.js / Apollo Server)packages/subgraph-notifications— Notifications subgraph (Node.js / Apollo Server)packages/web-app— React application (Vite + Apollo Client v4)packages/event-producer— Utility to simulate domain eventsinfra/docker— Docker Compose for local developmentinfra/k8s— Kubernetes manifests / Helm charts
- Define
.env.examplefiles with all configurable parameters. - Set up CI/CD pipeline skeleton (GitHub Actions).
- Redis (single node, port 6379)
- Kafka (or RabbitMQ) with management UI
- Apollo Router ×2 instances
- Subgraph (Orders) ×2 instances
- Subgraph (Notifications) ×2 instances
- Event Producer (CLI utility / cron job for injecting events)
- Prometheus + Grafana for local observability
- Network configuration ensuring Router callback URLs are resolvable between containers.
- Apollo Router —
StatefulSetwithHeadless Serviceto provide stable DNS names per pod (e.g.,router-0.router-headless.default.svc.cluster.local). This is critical for the callback URL to be instance-specific. - Subgraphs — Standard
Deployment(2+ replicas each) withService. - Redis —
StatefulSetor managed service (e.g., ElastiCache / Memorystore). - Kafka/RabbitMQ —
StatefulSetor managed service (e.g., MSK / CloudAMQP). - ConfigMaps for Router YAML, environment-specific overrides.
- Secrets for GraphOS API key, Redis auth, Kafka credentials.
- Include
HorizontalPodAutoscalerdefinitions for subgraphs.
# Subgraph: Orders
type Order @key(fields: "id") {
id: ID!
status: OrderStatus!
items: [OrderItem!]!
total: Float!
createdAt: DateTime!
updatedAt: DateTime!
}
type OrderItem {
productId: ID!
name: String!
quantity: Int!
price: Float!
}
enum OrderStatus {
PLACED
CONFIRMED
PREPARING
SHIPPED
DELIVERED
CANCELLED
}
type Subscription {
orderStatusChanged(orderId: ID!): OrderStatusUpdate!
newOrders: Order!
}
type OrderStatusUpdate {
orderId: ID!
previousStatus: OrderStatus!
newStatus: OrderStatus!
timestamp: DateTime!
}
type Query {
order(id: ID!): Order
orders(limit: Int, after: String): OrderConnection!
}
type Mutation {
placeOrder(input: PlaceOrderInput!): Order!
updateOrderStatus(orderId: ID!, status: OrderStatus!): Order!
}# Subgraph: Notifications
type Notification @key(fields: "id") {
id: ID!
type: NotificationType!
message: String!
read: Boolean!
createdAt: DateTime!
}
enum NotificationType {
ORDER_UPDATE
SYSTEM_ALERT
PROMOTION
}
type Subscription {
notificationReceived(userId: ID!): Notification!
}
type Query {
notifications(userId: ID!, limit: Int): [Notification!]!
}Each subgraph must implement the following:
On receiving a subscription initiation POST from Router:
- Parse the callback URL, subscription ID, and verifier from the Router's request.
- Store in Redis:
Key: sub:{subscriptionId} Value: { callbackUrl, verifier, createdAt, variables } TTL: 60s - Respond with
200 OKto acknowledge.
Heartbeat / TTL Renewal:
- Subgraph (or a dedicated Watcher service) pings the Router's callback URL with a
checkmessage every ~30s. - On
200 OK, reset Redis TTL to 60s. - On failure, mark subscription for cleanup.
Event Consumption (Kafka/RabbitMQ):
- Consumer receives event (e.g.,
order.status.changed). - Queries Redis for active subscriptions matching the event filter.
- For each active subscription: POST payload to the callback URL.
- Handle response: 204 → ACK, 5xx/timeout → NACK for retry, gone → cleanup Redis.
# Active Subscription State
Key: sub:{subscriptionId}
Type: Hash
Fields: callbackUrl, verifier, variables (JSON), createdAt
TTL: 60s (renewed on heartbeat)
# Subscription Index (for event routing)
Key: sub:index:{eventType}:{filterValue}
Type: Set
Members: subscriptionId_1, subscriptionId_2, ...
# Event Cursor (for resumable subscriptions)
Key: cursor:{subscriptionId}
Type: String
Value: last_event_id
supergraph:
listen: 0.0.0.0:4000
subscription:
enabled: true
mode:
callback:
public_url: ${ROUTER_PUBLIC_URL} # Pod-specific, e.g., http://router-0.router-headless:4000
listen: 0.0.0.0:4000
path: /callback
heartbeat_interval: 30s
subgraphs:
- orders
- notifications
# Enable multipart subscriptions for clients
batching:
enabled: false
telemetry:
exporters:
metrics:
prometheus:
enabled: true
listen: 0.0.0.0:9090
path: /metrics
tracing:
otlp:
enabled: true
endpoint: ${OTEL_COLLECTOR_ENDPOINT}
instrumentation:
events:
router:
request:
level: info
response:
level: info
error:
level: error
supergraph:
request:
level: info
response:
level: info
cors:
origins:
- http://localhost:3000
- ${CLIENT_ORIGIN}
allow_headers:
- Content-Type
- Apollo-Require-Preflight
headers:
all:
request:
- propagate:
named: Authorization
- propagate:
named: X-Request-ID
include_subgraph_errors:
all: true- Pod-specific callback URL: Each Router pod must advertise its own stable hostname. Use
StatefulSet+Headless Servicesorouter-0,router-1, etc., are DNS-resolvable. - Environment variable injection: Use Kubernetes
envfromfieldRefto inject pod name; construct callback URL dynamically. - Health probes: Configure liveness (
/health) and readiness probes on the Router.
web-app/
├── src/
│ ├── apollo/
│ │ ├── client.ts # Apollo Client setup with link chain
│ │ ├── retryLink.ts # RetryLink configuration
│ │ ├── errorLink.ts # ErrorLink for global error handling
│ │ └── subscriptionLink.ts # HTTP multipart subscription setup
│ ├── pages/
│ │ ├── OrderDashboard.tsx # Live order tracking page
│ │ └── NotificationCenter.tsx # Real-time notifications page
│ ├── components/
│ │ ├── OrderStatusTracker.tsx
│ │ ├── OrderList.tsx
│ │ ├── NotificationFeed.tsx
│ │ ├── ConnectionStatus.tsx # Shows subscription health
│ │ └── ErrorBoundary.tsx
│ ├── hooks/
│ │ ├── useOrderSubscription.ts
│ │ ├── useNotificationSubscription.ts
│ │ └── useResumableSubscription.ts # Custom hook with cursor tracking
│ ├── graphql/
│ │ ├── queries.ts
│ │ ├── mutations.ts
│ │ └── subscriptions.ts
│ └── App.tsx
// client.ts
import { ApolloClient, InMemoryCache, ApolloLink, from } from '@apollo/client';
import { RetryLink } from '@apollo/client/link/retry';
import { onError } from '@apollo/client/link/error';
import { HttpLink } from '@apollo/client/link/http';
// 1. RetryLink — handles transient network errors
const retryLink = new RetryLink({
delay: {
initial: 300,
max: 10000, // cap at 10s
jitter: true, // randomize to avoid thundering herd
},
attempts: {
max: 5,
retryIf: (error, _operation) => {
// Retry on network errors and 5xx status codes
return !!error && (
error.statusCode >= 500 ||
error.message === 'Failed to fetch' ||
error.message?.includes('NetworkError')
);
},
},
});
// 2. ErrorLink — global error observer
const errorLink = onError(({ graphQLErrors, networkError, protocolErrors, operation }) => {
if (graphQLErrors) {
graphQLErrors.forEach(({ message, locations, path, extensions }) => {
console.error(
`[GraphQL Error]: Message: ${message}, Path: ${path}, Code: ${extensions?.code}`
);
});
}
if (protocolErrors) {
// CombinedProtocolErrors — handle multipart/subscription protocol issues
protocolErrors.forEach(({ message, extensions }) => {
console.error(`[Protocol Error]: ${message}`);
// Could trigger re-subscription logic here
});
}
if (networkError) {
console.error(`[Network Error]: ${networkError.message}`);
// Surface to UI via reactive variable or context
}
});
// 3. HttpLink — standard HTTP transport (supports multipart subscriptions)
const httpLink = new HttpLink({
uri: import.meta.env.VITE_GRAPHQL_ENDPOINT || 'http://localhost:4000/graphql',
headers: {
'Apollo-Require-Preflight': 'true',
},
});
// Compose the link chain: Retry → Error → HTTP
const link = from([retryLink, errorLink, httpLink]);
export const client = new ApolloClient({
link,
cache: new InMemoryCache({
typePolicies: {
Query: {
fields: {
orders: {
keyArgs: false,
merge(existing, incoming) {
return incoming; // or implement cursor-based merge
},
},
},
},
},
}),
defaultOptions: {
watchQuery: {
errorPolicy: 'all', // Return both data and errors
},
},
});// hooks/useOrderSubscription.ts
import { useSubscription } from '@apollo/client';
import { ORDER_STATUS_CHANGED } from '../graphql/subscriptions';
export function useOrderSubscription(orderId: string) {
const { data, loading, error, restart } = useSubscription(
ORDER_STATUS_CHANGED,
{
variables: { orderId },
onData: ({ data: subscriptionData }) => {
// Handle incoming subscription event
console.log('Order update received:', subscriptionData);
},
onError: (error) => {
console.error('Subscription error:', error);
// Optionally trigger reconnect with backoff
},
onComplete: () => {
console.log('Subscription completed');
},
}
);
return { data, loading, error, restart };
}// hooks/useResumableSubscription.ts
import { useSubscription } from '@apollo/client';
import { useRef, useCallback } from 'react';
export function useResumableSubscription(subscription, options = {}) {
const cursorRef = useRef<string | null>(
localStorage.getItem(`sub_cursor_${options.key}`) || null
);
const { data, loading, error, restart } = useSubscription(subscription, {
...options,
variables: {
...options.variables,
after: cursorRef.current,
},
onData: ({ data: subData }) => {
// Update cursor from the latest event
const eventId = subData?.data?.__cursor || subData?.data?.id;
if (eventId) {
cursorRef.current = eventId;
localStorage.setItem(`sub_cursor_${options.key}`, eventId);
}
options.onData?.({ data: subData });
},
});
const resetCursor = useCallback(() => {
cursorRef.current = null;
localStorage.removeItem(`sub_cursor_${options.key}`);
restart();
}, [options.key, restart]);
return { data, loading, error, restart, resetCursor };
}Page 1 — Order Dashboard (OrderDashboard.tsx)
- Displays a list of recent orders (query + polling fallback).
- Each order card subscribes to
orderStatusChangedand updates in real-time. - Status transitions animate visually (e.g., PLACED → CONFIRMED → SHIPPED).
- "Place New Order" button triggers a mutation, demonstrating the full cycle.
- Connection status indicator shows subscription health.
Page 2 — Notification Center (NotificationCenter.tsx)
- Subscribes to
notificationReceivedfor the current user. - Displays a live feed of notifications with read/unread state.
- Demonstrates the resumable subscription pattern — on page reload, missed notifications are backfilled.
- Badge count updates in the nav bar via
useSubscription.
| Error Type | Source | Handling Approach |
|---|---|---|
| Network errors (timeout, DNS, 5xx) | RetryLink |
Automatic retry with exponential backoff + jitter (max 5 attempts). Surface to UI after exhaustion. |
| GraphQL errors | ErrorLink, useSubscription error |
Display inline error messages. Log with operation context. Do not retry (business logic errors). |
Protocol errors (CombinedProtocolErrors) |
Multipart stream | Log, attempt restart() on the subscription hook. Show reconnection indicator. |
| Subscription termination | Server-side close | Detect via onComplete, auto-restart with cursor for resumable subscriptions. |
# Kafka Topics (or RabbitMQ Exchanges/Queues)
orders.status.changed → consumed by Orders subgraph
orders.new → consumed by Orders subgraph + Notifications subgraph
notifications.system → consumed by Notifications subgraph
# Dead Letter Topic
subscriptions.dlq → failed callback deliveries after max retries
- CLI tool or cron job that publishes sample events to Kafka/RabbitMQ.
- Configurable: event type, frequency, payload size, burst mode.
- Used for demo and load testing.
- Each subgraph pod runs a consumer with a consumer group (Kafka) or competing consumers (RabbitMQ) pattern.
- Manual ACK mode: only acknowledge after successful callback POST.
- NACK with requeue on transient failures (Router 5xx, timeout).
- Move to DLQ after N retries (configurable, default: 3).
- Enable Router subscription debugging events per Apollo docs:
telemetry: instrumentation: events: supergraph: # Log subscription lifecycle events response: level: info
- Export metrics to Prometheus, traces to Jaeger/OTEL Collector.
- Subscription Metrics: Active subscriptions count, callback success/failure rate, heartbeat latency.
- Message Queue Metrics: Consumer lag, throughput, DLQ depth.
- Router Metrics: Request rate, latency percentiles, error rates.
- Redis Metrics: Memory usage, key count, TTL expiration rate.
- Subscription callback failure rate > 5% over 5 minutes.
- Kafka consumer lag > 1000 messages.
- Redis memory usage > 80%.
- Router pod restarts > 2 in 10 minutes.
- Subscription lifecycle: initiate → receive events → heartbeat → terminate.
- Callback delivery under Router pod restart (verify resumable pattern).
- Message queue consumer failover (kill one subgraph pod, verify rebalancing).
- Tool: k6 or Artillery with GraphQL subscription support.
- Scenarios:
- 100 concurrent subscriptions, steady event stream (1 event/sec).
- 500 concurrent subscriptions, burst event stream (50 events/sec for 10s).
- Router pod failure during active subscriptions (chaos testing).
- Measure: event delivery latency (p50, p95, p99), message loss rate, recovery time.
- Kill a Router pod — verify subscriptions reconnect via client
restart()+ resumable cursor. - Kill a Subgraph pod — verify Kafka/RabbitMQ rebalances, no message loss.
- Redis connection interruption — verify graceful degradation.
README.md— Quick start guide (Docker Compose + K8s).docs/ARCHITECTURE.md— System design document with diagrams.docs/RUNBOOK.md— Operational runbook (common issues, scaling procedures).docs/ENV_VARS.md— Complete environment variable reference.- Helm
values.yamlwith commented defaults for environment customization.
- All infrastructure is defined as code (Docker Compose + Helm).
- No hardcoded URLs — all endpoints configurable via env vars.
- Redis, Kafka/RabbitMQ can be swapped for managed services via config.
- Router callback URL construction is dynamic (pod name + namespace).
- CI/CD pipeline builds and pushes container images to configurable registry.
- Helm chart supports overrides for different cloud providers (AWS/GCP/Azure).
| Layer | Technology | Purpose |
|---|---|---|
| Client | React 18+ (Vite), Apollo Client v4 | UI + GraphQL subscription consumer |
| API Gateway | Kong / AWS API Gateway (optional) | Auth, rate limiting, TLS termination |
| Router | Apollo Router v2.x (Enterprise) | Federated gateway, subscription callback orchestration |
| Subgraphs | Node.js + Apollo Server v4 | Domain logic, callback protocol implementation |
| State Store | Redis 7+ | Subscription metadata, TTL, cursor tracking |
| Message Queue | Apache Kafka or RabbitMQ | Event-driven communication between services |
| Event Store | PostgreSQL (optional) | Persisted events for resumable subscriptions |
| Observability | Prometheus, Grafana, OTEL Collector | Metrics, dashboards, distributed tracing |
| Infra | Docker Compose, Kubernetes (Helm) | Environment-portable deployment |
| Week | Phase | Key Deliverables |
|---|---|---|
| 1 | Foundation | Repo scaffolding, Docker Compose stack, K8s manifests (draft) |
| 2 | Infrastructure + Schema | Redis/Kafka running, federated schema designed, subgraph stubs |
| 3 | Subgraph + Router | Callback protocol implemented, Router config finalized, link chain |
| 4 | React App + Observability | Both pages functional, Grafana dashboards, OTEL integration |
| 5 | Testing + Docs | Load tests, chaos tests, documentation, portability validation |
| Risk | Impact | Mitigation |
|---|---|---|
| Callback URL not resolvable between pods | Subscription delivery fails | Use StatefulSet + Headless Service; validate DNS resolution in init container |
| Redis TTL expiration race condition | Subscription dropped prematurely | Set heartbeat interval < TTL/2; implement TTL renewal with atomic Redis commands |
| Kafka consumer lag under burst load | Delayed event delivery | Tune partition count, consumer threads; add HPA for subgraph pods |
| Client reconnection thundering herd | Router overwhelmed after outage | Implement jittered backoff in RetryLink; stagger client reconnection |
| Router pod restart loses active subscriptions | Clients must re-subscribe | Resumable subscription pattern with cursor; client-side restart() on error |