Philosophy: Build bottom-up. Each phase is independently verifiable before moving to the next. Every phase ends with a "Checkpoint" — a concrete test that proves the layer works.
apollo-subscriptions-at-scale/
├── docker-compose.yml # Phase 1–5 local orchestration
├── docker-compose.infra.yml # Redis + Kafka standalone
├── subgraphs/
│ ├── notifications/ # Subgraph A — Plugin approach (ApolloServerPluginSubscriptionCallback)
│ │ ├── src/
│ │ │ ├── index.ts # Registers the subscription callback plugin
│ │ │ ├── schema.graphql
│ │ │ ├── resolvers.ts
│ │ │ ├── pubsub.ts # Kafka consumer → EventEmitter → AsyncIterator bridge
│ │ │ └── kafka.ts # Kafka producer/consumer setup
│ │ ├── Dockerfile
│ │ └── package.json
│ ├── orders/ # Subgraph B — Redis-backed approach (manual callback protocol)
│ │ ├── src/
│ │ │ ├── index.ts # Express middleware intercepts callbackSpec=1.0 requests
│ │ │ ├── schema.graphql # orderStatusChanged(orderId: ID!) subscription
│ │ │ ├── resolvers.ts # Query/Mutation resolvers only (no subscription resolvers)
│ │ │ ├── redis.ts # Subscription state: substate:{id} hash, subindex:{orderId} set
│ │ │ ├── heartbeat.ts # Background loop: check → renew TTL / cleanup on 404
│ │ │ └── kafka.ts # Kafka producer/consumer for order-status-changed topic
│ │ ├── Dockerfile
│ │ └── package.json
│ └── users/ # Subgraph C — Query/Mutation only
│ ├── src/
│ │ ├── index.ts
│ │ ├── schema.graphql
│ │ └── resolvers.ts
│ ├── Dockerfile
│ └── package.json
├── router/
│ ├── router.yaml # Router config (subscriptions, Redis, etc.)
│ ├── supergraph.graphql # Composed schema (rover output)
│ └── Dockerfile
├── web-app/
│ ├── src/
│ │ ├── App.tsx
│ │ ├── apollo/
│ │ │ ├── client.ts # ApolloClient with RetryLink + ErrorLink
│ │ │ └── links.ts
│ │ ├── pages/
│ │ │ ├── Dashboard.tsx # Live subscription feed
│ │ │ └── Notifications.tsx # Subscription + Query hybrid
│ │ ├── hooks/
│ │ │ └── useResilentSubscription.ts
│ │ └── components/
│ │ ├── LiveFeed.tsx
│ │ ├── ConnectionStatus.tsx
│ │ └── ErrorBoundary.tsx
│ ├── Dockerfile
│ └── package.json
├── event-producer/ # Simulates external events → Kafka
│ ├── produce.ts
│ └── Dockerfile
├── helm/
│ ├── Chart.yaml
│ ├── values.yaml
│ └── templates/
│ ├── redis-deployment.yaml
│ ├── kafka-statefulset.yaml
│ ├── router-statefulset.yaml # StatefulSet + headless Service
│ ├── notifications-deployment.yaml
│ ├── users-deployment.yaml
│ ├── web-app-deployment.yaml
│ └── _helpers.tpl
├── scripts/
│ ├── compose-supergraph.sh
│ ├── test-subscription.sh
│ └── load-test.sh # k6 or Artillery script
└── README.md
Goal: Stand up Redis and Kafka locally, prove they're reachable and functional.
-
Create
docker-compose.infra.yml- Redis 7.x (single node, port 6379)
- Kafka (KRaft mode — no ZooKeeper) via
bitnami/kafka:lateston port 9092 - Optional: Kafka UI (
provectuslabs/kafka-ui) on port 8080 for visual inspection
-
Verify Redis
docker compose -f docker-compose.infra.yml up -d docker exec -it redis redis-cli PING # → PONG docker exec -it redis redis-cli SET test "hello" && docker exec -it redis redis-cli GET test
-
Verify Kafka
# Create a test topic docker exec -it kafka kafka-topics.sh --create --topic test-events --bootstrap-server localhost:9092 # Produce a message echo "hello-kafka" | docker exec -i kafka kafka-console-producer.sh --topic test-events --bootstrap-server localhost:9092 # Consume it docker exec -it kafka kafka-console-consumer.sh --topic test-events --from-beginning --bootstrap-server localhost:9092
- Redis responds to PING
- Kafka topic created, message produced and consumed end-to-end
- Kafka UI shows the topic and message (optional but helpful)
Goal: Build a working Apollo Server subgraph that exposes Subscription fields and implements the HTTP callback protocol. Kafka events trigger subscription updates.
# subgraphs/notifications/src/schema.graphql
extend schema @link(url: "https://specs.apollo.dev/federation/v2.5", import: ["@key", "@shareable"])
type Notification @key(fields: "id") {
id: ID!
type: String!
message: String!
severity: String! # INFO | WARN | CRITICAL
timestamp: String!
userId: String!
}
type Query {
notifications(userId: String!): [Notification!]!
}
type Mutation {
triggerNotification(userId: String!, type: String!, message: String!, severity: String!): Notification!
}
type Subscription {
notificationCreated(userId: String): Notification!
systemAlert: Notification!
}- Apollo Server 4.x with
@apollo/serverand@apollo/subgraph - Enable
ApolloServerPluginSubscriptionCallbackfrom@apollo/server/plugin/subscriptionCallbackimport { ApolloServerPluginSubscriptionCallback } from '@apollo/server/plugin/subscriptionCallback'; const server = new ApolloServer({ schema: buildSubgraphSchema({ typeDefs, resolvers }), plugins: [ ApolloServerPluginSubscriptionCallback() ], });
- Kafka → AsyncIterator bridge (
pubsub.ts):- Use
kafkajsto consume fromnotification-eventstopic - Convert Kafka consumer messages into an
AsyncIteratorthat Apollo Server's subscription resolver returns - Each Kafka message = one subscription event pushed to Router via callback
- Use
- Mutation
triggerNotification: Produces a Kafka message tonotification-eventstopic (simulates an external system)
- Scaffold the subgraph with
npm init+ install deps (@apollo/server,@apollo/subgraph,kafkajs,graphql-tag) - Implement schema + resolvers with in-memory PubSub first (no Kafka yet)
- Test standalone — Run subgraph, execute subscription via GraphQL Playground/Sandbox (WebSocket mode)
- Swap PubSub for Kafka-backed AsyncIterator
- Dockerize the subgraph
- Subgraph starts and exposes
/graphqlendpoint query { notifications(userId: "u1") }returns datamutation { triggerNotification(...) }produces Kafka message- Subscription resolves events from Kafka when consumed locally (via WebSocket in Sandbox before Router is involved)
Goal: Build a simple companion subgraph to demonstrate federated subscriptions where Router stitches data from multiple subgraphs.
# subgraphs/users/src/schema.graphql
extend schema @link(url: "https://specs.apollo.dev/federation/v2.5", import: ["@key", "@shareable", "@external"])
type User @key(fields: "id") {
id: ID!
name: String!
email: String!
role: String!
}
type Query {
user(id: ID!): User
users: [User!]!
}Extend Notification to resolve user info:
# In notifications subgraph OR users subgraph (reference resolver)
extend type Notification @key(fields: "id") {
id: ID! @external
userId: String! @external
user: User
}- Build and Dockerize the Users subgraph
- Seed with mock user data
- Test standalone queries
query { user(id: "u1") { name email } }works against Users subgraph directly- Both subgraph Docker images build and run
Goal: Compose the supergraph, configure Router with HTTP callback protocol for subscriptions and Redis for subscription state, and prove end-to-end subscription flow without the React app.
# scripts/compose-supergraph.sh
rover supergraph compose --config ./supergraph-config.yaml > router/supergraph.graphql# supergraph-config.yaml
federation_version: =2.5.0
subgraphs:
notifications:
routing_url: http://notifications:4001/graphql
schema:
file: ./subgraphs/notifications/src/schema.graphql
users:
routing_url: http://users:4002/graphql
schema:
file: ./subgraphs/users/src/schema.graphqlsupergraph:
listen: 0.0.0.0:4000
subscription:
enabled: true
mode:
callback:
public_url: http://router:4000/callback # Internal Docker DNS
listen: 0.0.0.0:4000
path: /callback
heartbeat_interval: 5s
subgraphs:
- notifications
# Redis for subscription state (callback URLs, TTLs, verifiers)
queue_capacity: 32768
# Redis configuration for distributed subscription state
apq:
router:
cache:
redis:
urls: ["redis://redis:6379"]
# If using Redis for subscription deduplication or state
# (check your Router version for exact config keys)
preview_entity_cache:
enabled: false
# Subgraph routing overrides (if needed)
override_subgraph_url:
notifications: http://notifications:4001/graphql
users: http://users:4002/graphql
# Helpful for debugging
telemetry:
exporters:
logging:
stdout:
enabled: true
format: jsonAdd Router, both subgraphs, Redis, Kafka to docker-compose.yml with proper depends_on and networking.
# 1. Query (federated) — proves composition works
curl -X POST http://localhost:4000/ \
-H "Content-Type: application/json" \
-d '{"query": "{ notifications(userId: \"u1\") { id message user { name } } }"}'
# 2. Subscription via multipart HTTP (what Apollo Client uses)
curl -N -X POST http://localhost:4000/ \
-H "Content-Type: application/json" \
-H "Accept: multipart/mixed;boundary=\"graphql\";subscriptionSpec=1.0" \
-d '{"query": "subscription { notificationCreated(userId: \"u1\") { id message severity } }"}'
# 3. In another terminal, trigger an event
curl -X POST http://localhost:4000/ \
-H "Content-Type: application/json" \
-d '{"query": "mutation { triggerNotification(userId: \"u1\", type: \"ALERT\", message: \"Server CPU > 90%\", severity: \"CRITICAL\") { id } }"}'
# → The subscription curl should receive the multipart chunk with the new notificationrover supergraph composesucceeds- Router starts with no errors, connects to Redis
- Federated query resolves data across both subgraphs
- Subscription via cURL receives real-time multipart events when mutation fires
- Router logs show callback protocol handshake (init → check → next)
- Redis contains subscription state keys (inspect with
redis-cli KEYS *)
Goal: Build a React app that subscribes via the Router's multipart HTTP protocol, with robust error handling and retry logic.
import { ApolloClient, InMemoryCache, HttpLink, split, ApolloLink } from '@apollo/client';
import { RetryLink } from '@apollo/client/link/retry';
import { onError } from '@apollo/client/link/error';
import { GraphQLWsLink } from '@apollo/client/link/subscriptions';
// Note: For Router multipart subscriptions, HttpLink handles it natively in Apollo Client 4.x
// Error Link — handles GraphQL errors, protocol errors, network errors
const errorLink = onError(({ graphQLErrors, protocolErrors, networkError, operation }) => {
if (graphQLErrors) {
graphQLErrors.forEach(({ message, locations, path }) =>
console.error(`[GraphQL error]: Message: ${message}, Path: ${path}`)
);
}
// CombinedProtocolErrors — new in Apollo Client 4.x
if (protocolErrors) {
protocolErrors.forEach(({ message, extensions }) =>
console.error(`[Protocol error]: ${message}`, extensions)
);
}
if (networkError) {
console.error(`[Network error]: ${networkError.message}`);
// Could dispatch to a global state for UI indicator
}
});
// Retry Link — exponential backoff for transient network failures
const retryLink = new RetryLink({
delay: {
initial: 300, // ms
max: 10000, // cap at 10s
jitter: true, // randomize to avoid thundering herd
},
attempts: {
max: 5,
retryIf: (error, _operation) => {
// Retry only on network errors, not GraphQL errors
return !!error && !error.statusCode; // no status = network failure
},
},
});
const httpLink = new HttpLink({
uri: 'http://localhost:4000/',
// Apollo Client 4.x automatically negotiates multipart for subscriptions
});
const link = ApolloLink.from([retryLink, errorLink, httpLink]);
export const client = new ApolloClient({
link,
cache: new InMemoryCache(),
defaultOptions: {
watchQuery: { errorPolicy: 'all' },
query: { errorPolicy: 'all' },
},
});// Wraps useSubscription with connection status tracking, auto-reconnect indicator, error categorization
import { useSubscription, type DocumentNode, type TypedDocumentNode } from '@apollo/client';
import { useState, useCallback, useEffect } from 'react';
export function useResilientSubscription<TData, TVars>(
subscription: DocumentNode | TypedDocumentNode<TData, TVars>,
options?: { variables?: TVars; onData?: (data: TData) => void }
) {
const [connectionStatus, setConnectionStatus] = useState<'connecting' | 'connected' | 'error' | 'reconnecting'>('connecting');
const { data, loading, error, restart } = useSubscription(subscription, {
variables: options?.variables,
onData: ({ data: { data: subData } }) => {
setConnectionStatus('connected');
if (subData && options?.onData) options.onData(subData as TData);
},
onError: (err) => {
setConnectionStatus('error');
console.error('[Subscription error]', err);
},
onComplete: () => {
setConnectionStatus('connecting'); // will re-init
},
});
const reconnect = useCallback(() => {
setConnectionStatus('reconnecting');
restart(); // Apollo Client 4.x useSubscription supports restart()
}, [restart]);
return { data, loading, error, connectionStatus, reconnect };
}-
Dashboard (
Dashboard.tsx)- Uses
useResilientSubscriptionforsystemAlertsubscription - Displays live feed of system-wide alerts with severity badges
- Shows
<ConnectionStatus />indicator (green dot / yellow / red) - "Reconnect" button that calls
reconnect()
- Uses
-
Notifications (
Notifications.tsx)- Uses
useQueryto fetch historical notifications - Uses
useResilientSubscriptionfornotificationCreated(userId)to append new ones in real-time - Demonstrates hybrid query + subscription pattern
- Error boundary wraps the page to catch unhandled GraphQL errors
- Uses
- Scaffold React app with Vite + TypeScript
- Install
@apollo/client(v4.x) - Build Apollo Client with RetryLink + ErrorLink chain
- Build Dashboard page — verify subscription receives multipart events
- Build Notifications page — verify query + subscription hybrid
- Add
<ConnectionStatus />component - Dockerize the React app (nginx serving the build)
- React app loads at
http://localhost:3000 - Dashboard shows live subscription feed
- Firing
triggerNotificationmutation (via cURL or Notifications page) appears in real-time on Dashboard - Killing and restarting a subgraph shows error → reconnect behavior in the UI
- Browser DevTools Network tab shows multipart HTTP chunks arriving (not WebSocket)
Goal: Run 2 Router pods and 2 subgraph pods locally to prove the callback protocol routes correctly at scale. The orders subgraph's Redis-backed architecture makes this straightforward — any pod can service any event.
notifications (plugin) |
orders (Redis-backed) |
|
|---|---|---|
| Subscription state | In-process EventEmitter | Redis substate:{id} hash |
| Which pod can deliver? | Only the pod that got the init | Any pod (queries Redis for callbackUrl) |
| Add a second pod | Events on wrong partition are dropped | Kafka partitions auto-rebalance, both pods deliver |
| Pod crash | Active subscriptions on that pod are lost | Redis state survives; surviving pods take over |
docker compose up --scale orders=2Both pods join the same Kafka consumer group (orders-subgraph-group). Kafka distributes partitions between them. When a Kafka event arrives on Pod B for an orderId whose subscription was initiated through Router Pod A:
- Pod B calls
getSubscriptionsByOrderId(orderId)→ RedisSMEMBERS subindex:{orderId} - Gets back the
subscriptionId→ callsgetSubscription(id)→ RedisHGETALL substate:{id} - POSTs
nextpayload to the callbackUrl stored in Redis - Router Pod A (whose URL is in Redis) receives the callback and pushes to the client
No coordination needed between subgraph pods — Redis is the shared source of truth.
# Docker Compose with named instances (StatefulSet pattern):
docker compose up --scale router=2Add an nginx reverse-proxy in front of the Router instances to distribute client connections. Each Router pod must advertise its own stable hostname as CALLBACK_PUBLIC_URL so subgraphs send callbacks to the right pod. In Kubernetes this uses StatefulSet + headless Service (see Phase 7).
- 2 Router pods + 2
orderssubgraph pods running - Subscription initiated through Router Pod A receives events regardless of which subgraph pod consumed the Kafka message
redis-cli KEYS 'substate:*'shows active subscription state (callbackUrl, verifier, orderId)- Kafka UI shows 2 members in consumer group
orders-subgraph-group - Killing one subgraph pod — Kafka rebalances, surviving pod continues delivering events
Goal: Package everything into Helm charts that can deploy to any Kubernetes cluster (local or cloud).
| Service | K8s Resource | Why |
|---|---|---|
| Router | StatefulSet + Headless Service | Each pod needs a stable DNS name (router-0.router-headless.ns.svc) for callback URLs |
| Notifications Subgraph | Deployment (2 replicas) | Stateless; Kafka consumer group handles distribution |
| Users Subgraph | Deployment (2 replicas) | Stateless |
| Redis | StatefulSet (1 replica) or Bitnami Helm chart | Persistent storage for subscription state |
| Kafka | StatefulSet (Bitnami Helm chart) | Message ordering guarantees |
| Web App | Deployment + Ingress | Static files served by nginx |
# helm/templates/router-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: {{ include "app.fullname" . }}-router
spec:
serviceName: router-headless
replicas: {{ .Values.router.replicas }} # default: 2
selector:
matchLabels:
app: router
template:
metadata:
labels:
app: router
spec:
containers:
- name: router
image: "{{ .Values.router.image.repository }}:{{ .Values.router.image.tag }}"
ports:
- containerPort: 4000
env:
# Each pod computes its own callback URL using the pod's stable hostname
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: APOLLO_ROUTER_CONFIG_PATH
value: /config/router.yaml
- name: CALLBACK_PUBLIC_URL
value: "http://$(POD_NAME).router-headless.{{ .Release.Namespace }}.svc.cluster.local:4000/callback"
volumeMounts:
- name: router-config
mountPath: /config
volumes:
- name: router-config
configMap:
name: router-config
---
apiVersion: v1
kind: Service
metadata:
name: router-headless
spec:
clusterIP: None # Headless — enables per-pod DNS
selector:
app: router
ports:
- port: 4000
targetPort: 4000
---
apiVersion: v1
kind: Service
metadata:
name: router # Client-facing Service (load-balanced)
spec:
type: ClusterIP
selector:
app: router
ports:
- port: 4000
targetPort: 4000router:
replicas: 2
image:
repository: ghcr.io/apollographql/router
tag: v2.1.1
redis:
url: redis://redis:6379
notifications:
replicas: 2
kafka:
brokers: kafka-0.kafka-headless:9092
users:
replicas: 2
redis:
enabled: true
# Or use bitnami/redis subchart
kafka:
enabled: true
# Or use bitnami/kafka subchart- Create Helm chart structure
- Template all K8s resources with configurable
values.yaml - Deploy to local K8s (Minikube / kind / Docker Desktop Kubernetes)
helm install apollo-subs-demo ./helm- Port-forward Router service:
kubectl port-forward svc/router 4000:4000 - Verify same flow as Phase 5-6 but now on Kubernetes
helm installsucceeds, all pods are Runningkubectl get podsshows: 2 router pods, 2 notifications pods, 2 users pods, 1 redis, 1 kafka- Subscriptions work via
kubectl port-forward - Kill a Router pod → Kubernetes recreates it → subscriptions on that pod re-establish
- Kill a Notification subgraph pod → Kafka rebalances → no event loss
Goal: Prove the system handles failures gracefully and the React app recovers.
| Test | Action | Expected Outcome |
|---|---|---|
| Subgraph crash | kubectl delete pod notifications-0 |
Kafka rebalances, no duplicate events. React UI shows brief error then reconnects |
| Router pod crash | kubectl delete pod router-0 |
Subscriptions on that pod fail. Client retries (RetryLink). New subscription goes through router-1 |
| Redis restart | kubectl delete pod redis-0 |
Router logs warn about lost subscription state. Active subscriptions may need to re-initialize. Client sees reconnect cycle |
| Kafka broker restart | Restart Kafka pod | Consumers reconnect after session.timeout.ms. Brief pause in events, then resume |
| Network partition | kubectl exec router-0 -- iptables -A OUTPUT -d notifications -j DROP |
Heartbeat fails → Router terminates subscription → Client retries. Error Link fires |
| High load | Run scripts/load-test.sh (100 concurrent subscriptions + 50 events/sec) |
All events delivered, no OOM, Redis memory stable |
<ConnectionStatus />transitions: connected → error → reconnecting → connected- No stale data displayed after reconnection
- Error boundary catches and displays unrecoverable errors
- RetryLink logs show exponential backoff in browser console
- All chaos scenarios documented with results
- React app recovers from every transient failure within the RetryLink budget (5 attempts)
- No data loss for events published during subgraph pod restart (Kafka guarantees)
Phase 1: Redis + Kafka (infrastructure)
↓
Phase 2: Notifications Subgraph (subscription producer)
↓
Phase 3: Users Subgraph (federation companion)
↓
Phase 4: Router + Supergraph (composition + callback protocol)
↓
Phase 5: React Web App (client-side subscriptions + resilience)
↓
Phase 6: Scale Testing (multi-pod Docker Compose)
↓
Phase 7: Helm Charts (portable Kubernetes deployment)
↓
Phase 8: Chaos Testing (resilience proof)
Each phase is a PR-able unit — you can review, merge, and demo each independently.