Skip to content

Latest commit

 

History

History
699 lines (574 loc) · 24.6 KB

File metadata and controls

699 lines (574 loc) · 24.6 KB

Apollo Subscriptions at Scale — Phased Implementation Roadmap

Philosophy: Build bottom-up. Each phase is independently verifiable before moving to the next. Every phase ends with a "Checkpoint" — a concrete test that proves the layer works.


Repository Structure (Monorepo)

apollo-subscriptions-at-scale/
├── docker-compose.yml                  # Phase 1–5 local orchestration
├── docker-compose.infra.yml            # Redis + Kafka standalone
├── subgraphs/
│   ├── notifications/                  # Subgraph A — Plugin approach (ApolloServerPluginSubscriptionCallback)
│   │   ├── src/
│   │   │   ├── index.ts               # Registers the subscription callback plugin
│   │   │   ├── schema.graphql
│   │   │   ├── resolvers.ts
│   │   │   ├── pubsub.ts              # Kafka consumer → EventEmitter → AsyncIterator bridge
│   │   │   └── kafka.ts               # Kafka producer/consumer setup
│   │   ├── Dockerfile
│   │   └── package.json
│   ├── orders/                         # Subgraph B — Redis-backed approach (manual callback protocol)
│   │   ├── src/
│   │   │   ├── index.ts               # Express middleware intercepts callbackSpec=1.0 requests
│   │   │   ├── schema.graphql         # orderStatusChanged(orderId: ID!) subscription
│   │   │   ├── resolvers.ts           # Query/Mutation resolvers only (no subscription resolvers)
│   │   │   ├── redis.ts               # Subscription state: substate:{id} hash, subindex:{orderId} set
│   │   │   ├── heartbeat.ts           # Background loop: check → renew TTL / cleanup on 404
│   │   │   └── kafka.ts               # Kafka producer/consumer for order-status-changed topic
│   │   ├── Dockerfile
│   │   └── package.json
│   └── users/                          # Subgraph C — Query/Mutation only
│       ├── src/
│       │   ├── index.ts
│       │   ├── schema.graphql
│       │   └── resolvers.ts
│       ├── Dockerfile
│       └── package.json
├── router/
│   ├── router.yaml                     # Router config (subscriptions, Redis, etc.)
│   ├── supergraph.graphql              # Composed schema (rover output)
│   └── Dockerfile
├── web-app/
│   ├── src/
│   │   ├── App.tsx
│   │   ├── apollo/
│   │   │   ├── client.ts              # ApolloClient with RetryLink + ErrorLink
│   │   │   └── links.ts
│   │   ├── pages/
│   │   │   ├── Dashboard.tsx           # Live subscription feed
│   │   │   └── Notifications.tsx       # Subscription + Query hybrid
│   │   ├── hooks/
│   │   │   └── useResilentSubscription.ts
│   │   └── components/
│   │       ├── LiveFeed.tsx
│   │       ├── ConnectionStatus.tsx
│   │       └── ErrorBoundary.tsx
│   ├── Dockerfile
│   └── package.json
├── event-producer/                     # Simulates external events → Kafka
│   ├── produce.ts
│   └── Dockerfile
├── helm/
│   ├── Chart.yaml
│   ├── values.yaml
│   └── templates/
│       ├── redis-deployment.yaml
│       ├── kafka-statefulset.yaml
│       ├── router-statefulset.yaml     # StatefulSet + headless Service
│       ├── notifications-deployment.yaml
│       ├── users-deployment.yaml
│       ├── web-app-deployment.yaml
│       └── _helpers.tpl
├── scripts/
│   ├── compose-supergraph.sh
│   ├── test-subscription.sh
│   └── load-test.sh                   # k6 or Artillery script
└── README.md

Phase 1: Infrastructure Foundation (Redis + Kafka)

Goal: Stand up Redis and Kafka locally, prove they're reachable and functional.

Steps

  1. Create docker-compose.infra.yml

    • Redis 7.x (single node, port 6379)
    • Kafka (KRaft mode — no ZooKeeper) via bitnami/kafka:latest on port 9092
    • Optional: Kafka UI (provectuslabs/kafka-ui) on port 8080 for visual inspection
  2. Verify Redis

    docker compose -f docker-compose.infra.yml up -d
    docker exec -it redis redis-cli PING    # → PONG
    docker exec -it redis redis-cli SET test "hello" && docker exec -it redis redis-cli GET test
  3. Verify Kafka

    # Create a test topic
    docker exec -it kafka kafka-topics.sh --create --topic test-events --bootstrap-server localhost:9092
    # Produce a message
    echo "hello-kafka" | docker exec -i kafka kafka-console-producer.sh --topic test-events --bootstrap-server localhost:9092
    # Consume it
    docker exec -it kafka kafka-console-consumer.sh --topic test-events --from-beginning --bootstrap-server localhost:9092

✅ Checkpoint 1

  • Redis responds to PING
  • Kafka topic created, message produced and consumed end-to-end
  • Kafka UI shows the topic and message (optional but helpful)

Phase 2: Notifications Subgraph with Callback Protocol

Goal: Build a working Apollo Server subgraph that exposes Subscription fields and implements the HTTP callback protocol. Kafka events trigger subscription updates.

Schema Design

# subgraphs/notifications/src/schema.graphql
extend schema @link(url: "https://specs.apollo.dev/federation/v2.5", import: ["@key", "@shareable"])

type Notification @key(fields: "id") {
  id: ID!
  type: String!
  message: String!
  severity: String!          # INFO | WARN | CRITICAL
  timestamp: String!
  userId: String!
}

type Query {
  notifications(userId: String!): [Notification!]!
}

type Mutation {
  triggerNotification(userId: String!, type: String!, message: String!, severity: String!): Notification!
}

type Subscription {
  notificationCreated(userId: String): Notification!
  systemAlert: Notification!
}

Key Implementation Details

  • Apollo Server 4.x with @apollo/server and @apollo/subgraph
  • Enable ApolloServerPluginSubscriptionCallback from @apollo/server/plugin/subscriptionCallback
    import { ApolloServerPluginSubscriptionCallback } from '@apollo/server/plugin/subscriptionCallback';
    
    const server = new ApolloServer({
      schema: buildSubgraphSchema({ typeDefs, resolvers }),
      plugins: [
        ApolloServerPluginSubscriptionCallback()
      ],
    });
  • Kafka → AsyncIterator bridge (pubsub.ts):
    • Use kafkajs to consume from notification-events topic
    • Convert Kafka consumer messages into an AsyncIterator that Apollo Server's subscription resolver returns
    • Each Kafka message = one subscription event pushed to Router via callback
  • Mutation triggerNotification: Produces a Kafka message to notification-events topic (simulates an external system)

Steps

  1. Scaffold the subgraph with npm init + install deps (@apollo/server, @apollo/subgraph, kafkajs, graphql-tag)
  2. Implement schema + resolvers with in-memory PubSub first (no Kafka yet)
  3. Test standalone — Run subgraph, execute subscription via GraphQL Playground/Sandbox (WebSocket mode)
  4. Swap PubSub for Kafka-backed AsyncIterator
  5. Dockerize the subgraph

✅ Checkpoint 2

  • Subgraph starts and exposes /graphql endpoint
  • query { notifications(userId: "u1") } returns data
  • mutation { triggerNotification(...) } produces Kafka message
  • Subscription resolves events from Kafka when consumed locally (via WebSocket in Sandbox before Router is involved)

Phase 3: Users Subgraph (Query/Mutation Only)

Goal: Build a simple companion subgraph to demonstrate federated subscriptions where Router stitches data from multiple subgraphs.

Schema

# subgraphs/users/src/schema.graphql
extend schema @link(url: "https://specs.apollo.dev/federation/v2.5", import: ["@key", "@shareable", "@external"])

type User @key(fields: "id") {
  id: ID!
  name: String!
  email: String!
  role: String!
}

type Query {
  user(id: ID!): User
  users: [User!]!
}

Extend Notification to resolve user info:

# In notifications subgraph OR users subgraph (reference resolver)
extend type Notification @key(fields: "id") {
  id: ID! @external
  userId: String! @external
  user: User
}

Steps

  1. Build and Dockerize the Users subgraph
  2. Seed with mock user data
  3. Test standalone queries

✅ Checkpoint 3

  • query { user(id: "u1") { name email } } works against Users subgraph directly
  • Both subgraph Docker images build and run

Phase 4: Apollo Router + Supergraph Composition

Goal: Compose the supergraph, configure Router with HTTP callback protocol for subscriptions and Redis for subscription state, and prove end-to-end subscription flow without the React app.

Supergraph Composition

# scripts/compose-supergraph.sh
rover supergraph compose --config ./supergraph-config.yaml > router/supergraph.graphql
# supergraph-config.yaml
federation_version: =2.5.0
subgraphs:
  notifications:
    routing_url: http://notifications:4001/graphql
    schema:
      file: ./subgraphs/notifications/src/schema.graphql
  users:
    routing_url: http://users:4002/graphql
    schema:
      file: ./subgraphs/users/src/schema.graphql

Router Configuration (router/router.yaml)

supergraph:
  listen: 0.0.0.0:4000

subscription:
  enabled: true
  mode:
    callback:
      public_url: http://router:4000/callback     # Internal Docker DNS
      listen: 0.0.0.0:4000
      path: /callback
      heartbeat_interval: 5s
      subgraphs:
        - notifications

  # Redis for subscription state (callback URLs, TTLs, verifiers)
  queue_capacity: 32768

# Redis configuration for distributed subscription state
apq:
  router:
    cache:
      redis:
        urls: ["redis://redis:6379"]

# If using Redis for subscription deduplication or state
# (check your Router version for exact config keys)
preview_entity_cache:
  enabled: false

# Subgraph routing overrides (if needed)
override_subgraph_url:
  notifications: http://notifications:4001/graphql
  users: http://users:4002/graphql

# Helpful for debugging
telemetry:
  exporters:
    logging:
      stdout:
        enabled: true
        format: json

Docker Compose (Full Stack)

Add Router, both subgraphs, Redis, Kafka to docker-compose.yml with proper depends_on and networking.

Verification via cURL

# 1. Query (federated) — proves composition works
curl -X POST http://localhost:4000/ \
  -H "Content-Type: application/json" \
  -d '{"query": "{ notifications(userId: \"u1\") { id message user { name } } }"}'

# 2. Subscription via multipart HTTP (what Apollo Client uses)
curl -N -X POST http://localhost:4000/ \
  -H "Content-Type: application/json" \
  -H "Accept: multipart/mixed;boundary=\"graphql\";subscriptionSpec=1.0" \
  -d '{"query": "subscription { notificationCreated(userId: \"u1\") { id message severity } }"}'

# 3. In another terminal, trigger an event
curl -X POST http://localhost:4000/ \
  -H "Content-Type: application/json" \
  -d '{"query": "mutation { triggerNotification(userId: \"u1\", type: \"ALERT\", message: \"Server CPU > 90%\", severity: \"CRITICAL\") { id } }"}'

# → The subscription curl should receive the multipart chunk with the new notification

✅ Checkpoint 4

  • rover supergraph compose succeeds
  • Router starts with no errors, connects to Redis
  • Federated query resolves data across both subgraphs
  • Subscription via cURL receives real-time multipart events when mutation fires
  • Router logs show callback protocol handshake (init → check → next)
  • Redis contains subscription state keys (inspect with redis-cli KEYS *)

Phase 5: React Web App with Resilient Apollo Client

Goal: Build a React app that subscribes via the Router's multipart HTTP protocol, with robust error handling and retry logic.

Apollo Client Setup (web-app/src/apollo/client.ts)

import { ApolloClient, InMemoryCache, HttpLink, split, ApolloLink } from '@apollo/client';
import { RetryLink } from '@apollo/client/link/retry';
import { onError } from '@apollo/client/link/error';
import { GraphQLWsLink } from '@apollo/client/link/subscriptions';
// Note: For Router multipart subscriptions, HttpLink handles it natively in Apollo Client 4.x

// Error Link — handles GraphQL errors, protocol errors, network errors
const errorLink = onError(({ graphQLErrors, protocolErrors, networkError, operation }) => {
  if (graphQLErrors) {
    graphQLErrors.forEach(({ message, locations, path }) =>
      console.error(`[GraphQL error]: Message: ${message}, Path: ${path}`)
    );
  }
  // CombinedProtocolErrors — new in Apollo Client 4.x
  if (protocolErrors) {
    protocolErrors.forEach(({ message, extensions }) =>
      console.error(`[Protocol error]: ${message}`, extensions)
    );
  }
  if (networkError) {
    console.error(`[Network error]: ${networkError.message}`);
    // Could dispatch to a global state for UI indicator
  }
});

// Retry Link — exponential backoff for transient network failures
const retryLink = new RetryLink({
  delay: {
    initial: 300,       // ms
    max: 10000,         // cap at 10s
    jitter: true,       // randomize to avoid thundering herd
  },
  attempts: {
    max: 5,
    retryIf: (error, _operation) => {
      // Retry only on network errors, not GraphQL errors
      return !!error && !error.statusCode;  // no status = network failure
    },
  },
});

const httpLink = new HttpLink({
  uri: 'http://localhost:4000/',
  // Apollo Client 4.x automatically negotiates multipart for subscriptions
});

const link = ApolloLink.from([retryLink, errorLink, httpLink]);

export const client = new ApolloClient({
  link,
  cache: new InMemoryCache(),
  defaultOptions: {
    watchQuery: { errorPolicy: 'all' },
    query: { errorPolicy: 'all' },
  },
});

Custom Hook: useResilientSubscription

// Wraps useSubscription with connection status tracking, auto-reconnect indicator, error categorization
import { useSubscription, type DocumentNode, type TypedDocumentNode } from '@apollo/client';
import { useState, useCallback, useEffect } from 'react';

export function useResilientSubscription<TData, TVars>(
  subscription: DocumentNode | TypedDocumentNode<TData, TVars>,
  options?: { variables?: TVars; onData?: (data: TData) => void }
) {
  const [connectionStatus, setConnectionStatus] = useState<'connecting' | 'connected' | 'error' | 'reconnecting'>('connecting');

  const { data, loading, error, restart } = useSubscription(subscription, {
    variables: options?.variables,
    onData: ({ data: { data: subData } }) => {
      setConnectionStatus('connected');
      if (subData && options?.onData) options.onData(subData as TData);
    },
    onError: (err) => {
      setConnectionStatus('error');
      console.error('[Subscription error]', err);
    },
    onComplete: () => {
      setConnectionStatus('connecting'); // will re-init
    },
  });

  const reconnect = useCallback(() => {
    setConnectionStatus('reconnecting');
    restart();  // Apollo Client 4.x useSubscription supports restart()
  }, [restart]);

  return { data, loading, error, connectionStatus, reconnect };
}

Pages

  1. Dashboard (Dashboard.tsx)

    • Uses useResilientSubscription for systemAlert subscription
    • Displays live feed of system-wide alerts with severity badges
    • Shows <ConnectionStatus /> indicator (green dot / yellow / red)
    • "Reconnect" button that calls reconnect()
  2. Notifications (Notifications.tsx)

    • Uses useQuery to fetch historical notifications
    • Uses useResilientSubscription for notificationCreated(userId) to append new ones in real-time
    • Demonstrates hybrid query + subscription pattern
    • Error boundary wraps the page to catch unhandled GraphQL errors

Steps

  1. Scaffold React app with Vite + TypeScript
  2. Install @apollo/client (v4.x)
  3. Build Apollo Client with RetryLink + ErrorLink chain
  4. Build Dashboard page — verify subscription receives multipart events
  5. Build Notifications page — verify query + subscription hybrid
  6. Add <ConnectionStatus /> component
  7. Dockerize the React app (nginx serving the build)

✅ Checkpoint 5

  • React app loads at http://localhost:3000
  • Dashboard shows live subscription feed
  • Firing triggerNotification mutation (via cURL or Notifications page) appears in real-time on Dashboard
  • Killing and restarting a subgraph shows error → reconnect behavior in the UI
  • Browser DevTools Network tab shows multipart HTTP chunks arriving (not WebSocket)

Phase 6: Scale Testing — Multiple Router & Subgraph Pods

Goal: Run 2 Router pods and 2 subgraph pods locally to prove the callback protocol routes correctly at scale. The orders subgraph's Redis-backed architecture makes this straightforward — any pod can service any event.

Why the two subgraphs scale differently

notifications (plugin) orders (Redis-backed)
Subscription state In-process EventEmitter Redis substate:{id} hash
Which pod can deliver? Only the pod that got the init Any pod (queries Redis for callbackUrl)
Add a second pod Events on wrong partition are dropped Kafka partitions auto-rebalance, both pods deliver
Pod crash Active subscriptions on that pod are lost Redis state survives; surviving pods take over

Scaling the orders subgraph

docker compose up --scale orders=2

Both pods join the same Kafka consumer group (orders-subgraph-group). Kafka distributes partitions between them. When a Kafka event arrives on Pod B for an orderId whose subscription was initiated through Router Pod A:

  1. Pod B calls getSubscriptionsByOrderId(orderId) → Redis SMEMBERS subindex:{orderId}
  2. Gets back the subscriptionId → calls getSubscription(id) → Redis HGETALL substate:{id}
  3. POSTs next payload to the callbackUrl stored in Redis
  4. Router Pod A (whose URL is in Redis) receives the callback and pushes to the client

No coordination needed between subgraph pods — Redis is the shared source of truth.

Scaling the Router (multi-pod)

# Docker Compose with named instances (StatefulSet pattern):
docker compose up --scale router=2

Add an nginx reverse-proxy in front of the Router instances to distribute client connections. Each Router pod must advertise its own stable hostname as CALLBACK_PUBLIC_URL so subgraphs send callbacks to the right pod. In Kubernetes this uses StatefulSet + headless Service (see Phase 7).

✅ Checkpoint 6

  • 2 Router pods + 2 orders subgraph pods running
  • Subscription initiated through Router Pod A receives events regardless of which subgraph pod consumed the Kafka message
  • redis-cli KEYS 'substate:*' shows active subscription state (callbackUrl, verifier, orderId)
  • Kafka UI shows 2 members in consumer group orders-subgraph-group
  • Killing one subgraph pod — Kafka rebalances, surviving pod continues delivering events

Phase 7: Helm Charts for Kubernetes

Goal: Package everything into Helm charts that can deploy to any Kubernetes cluster (local or cloud).

Key Design Decisions

Service K8s Resource Why
Router StatefulSet + Headless Service Each pod needs a stable DNS name (router-0.router-headless.ns.svc) for callback URLs
Notifications Subgraph Deployment (2 replicas) Stateless; Kafka consumer group handles distribution
Users Subgraph Deployment (2 replicas) Stateless
Redis StatefulSet (1 replica) or Bitnami Helm chart Persistent storage for subscription state
Kafka StatefulSet (Bitnami Helm chart) Message ordering guarantees
Web App Deployment + Ingress Static files served by nginx

Router StatefulSet — Critical Config

# helm/templates/router-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: {{ include "app.fullname" . }}-router
spec:
  serviceName: router-headless
  replicas: {{ .Values.router.replicas }}      # default: 2
  selector:
    matchLabels:
      app: router
  template:
    metadata:
      labels:
        app: router
    spec:
      containers:
        - name: router
          image: "{{ .Values.router.image.repository }}:{{ .Values.router.image.tag }}"
          ports:
            - containerPort: 4000
          env:
            # Each pod computes its own callback URL using the pod's stable hostname
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: APOLLO_ROUTER_CONFIG_PATH
              value: /config/router.yaml
            - name: CALLBACK_PUBLIC_URL
              value: "http://$(POD_NAME).router-headless.{{ .Release.Namespace }}.svc.cluster.local:4000/callback"
          volumeMounts:
            - name: router-config
              mountPath: /config
      volumes:
        - name: router-config
          configMap:
            name: router-config
---
apiVersion: v1
kind: Service
metadata:
  name: router-headless
spec:
  clusterIP: None              # Headless — enables per-pod DNS
  selector:
    app: router
  ports:
    - port: 4000
      targetPort: 4000
---
apiVersion: v1
kind: Service
metadata:
  name: router                 # Client-facing Service (load-balanced)
spec:
  type: ClusterIP
  selector:
    app: router
  ports:
    - port: 4000
      targetPort: 4000

values.yaml Defaults

router:
  replicas: 2
  image:
    repository: ghcr.io/apollographql/router
    tag: v2.1.1
  redis:
    url: redis://redis:6379

notifications:
  replicas: 2
  kafka:
    brokers: kafka-0.kafka-headless:9092

users:
  replicas: 2

redis:
  enabled: true
  # Or use bitnami/redis subchart

kafka:
  enabled: true
  # Or use bitnami/kafka subchart

Steps

  1. Create Helm chart structure
  2. Template all K8s resources with configurable values.yaml
  3. Deploy to local K8s (Minikube / kind / Docker Desktop Kubernetes)
  4. helm install apollo-subs-demo ./helm
  5. Port-forward Router service: kubectl port-forward svc/router 4000:4000
  6. Verify same flow as Phase 5-6 but now on Kubernetes

✅ Checkpoint 7

  • helm install succeeds, all pods are Running
  • kubectl get pods shows: 2 router pods, 2 notifications pods, 2 users pods, 1 redis, 1 kafka
  • Subscriptions work via kubectl port-forward
  • Kill a Router pod → Kubernetes recreates it → subscriptions on that pod re-establish
  • Kill a Notification subgraph pod → Kafka rebalances → no event loss

Phase 8: Chaos & Resilience Testing

Goal: Prove the system handles failures gracefully and the React app recovers.

Test Scenarios

Test Action Expected Outcome
Subgraph crash kubectl delete pod notifications-0 Kafka rebalances, no duplicate events. React UI shows brief error then reconnects
Router pod crash kubectl delete pod router-0 Subscriptions on that pod fail. Client retries (RetryLink). New subscription goes through router-1
Redis restart kubectl delete pod redis-0 Router logs warn about lost subscription state. Active subscriptions may need to re-initialize. Client sees reconnect cycle
Kafka broker restart Restart Kafka pod Consumers reconnect after session.timeout.ms. Brief pause in events, then resume
Network partition kubectl exec router-0 -- iptables -A OUTPUT -d notifications -j DROP Heartbeat fails → Router terminates subscription → Client retries. Error Link fires
High load Run scripts/load-test.sh (100 concurrent subscriptions + 50 events/sec) All events delivered, no OOM, Redis memory stable

React-side Verification

  • <ConnectionStatus /> transitions: connected → error → reconnecting → connected
  • No stale data displayed after reconnection
  • Error boundary catches and displays unrecoverable errors
  • RetryLink logs show exponential backoff in browser console

✅ Checkpoint 8

  • All chaos scenarios documented with results
  • React app recovers from every transient failure within the RetryLink budget (5 attempts)
  • No data loss for events published during subgraph pod restart (Kafka guarantees)

Summary: Build Order & Dependencies

Phase 1: Redis + Kafka (infrastructure)
   ↓
Phase 2: Notifications Subgraph (subscription producer)
   ↓
Phase 3: Users Subgraph (federation companion)
   ↓
Phase 4: Router + Supergraph (composition + callback protocol)
   ↓
Phase 5: React Web App (client-side subscriptions + resilience)
   ↓
Phase 6: Scale Testing (multi-pod Docker Compose)
   ↓
Phase 7: Helm Charts (portable Kubernetes deployment)
   ↓
Phase 8: Chaos Testing (resilience proof)

Each phase is a PR-able unit — you can review, merge, and demo each independently.