Apollo Subscriptions at Scale — Project Plan

Version: 1.0 (Draft) Author: Solutions Architecture Date: February 18, 2026 Status: Pending Review

1. Executive Summary

This project demonstrates Apollo GraphQL Subscriptions operating at scale using the HTTP Callback protocol. The system showcases a federated architecture with multiple Apollo Router pods, multiple subgraph pod instances, event-driven messaging via Kafka/RabbitMQ, and Redis as the subscription state store. A React Web Application serves as the front-end, demonstrating real-time subscription functionality with robust error handling and retry logic.

The solution is designed to be environment-portable — deployable via Docker Compose for local development and Kubernetes (Helm charts) for staging/production.

2. Architecture Overview

2.1 High-Level Component Map

┌─────────────────────────────────────────────────────────────────────────────────┐
│  Clients (React Web App)                                                        │
│  ┌────────────────┐                                                             │
│  │ HTTP Multipart  │──── Heartbeats (empty JSON) ────────────────┐              │
│  │ Subscription    │◄── Payload / Error sent back ───────────────┤              │
│  └────────────────┘                                              │              │
└──────────────────────────────────────────────────────────────────┼──────────────┘
                                                                   │
┌──────────────────────────────────────────────────────────────────┼──────────────┐
│  API Gateway Layer                                               │              │
│  ┌────────────┐  ┌─────────────────┐  ┌──────────┐              │              │
│  │ Load       │  │ API Gateway     │  │ Rate     │              │              │
│  │ Balancer   │  │ (Kong/AWS GW)   │  │ Limiter  │              │              │
│  └────────────┘  └─────────────────┘  └──────────┘              │              │
└──────────────────────────────────────────────────────────────────┼──────────────┘
                                                                   │
┌──────────────────────────────────────────────────────────────────┼──────────────┐
│  Apollo Router Pods (StatefulSet + Headless Service)             │              │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐                    │              │
│  │ Router    │  │ Router    │  │ Router    │ ◄─ Callback URL     │              │
│  │ Pod 1     │  │ Pod 2     │  │ Pod 3     │    must be pod-     │              │
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘    specific        │              │
│        │              │              │                            │              │
│        └──────────────┴──────────────┘                            │              │
│                       │                                           │              │
└───────────────────────┼───────────────────────────────────────────┘              │
                        │ POST /graphql (init subscription with callback details)  │
┌───────────────────────┼─────────────────────────────────────────────────────────┐│
│  Subgraph Pods                                                                  ││
│  ┌───────────┐  ┌───────────┐  ┌───────────┐                                   ││
│  │ Subgraph  │  │ Subgraph  │  │ Subgraph  │                                   ││
│  │ Pod 1     │  │ Pod 2     │  │ Pod 3     │                                   ││
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘                                   ││
│        │              │              │                                           ││
│        └──────────────┼──────────────┘                                           ││
│                       │                                                          ││
│          ┌────────────┴────────────┐                                             ││
│          │     Redis (State)       │  ← Stores: callback URL, TTL,              ││
│          │                         │    subscription ID, metadata                ││
│          └────────────┬────────────┘                                             ││
│                       │                                                          ││
│          ┌────────────┴────────────┐                                             ││
│          │  Kafka / RabbitMQ       │  ← Domain events published by              ││
│          │  (Message Queue)        │    microservices, consumed by subgraphs     ││
│          └─────────────────────────┘                                             ││
└──────────────────────────────────────────────────────────────────────────────────┘│
                                                                                    │
┌──────────────────────────────────────────────────────────────────────────────────┐│
│  Observability                                                                   ││
│  Prometheus / Grafana / ELK Stack / OTEL Collector                               ││
└──────────────────────────────────────────────────────────────────────────────────┘│

2.2 Subscription Flow (HTTP Callback Protocol)

Client initiates a subscription via HTTP multipart request to the Router.
Router receives the subscription, generates a pod-specific Callback URL, and POSTs the subscription operation + callback details to the Subgraph.
Subgraph persists the callback metadata (URL, TTL, subscription ID) into Redis with a default TTL of 60s.
Subgraph (or a Watcher service) heartbeats the Router every ~30s; on 200 OK, resets the Redis TTL.
Kafka/RabbitMQ emits a domain event → Subgraph consumer picks it up.
Subgraph checks Redis: "Is this subscription still active (TTL > 0)?"
If active, Subgraph POSTs the payload to the Router's callback URL.
Router pushes the multipart chunk to the Client.

2.3 Lossless Message Delivery

Per the architecture diagram, the system implements lossless delivery:

On Router 5xx/timeout → subgraph sends channel nack back to the message queue for redelivery.
On subscription dead/expired → subgraph sends ACK + cleans up Redis metadata.
On 204 No Content → message delivered successfully, channel ACK.

2.4 Resumable Subscription Pattern

For handling disconnections gracefully:

Client holds an after cursor locally.
On reconnect, client subscribes with variables: { after: 'cursor_123' }.
Router POSTs to subgraph with callback URL + variables.
Subgraph queries the Event Store DB: SELECT * FROM events WHERE id > 'cursor_123'.
Subgraph backfills missed events (Event A, Event B) via POST to callback URL.
Once caught up, subgraph registers live subscription and starts consuming from RabbitMQ/Kafka.
New live events (Event C, etc.) flow through the normal callback path.

3. Workstream Breakdown

Phase 1: Foundation & Infrastructure (Weeks 1–2)

1.1 Repository & Project Scaffolding

Initialize monorepo (e.g., Nx or Turborepo) with the following packages:
- packages/router-config — Apollo Router YAML configuration
- packages/subgraph-orders — Orders subgraph (Node.js / Apollo Server)
- packages/subgraph-notifications — Notifications subgraph (Node.js / Apollo Server)
- packages/web-app — React application (Vite + Apollo Client v4)
- packages/event-producer — Utility to simulate domain events
- infra/docker — Docker Compose for local development
- infra/k8s — Kubernetes manifests / Helm charts
Define .env.example files with all configurable parameters.
Set up CI/CD pipeline skeleton (GitHub Actions).

1.2 Docker Compose — Local Development Stack

Redis (single node, port 6379)
Kafka (or RabbitMQ) with management UI
Apollo Router ×2 instances
Subgraph (Orders) ×2 instances
Subgraph (Notifications) ×2 instances
Event Producer (CLI utility / cron job for injecting events)
Prometheus + Grafana for local observability
Network configuration ensuring Router callback URLs are resolvable between containers.

1.3 Kubernetes Manifests / Helm Charts

Apollo Router — StatefulSet with Headless Service to provide stable DNS names per pod (e.g., router-0.router-headless.default.svc.cluster.local). This is critical for the callback URL to be instance-specific.
Subgraphs — Standard Deployment (2+ replicas each) with Service.
Redis — StatefulSet or managed service (e.g., ElastiCache / Memorystore).
Kafka/RabbitMQ — StatefulSet or managed service (e.g., MSK / CloudAMQP).
ConfigMaps for Router YAML, environment-specific overrides.
Secrets for GraphOS API key, Redis auth, Kafka credentials.
Include HorizontalPodAutoscaler definitions for subgraphs.

Phase 2: Schema Design & Subgraph Implementation (Weeks 2–3)

2.1 Federated Schema Design

# Subgraph: Orders
type Order @key(fields: "id") {
  id: ID!
  status: OrderStatus!
  items: [OrderItem!]!
  total: Float!
  createdAt: DateTime!
  updatedAt: DateTime!
}

type OrderItem {
  productId: ID!
  name: String!
  quantity: Int!
  price: Float!
}

enum OrderStatus {
  PLACED
  CONFIRMED
  PREPARING
  SHIPPED
  DELIVERED
  CANCELLED
}

type Subscription {
  orderStatusChanged(orderId: ID!): OrderStatusUpdate!
  newOrders: Order!
}

type OrderStatusUpdate {
  orderId: ID!
  previousStatus: OrderStatus!
  newStatus: OrderStatus!
  timestamp: DateTime!
}

type Query {
  order(id: ID!): Order
  orders(limit: Int, after: String): OrderConnection!
}

type Mutation {
  placeOrder(input: PlaceOrderInput!): Order!
  updateOrderStatus(orderId: ID!, status: OrderStatus!): Order!
}

# Subgraph: Notifications
type Notification @key(fields: "id") {
  id: ID!
  type: NotificationType!
  message: String!
  read: Boolean!
  createdAt: DateTime!
}

enum NotificationType {
  ORDER_UPDATE
  SYSTEM_ALERT
  PROMOTION
}

type Subscription {
  notificationReceived(userId: ID!): Notification!
}

type Query {
  notifications(userId: ID!, limit: Int): [Notification!]!
}

2.2 Subgraph Implementation — Callback Protocol Support

Each subgraph must implement the following:

On receiving a subscription initiation POST from Router:

Parse the callback URL, subscription ID, and verifier from the Router's request.

Store in Redis:

Key:   sub:{subscriptionId}
Value: { callbackUrl, verifier, createdAt, variables }
TTL:   60s

Respond with 200 OK to acknowledge.

Heartbeat / TTL Renewal:

Subgraph (or a dedicated Watcher service) pings the Router's callback URL with a check message every ~30s.
On 200 OK, reset Redis TTL to 60s.
On failure, mark subscription for cleanup.

Event Consumption (Kafka/RabbitMQ):

Consumer receives event (e.g., order.status.changed).
Queries Redis for active subscriptions matching the event filter.
For each active subscription: POST payload to the callback URL.
Handle response: 204 → ACK, 5xx/timeout → NACK for retry, gone → cleanup Redis.

2.3 Redis Data Model

# Active Subscription State
Key:    sub:{subscriptionId}
Type:   Hash
Fields: callbackUrl, verifier, variables (JSON), createdAt
TTL:    60s (renewed on heartbeat)

# Subscription Index (for event routing)
Key:    sub:index:{eventType}:{filterValue}
Type:   Set
Members: subscriptionId_1, subscriptionId_2, ...

# Event Cursor (for resumable subscriptions)
Key:    cursor:{subscriptionId}
Type:   String
Value:  last_event_id

Phase 3: Apollo Router Configuration (Week 3)

3.1 Router YAML Configuration

supergraph:
  listen: 0.0.0.0:4000

subscription:
  enabled: true
  mode:
    callback:
      public_url: ${ROUTER_PUBLIC_URL}  # Pod-specific, e.g., http://router-0.router-headless:4000
      listen: 0.0.0.0:4000
      path: /callback
      heartbeat_interval: 30s
      subgraphs:
        - orders
        - notifications

# Enable multipart subscriptions for clients
batching:
  enabled: false

telemetry:
  exporters:
    metrics:
      prometheus:
        enabled: true
        listen: 0.0.0.0:9090
        path: /metrics
    tracing:
      otlp:
        enabled: true
        endpoint: ${OTEL_COLLECTOR_ENDPOINT}
  instrumentation:
    events:
      router:
        request:
          level: info
        response:
          level: info
        error:
          level: error
      supergraph:
        request:
          level: info
        response:
          level: info

cors:
  origins:
    - http://localhost:3000
    - ${CLIENT_ORIGIN}
  allow_headers:
    - Content-Type
    - Apollo-Require-Preflight

headers:
  all:
    request:
      - propagate:
          named: Authorization
      - propagate:
          named: X-Request-ID

include_subgraph_errors:
  all: true

3.2 Key Considerations for Scaled Router Deployment

Pod-specific callback URL: Each Router pod must advertise its own stable hostname. Use StatefulSet + Headless Service so router-0, router-1, etc., are DNS-resolvable.
Environment variable injection: Use Kubernetes env from fieldRef to inject pod name; construct callback URL dynamically.
Health probes: Configure liveness (/health) and readiness probes on the Router.

Phase 4: React Web Application (Weeks 3–4)

4.1 Application Structure

web-app/
├── src/
│   ├── apollo/
│   │   ├── client.ts          # Apollo Client setup with link chain
│   │   ├── retryLink.ts       # RetryLink configuration
│   │   ├── errorLink.ts       # ErrorLink for global error handling
│   │   └── subscriptionLink.ts # HTTP multipart subscription setup
│   ├── pages/
│   │   ├── OrderDashboard.tsx  # Live order tracking page
│   │   └── NotificationCenter.tsx # Real-time notifications page
│   ├── components/
│   │   ├── OrderStatusTracker.tsx
│   │   ├── OrderList.tsx
│   │   ├── NotificationFeed.tsx
│   │   ├── ConnectionStatus.tsx  # Shows subscription health
│   │   └── ErrorBoundary.tsx
│   ├── hooks/
│   │   ├── useOrderSubscription.ts
│   │   ├── useNotificationSubscription.ts
│   │   └── useResumableSubscription.ts  # Custom hook with cursor tracking
│   ├── graphql/
│   │   ├── queries.ts
│   │   ├── mutations.ts
│   │   └── subscriptions.ts
│   └── App.tsx

4.2 Apollo Client Setup — Link Chain

// client.ts
import { ApolloClient, InMemoryCache, ApolloLink, from } from '@apollo/client';
import { RetryLink } from '@apollo/client/link/retry';
import { onError } from '@apollo/client/link/error';
import { HttpLink } from '@apollo/client/link/http';

// 1. RetryLink — handles transient network errors
const retryLink = new RetryLink({
  delay: {
    initial: 300,
    max: 10000,       // cap at 10s
    jitter: true,     // randomize to avoid thundering herd
  },
  attempts: {
    max: 5,
    retryIf: (error, _operation) => {
      // Retry on network errors and 5xx status codes
      return !!error && (
        error.statusCode >= 500 ||
        error.message === 'Failed to fetch' ||
        error.message?.includes('NetworkError')
      );
    },
  },
});

// 2. ErrorLink — global error observer
const errorLink = onError(({ graphQLErrors, networkError, protocolErrors, operation }) => {
  if (graphQLErrors) {
    graphQLErrors.forEach(({ message, locations, path, extensions }) => {
      console.error(
        `[GraphQL Error]: Message: ${message}, Path: ${path}, Code: ${extensions?.code}`
      );
    });
  }

  if (protocolErrors) {
    // CombinedProtocolErrors — handle multipart/subscription protocol issues
    protocolErrors.forEach(({ message, extensions }) => {
      console.error(`[Protocol Error]: ${message}`);
      // Could trigger re-subscription logic here
    });
  }

  if (networkError) {
    console.error(`[Network Error]: ${networkError.message}`);
    // Surface to UI via reactive variable or context
  }
});

// 3. HttpLink — standard HTTP transport (supports multipart subscriptions)
const httpLink = new HttpLink({
  uri: import.meta.env.VITE_GRAPHQL_ENDPOINT || 'http://localhost:4000/graphql',
  headers: {
    'Apollo-Require-Preflight': 'true',
  },
});

// Compose the link chain: Retry → Error → HTTP
const link = from([retryLink, errorLink, httpLink]);

export const client = new ApolloClient({
  link,
  cache: new InMemoryCache({
    typePolicies: {
      Query: {
        fields: {
          orders: {
            keyArgs: false,
            merge(existing, incoming) {
              return incoming; // or implement cursor-based merge
            },
          },
        },
      },
    },
  }),
  defaultOptions: {
    watchQuery: {
      errorPolicy: 'all',  // Return both data and errors
    },
  },
});

4.3 Subscription Hook Usage

// hooks/useOrderSubscription.ts
import { useSubscription } from '@apollo/client';
import { ORDER_STATUS_CHANGED } from '../graphql/subscriptions';

export function useOrderSubscription(orderId: string) {
  const { data, loading, error, restart } = useSubscription(
    ORDER_STATUS_CHANGED,
    {
      variables: { orderId },
      onData: ({ data: subscriptionData }) => {
        // Handle incoming subscription event
        console.log('Order update received:', subscriptionData);
      },
      onError: (error) => {
        console.error('Subscription error:', error);
        // Optionally trigger reconnect with backoff
      },
      onComplete: () => {
        console.log('Subscription completed');
      },
    }
  );

  return { data, loading, error, restart };
}

// hooks/useResumableSubscription.ts
import { useSubscription } from '@apollo/client';
import { useRef, useCallback } from 'react';

export function useResumableSubscription(subscription, options = {}) {
  const cursorRef = useRef<string | null>(
    localStorage.getItem(`sub_cursor_${options.key}`) || null
  );

  const { data, loading, error, restart } = useSubscription(subscription, {
    ...options,
    variables: {
      ...options.variables,
      after: cursorRef.current,
    },
    onData: ({ data: subData }) => {
      // Update cursor from the latest event
      const eventId = subData?.data?.__cursor || subData?.data?.id;
      if (eventId) {
        cursorRef.current = eventId;
        localStorage.setItem(`sub_cursor_${options.key}`, eventId);
      }
      options.onData?.({ data: subData });
    },
  });

  const resetCursor = useCallback(() => {
    cursorRef.current = null;
    localStorage.removeItem(`sub_cursor_${options.key}`);
    restart();
  }, [options.key, restart]);

  return { data, loading, error, restart, resetCursor };
}

4.4 Pages

Page 1 — Order Dashboard (OrderDashboard.tsx)

Displays a list of recent orders (query + polling fallback).
Each order card subscribes to orderStatusChanged and updates in real-time.
Status transitions animate visually (e.g., PLACED → CONFIRMED → SHIPPED).
"Place New Order" button triggers a mutation, demonstrating the full cycle.
Connection status indicator shows subscription health.

Page 2 — Notification Center (NotificationCenter.tsx)

Subscribes to notificationReceived for the current user.
Displays a live feed of notifications with read/unread state.
Demonstrates the resumable subscription pattern — on page reload, missed notifications are backfilled.
Badge count updates in the nav bar via useSubscription.

4.5 Error Handling Strategy

Error Type	Source	Handling Approach
Network errors (timeout, DNS, 5xx)	`RetryLink`	Automatic retry with exponential backoff + jitter (max 5 attempts). Surface to UI after exhaustion.
GraphQL errors	`ErrorLink`, `useSubscription` `error`	Display inline error messages. Log with operation context. Do not retry (business logic errors).
Protocol errors (`CombinedProtocolErrors`)	Multipart stream	Log, attempt `restart()` on the subscription hook. Show reconnection indicator.
Subscription termination	Server-side close	Detect via `onComplete`, auto-restart with cursor for resumable subscriptions.

Phase 5: Event System & Message Queue (Weeks 2–3, parallel)

5.1 Kafka/RabbitMQ Topic Design

# Kafka Topics (or RabbitMQ Exchanges/Queues)
orders.status.changed    → consumed by Orders subgraph
orders.new               → consumed by Orders subgraph + Notifications subgraph
notifications.system     → consumed by Notifications subgraph

# Dead Letter Topic
subscriptions.dlq        → failed callback deliveries after max retries

5.2 Event Producer (Simulation Utility)

CLI tool or cron job that publishes sample events to Kafka/RabbitMQ.
Configurable: event type, frequency, payload size, burst mode.
Used for demo and load testing.

5.3 Consumer Configuration

Each subgraph pod runs a consumer with a consumer group (Kafka) or competing consumers (RabbitMQ) pattern.
Manual ACK mode: only acknowledge after successful callback POST.
NACK with requeue on transient failures (Router 5xx, timeout).
Move to DLQ after N retries (configurable, default: 3).

Phase 6: Observability & Telemetry (Week 4)

6.1 Router Telemetry (OTEL)

Enable Router subscription debugging events per Apollo docs:

telemetry:
  instrumentation:
    events:
      supergraph:
        # Log subscription lifecycle events
        response:
          level: info

Export metrics to Prometheus, traces to Jaeger/OTEL Collector.

6.2 Grafana Dashboards

Subscription Metrics: Active subscriptions count, callback success/failure rate, heartbeat latency.
Message Queue Metrics: Consumer lag, throughput, DLQ depth.
Router Metrics: Request rate, latency percentiles, error rates.
Redis Metrics: Memory usage, key count, TTL expiration rate.

6.3 Alerting Rules

Subscription callback failure rate > 5% over 5 minutes.
Kafka consumer lag > 1000 messages.
Redis memory usage > 80%.
Router pod restarts > 2 in 10 minutes.

Phase 7: Testing & Validation (Week 5)

7.1 Integration Tests

Subscription lifecycle: initiate → receive events → heartbeat → terminate.
Callback delivery under Router pod restart (verify resumable pattern).
Message queue consumer failover (kill one subgraph pod, verify rebalancing).

7.2 Load Testing

Tool: k6 or Artillery with GraphQL subscription support.
Scenarios:
- 100 concurrent subscriptions, steady event stream (1 event/sec).
- 500 concurrent subscriptions, burst event stream (50 events/sec for 10s).
- Router pod failure during active subscriptions (chaos testing).
Measure: event delivery latency (p50, p95, p99), message loss rate, recovery time.

7.3 Chaos Engineering

Kill a Router pod — verify subscriptions reconnect via client restart() + resumable cursor.
Kill a Subgraph pod — verify Kafka/RabbitMQ rebalances, no message loss.
Redis connection interruption — verify graceful degradation.

Phase 8: Documentation & Portability (Week 5)

8.1 Deliverables

README.md — Quick start guide (Docker Compose + K8s).
docs/ARCHITECTURE.md — System design document with diagrams.
docs/RUNBOOK.md — Operational runbook (common issues, scaling procedures).
docs/ENV_VARS.md — Complete environment variable reference.
Helm values.yaml with commented defaults for environment customization.

8.2 Portability Checklist

All infrastructure is defined as code (Docker Compose + Helm).
No hardcoded URLs — all endpoints configurable via env vars.
Redis, Kafka/RabbitMQ can be swapped for managed services via config.
Router callback URL construction is dynamic (pod name + namespace).
CI/CD pipeline builds and pushes container images to configurable registry.
Helm chart supports overrides for different cloud providers (AWS/GCP/Azure).

4. Technology Stack Summary

Layer	Technology	Purpose
Client	React 18+ (Vite), Apollo Client v4	UI + GraphQL subscription consumer
API Gateway	Kong / AWS API Gateway (optional)	Auth, rate limiting, TLS termination
Router	Apollo Router v2.x (Enterprise)	Federated gateway, subscription callback orchestration
Subgraphs	Node.js + Apollo Server v4	Domain logic, callback protocol implementation
State Store	Redis 7+	Subscription metadata, TTL, cursor tracking
Message Queue	Apache Kafka or RabbitMQ	Event-driven communication between services
Event Store	PostgreSQL (optional)	Persisted events for resumable subscriptions
Observability	Prometheus, Grafana, OTEL Collector	Metrics, dashboards, distributed tracing
Infra	Docker Compose, Kubernetes (Helm)	Environment-portable deployment

5. Timeline Summary

Week	Phase	Key Deliverables
1	Foundation	Repo scaffolding, Docker Compose stack, K8s manifests (draft)
2	Infrastructure + Schema	Redis/Kafka running, federated schema designed, subgraph stubs
3	Subgraph + Router	Callback protocol implemented, Router config finalized, link chain
4	React App + Observability	Both pages functional, Grafana dashboards, OTEL integration
5	Testing + Docs	Load tests, chaos tests, documentation, portability validation

6. Risks & Mitigations

Risk	Impact	Mitigation
Callback URL not resolvable between pods	Subscription delivery fails	Use StatefulSet + Headless Service; validate DNS resolution in init container
Redis TTL expiration race condition	Subscription dropped prematurely	Set heartbeat interval < TTL/2; implement TTL renewal with atomic Redis commands
Kafka consumer lag under burst load	Delayed event delivery	Tune partition count, consumer threads; add HPA for subgraph pods
Client reconnection thundering herd	Router overwhelmed after outage	Implement jittered backoff in RetryLink; stagger client reconnection
Router pod restart loses active subscriptions	Clients must re-subscribe	Resumable subscription pattern with cursor; client-side `restart()` on error

FilesExpand file tree

subscription-at-scale-plan.md

Latest commit

History