Skip to content

Latest commit

 

History

History
700 lines (591 loc) · 29.8 KB

File metadata and controls

700 lines (591 loc) · 29.8 KB

Apollo Subscriptions at Scale — Project Plan

Version: 1.0 (Draft) Author: Solutions Architecture Date: February 18, 2026 Status: Pending Review


1. Executive Summary

This project demonstrates Apollo GraphQL Subscriptions operating at scale using the HTTP Callback protocol. The system showcases a federated architecture with multiple Apollo Router pods, multiple subgraph pod instances, event-driven messaging via Kafka/RabbitMQ, and Redis as the subscription state store. A React Web Application serves as the front-end, demonstrating real-time subscription functionality with robust error handling and retry logic.

The solution is designed to be environment-portable — deployable via Docker Compose for local development and Kubernetes (Helm charts) for staging/production.


2. Architecture Overview

2.1 High-Level Component Map

┌─────────────────────────────────────────────────────────────────────────────────┐
│  Clients (React Web App)                                                        │
│  ┌────────────────┐                                                             │
│  │ HTTP Multipart  │──── Heartbeats (empty JSON) ────────────────┐              │
│  │ Subscription    │◄── Payload / Error sent back ───────────────┤              │
│  └────────────────┘                                              │              │
└──────────────────────────────────────────────────────────────────┼──────────────┘
                                                                   │
┌──────────────────────────────────────────────────────────────────┼──────────────┐
│  API Gateway Layer                                               │              │
│  ┌────────────┐  ┌─────────────────┐  ┌──────────┐              │              │
│  │ Load       │  │ API Gateway     │  │ Rate     │              │              │
│  │ Balancer   │  │ (Kong/AWS GW)   │  │ Limiter  │              │              │
│  └────────────┘  └─────────────────┘  └──────────┘              │              │
└──────────────────────────────────────────────────────────────────┼──────────────┘
                                                                   │
┌──────────────────────────────────────────────────────────────────┼──────────────┐
│  Apollo Router Pods (StatefulSet + Headless Service)             │              │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐                    │              │
│  │ Router    │  │ Router    │  │ Router    │ ◄─ Callback URL     │              │
│  │ Pod 1     │  │ Pod 2     │  │ Pod 3     │    must be pod-     │              │
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘    specific        │              │
│        │              │              │                            │              │
│        └──────────────┴──────────────┘                            │              │
│                       │                                           │              │
└───────────────────────┼───────────────────────────────────────────┘              │
                        │ POST /graphql (init subscription with callback details)  │
┌───────────────────────┼─────────────────────────────────────────────────────────┐│
│  Subgraph Pods                                                                  ││
│  ┌───────────┐  ┌───────────┐  ┌───────────┐                                   ││
│  │ Subgraph  │  │ Subgraph  │  │ Subgraph  │                                   ││
│  │ Pod 1     │  │ Pod 2     │  │ Pod 3     │                                   ││
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘                                   ││
│        │              │              │                                           ││
│        └──────────────┼──────────────┘                                           ││
│                       │                                                          ││
│          ┌────────────┴────────────┐                                             ││
│          │     Redis (State)       │  ← Stores: callback URL, TTL,              ││
│          │                         │    subscription ID, metadata                ││
│          └────────────┬────────────┘                                             ││
│                       │                                                          ││
│          ┌────────────┴────────────┐                                             ││
│          │  Kafka / RabbitMQ       │  ← Domain events published by              ││
│          │  (Message Queue)        │    microservices, consumed by subgraphs     ││
│          └─────────────────────────┘                                             ││
└──────────────────────────────────────────────────────────────────────────────────┘│
                                                                                    │
┌──────────────────────────────────────────────────────────────────────────────────┐│
│  Observability                                                                   ││
│  Prometheus / Grafana / ELK Stack / OTEL Collector                               ││
└──────────────────────────────────────────────────────────────────────────────────┘│

2.2 Subscription Flow (HTTP Callback Protocol)

  1. Client initiates a subscription via HTTP multipart request to the Router.
  2. Router receives the subscription, generates a pod-specific Callback URL, and POSTs the subscription operation + callback details to the Subgraph.
  3. Subgraph persists the callback metadata (URL, TTL, subscription ID) into Redis with a default TTL of 60s.
  4. Subgraph (or a Watcher service) heartbeats the Router every ~30s; on 200 OK, resets the Redis TTL.
  5. Kafka/RabbitMQ emits a domain event → Subgraph consumer picks it up.
  6. Subgraph checks Redis: "Is this subscription still active (TTL > 0)?"
  7. If active, Subgraph POSTs the payload to the Router's callback URL.
  8. Router pushes the multipart chunk to the Client.

2.3 Lossless Message Delivery

Per the architecture diagram, the system implements lossless delivery:

  • On Router 5xx/timeout → subgraph sends channel nack back to the message queue for redelivery.
  • On subscription dead/expired → subgraph sends ACK + cleans up Redis metadata.
  • On 204 No Content → message delivered successfully, channel ACK.

2.4 Resumable Subscription Pattern

For handling disconnections gracefully:

  1. Client holds an after cursor locally.
  2. On reconnect, client subscribes with variables: { after: 'cursor_123' }.
  3. Router POSTs to subgraph with callback URL + variables.
  4. Subgraph queries the Event Store DB: SELECT * FROM events WHERE id > 'cursor_123'.
  5. Subgraph backfills missed events (Event A, Event B) via POST to callback URL.
  6. Once caught up, subgraph registers live subscription and starts consuming from RabbitMQ/Kafka.
  7. New live events (Event C, etc.) flow through the normal callback path.

3. Workstream Breakdown

Phase 1: Foundation & Infrastructure (Weeks 1–2)

1.1 Repository & Project Scaffolding

  • Initialize monorepo (e.g., Nx or Turborepo) with the following packages:
    • packages/router-config — Apollo Router YAML configuration
    • packages/subgraph-orders — Orders subgraph (Node.js / Apollo Server)
    • packages/subgraph-notifications — Notifications subgraph (Node.js / Apollo Server)
    • packages/web-app — React application (Vite + Apollo Client v4)
    • packages/event-producer — Utility to simulate domain events
    • infra/docker — Docker Compose for local development
    • infra/k8s — Kubernetes manifests / Helm charts
  • Define .env.example files with all configurable parameters.
  • Set up CI/CD pipeline skeleton (GitHub Actions).

1.2 Docker Compose — Local Development Stack

  • Redis (single node, port 6379)
  • Kafka (or RabbitMQ) with management UI
  • Apollo Router ×2 instances
  • Subgraph (Orders) ×2 instances
  • Subgraph (Notifications) ×2 instances
  • Event Producer (CLI utility / cron job for injecting events)
  • Prometheus + Grafana for local observability
  • Network configuration ensuring Router callback URLs are resolvable between containers.

1.3 Kubernetes Manifests / Helm Charts

  • Apollo RouterStatefulSet with Headless Service to provide stable DNS names per pod (e.g., router-0.router-headless.default.svc.cluster.local). This is critical for the callback URL to be instance-specific.
  • Subgraphs — Standard Deployment (2+ replicas each) with Service.
  • RedisStatefulSet or managed service (e.g., ElastiCache / Memorystore).
  • Kafka/RabbitMQStatefulSet or managed service (e.g., MSK / CloudAMQP).
  • ConfigMaps for Router YAML, environment-specific overrides.
  • Secrets for GraphOS API key, Redis auth, Kafka credentials.
  • Include HorizontalPodAutoscaler definitions for subgraphs.

Phase 2: Schema Design & Subgraph Implementation (Weeks 2–3)

2.1 Federated Schema Design

# Subgraph: Orders
type Order @key(fields: "id") {
  id: ID!
  status: OrderStatus!
  items: [OrderItem!]!
  total: Float!
  createdAt: DateTime!
  updatedAt: DateTime!
}

type OrderItem {
  productId: ID!
  name: String!
  quantity: Int!
  price: Float!
}

enum OrderStatus {
  PLACED
  CONFIRMED
  PREPARING
  SHIPPED
  DELIVERED
  CANCELLED
}

type Subscription {
  orderStatusChanged(orderId: ID!): OrderStatusUpdate!
  newOrders: Order!
}

type OrderStatusUpdate {
  orderId: ID!
  previousStatus: OrderStatus!
  newStatus: OrderStatus!
  timestamp: DateTime!
}

type Query {
  order(id: ID!): Order
  orders(limit: Int, after: String): OrderConnection!
}

type Mutation {
  placeOrder(input: PlaceOrderInput!): Order!
  updateOrderStatus(orderId: ID!, status: OrderStatus!): Order!
}
# Subgraph: Notifications
type Notification @key(fields: "id") {
  id: ID!
  type: NotificationType!
  message: String!
  read: Boolean!
  createdAt: DateTime!
}

enum NotificationType {
  ORDER_UPDATE
  SYSTEM_ALERT
  PROMOTION
}

type Subscription {
  notificationReceived(userId: ID!): Notification!
}

type Query {
  notifications(userId: ID!, limit: Int): [Notification!]!
}

2.2 Subgraph Implementation — Callback Protocol Support

Each subgraph must implement the following:

On receiving a subscription initiation POST from Router:

  1. Parse the callback URL, subscription ID, and verifier from the Router's request.
  2. Store in Redis:
    Key:   sub:{subscriptionId}
    Value: { callbackUrl, verifier, createdAt, variables }
    TTL:   60s
    
  3. Respond with 200 OK to acknowledge.

Heartbeat / TTL Renewal:

  • Subgraph (or a dedicated Watcher service) pings the Router's callback URL with a check message every ~30s.
  • On 200 OK, reset Redis TTL to 60s.
  • On failure, mark subscription for cleanup.

Event Consumption (Kafka/RabbitMQ):

  • Consumer receives event (e.g., order.status.changed).
  • Queries Redis for active subscriptions matching the event filter.
  • For each active subscription: POST payload to the callback URL.
  • Handle response: 204 → ACK, 5xx/timeout → NACK for retry, gone → cleanup Redis.

2.3 Redis Data Model

# Active Subscription State
Key:    sub:{subscriptionId}
Type:   Hash
Fields: callbackUrl, verifier, variables (JSON), createdAt
TTL:    60s (renewed on heartbeat)

# Subscription Index (for event routing)
Key:    sub:index:{eventType}:{filterValue}
Type:   Set
Members: subscriptionId_1, subscriptionId_2, ...

# Event Cursor (for resumable subscriptions)
Key:    cursor:{subscriptionId}
Type:   String
Value:  last_event_id

Phase 3: Apollo Router Configuration (Week 3)

3.1 Router YAML Configuration

supergraph:
  listen: 0.0.0.0:4000

subscription:
  enabled: true
  mode:
    callback:
      public_url: ${ROUTER_PUBLIC_URL}  # Pod-specific, e.g., http://router-0.router-headless:4000
      listen: 0.0.0.0:4000
      path: /callback
      heartbeat_interval: 30s
      subgraphs:
        - orders
        - notifications

# Enable multipart subscriptions for clients
batching:
  enabled: false

telemetry:
  exporters:
    metrics:
      prometheus:
        enabled: true
        listen: 0.0.0.0:9090
        path: /metrics
    tracing:
      otlp:
        enabled: true
        endpoint: ${OTEL_COLLECTOR_ENDPOINT}
  instrumentation:
    events:
      router:
        request:
          level: info
        response:
          level: info
        error:
          level: error
      supergraph:
        request:
          level: info
        response:
          level: info

cors:
  origins:
    - http://localhost:3000
    - ${CLIENT_ORIGIN}
  allow_headers:
    - Content-Type
    - Apollo-Require-Preflight

headers:
  all:
    request:
      - propagate:
          named: Authorization
      - propagate:
          named: X-Request-ID

include_subgraph_errors:
  all: true

3.2 Key Considerations for Scaled Router Deployment

  • Pod-specific callback URL: Each Router pod must advertise its own stable hostname. Use StatefulSet + Headless Service so router-0, router-1, etc., are DNS-resolvable.
  • Environment variable injection: Use Kubernetes env from fieldRef to inject pod name; construct callback URL dynamically.
  • Health probes: Configure liveness (/health) and readiness probes on the Router.

Phase 4: React Web Application (Weeks 3–4)

4.1 Application Structure

web-app/
├── src/
│   ├── apollo/
│   │   ├── client.ts          # Apollo Client setup with link chain
│   │   ├── retryLink.ts       # RetryLink configuration
│   │   ├── errorLink.ts       # ErrorLink for global error handling
│   │   └── subscriptionLink.ts # HTTP multipart subscription setup
│   ├── pages/
│   │   ├── OrderDashboard.tsx  # Live order tracking page
│   │   └── NotificationCenter.tsx # Real-time notifications page
│   ├── components/
│   │   ├── OrderStatusTracker.tsx
│   │   ├── OrderList.tsx
│   │   ├── NotificationFeed.tsx
│   │   ├── ConnectionStatus.tsx  # Shows subscription health
│   │   └── ErrorBoundary.tsx
│   ├── hooks/
│   │   ├── useOrderSubscription.ts
│   │   ├── useNotificationSubscription.ts
│   │   └── useResumableSubscription.ts  # Custom hook with cursor tracking
│   ├── graphql/
│   │   ├── queries.ts
│   │   ├── mutations.ts
│   │   └── subscriptions.ts
│   └── App.tsx

4.2 Apollo Client Setup — Link Chain

// client.ts
import { ApolloClient, InMemoryCache, ApolloLink, from } from '@apollo/client';
import { RetryLink } from '@apollo/client/link/retry';
import { onError } from '@apollo/client/link/error';
import { HttpLink } from '@apollo/client/link/http';

// 1. RetryLink — handles transient network errors
const retryLink = new RetryLink({
  delay: {
    initial: 300,
    max: 10000,       // cap at 10s
    jitter: true,     // randomize to avoid thundering herd
  },
  attempts: {
    max: 5,
    retryIf: (error, _operation) => {
      // Retry on network errors and 5xx status codes
      return !!error && (
        error.statusCode >= 500 ||
        error.message === 'Failed to fetch' ||
        error.message?.includes('NetworkError')
      );
    },
  },
});

// 2. ErrorLink — global error observer
const errorLink = onError(({ graphQLErrors, networkError, protocolErrors, operation }) => {
  if (graphQLErrors) {
    graphQLErrors.forEach(({ message, locations, path, extensions }) => {
      console.error(
        `[GraphQL Error]: Message: ${message}, Path: ${path}, Code: ${extensions?.code}`
      );
    });
  }

  if (protocolErrors) {
    // CombinedProtocolErrors — handle multipart/subscription protocol issues
    protocolErrors.forEach(({ message, extensions }) => {
      console.error(`[Protocol Error]: ${message}`);
      // Could trigger re-subscription logic here
    });
  }

  if (networkError) {
    console.error(`[Network Error]: ${networkError.message}`);
    // Surface to UI via reactive variable or context
  }
});

// 3. HttpLink — standard HTTP transport (supports multipart subscriptions)
const httpLink = new HttpLink({
  uri: import.meta.env.VITE_GRAPHQL_ENDPOINT || 'http://localhost:4000/graphql',
  headers: {
    'Apollo-Require-Preflight': 'true',
  },
});

// Compose the link chain: Retry → Error → HTTP
const link = from([retryLink, errorLink, httpLink]);

export const client = new ApolloClient({
  link,
  cache: new InMemoryCache({
    typePolicies: {
      Query: {
        fields: {
          orders: {
            keyArgs: false,
            merge(existing, incoming) {
              return incoming; // or implement cursor-based merge
            },
          },
        },
      },
    },
  }),
  defaultOptions: {
    watchQuery: {
      errorPolicy: 'all',  // Return both data and errors
    },
  },
});

4.3 Subscription Hook Usage

// hooks/useOrderSubscription.ts
import { useSubscription } from '@apollo/client';
import { ORDER_STATUS_CHANGED } from '../graphql/subscriptions';

export function useOrderSubscription(orderId: string) {
  const { data, loading, error, restart } = useSubscription(
    ORDER_STATUS_CHANGED,
    {
      variables: { orderId },
      onData: ({ data: subscriptionData }) => {
        // Handle incoming subscription event
        console.log('Order update received:', subscriptionData);
      },
      onError: (error) => {
        console.error('Subscription error:', error);
        // Optionally trigger reconnect with backoff
      },
      onComplete: () => {
        console.log('Subscription completed');
      },
    }
  );

  return { data, loading, error, restart };
}
// hooks/useResumableSubscription.ts
import { useSubscription } from '@apollo/client';
import { useRef, useCallback } from 'react';

export function useResumableSubscription(subscription, options = {}) {
  const cursorRef = useRef<string | null>(
    localStorage.getItem(`sub_cursor_${options.key}`) || null
  );

  const { data, loading, error, restart } = useSubscription(subscription, {
    ...options,
    variables: {
      ...options.variables,
      after: cursorRef.current,
    },
    onData: ({ data: subData }) => {
      // Update cursor from the latest event
      const eventId = subData?.data?.__cursor || subData?.data?.id;
      if (eventId) {
        cursorRef.current = eventId;
        localStorage.setItem(`sub_cursor_${options.key}`, eventId);
      }
      options.onData?.({ data: subData });
    },
  });

  const resetCursor = useCallback(() => {
    cursorRef.current = null;
    localStorage.removeItem(`sub_cursor_${options.key}`);
    restart();
  }, [options.key, restart]);

  return { data, loading, error, restart, resetCursor };
}

4.4 Pages

Page 1 — Order Dashboard (OrderDashboard.tsx)

  • Displays a list of recent orders (query + polling fallback).
  • Each order card subscribes to orderStatusChanged and updates in real-time.
  • Status transitions animate visually (e.g., PLACED → CONFIRMED → SHIPPED).
  • "Place New Order" button triggers a mutation, demonstrating the full cycle.
  • Connection status indicator shows subscription health.

Page 2 — Notification Center (NotificationCenter.tsx)

  • Subscribes to notificationReceived for the current user.
  • Displays a live feed of notifications with read/unread state.
  • Demonstrates the resumable subscription pattern — on page reload, missed notifications are backfilled.
  • Badge count updates in the nav bar via useSubscription.

4.5 Error Handling Strategy

Error Type Source Handling Approach
Network errors (timeout, DNS, 5xx) RetryLink Automatic retry with exponential backoff + jitter (max 5 attempts). Surface to UI after exhaustion.
GraphQL errors ErrorLink, useSubscription error Display inline error messages. Log with operation context. Do not retry (business logic errors).
Protocol errors (CombinedProtocolErrors) Multipart stream Log, attempt restart() on the subscription hook. Show reconnection indicator.
Subscription termination Server-side close Detect via onComplete, auto-restart with cursor for resumable subscriptions.

Phase 5: Event System & Message Queue (Weeks 2–3, parallel)

5.1 Kafka/RabbitMQ Topic Design

# Kafka Topics (or RabbitMQ Exchanges/Queues)
orders.status.changed    → consumed by Orders subgraph
orders.new               → consumed by Orders subgraph + Notifications subgraph
notifications.system     → consumed by Notifications subgraph

# Dead Letter Topic
subscriptions.dlq        → failed callback deliveries after max retries

5.2 Event Producer (Simulation Utility)

  • CLI tool or cron job that publishes sample events to Kafka/RabbitMQ.
  • Configurable: event type, frequency, payload size, burst mode.
  • Used for demo and load testing.

5.3 Consumer Configuration

  • Each subgraph pod runs a consumer with a consumer group (Kafka) or competing consumers (RabbitMQ) pattern.
  • Manual ACK mode: only acknowledge after successful callback POST.
  • NACK with requeue on transient failures (Router 5xx, timeout).
  • Move to DLQ after N retries (configurable, default: 3).

Phase 6: Observability & Telemetry (Week 4)

6.1 Router Telemetry (OTEL)

  • Enable Router subscription debugging events per Apollo docs:
    telemetry:
      instrumentation:
        events:
          supergraph:
            # Log subscription lifecycle events
            response:
              level: info
  • Export metrics to Prometheus, traces to Jaeger/OTEL Collector.

6.2 Grafana Dashboards

  • Subscription Metrics: Active subscriptions count, callback success/failure rate, heartbeat latency.
  • Message Queue Metrics: Consumer lag, throughput, DLQ depth.
  • Router Metrics: Request rate, latency percentiles, error rates.
  • Redis Metrics: Memory usage, key count, TTL expiration rate.

6.3 Alerting Rules

  • Subscription callback failure rate > 5% over 5 minutes.
  • Kafka consumer lag > 1000 messages.
  • Redis memory usage > 80%.
  • Router pod restarts > 2 in 10 minutes.

Phase 7: Testing & Validation (Week 5)

7.1 Integration Tests

  • Subscription lifecycle: initiate → receive events → heartbeat → terminate.
  • Callback delivery under Router pod restart (verify resumable pattern).
  • Message queue consumer failover (kill one subgraph pod, verify rebalancing).

7.2 Load Testing

  • Tool: k6 or Artillery with GraphQL subscription support.
  • Scenarios:
    • 100 concurrent subscriptions, steady event stream (1 event/sec).
    • 500 concurrent subscriptions, burst event stream (50 events/sec for 10s).
    • Router pod failure during active subscriptions (chaos testing).
  • Measure: event delivery latency (p50, p95, p99), message loss rate, recovery time.

7.3 Chaos Engineering

  • Kill a Router pod — verify subscriptions reconnect via client restart() + resumable cursor.
  • Kill a Subgraph pod — verify Kafka/RabbitMQ rebalances, no message loss.
  • Redis connection interruption — verify graceful degradation.

Phase 8: Documentation & Portability (Week 5)

8.1 Deliverables

  • README.md — Quick start guide (Docker Compose + K8s).
  • docs/ARCHITECTURE.md — System design document with diagrams.
  • docs/RUNBOOK.md — Operational runbook (common issues, scaling procedures).
  • docs/ENV_VARS.md — Complete environment variable reference.
  • Helm values.yaml with commented defaults for environment customization.

8.2 Portability Checklist

  • All infrastructure is defined as code (Docker Compose + Helm).
  • No hardcoded URLs — all endpoints configurable via env vars.
  • Redis, Kafka/RabbitMQ can be swapped for managed services via config.
  • Router callback URL construction is dynamic (pod name + namespace).
  • CI/CD pipeline builds and pushes container images to configurable registry.
  • Helm chart supports overrides for different cloud providers (AWS/GCP/Azure).

4. Technology Stack Summary

Layer Technology Purpose
Client React 18+ (Vite), Apollo Client v4 UI + GraphQL subscription consumer
API Gateway Kong / AWS API Gateway (optional) Auth, rate limiting, TLS termination
Router Apollo Router v2.x (Enterprise) Federated gateway, subscription callback orchestration
Subgraphs Node.js + Apollo Server v4 Domain logic, callback protocol implementation
State Store Redis 7+ Subscription metadata, TTL, cursor tracking
Message Queue Apache Kafka or RabbitMQ Event-driven communication between services
Event Store PostgreSQL (optional) Persisted events for resumable subscriptions
Observability Prometheus, Grafana, OTEL Collector Metrics, dashboards, distributed tracing
Infra Docker Compose, Kubernetes (Helm) Environment-portable deployment

5. Timeline Summary

Week Phase Key Deliverables
1 Foundation Repo scaffolding, Docker Compose stack, K8s manifests (draft)
2 Infrastructure + Schema Redis/Kafka running, federated schema designed, subgraph stubs
3 Subgraph + Router Callback protocol implemented, Router config finalized, link chain
4 React App + Observability Both pages functional, Grafana dashboards, OTEL integration
5 Testing + Docs Load tests, chaos tests, documentation, portability validation

6. Risks & Mitigations

Risk Impact Mitigation
Callback URL not resolvable between pods Subscription delivery fails Use StatefulSet + Headless Service; validate DNS resolution in init container
Redis TTL expiration race condition Subscription dropped prematurely Set heartbeat interval < TTL/2; implement TTL renewal with atomic Redis commands
Kafka consumer lag under burst load Delayed event delivery Tune partition count, consumer threads; add HPA for subgraph pods
Client reconnection thundering herd Router overwhelmed after outage Implement jittered backoff in RetryLink; stagger client reconnection
Router pod restart loses active subscriptions Clients must re-subscribe Resumable subscription pattern with cursor; client-side restart() on error

7. Key Reference Documentation