Skip to content

Simulations#2

Closed
sandboxws wants to merge 24 commits intomainfrom
sim
Closed

Simulations#2
sandboxws wants to merge 24 commits intomainfrom
sim

Conversation

@sandboxws
Copy link
Copy Markdown
Owner

No description provided.

- Bump flink-reactor to 0.1.8-rc.1 with schema introspection fixes
- Add pnpm store prune to refresh-dsl.sh to clear stale integrity checksums
Flink's timeSinceLastHeartbeat field is an epoch timestamp, not a
duration. The dashboard was computing Date.now() - timestamp, which
produced near-zero values. Use the value directly as new Date(value).
…penSpec skills

- EXPLAIN statement support: GraphQL mutation, resolver, dashboard
  sandbox integration with explain tab in synthesis output
- Plan analyzer: parser (JSON + text), 7 analyzers (bottleneck,
  changelog, join, skew, state, watermark, window), DAG visualization
  components, 31 test fixtures, Zustand store
- Sandbox UI: streamlined editor toolbar, removed redundant output
  header, added explain button
- Sandbox bug fix report: structured analysis of 14 DSL codegen bugs
- OpenSpec skills and prompts for Claude Code
- IDE project files
Replace flat column list in catalog browser with a proper table featuring
search, sortable columns, and pagination. Style JM config tag filter pills
with their corresponding tag background colors.
Update all references after the DSL package rename:
- dashboard dependency and dynamic imports
- completions generator node_modules path
- refresh-dsl.sh cache paths (using pnpm @scope+name encoding)
- release.yml: remove `if: false`, drop DSL-repo packages (create-app,
  ts-plugin), keep only UI + dashboard builds
…visibility

Server:
- Add tap_manifests table (migration 007) with pipeline_name PK and JSONB manifest
- Add TapManifestStore with Upsert/GetByPipeline/List/Delete operations
- Replace filesystem tap.Loader with DB-backed tap.Store
- Add POST /api/tap-manifests and DELETE /api/tap-manifests/:pipeline endpoints
- Wire TapStore in main.go when storage is enabled
- Fix detail_snapshot not captured for jobs first seen in terminal state

Dashboard:
- Fix tap manifest fetch using backend base URL (was hitting Vite dev server)
- Show Tap tab only for running jobs with an existing manifest
- Shorten tap job prefix from flink-reactor-tap- to fr-tap-
- Update @flink-reactor/dsl to 0.1.8-rc.3
- Add IsNotFound() helper and FlexFloat64.MarshalJSON() for safe JSON roundtrip
- Add GetJobByID() and UpsertJobSnapshot() for DB-backed job detail fallback
- Add job_db_fallback.go with two-tier recovery (JSONB snapshot → normalized tables)
- Add 006_job_detail_snapshot migration for detail_snapshot JSONB column
- Extract mapJobDetailAggregate() shared mapper for live and DB paths
- Add connector/ package with detector, vertex name patterns, and manifest parsing
- Add sources_sinks.graphqls schema extending JobDetail with sourcesAndSinks field
- Add Sources & Sinks tab to job detail with connector type cards and I/O metrics
sim-infra-01: K8s manifests for minikube simulation stack
- SeaweedFS (S3-compatible checkpoint storage), Kafka KRaft,
  PostgreSQL, SQL Gateway, reactor-server with ConfigMap
- Custom Flink image Dockerfile with S3 plugin
- README with quick start guide

sim-console-01: Job lifecycle mutations (savepoint, stop, rescale)
- Go: TriggerSavepoint, StopWithSavepoint, RescaleJob service methods
- GraphQL: triggerSavepoint, stopJobWithSavepoint, rescaleJob mutations
- Dashboard: wire savepoint button to real API, add Stop button,
  add Stop All Jobs to cluster overview page
sim-console-02: Chaos engineering simulation system
- New simulation package: engine orchestrator, 11 scenario presets
  (resource stress, checkpoint, load, failure), PostgreSQL store
- DB migration 008: simulation_runs + simulation_observations tables
- GraphQL schema: SimulationRun, SimulationPreset, SimulationObservation
  types with queries (list/get runs, presets) and mutations (run/stop)
- Resolvers map between domain and GraphQL model types
- Engine wired into main.go (conditional on storage enabled)
- One simulation at a time (mutex-guarded), background goroutine execution
sim-console-03: Simulation dashboard UI
- Admin sidebar group with Simulations and Benchmarks items
- Simulation store (Zustand) with presets, runs, active polling
- Preset grid organized by category (resource/checkpoint/load/failure)
- Inline parameter configuration per preset card
- Active simulation panel with live observation polling (3s)
- History table with status badges and run detail links
- Run detail page with observation timeline

sim-console-04: Benchmark collection page
- Run selector table with multi-select checkboxes (max 5)
- Comparison cards showing metric averages per run
- Empty state with link to Simulations page
- GraphQL client functions for simulation queries/mutations
- Guard against undefined observations in timeline component
- Include observations in runSimulation mutation response
- Add error banner to simulations page (shows API errors like
  missing storage or infrastructure)
Clicking Run now opens a modal that checks:
- Flink cluster reachable (required)
- PostgreSQL storage connected (required)
- Running jobs available (optional, with deploy instructions)
- No other simulation active (required)

Each check shows pass/fail/warn status with fix instructions.
Launch button only enabled when all required checks pass.
Re-check button to retry. Applied to both grid and list views.
- Each run row has a View link → navigates to full run detail
- Checkboxes for multi-select, Compare button appears when 2+ selected
- Compare fetches full observation data and shows side-by-side metric table
- Metrics show avg, min–max range, and sample count per run
- Clear button to dismiss comparison
- Help text explaining View vs Compare workflow
Preflight now validates the full infrastructure chain:
- Kubernetes cluster reachable (kubectl cluster-info)
- flink-demo namespace exists
- Flink Operator running in flink-system
- Kafka, PostgreSQL, SeaweedFS pods running in flink-demo
- PostgreSQL storage connected (server config)
- Flink cluster reachable (REST API)
- FlinkDeployments exist (optional)
- Kafka instrument healthy (optional)
- No active simulation running

Checks run server-side via new simulationPreflight GraphQL query.
Docker-only clusters correctly show failures for K8s checks.
…arallel

- Check kubectl existence upfront via exec.LookPath — if not found,
  all K8s checks instantly return "fail" with install instructions
- Run all 11 checks in parallel via goroutines (was sequential)
- Reduce kubectl timeout from 5s to 3s per check
- Eliminates 35s+ hang when kubectl is missing or can't connect
Checks now short-circuit — if a required check fails, downstream
checks are not shown:

  kubectl → K8s cluster → namespace → [pods] → FlinkDeployments

Pod checks verify actual K8s pods by label selector and phase,
tied to specific services (Kafka:9092, PostgreSQL:5432,
SeaweedFS:8333, SQL Gateway:8083, reactor-server:8080, Operator).

Removes false-positive green checks for local Docker services
that aren't the minikube infrastructure.
Iceberg REST deployment manifest for minikube and an optional
preflight check that warns (not fails) when the catalog is missing.
Infrastructure manifests belong with the CLI that manages them,
not the console. Preflight fix hints now reference `flink-reactor sim up`
instead of raw manifest paths.
@sandboxws sandboxws self-assigned this Mar 21, 2026
@sandboxws sandboxws closed this Mar 21, 2026
@github-actions github-actions bot locked and limited conversation to collaborators Mar 21, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant