fix(docker): healthcheck never transitions to healthy by brunobuddy · Pull Request #1538 · mnfst/manifest

brunobuddy · 2026-04-13T01:28:50Z

Summary

Closes #1532.

Root cause

The published image has HEALTHCHECK CMD wget -qO- http://localhost:3001/api/v1/health. Alpine's /etc/hosts lists `::1 localhost` after `127.0.0.1 localhost`, so BusyBox `wget` tries IPv6 first and hits `[::1]:3001`, where Node.js is not listening (binding to `0.0.0.0` is IPv4-only by default). Result: "Connection refused" on every probe, status stuck at `starting` forever even though the server is fully up and serving 200s via 127.0.0.1.

The Dockerfile in main already has the correct `127.0.0.1` from a previous commit, but the image on Docker Hub hasn't been rebuilt since — so every compose pull today still inherits the broken `localhost` healthcheck from the image.

On top of that, `--start-period=10s` is too tight for a cold boot. With `SEED_DATA=true` the server takes ~25-30 seconds to finish booting (migrations, OpenRouter pricing cache, models.dev cache, demo seed). First probe runs at t=10s against a server that isn't ready, eating the retry budget.

Fixes

Dockerfile `--start-period=10s` → `45s`. Next image rebuild will have room to finish booting before retries start counting.
Add an explicit `healthcheck:` block to `docker/docker-compose.yml` that overrides the (broken) image-level HEALTHCHECK. Uses `127.0.0.1` and the bumped 45s start period. This gives users a working healthcheck immediately from the next compose pull, without waiting for an image republish.

Reproduction (before this PR)

```bash
curl -O https://raw.githubusercontent.com/mnfst/manifest/main/docker/docker-compose.yml
docker compose up -d

Server is up

docker exec manifest-install-test-manifest-1 wget -qO- http://127.0.0.1:3001/api/v1/health

→ {"status":"healthy","uptime_seconds":60,"mode":"cloud","devMode":true}

But docker thinks it's still starting

docker inspect --format='{{.State.Health.Status}}' manifest-install-test-manifest-1

→ starting

```

Health log shows repeated `wget: can't connect to remote host: Connection refused` even though the same command succeeds when run via `docker exec`.

Verification (with this PR)

Applied the patched compose file locally, booted clean:

```bash
docker compose up -d

postgres healthy, manifest starting

→ healthcheck fires on first probe against 127.0.0.1, exit 0

→ status flips to healthy

```

`docker inspect --format='{{.State.Health.Status}}'` returns `healthy`.
`docker inspect --format='{{json .Config.Healthcheck}}'` shows the overridden block with `start_period: 45s` and the `127.0.0.1` target.

Test plan

Compose override boots to `healthy` on a cold pull
Dockerfile still builds cleanly (compose pull uses inherited image but the override runs)
CI passes
`Docker / Build (validate)` CI job rebuilds the image with the new start_period

Summary by cubic

Fixes Docker healthcheck stuck at “starting” by forcing IPv4 and extending the start period, so containers reliably reach “healthy” after cold boots.

Bug Fixes
- Dockerfile: increase HEALTHCHECK --start-period from 10s to 45s to cover slow startups (e.g., SEED_DATA=true).
- Compose: add a healthcheck override in docker/docker-compose.yml using wget against http://127.0.0.1:3001/api/v1/health with interval 30s, timeout 5s, start_period 45s, retries 3 (avoids IPv6 ::1 where Node isn’t listening).

^{Written for commit ef7dff7. Summary will update on new commits.}

Two related fixes that together close #1532. ## Root cause The published image has HEALTHCHECK running `wget -qO- http://localhost:3001/api/v1/health`. Alpine's /etc/hosts lists `::1 localhost` after `127.0.0.1 localhost`, so BusyBox wget tries IPv6 first and hits `[::1]:3001`, where Node.js is not listening (binding to 0.0.0.0 is IPv4-only by default). Result: "Connection refused" on every probe, status stuck at `starting` forever even though the server is fully up and serving 200s on /api/v1/health via 127.0.0.1. The Dockerfile in main already has the correct `127.0.0.1` from a previous commit, but the image on Docker Hub has not been rebuilt since, so every compose-file pull today gets a broken healthcheck via image inheritance. ## Fixes 1. Bump `--start-period=10s` to `45s` in the Dockerfile HEALTHCHECK. On a cold pull with `SEED_DATA=true`, the server takes 25-30 seconds to finish booting (migrations, OpenRouter and models.dev pricing caches, demo seed). The 10s start period leaves no room for a slow boot to succeed before the retry budget is consumed. 2. Add an explicit `healthcheck:` block to docker-compose.yml that overrides the (broken) image-level HEALTHCHECK with one using `127.0.0.1` and the bumped 45s start period. This gives users a working healthcheck immediately from the next compose pull, without waiting for an image republish. ## Verification With the patched compose file: docker compose up -d # postgres healthy, manifest starting # server binds at t+5s, healthcheck runs at t+5.9s → exit 0 # status flips to healthy on the first probe Confirmed locally via `docker inspect --format '{{.State.Health.Status}}'`.

codecov · 2026-04-13T01:29:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.43%. Comparing base (4043fae) to head (ef7dff7).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1538   +/-   ##
=======================================
  Coverage   98.43%   98.43%           
=======================================
  Files         118      118           
  Lines        8653     8653           
  Branches     3278     3278           
=======================================
  Hits         8518     8518           
  Misses        134      134           
  Partials        1        1

Flag	Coverage Δ
frontend	`98.43% <ø> (ø)`
shared	`100.00% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cubic-dev-ai

No issues found across 2 files

Resolves conflicts after main merged parallel cleanup PRs (#1528, #1533, #1534, #1536, #1537, #1538) that removed both openclaw-plugins packages. Takes main's direction on: - openclaw-plugins/manifest-model-router removal (main deleted both) - docker-compose.yml local-testing framing (main's choice) - Dockerfile healthcheck start-period=45s (main's fix) Keeps this branch's unique contributions: - Backend: delete all local-mode source files (LocalAuthGuard, local-mode.constants, local-bootstrap.service, limit-check-local, version-check.service), simplify sql-dialect.ts to Postgres-only, drop sql.js dep, remove backend-sqljs CI job - Frontend: delete services/local-mode.ts, VersionIndicator component, remove isLocalMode/isDevMode branches across ~14 components - AUTO_MIGRATE env var in database.module.ts + seeder - Unified EMAIL_* env var scheme in send-email.ts + app.config.ts + notification-email.service.ts (covers both Better Auth transactional and threshold alerts via existing fallback) - Docker compose: add AUTO_MIGRATE, EMAIL_*, OAuth env var passthroughs; drop MANIFEST_TRUST_LAN (dead after LocalAuthGuard removal) - backend/.env.example: new EMAIL_* block, deprecate legacy MAILGUN_* - Rewrite E2E test helpers to use Postgres (was sql.js)

cubic-dev-ai bot reviewed Apr 13, 2026

View reviewed changes

brunobuddy merged commit 857f416 into main Apr 13, 2026
15 checks passed

brunobuddy deleted the fix/docker-healthcheck-start-period branch April 13, 2026 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docker): healthcheck never transitions to healthy#1538

fix(docker): healthcheck never transitions to healthy#1538
brunobuddy merged 1 commit intomainfrom
fix/docker-healthcheck-start-period

brunobuddy commented Apr 13, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

codecov bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brunobuddy commented Apr 13, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fixes

Reproduction (before this PR)

Server is up

→ {"status":"healthy","uptime_seconds":60,"mode":"cloud","devMode":true}

But docker thinks it's still starting

→ starting

Verification (with this PR)

postgres healthy, manifest starting

→ healthcheck fires on first probe against 127.0.0.1, exit 0

→ status flips to healthy

Test plan

Summary by cubic

Uh oh!

codecov bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brunobuddy commented Apr 13, 2026 •

edited by cubic-dev-ai bot

Loading

codecov bot commented Apr 13, 2026 •

edited

Loading