Skip to content

fix(docker): healthcheck never transitions to healthy#1538

Merged
brunobuddy merged 1 commit intomainfrom
fix/docker-healthcheck-start-period
Apr 13, 2026
Merged

fix(docker): healthcheck never transitions to healthy#1538
brunobuddy merged 1 commit intomainfrom
fix/docker-healthcheck-start-period

Conversation

@brunobuddy
Copy link
Copy Markdown
Member

@brunobuddy brunobuddy commented Apr 13, 2026

Summary

Closes #1532.

Root cause

The published image has HEALTHCHECK CMD wget -qO- http://localhost:3001/api/v1/health. Alpine's /etc/hosts lists `::1 localhost` after `127.0.0.1 localhost`, so BusyBox `wget` tries IPv6 first and hits `[::1]:3001`, where Node.js is not listening (binding to `0.0.0.0` is IPv4-only by default). Result: "Connection refused" on every probe, status stuck at `starting` forever even though the server is fully up and serving 200s via 127.0.0.1.

The Dockerfile in main already has the correct `127.0.0.1` from a previous commit, but the image on Docker Hub hasn't been rebuilt since — so every compose pull today still inherits the broken `localhost` healthcheck from the image.

On top of that, `--start-period=10s` is too tight for a cold boot. With `SEED_DATA=true` the server takes ~25-30 seconds to finish booting (migrations, OpenRouter pricing cache, models.dev cache, demo seed). First probe runs at t=10s against a server that isn't ready, eating the retry budget.

Fixes

  1. Dockerfile `--start-period=10s` → `45s`. Next image rebuild will have room to finish booting before retries start counting.
  2. Add an explicit `healthcheck:` block to `docker/docker-compose.yml` that overrides the (broken) image-level HEALTHCHECK. Uses `127.0.0.1` and the bumped 45s start period. This gives users a working healthcheck immediately from the next compose pull, without waiting for an image republish.

Reproduction (before this PR)

```bash
curl -O https://raw.githubusercontent.com/mnfst/manifest/main/docker/docker-compose.yml
docker compose up -d

Server is up

docker exec manifest-install-test-manifest-1 wget -qO- http://127.0.0.1:3001/api/v1/health

→ {"status":"healthy","uptime_seconds":60,"mode":"cloud","devMode":true}

But docker thinks it's still starting

docker inspect --format='{{.State.Health.Status}}' manifest-install-test-manifest-1

→ starting

```

Health log shows repeated `wget: can't connect to remote host: Connection refused` even though the same command succeeds when run via `docker exec`.

Verification (with this PR)

Applied the patched compose file locally, booted clean:

```bash
docker compose up -d

postgres healthy, manifest starting

→ healthcheck fires on first probe against 127.0.0.1, exit 0

→ status flips to healthy

```

`docker inspect --format='{{.State.Health.Status}}'` returns `healthy`.
`docker inspect --format='{{json .Config.Healthcheck}}'` shows the overridden block with `start_period: 45s` and the `127.0.0.1` target.

Test plan

  • Compose override boots to `healthy` on a cold pull
  • Dockerfile still builds cleanly (compose pull uses inherited image but the override runs)
  • CI passes
  • `Docker / Build (validate)` CI job rebuilds the image with the new start_period

Summary by cubic

Fixes Docker healthcheck stuck at “starting” by forcing IPv4 and extending the start period, so containers reliably reach “healthy” after cold boots.

  • Bug Fixes
    • Dockerfile: increase HEALTHCHECK --start-period from 10s to 45s to cover slow startups (e.g., SEED_DATA=true).
    • Compose: add a healthcheck override in docker/docker-compose.yml using wget against http://127.0.0.1:3001/api/v1/health with interval 30s, timeout 5s, start_period 45s, retries 3 (avoids IPv6 ::1 where Node isn’t listening).

Written for commit ef7dff7. Summary will update on new commits.

Two related fixes that together close #1532.

## Root cause

The published image has HEALTHCHECK running
`wget -qO- http://localhost:3001/api/v1/health`. Alpine's /etc/hosts
lists `::1 localhost` after `127.0.0.1 localhost`, so BusyBox wget
tries IPv6 first and hits `[::1]:3001`, where Node.js is not
listening (binding to 0.0.0.0 is IPv4-only by default). Result:
"Connection refused" on every probe, status stuck at `starting`
forever even though the server is fully up and serving 200s on
/api/v1/health via 127.0.0.1.

The Dockerfile in main already has the correct `127.0.0.1` from
a previous commit, but the image on Docker Hub has not been
rebuilt since, so every compose-file pull today gets a broken
healthcheck via image inheritance.

## Fixes

1. Bump `--start-period=10s` to `45s` in the Dockerfile HEALTHCHECK.
   On a cold pull with `SEED_DATA=true`, the server takes 25-30
   seconds to finish booting (migrations, OpenRouter and models.dev
   pricing caches, demo seed). The 10s start period leaves no room
   for a slow boot to succeed before the retry budget is consumed.

2. Add an explicit `healthcheck:` block to docker-compose.yml that
   overrides the (broken) image-level HEALTHCHECK with one using
   `127.0.0.1` and the bumped 45s start period. This gives users
   a working healthcheck immediately from the next compose pull,
   without waiting for an image republish.

## Verification

With the patched compose file:

    docker compose up -d
    # postgres healthy, manifest starting
    # server binds at t+5s, healthcheck runs at t+5.9s → exit 0
    # status flips to healthy on the first probe

Confirmed locally via `docker inspect --format '{{.State.Health.Status}}'`.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.43%. Comparing base (4043fae) to head (ef7dff7).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1538   +/-   ##
=======================================
  Coverage   98.43%   98.43%           
=======================================
  Files         118      118           
  Lines        8653     8653           
  Branches     3278     3278           
=======================================
  Hits         8518     8518           
  Misses        134      134           
  Partials        1        1           
Flag Coverage Δ
frontend 98.43% <ø> (ø)
shared 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

@brunobuddy brunobuddy merged commit 857f416 into main Apr 13, 2026
15 checks passed
@brunobuddy brunobuddy deleted the fix/docker-healthcheck-start-period branch April 13, 2026 02:46
brunobuddy added a commit that referenced this pull request Apr 13, 2026
Resolves conflicts after main merged parallel cleanup PRs (#1528, #1533,
#1534, #1536, #1537, #1538) that removed both openclaw-plugins packages.
Takes main's direction on:
- openclaw-plugins/manifest-model-router removal (main deleted both)
- docker-compose.yml local-testing framing (main's choice)
- Dockerfile healthcheck start-period=45s (main's fix)

Keeps this branch's unique contributions:
- Backend: delete all local-mode source files (LocalAuthGuard,
  local-mode.constants, local-bootstrap.service, limit-check-local,
  version-check.service), simplify sql-dialect.ts to Postgres-only,
  drop sql.js dep, remove backend-sqljs CI job
- Frontend: delete services/local-mode.ts, VersionIndicator component,
  remove isLocalMode/isDevMode branches across ~14 components
- AUTO_MIGRATE env var in database.module.ts + seeder
- Unified EMAIL_* env var scheme in send-email.ts + app.config.ts +
  notification-email.service.ts (covers both Better Auth transactional
  and threshold alerts via existing fallback)
- Docker compose: add AUTO_MIGRATE, EMAIL_*, OAuth env var passthroughs;
  drop MANIFEST_TRUST_LAN (dead after LocalAuthGuard removal)
- backend/.env.example: new EMAIL_* block, deprecate legacy MAILGUN_*
- Rewrite E2E test helpers to use Postgres (was sql.js)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Manifest container healthcheck never transitions to healthy

1 participant