Skip to content

fix: fast-fail on_load_session when thread ID is missing or wrong#8543

Open
morgmart wants to merge 1 commit intomainfrom
fix/acp-load-session-fast-fail
Open

fix: fast-fail on_load_session when thread ID is missing or wrong#8543
morgmart wants to merge 1 commit intomainfrom
fix/acp-load-session-fast-fail

Conversation

@morgmart
Copy link
Copy Markdown
Collaborator

Overview

Category: fix
User Impact: Sessions that previously hung for 30 seconds and showed a blank screen now fail immediately with a clear error message.

Problem: When clicking on certain sessions in the Goose2 app, the conversation view hangs for 30 seconds, then shows a blank screen. This happens because the UI sometimes sends a legacy session ID (like 20260415_1) instead of a thread UUID to the load_session RPC. The goose binary's on_load_session handler tries to look up a thread by that ID, fails, and the error either hangs or is silently swallowed — the user never sees what went wrong.

Solution: Replace the thread lookup's silent failure with a diagnostic check. When get_thread fails, we now check whether the requested ID is a known session in the sessions table, and return an immediate error with a specific message explaining the mismatch. This turns a 30-second hang into an instant, actionable error.

Root Cause Investigation

This fix addresses the symptom. The investigation uncovered three deeper issues that this PR does not fix but documents here for the ACP engineer:

1. UI sends date-based session IDs instead of thread UUIDs

The loadSessions() function in chatSessionStore.ts has a fallback path (catch block) that loads sessions from localStorage overlays when acpListSessions() fails. These overlays use date-based session IDs (20260415_1) as their key instead of thread UUIDs. Once loaded this way, subsequent session clicks send the wrong ID to the backend.

Evidence: Timing logs showed session=20260412_8 goose=20260412_8 — the date-based ID being sent for sessions that have valid thread UUIDs in the database.

2. Some ACP sessions have no linked thread

Session 20260415_1 ("Removing Left Nav Divider Lines") was created with session_type=acp and provider_name=claude-acp, but its thread_id column is NULL. Out of 89 ACP sessions in the database, 14 have no thread ID. The thread creation either failed silently or was skipped during session creation.

Database evidence:

20260415_1|Removing Left Nav Divider Lines|acp|NO THREAD|claude-acp
20260414_2|ACP Session|acp|821f1582-...|claude-acp  (OK)
20260412_11|Current Agent Harness|acp|af32e07c-...|codex-acp  (OK)

3. Session loading RPC takes 18-38 seconds (even for valid sessions)

Even sessions with valid threads take 18-38 seconds to load via the RPC. The message replay itself is < 2ms — the time is spent elsewhere in the goose binary (likely provider initialization). The installed binary may differ from the current source (which defers agent setup via spawn_agent_setup).

Timing evidence:

Session RPC Time Events
Poetry Request 19.5s 4 events
Claude Code Agent Info 34.3s 10 events
Databricks session 38.1s (FAILED) missing DATABRICKS_HOST
File changes

crates/goose-acp/src/server.rs
Replaced the thread lookup in on_load_session with a match block that provides diagnostic fallback. When get_thread fails, it checks the sessions table to determine if the ID is a legacy session ID, and returns a specific error message explaining whether the session has a linked thread (wrong ID was sent) or has no thread at all (incomplete ACP creation).

ui/goose2/src-tauri/Cargo.toml
Relaxed tauri-plugin-dialog version constraint from ">=2,<2.7" to "2" to fix a recurring version mismatch warning with @tauri-apps/plugin-dialog v2.7.

Reproduction Steps

  1. Open the Goose2 app
  2. Click on a session in the sidebar — particularly older sessions or sessions that were created when the app was having connectivity issues
  3. Before this fix: The conversation view hangs for 30 seconds, then shows a blank screen or a timeout error
  4. After this fix: The session immediately shows an error with a diagnostic message explaining the mismatch

Next Steps for the ACP Engineer

  1. Fix the UI to always send thread UUIDs — ensure acpSessionId (the thread UUID) is preserved through the localStorage fallback path in chatSessionStore.ts
  2. Investigate why 14 ACP sessions have no linked thread — the create_internal_session flow should be audited for silent failures during thread creation
  3. Profile the installed binary's on_load_session — the 18-38s RPC time suggests the installed binary may not have the async spawn_agent_setup optimization that exists in the current source

When the UI sends a legacy session ID (e.g. "20260415_1") instead of a
thread UUID, on_load_session previously hung indefinitely because
get_thread returned an error that was silently swallowed or never
reached the client. This caused a 30-second frontend timeout and a
blank conversation view.

Now the handler checks whether the requested ID corresponds to a known
session in the sessions table, and returns an immediate, diagnostic
error explaining what went wrong:
- "Session X has linked thread Y, but was sent as the session_id
  instead of the thread UUID" — tells the engineer the UI needs to
  send the thread UUID instead.
- "Session X exists but has no linked thread" — the session was not
  fully created via ACP (thread creation failed or was skipped).
- "Session not found" — the ID doesn't match anything.

Also bumps tauri-plugin-dialog to "2" to fix a version mismatch
warning with @tauri-apps/plugin-dialog v2.7.

Signed-off-by: morgmart <98432065+morgmart@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant