Skip to content

Commit 809417f

Browse files
rosaclaude
andcommitted
Fix crash when recording a failed execution for a job that already has one
The previous code used `create_or_find_by!` in `Job#failed_with`, passing an `exception` object as an attribute. When a FailedExecution already existed for the job (due to a unique constraint violation), the `find_by!` fallback tried to use the exception object as a SQL bind parameter, causing `TypeError: can't cast ProcessMissingError`. This state — a job having both a ClaimedExecution and a FailedExecution — shouldn't be possible given the transactional guarantees in `ClaimedExecution#failed_with`, but has been observed in practice by multiple users. When it happens, `fail_orphaned_executions` crashes on startup, preventing Solid Queue from starting at all. Replace `create_or_find_by!` with `create!`, rescuing RecordNotUnique to find and update the existing FailedExecution with the new error details. If the record disappears between the failed create and the find (race with a concurrent retry), retry the create. Also change the `expand_error_details_from_exception` callback from `before_create` to `before_save` (guarded by `if: :exception`) so that updating an existing FailedExecution with a new exception properly serializes the error details. Fixes #699 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5518a4d commit 809417f

3 files changed

Lines changed: 26 additions & 2 deletions

File tree

app/models/solid_queue/failed_execution.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ class FailedExecution < Execution
66

77
serialize :error, coder: JSON
88

9-
before_create :expand_error_details_from_exception
9+
before_save :expand_error_details_from_exception, if: :exception
1010

1111
attr_accessor :exception
1212

app/models/solid_queue/job/retryable.rb

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,14 @@ def retry
1616
end
1717

1818
def failed_with(exception)
19-
FailedExecution.create_or_find_by!(job_id: id, exception: exception)
19+
FailedExecution.create!(job_id: id, exception: exception)
20+
rescue ActiveRecord::RecordNotUnique
21+
if (failed_execution = FailedExecution.find_by(job_id: id))
22+
failed_execution.exception = exception
23+
failed_execution.save!
24+
else
25+
retry
26+
end
2027
end
2128

2229
def reset_execution_counters

test/models/solid_queue/claimed_execution_test.rb

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,23 @@ class SolidQueue::ClaimedExecutionTest < ActiveSupport::TestCase
7373
assert job.reload.failed?
7474
end
7575

76+
test "fail with error when a failed execution already exists updates the existing one" do
77+
claimed_execution = prepare_and_claim_job AddToBufferJob.perform_later(42)
78+
job = claimed_execution.job
79+
80+
# Simulate corrupted state: a failed execution already exists for this job
81+
SolidQueue::FailedExecution.create!(job_id: job.id, exception: RuntimeError.new("old error"))
82+
83+
assert_no_difference -> { SolidQueue::FailedExecution.count } do
84+
assert_difference -> { SolidQueue::ClaimedExecution.count }, -1 do
85+
claimed_execution.failed_with(RuntimeError.new("new error"))
86+
end
87+
end
88+
89+
assert job.reload.failed?
90+
assert_equal "new error", job.failed_execution.message
91+
end
92+
7693
test "provider_job_id is available within job execution" do
7794
job = ProviderJobIdJob.perform_later
7895
claimed_execution = prepare_and_claim_job job

0 commit comments

Comments
 (0)