Describe the bug
After upgrading from RabbitMQ 3.13.7 to 4.1.8 we are seeing repeated errors of the form:
crasher:
initial call: ra_server_proc:init/1
pid: <PID>
registered_name: <SERVICE_NAME>
exception error: {bad_return_from_state_function,
{error,
"delete file <RABBITMQ_DATA_DIR>/mnesia/rabbit@<NODE_NAME>/quorum/rabbit@<NODE_NAME>/<QUORUM_QUEUE_ID>/snapshots/<SNAPSHOT_FILE>: not owner
"}}
in function gen_statem:loop_state_callback_result/11 (gen_statem.erl, line 3889)
ancestors: [<PID>,ra_server_sup_sup,<PID>,
ra_systems_sup,ra_sup,<PID>]
message_queue_len: 2
messages: [{ra_log_event,{written,1,{<IDX>,<IDX>}}},
{'$gen_call',
{<PID>,
[alias|
#Ref<REF>]},
{leader_call,
{command,normal,
{'$usr',
{register_enqueuer,<PID>},
await_consensus}}}}]
links: [<PID>]
dictionary: [{rand_seed,{#{type => exsss,next => #Fun<rand.X>,
bits => 58,uniform => #Fun<rand.X>,
uniform_n => #Fun<rand.X>,
jump => #Fun<rand.X>},
[<RAND>|<RAND>]}}]
trap_exit: true
status: running
heap_size: <SIZE>
stack_size: <SIZE>
reductions: <COUNT>
neighbours:
This causes ra_server_proc to crash. The error only appears on nodes running 4.1.8 — nodes still on 3.13.7 are unaffected.
Reproduction steps
This happens after we upgraded to 4.1.8 version of rabbitmq
Expected behavior
Although it is not causing any harm, the file is eventually deleted when checked later. The error messages seem confusing.
Additional context
Looking at ra_lib:recursive_delete/1, is_dir/1 returns false for any error from prim_file:read_file_info/1—not just when the path is a regular file. When that happens, delete(Dir, regular) is called, which invokes unlink() on what may still be a directory. Both ext4 and NFSv4 return EPERM in this case.
is_dir/1 — https://github.com/rabbitmq/ra/blob/main/src/ra_lib.erl#L485
recursive_delete/1 — https://github.com/rabbitmq/ra/blob/main/src/ra_lib.erl#L197
We are not sure if this is the actual cause of what we are observing, but we wanted to highlight it in case it is useful.
Please feel free to close this, if it is not an issue. I am just reporting it only because I noticed it as a bug.
Describe the bug
After upgrading from RabbitMQ 3.13.7 to 4.1.8 we are seeing repeated errors of the form:
This causes ra_server_proc to crash. The error only appears on nodes running 4.1.8 — nodes still on 3.13.7 are unaffected.
Reproduction steps
This happens after we upgraded to 4.1.8 version of rabbitmq
Expected behavior
Although it is not causing any harm, the file is eventually deleted when checked later. The error messages seem confusing.
Additional context
Looking at ra_lib:recursive_delete/1, is_dir/1 returns false for any error from prim_file:read_file_info/1—not just when the path is a regular file. When that happens, delete(Dir, regular) is called, which invokes unlink() on what may still be a directory. Both ext4 and NFSv4 return EPERM in this case.
is_dir/1 — https://github.com/rabbitmq/ra/blob/main/src/ra_lib.erl#L485
recursive_delete/1 — https://github.com/rabbitmq/ra/blob/main/src/ra_lib.erl#L197
We are not sure if this is the actual cause of what we are observing, but we wanted to highlight it in case it is useful.
Please feel free to close this, if it is not an issue. I am just reporting it only because I noticed it as a bug.