Skip to content

Title: ra_server_proc crash with "not owner" error on snapshot deletion after upgrade to 4.1.8 #625

@rahulsadanandan

Description

@rahulsadanandan

Describe the bug

After upgrading from RabbitMQ 3.13.7 to 4.1.8 we are seeing repeated errors of the form:

crasher:
initial call: ra_server_proc:init/1
pid: <PID>
registered_name: <SERVICE_NAME>
exception error: {bad_return_from_state_function,
{error,
"delete file <RABBITMQ_DATA_DIR>/mnesia/rabbit@<NODE_NAME>/quorum/rabbit@<NODE_NAME>/<QUORUM_QUEUE_ID>/snapshots/<SNAPSHOT_FILE>: not owner
"}}
in function gen_statem:loop_state_callback_result/11 (gen_statem.erl, line 3889)
ancestors: [<PID>,ra_server_sup_sup,<PID>,
ra_systems_sup,ra_sup,<PID>]
message_queue_len: 2
messages: [{ra_log_event,{written,1,{<IDX>,<IDX>}}},
{'$gen_call',
{<PID>,
[alias|
#Ref<REF>]},
{leader_call,
{command,normal,
{'$usr',
{register_enqueuer,<PID>},
await_consensus}}}}]
links: [<PID>]
dictionary: [{rand_seed,{#{type => exsss,next => #Fun<rand.X>,
bits => 58,uniform => #Fun<rand.X>,
uniform_n => #Fun<rand.X>,
jump => #Fun<rand.X>},
[<RAND>|<RAND>]}}]
trap_exit: true
status: running
heap_size: <SIZE>
stack_size: <SIZE>
reductions: <COUNT>
neighbours:

This causes ra_server_proc to crash. The error only appears on nodes running 4.1.8 — nodes still on 3.13.7 are unaffected.

Reproduction steps

This happens after we upgraded to 4.1.8 version of rabbitmq

Expected behavior

Although it is not causing any harm, the file is eventually deleted when checked later. The error messages seem confusing.

Additional context

Looking at ra_lib:recursive_delete/1, is_dir/1 returns false for any error from prim_file:read_file_info/1—not just when the path is a regular file. When that happens, delete(Dir, regular) is called, which invokes unlink() on what may still be a directory. Both ext4 and NFSv4 return EPERM in this case.

is_dir/1 — https://github.com/rabbitmq/ra/blob/main/src/ra_lib.erl#L485
recursive_delete/1 — https://github.com/rabbitmq/ra/blob/main/src/ra_lib.erl#L197

We are not sure if this is the actual cause of what we are observing, but we wanted to highlight it in case it is useful.

Please feel free to close this, if it is not an issue. I am just reporting it only because I noticed it as a bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions