Skip to content

Handling of ra machine failures #293

@mbj4668

Description

@mbj4668

I have noticed that the system ends up in a bad state (?) if the apply callback crashes (erlang:error or erlang:exit).

If apply crashes, the ra_server_proc proccess exits, and is restarted by its supervisor (ra_server_sup), but then it fails again, and the supervisor reaches its max restart, and then its supervisor (ra_server_sup_sup) detects this. However, the child ra_server_sup has restart strategy temporary, so it just ignores this error. Here's an attempt to illustrate the supervision tree:

The system is called store.

ra_sup [one_for_one, max 1 restarts in 5 secs]
  +-- PERM ra_systems_sup [one_for_one, max 1 restarts in 5 secs]
  |          +-- PERM <0.195.0>/ra_system_sup [one_for_all, max 1 restarts in 5 secs]
  |                     +-- PERM ra_store_server_sup_sup/ra_server_sup_sup [simple_one_for_one, max 1 restarts in 5 secs]
  |                     |          +-- TEMP <0.241.0>/ra_server_sup [one_for_one, max 2 restarts in 5 secs]
  |                     |                     +-- TRAN store_ra/ra_server_proc
  |                     +-- PERM ra_store_log_sup/ra_log_sup [one_for_all, max 5 restarts in 5 secs]
  |                     |          +-- PERM <0.205.0>/ra_log_wal_sup [one_for_one, max 1 restarts in 5 secs]
  |                     |          |          +-- PERM ra_store_log_wal/ra_log_wal
  |                     |          +-- PERM ra_store_segment_writer/ra_log_segment_writer
  |                     |          +-- PERM ra_store_log_meta/ra_log_meta
  |                     |          +-- PERM <0.200.0>/ra_log_pre_init
  |                     +-- PERM ra_store_log_ets/ra_log_ets
  +-- PERM ra_file_handle
  +-- PERM ra_metrics_ets
  +-- PERM ra_machine_ets

I expected the error to propagate and eventually terminate the application. What is the intended way to handle these kinds of errors? My current workaround is to find the temporary supervisor (<0.241.0> above) and monitor it from another process, but this requires peeking into the internal state of ra, which doesn't seem quite right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions