Skip to content

🐛 Engine: daemon restart fails with circus.exc.ConflictError #6041

@mbercx

Description

@mbercx

Describe the bug

From time to time, I find my processes are no longer updating. When trying to restart the daemon, stopping the daemon fails due to a timeout:

verdi daemon restart --reset
Profile: dev
Stopping the daemon... FAILED
Critical: Connection to the daemon timed out.

Restarting afterwards is not a problem:

verdi daemon start
Starting the daemon with 1 workers... OK

I find no errors in the daemon logs, but in the circus logs I find the following (trimmed for brevity, full error messages log below):

2023-05-29 21:48:37 circus[2237] [INFO] Arbiter exiting
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 tornado.general[2246] [WARNING] Got events for stream <zmq.eventloop.zmqstream.ZMQStream object at 0x104370580> attached to closed socket: Socket operation on non-socket
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2237] [INFO] circusd-stats stopped
2023-05-29 21:48:38 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:39 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
[...]
2023-05-29 21:48:53 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:53 circus[2237] [INFO] aiida-dev stopped
Full Error log
2023-05-29 21:48:37 circus[2237] [INFO] Arbiter exiting
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 tornado.general[2246] [WARNING] Got events for stream <zmq.eventloop.zmqstream.ZMQStream object at 0x104370580> attached to closed socket: Socket operation on non-socket
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2237] [INFO] circusd-stats stopped
2023-05-29 21:48:38 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:39 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:40 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:41 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:42 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:43 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:44 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:45 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:46 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:47 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:48 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:49 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:50 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:51 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:52 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:53 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:53 circus[2237] [INFO] aiida-dev stopped

Steps to reproduce

I haven't found a way to consistently reproduce the problem yet, but it seems to occur more often when I am running processes that involve large data transfers. For the event above, a dozen orso calculation jobs had paused due to connection issues, which I then had restarted with verdi process play -a.

Your environment

  • Operating system [e.g. Linux]: macOS Monterey v12.5
  • Python version [e.g. 3.7.1]: Python 3.9.16
  • aiida-core version [e.g. 1.2.1]: sph/fix/6013/verdi-computer-test branch, it seems. Commit 47cd515, but I've also had it happen when running on main.
  • circus: 0.18.0

Additional context

From an offline discussion I know @unkcpz has also run into this issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions