Skip to content

Unhandled exception calling finish_job #7997

@cheeseindustries

Description

@cheeseindustries

Greetings,

I am with the GenomeTrakr project at the FDA\CFSAN. We seem to be hitting a Galaxy bug that has caused an OOM condition on our production Galaxy instance for GalaxyTrakr which led to an outage. We are seeing the following errors in our job handler logs, it seems there is a bug in the finish_job method which is causing the job handlers to crash leading to Galaxy spawning a large number of uWSGI processes. This eventually lead to system processes on our host invoking OOM killer on uWSGI, which seems to have eventually caused our upstream web proxy to be unable to reach Galaxy. If there is any other relevant info or log data that you need I can provide it.

EDIT: Sorry forgot to include version. We are currently on 19.01

Thanks!

From job handler logs:

galaxy.jobs.runners ERROR 2019-05-15 21:44:20,442 [p:10617,w:0,m:1] [DRMAARunner.work_thread-3] (167903/5913) Job wrapper finish method failed
galaxy.jobs.runners ERROR 2019-05-15 21:44:21,143 [p:10617,w:0,m:1] [DRMAARunner.work_thread-3] (167903) Unhandled exception calling finish_job
galaxy.jobs.runners ERROR 2019-05-15 21:47:20,674 [p:10617,w:0,m:1] [DRMAARunner.work_thread-2] (167905/5915) Job wrapper finish method failed
galaxy.jobs.runners ERROR 2019-05-15 21:47:22,242 [p:10617,w:0,m:1] [DRMAARunner.work_thread-2] (167905) Unhandled exception calling finish_job
galaxy.jobs.runners ERROR 2019-05-15 21:55:09,751 [p:10617,w:0,m:1] [DRMAARunner.work_thread-3] (167907/5917) Job wrapper finish method failed
galaxy.jobs.runners ERROR 2019-05-15 21:56:00,636 [p:10617,w:0,m:1] [DRMAARunner.work_thread-3] (167907) Unhandled exception calling finish_job

Lots of these messages in /var/log/messages:

May 15 21:47:19 ip-X-X-X-X kernel: Out of memory: Kill process 10606 (uwsgi) score 14 or sacrifice child
May 15 21:47:19 ip-X-X-X-X kernel: Killed process 10606 (uwsgi) total-vm:1761168kB, anon-rss:464936kB, file-rss:0kB, shme
m-rss:124kB
May 15 21:47:19 ip-X-X-X-X kernel: sge_qmaster invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
May 15 21:47:19 ip-X-X-X-X kernel: sge_qmaster cpuset=/ mems_allowed=0
May 15 21:47:19 ip-X-X-X-X kernel: CPU: 4 PID: 13570 Comm: sge_qmaster Kdump: loaded Not tainted 3.10.0-957.10.1.el7.x86_
64 #1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions