job-status-consumer: improve logging of "not alive" workflows#443
job-status-consumer: improve logging of "not alive" workflows#443tiborsimko merged 1 commit intoreanahub:masterfrom
Conversation
b1a73dc to
c33bcc8
Compare
| job.status = job_status | ||
|
|
||
|
|
||
| def _update_job_cache(msg): |
There was a problem hiding this comment.
Hmm, whilst it is true that caching has been disabled since a long time, we may want to resurrect and retest it one day perhaps... Hence let me muse a bit on this.
Does this code actually runs on the server-side? We have cache=off by default on the client side, so I think this code should not even be triggered on the server-side. If it is not, then we don't seem to have a problem here? If it is, then we should fix it so that it would not be? (or inactivate it via a server-side configuration settings without removal of the code).
That said, philosophically speaking, I like suckless approach arguing that "the more code lines you have removed, the more progress you have made" 😉 Removing old inactive code is a good thing, especially if we investigate something like Armada or Kueue soon... Then it might be a perfect time for cleaning job cache everywhere. Until then, perhaps "inactivation" may still be having some advantages.
There was a problem hiding this comment.
Honestly, there is a lack of documentation for cache internals so I have no idea how it works or where it is configured exactly. Need to dig into the code.
Doesn't look like we have a cache config in the helm chart for this. Also, user docs only have a few things mentioned like the "CACHE=off" parameter for r-client. There are some Python tests for cache.
Looks like only the serial engine supports cache. Other engines do not check if a job is cached from the job-controller.
I heard that cache was not working so, in my opinion, we should remove it. Otherwise, this is just a dead code. But, I agree, maybe, I should create a separate issue to remove the cache and we can address it later (espcecially if we plan work on improving consumers and internals)
9282a46 to
3ca6537
Compare
3ca6537 to
ae32eb8
Compare
tiborsimko
left a comment
There was a problem hiding this comment.
Logging works nicely; the core issue to be addressed later.
addresses #437
How to test:
reana-client run -w testpendingstate) and delete itreana-client delete -w testreana-client list, it will show you thattestworkflow is deletedkubectl get pods, you will find batch pod inNotReadystatekubectl logs deployment/reana-workflow-controller job-status-consumer, you will see that workflow was not in an alive state in DB and logs indicate that.This doesn't solve a core issue but, at least, we can have better logs.