I have been struggling for several days with a problem in my agents. Randomly, they will stall and use 100% of the CPU. strace reveals the agents are just context switching and doing nothing:
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 40001616
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 101
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
I have tried everything: modified agents to use epoll rather than select, tried ruby enterprise edition and ruby1.9 (they remove the syscalls in strace, but agents still lock). I cannot discern a pattern or reason the agents lock specifically, meaning the job they lock on isn't consistent ASIDE from happening during a job that utilizes net/http to pull down some images and stitch them together.
I thought it might be an issue with calling sleep() inside the agents, but that didn't solve anything. I really have no idea where to go from here.
Pastie to my agent code: http://pastie.org/702881
Pastie to image fetch/stitch code: http://pastie.org/702895
On the plus side, I'll be able to give you a quick modification to nanite that causes it to use epoll, which dropped my CPU utilization a hair while performing a large amount of jobs! Any ideas on where to start even looking from here would be appreciated, otherwise I am going to just start commenting out code until something changes (the worst way to debug!).
I have been struggling for several days with a problem in my agents. Randomly, they will stall and use 100% of the CPU. strace reveals the agents are just context switching and doing nothing:
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 40001616
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 101
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
I have tried everything: modified agents to use epoll rather than select, tried ruby enterprise edition and ruby1.9 (they remove the syscalls in strace, but agents still lock). I cannot discern a pattern or reason the agents lock specifically, meaning the job they lock on isn't consistent ASIDE from happening during a job that utilizes net/http to pull down some images and stitch them together.
I thought it might be an issue with calling sleep() inside the agents, but that didn't solve anything. I really have no idea where to go from here.
Pastie to my agent code: http://pastie.org/702881
Pastie to image fetch/stitch code: http://pastie.org/702895
On the plus side, I'll be able to give you a quick modification to nanite that causes it to use epoll, which dropped my CPU utilization a hair while performing a large amount of jobs! Any ideas on where to start even looking from here would be appreciated, otherwise I am going to just start commenting out code until something changes (the worst way to debug!).