asksol: We're running into an aysnc io issue with the Kombu 3+ code..
asksol: are you around to possibly chat with one of our engineers whos been digging into the issue?
maular joined the channel
maular has quit
(these are rabbitmq issues btw..)
maular joined the channel
maular has quit
maular joined the channel
maular has quit
amckinley joined the channel
maular joined the channel
maular has quit
bwreilly joined the channel
bwreilly has quit
maular joined the channel
maular has quit
bwreilly joined the channel
maular joined the channel
ustunozg_ joined the channel
maular has quit
ustunozgur has quit
jessepollak has quit
maular joined the channel
Nizumzen has quit
maular has quit
asksol
Diranged: sure
Diranged
k gimme one sec
trying to get his attention..
In short, we're seeing celery workers hang and leave dead connetions open to RMQ. The hang seems to be related to the epoll code waiting for new data from RMQ. We did not see this before we switched to Celery 3.1+ and Kombu 3+.
negval has quit
maular joined the channel
maular has quit
surabujin has quit
(figures, he just stepped out of the office.. calling him real quick)
loic84 joined the channel
droodle1 has quit
maular joined the channel
maular has quit
asksol
Diranged: that may not be so unusual, it will not receive any data if has exceeded the prefetch limit
Diranged
I was wondering about the prefetch limit ..
asksol
maybe you should try running with the -Ofair argument to celery worker
Diranged
If we have a prefetch of 5, and our host pulls down 5 tasks that all have an ETA of now()+5mins, will celery keep pulling down other tasks or will it sit idle for 5 minutes?
asksol
eta will increase the prefetch count
Diranged
Ok.. so it will keep pulling tasks down to work on other ones…
maular joined the channel
asksol
but the prefetch count cannot exceed 65535, and there have been reports of problems if that is exceeded
maular has quit
Diranged
heh.. nce
Here's a quick snipet from our bug report, while I try to reach the eng..
"We are seeing celery workers hang in production. The symptom is that the worker prefetches messages from RabbitMQ but never starts working on any of them.
We reproduced this in staging fairly consistently yesterday and I think we only saw this on worker startup. I don't think we ever saw a worker that was finishing tasks ever hang.
We attached gdb to one of the processes and could see that it was sitting in epoll_wait() but that on its own is not very informative and without a debug build we couldn't see the rest of the stack trace."
(engineer is jumping online..
jessepollak joined the channel
asksol
you should try to avoid having lots of eta tasks, and if that is required find an alternate solution
maular joined the channel
aaronwebber joined the channel
maular has quit
Diranged
asksol: we have thousands a day currently.. but i agree, that we're looking for alternatives. for now we've atually disabled their ETA and let them run right away.
asksol: the model though is that we kick off a task the second a user "posts" something.. but give them up to 5 minutes to delete their post and prevent the "new post emails" from being generated and sent out.
asksol
the tasks may have been written to a process executing a long running task
Diranged
asksol: the eta-model in the task has been a really simple way of implementing that.. because we just mark a flag in the DB indicating the post is deleted, the task eventually runs, checks that flag, and just exits quietly without doing any real work..
asksol: aaronwebber is the engineer who's been digging into this connection issue.. he can provide details.
asksol
-Ofair will make sure that doesn't happen
Diranged
aaronwebber: I've only given Ask the brief snippet summary from the issue IDENTITY-424… but could you provide more details about the issue?
asksol
which has been almost consistently the problem when the worker hangs after upgrading from 3.0 the 3.1
aaronwebber
asksol: i also have an update on the hanging worker task issue i mentioned yesterday...it occurs when there are > prefetch_count messages in the queue when the worker starts up. it then gets in to the async i/o loop, but epoll never returns any events