runciter: hia - sorry, i haven't been able to get more details on it :/ after that fire was extinguished (by running more nodes) we moved onto the next fire. i'll try to keep it on the radar though so we can try to understand what was going on.
runciter
Matthew[m]: np, just thought i'd check in
Matthew[m]: i'm hoping to have some extra capacity in the next few weeks so if i can help with synapse stuff let me know :)
Matthew[m]
np, thanks for keeping it in mind
and that could be awesome - we're currently in a "gah, we've run out of CPU headroom on matrix.org" crisis again currently
runciter
oof
Matthew[m]
which we're mitigating by splitting up the master process into more worker processes (basically sharding the send path as well as the receive path)
runciter
that doesn't sound like a bad idea at all
Matthew[m]
but there's a distinct feeling that we're optimising based on gut feeling
rather than actually having profiled as a whole
runciter
i'm also inclined to recommend pypy
but i don't know how feasible that is
Matthew[m]
which boils down to the eternal challenge of getting meaningful profiles out of twisted
we've tried pypy and it seems not to help much at all
runciter
so _that_ i might be able to help with
(the profiling issue)
Matthew[m]
it uses a bit more RAM and perhaps speeds up by a few %, but nothing massive
that would be awesome.
runciter
interesting
Matthew[m]
so the problem with profiling is having any way to piece a meaningful stacktrace together in face of all the deferreds
runciter
are synapse's scaling problems so bad that i might be able to replicate them locally with only a moderately popular homeserver?
Matthew[m]
we do a bunch of pyflame which is better than nothing
but doesn't give the macro picture at all
and yes, it should be very easy to replicate locally
runciter
ok, cool!
i've been meaning to set up a local homeserver anyway :)
Matthew[m]
fire up a local server with a postgres backend; join #matrix:matrix.org; wait for it to warm up (i.e. learn which servers are online and which aren't); and then watch it plod.
runciter
oh, dang
Matthew[m]
the precise config we run on matrix.org is more subtle as we've split it apart into 10+ worker processes
runciter
it's that easy?
Matthew[m]
well, it won't give the "takes 30 minutes to trickle a response to a client" issue that we were seeing the other day - as that was specific to an overloaded sync worker
but in terms of the general "WHERE IS THE CPU GOING?!" question
it'd be very profileable
runciter
but it will expose a menagerie of slow downs?
fantastic
Matthew[m]
it should do. and if nothing else will do a good impression of a heavily loaded server
runciter
excellent!!
mk-fg joined the channel
it's likely i won't get around to this til monday at least but we'll see how badly this ends up distracting me ;)
Matthew[m]
eitherway, help in profiling to get any kind of meaningful hierarchy of which code paths are the slowest would be amazing, as we've spent 3 years on this so far without having a good holistic view, rather than the finer-grained pyflame style metrics
Matthew[m] -> sleep; and thanks!
runciter
cheers!
mk-fg joined the channel
itamar has quit
moshez joined the channel
mk-fg joined the channel
mk-fg joined the channel
mk-fg joined the channel
mk-fg joined the channel
meejah
Matthew[m]: are you sure the "slow responses to clients" are actually CPU usage and not some kind of bottleneck getting data to the right place? Have you tried twisted-theseus?
sounds similar to the issue we had a few days ago; a static string that takes a few ms to process but takes 12s to make it to the client somehow (or to process the inbound req)
terrycojones_ joined the channel
exploiteer joined the channel
exploiteer
Hello, does anyone have an example about how to use twisted.internet.asyncioreactor?
I have some code using asyncio, which I want to integrate with twisted, but I can't find an example.
Matthew[m]: "rpret this better than me, but it looks to me as if we're wedging the whole reactor for ~3s whilst loading state from the DB" <-- are you using sycn db requests in the main thread?
LordVan joined the channel
oberstet joined the channel
LordVan has quit
kolko joined the channel
Matthew[m]
no - i got that wrong
as erik pointed out a few comments later
the db is async postgres on its own thread
but i was confused by the pyflame seeingly showing it stuck in rhe same callframe for 3s
with three separate branches within that which looked sequential
and so made me assume that it wasn’t managing to otherwise yield
otherwise other traces would have shown up in the flamegraph
ie it looked like it was stuck doing the same thing for 3s.
kolko has quit
kolko joined the channel
meejah
still, that *could* be an example of "data not getting to the right place" in a timely fashion (not CPU load)..although I've only "played" with twisted-theseus it does sound like the right thing here: you should be able to which deferreds are bogging down requests
have you measured the raw syscalls/second rates on the process?
Matthew[m]: ^
oh, how about timers? We got a ton more throughput in crossbar.io by batching up timers
(That's because Autobahn libraries support either asyncio OR twisted use; up to client-code to decide). crossbar.io itself is writting using "just" twisted though
er, sorry, we got more throughput on the Autobahn "just websocket" use-cases in Twisted using batched timers
Maybe this is something that might make sense to put in Twisted proper? Basically what we're doing here is: putting *one* timer in Twisted and filling up that "bucket" with other notifications so that if e.g. we have 2000 timeouts pending, there might only be 2 or 3 "real" timers in Twisted; these are for "handshake timeouts" and the like so the absolute accuracy isn't that critical, and the vast majority
of them *don't* time out
before this, we'd have something like 100000 timers in twisted for ~25000 websockets (or thereabout, I think there were about 4 timers per connection for various things)
kolko has quit
kolko joined the channel
rajesh joined the channel
Matthew[m]
so i don’t think we have that many timers in synapse, tbh
other than possibly timeouts for presence (currently disabled) and general network timeouts i’m struggling to think of any
we were wondering whether theseus is going to slow everything right down due to being on the debug api?
meejah
Matthew[m]: don't know the answer to the theseus thing
Matthew[m]
We’ll try it and see :)
there may also be a statistical profling equivalent perhaps
Matthew[m]: I've only "played" with theseus, so i'm not sure -- maybe it even has such an option! :) Once very nice thing: it uses valgrind output, so you can use any valgrind tool on the data