#twisted

/

      • kenaan
        Tickets pending review: https://tm.tl/#9287, #9299, #9305, #9333, #9337, #9338, #9340, #9350 (ktdreyer), #9210 (markrwilliams), #9358, #9353, #9362, #9365 (axiaoxin), #9374, #9375, #9377, #9379, #9343 (rodrigc), #8966 (the0id), #9382, #9100, #9118 (the0id), #4964 (jameshilliard), #9138, #9176
      • Mike2509 has quit
      • rajesh has quit
      • n8thegr81 has quit
      • mk-fg joined the channel
      • mk-fg joined the channel
      • n8thegr8 joined the channel
      • mk-fg joined the channel
      • mk-fg joined the channel
      • mk-fg joined the channel
      • mk-fg joined the channel
      • Matthew[m]
        runciter: hia - sorry, i haven't been able to get more details on it :/ after that fire was extinguished (by running more nodes) we moved onto the next fire. i'll try to keep it on the radar though so we can try to understand what was going on.
      • runciter
        Matthew[m]: np, just thought i'd check in
      • Matthew[m]: i'm hoping to have some extra capacity in the next few weeks so if i can help with synapse stuff let me know :)
      • Matthew[m]
        np, thanks for keeping it in mind
      • and that could be awesome - we're currently in a "gah, we've run out of CPU headroom on matrix.org" crisis again currently
      • runciter
        oof
      • Matthew[m]
        which we're mitigating by splitting up the master process into more worker processes (basically sharding the send path as well as the receive path)
      • runciter
        that doesn't sound like a bad idea at all
      • Matthew[m]
        but there's a distinct feeling that we're optimising based on gut feeling
      • rather than actually having profiled as a whole
      • runciter
        i'm also inclined to recommend pypy
      • but i don't know how feasible that is
      • Matthew[m]
        which boils down to the eternal challenge of getting meaningful profiles out of twisted
      • we've tried pypy and it seems not to help much at all
      • runciter
        so _that_ i might be able to help with
      • (the profiling issue)
      • Matthew[m]
        it uses a bit more RAM and perhaps speeds up by a few %, but nothing massive
      • that would be awesome.
      • runciter
        interesting
      • Matthew[m]
        so the problem with profiling is having any way to piece a meaningful stacktrace together in face of all the deferreds
      • runciter
        are synapse's scaling problems so bad that i might be able to replicate them locally with only a moderately popular homeserver?
      • Matthew[m]
        we do a bunch of pyflame which is better than nothing
      • but doesn't give the macro picture at all
      • and yes, it should be very easy to replicate locally
      • runciter
        ok, cool!
      • i've been meaning to set up a local homeserver anyway :)
      • Matthew[m]
        fire up a local server with a postgres backend; join #matrix:matrix.org; wait for it to warm up (i.e. learn which servers are online and which aren't); and then watch it plod.
      • runciter
        oh, dang
      • Matthew[m]
        the precise config we run on matrix.org is more subtle as we've split it apart into 10+ worker processes
      • runciter
        it's that easy?
      • Matthew[m]
        well, it won't give the "takes 30 minutes to trickle a response to a client" issue that we were seeing the other day - as that was specific to an overloaded sync worker
      • but in terms of the general "WHERE IS THE CPU GOING?!" question
      • it'd be very profileable
      • runciter
        but it will expose a menagerie of slow downs?
      • fantastic
      • Matthew[m]
        it should do. and if nothing else will do a good impression of a heavily loaded server
      • runciter
        excellent!!
      • mk-fg joined the channel
      • it's likely i won't get around to this til monday at least but we'll see how badly this ends up distracting me ;)
      • Matthew[m]
        eitherway, help in profiling to get any kind of meaningful hierarchy of which code paths are the slowest would be amazing, as we've spent 3 years on this so far without having a good holistic view, rather than the finer-grained pyflame style metrics
      • Matthew[m] -> sleep; and thanks!
      • runciter
        cheers!
      • mk-fg joined the channel
      • itamar has quit
      • moshez joined the channel
      • mk-fg joined the channel
      • mk-fg joined the channel
      • mk-fg joined the channel
      • mk-fg joined the channel
      • meejah
        Matthew[m]: are you sure the "slow responses to clients" are actually CPU usage and not some kind of bottleneck getting data to the right place? Have you tried twisted-theseus?
      • runciter
        meejah: oh twisted-theseus!
      • brilliant call
      • kenaan
        Tickets pending review: https://tm.tl/#9287, #9299, #9305, #9333, #9337, #9338, #9340, #9350 (ktdreyer), #9210 (markrwilliams), #9358, #9353, #9362, #9365 (axiaoxin), #9374, #9375, #9377, #9379, #9343 (rodrigc), #8966 (the0id), #9382, #9100, #9118 (the0id), #4964 (jameshilliard), #9138, #9176
      • oberstet joined the channel
      • altendky has quit
      • n8thegr8 has quit
      • n8thegr8 joined the channel
      • efphe joined the channel
      • Rominux joined the channel
      • rockstar has quit
      • infobob has quit
      • rockstar joined the channel
      • __main__ joined the channel
      • forrestv joined the channel
      • cdunklau joined the channel
      • Mixologic joined the channel
      • CcxWrk joined the channel
      • infobob joined the channel
      • n-st joined the channel
      • energizer joined the channel
      • forrestv has quit
      • forrestv joined the channel
      • Tickets pending review: https://tm.tl/#9287, #9299, #9305, #9333, #9337, #9338, #9340, #9350 (ktdreyer), #9210 (markrwilliams), #9358, #9353, #9362, #9365 (axiaoxin), #9374, #9375, #9377, #9379, #9343 (rodrigc), #8966 (the0id), #9382, #9100, #9118 (the0id), #4964 (jameshilliard), #9138, #9176
      • Matthew[m]
        meejah (IRC): pretty sure it’s not a bottleneck generating the req based on logging
      • and mmm... i had looked at twisted theseus but unsure if anyone has actually used it. will check
      • runciter
        it's certainly seen real use
      • Matthew[m]
        i mean used it on matrix... it could be a massive oversight :s
      • have asked
      • meanwhile this may be unrelated
      • runciter
        ah ha
      • Matthew[m]
        sounds similar to the issue we had a few days ago; a static string that takes a few ms to process but takes 12s to make it to the client somehow (or to process the inbound req)
      • terrycojones_ joined the channel
      • exploiteer joined the channel
      • exploiteer
        Hello, does anyone have an example about how to use twisted.internet.asyncioreactor?
      • I have some code using asyncio, which I want to integrate with twisted, but I can't find an example.
      • LordVan joined the channel
      • Matthew[m]
        (interestingly https://github.com/matrix-org/synapse/issues/2910 really does look like the same issue)
      • jamesaxl joined the channel
      • rajesh joined the channel
      • terrycojones joined the channel
      • itamar joined the channel
      • exploiteer has quit
      • altendky joined the channel
      • nopf joined the channel
      • kenaan
        Tickets pending review: https://tm.tl/#9287, #9299, #9305, #9333, #9337, #9338, #9340, #9350 (ktdreyer), #9210 (markrwilliams), #9358, #9353, #9362, #9365 (axiaoxin), #9374, #9375, #9377, #9379, #9343 (rodrigc), #8966 (the0id), #9382, #9100, #9118 (the0id), #4964 (jameshilliard), #9138, #9176
      • meejah
        Matthew[m]: "rpret this better than me, but it looks to me as if we're wedging the whole reactor for ~3s whilst loading state from the DB" <-- are you using sycn db requests in the main thread?
      • LordVan joined the channel
      • oberstet joined the channel
      • LordVan has quit
      • kolko joined the channel
      • Matthew[m]
        no - i got that wrong
      • as erik pointed out a few comments later
      • the db is async postgres on its own thread
      • but i was confused by the pyflame seeingly showing it stuck in rhe same callframe for 3s
      • with three separate branches within that which looked sequential
      • and so made me assume that it wasn’t managing to otherwise yield
      • otherwise other traces would have shown up in the flamegraph
      • ie it looked like it was stuck doing the same thing for 3s.
      • kolko has quit
      • kolko joined the channel
      • meejah
        still, that *could* be an example of "data not getting to the right place" in a timely fashion (not CPU load)..although I've only "played" with twisted-theseus it does sound like the right thing here: you should be able to which deferreds are bogging down requests
      • have you measured the raw syscalls/second rates on the process?
      • Matthew[m]: ^
      • oh, how about timers? We got a ton more throughput in crossbar.io by batching up timers
      • We factored out much of our "cross asyncio/twisted" code and so use this: https://github.com/crossbario/txaio/blob/master...
      • (That's because Autobahn libraries support either asyncio OR twisted use; up to client-code to decide). crossbar.io itself is writting using "just" twisted though
      • er, sorry, we got more throughput on the Autobahn "just websocket" use-cases in Twisted using batched timers
      • Matthew[m]: if you like graphs, there are some on this bug: https://github.com/crossbario/autobahn-python/i...
      • Maybe this is something that might make sense to put in Twisted proper? Basically what we're doing here is: putting *one* timer in Twisted and filling up that "bucket" with other notifications so that if e.g. we have 2000 timeouts pending, there might only be 2 or 3 "real" timers in Twisted; these are for "handshake timeouts" and the like so the absolute accuracy isn't that critical, and the vast majority
      • of them *don't* time out
      • before this, we'd have something like 100000 timers in twisted for ~25000 websockets (or thereabout, I think there were about 4 timers per connection for various things)
      • kolko has quit
      • kolko joined the channel
      • rajesh joined the channel
      • Matthew[m]
        so i don’t think we have that many timers in synapse, tbh
      • other than possibly timeouts for presence (currently disabled) and general network timeouts i’m struggling to think of any
      • we were wondering whether theseus is going to slow everything right down due to being on the debug api?
      • meejah
        Matthew[m]: don't know the answer to the theseus thing
      • Matthew[m]
        We’ll try it and see :)
      • there may also be a statistical profling equivalent perhaps
      • kolko has quit
      • energizer joined the channel
      • energizer joined the channel
      • energizer joined the channel
      • Hasimir joined the channel
      • kenaan
        Tickets pending review: https://tm.tl/#9287, #9299, #9305, #9333, #9337, #9338, #9340, #9350 (ktdreyer), #9210 (markrwilliams), #9358, #9353, #9362, #9365 (axiaoxin), #9374, #9375, #9377, #9379, #9343 (rodrigc), #8966 (the0id), #9382, #9100, #9118 (the0id), #4964 (jameshilliard), #9138, #9176
      • jamesaxl joined the channel
      • meejah
        Matthew[m]: I've only "played" with theseus, so i'm not sure -- maybe it even has such an option! :) Once very nice thing: it uses valgrind output, so you can use any valgrind tool on the data