#grpc

/

      • hhhclam has quit
      • lungaro
        what's the difference between a tap and a interceptor?
      • hoping for an RTFM link..
      • treehau5 joined the channel
      • treehau5 has quit
      • MXfive joined the channel
      • rmz_la3lma has quit
      • rmz_la3lma joined the channel
      • rmz_la3lma has quit
      • hhclam
        ejona would it be possible that dns is causing this issue?
      • the hostname is backed by 2 ip addresses, each one is a k8s service
      • rmz_la3lma joined the channel
      • rmz_la3lma has quit
      • MXfive has quit
      • rmz_la3lma joined the channel
      • rmz_la3lma has quit
      • rmz_la3lma joined the channel
      • rmz_la3lma_ joined the channel
      • rmz_la3lma has quit
      • rmz_la3lma_ is now known as rmz_la3lma
      • rmz_la3lma has quit
      • rmz_la3lma joined the channel
      • treehau5 joined the channel
      • rmz_la3lma has quit
      • rmz_la3lma joined the channel
      • treehau5_ joined the channel
      • treehau5 has quit
      • rmz_la3lma has quit
      • treehau5_ has quit
      • rmz_la3lma joined the channel
      • rmz_la3lma has quit
      • rmz_la3lma joined the channel
      • MXfive joined the channel
      • MXfive has quit
      • wiking_ joined the channel
      • wiking has quit
      • wiking_ is now known as wiking
      • rmz_la3lma has quit
      • rmz_la3lma joined the channel
      • chin-tastic joined the channel
      • rmz_la3lma has quit
      • rmz_la3lma joined the channel
      • also the dns can sometimes reorder the ip addresses
      • ejona we did a pcap and confirmed that the stuck process never tried to contact the remote server when in this state
      • rmz_la3lma has quit
      • ejona
        hhclam: I wouldn't imagine DNS could cause the problem, especially in 1.2. I think it was 1.3 where I added an optimization to avoid reconnecting when unnecessary that would maybe allow some edge case. But in 1.2 it's pretty straight-forward.
      • hhclam
        i talked to our engineers who run k8s and i think we have identified the event leading up to the uptick of errors
      • the service endpoints are hosted by a set of gateway nodes that run kube-proxy, which configures iptables for load-balancing
      • they have changed --conntrack-tcp-timeout-close-wait duration from default 1h0m0s to 60s
      • and --conntrack-tcp-timeout-established duration from default 24h0m0s to 15m
      • since this change we saw a lot more "connection reset" exceptions, and this might have made the "stuck" issue to happen more often
      • we're going to revert the change to kube-proxy to see if this solves the problem
      • what's the default idle timeout of grpc?
      • ejona
        There isn't one. You can configure one on the client and server independently though (so here, on the server would be good).
      • You can also configure keepalive, which would probably avoid Kube from killing the TCP connection.
      • (Again, available on both client and server.)
      • hhclam
        is keepalive a feature in 1.2? or later versions?
      • ejona
        Ummm... (looking)
      • rmz_la3lma joined the channel
      • hhclam
        thanks
      • ejona
        All the server-side stuff looks to be in 1.3.
      • And keepalives on client-side with Netty was broken in 1.2
      • So...
      • hhclam
        okay i guess i'll need to upgrade then :)
      • ejona
        That said, that's just avoiding the trigger, which just worksaround the bug. gRPC should never have a Channel that gets hung like that. That's not acceptable. We'll be digging in more to figure out the problem, and can backport as needed.
      • hhclam
        thanks :)
      • is there anything i can do to help?
      • unfortunately i can't upload the heap dump..
      • ejona
        Yeah, I sort of assumed that wasn't a possibility.
      • I think the main thing is to stay on IRC so we can ask follow-up questions.
      • hhclam
        if you want screen cap of the heap dump i can do that :)
      • sure
      • ejona
        Honestly, today not much will probably get done. It's a meeting-full day.
      • hhclam
        yup no problem
      • we have one thing to try anyway
      • i'll let you know if reverting the change to k8s alleviate the issue
      • ejona
        Yeah. If you can confirm that the Kube change prevents the problem, we'll dig down more heavily from that angnle.
      • hhclam
        thanks
      • btw i work for two sigma and this is my github page: https://github.com/hhclam
      • ejona
        I assumed that was your github page based on the repositories :)
      • hhclam
        good guess
      • ejona
        At some point we'll probably create an issue. I'll make sure to CC you. You'd be free to create one as well if it'd help you track things.
      • hhclam
        i'd keep things in the irc at the moment, until we have a more concrete lead
      • and thanks for the quick response!
      • rmz_la3lma has quit
      • rmz_la3lma joined the channel
      • gchristensen joined the channel
      • Mrgoose2 joined the channel
      • Mrgoose2
        im using GRPC in ruby and getting a return object of type :Google::Protobuf::RepeatedField . How do i turn that into json?
      • rmz_la3lma has quit
      • rmz_la3lma joined the channel
      • chin-tastic has quit
      • chin-tastic joined the channel
      • ejona
        hhclam: FYI, this looked similar https://github.com/grpc/grpc-java/issues/3266
      • Doh. Waiting long enough fixed problem for him. Probably because we still had an active connection that we didn't realize was dead until TCP timeouts triggered (which are long). So I guess not related :(
      • deep-book-gk_ joined the channel
      • deep-book-gk_ has left the channel