#grpc

/

      • hhclam
        but activeTransport is null isn't it?
      • ejona
        But it should at least kick off connecting via startNewTransport which will set pendingTransport
      • hhclam
        I can try..
      • i'll kick another rpc and then do a heap dump
      • ejona
        (I'm believing you; it's just... how does this happen...)
      • hhclam
        same thing, 2 instances of InternalSubchannel
      • and activeTransport and pendingTransport are null in them
      • one thing that sticks out is one InternalSubchannel is in state shutdown
      • a few instances of hosts that i inspected all have "connection reset" before this issue
      • IOException like this: https://paste.ee/p/ruAcS
      • and then our retrying interceptor create a new call
      • ejona
        That exception can happen normally. It would just end up causing a reconnect.
      • hhclam
        a coworker pointed out to me of this comment: https://github.com/grpc/grpc-java/blob/v1.2.x/c...
      • maybe some race condition to cause the InternalSubchannel to be in a bad state
      • ejona
        That race would just cause RPCs to fail that shouldn't fail.
      • Kun is suggesting changing the log level. How easy would that be to do?
      • hhclam
        i can't do that :(
      • ejona
        Fair enough.
      • hhclam
        somehow our code took out the log4j mbeans :(
      • there's still the java.util.logging mbeans
      • ejona
        (InternalSubchannel has some FINE level logging)
      • We use java.util.logging
      • hhclam
        let me try...
      • ejona
        Kun suggested: does thread dump show anything suspiciously blocked? Specifically if ChannelExecutor shows up. If that's blocked accidentally, something may go weird.
      • hhclam
        nothing suspiciously blocked, other my client call thread
      • i have looked a couple times :(
      • also no luck on changing the log level
      • ejona
        Okay.
      • We aren't coming up with a way this is happening. It seems impossible. I'm inclined to believe you don't exist. :-P
      • hhclam
        lol
      • that would be nice
      • maybe i should just read through ManagedChannelImpl to see why the internal channel is stuck a idle
      • ejona
        hhclam: Did anything coorelate to when this problem started?
      • hhclam
        other than the "connection reset" then no
      • all the hosts i inspected have this "connection reset" exception 3 minutes before the first deadline_exceeded
      • ejona
        Like, when you started seeing it show up in production.
      • hhclam
        that's why i thought it's related
      • ejona
        Like, a new build roll-out.
      • hhclam
        hmmm this is the first time we roll out to all production machines
      • nothing in the network infrastructure that i can think of
      • ejona
        That's fair.
      • hhclam
        i have to take off now (east coast time)
      • thanks for all the help!
      • ejona
        Shot in the dark: Are you doing direct connections between servers, or are there any HTTP proxies or similar?
      • hhclam
        i'll check back later in case i find something new
      • ejona
        I need to as well.
      • Sorry nothing as yet. We'll think about it some more.
      • hhclam
        no proxy, but this is to a k8s service
      • so k8s does load balancing to actual pod
      • ejona
        So that may connect through a TCP proxy.
      • Have a good evening.
      • hhclam
        yeah i think k8s is using iptables
      • thanks, you too :)
      • hhhclam joined the channel