maybe some race condition to cause the InternalSubchannel to be in a bad state
ejona
That race would just cause RPCs to fail that shouldn't fail.
Kun is suggesting changing the log level. How easy would that be to do?
hhclam
i can't do that :(
ejona
Fair enough.
hhclam
somehow our code took out the log4j mbeans :(
there's still the java.util.logging mbeans
ejona
(InternalSubchannel has some FINE level logging)
We use java.util.logging
hhclam
let me try...
ejona
Kun suggested: does thread dump show anything suspiciously blocked? Specifically if ChannelExecutor shows up. If that's blocked accidentally, something may go weird.
hhclam
nothing suspiciously blocked, other my client call thread
i have looked a couple times :(
also no luck on changing the log level
ejona
Okay.
We aren't coming up with a way this is happening. It seems impossible. I'm inclined to believe you don't exist. :-P
hhclam
lol
that would be nice
maybe i should just read through ManagedChannelImpl to see why the internal channel is stuck a idle
ejona
hhclam: Did anything coorelate to when this problem started?
hhclam
other than the "connection reset" then no
all the hosts i inspected have this "connection reset" exception 3 minutes before the first deadline_exceeded
ejona
Like, when you started seeing it show up in production.
hhclam
that's why i thought it's related
ejona
Like, a new build roll-out.
hhclam
hmmm this is the first time we roll out to all production machines
nothing in the network infrastructure that i can think of
ejona
That's fair.
hhclam
i have to take off now (east coast time)
thanks for all the help!
ejona
Shot in the dark: Are you doing direct connections between servers, or are there any HTTP proxies or similar?
hhclam
i'll check back later in case i find something new
ejona
I need to as well.
Sorry nothing as yet. We'll think about it some more.