what's the difference between a tap and a interceptor?
hoping for an RTFM link..
treehau5 joined the channel
treehau5 has quit
MXfive joined the channel
rmz_la3lma has quit
rmz_la3lma joined the channel
rmz_la3lma has quit
hhclam
ejona would it be possible that dns is causing this issue?
the hostname is backed by 2 ip addresses, each one is a k8s service
rmz_la3lma joined the channel
rmz_la3lma has quit
MXfive has quit
rmz_la3lma joined the channel
rmz_la3lma has quit
rmz_la3lma joined the channel
rmz_la3lma_ joined the channel
rmz_la3lma has quit
rmz_la3lma_ is now known as rmz_la3lma
rmz_la3lma has quit
rmz_la3lma joined the channel
treehau5 joined the channel
rmz_la3lma has quit
rmz_la3lma joined the channel
treehau5_ joined the channel
treehau5 has quit
rmz_la3lma has quit
treehau5_ has quit
rmz_la3lma joined the channel
rmz_la3lma has quit
rmz_la3lma joined the channel
MXfive joined the channel
MXfive has quit
wiking_ joined the channel
wiking has quit
wiking_ is now known as wiking
rmz_la3lma has quit
rmz_la3lma joined the channel
chin-tastic joined the channel
rmz_la3lma has quit
rmz_la3lma joined the channel
also the dns can sometimes reorder the ip addresses
ejona we did a pcap and confirmed that the stuck process never tried to contact the remote server when in this state
rmz_la3lma has quit
ejona
hhclam: I wouldn't imagine DNS could cause the problem, especially in 1.2. I think it was 1.3 where I added an optimization to avoid reconnecting when unnecessary that would maybe allow some edge case. But in 1.2 it's pretty straight-forward.
hhclam
i talked to our engineers who run k8s and i think we have identified the event leading up to the uptick of errors
the service endpoints are hosted by a set of gateway nodes that run kube-proxy, which configures iptables for load-balancing
they have changed --conntrack-tcp-timeout-close-wait duration from default 1h0m0s to 60s
and --conntrack-tcp-timeout-established duration from default 24h0m0s to 15m
since this change we saw a lot more "connection reset" exceptions, and this might have made the "stuck" issue to happen more often
we're going to revert the change to kube-proxy to see if this solves the problem
what's the default idle timeout of grpc?
ejona
There isn't one. You can configure one on the client and server independently though (so here, on the server would be good).
You can also configure keepalive, which would probably avoid Kube from killing the TCP connection.
(Again, available on both client and server.)
hhclam
is keepalive a feature in 1.2? or later versions?
ejona
Ummm... (looking)
rmz_la3lma joined the channel
hhclam
thanks
ejona
All the server-side stuff looks to be in 1.3.
And keepalives on client-side with Netty was broken in 1.2
So...
hhclam
okay i guess i'll need to upgrade then :)
ejona
That said, that's just avoiding the trigger, which just worksaround the bug. gRPC should never have a Channel that gets hung like that. That's not acceptable. We'll be digging in more to figure out the problem, and can backport as needed.
hhclam
thanks :)
is there anything i can do to help?
unfortunately i can't upload the heap dump..
ejona
Yeah, I sort of assumed that wasn't a possibility.
I think the main thing is to stay on IRC so we can ask follow-up questions.
hhclam
if you want screen cap of the heap dump i can do that :)
sure
ejona
Honestly, today not much will probably get done. It's a meeting-full day.
hhclam
yup no problem
we have one thing to try anyway
i'll let you know if reverting the change to k8s alleviate the issue
ejona
Yeah. If you can confirm that the Kube change prevents the problem, we'll dig down more heavily from that angnle.
Doh. Waiting long enough fixed problem for him. Probably because we still had an active connection that we didn't realize was dead until TCP timeouts triggered (which are long). So I guess not related :(