Is it mainly because Digital Ocean that anytime a single host in a cluster restarts, it messes up etcd? If the host happens to be the leader, it takes down all hosts
NOTICE: [etcd] yichengq created release-2.0 at cea3448 (+0 new commits): http://git.io/xcvd
rpad has quit
Seventoes
epipho: ehh yeah it probably is. the "red herring when debugging real issues" comment is for me it seems ;)
so now i gotta figure out why my second machine stopped heartbeats enough to be considered lost :-/
shiv has quit
mikedanese_ joined the channel
Pindar joined the channel
thumpba joined the channel
mprins joined the channel
NOTICE: [etcd] AdoHe comment on issue #2391: @grimnight maybe you should find another solution, etcd is not very suitable for such case. http://git.io/xcJD
Pindar has quit
mprins has quit
rbowlby joined the channel
chance
where/how is this deployed
erkules joined the channel
Seventoes
digitalocean, two machines connected via private networking
chance
probably just some networking blip
your sure the node didnt go down right?
ie: checked logs
Seventoes
yup. uptime is fine, nothing in machien log
logs say heartbeat just failed. nothing on the second machine's side except the apparently non-messages :P
chance
bad tcp, bad!
Seventoes
is there a way to make sure my services will come back up after they're unscheduled? it all seems to require a manual stop then start to get running again
chance
it could just be that the ttl for the scheduled unit in etcd expired
basically when a unit gets scheduled to a fleetd
that machine heartbeats etcd periodically refreshing that units ttl
to verify the host is actually running it
and alive
Seventoes: thats how it should work
do you have metadata preventing it from being scheduled on another host?
Seventoes
yeah, it's restricted to that one machine for now
chance
then it should get rescheduled after the leader reconciles and that node is in the cluster
robszumski has quit
Seventoes
the logs say it got rescheduled, but the service never came back up
chance
did you look at the logs on that host for why it might not have come up?
ie: did the service ever get attempted to start?
when it came back
rossk joined the channel
NOTICE: [fleet] epipho comment on issue #1077: Global units now obey dynamic metadata, see 5941e1f. This felt like the least intrusive way to ensure that agents had the full machine state during reconciliation.... http://git.io/xck2
rossk has quit
t3rm1n4l joined the channel
donspaulding has quit
NOTICE: [etcd] yichengq opened pull request #2402: rafthttp: backoff error printout and log stream start (master...332) http://git.io/xcId