#coreos

/

      • doniphon joined the channel
      • KurtKraut is now known as kurtkraut
      • rodlogic joined the channel
      • mazzy joined the channel
      • boombatower joined the channel
      • boombatower joined the channel
      • fayablazer joined the channel
      • rossk joined the channel
      • bkruse
        Is it mainly because Digital Ocean that anytime a single host in a cluster restarts, it messes up etcd? If the host happens to be the leader, it takes down all hosts
      • ocdmw joined the channel
      • carlosdp joined the channel
      • chance
        it shouldnt
      • how big is your cluster?
      • ghostpl_ joined the channel
      • fayablazer has left the channel
      • ocdmw has quit
      • curiousdude99 has quit
      • Seventoes
        alright how can i go about figuring out why this is happening? https://gist.github.com/stith/8ad378ea8cbbe42a298d
      • z0mbix joined the channel
      • berto- has quit
      • it's happening pretty often. 915 in the last ~19 hours
      • etcd seems to be working perfectly fine though. never had a problem running a command manually.
      • donspaulding joined the channel
      • thumpba joined the channel
      • thumpba has quit
      • rsFF has quit
      • rsFF joined the channel
      • mkaesz joined the channel
      • ocdmw joined the channel
      • forbiddenera
        meow
      • mazzy joined the channel
      • chance
        Seventoes: i dont know if thats a big deal
      • could be just that the http conn itself got lost over a long poll
      • Seventoes
        chance: i ignored it too, but my second machine stopped responding to heartbeats last night and killed my redis/db service
      • so now i'm worried :P
      • it only dropped for like 2-5 seconds or something though
      • jkyle joined the channel
      • flower_ joined the channel
      • chance
        you can increase your heartbeat
      • but anything you need persistent prob should be done specially
      • especially since etcd can just reboot for updates
      • err
      • coreos
      • ccding has quit
      • z0mbix has quit
      • mprins has quit
      • mikedanese_ has quit
      • donspaulding has quit
      • berto- joined the channel
      • rossk joined the channel
      • rpad joined the channel
      • jcru joined the channel
      • jcru has quit
      • calavera has quit
      • juztin joined the channel
      • mattapperson joined the channel
      • dylanmei joined the channel
      • donspaulding joined the channel
      • NOTICE: [etcd] yichengq opened pull request #2401: rafthttp: add test and bench (master...331) http://git.io/xce1
      • donspaulding has quit
      • mkaesz joined the channel
      • donspaulding joined the channel
      • achanda joined the channel
      • epipho
        Seventoes: is it actually breaking anything? it may be https://github.com/coreos/fleet/issues/888
      • NOTICE: [etcd] yichengq created release-2.0 at cea3448 (+0 new commits): http://git.io/xcvd
      • rpad has quit
      • Seventoes
        epipho: ehh yeah it probably is. the "red herring when debugging real issues" comment is for me it seems ;)
      • so now i gotta figure out why my second machine stopped heartbeats enough to be considered lost :-/
      • shiv has quit
      • mikedanese_ joined the channel
      • Pindar joined the channel
      • thumpba joined the channel
      • mprins joined the channel
      • NOTICE: [etcd] AdoHe comment on issue #2391: @grimnight maybe you should find another solution, etcd is not very suitable for such case. http://git.io/xcJD
      • Pindar has quit
      • mprins has quit
      • rbowlby joined the channel
      • chance
        where/how is this deployed
      • erkules joined the channel
      • Seventoes
        digitalocean, two machines connected via private networking
      • chance
        probably just some networking blip
      • your sure the node didnt go down right?
      • ie: checked logs
      • Seventoes
        yup. uptime is fine, nothing in machien log
      • logs say heartbeat just failed. nothing on the second machine's side except the apparently non-messages :P
      • chance
        bad tcp, bad!
      • Seventoes
        is there a way to make sure my services will come back up after they're unscheduled? it all seems to require a manual stop then start to get running again
      • chance
        it could just be that the ttl for the scheduled unit in etcd expired
      • basically when a unit gets scheduled to a fleetd
      • that machine heartbeats etcd periodically refreshing that units ttl
      • to verify the host is actually running it
      • and alive
      • Seventoes: thats how it should work
      • do you have metadata preventing it from being scheduled on another host?
      • Seventoes
        yeah, it's restricted to that one machine for now
      • chance
        then it should get rescheduled after the leader reconciles and that node is in the cluster
      • robszumski has quit
      • Seventoes
        the logs say it got rescheduled, but the service never came back up
      • chance
        did you look at the logs on that host for why it might not have come up?
      • ie: did the service ever get attempted to start?
      • when it came back
      • rossk joined the channel
      • NOTICE: [fleet] epipho comment on issue #1077: Global units now obey dynamic metadata, see 5941e1f. This felt like the least intrusive way to ensure that agents had the full machine state during reconciliation.... http://git.io/xck2
      • rossk has quit
      • t3rm1n4l joined the channel
      • donspaulding has quit
      • NOTICE: [etcd] yichengq opened pull request #2402: rafthttp: backoff error printout and log stream start (master...332) http://git.io/xcId
      • NOTICE: [etcd] yichengq closed pull request #2367: rafthttp: improve error printout (210...320) http://git.io/A1Wg
      • shiv joined the channel
      • juztin joined the channel
      • politician joined the channel
      • juztin has quit
      • lynchc joined the channel
      • nolan_d joined the channel
      • curiousdude99 joined the channel
      • t3rm1n4l has quit
      • shiv has quit
      • shiv joined the channel
      • mkaesz joined the channel
      • achanda has quit
      • t3rm1n4l joined the channel
      • politician has quit
      • achanda joined the channel
      • mkaesz has quit
      • intransit joined the channel
      • koolhead17 joined the channel
      • koolhead17 joined the channel
      • nolan_d has quit
      • jsprogrammer has quit
      • if_e1se joined the channel
      • acabrera has quit
      • acabrera joined the channel
      • digcloud has left the channel
      • rossk joined the channel
      • kurtkraut is now known as KurtKraut
      • darren0 joined the channel
      • juztin joined the channel
      • cuong_ joined the channel
      • donspaulding joined the channel
      • markbook has quit
      • geomyidae_ joined the channel
      • geomyidae_ has quit
      • geomyidae_ joined the channel
      • thumpba joined the channel
      • mkaesz joined the channel
      • jcru joined the channel
      • shiv has quit
      • donspaulding has quit
      • mkaesz has quit
      • NOTICE: [etcd] mikael84 comment on issue #2351: Oh, I edited the wrong box. Sorry. http://git.io/xcch