Displaying #mesos/2014-07-02.log:

Wed Jul 2 02:53:51 2014  chengwei_:Hi, I am using chronos and found an issue that the web ui hang for ever after running half of month, anyone know that?
Wed Jul 2 03:10:03 2014  adam-mesos:chengwei_: Haven't heard of this. Maybe you could file a JIRA/github-issue with more details?
Wed Jul 2 03:11:57 2014  chengwei_:adam-mesos: thanks, I already found that chronos on github is not active anymore. :-(
Wed Jul 2 03:12:17 2014  chengwei_:I have some patches for it have been pending for months
Wed Jul 2 03:12:43 2014  chengwei_:anyway, thanks your suggestion, I'll file one issue on github
Wed Jul 2 03:25:00 2014  chengwei_:adam-mesos: filed an issue here. https://github.com/airbnb/chronos/issues/232
Wed Jul 2 04:13:41 2014  adam-mesos:chengwei_: Thanks. I'll poke the Chronos ppl I know to see if issues like yours can get some attention.
Wed Jul 2 08:52:39 2014  g-hennux:hi!
Wed Jul 2 08:53:21 2014  g-hennux:i set up a small cluster with four slaves, which all should have pretty much identical configuration. however, one of them continuously disconnects and reconnects to the master, each time getting a new id
Wed Jul 2 08:53:27 2014  g-hennux:any idea what could cause this?
Wed Jul 2 09:05:06 2014  g-hennux:the same problem that rclough has in http://wilderness.apache.org/channels/?f=mesos/2014-04-04 ,
Wed Jul 2 09:13:07 2014  g-hennux:a, the respective host had his own IP wrong in /etc/hosts
Wed Jul 2 13:18:41 2014  lyda:i'm setting up the first 3 nodes of a cluster. on it i'm running a small external-to-hadoop-cluster (for frameworks that want HDFS)...
Wed Jul 2 13:19:32 2014  lyda:...a zookeeper cluster (used by mesos)... and mesos masters and slaves.
Wed Jul 2 13:19:51 2014  lyda:the masters start up but are unable to agree on a master.
Wed Jul 2 13:22:12 2014  lyda:this is with mesos 0.19 (mesosphere debian package) on ununtu 14.04.
Wed Jul 2 13:22:24 2014  lyda:i have quorum set to 2.
Wed Jul 2 13:24:10 2014  lyda:the mesos-master processes keep dying with this in the FATAL log:
Wed Jul 2 13:24:11 2014  lyda:Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Wed Jul 2 14:42:16 2014  lyda:is this the best place to ask these things? would it be better to file a ticket in jira?
Wed Jul 2 17:31:54 2014  lyda:morning?
Wed Jul 2 18:00:44 2014  adam_mesos:lyda: G'morning
Wed Jul 2 18:07:35 2014  lyda:i suddenly recalled the mailing list which might be a better place to debug things.
Wed Jul 2 18:08:15 2014  lyda:i'm wondering if having a quorum of 2 was a good idea (it will be nice when that's gone).
Wed Jul 2 18:09:33 2014  lyda:i also note that i started mesos-master up on those servers a while back when they had different names. i cleaned out zookeeper (it was a fresh install actually, but I also removed /mesos)...
Wed Jul 2 18:10:03 2014  lyda:but i'm wondering if mesos-masters keep any state on disk. is there anything else i should wipe out?
Wed Jul 2 18:11:32 2014  adam_mesos:lyda: You can try wiping /tmp/mesos
Wed Jul 2 18:12:06 2014  adam_mesos:Typically a quorum must be an odd number (3 or 5 is ideal), to be able to break ties
Wed Jul 2 18:12:53 2014  lyda:really? i thought 3 masters would need a quorum of 2?
Wed Jul 2 18:13:05 2014  lyda:or am i misunderstanding?
Wed Jul 2 18:13:55 2014  lyda:and does that imply it's best to have 5 or 7 masters?
Wed Jul 2 18:13:56 2014  adam_mesos:Ah, right you are. (quorum * 2 - 1) == # masters
Wed Jul 2 18:14:13 2014  adam_mesos:Quorum of 2, for 3 masters.
Wed Jul 2 18:14:20 2014  lyda:ah. whew. so 3 masters is cool.
Wed Jul 2 18:14:26 2014  adam_mesos:Totally.
Wed Jul 2 18:15:54 2014  lyda:i'm starting slaves on the masters for now for early testing, but i'm guessing that with the machines running zookeeper, mesos-master and hadoop (as an hdfs store for various frameworks) i should stop doing that when i want to run real jobs, yes?
Wed Jul 2 18:17:04 2014  lyda:the machines are beefy - 16 cpus, 48g ram, but the goal is to have about 50 slaves.
Wed Jul 2 18:17:13 2014  adam_mesos:Mesos master is pretty minimal on memory/cpu footprint for small clusters, but can become a network bottleneck as you scale up your cluster size.
Wed Jul 2 18:18:14 2014  adam_mesos:For 50 slaves, I'd suggest making the master nodes separate, but you could run ZK on the master nodes as well. Up to you how you want to spread out your NameNode(s)
Wed Jul 2 18:18:51 2014  lyda:well, i eventually would rather run riak with its s3 support, but i'll crawl first.
Wed Jul 2 18:19:20 2014  lyda:it's just easier to make riak spof-free than it is to play stupid-namenode tricks.
Wed Jul 2 18:20:19 2014  adam_mesos:I hear ya
Wed Jul 2 18:22:45 2014  lyda:so would stale data in /tmp/mesos/meta plust hostname changes lead to a neverending master election?
Wed Jul 2 18:24:25 2014  lyda:is it ok to nuke /var/lib/mesos/replicated_log as well?
Wed Jul 2 18:25:34 2014  lyda:hm. not part of the package by default. nuking.
Wed Jul 2 18:27:36 2014  adam_mesos:lyda: Yeah, based on your dev@ thread, the replicated log (part of the registry/registrar) is probably what you need to nuke
Wed Jul 2 18:28:24 2014  lyda:nope. same error.
Wed Jul 2 18:28:40 2014  lyda:n1's mesos-master just bounced.
Wed Jul 2 18:29:07 2014  lyda:but now i'm wondering if i remembered to clean zookeeper.
Wed Jul 2 18:29:16 2014  adam_mesos:Well, you could always just try mesos-master --registry=in_memory to disable the registrar for now
Wed Jul 2 18:33:17 2014  lyda:dear java world, user hostile c has had libreadline and tecla and others for several decades now, is it that hard to add to things like /usr/lib/zookeeper/bin/cli_mt ? love, grumpy people.
Wed Jul 2 18:37:40 2014  lyda:adam_mesos: wow. i just figured out how to do that with the mesosphere init script (echo in_memory > /etc/mesos-master/'?registry'). so... what functionality will be lost if i do that?
Wed Jul 2 18:39:51 2014  adam_mesos:You lose registrar persistence, which means that on master-failover, the new master will have to wait for slaves/frameworks to re-register before it knows what is running on the cluster. In 0.19, the registrar is mostly write-only, and not much depends on the feature, but if you'd like to have a seamless transition to registrar-enabled 0.20, you wouldn't want in_memory
Wed Jul 2 18:40:39 2014  lyda:ok. cleaned zookeeper and the fs, no dice. but i notice something:
Wed Jul 2 18:40:41 2014  lyda:/var/log/mesos/mesos-master.n1.invalid-user.log.FATAL.20140702-193945.12511
Wed Jul 2 18:40:49 2014  lyda:"invalid-user"
Wed Jul 2 18:41:19 2014  lyda:is this a "don't run mesos as root you muppet" subtle hint?
Wed Jul 2 18:45:18 2014  adam_mesos:Could be, but I think we just don't translate uid(0) correctly
Wed Jul 2 18:45:32 2014  adam_mesos:Not an issue, so you can ignore it for now
Wed Jul 2 18:57:50 2014  lyda:ok, the FATAL log is short, the INFO log is long. i shrunk it down (losing the order) but wondering if any of these might hint at the issue: http://pastebin.com/0dVHwrbt
Wed Jul 2 18:58:10 2014  lyda:the one oddity is this line: Detected a new leader: (id='4')
Wed Jul 2 18:58:19 2014  lyda:there shouldn't be a 4.
Wed Jul 2 19:02:57 2014  adam_mesos:Try increasing your --registry_fetch_timeout=VALUE
Wed Jul 2 19:03:47 2014  adam_mesos:Looks like it's actually doing work for the registry, but times out.
Wed Jul 2 19:03:53 2014  adam_mesos:Maybe also try increasing --zk_session_timeout=VALUE
Wed Jul 2 19:07:03 2014  lyda:what about --registry_store_timeout ?
Wed Jul 2 19:10:30 2014  dhamon:lyda: do you have the INFO log pasted in order anywhere?
Wed Jul 2 19:14:40 2014  lyda:i can. it's rather large.
Wed Jul 2 19:15:06 2014  lyda:i'm trying registry_fetch_timeout at 2mins
Wed Jul 2 19:23:15 2014  lyda:dhamon: http://pastebin.com/NFJU5Xx7
Wed Jul 2 19:23:48 2014  lyda:that's the first 200 lines. it just repeats replica/received lines after that.
Wed Jul 2 19:25:59 2014  lyda:increasing zk_session_timeout now. increasing registry_fetch_timeout didn't help.
Wed Jul 2 19:27:07 2014  lyda:(though it did take as the FATAL log now says, "...Failed to perform fetch within 2mins"
Wed Jul 2 19:27:10 2014  lyda:)
Wed Jul 2 19:29:49 2014  lyda:ran this on n[123]: echo 50secs > '/etc/mesos-master/zk_session_timeout
Wed Jul 2 19:56:41 2014  lyda:nope. also fails.
Wed Jul 2 21:17:01 2014  kiteless:Hi. I’m trying to run ‘make check’ on the 0.19.0 version of mesos and I’m running into the issue detailed in MESOS-1077
Wed Jul 2 21:18:22 2014  kiteless:Specifically the “ZOO_ERROR@handle_socket_error_msg@1697” stuff
Wed Jul 2 21:19:13 2014  kiteless:I see in the ticket that this has been marked “fixed”. Is there a more up-to-date stable tag I should be using?
Wed Jul 2 21:19:44 2014  kiteless:Or is there a simple fix for 0.19.0 (the tarball featured on the site)
Wed Jul 2 21:25:50 2014  dhamon:are you seeing the CHECK failed message, or just the ZOO_ERROR?
Wed Jul 2 21:26:08 2014  dhamon:the fix is for the CHECK failed. The ZOO_ERROR is a different issue that is a symptom but unrelated.
Wed Jul 2 21:28:09 2014  kiteless:I’m just getting a never ending stream of those ZOO_ERROR messages when I’m running ‘make check’
Wed Jul 2 21:31:07 2014  kiteless:I’ve followed all the instructions provided at the mesos “Getting Started” page step-by-step. I can’t believe that this is an uncommon problem. Just not sure what I’m doing wrong.
Wed Jul 2 22:06:00 2014  adam_mesos:kiteless: Try checking your /etc/hosts to make that's setup correctly
Wed Jul 2 22:06:58 2014  kiteless:And what would it be looking for in there that would cause such an error?
Wed Jul 2 22:08:12 2014  kiteless:It’s trying to connect to on various high ports
Wed Jul 2 22:17:48 2014  kiteless:Nothing?
Wed Jul 2 22:28:08 2014  adam_mesos:Hmm.. Should be fine. You could try connecting to ZK yourself to make sure that ZK is actually running and you can connect to it.
Wed Jul 2 22:29:29 2014  adam_mesos:kiteless: I've seen sporadic ZK errors when running the tests too, but usually ignore them as the test suite passes for me anyway. What test(s) was it running when you started seeing the never-ending ZOO_ERRORs? You can use GTEST_FILTER="*Testname*" to run only a specific test.
Wed Jul 2 22:30:14 2014  adam_mesos:And the obvious: run 'service zookeeper status' to make sure ZK is actually up and running
Wed Jul 2 22:32:26 2014  kiteless:ZK is running
Wed Jul 2 22:33:17 2014  kiteless:I’m not sure which tests were running. Any log I can look at?
Wed Jul 2 22:34:03 2014  kiteless:It looked as if stuff from std_err was running all over std_out
Wed Jul 2 22:38:05 2014  kiteless:The problem is that it takes a damned long time to compile and then it ends in an endless, buffer overflowing string of errors.
Wed Jul 2 22:42:06 2014  adam_mesos:You can use build/bin/mesos-tests to run the tests again once everything is already compiled. Look for lines like "[ RUN ] RegistrarTest.recover" in the output to see what test is running.
Wed Jul 2 22:44:14 2014  kiteless:but it’s failing to compile the tests
Wed Jul 2 22:44:17 2014  kiteless:That’s the problem
Wed Jul 2 22:44:42 2014  dhamon:'make check GTEST_FILTER='''
Wed Jul 2 22:44:47 2014  dhamon:that'll compile without running
Wed Jul 2 22:53:02 2014  kiteless:ok, thanks
Wed Jul 2 22:55:26 2014  dhamon:we did have some flaky tests that have been fixed recently, and the symptom was that error displaying repeatedly. But they're unusual.
Wed Jul 2 22:56:17 2014  kiteless:The errors, the flaky tests, or reports of either?
Wed Jul 2 22:57:51 2014  dhamon:the failures are unusual
Wed Jul 2 22:57:58 2014  dhamon:(hence 'flaky')
Wed Jul 2 22:58:26 2014  dhamon:usually it is some subtle timing issue during the shutdown phase of a test. If you can figure out which test is the last one running it would help diagnose if it's a known issue.
Wed Jul 2 23:05:33 2014  kiteless:Wish I could. There’s a lot of log output
Wed Jul 2 23:06:44 2014  dhamon:it'll be the last line in the log starting with '[ RUN'
Wed Jul 2 23:06:50 2014  dhamon:so it should be fairly obvious
Wed Jul 2 23:07:44 2014  kiteless:which log?
Wed Jul 2 23:08:21 2014  dhamon:stdout
Wed Jul 2 23:08:58 2014  dhamon:maybe actually.. INFO