Chronos Intermittent Issue: Jobs get stuck #897

harjinder-flipkart · 2020-02-04T07:20:15Z

Intermittent Chronos Issue:
At our Chronos cluster, we have been encountering an intermittent issue where Chronos jobs stop getting executed on Mesos. The sequence of observed events is as follows:

Chronos jobs are not executed by Mesos.
Status of jobs on Chronos dashboard is ‘Queued’.
Mesos master logs show that
-- master has not been sending resource offers to framework i.e. Chronos.
-- master keeps getting update from slaves for old tasks.
-- it keeps trying to forward the update to chronos.
-- Zookeeper and slaves are not down. They are working fine.
After restarting Chronos and Zookeeper, the system starts working fine. Chronos jobs start getting executed.

Whys:

Why Chronos jobs stop getting executed ?
Chronos, as a Mesos application (framework), waits for resource offers from Mesos master.
Mesos master generally sends resource offers at a very high frequency i.e. 100 ms to a few seconds. However, in this case, the master stopped sending resource offers. Without these resource offers, Chronos is stuck.
Why Mesos master stopped sending resource offers ?
The mesos slaves were occupied with FINISHED tasks. Mesos slaves were telling the master that taks is FINISHED and the master was trying to tell Chronos leader the same and waiting for ACK. Chronos was not sending ACK.
Why did Chronos not send ACK ?
The "JobScheduler::handleFinishedTask" thread in Chronos leader was waiting on ReentrantLock which was held by the "JobScheduler::mainLoop" thread.
Why did "JobScheduler::mainLoop" thread not release the lock ?
The mainLoop thread is trying to reload jobs from ZK and it is blocked on ZK.

Software Versions:

Chronos 3.0.3
Mesos 1.4.0
Zookeeper 3.4.5

harjinder-flipkart · 2020-02-06T11:25:26Z

Based upon recent investigation, I have updated the problem description above.

Chronos team, can you please help us resolve the issue.

harjinder-flipkart · 2020-02-06T11:34:41Z

I have kept Chronos thread dump here.

Relevant threads look like this:
...
"Thread-264485" #264523 prio=5 os_prio=0 tid=0x00007fd9d4006800 nid=0x5fb9 waiting for monitor entry [0x00007fda1c9da000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.replaceJob(JobScheduler.scala:152) - waiting to lock <0x00000007042d73d0> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.handleFinishedTask(JobScheduler.scala:244) at org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:210) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37) at com.sun.proxy.$Proxy30.statusUpdate(Unknown Source)
...

"pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000] java.lang.Thread.State: RUNNABLE at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method) at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.liftedTree1$1(JobScheduler.scala:542) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.mainLoop(JobScheduler.scala:540) - locked <0x00000007042d73d0> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anon$1.run(JobScheduler.scala:516) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

harjinder-flipkart · 2020-02-18T07:34:37Z

@brndnmtthws can you please look into this issue ?

brndnmtthws · 2020-02-18T22:24:36Z

@harjinder-flipkart I haven't been involved with this project in years, so I'm not really in a position to help. Good luck with your debugging.

janisz · 2020-02-20T09:19:46Z

Can you send mesos state JSON?

harjinder-flipkart · 2020-02-20T10:26:48Z

State JSON for mesos master is here: https://gist.github.com/harjinder-flipkart/58f1dfc8e077ee9a80f1b544cf87ff4c

janisz · 2020-02-20T10:33:16Z

I suspect chronos is stuck with single offer. Have you tried restarting it? It might be helpful to set offer_timeout on Mesos Master.

harjinder-flipkart · 2020-02-21T05:10:48Z

Thanks @janisz for your reply !

Yes restarting Chronos and ZK brings the cluster back in working condition. Restarting chronos/zk is a work-around for the time being. But we are looking for a permanent solution and need your help :)

Also, I am not sure if Chronos was stuck with single offer. The thread dump shows that Chronos thread was trying to load jobs and it was waiting for ZK:

...
"pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method)
	at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226)
	at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
	at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68)
...

harjinder-flipkart · 2020-04-01T07:19:22Z

@janisz any pointers for this ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chronos Intermittent Issue: Jobs get stuck #897

Chronos Intermittent Issue: Jobs get stuck #897

harjinder-flipkart commented Feb 4, 2020 •

edited

harjinder-flipkart commented Feb 6, 2020

harjinder-flipkart commented Feb 6, 2020

harjinder-flipkart commented Feb 18, 2020

brndnmtthws commented Feb 18, 2020

janisz commented Feb 20, 2020

harjinder-flipkart commented Feb 20, 2020

janisz commented Feb 20, 2020

harjinder-flipkart commented Feb 21, 2020

harjinder-flipkart commented Apr 1, 2020

Chronos Intermittent Issue: Jobs get stuck #897

Chronos Intermittent Issue: Jobs get stuck #897

Comments

harjinder-flipkart commented Feb 4, 2020 • edited

harjinder-flipkart commented Feb 6, 2020

harjinder-flipkart commented Feb 6, 2020

harjinder-flipkart commented Feb 18, 2020

brndnmtthws commented Feb 18, 2020

janisz commented Feb 20, 2020

harjinder-flipkart commented Feb 20, 2020

janisz commented Feb 20, 2020

harjinder-flipkart commented Feb 21, 2020

harjinder-flipkart commented Apr 1, 2020

harjinder-flipkart commented Feb 4, 2020 •

edited