fix resource check in scanAll process #8

innomentats · 2017-10-30T07:35:52Z

fix the endless 'kill and restart' issue

CLAassistant · 2017-10-30T07:35:58Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

haohui

Looks good to me overall.

Can you please (1) add some checks to enforce the minimal number of cores and memory in the REST API, and the comments that describes the fields in the YAML file?
(2) an unit test validating the behavior?

innomentats · 2017-10-31T08:55:23Z

Sure, will do

innomentats · 2017-11-01T02:41:21Z

Got a problem, how do we define minimum amount of memory? It seems like neither AthenaX nor Flink has such kind of property currently...

haohui · 2017-11-03T23:20:42Z

athenax-backend/src/main/java/com/uber/athenax/backend/server/jobs/JobWatcherUtil.java

@@ -150,7 +150,8 @@ static StateView computeState(Map<UUID, JobDefinition> jobs, Map<UUID, InstanceI

  static JobDefinitionDesiredstate computeActualState(InstanceInfo info) {
    JobDefinitionResource r = new JobDefinitionResource()
-        .memory(info.status().getAllocatedMB())
+        .memory(info.status().getAllocatedMB() / (info.status().getAllocatedVCores() != 0


Taking a closer look I don't quite understand this. The memory should be the total amount of memory for the whole job instead of the memory used by each task manager.

https://github.com/uber/AthenaX/blob/9de54305f6dc198cbe176f64c8d6dddc6ab2e6ac/athenax-backend/src/main/java/com/uber/athenax/backend/server/yarn/JobConf.java#L48-L55

According to the definition of JobConf, it seems like the amount of memory specified by the Web API parameters is exactly the memory used by each task manager instead of the whole job's memory consumption. And when I start a job with 2 cores and 2G memory, AthenaX starts a job with two TaskManager containers with 1 Core and 2G memory each, so the total amount of memory retrieved from YARN is 4G, which leads to the same issue of continuously restarting.

Maybe this is not the right place to fix this issue ?

fix resource check in scanAll process

4e6e17e

haohui reviewed Oct 31, 2017

View reviewed changes

fix resource check for multi-vCore applications

68de14c

haohui reviewed Nov 3, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix resource check in scanAll process #8

fix resource check in scanAll process #8

innomentats commented Oct 30, 2017

CLAassistant commented Oct 30, 2017 •

edited

haohui left a comment

innomentats commented Oct 31, 2017

innomentats commented Nov 1, 2017

haohui Nov 3, 2017

innomentats Nov 4, 2017

fix resource check in scanAll process #8

Are you sure you want to change the base?

fix resource check in scanAll process #8

Conversation

innomentats commented Oct 30, 2017

CLAassistant commented Oct 30, 2017 • edited

haohui left a comment

Choose a reason for hiding this comment

innomentats commented Oct 31, 2017

innomentats commented Nov 1, 2017

haohui Nov 3, 2017

Choose a reason for hiding this comment

innomentats Nov 4, 2017

Choose a reason for hiding this comment

CLAassistant commented Oct 30, 2017 •

edited