Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed processing entk session #108

Open
lee212 opened this issue Jan 12, 2020 · 25 comments
Open

failed processing entk session #108

lee212 opened this issue Jan 12, 2020 · 25 comments

Comments

@lee212
Copy link
Contributor

lee212 commented Jan 12, 2020

This might or might not be related to py3 transition with entk but the session data to plot analytics is getting errors.

(analytics) [hrlee@login1.summit radical.analytics]$ bin/radical-analytics-inspect re.session.login1.hrlee.018273.0000
re.session.login1.hrlee.018273.0000 Traceback (most recent call last):
  File "bin/rp_inspect/plot_state.py", line 77, in <module>
    in thing.description.get('name', '').split()]
  File "bin/rp_inspect/plot_state.py", line 76, in <listcomp>
    pipe[thing.uid] = [int(x) for x
ValueError: invalid literal for int() with base 10: 'task.0000,None,stage.0000,MD,pipeline.0000,MD_ML'
...Traceback (most recent call last):
  File "bin/rp_inspect/plot_util.py", line 118, in <module>
    provided, consumed, stats_abs, stats_rel, info = session.utilization(metrics)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/analytics/session.py", line 961, in utilization
    provided  = rp.utils.get_provided_resources(self)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 479, in get_provided_resources
    data = _get_pilot_provision(session, p)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 437, in _get_pilot_provision
    cpn   = pilot.cfg['resource_details']['rm_info']['cores_per_node']
TypeError: 'NoneType' object is not subscriptable
 done
@lee212 lee212 self-assigned this Jan 12, 2020
@andre-merzky
Copy link
Member

The utilization plots were only tested with plain RP sessions. In the error above, the code seems to stumble over non-RP entities (such as RE tasks), whose UID cannot be correctly parsed. I think that RA's `bin/rp_inspect/plot_util.py should, after creation of the session, filter for all RP entities (and thus filter out all other entities (around line 107)

@lee212
Copy link
Contributor Author

lee212 commented Jan 12, 2020

Thanks Andre for the comment. I find <session>_cur.png, <session>_rate.png and <session>_conc.png were generated as an empty though.

I interpret the filtering is like this (with radical.pilot parameter in the plot_util.py):

(analytics) [hrlee@login2.summit radical.analytics]$ bin/rp_inspect/plot_util.py re.session.login1.hrlee.018273.0000 radical.pilot
Traceback (most recent call last):
  File "bin/rp_inspect/plot_util.py", line 118, in <module>
    provided, consumed, stats_abs, stats_rel, info = session.utilization(metrics)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/analytics/session.py", line 961, in utilization
    provided  = rp.utils.get_provided_resources(self)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 479, in get_provided_resources
    data = _get_pilot_provision(session, p)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 437, in _get_pilot_provision
    cpn   = pilot.cfg['resource_details']['rm_info']['cores_per_node']
TypeError: 'NoneType' object is not subscriptable

I will try to replicate the same steps with rp only for plotting purpose but I have a question. Does gpu resource affect analytics? I believe it wouldn't but core_per_node error makes me think gpus_per_node might create another issue.

@andre-merzky
Copy link
Member

andre-merzky commented Jan 12, 2020

I see. Hyungro, can you provide me the session so that I can try to reproduce?

@lee212
Copy link
Contributor Author

lee212 commented Jan 12, 2020

@lee212
Copy link
Contributor Author

lee212 commented Jan 12, 2020

re.session.login1.hrlee.018273.0000.tar.gz
pilot profiles added

@lee212
Copy link
Contributor Author

lee212 commented Jan 12, 2020

@andre-merzky
Copy link
Member

This problem should be gone when using radical-cybertools/radical.pilot/pull/2032. @lee212 implied on slack that this is not the case: could you please send the stack you are using? This stumbles still over the resource_details query for you, right?

@lee212
Copy link
Contributor Author

lee212 commented Jan 22, 2020

ValueError in the original post can be easily fixed by int() cast condition but regarding to the failing entk session, the error is still there:

$ bin/radical-analytics-inspect re.session.login5.hrlee.018282.0005
re.session.login5.hrlee.018282.0005 ....Traceback (most recent call last):
  File "/autofs/nccs-svm1_home1/hrlee/git/radical.analytics/bin/rp_inspect/plot_util.py", line 116, in <module>
    prov, cons, stats_abs, stats_rel, info = session.utilization(metrics)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/analytics/session.py", line 962, in utilization
    consumed  = rp.utils.get_consumed_resources(self)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 563, in get_consumed_resources
    p_max = pt(event=PILOT_DURATIONS['consume']['ignore'][1])[-1]
IndexError: list index out of range

My test runs show that entk is missing for cmd, for example:

>>> pilot_rp[0].events
[[15.303919553756714, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'NEW', '', 'pilot'], 
[15.421573162078857, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'], [15.422030925750732, 'put', 'pmgr.0000', 'MainThread', 'pilot.0000', 
'PMGR_LAUNCHING_PENDING', 'pmgr_launching_queue.put.0000', 'pilot'], 
[15.423201084136963, 'get', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'], 
[15.486279726028442, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING', '', 'pilot'], 
[16.787227392196655, 'staging_in_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[21.319350242614746, 'staging_in_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[21.53650426864624, 'submission_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[24.297909021377563, 'submission_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[24.297931909561157, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[25.630955934524536, 'bootstrap_0_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[25.654655933380127, 've_setup_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[27.444655895233154, 've_activate_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[28.221256017684937, 've_activate_stop', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[28.229156017303467, 'rp_install_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[43.8096559047699, 'rp_install_stop', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[43.81755590438843, 've_setup_stop', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[43.825356006622314, 've_activate_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[49.64625597000122, 'sync_rel', 'agent.0', 'MainThread', 'pilot.0000', '', 'agent.0', 'pilot'], 
[49.64625597000122, 'sync_rel', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', 'agent.0', 'pilot'], [49.64628076553345, 'hostname', 'agent.0', 'MainThread', 'pilot.0000', '', 'login5.summit.olcf.ornl.gov', 'pilot'],
[72.51676154136658, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[72.51686191558838, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[72.51692461967468, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[72.51698350906372, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[72.64947366714478, 'state', 'agent.0', 'Thread-5', 'pilot.0000', 'PMGR_ACTIVE', '', 'pilot'], [73.33910584449768, 'get', 'agent.0', 'Thread-8', 'pilot.0000', '', 'bulk: 128', 'pilot'], 
[78.87498164176941, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[83.1139395236969, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', 'cancel_pilot : []', 'pilot'], [263.38332962989807, 'state', 'pmgr_launching.0000', 'Thread-4', 'pilot.0000', 'CANCELED', '', 'pilot']]

>>> pilot_entk[0].events
[[18.046828508377075, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'NEW', '', 'pilot'], 
[18.164909601211548, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'], 
[18.165395736694336, 'put', 'pmgr.0000', 'MainThread', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', 'pmgr_launching_queue.put.0000', 'pilot'], 
[18.175760507583618, 'get', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'],
 [18.175876140594482, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING', '', 'pilot'], 
[19.4575514793396, 'staging_in_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[24.4504554271698, 'staging_in_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[24.450796842575073, 'submission_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[27.079242706298828, 'submission_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[27.07926607131958, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
 [66.70450067520142, 'state', 'pmgr.0000', 'Thread-9', 'pilot.0000', 'PMGR_ACTIVE', '', 'pilot'], 
[267.71360754966736, 'state', 'pmgr_launching.0000', 'Thread-3', 'pilot.0000', 'CANCELED', '', 'pilot']]

@lee212
Copy link
Contributor Author

lee212 commented Jan 22, 2020

For the trace purpose, the pt() tried to collect PILOT_DURATIONS['consume']['ignore'][1] which is cmd (see code here):
https://github.com/radical-cybertools/radical.pilot/blob/fix/ra_resource_details/src/radical/pilot/utils/prof_utils.py#L563

but the values look like:

>>> pt_rp(event=PILOT_DURATIONS['consume']['ignore'][1])
[72.51676154136658, 72.51686191558838, 72.51692461967468, 72.51698350906372, 78.87498164176941, 83.1139395236969]
>>> pt_rp(event=PILOT_DURATIONS['consume']['ignore'][1])[-1]
83.1139395236969

this looks correct, which is radical.pilot, but when I see radical.entk:

>>> pt_entk(event=PILOT_DURATIONS['consume']['ignore'][1])
[]

@lee212
Copy link
Contributor Author

lee212 commented Jan 24, 2020

Why I don't see any of the events in the entk profiling? For example, rp_inspect/plot_state.py filters 7 events e.g. AGENT_STAGING_INPUT but these 'AGENT' events are not in the entk event list. What it has

states: ['NEW', 'UMGR_SCHEDULING_PENDING', 'UMGR_SCHEDULING', 
'UMGR_STAGING_INPUT_PENDING', 'UMGR_STAGING_INPUT', 'AGENT_STAGING_INPUT_PENDING', 
'UMGR_STAGING_OUTPUT', 'DONE']

whereas rp shows:

states: ['NEW', 'UMGR_SCHEDULING_PENDING', 'UMGR_SCHEDULING', 
'UMGR_STAGING_INPUT_PENDING', 'UMGR_STAGING_INPUT', 'AGENT_STAGING_INPUT_PENDING', 
'AGENT_STAGING_INPUT', 'AGENT_SCHEDULING_PENDING', 'AGENT_SCHEDULING', 
'AGENT_EXECUTING_PENDING', 'AGENT_EXECUTING', 'AGENT_STAGING_OUTPUT_PENDING', 
'AGENT_STAGING_OUTPUT', 'UMGR_STAGING_OUTPUT_PENDING', 'UMGR_STAGING_OUTPUT', 
'DONE']

As long as RADICAL_PILOT_PROFILE=TRUE is set, session.close(download=True) will be called. And the events by profiling are the same on entk like rp, and additional profiling for wfprocessor, wmgr, and tmgr are included in the json.

Maybe this is related to _pilot.cancel()? if profiling data is recorded before termination?

@mturilli
Copy link
Contributor

I am not sure I understand well the issue, apologies. AGENT* events are not in EnTK as it does not have an agent and, more in general, I do not think EnTK has an event model. The states you listed for EnTK seem correct to me and I do not think you can get any other state from EnTK because those are all the states it has. When you set profiling to true for both RP and EnTK, you should get EnTK states and RP states/events. Is this not happening?

@lee212
Copy link
Contributor Author

lee212 commented Jan 27, 2020

analytics try to filter these states and events for plots, at least for rp. If entk does not have, analytics need to use different logics or look for other states/events that entk provides. I assume that (from your comment "you should get EnTK states ..."), my session provides possible EnTK states according to this: https://github.com/radical-cybertools/radical.entk/blob/devel/src/radical/entk/states.py.
However, I don't see them although RADICAL_ENTK_PROFILE=True is set. Let me dig into this with that.

@andre-merzky
Copy link
Member

Why I don't see any of the events in the entk profiling? For example, rp_inspect/plot_state.py filters 7 events e.g. AGENT_STAGING_INPUT but these 'AGENT' events are not in the entk event list. What it has [...]

@lee212 : can you please send me a small reproducer so that I can see how you produced the event lists?

@andre-merzky andre-merzky self-assigned this Jan 27, 2020
@lee212
Copy link
Contributor Author

lee212 commented Jan 27, 2020

import radical.analytics as ra
session = ra.Session.create("re.session.login1.hrlee.018285.0000", "radical.entk")
pilot_entk = session.get(etype="pilot")
pilot_entk[0].events

@andre-merzky
Copy link
Member

That is not one of the sessions attached here I think? Do you mind adding it to the ticket? Thanks!

@andre-merzky
Copy link
Member

states like AGENT_EXECUTING are no pilot states, so I am not sure where that comes from. I don't see that in other sessions from this ticket. For example, when I run this code:

#!/usr/bin/env python3

import radical.analytics as ra

sid     = 're.session.login1.hrlee.018273.0000'
session = ra.Session.create(sid, 'radical.entk')
pilots  = session.get(etype='pilot')

for e in pilots[0].events:
    print('event: ', e[1])

for e in pilots[0].events:
    print('state: ', e[5])

with the session re.session.login1.hrlee.018273.0000 you attached above (including the agent profiles from the separate tarball), I get this output:

$ ./t.py  | sort | uniq -c
      1 event:  bootstrap_0_start
    484 event:  cmd
      5 event:  get
      1 event:  hostname
      1 event:  put
      1 event:  rp_install_start
      1 event:  rp_install_stop
      1 event:  staging_in_start
      1 event:  staging_in_stop
      6 event:  state
      1 event:  submission_start
      1 event:  submission_stop
      2 event:  sync_rel
      2 event:  ve_activate_start
      1 event:  ve_activate_stop
      1 event:  ve_setup_start
      1 event:  ve_setup_stop
    494 state:
      1 state:  CANCELED
      1 state:  NEW
      1 state:  PMGR_ACTIVE
     10 state:  PMGR_ACTIVE_PENDING
      1 state:  PMGR_LAUNCHING
      3 state:  PMGR_LAUNCHING_PENDING

which seems like the expected set of states and events for a pilot. I guess I would need to look at your specific session to understand what's happening...

@lee212
Copy link
Contributor Author

lee212 commented Jan 28, 2020

This is interesting, same script I ran and the result looks like:

(analytics) [hrlee@login4.summit radical.analytics]$ python t.py|sort|uniq -c
      1 event:  get
      1 event:  put
      1 event:  staging_in_start
      1 event:  staging_in_stop
      6 event:  state
      1 event:  submission_start
      1 event:  submission_stop
      4 state:
      1 state:  CANCELED
      1 state:  NEW
      1 state:  PMGR_ACTIVE
      1 state:  PMGR_ACTIVE_PENDING
      1 state:  PMGR_LAUNCHING
      3 state:  PMGR_LAUNCHING_PENDING

@lee212
Copy link
Contributor Author

lee212 commented Jan 28, 2020

Any of the plots seems valid by radical-pilot-inspect command but I was looking at utilization first which runs bin/rp_inspect/plot_util.py. Specifically, one error is generated when pilot.cfg['resource_details']['rm_info']['cores_per_node'] is None and the other is when pt(event=PILOT_DURATIONS['consume']['ignore'][1])[-1] is None, if you look at the early comments in this ticket. These don't explain much but both errors happened in the ra.session.utilization method which starts with stype == radical.pilot assert. It might be okay if entk contains same states and events to calculate durations of cmds and others, otherwise, this pilot only has to be removed to call a utilization method from entk. If that is the case, the profiling methods of entk need some work (looking at radical.entk/utils/prof_utils.py) because necessary methods do not exist on entk, such as get_provided_resources, get_consumed_resources or can it be duplicate from RP?
RP seeks duration of tasks between PMGR_ACTIVE and termination but does entk have same state and event ?
e.g. {ru.STATE: rps.PMGR_ACTIVE }, {ru.EVENT: 'cmd' }]

@andre-merzky
Copy link
Member

Are you really sure that the session dir contains the agent profiles under 'pilot.0000/'? If so, can you please attach an tarball of the dir again? Thanks.

@lee212
Copy link
Contributor Author

lee212 commented Jan 28, 2020

No, I don't think I had it, so I guess *.prof and *.json are not all for profiling. Let me try again.

@lee212
Copy link
Contributor Author

lee212 commented Feb 17, 2020

Two updates:

  • empty plots (conc, rate, state):
    radical-pilot-fetch-db <session uid> downloads profiles in the session dir but the pilot.0000/ was nested under the uid again e.g. re.session.login4.hrlee.018309.0001/re.session.login4.hrlee.018309.0001/pilot.0000. After manually moving 1 level up, analytics were able to read *.prof. This change resolves empty plots except utilization.

  • failed to generate resource utilization due to missing resource_details
    *.json indeed does not contain resource_details, it shows None although it would be like:

                     {'lm_detail': None, 
                      'lm_info': {'FORK': {'version': '0.42', 'version_detail': 'There is no spoon'}},
                      'rm_info': {'agent_nodes': {}, 'cores_per_node': 8, 'gpus_per_node': 1, 
                                        'lfs_per_node': {'path': '/tmp', 'size': 1024}, 
                      'lm_info': {'version_info': {'FORK': {'version': '0.42', 'version_detail': 'There is no spoon'}}}, 
                      'mem_per_node': 0, 'name': 'Fork',
                      'node_list': [['localhost', 'localhost_0']]}}

I keep searching where this difference is created.

@lee212
Copy link
Contributor Author

lee212 commented Feb 17, 2020

Can this be the one removing resource_details? https://github.com/radical-cybertools/radical.pilot/blob/690b5486126850740e21a801da6b4844f95bb8ee/src/radical/pilot/pmgr/launching/default.py#L498

Because I find resource_details exist at agent.0.log

@mturilli
Copy link
Contributor

@lee212 , @andre-merzky can this be closed now? Did we find a solution for resource_details?

@andre-merzky
Copy link
Member

resource_details may still be missing if the pilot terminates early. I am not sure if we still have an issue otherwise. Either way, since parts of that code have changed in RP, we should at least try to reconfirm that this still fails. There is no doubt that this should be working out of the box (for complete sessions).

@lee212
Copy link
Contributor Author

lee212 commented Jun 24, 2020

The TypeError message was seen in the recent sessions with the latest stacks, and yes the resource_details is missing with the early termination. There is another case that wasn't terminated early but broken i.e., re.session.login5.hrlee.018425.0020 and so I can't tell whether this is fixed or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants