Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHYSLITE schema and inconsistent amounts of data being read for the same task #1073

Open
alexander-held opened this issue Apr 10, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@alexander-held
Copy link
Contributor

alexander-held commented Apr 10, 2024

Describe the bug

I am trying to track how much data exactly is getting read when reading PHYSLITE files. I am observing that this differs with the schema being used. In particular, I am observing for a test file (full reproducer below):

  • 4.4 MB read through coffea with PHYSLITESchema
  • 1.9 MB read with simple uproot.open (no schema)
  • 1.9 MB read with simple uproot.dask (no schema)
  • 3.1 MB read through coffea with BaseSchema

In all cases I request the same branch. Why does the report change with the schemas? Which extra information is being read, and why is that information needed?

I am also looking at the results of dak.report_necessary_columns, which only shows the specific branch I want to read anyway and does not show anything else in addition which may have explain a discrepancy.

I am happy to test more things but am somewhat stuck trying to understand the behavior. I also tested a similar setup on an CMS Open Data NanoAOD file with the corresponding NanoAOD schema and do not observe a similar kind of discrepancy there.

cc @nikoladze as expert on the schema

To Reproduce
full reproducer is at https://gist.github.com/alexander-held/8af116d93e936c5930648f1dea4fb02b (includes optional download for ~200 MB input file)

Expected behavior
I expect the same amount of data being read in all configurations.

Output
see gist

Desktop (please complete the following information):

awkward: 2.6.2
dask-awkward: 2024.3.0
uproot: 5.3.2
coffea: 2024.3.0

Additional context
n/a

@nikoladze
Copy link
Contributor

What i suspect: In PHYSLITESchema it might be that the AnalysisJetsAuxDyn.pt is read 3 times, one time for the offsets, one time for the content and one time to produce the _eventindex field that is attached to every collection to be able to calculate global indices dynamically for ElementLinks (the global index is the event index + local index).

The BaseSchema might still read the branch twice, one time for offsets, one time for content.

I'm not quite sure how everything is wired up in dask mode now - i think in the past it was avoided to read the branch multiple times due to the various caches - not sure how this is now.

To check the suspicion i ran your code through the debugger, and inspecting the base_form and rearranged form for PHYSLITE (output) in these lines of code:

def __init__(self, base_form, *args, **kwargs):
super().__init__(base_form)
form_dict = {
key: form for key, form in zip(self._form["fields"], self._form["contents"])
}
output = self._build_collections(form_dict)

They look like the following:

ipdb>  base_form
{'class': 'RecordArray', 'contents': [{'class': 'ListOffsetArray', 'offsets': 'i64', 'content': {'class': 'NumpyArray', 'primitive': 'float32', 'inner_shape': [], 'parameters': {'__doc__': 'AnalysisJetsAuxDyn.pt'}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load%2C%21content'}, 'parameters': {'__doc__': 'AnalysisJetsAuxDyn.pt'}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load'}], 'fields': ['AnalysisJetsAuxDyn.pt'], 'parameters': {'__doc__': 'CollectionTree', 'metadata': {'dataset': 'ttbar'}}, 'form_key': None}

ipdb>  output
{'Jets': {'class': 'ListOffsetArray', 'offsets': 'i64', 'content': {'class': 'RecordArray', 'fields': ['pt', '_eventindex'], 'contents': [{'class': 'NumpyArray', 'primitive': 'float32', 'inner_shape': [], 'parameters': {'__doc__': 'AnalysisJetsAuxDyn.pt'}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load%2C%21content'}, {'class': 'NumpyArray', 'parameters': {}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load%2C%21eventindex%2C%21content', 'itemsize': 8, 'primitive': 'int64'}], 'form_key': '%21invalid%2CJets', 'parameters': {'__record__': 'Particle', 'collection_name': 'Jets'}}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load'}}

where one can see the column AnalysisJetsAuxDyn.pt occuring 2 times in form_key for base_form:

  • AnalysisJetsAuxDyn.pt%2C%21load%2C%21content for the content
  • AnalysisJetsAuxDyn.pt%2C%21load for the offsets (not quite sure anymore why there is no !offsets here)

The same two also occur in the transformed form (output - this is the actual PHYSLITE schema) and additionally there is

  • AnalysisJetsAuxDyn.pt%2C%21load%2C%21eventindex%2C%21content for creating the eventindex

These additional instructions (!load, !content, !eventindex, with ! urlencoded to %21) are a coffea specific mini-language with transforms defined in src/coffea/nanoevents/transforms.py

@gordonwatts
Copy link
Contributor

@nikoladze - ok, if I understand this, this is really the same root cause as #1074. Is that right? I'm asking because that one seems tricky to solve - so it won't show up for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants