feat(wip): edm4hep development #822

jbrewster7 · 2023-05-23T16:22:16Z

A schema for edm4hep files in nanoevents.

It's very similar to Delphes / Treemaker FWIW.

cover all four-vector types in edm4hep (done for PFOs, genparticles, what else?)
association maps between gen particles and reconstruction objects (clusters, PFOs)
association maps between PFOs and clusters / tracks

@jbrewster7 we should keep a list here of the things we need to do unless we think we've found a good feature set!

lgray · 2023-05-23T16:34:05Z

@jbrewster7 don't worry about the failures on arm, they're intermittent (probably depending on the host machine the CI job lands on)

tmadlener · 2023-05-24T09:49:32Z

coffea/nanoevents/schemas/edm4hep.py

+    - Extended quantities of physic objects are stored in the format
+      <Object>_<variable>, such as "Jets_jecFactor". Such variables will be
+      merged into the collection <Object>, so the branch "Jets_jetFactor" will be
+      access to in the array format as "Jets.jecFactor". An exception to the


In "podio speak" these are VectorMembers and they are used in several of the datatypes of edm4hep.

Note that there are also OneToOneRelations and OneToManyRelations which are currently still stored using the convenction <collection-name>#<index>. These are the branches you would need to use to resolve e.g. mother/daughter relations for MCParticles. The <index> is currently somewhat "magic" but this is going to change with AIDASoft/podio#405, where the new schema of the branch names will be _<collection-name>_<name> for both VectorMembers and the relations. Let me know if you want to wait for that or whether you want some information about the magic.

I am not sure how general you want to make this in the end, but we could potentially help with some code generation on the podio side if you think that would be in any way useful.

@tmadlener Thanks for the info. Sorry, the documentation is for another dataformat type - I'll clean it up next week. In any case I can always find the constituents of a given physics object and wrap them together. It was pretty clear what to do in edm4hep!

So far as code generation - once we work it out here it could be nice to move it properly to podio (though that's not a python package but rather a generator for one)... Could envision some CI to put something on PyPI. Though that may not be well advised since what's being generated here is the schema as opposed to the implementation (which in our case is handled by more general functions + an embedded domain specific language that does know how to handle cross references if you give it the source/mapping/target). It may be best that it stays here, it only needs to be done once anyway since it gets distributed with coffea? We can work it out in time.

I think in the mean time if you could help us with the "magic" cross references so we know what we're supposed to be referencing that would be really useful! We can add in the better-named xref branches when they're there and files with it are available.

I agree on the code generation. Let's see how this works out and how frequent changes would need to be.

So the "magic" is effectively that the indices get generated sequentially in the order the relations appear in the yaml definition file, starting with OneToManyRelations and then continuing with OneToOneRelations. So as an example the edm4hep::ReconstructedParticle has

OneToOneRelations: - edm4hep::Vertex startVertex //start vertex associated to this particle - edm4hep::ParticleID particleIDUsed //particle Id used for the kinematics of this particle OneToManyRelations: - edm4hep::Cluster clusters //clusters that have been used for this particle. - edm4hep::Track tracks //tracks that have been used for this particle. - edm4hep::ReconstructedParticle particles //reconstructed particles that have been combined to this particle. - edm4hep::ParticleID particleIDs //particle Ids (not sorted by their likelihood)

and on file the branches will be (assuming the collection name is "reco")

reco#0 for the related Clusters

reco#1 for the related Tracks

...

reco#4 for the (single) startVertex

reco#5 for the (single) particleIDUsed

Indeed that's a little bit hacky to keep working since there's no clear way to keep track of the schema over time.

The next version sounds much more tractable in the long run!

lgray · 2023-05-30T11:49:09Z

@jbrewster7 how are things going w.r.t. cross references?

jbrewster7 · 2023-05-30T15:24:04Z

@lgray Hi! The cross references seem to all be working well and I'm trying to set them up in a way that makes them easy enough to change when they go from magic numbers to names

tmadlener · 2023-05-30T15:47:01Z

Hi @jbrewster7, if you want to make this transparent, it should be enough to use some information that is available from the podio_metadata Tree in the XXX___idTable branch(es). This contains a podio::CollectionIDTable for which we generate a dictionary. Internally it is effectively two vectors, one containing the (numerical) IDs and the other the (string) names. I think it should be possible to get to the m_collectionIDs and m_names leaves in the branches directly. With those

name_to_id = {n: i for (n,i) in zip(names, collectionIDs)}

should effectively give you a map of names to IDs. This will also hold after we have switched to something more robust than we currently have.

The XXX above is most likely always events for coffea, but podio can in principle also write other categories. I am not entirely sure if it makes sense at the moment to make this more general than it needs to be.

lgray · 2023-05-30T15:52:36Z

@tmadlener I'll poke around with that table - I remember uproot complaining about it. If it won't read I'll make a PR to uproot to get it happening.

lgray · 2023-05-30T15:53:18Z

@jbrewster7 Hooray! That's awesome!

lgray · 2023-05-30T16:01:07Z

@tmadlener ah, that was something else, the idTable is easily accessible:

>>> x["podio_metadata"].keys()
['events___idTable', 'events___idTable/m_collectionIDs', 'events___idTable/m_names', 'events___CollectionTypeInfo', 'events___CollectionTypeInfo/events___CollectionTypeInfo._3', 'events___CollectionTypeInfo/events___CollectionTypeInfo._2', 'events___CollectionTypeInfo/events___CollectionTypeInfo._1', 'events___CollectionTypeInfo/events___CollectionTypeInfo._0', 'PodioBuildVersion', 'PodioBuildVersion/major', 'PodioBuildVersion/minor', 'PodioBuildVersion/patch', 'EDMDefinitions', 'EDMDefinitions/EDMDefinitions._1', 'EDMDefinitions/EDMDefinitions._0']
>>> x["podio_metadata"]["events___idTable"].array()[0].show()
{m_collectionIDs: [1, 2, 3, 4, 5, 6, 7, 8, ..., 40, 41, 42, 43, 44, 45, 46, 47],
 m_names: ['AllCaloHitContributionsCombined', ..., 'RecoMCTruthLink']}

A bit annoying we can't just do this with the awkward form alone though. I'll need to add in a hook to nanoevents to let the schema get organizational data that's available in the file!

@nsmith- thoughts?

lgray · 2023-05-30T16:02:44Z

@jeyserma FYI

nsmith- · 2023-05-30T17:03:49Z

coffea/nanoevents/methods/vector.py

@@ -746,6 +748,251 @@ def nearest(
        return out


+@awkward.mixin_class(behavior)
+class LorentzVectorM(ThreeVector):


You should double-check, but I think all that's needed is to subclass LorentzVector and implement two overrides:

@property def t(self): return numpy.sqrt(self["mass"]*self["mass"] + self.rho2) @property def mass(self): return self["mass"]

basically the magic is that getitem always gets what's actually in the awkward array and all getattr is for derived or almost-derived. That way all the other methods should pick the right override

nsmith- · 2023-05-30T17:08:52Z

coffea/nanoevents/factory.py

+                behavior = {}
+                behavior.update(base.behavior)
+                behavior.update(vector.behavior)
+            elif schemaclass is EDM4HEPSchema:


This is following the existing pattern, so it's a question for @lgray: why not use the class property EDM4HEPSchema.behavior here?

This needs cleanup in a separate PR to harmonize the interface with the eager one. You're right but marking it for later.

nsmith- · 2023-05-30T21:47:24Z

On the issue of needing additional metadata to properly reconstruct the NanoEvents schema (i.e. awkward form) from the list of branches, the only issue I see is that, in the case of ROOT files, there is no standard way of storing it. So each file type will need its own. Perhaps what can be done is to add a new hook in the schema class interface: a classmethod that takes in the TFile and does what it needs to find the rest of the metadata. But this of course is only for ROOT, since parquet can mostly capture awkward data types and there is a standard spot (iirc) to put the mixin behavior names. So NanoEventsFactory.from_root would call this method. It breaks a bit the separation between the source mapping and the events object, but I think we don't have as strong a reason to keep that around long-term.

lgray · 2023-05-31T13:42:31Z

@jpivarski given @nsmith-'s comment (which is where I had more or less landed as well) - is there a reasonable way to get additional serialized metadata that's within the (first) root file (even if it is globbed) as a tree or class, when calling uproot.dask?

I suppose I have the file string, but it might be globbed, which I can't use in uproot.open reliably. I guess I could take the same string, uproot.dask it with the known location of the metadata and take the first partition only for a compute?

That should be sufficiently quick.

lgray · 2023-05-31T15:50:10Z

Indeed it's pretty easy:

import uproot

x = uproot.dask({"~/Downloads/rv02-02.sv02-02.mILD_l5_o1_v02.E250-SetA.I402004.Pe2e2h.eR.pL.n000.d_dstm_15090_*.slcio.edm4hep.root": "podio_metadata/events___idTable"})

(coffea-dev) lgray@dhcp-131-225-173-191 coffea % python -i edm4hep_idTable.py
>>> x.partitions[0].compute()
<Array [{m_collectionIDs: [...], ...}] type='1 * {m_collectionIDs: var * in...'>

jpivarski · 2023-05-31T16:40:25Z

uproot.dask does open the first file in order to know the names and types of TBranches, so the metadata from the first file should be available. It might even be public (no underscore) because Uproot exposes a lot of its inner workings.

lgray · 2023-06-02T16:25:20Z

Hey, @jbrewster7, just to check: how are things coming along with the cross references. I think whatever you have now is fine and then I'd be happy to take that and mutate it to use this stuff I was talking about with @nsmith- and @jpivarski.

However, I need to see what you've got first to act on that! There's no need to hone it to perfection, just good enough, going further is really more a collective process.

lgray · 2023-06-02T16:27:09Z

So yeah @nsmith- it looks like I can instead just take the filename and objectpath that's passed into .from_root and just ask for the appropriate ID table using uproot.dask on the first partition.

Should be enough, and there's only one additional file open on the user side for a small amount of data to get the necessary bits of metadata.

…o the procedure. Added the class LorentzVectorM to vector.py so that vectors can be definedwith mass instead of energy.

for more information, see https://pre-commit.ci

…particles and mc particles. This involved the creation of new classes in nanoevents/methods/edm4hep.py. This linking still isn't running error-free. There is a TypeError when the matched_gen function is called from the edm4hep methods.

Pushing changes made to add particle linking.

for more information, see https://pre-commit.ci

…ting still needs to be updated for proper construction of 4-vectors.

for more information, see https://pre-commit.ci

lgray · 2023-08-30T17:53:55Z

@tmadlener could you produce some new ZH files in the new EDM4HEP format (with the more human-readable index branch names)? In addition, could you make one that's ~40 events so that we can have a test file in our repo? Thanks!

lgray · 2023-08-30T17:56:43Z

@jbrewster7 could you fix up the flake8 errors? (click red x next to pre-commit.ci - pr). Looks like some improperly formatted strings.

jbrewster7 · 2023-08-31T18:11:34Z

@lgray Yes, committing the changes now!

for more information, see https://pre-commit.ci

tmadlener · 2023-09-01T13:58:01Z

@lgray apologies for the slight delay. New files (and one with 40 events) are on cernbox: https://cernbox.cern.ch/s/eYmuXTRimfgmdZg

The 40 event one is produced from one using

source /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh

and the following lines of python

from podio.root_io import Reader, Writer

reader = Reader("rv02-02.sv02-02.mILD_l5_o1_v02.E250-SetA.I402004.Pe2e2h.eR.pL.n000.d_dstm_15090_0.edm4hep.root")
events = reader.get("events")

writer = Writer("output.edm4hep.root")
for i in range(40):
    writer.write_frame(events[i], "events")

(There will be a few warnings about some missing schema evolution which can be safely ignored)

lgray · 2023-09-01T14:01:56Z

Awesome, thanks!

lgray · 2023-09-01T16:29:05Z

@tmadlener FYI https://pypi.org/project/podio/ is not taken. It would be nice if it was pip installable.

for more information, see https://pre-commit.ci

lgray · 2023-12-07T02:53:57Z

TODO: update to interface changes in dask-awkward, build tests.

lgray changed the title ~~edm4hep development~~ feat: edm4hep development May 23, 2023

lgray mentioned this pull request May 23, 2023

feat: (WIP) edm4hep schema #814

Closed

lgray changed the title ~~feat: edm4hep development~~ feat(wip): edm4hep development May 23, 2023

tmadlener reviewed May 24, 2023

View reviewed changes

nsmith- reviewed May 30, 2023

View reviewed changes

lgray and others added 4 commits June 3, 2023 20:52

proof of concept edm4hep schema, only does PFOs

09bb414

Adding changes to edm4hep.py so that MCParticlesSkimmed are taken int…

4fefeb5

…o the procedure. Added the class LorentzVectorM to vector.py so that vectors can be definedwith mass instead of energy.

[pre-commit.ci] auto fixes from pre-commit.com hooks

9967630

for more information, see https://pre-commit.ci

fix typo

075ab83

lgray force-pushed the main_dev branch from d156275 to 075ab83 Compare June 4, 2023 00:57

lgray and others added 6 commits June 4, 2023 12:47

Merge branch 'master' into main_dev

452daf8

Merge branch 'master' into main_dev

715679b

Merge branch 'main_dev' of github.com:jbrewster7/coffea into main_dev

14bafa3

Pushing changes made to add particle linking.

[pre-commit.ci] auto fixes from pre-commit.com hooks

7f628ce

for more information, see https://pre-commit.ci

Addition of track and cluster imports to edm4hep schema. Tracks impor…

db9ae5a

…ting still needs to be updated for proper construction of 4-vectors.

jbrewster7 and others added 2 commits August 23, 2023 13:42

Little fixes (mostly removing old commented-out code)

390c848

[pre-commit.ci] auto fixes from pre-commit.com hooks

a50c181

for more information, see https://pre-commit.ci

Merge branch 'master' into main_dev

9f58a88

jbrewster7 and others added 5 commits August 31, 2023 11:12

Fixing f string formatting problem.

87d4225

Merge branch 'main_dev' of github.com:jbrewster7/coffea into main_dev

91f7dc9

[pre-commit.ci] auto fixes from pre-commit.com hooks

45d5a38

for more information, see https://pre-commit.ci

Example notebook for edm4hep & coffea

3ed8167

Merge branch 'main_dev' of github.com:jbrewster7/coffea into main_dev

3980198

tmadlener mentioned this pull request Sep 1, 2023

Make podio a (pip installable) python package AIDASoft/podio#474

Open

1 task

lgray and others added 9 commits October 17, 2023 11:59

Merge branch 'master' into main_dev

eff2556

Merge branch 'master' into main_dev

70cab5e

Merge branch 'master' into main_dev

26dd92e

Merge branch 'master' into main_dev

1144d3b

Merge branch 'master' into main_dev

4518a7e

interface change

9506315

[pre-commit.ci] auto fixes from pre-commit.com hooks

679633b

for more information, see https://pre-commit.ci

Merge branch 'master' into main_dev

f1733dc

[pre-commit.ci] auto fixes from pre-commit.com hooks

2be5d11

for more information, see https://pre-commit.ci

lgray added 5 commits December 6, 2023 20:54

may need to debug typetracing here.

a693321

Merge branch 'master' into main_dev

e68b7f2

Merge branch 'master' into main_dev

0b23cd1

Merge branch 'master' into main_dev

d2269aa

Merge branch 'master' into main_dev

cb4082f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wip): edm4hep development #822

feat(wip): edm4hep development #822

jbrewster7 commented May 23, 2023 •

edited

lgray commented May 23, 2023

tmadlener May 24, 2023

lgray May 24, 2023 •

edited

tmadlener May 24, 2023

lgray May 24, 2023

lgray commented May 30, 2023

jbrewster7 commented May 30, 2023

tmadlener commented May 30, 2023

lgray commented May 30, 2023

lgray commented May 30, 2023

lgray commented May 30, 2023

lgray commented May 30, 2023

nsmith- May 30, 2023

nsmith- May 30, 2023

lgray May 30, 2023 •

edited

nsmith- commented May 30, 2023

lgray commented May 31, 2023 •

edited

lgray commented May 31, 2023

jpivarski commented May 31, 2023

lgray commented Jun 2, 2023

lgray commented Jun 2, 2023

lgray commented Aug 30, 2023

lgray commented Aug 30, 2023

jbrewster7 commented Aug 31, 2023

tmadlener commented Sep 1, 2023

lgray commented Sep 1, 2023

lgray commented Sep 1, 2023

lgray commented Dec 7, 2023 •

edited

feat(wip): edm4hep development #822

Are you sure you want to change the base?

feat(wip): edm4hep development #822

Conversation

jbrewster7 commented May 23, 2023 • edited

lgray commented May 23, 2023

tmadlener May 24, 2023

Choose a reason for hiding this comment

lgray May 24, 2023 • edited

Choose a reason for hiding this comment

tmadlener May 24, 2023

Choose a reason for hiding this comment

lgray May 24, 2023

Choose a reason for hiding this comment

lgray commented May 30, 2023

jbrewster7 commented May 30, 2023

tmadlener commented May 30, 2023

lgray commented May 30, 2023

lgray commented May 30, 2023

lgray commented May 30, 2023

lgray commented May 30, 2023

nsmith- May 30, 2023

Choose a reason for hiding this comment

nsmith- May 30, 2023

Choose a reason for hiding this comment

lgray May 30, 2023 • edited

Choose a reason for hiding this comment

nsmith- commented May 30, 2023

lgray commented May 31, 2023 • edited

lgray commented May 31, 2023

jpivarski commented May 31, 2023

lgray commented Jun 2, 2023

lgray commented Jun 2, 2023

lgray commented Aug 30, 2023

lgray commented Aug 30, 2023

jbrewster7 commented Aug 31, 2023

tmadlener commented Sep 1, 2023

lgray commented Sep 1, 2023

lgray commented Sep 1, 2023

lgray commented Dec 7, 2023 • edited

jbrewster7 commented May 23, 2023 •

edited

lgray May 24, 2023 •

edited

lgray May 30, 2023 •

edited

lgray commented May 31, 2023 •

edited

lgray commented Dec 7, 2023 •

edited