Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(wip): edm4hep development #822

Open
wants to merge 74 commits into
base: master
Choose a base branch
from

Conversation

jbrewster7
Copy link

@jbrewster7 jbrewster7 commented May 23, 2023

A schema for edm4hep files in nanoevents.

It's very similar to Delphes / Treemaker FWIW.

  • cover all four-vector types in edm4hep (done for PFOs, genparticles, what else?)
  • association maps between gen particles and reconstruction objects (clusters, PFOs)
  • association maps between PFOs and clusters / tracks

@jbrewster7 we should keep a list here of the things we need to do unless we think we've found a good feature set!

@lgray lgray changed the title edm4hep development feat: edm4hep development May 23, 2023
@lgray lgray mentioned this pull request May 23, 2023
@lgray lgray changed the title feat: edm4hep development feat(wip): edm4hep development May 23, 2023
@lgray
Copy link
Collaborator

lgray commented May 23, 2023

@jbrewster7 don't worry about the failures on arm, they're intermittent (probably depending on the host machine the CI job lands on)

Comment on lines 33 to 36
- Extended quantities of physic objects are stored in the format
<Object>_<variable>, such as "Jets_jecFactor". Such variables will be
merged into the collection <Object>, so the branch "Jets_jetFactor" will be
access to in the array format as "Jets.jecFactor". An exception to the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In "podio speak" these are VectorMembers and they are used in several of the datatypes of edm4hep.

Note that there are also OneToOneRelations and OneToManyRelations which are currently still stored using the convenction <collection-name>#<index>. These are the branches you would need to use to resolve e.g. mother/daughter relations for MCParticles. The <index> is currently somewhat "magic" but this is going to change with AIDASoft/podio#405, where the new schema of the branch names will be _<collection-name>_<name> for both VectorMembers and the relations. Let me know if you want to wait for that or whether you want some information about the magic.

I am not sure how general you want to make this in the end, but we could potentially help with some code generation on the podio side if you think that would be in any way useful.

Copy link
Collaborator

@lgray lgray May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmadlener Thanks for the info. Sorry, the documentation is for another dataformat type - I'll clean it up next week. In any case I can always find the constituents of a given physics object and wrap them together. It was pretty clear what to do in edm4hep!

So far as code generation - once we work it out here it could be nice to move it properly to podio (though that's not a python package but rather a generator for one)... Could envision some CI to put something on PyPI. Though that may not be well advised since what's being generated here is the schema as opposed to the implementation (which in our case is handled by more general functions + an embedded domain specific language that does know how to handle cross references if you give it the source/mapping/target). It may be best that it stays here, it only needs to be done once anyway since it gets distributed with coffea? We can work it out in time.

I think in the mean time if you could help us with the "magic" cross references so we know what we're supposed to be referencing that would be really useful! We can add in the better-named xref branches when they're there and files with it are available.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on the code generation. Let's see how this works out and how frequent changes would need to be.

So the "magic" is effectively that the indices get generated sequentially in the order the relations appear in the yaml definition file, starting with OneToManyRelations and then continuing with OneToOneRelations. So as an example the edm4hep::ReconstructedParticle has

    OneToOneRelations:
      - edm4hep::Vertex          startVertex    //start vertex associated to this particle
      - edm4hep::ParticleID      particleIDUsed //particle Id used for the kinematics of this particle
    OneToManyRelations:
      - edm4hep::Cluster               clusters     //clusters that have been used for this particle.
      - edm4hep::Track                 tracks       //tracks that have been used for this particle.
      - edm4hep::ReconstructedParticle particles    //reconstructed particles that have been combined to this particle.
      - edm4hep::ParticleID            particleIDs  //particle Ids (not sorted by their likelihood)

and on file the branches will be (assuming the collection name is "reco")

  • reco#0 for the related Clusters
  • reco#1 for the related Tracks
  • ...
  • reco#4 for the (single) startVertex
  • reco#5 for the (single) particleIDUsed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that's a little bit hacky to keep working since there's no clear way to keep track of the schema over time.

The next version sounds much more tractable in the long run!

@lgray
Copy link
Collaborator

lgray commented May 30, 2023

@jbrewster7 how are things going w.r.t. cross references?

@jbrewster7
Copy link
Author

@lgray Hi! The cross references seem to all be working well and I'm trying to set them up in a way that makes them easy enough to change when they go from magic numbers to names

@tmadlener
Copy link

Hi @jbrewster7, if you want to make this transparent, it should be enough to use some information that is available from the podio_metadata Tree in the XXX___idTable branch(es). This contains a podio::CollectionIDTable for which we generate a dictionary. Internally it is effectively two vectors, one containing the (numerical) IDs and the other the (string) names. I think it should be possible to get to the m_collectionIDs and m_names leaves in the branches directly. With those

name_to_id = {n: i for (n,i) in zip(names, collectionIDs)}

should effectively give you a map of names to IDs. This will also hold after we have switched to something more robust than we currently have.

The XXX above is most likely always events for coffea, but podio can in principle also write other categories. I am not entirely sure if it makes sense at the moment to make this more general than it needs to be.

@lgray
Copy link
Collaborator

lgray commented May 30, 2023

@tmadlener I'll poke around with that table - I remember uproot complaining about it. If it won't read I'll make a PR to uproot to get it happening.

@lgray
Copy link
Collaborator

lgray commented May 30, 2023

@jbrewster7 Hooray! That's awesome!

@lgray
Copy link
Collaborator

lgray commented May 30, 2023

@tmadlener ah, that was something else, the idTable is easily accessible:

>>> x["podio_metadata"].keys()
['events___idTable', 'events___idTable/m_collectionIDs', 'events___idTable/m_names', 'events___CollectionTypeInfo', 'events___CollectionTypeInfo/events___CollectionTypeInfo._3', 'events___CollectionTypeInfo/events___CollectionTypeInfo._2', 'events___CollectionTypeInfo/events___CollectionTypeInfo._1', 'events___CollectionTypeInfo/events___CollectionTypeInfo._0', 'PodioBuildVersion', 'PodioBuildVersion/major', 'PodioBuildVersion/minor', 'PodioBuildVersion/patch', 'EDMDefinitions', 'EDMDefinitions/EDMDefinitions._1', 'EDMDefinitions/EDMDefinitions._0']
>>> x["podio_metadata"]["events___idTable"].array()[0].show()
{m_collectionIDs: [1, 2, 3, 4, 5, 6, 7, 8, ..., 40, 41, 42, 43, 44, 45, 46, 47],
 m_names: ['AllCaloHitContributionsCombined', ..., 'RecoMCTruthLink']}

A bit annoying we can't just do this with the awkward form alone though. I'll need to add in a hook to nanoevents to let the schema get organizational data that's available in the file!

@nsmith- thoughts?

@lgray
Copy link
Collaborator

lgray commented May 30, 2023

@jeyserma FYI

@@ -746,6 +748,251 @@ def nearest(
return out


@awkward.mixin_class(behavior)
class LorentzVectorM(ThreeVector):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should double-check, but I think all that's needed is to subclass LorentzVector and implement two overrides:

@property
def t(self):
    return numpy.sqrt(self["mass"]*self["mass"] + self.rho2)

@property
def mass(self):
    return self["mass"]

basically the magic is that getitem always gets what's actually in the awkward array and all getattr is for derived or almost-derived. That way all the other methods should pick the right override

behavior = {}
behavior.update(base.behavior)
behavior.update(vector.behavior)
elif schemaclass is EDM4HEPSchema:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is following the existing pattern, so it's a question for @lgray: why not use the class property EDM4HEPSchema.behavior here?

Copy link
Collaborator

@lgray lgray May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs cleanup in a separate PR to harmonize the interface with the eager one. You're right but marking it for later.

@nsmith-
Copy link
Member

nsmith- commented May 30, 2023

On the issue of needing additional metadata to properly reconstruct the NanoEvents schema (i.e. awkward form) from the list of branches, the only issue I see is that, in the case of ROOT files, there is no standard way of storing it. So each file type will need its own. Perhaps what can be done is to add a new hook in the schema class interface: a classmethod that takes in the TFile and does what it needs to find the rest of the metadata. But this of course is only for ROOT, since parquet can mostly capture awkward data types and there is a standard spot (iirc) to put the mixin behavior names. So NanoEventsFactory.from_root would call this method. It breaks a bit the separation between the source mapping and the events object, but I think we don't have as strong a reason to keep that around long-term.

@lgray
Copy link
Collaborator

lgray commented May 31, 2023

@jpivarski given @nsmith-'s comment (which is where I had more or less landed as well) - is there a reasonable way to get additional serialized metadata that's within the (first) root file (even if it is globbed) as a tree or class, when calling uproot.dask?

I suppose I have the file string, but it might be globbed, which I can't use in uproot.open reliably. I guess I could take the same string, uproot.dask it with the known location of the metadata and take the first partition only for a compute?

That should be sufficiently quick.

@lgray
Copy link
Collaborator

lgray commented May 31, 2023

Indeed it's pretty easy:

import uproot

x = uproot.dask({"~/Downloads/rv02-02.sv02-02.mILD_l5_o1_v02.E250-SetA.I402004.Pe2e2h.eR.pL.n000.d_dstm_15090_*.slcio.edm4hep.root": "podio_metadata/events___idTable"})
(coffea-dev) lgray@dhcp-131-225-173-191 coffea % python -i edm4hep_idTable.py
>>> x.partitions[0].compute()
<Array [{m_collectionIDs: [...], ...}] type='1 * {m_collectionIDs: var * in...'>

@jpivarski
Copy link
Contributor

uproot.dask does open the first file in order to know the names and types of TBranches, so the metadata from the first file should be available. It might even be public (no underscore) because Uproot exposes a lot of its inner workings.

@lgray
Copy link
Collaborator

lgray commented Jun 2, 2023

Hey, @jbrewster7, just to check: how are things coming along with the cross references. I think whatever you have now is fine and then I'd be happy to take that and mutate it to use this stuff I was talking about with @nsmith- and @jpivarski.

However, I need to see what you've got first to act on that! There's no need to hone it to perfection, just good enough, going further is really more a collective process.

@lgray
Copy link
Collaborator

lgray commented Jun 2, 2023

So yeah @nsmith- it looks like I can instead just take the filename and objectpath that's passed into .from_root and just ask for the appropriate ID table using uproot.dask on the first partition.

Should be enough, and there's only one additional file open on the user side for a small amount of data to get the necessary bits of metadata.

lgray and others added 4 commits June 3, 2023 20:52
…o the procedure. Added the class LorentzVectorM to vector.py so that vectors can be definedwith mass instead of energy.
lgray and others added 6 commits June 4, 2023 12:47
…particles and mc particles. This involved the creation of new classes in nanoevents/methods/edm4hep.py. This linking still isn't running error-free. There is a TypeError when the matched_gen function is called from the edm4hep methods.
Pushing changes made to add particle linking.
…ting still needs to be updated for proper construction of 4-vectors.
@lgray
Copy link
Collaborator

lgray commented Aug 30, 2023

@tmadlener could you produce some new ZH files in the new EDM4HEP format (with the more human-readable index branch names)? In addition, could you make one that's ~40 events so that we can have a test file in our repo? Thanks!

@lgray
Copy link
Collaborator

lgray commented Aug 30, 2023

@jbrewster7 could you fix up the flake8 errors? (click red x next to pre-commit.ci - pr). Looks like some improperly formatted strings.

@jbrewster7
Copy link
Author

@lgray Yes, committing the changes now!

@tmadlener
Copy link

@lgray apologies for the slight delay. New files (and one with 40 events) are on cernbox: https://cernbox.cern.ch/s/eYmuXTRimfgmdZg

The 40 event one is produced from one using

source /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh

and the following lines of python

from podio.root_io import Reader, Writer

reader = Reader("rv02-02.sv02-02.mILD_l5_o1_v02.E250-SetA.I402004.Pe2e2h.eR.pL.n000.d_dstm_15090_0.edm4hep.root")
events = reader.get("events")

writer = Writer("output.edm4hep.root")
for i in range(40):
    writer.write_frame(events[i], "events")

(There will be a few warnings about some missing schema evolution which can be safely ignored)

@lgray
Copy link
Collaborator

lgray commented Sep 1, 2023

Awesome, thanks!

@lgray
Copy link
Collaborator

lgray commented Sep 1, 2023

@tmadlener FYI https://pypi.org/project/podio/ is not taken. It would be nice if it was pip installable.

@lgray
Copy link
Collaborator

lgray commented Dec 7, 2023

TODO: update to interface changes in dask-awkward, build tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants