Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fBits is not always 4 bytes #878

Open
hajifkd opened this issue Apr 14, 2023 · 8 comments
Open

fBits is not always 4 bytes #878

hajifkd opened this issue Apr 14, 2023 · 8 comments
Labels
big-project This will take some time, perhaps as a Fellowship or GSoC project bug (unverified) The problem described would be a bug, but needs to be triaged

Comments

@hajifkd
Copy link

hajifkd commented Apr 14, 2023

In the issue, #438, it is reported that in root files Delphes created, fBits fields may be 6 bytes and desalinization processes may fail.
As an ad-hoc workaround, fBits field was decoded as a 1 byte field, but after the issue, #569, the change has reverted, and kBits type is decoded as the 4 bytes fields with a guess that Delphes does something wrong.

However, this is not consistent with how the original root framework works.
In the original framework, in https://github.com/root-project/root/blob/61ea5a002198cd756bdce67e8499feded9dedc0b/io/io/src/TStreamerInfoReadBuffer.cxx#L1021, fields with type kBits is decoded as 4 bytes fields only if kIsReferenced is not set. If it is set, 2 additional bytes are read and used to get TProcessID. Then, fUniqueID on the decoded object is rewritten so that fUniqueID is really unique among all processes, I guess. In other words, kBits type is encoded to 4 or 6 bytes, depending on kIsReferenced.
Although I could not find any document about this behavior, I now believe the original issue, #438, is NOT the bug of Delphes, but the bug of uproot and is to be fixed.

As long as fUniqueID is not used across the processes, just skipping 2 extra bytes when kIsReferenced should be fine, I guess.
For instance, the official JS implementation does in that way:
https://github.com/root-project/root/blob/901c7cc0a986931bd771c0a873247f5708e54111/js/modules/io.mjs#L61
I understand most people including me have never used fBits and this simpler way is rather time-consuming, but in principle, uproot should implement this. The best way is probably to adopt some form of lazy evaluation like an iterator, but it would break backward compatibility.

@hajifkd hajifkd added the bug (unverified) The problem described would be a bug, but needs to be triaged label Apr 14, 2023
@jpivarski
Copy link
Member

We do want to make it as correct as it can be, but I've been bewildered by all of the variation in files that I've come across.

It sounds like you have a firm handle on all of the "if/then" logic. Do you see how the code in this part of identify.py,

https://github.com/scikit-hep/uproot5/pull/570/files

can be modified to get it right? A backward-incompatible change that converts a raises-an-exception case into a working-properly case is not a backward-incompatibility that we're concerned about.


Oh, when you say that it depends on kIsReferenced, you mean that it can't be predicted before reading data, right? In that case, any data containing an fBits can't be AsDtype; it would have to be AsObjects. At the point referenced above in identify.py,

raise NotNumerical

will get out of the part of the code that is trying to make it AsDtype and fall back to AsObjects. Then the Model_TObject can't have an AsStridedObjects interpretation because the byte-width of fBits can't be variable. This code:

@classmethod
def strided_interpretation(
cls, file, header=False, tobject_header=True, breadcrumbs=(), original=None
):
members = [(None, None)]
if tobject_header:
members.append(("@instance_version", numpy.dtype(">u2")))
members.append(("@num_bytes", numpy.dtype(">u4")))
members.append(("@fUniqueID", numpy.dtype(">u4")))
members.append(("@fBits", numpy.dtype(">u4")))
members.append(("@pidf", numpy.dtype(">u2")))
return uproot.interpretation.objects.AsStridedObjects(
cls, members, original=original
)

would have to

raise CannotBeStrided

to get out of the attempt to interpret it as a fixed-bytesize (strided) object.

And then the final fallback of stepping a different number of bytes depending on whether kIsReferenced is set seems to already be implemented in the full object-interpretation:

self._members["@fBits"] = (
numpy.uint32(self._members["@fBits"]) | uproot.const.kIsOnHeap
)
if self._members["@fBits"] & uproot.const.kIsReferenced:
cursor.skip(2)
self._members["@fBits"] = int(self._members["@fBits"])

So maybe as a first question, before digging too deeply into this: if you construct an AsObjects interpretation for some data that has fBits, does it get the number of bytes right? If so, then the project becomes one of making sure that it always falls back to the full object interpretation, as I've outlined above. Now that we have AwkwardForth, this is less of a performance degradation than it once was.


A lot of classes descend from TObject, which has this potentially variable-sized fBits. Even TLorentzVector, which was the original motivation for AsStridedObjects, would be changed by this. Such a modification would affect a lot of data types, but presumably, some of those are failing now because of variable-width fBits...

@hajifkd
Copy link
Author

hajifkd commented Apr 14, 2023

Oh, when you say that it depends on kIsReferenced, you mean that it can't be predicted before reading data, right?

Yes, that is exactly what I meant.

if you construct an AsObjects interpretation for some data that has fBits, does it get the number of bytes right?

Actually, I have never used AsObjects (to be honest, I am a novice user of uproot and didn't even know it) and rather use arrays or iterate, but I agree with this and also your sketch of changes to be done.

Now that we have AwkwardForth, this is less of a performance degradation than it once was.

Sounds great. To be honest, I think that effectively it doesn't matter even in pure python. The complexity changes from O(1) to O(N) though.

A lot of classes descend from TObject, which has this potentially variable-sized fBits. Even TLorentzVector, which was the original motivation for AsStridedObjects, would be changed by this. Such a modification would affect a lot of data types, but presumably, some of those are failing now because of variable-width fBits...

I completely agree with this, unfortunately.

@jpivarski jpivarski added the big-project This will take some time, perhaps as a Fellowship or GSoC project label Jan 30, 2024
@vischia
Copy link

vischia commented Feb 28, 2024

Hi, is there a quick fix or workaround that I could apply to my code (either to tell Delphes to use always the 4bits fBits size expected by uproot, or to tell uproot on the fly the correct number of bits to read for the branch) to make it work, without having to go through the full implementation in the uproot code iself?

Thank you very much in advance!

Cheers,
Pietro

@jpivarski
Copy link
Member

I don't know how to change what Delphes does, but re-reading what I wrote above, it looks like the AsObjects Interpretation does the correct check of fBits, while AsStridedObjects does not (can't, actually).

For the branch that you're interested in, does its branch.interpretation contain AsStridedObjects? If so, you could try manually computing an Interpretation with simplify=False to leave it AsObjects, rather than simplifying it to AsStridedObjects. For example, with this file,

>>> import uproot, skhep_testdata
>>> tree = uproot.open(skhep_testdata.data_path("uproot-HZZ-objects.root"))["events"]

the default interpretations of the tree["jetp4"] and tree["MET"] branches contain AsStridedObjects:

>>> tree["jetp4"].interpretation
AsJagged(AsStridedObjects(Model_TLorentzVector_v4), header_bytes=10)
>>> tree["MET"].interpretation
AsStridedObjects(Model_TVector2_v3)

We can make AsObjects versions of them by calling uproot.interpretation.identify.interpretation_of with the non-default simplify=False:

>>> uproot.interpretation.identify.interpretation_of(tree["jetp4"], {}, simplify=False)
AsObjects(AsVector(True, Model_TLorentzVector))
>>> uproot.interpretation.identify.interpretation_of(tree["MET"], {}, simplify=False)
AsObjects(Model_TVector2)

and then pass them to uproot.TBranch.array:

>>> branch = tree["jetp4"]
>>> interp = uproot.interpretation.identify.interpretation_of(branch, {}, simplify=False)
>>> branch.array(interp)
<Array [[], [{fP: {...}, ...}], ..., [...], []] type='2421 * var * TLorentz...'>
>>> 
>>> branch = tree["MET"]
>>> interp = uproot.interpretation.identify.interpretation_of(branch, {}, simplify=False)
>>> branch.array(interp)
<Array [{fX: 5.91, fY: 2.56}, ..., {...}] type='2421 * TVector2[fX: float64...'>

Maybe also set

>>> interp._forth = False

before using it.

The distinction is that AsObjects makes a class that performs a Python for loop over all of your data. Some data structures are complex enough that we have to suffer that factor-of-hundreds slower time. AwkwardForth mitigates that by replacing the Python for loop with a Forth loop, which is faster because it is a more streamlined virtual machine than Python's. AsStridedObjects can be applied in situations in which the data have fixed sizes, so we don't do any loop over the data—Python, Forth, or anything—we just pass it to NumPy's structured array with all of the fixed-size fields encoded in a header. That doesn't work if the size of a field like @fBits might be 4 bytes or 1 byte, depending on the value of its first byte.

The reason that this issue has stalled is because I can't think of a good way to check the @fBits for rare cases in which they're not 4 bytes without slowing down all of the other cases by factors of hundreds. This fBits is a part of TObject, which is the superclass for a lot of ROOT objects.

"If TObject, then simplify=True is not allowed?" Before doing something like that, we'd want to be sure that AwkwardForth is able to deal with the 4-byte, 1-byte roll-over. Actually, this looks like the right thing to do:

skip_length = cursor._index - start_index
forth_stash.add_to_pre(f"{skip_length} stream skip \n")

so maybe it is ready.

Since you have files that need the 4-byte, 1-byte roll-over, can you check that they work with both interp._forth = False and interp._forth = True? Furthermore, if they're public and smaller than a few MB, could we put them in https://github.com/scikit-hep/scikit-hep-testdata and make them permanent tests as part of finally solving this issue?

@vischia
Copy link

vischia commented Mar 4, 2024

Hi Jim,

thank you very much for the detailed explanation and suggestions!

I tried to apply it, with the following outcome:

  • if I call tree.arrays(), the code crashes as described in my previous message;
  • if I call branch.array(), the code runs, but the interpretation is AsDtype('>i4'). There is none of the output that you print in your example (no AsObject, no <Array..., etc)

It does not mention AsObject or AsSharedObject.

Concrete steps:

input_file_path="tag_1_delphes_events.root"
tree = uproot.open(f"{input_file_path}:Delphes")
branch = tree["Particle"]
print("Interpretation:", branch.interpretation)

this prints Interpretation: AsDtype('>i4')

interp=uproot.interpretation.identify.interpretation_of(tree["Particle"], {}, simplify=False)
print("setting forth to false")
print("setting forth to false")
interp._forth = False 
b = branch.array(interp)
print(np.shape(b))

this prints [10000] (the file has a tree with 10000 entries, for context)

`
print("setting forth to true")
interp._forth = True
b = branch.array()
print(np.shape(b))


this prints again `[10000]`. If I add `print(b)`, I get an array of integers: `[1380, 1729, 935, 2222, ...]`

tree=tree.arrays(library="ak")


and finally, calling `tree.arrays` fails as described in my previous message with 

`ValueError: basket 5 in tree/branch /Delphes;1:Particle/Particle.fBits has the wrong number of bytes (58938) for interpretation AsDtype('>u4')`

I will create a public version of the file, but for the moment can I share it privately with you by email?

Thank you very much in advance!

Cheers,
Pietro

@jpivarski
Copy link
Member

  • if I call tree.arrays(), the code crashes as described in my previous message;

tree.arrays doesn't give you the opportunity to provide an interpretation, so that's a non-starter.

  • if I call branch.array(), the code runs, but the interpretation is AsDtype('>i4'). There is none of the output that you print in your example (no AsObject, no <Array..., etc)

If the interpretation is AsDtype, then it's not a subclass of TObject (it's not a C++ object at all, just an integer) and fBits doesn't enter into this. (Nor does AwkwardForth.)

ValueError: basket 5 in tree/branch /Delphes;1:Particle/Particle.fBits has the wrong number of bytes (58938) for interpretation AsDtype('>u4')

Oh, fBits is just the name of the TBranch; it's not a part of the data that's being deserialized. The problem that you're seeing is not related to this issue, although I can see why it looks that way.

Somehow, you have a TBasket with 58938 bytes in it that Uproot is trying to interpret as either an unsigned 4-byte integer (>u4) or a signed 4-byte integer (>i4). Either way, 58938 is not divisible by 4 and that's why Uproot gets stuck at this point: it's either the wrong interpretation (should it be a 2-byte integer, perhaps?) or the TBasket is corrupted (only 58938 bytes got written when it should be more? or less?); something bad happened upstream of this point.

You could provisionally set

interp = uproot.AsDtype(">u2")   # or ">i2"

to see what the numbers look like with 2 bytes. If you know something about what the values are supposed to be, just applying an interpretation and checking it against your expectations is a fast way to narrow in on what the actual problem is.

Also, since you know this is happening in TBasket number 5, you can look at its number of entries and uncompressed bytes with

branch.basket(5).num_entries
branch.basket(5).uncompressed_bytes

(uproot.TBranch.basket, uproot.models.TBasket.Model_TBasket.num_entries, and uproot.models.TBasket.Model_TBasket.uncompressed_bytes). For fixed-width data types, like the integer types, you can determine the number of bytes per entry by dividing the number of bytes by the number of entries, and so that provides a cross-check. (If you're ever dealing with an array that has a jagged or an object data type, keep in mind that this doesn't apply!)

Hopefully, these will let you narrow in on whether this is a wrong interpretation (i.e. should be 2 or maybe even 6 bytes) or a wrong TBasket (doesn't fit the declared number of entries).

Also, since this is a TBranch consisting of nothing but fBits, why do you need to read it at all? If this is a split TObject, then the "bit field status word" are ROOT internals (a private member of the TObject class). If it's something else that just happens to be named fBits, then okay.

@vischia
Copy link

vischia commented Mar 4, 2024

  • if I call branch.array(), the code runs, but the interpretation is AsDtype('>i4'). There is none of the output that you print in your example (no AsObject, no <Array..., etc)

If the interpretation is AsDtype, then it's not a subclass of TObject (it's not a C++ object at all, just an integer) and fBits doesn't enter into this. (Nor does AwkwardForth.)

Well, it's the usual structure, where the overall branch name is "a int", but it is actually a composite object with various subbranches: see screenshot of when doing `Delphes->Print()" from plain ROOT:

image

I assume (assumed?) uproot would load the whole object as tree["Particle"], isn't that the case?

Then maybe we fall back on the last part of your answer:

Also, since this is a TBranch consisting of nothing but fBits, why do you need to read it at all? If this is a split TObject, then the "bit field status word" are ROOT internals (a private member of the TObject class). If it's something else that just happens to be named fBits, then okay.

OK so if uproot doesn't load the whole object as tree["Particle"], then can I just load the branches like Particle.PT and the like and completely ignore the internals like fBits? In doing so, can I just use tree.arrays( list_of_branches_to_keep) or do I need to go branch by branch?

Cheers,
Pietro

@jpivarski
Copy link
Member

"Subbranches" are arranged as a nested tree in the metadata, but in the data (what you're having trouble reading), every branch is independent and can be independently read.

In uproot.TTree.arrays, you can use expressions to list them all independently or filter_name, filter_typename, and/or filter_branch to select them more fluently. Each of these filters can be a glob pattern, a regular expression (if prepended and appended with "/"), or a function. If you want everything but fBits, you can say,

tree.arrays(filter_name=lambda x: x != "fBits")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
big-project This will take some time, perhaps as a Fellowship or GSoC project bug (unverified) The problem described would be a bug, but needs to be triaged
Projects
None yet
Development

No branches or pull requests

3 participants