Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Is uproot.update not yet supported? #381

Open
bixel opened this issue Oct 17, 2019 · 15 comments
Open

Is uproot.update not yet supported? #381

bixel opened this issue Oct 17, 2019 · 15 comments

Comments

@bixel
Copy link

bixel commented Oct 17, 2019

When trying to uproot.update a root file, uproot raises a TypeError:

>>> f = uproot.update('root-file.root', uproot.LZMA(4))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() takes 2 positional arguments but 3 were given
>>>

From what I understand it looks like the update constructor takes too few arguments, compared to the called _openfile function.

Is this feature meant to be working or should it still be not implemented, judging from the NotImplementedError that is still present in the constructor?

PS: Thank you for this awesome python package, this is really bringing back some fun to the daily root file massage :-)

@reikdas
Copy link
Collaborator

reikdas commented Oct 17, 2019

The update method which is meant to take an existing ROOT file and write objects to it is not in our current roadmap of development.
Perhaps sometime in the future.

@reikdas reikdas closed this as completed Oct 17, 2019
@bixel
Copy link
Author

bixel commented Oct 17, 2019

Ok I see, thanks. Then I would suggest to update the README, since it reads like all of create, recreate, update is supported.

@jpivarski
Copy link
Member

Actually, this shouldn't be closed because the existence of uproot.update is a placeholder. It's not out of scope, though we haven't been talking about it much. If we look into it and it looks too difficult, then maybe we should formally descope it, but that hasn't happened yet. (Sorry if I gave the wrong impression.)

@andrzejnovak
Copy link
Member

+1 for would use

@BrutalPosedon
Copy link

I definitely would use this as well.
Sorry for writing on an old issue, but I think this would be nice to have.

@jpivarski
Copy link
Member

In the scale of easy features vs hard features, this is a very hard one, unfortunately. We'd have to be able to pick up any ROOT file, regardless of how its internal structure is configured, and work with it in our scheme. A poor man's solution would be for uproot.update to be an alias to "copy the ROOT file into the organizational structure we expect, then work with it as though it were made with uproot.recreate." That technically works and has the right semantics, but probably isn't what you want because the reason you want to update is to avoid copies, right?

The stub exists because it's a very natural thing to want. In general, the desirability of features has no correlation with their ease of implementation, such that some extremely trivial implementations have enormous benefits and some insurmountably difficult implementations have only marginal value. It sounds like this feature is both desirable and hard, so it's worth considering, but we'd need an interested developer to spend (what might turn out to be) a few months on it.

@chrisburr
Copy link
Member

We'd have to be able to pick up any ROOT file, regardless of how its internal structure is configured, and work with it in our scheme. A poor man's solution would be for uproot.update to be an alias to "copy the ROOT file into the organizational structure we expect, then work with it as though it were made with uproot.recreate.

Would it simplify the problem if you only accept files that were made by uproot and raise an exception in other cases? I suspect this would cover most use cases and it leaves the option of future extensions without risking backwards incompatibility.

@jpivarski
Copy link
Member

That would simplify the problem. It hadn't occurred to me that this is the desirable use-case: opening and re-opening with uproot, as opposed to opening with ROOT, re-opening with uproot. It could be a check for structure, rather than explicitly for who created it, so that a file made by ROOT might fit.

But then, that could be a confusing error for the user: one ROOT file can be updated in uproot while another can't, and there doesn't seem to be any significant difference between the ROOT files. (We're talking about invisible-to-the-user differences in where blocks of data are allocated, which can turn on minor details like how many times it's been opened, how many or what types of objects have been written, whether anything has ever been replaced in-place or deleted at any point in the file's history, etc.)

@chrisburr
Copy link
Member

I think the simplest solution of raise Exception("Only files created by uproot are supported for opening in update mode") avoids the confusion if that is a commitment that can be honoured even if that means rejecting some files which would technically work by chance.

@andrzejnovak
Copy link
Member

I think that sounds reasonable. Copying whatever objects don't have uproot equivalents, and allowing to update those that do.

I don't know if what I was trying to do was necessarily smart, but my use case was needing histograms in a certain dir structure in the file to make stuff compatible with existing code. I couldn't figure out how to create directories in uproot, so I figured I could open a file that works and just update the histograms, but I got stuck because "recreate" didn't work and I couldn't update the TH1 names.

@jpivarski
Copy link
Member

I couldn't figure out how to create directories in uproot

The reason is because you can't create directories with uproot (#138).

It's one thing to read a ROOT file, jumping to where the pointers take you, skipping the parts that aren't relevant for what you're trying to do, and it's another thing entirely to write a structure, byte for byte, that ROOT will accept. You have to understand all the structures, at least well enough to make one valid state. To update a ROOT file, you have to pick up any (or a large set of) valid states and continue where they left off. The reason we punted on adding directories is because it multiplied the number of things we had to think about. ("What if the user adds directories first, then a histogram?" "What if they write a histogram, then add directories, then another histogram?" "What if some baskets of a TTree are interleaved with both of them?" It was a combinatoric problem.)

Copying whatever objects don't have uproot equivalents, and allowing to update those that do.

It's not (just) an issue of types of objects we don't recognize. What I'm thinking about is the fact that a ROOT file is a filesystem; different regions of bytes correspond to different objects. If you add and remove objects, this space gets fragmented: "sliding everything to the left" when you delete an object would be prohibitively expensive, and anyway it would require a lot of pointers to be updated, so like a good filesystem, ROOT doesn't do this. Instead, it has a serialized linked list of the objects that do exist with a table of free space (TFree), so that new objects can be allocated in the otherwise unused gaps left by removed objects. Those that don't fit in the gaps have to go on the end, which moves the TFree object, which has to be updated in the TFile header, so that at any time the file is in a valid state. And finally, we're not designing a system that does all of these things, we have to do them exactly the same way that ROOT does—reproduce its logic exactly—or we'd end up putting the file into a corrupted state.

The simplification that we're talking about is not dealing with all of the possible states that this filesystem can get into, but only the subset that we've already understood. I think in practice this would mean that 99% of ROOT files produced by ROOT would not be updatable in uproot. That would definitely include the use-case you're talking about: using ROOT to create the directories and uproot to add additional objects. We definitely want to be on the same page about what counts as "done" before launching into a project like this.

(Personal rant at the bottom of this comment: the "filesystem in a file" technology existed prior to ROOT. Zip files have been ubiquitous since 1989 and are the basis for many types of files that have to store a lot of objects, such as Java JARs and Python wheels. Zip doesn't have the "update" feature, to delete objects and recover their space, but all the dbm implementations do. In fact, dbm is a protocol established in 1979 so that you can swap out different implementations. sdbm is public domain, still used by Perl and Ruby, and BerkeleyDB, written in 1991, is a high-quality variant with users like sendmail, RPM, Bitcoin, and Oracle NoSQL. Maybe there was a technical reason one of these standards couldn't be adopted, but I really wish one had. It's not that our objects, like TBaskets, are particularly large binary blobs—megabytes at the most—we just have a lot of them, like most key-value stores.)

@reikdas
Copy link
Collaborator

reikdas commented Mar 28, 2020

Instead, it has a serialized linked list of the objects that do exist with a table of free space (TFree), so that new objects can be allocated in the otherwise unused gaps left by removed objects. Those that don't fit in the gaps have to go on the end, which moves the TFree object, which has to be updated in the TFile header, so that at any time the file is in a valid state.

For any readers, it might be important to know that uproot cannot do this yet and assigns new objects at the end of the file (#135).

@VukanJ
Copy link

VukanJ commented Apr 24, 2020

Instead, it has a serialized linked list of the objects that do exist with a table of free space (TFree), so that new objects can be allocated in the otherwise unused gaps left by removed objects. Those that don't fit in the gaps have to go on the end, which moves the TFree object, which has to be updated in the TFile header, so that at any time the file is in a valid state.

For any readers, it might be important to know that uproot cannot do this yet and assigns new objects at the end of the file (#135).

I am confused. Does this mean that uproot can actually update a file with new objects as long as they are appended at the end of the file and this could "simply" be enabled for the update method? If that is true, couldn't the problem of managing data blocks in the best possible way be solved at a later stage, so that people could already use this feature?

@jpivarski
Copy link
Member

Even if we don't plan to delete objects from a ROOT file or take advantage of empty spaces and keep it defragmented, appending does require changing pointers on a variety of places. (Here, "pointers" means byte positions in the file or in a nested range within the file.) To do that without corrupting people's files, we'd have to understand the set of possible states ROOT files can be in better than we do now. The list of free spaces is some sort of linked list (in file byte positions), but there's also a footer after it that would have to be copied if we're going to expand that list.

We could try adding an append-only update feature, testing it on all the files we have, and then adding a warning that it's experimental and shouldn't be used on valuable files because what we don't know can corrupt files, then collect feedback from users who do encounter exceptional cases. However, that's a greater level of engagement than I'm able to support right now.

Append-only is a good idea to simplify the problem, but it's still going to require a lot of work. (And then after that, surely someone will ask why they can't delete objects or why their output files are so large. However, doing it in two stages like this does help to break down the problem.)

@VukanJ
Copy link

VukanJ commented Apr 24, 2020

Ok, very interesting. Thanks a lot for the clarification!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants