Is uproot.update not yet supported? #381

bixel · 2019-10-17T12:47:23Z

When trying to uproot.update a root file, uproot raises a TypeError:

>>> f = uproot.update('root-file.root', uproot.LZMA(4))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() takes 2 positional arguments but 3 were given
>>>

From what I understand it looks like the update constructor takes too few arguments, compared to the called _openfile function.

Is this feature meant to be working or should it still be not implemented, judging from the NotImplementedError that is still present in the constructor?

PS: Thank you for this awesome python package, this is really bringing back some fun to the daily root file massage :-)

reikdas · 2019-10-17T12:51:57Z

The update method which is meant to take an existing ROOT file and write objects to it is not in our current roadmap of development.
Perhaps sometime in the future.

bixel · 2019-10-17T13:04:51Z

Ok I see, thanks. Then I would suggest to update the README, since it reads like all of create, recreate, update is supported.

jpivarski · 2019-10-17T15:38:13Z

Actually, this shouldn't be closed because the existence of uproot.update is a placeholder. It's not out of scope, though we haven't been talking about it much. If we look into it and it looks too difficult, then maybe we should formally descope it, but that hasn't happened yet. (Sorry if I gave the wrong impression.)

andrzejnovak · 2020-03-26T21:00:25Z

+1 for would use

BrutalPosedon · 2020-03-27T14:23:28Z

I definitely would use this as well.
Sorry for writing on an old issue, but I think this would be nice to have.

jpivarski · 2020-03-27T15:48:32Z

In the scale of easy features vs hard features, this is a very hard one, unfortunately. We'd have to be able to pick up any ROOT file, regardless of how its internal structure is configured, and work with it in our scheme. A poor man's solution would be for uproot.update to be an alias to "copy the ROOT file into the organizational structure we expect, then work with it as though it were made with uproot.recreate." That technically works and has the right semantics, but probably isn't what you want because the reason you want to update is to avoid copies, right?

The stub exists because it's a very natural thing to want. In general, the desirability of features has no correlation with their ease of implementation, such that some extremely trivial implementations have enormous benefits and some insurmountably difficult implementations have only marginal value. It sounds like this feature is both desirable and hard, so it's worth considering, but we'd need an interested developer to spend (what might turn out to be) a few months on it.

chrisburr · 2020-03-27T16:11:13Z

We'd have to be able to pick up any ROOT file, regardless of how its internal structure is configured, and work with it in our scheme. A poor man's solution would be for uproot.update to be an alias to "copy the ROOT file into the organizational structure we expect, then work with it as though it were made with uproot.recreate.

Would it simplify the problem if you only accept files that were made by uproot and raise an exception in other cases? I suspect this would cover most use cases and it leaves the option of future extensions without risking backwards incompatibility.

jpivarski · 2020-03-27T16:24:50Z

That would simplify the problem. It hadn't occurred to me that this is the desirable use-case: opening and re-opening with uproot, as opposed to opening with ROOT, re-opening with uproot. It could be a check for structure, rather than explicitly for who created it, so that a file made by ROOT might fit.

But then, that could be a confusing error for the user: one ROOT file can be updated in uproot while another can't, and there doesn't seem to be any significant difference between the ROOT files. (We're talking about invisible-to-the-user differences in where blocks of data are allocated, which can turn on minor details like how many times it's been opened, how many or what types of objects have been written, whether anything has ever been replaced in-place or deleted at any point in the file's history, etc.)

chrisburr · 2020-03-27T16:45:42Z

I think the simplest solution of raise Exception("Only files created by uproot are supported for opening in update mode") avoids the confusion if that is a commitment that can be honoured even if that means rejecting some files which would technically work by chance.

andrzejnovak · 2020-03-27T23:45:28Z

I think that sounds reasonable. Copying whatever objects don't have uproot equivalents, and allowing to update those that do.

I don't know if what I was trying to do was necessarily smart, but my use case was needing histograms in a certain dir structure in the file to make stuff compatible with existing code. I couldn't figure out how to create directories in uproot, so I figured I could open a file that works and just update the histograms, but I got stuck because "recreate" didn't work and I couldn't update the TH1 names.

jpivarski · 2020-03-28T01:30:12Z

I couldn't figure out how to create directories in uproot

The reason is because you can't create directories with uproot (#138).

It's one thing to read a ROOT file, jumping to where the pointers take you, skipping the parts that aren't relevant for what you're trying to do, and it's another thing entirely to write a structure, byte for byte, that ROOT will accept. You have to understand all the structures, at least well enough to make one valid state. To update a ROOT file, you have to pick up any (or a large set of) valid states and continue where they left off. The reason we punted on adding directories is because it multiplied the number of things we had to think about. ("What if the user adds directories first, then a histogram?" "What if they write a histogram, then add directories, then another histogram?" "What if some baskets of a TTree are interleaved with both of them?" It was a combinatoric problem.)

Copying whatever objects don't have uproot equivalents, and allowing to update those that do.

It's not (just) an issue of types of objects we don't recognize. What I'm thinking about is the fact that a ROOT file is a filesystem; different regions of bytes correspond to different objects. If you add and remove objects, this space gets fragmented: "sliding everything to the left" when you delete an object would be prohibitively expensive, and anyway it would require a lot of pointers to be updated, so like a good filesystem, ROOT doesn't do this. Instead, it has a serialized linked list of the objects that do exist with a table of free space (TFree), so that new objects can be allocated in the otherwise unused gaps left by removed objects. Those that don't fit in the gaps have to go on the end, which moves the TFree object, which has to be updated in the TFile header, so that at any time the file is in a valid state. And finally, we're not designing a system that does all of these things, we have to do them exactly the same way that ROOT does—reproduce its logic exactly—or we'd end up putting the file into a corrupted state.

The simplification that we're talking about is not dealing with all of the possible states that this filesystem can get into, but only the subset that we've already understood. I think in practice this would mean that 99% of ROOT files produced by ROOT would not be updatable in uproot. That would definitely include the use-case you're talking about: using ROOT to create the directories and uproot to add additional objects. We definitely want to be on the same page about what counts as "done" before launching into a project like this.

(Personal rant at the bottom of this comment: the "filesystem in a file" technology existed prior to ROOT. Zip files have been ubiquitous since 1989 and are the basis for many types of files that have to store a lot of objects, such as Java JARs and Python wheels. Zip doesn't have the "update" feature, to delete objects and recover their space, but all the dbm implementations do. In fact, dbm is a protocol established in 1979 so that you can swap out different implementations. sdbm is public domain, still used by Perl and Ruby, and BerkeleyDB, written in 1991, is a high-quality variant with users like sendmail, RPM, Bitcoin, and Oracle NoSQL. Maybe there was a technical reason one of these standards couldn't be adopted, but I really wish one had. It's not that our objects, like TBaskets, are particularly large binary blobs—megabytes at the most—we just have a lot of them, like most key-value stores.)

reikdas · 2020-03-28T06:42:16Z

Instead, it has a serialized linked list of the objects that do exist with a table of free space (TFree), so that new objects can be allocated in the otherwise unused gaps left by removed objects. Those that don't fit in the gaps have to go on the end, which moves the TFree object, which has to be updated in the TFile header, so that at any time the file is in a valid state.

For any readers, it might be important to know that uproot cannot do this yet and assigns new objects at the end of the file (#135).

VukanJ · 2020-04-24T09:12:15Z

Instead, it has a serialized linked list of the objects that do exist with a table of free space (TFree), so that new objects can be allocated in the otherwise unused gaps left by removed objects. Those that don't fit in the gaps have to go on the end, which moves the TFree object, which has to be updated in the TFile header, so that at any time the file is in a valid state.

For any readers, it might be important to know that uproot cannot do this yet and assigns new objects at the end of the file (#135).

I am confused. Does this mean that uproot can actually update a file with new objects as long as they are appended at the end of the file and this could "simply" be enabled for the update method? If that is true, couldn't the problem of managing data blocks in the best possible way be solved at a later stage, so that people could already use this feature?

jpivarski · 2020-04-24T12:23:49Z

Even if we don't plan to delete objects from a ROOT file or take advantage of empty spaces and keep it defragmented, appending does require changing pointers on a variety of places. (Here, "pointers" means byte positions in the file or in a nested range within the file.) To do that without corrupting people's files, we'd have to understand the set of possible states ROOT files can be in better than we do now. The list of free spaces is some sort of linked list (in file byte positions), but there's also a footer after it that would have to be copied if we're going to expand that list.

We could try adding an append-only update feature, testing it on all the files we have, and then adding a warning that it's experimental and shouldn't be used on valuable files because what we don't know can corrupt files, then collect feedback from users who do encounter exceptional cases. However, that's a greater level of engagement than I'm able to support right now.

Append-only is a good idea to simplify the problem, but it's still going to require a lot of work. (And then after that, surely someone will ask why they can't delete objects or why their output files are so large. However, doing it in two stages like this does help to break down the problem.)

VukanJ · 2020-04-24T12:41:18Z

Ok, very interesting. Thanks a lot for the clarification!

reikdas closed this as completed Oct 17, 2019

jpivarski reopened this Oct 17, 2019

jpivarski added the writing-improvements label Oct 17, 2019

jpivarski mentioned this issue Nov 30, 2020

uproot.update #530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is uproot.update not yet supported? #381

Is uproot.update not yet supported? #381

bixel commented Oct 17, 2019

reikdas commented Oct 17, 2019

bixel commented Oct 17, 2019

jpivarski commented Oct 17, 2019

andrzejnovak commented Mar 26, 2020

BrutalPosedon commented Mar 27, 2020

jpivarski commented Mar 27, 2020

chrisburr commented Mar 27, 2020

jpivarski commented Mar 27, 2020

chrisburr commented Mar 27, 2020

andrzejnovak commented Mar 27, 2020

jpivarski commented Mar 28, 2020

reikdas commented Mar 28, 2020

VukanJ commented Apr 24, 2020

jpivarski commented Apr 24, 2020

VukanJ commented Apr 24, 2020

Is uproot.update not yet supported? #381

Is uproot.update not yet supported? #381

Comments

bixel commented Oct 17, 2019

reikdas commented Oct 17, 2019

bixel commented Oct 17, 2019

jpivarski commented Oct 17, 2019

andrzejnovak commented Mar 26, 2020

BrutalPosedon commented Mar 27, 2020

jpivarski commented Mar 27, 2020

chrisburr commented Mar 27, 2020

jpivarski commented Mar 27, 2020

chrisburr commented Mar 27, 2020

andrzejnovak commented Mar 27, 2020

jpivarski commented Mar 28, 2020

reikdas commented Mar 28, 2020

VukanJ commented Apr 24, 2020

jpivarski commented Apr 24, 2020

VukanJ commented Apr 24, 2020