Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

uproot.update #530

Open
atasattari opened this issue Nov 30, 2020 · 4 comments
Open

uproot.update #530

atasattari opened this issue Nov 30, 2020 · 4 comments

Comments

@atasattari
Copy link

atasattari commented Nov 30, 2020

Hi,
I followed the uproot documentation to update a root file by adding more trees, but I'm getting "compression" error. Here is my script:

`with uproot.update("Test_files/"+"%s.root"%'test') as f:
#making 2 trees
for i in range(2):
seriesnumber=random.randint(1010,1011-1)
dic={}
#making 5 branches
for i in range(5):
dic['zip'+str(i)]="bool"
events=(np.random.rand(10)*10**5).astype(int)

    f[str(seriesnumber)]=uproot.newtree(dic)
    
    p=np.random.rand()
    #50 entries in each branch
    for tree_name in list(dic.keys()):

            cGoodEv = np.random.choice(a=[True, False], size=(50), p=[p, 1-p])
            f[str(seriesnumber)][tree_name].newbasket(cGoodEv)` 

The error:
TypeError: _openfile() missing 1 required positional argument: 'compression'
Including a compression method I get the error below:
__init__() got an unexpected keyword argument 'compression'

Also, I noticed uproot might raise "NotImplementedError" error :
~/anaconda3/lib/python3.7/site-packages/uproot/write/TFile.py in __init__(self, path) 27 class TFileUpdate(object): 28 def __init__(self, path): ---> 29 self._openfile(path) 30 raise NotImplementedError

So i'm a bit confused. I'm wondering if this feature is implemented, and if so why my script is not working?

Thanks,
Ata @bloer @mdiamon @pibion (CDMS collaboration)

@jpivarski
Copy link
Member

uproot3.update is not implemented, and an attempt to use it is supposed to raise NotImplementedError. This happened before: #460. I'll check into why it's failing to raise NotImplementedError again.

As discussed in #381, updating existing ROOT files is unlikely to ever be implemented. When this is ported to Uproot 4, there will be a clearer placeholder in update to explain all of this.

As it is, you should be able to write new ROOT files with Uproot 3, but not change them in place. The short story is that it's much easier to maintain the consistency of the internal structures within a ROOT file when writing them fresh. An "update" feature would have to be able to accept any valid ROOT file and change it into another valid ROOT file, but we don't know the full inclusive set of what counts as "valid." Knowing an exclusive subset of what counts as "valid" is all you need to do "recreate," and that's why we have "recreate" but not "update."

@jpivarski
Copy link
Member

#460 was fixed: you just need to update.

Last week, this Uproot changed its name in PyPI to uproot3, so

pip install uproot3

and use

import uproot3 as uproot

in your scripts. Hopefully today, but maybe tomorrow (at this rate), the PyPI package named uproot will become Uproot 4, which does not yet have the ability to write files (even "recreate"). Since the PyPI package names are different, you can use both in the same script. But for now, the writing parts need to use uproot3.

@atasattari
Copy link
Author

Thanks. I noticed there is a suggestion in #381 to implement uproot.update for the files that were made by uproot. In our case, we are using uproot to produce the initial root files, so being able to update such files would be very useful. As you suggested, I can use "recreate" to write the information of the root files along with the updates to a new file. The main problem is that we were planning to use uproot to create and update many root files with many entries. So recreating them is not the most efficient way.

@jpivarski
Copy link
Member

If you're accumulating entries in batches, the best thing might be to create little files and concatenate them afterward with "hadd" (regardless of whether it's ROOT or Uproot). There's a fast-clone and basket-combining, basket-sorting options which trade speed of "hadd" for speed of access later. If you're going to read the resultant files many times, you probably want to at least combine baskets, maybe sort them in a way that benefits your reading pattern.

If you're reopening the files to add just one or two entries (not a "batch"), then it's not efficient in any sense. There's a lot of overhead to opening a file and rearranging the objects in it, which would make the "open, write one entry, close, reopen" pattern horribly slow (ROOT or Uproot).

If the latter is your access pattern, you might want to consider a different file format, even if only for the intermediate files that you need to write in small bits.

  • This might sound ridiculous, but CSV is a very "appendable" format: opening a CSV file, seeking to the end, writing one line of text, and closing it will likely be much better than opening a ROOT file, appending one entry, and closing it. CSV is not a fast format for reading, but a one-time conversion to ROOT after the intermittent writing wouldn't be too bad.
  • A file format that is both binary and good for incremental appending is HDF5 (see h5py), but it has a lot of performance knobs to tune.
  • NumPy's own NPY format would be great for appending, but its built-in np.save function doesn't provide good append access.There's a project called npy-append-array to rectify that, though I haven't tried it myself. (NPY is such a simple file format that I've implemented appending myself, but having a library for this would be nice.) Note that a NumPy array can have many fields (structured array) like the TBranches of a TTree. This is the most "bare metal" solution, if efficiency really matters.

The only thing lacking from all three of these suggestions (CSV, HDF5, NPY) is support for "jagged" arrays. They only accept flat tables, but I think that's the kind of data you have.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants