Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading structs into higher-dimensional Arrays #9

Open
briederer opened this issue Aug 21, 2020 · 9 comments
Open

Reading structs into higher-dimensional Arrays #9

briederer opened this issue Aug 21, 2020 · 9 comments
Labels
enhancement New feature or request

Comments

@briederer
Copy link

ROOT offers different ways to store arrays or even structs in TBranches. I added a small ROOT-File (test_array.root.zip) which contains a TTree called 'arrays' which contains a scalar, a 6d-vector and a 2x3-matrix and another TTree called 'structs' which contains again the same scalar as 'arrays' and a struct which resembles again a 2x3-matrix but stored as a 6d-vec and is defined in the following way
struct m23 {double el[6];};

So the file was created in the following way (may be useful when trying to retrieve data in Julia):

struct m23 {double el[6];};

m23 mat_23 = {1.0,2.0,3.0,4.0,5.0,6.0};
int n = 1;
double vec[6] = {1.0,2.0,3.0,4.0,5.0,6.0};
double mat[2][3] = {{1.0,2.0,3.0},{4.0,5.0,6.0}};

TFile *file = new TFile("test_array.root","Recreate");
TTree *tree_arr = new TTree("arrays","tree filled with arrays")
TTree *tree_str = new TTree("structs","tree filled with struct");

tree_arr -> Branch("nInt",&n);
tree_arr -> Branch("6dVec",&vec,"[6]/D");
tree_arr -> Branch("2x3Mat",&mat,"[2][3]/D");
tree_arr -> Fill();

tree_str -> Branch("nInt",&n);
tree_str -> Branch("2x3mat",&mat,"el[6]/D");
tree_str -> Fill();

file -> Write("",TObject::kOverwrite);

In the ROOT command-line the TTrees look like this in the end

root [1] arrays -> Scan();
***********************************************************
*    Row   * Instance *      nInt * 6dVec.6dV * 2x3Mat.2x *
***********************************************************
*        0 *        0 *         1 *         1 *         1 *
*        0 *        1 *         1 *         2 *         2 *
*        0 *        2 *         1 *         3 *         3 *
*        0 *        3 *         1 *         4 *         4 *
*        0 *        4 *         1 *         5 *         5 *
*        0 *        5 *         1 *         6 *         6 *
***********************************************************
root [2] structs -> Scan();
***********************************************
*    Row   * Instance *      nInt * 2x3mat.2x *
***********************************************
*        0 *        0 *         1 *         1 *
*        0 *        1 *         1 *         2 *
*        0 *        2 *         1 *         3 *
*        0 *        3 *         1 *         4 *
*        0 *        4 *         1 *         5 *
*        0 *        5 *         1 *         6 *
***********************************************

When I try to read this in Julia using the UnROOT package I get the following

julia> file=ROOTFile("/hosts/nashome/riederer_bernd/test_array.root")
ROOTFile("/hosts/nashome/riederer_bernd/test_array.root") with 2 entries and 17 streamers.

julia> keys(file)
2-element Array{String,1}:
 "arrays" 
 "structs"

julia> array(file,"arrays")
ERROR: Branches with multiple leaves are not supported yet.
Stacktrace:
 [1] #array#120(::Bool, ::typeof(array), ::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:153
 [2] array(::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:139
 [3] top-level scope at none:0

julia> array(file,"structs")
ERROR: Branches with multiple leaves are not supported yet.
Stacktrace:
 [1] #array#120(::Bool, ::typeof(array), ::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:153
 [2] array(::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:139
 [3] top-level scope at none:0

julia> DataFrame(file,"arrays")
1×3 DataFrame
│ Row │ nInt  │ 6dVec   │ 2x3Mat  │
│     │ Int32 │ Float64 │ Float64 │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ 1.0     │ 1.0     │

julia> DataFrame(file,"structs")
1×2 DataFrame
│ Row │ nInt  │ 2x3mat  │
│     │ Int32 │ Float64 │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 1.0     │

It seems that, while array is refusing to handle multidimensional objects in a TBranch, the function DataFrame is able to get the scalar and the first entry of each multidimensional object.
I tested this with another file and DataFrame returns correctly all Rows of scalar-entries and next to it in each case the first element of the higher-dimensional object. So it looks like this:

julia> DataFrame(file,"structs")
1×5 DataFrame
│ Row │ nInt  │  2x3mat   │
│     │ Int32 │  Float64  │
├─────┼───────┼───────────┤
│ 1   │ 1     │ mat.el[0] │
│ 2   │ 2     │ mat.el[0] │
│ 3   │ 3     │ mat.el[0] │
│ 4   │ 4     │ mat.el[0] │

I would like to request to extend the support for multidimensional objects at least for the array function.

@tamasgal
Copy link
Member

Indeed, the problem here is that currently there is no support for so-called jagged arrays from multiple leaves in array().

The structure information of the leaves is encoded in the fLeaves of the TTree and there you can see the multi-dimensionality is basically given by the fTitle, which is [6] for 6dVec (meaning it's a TLeafD with 6 values per entry) and [2][3] for 2x3Mat, the 2x3 matrix, obviously. Here is the output where you can inspect the meta information of the three leaves:

julia> f["arrays"].fLeaves
UnROOT.TObjArray("", 0, Any[UnROOT.TLeafI
  fName: String "nInt"
  fTitle: String "nInt"
  fLen: Int32 1
  fLenType: Int32 4
  fOffset: Int32 0
  fIsRange: Bool false
  fIsUnsigned: Bool false
  fLeafCount: UInt32 0x00000000
  fMinimum: Int32 0
  fMaximum: Int32 0
, UnROOT.TLeafD
  fName: String ""
  fTitle: String "[6]"
  fLen: Int32 6
  fLenType: Int32 8
  fOffset: Int32 0
  fIsRange: Bool false
  fIsUnsigned: Bool false
  fLeafCount: UInt32 0x00000000
  fMinimum: Float64 0.0
  fMaximum: Float64 0.0
, UnROOT.TLeafD
  fName: String ""
  fTitle: String "[2][3]"
  fLen: Int32 6
  fLenType: Int32 8
  fOffset: Int32 0
  fIsRange: Bool false
  fIsUnsigned: Bool false
  fLeafCount: UInt32 0x00000000
  fMinimum: Float64 0.0
  fMaximum: Float64 0.0
])

I think that parsing this string is the only way to assemble the correct structure, so the first thing is to dynamically create the correct type (Julia struct) for this.

Here is a very basic (and ugly) hack to get the first entry:

julia> @UnROOT.io struct A6DVec
       a::Float64
       b::Float64
       c::Float64
       d::Float64
       e::Float64
       f::Float64
       end

julia> UnROOT.unpack(IOBuffer(array(f, "arrays/6dVec"; raw=true)[1]), A6DVec)
A6DVec(1.0, 2.0, 3.0, 4.0, 5.0, 6.0)

The @UnROOT.io macro can be used to create a struct and a corresponding unpack method which will use big-endian parsing for all the fields.

I also looked at the file with uproot and uproot4 (the Python package I usually use as a reference and to debug) but apparently it has problems with it. It would be nice if we also create an issue in https://github.com/scikit-hep/uproot4 or just ping @jpivarski (sorry Jim 🙈)

If you want, I can do it for you (create an issue there) and discuss the details, since I guess I first need to understand all the underlying stuff before I can think of an approach to tackle this in the Julia implementation.

Here is a quick example how it fails in uproot4:

In [1]: import uproot4

In [2]: f = uproot4.open("/Users/tamasgal/Downloads/test_array.root")

In [3]: f["arrays"]["6dVec"].array()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-0ed0e403bdf9> in <module>
----> 1 f["arrays"]["6dVec"].array()

~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in array(self, interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library)
   1620             interpretation_executor,
   1621             library,
-> 1622             arrays,
   1623         )
   1624

~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in _ranges_or_baskets_to_arrays(hasbranches, ranges_or_baskets, branchid_interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, library, arrays)
    516
    517         elif isinstance(obj, tuple) and len(obj) == 3:
--> 518             uproot4.source.futures.delayed_raise(*obj)
    519
    520         else:

~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/source/futures.py in delayed_raise(exception_class, exception_value, traceback)
     35         exec("raise exception_class, exception_value, traceback")
     36     else:
---> 37         raise exception_value.with_traceback(traceback)
     38
     39

~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in basket_to_array(basket)
    482                         len(basket_arrays[basket.basket_num]),
    483                         interpretation,
--> 484                         branch.file.file_path,
    485                     )
    486                 )

ValueError: basket 0 in tree/branch /arrays;1:6dVec has the wrong number of entries (expected 1, obtained 6) when interpreted as AsDtype('>f8')
    in file /Users/tamasgal/Downloads/test_array.root

In [4]: f["arrays"]["6dVec"].show()
name                 | typename             | interpretation
---------------------+----------------------+-----------------------------------
6dVec                | double               | AsDtype('>f8')

@tamasgal
Copy link
Member

tamasgal commented Aug 22, 2020

Btw. you can see that both uproot and uproot4 are able to nicely read the structs:

In [5]: f["arrays"].show()
name                 | typename             | interpretation
---------------------+----------------------+-----------------------------------
nInt                 | int32_t              | AsDtype('>i4')
6dVec                | double               | AsDtype('>f8')
2x3Mat               | double               | AsDtype('>f8')

In [6]: f["structs"].show()
name                 | typename             | interpretation
---------------------+----------------------+-----------------------------------
nInt                 | int32_t              | AsDtype('>i4')
2x3mat               | double[6]            | AsDtype("('>f8', (6,))")

In [7]: f["structs/2x3mat"].array()
Out[7]: <Array [[1, 2, 3, 4, 5, 6]] type='1 * 6 * float64'>
In [9]: import uproot

In [10]: f = uproot.open("/Users/tamasgal/Downloads/test_array.root")

In [11]: f["structs/nInt"].array()
Out[11]: array([1], dtype=int32)

In [12]: f["structs/2x3mat"].array()
Out[12]: array([[1., 2., 3., 4., 5., 6.]])

@jpivarski
Copy link
Member

That's weird—I'm surprised this file is failing in not just one Uproot, but both. It is a type that I've addressed and have tests for. Normally, I'd say that Uproot would be a good model to follow for this case, but apparently not.

What it ought to be doing is reading the fixed-size dimensions from the TLeaf fTitle and instructing NumPy to expect dtype("f8", (2, 3)) (a dtype with shape information baked into it, for all dimensions but the first of the array (1 entry in this case)). Then it should expect an entry's itemsize to be 6*8 bytes, rather than 8 bytes.

I'm glad it got there leaf-list as a NumPy structured array (which was then converted into Awkward). That's a similar thing: the fields are baked into a dtype. It's a different way of reading than data split among branches.

For better debugging, you may want to pass library="np" to the array method, so that these are not also converted into NumPy. For jagged arrays, library="np" is more conversation, not less, but it's the opposite for these rectilinear arrays.

@jpivarski
Copy link
Member

I just caught up to your last example—that "structs/2x3mat" is not a leaf-list/structured array (fields baked into dtype), it's a fixed-size multidimensional array (shape baked into dtype). Is it just the case with two fixed-size indexes that isn't working? I don't think I have tests for that, and it could be failing. In that case, Uproot is a good model to follow—there's code to read that handles that, but it's apparently broken in some cases.

@tamasgal
Copy link
Member

Oh dear, you are fast 😅

Alright, thanks! I will dig deeper... I just created an issue in uproot4, to follow the standard procedures.

@tamasgal
Copy link
Member

I just caught up to your last example—that "structs/2x3mat" is not a leaf-list/structured array (fields baked into dtype), it's a fixed-size multidimensional array (shape baked into dtype). Is it just the case with two fixed-size indexes that isn't working? I don't think I have tests for that, and it could be failing. In that case, Uproot is a good model to follow—there's code to read that handles that, but it's apparently broken in some cases.

Yes, so the same data is saved in structs and in arrays, former is a real struct, latter is just a simple array. We can continue in scikit-hep/uproot5#74

@tamasgal tamasgal added the enhancement New feature or request label Aug 23, 2020
@tamasgal tamasgal changed the title [FeatureRequest] Reading structs into higher-dimensioal Arrays Reading structs into higher-dimensional Arrays Aug 23, 2020
@briederer
Copy link
Author

I am not at a PC currently but I guess this is fixed too by #22.
So if someone has time to check this could be closed too.

PS: @Moelf thanks a lot for the effort in the recent days/weeks. I can finally switch back to UnROOT now and will let you know if I run into new issues.

@Moelf
Copy link
Member

Moelf commented Jul 7, 2021

I just looked at a bit, sadly no. Your branch is extra jagged by the fact that each element is higher-dimensional and they are certainly some kind of nested vector in C++? I've never seen matrix branch in my (LHC experiment) life sorry, but I will be happy to learn from Jim later on!

@briederer
Copy link
Author

Okay, seems reasonable for the arrays tree, since this really contains a basic two-dimensional C-Array. This probably complicates things a bit.

However the struct array is basically a wrapped one-dimensional C-array and this should not be different than what was used in #22. So this is interesting, that it still not works. I try to have a look at that tomorrow.

PS: In experiment there is probably no reason to use higher-dimensional arrays. But for me as a theoretician I am storing a lot of matrix-valued observables from Lattice-simulations and reading these results back into Julia would be really nice 😊
My current workaround uses the Julia read- and pipeline-functions to get the values from the ROOT-CLI, which is of course pretty slow and messy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants