Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uproot.dask is turning TBranches of fixed-size C arrays into Dask arrays with shape (num_entries,), rather than (num_entries, fixed_size) #1173

Open
jpivarski opened this issue Mar 18, 2024 · 0 comments
Labels
bug The problem described is something that must be fixed

Comments

@jpivarski
Copy link
Member

The issue raised in #1116 is that @Jailbone's test case creates a TTree of double[fixed_size] (one fixed-size array per entry), and this should be read as a 2D NumPy array of shape (num_entries, fixed_size), but uproot.dask is presenting it to Dask as having shape (num_entries,). Then, of course, Dask does wrong things with it.

Reproducer:

import uproot
import numpy as np

with uproot.recreate("test.root") as file:
    file["test_tree"] = {"test_branch": np.random.random((100, 10))}
>>> uproot.open("test.root:test_tree").show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
test_branch          | double[10]               | AsDtype("('>f8', (10,))")
>>> uproot.open("test.root:test_tree/test_branch").array(library="np").shape
(100, 10)

(fixed_size is 10.)

But

>>> lazy = uproot.dask("test.root:test_tree", library="np")["test_branch"]
>>> lazy.shape
(100,)
>>> lazy.compute().shape
(100, 10)

There's only one place where Uproot creates a dask.array; it's here:

return da.core.Array(hlg, name, chunks, dtype=dtype)

Should we set the Dask array shape in chunks, or is that something else? If we know that the TBranch's Interpretation is AsDtype (the only type that can have more than one dimension), we can get the part of the shape beyond the number of entries with inner_shape:

>>> uproot.open("test.root:test_tree/test_branch").interpretation
AsDtype("('>f8', (10,))")
>>> uproot.open("test.root:test_tree/test_branch").interpretation.inner_shape
(10,)
@jpivarski jpivarski added the bug The problem described is something that must be fixed label Mar 18, 2024
@jpivarski jpivarski added this to Important in Finalization Mar 21, 2024
@jpivarski jpivarski moved this from Deserialization to Dask and high-level behavior in Finalization Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
Finalization
Dask and high-level behavior
Development

No branches or pull requests

1 participant