Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want to support library export as Polars. #1096

Open
Esword618 opened this issue Jan 24, 2024 · 5 comments
Open

Want to support library export as Polars. #1096

Esword618 opened this issue Jan 24, 2024 · 5 comments
Labels
feature New feature or request

Comments

@Esword618
Copy link

Want to support library export as Polars.

Polars

@Esword618 Esword618 added the feature New feature or request label Jan 24, 2024
@jpivarski
Copy link
Member

I put a "thumbs up" on this. In the context of https://github.com/intake/awkward-pandas, we've been thinking about Polars, too. (Extending awkward-pandas would be a necessary step to get ragged arrays in Polars, though in principle it could be added to Uproot for flat arrays now.)

Incidentally, all of the Pandas conversion happens in one file:

def _is_pandas_rangeindex(pandas, index):
if hasattr(pandas, "RangeIndex") and isinstance(index, pandas.RangeIndex):
return True
if hasattr(index, "is_integer") and index.is_integer():
return True
if uproot._util.parse_version(pandas.__version__) < uproot._util.parse_version(
"1.4.0"
) and isinstance(index, pandas.Int64Index):
return True
return False
def _strided_to_pandas(path, interpretation, data, arrays, columns):
for name, member in interpretation.members:
if not name.startswith("@"):
p = (*path, name)
if isinstance(member, uproot.interpretation.objects.AsStridedObjects):
_strided_to_pandas(p, member, data, arrays, columns)
else:
arrays.append(data["/".join(p)])
columns.append(p)
def _pandas_basic_index(pandas, entry_start, entry_stop):
if hasattr(pandas, "RangeIndex"):
return pandas.RangeIndex(entry_start, entry_stop)
else:
return pandas.Int64Index(range(entry_start, entry_stop))
def _pandas_only_series(pandas, original_arrays, expression_context):
arrays = {}
names = []
for name, context in expression_context:
arrays[_rename(name, context)] = original_arrays[name]
names.append(_rename(name, context))
return arrays, names
class Pandas(Library):
"""
A :doc:`uproot.interpretation.library.Library` that presents ``TBranch``
data as Pandas Series and DataFrames. The standard name for this library is
``"pd"``.
The single-``TBranch`` (with a single ``TLeaf``) form for this library is
``pandas.Series``, and the "group" form is ``pandas.DataFrame``.
The "group" behavior for this library is:
* ``how=None`` or a string: passed to ``pandas.merge`` as its ``how``
parameter, which would be relevant if jagged arrays with different
multiplicity are requested.
* ``how=dict``: a dict of str \u2192 array, mapping the names to
``pandas.Series``.
* ``how=tuple``: a tuple of ``pandas.Series``, in the order requested.
(Names are assigned to the ``pandas.Series``.)
* ``how=list``: a list of ``pandas.Series``, in the order requested.
(Names are assigned to the ``pandas.Series``.)
Pandas Series and DataFrames are indexed, so ``global_index`` adjusts them.
"""
name = "pd"
@property
def imported(self):
return uproot.extras.pandas()
def finalize(self, array, branch, interpretation, entry_start, entry_stop, options):
pandas = self.imported
index = _pandas_basic_index(pandas, entry_start, entry_stop)
if (
isinstance(array, numpy.ndarray)
and array.dtype.names is None
and len(array.shape) == 1
and array.dtype != numpy.dtype(object)
):
return pandas.Series(array, index=index)
try:
interpretation.awkward_form(None)
except uproot.interpretation.objects.CannotBeAwkward:
pass
else:
array = _libraries[Awkward.name].finalize(
array, branch, interpretation, entry_start, entry_stop, options
)
if isinstance(
array.type.content, uproot.extras.awkward().types.NumpyType
) and array.layout.minmax_depth == (1, 1):
array = array.to_numpy()
else:
array = uproot.extras.awkward_pandas().AwkwardExtensionArray(array)
return pandas.Series(array, index=index)
def group(self, arrays, expression_context, how):
pandas = self.imported
if how is tuple:
return tuple(arrays[name] for name, _ in expression_context)
elif how is list:
return [arrays[name] for name, _ in expression_context]
elif how is dict:
return {_rename(name, c): arrays[name] for name, c in expression_context}
elif isinstance(how, str) or how is None:
arrays, names = _pandas_only_series(pandas, arrays, expression_context)
if len(arrays) == 0:
return pandas.DataFrame()
else:
arrays = {
k: v
if isinstance(v, (pandas.Series, pandas.DataFrame))
else pandas.Series(v, name=k)
for k, v in arrays.items()
}
out = pandas.concat(arrays, axis=1, ignore_index=True)
out.columns = names
return out
else:
raise TypeError(
f"for library {self.name}, how must be tuple, list, dict, str (for "
"pandas.merge's 'how' parameter, or None (for one or more"
"DataFrames without merging)"
)
def global_index(self, arrays, global_offset):
if isinstance(arrays, tuple):
return tuple(self.global_index(x, global_offset) for x in arrays)
elif isinstance(arrays, list):
return [self.global_index(x, global_offset) for x in arrays]
if type(arrays.index).__name__ == "RangeIndex":
index_start = arrays.index.start
index_stop = arrays.index.stop
arrays.index = type(arrays.index)(
index_start + global_offset, index_stop + global_offset
)
else:
index = arrays.index.arrays
numpy.add(index, global_offset, out=index)
return arrays
def concatenate(self, all_arrays):
pandas = self.imported
if len(all_arrays) == 0:
return all_arrays
if isinstance(all_arrays[0], (tuple, list)):
keys = range(len(all_arrays[0]))
elif isinstance(all_arrays[0], dict):
keys = list(all_arrays[0])
else:
return pandas.concat(all_arrays)
to_concatenate = {k: [] for k in keys}
for arrays in all_arrays:
for k in keys:
to_concatenate[k].append(arrays[k])
concatenated = {k: pandas.concat(to_concatenate[k]) for k in keys}
if isinstance(all_arrays[0], tuple):
return tuple(concatenated[k] for k in keys)
elif isinstance(all_arrays[0], list):
return [concatenated[k] for k in keys]
elif isinstance(all_arrays[0], dict):
return concatenated

Some of the preparation steps are not Pandas-specific and can be reused in a new library (fourth after library="np", library="ak", and library="pd"). Do you know the Polars data constructors well enough to do that?

Actually, now that I think of it, Polars columns are in Apache Arrow format. Maybe we could use ak.to_arrow or ak.to_arrow_table instead of expanding awkward-pandas. @Esword618, do you know enough about getting data into Polars to know if there's an easy way to do it with a pyarrow array or Table?

@Esword618
Copy link
Author

I'm also new to Polars and not very familiar with it, but I'm willing to learn about it and try to add this feature to uproot.

@jpivarski
Copy link
Member

Okay, thanks! The first question that could make short work of this is to see if Polars has any constructor that turns a pyarrow array or a pyarrow Table into a DataFrame. If this is true, then there would be almost no work on our side.

Here's a way to make a pyarrow array or Table (other than using pyarrow's own constructors; I think Awkward Arrays are easier):

>>> import awkward as ak
>>> ak_array = ak.Array([
...     {"col1": 1.1, "col2": [1]},
...     {"col1": 2.2, "col2": [1, 2]},
...     {"col1": 3.3, "col2": [1, 2, 3]},
... ])
>>> ak.to_arrow(ak_array)
<awkward._connect.pyarrow.AwkwardArrowArray object at 0x738fb207b880>
-- is_valid: all not null
-- child 0 type: extension<awkward<AwkwardArrowType>>
  [
    1.1,
    2.2,
    3.3
  ]
-- child 1 type: extension<awkward<AwkwardArrowType>>
  [
    [
      1
    ],
    [
      1,
      2
    ],
    [
      1,
      2,
      3
    ]
  ]
>>> ak.to_arrow_table(ak_array)
pyarrow.Table
col1: extension<awkward<AwkwardArrowType>> not null
col2: extension<awkward<AwkwardArrowType>> not null
----
col1: [[1.1,2.2,3.3]]
col2: [[[1],[1,2],[1,2,3]]]

I'd expect pyarrow array to be something like a Series and a pyarrow Table to be something like a DataFrame. Arrow makes a distinction between records with named fields in an array and the top-level fields of a Table. You might try different ak_arrays, including simpler ones like

>>> ak_array = ak.Array([1.1, 2.2, 3.3])

and more complex ones like

>>> ak_array = ak.Array([1.1, 2.2, 3.3, [1, 2, 3, None]])

@Esword618
Copy link
Author

Dear @jpivarski:
I did try, I read the root file, read the TTree inside the data, converted to np format, and then converted to polars. It's gonna work. What should I do next? Of course you can have other suggestions as well, I want to make something for uproot.
image

@jpivarski
Copy link
Member

That's great! There's an inefficiency in that pathway, though: those NumPy arrays have dtype=object, so arrays nested inside arrays are separate Python objects, with all of the memory bloat and extra CPU cycles that implies. (If it were just a few percent, I wouldn't bother mentioning it, but it's usually an order of magnitude effect.)

Is it possible to do this?

import awkward as ak
import polars as pl

dict_of_awkward_arrays = mu_tree.arrays(..., library="ak", how=dict)
dict_of_arrow_arrays = {k: ak.to_arrow(v, extensionarray=False) for k, v in dict_of_awkward_arrays.items()}
list_of_polars_series = [pl.Series(k, v) for k, v in dict_of_arrow_arrays.items()]
polars_df = pl.DataFrame(list_of_polars_series)
polars_df

Or this?

import awkward as ak
import pyarrow as pa
import polars as pl

awkward_array = mu_tree.arrays(..., library="ak")
arrow_table = ak.to_arrow_table(awkward_array, extensionarray=False)
polars_df = pl.DataFrame(arrow_table)
polars_df

(Replace ... with desired branches, like ["evtID", "MuMult", "PDG"].)

Looking at the Polars documentation, pl.Series allows pyarrow arrays (pa.array) as one of its ArrayLike types, and pl.DataFrame allows pyarrow Tables (pa.Table) as one of its FrameInitTypes, so I think both of the above should work. The route that goes through Series is more explicit and perhaps better for that reason.

(The extensionarray=False argument, described in the ak.to_arrow and ak.to_arrow_table documentation, prevents Awkward Array from adding metadata to the pyarrow arrays or Tables that would be necessary to convert back to an Awkward Array without losing any information, but not all Arrow-compliant tools know what to do with ExtensionArrays, so for safety, I've left it off. Does Polars know what to do with it? If it doesn't mind consuming an array with extensionarray=True, then I wonder what would happen if the output of pl.DataFrame.to_arrow or pl.Series.to_arrow is passed through ak.from_arrow: would it return an Awkward Array without loss?)

@jpivarski jpivarski added this to Feb 15‒22 in Finalization Jan 30, 2024
@jpivarski jpivarski moved this from Feb 15‒22 to Before in Finalization Jan 30, 2024
@jpivarski jpivarski moved this from Before to Mar 7‒14 in Finalization Jan 30, 2024
@ioanaif ioanaif moved this from Mar 7‒14 to Mar 21‒28 in Finalization Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
Finalization
Stretch goals
Development

No branches or pull requests

2 participants