Want to support library export as Polars. #1096

Esword618 · 2024-01-24T03:21:41Z

Want to support library export as Polars.

jpivarski · 2024-01-24T16:06:34Z

I put a "thumbs up" on this. In the context of https://github.com/intake/awkward-pandas, we've been thinking about Polars, too. (Extending awkward-pandas would be a necessary step to get ragged arrays in Polars, though in principle it could be added to Uproot for flat arrays now.)

Incidentally, all of the Pandas conversion happens in one file:

uproot5/src/uproot/interpretation/library.py

Lines 746 to 923 in e592ae3

    
           def _is_pandas_rangeindex(pandas, index): 
        
               if hasattr(pandas, "RangeIndex") and isinstance(index, pandas.RangeIndex): 
        
                   return True 
        
               if hasattr(index, "is_integer") and index.is_integer(): 
        
                   return True 
        
               if uproot._util.parse_version(pandas.__version__) < uproot._util.parse_version( 
        
                   "1.4.0" 
        
               ) and isinstance(index, pandas.Int64Index): 
        
                   return True 
        
               return False 
        
           def _strided_to_pandas(path, interpretation, data, arrays, columns): 
        
               for name, member in interpretation.members: 
        
                   if not name.startswith("@"): 
        
                       p = (*path, name) 
        
                       if isinstance(member, uproot.interpretation.objects.AsStridedObjects): 
        
                           _strided_to_pandas(p, member, data, arrays, columns) 
        
                       else: 
        
                           arrays.append(data["/".join(p)]) 
        
                           columns.append(p) 
        
           def _pandas_basic_index(pandas, entry_start, entry_stop): 
        
               if hasattr(pandas, "RangeIndex"): 
        
                   return pandas.RangeIndex(entry_start, entry_stop) 
        
               else: 
        
                   return pandas.Int64Index(range(entry_start, entry_stop)) 
        
           def _pandas_only_series(pandas, original_arrays, expression_context): 
        
               arrays = {} 
        
               names = [] 
        
               for name, context in expression_context: 
        
                   arrays[_rename(name, context)] = original_arrays[name] 
        
                   names.append(_rename(name, context)) 
        
               return arrays, names 
        
           class Pandas(Library): 
        
               """ 
        
               A :doc:`uproot.interpretation.library.Library` that presents ``TBranch`` 
        
               data as Pandas Series and DataFrames. The standard name for this library is 
        
               ``"pd"``. 
        
               The single-``TBranch`` (with a single ``TLeaf``) form for this library is 
        
               ``pandas.Series``, and the "group" form is ``pandas.DataFrame``. 
        
               The "group" behavior for this library is: 
        
               * ``how=None`` or a string: passed to ``pandas.merge`` as its ``how`` 
        
                 parameter, which would be relevant if jagged arrays with different 
        
                 multiplicity are requested. 
        
               * ``how=dict``: a dict of str \u2192 array, mapping the names to 
        
                 ``pandas.Series``. 
        
               * ``how=tuple``: a tuple of ``pandas.Series``, in the order requested. 
        
                 (Names are assigned to the ``pandas.Series``.) 
        
               * ``how=list``: a list of ``pandas.Series``, in the order requested. 
        
                 (Names are assigned to the ``pandas.Series``.) 
        
               Pandas Series and DataFrames are indexed, so ``global_index`` adjusts them. 
        
               """ 
        
               name = "pd" 
        
               @property 
        
               def imported(self): 
        
                   return uproot.extras.pandas() 
        
               def finalize(self, array, branch, interpretation, entry_start, entry_stop, options): 
        
                   pandas = self.imported 
        
                   index = _pandas_basic_index(pandas, entry_start, entry_stop) 
        
                   if ( 
        
                       isinstance(array, numpy.ndarray) 
        
                       and array.dtype.names is None 
        
                       and len(array.shape) == 1 
        
                       and array.dtype != numpy.dtype(object) 
        
                   ): 
        
                       return pandas.Series(array, index=index) 
        
                   try: 
        
                       interpretation.awkward_form(None) 
        
                   except uproot.interpretation.objects.CannotBeAwkward: 
        
                       pass 
        
                   else: 
        
                       array = _libraries[Awkward.name].finalize( 
        
                           array, branch, interpretation, entry_start, entry_stop, options 
        
                       ) 
        
                       if isinstance( 
        
                           array.type.content, uproot.extras.awkward().types.NumpyType 
        
                       ) and array.layout.minmax_depth == (1, 1): 
        
                           array = array.to_numpy() 
        
                       else: 
        
                           array = uproot.extras.awkward_pandas().AwkwardExtensionArray(array) 
        
                   return pandas.Series(array, index=index) 
        
               def group(self, arrays, expression_context, how): 
        
                   pandas = self.imported 
        
                   if how is tuple: 
        
                       return tuple(arrays[name] for name, _ in expression_context) 
        
                   elif how is list: 
        
                       return [arrays[name] for name, _ in expression_context] 
        
                   elif how is dict: 
        
                       return {_rename(name, c): arrays[name] for name, c in expression_context} 
        
                   elif isinstance(how, str) or how is None: 
        
                       arrays, names = _pandas_only_series(pandas, arrays, expression_context) 
        
                       if len(arrays) == 0: 
        
                           return pandas.DataFrame() 
        
                       else: 
        
                           arrays = { 
        
                               k: v 
        
                               if isinstance(v, (pandas.Series, pandas.DataFrame)) 
        
                               else pandas.Series(v, name=k) 
        
                               for k, v in arrays.items() 
        
                           } 
        
                           out = pandas.concat(arrays, axis=1, ignore_index=True) 
        
                           out.columns = names 
        
                           return out 
        
                   else: 
        
                       raise TypeError( 
        
                           f"for library {self.name}, how must be tuple, list, dict, str (for " 
        
                           "pandas.merge's 'how' parameter, or None (for one or more" 
        
                           "DataFrames without merging)" 
        
                       ) 
        
               def global_index(self, arrays, global_offset): 
        
                   if isinstance(arrays, tuple): 
        
                       return tuple(self.global_index(x, global_offset) for x in arrays) 
        
                   elif isinstance(arrays, list): 
        
                       return [self.global_index(x, global_offset) for x in arrays] 
        
                   if type(arrays.index).__name__ == "RangeIndex": 
        
                       index_start = arrays.index.start 
        
                       index_stop = arrays.index.stop 
        
                       arrays.index = type(arrays.index)( 
        
                           index_start + global_offset, index_stop + global_offset 
        
                       ) 
        
                   else: 
        
                       index = arrays.index.arrays 
        
                       numpy.add(index, global_offset, out=index) 
        
                   return arrays 
        
               def concatenate(self, all_arrays): 
        
                   pandas = self.imported 
        
                   if len(all_arrays) == 0: 
        
                       return all_arrays 
        
                   if isinstance(all_arrays[0], (tuple, list)): 
        
                       keys = range(len(all_arrays[0])) 
        
                   elif isinstance(all_arrays[0], dict): 
        
                       keys = list(all_arrays[0]) 
        
                   else: 
        
                       return pandas.concat(all_arrays) 
        
                   to_concatenate = {k: [] for k in keys} 
        
                   for arrays in all_arrays: 
        
                       for k in keys: 
        
                           to_concatenate[k].append(arrays[k]) 
        
                   concatenated = {k: pandas.concat(to_concatenate[k]) for k in keys} 
        
                   if isinstance(all_arrays[0], tuple): 
        
                       return tuple(concatenated[k] for k in keys) 
        
                   elif isinstance(all_arrays[0], list): 
        
                       return [concatenated[k] for k in keys] 
        
                   elif isinstance(all_arrays[0], dict): 
        
                       return concatenated

Some of the preparation steps are not Pandas-specific and can be reused in a new library (fourth after library="np", library="ak", and library="pd"). Do you know the Polars data constructors well enough to do that?

Actually, now that I think of it, Polars columns are in Apache Arrow format. Maybe we could use ak.to_arrow or ak.to_arrow_table instead of expanding awkward-pandas. @Esword618, do you know enough about getting data into Polars to know if there's an easy way to do it with a pyarrow array or Table?

Esword618 · 2024-01-25T02:24:27Z

I'm also new to Polars and not very familiar with it, but I'm willing to learn about it and try to add this feature to uproot.

jpivarski · 2024-01-25T13:50:56Z

Okay, thanks! The first question that could make short work of this is to see if Polars has any constructor that turns a pyarrow array or a pyarrow Table into a DataFrame. If this is true, then there would be almost no work on our side.

Here's a way to make a pyarrow array or Table (other than using pyarrow's own constructors; I think Awkward Arrays are easier):

>>> import awkward as ak
>>> ak_array = ak.Array([
...     {"col1": 1.1, "col2": [1]},
...     {"col1": 2.2, "col2": [1, 2]},
...     {"col1": 3.3, "col2": [1, 2, 3]},
... ])
>>> ak.to_arrow(ak_array)
<awkward._connect.pyarrow.AwkwardArrowArray object at 0x738fb207b880>
-- is_valid: all not null
-- child 0 type: extension<awkward<AwkwardArrowType>>
  [
    1.1,
    2.2,
    3.3
  ]
-- child 1 type: extension<awkward<AwkwardArrowType>>
  [
    [
      1
    ],
    [
      1,
      2
    ],
    [
      1,
      2,
      3
    ]
  ]
>>> ak.to_arrow_table(ak_array)
pyarrow.Table
col1: extension<awkward<AwkwardArrowType>> not null
col2: extension<awkward<AwkwardArrowType>> not null
----
col1: [[1.1,2.2,3.3]]
col2: [[[1],[1,2],[1,2,3]]]

I'd expect pyarrow array to be something like a Series and a pyarrow Table to be something like a DataFrame. Arrow makes a distinction between records with named fields in an array and the top-level fields of a Table. You might try different ak_arrays, including simpler ones like

>>> ak_array = ak.Array([1.1, 2.2, 3.3])

and more complex ones like

>>> ak_array = ak.Array([1.1, 2.2, 3.3, [1, 2, 3, None]])

Esword618 · 2024-01-27T06:09:19Z

Dear @jpivarski:
I did try, I read the root file, read the TTree inside the data, converted to np format, and then converted to polars. It's gonna work. What should I do next? Of course you can have other suggestions as well, I want to make something for uproot.

jpivarski · 2024-01-29T18:58:00Z

That's great! There's an inefficiency in that pathway, though: those NumPy arrays have dtype=object, so arrays nested inside arrays are separate Python objects, with all of the memory bloat and extra CPU cycles that implies. (If it were just a few percent, I wouldn't bother mentioning it, but it's usually an order of magnitude effect.)

Is it possible to do this?

import awkward as ak
import polars as pl

dict_of_awkward_arrays = mu_tree.arrays(..., library="ak", how=dict)
dict_of_arrow_arrays = {k: ak.to_arrow(v, extensionarray=False) for k, v in dict_of_awkward_arrays.items()}
list_of_polars_series = [pl.Series(k, v) for k, v in dict_of_arrow_arrays.items()]
polars_df = pl.DataFrame(list_of_polars_series)
polars_df

Or this?

import awkward as ak
import pyarrow as pa
import polars as pl

awkward_array = mu_tree.arrays(..., library="ak")
arrow_table = ak.to_arrow_table(awkward_array, extensionarray=False)
polars_df = pl.DataFrame(arrow_table)
polars_df

(Replace ... with desired branches, like ["evtID", "MuMult", "PDG"].)

Looking at the Polars documentation, pl.Series allows pyarrow arrays (pa.array) as one of its ArrayLike types, and pl.DataFrame allows pyarrow Tables (pa.Table) as one of its FrameInitTypes, so I think both of the above should work. The route that goes through Series is more explicit and perhaps better for that reason.

(The extensionarray=False argument, described in the ak.to_arrow and ak.to_arrow_table documentation, prevents Awkward Array from adding metadata to the pyarrow arrays or Tables that would be necessary to convert back to an Awkward Array without losing any information, but not all Arrow-compliant tools know what to do with ExtensionArrays, so for safety, I've left it off. Does Polars know what to do with it? If it doesn't mind consuming an array with extensionarray=True, then I wonder what would happen if the output of pl.DataFrame.to_arrow or pl.Series.to_arrow is passed through ak.from_arrow: would it return an Awkward Array without loss?)

Esword618 added the feature New feature or request label Jan 24, 2024

jpivarski added this to Feb 15‒22 in Finalization Jan 30, 2024

jpivarski moved this from Feb 15‒22 to Before in Finalization Jan 30, 2024

jpivarski moved this from Before to Mar 7‒14 in Finalization Jan 30, 2024

ioanaif moved this from Mar 7‒14 to Mar 21‒28 in Finalization Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Want to support library export as Polars. #1096

Want to support library export as Polars. #1096

Esword618 commented Jan 24, 2024

jpivarski commented Jan 24, 2024

Esword618 commented Jan 25, 2024

jpivarski commented Jan 25, 2024

Esword618 commented Jan 27, 2024

jpivarski commented Jan 29, 2024

Want to support library export as Polars. #1096

Want to support library export as Polars. #1096

Comments

Esword618 commented Jan 24, 2024

jpivarski commented Jan 24, 2024

Esword618 commented Jan 25, 2024

jpivarski commented Jan 25, 2024

Esword618 commented Jan 27, 2024

jpivarski commented Jan 29, 2024