Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous 0 row behaviour with upstream modin and pandas versions #8

Open
honno opened this issue Jul 4, 2022 · 1 comment
Open

Comments

@honno
Copy link
Member

honno commented Jul 4, 2022

I wrote this initially for the modin issue tracker, but realised this is probably not appropriate (yet) for modin. This could actually be an upstream pandas issue too. Dumping this issue here to track for now.


Describe the problem

When a (modin) dataframe containing 0 rows, it cannot be interchanged to a library supporting the interchange protocol. This includes interchanging modin dataframes to modin.

Source code / logs

Note you have to monkey patch pandas.errors.DataError to pandas.core.base.DataError for modin to work with upstream pandas.

>>> from modin import pandas as mpd
>>> df = mpd.DataFrame({"foo_col": mpd.Series([], dtype="int64")})  # dtype can be anything
>>> from modin.pandas.utils import from_dataframe as modin_from_dataframe
>>> modin_from_dataframe(df)
NotImplementedError: Non-string object dtypes are not supported yet
Full traceback
>>> modin_from_dataframe(df)
.../modin/modin/pandas/utils.py:123, in from_dataframe(df)
    120 from modin.core.execution.dispatching.factories.dispatcher import FactoryDispatcher
    121 from .dataframe import DataFrame
--> 123 return DataFrame(query_compiler=FactoryDispatcher.from_dataframe(df))

.../modin/modin/core/execution/dispatching/factories/dispatcher.py:175, in FactoryDispatcher.from_dataframe(cls, *args, **kwargs)
    172 @classmethod
    173 @_inherit_docstrings(factories.BaseFactory._from_dataframe)
    174 def from_dataframe(cls, *args, **kwargs):
--> 175     return cls.__factory._from_dataframe(*args, **kwargs)

.../modin/modin/core/execution/dispatching/factories/factories.py:197, in BaseFactory._from_dataframe(cls, *args, **kwargs)
    189 @classmethod
    190 @doc(
    191     _doc_io_method_template,
   (...)
    195 )
    196 def _from_dataframe(cls, *args, **kwargs):
--> 197     return cls.io_cls.from_dataframe(*args, **kwargs)

.../modin/modin/core/io/io.py:120, in BaseIO.from_dataframe(cls, df)
    105 @classmethod
    106 def from_dataframe(cls, df):
    107     """
    108     Create a Modin QueryCompiler from a DataFrame supporting the DataFrame exchange protocol `__dataframe__()`.
    109 
   (...)
    118         QueryCompiler containing data from the DataFrame.
    119     """
--> 120     return cls.query_compiler_cls.from_dataframe(df, cls.frame_cls)

.../modin/modin/core/storage_formats/pandas/query_compiler.py:279, in PandasQueryCompiler.from_dataframe(cls, df, data_cls)
    277 @classmethod
    278 def from_dataframe(cls, df, data_cls):
--> 279     return cls(data_cls.from_dataframe(df))

.../modin/modin/core/dataframe/pandas/dataframe/dataframe.py:2968, in PandasDataframe.from_dataframe(cls, df)
   2963 from modin.core.dataframe.pandas.exchange.dataframe_protocol.from_dataframe import (
   2964     from_dataframe_to_pandas,
   2965 )
   2967 ErrorMessage.default_to_pandas(message="`from_dataframe`")
-> 2968 pandas_df = from_dataframe_to_pandas(df)
   2969 return cls.from_pandas(pandas_df)

.../modin/modin/core/dataframe/pandas/exchange/dataframe_protocol/from_dataframe.py:68, in from_dataframe_to_pandas(df, n_chunks)
     66 pandas_dfs = []
     67 for chunk in df.get_chunks(n_chunks):
---> 68     pandas_df = protocol_df_chunk_to_pandas(chunk)
     69     pandas_dfs.append(pandas_df)
     71 pandas_df = pandas.concat(pandas_dfs, axis=0, ignore_index=True)

.../modin/modin/core/dataframe/pandas/exchange/dataframe_protocol/from_dataframe.py:102, in protocol_df_chunk_to_pandas(df)
    100     raise ValueError(f"Column {name} is not unique")
    101 col = df.get_column_by_name(name)
--> 102 dtype = col.dtype[0]
    103 if dtype in (
    104     DTypeKind.INT,
    105     DTypeKind.UINT,
    106     DTypeKind.FLOAT,
    107     DTypeKind.BOOL,
    108 ):
    109     columns[name], buf = primitive_column_to_ndarray(col)

.../pandas/pandas/_libs/properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

.../pandas/pandas/core/exchange/column.py:125, in PandasColumn.dtype(self)
    118     if infer_dtype(self._col) == "string":
    119         return (
    120             DtypeKind.STRING,
    121             8,
    122             dtype_to_arrow_c_fmt(dtype),
    123             Endianness.NATIVE,
    124         )
--> 125     raise NotImplementedError("Non-string object dtypes are not supported yet")
    126 else:
    127     return self._dtype_from_pandasdtype(dtype)

NotImplementedError: Non-string object dtypes are not supported yet

This goes for interchanging a modin dataframe to pandas,

>>> from pandas.api.exchange import from_dataframe as pandas_from_dataframe
>>> pandas_from_dataframe(df)
NotImplementedError: Non-string object dtypes are not supported yet

but interchanging a pandas dataframe to modin works just fine.

>>> df2 = pd.DataFrame({"foo": pd.Series([], dtype="int64")})
>>> modin_from_dataframe(df)
Empty DataFrame
Columns: [foo]
Index: []

System information

  • OS Platform and Distribution: Linux Ubuntu 20.04.4
  • Modin version: 0.15.0+6.g7df6cb33.dirty
  • pandas version: 1.5.0.dev0+953.g696e9bd04f.dirty
  • Python version: 3.8.10
@honno
Copy link
Member Author

honno commented Aug 8, 2022

With modin-project/modin#4652 I also get an inscrutable error when interchanging modin dataframes with categoricals with modin itself, that I don't get using normal pandas. Traceback is too large for GitHub lol, will document properly if it persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant