Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas DataFrame.to_msgpack and read_msgpack are deprecated #77

Open
arnevdk opened this issue Aug 29, 2023 · 1 comment
Open

Pandas DataFrame.to_msgpack and read_msgpack are deprecated #77

arnevdk opened this issue Aug 29, 2023 · 1 comment
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@arnevdk
Copy link

arnevdk commented Aug 29, 2023

Steps to reproduce

When creating a timeflux.nodes.zmq.Pub node with the msgpack serializer like

module: timeflux.nodes.zmq
class: Pub
params:
    topic: eeg
    serializer: msgpack

timeflux errors with

'DataFrame' object has no attribute 'to_msgpack'
Traceback (most recent call last):
  File "<...>/timeflux/core/worker.py", line 57, in _run
    scheduler.run()
  File "<...>//timeflux/core/scheduler.py", line 22, in run
    self.next()
  File "<...>//timeflux/core/scheduler.py", line 59, in next
    self._nodes[step["node"]].update()
  File "<...>//timeflux/nodes/zmq.py", line 177, in update
    self._socket.send_serialized(
  File "<...>//site-packages/zmq/sugar/socket.py", line 854, in send_serialized
    frames = serialize(msg)
  File "<...>//timeflux/core/message.py", line 31, in msgpack_serialize
    return [topic, data.to_msgpack()]
  File "/home/arne/.virtualenvs/timeflux-workshop/lib64/python3.9/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'to_msgpack'

Issue

When initialized using serialize='msgpack or deserialize='msgpack', timeflux.nodes.zmq.Pub and timeflux.nodes.zmq.Sub respectively retrieve timeflux.core.message.msgpack_serialize and timeflux.core.message.msgpack_deserialize methods, which in turn call pandas.DataFrame.to_msgpack and pandas.read_msgpack. These methods have been deprecated since Pandas 0.25 and were removed in newer versions:
https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.DataFrame.to_msgpack.html
https://pandas.pydata.org/pandas-docs/version/0.25.3/reference/api/pandas.read_msgpack.html

Proposed fixes

@mesca
Copy link
Member

mesca commented Aug 30, 2023

Hi Arne,

Thanks for the report. I believe msgpack is seldom used by Timeflux users, so I agree we should deprecate it.

Actually, and as you probably noticed, the code for pyarrow (de)serialization already exists, so a quick implementation would require a conditional import of the module, and an optional dependency to pyarrow.

Outputs from Timeflux nodes are objects consisting of two components:

  • A DataFrame structure with a Pandas-compatible API ;
  • An arbitrary Python dictionary containing metadata such as sample rate, context, etc.

So we still need a way to serialize the metadata. We could use Pickle, but that would limit interoperability to Python applications. Metadata can contain complex datatypes that could be more efficiently handled by libraries such as orjson, but we would introduce another dependency. Would a simple JSON serialization using the standard library work for you? If so, we could just log a warning if metadata (de)serialization fails.

If that sounds like an acceptable solution for you, feel free to open a PR!

@mesca mesca added enhancement New feature or request good first issue Good for newcomers labels Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants