Using pd.DataFrame(tensor) is abnormally slow, you can make the following modifications #44616

YeahNew · 2021-11-25T12:39:01Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
import torch

row = 700000
col = 64
val_numpy = np.random.rand(row, col)
val_tensor = torch.randn(row, col)

numpy_pd_start_time = time.time()
va_numpy_pd = pd.DataFrame(val_numpy)
numpy_pd_end_time = time.time()
print("numpy to pd time:{:.4f}s".
      format(numpy_pd_end_time - numpy_pd_start_time))

tensor_numpy_pd_start_time = time.time()
val_tensor_pd1 = pd.DataFrame(val_tensor.numpy())
tensor_numpy_pd_end_time = time.time()
print("tensor to numpy to pd time:{:.4f} s".
      format(tensor_numpy_pd_end_time - tensor_numpy_pd_start_time))

tensor_pd_start_time = time.time()
val_tensor_pd2 = pd.DataFrame(val_tensor)
tensor_pd_end_time = time.time()
print("tensor to pd time:{:.4f} s".
      format(tensor_pd_end_time - tensor_pd_start_time))

Issue Description

Recently, using pd.DataFrame() to convert data of type torch.tensor to pandas DataFrame is very slow, while converting tensor to numpy and then to pandas DataFrame is very fast. The test code is shown in the Reproducible Example.
The code prints as follows:

numpy to pd time: 0.0013s
tensor to numpy to pd time:0.0005s
tensor to pd time:220.5251s

Then I read the source code and found that if the data accepted by pd.DataFrame() is tensor, tensor will be processed as list_like (line 682 in https://github.com/pandas-dev/pandas/blob/master/pandas/core/ frame.py) .
Mainly time-consuming in the following three stages:

data = list(data)：2.5952s
nested_data_to_arrays: 214.7532s
arrays_to_mgr:2.5987s

In the nested_data_to_arrays stage, a large number of data type conversion operations are involved, the row-list is converted to col-list, and the operation is read by row.This will take a long time.

Sure，This method of use may not be appropriate, but now torch.tensor is widely used, and it is inevitable that it will be used directly in this way, resulting in low efficiency. So can you add a comment at line 467 in frame.py, like this: If data is a torch.tensor, you can transform it to numpy first(tensor.numpy()).
Or can I submit a PR? When it is judged that the input parameter is tensor, execute the conversion, and then execute the ''elif isinstance(data, (np.ndarray, Series, Index))'' judgment.

Looking forward to your reply ~

Installed Versions

pandas.version == 1.3.4

The text was updated successfully, but these errors were encountered:

twoertwein · 2021-11-25T15:30:00Z

I'm not sure how happy people would be adding pytorch as a dependency to pandas. (Could use if hasattr(data, "numpy")?)

Based on the documentation:

one might expect that a pytorch tensor will be treated as an iterable.

It might be worth extending the documentation. Something along the lines of "For best performance, iterable objects, such as a Pytorch Tensor, that can efficiently be converted to a Numpy Array, should be converted before passing it to pd.DataFrame."

jorisvandenbossche · 2021-11-25T19:25:34Z

I am not super familiar with pytorch, but I suppose they support the array interface? If we don't do that yet, I think we can certainly ensure to treat all objects like that as arrays instead of list-likes.

YeahNew · 2021-11-26T02:19:04Z

I am not super familiar with pytorch, but I suppose they support the array interface? If we don't do that yet, I think we can certainly ensure to treat all objects like that as arrays instead of list-likes.

Yes, I think they support the array interface, and it is easy to convert between Tensor and Numpy. If the two data types are the same, the memory will be shared after conversion. Judging from the above test results, it is indeed not suitable to convert tensor as list-likes.

YeahNew · 2021-11-26T02:21:18Z

I'm not sure how happy people would be adding pytorch as a dependency to pandas. (Could use if hasattr(data, "numpy")?)

Based on the documentation: one might expect that a pytorch tensor will be treated as an iterable.

It might be worth extending the documentation. Something along the lines of "For best performance, iterable objects, such as a Pytorch Tensor, that can efficiently be converted to a Numpy Array, should be converted before passing it to pd.DataFrame."

Yes, I think it is appropriate to add such a comment, because it is likely that someone will directly use pd.DataFrame(tensor) to create a DataFrame, which will not report an error, but the performance is very low.
Or Can I submit a PR to modify it?

attack68 · 2021-11-26T08:17:28Z

PR is welcome.

jbrockmendel · 2021-11-26T18:55:20Z

IIRC from similar issues checking for an __array__ method in sanitize_array was a best-guess for a place to start

YeahNew · 2021-12-02T06:14:27Z

IIRC from similar issues checking for an __array__ method in sanitize_array was a best-guess for a place to start

Sorry, I don't understand what you mean? Would you like to describe it in detail?

YeahNew · 2021-12-02T07:29:14Z

PR is welcome.
@attack68
hi, sumbit a PR(#44719), Just add a note, do you think it is appropriate？
thanks for @twoertwein

jbrockmendel · 2021-12-03T02:37:26Z

IIRC from similar issues checking for an array method in sanitize_array was a best-guess for a place to start

Sorry, I don't understand what you mean? Would you like to describe it in detail?

Never mind, that advice was wrong. Better advice: in frame.py L707-708 we check if not isinstance(data, (abc.Sequence, ExtensionArray)): data = list(data). That list conversion is what you want to avoid. Two options come to mind: in pytorch make a fix so that isinstance(val_tensor, abc.Sequence) is True, or add a check somewhere before 708 for an __array__ attribute

NeilGirdhar · 2021-12-05T18:35:07Z

make a fix so that isinstance(val_tensor, abc.Sequence)

This won't work because tensors are not sequences. (See numpy/numpy#2776 (comment))

or add a check somewhere before 708 for an array

This sounds reasonable!

jorisvandenbossche · 2021-12-06T10:48:02Z

I think on this line:

pandas/pandas/core/frame.py

Line 672 in ca81e6c

elif isinstance(data, (np.ndarray, Series, Index)):

we would need to also catch "array-likes", so those are passed through to ndarray_to_mgr, which can then coerce any non-numpy array like into a numpy array (np.asarray)?

jbrockmendel · 2021-12-06T20:17:43Z

we would need to also catch "array-likes", so those are passed through to ndarray_to_mgr, which can then coerce any non-numpy array like into a numpy array (np.asarray)?

IIRC trying to add EAs to go through that path broke some stuff, but I'd be very happy to be wrong about this.

jbrockmendel · 2021-12-07T02:38:05Z

IIRC trying to add EAs to go through that path broke some stuff, but I'd be very happy to be wrong about this.

Found it in my notes. According to past-me, having EAs go through that branch on L672 broke 5 test_apply_series_on_date_time_index_aware_series tests bc PandasArray[object] going through treat_as_nested paths. This motivated #43986

YeahNew added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 25, 2021

jbrockmendel added Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance labels Nov 26, 2021

lithomas1 added Compat pandas objects compatability with Numpy or Python functions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 30, 2021

This was referenced Dec 1, 2021

ENH: Please consider adding a very simple abstract base class that promises the array interface numpy/numpy#20459

Closed

Please consider checking ndarray types using the array interface #44617

Closed

YeahNew mentioned this issue Dec 2, 2021

Added the note of class DataFrame #44719

Closed

jbrockmendel mentioned this issue Dec 21, 2021

PERF: DataFrame(pytorch_tensor) #45007

Merged

4 tasks

jreback added this to the 1.4 milestone Dec 23, 2021

jreback closed this as completed in #45007 Dec 23, 2021

HarryCollins2 mentioned this issue Jan 16, 2024

BUG: <PERF: Using np.reshape(tensor) is slow> numpy/numpy#25591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pd.DataFrame(tensor) is abnormally slow, you can make the following modifications #44616

Using pd.DataFrame(tensor) is abnormally slow, you can make the following modifications #44616

YeahNew commented Nov 25, 2021 •

edited

twoertwein commented Nov 25, 2021

jorisvandenbossche commented Nov 25, 2021

YeahNew commented Nov 26, 2021

YeahNew commented Nov 26, 2021

attack68 commented Nov 26, 2021

jbrockmendel commented Nov 26, 2021

YeahNew commented Dec 2, 2021

YeahNew commented Dec 2, 2021 •

edited

jbrockmendel commented Dec 3, 2021

NeilGirdhar commented Dec 5, 2021 •

edited

jorisvandenbossche commented Dec 6, 2021

jbrockmendel commented Dec 6, 2021

jbrockmendel commented Dec 7, 2021

Using pd.DataFrame(tensor) is abnormally slow, you can make the following modifications #44616

Using pd.DataFrame(tensor) is abnormally slow, you can make the following modifications #44616

Comments

YeahNew commented Nov 25, 2021 • edited

Reproducible Example

Issue Description

Installed Versions

twoertwein commented Nov 25, 2021

jorisvandenbossche commented Nov 25, 2021

YeahNew commented Nov 26, 2021

YeahNew commented Nov 26, 2021

attack68 commented Nov 26, 2021

jbrockmendel commented Nov 26, 2021

YeahNew commented Dec 2, 2021

YeahNew commented Dec 2, 2021 • edited

jbrockmendel commented Dec 3, 2021

NeilGirdhar commented Dec 5, 2021 • edited

jorisvandenbossche commented Dec 6, 2021

jbrockmendel commented Dec 6, 2021

jbrockmendel commented Dec 7, 2021

YeahNew commented Nov 25, 2021 •

edited

YeahNew commented Dec 2, 2021 •

edited

NeilGirdhar commented Dec 5, 2021 •

edited