Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pd.DataFrame(tensor) is abnormally slow, you can make the following modifications #44616

Closed
3 tasks done
YeahNew opened this issue Nov 25, 2021 · 13 comments · Fixed by #45007
Closed
3 tasks done
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance
Milestone

Comments

@YeahNew
Copy link

YeahNew commented Nov 25, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
import torch

row = 700000
col = 64
val_numpy = np.random.rand(row, col)
val_tensor = torch.randn(row, col)

numpy_pd_start_time = time.time()
va_numpy_pd = pd.DataFrame(val_numpy)
numpy_pd_end_time = time.time()
print("numpy to pd time:{:.4f}s".
      format(numpy_pd_end_time - numpy_pd_start_time))

tensor_numpy_pd_start_time = time.time()
val_tensor_pd1 = pd.DataFrame(val_tensor.numpy())
tensor_numpy_pd_end_time = time.time()
print("tensor to numpy to pd time:{:.4f} s".
      format(tensor_numpy_pd_end_time - tensor_numpy_pd_start_time))

tensor_pd_start_time = time.time()
val_tensor_pd2 = pd.DataFrame(val_tensor)
tensor_pd_end_time = time.time()
print("tensor to pd time:{:.4f} s".
      format(tensor_pd_end_time - tensor_pd_start_time))

Issue Description

Recently, using pd.DataFrame() to convert data of type torch.tensor to pandas DataFrame is very slow, while converting tensor to numpy and then to pandas DataFrame is very fast. The test code is shown in the Reproducible Example.
The code prints as follows:

numpy to pd time: 0.0013s
tensor to numpy to pd time:0.0005s
tensor to pd time:220.5251s

Then I read the source code and found that if the data accepted by pd.DataFrame() is tensor, tensor will be processed as list_like (line 682 in https://github.com/pandas-dev/pandas/blob/master/pandas/core/ frame.py) .
Mainly time-consuming in the following three stages:

data = list(data):2.5952s
nested_data_to_arrays: 214.7532s
arrays_to_mgr:2.5987s

In the nested_data_to_arrays stage, a large number of data type conversion operations are involved, the row-list is converted to col-list, and the operation is read by row.This will take a long time.

Sure,This method of use may not be appropriate, but now torch.tensor is widely used, and it is inevitable that it will be used directly in this way, resulting in low efficiency. So can you add a comment at line 467 in frame.py, like this: If data is a torch.tensor, you can transform it to numpy first(tensor.numpy()).
Or can I submit a PR? When it is judged that the input parameter is tensor, execute the conversion, and then execute the ''elif isinstance(data, (np.ndarray, Series, Index))'' judgment.

Looking forward to your reply ~

Installed Versions

pandas.version == 1.3.4

@YeahNew YeahNew added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 25, 2021
@twoertwein
Copy link
Member

I'm not sure how happy people would be adding pytorch as a dependency to pandas. (Could use if hasattr(data, "numpy")?)

Based on the documentation:
image
one might expect that a pytorch tensor will be treated as an iterable.

It might be worth extending the documentation. Something along the lines of "For best performance, iterable objects, such as a Pytorch Tensor, that can efficiently be converted to a Numpy Array, should be converted before passing it to pd.DataFrame."

@jorisvandenbossche
Copy link
Member

I am not super familiar with pytorch, but I suppose they support the array interface? If we don't do that yet, I think we can certainly ensure to treat all objects like that as arrays instead of list-likes.

@YeahNew
Copy link
Author

YeahNew commented Nov 26, 2021

I am not super familiar with pytorch, but I suppose they support the array interface? If we don't do that yet, I think we can certainly ensure to treat all objects like that as arrays instead of list-likes.

Yes, I think they support the array interface, and it is easy to convert between Tensor and Numpy. If the two data types are the same, the memory will be shared after conversion. Judging from the above test results, it is indeed not suitable to convert tensor as list-likes.

@YeahNew
Copy link
Author

YeahNew commented Nov 26, 2021

I'm not sure how happy people would be adding pytorch as a dependency to pandas. (Could use if hasattr(data, "numpy")?)

Based on the documentation: image one might expect that a pytorch tensor will be treated as an iterable.

It might be worth extending the documentation. Something along the lines of "For best performance, iterable objects, such as a Pytorch Tensor, that can efficiently be converted to a Numpy Array, should be converted before passing it to pd.DataFrame."

Yes, I think it is appropriate to add such a comment, because it is likely that someone will directly use pd.DataFrame(tensor) to create a DataFrame, which will not report an error, but the performance is very low.
Or Can I submit a PR to modify it?

@attack68
Copy link
Contributor

PR is welcome.

@jbrockmendel
Copy link
Member

IIRC from similar issues checking for an __array__ method in sanitize_array was a best-guess for a place to start

@jbrockmendel jbrockmendel added Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance labels Nov 26, 2021
@lithomas1 lithomas1 added Compat pandas objects compatability with Numpy or Python functions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 30, 2021
@YeahNew
Copy link
Author

YeahNew commented Dec 2, 2021

IIRC from similar issues checking for an __array__ method in sanitize_array was a best-guess for a place to start

Sorry, I don't understand what you mean? Would you like to describe it in detail?

@YeahNew
Copy link
Author

YeahNew commented Dec 2, 2021

PR is welcome.
@attack68
hi, sumbit a PR(#44719), Just add a note, do you think it is appropriate?
thanks for @twoertwein

@jbrockmendel
Copy link
Member

IIRC from similar issues checking for an array method in sanitize_array was a best-guess for a place to start

Sorry, I don't understand what you mean? Would you like to describe it in detail?

Never mind, that advice was wrong. Better advice: in frame.py L707-708 we check if not isinstance(data, (abc.Sequence, ExtensionArray)): data = list(data). That list conversion is what you want to avoid. Two options come to mind: in pytorch make a fix so that isinstance(val_tensor, abc.Sequence) is True, or add a check somewhere before 708 for an __array__ attribute

@NeilGirdhar
Copy link

NeilGirdhar commented Dec 5, 2021

make a fix so that isinstance(val_tensor, abc.Sequence)

This won't work because tensors are not sequences. (See numpy/numpy#2776 (comment))

or add a check somewhere before 708 for an array

This sounds reasonable!

@jorisvandenbossche
Copy link
Member

I think on this line:

elif isinstance(data, (np.ndarray, Series, Index)):

we would need to also catch "array-likes", so those are passed through to ndarray_to_mgr, which can then coerce any non-numpy array like into a numpy array (np.asarray)?

@jbrockmendel
Copy link
Member

we would need to also catch "array-likes", so those are passed through to ndarray_to_mgr, which can then coerce any non-numpy array like into a numpy array (np.asarray)?

IIRC trying to add EAs to go through that path broke some stuff, but I'd be very happy to be wrong about this.

@jbrockmendel
Copy link
Member

IIRC trying to add EAs to go through that path broke some stuff, but I'd be very happy to be wrong about this.

Found it in my notes. According to past-me, having EAs go through that branch on L672 broke 5 test_apply_series_on_date_time_index_aware_series tests bc PandasArray[object] going through treat_as_nested paths. This motivated #43986

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants