New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466
Comments
Something that hasn't received enough attention/discussion, at least in my mind, is this piece of the Drawbacks section of the PDEP (bolding added by me):
I honestly don't understand how mandating a 170% increase in the effective size of a pandas installation (70MB to 190MB, from the numbers in the quoted text) can be considered okay. For that kind of increase, I would expect/want the tradeoff to be major improvements across the board. Instead, this change comes with limited benefit but massive bloat for anyone who doesn't need the features PyArrow enables, e.g. for those who don't have issues with the current functionality of pandas. |
Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.) I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy). |
Yeah unfortunately this is where the subjective tradeoff comes into effect. pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively. The hope with pyarrow is that the tradeoff improves the current functionality for common "object" types in pandas such as text, binary, decimal, and nested data.
AFAIK most pydata projects don't actually publish/manage Linux system packages for their respective libraries. Do you know how these are packaged today? |
The pytz and dateutil wheels are only ~500kb. Drawing a comparison between them and PyArrow seems like a stretch, to put it lightly. |
By whoever offers to do it, currently me for pandas. Of the pydata projects, Debian currently has pydata-sphinx-theme, sparse, patsy, xarray and numexpr. An old discussion thread (anyone can post there, but be warned that doing so will expose your non-spam-protected email address) suggests that there is existing work on a pyarrow Debian package, but I don't yet know whether it ever got far enough to work. |
Hi, Thanks for welcoming feedback from the community. While I respect you decision, I am afraid that making
Packages size
Have you considered those two observations as drawbacks before taking the decision? |
This is discussed a bit in https://github.com/pandas-dev/pandas/pull/52711/files#diff-3fc3ce7b7d119c90be473d5d03d08d221571c67b4f3a9473c2363342328535b2R179-R193 While currently the build size for pyarrow is pretty large, it doesn't "have" to be that big. I think by pandas 3.0 (cc @jorisvandenbossche for more info on this) I'm not an Arrow dev myself, but if is something that just needs someone to look at, I'm happy to put some time in help give Arrow a nudge in the right direction. Finally, for clarity purposes, is the reason for concern also AWS lambda/pyodide/Alpine, or something else? (IMO, outside of stuff like lambda funcs, pyarrow isn't too egregious in terms of package size compared to pytorch/tensorflow but it's definetely something that can be improved) |
If Edit: See conda-forge/arrow-cpp-feedstock#1035 |
Hi, Thanks for welcoming feedback from the community. With |
There is another way - use virtual environments in user space instead of system python. The Python Software Foundation recommends users create virtual environments; and Debian/Ubuntu want users to leave the system python untouched to avoid breaking system python. Perhaps Pandas could add some warnings or error messages on install to steer people to virtualenv. This approach might avoid or at least defer work of adding pyarrow to APT as well as the risks of users breaking system python. Also which I'm building projects I might want a much later version of pandas/pyarrow than would ever ship on Debian given the release strategy/timing delay. On the other hand, arrow backend has significant advantages and with the rise of other important packages in the data space that also use pyarrow (polars, dask, modin), perhaps there is sufficient reason to add pyarrow to APT sources. A good summary that might be worth checking out is Externally managed environments. The original PEP 668 is found here. |
I think it's the rigth path for performance in WASM. |
This is a good idea!
|
Regarding concat: This should already be zero copy:
This creates a new dataframe that has 2 pyarrow chunks. Can you open a separate issue if this is not what you are looking for? |
@phofl
|
If this happens, would We’re currently thinking about coercing strings in our library, but hesitating because of the unclear future here. |
Arrow is a beast to build, and even harder to fit into a wheel properly (so you get less features, and things like using the slimmed-down libarrow will be harder to pull off). Conda-forge builds for py312 have been available for a month already though, and are ready in principle to ship pyarrow with a minimal libarrow. That still needs some usability improvements, but it's getting there. |
Without weighing in on whether this is a good idea or a bad one, Fedora Linux already has a I’m not saying that Pandas is easy to keep packaged, up to date, and coordinated with its dependencies and reverse dependencies! Just that a hard dependency on PyArrow wouldn’t necessarily make the situation worse for us. |
@h-vetinari Almost there? :-) |
There is still a lot of work to be done on the wheels side but for conda after the work we did to divide the CPP library, I created this PR which is currently under discussion in order to provide both a |
Thanks for requesting feedback. I'm not well versed on the technicalities, but I strongly prefer to not require pyarrow as a dependency. It's better imo to let users choose to use PyArrow if they desire. I prefer to use the default NumPy object type or pandas' StringDType without the added complexity of PyArrow. |
Thankyou its working |
I too was hoping to use pandas in an embedded AWS lambda function. If the size explodes, this will be a huge overhead. I am currently using about 0.004% of the pandas library. From the looks of this discussion, my usage will not change nor will I ever need pyarrow but I will now be using 0.0015% of the pandas library, and paying dearly for it, probably by abandoning this bloated software. I have found and verified that the deprecation warning can be suppressed with this : #54466 (comment) Does anyone have a procedure for installing pyarrow in cygwin? Note: straightforward installation does not work.
|
This comment was marked as duplicate.
This comment was marked as duplicate.
What is the minimum version of PyArrow that will work with pandas? |
This is the exact error message I am receiving in my terminal when running the code "/home/project/banks_project.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd Traceback (most recent call last): File "/home/project/banks_project.py", line 55, in <module> df = extract(url, table_attribs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/project/banks_project.py", line 30, in extract bank_name = col[1].find_all('a')[1]['title'] ~~~^^^ IndexError: list index out of range"
@dwgillies I don't use Cygwin so I can only help a little with your installation issue. Pyarrow doesn't provide a wheel for your OS and architecture. So, pip is trying to build a wheel from source. In order to build form source, pyarrow requires that you have libarrow installed. If you install libarrow, then try to pip install again, it might work. |
Importing model |
I don't consider this a good decision, a huge increment in the installation size will be there :( |
That's a great question - many companies rely on Python + Pandas running in cygwin, mingw (through git-bash) and Msys in their Windows work PCs. It is often the best way to have a useful Python dev env in a corporate environment. Will Pandas+PyArrow be supported in these environments? If not there is a high risk of lots of outdated installations bc these environments are rather sticky once deployed, and there is no easy way to upgrade to Linux or WSL. |
Our issues kinda match buddy. I use pandas in my android app which ships a cross compiled copy of python and of pandas compiled using crossenv. PyArrow's installation doesn't work there either... And triggers some weird errors |
**warning:** Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 ```py import pandas as pd ``` Signed-off-by: Avelino <31996+avelino@users.noreply.github.com>
**warning:** Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 ```py import pandas as pd ``` Signed-off-by: Avelino <31996+avelino@users.noreply.github.com>
My general concern with the mandatory PyArrow dependency is chasing competing standards and dependency issues like bugs. Kindly recall PDEP 10 lists three key benefits of pyarrow: (1) better pyarrow string memory/speed; (2) nested datatypes; and (3) interoperability. PDEP Point 1 - Pandas 2.2.0 Performance 1brc INPUT - 1 billion rows OUTPUT - Temp mean/min/max by city Memory Turns out the city column 'object' format hogs 🐷 90% of the 'deep' memory usage ⵜ. This is indeed an issue! The last 10% of memory is temperatures. Downcasting to 'float32' halves memory for the temperature column. ⵜ Memory Footnote: Speed PDEP Point 2 - Nesting The existing alternative is use PDEP Point 3 - Interoperability TAKEAWAYS The standout issue to me is the |
BTW, reading in a CSV file or parquet file is still faster by a factor of 5 for me when I do the reading with |
@hagenw Would you kindly explain the below result? Looks like parquet uses a lot more peak RAM. Windows users: In general, what is the large discrepancy between DataFrame memory shown by |
I measured peak memory consumption with So it seems to be more equal. The code I used to measure memory consumption is available at https://github.com/audeering/audb/blob/44de33f0fea1f4d003882d674dc696a8f0cfe95d/benchmarks/benchmark-dependencies-save-and-load.py. That uses |
`pyarrow` will be a future dependency for pandas: pandas-dev/pandas#54466
commit 218ce70 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 18:11:19 2024 -0500 feat: 🌱 created a seed command generates 3 new leaderboard documents commit 200565e Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 18:05:39 2024 -0500 feat: ✨ modified commands using subparsers commit e3a0ff5 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 15:38:26 2024 -0500 style: 🎨 Formatted Code commit 970b95a Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 15:19:50 2024 -0500 refactor: ♻️ updated magic constants in `shared` Constants: `MAX_NUM_OF_TEAMS`, `DECIMALS`, `ROOT_FOLDER_PATH` commit 957213d Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 15:03:20 2024 -0500 perf: ⚡️ reduced pandas imports commit 86453b6 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 14:51:07 2024 -0500 docs: 📝 updated instructions and commands in README added `--dev` to install develop packages commit 7dab886 Author: Shawn Santhoshgeorge <32755895+ShawnGeorge03@users.noreply.github.com> Date: Thu Mar 7 14:40:50 2024 -0500 chore: ➕ installed `pyarrow` as pandas depenceny `pyarrow` will be a future dependency for pandas: pandas-dev/pandas#54466
There have been / are some efforts to reduce the size of pandas (#30741), these efforts should not be wasted by a dependency which could perhaps remain optional (although I have no idea whether this is feasible). +120MB multiplied by the number of installs/environments/images/CI runs is not so small. It takes more time to download and install, more network usage, more storage... It's neither green, nor inclusive for situations/people/institutes/countries where resources are not as easily available as where these decisions are taken. |
This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.
The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html
If you would like to filter this warning without installing pyarrow at this time, please view this comment: #54466 (comment)
The text was updated successfully, but these errors were encountered: