-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use arrow by default? #278
Comments
I think the GDAL Arrow integration is groundbreaking in its potential but is very underused right now, so I'd love to see it become the default in the future. I do understand, however, that for compatibility reasons some might be hesitant to make it the default.
If I'm reading it correctly, the cases not supported by the Arrow integration are these: Lines 1182 to 1191 in 94c2bb4
I suppose these aren't supported by GDAL? For One other downside of using Arrow by default is that it makes |
I'm open to the idea of making Arrow automatic if We'd have to check about GDAL support for the parts we're using to implement |
I looked at the impact to make the switch in geofileops and even though my tests there aren't meant to be a full coverage at all to all possible incompatibilities between fiona/pyogrio/pyogrio+arrow, only two (easy to fix) issues/differences turned up: #263 and #262. Maybe a beta version could be packaged with arrow as default to make it easy to plug it in in the CI tests of some projects we are involved in to try to get a better coverage of potential issues? |
Unfortunately, it looks like Arrow support is disabled for GDAL in VCPK (we use VCPKG for building wheels). I'm not familiar enough with VCPKG to know if it is possible for us to get Arrow installed ourselves and then override the GDAL VCPKG port to build against Arrow. |
GDAL ignores setting the feature index to read (via If Using a |
Another option to deal with this is via sql using e.g. following sql statement: f'SELECT * FROM "{layername}" OFFSET {skip_features} LIMIT {max_features}' To be tested what performance is for non-sqlite based file formats... |
The We could consider doing something similar: support consuming the ArrowStream without actually having a hard dependency on pyarrow. I don't how important this use case of pyogrio is, but it would address the possible concern about adding pyarrow as required dependency:
One option for this could even be to just re-use the code of the GDAL python bindings, where they have an implementation of |
And to be explicit: I agree very much with the gist of Martin's top post: ideally we give users just the best default (for geopandas we now ask users to specify |
Something else: it would be a good start that we ourselves start to test the |
I think it would be practical to be able to overrule the default values for those kind of parameters in an environment variable. Par example for This makes it easy to change/test this for anyone without making any code changes. If there are issues, you can just as easily roll it back or if there are only issues in some specific places, you can force one or the other specifically there. Also in tests it is practical to deal with it like that: make a fixture that sets the environment variable first to True then to False. But, obviously it is also not a problem to use a pytest mark/fixture and add In |
Thorough testing is a good idea for sure. I agree that having pyarrow as hard dependency is a no-go but I'd suggest setting default to None and use arrow by default if pyarrow is available and standard reading of it is not. That feels like an optimal balance between defaulting to the fastest path and depending on arrow. It is a bit similar as what a lot of packages do with numba. It works without it but it is much better if numba is installed. |
I wasn't following the implementation of
use_arrow
in detail, so I'm wondering - would it make sense to default to True for cases where all of the options are compatible with it? Maybe changing the actual default toNone
and having a smart heuristic deciding whether it could be used to provide a performance boost or not due to compatibility issues.I would like to avoid the same situation we have now in geopandas, where we have a faster way or reading and writing files than the default but you have to opt-in (
engine="pyogrio"
). If arrow can be used at least for a subset of use cases automatically, I think it can be beneficial.Is there anything major blocking that?
The text was updated successfully, but these errors were encountered: