Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add hdfs support in python #1359

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Kimahriman
Copy link

Description

The description of the main changes of your pull request

Adds support for HDFS in the Python package by enabling the hdfs feature from the Rust library. Additionally adds a simple integration test, and docker-compose support to run the integration test locally.

Had to update some of the workflows to accommodate. The trickiest one was the Python 3.7 stage that uses an old OS, which doesn't have a new enough clang library to compile the HDFS object store I guess?

This might require Java to build and develop on any Python related thing now? Not totally sure what side effects this might induce, first time playing around with this library (avid Spark user).

Related Issue(s)

Closes #1358

Documentation

@github-actions github-actions bot added the binding/python Issues for the Python package label May 12, 2023
@github-actions
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@Kimahriman Kimahriman changed the title Add HDFS support in Python feat: add hdfs support in python May 12, 2023
@roeap
Copy link
Collaborator

roeap commented May 13, 2023

Thanks for doing this @Kimahriman. Could we add some documentation for users on what the prerequisites are for using this?

However one thing you mentioned made me a little concerned, but lets make sure I got this right. Would this mean that we always require JAVA to be available when building the wheels and or installing the library?

In that case I think we would have to find another way to make HDFS available on the python side. For me not requiring Java is one of the main driver to even get involved in the project 😆, and having that as a requirement would break a lot of use cases. One potential approach could be to do something like #900.

@Kimahriman
Copy link
Author

Yeah it's kinda unfortunate. The pyarrow approach would be interesting since that's how all the data files would be read at that point right? I'm curious if building pyarrow requires Java for however they do the libhdfs bindings?

Definitely not my area of expertise, just a pyspark power user that thought this would be interesting to figure out.

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make hdfs an optional (and non-default) feature for the Python package?

Seems like it adds Java as a build dependency. That's okay, if it's not default. But does it mean it's also a runtime dependency for users who aren't using HDFS? If not, I'm fine including in wheels by default. Otherwise, we might be better off finishing the fsspec wrapper.

@@ -39,7 +39,7 @@ features = ["extension-module", "abi3", "abi3-py37"]
[dependencies.deltalake]
path = "../rust"
version = "0"
features = ["azure", "gcs", "python", "datafusion", "unity-experimental"]
features = ["azure", "gcs", "python", "datafusion", "unity-experimental", "hdfs"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this an optional feature of the Python library, since it has that extra java dependency?

You can add below:

hdfs = ["deltalake/hdfs"]

And then in Python release add it to the feature flags for release:

FEATURES_FLAG: --no-default-features --features native-tls

@Kimahriman
Copy link
Author

Kimahriman commented May 17, 2023

Yeah I think that all would work! I don't think Java would be required at runtime unless you try to use HDFS, but I guess should try to verify that. I made it an optional dependency that gets installed for the integration tests, didn't touch the release workflows.

@Kimahriman
Copy link
Author

Well nevermind, I think I was wrong. I used a fresh docker container, installed Java, built with HDFS support, then uninstalled Java and the unit tests fail with

ImportError while loading conftest '/data/python/tests/conftest.py'.
tests/conftest.py:11: in <module>
    from deltalake import DeltaTable, write_deltalake
deltalake/__init__.py:1: in <module>
    from ._internal import PyDeltaTableError, RawDeltaTable, __version__, rust_core_version
E   ImportError: libjvm.so: cannot open shared object file: No such file or directory

so this might just not be viable, might only be able to do the pyarrow based approach.

@wjones127
Copy link
Collaborator

That's unfortunate. Either datafusion-objectstore-hdfs needs to change to load the library when called (using libloading) or we'll need to instead support fsspec.

@rtyler
Copy link
Member

rtyler commented Jun 2, 2023

Given the state of this work, I am going to convert it to a draft since I do not believe it would be suitable to merge at this point.

@rtyler rtyler marked this pull request as draft June 2, 2023 04:39
@rtyler rtyler self-assigned this Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support HDFS in Python package
5 participants