Add support for reading parquet file thanks to arrow-dataset #576 #577

fb64 · 2024-01-28T22:46:57Z

Fixes #576

Using arrow-dataset to read parquet file
Adding test parquet file generated from a fork of arrow_example that allows to write parquet file
Adding arrow-dataset dependency
Updating arrow version (from 11 to 15)

koperagen · 2024-01-30T18:27:12Z

Hi and thanks for the PR. I have nothing to add to the code. But i get this exception trying to run the test on Linux with both JDK 11 and 17. The issue seems to be on Arrow side. Do you know about any requirements for it to work?

java.lang.UnsatisfiedLinkError: /tmp/jnilib-15573607865820834233.tmp: /tmp/jnilib-15573607865820834233.tmp: undefined symbol: _ZTIN6google8protobuf7MessageE
	at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
	at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:388)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:232)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:174)
	at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2389)
	at java.base/java.lang.Runtime.load0(Runtime.java:755)
	at java.base/java.lang.System.load(System.java:1953)
	at org.apache.arrow.dataset.jni.JniLoader.load(JniLoader.java:92)
	at org.apache.arrow.dataset.jni.JniLoader.loadRemaining(JniLoader.java:75)
	at org.apache.arrow.dataset.jni.JniLoader.ensureLoaded(JniLoader.java:61)
	at org.apache.arrow.dataset.jni.NativeMemoryPool.createListenable(NativeMemoryPool.java:44)
	at org.jetbrains.kotlinx.dataframe.io.ArrowReadingImplKt.readArrowDataset(arrowReadingImpl.kt:327)
	at org.jetbrains.kotlinx.dataframe.io.ArrowReadingKt.readParquet(arrowReading.kt:197)
	at org.jetbrains.kotlinx.dataframe.io.ArrowReadingKt.readParquet$default(arrowReading.kt:194)
	at org.jetbrains.kotlinx.dataframe.io.ArrowKtTest.testReadParquet(ArrowKtTest.kt:590)

fb64 · 2024-01-30T19:57:02Z

Hi @koperagen it seems to be a JNI issue, I just checked and it works well both on my MacBook Pro (M1) and on a PC with Windows 10 (intel core i7). What is the processor architecture on your computer ? Normally arrow-dataset dependency provides the required native library but maybe it not fits with your hardware 🤔
Did you launch the tests with gradle ? Only the JVM arg --add-opens java.base/java.nio=ALL-UNNAMED is needed (as is it configured in gradle tasks.test ...)
I'll try to reproduce on linux with docker ....

koperagen · 2024-01-31T11:50:01Z

Yes, i do run them with Gradle. Processor is Intel core i7. I tried to run the test on TeamCity, but there it fails on Linux as well :(
Upon inspecting that .so file content i found that this protobuf symbol is indeed undefined which means it's expected to be loaded from another library

0000000000000000         *UND*	0000000000000000              _ZTIN6google8protobuf7MessageE

objdump /tmp/jnilib-11657767653473718381.tmp -x

But the library doesn't have a dependency on any protobuf library, so i assume it could be a linkage error on Arrow side.. maybe? Either this or project needs a dependency on native protobuf somehow

ldd /tmp/jnilib-11657767653473718381.tmp
	linux-vdso.so.1 (0x00007ffd3f689000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe7a31f0000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe7a31eb000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe7a31e6000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fe79f800000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe7a30ff000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe7a30dd000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe79f400000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fe7a3219000)

fb64 · 2024-01-31T12:53:04Z

Effectively, I also reproduced the issue with docker, downgrading arrow dependency to the version 14.0.2 seems to fix the error. I'll update the PR and check try to dig on arrow side

koperagen · 2024-01-31T16:20:13Z

Can confirm, 14.0.2 works. I tried it, have some requests

Can you clarify what are expected url values?
Because following code throws an exception
val df = DataFrame.readParquet(URL("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet"))

Exception in thread "main" java.lang.RuntimeException: Unrecognized filesystem type in URI: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet
	at org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native Method)
	at org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:40)
	at org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:31)
	at org.jetbrains.kotlinx.dataframe.io.ArrowReadingImplKt.readArrowDataset(arrowReadingImpl.kt:328)
	at org.jetbrains.kotlinx.dataframe.io.ArrowReadingKt.readParquet(arrowReading.kt:197)
	at org.jetbrains.kotlinx.dataframe.io.ArrowReadingKt.readParquet$default(arrowReading.kt:194)
	at org.jetbrains.kotlinx.dataframe.io.ArrowReadingImplKt.main(arrowReadingImpl.kt:348)
	at org.jetbrains.kotlinx.dataframe.io.ArrowReadingImplKt.main(arrowReadingImpl.kt)

Looks like only URL that point to files are valid ones? Can we make this parameter a File then?
DataFrame.readParquet(URL("file:/home/nikita/Downloads/yellow_tripdata_2023-07.parquet"))

I can't read https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet and get an exception. Does it work for you?

/arrow/java/dataset/src/main/cpp/jni_util.cc:79: Failed to update reservation while freeing bytes: JNIEnv was not attached to current thread
/tmp/jnilib-17033017462678975899.tmp(+0x1623b38)[0x7fe9e1e23b38]
/tmp/jnilib-17033017462678975899.tmp(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7fe9e1e23eed]
... etc

At the same time it reads sample file from tests just fine

fb64 · 2024-01-31T20:15:18Z

Actually URI parsing is done natively by arrow and it supports only few file systems and unfortunately http(s) is not supported yet :

local file: file:/filename.parquet
AWS S3: s3:/filename.parquet
Google Cloud Storage gs:/filename.parquet or gcs:/filename.parquet
Hadoop : hdhf:/filename.parquet or viewfs:/filename.parquet

CF arrow source code : https://github.com/apache/arrow/blob/2a87693134135a8af2ae2b6df41980176431b1c0/cpp/src/arrow/filesystem/filesystem.cc#L679

koperagen · 2024-02-01T10:48:16Z

unfortunately http(s) is not supported yet

I actually tried to read local copy of that file and it failed with JNIEnv was not attached to current thread. Want to see if it's a platform specific bug or something else

Thanks for clarification about URI. Let's change that parameter type to java.net.URI and add a note about filesystems then?

fb64 · 2024-02-02T08:35:20Z

I reached the same issue, another problem with JNI (and thread)...
Changing the creation of NativeMemoryPool with NativeMemoryPool.getDefault() here seems to fix the error.
By the way is not recommended to use it in production: https://arrow.apache.org/docs/java/dataset.html#native-memory-management
I also will update the URI part ...

zaleslaw · 2024-02-05T12:50:10Z

Hi, thanks to the PR, sorry, I could not understand will it cover any Parquet files or only Parquet files keeping the something in the Arrow format? I will collect a few parquet files and return to you

fb64 · 2024-02-06T07:53:19Z

Hi, thanks to the PR, sorry, I could not understand will it cover any Parquet files or only Parquet files keeping the something in the Arrow format? I will collect a few parquet files and return to you

I confirm that it should cover every parquet files. We facing to a JNI error with some parquet files (not all). I created an issue on arrow repository: apache/arrow#20379

zaleslaw · 2024-02-08T11:53:16Z

@fb64 we made a decision to not merge it immediately before three things happened:

This PR will be tested on our own Parquet files
We compare this approach with alternatives (without Apache Arrow, for example)
We decided to keep it in Arrow module or create a separate module for it (it could change our 2-level structure of modules and dependencies in the project.

Thanks again for your help and collaboration!

fb64 · 2024-02-08T20:23:12Z

@fb64 we made a decision to not merge it immediately before three things happened:

This PR will be tested on our own Parquet files

We compare this approach with alternatives (without Apache Arrow, for example)

We decided to keep it in Arrow module or create a separate module for it (it could change our 2-level structure of modules and dependencies in the project.

Thanks again for your help and collaboration!

No problem !
From my experience the other alternative is to use the Java Parquet library which relies on Hadoop which can be difficult to run on windows because of certain native libraries (but this point has maybe been improved). On the other hand arrow-dataset seems to be still under development and not totally operational but it seems prometheus and could bring both parquet and orc format reading/writing feature easily.
Let's keep in touch

fb64 · 2024-02-11T17:34:00Z

Related Arrow issue for JNIEnv was not attached to current thread error : apache/arrow#20379

fb64 · 2024-04-24T21:45:17Z

for information I just updated this PR with Arrow 16.0.0 that includes fixes for the 2 issues discovered previously :

java.lang.UnsatisfiedLinkError --> [Java][Dataset] JNI Error when reading parquet file apache/arrow#39919
JNIEnv was not attached to current thread --> [Java] Dataset Failed to update reservation while freeing bytes: JNIEnv was not attached to current thread apache/arrow#20379

fb64 mentioned this pull request Jan 29, 2024

Add support for reading Parquet files #576

Open

fb64 force-pushed the arrow-read-parquet branch from 158ed95 to 3c6e600 Compare January 29, 2024 08:23

fb64 force-pushed the arrow-read-parquet branch from 3c6e600 to 0dd7498 Compare January 31, 2024 12:58

koperagen self-requested a review February 2, 2024 13:50

koperagen added this to the 0.13.0 milestone Feb 2, 2024

fb64 mentioned this pull request Feb 7, 2024

[Java][Dataset] JNI Error when reading parquet file apache/arrow#39919

Closed

fb64 mentioned this pull request Feb 11, 2024

[Java] Dataset Failed to update reservation while freeing bytes: JNIEnv was not attached to current thread apache/arrow#20379

Closed

Jolanrensen modified the milestones: 0.13.0, Backlog Mar 7, 2024

Jolanrensen added the enhancement New feature or request label Mar 7, 2024

fb64 force-pushed the arrow-read-parquet branch from 0dd7498 to 71b06f5 Compare April 24, 2024 10:40

Add support for reading parquet file thanks to arrow-dataset Kotlin#576

8b8f706

fb64 force-pushed the arrow-read-parquet branch from 71b06f5 to 8b8f706 Compare April 24, 2024 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reading parquet file thanks to arrow-dataset #576 #577

Add support for reading parquet file thanks to arrow-dataset #576 #577

fb64 commented Jan 28, 2024 •

edited by Jolanrensen

koperagen commented Jan 30, 2024 •

edited

fb64 commented Jan 30, 2024 •

edited

koperagen commented Jan 31, 2024 •

edited

fb64 commented Jan 31, 2024

koperagen commented Jan 31, 2024 •

edited

fb64 commented Jan 31, 2024 •

edited

koperagen commented Feb 1, 2024 •

edited

fb64 commented Feb 2, 2024

zaleslaw commented Feb 5, 2024 •

edited

fb64 commented Feb 6, 2024 •

edited

zaleslaw commented Feb 8, 2024

fb64 commented Feb 8, 2024

fb64 commented Feb 11, 2024

fb64 commented Apr 24, 2024

Add support for reading parquet file thanks to arrow-dataset #576 #577

Are you sure you want to change the base?

Add support for reading parquet file thanks to arrow-dataset #576 #577

Conversation

fb64 commented Jan 28, 2024 • edited by Jolanrensen

koperagen commented Jan 30, 2024 • edited

fb64 commented Jan 30, 2024 • edited

koperagen commented Jan 31, 2024 • edited

fb64 commented Jan 31, 2024

koperagen commented Jan 31, 2024 • edited

fb64 commented Jan 31, 2024 • edited

koperagen commented Feb 1, 2024 • edited

fb64 commented Feb 2, 2024

zaleslaw commented Feb 5, 2024 • edited

fb64 commented Feb 6, 2024 • edited

zaleslaw commented Feb 8, 2024

fb64 commented Feb 8, 2024

fb64 commented Feb 11, 2024

fb64 commented Apr 24, 2024

fb64 commented Jan 28, 2024 •

edited by Jolanrensen

koperagen commented Jan 30, 2024 •

edited

fb64 commented Jan 30, 2024 •

edited

koperagen commented Jan 31, 2024 •

edited

koperagen commented Jan 31, 2024 •

edited

fb64 commented Jan 31, 2024 •

edited

koperagen commented Feb 1, 2024 •

edited

zaleslaw commented Feb 5, 2024 •

edited

fb64 commented Feb 6, 2024 •

edited