Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I trouble you for code review to integrated newer version of parquet4s into delta DSR/DSW? #256

Closed
MironAtHome opened this issue Mar 23, 2022 · 19 comments

Comments

@MironAtHome
Copy link

MironAtHome commented Mar 23, 2022

I am not sure how much effort, just asking if you would be willing and available to look over PR.
It's currently stack at 1.2.1 and it's just too far back. Causes troubles with deps and I really don't like shading if I have to use it as a workaround.

@mjakubowski84
Copy link
Owner

@MironAtHome Sure thing!

@MironAtHome
Copy link
Author

Most of the changes were to wrap file path expressed as string into .Path imported from latest version of parquet4s parquet implementation.
However, two kinds of errors turned out a bit heavier to mend.
1, ParquetReader.read in CloseableParquetDataIterator.scala:154
2. ValueCodec with root of the trouble in RowParquetRecordImpl.scala:308 and a few lines, related to ValueCodec. I am not certain what to replace those with directly, if such a substitute readily available, please suggest, else, will put together a few lines to wrap new classes and expose same interface

@mjakubowski84
Copy link
Owner

I think that this can be helpful: https://mjakubowski84.github.io/parquet4s/docs/migration/

@mjakubowski84
Copy link
Owner

Regarding RowParquetRecordImpl:

  • ValueCodec's implementation is now split into ValueEncoder + ValueDecoder so here probably you want to use ValueDecoder.intDecoder
  • private def customSeqCodec[T](elementCodec: ValueCodec[T])(implicit seems to be redundant for me. The comment above the function sounds as not true to me.

@MironAtHome
Copy link
Author

MironAtHome commented Mar 24, 2022

Great thank you for your guidance and help.
Started from this pr to ensure build.
#257
The replacement of ValueCodec worked, down to the last 5 errors, all related to type passed as a generic to reading. If I don't get enough time to finish it today, will finish over Sat/Sun.
And thank you for tending to PR above.

@MironAtHome
Copy link
Author

Hey Marcin, pr is ready
delta-io/connectors#303
please provide feedback

@mjakubowski84
Copy link
Owner

Two comment from me only. One is a minor code change but the other worries me. I see that Delta relies on some ancient version of parquet-hadoop. I wonder if that can be upgraded without any issue.

@MironAtHome
Copy link
Author

MironAtHome commented Apr 18, 2022

Hey Marcin, sorry for long time to turn around, I had to get over case of covid.
I have ran into an issue with boolean type decode in the connector with edition 2.3.0 ( latest available on maven ) of parquet4s.
Could you please look?
I have created a private repo for this unit test.
https://github.com/MironAtHome/connectors-private.git
branch
miron/integrate-parquet4s-23
here is the lines where I am getting assertion:

      val b: Boolean = (i % 2 == 0)
      val rowB: Boolean = row.getBoolean("as_boolean")
      println(s"Evaluating row value ${rowB} to ${b} for (${i} % 2 == 0)} as ${(i % 2 == 0)}")
      assert(row.getBoolean("as_boolean") == (i % 2 == 0))

here is assertion text, with a few trace rows printed prior:
---
[info] DeltaDataReaderSuite:
Evaluating row value true to true for (4 % 2 == 0)} as true
Evaluating row value false to false for (5 % 2 == 0)} as false
Evaluating row value true to true for (6 % 2 == 0)} as true
Evaluating row value false to false for (7 % 2 == 0)} as false
Evaluating row value true to true for (8 % 2 == 0)} as true
Evaluating row value false to false for (9 % 2 == 0)} as false
Evaluating row value false to true for (0 % 2 == 0)} as true
[info] - read - primitives *** FAILED ***
[info] false did not equal true (DeltaDataReaderSuite.scala:87)
---
Here is rows from the test table, as per spark read:
scala> spark.read.format("delta").load("./golden-tables/src/test/resources/golden/data-reader-primitives").show()
+------+-------+-------+--------+----------+--------+---------+---------+---------+--------------+
|as_int|as_long|as_byte|as_short|as_boolean|as_float|as_double|as_string|as_binary|as_big_decimal|
+------+-------+-------+--------+----------+--------+---------+---------+---------+--------------+
| 4| 4| 4| 4| true| 4.0| 4.0| 4| [04 04]| 4|
| 5| 5| 5| 5| false| 5.0| 5.0| 5| [05 05]| 5|
| 6| 6| 6| 6| true| 6.0| 6.0| 6| [06 06]| 6|
| 7| 7| 7| 7| false| 7.0| 7.0| 7| [07 07]| 7|
| 8| 8| 8| 8| true| 8.0| 8.0| 8| [08 08]| 8|
| 9| 9| 9| 9| false| 9.0| 9.0| 9| [09 09]| 9|
| null| null| null| null| null| null| null| null| null| null|
| 0| 0| 0| 0| true| 0.0| 0.0| 0| [00 00]| 0|
| 1| 1| 1| 1| false| 1.0| 1.0| 1| [01 01]| 1|
| 2| 2| 2| 2| true| 2.0| 2.0| 2| [02 02]| 2|
| 3| 3| 3| 3| false| 3.0| 3.0| 3| [03 03]| 3|
+------+-------+-------+--------+----------+--------+---------+---------+---------+--------------+
as per above, spark 3.2.1 with hadoop 3.3.1 reads values correctly, however, unit test produces assertion for value in the variable rowB containing "false" when expected value is "true" for the row where field "as_int" equals to 0.

This is affecting DeltaDataReaderSuite.scala

I would much appreciate your help to verify this issue.

@mjakubowski84
Copy link
Owner

Hi @MironAtHome.

I do not see the code as your repo is private.
I recommend to have a look at boolean decoder and debug what happens there. I do not recall if business logic changed there since version 1.0 - probably not.
And BTW - latest version of Parquet4s is now 1.4.1 :)

@MironAtHome
Copy link
Author

Marcin, it is so nice to have your comments.
I apologies, I knew my repo setup security wasn't right for your access, but I thought that having you invited through issue link might open it up for you.
Give me till EOD ( it's 9:47AM in Seattle right now ) to get to it, or ask you for further assistance. Will open access right now.

@mjakubowski84
Copy link
Owner

I took me time to debug those tests. Thousands of dependencies and buggy resource loading of golden tables.

So, the issue seems to be with the test data or with the test itself.

Tests fail on NullValue. as_boolean is set to be nullable. In effect here we cast null to Boolean - https://github.com/MironAtHome/connectors-private/blob/miron/integrate-parquet4s-23/standalone/src/main/scala-2.12/io/delta/standalone/internal/data/RowParquetRecordImpl.scala#L167. And null is resolved as false. So if the test expects true then it must fail.

@mjakubowski84
Copy link
Owner

Closing due to inactivity.

@MironAtHome
Copy link
Author

Ok. Let me revisit tests and code.
Much time has passed, but it's worth revisiting.
Just tried running _gym project. Between now and then I had to rebuild my machine ( computer ). And was a bit surprised to find debugger failing due to not finding HADOOP_HOME environment variable.
After all this time I guess it's a bit late to ask, does parquet4s have dependency on hadoop being present and configured on the machine?

@mjakubowski84
Copy link
Owner

Hi!
Great to hear that.
No, there's no need to have Hadoop on your machine (I don't). And I don't have HADOOP_HOME set, too.

@mjakubowski84 mjakubowski84 reopened this Nov 10, 2022
@MironAtHome
Copy link
Author

Here is my stack dump, first with screenshot of debugger stepped into method caused exception:
image

@MironAtHome
Copy link
Author

Stack trace:
"C:\Program Files\Microsoft\jdk-11.0.12.7-hotspot\bin\java.exe" -agentlib:jdwp=transport=dt_socket,address=127.0.0.1:65487,suspend=y,server=n -javaagent:C:\Users\user\AppData\Local\JetBrains\IntelliJIdea2022.2\captureAgent\debugger-agent.jar -Dfile.encoding=UTF-8 -classpath "F:\HL\dev\git\delta-standlone\parquet4s-gym\target\scala-2.13\classes;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\aopalliance\aopalliance\1.0\aopalliance-1.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\ch\qos\logback\logback-classic\1.3.0-alpha14\logback-classic-1.3.0-alpha14.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\ch\qos\logback\logback-core\1.3.0-alpha14\logback-core-1.3.0-alpha14.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\beachape\enumeratum-macros_2.13\1.6.1\enumeratum-macros_2.13-1.6.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\beachape\enumeratum_2.13\1.7.0\enumeratum_2.13-1.7.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\chuusai\shapeless_2.13\2.3.7\shapeless_2.13-2.3.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-annotations\2.13.0\jackson-annotations-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-core\2.13.0\jackson-core-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-databind\2.13.0\jackson-databind-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\jaxrs\jackson-jaxrs-base\2.13.0\jackson-jaxrs-base-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\jaxrs\jackson-jaxrs-json-provider\2.13.0\jackson-jaxrs-json-provider-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\module\jackson-module-jaxb-annotations\2.13.0\jackson-module-jaxb-annotations-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\woodstox\woodstox-core\5.3.0\woodstox-core-5.3.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\luben\zstd-jni\1.4.9-1\zstd-jni-1.4.9-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\mjakubowski84\parquet4s-akka_2.13\2.2.0\parquet4s-akka_2.13-2.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\mjakubowski84\parquet4s-core_2.13\2.2.0\parquet4s-core_2.13-2.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\stephenc\jcip\jcip-annotations\1.0-1\jcip-annotations-1.0-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\code\findbugs\jsr305\3.0.2\jsr305-3.0.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\code\gson\gson\2.8.9\gson-2.8.9.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\errorprone\error_prone_annotations\2.2.0\error_prone_annotations-2.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\guava\failureaccess\1.0\failureaccess-1.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\guava\guava\27.0-jre\guava-27.0-jre.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\guava\listenablefuture\9999.0-empty-to-avoid-conflict-with-guava\listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\inject\guice\4.0\guice-4.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\j2objc\j2objc-annotations\1.1\j2objc-annotations-1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\protobuf\protobuf-java\2.5.0\protobuf-java-2.5.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\re2j\re2j\1.1\re2j-1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\nimbusds\nimbus-jose-jwt\9.8.1\nimbus-jose-jwt-9.8.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\squareup\okhttp\okhttp\2.7.5\okhttp-2.7.5.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\squareup\okio\okio\1.6.0\okio-1.6.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\contribs\jersey-guice\1.19\jersey-guice-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-client\1.19\jersey-client-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-core\1.19\jersey-core-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-json\1.19\jersey-json-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-server\1.19\jersey-server-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-servlet\1.19\jersey-servlet-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\xml\bind\jaxb-impl\2.2.3-1\jaxb-impl-2.2.3-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\thoughtworks\paranamer\paranamer\2.3\paranamer-2.3.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\akka\akka-actor_2.13\2.6.18\akka-actor_2.13-2.6.18.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\akka\akka-protobuf-v3_2.13\2.6.18\akka-protobuf-v3_2.13-2.6.18.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\akka\akka-stream_2.13\2.6.18\akka-stream_2.13-2.6.18.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\scala-logging\scala-logging_2.13\3.9.4\scala-logging_2.13-3.9.4.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\config\1.4.0\config-1.4.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\ssl-config-core_2.13\0.4.2\ssl-config-core_2.13-0.4.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-beanutils\commons-beanutils\1.9.4\commons-beanutils-1.9.4.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-cli\commons-cli\1.2\commons-cli-1.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-codec\commons-codec\1.11\commons-codec-1.11.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-collections\commons-collections\3.2.2\commons-collections-3.2.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-io\commons-io\2.8.0\commons-io-2.8.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-logging\commons-logging\1.2\commons-logging-1.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-net\commons-net\3.6\commons-net-3.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-pool\commons-pool\1.6\commons-pool-1.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\dnsjava\dnsjava\2.1.7\dnsjava-2.1.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\jakarta\activation\jakarta.activation-api\1.2.2\jakarta.activation-api-1.2.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\jakarta\xml\bind\jakarta.xml.bind-api\2.3.3\jakarta.xml.bind-api-2.3.3.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\annotation\javax.annotation-api\1.3.2\javax.annotation-api-1.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\inject\javax.inject\1\javax.inject-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\servlet\jsp\jsp-api\2.1\jsp-api-2.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\servlet\javax.servlet-api\3.1.0\javax.servlet-api-3.1.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\ws\rs\javax.ws.rs-api\2.1.1\javax.ws.rs-api-2.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\ws\rs\jsr311-api\1.1.1\jsr311-api-1.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\xml\bind\jaxb-api\2.2.11\jaxb-api-2.2.11.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\log4j\log4j\1.2.17\log4j-1.2.17.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\net\minidev\accessors-smart\2.4.7\accessors-smart-2.4.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\net\minidev\json-smart\2.4.7\json-smart-2.4.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\avro\avro\1.7.7\avro-1.7.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-compress\1.21\commons-compress-1.21.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-configuration2\2.1.1\commons-configuration2-2.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-lang3\3.12.0\commons-lang3-3.12.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-math3\3.1.1\commons-math3-3.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-text\1.4\commons-text-1.4.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\curator\curator-client\4.2.0\curator-client-4.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\curator\curator-framework\4.2.0\curator-framework-4.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\curator\curator-recipes\4.2.0\curator-recipes-4.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\thirdparty\hadoop-shaded-guava\1.1.1\hadoop-shaded-guava-1.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\thirdparty\hadoop-shaded-protobuf_3_7\1.1.1\hadoop-shaded-protobuf_3_7-1.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-annotations\3.3.2\hadoop-annotations-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-auth\3.3.2\hadoop-auth-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-client\3.3.2\hadoop-client-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-common\3.3.2\hadoop-common-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-hdfs-client\3.3.2\hadoop-hdfs-client-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-mapreduce-client-common\3.3.2\hadoop-mapreduce-client-common-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-mapreduce-client-core\3.3.2\hadoop-mapreduce-client-core-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-mapreduce-client-jobclient\3.3.2\hadoop-mapreduce-client-jobclient-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-yarn-api\3.3.2\hadoop-yarn-api-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-yarn-client\3.3.2\hadoop-yarn-client-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-yarn-common\3.3.2\hadoop-yarn-common-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\httpcomponents\httpclient\4.5.13\httpclient-4.5.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\httpcomponents\httpcore\4.4.13\httpcore-4.4.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-admin\1.0.1\kerb-admin-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-client\1.0.1\kerb-client-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-common\1.0.1\kerb-common-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-core\1.0.1\kerb-core-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-crypto\1.0.1\kerb-crypto-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-identity\1.0.1\kerb-identity-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-server\1.0.1\kerb-server-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-simplekdc\1.0.1\kerb-simplekdc-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-util\1.0.1\kerb-util-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-asn1\1.0.1\kerby-asn1-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-config\1.0.1\kerby-config-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-pkix\1.0.1\kerby-pkix-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-util\1.0.1\kerby-util-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-xdr\1.0.1\kerby-xdr-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\token-provider\1.0.1\token-provider-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-column\1.12.2\parquet-column-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-common\1.12.2\parquet-common-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-encoding\1.12.2\parquet-encoding-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-format-structures\1.12.2\parquet-format-structures-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-hadoop\1.12.2\parquet-hadoop-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-jackson\1.12.2\parquet-jackson-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\yetus\audience-annotations\0.12.0\audience-annotations-0.12.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\zookeeper\zookeeper-jute\3.5.6\zookeeper-jute-3.5.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\zookeeper\zookeeper\3.5.6\zookeeper-3.5.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\checkerframework\checker-qual\2.5.2\checker-qual-2.5.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-core-asl\1.9.13\jackson-core-asl-1.9.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-jaxrs\1.9.2\jackson-jaxrs-1.9.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-mapper-asl\1.9.13\jackson-mapper-asl-1.9.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-xc\1.9.2\jackson-xc-1.9.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jettison\jettison\1.1\jettison-1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\mojo\animal-sniffer-annotations\1.17\animal-sniffer-annotations-1.17.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\woodstox\stax2-api\4.2.1\stax2-api-4.2.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\websocket\websocket-api\9.4.43.v20210629\websocket-api-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\websocket\websocket-client\9.4.43.v20210629\websocket-client-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\websocket\websocket-common\9.4.43.v20210629\websocket-common-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-client\9.4.43.v20210629\jetty-client-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-http\9.4.43.v20210629\jetty-http-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-io\9.4.43.v20210629\jetty-io-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-security\9.4.43.v20210629\jetty-security-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-servlet\9.4.43.v20210629\jetty-servlet-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-util-ajax\9.4.43.v20210629\jetty-util-ajax-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-util\9.4.43.v20210629\jetty-util-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-webapp\9.4.43.v20210629\jetty-webapp-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-xml\9.4.43.v20210629\jetty-xml-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\jline\jline\3.9.0\jline-3.9.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\ow2\asm\asm\9.1\asm-9.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\reactivestreams\reactive-streams\1.0.3\reactive-streams-1.0.3.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\modules\scala-collection-compat_2.13\2.6.0\scala-collection-compat_2.13-2.6.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\modules\scala-java8-compat_2.13\1.0.0\scala-java8-compat_2.13-1.0.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\modules\scala-parser-combinators_2.13\1.1.2\scala-parser-combinators_2.13-1.1.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\scala-library\2.13.8\scala-library-2.13.8.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\scala-reflect\2.13.8\scala-reflect-2.13.8.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\slf4j\slf4j-api\2.0.0-alpha5\slf4j-api-2.0.0-alpha5.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\xerial\snappy\snappy-java\1.1.8.2\snappy-java-1.1.8.2.jar;C:\Program Files\JetBrains\IntelliJ IDEA 2022.2.1\lib\idea_rt.jar" Main
Connected to the target VM, address: '127.0.0.1:65487', transport: 'socket'
06:06:08.412 [main] DEBUG [Main$ Main.scala:89] - Writing... demo.0.parquet
06:19:49.240 [main] WARN [o.a.h.u.Shell Shell.java:692] - Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547)
at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568)
at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591)
at org.apache.hadoop.util.Shell.(Shell.java:688)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3741)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3736)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3520)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:524)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.parquet.hadoop.util.HadoopOutputFile.fromPath(HadoopOutputFile.java:58)
at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:655)
at com.github.mjakubowski84.parquet4s.ParquetWriter$.internalWriter(ParquetWriter.scala:129)
at com.github.mjakubowski84.parquet4s.SingleFileParquetSink$.com$github$mjakubowski84$parquet4s$SingleFileParquetSink$$apply(SingleFileParquetSink.scala:67)
at com.github.mjakubowski84.parquet4s.SingleFileParquetSink$BuilderImpl.write(SingleFileParquetSink.scala:57)
at Main$.write(Main.scala:95)
at Main$.main(Main.scala:67)
at Main.main(Main.scala)
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467)
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438)
at org.apache.hadoop.util.Shell.(Shell.java:515)
... 16 common frames omitted
Exception in thread "main" java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735)
at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270)
at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:324)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:294)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:439)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:428)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:459)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:521)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:500)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)
at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81)
at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:327)
at org.apache.parquet.hadoop.ParquetWriter.(ParquetWriter.java:292)
at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:658)
at com.github.mjakubowski84.parquet4s.ParquetWriter$.internalWriter(ParquetWriter.scala:129)
at com.github.mjakubowski84.parquet4s.SingleFileParquetSink$.com$github$mjakubowski84$parquet4s$SingleFileParquetSink$$apply(SingleFileParquetSink.scala:67)
at com.github.mjakubowski84.parquet4s.SingleFileParquetSink$BuilderImpl.write(SingleFileParquetSink.scala:57)
at Main$.write(Main.scala:95)
at Main$.main(Main.scala:67)
at Main.main(Main.scala)
Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547)
at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568)
at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591)
at org.apache.hadoop.util.Shell.(Shell.java:688)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3741)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3736)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3520)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:524)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.parquet.hadoop.util.HadoopOutputFile.fromPath(HadoopOutputFile.java:58)
at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:655)
... 6 more
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467)
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438)
at org.apache.hadoop.util.Shell.(Shell.java:515)
... 16 more

@MironAtHome
Copy link
Author

MironAtHome commented Nov 11, 2022

This line LOL makes a lot of sense. Still, it would be nice to trace and fix, agreed?
https://wiki.apache.org/hadoop/WindowsProblems
Unless, of course, this is impossible.
Admittedly, this is likely hadoop issue ( client that is ). So, great thank you, Marcin, to you for providing this great tool to troubleshoot and ferret out kinks like this one.

@MironAtHome
Copy link
Author

MironAtHome commented Nov 15, 2022

Well, a quick look at \parquet4s\core\src\main\scala\com\github\mjakubowski84\parquet4s\ParquetWriter.scala nets this finding:
image
In the end we do need to have hadoop on local. Which is ok.
Unless I miss something really glaring.
Let's see if we can find anything to change this. If my findings stand correct, I find this to be an advantage.

@mjakubowski84
Copy link
Owner

mjakubowski84 commented Nov 16, 2022

TBH, I haven't been using Windows for many, many years, so it is the first time I have seen such an error :)
For sure, you do need local Hadoop when using a Hadoop client on Mac and Linux.

Thanks for spotting it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants