Feat(core): make path filter configurable #329

i10416 · 2023-12-27T17:51:22Z

This change adds a pathFilter option to ParquetReader builder interface
because there are some situations where users needs to configure path filter
predicates(e.g. They use _ prefix for partition columns).

Currently, there seems no option to change default path filter(org.apache.parquet.hadoop.util.HiddenFileFilter)

This change adds a pathFilter option to ParquetReader builder interface because there are some situations where users needs to configure path filter predicates(e.g. They use `_` prefix for partition columns).

i10416 · 2023-12-27T17:51:51Z

.gitignore

@@ -11,3 +11,4 @@ target
 spark-warehouse
 project/.plugins.sbt.swp
 project/project
+project/metals.sbt


~~no need to track by vcs~~

reverted 2276a90

i10416 · 2023-12-27T17:52:08Z

core/src/it/scala/com/github/mjakubowski84/parquet4s/IOOpsITSpec.scala

+import org.scalatest.BeforeAndAfter
+import org.scalatest.EitherValues
 import org.scalatest.flatspec.AnyFlatSpec
 import org.scalatest.matchers.should.Matchers
-import org.scalatest.{BeforeAndAfter, EitherValues}
-import org.slf4j.{Logger, LoggerFactory}
+import org.slf4j.Logger
+import org.slf4j.LoggerFactory


I just ran organize imports by scalafmt

i10416 · 2023-12-27T17:57:57Z

core/src/main/scala/com/github/mjakubowski84/parquet4s/ParquetReader.scala

@@ -42,6 +46,12 @@ object ParquetReader extends IOOps {
      */
    def filter(filter: Filter): Builder[T]

+    /** @param pathFilter
+      *   optional path filter; ParquetReader traverses paths that match this predicate to resolve partitions. It uses
+      *   org.apache.parquet.hadoop.util.HiddenFileFilter by default.


Mentioning org.apache.parquet.hadoop.util.HiddenFileFilter here feels leaking implementation details.

Current HiddenFileFilter definition is as simple as !_.getName().startsWith(Set('.','_')). Should we define it in com.github.mjakubowski84.parquet4s package instead of using org.apache.parquet.hadoop.util.HiddenFileFilter?

Parquet4s relies so much on parquet-hadoop that I am not so concerned about leaking this detail.
However, PathFilter seems to be quite an esoteric (but static & reusable) option, so I think that it can go into ParquetReader.Options case class alongside Hadoop Configuration. That way, it will also not cause confusion with an existing filter. What do you think?

This reverts commit d53ed10.

mjakubowski84

Would you mind also adding this feature to Akka/Pekko and FS2 modules?

mjakubowski84 · 2024-01-03T18:27:50Z

core/src/main/scala/com/github/mjakubowski84/parquet4s/ParquetReader.scala

@@ -42,6 +46,12 @@ object ParquetReader extends IOOps {
      */
    def filter(filter: Filter): Builder[T]

+    /** @param pathFilter
+      *   optional path filter; ParquetReader traverses paths that match this predicate to resolve partitions. It uses
+      *   org.apache.parquet.hadoop.util.HiddenFileFilter by default.


Parquet4s relies so much on parquet-hadoop that I am not so concerned about leaking this detail.
However, PathFilter seems to be quite an esoteric (but static & reusable) option, so I think that it can go into ParquetReader.Options case class alongside Hadoop Configuration. That way, it will also not cause confusion with an existing filter. What do you think?

mjakubowski84 · 2024-01-03T18:28:27Z

core/src/main/scala/com/github/mjakubowski84/parquet4s/IOOps.scala

  protected def findPartitionedPaths(
      path: Path,
-      configuration: Configuration
+      configuration: Configuration,
+      pathFilter: PathFilter = HiddenFileFilter.INSTANCE


i10416 added 2 commits December 28, 2023 02:42

chore(build): ignore metals.sbt

d53ed10

feat(core): enable to configure path filter on read

77eff59

This change adds a pathFilter option to ParquetReader builder interface because there are some situations where users needs to configure path filter predicates(e.g. They use `_` prefix for partition columns).

i10416 commented Dec 27, 2023

View reviewed changes

i10416 added 2 commits December 28, 2023 02:59

Revert "chore(build): ignore metals.sbt"

2276a90

This reverts commit d53ed10.

chore(core): fmt

2f8ad25

mjakubowski84 reviewed Jan 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat(core): make path filter configurable #329

Feat(core): make path filter configurable #329

i10416 commented Dec 27, 2023 •

edited

i10416 Dec 27, 2023 •

edited

i10416 Dec 27, 2023

i10416 Dec 27, 2023 •

edited

i10416 Dec 27, 2023

mjakubowski84 Jan 3, 2024

mjakubowski84 left a comment

mjakubowski84 Jan 3, 2024

mjakubowski84 Jan 3, 2024

Feat(core): make path filter configurable #329

Are you sure you want to change the base?

Feat(core): make path filter configurable #329

Conversation

i10416 commented Dec 27, 2023 • edited

i10416 Dec 27, 2023 • edited

Choose a reason for hiding this comment

i10416 Dec 27, 2023

Choose a reason for hiding this comment

i10416 Dec 27, 2023 • edited

Choose a reason for hiding this comment

i10416 Dec 27, 2023

Choose a reason for hiding this comment

mjakubowski84 Jan 3, 2024

Choose a reason for hiding this comment

mjakubowski84 left a comment

Choose a reason for hiding this comment

mjakubowski84 Jan 3, 2024

Choose a reason for hiding this comment

mjakubowski84 Jan 3, 2024

Choose a reason for hiding this comment

i10416 commented Dec 27, 2023 •

edited

i10416 Dec 27, 2023 •

edited

i10416 Dec 27, 2023 •

edited