Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(core): make path filter configurable #329

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

i10416
Copy link
Contributor

@i10416 i10416 commented Dec 27, 2023

This change adds a pathFilter option to ParquetReader builder interface
because there are some situations where users needs to configure path filter
predicates(e.g. They use _ prefix for partition columns).

Currently, there seems no option to change default path filter(org.apache.parquet.hadoop.util.HiddenFileFilter)

This change adds a pathFilter option to ParquetReader builder interface
because there are some situations where users needs to configure path filter
predicates(e.g. They use `_` prefix for partition columns).
.gitignore Outdated
@@ -11,3 +11,4 @@ target
spark-warehouse
project/.plugins.sbt.swp
project/project
project/metals.sbt
Copy link
Contributor Author

@i10416 i10416 Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to track by vcs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted 2276a90

Comment on lines +5 to +10
import org.scalatest.BeforeAndAfter
import org.scalatest.EitherValues
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.scalatest.{BeforeAndAfter, EitherValues}
import org.slf4j.{Logger, LoggerFactory}
import org.slf4j.Logger
import org.slf4j.LoggerFactory
Copy link
Contributor Author

@i10416 i10416 Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran organize imports by scalafmt

@@ -42,6 +46,12 @@ object ParquetReader extends IOOps {
*/
def filter(filter: Filter): Builder[T]

/** @param pathFilter
* optional path filter; ParquetReader traverses paths that match this predicate to resolve partitions. It uses
* org.apache.parquet.hadoop.util.HiddenFileFilter by default.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning org.apache.parquet.hadoop.util.HiddenFileFilter here feels leaking implementation details.

Current HiddenFileFilter definition is as simple as !_.getName().startsWith(Set('.','_')). Should we define it in com.github.mjakubowski84.parquet4s package instead of using org.apache.parquet.hadoop.util.HiddenFileFilter?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet4s relies so much on parquet-hadoop that I am not so concerned about leaking this detail.
However, PathFilter seems to be quite an esoteric (but static & reusable) option, so I think that it can go into ParquetReader.Options case class alongside Hadoop Configuration. That way, it will also not cause confusion with an existing filter. What do you think?

Copy link
Owner

@mjakubowski84 mjakubowski84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind also adding this feature to Akka/Pekko and FS2 modules?

@@ -42,6 +46,12 @@ object ParquetReader extends IOOps {
*/
def filter(filter: Filter): Builder[T]

/** @param pathFilter
* optional path filter; ParquetReader traverses paths that match this predicate to resolve partitions. It uses
* org.apache.parquet.hadoop.util.HiddenFileFilter by default.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet4s relies so much on parquet-hadoop that I am not so concerned about leaking this detail.
However, PathFilter seems to be quite an esoteric (but static & reusable) option, so I think that it can go into ParquetReader.Options case class alongside Hadoop Configuration. That way, it will also not cause confusion with an existing filter. What do you think?

protected def findPartitionedPaths(
path: Path,
configuration: Configuration
configuration: Configuration,
pathFilter: PathFilter = HiddenFileFilter.INSTANCE
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants