Skip to content

v2.17.0

Compare
Choose a tag to compare
@mjakubowski84 mjakubowski84 released this 25 Feb 08:58
· 4 commits to master since this release

Improved reading of partitioned directories

Do you read data from a huge data lake partitioned into lots of directories? You have probably noticed that listing all those directories and files within takes a lot of time. And then, when you are interested in just a single partition, you still wait minutes before the files are actually being read. Indeed, reading a file can be much faster than locating it in storage. That's why Parquet4s introduces an improvement in listing partitioned directories. When you provide a filter it is eagerly evaluated against partitions. Partitions that do not match the filter are skipped early. Thanks to that Parquet4s avoids loading the whole structure of the directory tree into the memory - it lists only those directories which match the filter. You can expect a huge improvement in the speed of filtering huge data lakes!

Record filter

Parquet4s introduces an experimental RecordFilter. It allows skipping records based on their index in the file. The RecordFilter can be used for the development of custom low-level solutions.

Other notable changes:

  • Fixed bug in FS2 - postWriteHandler now always receives proper counts in the state of the partition
  • Various fixes and improvements in examples
  • Updated docs