Skip to content

Releases: mjakubowski84/parquet4s

v2.18.0

19 May 18:35
Compare
Choose a tag to compare

This release introduces two significant changes:

  1. Improved internals responsible for reading content and statistics of Parquet files. The difference is especially noticeable in the case of Stats: it is faster and now you can also query for min and max of partition fields.

  2. Upgrades Parquet to 1.14.0. The biggest improvement is support for Hadoop's vectored IO, which you can optionally enable in ParquetReader.Options. It can significantly improve the performance of reading huge files.

v2.17.0

25 Feb 08:58
Compare
Choose a tag to compare

Improved reading of partitioned directories

Do you read data from a huge data lake partitioned into lots of directories? You have probably noticed that listing all those directories and files within takes a lot of time. And then, when you are interested in just a single partition, you still wait minutes before the files are actually being read. Indeed, reading a file can be much faster than locating it in storage. That's why Parquet4s introduces an improvement in listing partitioned directories. When you provide a filter it is eagerly evaluated against partitions. Partitions that do not match the filter are skipped early. Thanks to that Parquet4s avoids loading the whole structure of the directory tree into the memory - it lists only those directories which match the filter. You can expect a huge improvement in the speed of filtering huge data lakes!

Record filter

Parquet4s introduces an experimental RecordFilter. It allows skipping records based on their index in the file. The RecordFilter can be used for the development of custom low-level solutions.

Other notable changes:

  • Fixed bug in FS2 - postWriteHandler now always receives proper counts in the state of the partition
  • Various fixes and improvements in examples
  • Updated docs

v2.16.1

11 Feb 19:38
Compare
Choose a tag to compare

This small release optimizes the calculation of partition paths in viaParquet function in Akka, Pekko and FS2 modules. Resource consumption was lowered and performance significantly improved - especially in applications with utilize multiple nested partitions.

Big thanks to @sndnv for the contribution.

v2.16.0

07 Feb 21:45
Compare
Choose a tag to compare

This release introduces a feature that enables significant improvement in the performance of reading Parquet files. Parquet storage, like a data lake usually consists of a huge number of files. How can we speed up the reading of such a storage? Simply by reading multiple files in parallel at the same time!
Parquet4s by default reads a file by file - in a sequence. Now, by using Akka, Pekko or FS2, you can choose a parallelism level and read multiple files at the same time, while still controlling the utilization of resources. Simply use the option parallelism(n = ???) when defining your reader.

Besides that, there were numerous minor and bugfix dependency updates, e.g. in Pekko, Cats Effect, FS2 and Slf4j.

Big thanks to @calvinlfer for his contribution.

v2.15.1

05 Feb 18:45
Compare
Choose a tag to compare

This release fixes a bug when a decimal value is encoded in Parquet in the form of a long number. Parquet4s was reading such a value as a simple long. Now it also applies a scale and a precision

v2.15.0

20 Jan 13:59
Compare
Choose a tag to compare

Two contributions were made in this release:

  1. @flipp5b added codecs for java.time.Instant. A bug in encoding timestamps as nanos was also fixed.
  2. @i10416 turned Path into a value class.

Big thanks to both of them!

v2.14.2

01 Dec 14:54
Compare
Choose a tag to compare

Versions 2.14.0 and 2.14.1 mistakenly released parquet4s-scalapb module as parquet4s-scalapb-akka and parquet4s-scalapb-pekko. In version 2.14.2 a sole parquet4s-scalapb is brought back.

v2.14.1

12 Nov 13:28
Compare
Choose a tag to compare

This release fixes generic projection over a group using the group's multiple fields.

v2.14.0

09 Nov 19:19
Compare
Choose a tag to compare

Version 2.14.0 brings a revolution to Parquet4s led mostly by @utkuaydn and @j-madden:

  • Parquet4s now supports both Akka and Pekko 🥳
  • Upgrade to Scala 2.13.12 and 3.3.1
  • Upgrade of SBT to 1.9.x and building project using sbt-projectmatrix
  • Supporting legacy pyarrow lists in file reads

Big thanks to the contributors!

v2.13.0

23 Sep 13:24
Compare
Choose a tag to compare

Here it is! Proper support for Protobuf! In Scala way!
Thanks to @huajiang-tubi Parquet4S has a new module that allows reading and writing Parquet to and from Protobuf. It leverages ScalaPB so that you can use Scala case classes for the model in your Scala projects. And it is very easy to use! Please refer to the documentation for more details.

Other notable changes:

  • Each module now has a custom function in the API for reading and writing Parquet using your custom internals
  • InMemoryOutput file becomes reusable
  • FS2 updated to 3.9.2
  • SLF4J updated to 2.0.9

Big thanks to @huajiang-tubi for his contributions!