Release v0.6.0 · Qbeast-io/qbeast-spark

WARNING: This release includes breaking changes to the Format. If you have tables written prior to the 0.6.0 version, you can convert them following the documentation.

What's Changed?

1. New Qbeast Metadata to solve small files problem

Fixes the small file problem in incremental appends by adding support for multiple-block files. This change reduces the amount of files loaded when executing a query, improving the overall reading performance.

Before 0.6.0, each file would only contain information about one single cube. This causes the data to be spread amongst many small files, creating bigger overheads when reading from a specific area.

New `AddFile` `tags` schema (>v0.6.0)

"tags": {
  "revision": "1",
  "blocks": [
    {
      "cube": "w",
      "minWeight": 2,
      "maxWeight": 3,
      "replicated": false,
      "elementCount": 4
    },
    {
      "cube": "wg",
      "minWeight": 5,
      "maxWeight": 6,
      "replicated": false,
      "elementCount": 7
    },
  ]
}

The MultiBlock file approach, allows each file to contain multiple Blocks from different Cubes. This means, that the Metadata in each AddFile is modified, and such change can compromise old tables.

Make sure to follow the guides to transform an old table (<0.6.0) to the new format.

2. Balanced file layout with Domain-Driven Appends

Another of the upgrades we made in the new code, is using Cube Domains Strategy for appending data incrementally. The change uses the existing index during partition-level domain estimation to help reduce the number of cubes with outdated max weights from 45% to 0.16%, producing a more stable and balanced file layout.

Fixes #226. Full details in #227

3. AutoIndexing Feature

Say goodbye to the .option("columnsToIndex", "a,b") . The new AutoIndexing feature chooses the best columns to organize the data automatically.

It is NOT enabled by default. If you want to use it, you should add the necessary configuration.

spark.qbeast.index.columnsToIndex.auto=true
spark.qbeast.index.columnsToIndex.auto.max=10

4. Support for Spark 3.5.x and Delta 3.1.x

Upgrade to the latest version of the Dependencies. New libraries include:

Read everything on the Apache Spark page and Delta Lake Release.

Other Features

Adds #288: Including more log messages in critical parts of the code. Make the code easier to debug and understand what is happening.
Adds #261: Block filtering during Sampling. Lesser files to read, faster results.
Adds #253: File Skipping with Delta. Initial results show an improvement of 10x by applying Delta's file skipping on Delta Log's entries.
Adds #243: txnVersion and txnAppId are included in QbeastOptions to write streaming data.
Adds #236: Update SBT / scalastyle frameworks.
Fixed #312: dataChange on Optimization is set to false.
Fixed #315: solve roll-up cube count.
Fixed #317: no overhead during optimization.

Bug Fixes

Fix #246: Create an External Table w/ Location loads the existing configuration instead of throwing errors.
Fix #281: Schema Merge and Schema Overwrite mimic Delta Lake's behavior.
Fix #228: Correct implementation of CubeId hash equals.

Contributors

@Jiaweihu08 @fpj @cdelfosse @alexeiakimov @osopardo1

Full Changelog: v0.5.0...v0.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0

What's Changed?

1. New Qbeast Metadata to solve small files problem

New `AddFile` `tags` schema (>v0.6.0)

2. Balanced file layout with Domain-Driven Appends

3. AutoIndexing Feature

4. Support for Spark 3.5.x and Delta 3.1.x

Other Features

Bug Fixes

Contributors

Contributors

v0.6.0

What's Changed?

1. New Qbeast Metadata to solve small files problem

New AddFile tags schema (>v0.6.0)

2. Balanced file layout with Domain-Driven Appends

3. AutoIndexing Feature

4. Support for Spark 3.5.x and Delta 3.1.x

Other Features

Bug Fixes

Contributors

Contributors

New `AddFile` `tags` schema (>v0.6.0)