Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication not enabled on Multiblock Format #282

Open
osopardo1 opened this issue Mar 13, 2024 · 2 comments
Open

Replication not enabled on Multiblock Format #282

osopardo1 opened this issue Mar 13, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@osopardo1
Copy link
Member

osopardo1 commented Mar 13, 2024

WARNING: Replication would be removed from 0.6.0 version

Multiblock Format

The upcoming release of Qbeast Spark has new protocol updates.

In this modification, we change the layout of a cube spread in multiple files into a single file containing multiple cubes (divided into blocks). This allows the roll-up operation to arrange several small cubes into a bigger file, helping queries to filter data more effectively.

Original protocol metadata:

"tags": {
  "state": "FLOODED",
  "cube": "w",
  "revision": "1",
  "minWeight": "2",
  "maxWeight": "3",
  "elementCount": "4" 
}

NEW protocol metadata:

"tags": {
  "revision": "1",
  "blocks": [
    {
      "cube": "w",
      "minWeight": 2,
      "maxWeight": 3,
      "replicated": false,
      "elementCount": 4
    },
    {
      "cube": "wg",
      "minWeight": 5,
      "maxWeight": 6,
      "replicated": false,
      "elementCount": 7
    },
  ]
}

But changes come with downsides.

What is Replication?

Replication is the operation that optimizes the index for Sampling and Min-Max distribution.

In summary, it reads data from the cubes that are overflowed (containing much more records than their original capacity) and spreads the information to their children.

image

The problem

When combining Roll-Up and Replication, the replicated (copied) data of a cube might end up in the same Parquet, thus making not possible the compatibility for reading the file from other underlying sources (delta and parquet) and even from the current qbeast implementation.

The situation

We are removing Replication from the new version of qbeast-spark

It is a very specific feature, and we have to redesign it in a way that doesn't affect compatibility with other Formats. Right now, the effort of maintaining the operation is bigger than our development capacity.

This issue is to resonate ways of writing and interacting with replicated data.

Proposed solutions

  • One solution might be to write replicated data in another folder inside the Table. The solution needs to be elaborated on and proposed in a document, this is only a high-level idea.
@fpj
Copy link
Contributor

fpj commented Apr 24, 2024

@osopardo1 a question about the replication described above. In the example, there are overflowed records in cube A (I think we call these offsets, yes?). When we say that we are "replicating" to children AA and AB, does it mean that A keeps the overflowed elements and create copies in AA and AB? The figure indicates that the elements are moved to AA and AB, and if so, it is not replicating, it is moving. I'm confused about the concept of replication here.,

@osopardo1
Copy link
Member Author

The idea of replication states that the offset of the cube would be removed as well, and A should be rewritten with the right amount of elements.

It is true that is not the behavior that we had on the operation at the moment. I will rebuild the figure expressing that A keeps the same records, only replicating the information and not cutting it's content.
Thanks for noticing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants