Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ESDB-106-5] Refactoring to support chunk data transformation #4217

Open
wants to merge 1 commit into
base: db-719-replicate-chunk-header
Choose a base branch
from

Conversation

shaan1337
Copy link
Member

@shaan1337 shaan1337 commented Apr 3, 2024

Changed: Refactoring to support chunk data transformation

This PR introduces a data transformation layer to the database. At the moment, only chunk data is transformed but the interfaces have been designed to allow for future expansion (e.g we might also want index data to be transformed later on).

This allows the database to support data transformation operations like encryption, compression, etc. The primary goal of this PR at the moment is to enable the development of an encryption-at-rest plugin.

  • chunk headers & footers are not transformed
  • only the chunk's data is transformed (scavenge posmaps are also considered as part of the data)
  • the transformed data's size does not necessarily need to match the original data's size, it could be larger or smaller.
  • the default transform is the identity transform which doesn't change anything in the data
  • one additional byte is used in the chunk header to store the transform type. this also implies that we can support only up to 256 transforms.
  • a new header type, called the transform header is written just after the chunk's header and before the data starts
    • the transform header is specific to the transform and can be of any size
    • the identity transform has an empty transform header
    • the transform header is considered to be part of the data, however it is designed to be replicated separately from the data (because the follower needs to know the transform header in advance to be able to create the transform. the created transform is then applied to the data being replicated)
    • at the moment, the transform header is not replicated to the follower, this will be done in a separate PR.
  • replication:
    • data chunks are replicated to followers in their untransformed state.
    • raw (scavenged) chunks are replicated in their transformed state.
  • caching:
    • the active chunk is cached in its untransformed state in memory (as we would otherwise incur the cost of transformation twice - once when writing to the file stream and once when writing to the memory stream)
    • read-only chunks are cached in their transformed state in memory (as we would otherwise incur the cost of untransforming the whole chunk when loading it)
  • checksum:
    • checksum computation is done on transformed data
    • checksum verification is done simply by hashing the whole chunk file except the last 16 bytes (the data doesn't need to be untransformed)
  • truncation
    • the position at which a chunk needs to be truncated is transformed before truncating the file
  • scavenging
    • scavenged chunks are transformed normally just as non-scavenged chunk. the posmaps are also transformed as they are considered part of the data.

Copy link

linear bot commented Apr 3, 2024

@shaan1337 shaan1337 changed the base branch from master to db-719-replicate-chunk-header April 3, 2024 07:20
@shaan1337 shaan1337 force-pushed the db-742-chunk-transform branch 8 times, most recently from 27fc058 to fcb1b10 Compare April 8, 2024 06:04
@shaan1337 shaan1337 force-pushed the db-742-chunk-transform branch 2 times, most recently from 4e5d406 to 49dc41e Compare April 22, 2024 12:21
@hayley-jean hayley-jean changed the title Refactoring to support chunk data transformation [DB-742] [ESDB-106-5] Refactoring to support chunk data transformation Apr 22, 2024
@shaan1337 shaan1337 force-pushed the db-742-chunk-transform branch 8 times, most recently from 529d273 to b1f08f8 Compare April 24, 2024 12:09
@shaan1337 shaan1337 marked this pull request as ready for review April 24, 2024 12:09
@shaan1337 shaan1337 force-pushed the db-719-replicate-chunk-header branch from 56ec6f1 to c1893f9 Compare May 15, 2024 10:31
@shaan1337 shaan1337 force-pushed the db-719-replicate-chunk-header branch from c1893f9 to 8637e02 Compare May 27, 2024 06:15
@shaan1337 shaan1337 force-pushed the db-719-replicate-chunk-header branch from 8637e02 to c55c0ed Compare May 27, 2024 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant