Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset versioning #235

Open
sjperkins opened this issue Jul 20, 2022 · 20 comments
Open

Dataset versioning #235

sjperkins opened this issue Jul 20, 2022 · 20 comments

Comments

@sjperkins
Copy link
Member

Support Dataset Versioning. Briefly:

  1. Datasets should provide a unified view over multiple datasets on disk.
  2. This will allow users to iteratively reduce their data.
@sjperkins
Copy link
Member Author

Three approaches thus far

1. Store Link to parent dataset in metadata

Each dataset can store a link to the parent dataset in metadata

2. Store links to all previous datasets in metadata

Slight modification to 1. I prefer this as it provides a less brittle way to specify dataset concatenation.

3. Store all datasets in a single directory.

Each dataset is stored in a sub-directory of a parent directory. The advantage of this approach is that there's less conceptual fragmentation compared to the first two approaches where url links may or may not exist.

@sjperkins
Copy link
Member Author

Something to watch out for, if we're creating multiple Measurement Sets we may end up creating empty required columns (TIME, ANTENNA1, ANTENNA2, UVW) when we only want to store a new version of for e.g. CORRECTED_DATA and FLAG.

/cc @o-smirnov @JSKenyon @landmanbester

@o-smirnov
Copy link
Contributor

I vote #2!

Something to watch out for, if we're creating multiple Measurement Sets we may end up creating empty required columns (TIME, ANTENNA1, ANTENNA2, UVW) when we only want to store a new version of for e.g. CORRECTED_DATA and FLAG.

Well that's exactly the idea, isn't it... new columns live in the "new" version, and all other columns live in the "old" version. So the "new" version is not self-contained without the old dataset also available, but that's by design.

There's another use case for this that feels very related, and that is SSD caching. We've garnered new appreciation recently for how slow MS reads are (see https://github.com/ratt-ru/systems/issues/82), and I'll add SSD scratch filesystems to a few nodes to address this. The problem then is managing the scratch space, since it is necessarily much smaller. I think dask-ms-based tools can be made to be much more cache-friendly by implementing the following logic:

  • A "fast new" version of the MS is created on SSD space, with all columns empty, linking back to the "old slow" MS on HDD. Some columns (e.g. DATA, CORRECTED_DATA) are marked as "caching". There should a command-line tool for this.

  • When dask-ms reads the fast MS, non-caching columns are pulled straight from the slow version, while caching columns are first copied from the slow version to the fast version.

  • Likewise for writes -- caching columns are written to the fast version only, non-caching columns go straight back to the slow version.

  • A command-line tool can look for "caching columns older than X days" or "caching columns not accessed for Y days", write them back to the slow version, and delete them from the fast version. This can be put in a cron job.

I think this can provide a very smooth user experience for dask-ms-based tools. If you have access to fast disk, you create your "fast MS" linking back to your original slow MS, and as long as you're actively using it, it stays on fast disk -- and if you stop molesting the data for a while, it makes its way back to slow storage eventually without you or the sysadmin worrying too much about it.

(This will also be very useful in the cloud context -- I think things like S3 storage come in hierarchies of access speed...)

Thoughts?

@sjperkins
Copy link
Member Author

I vote #2!

Yep that's my vote too

Something to watch out for, if we're creating multiple Measurement Sets we may end up creating empty required columns (TIME, ANTENNA1, ANTENNA2, UVW) when we only want to store a new version of for e.g. CORRECTED_DATA and FLAG.

Well that's exactly the idea, isn't it... new columns live in the "new" version, and all other columns live in the "old" version. So the "new" version is not self-contained without the old dataset also available, but that's by design.

I more concerned by the nature of the CASA Measurement Set in the sense that if you create an MS, say with pyrap.tables.default_ms, all the other required columns are also created with zeros if columns have a Tiled Storage Manager, or empty otherwise. Thus, creating a secondary MS holding new CORRECTED_DATA and FLAG columns will also create empty/zeroed TIME, ANTENNA1, ANTENNA2, UVW columns which will, annoyingly, override the correct values in the previous MS.

To work around this, it should be possible to just create the new versions as plain CASA tables (which won't have required columns). Or perhaps modify the MS descriptor to just require the new columns... This is all workable but I also don't want things to get too complex.

@sjperkins
Copy link
Member Author

The zarr and parquet formats don't have these issues because there no distinction between a plain dataset and an MS dataset.

@sjperkins
Copy link
Member Author

(This will also be very useful in the cloud context -- I think things like S3 storage come in hierarchies of access speed...)

Yep, these are all exciting suggestions. In fact, one thought that comes to mind is datasets composed of CASA tables for the original and zarr/parquet tables for the deltas (fast versions).

There's another use case for this that feels very related, and that is SSD caching. We've garnered new appreciation recently for how slow MS reads are (see ratt-ru/systems#82), and I'll add SSD scratch filesystems to a few nodes to address this. The problem then is managing the scratch space, since it is necessarily much smaller. I think dask-ms-based tools can be made to be much more cache-friendly by implementing the following logic:

  • A "fast new" version of the MS is created on SSD space, with all columns empty, linking back to the "old slow" MS on HDD. Some columns (e.g. DATA, CORRECTED_DATA) are marked as "caching". There should a command-line tool for this.
  • When dask-ms reads the fast MS, non-caching columns are pulled straight from the slow version, while caching columns are first copied from the slow version to the fast version.
  • Likewise for writes -- caching columns are written to the fast version only, non-caching columns go straight back to the slow version.
  • A command-line tool can look for "caching columns older than X days" or "caching columns not accessed for Y days", write them back to the slow version, and delete them from the fast version. This can be put in a cron job.

I think this can provide a very smooth user experience for dask-ms-based tools. If you have access to fast disk, you create your "fast MS" linking back to your original slow MS, and as long as you're actively using it, it stays on fast disk -- and if you stop molesting the data for a while, it makes its way back to slow storage eventually without you or the sysadmin worrying too much about it.

Yep, this is all sensible stuff. I would point that this is the kind of functionality that a datalake implements -- I'm somewhat wary of reinventing the wheel here, even though it may be fun to roll our own stuff.

@sjperkins
Copy link
Member Author

One strategy that occurred to me was to introduce a new MS indexing column (PROVENANCE_ID or repurposing an existing column) storing an integer for each row. Each entry would index a list of previous on-disk datasets, stored in a separate PROVENANCE subtable or in table/column metadata.

/cc @JSKenyon @landmanbester @bennahugo

@sjperkins
Copy link
Member Author

sjperkins commented Oct 31, 2022

I also think that averaging should result in the discarding previous provenance information, because a one-to-one mapping no longer exists.

@sjperkins
Copy link
Member Author

Posting @o-smirnov's chat discussion here:

From a user's PoV, the simplest thing for flagversions would be:

  1. QuartiCal writes to the FLAG column (of the current delta-set)
  2. A command-line tool is used to say "I would like to save this version of FLAG content under the name 'foo'"
  3. The next time QuartiCal writes to FLAG, the current content of FLAG is tucked away under the label "foo" somehow, and new content is written in. So copy-on-write semantics

Then the restore operation is as simple as saying "restore FLAG:foo"

@sjperkins
Copy link
Member Author

Here is my current thoughts for a row-granularity data provenance mechanism:

Provenance Sub-table

id name url
0 archive s3://ratt-public-data/ESO137/archive.zarr
1 1gc s3://ratt-public-data/ESO137/1gc .zarr
2 3gc s3://ratt-public-data/ESO137/1gc .zarr
3 imaging s3://ratt-public-data/ESO137/imaging .zarr

PROVENANCE_ID column

  • Through this column, partial derivation of a dataset from multiple datasets can be supported at row granularity.
  • Add to MAIN table
  • Added to all sub-tables. This could be potentially messy.
  • One of my concerns with this approach is that creating dask arrays from many partial sources
    can end up being messy and interfere with chunking mechanisms.

Add provenance keyword to xds_from_ and xds_to_ methods

  • If provenance, not specified, then latest provenance is the default in xds_from_ and xds_to_ methods
  • Otherwise provenance name must be supplied by application.

@sjperkins
Copy link
Member Author

I'm leaning towards ditching the PROVENANCE_ID column and making it a requirement that provenance chains must have a one-to-one mapping in terms of data sizes. Would this be too restrictive for data processing purposes?

@sjperkins
Copy link
Member Author

I'm leaning towards ditching the PROVENANCE_ID column and making it a requirement that provenance chains must have a one-to-one mapping in terms of data sizes. Would this be too restrictive for data processing purposes?

If a new data size is produced (by averaging) for example, a completely new provenance chain, and hence, sub-table would be created.

@o-smirnov
Copy link
Contributor

I'm leaning towards ditching the PROVENANCE_ID column and making it a requirement that provenance chains must have a one-to-one mapping in terms of data sizes. Would this be too restrictive for data processing purposes?

Yes. A very common procedure is to split out calibrators and targets (different FIELD_IDs). It would be very nice if chains supported this, making this split-out effectively a no-op (in terms of storage used).

If a new data size is produced (by averaging) for example, a completely new provenance chain, and hence, sub-table would be created.

I think averaging makes a completely new dataset anyway, no, so there's no chain to speak of?

@o-smirnov
Copy link
Contributor

  • One of my concerns with this approach is that creating dask arrays from many partial sources
    can end up being messy and interfere with chunking mechanisms.

Yes I see that concern. My current feeling is that a per-row PROVENANCE_ID is certainly the most generic way, but is also overkill for most applications.

I think what would be sufficient would be for a delta-column to represent a fixed (row-based) subset of the parent column. Can we do it without full-on row-level granularity somehow? Let's keep on thinking...

@sjperkins
Copy link
Member Author

Yes. A very common procedure is to split out calibrators and targets (different FIELD_IDs). It would be very nice if chains supported this, making this split-out effectively a no-op (in terms of storage used).

Yes, we'd need data subsets you describe below to support the above case, because the split data would be derived from completely different rows.

I think what would be sufficient would be for a delta-column to represent a fixed (row-based) subset of the parent column. Can we do it without full-on row-level granularity somehow? Let's keep on thinking...

Will do. In fact, am doing some reading under the wikpedia Data Lineage which seems to fall squarely into what we're planning.

@sjperkins
Copy link
Member Author

@sjperkins
Copy link
Member Author

sjperkins commented Oct 31, 2022

  • @JSKenyon previously mentioned https://docs.xarray.dev/en/stable/generated/xarray.merge.html as a mechanism for reconciling a series of Datasets.
  • We'd probably need to reify coordinates in order to do so.
  • Disparate Dataset chunking might still be an issue
    • in the most pathological case, say we need one row from a dask array in a previous Dataset that represents a load of 10,000 rows.

@o-smirnov
Copy link
Contributor

If performance in pathological edge cases is the only issue, then I wouldn't worry about it too much...

@sjperkins
Copy link
Member Author

TileDB supports dataset versioning and time travelling natively.

https://docs.tiledb.com/main/background/key-concepts-and-data-format

@sjperkins
Copy link
Member Author

If performance in pathological edge cases is the only issue, then I wouldn't worry about it too much...

Agreed that this is unlikely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants