Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] (suggestion) Make delta releases easily usable #4447

Open
HarikalarKutusu opened this issue Apr 20, 2024 · 0 comments
Open

[FR] (suggestion) Make delta releases easily usable #4447

HarikalarKutusu opened this issue Apr 20, 2024 · 0 comments

Comments

@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented Apr 20, 2024

Is your feature request related to a problem? Please describe.
Delta releases are great, and they have correct data after v13.0 delta releases. But, incorporating them in one's workflow is not straightforward, as it involves clips AND metadata merging and re-splitting.

The idea behind delta releases is like this in general:

version [N] dataset (which you already have & extracted) + version [N+1] DELTA (that you just downloaded) 
      => version [N+1] dataset

Now you have to do these (manually or using a script):

  1. You have v[N] extracted
  2. Download v[N+1] DELTA and extract it somewhere else
  3. Create a v[N+1] directory and copy the clips directory contents form v[N] & v[N+1] DELTA into it
  4. Merge v[N] validated with v[N+1] DELTA validated and write it in v[N+1] validated. (The same works for invalidated, but NOT for other.tsv (some from the previous version possibly moved to other buckets - validated/invalidated)! So you actually cannot reconstruct the new dataset wholly - perhaps (!) unless you write some code to check if they moved)
  5. Use CorporaCreator repo on v[N+1] validated (by renaming it as clips.tsv), which further creates train/dev/test splits.

And very few people do that - because of steps 4 & 5, so the whole point of having delta releases gets lost.

So I propose this:

Describe the solution you'd like
Include the same metadata which resides in the full version of v[N+1] into v[N+1] DELTA.
This way one can just copy the files and be done.

Describe alternatives you've considered
I can think of two different workflows here:

  1. Using the newly created dataset, create a model from scratch
  2. You have a model based on v[N] and you have a nice amount of new recordings in the v[N+1] DELTA, so you choose to fine-tune (this may be preferred for the largest datasets).

The solution I suggested works for 1, and a programmer can easily get what it needs from the clips file list, so the second can also be solved.
Alternatively, both full and delta metadata can be put into the distribution, if desired.

Additional context

I don't know how many people use the second workflow, I don't have the numbers, but I suspect very few download the deltas.

I find delta releases important. They are much smaller than the full releases, thus using them will save lots of bandwidth and reduce the carbon footprint of the whole system - which I find utmost important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant