Local data sync into clearml-data #1246

nikiniki1 · 2024-04-15T15:27:58Z

Hi!
I'm going to use clearml data like this:

I Have dataset probably around 700Gb. When I want to solve a problem, I select a subsample from them and use it as a train/test data. And when I feed only txt with paths (data_path) of subsample.
So, when I use clearml I have to initalize dataset = Dataset()) and after call dataset.sync_folder(). But if I use it this way, then clearml will chunk my data and load it into filestorage, so I end up with duplicates in the data.
I don’t want clearml to duplicate the data, but I just want it to monitor the shared folder with all the data and show only the paths for the selected ones.
How can I solve this problem?

ainoam · 2024-04-15T15:49:09Z

@nikiniki1 Dataset.sync_folder is intended to do exactly that: synchronize data between two locations.
If your use case uses a single location, I think Dataset.add_external_files is what you need.

Does this help?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local data sync into clearml-data #1246

Local data sync into clearml-data #1246

nikiniki1 commented Apr 15, 2024

ainoam commented Apr 15, 2024

Local data sync into clearml-data #1246

Local data sync into clearml-data #1246

Comments

nikiniki1 commented Apr 15, 2024

ainoam commented Apr 15, 2024