Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: output filepaths in lists without moving or copying files #36

Open
flugenheimer opened this issue Apr 21, 2022 · 6 comments

Comments

@flugenheimer
Copy link

Hi,

first off, I really like this function. It could however be nice with a feature of just splitting and outputting the file paths into train, val, test without actually moving or copying any files.

@jfilter
Copy link
Owner

jfilter commented Apr 21, 2022

Hey, so you are interested in the pairs of source and destination. Something like (x.jpg, test/x.jpg)? What is your use case for the paths? When do you need the file paths instead of moving/copying the files?

@flugenheimer
Copy link
Author

Hey, so you are interested in the pairs of source and destination. Something like (x.jpg, test/x.jpg)? What is your use case for the paths? When do you need the file paths instead of moving/copying the files?

Exactly!
the reason is two things:

  1. I often have a whole lot of files - sometimes above 500GB. copying takes up too much space
  2. I want to keep the original pile of annotation in the original structure to be able to keep track of my data version, and what happens to the data since it was added (via Weights and Biases)

I therefore often just need a list of the split file pairs and can add it by filename.
I still from time to time want to physically split or copy files and folders, therefore I though it could make sense to be able to get the lists of filenames in the different splits as outputs

@flugenheimer
Copy link
Author

Maybe what I would actually need is just the list of source files that would be in each split. for my current scenario i am working on semantic segmentation, and the folder structure is therefore:

  • images
  • masks

it would then be nice to be able to get all the source destinations for images and masks in the different splits: train, val and test

@jfilter
Copy link
Owner

jfilter commented Apr 22, 2022

Thanks for the explanations. I will look into the issue.

@jfilter
Copy link
Owner

jfilter commented Apr 22, 2022

I'm not sure if this package is right for you. I does not support this kind of folder structure. I think scikit learn got you covered: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

@cs-mshah
Copy link

cs-mshah commented Mar 8, 2023

I also find that several repositories require you to organise your dataset in a specific data/ directory under their main codebase, which further requires you to have train, val, test splits. Different codebases might have different requirements/structure. So while working with multiple codebases at once, to be efficient and save some space instead of copying/moving files to different directories, its much easier to create symlinks (ln -s). See issue #31. I have created a pull request #48 for the same and tested it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants