Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider spreading the data into multiple directories #9

Open
cesarsouza opened this issue Oct 12, 2017 · 5 comments
Open

Consider spreading the data into multiple directories #9

cesarsouza opened this issue Oct 12, 2017 · 5 comments

Comments

@cesarsouza
Copy link
Contributor

Right now the entire dataset is contained in a single directory (https://github.com/Jakobovski/free-spoken-digit-dataset/tree/master/recordings). This will not scale once the dataset becomes larger. Depending on the file system, even listing the directory contents with ls can become burdensome after around 10,000 files.

But another reason to do so is that the current layout may prevent the files from being queried using GitHub's developer API in the future. I am building an interface to the dataset that can automatically download, query and organize the dataset into training and testing sets without having to first clone the dataset using git. However, there is a limit on the number of files that can be retrieved using this API, and after this limit, the only method would be to clone the repository and retrieve the files manually.

Regards,
Cesar

@Jakobovski
Copy link
Owner

How would you like to organize the recordings?

@cesarsouza
Copy link
Contributor Author

I would say that the simplest way would be to organize them hierarchically as recordings/<digit>/<speaker>/<digit>_<speaker>_<variation>.wav.

@Mistobaan
Copy link

+1, are the files also added using git lfs?

@dansuh17
Copy link

dansuh17 commented Oct 2, 2019

+1, I suggest a more general structure commonly used in many computer vision datasets (like ImageNet), as: recordings/<digit>/<speaker>_<variation>.wav, following the structure <data_root>/<class_label>/<id>.<ext>.

@Jakobovski
Copy link
Owner

@dansuh17 Feel free to contribute and I will accept the MR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants