Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with more data #16

Open
petebachant opened this issue Feb 5, 2016 · 1 comment
Open

Dealing with more data #16

petebachant opened this issue Feb 5, 2016 · 1 comment

Comments

@petebachant
Copy link

Very cool and helpful project! I have created similar ones that I don't make installable since they contain more data, which I setup to be downloaded from figshare as needed within the code, i.e., most analyses only require the smaller processed data CSVs. I'm still not totally happy with this however, since users can't really use the package without being in the project root directory.

Do you think maybe the data should be put in each user's home directory (assuming the data doesn't change) under a folder like $HOME/.shablona/data? This would help save space if users are using the package in conda or virtual envs, right?

I was also considering having users install such that the code is used in place, i.e., python setup.py develop or pip install -e shablona. This way, the data directory would always be known relative to the package directory (I see you've already implemented something similar), and the Python directory won't become bloated with data.

Any thoughts on how to effectively work with more data?

@arokem
Copy link
Contributor

arokem commented Feb 5, 2016

Thanks for taking a look and for the question. Yes - putting the data under the user home directory is a good idea. On other projects, we've developed systems for fetching large(ish) data from urls, validating the hash, and storing it in the user's home directory. That does seem to work well. For details see: https://github.com/nipy/dipy/blob/master/dipy/data/fetcher.py. It would actually be a good idea to refactor the data part here to do that, with the data repository on our library repository, or even better in Figshare. Maybe I will leave this issue open, until we get around to doing that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants