-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding a script that creates scaffolding for a new dataset #98
Comments
Er, also, the script I made has a lot of limitations: it assumes you're creating a dataset based on a single CSV file, so it won't help as much (and could even be detrimental) for more complicated multi-file/multi-table datasets like ACRIS or zipped datasets like PLUTO. But it's designed to handle the most common use case, I guess, which does seem to be a single CSV that maps to a single table in the database. |
Perhaps we can put a link to your script in the docs somewhere? |
I remembered you mentioning that you made this script and found it here and used it for adding my first dataset. I found it really helpful as a complement to the guide as way of walking you through the steps - you can just run it and then look at all the places where it made edits and follow along to make the necessary tweaks in each spot. Since it's most helpful to new contributors it's probably fine that it only works with the simple case of a single dataset and that's what someone is most likely to do as their first dataset. |
Revisiting this after the hackathon, and I think this could be really helpful to get people started. @aepyornis do you still prefer to not include it in nycdb, but just link the script in the docs? I think it might be helpful to have it fully included to save the extra steps when new contributors are getting started. I'd be happy to make the PR to set this up. Curious what others from the adding-datasets hackathon group think - did anyone end up trying this out? @kfinn @elaby-bxd I was just using it to add a few more datasets, and found it really helpful. I also made a couple small additions to Atul's original script so the SQL file is linked in the YAML, a simple attempt to guess the data types for the YAML, and extra steps for the integration test. https://gist.github.com/austensen/9d1b4eda9ca82ec220f07c7f0245e6a8 |
It would be nice to have a script that does most of the grunt work involved in creating a new dataset.
I made a simple version of one and used it to create #96:
https://gist.github.com/toolness/ff8d00f36234442d650c63311178f8bd
I'm not sure if it should be added to the nycdb repository or kept separate from it, though. It has a rather unorthodox integration test that involves adding a sample dataset to nycdb and then runs pytest to run the test it created for the new dataset, and I'm not sure if pytest is ok with running itself while already running itself.
The code is definitely going to break as we evolve nycdb's architecture (see e.g. #94). I'm not sure if maintaining the script as nycdb evolves will be too much of a burden to keep it in the repo.
Anyways, I figured I'd plop the idea here to see what folks think.
The text was updated successfully, but these errors were encountered: