Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a script that creates scaffolding for a new dataset #98

Open
toolness opened this issue Apr 10, 2019 · 4 comments · May be fixed by #252
Open

Consider adding a script that creates scaffolding for a new dataset #98

toolness opened this issue Apr 10, 2019 · 4 comments · May be fixed by #252

Comments

@toolness
Copy link
Contributor

It would be nice to have a script that does most of the grunt work involved in creating a new dataset.

I made a simple version of one and used it to create #96:

https://gist.github.com/toolness/ff8d00f36234442d650c63311178f8bd

I'm not sure if it should be added to the nycdb repository or kept separate from it, though. It has a rather unorthodox integration test that involves adding a sample dataset to nycdb and then runs pytest to run the test it created for the new dataset, and I'm not sure if pytest is ok with running itself while already running itself.

The code is definitely going to break as we evolve nycdb's architecture (see e.g. #94). I'm not sure if maintaining the script as nycdb evolves will be too much of a burden to keep it in the repo.

Anyways, I figured I'd plop the idea here to see what folks think.

@toolness
Copy link
Contributor Author

Er, also, the script I made has a lot of limitations: it assumes you're creating a dataset based on a single CSV file, so it won't help as much (and could even be detrimental) for more complicated multi-file/multi-table datasets like ACRIS or zipped datasets like PLUTO. But it's designed to handle the most common use case, I guess, which does seem to be a single CSV that maps to a single table in the database.

@aepyornis
Copy link
Collaborator

Perhaps we can put a link to your script in the docs somewhere?

@toolness toolness changed the title Consider adding a script that creates scaffoldoing for a new dataset Consider adding a script that creates scaffolding for a new dataset Jul 2, 2019
@austensen
Copy link
Member

I remembered you mentioning that you made this script and found it here and used it for adding my first dataset. I found it really helpful as a complement to the guide as way of walking you through the steps - you can just run it and then look at all the places where it made edits and follow along to make the necessary tweaks in each spot. Since it's most helpful to new contributors it's probably fine that it only works with the simple case of a single dataset and that's what someone is most likely to do as their first dataset.

@austensen
Copy link
Member

austensen commented Dec 17, 2022

Revisiting this after the hackathon, and I think this could be really helpful to get people started. @aepyornis do you still prefer to not include it in nycdb, but just link the script in the docs? I think it might be helpful to have it fully included to save the extra steps when new contributors are getting started. I'd be happy to make the PR to set this up.

Curious what others from the adding-datasets hackathon group think - did anyone end up trying this out? @kfinn @elaby-bxd

I was just using it to add a few more datasets, and found it really helpful. I also made a couple small additions to Atul's original script so the SQL file is linked in the YAML, a simple attempt to guess the data types for the YAML, and extra steps for the integration test.

https://gist.github.com/austensen/9d1b4eda9ca82ec220f07c7f0245e6a8

@steve52 steve52 linked a pull request Feb 23, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants