Consider adding a script that creates scaffolding for a new dataset #98

toolness · 2019-04-10T12:22:41Z

It would be nice to have a script that does most of the grunt work involved in creating a new dataset.

I made a simple version of one and used it to create #96:

https://gist.github.com/toolness/ff8d00f36234442d650c63311178f8bd

I'm not sure if it should be added to the nycdb repository or kept separate from it, though. It has a rather unorthodox integration test that involves adding a sample dataset to nycdb and then runs pytest to run the test it created for the new dataset, and I'm not sure if pytest is ok with running itself while already running itself.

The code is definitely going to break as we evolve nycdb's architecture (see e.g. #94). I'm not sure if maintaining the script as nycdb evolves will be too much of a burden to keep it in the repo.

Anyways, I figured I'd plop the idea here to see what folks think.

toolness · 2019-04-10T12:24:50Z

Er, also, the script I made has a lot of limitations: it assumes you're creating a dataset based on a single CSV file, so it won't help as much (and could even be detrimental) for more complicated multi-file/multi-table datasets like ACRIS or zipped datasets like PLUTO. But it's designed to handle the most common use case, I guess, which does seem to be a single CSV that maps to a single table in the database.

aepyornis · 2019-04-15T18:32:00Z

Perhaps we can put a link to your script in the docs somewhere?

austensen · 2019-09-25T23:01:33Z

I remembered you mentioning that you made this script and found it here and used it for adding my first dataset. I found it really helpful as a complement to the guide as way of walking you through the steps - you can just run it and then look at all the places where it made edits and follow along to make the necessary tweaks in each spot. Since it's most helpful to new contributors it's probably fine that it only works with the simple case of a single dataset and that's what someone is most likely to do as their first dataset.

austensen · 2022-12-17T22:34:29Z

Revisiting this after the hackathon, and I think this could be really helpful to get people started. @aepyornis do you still prefer to not include it in nycdb, but just link the script in the docs? I think it might be helpful to have it fully included to save the extra steps when new contributors are getting started. I'd be happy to make the PR to set this up.

Curious what others from the adding-datasets hackathon group think - did anyone end up trying this out? @kfinn @elaby-bxd

I was just using it to add a few more datasets, and found it really helpful. I also made a couple small additions to Atul's original script so the SQL file is linked in the YAML, a simple attempt to guess the data types for the YAML, and extra steps for the integration test.

https://gist.github.com/austensen/9d1b4eda9ca82ec220f07c7f0245e6a8

toolness changed the title ~~Consider adding a script that creates scaffoldoing for a new dataset~~ Consider adding a script that creates scaffolding for a new dataset Jul 2, 2019

steve52 linked a pull request Feb 23, 2023 that will close this issue

Create dataset script #252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding a script that creates scaffolding for a new dataset #98

Consider adding a script that creates scaffolding for a new dataset #98

toolness commented Apr 10, 2019

toolness commented Apr 10, 2019

aepyornis commented Apr 15, 2019

austensen commented Sep 25, 2019

austensen commented Dec 17, 2022 •

edited

Consider adding a script that creates scaffolding for a new dataset #98

Consider adding a script that creates scaffolding for a new dataset #98

Comments

toolness commented Apr 10, 2019

toolness commented Apr 10, 2019

aepyornis commented Apr 15, 2019

austensen commented Sep 25, 2019

austensen commented Dec 17, 2022 • edited

austensen commented Dec 17, 2022 •

edited