Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add DOF Property Charges Balance #290

Open
3 tasks
austensen opened this issue Dec 10, 2023 · 4 comments
Open
3 tasks

add DOF Property Charges Balance #290

austensen opened this issue Dec 10, 2023 · 4 comments
Labels
documentation Add documentation to wiki example query Add example query to wiki new dataset New dataset to add to NYCDB
Milestone

Comments

@austensen
Copy link
Member

Dept of Finance dataset with how much property's owe to the city. Can be helpful in identifying building under financial distress.

Dataset: https://data.cityofnewyork.us/City-Government/DOF-Property-Charges-Balance/scjx-j6np

dataset/table name: dof_property_charges

  • add dataset to NYCDB
  • add documentation to wiki
  • add example query to wiki

Each task can be completed by a different person - comment below to claim a part of it

@austensen austensen added new dataset New dataset to add to NYCDB documentation Add documentation to wiki example query Add example query to wiki labels Dec 10, 2023
@austensen austensen added this to the HDC Hackathon milestone Dec 10, 2023
@wstlabs
Copy link
Collaborator

wstlabs commented Jan 20, 2024

Getting started

@wstlabs
Copy link
Collaborator

wstlabs commented Jan 20, 2024

I've pushed some (very) rough and ready code to the following branch:

https://github.com/nycdb/nycdb/tree/dev-291-dataset-dof-property-charge

That has been tested (successfully) on a partial load of 600k or so records (1 percent of the total of 62M, still downloading).

However not ready for others to test, due to some apparent underlying weirdness in the existing codebase - which I'd like to run by @austensen (or someone else) before going into much detail here just yet.

As to the weirdness - has to do the forced CamelCase munging of column names (which apparently has unintended side effects). Should be easy enough to resolve (sometime in the coming days, after the hackathon)

@kfinn
Copy link
Collaborator

kfinn commented Jan 20, 2024

Hi @wstlabs ! Do you mind clarifying exactly what you mena about the forced CamelCase munging of column names?

Alternatively, there's some documentation on the column name munging we do: https://github.com/nycdb/nycdb/blob/main/src/ADDING_NEW_DATASETS.md#-note- (see the bulleted list "Some examples of how column names are transformed:"), I wonder if this would add enough context to answer your questions.

@wstlabs
Copy link
Collaborator

wstlabs commented Jan 20, 2024

Basically, the CC munging seems to conflict with the (what would seem to be more important) explicit field declarations in the dataset config file (src/nycdb/datasets/dof_property_charge.yml in the new branch).

At least my assumption was that the config provides the explicit schema. In presenting an explicit mapping of column names to types -- that definitely would seem to be its purpose. But no, it seems that's not the "real" schema that ends up being used -- or perhaps it is, in terms of column types, but not column names. Which are still automunged internally, per the above description.

Here's how it plays out in this case:

(1) The raw file contains some field names with underscores, e.g. dt_pd_begin which (as per the writeup) nycdb is apparently trying to munge to DtPdBegin

(2) Which apparently overrides the settings in the config file (src/nycdb/datasets/dof_property_charge.yml), contrary to expectations.

(3) So you'd think "Fine, I'll bring the config file in line with the automunged name then, to make everyone happy". But unfortunatelly, no -- it also apparently wants the field names in the CSV header to matched the automunged names as well (meaning I had to edit the CSV, and change underscored names to CamelCase throughout) -- in order to get the file to load.

Which is not the way things are meant to be done, I'm assuming.

But at least the file (or a 1 percent sample of it) does load, with close to correct column types -- which is a good sign, in that it seems it should be pretty easy to get this dataset integrated (once the above weirdness is resolved).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Add documentation to wiki example query Add example query to wiki new dataset New dataset to add to NYCDB
Projects
None yet
Development

No branches or pull requests

3 participants