Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What would be the best way to query and return all the variables available on a table? #62

Open
palewire opened this issue Nov 28, 2018 · 10 comments

Comments

@palewire
Copy link
Contributor

palewire commented Nov 28, 2018

Let's say I'm interested in exploring a topic I've never worked with before. Say, for example, Census data tracking languages spoken at home.

In this use case, even if I am a very skilled Python programmer (lol, remember this is a fantasy) and even if I have some experience with Census data, I am unlikely to know either the table id (in this case, B16001) or the fields on the table (e.g. that B16001_024E is the total number of people who speak Greek at home).

In a perfect world, I think this library would help the user find the table they're interested in, and then enumerate the fields within it.

I took a baby step in this direction with PR #61, which allows you get to a raw JSON list of all tables, but it's a long way from a solution. The good news is that all of the data we need to solve the problem is available in the Census API. https://api.census.gov/data/2016/acs/acs5/groups/B16001.json

If the crowd here agrees this is a worthy problem statement and use case, I'd like to start a discussion about how to solve it.

@mr-fuller
Copy link
Contributor

I'm a fan of how some packages in R return census metadata in a data frame, so I have played around with returning a pandas data frame with the information. An example of a returned data frame might look something like:
image

However, I'm not sure if printing a data frame to the console works as well in a typical python development environment.

Also, if this thread gains traction, issue #55 would become redundant.

@palewire
Copy link
Contributor Author

palewire commented Nov 28, 2018

Thanks for the input. I don't know if a dataframe output is within the scope of where this package wants to end up, but I do wonder if we could achieve some of the same things by packaging the JSON metadata from the API in a vanilla Python object to help grab extra data from related URLs, offer annotations and labels with values, etc.

@JoeGermuska
Copy link

JoeGermuska commented Nov 28, 2018

As noted in #61 cenpy has some of this functionality, although it depends upon dataframes -- see the demo notebook to see it in action.

Taking the variable names as they come from the API, they're pretty unwieldy. The median string length of variable labels is 65, in part because they're concatenations of all of the parent labels. If you split the concatenation, the median variable has four parts (three parents).

What the heck, here's the full describe on those two factors:

image

Definitely interested in the problem; wonder about the relative value of reproducing things that cenpy already does -- I've only recently been playing around with any of the python census libraries.

@JoeGermuska
Copy link

this is a crude way to format labels, leveraging cenpy, but which could be adapted to raw API output. (v is as defined in the previous comment)

image

It sort of breaks down for tables which aren't in a tree structure, like B01002 Median Age:

image

I guess one could infer here that the labels are median age, but I'd have preferred output more like:

Median Age
  Total
  Male
  Female

There are a few detailed tables I can't recall off hand which are kind of mixed this way -- the labels aren't a strict tree. And I don't think this would adapt very well to the data profiles or subject tables.

@fgregg
Copy link
Member

fgregg commented Nov 28, 2018

Could the census be open to having all tables be tree structured? @loganpowell

@palewire
Copy link
Contributor Author

palewire commented Nov 28, 2018

IMHO, having the library so closely integrated with pandas is a bug, not a feature. I'd like to see a lower-level mapping of Python to data, which can then be rolled up into dataframes further on down the line.

@JoeGermuska
Copy link

JoeGermuska commented Nov 28, 2018

The challenge is that there are no variables corresponding to those "missing" tree nodes.

In the Excel spreadsheet which has served as the best structured metadata we have for Census Reporter, variables have "line numbers", and the non-variable labels have n.5 numbers...

select * from acs2017_1yr.census_column_metadata where table_id = 'B01002';
 table_id | line_number |  column_id  | column_title  | indent | parent_column_id 
----------+-------------+-------------+---------------+--------+------------------
 B01002   |         0.5 | B01002000.5 | Median age -- |      0 | 
 B01002   |         1.0 | B01002001   | Total:        |      1 | B01002000.5
 B01002   |         2.0 | B01002002   | Male          |      1 | B01002000.5
 B01002   |         3.0 | B01002003   | Female        |      1 | B01002000.5
(4 rows)

@knaaptime
Copy link

the labels are also structured quite differently for different surveys (ACS vs, e.g. 2000 or 1990 decennial)

@loganpowell
Copy link

loganpowell commented Nov 29, 2018

Hi folks! Sorry for the delay. I believe @palewire with https://api.census.gov/data/2016/acs/acs5/groups/B16001.json is on the right track. Unfortunately, @fgregg the Census API team is not taking feature requests (tree structure) as their top priority for the foreseeable future is getting the rest of the house of Census data into the API... You all have noted that the /variables.json list (especially for ACS) is huge (@40k variables).

@ljwolf
Copy link

ljwolf commented Mar 15, 2019

Would be happy to collaborate/merge where possible. I've just merged and released a better default variable explorer using dataframes in cenpy. More than happy to recommend census as a good frontend to packages of datasets built on top of cenpy, or to consider where to merge things together...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants