Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier way to apply to DataFrames #42

Open
MaxGhenis opened this issue Apr 28, 2019 · 5 comments
Open

Easier way to apply to DataFrames #42

MaxGhenis opened this issue Apr 28, 2019 · 5 comments

Comments

@MaxGhenis
Copy link

In mapping a pandas DataFrame's numeric fips index to state name, I currently have to do this:

state['state'] = pd.Series(state.index).apply(
    lambda x: us.states.lookup(str(x).zfill(2)).name).tolist()

There are a couple things going on here that could be other issues*, but might also be nice to have the vectorization built-in for pandas. e.g. I'd like to be able to just do:
state['state'] = us.states.lookup(state.index)

* This one was verbose because things like us.states.lookup(1) and us.states.lookup('1') fail.

@mileswwatkins
Copy link
Member

Thanks for the input, Max! Can you elaborate a bit on what the content of what your state.index column looks like? State FIPS codes must be strings and must be zero-padded to a length of two (federal government examples: eg, eg), similar to ZIP codes, which is why you're getting those lookup failures.

@MaxGhenis
Copy link
Author

state.index was an int, but abstracting out the zero-padded string piece, passing that also requires using a map or apply statement, e.g. if fips_char is a zero-padded string pandas Series, getting a new Series with the state would require:

state.fips_char.map(us.states.lookup)

It'd be cool if it could instead be:

us.states.lookup(state.fips_char)

That is, if a Series (or numpy array) is passed, do the map/apply as needed and return an object of the same input size/shape.

Separately, since CSVs from sources like IPUMS provide fips codes as ints, saving that step could be useful, but I could see it also making it more error-prone, especially for ZIPs, and it sounds like you've thought about that and made a design decision to require preliminary padding.

@rchurt
Copy link

rchurt commented Jul 12, 2020

This would be great to have—any updates on the status?

@wolfc86
Copy link

wolfc86 commented Feb 7, 2021

Hey there! Great library, I've been using it in several projects at work. This issue caught my eye since we use Pandas a lot in my projects too, so I thought I'd weigh in.

Since this library is pretty lightweight (the only dependency I see is on jellyfish), I think it would be better to have guidance on how to quickly create a DataFrame from it:

import pandas as pd
from us.states import STATES
us_df = pd.DataFrame([vars(st) for st in STATES])

From there, I assume you wouldn't be doing a 'lookup' unless you had another dataset you were trying to merge with, but if that's the case then you would use something like df.merge now that you have this library's records as a DataFrame.

If it so happens to be that your dataset has numeric FIPS codes (this would be 'dirty' data though, since as @mileswwatkins stated, FIPS codes are officially zero-padded strings), you can either:

  1. convert the padded-string FIPS codes to numeric prior to merging:
us_df['fips'] = pd.to_numeric(us_df['fips'])

or 2) convert your data to padded-string FIPS codes prior to merging:

my_df['fips'] = my_df['fips'].str.pad(width=2, fillchar='0')

Alternatively, if you know you just need to convert from one representation to another and don't need all of the information from this library as a DataFrame, rather than doing a lookup one at a time, you can use this library's mapping function to quickly convert from one representation to another:

from us.states import mapping
my_df["abbr"] = my_df["fips"].map(mapping('fips', 'abbr'))

In my opinion: since the zero-padded version is the official designation, that's what this library should maintain, and it shouldn't make assumptions about what libraries downstream consumers are using (consider if someone else were using PySpark or Dask instead), but documentation/examples are always good to have (perhaps in a 'Recipes' guide)

Hope this helps!

@fedderw
Copy link

fedderw commented Apr 30, 2021

I've also had this problem and converted from int to a string with leading zeros as a workaround.

The issue with this, however, is that I caught the fact that some of the lookups didn't return any values only later on. There was no error, passing
person["statefip"].astype(str).apply(lambda x: us.states.lookup(x))
in this case returned a series with 44 unique states values, instead of 51, with no warning/error about the failed lookups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants