Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

County name cleanup module #77

Open
jstray opened this issue May 10, 2018 · 9 comments
Open

County name cleanup module #77

jstray opened this issue May 10, 2018 · 9 comments

Comments

@jstray
Copy link
Collaborator

jstray commented May 10, 2018

It would be very useful, for local data journalism, to have a module that cleans US county names and looks up their FIPS codes. Attached is a mockup of what this might look like.
unnamed

@achyutjoshi
Copy link

I had worked on something similar before. A system of human in the loop can be used to solve this problem. Think of it as a repository of all the different spelling errors that can happen for a county name. The model (along with human intervention) now matches these faulty spelling based on context (state name, area code, etc). Once people start using it, these faulty spellings can be cached in the system to make the process faster.

@jstray
Copy link
Collaborator Author

jstray commented May 29, 2019

Sounds right. Do you want to build it?

@achyutjoshi
Copy link

I can surely try it out. Do you have some documentation around the issue which can help me get started?

@jstray
Copy link
Collaborator Author

jstray commented May 31, 2019

You could start with https://github.com/CJWorkbench/cjworkbench/wiki/Creating-A-Module

@achyutjoshi
Copy link

@jstray Added a very first version of this. You can check it out here - https://github.com/achyutjoshi/hello-workbench and https://github.com/achyutjoshi/cjworkbench

You can try the workflow using this dummy dataset -

dummy_df = pd.DataFrame({'state' : ['maaryland state','Georgia','California','colarado','florida'],'county' : ['baltimore','brooks','achyut','jackson','jackson']})

Things to note -

  1. I am in the process of adding tests and documentation
  2. I know an existing bug which occurs when the 'tolerance > 80'. I will fix that soon too.

Would love to know your feedback.

@jstray
Copy link
Collaborator Author

jstray commented Jun 19, 2019

Hi! Thanks so much for this, it's a great start!
I tested it in Workbench. Some notes:

  • I notice it requires fuzzywuzzy. Makes sense. But the Workbench docker container normally install that module so we'd have to add it before we could deploy this on our servers.
  • State "ca" and county "foo" resolves to Modoc County, California. This is more than edit distance 2 away, so I'm not sure why it matches to this. I'd expect the result to be null if there no match close enough.
  • Documentation would definitely be useful. You can set the help link in the yam, it could just go to the github readme for now.

Finally, please join us on Gitter for faster response https://gitter.im/workbenchdata/Lobby

@achyutjoshi
Copy link

Thanks!

  1. Yes it does require fuzzywuzzy. Once we are done with the improvements, we can add the dependency to the docker container?
  2. State "ca" and county "foo" - What is the tolerance level you used? If I use 79 - it does work as expected.
  3. I will complete the documentation and add it to the GitHub readme.

@jstray
Copy link
Collaborator Author

jstray commented Jun 23, 2019 via email

@achyutjoshi
Copy link

Yes. 100 = perfect matches.

And yes, I will change the name so it is more intuitive and maybe default to something higher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants