County name cleanup module #77

jstray · 2018-05-10T21:54:26Z

It would be very useful, for local data journalism, to have a module that cleans US county names and looks up their FIPS codes. Attached is a mockup of what this might look like.

achyutjoshi · 2019-05-29T06:58:19Z

I had worked on something similar before. A system of human in the loop can be used to solve this problem. Think of it as a repository of all the different spelling errors that can happen for a county name. The model (along with human intervention) now matches these faulty spelling based on context (state name, area code, etc). Once people start using it, these faulty spellings can be cached in the system to make the process faster.

jstray · 2019-05-29T15:26:47Z

Sounds right. Do you want to build it?

achyutjoshi · 2019-05-31T05:59:56Z

I can surely try it out. Do you have some documentation around the issue which can help me get started?

jstray · 2019-05-31T17:56:09Z

You could start with https://github.com/CJWorkbench/cjworkbench/wiki/Creating-A-Module

achyutjoshi · 2019-06-13T06:02:17Z

@jstray Added a very first version of this. You can check it out here - https://github.com/achyutjoshi/hello-workbench and https://github.com/achyutjoshi/cjworkbench

You can try the workflow using this dummy dataset -

dummy_df = pd.DataFrame({'state' : ['maaryland state','Georgia','California','colarado','florida'],'county' : ['baltimore','brooks','achyut','jackson','jackson']})

Things to note -

I am in the process of adding tests and documentation
I know an existing bug which occurs when the 'tolerance > 80'. I will fix that soon too.

Would love to know your feedback.

jstray · 2019-06-19T18:23:24Z

Hi! Thanks so much for this, it's a great start!
I tested it in Workbench. Some notes:

I notice it requires fuzzywuzzy. Makes sense. But the Workbench docker container normally install that module so we'd have to add it before we could deploy this on our servers.
State "ca" and county "foo" resolves to Modoc County, California. This is more than edit distance 2 away, so I'm not sure why it matches to this. I'd expect the result to be null if there no match close enough.
Documentation would definitely be useful. You can set the help link in the yam, it could just go to the github readme for now.

Finally, please join us on Gitter for faster response https://gitter.im/workbenchdata/Lobby

achyutjoshi · 2019-06-23T18:01:52Z

Thanks!

Yes it does require fuzzywuzzy. Once we are done with the improvements, we can add the dependency to the docker container?
State "ca" and county "foo" - What is the tolerance level you used? If I use 79 - it does work as expected.
I will complete the documentation and add it to the GitHub readme.

jstray · 2019-06-23T19:12:07Z

Ah I guess I am misunderstanding tolerance -- is it 0-100? I thought it was edit distance. So 100=perfect matches only? Perhaps it should default to something much higher than 2.0. Or maybe it should work in reverse, default to zero, and be called "Match percentage error" or something with "percentage" in the name so users understand the range.

…

On Sun, Jun 23, 2019 at 2:01 PM Achyut Joshi ***@***.***> wrote: Thanks! 1. Yes it does require fuzzywuzzy. Once we are done with the improvements, we can add the dependency to the docker container? 2. State "ca" and county "foo" - What is the tolerance level you used? If I use 79 - it does work as expected. 3. I will complete the documentation and add it to the GitHub readme. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77?email_source=notifications&email_token=AAH3EFHZEC3OSAHXFE7MMZDP363BDA5CNFSM4E7KOH32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYLDZ3Q#issuecomment-504773870>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH3EFAOIXJQPTTUC75HWSLP363BDANCNFSM4E7KOH3Q> .

achyutjoshi · 2019-06-23T19:14:36Z

Yes. 100 = perfect matches.

And yes, I will change the name so it is more intuitive and maybe default to something higher.

jstray added enhancement help wanted labels May 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

County name cleanup module #77

County name cleanup module #77

jstray commented May 10, 2018

achyutjoshi commented May 29, 2019

jstray commented May 29, 2019 •

edited

achyutjoshi commented May 31, 2019

jstray commented May 31, 2019

achyutjoshi commented Jun 13, 2019

jstray commented Jun 19, 2019

achyutjoshi commented Jun 23, 2019

jstray commented Jun 23, 2019 via email

achyutjoshi commented Jun 23, 2019

County name cleanup module #77

County name cleanup module #77

Comments

jstray commented May 10, 2018

achyutjoshi commented May 29, 2019

jstray commented May 29, 2019 • edited

achyutjoshi commented May 31, 2019

jstray commented May 31, 2019

achyutjoshi commented Jun 13, 2019

jstray commented Jun 19, 2019

achyutjoshi commented Jun 23, 2019

jstray commented Jun 23, 2019 via email

achyutjoshi commented Jun 23, 2019

jstray commented May 29, 2019 •

edited