New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
County name cleanup module #77
Comments
I had worked on something similar before. A system of human in the loop can be used to solve this problem. Think of it as a repository of all the different spelling errors that can happen for a county name. The model (along with human intervention) now matches these faulty spelling based on context (state name, area code, etc). Once people start using it, these faulty spellings can be cached in the system to make the process faster. |
Sounds right. Do you want to build it? |
I can surely try it out. Do you have some documentation around the issue which can help me get started? |
You could start with https://github.com/CJWorkbench/cjworkbench/wiki/Creating-A-Module |
@jstray Added a very first version of this. You can check it out here - https://github.com/achyutjoshi/hello-workbench and https://github.com/achyutjoshi/cjworkbench You can try the workflow using this dummy dataset -
Things to note -
Would love to know your feedback. |
Hi! Thanks so much for this, it's a great start!
Finally, please join us on Gitter for faster response https://gitter.im/workbenchdata/Lobby |
Thanks!
|
Ah I guess I am misunderstanding tolerance -- is it 0-100? I thought it was
edit distance.
So 100=perfect matches only? Perhaps it should default to something much
higher than 2.0. Or maybe it should work in reverse, default to zero, and
be called "Match percentage error" or something with "percentage" in the
name so users understand the range.
…On Sun, Jun 23, 2019 at 2:01 PM Achyut Joshi ***@***.***> wrote:
Thanks!
1. Yes it does require fuzzywuzzy. Once we are done with the
improvements, we can add the dependency to the docker container?
2. State "ca" and county "foo" - What is the tolerance level you used?
If I use 79 - it does work as expected.
3. I will complete the documentation and add it to the GitHub readme.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#77?email_source=notifications&email_token=AAH3EFHZEC3OSAHXFE7MMZDP363BDA5CNFSM4E7KOH32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYLDZ3Q#issuecomment-504773870>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAH3EFAOIXJQPTTUC75HWSLP363BDANCNFSM4E7KOH3Q>
.
|
Yes. 100 = perfect matches. And yes, I will change the name so it is more intuitive and maybe default to something higher. |
It would be very useful, for local data journalism, to have a module that cleans US county names and looks up their FIPS codes. Attached is a mockup of what this might look like.
The text was updated successfully, but these errors were encountered: