Skip to content
This repository has been archived by the owner on May 14, 2018. It is now read-only.

Identify mangled strings #14

Open
davharris opened this issue Apr 1, 2014 · 1 comment
Open

Identify mangled strings #14

davharris opened this issue Apr 1, 2014 · 1 comment

Comments

@davharris
Copy link
Contributor

OpenRefine can identify lots of annoying cases where strings are spelled in different ways. I'm sure other people have thought hard about this, but I'd be willing to take a naive shot at it.

Here are a few ideas. I hope you all can make some suggestions as well.

  • Detect inconsistent capitalization
  • Detect abbreviated species names (e.g. H. sapiens)
  • Detect strings that only appear a few times
  • Detect pairs of elements with low distance
@karthik
Copy link
Owner

karthik commented Apr 1, 2014

This sounds great. We've got an installable package now and there are some test datasets in the local folder to test against. I can find more crappier datasets to put things through.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants