Skip to content

Text Classification: Fast, custom string similarity functions in Python mapping lambda functions to NumPy arrays, assigning issuers to one of 10,000 classes.

Notifications You must be signed in to change notification settings

katlass/String-Similarity-Custom-Functions

Repository files navigation

String-Similarity-Custom-Functions

Text Classification: Fast, custom string similarity functions in Python mapping lambda functions to NumPy arrays, assigning issuers to one of 10,000 classes. This automates 8 hours of manual work monthly in public data production.

“One of our monthly production processes was deemed so complex and time consuming that we attempted outsourcing it for assistance reworking it. When this yearlong effort proved fruitless, I identified ways to streamline it. In 1,500 lines of code, I was able to save 8 hours of manual RA work every month. I completely rewrote the code in Python for one major portion of the process currently accomplished by outdated scripts written in Perl. I wrote very fast custom word similarity functions by mapping lambda functions to Numpy arrays. For this project, I had to determine string similarity, namely matching about 200-400 never seen before raw names each month to the correct corresponding class (of 10,000 classes). My algorithm employs 5 different similarity techniques and executes in less than 5 minutes. This code includes a manual classification user interface where the user is shown the algorithm’s best guesses, they can easily select their choice, and then it pushes this selection to the existing SQL database at the very end. The user can also modify/manipulate/delete past string matches as well as add/delete/modify new classes all within this interface. The code is also designed very flexibly to allow for user error, like exiting out of the program prematurely without losing progress. Likewise, the user can make changes to the database after the fact, many times, and the SQL push code is set up to only execute portions it recognizes as having changed since the last time it was executed. The Python code also must push about 18,000 additional string matches to the database based on most common past matches for strings it’s seen exactly before. Clever coding avoids the need for a for-loop when matching in the database not only on the string itself, but also on that string’s unique identifier. To do this a vectorized solution is used that executes in about 2 minutes. The entire code executes in about 10 minutes of time.”

About

Text Classification: Fast, custom string similarity functions in Python mapping lambda functions to NumPy arrays, assigning issuers to one of 10,000 classes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages