String-Similarity-Custom-Functions

Text Classification: Fast, custom string similarity functions in Python mapping lambda functions to NumPy arrays, assigning issuers to one of 10,000 classes. This automates 8 hours of manual work monthly in public data production.

“One of our monthly production processes was deemed so complex and time consuming that we attempted outsourcing it for assistance reworking it. When this yearlong effort proved fruitless, I identified ways to streamline it. In 1,500 lines of code, I was able to save 8 hours of manual RA work every month. I completely rewrote the code in Python for one major portion of the process currently accomplished by outdated scripts written in Perl. I wrote very fast custom word similarity functions by mapping lambda functions to Numpy arrays. For this project, I had to determine string similarity, namely matching about 200-400 never seen before raw names each month to the correct corresponding class (of 10,000 classes). My algorithm employs 5 different similarity techniques and executes in less than 5 minutes. This code includes a manual classification user interface where the user is shown the algorithm’s best guesses, they can easily select their choice, and then it pushes this selection to the existing SQL database at the very end. The user can also modify/manipulate/delete past string matches as well as add/delete/modify new classes all within this interface. The code is also designed very flexibly to allow for user error, like exiting out of the program prematurely without losing progress. Likewise, the user can make changes to the database after the fact, many times, and the SQL push code is set up to only execute portions it recognizes as having changed since the last time it was executed. The Python code also must push about 18,000 additional string matches to the database based on most common past matches for strings it’s seen exactly before. Clever coding avoids the need for a for-loop when matching in the database not only on the string itself, but also on that string’s unique identifier. To do this a vectorized solution is used that executes in about 2 minutes. The entire code executes in about 10 minutes of time.”

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Custom Search Techniques		Custom Search Techniques
README.md		README.md
classifyer_alpha		classifyer_alpha
manual_issuer_classification		manual_issuer_classification
old_issuers		old_issuers
same_name_diff_issuer		same_name_diff_issuer
sql_push		sql_push

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Search Techniques

Custom Search Techniques

README.md

README.md

classifyer_alpha

classifyer_alpha

manual_issuer_classification

manual_issuer_classification

old_issuers

old_issuers

same_name_diff_issuer

same_name_diff_issuer

sql_push

sql_push

Repository files navigation

String-Similarity-Custom-Functions

About

Releases

Packages

Languages

katlass/String-Similarity-Custom-Functions

Folders and files

Latest commit

History

Repository files navigation

String-Similarity-Custom-Functions

About

Topics

Resources

Stars

Watchers

Forks

Languages