Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the spelling checker #70

Open
ali1234 opened this issue Apr 5, 2022 · 1 comment
Open

Improve the spelling checker #70

ali1234 opened this issue Apr 5, 2022 · 1 comment

Comments

@ali1234
Copy link
Owner

ali1234 commented Apr 5, 2022

The spellcheck tool automatically corrects errors using a dictionary-based lookup, similar to the one you would find in a word processor. This doesn't work very well because errors produced by the deconvolution algorithm are different to those that a human would make.

Deconvolution errors always result in exactly one character being substituted for another. For example, "e" is often confused with "j". This type of error is measured using Hamming distance, the number of such substitutions required to transform one string into another.

Humans make this type of error, but they also make another type of error where the correct letters are used but in the wrong order, and another type where letters are missed out or added. For example typing "teh" or "te" or "thee" instead of "the". These types of error are measured using Levenshtein distance, which is the number of insertions, deletions, or substitutions.

The spell check engine currently used measures error by Levenshtein distance, which means it incorrectly estimates the difference between the input string and the suggestions. This results in the suggestions being less than optimal. For example, given the input string "jdge" it will more likely suggest "judge" than "edge". If the text were typed by a human, "judge" would be the better suggestion. But given the way deconvolution works, "edge" is the best suggestion, and "judge" is impossible because it requires inserting a new character.

The current code contains tests for cases like these: the suggestions are filtered to remove any word that is a different length, and also any word that can't be reached via a list of common substitutions (e/j being one such pair.) This filtering has to be done after the spell check engine has returned a list of suggestions, which is limited to the best N suggestions it found. It is possible for the correct answer to be ranked too low by the Levenshtein distance test, so it doesn't get suggested, and the filtering returns no suggestions.

Calculating Levenshtein distance is also slower, so a more specific algorithm that used Hamming distance would be faster and more accurate. Unfortunately there doesn't seem to be any drop-in replacement spell checking engine that can be tuned to do this.

@MatthewTromp
Copy link

What about simply counting the number of differences between two strings? For instance

def hamming_distance(s1, s2):
    return sum(c1 != c2 for c1, c2 in zip(s1,s2))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants