Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detection of an omitted space #26

Open
fingoldo opened this issue Mar 28, 2021 · 1 comment
Open

Detection of an omitted space #26

fingoldo opened this issue Mar 28, 2021 · 1 comment

Comments

@fingoldo
Copy link

Thanks for this wonderful lib!

Can you add some functionality to detect accidentally merged words, for example, when a whitespace (separating words apart) was omitted?

from autocorrect import Speller
spellEn = Speller('en')
[spellEn.get_candidates(lemma) for lemma in ['test','project','testproject']]

>>>[[(495684, 'test')], [(1628175, 'project')], [(0, 'testproject')]]

It would be cool if 'testproject' could produce correct candidates: 'test' and 'project'
How hard is it to add such a feature?

@filyp
Copy link
Owner

filyp commented Mar 28, 2021

Hi!
It would complicate the logic a bit, but it's possible.
This would require adding a function generating these splits in https://github.com/fsondej/autocorrect/blob/master/autocorrect/typos.py
and in https://github.com/fsondej/autocorrect/blob/master/autocorrect/__init__.py assigning scores to those splits, for example as min(score_word1, score_word2).

Also, I fear that this splitting would happen too often, for example
ashe -> as he instead of ashes
anso -> an so instead of also
This would require some calibration, for example downscoring short words, which further complicates things. Also maybe switching off double typos correction would be necessary when using these splits.
I don't have time to add this feature, but I would happily merge a PR with it, if the score in tests increases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants