Skip to content

wolfgarbe/SymSpellCompound

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 

Repository files navigation

SymSpellCompound

SymSpellCompound has been integrated into SymSpell. Please visit the SymSpell repository!


Compound aware automatic spelling correction

SymSpellCompound supports compound aware automatic spelling correction of multi-word input strings.
It is built on top of SymSpell's 1 million times faster spelling correction algorithm.

1. Compound splitting & decompounding

SymSpell assumed every input string as single term. SymSpellCompound supports compound splitting / decompounding with three cases:

  1. mistakenly inserted space within a correct word led to two incorrect terms
  2. mistakenly omitted space between two correct words led to one incorrect combined term
  3. multiple input terms with/without spelling errors

Splitting errors, concatenation errors, substitution errors, transposition errors, deletion errors and insertion errors can by mixed within the same word.

2. Automatic spelling correction

  • Large document collections make manual correction infeasible and require unsupervised, fully-automatic spelling correction.
  • In conventional spelling correction of a single token, the user is presented with spelling correction suggestions.
    For automatic spelling correction of long multi-word text the the algorithm itself has to make an educated choice.

Examples:

- whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him
+ where is the love he had dated for much of the past who couldn't read in sixth grade and inspired him  (9 edits)

- in te dhird qarter oflast jear he hadlearned ofca sekretplan y iran
+ in the third quarter of last year he had learned of a secret plan by iran  (10 edits)

- the bigjest playrs in te strogsommer film slatew ith plety of funn
+ the biggest players in the strong summer film slate with plenty of fun  (9 edits)

- Can yu readthis messa ge despite thehorible sppelingmsitakes
+ can you read this message despite the horrible spelling mistakes  (9 edits)

Performance

0.2 milliseconds / word
5000 words / second (single core on 2012 Macbook Pro)

Applications

  • Query correction (10–15% of queries contain misspelled terms),
  • Chatbots,
  • OCR post-processing,
  • Automated proofreading.