Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing Punctuation #80

Open
islama-lh opened this issue May 2, 2020 · 2 comments
Open

Removing Punctuation #80

islama-lh opened this issue May 2, 2020 · 2 comments

Comments

@islama-lh
Copy link

I can see this pull request resolve issue with numbers .
This solve issue with service available 24/7 but still it's removing punctuations from sentences like
Prev: servic available 24/7.
After: service available 24/7.
Prev: If the extracted string less less than 50 characters long, and is not sentence-terminated, then we assume that it is a header.
After: if the extracted string less less than 50 characters long and is not sentence terminated then we assume that it is a header

Is it possible to leave the punctuations?

@duhaime
Copy link

duhaime commented May 13, 2020

I needed to leave punctuation, and essentially just re-added it to my words after correcting their spelling:

import pkg_resources, string
from symspellpy import SymSpell, Verbosity

spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename('symspellpy', 'frequency_dictionary_en_82_765.txt')
spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

def correct(w):
  word = w
  o = spell.lookup(w,
    Verbosity.CLOSEST,
    max_edit_distance=2,
    transfer_casing=True)
  if not o: return w
  word = o[0].term
  if w[0].isupper():
    word = word[0].upper() + ''.join(word[1:])
  # find start punctuation
  start_idx = 0
  start_punct = ''
  while w[start_idx] in string.punctuation:
    start_punct += w[start_idx]
    if start_idx + 1 < len(w):
      start_idx += 1
    else:
      break
  # find end punctuation
  end_idx = 1
  end_punct = ''
  while w[-end_idx] in string.punctuation:
    end_punct += w[-end_idx]
    if end_idx - 1 > 0:
      end_idx -= 1
    else:
      break
  return start_punct + word + end_punct

s = '''Now that we have carried our geographical analogy quite far, we return to the uestion of isomorphisms between brains. You might well wonder why this whole uestion of brain isomorphisms has been stressed so much. What does it matter if two rains are isomorphic, or quasi-isomorphic, or not isomorphic at all? The answer is that e have an intuitive sense that, although other people differ from us in important ways, hey are still "the same" as we are in some deep and important ways. It would be nstructive to be able to pinpoint what this invariant core of human intelligence is, and hen to be able to describe the kinds of "embellishments" which can be added to it, aking each one of us a unique embodiment of this abstract and mysterious quality alled "intelligence".'''
cleaned = ' '.join([correct(w) for w in s.split()])
print(cleaned)

That prints:

Now that we have carried our geographical analogy quite far, we return to the question of isomorphisms between brains. You might well wonder why this whole question of brain isomorphisms has been stressed so much. What does it matter if two rains are isomorphic, or quasi-isomorphic, or not isomorphic at all? The answer is that a have an intuitive sense that, although other people differ from us in important ways, hey are still "the same" as we are in some deep and important ways. It would be instructive to be able to pinpoint what this invariant core of human intelligence is, and hen to be able to describe the kinds of "embellishments" which can be added to it, making each one of us a unique embodiment of this abstract and mysterious quality called "intelligence".

@amyoungil
Copy link

amyoungil commented Aug 16, 2021

I find that the example above seems to interfere with SymSpell's joining of incorrectly split pieces of a word, for example:
"Had our forefathers fai led on that day of trial whichwe now cele brate ;" is corrected to
"Had our forefathers fax led on that day of trial which we now cell rate;"

Is that because the above code treats the string word-by-word?

Trying an example of sym_spell.lookup_compound(input_term, max_edit_distance=2, transfer_casing=True)
yields correct spelling but no punctuation anymore:
"Had our forefathers failed on that day of trial which we now celebrate had their votes"

Do you have the same results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants