Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of ignore_token parameter to word_segmentation not documented enough, does not work #87

Open
sbhaktha opened this issue Jan 19, 2021 · 0 comments

Comments

@sbhaktha
Copy link

sbhaktha commented Jan 19, 2021

I have phrases with named entities that I want the word_segmentation API to ignore. I tried replacing the named entities with SPECIAL_TOKEN_1, SPECIAL_TOKEN_2 etc in the phrase itself, then passing SPECIAL_TOKEN_1 and SPECIAL_TOKEN_2 as ignore_token to the call to word_segmentation. I cannot get this to work.

phrase = "Hello SPECIAL_TOKEN_1, I am happyto meet you tomorrowmorning. Thanks, SPECIAL_TOKEN_2"
phrase_suggestions = sym_spell.word_segmentation(test_phrase)

phrase_suggestions looks like this:

Composition(segmented_string='Hello **SPECIAL _TOKEN_ 1,** I am happy to meet you tomorrow morning. Thanks, **SPECIAL_ TOKEN_2**', corrected_string='Hello Special token of I am happy to meet you tomorrow morning Thanks Special Token', distance_sum=14, log_prob_sum=-55.6460931972679)

Notice how SPECIAL_TOKEN_1 and SPECIAL_TOKEN_2 get broken.

I tried using the ignore_token argument but cannot get it to work--

phrase = "Hello SPECIAL_TOKEN_1, I am happyto meet you tomorrowmorning. Thanks, SPECIAL_TOKEN_2"
phrase_suggestions = sym_spell.word_segmentation(test_phrase, ignore_token='SPECIAL_TOKEN_1')

I get back the same phrase_suggestions as before. Also not sure how to pass multiple tokens to ignore.

Also tried:

phrase_suggestions = sym_spell.word_segmentation(test_phrase, ignore_token=r"SPECIAL_TOKEN_\d")

and I get the following returned as phrase_suggestions:

Composition(segmented_string='Hello **SPECIAL _TOKEN_ 1**, I am happy to meet you tomorrow morning. Thanks, **SPECIAL_ TOKEN_2**', corrected_string='Hello Special token of I am happy to meet you tomorrow morning Thanks Special Token', distance_sum=14, log_prob_sum=-55.6460931972679)

Could you please help and also add more documentation on using this parameter?

What's the recommended way to deal with named entities?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant