Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem ranking text containing abbreviation, such as U.S.A #14

Open
xD0135 opened this issue Jul 30, 2021 · 1 comment
Open

Problem ranking text containing abbreviation, such as U.S.A #14

xD0135 opened this issue Jul 30, 2021 · 1 comment

Comments

@xD0135
Copy link

xD0135 commented Jul 30, 2021

Hi, first of all thanks for this library, you are awesome 馃殌

I'm having an issue ranking text that contains abbreviation such as U.S.A (short for United States of America) or No. 7 (short for Number 7) as the . is currently used here https://github.com/DavidBelicza/TextRank/blob/master/parse/rule.go#L21 to set the bounds of words.

Do you currently have a way to get around this problem? Or should I simply create a new rule implementing the Rule interface that checks for known abbreviations?

@DavidBelicza
Copy link
Owner

Hi @xD0135, I faced this issue too a while ago. The reason why I left this as it is because the solution would be domain-specific. As you mentioned implementing the Rule interface can be the solution.

If you create a whitelist of tokens for skipping the checking of these words and keep them as tokens that could work. However, I think this would be too domain-specific for this repo.

Or the sentence separator list in the Rule could have ". " or ".\n" instead of ".". But in this case, not all texts could be parsed well. I should know the general usage of this package. If usually, the text originates from emails, forums, chats then changing the sentence separator could work. But if the text is from parsed books then it could break the tokenization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants