Problem ranking text containing abbreviation, such as U.S.A #14

xD0135 · 2021-07-30T06:30:05Z

Hi, first of all thanks for this library, you are awesome 🚀

I'm having an issue ranking text that contains abbreviation such as U.S.A (short for United States of America) or No. 7 (short for Number 7) as the . is currently used here https://github.com/DavidBelicza/TextRank/blob/master/parse/rule.go#L21 to set the bounds of words.

Do you currently have a way to get around this problem? Or should I simply create a new rule implementing the Rule interface that checks for known abbreviations?

The text was updated successfully, but these errors were encountered:

DavidBelicza · 2021-07-30T07:44:50Z

Hi @xD0135, I faced this issue too a while ago. The reason why I left this as it is because the solution would be domain-specific. As you mentioned implementing the Rule interface can be the solution.

If you create a whitelist of tokens for skipping the checking of these words and keep them as tokens that could work. However, I think this would be too domain-specific for this repo.

Or the sentence separator list in the Rule could have ". " or ".\n" instead of ".". But in this case, not all texts could be parsed well. I should know the general usage of this package. If usually, the text originates from emails, forums, chats then changing the sentence separator could work. But if the text is from parsed books then it could break the tokenization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem ranking text containing abbreviation, such as U.S.A #14

Problem ranking text containing abbreviation, such as U.S.A #14

xD0135 commented Jul 30, 2021

DavidBelicza commented Jul 30, 2021

Problem ranking text containing abbreviation, such as U.S.A #14

Problem ranking text containing abbreviation, such as U.S.A #14

Comments

xD0135 commented Jul 30, 2021

DavidBelicza commented Jul 30, 2021