Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use stopword lists to reduce index size and improve search results #250

Open
lioman opened this issue Mar 7, 2022 · 3 comments
Open

Use stopword lists to reduce index size and improve search results #250

lioman opened this issue Mar 7, 2022 · 3 comments
Labels
improvement-request Request for new or enhanced behavior.

Comments

@lioman
Copy link

lioman commented Mar 7, 2022

To keep the search index clean and small and improve search results it would be good to have the possibility to remove common words from index. Examples would be something like this, that, a, and or ein, eine, weil, dass, der, die, das for German based indices.

For most sites it would decrease the size of index and would improve search results. For the search term "and" we would not return ”and he goes...", "and Peter...” but something like "Android", "Andreas".
This is not only the case for these common words in the language, but if the list is well chosen, for other words too.
E.g having a site about coffee, the word "coffee" will be on nearly every site and it could make sense to remove this from index because the search results would just be a full representation of the entire site.

@jameslittle230
Copy link
Owner

This is a good idea - I'll incorporate it into a future release. Thanks!
-James

@jmooring
Copy link
Contributor

Resource: NLTK stopwords by language (25 languages)

On a per language basis, my high level "want" is:

  1. Provide a default list so I don't have to think about it. Result --> foo bar
  2. Provide a mechanism to disable one or more items from the default list: Result --> foo
  3. Provide a mechanism to add one or more items to the list. Result --> foo baz

I just started working with Stork this morning. Very, very nice.

@lioman
Copy link
Author

lioman commented Mar 14, 2022

To simplify things, I think the first approach should be just to add a list. Including the NLTK list would be nice, but basically adding the whole list by myself is the feature you always need at the end. So I would start with that and perhaps include ease of use features later.

@jameslittle230 jameslittle230 added the improvement-request Request for new or enhanced behavior. label Mar 17, 2023
@jameslittle230 jameslittle230 removed their assignment Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement-request Request for new or enhanced behavior.
Projects
None yet
Development

No branches or pull requests

3 participants