v0.4 (#43)

* Use paraphrase-MiniLM-L6-v2 as the default embedding model * Highlight a document's keywords * Added FAQ
MaartenGr · Jun 30, 2021 · 25dab3a · 25dab3a
1 parent eb6d086
commit 25dab3a
Show file tree

Hide file tree

Showing 18 changed files with 242 additions and 83 deletions.
diff --git a/README.md b/README.md
@@ -90,8 +90,8 @@ from keybert import KeyBERT
 
 doc = """
          Supervised learning is the machine learning task of learning a function that
-         maps an input to an output based on example input-output pairs.[1] It infers a
-         function from labeled training data consisting of a set of training examples.[2]
+         maps an input to an output based on example input-output pairs. It infers a
+         function from labeled training data consisting of a set of training examples.
          In supervised learning, each example is a pair consisting of an input object
          (typically a vector) and a desired output value (also called the supervisory signal). 
          A supervised learning algorithm analyzes the training data and produces an inferred function, 
@@ -100,7 +100,7 @@ doc = """
          the learning algorithm to generalize from the training data to unseen situations in a 
          'reasonable' way (see inductive bias).
       """
-kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
+kw_model = KeyBERT()
 keywords = kw_model.extract_keywords(doc)
 ```
 
@@ -127,10 +127,17 @@ of words you would like in the resulting keyphrases:
  ('learning function', 0.5850)]
 ``` 
 
+We can highlight the keywords in the document by simply setting `hightlight`:
 
+```python
+keywords = kw_model.extract_keywords(doc, highlight=True)
+```
+<img src="images/highlight.png" width="75%" height="75%" />
+
+
 **NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
-I would advise either `'distilbert-base-nli-mean-tokens'` or `'xlm-r-distilroberta-base-paraphrase-v1'` as they
-have shown great performance in semantic similarity and paraphrase identification respectively. 
+I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"` 
+for multi-lingual documents or any other language.  
 
 <a name="maxsum"/></a>
 ###  2.3. Max Sum Similarity
@@ -198,7 +205,7 @@ and pass it through KeyBERT with `model`:
 
 ```python
 from keybert import KeyBERT
-kw_model = KeyBERT(model='distilbert-base-nli-mean-tokens')
+kw_model = KeyBERT(model='paraphrase-MiniLM-L6-v2')
 ```
 
 Or select a SentenceTransformer model with your own parameters:
@@ -207,7 +214,7 @@ Or select a SentenceTransformer model with your own parameters:
 from keybert import KeyBERT
 from sentence_transformers import SentenceTransformer
 
-sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
+sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
 kw_model = KeyBERT(model=sentence_model)
 ```
 

diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,3 +1,17 @@
+## **Version 0.4.0**
+*Release date:  23 June, 2021*  
+
+**Highlights**:
+
+* Highlight a document's keywords with:
+    * ```keywords = kw_model.extract_keywords(doc, highlight=True)```
+* Use `paraphrase-MiniLM-L6-v2` as the default embedder which gives great results!
+
+**Miscellaneous**:
+
+* Update Flair dependencies
+* Added FAQ
+
 ## **Version 0.3.0**
 *Release date:  10 May, 2021*  
 

diff --git a/docs/faq.md b/docs/faq.md
@@ -0,0 +1,20 @@
+## **Which embedding model works best for which language?**
+Unfortunately, there is not a definitive list of the best models for each language, this highly depends 
+on your data, the model, and your specific use-case. However, the default model in KeyBERT 
+(`"paraphrase-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual** 
+documents or any other language, `"paraphrase-multilingual-MiniLM-L12-v2""` has shown great performance.  
+
+If you want to use a model that provides a higher quality, but takes more compute time, then I would advise using `paraphrase-mpnet-base-v2` and `paraphrase-multilingual-mpnet-base-v2` instead.
+
+
+## **Should I preprocess the data?**
+No. By using document embeddings there is typically no need to preprocess the data as all parts of a document 
+are important in understanding the general topic of the document. Although this holds true in 99% of cases, if you 
+have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags 
+typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply 
+topic modeling to HTML-code to extract topics of code, then it becomes important. 
+
+
+## **Can I use the GPU to speed up the model?**
+Yes! Since KeyBERT uses embeddings as its backend, a GPU is actually prefered when using this package. 
+Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower. 
diff --git a/docs/guides/embeddings.md b/docs/guides/embeddings.md
@@ -8,15 +8,15 @@ and pass it through KeyBERT with `model`:
 
 ```python
 from keybert import KeyBERT
-kw_model = KeyBERT(model="xlm-r-bert-base-nli-stsb-mean-tokens")
+kw_model = KeyBERT(model="paraphrase-MiniLM-L6-v2")
 ```
 
 Or select a SentenceTransformer model with your own parameters:
 
 ```python
 from sentence_transformers import SentenceTransformer
 
-sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cuda")
+sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
 kw_model = KeyBERT(model=sentence_model)
 ```
 
@@ -60,7 +60,7 @@ import spacy
 
 nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
 
-kw_model = KeyBERT(model=document_glove_embeddings)nlp
+kw_model = KeyBERT(model=nlp)
 ```
 
 Using spacy-transformer models:
@@ -129,7 +129,7 @@ class CustomEmbedder(BaseEmbedder):
         return embeddings 
 
 # Create custom backend
-distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
+distilbert = SentenceTransformer("paraphrase-MiniLM-L6-v2")
 custom_embedder = CustomEmbedder(embedding_model=distilbert)
 
 # Pass custom backend to keybert

diff --git a/docs/guides/quickstart.md b/docs/guides/quickstart.md
@@ -38,7 +38,7 @@ doc = """
          the learning algorithm to generalize from the training data to unseen situations in a 
          'reasonable' way (see inductive bias).
       """
-kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
+kw_model = KeyBERT()
 keywords = kw_model.extract_keywords(doc)
 ```
 
@@ -65,9 +65,15 @@ of words you would like in the resulting keyphrases:
  ('learning function', 0.5850)]
 ``` 
 
+We can highlight the keywords in the document by simply setting `hightlight`:
+
+```python
+keywords = kw_model.extract_keywords(doc, highlight=True)
+``` 
+
 **NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
-I would advise either `'distilbert-base-nli-mean-tokens'` or `'xlm-r-distilroberta-base-paraphrase-v1'` as they
-have shown great performance in semantic similarity and paraphrase identification respectively. 
+I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"` 
+for multi-lingual documents or any other language.  
 
 ###  Max Sum Similarity
 

diff --git a/docs/index.md b/docs/index.md
@@ -7,7 +7,7 @@ create keywords and keyphrases that are most similar to a document.
 
 ## About the Project
 
-Although that are already many methods available for keyword generation 
+Although there are already many methods available for keyword generation 
 (e.g., 
 [Rake](https://github.com/aneesha/RAKE), 
 [YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.) 
@@ -30,11 +30,6 @@ papers and solutions out there that use BERT-embeddings
 ), I could not find a BERT-based solution that did not have to be trained from scratch and
 could be used for beginners (**correct me if I'm wrong!**).
 Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.   
-
-**NOTE**: If you use MMR to select the candidates instead of simple cosine similarity,
-this repo is essentially a simplified implementation of 
-[EmbedRank](https://github.com/swisscom/ai-research-keyphrase-extraction) 
-with BERT-embeddings. 
 
 ## Installation
 Installation can be done using [pypi](https://pypi.org/project/keybert/):
@@ -43,22 +38,33 @@ Installation can be done using [pypi](https://pypi.org/project/keybert/):
 pip install keybert
 ```
 
-To use Flair embeddings, install KeyBERT as follows:
+You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
 
 ```
 pip install keybert[flair]
+pip install keybert[gensim]
+pip install keybert[spacy]
+pip install keybert[use]
 ```
 
+To install all backends:
+
+```
+pip install keybert[all]
+```
+
+
 ## Usage
 
+
 The most minimal example can be seen below for the extraction of keywords:
 ```python
 from keybert import KeyBERT
 
 doc = """
          Supervised learning is the machine learning task of learning a function that
-         maps an input to an output based on example input-output pairs.[1] It infers a
-         function from labeled training data consisting of a set of training examples.[2]
+         maps an input to an output based on example input-output pairs. It infers a
+         function from labeled training data consisting of a set of training examples.
          In supervised learning, each example is a pair consisting of an input object
          (typically a vector) and a desired output value (also called the supervisory signal). 
          A supervised learning algorithm analyzes the training data and produces an inferred function, 
@@ -67,13 +73,14 @@ doc = """
          the learning algorithm to generalize from the training data to unseen situations in a 
          'reasonable' way (see inductive bias).
       """
-model = KeyBERT('distilbert-base-nli-mean-tokens')
+kw_model = KeyBERT()
+keywords = kw_model.extract_keywords(doc)
 ```
 
-You can set `keyphrase_length` to set the length of the resulting keyphras:
+You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:
 
 ```python
->>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1))
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
 [('learning', 0.4604),
  ('algorithm', 0.4556),
  ('training', 0.4487),
@@ -85,10 +92,10 @@ To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher de
 of words you would like in the resulting keyphrases: 
 
 ```python
->>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2))
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
 [('learning algorithm', 0.6978),
  ('machine learning', 0.6305),
  ('supervised learning', 0.5985),
  ('algorithm analyzes', 0.5860),
  ('learning function', 0.5850)]
-``` 
+``` 
diff --git a/images/highlight.png b/images/highlight.png
diff --git a/keybert/__init__.py b/keybert/__init__.py
@@ -1,3 +1,3 @@
-from keybert.model import KeyBERT
+from keybert._model import KeyBERT
 
-__version__ = "0.3.0"
+__version__ = "0.4.0"
diff --git a/keybert/_highlight.py b/keybert/_highlight.py
@@ -0,0 +1,96 @@
+import re
+from rich.console import Console
+from rich.highlighter import RegexHighlighter
+from typing import Tuple, List
+
+
+class NullHighlighter(RegexHighlighter):
+    """Apply style to anything that looks like an email."""
+
+    base_style = ""
+    highlights = [r""]
+
+
+def highlight_document(doc: str,
+                       keywords: List[Tuple[str, float]]):
+    """ Highlight keywords in a document
+
+    Arguments:
+        doc: The document for which to extract keywords/keyphrases
+        keywords: the top n keywords for a document with their respective distances
+                  to the input document
+
+    Returns:
+        highlighted_text: The document with additional tags to highlight keywords
+                          according to the rich package
+    """
+    keywords_only = [keyword for keyword, _ in keywords]
+    max_len = max([len(token.split(" ")) for token in keywords_only])
+
+    if max_len == 1:
+        highlighted_text = _highlight_one_gram(doc, keywords_only)
+    else:
+        highlighted_text = _highlight_n_gram(doc, keywords_only)
+
+    console = Console(highlighter=NullHighlighter())
+    console.print(highlighted_text)
+
+
+def _highlight_one_gram(doc: str,
+                        keywords: List[str]) -> str:
+    """ Highlight 1-gram keywords in a document
+
+    Arguments:
+        doc: The document for which to extract keywords/keyphrases
+        keywords: the top n keywords for a document
+
+    Returns:
+        highlighted_text: The document with additional tags to highlight keywords
+                          according to the rich package
+    """
+    tokens = re.sub(r' +', ' ', doc.replace("\n", " ")).split(" ")
+
+    highlighted_text = " ".join([f"[black on #FFFF00]{token}[/]"
+                                 if token.lower() in keywords
+                                 else f"{token}"
+                                 for token in tokens]).strip()
+    return highlighted_text
+
+
+def _highlight_n_gram(doc: str,
+                      keywords: List[str]) -> str:
+    """ Highlight n-gram keywords in a document
+
+    Arguments:
+        doc: The document for which to extract keywords/keyphrases
+        keywords: the top n keywords for a document
+
+    Returns:
+        highlighted_text: The document with additional tags to highlight keywords
+                          according to the rich package
+    """
+    max_len = max([len(token.split(" ")) for token in keywords])
+    tokens = re.sub(r' +', ' ', doc.replace("\n", " ")).strip().split(" ")
+    n_gram_tokens = [[" ".join(tokens[i: i + max_len][0: j + 1]) for j in range(max_len)] for i, _ in enumerate(tokens)]
+    highlighted_text = []
+    skip = False
+
+    for n_grams in n_gram_tokens:
+        candidate = False
+
+        if not skip:
+            for index, n_gram in enumerate(n_grams):
+
+                if n_gram.lower() in keywords:
+                    candidate = f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1]
+                    skip = index + 1
+
+            if not candidate:
+                candidate = n_grams[0]
+
+            highlighted_text.append(candidate)
+
+        else:
+            skip = skip - 1
+    highlighted_text = " ".join(highlighted_text)
+    return highlighted_text
diff --git a/keybert/maxsum.py → keybert/_maxsum.py b/keybert/maxsum.py → keybert/_maxsum.py
diff --git a/keybert/mmr.py → keybert/_mmr.py b/keybert/mmr.py → keybert/_mmr.py