v0.7 (#135)

* Added option to extract and pass word/document embeddings for faster iteration * Focused on making the documentation a bit nicer (visualizations, etc. ) * Fixed #71 * Fixed #122, #136
MaartenGr · Nov 3, 2022 · 7b763ae · 7b763ae
1 parent c512c21
commit 7b763ae
Show file tree

Hide file tree

Showing 13 changed files with 379 additions and 67 deletions.
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,3 +1,38 @@
+---
+hide:
+  - navigation
+---
+
+
+## **Version 0.7.0**
+*Release date: 3 November, 2022*
+
+**Highlights**:
+
+* Cleaned up documentation and added several visual representations of the algorithm (excluding MMR / MaxSum)
+* Added function to extract and pass word- and document embeddings which should make fine-tuning much faster
+
+```python
+from keybert import KeyBERT
+
+kw_model = KeyBERT()
+
+# Prepare embeddings
+doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)
+
+# Extract keywords without needing to re-calculate embeddings
+keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
+```
+
+Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`. 
+
+**Fixes**:
+
+* Redundant documentation was removed by [@mabhay3420](https://github.com/priyanshul-govil) in [#123](https://github.com/MaartenGr/KeyBERT/pull/123)
+* Fixed Gensim backend not working after v4 migration ([#71](https://github.com/MaartenGr/KeyBERT/issues/71))
+* Fixed `candidates` not working ([#122](https://github.com/MaartenGr/KeyBERT/issues/122))
+
+
 ## **Version 0.6.0**
 *Release date: 25 July, 2022*
 

diff --git a/docs/faq.md b/docs/faq.md
@@ -1,3 +1,8 @@
+---
+hide:
+  - navigation
+---
+
 ## **Which embedding model works best for which language?**
 Unfortunately, there is not a definitive list of the best models for each language, this highly depends
 on your data, the model, and your specific use-case. However, the default model in KeyBERT

diff --git a/docs/guides/quickstart.md b/docs/guides/quickstart.md
@@ -14,7 +14,13 @@ pip install keybert[spacy]
 pip install keybert[use]
 ```
 
-## **Usage**
+
+<div class="excalidraw">
+--8<-- "docs/images/pipeline.svg"
+</div>
+
+
+## **Basic usage**
 
 The most minimal example can be seen below for the extraction of keywords:
 ```python
@@ -70,6 +76,12 @@ keywords = kw_model.extract_keywords(doc, highlight=True)
     I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
     for multi-lingual documents or any other language.
 
+## **Fine-tuning**
+
+As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead 
+to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two 
+approaches that we can take in order to fine-tune our output, **Max Sum Distance** and **Maximal Marginal Relevance**. 
+
 ###  **Max Sum Distance**
 
 To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
@@ -93,8 +105,8 @@ keywords / keyphrases which is also based on cosine similarity. The results
 with **high diversity**:
 
 ```python
->>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
-                              use_mmr=True, diversity=0.7)
+kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+                          use_mmr=True, diversity=0.7)
 [('algorithm generalize training', 0.7727),
  ('labels unseen instances', 0.1649),
  ('new examples optimal', 0.4185),
@@ -114,58 +126,93 @@ The results with **low diversity**:
  ('learning algorithm generalize', 0.7514)]
 ```
 
-### **Candidate Keywords/Keyphrases**
+## **Candidate Keywords/Keyphrases**
 In some cases, one might want to be using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases. In KeyBERT, you can easily use those candidate keywords to perform keyword extraction:
 
 ```python
 import yake
 from keybert import KeyBERT
 
-doc = """
-         Supervised learning is the machine learning task of learning a function that
-         maps an input to an output based on example input-output pairs.[1] It infers a
-         function from labeled training data consisting of a set of training examples.[2]
-         In supervised learning, each example is a pair consisting of an input object
-         (typically a vector) and a desired output value (also called the supervisory signal).
-         A supervised learning algorithm analyzes the training data and produces an inferred function,
-         which can be used for mapping new examples. An optimal scenario will allow for the
-         algorithm to correctly determine the class labels for unseen instances. This requires
-         the learning algorithm to generalize from the training data to unseen situations in a
-         'reasonable' way (see inductive bias).
-      """
-
 # Create candidates
 kw_extractor = yake.KeywordExtractor(top=50)
 candidates = kw_extractor.extract_keywords(doc)
 candidates = [candidate[0] for candidate in candidates]
 
-# KeyBERT init
+# Pass candidates to KeyBERT
 kw_model = KeyBERT()
-keywords = kw_model.extract_keywords(doc, candidates)
+keywords = kw_model.extract_keywords(doc, candidates=candidates)
 ```
 
-### **Guided KeyBERT**
+## **Guided KeyBERT**
 
 Guided KeyBERT is similar to Guided Topic Modeling in that it tries to steer the training towards a set of seeded terms. When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article. However, there might be a specific topic in the article that you would like to be extracted through the keywords. To achieve this, we simply give KeyBERT a set of related seeded keywords (it can also be a single one!) and search for keywords that are similar to both the document and the seeded keywords.
 
+<div class="excalidraw">
+--8<-- "docs/images/guided.svg"
+</div>
+
 Using this feature is as simple as defining a list of seeded keywords and passing them to KeyBERT:
 
 
 ```python
-doc = """
-         Supervised learning is the machine learning task of learning a function that
-         maps an input to an output based on example input-output pairs.[1] It infers a
-         function from labeled training data consisting of a set of training examples.[2]
-         In supervised learning, each example is a pair consisting of an input object
-         (typically a vector) and a desired output value (also called the supervisory signal).
-         A supervised learning algorithm analyzes the training data and produces an inferred function,
-         which can be used for mapping new examples. An optimal scenario will allow for the
-         algorithm to correctly determine the class labels for unseen instances. This requires
-         the learning algorithm to generalize from the training data to unseen situations in a
-         'reasonable' way (see inductive bias).
-      """
-
+from keybert import KeyBERT
 kw_model = KeyBERT()
+
+# Define our seeded term
 seed_keywords = ["information"]
-keywords = kw_model.extract_keywords(doc, use_mmr=True, diversity=0.1, seed_keywords=seed_keywords)
+keywords = kw_model.extract_keywords(doc, seed_keywords=seed_keywords)
+```
+
+## **Prepare embeddings**
+
+When you have a large dataset and you want to fine-tune parameters such as `diversity` it can take quite a while to re-calculate the document and 
+word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to `.extract_keywords` such that 
+we only have to calculate it once:
+
+
+```python
+from keybert import KeyBERT
+
+kw_model = KeyBERT()
+doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)
+```
+
+You can then use these embeddings and pass them to `.extract_keywords` to speed up the tuning the model:
+
+```python
+keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
+```
+
+There are several parameters in `.extract_embeddings` that define how the list of candidate keywords/keyphrases is generated:   
+
+* `candidates`
+* `keyphrase_ngram_range`
+* `stop_words` 
+* `min_df`
+* `vectorizer`
+
+The values of these parameters need to be exactly the same in `.extract_embeddings` as they are in `. extract_keywords`. 
+
+In other words, the following will work as they use the same parameter subset:
+
+```python
+from keybert import KeyBERT
+
+kw_model = KeyBERT()
+doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=1, stop_words="english")
+keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english", 
+                                     doc_embeddings=doc_embeddings, 
+                                     word_embeddings=word_embeddings)
+```
+
+The following, however, will throw an error since we did not use the same values for `min_df` and `stop_words`:
+
+```python
+from keybert import KeyBERT
+
+kw_model = KeyBERT()
+doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=3, stop_words="dutch")
+keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english", 
+                                     doc_embeddings=doc_embeddings, 
+                                     word_embeddings=word_embeddings)
 ```
diff --git a/docs/images/guided.svg b/docs/images/guided.svg
diff --git a/docs/images/pipeline.svg b/docs/images/pipeline.svg
diff --git a/docs/index.md b/docs/index.md
@@ -1,3 +1,8 @@
+---
+hide:
+  - navigation
+---
+
 <img src="https://raw.githubusercontent.com/MaartenGr/KeyBERT/master/images/logo.png" width="35%" height="35%" align="right" />
 
 # **KeyBERT**

diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
@@ -5,3 +5,15 @@
 :root>* {
 --md-typeset-a-color: #0277BD;
 }
+
+body[data-md-color-primary="black"] .excalidraw  svg  {  
+  filter: invert(100%) hue-rotate(180deg);
+}
+
+body[data-md-color-primary="black"] .excalidraw svg rect {
+  fill: transparent;
+}
+
+.excalidraw {
+  text-align: center;
+}
diff --git a/keybert/__init__.py b/keybert/__init__.py
@@ -1,3 +1,3 @@
 from keybert._model import KeyBERT
 
-__version__ = "0.6.0"
+__version__ = "0.7.0"