v0.3 (#32)

* Use candidate words instead of extracting those from the documents * Spacy, Gensim, USE, and Custom Backends were added * Improved imports * Fix encoding error when locally installing KeyBERT #30 * Improved documentation (ReadMe & MKDocs) * Add the main tutorial as a shield * Typos #31, #35
MaartenGr · May 10, 2021 · eb6d086 · eb6d086
1 parent 2a982bd
commit eb6d086
Show file tree

Hide file tree

Showing 16 changed files with 747 additions and 191 deletions.
diff --git a/README.md b/README.md
@@ -2,6 +2,7 @@
 [![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
 [![PyPI - PyPi](https://img.shields.io/pypi/v/keyBERT)](https://pypi.org/project/keybert/)
 [![Build](https://img.shields.io/github/workflow/status/MaartenGr/keyBERT/Code%20Checks/master)](https://pypi.org/project/keybert/)
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing)
 
 <img src="images/logo.png" width="35%" height="35%" align="right" />
 
@@ -65,10 +66,19 @@ Installation can be done using [pypi](https://pypi.org/project/keybert/):
 pip install keybert
 ```
 
-To use Flair embeddings, install KeyBERT as follows:
+You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
 
 ```
 pip install keybert[flair]
+pip install keybert[gensim]
+pip install keybert[spacy]
+pip install keybert[use]
+```
+
+To install all backends:
+
+```
+pip install keybert[all]
 ```
 
 <a name="usage"/></a>
@@ -90,14 +100,14 @@ doc = """
          the learning algorithm to generalize from the training data to unseen situations in a 
          'reasonable' way (see inductive bias).
       """
-model = KeyBERT('distilbert-base-nli-mean-tokens')
-keywords = model.extract_keywords(doc)
+kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
+keywords = kw_model.extract_keywords(doc)
 ```
 
 You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:
 
 ```python
->>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
 [('learning', 0.4604),
  ('algorithm', 0.4556),
  ('training', 0.4487),
@@ -109,7 +119,7 @@ To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher de
 of words you would like in the resulting keyphrases: 
 
 ```python
->>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
 [('learning algorithm', 0.6978),
  ('machine learning', 0.6305),
  ('supervised learning', 0.5985),
@@ -125,13 +135,13 @@ have shown great performance in semantic similarity and paraphrase identificatio
 <a name="maxsum"/></a>
 ###  2.3. Max Sum Similarity
 
-To diversity the results, we take the 2 x top_n most similar words/phrases to the document.
+To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
 Then, we take all top_n combinations from the 2 x top_n words and extract the combination 
 that are the least similar to each other by cosine similarity.
 
 ```python
->>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
-                           use_maxsum=True, nr_candidates=20, top_n=5)
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
+                              use_maxsum=True, nr_candidates=20, top_n=5)
 [('set training examples', 0.7504),
  ('generalize training data', 0.7727),
  ('requires learning algorithm', 0.5050),
@@ -148,8 +158,8 @@ keywords / keyphrases which is also based on cosine similarity. The results
 with **high diversity**:
 
 ```python
->>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
-                           use_mmr=True, diversity=0.7)
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
+                              use_mmr=True, diversity=0.7)
 [('algorithm generalize training', 0.7727),
  ('labels unseen instances', 0.1649),
  ('new examples optimal', 0.4185),
@@ -160,8 +170,8 @@ with **high diversity**:
 The results with **low diversity**:  
 
 ```python
->>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
-                           use_mmr=True, diversity=0.2)
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
+                              use_mmr=True, diversity=0.2)
 [('algorithm generalize training', 0.7727),
  ('supervised learning algorithm', 0.7502),
  ('learning machine learning', 0.7577),
@@ -172,16 +182,23 @@ The results with **low diversity**:
 
 <a name="embeddings"/></a>
 ###  2.5. Embedding Models
-The parameter `model` takes in a string pointing to a sentence-transformers model, 
-a SentenceTransformer, or a Flair DocumentEmbedding model. 
+KeyBERT supports many embedding models that can be used to embed the documents and words:
+
+* Sentence-Transformers
+* Flair
+* Spacy
+* Gensim
+* USE
+
+Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.
 
 **Sentence-Transformers**  
 You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html) 
 and pass it through KeyBERT with `model`:
 
 ```python
 from keybert import KeyBERT
-model = KeyBERT(model='distilbert-base-nli-mean-tokens')
+kw_model = KeyBERT(model='distilbert-base-nli-mean-tokens')
 ```
 
 Or select a SentenceTransformer model with your own parameters:
@@ -191,7 +208,7 @@ from keybert import KeyBERT
 from sentence_transformers import SentenceTransformer
 
 sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
-model = KeyBERT(model=sentence_model)
+kw_model = KeyBERT(model=sentence_model)
 ```
 
 **Flair**  
@@ -203,7 +220,7 @@ from keybert import KeyBERT
 from flair.embeddings import TransformerDocumentEmbeddings
 
 roberta = TransformerDocumentEmbeddings('roberta-base')
-model = KeyBERT(model=roberta)
+kw_model = KeyBERT(model=roberta)
 ```
 
 You can select any 🤗 transformers model [here](https://huggingface.co/models).
@@ -218,7 +235,7 @@ To cite PolyFuzz in your work, please use the following bibtex reference:
   title        = {KeyBERT: Minimal keyword extraction with BERT.},
   year         = 2020,
   publisher    = {Zenodo},
-  version      = {v0.1.3},
+  version      = {v0.3.0},
   doi          = {10.5281/zenodo.4461265},
   url          = {https://doi.org/10.5281/zenodo.4461265}
 }
@@ -238,10 +255,10 @@ but most importantly, these are amazing resources for creating impressive keywor
 * https://github.com/swisscom/ai-research-keyphrase-extraction
 
 **MMR**:  
-The selection of keywords/keyphrases was modelled after:
+The selection of keywords/keyphrases was modeled after:
 * https://github.com/swisscom/ai-research-keyphrase-extraction
 
 **NOTE**: If you find a paper or github repo that has an easy-to-use implementation
 of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to
-add it a reference to this repo. 
+add a reference to this repo. 
 
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -0,0 +1,44 @@
+## **Version 0.3.0**
+*Release date:  10 May, 2021*  
+
+The two main features are **candidate keywords** 
+and several **backends** to use instead of Flair and SentenceTransformers!
+
+**Highlights**:
+
+* Use candidate words instead of extracting those from the documents ([#25](https://github.com/MaartenGr/KeyBERT/issues/25))
+    * ```KeyBERT().extract_keywords(doc, candidates)```
+* Spacy, Gensim, USE, and Custom Backends were added (see documentation [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html))
+
+**Fixes**:
+
+* Improved imports
+* Fix encoding error when locally installing KeyBERT ([#30](https://github.com/MaartenGr/KeyBERT/issues/30)) 
+
+**Miscellaneous**:
+
+* Improved documentation (ReadMe & MKDocs)
+* Add the main tutorial as a shield
+* Typos ([#31](https://github.com/MaartenGr/KeyBERT/pull/31), [#35](https://github.com/MaartenGr/KeyBERT/pull/35))
+
+
+## **Version 0.2.0**
+*Release date:  9 Feb, 2021*  
+
+**Highlights**:
+
+* Add similarity scores to the output
+* Add Flair as a possible back-end
+* Update documentation + improved testing
+
+## **Version 0.1.2*
+*Release date:  28 Oct, 2020*  
+
+Added Max Sum Similarity as an option to diversify your results.
+
+
+## **Version 0.1.0**
+*Release date:  27 Oct, 2020*  
+
+This first release includes keyword/keyphrase extraction using BERT and simple cosine similarity. 
+There is also an option to use Maximal Marginal Relevance to select the candidate keywords/keyphrases.
diff --git a/docs/guides/embeddings.md b/docs/guides/embeddings.md
@@ -1,36 +1,137 @@
-## **Embedding Models**
-The parameter `model` takes in a string pointing to a sentence-transformers model, 
-a SentenceTransformer, or a Flair DocumentEmbedding model. 
+# Embedding Models
+In this tutorial we will be going through the embedding models that can be used in KeyBERT. 
+Having the option to choose embedding models allow you to leverage pre-trained embeddings that suit your use-case. 
 
-### **Sentence-Transformers**  
-You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html) 
+### **Sentence Transformers**
+You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html) 
 and pass it through KeyBERT with `model`:
 
 ```python
 from keybert import KeyBERT
-model = KeyBERT(model='distilbert-base-nli-mean-tokens')
+kw_model = KeyBERT(model="xlm-r-bert-base-nli-stsb-mean-tokens")
 ```
 
 Or select a SentenceTransformer model with your own parameters:
 
 ```python
-from keybert import KeyBERT
 from sentence_transformers import SentenceTransformer
 
-sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
-model = KeyBERT(model=sentence_model)
+sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cuda")
+kw_model = KeyBERT(model=sentence_model)
 ```
 
-### **Flair**  
+### **Flair**
 [Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that 
 is publicly available. Flair can be used as follows:
 
 ```python
-from keybert import KeyBERT
 from flair.embeddings import TransformerDocumentEmbeddings
 
 roberta = TransformerDocumentEmbeddings('roberta-base')
-model = KeyBERT(model=roberta)
+kw_model = KeyBERT(model=roberta)
 ```
 
 You can select any 🤗 transformers model [here](https://huggingface.co/models).
+
+Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings. 
+Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily 
+pass it to KeyBERT in order to use those word embeddings as document embeddings: 
+
+```python
+from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
+
+glove_embedding = WordEmbeddings('crawl')
+document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])
+
+kw_model = KeyBERT(model=document_glove_embeddings)
+```
+
+### **Spacy**
+[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are 
+many models available across many languages for modeling text. 
+
+ allows you to choose almost any embedding model that 
+is publicly available. Flair can be used as follows:
+
+To use Spacy's non-transformer models in KeyBERT:
+
+```python
+import spacy
+
+nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
+
+kw_model = KeyBERT(model=document_glove_embeddings)nlp
+```
+
+Using spacy-transformer models:
+
+```python
+import spacy
+
+spacy.prefer_gpu()
+nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
+
+kw_model = KeyBERT(model=nlp)
+```
+
+If you run into memory issues with spacy-transformer models, try:
+
+```python
+import spacy
+from thinc.api import set_gpu_allocator, require_gpu
+
+nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
+set_gpu_allocator("pytorch")
+require_gpu(0)
+
+kw_model = KeyBERT(model=nlp)
+```
+
+### **Universal Sentence Encoder (USE)**
+The Universal Sentence Encoder encodes text into high dimensional vectors that are used here 
+for embedding the documents. The model is trained and optimized for greater-than-word length text, 
+such as sentences, phrases or short paragraphs.
+
+Using USE in KeyBERT is rather straightforward:
+
+```python
+import tensorflow_hub
+embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
+kw_model = KeyBERT(model=embedding_model)
+```
+
+### **Gensim**
+For Gensim, KeyBERT supports its `gensim.downloader` module. Here, we can download any model word embedding model 
+to be used in KeyBERT. Note that Gensim is primarily used for Word Embedding models. This works typically 
+best for short documents since the word embeddings are pooled.
+
+```python
+import gensim.downloader as api
+ft = api.load('fasttext-wiki-news-subwords-300')
+kw_model = KeyBERT(model=ft)
+```
+
+### **Custom Backend**
+If your backend or model cannot be found in the ones currently available, you can use the `keybert.backend.BaseEmbedder` class to 
+create your own backend. Below, you will find an example of creating a SentenceTransformer backend for KeyBERT:
+
+```python
+from keybert.backend import BaseEmbedder
+from sentence_transformers import SentenceTransformer
+
+class CustomEmbedder(BaseEmbedder):
+    def __init__(self, embedding_model):
+        super().__init__()
+        self.embedding_model = embedding_model
+
+    def embed(self, documents, verbose=False):
+        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
+        return embeddings 
+
+# Create custom backend
+distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
+custom_embedder = CustomEmbedder(embedding_model=distilbert)
+
+# Pass custom backend to keybert
+kw_model = KeyBERT(model=custom_embedder)
+```