v0.2 (#23)

* Add similarity scores to the output * Add Flair as a possible back-end * Update documentation + improved testing
MaartenGr · Feb 9, 2021 · 2a982bd · 2a982bd
1 parent e66fc12
commit 2a982bd
Show file tree

Hide file tree

Showing 11 changed files with 411 additions and 142 deletions.
diff --git a/README.md b/README.md
@@ -20,7 +20,8 @@ Corresponding medium post can be found [here](https://towardsdatascience.com/key
         2.1. [Installation](#installation)    
         2.2. [Basic Usage](#usage)     
         2.3. [Max Sum Similarity](#maxsum)  
-        2.4. [Maximal Marginal Relevance](#maximal)
+        2.4. [Maximal Marginal Relevance](#maximal)  
+        2.5. [Embedding Models](#embeddings)
 <!--te-->
 
 
@@ -58,15 +59,18 @@ Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.
 
 <a name="installation"/></a>
 ###  2.1. Installation
-**[PyTorch 1.2.0](https://pytorch.org/get-started/locally/)** or higher is recommended. If the install below gives an
-error, please install pytorch first [here](https://pytorch.org/get-started/locally/). 
-
-Installation can be done using [pypi](https://pypi.org/project/bertopic/):
+Installation can be done using [pypi](https://pypi.org/project/keybert/):
 
 ```
 pip install keybert
 ```
 
+To use Flair embeddings, install KeyBERT as follows:
+
+```
+pip install keybert[flair]
+```
+
 <a name="usage"/></a>
 ###  2.2. Usage
 
@@ -94,23 +98,23 @@ You can set `keyphrase_ngram_range` to set the length of the resulting keywords/
 
 ```python
 >>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
-['learning', 
- 'training', 
- 'algorithm', 
- 'class', 
- 'mapping']
+[('learning', 0.4604),
+ ('algorithm', 0.4556),
+ ('training', 0.4487),
+ ('class', 0.4086),
+ ('mapping', 0.3700)]
 ```
 
 To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number 
 of words you would like in the resulting keyphrases: 
 
 ```python
 >>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
-['learning algorithm',
- 'learning machine',
- 'machine learning',
- 'supervised learning',
- 'learning function']
+[('learning algorithm', 0.6978),
+ ('machine learning', 0.6305),
+ ('supervised learning', 0.5985),
+ ('algorithm analyzes', 0.5860),
+ ('learning function', 0.5850)]
 ``` 
 
 
@@ -128,11 +132,11 @@ that are the least similar to each other by cosine similarity.
 ```python
 >>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
                            use_maxsum=True, nr_candidates=20, top_n=5)
-['set training examples',
- 'generalize training data',
- 'requires learning algorithm',
- 'superivsed learning algorithm',
- 'learning machine learning']
+[('set training examples', 0.7504),
+ ('generalize training data', 0.7727),
+ ('requires learning algorithm', 0.5050),
+ ('supervised learning algorithm', 0.3779),
+ ('learning machine learning', 0.2891)]
 ``` 
 
 
@@ -144,26 +148,67 @@ keywords / keyphrases which is also based on cosine similarity. The results
 with **high diversity**:
 
 ```python
->>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', use_mmr=True, diversity=0.7)
-['algorithm generalize training',
- 'labels unseen instances',
- 'new examples optimal',
- 'determine class labels',
- 'supervised learning algorithm']
+>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
+                           use_mmr=True, diversity=0.7)
+[('algorithm generalize training', 0.7727),
+ ('labels unseen instances', 0.1649),
+ ('new examples optimal', 0.4185),
+ ('determine class labels', 0.4774),
+ ('supervised learning algorithm', 0.7502)]
 ``` 
 
 The results with **low diversity**:  
 
 ```python
->>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', use_mmr=True, diversity=0.2)
-['algorithm generalize training',
- 'learning machine learning',
- 'learning algorithm analyzes',
- 'supervised learning algorithm',
- 'algorithm analyzes training']
+>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
+                           use_mmr=True, diversity=0.2)
+[('algorithm generalize training', 0.7727),
+ ('supervised learning algorithm', 0.7502),
+ ('learning machine learning', 0.7577),
+ ('learning algorithm analyzes', 0.7587),
+ ('learning algorithm generalize', 0.7514)]
 ``` 
 
 
+<a name="embeddings"/></a>
+###  2.5. Embedding Models
+The parameter `model` takes in a string pointing to a sentence-transformers model, 
+a SentenceTransformer, or a Flair DocumentEmbedding model. 
+
+**Sentence-Transformers**  
+You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html) 
+and pass it through KeyBERT with `model`:
+
+```python
+from keybert import KeyBERT
+model = KeyBERT(model='distilbert-base-nli-mean-tokens')
+```
+
+Or select a SentenceTransformer model with your own parameters:
+
+```python
+from keybert import KeyBERT
+from sentence_transformers import SentenceTransformer
+
+sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
+model = KeyBERT(model=sentence_model)
+```
+
+**Flair**  
+[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that 
+is publicly available. Flair can be used as follows:
+
+```python
+from keybert import KeyBERT
+from flair.embeddings import TransformerDocumentEmbeddings
+
+roberta = TransformerDocumentEmbeddings('roberta-base')
+model = KeyBERT(model=roberta)
+```
+
+You can select any 🤗 transformers model [here](https://huggingface.co/models).
+
+
 ## Citation
 To cite PolyFuzz in your work, please use the following bibtex reference:
 

diff --git a/docs/guides/embeddings.md b/docs/guides/embeddings.md
@@ -0,0 +1,36 @@
+## **Embedding Models**
+The parameter `model` takes in a string pointing to a sentence-transformers model, 
+a SentenceTransformer, or a Flair DocumentEmbedding model. 
+
+### **Sentence-Transformers**  
+You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html) 
+and pass it through KeyBERT with `model`:
+
+```python
+from keybert import KeyBERT
+model = KeyBERT(model='distilbert-base-nli-mean-tokens')
+```
+
+Or select a SentenceTransformer model with your own parameters:
+
+```python
+from keybert import KeyBERT
+from sentence_transformers import SentenceTransformer
+
+sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
+model = KeyBERT(model=sentence_model)
+```
+
+### **Flair**  
+[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that 
+is publicly available. Flair can be used as follows:
+
+```python
+from keybert import KeyBERT
+from flair.embeddings import TransformerDocumentEmbeddings
+
+roberta = TransformerDocumentEmbeddings('roberta-base')
+model = KeyBERT(model=roberta)
+```
+
+You can select any 🤗 transformers model [here](https://huggingface.co/models).
diff --git a/docs/guides/quickstart.md b/docs/guides/quickstart.md
@@ -0,0 +1,112 @@
+## **Installation**
+Installation can be done using [pypi](https://pypi.org/project/bertopic/):
+
+```
+pip install keybert
+```
+
+To use Flair embeddings, install KeyBERT as follows:
+
+```
+pip install keybert[flair]
+```
+
+Or to install all additional dependencies:
+
+
+```
+pip install keybert[all]
+```
+
+## **Usage**
+
+The most minimal example can be seen below for the extraction of keywords:
+```python
+from keybert import KeyBERT
+
+doc = """
+         Supervised learning is the machine learning task of learning a function that
+         maps an input to an output based on example input-output pairs.[1] It infers a
+         function from labeled training data consisting of a set of training examples.[2]
+         In supervised learning, each example is a pair consisting of an input object
+         (typically a vector) and a desired output value (also called the supervisory signal). 
+         A supervised learning algorithm analyzes the training data and produces an inferred function, 
+         which can be used for mapping new examples. An optimal scenario will allow for the 
+         algorithm to correctly determine the class labels for unseen instances. This requires 
+         the learning algorithm to generalize from the training data to unseen situations in a 
+         'reasonable' way (see inductive bias).
+      """
+model = KeyBERT('distilbert-base-nli-mean-tokens')
+keywords = model.extract_keywords(doc)
+```
+
+You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:
+
+```python
+>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
+[('learning', 0.4604),
+ ('algorithm', 0.4556),
+ ('training', 0.4487),
+ ('class', 0.4086),
+ ('mapping', 0.3700)]
+```
+
+To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number 
+of words you would like in the resulting keyphrases: 
+
+```python
+>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
+[('learning algorithm', 0.6978),
+ ('machine learning', 0.6305),
+ ('supervised learning', 0.5985),
+ ('algorithm analyzes', 0.5860),
+ ('learning function', 0.5850)]
+``` 
+
+**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
+I would advise either `'distilbert-base-nli-mean-tokens'` or `'xlm-r-distilroberta-base-paraphrase-v1'` as they
+have shown great performance in semantic similarity and paraphrase identification respectively. 
+
+###  Max Sum Similarity
+
+To diversity the results, we take the 2 x top_n most similar words/phrases to the document.
+Then, we take all top_n combinations from the 2 x top_n words and extract the combination 
+that are the least similar to each other by cosine similarity.
+
+```python
+>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
+                           use_maxsum=True, nr_candidates=20, top_n=5)
+[('set training examples', 0.7504),
+ ('generalize training data', 0.7727),
+ ('requires learning algorithm', 0.5050),
+ ('supervised learning algorithm', 0.3779),
+ ('learning machine learning', 0.2891)]
+``` 
+
+###  Maximal Marginal Relevance
+
+To diversify the results, we can use Maximal Margin Relevance (MMR) to create
+keywords / keyphrases which is also based on cosine similarity. The results 
+with **high diversity**:
+
+```python
+>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
+                           use_mmr=True, diversity=0.7)
+[('algorithm generalize training', 0.7727),
+ ('labels unseen instances', 0.1649),
+ ('new examples optimal', 0.4185),
+ ('determine class labels', 0.4774),
+ ('supervised learning algorithm', 0.7502)]
+``` 
+
+The results with **low diversity**:  
+
+```python
+>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
+                           use_mmr=True, diversity=0.2)
+[('algorithm generalize training', 0.7727),
+ ('supervised learning algorithm', 0.7502),
+ ('learning machine learning', 0.7577),
+ ('learning algorithm analyzes', 0.7587),
+ ('learning algorithm generalize', 0.7514)]
+```