Skip to content

Commit

Permalink
v0.2 (#23)
Browse files Browse the repository at this point in the history
* Add similarity scores to the output
* Add Flair as a possible back-end
* Update documentation + improved testing
  • Loading branch information
MaartenGr committed Feb 9, 2021
1 parent e66fc12 commit 2a982bd
Show file tree
Hide file tree
Showing 11 changed files with 411 additions and 142 deletions.
109 changes: 77 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ Corresponding medium post can be found [here](https://towardsdatascience.com/key
2.1. [Installation](#installation)
2.2. [Basic Usage](#usage)
2.3. [Max Sum Similarity](#maxsum)
2.4. [Maximal Marginal Relevance](#maximal)
2.4. [Maximal Marginal Relevance](#maximal)
2.5. [Embedding Models](#embeddings)
<!--te-->


Expand Down Expand Up @@ -58,15 +59,18 @@ Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.

<a name="installation"/></a>
### 2.1. Installation
**[PyTorch 1.2.0](https://pytorch.org/get-started/locally/)** or higher is recommended. If the install below gives an
error, please install pytorch first [here](https://pytorch.org/get-started/locally/).

Installation can be done using [pypi](https://pypi.org/project/bertopic/):
Installation can be done using [pypi](https://pypi.org/project/keybert/):

```
pip install keybert
```

To use Flair embeddings, install KeyBERT as follows:

```
pip install keybert[flair]
```

<a name="usage"/></a>
### 2.2. Usage

Expand Down Expand Up @@ -94,23 +98,23 @@ You can set `keyphrase_ngram_range` to set the length of the resulting keywords/

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
['learning',
'training',
'algorithm',
'class',
'mapping']
[('learning', 0.4604),
('algorithm', 0.4556),
('training', 0.4487),
('class', 0.4086),
('mapping', 0.3700)]
```

To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number
of words you would like in the resulting keyphrases:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
['learning algorithm',
'learning machine',
'machine learning',
'supervised learning',
'learning function']
[('learning algorithm', 0.6978),
('machine learning', 0.6305),
('supervised learning', 0.5985),
('algorithm analyzes', 0.5860),
('learning function', 0.5850)]
```


Expand All @@ -128,11 +132,11 @@ that are the least similar to each other by cosine similarity.
```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_maxsum=True, nr_candidates=20, top_n=5)
['set training examples',
'generalize training data',
'requires learning algorithm',
'superivsed learning algorithm',
'learning machine learning']
[('set training examples', 0.7504),
('generalize training data', 0.7727),
('requires learning algorithm', 0.5050),
('supervised learning algorithm', 0.3779),
('learning machine learning', 0.2891)]
```


Expand All @@ -144,26 +148,67 @@ keywords / keyphrases which is also based on cosine similarity. The results
with **high diversity**:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', use_mmr=True, diversity=0.7)
['algorithm generalize training',
'labels unseen instances',
'new examples optimal',
'determine class labels',
'supervised learning algorithm']
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.7)
[('algorithm generalize training', 0.7727),
('labels unseen instances', 0.1649),
('new examples optimal', 0.4185),
('determine class labels', 0.4774),
('supervised learning algorithm', 0.7502)]
```

The results with **low diversity**:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', use_mmr=True, diversity=0.2)
['algorithm generalize training',
'learning machine learning',
'learning algorithm analyzes',
'supervised learning algorithm',
'algorithm analyzes training']
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.2)
[('algorithm generalize training', 0.7727),
('supervised learning algorithm', 0.7502),
('learning machine learning', 0.7577),
('learning algorithm analyzes', 0.7587),
('learning algorithm generalize', 0.7514)]
```


<a name="embeddings"/></a>
### 2.5. Embedding Models
The parameter `model` takes in a string pointing to a sentence-transformers model,
a SentenceTransformer, or a Flair DocumentEmbedding model.

**Sentence-Transformers**
You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
and pass it through KeyBERT with `model`:

```python
from keybert import KeyBERT
model = KeyBERT(model='distilbert-base-nli-mean-tokens')
```

Or select a SentenceTransformer model with your own parameters:

```python
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
model = KeyBERT(model=sentence_model)
```

**Flair**
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
is publicly available. Flair can be used as follows:

```python
from keybert import KeyBERT
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
model = KeyBERT(model=roberta)
```

You can select any 🤗 transformers model [here](https://huggingface.co/models).


## Citation
To cite PolyFuzz in your work, please use the following bibtex reference:

Expand Down
36 changes: 36 additions & 0 deletions docs/guides/embeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
## **Embedding Models**
The parameter `model` takes in a string pointing to a sentence-transformers model,
a SentenceTransformer, or a Flair DocumentEmbedding model.

### **Sentence-Transformers**
You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
and pass it through KeyBERT with `model`:

```python
from keybert import KeyBERT
model = KeyBERT(model='distilbert-base-nli-mean-tokens')
```

Or select a SentenceTransformer model with your own parameters:

```python
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
model = KeyBERT(model=sentence_model)
```

### **Flair**
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
is publicly available. Flair can be used as follows:

```python
from keybert import KeyBERT
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
model = KeyBERT(model=roberta)
```

You can select any 🤗 transformers model [here](https://huggingface.co/models).
112 changes: 112 additions & 0 deletions docs/guides/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
## **Installation**
Installation can be done using [pypi](https://pypi.org/project/bertopic/):

```
pip install keybert
```

To use Flair embeddings, install KeyBERT as follows:

```
pip install keybert[flair]
```

Or to install all additional dependencies:


```
pip install keybert[all]
```

## **Usage**

The most minimal example can be seen below for the extraction of keywords:
```python
from keybert import KeyBERT

doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.[1] It infers a
function from labeled training data consisting of a set of training examples.[2]
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
model = KeyBERT('distilbert-base-nli-mean-tokens')
keywords = model.extract_keywords(doc)
```

You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
('algorithm', 0.4556),
('training', 0.4487),
('class', 0.4086),
('mapping', 0.3700)]
```

To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number
of words you would like in the resulting keyphrases:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
('machine learning', 0.6305),
('supervised learning', 0.5985),
('algorithm analyzes', 0.5860),
('learning function', 0.5850)]
```

**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
I would advise either `'distilbert-base-nli-mean-tokens'` or `'xlm-r-distilroberta-base-paraphrase-v1'` as they
have shown great performance in semantic similarity and paraphrase identification respectively.

### Max Sum Similarity

To diversity the results, we take the 2 x top_n most similar words/phrases to the document.
Then, we take all top_n combinations from the 2 x top_n words and extract the combination
that are the least similar to each other by cosine similarity.

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_maxsum=True, nr_candidates=20, top_n=5)
[('set training examples', 0.7504),
('generalize training data', 0.7727),
('requires learning algorithm', 0.5050),
('supervised learning algorithm', 0.3779),
('learning machine learning', 0.2891)]
```

### Maximal Marginal Relevance

To diversify the results, we can use Maximal Margin Relevance (MMR) to create
keywords / keyphrases which is also based on cosine similarity. The results
with **high diversity**:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.7)
[('algorithm generalize training', 0.7727),
('labels unseen instances', 0.1649),
('new examples optimal', 0.4185),
('determine class labels', 0.4774),
('supervised learning algorithm', 0.7502)]
```

The results with **low diversity**:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.2)
[('algorithm generalize training', 0.7727),
('supervised learning algorithm', 0.7502),
('learning machine learning', 0.7577),
('learning algorithm analyzes', 0.7587),
('learning algorithm generalize', 0.7514)]
```

0 comments on commit 2a982bd

Please sign in to comment.