Skip to content

Commit

Permalink
v0.3 (#32)
Browse files Browse the repository at this point in the history
* Use candidate words instead of extracting those from the documents
* Spacy, Gensim, USE, and Custom Backends were added
* Improved imports
* Fix encoding error when locally installing KeyBERT #30
* Improved documentation (ReadMe & MKDocs)
* Add the main tutorial as a shield
* Typos #31, #35
  • Loading branch information
MaartenGr committed May 10, 2021
1 parent 2a982bd commit eb6d086
Show file tree
Hide file tree
Showing 16 changed files with 747 additions and 191 deletions.
57 changes: 37 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
[![PyPI - PyPi](https://img.shields.io/pypi/v/keyBERT)](https://pypi.org/project/keybert/)
[![Build](https://img.shields.io/github/workflow/status/MaartenGr/keyBERT/Code%20Checks/master)](https://pypi.org/project/keybert/)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing)

<img src="images/logo.png" width="35%" height="35%" align="right" />

Expand Down Expand Up @@ -65,10 +66,19 @@ Installation can be done using [pypi](https://pypi.org/project/keybert/):
pip install keybert
```

To use Flair embeddings, install KeyBERT as follows:
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

```
pip install keybert[flair]
pip install keybert[gensim]
pip install keybert[spacy]
pip install keybert[use]
```

To install all backends:

```
pip install keybert[all]
```

<a name="usage"/></a>
Expand All @@ -90,14 +100,14 @@ doc = """
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
model = KeyBERT('distilbert-base-nli-mean-tokens')
keywords = model.extract_keywords(doc)
kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
keywords = kw_model.extract_keywords(doc)
```

You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
('algorithm', 0.4556),
('training', 0.4487),
Expand All @@ -109,7 +119,7 @@ To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher de
of words you would like in the resulting keyphrases:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
('machine learning', 0.6305),
('supervised learning', 0.5985),
Expand All @@ -125,13 +135,13 @@ have shown great performance in semantic similarity and paraphrase identificatio
<a name="maxsum"/></a>
### 2.3. Max Sum Similarity

To diversity the results, we take the 2 x top_n most similar words/phrases to the document.
To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
Then, we take all top_n combinations from the 2 x top_n words and extract the combination
that are the least similar to each other by cosine similarity.

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_maxsum=True, nr_candidates=20, top_n=5)
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_maxsum=True, nr_candidates=20, top_n=5)
[('set training examples', 0.7504),
('generalize training data', 0.7727),
('requires learning algorithm', 0.5050),
Expand All @@ -148,8 +158,8 @@ keywords / keyphrases which is also based on cosine similarity. The results
with **high diversity**:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.7)
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.7)
[('algorithm generalize training', 0.7727),
('labels unseen instances', 0.1649),
('new examples optimal', 0.4185),
Expand All @@ -160,8 +170,8 @@ with **high diversity**:
The results with **low diversity**:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.2)
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.2)
[('algorithm generalize training', 0.7727),
('supervised learning algorithm', 0.7502),
('learning machine learning', 0.7577),
Expand All @@ -172,16 +182,23 @@ The results with **low diversity**:

<a name="embeddings"/></a>
### 2.5. Embedding Models
The parameter `model` takes in a string pointing to a sentence-transformers model,
a SentenceTransformer, or a Flair DocumentEmbedding model.
KeyBERT supports many embedding models that can be used to embed the documents and words:

* Sentence-Transformers
* Flair
* Spacy
* Gensim
* USE

Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.

**Sentence-Transformers**
You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
and pass it through KeyBERT with `model`:

```python
from keybert import KeyBERT
model = KeyBERT(model='distilbert-base-nli-mean-tokens')
kw_model = KeyBERT(model='distilbert-base-nli-mean-tokens')
```

Or select a SentenceTransformer model with your own parameters:
Expand All @@ -191,7 +208,7 @@ from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
model = KeyBERT(model=sentence_model)
kw_model = KeyBERT(model=sentence_model)
```

**Flair**
Expand All @@ -203,7 +220,7 @@ from keybert import KeyBERT
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
model = KeyBERT(model=roberta)
kw_model = KeyBERT(model=roberta)
```

You can select any 🤗 transformers model [here](https://huggingface.co/models).
Expand All @@ -218,7 +235,7 @@ To cite PolyFuzz in your work, please use the following bibtex reference:
title = {KeyBERT: Minimal keyword extraction with BERT.},
year = 2020,
publisher = {Zenodo},
version = {v0.1.3},
version = {v0.3.0},
doi = {10.5281/zenodo.4461265},
url = {https://doi.org/10.5281/zenodo.4461265}
}
Expand All @@ -238,10 +255,10 @@ but most importantly, these are amazing resources for creating impressive keywor
* https://github.com/swisscom/ai-research-keyphrase-extraction

**MMR**:
The selection of keywords/keyphrases was modelled after:
The selection of keywords/keyphrases was modeled after:
* https://github.com/swisscom/ai-research-keyphrase-extraction

**NOTE**: If you find a paper or github repo that has an easy-to-use implementation
of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to
add it a reference to this repo.
add a reference to this repo.

44 changes: 44 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
## **Version 0.3.0**
*Release date: 10 May, 2021*

The two main features are **candidate keywords**
and several **backends** to use instead of Flair and SentenceTransformers!

**Highlights**:

* Use candidate words instead of extracting those from the documents ([#25](https://github.com/MaartenGr/KeyBERT/issues/25))
* ```KeyBERT().extract_keywords(doc, candidates)```
* Spacy, Gensim, USE, and Custom Backends were added (see documentation [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html))

**Fixes**:

* Improved imports
* Fix encoding error when locally installing KeyBERT ([#30](https://github.com/MaartenGr/KeyBERT/issues/30))

**Miscellaneous**:

* Improved documentation (ReadMe & MKDocs)
* Add the main tutorial as a shield
* Typos ([#31](https://github.com/MaartenGr/KeyBERT/pull/31), [#35](https://github.com/MaartenGr/KeyBERT/pull/35))


## **Version 0.2.0**
*Release date: 9 Feb, 2021*

**Highlights**:

* Add similarity scores to the output
* Add Flair as a possible back-end
* Update documentation + improved testing

## **Version 0.1.2*
*Release date: 28 Oct, 2020*

Added Max Sum Similarity as an option to diversify your results.


## **Version 0.1.0**
*Release date: 27 Oct, 2020*

This first release includes keyword/keyphrase extraction using BERT and simple cosine similarity.
There is also an option to use Maximal Marginal Relevance to select the candidate keywords/keyphrases.
125 changes: 113 additions & 12 deletions docs/guides/embeddings.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,137 @@
## **Embedding Models**
The parameter `model` takes in a string pointing to a sentence-transformers model,
a SentenceTransformer, or a Flair DocumentEmbedding model.
# Embedding Models
In this tutorial we will be going through the embedding models that can be used in KeyBERT.
Having the option to choose embedding models allow you to leverage pre-trained embeddings that suit your use-case.

### **Sentence-Transformers**
You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
### **Sentence Transformers**
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html)
and pass it through KeyBERT with `model`:

```python
from keybert import KeyBERT
model = KeyBERT(model='distilbert-base-nli-mean-tokens')
kw_model = KeyBERT(model="xlm-r-bert-base-nli-stsb-mean-tokens")
```

Or select a SentenceTransformer model with your own parameters:

```python
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
model = KeyBERT(model=sentence_model)
sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cuda")
kw_model = KeyBERT(model=sentence_model)
```

### **Flair**
### **Flair**
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
is publicly available. Flair can be used as follows:

```python
from keybert import KeyBERT
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
model = KeyBERT(model=roberta)
kw_model = KeyBERT(model=roberta)
```

You can select any 🤗 transformers model [here](https://huggingface.co/models).

Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings.
Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily
pass it to KeyBERT in order to use those word embeddings as document embeddings:

```python
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings

glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])

kw_model = KeyBERT(model=document_glove_embeddings)
```

### **Spacy**
[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are
many models available across many languages for modeling text.

allows you to choose almost any embedding model that
is publicly available. Flair can be used as follows:

To use Spacy's non-transformer models in KeyBERT:

```python
import spacy

nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

kw_model = KeyBERT(model=document_glove_embeddings)nlp
```

Using spacy-transformer models:

```python
import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

kw_model = KeyBERT(model=nlp)
```

If you run into memory issues with spacy-transformer models, try:

```python
import spacy
from thinc.api import set_gpu_allocator, require_gpu

nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)

kw_model = KeyBERT(model=nlp)
```

### **Universal Sentence Encoder (USE)**
The Universal Sentence Encoder encodes text into high dimensional vectors that are used here
for embedding the documents. The model is trained and optimized for greater-than-word length text,
such as sentences, phrases or short paragraphs.

Using USE in KeyBERT is rather straightforward:

```python
import tensorflow_hub
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
kw_model = KeyBERT(model=embedding_model)
```

### **Gensim**
For Gensim, KeyBERT supports its `gensim.downloader` module. Here, we can download any model word embedding model
to be used in KeyBERT. Note that Gensim is primarily used for Word Embedding models. This works typically
best for short documents since the word embeddings are pooled.

```python
import gensim.downloader as api
ft = api.load('fasttext-wiki-news-subwords-300')
kw_model = KeyBERT(model=ft)
```

### **Custom Backend**
If your backend or model cannot be found in the ones currently available, you can use the `keybert.backend.BaseEmbedder` class to
create your own backend. Below, you will find an example of creating a SentenceTransformer backend for KeyBERT:

```python
from keybert.backend import BaseEmbedder
from sentence_transformers import SentenceTransformer

class CustomEmbedder(BaseEmbedder):
def __init__(self, embedding_model):
super().__init__()
self.embedding_model = embedding_model

def embed(self, documents, verbose=False):
embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
return embeddings

# Create custom backend
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
custom_embedder = CustomEmbedder(embedding_model=distilbert)

# Pass custom backend to keybert
kw_model = KeyBERT(model=custom_embedder)
```

0 comments on commit eb6d086

Please sign in to comment.