Skip to content

Commit

Permalink
v0.4 (#43)
Browse files Browse the repository at this point in the history
* Use paraphrase-MiniLM-L6-v2 as the default embedding model
* Highlight a document's keywords
* Added FAQ
  • Loading branch information
MaartenGr committed Jun 30, 2021
1 parent eb6d086 commit 25dab3a
Show file tree
Hide file tree
Showing 18 changed files with 242 additions and 83 deletions.
21 changes: 14 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,8 @@ from keybert import KeyBERT

doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.[1] It infers a
function from labeled training data consisting of a set of training examples.[2]
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
Expand All @@ -100,7 +100,7 @@ doc = """
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)
```

Expand All @@ -127,10 +127,17 @@ of words you would like in the resulting keyphrases:
('learning function', 0.5850)]
```

We can highlight the keywords in the document by simply setting `hightlight`:

```python
keywords = kw_model.extract_keywords(doc, highlight=True)
```
<img src="images/highlight.png" width="75%" height="75%" />


**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
I would advise either `'distilbert-base-nli-mean-tokens'` or `'xlm-r-distilroberta-base-paraphrase-v1'` as they
have shown great performance in semantic similarity and paraphrase identification respectively.
I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
for multi-lingual documents or any other language.

<a name="maxsum"/></a>
### 2.3. Max Sum Similarity
Expand Down Expand Up @@ -198,7 +205,7 @@ and pass it through KeyBERT with `model`:

```python
from keybert import KeyBERT
kw_model = KeyBERT(model='distilbert-base-nli-mean-tokens')
kw_model = KeyBERT(model='paraphrase-MiniLM-L6-v2')
```

Or select a SentenceTransformer model with your own parameters:
Expand All @@ -207,7 +214,7 @@ Or select a SentenceTransformer model with your own parameters:
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)
```

Expand Down
14 changes: 14 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
## **Version 0.4.0**
*Release date: 23 June, 2021*

**Highlights**:

* Highlight a document's keywords with:
* ```keywords = kw_model.extract_keywords(doc, highlight=True)```
* Use `paraphrase-MiniLM-L6-v2` as the default embedder which gives great results!

**Miscellaneous**:

* Update Flair dependencies
* Added FAQ

## **Version 0.3.0**
*Release date: 10 May, 2021*

Expand Down
20 changes: 20 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
## **Which embedding model works best for which language?**
Unfortunately, there is not a definitive list of the best models for each language, this highly depends
on your data, the model, and your specific use-case. However, the default model in KeyBERT
(`"paraphrase-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual**
documents or any other language, `"paraphrase-multilingual-MiniLM-L12-v2""` has shown great performance.

If you want to use a model that provides a higher quality, but takes more compute time, then I would advise using `paraphrase-mpnet-base-v2` and `paraphrase-multilingual-mpnet-base-v2` instead.


## **Should I preprocess the data?**
No. By using document embeddings there is typically no need to preprocess the data as all parts of a document
are important in understanding the general topic of the document. Although this holds true in 99% of cases, if you
have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags
typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply
topic modeling to HTML-code to extract topics of code, then it becomes important.


## **Can I use the GPU to speed up the model?**
Yes! Since KeyBERT uses embeddings as its backend, a GPU is actually prefered when using this package.
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.
8 changes: 4 additions & 4 deletions docs/guides/embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ and pass it through KeyBERT with `model`:

```python
from keybert import KeyBERT
kw_model = KeyBERT(model="xlm-r-bert-base-nli-stsb-mean-tokens")
kw_model = KeyBERT(model="paraphrase-MiniLM-L6-v2")
```

Or select a SentenceTransformer model with your own parameters:

```python
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cuda")
sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)
```

Expand Down Expand Up @@ -60,7 +60,7 @@ import spacy

nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

kw_model = KeyBERT(model=document_glove_embeddings)nlp
kw_model = KeyBERT(model=nlp)
```

Using spacy-transformer models:
Expand Down Expand Up @@ -129,7 +129,7 @@ class CustomEmbedder(BaseEmbedder):
return embeddings

# Create custom backend
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
distilbert = SentenceTransformer("paraphrase-MiniLM-L6-v2")
custom_embedder = CustomEmbedder(embedding_model=distilbert)

# Pass custom backend to keybert
Expand Down
12 changes: 9 additions & 3 deletions docs/guides/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ doc = """
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)
```

Expand All @@ -65,9 +65,15 @@ of words you would like in the resulting keyphrases:
('learning function', 0.5850)]
```

We can highlight the keywords in the document by simply setting `hightlight`:

```python
keywords = kw_model.extract_keywords(doc, highlight=True)
```

**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
I would advise either `'distilbert-base-nli-mean-tokens'` or `'xlm-r-distilroberta-base-paraphrase-v1'` as they
have shown great performance in semantic similarity and paraphrase identification respectively.
I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
for multi-lingual documents or any other language.

### Max Sum Similarity

Expand Down
35 changes: 21 additions & 14 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ create keywords and keyphrases that are most similar to a document.

## About the Project

Although that are already many methods available for keyword generation
Although there are already many methods available for keyword generation
(e.g.,
[Rake](https://github.com/aneesha/RAKE),
[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.)
Expand All @@ -30,11 +30,6 @@ papers and solutions out there that use BERT-embeddings
), I could not find a BERT-based solution that did not have to be trained from scratch and
could be used for beginners (**correct me if I'm wrong!**).
Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.

**NOTE**: If you use MMR to select the candidates instead of simple cosine similarity,
this repo is essentially a simplified implementation of
[EmbedRank](https://github.com/swisscom/ai-research-keyphrase-extraction)
with BERT-embeddings.

## Installation
Installation can be done using [pypi](https://pypi.org/project/keybert/):
Expand All @@ -43,22 +38,33 @@ Installation can be done using [pypi](https://pypi.org/project/keybert/):
pip install keybert
```

To use Flair embeddings, install KeyBERT as follows:
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

```
pip install keybert[flair]
pip install keybert[gensim]
pip install keybert[spacy]
pip install keybert[use]
```

To install all backends:

```
pip install keybert[all]
```


## Usage


The most minimal example can be seen below for the extraction of keywords:
```python
from keybert import KeyBERT

doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.[1] It infers a
function from labeled training data consisting of a set of training examples.[2]
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
Expand All @@ -67,13 +73,14 @@ doc = """
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
model = KeyBERT('distilbert-base-nli-mean-tokens')
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)
```

You can set `keyphrase_length` to set the length of the resulting keyphras:
You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 1))
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
('algorithm', 0.4556),
('training', 0.4487),
Expand All @@ -85,10 +92,10 @@ To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher de
of words you would like in the resulting keyphrases:

```python
>>> model.extract_keywords(doc, keyphrase_ngram_range=(1, 2))
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
('machine learning', 0.6305),
('supervised learning', 0.5985),
('algorithm analyzes', 0.5860),
('learning function', 0.5850)]
```
```
Binary file added images/highlight.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions keybert/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from keybert.model import KeyBERT
from keybert._model import KeyBERT

__version__ = "0.3.0"
__version__ = "0.4.0"
96 changes: 96 additions & 0 deletions keybert/_highlight.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import re
from rich.console import Console
from rich.highlighter import RegexHighlighter
from typing import Tuple, List


class NullHighlighter(RegexHighlighter):
"""Apply style to anything that looks like an email."""

base_style = ""
highlights = [r""]


def highlight_document(doc: str,
keywords: List[Tuple[str, float]]):
""" Highlight keywords in a document
Arguments:
doc: The document for which to extract keywords/keyphrases
keywords: the top n keywords for a document with their respective distances
to the input document
Returns:
highlighted_text: The document with additional tags to highlight keywords
according to the rich package
"""
keywords_only = [keyword for keyword, _ in keywords]
max_len = max([len(token.split(" ")) for token in keywords_only])

if max_len == 1:
highlighted_text = _highlight_one_gram(doc, keywords_only)
else:
highlighted_text = _highlight_n_gram(doc, keywords_only)

console = Console(highlighter=NullHighlighter())
console.print(highlighted_text)


def _highlight_one_gram(doc: str,
keywords: List[str]) -> str:
""" Highlight 1-gram keywords in a document
Arguments:
doc: The document for which to extract keywords/keyphrases
keywords: the top n keywords for a document
Returns:
highlighted_text: The document with additional tags to highlight keywords
according to the rich package
"""
tokens = re.sub(r' +', ' ', doc.replace("\n", " ")).split(" ")

highlighted_text = " ".join([f"[black on #FFFF00]{token}[/]"
if token.lower() in keywords
else f"{token}"
for token in tokens]).strip()
return highlighted_text


def _highlight_n_gram(doc: str,
keywords: List[str]) -> str:
""" Highlight n-gram keywords in a document
Arguments:
doc: The document for which to extract keywords/keyphrases
keywords: the top n keywords for a document
Returns:
highlighted_text: The document with additional tags to highlight keywords
according to the rich package
"""
max_len = max([len(token.split(" ")) for token in keywords])
tokens = re.sub(r' +', ' ', doc.replace("\n", " ")).strip().split(" ")
n_gram_tokens = [[" ".join(tokens[i: i + max_len][0: j + 1]) for j in range(max_len)] for i, _ in enumerate(tokens)]
highlighted_text = []
skip = False

for n_grams in n_gram_tokens:
candidate = False

if not skip:
for index, n_gram in enumerate(n_grams):

if n_gram.lower() in keywords:
candidate = f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1]
skip = index + 1

if not candidate:
candidate = n_grams[0]

highlighted_text.append(candidate)

else:
skip = skip - 1
highlighted_text = " ".join(highlighted_text)
return highlighted_text
File renamed without changes.
File renamed without changes.

0 comments on commit 25dab3a

Please sign in to comment.