Skip to content

Commit

Permalink
v0.7 (#135)
Browse files Browse the repository at this point in the history
* Added option to extract and pass word/document embeddings for faster iteration
* Focused on making the documentation a bit nicer (visualizations, etc. )
* Fixed #71
* Fixed #122, #136
  • Loading branch information
MaartenGr committed Nov 3, 2022
1 parent c512c21 commit 7b763ae
Show file tree
Hide file tree
Showing 13 changed files with 379 additions and 67 deletions.
35 changes: 35 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,38 @@
---
hide:
- navigation
---


## **Version 0.7.0**
*Release date: 3 November, 2022*

**Highlights**:

* Cleaned up documentation and added several visual representations of the algorithm (excluding MMR / MaxSum)
* Added function to extract and pass word- and document embeddings which should make fine-tuning much faster

```python
from keybert import KeyBERT

kw_model = KeyBERT()

# Prepare embeddings
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)

# Extract keywords without needing to re-calculate embeddings
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
```

Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`.

**Fixes**:

* Redundant documentation was removed by [@mabhay3420](https://github.com/priyanshul-govil) in [#123](https://github.com/MaartenGr/KeyBERT/pull/123)
* Fixed Gensim backend not working after v4 migration ([#71](https://github.com/MaartenGr/KeyBERT/issues/71))
* Fixed `candidates` not working ([#122](https://github.com/MaartenGr/KeyBERT/issues/122))


## **Version 0.6.0**
*Release date: 25 July, 2022*

Expand Down
5 changes: 5 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
hide:
- navigation
---

## **Which embedding model works best for which language?**
Unfortunately, there is not a definitive list of the best models for each language, this highly depends
on your data, the model, and your specific use-case. However, the default model in KeyBERT
Expand Down
115 changes: 81 additions & 34 deletions docs/guides/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,13 @@ pip install keybert[spacy]
pip install keybert[use]
```

## **Usage**

<div class="excalidraw">
--8<-- "docs/images/pipeline.svg"
</div>


## **Basic usage**

The most minimal example can be seen below for the extraction of keywords:
```python
Expand Down Expand Up @@ -70,6 +76,12 @@ keywords = kw_model.extract_keywords(doc, highlight=True)
I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
for multi-lingual documents or any other language.

## **Fine-tuning**

As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead
to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two
approaches that we can take in order to fine-tune our output, **Max Sum Distance** and **Maximal Marginal Relevance**.

### **Max Sum Distance**

To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
Expand All @@ -93,8 +105,8 @@ keywords / keyphrases which is also based on cosine similarity. The results
with **high diversity**:

```python
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.7)
kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.7)
[('algorithm generalize training', 0.7727),
('labels unseen instances', 0.1649),
('new examples optimal', 0.4185),
Expand All @@ -114,58 +126,93 @@ The results with **low diversity**:
('learning algorithm generalize', 0.7514)]
```

### **Candidate Keywords/Keyphrases**
## **Candidate Keywords/Keyphrases**
In some cases, one might want to be using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases. In KeyBERT, you can easily use those candidate keywords to perform keyword extraction:

```python
import yake
from keybert import KeyBERT

doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.[1] It infers a
function from labeled training data consisting of a set of training examples.[2]
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""

# Create candidates
kw_extractor = yake.KeywordExtractor(top=50)
candidates = kw_extractor.extract_keywords(doc)
candidates = [candidate[0] for candidate in candidates]

# KeyBERT init
# Pass candidates to KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, candidates)
keywords = kw_model.extract_keywords(doc, candidates=candidates)
```

### **Guided KeyBERT**
## **Guided KeyBERT**

Guided KeyBERT is similar to Guided Topic Modeling in that it tries to steer the training towards a set of seeded terms. When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article. However, there might be a specific topic in the article that you would like to be extracted through the keywords. To achieve this, we simply give KeyBERT a set of related seeded keywords (it can also be a single one!) and search for keywords that are similar to both the document and the seeded keywords.

<div class="excalidraw">
--8<-- "docs/images/guided.svg"
</div>

Using this feature is as simple as defining a list of seeded keywords and passing them to KeyBERT:


```python
doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.[1] It infers a
function from labeled training data consisting of a set of training examples.[2]
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""

from keybert import KeyBERT
kw_model = KeyBERT()

# Define our seeded term
seed_keywords = ["information"]
keywords = kw_model.extract_keywords(doc, use_mmr=True, diversity=0.1, seed_keywords=seed_keywords)
keywords = kw_model.extract_keywords(doc, seed_keywords=seed_keywords)
```

## **Prepare embeddings**

When you have a large dataset and you want to fine-tune parameters such as `diversity` it can take quite a while to re-calculate the document and
word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to `.extract_keywords` such that
we only have to calculate it once:


```python
from keybert import KeyBERT

kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)
```

You can then use these embeddings and pass them to `.extract_keywords` to speed up the tuning the model:

```python
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
```

There are several parameters in `.extract_embeddings` that define how the list of candidate keywords/keyphrases is generated:

* `candidates`
* `keyphrase_ngram_range`
* `stop_words`
* `min_df`
* `vectorizer`

The values of these parameters need to be exactly the same in `.extract_embeddings` as they are in `. extract_keywords`.

In other words, the following will work as they use the same parameter subset:

```python
from keybert import KeyBERT

kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=1, stop_words="english")
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
word_embeddings=word_embeddings)
```

The following, however, will throw an error since we did not use the same values for `min_df` and `stop_words`:

```python
from keybert import KeyBERT

kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=3, stop_words="dutch")
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
word_embeddings=word_embeddings)
```
16 changes: 16 additions & 0 deletions docs/images/guided.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 16 additions & 0 deletions docs/images/pipeline.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
hide:
- navigation
---

<img src="https://raw.githubusercontent.com/MaartenGr/KeyBERT/master/images/logo.png" width="35%" height="35%" align="right" />

# **KeyBERT**
Expand Down
12 changes: 12 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,15 @@
:root>* {
--md-typeset-a-color: #0277BD;
}

body[data-md-color-primary="black"] .excalidraw svg {
filter: invert(100%) hue-rotate(180deg);
}

body[data-md-color-primary="black"] .excalidraw svg rect {
fill: transparent;
}

.excalidraw {
text-align: center;
}
2 changes: 1 addition & 1 deletion keybert/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from keybert._model import KeyBERT

__version__ = "0.6.0"
__version__ = "0.7.0"

0 comments on commit 7b763ae

Please sign in to comment.