Skip to content

Commit

Permalink
v0.5 (#58)
Browse files Browse the repository at this point in the history
* Guided KeyBERT
* Update default SBERT model
  • Loading branch information
MaartenGr committed Sep 28, 2021
1 parent c8c6993 commit 6ab9af1
Show file tree
Hide file tree
Showing 12 changed files with 113 additions and 29 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,6 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[test]"
pip install -e ".[dev]"
- name: Run Checking Mechanisms
run: make check
12 changes: 3 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,12 +75,6 @@ pip install keybert[spacy]
pip install keybert[use]
```

To install all backends:

```
pip install keybert[all]
```

<a name="usage"/></a>
### 2.2. Usage

Expand Down Expand Up @@ -136,7 +130,7 @@ keywords = kw_model.extract_keywords(doc, highlight=True)


**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
for multi-lingual documents or any other language.

<a name="maxsum"/></a>
Expand Down Expand Up @@ -205,7 +199,7 @@ and pass it through KeyBERT with `model`:

```python
from keybert import KeyBERT
kw_model = KeyBERT(model='paraphrase-MiniLM-L6-v2')
kw_model = KeyBERT(model='all-MiniLM-L6-v2')
```

Or select a SentenceTransformer model with your own parameters:
Expand All @@ -214,7 +208,7 @@ Or select a SentenceTransformer model with your own parameters:
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)
```

Expand Down
14 changes: 14 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
## **Version 0.5.0**
*Release date: 28 September, 2021*

**Highlights**:

* Added Guided KeyBERT
* kw_model.extract_keywords(doc, seed_keywords=seed_keywords)
* Thanks to [@zolekode](https://github.com/zolekode) for the inspiration!
* Use the newest all-* models from SBERT

**Miscellaneous**:

* Added instructions in the FAQ to extract keywords from Chinese documents

## **Version 0.4.0**
*Release date: 23 June, 2021*

Expand Down
27 changes: 25 additions & 2 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
## **Which embedding model works best for which language?**
Unfortunately, there is not a definitive list of the best models for each language, this highly depends
on your data, the model, and your specific use-case. However, the default model in KeyBERT
(`"paraphrase-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual**
(`"all-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual**
documents or any other language, `"paraphrase-multilingual-MiniLM-L12-v2""` has shown great performance.

If you want to use a model that provides a higher quality, but takes more compute time, then I would advise using `paraphrase-mpnet-base-v2` and `paraphrase-multilingual-mpnet-base-v2` instead.
Expand All @@ -17,4 +17,27 @@ topic modeling to HTML-code to extract topics of code, then it becomes important

## **Can I use the GPU to speed up the model?**
Yes! Since KeyBERT uses embeddings as its backend, a GPU is actually prefered when using this package.
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.

## **How can I use KeyBERT with Chinese documents?**
You need to make sure you use a Tokenizer in KeyBERT that supports tokenization of Chinese. I suggest installing [`jieba`](https://github.com/fxsjy/jieba) for this:

```python
from sklearn.feature_extraction.text import CountVectorizer
import jieba

def tokenize_zh(text):
words = jieba.lcut(text)
return words

vectorizer = CountVectorizer(tokenizer=tokenize_zh)
```

Then, simply pass the vectorizer to your KeyBERT instance:

```python
from keybert import KeyBERT

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, vectorizer=vectorizer)
```
4 changes: 2 additions & 2 deletions docs/guides/embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ and pass it through KeyBERT with `model`:

```python
from keybert import KeyBERT
kw_model = KeyBERT(model="paraphrase-MiniLM-L6-v2")
kw_model = KeyBERT(model="all-MiniLM-L6-v2")
```

Or select a SentenceTransformer model with your own parameters:

```python
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)
```

Expand Down
28 changes: 27 additions & 1 deletion docs/guides/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ keywords = kw_model.extract_keywords(doc, highlight=True)
```

**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
for multi-lingual documents or any other language.

### Max Sum Similarity
Expand Down Expand Up @@ -147,4 +147,30 @@ candidates = [candidate[0] for candidate in candidates]
# KeyBERT init
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, candidates)
```

### Guided KeyBERT

Guided KeyBERT is similar to Guided Topic Modeling in that it tries to steer the training towards a set of seeded terms. When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article. However, there might be a specific topic in the article that you would like to be extracted through the keywords. To achieve this, we simply give KeyBERT a set of related seeded keywords (it can also be a single one!) and search for keywords that are similar to both the document and the seeded keywords.

Using this feature is as simple as defining a list of seeded keywords and passing them to KeyBERT:


```python
doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.[1] It infers a
function from labeled training data consisting of a set of training examples.[2]
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""

kw_model = KeyBERT()
seed_keywords = ["information"]
keywords = kw_model.extract_keywords(doc, use_mmr=True, diversity=0.1, seed_keywords=seed_keywords)
```
2 changes: 1 addition & 1 deletion keybert/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from keybert._model import KeyBERT

__version__ = "0.4.0"
__version__ = "0.5.0"
22 changes: 17 additions & 5 deletions keybert/_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ class KeyBERT:
"""
def __init__(self,
model="paraphrase-MiniLM-L6-v2"):
model="all-MiniLM-L6-v2"):
""" KeyBERT initialization
Arguments:
Expand Down Expand Up @@ -60,8 +60,9 @@ def extract_keywords(self,
diversity: float = 0.5,
nr_candidates: int = 20,
vectorizer: CountVectorizer = None,
highlight: bool = False) -> Union[List[Tuple[str, float]],
List[List[Tuple[str, float]]]]:
highlight: bool = False,
seed_keywords: List[str] = None) -> Union[List[Tuple[str, float]],
List[List[Tuple[str, float]]]]:
""" Extract keywords/keyphrases
NOTE:
Expand Down Expand Up @@ -99,6 +100,8 @@ def extract_keywords(self,
highlight: Whether to print the document and highlight
its keywords/keyphrases. NOTE: This does not work if
multiple documents are passed.
seed_keywords: Seed keywords that may guide the extraction of keywords by
steering the similarities towards the seeded keywords
Returns:
keywords: the top n keywords for a document with their respective distances
Expand All @@ -116,7 +119,8 @@ def extract_keywords(self,
use_mmr=use_mmr,
diversity=diversity,
nr_candidates=nr_candidates,
vectorizer=vectorizer)
vectorizer=vectorizer,
seed_keywords=seed_keywords)
if highlight:
highlight_document(docs, keywords)

Expand All @@ -143,7 +147,8 @@ def _extract_keywords_single_doc(self,
use_mmr: bool = False,
diversity: float = 0.5,
nr_candidates: int = 20,
vectorizer: CountVectorizer = None) -> List[Tuple[str, float]]:
vectorizer: CountVectorizer = None,
seed_keywords: List[str] = None) -> List[Tuple[str, float]]:
""" Extract keywords/keyphrases for a single document
Arguments:
Expand All @@ -157,6 +162,8 @@ def _extract_keywords_single_doc(self,
diversity: The diversity of results between 0 and 1 if use_mmr is True
nr_candidates: The number of candidates to consider if use_maxsum is set to True
vectorizer: Pass in your own CountVectorizer from scikit-learn
seed_keywords: Seed keywords that may guide the extraction of keywords by
steering the similarities towards the seeded keywords
Returns:
keywords: the top n keywords for a document with their respective distances
Expand All @@ -175,6 +182,11 @@ def _extract_keywords_single_doc(self,
doc_embedding = self.model.embed([doc])
candidate_embeddings = self.model.embed(candidates)

# Guided KeyBERT with seed keywords
if seed_keywords is not None:
seed_embeddings = self.model.embed([" ".join(seed_keywords)])
doc_embedding = np.average([doc_embedding, seed_embeddings], axis=0, weights=[3, 1])

# Calculate distances and extract keywords
if use_mmr:
keywords = mmr(doc_embedding, candidate_embeddings, candidates, top_n, diversity)
Expand Down
6 changes: 3 additions & 3 deletions keybert/backend/_sentencetransformers.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@ class SentenceTransformerBackend(BaseEmbedder):
sentence-transformers model:
```python
from keybert.backend import SentenceTransformerBackend
sentence_model = SentenceTransformerBackend("paraphrase-MiniLM-L6-v2")
sentence_model = SentenceTransformerBackend("all-MiniLM-L6-v2")
```
or you can instantiate a model yourself:
```python
from keybert.backend import SentenceTransformerBackend
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
sentence_model = SentenceTransformerBackend(embedding_model)
```
"""
Expand All @@ -36,7 +36,7 @@ def __init__(self, embedding_model: Union[str, SentenceTransformer]):
else:
raise ValueError("Please select a correct SentenceTransformers model: \n"
"`from sentence_transformers import SentenceTransformer` \n"
"`model = SentenceTransformer('paraphrase-MiniLM-L6-v2')`")
"`model = SentenceTransformer('all-MiniLM-L6-v2')`")

def embed(self,
documents: List[str],
Expand Down
2 changes: 1 addition & 1 deletion keybert/backend/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

def select_backend(embedding_model) -> BaseEmbedder:
""" Select an embedding model based on language or a specific sentence transformer models.
When selecting a language, we choose `paraphrase-MiniLM-L6-v2` for English and
When selecting a language, we choose `all-MiniLM-L6-v2` for English and
`paraphrase-multilingual-MiniLM-L12-v2` for all other languages as it support 100+ languages.
Returns:
Expand Down
5 changes: 2 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
setup(
name="keybert",
packages=find_packages(exclude=["notebooks", "docs"]),
version="0.4.0",
version="0.5.0",
author="Maarten Grootendorst",
author_email="maartengrootendorst@gmail.com",
description="KeyBERT performs keyword extraction with state-of-the-art transformer models.",
Expand Down Expand Up @@ -76,8 +76,7 @@
"test": test_packages,
"docs": docs_packages,
"dev": dev_packages,
"flair": flair_packages,
"all": extra_packages
"flair": flair_packages
},
python_requires='>=3.6',
)
18 changes: 17 additions & 1 deletion tests/test_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from keybert import KeyBERT

doc_one, doc_two = get_test_data()
model = KeyBERT(model='paraphrase-MiniLM-L6-v2')
model = KeyBERT(model='all-MiniLM-L6-v2')


@pytest.mark.parametrize("keyphrase_length", [(1, i+1) for i in range(5)])
Expand Down Expand Up @@ -68,6 +68,22 @@ def test_extract_keywords_multiple_docs(keyphrase_length):
assert len(keyword[0].split(" ")) <= keyphrase_length[1]


def test_guided():
""" Test whether the keywords are correctly extracted """
top_n = 5
seed_keywords = ["time", "night", "day", "moment"]
keywords = model.extract_keywords(doc_one,
min_df=1,
top_n=top_n,
seed_keywords=seed_keywords)

assert isinstance(keywords, list)
assert isinstance(keywords[0], tuple)
assert isinstance(keywords[0][0], str)
assert isinstance(keywords[0][1], float)
assert len(keywords) == top_n


def test_error():
""" Empty doc should raise a ValueError """
with pytest.raises(AttributeError):
Expand Down

0 comments on commit 6ab9af1

Please sign in to comment.