Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAKE favours single terms #386

Open
OmniaZayed opened this issue Apr 30, 2024 · 0 comments
Open

YAKE favours single terms #386

OmniaZayed opened this issue Apr 30, 2024 · 0 comments
Labels

Comments

@OmniaZayed
Copy link

OmniaZayed commented Apr 30, 2024

Hi,

I used YAKE to extract some keywords from text as follows:

import spacy

import textacy

import textacy.extract.keyterms as ke

en_spacy_model = spacy.load("en_core_web_lg")  # large language package

text = "In the text mining tasks, textual representation should be not only efficient but also interpretable, " \
       "as this enables an understanding of the operational logic underlying the data mining models. Traditional text " \
       "vectorization methods such as TF-IDF and bag-of-words are effective and characterized by intuitive " \
       "interpretability, but suffer from the «curse of dimensionality», and they are unable to capture the meanings " \
       "of words. On the other hand, modern distributed methods effectively capture the hidden semantics, " \
       "but they are computationally intensive, time-consuming, and uninterpretable. This article proposes a new text " \
       "vectorization method called Bag of weighted Concepts BoWC that presents a document according to the concepts’ " \
       "information it contains. The proposed method creates concepts by clustering word vectors (i.e. word " \
       "embedding) then uses the frequencies of these concept clusters to represent document vectors. To enrich the " \
       "resulted document representation, a new modified weighting function is proposed for weighting concepts based " \
       "on statistics extracted from word embedding information. The generated vectors are characterized by " \
       "interpretability, low dimensionality, high accuracy, and low computational costs when used in data mining " \
       "tasks. The proposed method has been tested on five different benchmark datasets in two data mining tasks; " \
       "document clustering and classification, and compared with several baselines, including Bag-of-words, TF-IDF, " \
       "Averaged GloVe, Bag-of-Concepts, and VLAC. The results indicate that BoWC outperforms most baselines and " \
       "gives 7% better accuracy on average "

doc = textacy.make_spacy_doc(text.lower(), en_spacy_model)


yake_kw = ke.yake(doc, ngrams=(1,2,3,4), normalize=None, include_pos=("NOUN", "PROPN", "ADJ"), window_size=4, topn=20)

#Print the keywords using Yake algorithm, as implemented in Textacy.

print("Yake output: ")

for e in yake_kw:
    print(e[0],"\t", e[1]) # Order ascending, as lower scores means higher importance


The output is as follows:


Yake output: 
concepts 	 0.34785604529774267
mining 	 0.3573874481263331
bag 	 0.36656842484621355
document 	 0.3840296285062857
words 	 0.4131660220894072
text 	 0.4151558602661391
tasks 	 0.42079776707616295
data 	 0.42684394140294357
method 	 0.43918219586137885
vectors 	 0.4438910610888909
idf 	 0.5368745934076988
information 	 0.5389345352012269
vectorization 	 0.5405218742053667
dimensionality 	 0.5408938012298964
tf 	 0.5419066125993368
interpretability 	 0.5432608453679435
representation 	 0.5509415208495251
clustering 	 0.5571034040321834
accuracy 	 0.5585834907670935
new 	 0.5590334612780952

Although 4 grams are specified, the algorithm seems to favour single words as the top keywords. Is this behaviour expected?

@OmniaZayed OmniaZayed added the bug label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant