In vectorizer.fit_transform() function, when tf_type="log" we get UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float64') to dtype('int32') with casting rule 'same_kind' #288

rohetoric · 2020-01-31T12:20:38Z

steps to reproduce

Read a text file.
Set the value of the following parameters one by one
tf_type=["linear", "sqrt", "log", "binary"]
idf_type = ["standard", "smooth", "bm25"]
dl_type= ["linear", "sqrt", "log"]
norm =["l1", "l2"]
models= ["lsa","lda","nmf"]
Iterate with a nested loop along values of all 5 parameters and compute doc_term_matrix
ie
for t in tf_type: for i in idf_type: for d in dl_type: for n in norm: for mo in models: vectorizer = textacy.vsm.Vectorizer(tf_type=t, apply_idf=True, idf_type=i,dl_type=d, norm=n,min_df=2, max_df=0.95) doc_term_matrix = vectorizer.fit_transform((doc._.to_terms_list(ngrams=3, entities=True, as_strings=True)for doc in spacy_gram))
When the tf_type="log", we receive the above error.

expected vs. actual behavior

possible solution?

I saw that inside the vectroizer.fit_transform there is a function _reweight_values(self, doc_term_matrix) function. When the tf_type="log", we read np.log(doc_term_matrix.data, doc_term_matrix.data, casting="unsafe"). Even though the casting has been declared as "unsafe", there is error is on the next line i.e doc_term_matrix.data += 1.0. I think it should be initialized as doc_term_matrix.data = doc_term_matrix.data+1.0 according to https://stackoverflow.com/questions/38673531/multiply-numpy-int-and-float-arrays-cannot-cast-ufunc-multiply-output-from-dtyp

context

I am trying to get clusters with similar intent according to my dataset and for that I need the document term matrix. I am just using the brute force method as to when I can receive the best silhouette score of the cluster based on tweaking the parameters of the vectorizer function in a loop.

environment

Receving an TypeError here in print_markdown(items) i.e.TypeError:s must be (<class 'str'>, <class 'bytes'>), not <class 'list'> inside the to_unicode(s, encoding, errors) function.

operating system: Ubuntu 18.04
python version: Python 3.7.4
spacy version: 2.2.3
installed spacy models: en_core_web_sm, en_core_web_md,
textacy version: 0.9.1

The text was updated successfully, but these errors were encountered:

rohetoric added the bug label Jan 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In vectorizer.fit_transform() function, when tf_type="log" we get UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float64') to dtype('int32') with casting rule 'same_kind' #288

In vectorizer.fit_transform() function, when tf_type="log" we get UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float64') to dtype('int32') with casting rule 'same_kind' #288

rohetoric commented Jan 31, 2020 •

edited

In vectorizer.fit_transform() function, when tf_type="log" we get UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float64') to dtype('int32') with casting rule 'same_kind' #288

In vectorizer.fit_transform() function, when tf_type="log" we get UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float64') to dtype('int32') with casting rule 'same_kind' #288

Comments

rohetoric commented Jan 31, 2020 • edited

steps to reproduce

expected vs. actual behavior

possible solution?

context

environment

rohetoric commented Jan 31, 2020 •

edited