Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Errors Adapting GloVe Example to Project - Quanteda Related #334

Open
sellociompi opened this issue May 11, 2022 · 3 comments
Open

Comments

@sellociompi
Copy link

sellociompi commented May 11, 2022

Hello there,

I am having what I believe are multiple issues adapting the GloVe word embeddings tutorial to my project. I am starting with a tokens object created in Quanteda (TOK.Debates.2020.Full.Clean) to create the iterator. However, when I run that first line, I am greeted with this error:

Tokenizer_Debates_2020 = space_tokenizer(TOK.Debates.2020.Full.Clean)

_Warning message: In stringi::stri_split_fixed(strings, pattern = sep, ...) :
  argument is not an atomic vector; coercing_

The tokenizer is created and looks like this:

image

I continue the example with no errors:

Iterator_Debates_2020 = itoken(Tokenizer_Debates_2020)
Vocab_Debates_2020 = create_vocabulary(Iterator_Debates_2020)
Vocab_Debates_2020 = prune_vocabulary(Vocab_Debates_2020, term_count_min = 10L)
Vectorizer_Debates_2020 = vocab_vectorizer(Vocab_Debates_2020)
TCM_Debates_2020 = create_tcm(Iterator_Debates_2020, Vectorizer_Debates_2020, skip_grams_window = 5L)

I check the dimensions of the TCM and see that I have rows and columns:
dim(TCM_Debates_2020)

_[1] 9277 9277_

I start to fit the model, creating the glove environment with no issue, but when I try to do the actual fitting I obtain the following error:

glove = GlobalVectors$new(rank = 50, x_max = 10)
WV_Debates_2020 = glove$fit_transform(TCM_Debates_2020, n_iter = 10, convergence_tol = 0.01, n_threads = 8)

_Error in if (cost/n_nnz > 1) stop("Cost is too big, probably something goes wrong... try smaller learning rate") : 
  missing value where TRUE/FALSE needed_

In order to troubleshoot this error, I have tried to do the following:

  • Change the learning rate in the glove environment down to .001, still receive the same calculation cost error message
  • Attempted to change the initial token object into a text file to simulate the example better, still receive the same coercion error
  • Attempted to use a Quanteda FCM to replace the TCM, but receive the following error:

WV_Debates_2020 = glove$fit_transform(Debates2020.FCM, n_iter = 10, convergence_tol = 0.01, n_threads = 8)

_Error in glove$fit_transform(Debates2020.FCM, n_iter = 10, convergence_tol = 0.01,  : 
  all(x@x > 0) is not TRUE_

I have been unable to proceed further and obviously one or more of these errors must be the culprit, but I have been unable to find documentation on these errors elsewhere, including past issues catalogued here.

Thank you in advance for any help in taking out this gremlin.
-Sello

@jwijffels
Copy link

your Tokenizer_Debates_2020 looks like a list of words instead of a list of sequences of words

@sellociompi
Copy link
Author

sellociompi commented May 19, 2022

@jwijffels thank you for pointing that out, I've been trying to understand what the difference is, but I'm coming up short, unfortunately.

Would I avoid this problem if I tokenized the original corpus instead of a cleaned tokens item?

@jwijffels
Copy link

did you try Iterator_Debates_2020 = itoken(TOK.Debates.2020.Full.Clean, tokenizer = space_tokenizer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants