Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tfidf fitting much slower than expected #335

Open
bogedy opened this issue Nov 18, 2022 · 3 comments
Open

tfidf fitting much slower than expected #335

bogedy opened this issue Nov 18, 2022 · 3 comments

Comments

@bogedy
Copy link

bogedy commented Nov 18, 2022

Hi! I came across this package because I have a dataset of ~2 million text sequences (each <500 chars long) and I wanted to get faster performance than sklearn's tfidf vectorizer while I play with different configurations. Sklearn's vectorizer is single threaded and written in python.

It takes about 5 minutes to vectorize and transform in sklearn in python:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True,
                        norm='l2',
                        encoding='latin-1', ngram_range=(1, 2),
                        stop_words=None)

%time X = tfidf.fit_transform(dataset.text)
CPU times: user 4min 27s, sys: 14.1 s, total: 4min 41s
Wall time: 4min 49s

I can see on top that this is only using a single thread.

with text2vec (I hope I'm using it right! I tried to follow the example http://text2vec.org/vectorization.html#tf-idf):

dt = fread('dataset.csv.tar.gz')

setkey(dt, id)

prep_fun = tolower
tok_fun = word_tokenizer

my_iterator = itoken_parallel(dt$text,
                  preprocessor = prep_fun,
                  tokenizer = tok_fun,
                  ids = dt$id,
                  progressbar = TRUE)

t10 = Sys.time()
vocab = create_vocabulary(my_iterator, ngram=c(1L, 2L))
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(my_iterator, vectorizer)

# define tfidf model
tfidf = TfIdf$new(norm = 'l2', sublinear_tf = TRUE)
# fit model to train data and transform train data with fitted model
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
# tfidf modified by fit_transform() call!

paste('Time to build tfidf:', difftime(Sys.time(), t10, units = 'sec'))

I've left it running on an AWS. I can see on top that 4 threads are going. But they've been going much much longer than 5 minutes. Had to kill the process eventually. If I work on a smaller subset of a few thousand articles it works fine.

Am I missing something? Or do I just lack patience? Thanks for your help.

@dselivanov
Copy link
Owner

Hi. Code looks fine. Can you try single process (itoken() instead of itoken_parallel)?

@bogedy
Copy link
Author

bogedy commented Nov 18, 2022

edit: the parallel one has been going for 2 hours now. Seems broken.

Just ran it. It took about 11 minutes on a single thread. Running the parallel again, more than 20 minutes so far and still going.

I forgot to add, when I run the parallel tokenizer I get the following warnings every few seconds while its running:

Warning message in selectChildren(jobs, timeout):
“cannot wait for child 30598 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 30731 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31722 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31721 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31732 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31736 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32259 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32368 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32441 as it does not exist”

And earlier when I interrupted R early:

Warning message in selectChildren(jobs, timeout):
“cannot wait for child 17021 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 17049 as it does not exist”
as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead

@dselivanov
Copy link
Owner

This means, workers (processes which process chunks of the input data) are dying for some reason and don't deliver results of their job. You might need to investigate somehow why this happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants