Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade get_dataset.tokenize() to multiprocessing #24

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

DrStoop
Copy link

@DrStoop DrStoop commented Aug 20, 2019

get_dataset.tokenize() on a single CPU is very slow. Therefore in this pull request it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with
logger.debug() and mp_logger.debug() message to track progress in the python console.

get_dataset.tokenize() is to slow on a single CPU. Therefore it is upgraded to multiprocessing by implementing the
multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger
was added together with logger.debug() and mp_logger.debug() message to track progress in the python console.
utils.py Show resolved Hide resolved
utils.py Show resolved Hide resolved
@thomwolf
Copy link
Member

Looks nice, thanks!

Copy link
Author

@DrStoop DrStoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing, very nice project, happy you published it :) If there's anything else, let me know...

dataset = tokenize(dataset)
# dataset = tokenize(dataset)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely!

Suggested change
# dataset = tokenize(dataset)

personachat = tokenize(personachat)
torch.save(personachat, dataset_cache)
# torch.save(personachat, dataset_cache)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course!

Suggested change
# torch.save(personachat, dataset_cache)
torch.save(personachat, dataset_cache)

@DrStoop
Copy link
Author

DrStoop commented Aug 20, 2019

The question would be, if multiprocessing module should be added to requirements.txt?

@martinritchie
Copy link

@thomwolf , please could we get this merged? Thank you.

@DrStoop
Copy link
Author

DrStoop commented Sep 18, 2019

@thomwolf, before merging: i did some work on parallelizing the complete preprocessing chain affecting quite some code in ‚train.py‘ and ‚utils.py‘. i could clean the code & create a new pull request with e.g. 2 new files ‚utils_multiprocessing.py‘ and ‚train_multiprocessing.py‘. This way merging would become very easy & backward compatibility for everybody is guaranteed. Just let me know if you have interest in merging such a speedup ⏩ 💨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants