Upgrade get_dataset.tokenize() to multiprocessing #24

DrStoop · 2019-08-20T03:52:19Z

get_dataset.tokenize() on a single CPU is very slow. Therefore in this pull request it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with
logger.debug() and mp_logger.debug() message to track progress in the python console.

get_dataset.tokenize() is to slow on a single CPU. Therefore it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with logger.debug() and mp_logger.debug() message to track progress in the python console.

utils.py

thomwolf · 2019-08-20T10:40:06Z

Looks nice, thanks!

DrStoop

Thanks for reviewing, very nice project, happy you published it :) If there's anything else, let me know...

DrStoop · 2019-08-20T12:38:45Z

utils.py

        dataset = tokenize(dataset)
+        # dataset = tokenize(dataset)


absolutely!

Suggested change

# dataset = tokenize(dataset)

DrStoop · 2019-08-20T12:38:53Z

utils.py

        personachat = tokenize(personachat)
-        torch.save(personachat, dataset_cache)
+        # torch.save(personachat, dataset_cache)


of course!

Suggested change

# torch.save(personachat, dataset_cache)

torch.save(personachat, dataset_cache)

DrStoop · 2019-08-20T14:37:16Z

The question would be, if multiprocessing module should be added to requirements.txt?

martinritchie · 2019-09-18T14:12:39Z

@thomwolf , please could we get this merged? Thank you.

DrStoop · 2019-09-18T22:01:17Z

@thomwolf, before merging: i did some work on parallelizing the complete preprocessing chain affecting quite some code in ‚train.py‘ and ‚utils.py‘. i could clean the code & create a new pull request with e.g. 2 new files ‚utils_multiprocessing.py‘ and ‚train_multiprocessing.py‘. This way merging would become very easy & backward compatibility for everybody is guaranteed. Just let me know if you have interest in merging such a speedup ⏩ 💨

thomwolf reviewed Aug 20, 2019

View reviewed changes

utils.py Show resolved Hide resolved

thomwolf reviewed Aug 20, 2019

View reviewed changes

utils.py Show resolved Hide resolved

DrStoop commented Aug 20, 2019

View reviewed changes

Merge branch 'master' into integration

a0631a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade get_dataset.tokenize() to multiprocessing #24

Upgrade get_dataset.tokenize() to multiprocessing #24

DrStoop commented Aug 20, 2019

thomwolf commented Aug 20, 2019

DrStoop left a comment

DrStoop Aug 20, 2019

DrStoop Aug 20, 2019

DrStoop commented Aug 20, 2019

martinritchie commented Sep 18, 2019

DrStoop commented Sep 18, 2019 •

edited

	# torch.save(personachat, dataset_cache)
	torch.save(personachat, dataset_cache)

Upgrade get_dataset.tokenize() to multiprocessing #24

Are you sure you want to change the base?

Upgrade get_dataset.tokenize() to multiprocessing #24

Conversation

DrStoop commented Aug 20, 2019

thomwolf commented Aug 20, 2019

DrStoop left a comment

Choose a reason for hiding this comment

DrStoop Aug 20, 2019

Choose a reason for hiding this comment

DrStoop Aug 20, 2019

Choose a reason for hiding this comment

DrStoop commented Aug 20, 2019

martinritchie commented Sep 18, 2019

DrStoop commented Sep 18, 2019 • edited

DrStoop commented Sep 18, 2019 •

edited