You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apache Beam Pipeline cannot maximize the number of workers to increase parallelism for preprocessing in Google Cloud
I put the object storage and compute engine in the same region.
I use "gsutil perfdiag -n 10 -s 100M -c 1 gs://my_storage" to test the throughput of Google Cloud Storage, 876 Mbit/s for writing, 1.56 Gbit/s for reading.
When I try to generate vocabulary and run "python criteo_preprocess.py --input_path "${STORAGE_BUCKET}/criteo_sharded/training/*" --output_path "${STORAGE_BUCKET}/criteo_out/" --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000 --project ${PROJECT} --region ${REGION}", it turns out to be very slow. It takes 30mins when the size of input dataset is 11GB.
I use htop and find that there are three processes of this python command. The utilization of all cores are nearly 0 and only 1 thread is actively running.
I also use shard_rebalancer.py to partition the input dataset to be 64 or 1024. There is no improvement.
Arith2
changed the title
Apache Beam Pipeline cannot parallelize criteo_preprocess.py
Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py
Feb 21, 2024
Arith2
changed the title
Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py
Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud
Feb 21, 2024
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/blob/master/official/recommendation/ranking/preprocessing/criteo_preprocess.py
2. Describe the bug
3. Steps to reproduce
4. Expected behavior
6. System information
The text was updated successfully, but these errors were encountered: