Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud #11166

Arith2 · 2024-02-21T16:02:17Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am using the latest TensorFlow Model Garden release and TensorFlow 2.
I am reporting the issue to the correct repository. (Model Garden official or research directory)
I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/official/recommendation/ranking/preprocessing/criteo_preprocess.py

2. Describe the bug

Apache Beam Pipeline cannot maximize the number of workers to increase parallelism for preprocessing in Google Cloud
I put the object storage and compute engine in the same region.
I use "gsutil perfdiag -n 10 -s 100M -c 1 gs://my_storage" to test the throughput of Google Cloud Storage, 876 Mbit/s for writing, 1.56 Gbit/s for reading.
When I try to generate vocabulary and run "python criteo_preprocess.py --input_path "${STORAGE_BUCKET}/criteo_sharded/training/*" --output_path "${STORAGE_BUCKET}/criteo_out/" --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000 --project ${PROJECT} --region ${REGION}", it turns out to be very slow. It takes 30mins when the size of input dataset is 11GB.
I use htop and find that there are three processes of this python command. The utilization of all cores are nearly 0 and only 1 thread is actively running.
I also use shard_rebalancer.py to partition the input dataset to be 64 or 1024. There is no improvement.

3. Steps to reproduce

Input dataset: Training text of Criteo Kaggle, about 11GB. I upload it as Google Cloud Storage in europe-west1. https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download
Compute Engine c2d-highcpu-32 in europe-west1-b
Specify STORAGE_BUCKET, PROJECT, REGION
Run the python command above.

4. Expected behavior

Apache Beam Pipeline can maximize the number of running workers

6. System information

OS Platform and Distribution : Linux 6.1.0-18-cloud-amd64 x86_64
TensorFlow installed from (source or binary): setup.py
TensorFlow version: 2.15.0
Python version: 3.9.2

Arith2 added models:official models that come under official repository type:bug Bug in the code labels Feb 21, 2024

google-ml-butler bot assigned laxmareddyp Feb 21, 2024

Arith2 changed the title ~~Apache Beam Pipeline cannot parallelize criteo_preprocess.py~~ Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py Feb 21, 2024

Arith2 changed the title ~~Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py~~ Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud #11166

Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud #11166

Arith2 commented Feb 21, 2024

Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud #11166

Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud #11166

Comments

Arith2 commented Feb 21, 2024

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

6. System information