Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud #11166

Open
3 tasks done
Arith2 opened this issue Feb 21, 2024 · 0 comments
Assignees
Labels
models:official models that come under official repository type:bug Bug in the code

Comments

@Arith2
Copy link

Arith2 commented Feb 21, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/official/recommendation/ranking/preprocessing/criteo_preprocess.py

2. Describe the bug

  1. Apache Beam Pipeline cannot maximize the number of workers to increase parallelism for preprocessing in Google Cloud
  2. I put the object storage and compute engine in the same region.
  3. I use "gsutil perfdiag -n 10 -s 100M -c 1 gs://my_storage" to test the throughput of Google Cloud Storage, 876 Mbit/s for writing, 1.56 Gbit/s for reading.
  4. When I try to generate vocabulary and run "python criteo_preprocess.py --input_path "${STORAGE_BUCKET}/criteo_sharded/training/*" --output_path "${STORAGE_BUCKET}/criteo_out/" --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000 --project ${PROJECT} --region ${REGION}", it turns out to be very slow. It takes 30mins when the size of input dataset is 11GB.
  5. I use htop and find that there are three processes of this python command. The utilization of all cores are nearly 0 and only 1 thread is actively running.
  6. I also use shard_rebalancer.py to partition the input dataset to be 64 or 1024. There is no improvement.

3. Steps to reproduce

  1. Input dataset: Training text of Criteo Kaggle, about 11GB. I upload it as Google Cloud Storage in europe-west1. https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download
  2. Compute Engine c2d-highcpu-32 in europe-west1-b
  3. Specify STORAGE_BUCKET, PROJECT, REGION
  4. Run the python command above.

4. Expected behavior

  • Apache Beam Pipeline can maximize the number of running workers

6. System information

  • OS Platform and Distribution : Linux 6.1.0-18-cloud-amd64 x86_64
  • TensorFlow installed from (source or binary): setup.py
  • TensorFlow version: 2.15.0
  • Python version: 3.9.2
@Arith2 Arith2 added models:official models that come under official repository type:bug Bug in the code labels Feb 21, 2024
@Arith2 Arith2 changed the title Apache Beam Pipeline cannot parallelize criteo_preprocess.py Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py Feb 21, 2024
@Arith2 Arith2 changed the title Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:official models that come under official repository type:bug Bug in the code
Projects
None yet
Development

No branches or pull requests

2 participants