Extremely low throughput of running on IBM POWER9 processor #407

jw447 · 2019-07-30T19:56:39Z

Hi,

I'm running TensorFlow benchmark on IBM machine(POWER9 processor + V100 GPUs). I know it is not the optimal way to go, but I'm just trying out the performance of POWER9 without using GPUs. Turns out the performance is VERY low (~0.5 images/sec to 4 images/sec) regardless my tuning of threading number(from 16 to 160). I'm not sure if anyone has been playing with similar setup but I cannot seem to find any reported performance. I'm doubting the performance number because Power9 seems to have very high CPU frequency despite no MKL.

So can anyone give me any suggestions? I'm attaching the script here:

python ~/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --data_format=NHWC --batch_size=128 --num_batches=50 --model=resnet50 --optimizer=sgd --variable_update=replicated --use_fp16=False --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --local_parameter_device=cpu --device=cpu --local_parameter_device=cpu --display_every=10 --num_intra_threads=128 --num_inter_threads=1

tfboyd · 2019-07-30T20:25:14Z

I worked with the POWER9 team before and our numbers were often very close. I have no idea how the CPU would perform but that might not be that far off. Your thread settings are likely not good.

As of TF 1.14 (I think) and certainly nightly. We have built in some of the MKL-DNN open source features. No idea how that impacts POWER9, it worked fine on AMD as I suspect it just looks for the supported instruction sets again I realized POWER9 is way different.

You do not need most of those flags. Let me give you some data. I am not sure of your end objective; and I hope this helps a little.

tf_cnn_benchmarks is deprecated going into TF 2.0. It is still a great tool pre-TF 2.0. The new "perf tools" are different and focus on end user performance (plus flags for some "magic") where as tf_cnn_benchmarks was 100% focused on testing any kind of hardware we could find and avoided many of the high level APIs.
Here is some data from WAY back scroll to the bottom area, It is not formatted well other than the very last entry but you can find it. inception was 4.3 images/sec no optimizations and 20.3 with MKL training.
For TF 2.0 I am adding some CPU tests slowly; and we mostly run with real data so you will need to pull down CIFAR-10 for this test: ResNet56 on 32v (16 physical cores XEON 2.0Ghz) is 201 images/sec to 249 and with TF 1.0 I saw 260 images/sec. Code FLAGS: --data_dir=/data/cifar-10-batches-bin --data_format=channels_last --distribution_strategy=off --enable_eager --log_steps=10 --model_dir=/workspace/benchmarks/perfzero/workspace/output/2019-07-30-13-02-52-696886/benchmark_cpu_no_dist_strat --num_gpus=0 --skip_eval --train_steps=110
For TF 2.0 I am also running a basic LSTM example. Same CPU as above and I get 83 examples/sec (this is step time and batch-size, just to have a number) On 1xV100 this is 1,800 to 2,200

I realized some of this info a a bit sloppy. I do not know exactly what you want so I went with sharing a mix of stuff. Feel free to ping / mention me or whatever they call it on github. :-) I would like to have you testing with the official models and most of it should work on TF 1.0 (1.14) or nightly. TF 2.0 would be better but it is still in Alpha and I run with nightly versions so a bit bleeding edge.

jw447 · 2019-07-30T20:41:32Z

Hi @tfboyd Thank you very much! I'm still playing with TF 1.12 now. And I think the best number I see is around 3.6. I may need to double check the vector instruction extensions. BTW, can I build the MKL-DNN feature you mentioned on IBM machine? I understand it is optimized for INTEL architecture.

Thank you very much!
Jinzhen

jayfurmanek · 2019-08-15T20:16:04Z

MKL is extremely specific to Intel processors. So you'll have to use the Eigen builds.
What Tensorflow build are you using? (did you build it yourself?)

In any case, training on a CPUs is very slow. Use those V100s.

*and I agree - thread settings and SMT mode will likely make a difference. Tensorflow will create a lot of threads if you let it and it doesn't always help.

jw447 · 2019-08-15T20:22:56Z

Thank you @jayfurmanek . Yes, I learned that MKL is for Intel and AMD processors. Actually I think all the vector instruction set supported by tensorflow are all for intel architecture (MKL, SSE etc.)

I'm still in trial-and-error stage for threading and SMT mode. So far I haven't observe any clear trend.

I'm using my self-build TF 1.12.

edelsohn · 2019-08-15T20:28:58Z

The latest version of Eigen have specific support for Power9, if TF is built with a compiler that supports Power9 (depends on the Linux distro).

Also, you mention images/sec, so I presume that this is some image processing benchmark. TF uses NumPy and other libraries for image pre-processing. You should check if you have OpenBLAS installed (the most recent release of OpenBLAS has some Power9 enhancements and other enhancements are committed but available in an official release). Depending on the format of the images (jpg, etc.) one also needs to ensure that the best libraries are installed, e.g., libjpeg-turbo.

You don't mention your configuration, your Linux distro, and the source of the components, but just as one installs MKL-DNN, etc. for Intel/AMD, one needs to install the appropriate, optimized libraries for Power9 to perform a meaningful comparison.

jw447 · 2019-08-15T21:27:33Z

Thank you @edelsohn for introducing OpenBLAS to me. My system is red hat 7.6 with OpenBLAS 0.3.5. But it seems that Power 9 support is in 0.3.7. Is it correct?

Also, do I need to build my own Ops to use OpenBLAS? Based on my understanding, the Ops in tensorflow are mainly based on eigen and mkl.

edelsohn · 2019-08-15T21:33:27Z

0.3.7 contains double precision optimizations. The single precision optimizations will be in the next release. They are in the github master repo, so you can download it and build it yourself.

OpenBLAS doesn't affect the TF ops -- the ops use Eigen. But not everything in TF is the DL tensor ops. You mention images/sec, so something needs to handle the image ingestion and preprocessing. Even with GPU, the TF ingestion and pre-processing is handled by the CPU. Especially if you are testing inferencing, you shouldn't assume that the TF ops are dominating the time. The preliminary ingestion and pre-processing are provided by NumPy, Python and other libraries (OpenBLAS, libjpeg-turbo, libpng, FFMPEG, etc.)

wdirons · 2019-08-16T15:16:33Z

If building TensorFlow yourself, you'll want to ensure you have in your .tf_configure.bazelrc file

build:opt --copt=-mcpu=power9
build:opt --copt=-mtune=power9

(or least power8)

and -c opt on the bazel build command line

JonTriebenbach · 2019-08-16T15:28:27Z

@tfboyd What are the new "perf tools" that will be added for Tensorflow 2.0?

tf_cnn_benchmarks is deprecated going into TF 2.0. It is still a great tool pre-TF 2.0. The new "perf tools" are different and focus on end user performance (plus flags for some "magic") where as tf_cnn_benchmarks was 100% focused on testing any kind of hardware we could find and avoided many of the high level APIs.

Is this the tfprof tool? Is there a github location this activity is occurring in? Thanks for any additional information.

jw447 · 2019-08-16T18:57:17Z

@edelsohn Thank you for your help. However I'm using synthetic imagenet data for testing right now so I believe there's not any image injestion and preprocessing going on. I will build my own OpenBLAS 0.3.7 with real dataset to see if it helps.

vilmara mentioned this issue Nov 14, 2019

Are the tf_cnn_benchmarks deprecated and replaced by MLPerf? #436

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely low throughput of running on IBM POWER9 processor #407

Extremely low throughput of running on IBM POWER9 processor #407

jw447 commented Jul 30, 2019 •

edited

tfboyd commented Jul 30, 2019 •

edited

jw447 commented Jul 30, 2019

jayfurmanek commented Aug 15, 2019 •

edited

jw447 commented Aug 15, 2019

edelsohn commented Aug 15, 2019

jw447 commented Aug 15, 2019

edelsohn commented Aug 15, 2019 •

edited

wdirons commented Aug 16, 2019

JonTriebenbach commented Aug 16, 2019

jw447 commented Aug 16, 2019

Extremely low throughput of running on IBM POWER9 processor #407

Extremely low throughput of running on IBM POWER9 processor #407

Comments

jw447 commented Jul 30, 2019 • edited

tfboyd commented Jul 30, 2019 • edited

jw447 commented Jul 30, 2019

jayfurmanek commented Aug 15, 2019 • edited

jw447 commented Aug 15, 2019

edelsohn commented Aug 15, 2019

jw447 commented Aug 15, 2019

edelsohn commented Aug 15, 2019 • edited

wdirons commented Aug 16, 2019

JonTriebenbach commented Aug 16, 2019

jw447 commented Aug 16, 2019

jw447 commented Jul 30, 2019 •

edited

tfboyd commented Jul 30, 2019 •

edited

jayfurmanek commented Aug 15, 2019 •

edited

edelsohn commented Aug 15, 2019 •

edited