Skip to content
This repository has been archived by the owner on Aug 6, 2020. It is now read-only.

Latest commit

 

History

History
40 lines (32 loc) · 4.08 KB

README.md

File metadata and controls

40 lines (32 loc) · 4.08 KB

Optimizing TensorFlow code

This repo summarizes some techniques for optimizing TensorFlow code. Official document describing a collection of best practices can be found here. Before started, it will be very helpful to read it.

Dockerfile which contains all of libraries/packages introduced here is provided. It includes how to install the libraries/packages listed below.

First of all, it is important to find whether CPU will bottleneck GPU, or vice versa (simply check by running nvidia-smi). If GPU is a bottleneck, it is relatively easy to optimize. On the other hand, it is complicated if CPU is your bottleneck.

Overall, I got 1.5~2.0x performance gain by applying all below.

If GPUs are fully utilized

  1. Use NCHW data format for 4D tensor.
  • Native data format for cudnn library is NCHW. Performance gain increases as you have many layers.
  • If you use this format, using _fused_batch_norm is mandatory. Otherwise, your code will be almost 10x slower since nn.moments cannot deal with this format efficiently.
  • Several preprocessing ops support only HWC format, so we have to transpose tensors somewhere. If your input pipeline is a bottleneck, it is better to transpose them using GPU.
  1. Use fused batch norm.
  • Whatever your data format is, it is better to use fused batch norm.

If CPUs are your bottleneck

  1. Utilize queues for input pipeline
  • First, you have to utilize queues for reading and fetching input data. Please refer to Reading Data Guide and batch_inputs function in inception codes.
  • CAREFULLY allocate threads for each reading and preprocessing. It completely depends on your machine: how many threads can you use?, can you read from SSD?, etc.
  1. Use TCMalloc.
  • TCMalloc is faster for multi-threaded programs.
  • Also, it is effective if you use multi-threads for input pipeline.
  • Relevant issues or comments: here, here.
  1. Use advanced instructions (SSE, AVX, FMA) on Intel CPUs.
  • For TensorFlow v.1.0.0, you can see the following warnings when you execute codes.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
  • To use these instructions, you have to build from sources. The most simple way is to build this dockerfile.
  • Relevant issues or comments: here, here, here.