This repo summarizes some techniques for optimizing TensorFlow code. Official document describing a collection of best practices can be found here. Before started, it will be very helpful to read it.
Dockerfile which contains all of libraries/packages introduced here is provided. It includes how to install the libraries/packages listed below.
First of all, it is important to find whether CPU will bottleneck GPU, or vice versa (simply check by running nvidia-smi
). If GPU is a bottleneck, it is relatively easy to optimize. On the other hand, it is complicated if CPU is your bottleneck.
Overall, I got 1.5~2.0x performance gain by applying all below.
- Use
NCHW
data format for 4D tensor.
- Native data format for cudnn library is
NCHW
. Performance gain increases as you have many layers. - If you use this format, using
_fused_batch_norm
is mandatory. Otherwise, your code will be almost 10x slower sincenn.moments
cannot deal with this format efficiently. - Several preprocessing ops support only
HWC
format, so we have to transpose tensors somewhere. If your input pipeline is a bottleneck, it is better to transpose them using GPU.
- Use fused batch norm.
- Whatever your data format is, it is better to use fused batch norm.
- Utilize queues for input pipeline
- First, you have to utilize queues for reading and fetching input data. Please refer to Reading Data Guide and
batch_inputs
function in inception codes. - CAREFULLY allocate threads for each reading and preprocessing. It completely depends on your machine: how many threads can you use?, can you read from SSD?, etc.
- Use TCMalloc.
- TCMalloc is faster for multi-threaded programs.
- Also, it is effective if you use multi-threads for input pipeline.
- Relevant issues or comments: here, here.
- Use advanced instructions (SSE, AVX, FMA) on Intel CPUs.
- For TensorFlow v.1.0.0, you can see the following warnings when you execute codes.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
- To use these instructions, you have to build from sources. The most simple way is to build this dockerfile.
- Relevant issues or comments: here, here, here.