Tensorflow binary seems compiled to use SIMD instructions like AVX2 and FMA, but actually not? #13500

martozzz · 2017-10-05T06:00:22Z

I found similar issues mentioned as #8037, #7778 etc, but the issue seems not solved: the warnings did disappear after building with the necessary optimization options, but they appeared again when I followed this tutorial (https://www.tensorflow.org/performance/xla/tfcompile) to the last step. So, is the tensorflow binary compiled to use the SIMD instructions or not?

System information

Have I written custom code: No
OS Platform and Distribution: Linux Ubuntu 16.04
TensorFlow installed from: source
TensorFlow version: v1.3.0-rc1-3000-g840dcae
Python version: Python3
Bazel version: 0.6.0
CPU: Intel Core i7-4770, Haswell architecture, supporting AVX2 and FMA
GPU: No
Compiler: gcc 5.4.0

Issue reproducing:

Building tensorflow from source:

Configure: only jemalloc and XLA JIT support are ticked. The default optimization flag is -march=native, therefore was not specified;
Build pip package:

bazel build --config=opt --copt=-mavx2 --copt=-mfma --config=mkl --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package

Install pip package:

sudo -H python3 -m pip install /tmp/tensorflow_pkg/tensorflow-1.3.0-cp35-cp35m-linux_x86_64.whl

The installation was validated using the "Hello, TensorFlow!" example, and no warnings are generated.

Generating tfcompile binary:

bazel build --config=opt --copt=-mavx2 --copt=-mfma --config=mkl --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/compiler/aot:tfcompile

Follow the tutorial here https://www.tensorflow.org/performance/xla/tfcompile, in the directory //tensorflow/compiler/aot/tests:

Step 1: The config file already exists as test_graph_tfmatmul.config.pbtxt;
Step 2.1: Generate the graph file test_graph_tfmatmul.pb:

python3 ./make_test_graphs.py --out_dir=./

Step 2.2: Compile the graph using tfcompile:

~/tensorFlow_src/tensorflow/bazel-bin/tensorflow/compiler/aot/tfcompile --graph="./test_graph_tfmatmul.pb" --config="./test_graph_tfmatmul.config.pbtxt" --entry_point="test_graph_tfmatmul" --cpp_class="foo::bar::MatMulComp" --out_object="test_graph_tfmatmul.o" --out_header="test_graph_tfmatmul.h" --target_features="+avx2"

Step 3: Creating a file named my_code.cc:

#define EIGEN_USE_THREADS
#define EIGEN_USE_CUSTOM_THREAD_POOL

#include <iostream>
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
#include "tensorflow/compiler/aot/tests/test_graph_tfmatmul.h" // generated

int main(int argc, char** argv) {
    Eigen::ThreadPool tp(2);  // Size the thread pool as appropriate.
    Eigen::ThreadPoolDevice device(&tp, tp.NumThreads());

    foo::bar::MatMulComp matmul;
    matmul.set_thread_pool(&device);

    // Set up args and run the computation.
    const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    std::copy(args + 0, args + 6, matmul.arg0_data());
    std::copy(args + 6, args + 12, matmul.arg1_data());
    matmul.Run();

    // Check result
    if (matmul.result0(0, 0) == 58) {
        std::cout << "Success" << std::endl;
    } else {
        std::cout << "Failed. Expected value 58 at 0,0. Got:"
                    << matmul.result0(0, 0) << std::endl;
    }

    return 0;
}

Step 4.1: Create the BUILD file:

# Example of linking your binary
# Also see //third_party/tensorflow/compiler/aot/tests/BUILD
load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")

# The same tf_library call from step 2 above.
tf_library(
    name = "test_graph_tfmatmul",
    cpp_class = "foo::bar::MatMulComp",
    graph = "test_graph_tfmatmul.pb",
    config = "test_graph_tfmatmul.config.pbtxt",
)

# The executable code generated by tf_library can then be linked into your code.
cc_binary(
    name = "my_binary",
    srcs = [
        "my_code.cc",  # include test_graph_tfmatmul.h to access the generated header
    ],
    deps = [
        ":test_graph_tfmatmul",  # link in the generated object file
        "//third_party/eigen3",
    ],
    linkopts = [
        "-lpthread",
    ]
)

Step 4.2: Create the final binary:

bazel build --config=opt --copt=-mavx2 --copt=-mfma --config=mkl --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/compiler/aot/tests:my_binary

Finally, it will print:
INFO: From Executing genrule //tensorflow/compiler/aot/tests:gen_test_graph_tfmatmul: 2017-10-05 15:15:29.233159: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
(An error will also occur, but that is another issue #13482).

So, is the tensorflow binary compiled to use the SIMD instructions (SSE4.1 SSE4.2 AVX AVX2 FMA) or not? May I have your advice?

The text was updated successfully, but these errors were encountered:

tatatodd · 2017-10-12T00:27:03Z

@MartinZZZ This is slightly confusing.

Note that the info log Your CPU supports... is actually coming from the invocation of the tfcompile tool, which generates the header and object file, and is a side-effect of how we perform the compilation re-using underlying TensorFlow infrastructure. It's not actually trying to warn you that the generated code is missing support for SIMD.

The reason I know this is because the first part of the info log says:

From Executing genrule //tensorflow/compiler/aot/tests:gen_test_graph_tfmatmul

The genrule above corresponds to this bazel genrule, which is created by the tf_library build macro:

tensorflow/tensorflow/compiler/aot/tfcompile.bzl

Line 134 in 263d025

name=("gen_" + name),

I'm suspecting that you saw this info log in #13482 because you were running tfcompile manually; if instead you let blaze run it, via the tf_library macro, these info logs won't be visible.

That said, note that by default tfcompile doesn't assume any target-specific features (SIMD, etc). If there are specific features you'd like to enable, you need to set them via the following tfcompile flags:

	--target_cpu=""                  	string	Target cpu, similar to the clang -mcpu flag.  http://clang.llvm.org/docs/CrossCompilation.html#cpu-fpu-abi
	--target_features=""             	string	Target features, e.g. +avx2, +neon, etc.

These may be specified in your tf_library build rule using the tfcompile_flags argument:

tensorflow/tensorflow/compiler/aot/tfcompile.bzl

Line 25 in 263d025

tfcompile_flags=None,

martozzz · 2017-10-12T02:20:31Z

@tatatodd Thanks for the information. I specified those flags in the tf_library, but the log still exists:

The tfcompile_flags is specified as following (in //tensorflow/tensorflow/compiler/aot/tfcompile.bzl):

# -*- Python -*-

load("//tensorflow:tensorflow.bzl", "if_android", "tf_copts")

def tf_library(name, graph, config,
               freeze_checkpoint=None, freeze_saver=None,
               cpp_class=None, gen_test=True, gen_benchmark=True,
               visibility=None, testonly=None,
               tfcompile_flags=str('--target_cpu="x86-64" --target_features="+sse4.1" --target_features="+sse4.2" --target_features="+avx" --target_features="+avx2" --target_features="+fma"'),
               #tfcompile_flags=None,
               tfcompile_tool="//tensorflow/compiler/aot:tfcompile",
               include_standard_runtime_deps=True, deps=None, tags=None):
 
  if not cpp_class:
    fail("cpp_class must be specified")

  tfcompile_graph = graph
  if freeze_checkpoint or freeze_saver:
    if not freeze_checkpoint:
      fail("freeze_checkpoint must be specified when freeze_saver is specified")

    freeze_name = "freeze_" + name
    freeze_file = freeze_name + ".pb"

    # First run tfcompile to generate the list of out_nodes.
    out_nodes_file = "out_nodes_" + freeze_name
    native.genrule(
        name=("gen_" + out_nodes_file),
        srcs=[config],
        outs=[out_nodes_file],
        cmd=("$(location " + tfcompile_tool + ")" +
             " --config=$(location " + config + ")" +
             " --dump_fetch_nodes > $@"),
        tools=[tfcompile_tool],
        # Run tfcompile on the build host, rather than forge, since it's
        # typically way faster on the local machine.
        local=1,
        tags=tags,
    )

    # Now run freeze_graph to convert variables into constants.
    freeze_args = (" --input_graph=$(location " + graph + ")" +
                   " --input_binary=" + str(not graph.endswith(".pbtxt")) +
                   " --input_checkpoint=$(location " + freeze_checkpoint + ")" +
                   " --output_graph=$(location " + freeze_file + ")" +
                   " --output_node_names=$$(<$(location " + out_nodes_file +
                   "))")
    freeze_saver_srcs = []
    if freeze_saver:
      freeze_args += " --input_saver=$(location " + freeze_saver + ")"
      freeze_saver_srcs += [freeze_saver]
    native.genrule(
        name=freeze_name,
        srcs=[
            graph,
            freeze_checkpoint,
            out_nodes_file,
        ] + freeze_saver_srcs,
        outs=[freeze_file],
        cmd=("$(location //tensorflow/python/tools:freeze_graph)" +
             freeze_args),
        tools=["//tensorflow/python/tools:freeze_graph"],
        tags=tags,
    )
    tfcompile_graph = freeze_file

  # Rule that runs tfcompile to produce the header and object file.
  header_file = name + ".h"
  object_file = name + ".o"
  ep = ("__" + PACKAGE_NAME + "__" + name).replace("/", "_")
  native.genrule(
      name=("gen_" + name),
      srcs=[
          tfcompile_graph,
          config,
      ],
      outs=[
          header_file,
          object_file,
      ],
      cmd=("$(location " + tfcompile_tool + ")" +
           " --graph=$(location " + tfcompile_graph + ")" +
           " --config=$(location " + config + ")" +
           " --entry_point=" + ep +
           " --cpp_class=" + cpp_class +
           " --target_triple=" + target_llvm_triple() +
           " --out_header=$(@D)/" + header_file +
           " --out_object=$(@D)/" + object_file +
           " " + (tfcompile_flags or "")),
      tools=[tfcompile_tool],
      visibility=visibility,
      testonly=testonly,
      # Run tfcompile on the build host since it's typically faster on the local
      # machine.
      #
      # Note that setting the local=1 attribute on a *test target* causes the
      # test infrastructure to skip that test.  However this is a genrule, not a
      # test target, and runs with --genrule_strategy=forced_forge, meaning the
      # local=1 attribute is ignored, and the genrule is still run.
      #
      # https://www.bazel.io/versions/master/docs/be/general.html#genrule
      local=1,
      tags=tags,
  )

  # The cc_library rule packaging up the header and object file, and needed
  # kernel implementations.
  need_xla_data_proto = (tfcompile_flags and
                         tfcompile_flags.find("--gen_program_shape") != -1)
  native.cc_library(
      name=name,
      srcs=[object_file],
      hdrs=[header_file],
      visibility=visibility,
      testonly=testonly,
      deps = [
          # These deps are required by all tf_library targets even if
          # include_standard_runtime_deps is False.  Without them, the
          # generated code will fail to compile.
          "//tensorflow/compiler/tf2xla:xla_compiled_cpu_function",
          "//tensorflow/core:framework_lite",
      ] + (need_xla_data_proto and [
          # If we're generating the program shape, we must depend on the proto.
          "//tensorflow/compiler/xla:xla_data_proto",
      ] or []) + (include_standard_runtime_deps and [
          # TODO(cwhipkey): only depend on kernel code that the model actually needed.
          "//tensorflow/compiler/tf2xla/kernels:gather_op_kernel_float_int32",
          "//tensorflow/compiler/tf2xla/kernels:gather_op_kernel_float_int64",
          "//tensorflow/compiler/tf2xla/kernels:index_ops_kernel_argmax_float_1d",
          "//tensorflow/compiler/tf2xla/kernels:index_ops_kernel_argmax_float_2d",
          "//tensorflow/compiler/xla/service/cpu:cpu_runtime_avx",
          "//tensorflow/compiler/xla/service/cpu:cpu_runtime_neon",
          "//tensorflow/compiler/xla/service/cpu:cpu_runtime_sse4_1",
          "//tensorflow/compiler/xla/service/cpu:runtime_conv2d",
          "//tensorflow/compiler/xla/service/cpu:runtime_matmul",
          "//tensorflow/compiler/xla/service/cpu:runtime_single_threaded_conv2d",
          "//tensorflow/compiler/xla/service/cpu:runtime_single_threaded_matmul",
          "//third_party/eigen3",
      ] or []) + (deps or []),
      tags=tags,
  )

  # Variables used for gen_test and gen_benchmark.
  no_ns_name = ""
  cpp_class_split = cpp_class.rsplit("::", maxsplit=2)
  if len(cpp_class_split) == 1:
    no_ns_name = cpp_class_split[0]
  else:
    no_ns_name = cpp_class_split[1]
  sed_replace = (
      "-e \"s|{{TFCOMPILE_HEADER}}|$(location " + header_file + ")|g\" " +
      "-e \"s|{{TFCOMPILE_CPP_CLASS}}|" + cpp_class + "|g\" " +
      "-e \"s|{{TFCOMPILE_NAME}}|" + no_ns_name + "|g\" ")

  if gen_test:
    test_name = name + "_test"
    test_file = test_name + ".cc"
    # Rule to rewrite test.cc to produce the test_file.
    native.genrule(
        name=("gen_" + test_name),
        testonly=1,
        srcs=[
            "//tensorflow/compiler/aot:test.cc",
            header_file,
        ],
        outs=[test_file],
        cmd=("sed " + sed_replace +
             " $(location //tensorflow/compiler/aot:test.cc) " +
             "> $(OUTS)"),
        tags=tags,
    )

    # The cc_test rule for the generated code.
    native.cc_test(
        name=test_name,
        srcs=[test_file],
        deps=[
            ":" + name,
            "//tensorflow/compiler/tf2xla:xla_local_runtime_context",
            "//tensorflow/compiler/aot:runtime",
            "//tensorflow/compiler/aot:tf_library_test_main",
            "//tensorflow/compiler/xla:executable_run_options",
            "//third_party/eigen3",
            "//tensorflow/core:lib",
            "//tensorflow/core:test",
            ],
        tags=tags,
    )

  if gen_benchmark:
    benchmark_name = name + "_benchmark"
    benchmark_file = benchmark_name + ".cc"
    benchmark_main = ("//tensorflow/compiler/aot:" +
                      "benchmark_main.template")

    # Rule to rewrite benchmark.cc to produce the benchmark_file.
    native.genrule(
        name=("gen_" + benchmark_name),
        srcs=[
            benchmark_main,
            header_file,
        ],
        testonly = testonly,
        outs=[benchmark_file],
        cmd=("sed " + sed_replace +
             " $(location " + benchmark_main + ") " +
             "> $(OUTS)"),
        tags=tags,
    )

    # The cc_benchmark rule for the generated code.
    #
    # Note: to get smaller size on android for comparison, compile with:
    #    --copt=-fvisibility=hidden
    #    --copt=-D_LIBCPP_TYPE_VIS=_LIBCPP_HIDDEN
    #    --copt=-D_LIBCPP_EXCEPTION_ABI=_LIBCPP_HIDDEN
    native.cc_binary(
        name=benchmark_name,
        srcs=[benchmark_file],
        testonly = testonly,
        copts = tf_copts(),
        linkopts = if_android(["-pie", "-s"]),
        deps=[
            ":" + name,
            "//tensorflow/compiler/tf2xla:xla_local_runtime_context",
            "//tensorflow/compiler/aot:benchmark",
            "//tensorflow/compiler/aot:runtime",
            "//tensorflow/compiler/xla:executable_run_options",
            "//third_party/eigen3",
        ] + if_android([
            "//tensorflow/compiler/aot:benchmark_extra_android",
        ]),
        tags=tags,
    )


def target_llvm_triple():
  """Returns the target LLVM triple to be used for compiling the target."""
  # TODO(toddw): Add target_triple for other targets.  For details see:
  # http://llvm.org/docs/doxygen/html/Triple_8h_source.html
  return select({
      "//tensorflow:android_armeabi": "armv5-none-android",
      "//tensorflow:android_arm": "armv7-none-android",
      "//tensorflow:android_arm64": "aarch64-none-android",
      "//tensorflow:android_x86": "i686-none-android",
      "//tensorflow:linux_ppc64le": "ppc64le-ibm-linux-gnu",
      "//tensorflow:darwin": "x86_64-none-darwin",
      "//conditions:default": "x86_64-pc-linux",
  })

Then instead of running tfcompile manually, I built the cc_library using the tf_library macro (i.e., skipping Step 2.2), and the final binary can be successfully created now with command:

bazel build //tensorflow/compiler/aot/tests:my_binary

but the logging info still exists:

INFO: Analysed target //tensorflow/compiler/aot/tests:my_binary (2 packages loaded).
INFO: Found 1 target...
INFO: From Executing genrule //tensorflow/compiler/aot/tests:gen_test_graph_tfmatmul:
2017-10-12 12:56:13.846822: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Target //tensorflow/compiler/aot/tests:my_binary up-to-date:
  bazel-bin/tensorflow/compiler/aot/tests/my_binary
INFO: Elapsed time: 0.331s, Critical Path: 0.05s
INFO: Build completed successfully, 2 total actions

May I have your advice?

tatatodd · 2017-10-12T04:46:07Z

@MartinZZZ I see. I had misinterpreted your previous message; I thought you meant that you no longer saw the log info when using tf_library.

As I mentioned, the log message is harmless, and is a side-effect of the way we perform the compilation. We're changing the implementation (for a different reason), at which point the info log will go away. In the meantime you should just ignore it.

martozzz · 2017-10-12T05:43:12Z

@tatatodd Thanks for the explanation. I seem to understand now, and may I check with you regarding the following two simple questions:

(1) Does it mean that the created binary is actually able to use the SIMD instructions as specified by the tfcompile_flags, and the log message is just a false warning?

(2) Whether the tfcompile_flags is successfully specifiled or not, will it affect the performance of matmul operation in this example? I thought it would not. As tensorflow is using Eigen's implementation for matmul, in which SIMD intrinsic instructions seem to be already used to exploit the SIMD hardware.

tatatodd · 2017-10-12T06:58:12Z

@MartinZZZ Answers to your questions:

Yes, if you set tfcompile_flags as in your example above, the code generated by XLA will use SIMD instructions. The log message is a false warning, regardless of your tfcompile_flags settings.
You're correct. tfcompile uses XLA for compilation, and XLA currently calls out to Eigen for matmul. This may change in the future; we do have a pure-XLA implementation of matmul, but it's quite slow.

martozzz · 2017-10-12T07:54:52Z

@tatatodd Thanks for your answers:)

And please correct me if I am mistaken. I wonder if XLA-AoT mainly focuses on the space issue on mobile device, while XLA-JIT mainly focuses on the performance. If so, may I seek for your advice regarding the following short questions:

(1) Will XLA-JIT generate code after fusing operations without calling to Eigen?

(2) If so, is the code generation done by XLA, or totally relied on LLVM? and is this code released as well? Thanks.

lijiansong · 2017-11-19T02:08:51Z

so am i

carlthome · 2017-11-23T15:35:01Z

@lijiansong, could you clarify your question?

thefiddler · 2018-01-09T10:04:16Z

@carlthome MartinZZZ and lijiansong are asking which one has better runtime performance: XLA-JIT or XLA-AOT?

The documentation seems to imply that XLA-AOT is meant for space-constrained situations (e.g. mobile) but does not mention anything regarding runtime performance of XLA-AOT vs XLA-JIT. Any clarification on that point would be welcome.

carlthome · 2018-01-09T10:13:47Z

AOT and JIT provides the same performance benefits (e.g. op fusion, constant folding, common subexpression elimination and other HLOs).

The downside to AOT is that you have to specify static tensor shapes and know what hardware you're targeting, while JIT would do that for you.

The downside to JIT is that compilation happens in the runtime (which takes extra time) and that you have to bundle the compiler with your program.

yaroslavvb added the stat:community support Status - Community Support label Oct 5, 2017

martozzz mentioned this issue Oct 11, 2017

Error in creating the final binary using AOT compilation for CPU backend #13482

Closed

tatatodd closed this as completed Oct 12, 2017

tatatodd removed the stat:community support Status - Community Support label Oct 12, 2017

cfregly mentioned this issue Nov 3, 2017

Working example of incorporating AOT with Pb graph? PipelineAI/pipeline#246

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow binary seems compiled to use SIMD instructions like AVX2 and FMA, but actually not? #13500

Tensorflow binary seems compiled to use SIMD instructions like AVX2 and FMA, but actually not? #13500

martozzz commented Oct 5, 2017

tatatodd commented Oct 12, 2017

martozzz commented Oct 12, 2017

tatatodd commented Oct 12, 2017

martozzz commented Oct 12, 2017

tatatodd commented Oct 12, 2017

martozzz commented Oct 12, 2017

lijiansong commented Nov 19, 2017

carlthome commented Nov 23, 2017

thefiddler commented Jan 9, 2018

carlthome commented Jan 9, 2018 •

edited

Tensorflow binary seems compiled to use SIMD instructions like AVX2 and FMA, but actually not? #13500

Tensorflow binary seems compiled to use SIMD instructions like AVX2 and FMA, but actually not? #13500

Comments

martozzz commented Oct 5, 2017

System information

Issue reproducing:

tatatodd commented Oct 12, 2017

martozzz commented Oct 12, 2017

tatatodd commented Oct 12, 2017

martozzz commented Oct 12, 2017

tatatodd commented Oct 12, 2017

martozzz commented Oct 12, 2017

lijiansong commented Nov 19, 2017

carlthome commented Nov 23, 2017

thefiddler commented Jan 9, 2018

carlthome commented Jan 9, 2018 • edited

carlthome commented Jan 9, 2018 •

edited