Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance of stock model on Windows compared to Linux #62387

Open
ghost opened this issue Jul 29, 2021 · 24 comments
Open

Bad performance of stock model on Windows compared to Linux #62387

ghost opened this issue Jul 29, 2021 · 24 comments
Assignees
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: performance Issues related to performance, either of kernel code or framework glue module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ghost
Copy link

ghost commented Jul 29, 2021

馃悰 Bug

We have got two identical two-socket servers on Intel Xeon E5-2650v4 with 256 GB RAM, one running Ubuntu and the other running Windows Server 2012. There is a severe degradation of a model perfomance when using Windows compared to Linux (about 2-4 times).

To Reproduce

Steps to reproduce the behavior:

  1. Install torch and torchvision using pip.
  2. Run the code below (taken completely from the profiler recipe page https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
import time


model = models.resnet18()
device = torch.device("cpu")
model.to(device)
inputs = torch.randn(5, 3, 224, 224).to(device)

start = time.time()
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        for _ in range(100):
            model(inputs)
end = time.time()

print(prof.key_averages().table(sort_by="cpu_time_total"))
print("Execution time:", end - start)

Output on the Linux server

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference         6.09%     285.810ms       100.00%        4.694s        4.694s             1                                                                                    [21/77]                     aten::conv2d         0.38%      18.065ms        59.37%        2.787s       1.393ms          2000
                aten::convolution         0.37%      17.277ms        58.98%        2.768s       1.384ms          2000
               aten::_convolution         0.57%      26.876ms        58.61%        2.751s       1.376ms          2000
         aten::mkldnn_convolution        57.60%        2.704s        58.04%        2.724s       1.362ms          2000
                 aten::batch_norm         0.34%      15.975ms        26.86%        1.261s     630.401us          2000
     aten::_batch_norm_impl_index         0.58%      27.008ms        26.52%        1.245s     622.413us          2000
          aten::native_batch_norm        14.05%     659.276ms        25.81%        1.212s     605.830us          2000
                     aten::select         4.30%     201.836ms         5.82%     273.392ms      13.019us         21000
                       aten::mean         0.59%      27.624ms         5.52%     259.013ms     123.340us          2100
                 aten::max_pool2d         0.03%       1.224ms         4.37%     205.196ms       2.052ms           100
    aten::max_pool2d_with_indices         4.35%     203.972ms         4.35%     203.972ms       2.040ms           100
                        aten::sum         2.90%     136.256ms         3.20%     150.110ms      71.481us          2100
                 aten::as_strided         1.95%      91.718ms         1.95%      91.718ms       2.940us         31200
                       aten::div_         0.76%      35.760ms         1.59%      74.447ms      35.451us          2100
                      aten::relu_         0.40%      18.946ms         1.58%      74.004ms      43.532us          1700
                 aten::clamp_min_         0.23%      10.926ms         1.17%      55.058ms      32.387us          1700
                      aten::empty         0.96%      44.975ms         0.96%      44.975ms       4.452us         10102
                  aten::clamp_min         0.94%      44.132ms         0.94%      44.132ms      25.960us          1700
                         aten::to         0.43%      20.318ms         0.82%      38.687ms      18.422us          2100
                       aten::add_         0.71%      33.418ms         0.71%      33.418ms      41.773us           800
                        aten::add         0.40%      18.805ms         0.40%      18.805ms       9.402us          2000
                     aten::linear         0.03%       1.240ms         0.34%      16.008ms     160.080us           100
                      aten::addmm         0.22%      10.476ms         0.27%      12.472ms     124.720us           100
                      aten::copy_         0.25%      11.573ms         0.25%      11.573ms       5.260us          2200
        aten::adaptive_avg_pool2d         0.01%     648.000us         0.23%      11.015ms     110.150us           100
                      aten::fill_         0.17%       7.921ms         0.17%       7.921ms       3.772us          2100
              aten::empty_strided         0.17%       7.894ms         0.17%       7.894ms       3.759us          2100
                aten::as_strided_         0.11%       5.316ms         0.11%       5.316ms       2.658us          2000
                          aten::t         0.03%       1.322ms         0.05%       2.296ms      22.960us           100
                    aten::flatten         0.02%     898.000us         0.04%       1.985ms      19.850us           100
                       aten::view         0.02%       1.087ms         0.02%       1.087ms      10.870us           100
                  aten::transpose         0.01%     687.000us         0.02%     974.000us       9.740us           100
                     aten::expand         0.01%     674.000us         0.02%     898.000us       8.980us           100
                      aten::zeros         0.00%      42.000us         0.00%      80.000us      80.000us             1
                      aten::zero_         0.00%      25.000us         0.00%      25.000us      25.000us             1
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 4.694s

Execution time: 10.080621480941772

Output on the Windows server

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference        15.80%        2.797s       100.00%       17.705s       17.705s             1
                     aten::conv2d         0.19%      33.769ms        61.01%       10.801s       5.401ms          2000
                aten::convolution         0.17%      30.678ms        60.81%       10.767s       5.384ms          2000
               aten::_convolution         0.27%      47.605ms        60.64%       10.737s       5.368ms          2000
         aten::mkldnn_convolution        60.17%       10.654s        60.37%       10.689s       5.345ms          2000
                 aten::batch_norm         0.16%      28.550ms        17.74%        3.142s       1.571ms          2000
     aten::_batch_norm_impl_index         0.27%      47.746ms        17.58%        3.113s       1.557ms          2000
          aten::native_batch_norm        12.59%        2.229s        17.26%        3.057s       1.528ms          2000
                 aten::max_pool2d         0.01%       2.447ms         3.43%     607.411ms       6.074ms           100
    aten::max_pool2d_with_indices         3.42%     604.964ms         3.42%     604.964ms       6.050ms           100
                       aten::mean         0.49%      87.554ms         2.93%     518.661ms     246.981us          2100
                        aten::sum         1.41%     250.225ms         1.54%     272.660ms     129.838us          2100
                     aten::select         1.13%     199.602ms         1.53%     271.359ms      12.922us         21000
                      aten::relu_         0.16%      27.804ms         1.17%     207.175ms     121.868us          1700
                 aten::clamp_min_         0.09%      15.782ms         1.01%     179.371ms     105.512us          1700
                  aten::clamp_min         0.92%     163.589ms         0.92%     163.589ms      96.229us          1700
                       aten::div_         0.36%      63.607ms         0.84%     148.273ms      70.606us          2100
                 aten::as_strided         0.59%     104.998ms         0.59%     104.998ms       3.365us         31200
                         aten::to         0.32%      57.305ms         0.48%      84.666ms      40.317us          2100
                       aten::add_         0.47%      82.884ms         0.47%      82.884ms     103.605us           800
                      aten::empty         0.43%      76.867ms         0.43%      76.867ms       7.609us         10102
                        aten::add         0.15%      26.244ms         0.15%      26.244ms      13.122us          2000
                     aten::linear         0.01%       1.721ms         0.12%      21.130ms     211.300us           100
        aten::adaptive_avg_pool2d         0.01%     992.000us         0.10%      18.139ms     181.390us           100
                      aten::addmm         0.07%      13.263ms         0.09%      16.251ms     162.510us           100
                      aten::copy_         0.09%      15.675ms         0.09%      15.675ms       7.125us          2200
                      aten::fill_         0.08%      13.534ms         0.08%      13.534ms       6.445us          2100
              aten::empty_strided         0.07%      13.128ms         0.07%      13.128ms       6.251us          2100
                aten::as_strided_         0.05%       8.930ms         0.05%       8.930ms       4.465us          2000
                          aten::t         0.01%       1.605ms         0.02%       3.158ms      31.580us           100
                    aten::flatten         0.01%       1.168ms         0.01%       2.519ms      25.190us           100
                  aten::transpose         0.01%       1.085ms         0.01%       1.553ms      15.530us           100
                     aten::expand         0.01%       1.185ms         0.01%       1.546ms      15.460us           100
                       aten::view         0.01%       1.351ms         0.01%       1.351ms      13.510us           100
                      aten::zeros         0.00%      97.000us         0.00%     112.000us     112.000us             1
                      aten::zero_         0.00%       9.000us         0.00%       9.000us       9.000us             1
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 17.705s

Execution time:  24.105701684951782

Expected behavior

Execution times should well match with some tolerance maybe.

Environment

Output of collect_env.py on the Linux server

Collecting environment information...
PyTorch version: 1.9.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.14.7
Libc version: glibc-2.27

Python version: 3.8.0 (default, Oct 28 2019, 16:14:01)  [GCC 8.3.0] (64-bit runtime)
Python platform: Linux-4.15.0-112-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti

Nvidia driver version: 450.51.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.3
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.1
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] Could not collect

Output of collect_env.py on the Windows server

Collecting environment information...
PyTorch version: 1.9.0+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Microsoft Windows Server 2012 R2 Standard
GCC version: (x86_64-posix-sjlj, built by strawberryperl.com project) 4.9.2
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-2012ServerR2-6.3.9600-SP0
Is CUDA available: False
CUDA runtime version: 10.1.105
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 418.96
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.1
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] Could not collect

Additional context

Both servers are equipped with two Intel Xeon E5-2650v4 (which gives 48 threads total) with 256 GB RAM and are idle.

cc @VitalyFedyunin @ngimel @heitorschueroff @fmassa @vfdev-5 @pmeier @peterjc123 @mszhanyi @skyline75489 @nbcsm

@skyline75489
Copy link
Contributor

It seems that on your Windows server you don't have CUDA installed correctly?

Is CUDA available: False

@skyline75489 skyline75489 added the module: windows Windows support for PyTorch label Jul 29, 2021
@ghost
Copy link
Author

ghost commented Jul 29, 2021

It seems that on your Windows server you don't have CUDA installed correctly?

Correct. The NVIDIA driver seems not to be installed on Windows or too old. Not sure if that could influence CPU-only performance testing though.

@skyline75489
Copy link
Contributor

Can you try:

model = models.resnet18()
device = torch.device("cpu")
model.to(device)

This should force the model to use CPU only.

@ghost
Copy link
Author

ghost commented Jul 29, 2021

I have rerun updated test snippet on both machines and obtained the very same results.

@heitorschueroff heitorschueroff added module: performance Issues related to performance, either of kernel code or framework glue module: vision triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 29, 2021
@malfet
Copy link
Contributor

malfet commented Jul 30, 2021

I guess performance difference stems from a fact that a lot of AVX2 accelerated codepaths are not enabled for Visual C++ compiler, for example see

#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)

@malfet malfet added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jul 30, 2021
@seemethere
Copy link
Member

It seems that on your Windows server you don't have CUDA installed correctly?

Correct. The NVIDIA driver seems not to be installed on Windows or too old. Not sure if that could influence CPU-only performance testing though.

For windows you're also testing using 1.9.0+cpu which isn't compiled with CUDA support

@ghost
Copy link
Author

ghost commented Jul 30, 2021

I guess performance difference stems from a fact that a lot of AVX2 accelerated codepaths are not enabled for Visual C++ compiler

Is that for a reason?

@malfet
Copy link
Contributor

malfet commented Jul 30, 2021

Is that for a reason?

I believe at some point intrinsics used in that codebase were not supported by VC++, testing if this still holds true in #62491

@xsacha
Copy link
Contributor

xsacha commented Jan 31, 2022

So this is still an issue.
For the OP, have you been able to convert your model to use MKLDNN instead? The code paths on that backend should work on Windows as well.

@xuhancn
Copy link
Collaborator

xuhancn commented Mar 21, 2023

I have tried to reproduce this issue and found this issue only occured on NUMA machine. and it looks like omp does not support NUMA machine well on Windows.

My hardware:
Processor_0: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 2394 Mhz, 24 Core(s), 48 Logical Processor(s)
Processor_1: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 2394 Mhz, 24 Core(s), 48 Logical Processor(s)
Installed Physical Memory (RAM) 192 GB

Windows: Microsoft Windows Server 2022 Datacenter
Linux: Linux mlt-clx131 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

image
Windows snapshot shows it only use the CPUs of 1 CPU group(socket).

image
Linux shows it use all CPUs on NUMA nodes.


On my Core I7 - 9750H notebook(32 GB RAM), only one socket.
This test script doesn't have such a huge performance gap.
image

@xuhancn
Copy link
Collaborator

xuhancn commented Apr 10, 2023

In order to double confirm the NUMA impaction on performance gap. I disabled one socket of the server.
After disabled one socket, the hardware configuration is:
Processor: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 2394 Mhz, 24 Core(s), 48 Logical Processor(s)
Installed Physical Memory (RAM) 96.0 GB

OS:
Windows Server 2022 Datacenter, 10.0.20348 Build 20348
Linux version 3.10.0-862.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Apr 20 16:44:24 UTC 2018

Linux:


                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  

              model_inference         8.25%     188.253ms       100.00%        2.282s        2.282s             1  
                 aten::conv2d         1.26%      28.718ms        61.08%        1.394s     696.976us          2000  
            aten::convolution         1.05%      23.990ms        60.70%        1.385s     692.659us          2000  
           aten::_convolution         0.66%      15.047ms        59.65%        1.361s     680.663us          2000  
     aten::mkldnn_convolution        58.22%        1.329s        58.99%        1.346s     673.140us          2000  
             aten::batch_norm         0.25%       5.663ms        17.02%     388.481ms     194.240us          2000  
 aten::_batch_norm_impl_index         0.61%      13.911ms        16.77%     382.819ms     191.410us          2000  
      aten::native_batch_norm        15.40%     351.399ms        16.19%     369.523ms     184.762us          2000  
             aten::max_pool2d         0.02%     482.000us         7.00%     159.723ms       1.597ms           100  
aten::max_pool2d_with_indices         6.98%     159.241ms         6.98%     159.241ms       1.592ms           100  
                  aten::relu_         0.78%      17.774ms         2.57%      58.722ms      34.542us          1700  
                   aten::add_         2.08%      47.463ms         2.08%      47.463ms      16.951us          2800  
             aten::clamp_min_         1.79%      40.948ms         1.79%      40.948ms      24.087us          1700  
                  aten::empty         1.17%      26.709ms         1.17%      26.709ms       1.335us         20000  
                 aten::linear         0.04%     829.000us         0.48%      11.025ms     110.250us           100  
    aten::adaptive_avg_pool2d         0.03%     717.000us         0.43%       9.759ms      97.590us           100  
                   aten::mean         0.08%       1.837ms         0.41%       9.319ms      93.190us           100  
                  aten::addmm         0.35%       7.924ms         0.39%       8.970ms      89.700us           100  
             aten::empty_like         0.22%       4.931ms         0.39%       8.963ms       4.481us          2000  
            aten::as_strided_         0.27%       6.197ms         0.27%       6.197ms       3.099us          2000  
                    aten::sum         0.15%       3.453ms         0.17%       3.773ms      37.730us           100  
                   aten::div_         0.07%       1.553ms         0.16%       3.709ms      36.363us           102  
                     aten::to         0.02%     385.000us         0.09%       2.100ms      21.000us           100  
               aten::_to_copy         0.04%     902.000us         0.08%       1.750ms      17.500us           100  
                aten::flatten         0.04%     821.000us         0.07%       1.590ms      15.900us           100  
                      aten::t         0.03%     720.000us         0.06%       1.398ms      13.980us           100  
                  aten::copy_         0.05%       1.141ms         0.05%       1.141ms       5.705us           200  
         aten::_reshape_alias         0.03%     769.000us         0.03%     769.000us       7.690us           100  
              aten::transpose         0.02%     443.000us         0.03%     654.000us       6.540us           100  
                 aten::expand         0.02%     408.000us         0.02%     421.000us       4.210us           100  
          aten::empty_strided         0.02%     346.000us         0.02%     346.000us       3.460us           100  
                  aten::fill_         0.01%     320.000us         0.01%     320.000us       3.200us           100  
             aten::as_strided         0.01%     253.000us         0.01%     253.000us       1.265us           200  
           aten::resolve_conj         0.00%       2.000us         0.00%       2.000us       0.010us           200  

Self CPU time total: 2.282s

Execution time: 4.746490001678467

Windows:


                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  

              model_inference        10.70%        1.189s       100.00%       11.114s       11.114s             1  
                 aten::conv2d         0.34%      37.947ms        68.18%        7.577s       3.789ms          2000  
            aten::convolution         0.24%      26.528ms        68.10%        7.569s       3.784ms          2000  
           aten::_convolution         0.15%      16.587ms        67.86%        7.542s       3.771ms          2000  
     aten::mkldnn_convolution        67.51%        7.503s        67.71%        7.526s       3.763ms          2000  
             aten::batch_norm         0.05%       5.361ms        13.62%        1.514s     757.132us          2000  
 aten::_batch_norm_impl_index         0.14%      15.450ms        13.57%        1.508s     754.168us          2000  
      aten::native_batch_norm        13.24%        1.471s        13.43%        1.493s     746.429us          2000  
             aten::max_pool2d         0.00%     532.000us         5.33%     592.270ms       5.923ms           100  
aten::max_pool2d_with_indices         5.32%     591.738ms         5.32%     591.738ms       5.917ms           100  
                  aten::relu_         0.17%      18.509ms         1.19%     131.798ms      77.528us          1700  
             aten::clamp_min_         1.02%     113.289ms         1.02%     113.289ms      66.641us          1700  
                   aten::add_         0.50%      55.526ms         0.50%      55.526ms      19.831us          2800  
                  aten::empty         0.33%      36.966ms         0.33%      36.966ms       1.848us         20000  
             aten::empty_like         0.04%       4.495ms         0.14%      15.724ms       7.862us          2000  
                 aten::linear         0.01%     891.000us         0.11%      11.715ms     117.150us           100  
                  aten::addmm         0.08%       8.419ms         0.09%       9.660ms      96.600us           100  
    aten::adaptive_avg_pool2d         0.00%     393.000us         0.08%       9.258ms      92.580us           100  
                   aten::mean         0.01%       1.662ms         0.08%       8.865ms      88.650us           100  
            aten::as_strided_         0.05%       5.024ms         0.05%       5.024ms       2.512us          2000  
                    aten::sum         0.03%       3.537ms         0.04%       3.919ms      39.190us           100  
                   aten::div_         0.01%       1.432ms         0.03%       3.284ms      30.981us           106  
                     aten::to         0.00%     360.000us         0.02%       1.768ms      17.680us           100  
                aten::flatten         0.01%     800.000us         0.02%       1.726ms      17.260us           100  
                      aten::t         0.01%     936.000us         0.01%       1.522ms      15.220us           100  
               aten::_to_copy         0.01%     801.000us         0.01%       1.482ms      14.820us           100  
                  aten::copy_         0.01%       1.318ms         0.01%       1.318ms       6.590us           200  
         aten::_reshape_alias         0.01%     955.000us         0.01%     955.000us       9.550us           100  
              aten::transpose         0.00%     374.000us         0.01%     576.000us       5.760us           100  
                  aten::fill_         0.00%     382.000us         0.00%     382.000us       3.820us           100  
                 aten::expand         0.00%     271.000us         0.00%     346.000us       3.460us           100  
             aten::as_strided         0.00%     312.000us         0.00%     312.000us       1.560us           200  
          aten::empty_strided         0.00%     239.000us         0.00%     239.000us       2.390us           100  
           aten::resolve_conj         0.00%       4.000us         0.00%       4.000us       0.020us           200  

Self CPU time total: 11.114s

Execution time: 15.6474769115448

Correct the last post, NUMA is not major reason aten::mkldnn_convolution much slower on Windows than Linux.

@xuhancn
Copy link
Collaborator

xuhancn commented Apr 10, 2023

Focus on the aten::mkldnn_convolution, I collected the oneDNN verbose log follow the steps: https://github.com/oneapi-src/oneDNN/blob/master/doc/performance_considerations/verbose.md#enable-onednn_verbose
And get the breakdown logs follow verbose_converter steps: https://github.com/oneapi-src/oneDNN/tree/master/scripts/verbose_converter#breakdown-generator

Linux:
image

Windows:
image

Breakdown logs show that, each prim_kind and shapes on Windows are slower than on Linux.

@xuhancn
Copy link
Collaborator

xuhancn commented Apr 17, 2023

Continued.

From the breakdown logs. We can find that, pytorch do everything on Windows are slower than on Linux. I guess Windows's memory performance maybe worse than Linux's.

In order to poorf my guess, I wrote a simple benchmark to measure it. bench_malloc

image
From the benchmark result, it shows Windows malloc and access memory performance is more than ten times slower than Linux's.

Bench_malloc is open source, and can build via cmake on both Windows and Linux. You can clone and measure it on your machine.


Another bench_malloc data on single socket Xeon server.
Processor: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 2394 Mhz, 24 Core(s), 48 Logical Processor(s)
Installed Physical Memory (RAM): 96.0 GB
image
Original data:
image

@xuhancn
Copy link
Collaborator

xuhancn commented Apr 18, 2023

Continued.

From bench_malloc result, Windows system malloc is bad performance. But tc_malloc is much better performance than system malloc. whether we can replace the system malloc to tc_malloc to improve performance?

I made a POC to replace pytorch C10 alloc_cpu/free_cpu to tc_malloc. example PR: https://github.com/xuhancn/pytorch/pull/2/files

After malloc replacement, the performance data shows as below:
image
We focus on aten::mkldnn_convolution also, its time span reduced from 7.27s to 1.41s.
So the major performance gaps looks caused by Windows heap manager is bad performance.

To embedded tc_malloc I forked & modified offical tc_malloc code. My forked repo: https://github.com/xuhancn/gperftools

  1. Rename tc_malloc's library from logging to tm_logging, its name was duplicate and confilict to XNNPACK.
    image
  2. Add a switch(TC_MALLOC_NO_HOOK) to disable tc_malloc patch binary.
    Tc_malloc will triggered binary patch via static class initial, it should patch all loaded modules and hook malloc/free/new/delete functions. It makes same problem on python modules.

@peterjc123
Copy link
Collaborator

@xuhancn Wow. Thanks for the interesting experiment results. Have you happened to try mimalloc which claims to have better interop with MSVC?

@xsacha
Copy link
Contributor

xsacha commented Apr 18, 2023

Is this somewhat mitigated by reusing a memory cache? I assume the effect is less pronounced

@xuhancn
Copy link
Collaborator

xuhancn commented Apr 18, 2023

@xuhancn Wow. Thanks for the interesting experiment results. Have you happened to try mimalloc which claims to have better interop with MSVC?

Not yet so far. I will try it soon.

@xuhancn
Copy link
Collaborator

xuhancn commented Apr 23, 2023

I tried to intgrate mimalloc into pytorch, and found mimalloc performance can't competing to tc_malloc. @peterjc123
image
PR is here: xuhancn#3

@peterjc123
Copy link
Collaborator

peterjc123 commented Apr 23, 2023

I tried to intgrate mimalloc into pytorch, and found mimalloc performance can't competing to tc_malloc. @peterjc123 image PR is here: xuhancn#3

@xuhancn Does setting those environment variables help?
https://github.com/microsoft/mimalloc#environment-options
microsoft/mimalloc#293 (comment)

See also microsoft/mimalloc#633 (comment)

@xuhancn
Copy link
Collaborator

xuhancn commented Apr 27, 2023

Additional, I tried to replace Windows system malloc with jemalloc.

  1. Jemalloc not contains cmake support.
  2. I found msvc solution files, but the solution can't build success.

image

It seems that jemalloc lost Windows maintains for some times.

@xuhancn
Copy link
Collaborator

xuhancn commented May 20, 2023

Picture1 3017

I made a summary of malloc libraries.

  1. tc_malloc is the best performance.
  2. tc_malloc has two repos, but only gperftools support cmake build.
  3. I will submit PRs to gperftools, and try to integrate tc_malloc to pytorch. Link: PRs for pytorch聽gperftools/gperftools#1396

@xuhancn
Copy link
Collaborator

xuhancn commented May 20, 2023

I tried to intgrate mimalloc into pytorch, and found mimalloc performance can't competing to tc_malloc. @peterjc123 image PR is here: xuhancn#3

@xuhancn Does setting those environment variables help? https://github.com/microsoft/mimalloc#environment-options microsoft/mimalloc#293 (comment)

See also microsoft/mimalloc#633 (comment)

I think envirment variable optimition is not friendly to end users. Let's considering tc_malloc firstly.

@xuhancn
Copy link
Collaborator

xuhancn commented Jul 20, 2023

#102534 (comment) mimalloc enabled.

@xuhancn
Copy link
Collaborator

xuhancn commented Feb 5, 2024

I guess performance difference stems from a fact that a lot of AVX2 accelerated codepaths are not enabled for Visual C++ compiler, for example see

#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)

#118980 Enabled SIMD by this PR.

@xuhancn xuhancn self-assigned this Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: performance Issues related to performance, either of kernel code or framework glue module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: In Progress
Status: Backlog
Development

No branches or pull requests

7 participants