Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 性能问题:_get_valid_value函数首次调用torch.Tensor.item()时耗时过长 #1519

Open
2 tasks done
BenjaminPang opened this issue Mar 20, 2024 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@BenjaminPang
Copy link

Prerequisite

Environment

OrderedDict([('sys.platform', 'win32'), ('Python', '3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)]'), ('CUDA available', True), ('MUSA available', False), ('numpy_random_seed', 2147483648), ('GPU 0', 'NVIDIA GeForce RTX 3070'), ('CUDA_HOME', 'C:\Program Files\
NVIDIA GPU Computing Toolkit\CUDA\v11.6'), ('NVCC', 'Cuda compilation tools, release 11.6, V11.6.55'), ('MSVC', 'Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33130 for x64'), ('GCC', 'n/a'), ('PyTorch', '1.13.1+cu116'), ('PyTorch compiling details', 'PyTorch built with:\n - C++ Version: 199711\n -
MSVC 192829337\n - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n - OpenMP 2019\n - LAPACK is enabled (usually provided by MKL)\n - CPU capability usage: AVX2\n

  • CUDA Runtime 11.6\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;
    arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.3.2 (built against CUDA 11.5)\n - Magma 2.5.4\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl
    .exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl,
    PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.14.1+cu116'), ('OpenCV', '4.9.0'
    ), ('MMEngine', '0.10.3')])

Reproduces the problem - code sample

def _get_valid_value(
    self,
    value: Union['torch.Tensor', np.ndarray, np.number, int, float],
) -> Union[int, float]:
    """Convert value to python built-in type.

    Args:
        value (torch.Tensor or np.ndarray or np.number or int or float):
            value of log.

    Returns:
        float or int: python built-in type value.
    """
    import time
    s = time.time()
    if isinstance(value, (np.ndarray, np.number)):
        assert value.size == 1
        value = value.item()
    elif isinstance(value, (int, float)):
        value = value
    else:
        # check whether value is torch.Tensor but don't want
        # to import torch in this file
        assert hasattr(value, 'numel') and value.numel() == 1
        value = value.item()
    print(f"get_valid_value use {time.time() - s}")
    return value  # type: ignore

mmenginelogging/message_hub.py文件中,_get_valid_value函数在被running_info_hookafter_train_iter方法调用时,使用torch.Tensor.item()进行类型转换时的性能存在问题。我的测试表明,该函数在第一次调用时耗时显著,随后的调用耗时则为零,这在训练循环中由于after_train_iter被频繁调用而导致整体耗时严重上升。

复现步骤

  1. 运行一个包含after_train_iter调用的训练循环。
  2. 观察_get_valid_value函数中torch.Tensor.item()调用的耗时。

以下是调用时间的输出:

get_valid_value use 0.02899909019470215
get_valid_value use 0.0
get_valid_value use 0.0

单位:秒

预期行为

我期望torch.Tensor.item()调用不会在首次调用时造成如此显著的延迟。

Reproduces the problem - command or script

No comment

Reproduces the problem - error message

No comment

Additional information

为了解决这个性能问题,我建议考虑修改base_model中的parse_losses函数,让其提前进行类型转换,将损失值和准确率等转换为标量(scalars),以避免在_get_valid_value中进行昂贵的torch.Tensor.item()调用。这是一个可能的解决方案示例:

# 修改后的 parse_losses 函数
def parse_losses(
    self, losses: Dict[str, torch.Tensor]
) -> Tuple[torch.Tensor, Dict[str, float]]:
    # ... 保留原有代码 ...
    log_vars = [[key, value.mean().item()] for key, value in log_vars]
    # ... 保留原有代码 ...
    return loss, log_vars  # type: ignore
@BenjaminPang BenjaminPang added the bug Something isn't working label Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant