Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when iteration! My environment problem or bug? #2095

Open
YmdTb opened this issue Dec 3, 2023 · 0 comments
Open

Memory leak when iteration! My environment problem or bug? #2095

YmdTb opened this issue Dec 3, 2023 · 0 comments
Labels

Comments

@YmdTb
Copy link

YmdTb commented Dec 3, 2023

Bug / performance issue / build issue

I don't know this situation is my environment problem or is bug.
I found the opt.minimize will increase memory usage inside a iteration. Detail is below (see the results the fit_matern32_kernel iters 30 times and mem usage grows. I only iters 30 times for show the results. This bug will cause my code run out of memory) :

hint: I have see a old issues #857, tf.Session may be a solution for tf 1.0, but for tf 2.0 how to reach it?

To reproduce (memory_profile package is used to show the mem usage. You can delete it if you don't want it.)

Minimal, reproducible example

import datetime as dt
from typing import Dict, List, Optional, Tuple, Union

import gpflow
import numpy as np
import pandas as pd
import tensorflow as tf
from gpflow.kernels import ChangePoints, Matern32
from sklearn.preprocessing import StandardScaler
from tensorflow_probability import bijectors as tfb

Kernel = gpflow.kernels.base.Kernel

def fit_matern32_kernel(
    ts_data: pd.DataFrame,
    variance: float = 1.0,
    lengthscale: float = 1.0,
    llh_var: float = 1.0,
) -> Tuple[float, Dict[str, float]]:
    m = gpflow.models.GPR(
        data=(
            ts_data.loc[:, ["X"]].to_numpy(),
            ts_data.loc[:, ["Y"]].to_numpy(),
        ),
        kernel=Matern32(variance=variance, lengthscales=lengthscale),
        noise_variance=llh_var,
    )
    
    opt = gpflow.optimizers.Scipy()
    
    nlml = opt.minimize(
        m.training_loss, m.trainable_variables, options=dict(maxiter=100)
    ).fun
from memory_profiler import profile
@profile
def run():
    # ts_data: pd.DataFrame {"date":[dt.datetime("1990-01-01 00:00:00"),dt.datetime("1990-01-02 00:00:00"),....],
    #                         "daily_returns":[0.01,-0.025.....]}

    dates = pd.date_range("1999-12-31","2002-01-01")
    ts_data = pd.DataFrame({"date":dates,
                            "daily_returns":np.random.random(len(dates))/100})
    ts_data["date"] = ts_data.index
    wd = 21
    for window_end in range(wd + 1, len(ts_data))[:30]:
        ts_data_window = ts_data.iloc[window_end - (wd + 1) : window_end][["date", "daily_returns"]].copy()
        ts_data_window["X"] = ts_data_window.index.astype(float)
        ts_data_window = ts_data_window.rename(columns={"daily_returns": "Y"})
        fit_matern_kernel(ts_data_window,)
if __name__=='__main__':
    run()

Results

focus the line 421 results. Increment 657.5 Mib mem usage.

2023-12-03 16:21:45.704483: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-03 16:21:45.726512: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-03 16:21:45.726559: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-03 16:21:45.727129: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-03 16:21:45.730785: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-03 16:21:48.165010: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-03 16:21:48.168628: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-03 16:21:48.168681: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-03 16:21:48.171375: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-03 16:21:48.171418: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-03 16:21:48.171442: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-03 16:21:48.261821: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-03 16:21:48.261867: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-03 16:21:48.261876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2022] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-12-03 16:21:48.261896: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-03 16:21:48.261924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9516 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4070 Ti, pci bus id: 0000:01:00.0, compute capability: 8.9
2023-12-03 16:21:48.486685: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-12-03 16:21:55.498487: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xa113960
Filename: /pycharm_project/mom_trans/changepoint_detection.py
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   407    746.6 MiB    746.6 MiB           1   @profile
   408                                         def run():
   409                                             # ts_data: pd.DataFrame {"date":[dt.datetime("1990-01-01 00:00:00"),dt.datetime("1990-01-02 00:00:00"),....],
   410                                             #                         "daily_returns":[0.01,-0.025.....]}
   411                                         
   412    746.6 MiB      0.0 MiB           1       dates = pd.date_range("1999-12-31","2002-01-01")
   413    746.6 MiB      0.0 MiB           2       ts_data = pd.DataFrame({"date":dates,
   414    746.6 MiB      0.0 MiB           1                               "daily_returns":np.random.random(len(dates))/100})
   415    746.6 MiB      0.0 MiB           1       ts_data["date"] = ts_data.index
   416    746.6 MiB      0.0 MiB           1       wd = 21
   417   1406.8 MiB      0.0 MiB          31       for window_end in range(wd + 1, len(ts_data))[:30]:
   418   1404.3 MiB      0.0 MiB          30           ts_data_window = ts_data.iloc[window_end - (wd + 1) : window_end][["date", "daily_returns"]].copy()
   419   1404.3 MiB      0.0 MiB          30           ts_data_window["X"] = ts_data_window.index.astype(float)
   420   1404.3 MiB      2.6 MiB          30           ts_data_window = ts_data_window.rename(columns={"daily_returns": "Y"})
   421   1406.8 MiB    657.5 MiB          30           fit_matern_kernel(ts_data_window,)

System information

GPflow version : 2.9.0
GPflow installed from : pip install gpflow
TensorFlow version : 2.15.0
Python version : 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
Operating system :(build by docker)

    Distributor ID: Ubuntu
    Description:    Ubuntu 22.04.3 LTS
    Release:        22.04
    Codename:       jammy

GPU : device: 0, name: NVIDIA GeForce RTX 4070 Ti, pci bus id: 0000:01:00.0, compute capability: 8.9
nvidia-smi :NVIDIA-SMI 545.23.05 Driver Version: 545.84 CUDA Version: 12.3

@YmdTb YmdTb added the bug label Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant