Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSCAN do not free memory #18

Open
ShJacub opened this issue Dec 6, 2023 · 12 comments
Open

DBSCAN do not free memory #18

ShJacub opened this issue Dec 6, 2023 · 12 comments

Comments

@ShJacub
Copy link

ShJacub commented Dec 6, 2023

Thank for your fast DBSCAN realization. I have a problem. Calling dbscan.DBSCAN(x) consums additional memory. If I call dbscan.DBSCAN(x) n time consums n*V memory, where V is memory for one dbscan.DBSCAN(x) calling.

@ShJacub ShJacub changed the title Do not free memory DBSCAN do not free memory Dec 6, 2023
@anivegesana
Copy link
Contributor

anivegesana commented Dec 7, 2023

It is possible that didn't handle the reference counts in the Python wrapper correctly. Can you share a small example that I can use to reproduce the issue?


This seems to reproduce it:

import dbscan
import sys
f = np.random.rand(5000000, 5)
for i in range(100):
  q = dbscan.DBSCAN(f)

sys.getrefcount(q)
sys.getrefcount(q[0])

If the object passed in is a NULL pointer, it is assumed that this was caused because the call producing the argument found an error and set an exception.
https://docs.python.org/3/c-api/arg.html#c.Py_BuildValue

Does the following build fix the issue for you?

pip install git+https://github.com/anivegesana/dbscan-python@memleak

@ShJacub
Copy link
Author

ShJacub commented Dec 8, 2023

Good day, I'm sorry for late response. I ran your code and I have the same memory leak. Outputs of sys.getrefcount(q), sys.getrefcount(q[0]) are 2 and 3. I also tried pip install git+https://github.com/anivegesana/dbscan-python@memleak and have the same memory leak.

@anivegesana
Copy link
Contributor

Hey @ShJacub,

I think I need some help reproducing the memory leak on my new branch. This is code snippet that I am currently running and its output.

>>> import numpy as np
>>> import dbscan
>>> dbscan.__version__
'0.0.12.dev1+gc993316.d20230420'
>>> x = np.random.rand(5000000, 5)
>>> r = dbscan.DBSCAN(x)
>>> sys.getrefcount(r)
2
>>> sys.getrefcount(r[0])
2
>>> sys.getrefcount(r[1])
2
>>> import weakref
>>> del r
>>> g = weakref.ref(r[0])
>>> g()
array([0, 0, 0, ..., 0, 0, 0], dtype=int32)
>>> gc.collect()
480
>>> g()

Thank you for pointing out the memory leak. I need to be a little bit more careful when reading Python C API documentation since information about borrowed and owned references isn't always in an obvious place. 😅

@ShJacub
Copy link
Author

ShJacub commented Dec 11, 2023

Is it possible to solve this problem?

@anivegesana
Copy link
Contributor

Yes, it is. Just need some more information. Can you run the code that I shared in the previous comment and share the output? Also, can you share the OS, Python version, and NumPy version that you are using?

@ShJacub
Copy link
Author

ShJacub commented Dec 12, 2023

Ubuntu 20.04.6 LTS
python 3.8.10
numpy 1.24.4

I ran code placed above. These are outputs:

import numpy as np
import dbscan
dbscan.version
'0.0.12'
x = np.random.rand(5000000, 5)
r = dbscan.DBSCAN(x)
sys.getrefcount(r)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'sys' is not defined
import sys
sys.getrefcount(r)
2
sys.getrefcount(r[0])
3
sys.getrefcount(r[1])
3
import weakref
del r
g = weakref.ref(r[0])
Traceback (most recent call last):
File "", line 1, in
NameError: name 'r' is not defined
g()
Traceback (most recent call last):
File "", line 1, in
NameError: name 'g' is not defined
gc.collect()
Traceback (most recent call last):
File "", line 1, in
NameError: name 'gc' is not defined
g()
Traceback (most recent call last):
File "", line 1, in
NameError: name 'g' is not defined

@anivegesana
Copy link
Contributor

anivegesana commented Dec 14, 2023

Sorry about the delay. Was a little bit busy at work for a couple of days and didn't have a chance to take a look at this. It seems like the version of the dbscan-python library doesn't match up. You have the production version (0.0.12) and I have the version with the fix (0.0.12.dev1+gc993316.d20230420.) Perhaps it failed to compile on your machine? I will build you a wheel tonight for you to try it out.

For some reason, the version name on the wheel is messed up, but it should mostly work fine.

curl 'https://drive.google.com/uc?export=download&id=1Wrglr9Xo9dyDiD9ngPLcPjI9sFCiYLIJ' -o "dbscan-0.1.dev90+g6a8f3e3-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl"
pip install "dbscan-0.1.dev90+g6a8f3e3-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl" --force-reinstall --no-deps

@alwansm
Copy link

alwansm commented Jan 22, 2024

Hello,

Please, I have encountered a memory problem but in C++. When I call the DBSCAN function inside a for loop, the memory usage increases and keeps increasing until it crashes

#include <iostream>

#include "dbscan/capi.h"
#include "dbscan/point.h"
#include "dbscan/geometryIO.h"
#include "dbscan/pbbs/parallel.h"
#include "dbscan/pbbs/parseCommandLine.h"

int main(int argc, char *argv[])
{

  commandLine P(argc, argv, "[-o <outFile>] [-eps <p_epsilon>] [-minpts <p_minpts>] <inFile>");
  char *iFile = P.getArgument(0);
  char *oFile = P.getOptionValue("-o");
  size_t rounds = P.getOptionIntValue("-r", 1);
  double p_epsilon = P.getOptionDoubleValue("-eps", 1);
  size_t p_minpts = P.getOptionIntValue("-minpts", 1);
  double p_rho = P.getOptionDoubleValue("-rho", -1);

  for (int i = 0; i < 10000; i++)
  {
    parlay::internal::start_scheduler();
    int dim = readHeader(iFile);
    _seq<double> PIn = readDoubleFromFile(iFile, dim);

    bool *coreFlag = new bool[PIn.n / dim];
    int *cluster = new int[PIn.n / dim];
    double *data = PIn.A;

    if (DBSCAN(dim, PIn.n / dim, data, p_epsilon, p_minpts, coreFlag, cluster))
      cout << "Error: dimension >20 is not supported." << endl;

    // if (oFile != NULL)
    // {
    //   writeArrayToFile("cluster-id", cluster, PIn.n / dim, oFile);
    // }

    PIn.del();
    delete[] coreFlag;
    delete[] cluster;
    parlay::internal::stop_scheduler();
  }
  return 0;
}

@william-silversmith
Copy link

I ran into a similar problem in some code I wrote a while back (not in DBSCAN) and here was its solution: seung-lab/mapbuffer@3746bd8

I suspect the issue has something to do with not releasing Xobj or X. You may want to call: Py_DECREF(X); to decrement the reference count on the python object before returning stuff.

@anivegesana
Copy link
Contributor

Oh yes. You are absolutely right! PyArg_ParseTupleAndKeywords creates a strong reference to objects. The C API has a lot of sharp corners. Will try it out. I don't think this solves @alwansm 's problem, but Valgrind should help find that.

@GunjanKholapure
Copy link

GunjanKholapure commented May 9, 2024

Any updates on this? It increasingly takes up more memory in python when run continuously

@GunjanKholapure
Copy link

GunjanKholapure commented May 17, 2024

Update: This is what worked for me:

Since I just needed labels and not the core mask I have changed the return statement and I haven't seen any memory leak after this for my use case:

Instead of returning:
return PyTuple_Pack(2, labels, core_samples);

I am returning:
Py_DECREF(X); Py_DECREF(core_samples); return (PyObject*) labels;

After making these changes I did pip install -e . in the repo
I had to comment these lines for it to work:
py_limited_api=True ('Py_LIMITED_API', '0x03020000')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants