CUDA out of memory during saving to phy #670

ananmoran · 2024-04-21T07:24:30Z

Describe the issue:

Hi,
Thanks for the great work. I've been using KS to sort long recordings (days) of NP1. I've managed to edit KS3 so that some of the memory-heavy computations will be done using the CPU memory, with the cost of slower running time. Moving to KS4, running using the GPU was possible again, but failed oddly during saving to phy. I will be happy if you can help me resolve this issue.
Thanks
Anan

The following is the output of KS4

Interpreting binary file as default dtype='int16'. If data was saved in a different format, specify data_dtype.
Using GPU for PyTorch computations. Specify device to change this.
sorting G:\NDR21\NDR21_hab3ToExt_g0\NDR21_hab3ToExt_g0_imec0\NDR21_hab3ToExt_g0_t0.imec0.ap.bin
using probe neuropixPhase3B1_kilosortChanMap.mat
Preprocessing filters computed in 227.77s; total 227.85s

computing drift
Re-computing universal templates from data.
H:\envs\kilosort4_1\lib\site-packages\threadpoolctl.py:1223: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

warnings.warn(msg, RuntimeWarning)
100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [5:29:56<00:00, 1.09it/s]
drift computed in 24592.70s; total 24820.55s

Extracting spikes using templates
Re-computing universal templates from data.
100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [5:34:03<00:00, 1.08it/s]
101617684 spikes extracted in 20305.22s; total 45127.25s

First clustering
100%|██████████████████████████████████████████████████████████████████████████████| 96/96 [15:20:35<00:00, 575.37s/it]
742 clusters found, in 55302.18s; total 100429.43s

Extracting spikes using cluster waveforms
100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [3:41:37<00:00, 1.62it/s]
119437390 spikes extracted in 13482.27s; total 113911.70s

Final clustering
100%|██████████████████████████████████████████████████████████████████████████████| 96/96 [22:33:55<00:00, 846.20s/it]
492 clusters found, in 81236.76s; total 195148.93s

Merging clusters
471 units found, in 362.79s; total 195511.72s

Saving to phy and computing refractory periods
Traceback (most recent call last):
File "H:\envs\kilosort4_1\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "H:\envs\kilosort4_1\lib\runpy.py", line 110, in _get_module_details
import(pkg_name)
File "F:\PycharmProjects\kilosort4\ks4_main.py", line 23, in
run_kilosort(settings=settings, probe_name='neuropixPhase3B1_kilosortChanMap.mat')
File "H:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 146, in run_kilosort
save_sorting(ops, results_dir, st, clu, tF, Wall, bfile.imin, tic0,
File "H:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 472, in save_sorting
results_dir, similar_templates, is_ref, est_contam_rate = io.save_to_phy(
File "H:\envs\kilosort4_1\lib\site-packages\kilosort\io.py", line 172, in save_to_phy
xs, ys = compute_spike_positions(st, tF, ops)
File "H:\envs\kilosort4_1\lib\site-packages\kilosort\postprocessing.py", line 39, in compute_spike_positions
chs = ops['iCC'][:, ops['iU'][st[:,1]]].cpu()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.90 GiB. GPU 0 has a total capacity of 11.00 GiB of which 6.65 GiB is free. Of the allocated memory 922.21 MiB is allocated by PyTorch, and 543.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The text was updated successfully, but these errors were encountered:

jacobpennington · 2024-04-23T15:37:35Z

Thanks for catching this, we're deciding on how to fix it. If you want to be able to sort in the meantime, since you mentioned you're comfortable modifying code, you could install from source and comment out a few lines to skip this step for now. It's only used for plotting the spike positions on the probe (which is nice, but not necessary).

The relevant lines are

214 and 215 in kilosort.gui.run_box.py (last two lines here):

  elif plot_type == 'probe':
      plot_window = self.plots['probe']
      ops = self.current_worker.ops
      st = self.current_worker.st
      clu = self.current_worker.clu
      tF = self.current_worker.tF
      is_refractory = self.current_worker.is_refractory
      device = self.parent.device
      # plot_spike_positions(plot_window, ops, st, clu, tF, is_refractory,
      #                      device)

172, 173, 181, and 185 in kilosort.io.py (all the ones that mention "spike_positions"):

    # spike properties
    spike_times = st[:,0].astype('int64') + imin  # shift by minimum sample index
    spike_templates = st[:,1].astype('int32')
    spike_clusters = clu
    # xs, ys = compute_spike_positions(st, tF, ops)             <---- 
    # spike_positions = np.vstack([xs, ys]).T                    <----
    amplitudes = ((tF**2).sum(axis=(-2,-1))**0.5).cpu().numpy()
    # remove duplicate (artifact) spikes
    spike_times, spike_clusters, kept_spikes = remove_duplicates(
        spike_times, spike_clusters, dt=ops['settings']['duplicate_spike_bins']
    )
    amp = amplitudes[kept_spikes]
    spike_templates = spike_templates[kept_spikes]
    # spike_positions = spike_positions[kept_spikes]            <----
    np.save((results_dir / 'spike_times.npy'), spike_times)
    np.save((results_dir / 'spike_templates.npy'), spike_clusters)
    np.save((results_dir / 'spike_clusters.npy'), spike_clusters)
    # np.save((results_dir / 'spike_positions.npy'), spike_positions)        <-----

ananmoran · 2024-04-25T15:39:40Z

Thanks. I am still straggling with the KS implementation of CUDA. I have found out that unless explicitly released using torch.cuda.empty_cache(), the cache is not emptied and "CUDA out of memory" causes the program to exit. When using torch.cuda.empty_cache() before and after every phase of the sorting I successfully managed to sort NP1 recording of 5 hours. I think you should add empty_cache() to your code, or have some flag to do it when desired. Unfortunately, KS4 still crashed with "CUDA out of memory" when I tried sorting a longer recording of 49h. There was a warning about scalar overflow, and then it tried to allocate 2.6TB of GPU memory. See the trace log below:

Interpreting binary file as default dtype='int16'. If data was saved in a different format, specify data_dtype.
Using GPU for PyTorch computations. Specify device to change this.
sorting G:\NDR21\NDR21_hab3ToExt_g0\NDR21_hab3ToExt_g0_imec0\NDR21_hab3ToExt_g0_t0.imec0.ap.bin
using probe neuropixPhase3B1_kilosortChanMap.mat
Preprocessing filters computed in 1003.20s; total 1003.29s

computing drift
Re-computing universal templates from data.
h:\envs\kilosort4_1\lib\site-packages\threadpoolctl.py:1223: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

warnings.warn(msg, RuntimeWarning)
41%|████████████████████████████▊ | 35791/88200 [8:54:26<13:46:29, 1.06it/s]h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py:511: RuntimeWarning: overflow encountered in scalar add
bend = min(self.imax, bstart + self.NT + 2*self.nt)
41%|████████████████████████████▊ | 35791/88200 [8:54:27<13:02:37, 1.12it/s]
Traceback (most recent call last):
File "h:\envs\kilosort4_1\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "h:\envs\kilosort4_1\lib\runpy.py", line 110, in _get_module_details
import(pkg_name)
File "F:\PycharmProjects\kilosort4\ks4_main.py", line 24, in
run_kilosort(settings=settings, probe_name='neuropixPhase3B1_kilosortChanMap.mat')
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 131, in run_kilosort
ops, bfile, st0 = compute_drift_correction(
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 345, in compute_drift_correction
ops, st = datashift.run(ops, bfile, device=device, progress_bar=progress_bar)
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\datashift.py", line 192, in run
st, _, ops = spikedetect.run(ops, bfile, device=device, progress_bar=progress_bar)
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\spikedetect.py", line 246, in run
X = bfile.padded_batch_to_torch(ibatch, ops)
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py", line 709, in padded_batch_to_torch
X = super().padded_batch_to_torch(ibatch)
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py", line 537, in padded_batch_to_torch
X[:] = torch.from_numpy(data).to(self.device).float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2594.29 GiB. GPU 0 has a total capacity of 11.00 GiB of which 7.85 GiB is free. Of the allocated memory 199.10 MiB is allocated by PyTorch, and 30.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

jacobpennington · 2024-04-25T17:11:54Z

Okay thanks, looking into it. Just to clarify, are you using the default settings to sort this? I.e. no changes to batch size, detection thresholds, etc.

ananmoran · 2024-04-25T17:51:26Z

I did not change any parameter. Thanks for looking into it.
Anan

ananmoran · 2024-05-06T10:20:18Z

Hi. Any news regarding this issue?
Thanks
Anan

jacobpennington · 2024-05-06T23:14:53Z

Not yet.

jacobpennington · 2024-05-07T19:10:23Z

Re: the last error you described, I have a fix working. I'll push it after I test a few more things (probably today). The problem was caused because the large number of samples was causing an integer overflow that caused the program to try to load many batches at once.

As for the other memory issues you brought up, that will take longer to work on but it's on the to-do list. It sounds like using empty_cache() is working for you for whatever reason, but I don't want to add that to the code since it doesn't actually free up any memory that pytorch doesn't already have reserved. There are optimizations in the underlying sorting steps that we need to try instead, to reduce the amount of memory allocated in the first place, they just hadn't been a priority yet since most users' recordings are much shorter than this.

ananmoran · 2024-05-08T05:37:35Z

Thanks for the overflow fix. I will wait for your push and test it on my data.

WRT the GPU memory usage, I understand that it is not critical for most users, but I hope that the KS team will find time to optimize this, as recording time will surely grow fast inbyhe near future.

Thanks again for putting an effort to solve these problems.

Much obliged
Anan

ananmoran changed the title ~~CUDA out of memory during saving to phy<Please write a comprehensive title.>~~ CUDA out of memory during saving to phy Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory during saving to phy #670

CUDA out of memory during saving to phy #670

ananmoran commented Apr 21, 2024

jacobpennington commented Apr 23, 2024

ananmoran commented Apr 25, 2024

jacobpennington commented Apr 25, 2024

ananmoran commented Apr 25, 2024

ananmoran commented May 6, 2024

jacobpennington commented May 6, 2024

jacobpennington commented May 7, 2024

ananmoran commented May 8, 2024

CUDA out of memory during saving to phy #670

CUDA out of memory during saving to phy #670

Comments

ananmoran commented Apr 21, 2024

Describe the issue:

jacobpennington commented Apr 23, 2024

ananmoran commented Apr 25, 2024

jacobpennington commented Apr 25, 2024

ananmoran commented Apr 25, 2024

ananmoran commented May 6, 2024

jacobpennington commented May 6, 2024

jacobpennington commented May 7, 2024

ananmoran commented May 8, 2024