Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory during saving to phy #670

Open
ananmoran opened this issue Apr 21, 2024 · 8 comments
Open

CUDA out of memory during saving to phy #670

ananmoran opened this issue Apr 21, 2024 · 8 comments

Comments

@ananmoran
Copy link

Describe the issue:

Hi,
Thanks for the great work. I've been using KS to sort long recordings (days) of NP1. I've managed to edit KS3 so that some of the memory-heavy computations will be done using the CPU memory, with the cost of slower running time. Moving to KS4, running using the GPU was possible again, but failed oddly during saving to phy. I will be happy if you can help me resolve this issue.
Thanks
Anan

The following is the output of KS4

Interpreting binary file as default dtype='int16'. If data was saved in a different format, specify data_dtype.
Using GPU for PyTorch computations. Specify device to change this.
sorting G:\NDR21\NDR21_hab3ToExt_g0\NDR21_hab3ToExt_g0_imec0\NDR21_hab3ToExt_g0_t0.imec0.ap.bin
using probe neuropixPhase3B1_kilosortChanMap.mat
Preprocessing filters computed in 227.77s; total 227.85s

computing drift
Re-computing universal templates from data.
H:\envs\kilosort4_1\lib\site-packages\threadpoolctl.py:1223: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

warnings.warn(msg, RuntimeWarning)
100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [5:29:56<00:00, 1.09it/s]
drift computed in 24592.70s; total 24820.55s

Extracting spikes using templates
Re-computing universal templates from data.
100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [5:34:03<00:00, 1.08it/s]
101617684 spikes extracted in 20305.22s; total 45127.25s

First clustering
100%|██████████████████████████████████████████████████████████████████████████████| 96/96 [15:20:35<00:00, 575.37s/it]
742 clusters found, in 55302.18s; total 100429.43s

Extracting spikes using cluster waveforms
100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [3:41:37<00:00, 1.62it/s]
119437390 spikes extracted in 13482.27s; total 113911.70s

Final clustering
100%|██████████████████████████████████████████████████████████████████████████████| 96/96 [22:33:55<00:00, 846.20s/it]
492 clusters found, in 81236.76s; total 195148.93s

Merging clusters
471 units found, in 362.79s; total 195511.72s

Saving to phy and computing refractory periods
Traceback (most recent call last):
File "H:\envs\kilosort4_1\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "H:\envs\kilosort4_1\lib\runpy.py", line 110, in _get_module_details
import(pkg_name)
File "F:\PycharmProjects\kilosort4\ks4_main.py", line 23, in
run_kilosort(settings=settings, probe_name='neuropixPhase3B1_kilosortChanMap.mat')
File "H:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 146, in run_kilosort
save_sorting(ops, results_dir, st, clu, tF, Wall, bfile.imin, tic0,
File "H:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 472, in save_sorting
results_dir, similar_templates, is_ref, est_contam_rate = io.save_to_phy(
File "H:\envs\kilosort4_1\lib\site-packages\kilosort\io.py", line 172, in save_to_phy
xs, ys = compute_spike_positions(st, tF, ops)
File "H:\envs\kilosort4_1\lib\site-packages\kilosort\postprocessing.py", line 39, in compute_spike_positions
chs = ops['iCC'][:, ops['iU'][st[:,1]]].cpu()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.90 GiB. GPU 0 has a total capacity of 11.00 GiB of which 6.65 GiB is free. Of the allocated memory 922.21 MiB is allocated by PyTorch, and 543.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@ananmoran ananmoran changed the title CUDA out of memory during saving to phy<Please write a comprehensive title.> CUDA out of memory during saving to phy Apr 21, 2024
@jacobpennington
Copy link
Collaborator

Thanks for catching this, we're deciding on how to fix it. If you want to be able to sort in the meantime, since you mentioned you're comfortable modifying code, you could install from source and comment out a few lines to skip this step for now. It's only used for plotting the spike positions on the probe (which is nice, but not necessary).

The relevant lines are

214 and 215 in kilosort.gui.run_box.py (last two lines here):

  elif plot_type == 'probe':
      plot_window = self.plots['probe']
      ops = self.current_worker.ops
      st = self.current_worker.st
      clu = self.current_worker.clu
      tF = self.current_worker.tF
      is_refractory = self.current_worker.is_refractory
      device = self.parent.device
      # plot_spike_positions(plot_window, ops, st, clu, tF, is_refractory,
      #                      device)

172, 173, 181, and 185 in kilosort.io.py (all the ones that mention "spike_positions"):

    # spike properties
    spike_times = st[:,0].astype('int64') + imin  # shift by minimum sample index
    spike_templates = st[:,1].astype('int32')
    spike_clusters = clu
    # xs, ys = compute_spike_positions(st, tF, ops)             <---- 
    # spike_positions = np.vstack([xs, ys]).T                    <----
    amplitudes = ((tF**2).sum(axis=(-2,-1))**0.5).cpu().numpy()
    # remove duplicate (artifact) spikes
    spike_times, spike_clusters, kept_spikes = remove_duplicates(
        spike_times, spike_clusters, dt=ops['settings']['duplicate_spike_bins']
    )
    amp = amplitudes[kept_spikes]
    spike_templates = spike_templates[kept_spikes]
    # spike_positions = spike_positions[kept_spikes]            <----
    np.save((results_dir / 'spike_times.npy'), spike_times)
    np.save((results_dir / 'spike_templates.npy'), spike_clusters)
    np.save((results_dir / 'spike_clusters.npy'), spike_clusters)
    # np.save((results_dir / 'spike_positions.npy'), spike_positions)        <-----

@ananmoran
Copy link
Author

Thanks. I am still straggling with the KS implementation of CUDA. I have found out that unless explicitly released using torch.cuda.empty_cache(), the cache is not emptied and "CUDA out of memory" causes the program to exit. When using torch.cuda.empty_cache() before and after every phase of the sorting I successfully managed to sort NP1 recording of 5 hours. I think you should add empty_cache() to your code, or have some flag to do it when desired. Unfortunately, KS4 still crashed with "CUDA out of memory" when I tried sorting a longer recording of 49h. There was a warning about scalar overflow, and then it tried to allocate 2.6TB of GPU memory. See the trace log below:

Interpreting binary file as default dtype='int16'. If data was saved in a different format, specify data_dtype.
Using GPU for PyTorch computations. Specify device to change this.
sorting G:\NDR21\NDR21_hab3ToExt_g0\NDR21_hab3ToExt_g0_imec0\NDR21_hab3ToExt_g0_t0.imec0.ap.bin
using probe neuropixPhase3B1_kilosortChanMap.mat
Preprocessing filters computed in 1003.20s; total 1003.29s

computing drift
Re-computing universal templates from data.
h:\envs\kilosort4_1\lib\site-packages\threadpoolctl.py:1223: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

warnings.warn(msg, RuntimeWarning)
41%|████████████████████████████▊ | 35791/88200 [8:54:26<13:46:29, 1.06it/s]h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py:511: RuntimeWarning: overflow encountered in scalar add
bend = min(self.imax, bstart + self.NT + 2*self.nt)
41%|████████████████████████████▊ | 35791/88200 [8:54:27<13:02:37, 1.12it/s]
Traceback (most recent call last):
File "h:\envs\kilosort4_1\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "h:\envs\kilosort4_1\lib\runpy.py", line 110, in _get_module_details
import(pkg_name)
File "F:\PycharmProjects\kilosort4\ks4_main.py", line 24, in
run_kilosort(settings=settings, probe_name='neuropixPhase3B1_kilosortChanMap.mat')
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 131, in run_kilosort
ops, bfile, st0 = compute_drift_correction(
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 345, in compute_drift_correction
ops, st = datashift.run(ops, bfile, device=device, progress_bar=progress_bar)
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\datashift.py", line 192, in run
st, _, ops = spikedetect.run(ops, bfile, device=device, progress_bar=progress_bar)
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\spikedetect.py", line 246, in run
X = bfile.padded_batch_to_torch(ibatch, ops)
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py", line 709, in padded_batch_to_torch
X = super().padded_batch_to_torch(ibatch)
File "h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py", line 537, in padded_batch_to_torch
X[:] = torch.from_numpy(data).to(self.device).float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2594.29 GiB. GPU 0 has a total capacity of 11.00 GiB of which 7.85 GiB is free. Of the allocated memory 199.10 MiB is allocated by PyTorch, and 30.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@jacobpennington
Copy link
Collaborator

Okay thanks, looking into it. Just to clarify, are you using the default settings to sort this? I.e. no changes to batch size, detection thresholds, etc.

@ananmoran
Copy link
Author

I did not change any parameter. Thanks for looking into it.
Anan

@ananmoran
Copy link
Author

Hi. Any news regarding this issue?
Thanks
Anan

@jacobpennington
Copy link
Collaborator

Not yet.

@jacobpennington
Copy link
Collaborator

Re: the last error you described, I have a fix working. I'll push it after I test a few more things (probably today). The problem was caused because the large number of samples was causing an integer overflow that caused the program to try to load many batches at once.

As for the other memory issues you brought up, that will take longer to work on but it's on the to-do list. It sounds like using empty_cache() is working for you for whatever reason, but I don't want to add that to the code since it doesn't actually free up any memory that pytorch doesn't already have reserved. There are optimizations in the underlying sorting steps that we need to try instead, to reduce the amount of memory allocated in the first place, they just hadn't been a priority yet since most users' recordings are much shorter than this.

@ananmoran
Copy link
Author

Thanks for the overflow fix. I will wait for your push and test it on my data.

WRT the GPU memory usage, I understand that it is not critical for most users, but I hope that the KS team will find time to optimize this, as recording time will surely grow fast inbyhe near future.

Thanks again for putting an effort to solve these problems.

Much obliged
Anan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants