Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Allocation Failed - Large Datasets #120

Open
TonyX26 opened this issue Aug 12, 2021 · 7 comments
Open

Memory Allocation Failed - Large Datasets #120

TonyX26 opened this issue Aug 12, 2021 · 7 comments

Comments

@TonyX26
Copy link

TonyX26 commented Aug 12, 2021

Hi all,

I've been trying to run FIt-SNE on a FCS file 20 million events large. Unfortunately, despite allocating 1.5TB of memory, an error still arises (below). This does not occur when running the same file downsampled to 2 or 5 million cells. I have just been trying to run a small 20 iterations, just to identify the problem, however it never manages to get there...

Has anyone encountered this error before? I've attached the error file, the output, and my script.

Thanks!

=============== t-SNE v1.2.1 ===============
fast_tsne data_path: <path> 
fast_tsne result_path: <path>
fast_tsne nthreads: 96
Read the following parameters:
	 n 19113296 by d 17 dataset, theta 0.500000,
	 perplexity 50.000000, no_dims 2, max_iter 20,
	 stop_lying_iter 250, mom_switch_iter 250,
	 momentum 0.500000, final_momentum 0.800000,
	 learning_rate 1592774.666667, max_step_norm 5.000000,
	 K -1, sigma -30.000000, nbody_algo 2,
	 knn_algo 1, early_exag_coeff 12.000000,
	 no_momentum_during_exag 0, n_trees 50, search_k 7500,
	 start_late_exag_iter -1, late_exag_coeff 1.000000
	 nterms 3, interval_per_integer 1.000000, min_num_intervals 50, t-dist df 1.000000
Read the 19113296 x 17 data matrix successfully. X[0,0] = 71838.656250
Read the initialization successfully.
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 7500
Going to allocate memory. N: 19113296, K: 150, N*K = -1427972896
Memory allocation failed!

Resource Usage on 2021-08-05 16:59:31:
Job Id:             job_ID
Project:            ##
Exit Status:        1
Service Units:      6.20
NCPUs Requested:    48                     NCPUs Used: 48              
                                           CPU Time Used: 00:02:26                                   
   Memory Requested:   1.46TB                Memory Used: 37.03GB         
   Walltime requested: 20:00:00            Walltime Used: 00:02:35        
   JobFS requested:    30.0GB                 JobFS used: 15.18KB         

Error file:

FIt-SNE R wrapper loading.
FIt-SNE root directory was set to <directory>
Using rsvd() to compute the top PCs for initialization.
Error in fftRtsne(dsobject_s[, -c(1, 19:24)], perplexity = 50, max_iter = 20) : 
  tsne call failed
Execution halted

Script:

library(flowCore)

## Sourcing FITSNE 
fast_tsne_path  <- "<path>/fast_tsne" 
source(paste0(fast_tsne_path,".R"))

## Loading in File
object <- exprs(read.FCS("<file>.fcs"))

## Running FIt-SNE 
tsne_object <-fftRtsne(object[,-c(1, 19:24)],perplexity = 50, max_iter = 20)
export_obj <- cbind(object, tSNEX = tsne_object[,1], tSNEY = tsne_object[,2],fast_tsne_path=fast_tsne_path)

## Saving Object
saveRDS(export_obj, "fitSNE_alltube_simple20.rds")

@TonyX26 TonyX26 changed the title Memory Allocation Failed - 3TB Memory Memory Allocation Failed - 1.5TB Memory Aug 12, 2021
@linqiaozhi
Copy link
Member

Thanks for posting the issue, @TonyX26. I think the problem is integer overflow. See how N*K gives a negative number here?

Going to allocate memory. N: 19113296, K: 150, N*K = -1427972896

This is because N*K = 2.89E9 which is larger than the maximum integer 2.14E9.

I think you need to change the definition of the function computeGaussianPerplexity() so that N and D are not declared as integer, but long integer instead. This would also need to be changed in the header file, of course.

Then, N*D should not overflow, and calloc() will not be trying to allocate a "negative" amount of memory. If you have trouble making that change, I can also make it for you.

If that fixes the problem, please make a pull request so we can update the repo.

@TonyX26
Copy link
Author

TonyX26 commented Aug 13, 2021

Thanks for the reply!

That has fixed the negative memory problem... however it is still not working sadly.
I'm not sure why, but it normally quits out after 2 or 3 minutes. But this went for much longer so I think that is hopeful!

Thanks
Tony

Error output:

FIt-SNE R wrapper loading.
FIt-SNE root directory was set to <directory>
Using rsvd() to compute the top PCs for initialization.
Error in fftRtsne(data, perplexity = 80, max_iter = 3000,  : 
  tsne call failed
Execution halted

Full Output

=============== t-SNE v1.2.1 ===============
fast_tsne data_path: <path>.dat
fast_tsne result_path:  <path>.dat
fast_tsne nthreads: 64
Read the following parameters:
	 n 19113296 by d 17 dataset, theta 0.500000,
	 perplexity 80.000000, no_dims 2, max_iter 3000,
	 stop_lying_iter 250, mom_switch_iter 250,
	 momentum 0.500000, final_momentum 0.800000,
	 learning_rate 1592774.666667, max_step_norm 5.000000,
	 K -1, sigma -30.000000, nbody_algo 2,
	 knn_algo 1, early_exag_coeff 12.000000,
	 no_momentum_during_exag 0, n_trees 50, search_k 12000,
	 start_late_exag_iter -1, late_exag_coeff 1.000000
	 nterms 3, interval_per_integer 1.000000, min_num_intervals 50, t-dist df 0.800000
Read the 19113296 x 17 data matrix successfully. X[0,0] = 71838.656250
Read the initialization successfully.
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 12000
Going to allocate memory. N: 19113296, K: 240, N*K = 292223744
Building Annoy tree...
Done building tree. Beginning nearest neighbor search... 
parallel (64 threads):
[>                                                           ] 0% 0.036s
======================================================================================
                  Resource Usage on 2021-08-13 11:09:37:
   Job Id:             <job>
   Project:            <job>
   Exit Status:        1
   Service Units:      45.22
   NCPUs Requested:    32                     NCPUs Used: 32              
                                           CPU Time Used: 01:07:32                                   
   Memory Requested:   2.93TB                Memory Used: 54.25GB         
   Walltime requested: 20:00:00            Walltime Used: 01:07:50        
   JobFS requested:    30.0GB                 JobFS used: 2.71GB          
======================================================================================

Script:

library(flowCore)
fast_tsne_path <- "<path>/fast_tsne"
source(paste0(fast_tsne_path,".R"))
object <- exprs( read.FCS("ConcatAnnaB_AllTube_1.fcs"))
tsne_object <-fftRtsne(object[,-c(1, 19:24)],perplexity = 80,max_iter = 3000, df = 0.8, fast_tsne_path = fast_tsne_path)
export_obj <- cbind(object_e, tSNEX = tsne_object[,1], tSNEY = tsne_object[,2])
saveRDS(export_obj, "fitSNE_alltube_full.rds")

@linqiaozhi
Copy link
Member

Same story. It is crashing here, and you can see the indices are integers, so they are overflowing. Try changing those to long int, particularly n.

Although the algorithm should work, we did not test the actual code with a dataset of this size. There are likely other places too where we used int and it should be long int instead. It is a big help if you can go through and make those changes and then pull request. If you have difficulty or it's too much work, let me know and I can do it.

@TonyX26
Copy link
Author

TonyX26 commented Aug 15, 2021

Thanks for all the help! I've had a shot at doing it myself, but sadly haven't managed to get it to work still. If it's possible to get some help, that'd be much appreciated. I'll pull the request though, seeing that it is solved! Thank you

@TonyX26 TonyX26 closed this as completed Aug 15, 2021
@dkobak
Copy link
Collaborator

dkobak commented Aug 17, 2021

Hey Toni, "pull request" does not mean closing the issue :-) I am reopening it, as it's clearly a bug.

@dkobak dkobak reopened this Aug 17, 2021
@TonyX26
Copy link
Author

TonyX26 commented Aug 19, 2021

I'm so sorry. First time as you may have realised :D I've put in a pull request now.

@TonyX26 TonyX26 changed the title Memory Allocation Failed - 1.5TB Memory Memory Allocation Failed - Large Datasets Sep 13, 2021
@TonyX26
Copy link
Author

TonyX26 commented Sep 13, 2021

Hi All,

I've implemented the above changes, and have been trying to find additional ways around it. The output below is the furthest I've managed to get sadly. I note the memory allocation is still negative, but I'm unsure of what else to change to avoid this problem. Any help would be very much appreciated!

fast_tsne data_path: <path> RtmpfICJOc/fftRtsne_data_1e6bc64af50344.dat
fast_tsne result_path: <path> RtmpfICJOc/fftRtsne_result_1e6bc69d15fd9.dat
fast_tsne nthreads: 96
Read the following parameters:
	 n 19113296 by d 17 dataset, theta 0.500000,
	 perplexity 50.000000, no_dims 2, max_iter 20,
	 stop_lying_iter 250, mom_switch_iter 250,
	 momentum 0.500000, final_momentum 0.800000,
	 learning_rate 1592774.666667, max_step_norm 5.000000,
	 K -1, sigma -30.000000, nbody_algo 2,
	 knn_algo 1, early_exag_coeff 12.000000,
	 no_momentum_during_exag 0, n_trees 50, search_k 7500,
	 start_late_exag_iter -1, late_exag_coeff 1.000000
	 nterms 3, interval_per_integer 1.000000, min_num_intervals 50, t-dist df 1.000000
Read the 19113296 x 17 data matrix successfully. X[0,0] = 71838.656250
Read the initialization successfully.
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 7500
Going to allocate memory. N: 19113296, K: 150, N*K = -1427972896
Building Annoy tree...
Done building tree. Beginning nearest neighbor search... 
parallel (96 threads):
[>                                                           ] 0% 0.005s
[>                                                           ] 0% 0.564s
[>                                                           ] 0% 1.101s
[>                                                           ] 0% 1.663s
[>                                                           ] 0% 2.181s
[>                                                           ] 0% 2.777s
[>                                                           ] 0% 3.315s
[>                                                           ] 0% 3.988s
[>                                                           ] 0% 4.601s
[>                                                           ] 0% 5.105s
[>                                                           ] 0% 5.698s
[>                                                           ] 0% 6.276s
[>                                                           ] 0% 6.759s
[>                                                           ] 0% 7.307s
[>                                                           ] 0% 7.807s
[>                                                           ] 0% 8.303s
[>                                                           ] 0% 8.858s
[>                                                           ] 0% 9.383s
[>                                                           ] 0% 9.907s
[>                                                           ] 0% 10.397s
[>                                                           ] 0% 10.951s
[>                                                           ] 1% 11.444s
[>                                                           ] 1% 11.91s
[>                                                           ] 1% 12.352s
[>                                                           ] 1% 12.803s

This continues until:

[===========================================================>] 98% 1002.15s
[===========================================================>] 99% 1018.77s

Where the process then stops and ends.
The error output is:

FIt-SNE R wrapper loading.
FIt-SNE root directory was set to /scratch/nd12/tx2668
Using rsvd() to compute the top PCs for initialization.
Error in fftRtsne(dsobject_s[, -c(1, 19:24)], perplexity = 50, max_iter = 20) : 
  tsne call failed
Execution halted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants