t2smap OOM #3125

bpinsard · 2023-11-01T15:11:36Z

What happened?

Processing session with 6 runs of multi-echo, the jobs get killed by SLURM, despite setting memory reqs of SLURM with a larger buffer to memory given to fmriprep.
tedana t2smap is the one crashing so nipype-set reqs seems not to properly estimate the memory reqs for the nodes.

I know that this problem has been reported before, but it seems to still be present in 23.1.4.
Each echo file nii.gz is approximately .4Gb
Current heuristics is mem_gb=2.5 * mem_gb * len(echo_times).
So it estimates memory reqs at ~3Gb, but core dumps when OOM occurs are 8Gb, and that what basic top gives me as well.

I will try to run memory profiling of t2smap alone on our data to figure out a better heuristic.

What command did you use?

containers-run -m 'fMRIPrep_sub-01/ses-001' -n bids-fmriprep --input sourcedata/templateflow/tpl-MNI152NLin2009cAsym/ --input sourcedata/templateflow/tpl-OASIS30ANTs/ --input sourcedata/templateflow/tpl-fsLR/ --input sourcedata/templateflow/tpl-fsaverage/ --input sourcedata/templateflow/tpl-MNI152NLin6Asym/ --output . --input 'sourcedata/cneuromod.emotion-videos/sub-01/ses-001/fmap/' --input 'sourcedata/cneuromod.emotion-videos/sub-01/ses-001/func/' --input 'sourcedata/cneuromod.anat.smriprep.longitudinal/sub-01/anat/' --input sourcedata/cneuromod.anat.smriprep.longitudinal/sourcedata/cneuromod.anat.freesurfer_longitudinal/sub-01/ -- -w ./workdir --participant-label 01 --anat-derivatives sourcedata/cneuromod.anat.smriprep.longitudinal --fs-subjects-dir sourcedata/cneuromod.anat.smriprep.longitudinal/sourcedata/cneuromod.anat.freesurfer_longitudinal --bids-filter-file code/fmriprep_study-cneuromod.emotion-videos_sub-01_ses-001_bids_filters.json --output-layout bids --ignore slicetiming --use-syn-sdc --output-spaces MNI152NLin2009cAsym T1w:res-iso2mm --cifti-output 91k --notrack --write-graph --skip_bids_validation --omp-nthreads 8 --nprocs 8 --mem_mb 45000 --fs-license-file code/freesurfer.license \--me-output-echos --resource-monitor sourcedata/cneuromod.emotion-videos ./ participant

What version of fMRIPrep are you running?

23.1.4

How are you running fMRIPrep?

Singularity

Is your data BIDS valid?

Yes

Are you reusing any previously computed results?

Anatomical derivatives

Please copy and paste any relevant log output.

No response

Additional information / screenshots

No response

The text was updated successfully, but these errors were encountered:

effigies · 2023-11-01T15:16:50Z

Just linking relevant issues/PRs:

bpinsard · 2023-11-03T14:09:49Z

I realized that I get that warning in the logs

231103-02:09:47,698 nipype.workflow WARNING:
         Some nodes exceed the total amount of memory available (45.00GB).

I cannot imagine which operation would require that amount of memory for 5min runs of 2mm iso fMRI.
Looking for a way to get the node mem_gb values, apparently it's not in the exported graph (--write-graph).

Looking at the code, the likely case is that nodes set with mem_gb = mem_gb * 3 * omp_nthreads and maybe using resampled mem_gb estimate, could reach that much memory reqs because I used omp_nthreads=8.

However t2smap is likely not the largest mem_gb req set in that workflow.

effigies · 2023-11-03T14:14:43Z

Here's what we calculate:

fmriprep/fmriprep/workflows/bold/base.py

Lines 261 to 269 in 61a7d98

    
           config.loggers.workflow.debug( 
        
               "Creating bold processing workflow for <%s> (%.2f GB / %d TRs). " 
        
               "Memory resampled/largemem=%.2f/%.2f GB.", 
        
               ref_file, 
        
               mem_gb["filesize"], 
        
               bold_tlen, 
        
               mem_gb["resampled"], 
        
               mem_gb["largemem"], 
        
           )

fmriprep/fmriprep/workflows/bold/base.py

Lines 1273 to 1285 in 61a7d98

    
           def _create_mem_gb(bold_fname): 
        
               img = nb.load(bold_fname) 
        
               nvox = int(np.prod(img.shape, dtype='u8')) 
        
               # Assume tools will coerce to 8-byte floats to be safe 
        
               bold_size_gb = 8 * nvox / (1024**3) 
        
               bold_tlen = img.shape[-1] 
        
               mem_gb = { 
        
                   "filesize": bold_size_gb, 
        
                   "resampled": bold_size_gb * 4, 
        
                   "largemem": bold_size_gb * (max(bold_tlen / 100, 1.0) + 4), 
        
               } 
        
               return bold_tlen, mem_gb

I don't really remember the logic for the largemem one. But you should be able to see the estimates in your logs.

effigies · 2023-11-07T15:18:32Z

Let's go ahead and link ME-ICA/tedana#856.

bpinsard added the bug label Nov 1, 2023

bpinsard self-assigned this Nov 1, 2023

bpinsard mentioned this issue Nov 1, 2023

remove unecessary copy of large data ME-ICA/tedana#995

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t2smap OOM #3125

t2smap OOM #3125

bpinsard commented Nov 1, 2023

effigies commented Nov 1, 2023

bpinsard commented Nov 3, 2023

effigies commented Nov 3, 2023

effigies commented Nov 7, 2023

t2smap OOM #3125

t2smap OOM #3125

Comments

bpinsard commented Nov 1, 2023

What happened?

What command did you use?

What version of fMRIPrep are you running?

How are you running fMRIPrep?

Is your data BIDS valid?

Are you reusing any previously computed results?

Please copy and paste any relevant log output.

Additional information / screenshots

effigies commented Nov 1, 2023

bpinsard commented Nov 3, 2023

effigies commented Nov 3, 2023

effigies commented Nov 7, 2023