Make the pipeline run with different file sizes #108

szymonwieloch · 2021-02-04T13:14:11Z

Hi! I have a problem with running this pipeline. It seems that it incorrectly chooses memory requirements for files. This is especially problematic with very big files. My biggest file during tests was 16 GB, but in the future we may have much bigger. Such a size requires 256 GB of memory for the run_optitype process.

My issue is that the hlatyping pipeline does not allow you by default to handle such a big files. The only workaround that I found was creation of a additional configuration file extra.config and passing it to the nextflow with the -c parameter to override the default configuration. My expectation is that the pipeline should allow you to process your data only using command line parameters. This didnt work because:

1. Problems with setting maxRetries

For some strange reason when I tried to increase retries with -process.maxRetries 5 it didn't work and the default value of 1 was used. When I tried to set 'maxRetries = 5' in the extra.config file, for some strange reason I saw 2 retries. All failing processes were finishing with the 137 error code and should be retried 5 times with increasing memory. I am not sure if this is a problem with this pipeline or with NextFlow, however I haven't experienced such problems with other pipelines.

2. Slow memory adaptation mechanism

Te current memory adaptation mechanism is extremely slow:

memory = { check_max( 7.GB * task.attempt, 'memory' ) }

To reach required 256 GB of RAM for my samples it would require 37 retries. To process a 50 GB sample file - around 116. There are two good approaches to fix that:

A. Change the algorithm and use an exponential adaptation algorithm:

memory = {8.GB * (2^(task.attempt - 1))}

This would only require 6 retries for 16 GB file, 8 retries for a 50 GB file, and wouldn't cause huge resource overhead.

B. Calculate memory requirement from the input file size.

The task object should give you access to the input files. This allows you to check the sample size and calculate the amount of required memory. I suspect that there is a linear relation between the input file size and the actual memory requirement. A simple linear equation should give your a precise amount of memory needed to process the given sample. This approach requires more work: obtaining real memory usage for several samples and checking the actual relationship but eventually no retries would be needed.

The text was updated successfully, but these errors were encountered:

christopher-mohr · 2021-02-04T13:23:37Z

Hi @szymonwieloch, thanks for reporting and providing detailed information on this. We will check this and get back to you.

christopher-mohr added the enhancement New feature or request label Feb 4, 2021

christopher-mohr added this to the 1.1.3 milestone Feb 4, 2021

apeltzer modified the milestones: 1.1.3, 2.0 Mar 17, 2021

apeltzer added this to hlatyping in hackathon-march-2021-pipelines Mar 17, 2021

christopher-mohr modified the milestones: 2.0, 2.1 Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the pipeline run with different file sizes #108

Make the pipeline run with different file sizes #108

szymonwieloch commented Feb 4, 2021 •

edited

christopher-mohr commented Feb 4, 2021

Make the pipeline run with different file sizes #108

Make the pipeline run with different file sizes #108

Comments

szymonwieloch commented Feb 4, 2021 • edited

1. Problems with setting maxRetries

2. Slow memory adaptation mechanism

A. Change the algorithm and use an exponential adaptation algorithm:

B. Calculate memory requirement from the input file size.

christopher-mohr commented Feb 4, 2021

szymonwieloch commented Feb 4, 2021 •

edited