Input coverage for `sequana_coverage` #555

vladsavelyev · 2019-02-03T01:20:00Z

Hi,

Thanks for your work on Sequana. Really appreciate that you are making parts of the pipeline usable standalone, like sequana_coverage. I got a couple of requests regarding it.

First, you call the input file "BED", however technically it's not. You request the 3rd column to be the coverage:

    - a BED file that is a tabulated file at least 3 columns.
      The first column being the reference, the second is the position
      and the third column contains the coverage itself.

However, by the standard and the 3rd column must be the end coordinate of a region, with the 2nd column being the start of this region, 0-based:

The first three required BED fields are:
chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
chromEnd - The ending position of the feature in the chromosome or scaffold.

However I guess it's usually fine to put values like coverage into the optional 4th column; but first 3 should really stay coordinates. So wondering if you could disable this check by any chance?

This leading to the next request: mosdepth provides a twice faster method to generate per-base coverage compared than samtools depth. It also generates a genuine BED file, compressing consecutive bases of the same coverage into regions, e.g. in the beginning of a chromosome it would typically have

21      0       9411191 0
21      9411191 9411192 1
...

Instead of repeated

21    0   0
21    1   0
21    2   0
...

Saving a lot of disk space (samtools depth output for a whole genome took 45G in my test run).

Also, mosdepth can generate a window-based coverage, which can be used directly for sequana_coverage visualizations, saving much more computation and disk space. Wondering if you could consider using input from mosdepth instead and even running it internally for BAM inputs?

Vlad

The text was updated successfully, but these errors were encountered:

cokelaer · 2019-02-06T10:26:17Z

Hi Vlad, thanks this is very helpful. This won't be done immediately but this looks very promising indeed. I will implement this feature (mosdepth). As for the BED file, thanks for the clarification. We were a bit lazy here by calling the input file a BED file indeed. I let this issue aside for now but will come back to it in Feb/March if possible.

cokelaer added the Todo label Feb 6, 2019

cokelaer self-assigned this Feb 6, 2019

cokelaer mentioned this issue Jun 10, 2019

clarify bed versus bedgraph format bioconvert/bioconvert#236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input coverage for `sequana_coverage` #555

Input coverage for `sequana_coverage` #555

vladsavelyev commented Feb 3, 2019 •

edited

cokelaer commented Feb 6, 2019

Input coverage for sequana_coverage #555

Input coverage for sequana_coverage #555

Comments

vladsavelyev commented Feb 3, 2019 • edited

cokelaer commented Feb 6, 2019

Input coverage for `sequana_coverage` #555

Input coverage for `sequana_coverage` #555

vladsavelyev commented Feb 3, 2019 •

edited