Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input coverage for sequana_coverage #555

Open
vladsavelyev opened this issue Feb 3, 2019 · 1 comment
Open

Input coverage for sequana_coverage #555

vladsavelyev opened this issue Feb 3, 2019 · 1 comment
Assignees
Labels

Comments

@vladsavelyev
Copy link

vladsavelyev commented Feb 3, 2019

Hi,

Thanks for your work on Sequana. Really appreciate that you are making parts of the pipeline usable standalone, like sequana_coverage. I got a couple of requests regarding it.

First, you call the input file "BED", however technically it's not. You request the 3rd column to be the coverage:

    - a BED file that is a tabulated file at least 3 columns.
      The first column being the reference, the second is the position
      and the third column contains the coverage itself.

However, by the standard and the 3rd column must be the end coordinate of a region, with the 2nd column being the start of this region, 0-based:

The first three required BED fields are:
chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
chromEnd - The ending position of the feature in the chromosome or scaffold.

However I guess it's usually fine to put values like coverage into the optional 4th column; but first 3 should really stay coordinates. So wondering if you could disable this check by any chance?

This leading to the next request: mosdepth provides a twice faster method to generate per-base coverage compared than samtools depth. It also generates a genuine BED file, compressing consecutive bases of the same coverage into regions, e.g. in the beginning of a chromosome it would typically have

21      0       9411191 0
21      9411191 9411192 1
...

Instead of repeated

21    0   0
21    1   0
21    2   0
...

Saving a lot of disk space (samtools depth output for a whole genome took 45G in my test run).

Also, mosdepth can generate a window-based coverage, which can be used directly for sequana_coverage visualizations, saving much more computation and disk space. Wondering if you could consider using input from mosdepth instead and even running it internally for BAM inputs?

Vlad

@cokelaer
Copy link
Collaborator

cokelaer commented Feb 6, 2019

Hi Vlad, thanks this is very helpful. This won't be done immediately but this looks very promising indeed. I will implement this feature (mosdepth). As for the BED file, thanks for the clarification. We were a bit lazy here by calling the input file a BED file indeed. I let this issue aside for now but will come back to it in Feb/March if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants