Reimplement samtools stats #1997

jkbonfield · 2024-02-22T09:52:37Z

This is long overdue. It's memory usage is excessive on long read technologies, plus the output simply isn't that useful either.

It was first written in an era of small (eg 75bp) fixed size alignments. With potentially 1MB long reads it becomes totally unwieldy and consumes many GBs of RAM doing things which are, frankly, not remotely useful to the end user. An example is FFQ which reports data on the frequency of quality values as a table showing per-qual and per-base position. Per base position isn't useful data when you have MB long reads! A consumer of the data would need to do some aggregating and smoothing to get useful results, so the program should be doing that itself, perhaps with a parameter, or maybe using a log-scale so bins start growing the further out you get.

Additionally per quality produces excessive tables when we're looking at binned quality instruments (most of the table is full of zeros on modern Illumina or Revio).

However this would change the output formats. Hence stats2 is a better solution I think, but a longer term wish-list.
Shorter term, we may perhaps just wish to have command line options that simply disable some features so we can get basic stats without the worst excesses.

markjschreiber · 2024-04-01T17:57:19Z

I was also recently wondering what would be the best way to add flags to samtools stats to allow you to control which statistics are calculated? In my case I usually only need the number of reads (potentially filtered by the flag bits) and sometimes the total base count of the reads. To avoid excess compute I have made an app to do this based on one of the htslib demo apps but I think it would be nice to have the option to do this for the official stats app.

jkbonfield · 2024-04-04T11:11:42Z

Thanks for the suggestion. Please do keep them coming, although right now this is rather a wish-list item and we haven't yet decided what priority the many competing ideas have so I don't have any time scales on rewrites.

jkbonfield mentioned this issue Apr 4, 2024

plot-bamstats: Issue with accuracy of quality score plot #2017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement samtools stats #1997

Reimplement samtools stats #1997

jkbonfield commented Feb 22, 2024

markjschreiber commented Apr 1, 2024

jkbonfield commented Apr 4, 2024

Reimplement samtools stats #1997

Reimplement samtools stats #1997

Comments

jkbonfield commented Feb 22, 2024

markjschreiber commented Apr 1, 2024

jkbonfield commented Apr 4, 2024