Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement samtools stats #1997

Open
jkbonfield opened this issue Feb 22, 2024 · 2 comments
Open

Reimplement samtools stats #1997

jkbonfield opened this issue Feb 22, 2024 · 2 comments

Comments

@jkbonfield
Copy link
Contributor

This is long overdue. It's memory usage is excessive on long read technologies, plus the output simply isn't that useful either.

It was first written in an era of small (eg 75bp) fixed size alignments. With potentially 1MB long reads it becomes totally unwieldy and consumes many GBs of RAM doing things which are, frankly, not remotely useful to the end user. An example is FFQ which reports data on the frequency of quality values as a table showing per-qual and per-base position. Per base position isn't useful data when you have MB long reads! A consumer of the data would need to do some aggregating and smoothing to get useful results, so the program should be doing that itself, perhaps with a parameter, or maybe using a log-scale so bins start growing the further out you get.

Additionally per quality produces excessive tables when we're looking at binned quality instruments (most of the table is full of zeros on modern Illumina or Revio).

However this would change the output formats. Hence stats2 is a better solution I think, but a longer term wish-list.
Shorter term, we may perhaps just wish to have command line options that simply disable some features so we can get basic stats without the worst excesses.

@markjschreiber
Copy link

I was also recently wondering what would be the best way to add flags to samtools stats to allow you to control which statistics are calculated? In my case I usually only need the number of reads (potentially filtered by the flag bits) and sometimes the total base count of the reads. To avoid excess compute I have made an app to do this based on one of the htslib demo apps but I think it would be nice to have the option to do this for the official stats app.

@jkbonfield
Copy link
Contributor Author

Thanks for the suggestion. Please do keep them coming, although right now this is rather a wish-list item and we haven't yet decided what priority the many competing ideas have so I don't have any time scales on rewrites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants