Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maximum number of gather hashes on report graph and finding numbers of hashes that don't match anything #266

Open
jessicalumian opened this issue Feb 7, 2023 · 2 comments

Comments

@jessicalumian
Copy link

Hello I have some questions!

  1. How can I modify the gather hashes vs mapped bp graph to show more than 60 genomes?
  2. How can I see the number of hashes that don't match anything in GTDB?
  3. Can I get the answers to 1 and 2 if I am using the GTDB database and providing another database in the same run?

Bonus question:

Is there a way to easily find out the amount of genome covered of a specific genome for different runs of genome-grist? Say I am looking for microbe X in five different microbiome samples and I want to know how many hashes match microbe X and what percentage of genome is covered in those samples. I imagine I could look at the report graphs but wondering if there's another way.

@ctb
Copy link
Member

ctb commented Feb 7, 2023

Hello I have some questions!

and I have answers!

1. How can I modify the gather hashes vs mapped bp graph to show more than 60 genomes?

The reports are generated from template notebooks in genome_grist/notebooks that are filled in and executed. The filled in notebooks are available in outputs.*/reports/*.ipynb, and you can actually run them directly from there and modify them.

In this case you want report-mapping-{sample}.ipynb. You should be able to modify the number 60 at the top of it = see NUM=60.

If there are things we can do to make this notebook easier to edit let me know :). Haven't paid much attention to it in a while...

2. How can I see the number of hashes that don't match anything in GTDB?

See outputs.*/{sample}.yaml. The unknown_hashes is what you want. See also total_hashes and known_hashes.

3. Can I get the answers to 1 and 2 if I am using the GTDB database and providing another database in the same run?

The numbers will be calculated with respect to the combined databases.

Bonus question:

Is there a way to easily find out the amount of genome covered of a specific genome for different runs of genome-grist? Say I am looking for microbe X in five different microbiome samples and I want to know how many hashes match microbe X and what percentage of genome is covered in those samples. I imagine I could look at the report graphs but wondering if there's another way.

hmm. ...yes... if I understand your question correctly...

outputs.*/gather/{sample}.gather.csv will contain the sourmash/hash information. You're looking for one of the columns f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,f_match_orig,unique_intersect_bp for the row where name matches your desired microbe.

For the mapping coverage, look at outputs.*/mapping/{sample}.summary.csv. You're looking for f_covered_bp.

There are some details - like whether you want the stats for the metagenome x genome, or leftover metagenome x genome - but first I'd suggest that you go get confused by what's there and then come back and ask questions ;)

@ctb
Copy link
Member

ctb commented Feb 7, 2023

p.s. great questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants