Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community feedback needed: (pre)-validation at the Sarek level #798

Open
maxulysse opened this issue Oct 13, 2022 · 33 comments
Open

Community feedback needed: (pre)-validation at the Sarek level #798

maxulysse opened this issue Oct 13, 2022 · 33 comments
Labels
community-feedback enhancement New feature or request
Milestone

Comments

@maxulysse
Copy link
Member

Description of feature

Could be good to do whatever can be done when we release to help prepare for validation.
I'm thinking benchmarking, GiAB, checking results...
Could be interesting to talk with multiple collaborators to figure out what can be done and how we can help the community.

@maxulysse maxulysse added the enhancement New feature or request label Oct 13, 2022
@FriederikeHanssen
Copy link
Contributor

Tagging @adamrtalbot @c-mertes @ggabernet @asp8200 @jemten as we all are trying to do similar things and could join forces 🚀

@c-mertes
Copy link

Thank you @FriederikeHanssen for tagging me. Yes, indeed this is a goal of GHGA to automate benchmarking in a kind of CI/CD fashion. We currently do this together with the NGS-Competence-Network. We do have GiaB samples for WES and WGS available and are setting up right now a benchmarking pipeline. This pipeline could be used later on to test SAREK on releases into master also to track the performance over time.

In the case of SAREK we would need to run SAREK with all combinations of alignment tools and variant calling tools to benchmark every combination.

This is actually what we are doing right now to find the best configuration for GHGA. @nickhsmith is doing this.

Happy to join forces here.

@nicorap
Copy link

nicorap commented Oct 13, 2022

So we have at DNGC quite a thorough analysis plan for Germline testing of pipelines. It’s even been implemented in dsl2. I think we can definitely share word and git repo here. For somatic benchmarking, we have little, but some directions at least.

@maxulysse
Copy link
Member Author

@apeltzer might you be interested as well?

@apeltzer
Copy link
Member

Yes, count me in 👍😬

@adamrtalbot
Copy link
Contributor

Yep, count me in. Lots of people are trying to solve this problem so might as well share it.

@jemten
Copy link

jemten commented Oct 14, 2022

Definitely interested! We validate our old germline pipeline by running GIAB samples as well as running a selection of our clinical samples with known pathogenic variants. We can also think about whether we should/could present the validation results in an automated fashion in the readme or release page

@adamrtalbot
Copy link
Contributor

What sort of somatic mutations are you looking for?

@FriederikeHanssen
Copy link
Contributor

what tool are you all using for the evaluation of the vcfs? hap.py/som.py?

@adamrtalbot
Copy link
Contributor

Happy using the RTGeval engine. Use GIAB stratifications for easy or hard regions: https://github.com/genome-in-a-bottle/genome-stratifications

@c-mertes
Copy link

We also use the hap.py + eval on a GIAB sample with some extra tweaks. Currently, the pipeline is written in snakemake.

@FriederikeHanssen
Copy link
Contributor

FriederikeHanssen commented Oct 14, 2022

🙈 forgot @lassefolkersen . He is doing an amazing job benchmarking sarek 3.0 at the moment. (sorry, Lasse not on purpose, but my oversight after a long day yesterday! )

@nicorap
Copy link

nicorap commented Oct 14, 2022

Hehe Good to see that @lassefolkersen is on it. This is work we started at DNGC. But just to confirm, we use Happy on a variety of regions. We also have some tboughs (and code) for benchmarking CNVs. On top of that, we also collect some informations on bams and vcfs. It would be good to have a final module that produces a report , where sarek (or other pipeline ) is compared to dragen, sentieon

@adamrtalbot
Copy link
Contributor

Definitely - run comparisons as part of CI/CD and you can track performance over time.

@maxulysse
Copy link
Member Author

@drpatelh and @robsyme you should participate in this

@FriederikeHanssen
Copy link
Contributor

FriederikeHanssen commented Oct 17, 2022

not sure how easy it would be to compare multiple different workflows
In the long run it would also be great to compare all previous versions and any tool combination. However, I think we then need to have a second workflow that is outside of sarek and triggered on results generation + some DB that would keep all these results around.

As an MVP I started working on a subworkflow now that compares the output that the specific sarek run generates with a truth data set when a benchmarking flag is set. This can then be enabled on the aws full size tests and would also empower anyone to quickly run their own benchmark.

@johanneskoester
Copy link

As @c-mertes was mentioning, we have an ongoing benchmark project within the NGS-CN SIG4 which is meant to continuously benchmark any kind of variant calling pipelines. Sarek is already included there by @ggabernet.

The workflow is based on a combination of Zenodo (for callsets from pipelines), Snakemake, hap.py and custom code for the reporting. I would be happy to discuss this in detail. You could either join our SIG4 meeting next Monday, or we can make a separate call. Just send me an email in any case.

@ggabernet
Copy link
Member

Thanks @johanneskoester! Yes exactly, in that benchmark I included Sarek 2.7 with variant callers Strelka2, HaplotypeCaller and Freebayes, as only germline variant calls are benchmarked. The purpose of that benchmark is to compare the VC results also to other variant calling pipelines being used at other Bioinformatics cores across Germany (the NGS competence Network: NGS-CN).

However, I see that this is not incompatible with adding a benchmarking step within Sarek, that would run hap.py to benchmark the variant callers (and all the other parameters of Sarek) at Sarek-level only, and that is run automatically on each pipeline release. That would allow us to automatically check as well that the benchmarking results stay consistent across releases.

Additionally, we could submit the results also automatically for every Sarek release to the NGS-CN pipeline benchmark. So far this process still requires some manual intervention: i.e. uploading the results to Zenodo and creating a PR with the details. Maybe we could also work on automating that? Would that be within the scope of that benchmark @johanneskoester ? (I'll be at the meeting next Monday so we can also discuss this there...)

@nicorap
Copy link

nicorap commented Oct 20, 2022

It looks like there’s lots of thing going on here. Maybe we should have a call all of us soon to align, for example, when we run validation of a pipeline we run no less than 400 wgs through at various coverages etc. This cost resources time etc. We also have a test each month with o e GIAB to track drift in the wet lab setup. That’s maybe an overkill :).

@johanneskoester
Copy link

johanneskoester commented Oct 20, 2022

Thanks @johanneskoester! Yes exactly, in that benchmark I included Sarek 2.7 with variant callers Strelka2, HaplotypeCaller and Freebayes, as only germline variant calls are benchmarked. The purpose of that benchmark is to compare the VC results also to other variant calling pipelines being used at other Bioinformatics cores across Germany (the NGS competence Network: NGS-CN).

However, I see that this is not incompatible with adding a benchmarking step within Sarek, that would run hap.py to benchmark the variant callers (and all the other parameters of Sarek) at Sarek-level only, and that is run automatically on each pipeline release. That would allow us to automatically check as well that the benchmarking results stay consistent across releases.

Additionally, we could submit the results also automatically for every Sarek release to the NGS-CN pipeline benchmark. So far this process still requires some manual intervention: i.e. uploading the results to Zenodo and creating a PR with the details. Maybe we could also work on automating that? Would that be within the scope of that benchmark @johanneskoester ? (I'll be at the meeting next Monday so we can also discuss this there...)

Absolutely, automating this would be awesome. Do you think you can run the entire sarek pipeline on one of the giab datasets within Github actions CI (i.e. within the 6 hour limit on a single thread), or do you have a different CI for that?

@nicorap
Copy link

nicorap commented Oct 20, 2022

Thanks @johanneskoester! Yes exactly, in that benchmark I included Sarek 2.7 with variant callers Strelka2, HaplotypeCaller and Freebayes, as only germline variant calls are benchmarked. The purpose of that benchmark is to compare the VC results also to other variant calling pipelines being used at other Bioinformatics cores across Germany (the NGS competence Network: NGS-CN).
However, I see that this is not incompatible with adding a benchmarking step within Sarek, that would run hap.py to benchmark the variant callers (and all the other parameters of Sarek) at Sarek-level only, and that is run automatically on each pipeline release. That would allow us to automatically check as well that the benchmarking results stay consistent across releases.
Additionally, we could submit the results also automatically for every Sarek release to the NGS-CN pipeline benchmark. So far this process still requires some manual intervention: i.e. uploading the results to Zenodo and creating a PR with the details. Maybe we could also work on automating that? Would that be within the scope of that benchmark @johanneskoester ? (I'll be at the meeting next Monday so we can also discuss this there...)

Absolutely, automating this would be awesome. Do you think you can run the entire sarek pipeline on one of the giab datasets within Github actions CI (i.e. within the 6 hour limit on a single thread), or do you have a different CI for that?

This is not gonna work. I like the idea though. :)

@FriederikeHanssen
Copy link
Contributor

Hi all! We use aws for running full size tests on release. To keep costs somewhat reasonable we can run it there with one Giab sample and for the somatic track with HCC1395 on each release. We are already doing this but are not further processing the output. The idea for this particular benchmarking idea is to check on each release the generated variantcalling results against the ground truth giving us and users reassurance that everything is working as expected. In addition it would give users a quick way to run their own benchmarking datasets through without necessary the need of setting up additional benchmarking pipelines.
So it would very much be self-contained within the pipeline and not so much an overarching pipeline that runs all existing variantcalling pipelines.
My current impression is that we have two benchmarking approaches that could very well live next to each other. How are you making the benchmarking results available to the general public? Can they be easily added to the nf-core website to make them easily findable for users?

Github actions has a time limit of 4hours, 6GB memory and 4 cpus iirc, so while it is very suitable for tests with small data a complete datasets cannot be used here.

A meeting sounds like a good idea. I am on vacation (supposedly :D ) But around again the week after next

@ggabernet
Copy link
Member

also to clarify, this is done also through a GitHub actions workflow (https://github.com/nf-core/sarek/blob/master/.github/workflows/awsfulltest.yml), but this action just submits the Sarek run to AWS batch via Nextflow Tower.

We've already added the NA12878 GIAB sample to AWS S3, but we don't use the full WGS, but rather a 30x subsample of WXS to reduce costs on AWS. The config for the germline full test can be found here: https://github.com/nf-core/sarek/blob/master/conf/test_full_germline.config

@lassefolkersen
Copy link
Contributor

@ggabernet how do you handle limiting the cost of the GitHub actions part when you run full genomes that way? Doesn't it become un-manageable when a full genome (=30x I mean) is run on every commit? Or you just limit the number of commits?

@adamrtalbot
Copy link
Contributor

adamrtalbot commented Oct 24, 2022

We wire in our full sized test to our release pipeline. So we hit the 'release' button, it runs the full sized test and only if it passes will it push a draft release to Github. Added bonus, you could put the results in the release notes, I haven't done this but it's a good idea! We don't use Github Actions so not much help right now.

@lassefolkersen
Copy link
Contributor

@adamrtalbot so it's configured to only run (that) GitHub action when you do actual releases? That sounds very smart 👍 Have you any code-links to look at for that?

@adamrtalbot
Copy link
Contributor

It's wired the other way. In our CI/CD platform there is the concept of CI/CD tests and a release pipeline. You hit the release button, it runs the minimal tests to confirm the code works, if these pass it runs the bigger tests to check the pipeline is valid. If these pass it runs nf-core bump and creates a draft release.

The point is to only run the big tests at snapshots rather than every single commit.

@ggabernet
Copy link
Member

It's part of the GitHub action yml, you can specify to trigger any GitHub action only on certain actions, e.g. releases. This one also has a trigger button, so that you can test it once before releasing (but it shouldn't be abused):

@FriederikeHanssen
Copy link
Contributor

We wire in our full sized test to our release pipeline. So we hit the 'release' button, it runs the full sized test and only if it passes will it push a draft release to Github. Added bonus, you could put the results in the release notes, I haven't done this but it's a good idea! We don't use Github Actions so not much help right now.

uh that is cool.

@FriederikeHanssen
Copy link
Contributor

ndle limiting the cost of the GitHub actions part when you run full genomes that way? Doesn't it become un-manageable when a full genome (=30x I mean) is run on every commit? Or you just limit the number of commits?

We currently only run on release and not on every commit (that would be way to expensive and I also don't think we would gain any insights from that)

@FriederikeHanssen
Copy link
Contributor

Thank you all for the input. 🥳 I will sent around a doodle next week and we can meet up to discuss more details.

@lassefolkersen
Copy link
Contributor

So we talked a little more about this today @FriederikeHanssen @asp8200 @ggabernet - and it seems that most feasible solution right now, with resources as they are, is to jump on the awsfulltest files, here. They are (smartly) from GiAB, and there already is a hap.py-module. So then the only other things needed is the GiAB gold standard file (HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz in example command below), which is on their FTP-server, and a bed file to specify only checking in high confidence regions. And that's it.

Then, we are thinking to have that implemented as a very short subworkflow, based on something like below and to be called only on release, and just saving the hap.py output txt-file. @ggabernet maybe you fill in the auto-trigger things I missed here? 😄 And of course - open discussion welcome, this is why I post it here; in case anyone can see an even smarter solution (I've no doubt people can see more extensive solutions, but small steps first 😃 )


docker run -it -v \
/home/ubuntu:/data pkrusche/hap.py /opt/hap.py/bin/hap.py \
/data/data2/giab_HG002_benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/data/data2/giab_HG002_benchmark/HG002_deep_one_file_a.deepvariant.vcf.gz \
-o /data/data2/giab_HG002_benchmark/HHG002_deep_one_file_a.deepvariant \
-r /data/reference/Homo_sapiens_assembly38.fasta \
-f /data/data2/giab_HG002_benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
--pass-only

@adamrtalbot
Copy link
Contributor

Sounds like a good plan!

Here are the GIAB stratifications: https://github.com/genome-in-a-bottle/genome-stratifications and how to use them: https://github.com/Illumina/hap.py/blob/master/doc/happy.md#stratification-via-bed-regions

and here's how to use RTG-eval as the engine: https://github.com/Illumina/hap.py/blob/master/doc/happy.md#using-rtg-tools-vcfeval-as-the-comparison-engine

This will give you more data than you know what to do with!

@maxulysse maxulysse added this to the 3.3 milestone Feb 21, 2023
@maxulysse maxulysse changed the title Have (pre) validation at a Sarek level Community feedback needed: (pre)-validation at a Sarek level Mar 4, 2023
@maxulysse maxulysse changed the title Community feedback needed: (pre)-validation at a Sarek level Community feedback needed: (pre)-validation at the Sarek level Mar 4, 2023
@maxulysse maxulysse modified the milestones: 3.3, 3.4, 3.5 Feb 8, 2024
@maxulysse maxulysse pinned this issue Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-feedback enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

10 participants