Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snakemake pipeline viral genome pipe #749

Open
jb013b opened this issue Jan 3, 2018 · 2 comments
Open

snakemake pipeline viral genome pipe #749

jb013b opened this issue Jan 3, 2018 · 2 comments

Comments

@jb013b
Copy link

jb013b commented Jan 3, 2018

What is the best way to edit the .yaml file to use the viral genome analysis sections (assembly/intrahost variation)?

Would input files need to be depleted already?
Which dir would the input files be placed within /data?
What would the appropriate file format be?

For editing the .yaml does one remove the depletion section of leave them blank ""?

If this is already discussed please point me in the correct direction.
Thank you for all the help.
James

@dpark01
Copy link
Member

dpark01 commented Jan 3, 2018

Hi @jb013b,

While documentation on this topic isn't great, there are a few higher level resources you can look to.

For further info on the Snakemake pipelines, there's some overview here. Again, not perfect, but can be handy to refer to. One thing to keep in mind is that Snakemake is all about specifying your desired end-result and it will figure out what it needs to do to get there.

One of the more common end points is the all_assemble Snakemake target, which tries to assemble a genome for every sample defined in samples-assembly.txt. Because the success or failure of these assemblies is often for reasons that have nothing to do with computational correctness (ie, assembly often fails due to your data), this is often an iterative process of running snakemake all_assemble, and then identifying which samples did not have sufficient reads to create a genome, and then manually removing them from samples-assembly.txt (usually moving them into samples-assembly-failures.txt), and then trying again until snakemake succeeds on the all_assemble target. You can start to think about inter and intra host variants after that, but there can be more complexity there we can discuss later.

If you provide depleted uBAMs, you can place them in data/01_cleaned/samplename.cleaned.bam, but really, the Snakemake pipeline is meant to do that all for you. Undepleted (raw) uBAMs can go in data/00_raw/samplename.bam. Actually the whole pipeline can start from an Illumina BCL directory, and lets you re-demultiplex and redefine samplesheets and such. Or you can put paired fastq files in data/00_raw/samplename_L001_R1_001.fastq.gz (and R2_001.fastq.gz) as long as that directory also contains an Illumina-style SampleSheet.csv and RunInfo.xml as well (the Snakemake rules will automatically detect that and do fastq-to-uBAM conversion). There's also some code complexity in there that allows you to specify a couple of tabular inputs that let you merge data for the same sample across multiple sequencing runs--it does nice things like run all the depletions separately (in parallel) while merging it all prior to assembly. This step is actually required if you want to do any intrahost variant calling, as we require multiple independent sequencing libraries per sample in order to call any iSNVs. Let me know if you want to delve into any of those paths, but I suggest starting with the simpler things before getting into things like intrahost variant calling.

If, at the end of the day, you find the Snakemake pipelines too difficult to use, you can always try the Cromwell or DNAnexus pipelines, which are entirely separate, but call the same python scripts underneath. There is no inter or intrahost variant calling yet, but most of the basic workflows (depletion, assembly, metagenomics, demultiplexing) are all there. Although we've used Cromwell (either locally or on Google Cloud) successfully, we have no documentation on that yet, since it's quite new. The DNAnexus implementation is our most popular one, however, and has been widely used both within our lab and with our partners, and is the primary platform we train on and that our ACEGID partner sites use regularly.

@jb013b
Copy link
Author

jb013b commented Jan 3, 2018

Hi Daniel,
Thank you that information is very helpful, the .yaml file was also very helpful. I am going to keep working on the snakemake pipelines. The one sample I am starting with is ultracentrifuged virus from vero cells (nhp). Prior to working on the viral-ngs pipeline I had depeleted the NHP reads, so I have both non-depleted and depleted files. The original input is ~87,000,000 total reads (2x150). After depletion of NHP reads there were 86,000,000 total reads left. No guarantee's that these are all against CHIKV but the majority should be. Any possibility it is failing due to the number of specific reads, or just total read number?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants