Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation on running metaGEM using user-generated contig assemblies #56

Closed
zoey-rw opened this issue Jun 16, 2021 · 3 comments
Closed
Assignees
Labels
documentation Additional documentation required

Comments

@zoey-rw
Copy link
Contributor

zoey-rw commented Jun 16, 2021

I am trying to run metaGEM using a dataset that has already been quality filtered and assembled into contigs. I'm trying to format the data the way metaGEM wants it, but I can't get it right (maybe because I'm not "touch"ing the files in the right order for Snakemake?).

Is there any documentation on how users should input files when starting at the crossMap/binning step of the pipeline?

Thank you!

@franciscozorrilla
Copy link
Owner

Hi Zoey,

Thanks for raising this issue! Indeed I now realize that documentation is lacking for this usage of metaGEM, I will create a new page in the Wiki to address this.

In short, metaGEM creates a number of folders where it stores and expects to find sample-specfic-subdirectories for input/output files. The most important folder to configure is the dataset folder, which is used to extract sample IDs that are used for wildcard expansion in the Snakefile:

IDs = get_ids_from_path_pattern('dataset/*')

There is more information about this here. metaGEM is optimized for users to run an entire analysis from raw reads, so if you don't have raw data can simply create empty sample specific subfolders within the dataset folder. Alternatively, you could also modify the above quoted line, replacing the dataset folder for qfiltered.

Ok, now that we have properly configured wildcards to expand sample IDs, let's look at the crossMapSeries rule:

metaGEM/Snakefile

Lines 390 to 397 in d81186a

rule crossMapSeries:
input:
contigs = rules.megahit.output,
reads = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}'
output:
concoct = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'),
metabat = directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/cov'),
maxbin = directory(f'{config["path"]["root"]}/{config["folder"]["maxbin"]}/{{IDs}}/cov')

As you can see, the rule takes in the entire qfiltered folder as the second input, as it will cycle through this folder to map each set of reads to an assembly. This folder should have sample specific sub-directories which contain paired end read files ending with fastq.gz e.g. SRR12557734_R1.fastq.gz, SRR12557734_R2.fastq.gz.

Additionally, the output of the megahit assembly rule is taken in as an input via the shorthand rules.megahit.output. We can see what this file is called and where it lives by looking at the megahit rule itself:

metaGEM/Snakefile

Lines 273 to 278 in d81186a

rule megahit:
input:
R1 = rules.qfilter.output.R1,
R2 = rules.qfilter.output.R2
output:
f'{config["path"]["root"]}/{config["folder"]["assemblies"]}/{{IDs}}/contigs.fasta.gz'

As you can see, the contigs should be named contigs.fasta.gz and stored in the assemblies folder within sample-specific-subdirectories. I should also note that, within the assembly rule, I use sed to replace all spaces with hyphens in the contig headers.

metaGEM/Snakefile

Lines 313 to 323 in d81186a

# Rename assembly
echo "Renaming assembly ... "
mv tmp/final.contigs.fa contigs.fasta
# Remove spaces from contig headers and replace with hyphens
echo "Fixing contig header names: replacing spaces with hyphens ... "
sed -i 's/ /-/g' contigs.fasta
# Zip and move assembly to output folder
echo "Zipping and moving assembly ... "
gzip contigs.fasta

In summary, if you have the dataset, assemblies, and qfiltered folders configured as described here then you should be in good shape for cross-mapping and downstream analysis. Hope this helps and let me know if you have any issues with this!

Best wishes,
Francisco

@franciscozorrilla franciscozorrilla self-assigned this Jun 17, 2021
@franciscozorrilla franciscozorrilla added the documentation Additional documentation required label Jun 17, 2021
@franciscozorrilla franciscozorrilla pinned this issue Jun 28, 2021
@zoey-rw
Copy link
Contributor Author

zoey-rw commented Feb 16, 2022

A late follow-up on this: I also have co-assemblies that I would like to use as input for cross-mapping. I noticed in earlier metaGEM development you tested this approach, do you happen to know which code snippets I could pull from? Thanks!

@franciscozorrilla
Copy link
Owner

Hi Zoey, apologies for the late response. Unfortunately I abandoned this approach very early on in the development of metaGEM so I do not have any code to share. If you are still looking, perhaps this pipeline may have some coassembly code for you to pull from.

https://github.com/Finn-Lab/MAG_Snakemake_wf

Best,
Francisco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Additional documentation required
Projects
No open projects
Development

No branches or pull requests

2 participants