documentation on running metaGEM using user-generated contig assemblies #56

zoey-rw · 2021-06-16T17:14:22Z

I am trying to run metaGEM using a dataset that has already been quality filtered and assembled into contigs. I'm trying to format the data the way metaGEM wants it, but I can't get it right (maybe because I'm not "touch"ing the files in the right order for Snakemake?).

Is there any documentation on how users should input files when starting at the crossMap/binning step of the pipeline?

Thank you!

franciscozorrilla · 2021-06-17T11:49:53Z

Hi Zoey,

Thanks for raising this issue! Indeed I now realize that documentation is lacking for this usage of metaGEM, I will create a new page in the Wiki to address this.

In short, metaGEM creates a number of folders where it stores and expects to find sample-specfic-subdirectories for input/output files. The most important folder to configure is the dataset folder, which is used to extract sample IDs that are used for wildcard expansion in the Snakefile:

metaGEM/Snakefile

Line 14 in d81186a

IDs = get_ids_from_path_pattern('dataset/*')

There is more information about this here. metaGEM is optimized for users to run an entire analysis from raw reads, so if you don't have raw data can simply create empty sample specific subfolders within the dataset folder. Alternatively, you could also modify the above quoted line, replacing the dataset folder for qfiltered.

Ok, now that we have properly configured wildcards to expand sample IDs, let's look at the crossMapSeries rule:

metaGEM/Snakefile

Lines 390 to 397 in d81186a

    
           rule crossMapSeries: 
        
               input: 
        
                   contigs = rules.megahit.output, 
        
                   reads = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}' 
        
               output: 
        
                   concoct = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'), 
        
                   metabat = directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/cov'), 
        
                   maxbin = directory(f'{config["path"]["root"]}/{config["folder"]["maxbin"]}/{{IDs}}/cov')

As you can see, the rule takes in the entire qfiltered folder as the second input, as it will cycle through this folder to map each set of reads to an assembly. This folder should have sample specific sub-directories which contain paired end read files ending with fastq.gz e.g. SRR12557734_R1.fastq.gz, SRR12557734_R2.fastq.gz.

Additionally, the output of the megahit assembly rule is taken in as an input via the shorthand rules.megahit.output. We can see what this file is called and where it lives by looking at the megahit rule itself:

metaGEM/Snakefile

Lines 273 to 278 in d81186a

    
           rule megahit: 
        
               input: 
        
                   R1 = rules.qfilter.output.R1,  
        
                   R2 = rules.qfilter.output.R2 
        
               output: 
        
                   f'{config["path"]["root"]}/{config["folder"]["assemblies"]}/{{IDs}}/contigs.fasta.gz'

As you can see, the contigs should be named contigs.fasta.gz and stored in the assemblies folder within sample-specific-subdirectories. I should also note that, within the assembly rule, I use sed to replace all spaces with hyphens in the contig headers.

metaGEM/Snakefile

Lines 313 to 323 in d81186a

    
                   # Rename assembly 
        
                   echo "Renaming assembly ... " 
        
                   mv tmp/final.contigs.fa contigs.fasta 
        
                   # Remove spaces from contig headers and replace with hyphens 
        
                   echo "Fixing contig header names: replacing spaces with hyphens ... " 
        
                   sed -i 's/ /-/g' contigs.fasta 
        
                   # Zip and move assembly to output folder 
        
                   echo "Zipping and moving assembly ... " 
        
                   gzip contigs.fasta

In summary, if you have the dataset, assemblies, and qfiltered folders configured as described here then you should be in good shape for cross-mapping and downstream analysis. Hope this helps and let me know if you have any issues with this!

Best wishes,
Francisco

zoey-rw · 2022-02-16T17:11:02Z

A late follow-up on this: I also have co-assemblies that I would like to use as input for cross-mapping. I noticed in earlier metaGEM development you tested this approach, do you happen to know which code snippets I could pull from? Thanks!

franciscozorrilla · 2022-05-03T21:53:55Z

Hi Zoey, apologies for the late response. Unfortunately I abandoned this approach very early on in the development of metaGEM so I do not have any code to share. If you are still looking, perhaps this pipeline may have some coassembly code for you to pull from.

https://github.com/Finn-Lab/MAG_Snakemake_wf

Best,
Francisco

franciscozorrilla self-assigned this Jun 17, 2021

franciscozorrilla added the documentation Additional documentation required label Jun 17, 2021

franciscozorrilla added this to To do in metaGEM v1.1.0 Jun 17, 2021

franciscozorrilla pinned this issue Jun 28, 2021

franciscozorrilla closed this as completed Nov 24, 2022

franciscozorrilla mentioned this issue Nov 30, 2022

Run snakemakefile-no results #116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentation on running metaGEM using user-generated contig assemblies #56

documentation on running metaGEM using user-generated contig assemblies #56

zoey-rw commented Jun 16, 2021

franciscozorrilla commented Jun 17, 2021

zoey-rw commented Feb 16, 2022

franciscozorrilla commented May 3, 2022

documentation on running metaGEM using user-generated contig assemblies #56

documentation on running metaGEM using user-generated contig assemblies #56

Comments

zoey-rw commented Jun 16, 2021

franciscozorrilla commented Jun 17, 2021

zoey-rw commented Feb 16, 2022

franciscozorrilla commented May 3, 2022