Skip to content
Gustavo A. Ramírez edited this page Feb 24, 2022 · 51 revisions

Welcome back to the GCBS 5086 Wikipage

Business for today Feb 24th, 2022

  1. File/directory organization
  2. Generating seq file summaries using python scripts
  3. KEGG decoder “from the top”-Genomes
  4. KEGG decoder: Environmental Metagenomes
  5. KEGG decoder: Genetic potential vs. activity in animal gut
  6. LET'S TALK ABOUT POTENTIAL PROJECTS.

All questions regarding this work can be directed to:

Gustavo A. Ramírez
email: ramirezg@westernu.edu
Twitter: @zombiephylotype
https://orcid.org/0000-0001-8122-4898

As before, we will get everyone on board with the proper software and data for today via binder!

Binder

Binder
Just click the launch/binder icon and enter a fully-functional UNIX working environment! When ready, click on the $_Terminal icon to start a terminal window. This is your work environment, as before 💻

1. Getting organized!

Consolidating files according to data type in new directories
Once in your home directory, change directories to /GCBS_data and list the files as follows:

to enter the directory, type the following command.

ls
cd GCBS_data/
ls

Here you should have the following items: KEGGdecoData, seq_length.py, summary_stats.sh.
One is a directory and two are "scripts" or files containing executable code.
The code in seq_length.py and summary_stats.sh, is written in the python and shell languages, respectively.

For now let's ignore the executables (programs) and change directories into the KEGGdecoData directory and list the contents as follows:

cd KEGGdecoData/
ls

Here, we see a total of 15 files. Side note: You can explicitly count the items in a directory using the code below!:
This command uses a "pipe", ie: combines two commands: ls = list stuff [and then "|"] wc - = count the lines.

ls | wc -l

Anyhow, here we see the following: 15 files that either end with .txt, .faa, or .fasta

FYI: .faa, .fa, .fasta are common extension for the FASTA file type (we've seen this before!)

FASTA means that a line contains ">" followed by a "name" followed by a return and a second line contains the sequence of nucleotides or proteins!

.fasta [also often written with an .fa extension (as opposed to .faa)] files contain nucleic acid sequences.

.faa files contain amino acid sequences.

Examples:
.fasta or .fa file:

> Seq1
ATGCCCCAAT
> Seq2
GGGCCTAAA

Have a look at one file of each fasta extension type:

head Alpha_MAG_56.fasta
head ArabianSea_subMG.faa

Bonus: Look at the top lines of all .faa files using a wild card!:

head *.faa

Don't forget to clear your screen as needed:

clear

Also, if you decide to view with "less", pressing the "q" key will get you out of viewer mode!

Making new directories

Make the following new directories (mkdir command):

mkdir MAGs
mkdir Ocean_MGs
mkdir Gut_MGs
mkdir Sponge_MG_MT

Next we will
i) move Alpha_MAG_56.fasta (a metagenome assembled genome!) into the MAGs direcotry
ii) move all ocean files and metagenomes to the Ocean_MGs directory
iii) move all Gut files and metagenomes to the Gut_MGs directory
and finally,
iv) move all Sponge files, metagenome and metatrascriptome to the Sponge_MG_MT directory
as follows:

mv Alpha_MAG_56.fasta MAGs
mv *Sea* Ocean_MGs
mv Pacific* Ocean_MGs
mv SpongeM* Sponge_MG_MT
mv *GutM* Gut_MGs

Now check that there are no "lost files" in the directory... they should all be in a folder:

ls

Bonus BASH points: list all files within all directories simultaneously:

ls ./*

Important points here!:

  1. Every "metagenome file" here is actually predicted proteins from the metagenome rather than genes.

The sole exception is the MAG file: Alpha_MAG_56.fasta. (why you ask?: I didn't know how much data I could pre-load to binder, that is, adding the nucleotide fasta (largest files) and amino acid fasta eats up storage space quickly!). Just know that for every AA file there is also a nucleic acid file from which the proteins were predicted!

  1. Predicted protein files (.faa) are the input for KEGG annotation using a KOALA (examples coming up!).

Now, for the MAG nucleotide file (Alpha_MAG_56.fasta), we will generate the AA file (.faa) and get annotations using KEGG Ghost Koala from scratch!

BIG PICTURE HERE: The annotation files for each metagenome (.txt files) are what we need to run KEGGdecoder!!!!

Takehome points:

To annotate a genome/MG/MT and eventually predict metabolic pathways with KEGG decoder you will need to:
i) Take the DNA file (.fasta) and predict proteins using prodigal/prokka
ii) Take the predicted AA file (.faa) and submit it to KEGG Koala
iii) Get the Koala annotations (.txt) and use as input for KEGG decoder
iv) Get a cool KEGG decoder pathway completion heat map and enjoy! 🤓

2. Generating sequence file stats

Scripts (executable code) that will help you summarize the content of your files

To generate some basic stats on your fasta files do the following:

Let's copy (cp) both summary scripts found in the GCBS_data directory (/home/jovyan/GCBS_data) to the MAGs folder, where the metagenome assembled genome (Alpha_MAG_56.fasta) that we are analyzing next resides!:

First let's move to the MAGs directory (just in case you aren't there already):

cd /home/jovyan/GCBS_data/KEGGdecoData/MAGs

Then, cp each file, by providing the path, to your current directory ( .) as follows:

cp /home/jovyan/GCBS_data/seq_length.py .
cp /home/jovyan/GCBS_data/summary_stats.sh .

List the contents of the directory and both files (.sh and .py) should be here now!

ls

If you are curious, check out the contents of the script e.g.:

less seq_length.py

"q" to exit view mode!

Briefly: Again, each sequence in a fasta file occupies two lines: top line contains a carrot followed by a name, second line is the actual sequence.... Thus, if we count the carrots in a fasta file, we have counted the number of sequences!

Here we use "grep" (Global Regular Expression Print) for that, certainly one of the most classic linux commands ever! 😎 💻

Run it as follows:

Translation: [grep ">"] = look for this regular expression ">", here: [Alpha_MAG_56.fasta], [-c] just tally the count.

grep -c ">" Alpha_MAG_56.fasta 

Now we know how many actual sequences our file contains!
Note: MG contigs can be quite long- how long you ask?:

This command will print the length of each of the 102 sequences in Alpha_MAG_56.fasta :

python3 seq_length.py Alpha_MAG_56.fasta

note that the output is not stored (written to a file) but it is just regurgitated by python to the screen and we only have human eyes 😞 !
Fortunately, we can use stats 😄 !

Run the following code:

python3 seq_length.py Alpha_MAG_56.fasta | cut -f 2 | summary_stats.sh

our regurgitated sequence lengths have been separated from into a single numerical column, away from their names [cut -f2], and summarized by summary_stats.sh!

*March17_22:00: there is a bug in awk version (mawk?)... trouble shoot tomorrow.

Students: In the mean time get your own stats (sequence: count (n), length: median, min, max, etc) using this file: Length_of_each_seq.txt, generated as follows:

python3 seq_length.py Alpha_MAG_56.fasta | cut -f 2 > Length_of_each_seq.txt

Takehome points:

i) You are now fully capable of providing a description of your initial sequence data (fasta file!)!

3. From nucleotide fasta to pathway prediction: KEGG decoder!

DNA--> Protein ---> Annotation ---> KEGGdecoder summary, let's do this!

Let's copy (cp) both summary scripts found in the GCBS_data directory (/home/jovyan/GCBS_data) to the MAGs folder, where the metagenome assembled genome (Alpha_MAG_56.fasta) that we are analyzing next resides!:

First let's move to the MAGs directory (just in case you aren't there already):

cd /home/jovyan/GCBS_data/KEGGdecoData/MAGs

Run the following command and take a break... It will take ~15-20mins!

prokka Alpha_MAG_56.fasta

Let's talk about prokka:
This programs runs "prodigal" and predicts proteins by finding ORFs in your sequence data.
The ORFs are then aligned (BLAST) against a library of genes with known functions and your ORFs get "annotated" accordingly...

Eventually, prokka generates many cool files, all stored within a newly generated folder called PROKKA_date, but, importantly for us, it generates a .faa file.

cd PROKKA_03182021
ls

Download this PROKKA_03182021.faa file to an easily accessible place in your computer by clicking on the folder icon on the top left of your binder screen, click navigating to the MAGs directory and downloading (hover, right click, for download option!).

Next:
i) Open a browser window with the following link: https://www.kegg.jp/ghostkoala/
ii) Click on the Choose File upload option.
iii) Select the .faa file downloaded to your computer.
iv) Enter an email address
v) Click request confirmation
vi) Important, go to that email and click on the "Submit Link" and ensure that you get a "Request submitted confirmation"!
vii) Wait ~5-10 mins... You will be emailed with a link to your results!
viii) click the results link - DATA! yeeeeaaaahhhh!
xi) Download annotation data (save as: "Alpha_MAG_56.Koala.txt" or something like that!), where you can see it!
x) On you instance, navigate on the side panel, with the folder icon, to the the MAGs directory: /.../KEGGdecoData/MAGs/
xi) Click the "up arrow" to Upload... Select Alpha_MAG_56.Koala.txt from your computer
xii) DONE WITH THIS...!

To finally run KEGG decoder, let's go to the MAGs folder on the command line as follow and ensure that the newly uploaded "Alpha_MAG_56.Koala.txt" file is actually where we think it is:

cd /home/jovyan/GCBS_data/KEGGdecoData/MAGs
ls

Fantastic!

Now, on to KEGGdecoder!

as always:


KEGG-decoder --help

will provide general information and additional execution options!

For our purposes we will run the following code:


KEGG-decoder -i Alpha_MAG_56.Koala.txt -o Alpha_MAG_56.Koala.KEGGdecoder -v static

After a few seconds...
We should have .svg file in the directory!

ls

Again, from the top left folder icon, hover over the Alpha_MAG_56.Koala.svg file, right click for download to your computer then open it for inspection....

Great job!

You have now annotated all genes in a metagenome assembled genome and, by organizing the data at the level of metabolism in KEGG, you can now assess the genetic potential for various metabolic activities by this microbe!!! 👩🏽‍🔬 🧪 🧬 💻 - That's Science!

Takehome points (summary):

Metabolic pathways from scratch:
i) use prokka to generate AA from nucleotide seqs
ii) upload .faa file to GhostKoala (binder --> your computer --> GhostKoala --> email --> Results (rename.txt) --> binder). iii) use the .txt file to run KEGG decoder- That's it! 😉

4. KEGG decoder: Environmental Metagenomes

Let's explore metabolisms across ocean basins!

Let's first move to the correct directory:

cd /home/jovyan/GCBS_data/KEGGdecoData/Ocean_MGs
ls

Short hand notation now... (Let's see how this goes):

Next: Combine the three Ocean GhostKoala output files

cat *koala.txt > OceanCombo.koala.txt
ls -lht

Size seems right!

Run KEGGdecoder!

KEGG-decoder -i OceanCombo.koala.txt -o OceanComboKEGGdecoder -v static

Download .sgv file as before and inpect! (heads up- check the directory listed in the top left if you don't see your file: Directory of the terminal screen and downloads work independently!).

Takehome points (summary):

This is getting easy, right?

5. KEGG decoder: Genetic potential vs. activity in the most ancient animal gut

Let's explore what can happen (metagenome) vs. what happens (metatranscriptome) in marine sponges 🌊 🧽 !

Let's first move to the correct directory:

cd /home/jovyan/GCBS_data/KEGGdecoData/Sponge_MG_MT
ls

Short hand notation now... (Let's see how this goes):

Next: Combine the sponge microbiome metagenome and metatranscriptome Ghost Koala output files:

cat *koala.txt > Sponge_MG_MT_combo.koala.txt
ls -lht

Size seems right!

Run KEGGdecoder!

KEGG-decoder -i Sponge_MG_MT_combo.koala.txt -o Sponge_MG_MT_KEGGdecoder -v static

Download .sgv file as before and inpect! (heads up- check the directory listed in the top left if you don't see your file: Directory of the terminal screen and downloads work independently!).

Takehome points (summary):

Wow, this is easy!
Now, let's think a bit about the science here and start talking about projects!

Back to my ppt presentation...