Skip to content
Gustavo A. RamΓ­rez edited this page Mar 14, 2023 · 91 revisions

California State University, Los Angeles

Department of Biological Sciences
Applied and Environmental Microbiology MICR3900-02,03
Spring 2023, Professors RamΓ­rez and Xu

Welcome to the MICR3900 Bioinformatics Wikipage!

Business for today March 13th and 15th, 2023

Sub-modules 1 and 2 below:

  1. Very brief introduction to Unix
  2. Important file structures
  3. Basic Amplicon Analysis Concepts
  4. Command line BLAST
  5. Phylogenomics
  6. Comparative genomics

All questions regarding this work can be directed to:

Prof. Gustavo A. RamΓ­rez
email: gramir157@calstatela.edu
Twitter: @zombiephylotype
https://orcid.org/0000-0001-8122-4898

Brief intro:
We will go over an introduction to 16S rRNA gene "amplicon" sequence analyses.

Binder

Binder
To enter a fully-functional unix environment, just click on the launch binder icon above!
-fyi: it may take a few mins to load.

Briefly, this is a "remote environment" where you access time and computing resources on a real machine running unix that you access through your browser binder window. This environment contains all the necessary program and files for this MIC3900 lab tutorial. Welcome to high performance computing at your fingertips πŸ¦ πŸ§¬πŸ‘¨β€πŸ’»!

1. Very brief introduction to Unix

Welcome to Unix
Unix environment basics

When starting a unix window in a computer, you get "dumped" into a folder (folders are also called "directories" in computer science) that is called your "home directory". You can think of your home directory as your "permanent address"; every time you enter the computer, this will be where you start.
Any file you want to find in a unix environment also has an "path" or "full address". Let's explore: To print your "home directory" or starting address in the computer simply try the Print Working Directory (pwd)command as follows:

pwd

Next we probably want to look at other contents of our home directory (starting folder)

To list our directory contents we type:

ls

The next order of business will be to navigate into a directory in the environment
Lets move into the For3900Binder directory with the change directory (cd) command as follows:

cd For3900Binder

You may see your address change here
List the contents of this directory as before:

ls

To move back up to the directory we came from (our home directory) execute the following:

cd ..

Now you are back where we started :)

Next we will attempt to read a file containing called Alpha_bin65.fasta representing a genome of an Alphaproteobacteria assembled from a metagenome of marine sponges. This file is in a folder called "Known_Genomes" which is itself in a folder called "Phylogenomics"

listing the contents of our current location (home directory) we can see the directory called "Phylogenomics":

ls

Moving into that directory is done as follows:

cd Phylogenomics

Listing the contents of this directory we see the other "Known_Genomes" directory. Let's move into it and list the contents, keeping an eye out for a file called "Alpha_bin65.fasta" as follows:

cd Known_Genomes
ls

and there it is!

To have a sneak peak into the contents of the file use the following command, which list, by default, the first 10 lines of the file:

head Alpha_bin65.fasta

Say, you want to see the first 100 lines of the file? Fine, use the same command with the following modifier:

head -100 Alpha_bin65.fasta

If the screen gets messy, which is should, clear the contents of your unix terminal screen with:

clear

Take a min to explore the contents of the other genomes using the head command.

Now, return to your home directory. We can do this directly:

cd

or indirectly (step wise):

cd ..
cd ..

check that you are home:

pwd
ls

Finally, we will make a new directory and populate it: This command makes a new folder called "myFolder" or whatever you want to call it:

mkdir MyFolder

Move into the new folder MyFolder and list the contents:

cd MyFolder
ls

Unsurprisingly, it is empty! Let's make a new file called "NewFile" as follows:

touch NewFile

By listing the contents we can confirm the file has been made:

ls

Now, let's delete this new file using the remove (rm) as follows"

rm NewFile

and move out of this directory:

cd ..

Finally, we will delete the directory we just created as follows:

and move out of this directory (note that the argument "-r" in between the command and file to be canned:

rm -r MyFolder

NOW YOU CAN AT THE VERY LEAST NAVIGATE AROUND A UNIX ENVIRONMENT, NOT A BAD START πŸ‘πŸ‘πŸ‘ :)
Please check out this link for additional beginner UNIX practice/resources:
https://astrobiomike.github.io/unix/

Takehome points:

i) ENTERING A UNIX ENVIRONMENT STARTS YOU OFF IN YOUR HOME DIRECTORY OR FOLDER: use pwd to print your current directory.
ii) MOVIGN AROUND DIRECTORIES CAN BE DONE WITH THE cd COMMAND.
iii) THE COMMAND head CAN LET YOU READ FILES.
iv) DIRECTORIES AND FILES CAN BE MADE OR DELETED AS YOU PLEASE.

2. Important file structures

The fastq/fasta file structure
After confirming our 16S rRNA gene amplicons they will be sequenced in an Illumina MiSeq machine: https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

The sequencing center will give us the data in files called fastq files.

-What are fastq files? A standard file structure containing 3 key pieces of information over 4 line entrees:

Field 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Field 2 is the raw sequence letters.
Field 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Field 4 encodes the quality values for the sequence in Field 2, and must contain the same number of symbols as letters in the sequence.

From your home directory navigate into the "For3900Binder" directory and list the contents to identify two fastq files (each ending with .fastq)

cd For3900Binder
ls

These files represent "paired-end" reads of DNA molecules (read bi-directionally). Most importantly, note the name of the sequence, the raw sequence letters, and the "quality scores".
Explore these files using head as follows for the forward file:

head forward_sub.fastq

And the reverse file:

head reverse_sub.fastq

Don't forget to clear your screen with "clear" if needed.

clear
  • Note that in paired end sequencing, each physical DNA molecule is read from each end. Here, the quality score is use to keep or cull each sequence. If the forward read of a sequence is culled, the reverse, wether its high or low quality, is also culled. Ultimately, the pairs that remain are overlapped to generate contigs.

Contigs are then further scrutinized to generate "clouds" of similar sequences (usually 97% similarity) called operational taxonomic units (OTUs) or, using a state-of-the-art program called DADA2 exact amplicon sequence variants (ASVs) of each sequence, rather than groups of very similar sequences, can be identified. Both ASVs and OTUs are functional definitions of species present in the dataset, although an argument exists that this statement is most "real" for ASVs...

This results in 3 classic analysis files: 1) a fasta file containing the ASV sequences, 2) a count table tabulating the number of times each sequence was observed in the dataset, and 3) a taxonomy table, containing the taxonomy assigned to each sequence.
Let's explore an example of each one of these file types:

Have a look at the first 50 lines of an ASV.fasta file:

head -50 subsamp_ASVs.fa

Note that the fasta file format differs from that of the fastq file.
Here the name line begins with a carrot ">" and there are no more quality scores retained!

Next we will explore a count table with the a new command called less. Since less prints out more information than the standard head output, it is said that "less is more" lol. Try the following:

less -S subSamp_ASVs_counts.txt

Note that you can scroll through as many lines of the file as you please. The -S flag allows us to organize the data structure by column, where each column header is a sample, and each row headers is an ASV. The frequency of each ASVWhen you are satisfy, you must exit the less window by pressing the "q" key. I think of it as "q" = quit.

Finally, we can have a look at a taxonomy table with the following command:

less -S subSamp_ASVs_taxonomy.txt

Here, each ASV (species) is given its closest taxonomy based on how closely the ASV sequence matches other 16S sequences with known provenance (from identified microbes). NOTE: You must again press "q" to exit this taxonomy window.

The information in these three files is all that will be needed to run an R program called phyloSeq where we will analyze these data to provide very fancy statistics (alpha and beta diversity, multivariate analysis, differential enrichments) and their visualization in a few weeks!

Takehome points:

i) fastq format differs from fasta format by the identifier before the sequence name ("@" vs. ">") and based on an additional line for read quality scores.
ii) High quality reads can undergo further analysis by being clustered into species equivalents (OUTs or ASVs).
iii) The sequence, frequency in the dataset, and taxonomy of ASVs is summarized in the ASV.fasta, count table, and taxonomy files.
iv) The three files are used as input to down stream statistical programs to generate ecological statistics and visualization of the dat summaries.

This will wrap up today!

Thanks

3. Amplicon Analysis Concepts 🚧

Amplicon analysis file types
Explore the contents of the BLAST directory.

to enter the directory, type the following command.

ls
cd BLAST
ls

In this directory you will see two other directories titled "Genome" and "Metagenome" and a single document containing a recA nucleic acid sequence.

To have a look at the recA.fasta file try this:

less recA.fasta

Takehome points:

i) Blast is easy to use in a command line format!
ii) Just make a db and execute the command.
iii) Today we only BLASTed a single gene but this can also be done with a library of genes. This is particularly useful when the query sequences are gene markers for known functions/metabolism (eg: methanogenesis, photosynthesis, antimicrobial resistance, etc.)!
iv) Your custom db can be anything you want.
v) Gold mine for very specific hypothesis generation and testing, as we'll further discuss briefly.

4. BLAST: The most universal tool in bioinformatics 🚧

Command line BLAST!
Explore the contents of the BLAST directory.

to enter the directory, type the following command.

ls
cd BLAST
ls

In this directory you will see two other directories titled "Genome" and "Metagenome" and a single document containing a recA nucleic acid sequence.

To have a look at the recA.fasta file try this:

less recA.fasta

Note: press the lowercase letter "q" on your keyboard to exit the "less" view...

We will look for this gene in the genome and metagenome stored in the other folders!
So, lets place a copy of this file (recA.fasta) into each one of those directories as follows:

cp recA.fasta Genome
cp recA.fasta Metagenome

Using BLAST to find a gene within a Genome

Now, navigate into the Genome directory to BLAST the recA gene against a complete genome:

cd Genome
ls

You should see two files here: "bin40.fasta", a genome of a Gamma-proteobacterium that is a marine Sponge symbiont 🌊 🦠 , and "recA.fasta" which we copied over in the last step!

Now, build a BLAST database:

makeblastdb -in bin40.fasta -dbtype nucl -out Gdb -max_file_sz 100000000 -parse_seqids

-That was fast!

Let's have a look at the output:

ls

Note that the "database" is split into different files that all start with "Gdb."

Next, we can finally execute the BLAST command!
(which blast algorithm and why?)...

Also, this is you most complex command thus far. What do you think the options mean?

tblastx -query recA.fasta -db Gdb -evalue 1e-60 -max_target_seqs 50 -outfmt "6 qseqid sseqid qlen length pident evalue bitscore qstart qend sstart send" -out GBlast_RecA.txt

The output is stored here: GBlast_RecA.txt

Let's have a look:

ls
less GBlast_RecA.txt

Note: pressing lowecase "q" will get you out of view mode.

To interpret the output, we have to recall this:

qseqid sseqid qlen length pident evalue bitscore qstart qend sstart send

which matches to this: MG194538.1 gb|MPMQ01000015.1| 618 206 76.214 1.98e-96 346 1 618 24238 2362

Final interpretation: A RecA homologue is present in this genome!

Using BLAST to find a gene within a Metagenome

Let's go back to the home directory with:

cd 

Look around [ie: "list" (ls) the contents]:

ls

Now, navigate into the Metagenome directory to within the BLAST directory as before:

Can be done in one step:

cd BLAST/Metagenome/

or two steps:

cd BLAST
cd Metagenome

List the contents:

ls

You should see two files here: "ChickenGutMG.fasta", sequences for all the DNA collected from chicken ceca, and "recA.fasta" which we copied over earlier! πŸ”

Now, build a BLAST database, as before:

Note: Keep in mind what changes below (input file name & your db's prefix):

makeblastdb -in ChickenGutMG.fasta -dbtype nucl -out MGdb -max_file_sz 100000000 -parse_seqids

-That was fast! (It almost always is, else, why use a computer... lol)

Please note we that, for human-reader purposes, we changed the db's to MGdb for "MetaGenome".

have a look:

ls

Moving on to the alignment(s):

Take a second to understand what's changed this time around...

tblastx -query recA.fasta -db MGdb -evalue 1e-60 -max_target_seqs 50 -outfmt "6 qseqid sseqid qlen length pident evalue bitscore qstart qend sstart send" -out MGBlast_RecA.txt

Cool! - do we see an output file?:

ls

There it is!

Let's have a inside look:

less MGBlast_RecA.txt

How many homologues do we have significant alignments against?

Given that we are working with a metagenome, does this result make sense?

Note: press "q" to exit view mode.

Clear your screen if you haven't already...

clear

And... remember, all of your activity during the session is stored and may be accessed as follows:

history

and cleared again...

clear

Final interpretation: A RecA homologues are present in this metagenome!

Takehome points:

i) Blast is easy to use in a command line format!
ii) Just make a db and execute the command.
iii) Today we only BLASTed a single gene but this can also be done with a library of genes. This is particularly useful when the query sequences are gene markers for known functions/metabolism (eg: methanogenesis, photosynthesis, antimicrobial resistance, etc.)!
iv) Your custom db can be anything you want.
v) Gold mine for very specific hypothesis generation and testing, as we'll further discuss briefly.

5. Phylogenomics: Exploring evolutionary relationships of organisms 🚧

Let's start your first phylogenomic tree!
Please check if your binder instance is still responsive.
-If it is: great!
-If not: start a new one now by scrolling up and clicking on the Binder icon.

Whether you re-starting binder or not, let's got to our home directory with the following command:

cd

List the files here and you will see a folder titled "Phylogenomics".

Let's "change directories" (cd) into as follows and then lists its contents:

cd Phylogenomics/
ls

Here, you'll find two directories titled: "Known_Genomes" and "Unknowns_Genomes"

Note: "Known" or "unknown" refers to the taxonomic affiliation of each genome within each library.

Constructing a phylogenomic tree comprised of genomes with known taxonomic assignments

Let's move into the Known_Genomes directory and list contents:

cd Known_Genomes
ls

Each file with a with a .fasta "extension" (last part of its name) is a metagenome amplified genome (random assortment from Sponge and Chicken Gut microbiomes).

The files ending with .txt extension are just a list of all the names of all the genomes in this directory and all the genomes of this directory + 1 unknown (we will transfer that genome in later....).

Check for yourself!:

less fasta_names.txt

Exit view mode with "q".

less fasta_names2.txt

Exit view mode with "q".

These text files listing the names of the genome files are required by GToTree...

Let's make sure GToTree is running well here by having a look at it's help page:

GToTree -h

Note: GToTree can do way more than what we will make it do here!
Read all GToTree details in its official release publication here: https://academic.oup.com/bioinformatics/article/35/20/4162/5378708

This shows us that, basically, there are "required" and "mandatory" inputs to run GToTree.
i) You must give GToTree the names of files with genome sequences or access numbers to genome sequences in a database.
ii) You must give GToTree a set of targets (in the form of HMM or hidden Markov models): these can be taxonomically specific:
Actinobacteria, Gammaproteobacteria; broad: Bacteria, Archaea, or as broad as possible: Universal.

Let's have a look at these "HMMs" to see what I mean:

gtt-hmms

Thus, for this second mandatory input, we will use "Bacteria.hmm".

Now, there are many optional inputs. We will use a single one "-o " which lets our output (results) be stored in a new directory that we get to name.

Here is the command to finally generate a phylogenomic tree of all the genomes in this directory:

GToTree -f fasta_names.txt -H Bacteria -o PhyloGenTree

Watch it run (~7 mins) πŸ˜ƒ πŸ’» !

Done?

If we list items in the directory, we should see a new directory storing our GToTree results.

ls

Change directory to "PhyloGenTree" and list items there:

cd PhyloGenTree
ls

This is our phylogenomic tree: PhyloGenTree.tre

Next, click on the folder icon in the left panel of the binder screen and nagivate to the Phylogenomics/Known_Genomes/PhyloGenTree directory. There you will find our tree file: PhyloGenTree.tre. Right click on it and select download to an easily accessible place in your computer (eg: Desktop).

Now copy and paste the following link to your browser: https://itol.embl.de/upload.cgi
Under "Tree file:", click "Choose File". Navigate to where the PhyloGenTree.tre file was downloaded in your computer (eg: the Desktop in my case).

Select the PhyloGenTree.tre file, once back on the upload page, select "Upload".

Modify the tree aesthetics to your needs/liking!: Here we may want to color the branches as a function of taxonomic clade
When ready, go to the "Export" tab, select your file format (SGV, pdf, etc), and, FINALLY, click export.

Congrats- you've made a phylogenomic tree from scratch! πŸ‘πŸ½ πŸ’» 🧬 !!!!

Placing an unknown genome into our phylogenomic to infer taxonomy of the unknown

Let's move directly from here to the Unknown_Genomes folder and lists its contents:

cd /home/jovyan/Phylogenomics/Unknowns_Genomes/
ls

-if you got lost here, you can also arrive by going to the home directory first, as follows:

cd
cd Phylogenomics/Unknowns_Genomes/
ls

The following: "Unknown1.fasta Unknown2.fasta Unknown3.fasta Unknown4.fasta Unknown5.fasta" are all genomes with unknown taxonomic assignments.

Next, we will copy (cp) one of these to the Known_Genomes directory from here as follows:

cp Unknown5.fasta ~/Phylogenomics/Known_Genomes/ 

Let's change directories back to "Known_Genomes" and confirm that we have a copy of Unknown5.fasta there!

cd ~/Phylogenomics/Known_Genomes/

There it is! - We are ready to re-run GToTree. This time we give it "fasta_names2.txt", which also includes Unknown5.fasta) as the required list of genomes we want treed. The other required option will remain the same "-H Bacteria". Why?. Lastly, we will have GToTree direct our output (results) to another new folder in this directory that we'll call "PhyloGenTree_Unknown5"

Here it goes!:

GToTree -f fasta_names2.txt -H Bacteria -o PhyloGenTree_Unknown5

Watch it run AGAIN (~7 mins) πŸ˜ƒ πŸ’» ! (doesn't get old to me lol).

Done?

As before, if we list items in the directory, we should see a new directory storing our GToTree results.

ls

Change directory to "PhyloGenTree_Unknown5" and list items there:

cd PhyloGenTree
ls

This is our phylogenomic tree: PhyloGenTree_Unknown5.tre

Now: Follow the same steps as above:
i) download PhyloGenTree_Unknown5.tre from the binder to your computer.
ii) go to this website: https://itol.embl.de/upload.cgi
iii) upload your PhyloGenTree_Unknown5.tre
iv) play with the options on the GUI.
v) export!

What is the phylogenetic relationship of our chosen unknown to the other genomes/clades in the tree? -As you can see our Unknown is a member of the Chloriflexi, yet another microbial marine sponge 🦠 🌊 🧽 MAG.

In sum

Today you learned the basics behind command line BLAST (making a db and BLASTing a query against it) and how to use phylogenomic treeing software (GToTree) to ask a meaningful biological questions.

These are useful tools that allow us to answer simple yet important questions about i) microorganisms and their evolutionary relationships to each other (phylogenomics), ii) gene (... and depending on the gene: function/metabolism) presence or absence in an organism or organisms (genomes) or in a particular environment (metagenomes), and many possible iterations of these simple yet powerful core concepts in biology.

** This will be it for today... For the rest our class time, let's talk about potential project ideas given the concepts/tools you've been exposed to so far! Many more new tools coming!

Sneak peak: Next class: We will use machine learning (AI) approaches to look for viruses in medical and environmental metagenomes: Stay tuned!
Nerd Mode 1000: Activated! πŸ€“ πŸ’» 🧬 πŸ§ͺ 🦠 !!!