Skip to content

jlgardy/tb_demo

Repository files navigation

tb_demo

Files and instructions for an in-class demo of TB genomic epidemiology

1. This demo requires that you have certain software tools installed on your computer. Before attempting the tutorial, ensure that these tools are installed and functioning. If you are running a different version, the syntax may be slightly different from what is shown here.

2. You will also need to download the data files we'll be using, which are provided to you via this repo. You'll need:

3. Make a directory called tb_demo and place your downloaded reference genome and sequencing data there:

mkdir tb_demo creates the directory

mv reference.fa tb_demo moves the reference genome to the tb_demo directory (this command will vary depending on where you downloaded your data to and where you created the tb_demo directory)

mv *.fastq.gz tb_demo moves the data to the tb_demo directory (as above, this may vary depending on where you created things)

gunzip *.gz unzips the compressed fastq files

4. Index the reference genome. This only needs to be done once, anytime you download a new reference genome:

bwa index reference.fa

5. Map the paired-end reads from each patient's isolate against the reference using bwa mem:

bwa mem reference.fa reads1.fastq reads2.fastq > outfiile.sam is the general syntax you'd use

bwa mem reference.fa patient1_1.fastq patient1_2.fastq > patient_1.sam is an example using the data from patient_1's isolate

6. For each SAM file, we must do a few operations to make it efficient to work with. We will use three different samtools utilities - view, sort, and index:

samtools view -b file.sam > file.bam is the general syntax you'd use to convert the text SAM file to the binary BAM format

samtools view -b patient_1.sam > patient_1.bam is an example using the data from patient_1's isolate

samtools sort file.bam -o file.sorted is the general syntax you'd use to sort the BAM

samtools sort patient_1.bam -o patient_1.sorted is an example using the data from patient_1's isolate

samtools index file.sorted is the general syntax you'd use to index the BAM

samtools index patient_1.sorted is an example using the data from patient_1's isolate

7. For each sorted, indexed BAM file, we will generate a pileup file summarizing the mapping at each position relative to our reference genome:

samtools mpileup -q 30 -u -f reference.fa file.sorted > file.bcf -I is the general syntax you'd use

samtools mpileup -q 30 -u -f reference.fa patient_1.sorted > patient_1.bcf -I is an example using the data from patient_1's isolate

8. For each pileup file, we will convert it to a human-readable format and, at the same time, extract only those positions at which our genome is different from the reference genome - the variants:

bcftools call -O v -mv file.bcf > file.vcf is the general syntax you'd use

bcftools call -O v -mv patient_1.bcf > patient_1.vcf is an example using the data from patient_1's isolate

9. We would then filter our VCF to keep only positions that we trust - this is more advanced bioinformatics, and there are a LOT of factors that will influence the types of filtering you might do. For demo purposes, we are going to skip ahead to the results file, showing only those high-confidence variants that are different between our three genomes.

  • variants.fa
  • the custom script I often use for a quick overview of all variant positions in all VCFs in a directory is vcf_to_fasta_het.py. To run it, you must have python installed and you can use the syntax python vcf_to_fasta_het.py -x vcfdirectory outputprefix

About

Files and instructions for an in-class demo of TB genomic epidemiology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages