-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Time: 13.10, 1.10.2018
- Unzip reference data
- Unzip Illumina data
- Format coltrol: head -20 [filename.format]
-
Each read has 4 lines of information, and then the next read starts on the following line.
-
The first line starts with the @ symbol, and contains identifiers and information about this read.
-
The next line contains the actual sequence, then on line three there is a ‘+’, which may sometimes have the identifier and info repeated.
-
Line 4 contains the quality string, where ASCII characters encode the quality score for each base.
- The quality score ranges from 0 to about 40; the higher the number, the greater the accuracy of the base call.
-
- Format coltrol: head -20 [filename.format]
- Quality control
- we used word count with the lines flag to see how many lines there are in each fastq file. wc -l [filename.fastq] * calculated the number of reads in each file * 1823504 amp_res_1.fastq * 1823504 amp_res_2.fastq
- Inspecting raw sequencing data with fastqc. Filtering the reads.
apt-get install fastqc fastqc -h fastqc -o amp_res_1.fastq amp_res_2.fastq
-
The basic statistics match what we calculated for the number of reads manually.
-
Run Trimmomatic
- Run Trimmomatic in paired end mode, with following parameters:
- select the correct phred scale - you can figure it out from the first part of the FastQC reports “Basic statistics” - “Encoding”. Illumina1.9
- Run Trimmomatic in paired end mode, with following parameters:
S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
- Cut bases off the start of a read if quality below 20
- Cut bases off the end of a read if quality below 20
- Trim reads using a sliding window approach, with window size 10 and average quality within the window 20.
- Drop the read if it is below length 20.
-threads How many processors do you want Trimmomatic to run with? SE or PE Single End or Paired End reads? -phred33 or -phred64 Which quality score do your reads have? SLIDINGWINDOW Perform sliding window trimming, cutting once the average quality within the window falls below a threshold. LEADING Cut bases off the start of a read, if below a threshold quality. TRAILING Cut bases off the end of a read, if below a threshold quality. CROP Cut the read to a specified length. HEADCROP Cut the specified number of bases from the start of the read. MINLEN Drop an entire read if it is below a specified length. TOPHRED33 Convert quality scores to Phred-33. TOPHRED64 Convert quality scores to Phred-64.
TrimmomaticPE [-threads ] [-phred33] [-trimlog ] [-basein | ] [-baseout | ] ...
java -jar trimmomatic-0.32+dfsg-1.jar PE -phred33 TrimmomaticPE -phred33 amp_res_1.fastq amp_res_2.fastq _1P.fq _1U.fq _2P.fq _2U.fq LEADING:20 TRAILING:20 SLIDINGWINDOW:10:20 MINLEN:20
TrimmomaticPE: Started with arguments: amp_res_1.fastq amp_res_2.fastq _1P.fq _1U.fq _2P.fq _2U.fq LEADING:20 TRAILING:20 SLIDINGWINDOW:10:20 MINLEN:20 Multiple cores found: Using 4 threads Quality encoding detected as phred33 Input Read Pairs: 455876 Both Surviving: 446259 (97,89%) Forward Only Surviving: 9216 (2,02%) Reverse Only Surviving: 273 (0,06%) Dropped: 128 (0,03%) TrimmomaticPE: Completed successfully
java -jar trimmomatic-0.38.jar PE -phred33 amp_res_1.fastq amp_res_2.fastq _1P.fq _1U.fq _2P.fq _2U.fq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:10:20 MINLEN:20
-
☐ Input Read Pairs: 455876 Both Surviving: 446259 (97,89%) Forward Only Surviving: 9216 (2,02%) Reverse Only Surviving: 273 (0,06%) Dropped: 128 (0,03%)
-
how many paired reads were kept?
- 455876 (four times less than lines)
-
we cheched it manually
wc -l
* 1P 1785036 lines
* 2P 1785036 lines
- fastqc analysis on the _1P.fq and _2P.fq files.
- What happens if we increase the quality score at all steps to 30 at trimmomatic?
- TrimmomaticPE
amp_res_2.fastq _1P.fq _1U.fq _2P.fq _2U.fq LEADING:30 TRAILING:30 SLIDINGWINDOW:10:30 MINLEN:20
-
Multiple cores found: Using 4 threads
Quality encoding detected as phred33 Input Read Pairs: 446259 Both Surviving: 354123 (79,35%) Forward Only Surviving: 35710 (8,00%) Reverse Only Surviving: 50171 (11,24%) Dropped: 6255 (1,40%) TrimmomaticPE: Completed successfully
1P ./pasted_image013.png?width=500./pasted_image014.png?width=500 ./pasted_image015.png?width=500./pasted_image016.png?width=500 ./pasted_image017.png?width=500 ./pasted_image018.png?width=500
./pasted_image019.png?width=500./pasted_image020.png?width=500 ./pasted_image021.png?width=500./pasted_image022.png?width=500
2P ./pasted_image023.png?width=500./pasted_image024.png?width=500 ./pasted_image025.png?width=500./pasted_image026.png?width=500 ./pasted_image027.png?width=500./pasted_image028.png?width=500 ./pasted_image029.png?width=500./pasted_image030.png?width=500 ./pasted_image031.png?width=500./pasted_image032.png?width=500
- Alingnment to reference
sudo apt-get install bwa
- index the reference file
bwa index -a is GCA_000005845.2_ASM584v2_genomic.fna
* we observe a bunch of new files in directotry
-
Align your reads
- We aligned trimmed, paired sequences
- bwa mem GCA_000005845.2_ASM584v2_genomic.fna _1P.fq _2P.fq > alignment.sam
- compressed and sorted the sam file samtools view -S -b alignment.sam > alignment.bam
- We aligned trimmed, paired sequences
-
samtools flagstat alignment.bam
892776 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 891655 + 0 mapped (99.87%:-nan%) 892776 + 0 paired in sequencing 446402 + 0 read1 446374 + 0 read2 888879 + 0 properly paired (99.56%:-nan%) 890679 + 0 with itself and mate mapped 976 + 0 singletons (0.11%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)l
- ☐ If we use quality treshold at Trimmomatic stage equals 30
71438 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 69996 + 0 mapped (97.98%:-nan%) 71438 + 0 paired in sequencing 35719 + 0 read1 35719 + 0 read2 856 + 0 properly paired (1.20%:-nan%) 68564 + 0 with itself and mate mapped 1432 + 0 singletons (2.00%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
- Sort bam file by sequence coordinate on reference:
samtools sort alignment.bam alignment_sorted
- Indexing
samtools index alignment_sorted.bam
* Viewing the BAM file in IGV browser
-
Variant calling
-
We created mpileup file
samtools mpileup -f GCA_000005845.2_ASM584v2_genomic.fna alignment_sorted.bam > my.mpileup
-
-
Scanned possible variants
java -jar VarScan.v2.3.9.jar mpileup2snp -h
Only SNPs will be reported Warning: No p-value threshold provided, so p-values will not be calculated Min coverage: 8 Min reads2: 2 Min var freq: 0.2 Min avg qual: 15 P-value thresh: 0.01 USAGE: java -jar VarScan.jar mpileup2cns [pileup file] OPTIONS mpileup file - The SAMtools mpileup file
OPTIONS: --min-coverage Minimum read depth at a position to make a call [8] --min-reads2 Minimum supporting reads at a position to call variants [2] --min-avg-qual Minimum base quality at a position to count a read [15] --min-var-freq Minimum variant allele frequency threshold [0.01] --min-freq-for-hom Minimum frequency to call homozygote [0.75] --p-value Default p-value threshold for calling variants [99e-02] --strand-filter Ignore variants with >90% support on one strand [1] --output-vcf If set to 1, outputs in VCF format --vcf-sample-list For VCF output, a list of sample names in order, one per line --variants Report only variant (SNP/indel) positions [0]
* ☐ If we use at Trimmomatic stage quality threshold equals 30, we obtain similar results
- java -jar VarScan.v2.3.9.jar mpileup2snp my.mpileup OPTIONS --min-var-freq 0.1 --variants --output-vcf 1 > VarScan_results.vcf
- ☒ --min-var-freq 0.1 --- too low value of the parameter
java -jar VarScan.v2.3.9.jar mpileup2snp my.mpileup OPTIONS --min-var-freq 0.1 --variants --output-vcf 1 > VarScan_results.vcf Only SNPs will be reported Warning: No p-value threshold provided, so p-values will not be calculated Min coverage: 8 Min reads2: 2 Min var freq: 0.1 Min avg qual: 15 P-value thresh: 0.01 Reading input from my.mpileup 4641448 bases in pileup file 10 variant positions (7 SNP, 3 indel) 0 were failed by the strand-filter 7 variant positions reported (7 SNP, 0 indel)
- ☒ --min-var-freq 0.01 --- too low value of the parameter
java -jar VarScan.v2.3.9.jar mpileup2snp my.mpileup OPTIONS --min-var-freq 0.01 --variants --output-vcf 1 > VarScan_results_0.01trsh.vcf Only SNPs will be reported Warning: No p-value threshold provided, so p-values will not be calculated Min coverage: 8 Min reads2: 2 Min var freq: 0.01 Min avg qual: 15 P-value thresh: 0.01 Reading input from my.mpileup 4641448 bases in pileup file 10 variant positions (7 SNP, 3 indel) 0 were failed by the strand-filter 7 variant positions reported (7 SNP, 0 indel)
- --min-var-freq 0.5 --- optimal value of the parameter
java -jar VarScan.v2.3.9.jar mpileup2snp my.mpileup OPTIONS --min-var-freq 0.5 --variants --output-vcf 1 > VarScan_results_0.5trsh.vcf Only SNPs will be reported Warning: No p-value threshold provided, so p-values will not be calculated Min coverage: 8 Min reads2: 2 Min var freq: 0.5 Min avg qual: 15 P-value thresh: 0.01 Reading input from my.mpileup 4641448 bases in pileup file 9 variant positions (6 SNP, 3 indel) 0 were failed by the strand-filter 6 variant positions reported (6 SNP, 0 indel)
* --min-var-freq 0.75 --- best value of the parameter according to
java -jar VarScan.v2.3.9.jar mpileup2snp my.mpileup OPTIONS --strand-filter 1 --min-var-freq 0.75 --variants --output-vcf 1 > VarScan_results_075tsh_lt.vcf Only SNPs will be reported Warning: No p-value threshold provided, so p-values will not be calculated Min coverage: 8 Min reads2: 2 Min var freq: 0.75 Min avg qual: 15 P-value thresh: 0.01 Reading input from my.mpileup 4641448 bases in pileup file 9 variant positions (6 SNP, 3 indel) 0 were failed by the strand-filter 6 variant positions reported (6 SNP, 0 indel)
- SNP analysis in IGV browser