Graph_Genome

Introduction

This pipeline is developed for detecting rare pathogens by aligning raw reads from Illumina platform to graph reference genome. This graph based approach allows to use multiple reference genome in sequencing. Currently, this pipeline is developed for detecting subtypes of Enterovirus species.

Vsearch algorithm is used to cluster matured peptides of all known subtypes of Enterovirus. This clustering technique is protein content-based approach.

Groot Graphs is used to index database for searching and build graph for visualizing the variants between all subtypes within Enterovirus species.

Pipeline

Building Clustering Database

Fasta file including all known subtypes is provided.
From this fasta file, obtain the list of accession_no for each subtype then download genbank files contain completem genome for each subtype
Extract matured peptides of each subtypes and prepare pre-clustered files (under format '*.fna')
Run vsearch to get the clustered database
Index the database for building graph by running groot

Fastq file processing

Raw reads in fastq files are trimmed out by FastQC with default Q = 20, then all reads with length 100 +/- 10 bp will be used to align to genome graph
Visualisation of number of reads before and after quality check can be provided

Alignment reads

After multiple sequence alignment database is built up, alignment reads from fastq files can be started with shell script to run python script, a table result for read alignment is extracted with information include:

Building graph

Multiple reference graph is visualised by Bandage

Cluster Analysis

List of metrics

Visualisation

Graph genome is visualised by Bandage to get the overview about variation, hitting when aligning the query sequence.

Simulation data

Wgsim tools

Simulated fastq files will be created from fasta files:

simulation for the read length
simulation for the error rate/ mutation rate
simulation for the contaminant reads
simulation for ...

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Others		Others
Real_data		Real_data
.DS_Store		.DS_Store
.Rhistory		.Rhistory
Detection_Analysis.py		Detection_Analysis.py
Graph.R		Graph.R
Graph_Process.py		Graph_Process.py
Linux_Command_Help.sh		Linux_Command_Help.sh
Pipeline_Introduction.html		Pipeline_Introduction.html
Pipeline_Introduction.ipynb		Pipeline_Introduction.ipynb
README.md		README.md
Report.Rmd		Report.Rmd
Report.html		Report.html
Running_Pipeline.md		Running_Pipeline.md
Thesis.ipynb		Thesis.ipynb
Visual.R		Visual.R
access_vs_prot.csv		access_vs_prot.csv
alignment.py		alignment.py
clean_consensus.py		clean_consensus.py
clustering.sh		clustering.sh
download_ref_genome.sh		download_ref_genome.sh
extract_mp.py		extract_mp.py
gb_2_fa.py		gb_2_fa.py
index.html		index.html
layout.html		layout.html
metric_functions.py		metric_functions.py
metrics_use.py		metrics_use.py
mix_contam.csv		mix_contam.csv
mixture1.csv		mixture1.csv
mixture2.csv		mixture2.csv
report.py		report.py
report_functions.py		report_functions.py
simulation.py		simulation.py
summarise.csv		summarise.csv
summarise2.csv		summarise2.csv
summarise3.csv		summarise3.csv
summarise4.csv		summarise4.csv
summarise5.csv		summarise5.csv
summarise6.csv		summarise6.csv
summarise7.csv		summarise7.csv
test.csv		test.csv
test.html		test.html
test1.py		test1.py

maingoc303/Graph_Genome

Folders and files

Latest commit

History

Repository files navigation

Graph_Genome

Introduction

Pipeline

Building Clustering Database

Fastq file processing

Alignment reads

Building graph

Cluster Analysis

List of metrics

Visualisation

Simulation data

Wgsim tools

Estimating the FDR

One query sample versus graph

Multiple query samples versus graph

Adding mutation rate to query samples

Estimated sensitivity and specificity

Estimating TP and FN

Estimating FP and TN

Calculating sensitivity and specificity

About

Topics

Resources

Stars

Watchers

Forks

Languages