GenomeAnalytica

Table of Contents

Package Description
Flow
End to End Demo
User Manual
- Supplementary Commands
- Main Commands
Output
Contributors

Package Description

This is a package that provide tools to extract different genome architecture features, build taxonomy lineage and test features against certain value in taxonomy rank in fully automated pipeline.

The package is very user friendly allowing users to ask question like:

Does Vertebrates have a significant GC content than other species ?
What are the genome architecture features that significant in Eukaryotes ?
What are the features that signifies Betacoronavirus from other Riboviria ?

Output of the genomeAnalytica package is a csv similar to common deferential expression experiments (features with adjusted p-values and averages) and a boxplot figures for significant features.

The package is providing an end to end dedicative pipeline for investigating genomic architecture features which is unique to this package as far as we know.

Flow

Prerequisites

Option 1 - Manual
Download and install the latest version of the following.

BASH
bedtools
entrez-direct
Python
pandas
scipy
statsmodels
matplotlib
seaborn

Option 2 - Automated
Make conda environment and use the requirements.txt to install needed libraries

$ conda create --name <env_name> --file requirements.txt

$ conda activate <env_name>

End to End Demo

Make data directory and under it make raw directory under it make directory for each species and create a txt file containing 2 ftp links 1 for species genome gtf file and another for species genome fasta file like the following tree:

data
├── raw
   ├── Caenorhabditis_elegans
   │   └── WgetMe.txt
   ├── Choloepus_hoffmanni
   │   └── WgetMe.txt
   ├── Chrysolophus_pictus
   │   └── WgetMe.txt
   ├── Ciona_intestinalis
   │   └── WgetMe.txt
   ├── Ciona_savignyi
   │   └── WgetMe.txt
   ├── Drosophila_melanogaster
   │   └── WgetMe.txt
   ├── Homo_sapiens
   │   └── WgetMe.txt
   ├── Monopterus_albus
   │   └── WgetMe.txt
   ├── Mus_musculus
   │   └── WgetMe.txt
   └── Saccharomyces_cerevisiae
       └── WgetMe.txt

For using the same species as above just download from data

Download and unzip the data at each corresponding directory using the following command.

$ ./get_data.sh data/raw WgetMe.txt

Make features for all species under data/raw. Can be done either using a Fast Mode mode or Expert Mode mode

Option 1 - Fast Mode:
e.g: creating statistics for (all, protein coding, non protein coding) for (gene, exon, transcript, three_prime_utr, five_prime_utr)

$ ./make_arch_fet.sh data/raw/ g pg npg e pe npe t pt npt f pf npf h ph nph

Option 2 - Expert Mode:
e.g: creating statistics for protein coding exons

for i in `ls data/raw/`; do echo $i; ./get_len_gc.sh data/raw/$i/ exon -i transcript_biotype=protein_coding ; done

Make taxonomy lineage for all species under data/raw using the following command

$ ./make_tax.sh data/raw/ tax_mapping.csv

Make test for calculated feature under data/output using the taxonomy lineage and specifying the 10th rank (supphilum) and (vertebrata) as the value to compare against using the following command:

$ ./make_arch_test.py --stats data/output/ --tax data/tax_mapping.csv --tax_value vertebrata

User Manual

The package consists of three main commands to build standard features, build mapping and test them supplementary commands are available to batch download ftp links and build custom genome architecture features.

Supplementary Commands

get_data

Description:
This script loop throw the species directories under user specified species containing directory; reading the user specified text file containing the ftp links for gtf and fasta then unzip them
Positional Arguments
There are 2 mandatory arguments
arg1: the directory containing the species directories
arg2: the name of the files that contains the ftp link to download
Example:

$ ./get_data.sh data/raw WgetMe.txt

get_len_gc

Description
This module gets stats (avg gc content, avg length and count ) for a custom user specified feature and filtration in a gtf file. Features extracted can differ from those served out of the box in make_arch_fet.
Positional Arguments
There are 2 mandatory arguments :
arg1: the path to directory containing the species gtf and fasta files (uncompressed) with the following hirarchy: raw/species/ e.g (data/raw/Homo_Sapiens)
arg2: the feature that will be filtered for stats. this feature should exist in column 3 gtf file. e.g (exon, gene, transcript)
There are 2 optional arguments to be used when you want to filter certain tag in column 9 gtf:
arg3: a flag to include or exclude the following (tag=value) argument. to include use -i , to exclude use -e
arg4: the name of the tag you want to filter tigetherr with the filtering value in the form of (tag=value). e.g (-i gene_biotype=protein_coding)
Example:

to calculate the stats of all genes of Homo Sapiens use the following

$ bash get_len_gc.sh dat/raw/Homo_Sapiens gene

to calculate the stats of exons coming from non-protein coding transcript of Homo Sapiens use the following

$ bash get_len_gc.sh dat/raw/Homo_Sapiens exon -e transcript_biotype=proteing_coding

Main Commands

make_arch_fet

Description
This script generalize applying stats like (avg gc content, avg length, count ) for all species under user input containing dir given that each species dir contains gtf and fasta file for species genome
Positional Arguments
There are minimum 2 mandatory arguments :
arg1: the path to directory containing all the species directories containing gtf and fasta files (uncompressed). Dir must be named raw e.g (data/raw/)
arg2,3,4,...: 1 to 3 characters representing the features that will be filtered for stats. with the following naming:
filtration criteria: can be:
p for pretein coding or
np for non proteing coding
feature type: can be:
g for gene
e for exon
t for transcript
f for five_prime_utr
h for three_prime_utr
e.g pg for protein coding gene; npt for non protein coding three_prime_utr
Example:
to calculate the stats of (non protein coding genes, protein coding exons and all five_prime_utr) for all species under data/raw/ use the following

$ ./ make_arch_fet.sh data/raw/ npg pe f

To calculate all features

$ ./make_arch_fet.sh data/raw/ g pg npg e pe npe t pt npt f pf npf

make_tax

Description
This script loop throw the species directories under user specified species contining directory; using the species dir naming the taxonomy lineage record is retrieved from ncbi, appended in csv file and saved in user specified file
Positional Arguments
There are 2 mandatory arguments
arg1: the directory containing the species directories
arg2: the name of the taxonomy mapping csv file to be saved
Example:

$ ./make_tax.sh data/raw/ tax_mapping.csv

make_arch_test

Description
This script takes the path for data created by make_arch_fet, path for species lineage csv created by make_tax, rank value to test againist. Count the nulls (which is filled with 0 and returned as 0_count) in each feature, applies 2 sample 2 tailed ttest and Manwhetney, correct using benferroni and returns
i. parsed data architicture feature in 1 csv
ii. test results containing (feature,0_count,avg_[rank_value],avg_non_[rank_value],MannWhitney_p,ttest_p,MannWhitney_adj_p,ttest_adj_p,MannWhitney_adj_reject,ttest_adj_reject)
iii. filtered parsed data architicture features for only significant ones according to corrected p-value < 0.05
Named Arguments:
-h, --help: show this help message and exit
--stats [STATS]: Directory containning the nested csv stats for all species in form of (species, stats)
--tax [TAX]: File csv contianing the full lineage of all species in stats
--out_test [OUT_TEST]: File to write the test results
--out_parsed [OUT_PARSED]: File to write parsed data
--out_parsed_filtered [OUT_PARSED_FILTERED]: File to write parsed data with filtered significant features
--out_fig [OUT_FIG] png File to write boxplot figure
--tax_value [RANK_VALUE]: value to test for in the tax rank e.g: Vertebrata to test Vert vs In-Vert in Subphylum
Example:

$ ./make_arch_test.py --stats data/output/ --tax data/tax_mapping.csv --tax_value vertebrata

Output

test results

feature	0_count	avg_vertebrata	avg_non_vertebrata	MannWhitney_p	ttest_p	MannWhitney_adj_p	ttest_adj_p	MannWhitney_adj_reject	ttest_adj_reject
gene_-i_gene_biotype=protein_coding_count	0	17616.4	12821.4	0.000145036	4.61E-08	0.005221295	1.66E-06	TRUE	TRUE
exon_-e_transcript_biotype=protein_coding_avg_len	0	247.5974	382.6778	0.000145036	3.91E-05	0.005221295	0.001405841	TRUE	TRUE

siginificant features boxplot

Contributors

Ahmed Tarek
Abdullah Alkhawaja
Ali El-Nisr
Montaser Bellah Yasser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/raw

data/raw

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

get_data.sh

get_data.sh

get_len_gc.sh

get_len_gc.sh

make_arch_fet.sh

make_arch_fet.sh

make_arch_test.py

make_arch_test.py

make_tax.sh

make_tax.sh

requirements.txt

requirements.txt

Repository files navigation

GenomeAnalytica

Package Description

Flow

Prerequisites

End to End Demo

User Manual

Supplementary Commands

Main Commands

Output

Contributors

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
data/raw		data/raw
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
get_data.sh		get_data.sh
get_len_gc.sh		get_len_gc.sh
make_arch_fet.sh		make_arch_fet.sh
make_arch_test.py		make_arch_test.py
make_tax.sh		make_tax.sh
requirements.txt		requirements.txt

License

ahmedtariq/GenomeAnalytica

Folders and files

Latest commit

History

Repository files navigation

GenomeAnalytica

Package Description

Flow

Prerequisites

End to End Demo

User Manual

Supplementary Commands

Main Commands

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages