rnaSQLite

Tool for storing cuffdiff output into a SQLite database, and matching with known gene functions and families. Although the tool was designed to take the Panther Sequence Association file, the script or input reference files can be reformated to accept any annotation or database file. Furthermore the schema used for the SQLite database was designed to be as source-database agnostic as possible, which will hopefully make this tool flexible for use with other sources, such as Kegg or CPDB.

Prerequisites

The rnaSQLite tool uses the following perl modules:

DBD-SQLite

on RHEL/Centos systems you can install as follows:

yum install perl-DBD-SQLite

Getting Started (the basics)

Download the repo and unzip.

unzip master.zip

If your system does not have unzip you will need to install. RHEL/Centos yum install unzip.

The first step, after downloading, is to setup the database and store reference codes used throughout the program.

./init_db.pl my_rnaseq_data.db

Change my_rnaseq_data.db to whatever name that is more descriptive to your project or analysis you are going to do.

Initialize the reference/annotation database. This can be for either mouse or human (more species to come).

Download the Seuqence Association file from PANTHER (ftp://ftp.pantherdb.org/pathway/current_release/SequenceAssociationPathway3.5.txt).

Before use, please read their README (ftp://ftp.pantherdb.org/pathway/current_release/README) and LICENSE (ftp://ftp.pantherdb.org/pathway/current_release/LICENSE)

For Mouse

Download the mouse gene list file from JAX (http://www.informatics.jax.org/downloads/reports/MRK_List2.rpt) and run the following program

./init_mouse_ref.pl my_rnaseq_data.db SequenceAssociationPathway3.5.txt MRK_List2.rpt

For Human

Download the human gene list from HGNC (ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/locus_groups/protein-coding_gene.txt) and run the following program

./init_human_ref.pl my_rnaseq_data.db SequenceAssociationPathway3.5.txt protein-coding_gene.txt

Take the cuffdiff diff_out file and store in SQLite database. Use the appropriate sepcies short name. HUMAN = Homo sapien, MOUSE = Mus musculus.

It may be worth copying the SQLite database file at this point as a backup. It is easier to come back to this step than to redo the whole initialisation process.

./cuffdiff2SQLite.pl my_rnaseq_data.db /path/to/cuff/diff/output HUMAN

This step may take a while depending on how large the diffout file is.

Generate a report with the pathways/functions and genes. By default the program will use a 0.05 p-value cutoff and 5 FPKM cutoff. You can change this by using the -p and -r flags respectively.

./report_pathways.pl my_rnaseq_data.db HEALTHY TREATMENT /path/to/output.txt

For a p-value of less than 0.01 and FPKM cutoff of 10:

./report_pathways.pl my_rnaseq_data.db HEALTHY TREATMENT /path/to/output.txt -p 0.01 -r 10

Cytoscape and CPDB Induced Network Modules

In this section we will use the gene list that you obtained from report_pathways.pl to obtain a network list from CPDB's induced network module. CPDB, or ConsensusPathDB, is a network analysis tool that integrates many different pathway databases into one. At the time of writing, CPDB supports human, yeast and mouse pathways (http://cpdb.molgen.mpg.de/). Once a network list is obtained, we will use this tool to merge the network list with gene expression data, which we will then import into cytoscape for visualisation. Download Cytoscape from http://www.cytoscape.org/.

When using CPDB, the first thing is to select your species at the top of the website (http://cpdb.molgen.mpg.de/). Then access the induced network modules by clicking on "gene set analysis" in the menu on the right. You will see "induced network modules" show up. Click it and you will be presented with a text field to past a list of genes.
Open the output file from report_pathways.pl in a spreadsheet software to copy the list of gene symbols, or just extract the first colum from the text file using awk or similar methods. Duplicate gene symbols may appear, as a single gene can be part of one or more pathways. If you would like, you can remove duplicates, although they will not affect the induced network modules.
Paste the list of genes into the text box on CPDB's induced network modules page and click "Proceed".
Once the page loads you will see a visualisation of the network. To download the network list as a text file, click on "export" at the top of the page.
Use cpdb2cytoscape.pl to merge expression data which can then be visualised as colour changes in cytoscape.

./cpdb2cytoscape.pl my_rnaseq_data.db HEALTHY TREATMENT /path/to/CPDB_inducedModules /path/to/output.txt

You will then be able to import the resulting text file in to a new session in Cytoscape for visualisation.

Database Schema

The following is the table schemas and other details.

diff_table

Column Name	Type	Remarks
id	INTEGER	PRIMARY KEY AUTOINCREMENT NOT NULL
sample_id_1	INTEGER	NOT NULL
sample_id_2	INTEGER	NOT NULL
diff_status	CHAR(8)	NOT NULL
log2FC	REAL	NOT NULL
test_stat	REAL	NOT NULL
p_value	REAL	NOT NULL
q_value	REAL	NOT NULL

Code:

$stmt = qq(CREATE TABLE IF NOT EXISTS diff_table(
        id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
        sample_id_1 INTEGER NOT NULL,
        sample_id_2 INTEGER NOT NULL,
        diff_status CHAR(8) NOT NULL,
        log2FC REAL NOT NULL,
        test_stat REAL NOT NULL,
        p_value REAL NOT NULL,
        q_value REAL NOT NULL););

reference_table

Column Name	Type	Remarks
id	INTEGER	PRIMARY KEY AUTOINCREMENT NOT NULL
accession_id	CHAR(16)	NOT NULL
gene_symbol	CHAR(16)	NOT NULL
gene_name	CHAR(64)	NOT NULL
chromosome	CHAR(2)	NOT NULL
species	INTEGER	NOT NULL
pathway_accession	CHAR(16)	NOT NULL
pathway_name	TEXT	NOT NULL
evidence_id	CHAR(16)	NOT NULL
evidence_type	CHAR(16)	NOT NULL
panther_subfamily_id	CHAR(16)
panther_subfamily_name	TEXT

Code:

$stmt = qq(CREATE TABLE IF NOT EXISTS reference_table(
        id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
        accession_id CHAR(16) NOT NULL,
        gene_symbol CHAR(16) NOT NULL,
        gene_name CHAR(64) NOT NULL,
        chromosome char(2) NOT NULL,
        species INTEGER NOT NULL,
        pathway_accession CHAR(16) NOT NULL,
        pathway_name TEXT NOT NULL,
        evidence_id CHAR(16) NOT NULL,
        evidence_type CHAR(16) NOT NULL,
        panther_subfamily_id CHAR(16),
        panther_subfamily_name TEXT););

sample_table

Column Name	Type	Remarks
id	INTEGER	PRIMARY KEY AUTOINCREMENT NOT NULL
sample_name	TEXT	NOT NULL
species	INTEGER	NOT NULL
gene_symbol	CHAR(16)	NOT NULL
accession_id	CHAR(16)	NOT NULL
reads	INTEGER	NOT NULL

Code:

$stmt = qq(CREATE TABLE IF NOT EXISTS sample_table(
        id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
        sample_name TEXT NOT NULL,
        species INTEGER NOT NULL,
        gene_symbol CHAR(16) NOT NULL,
        accession_id CHAR(16) NOT NULL,
        reads INTEGER NOT NULL););

species_table

Column Name	Type	Remarks
id	INTEGER	PRIMARY KEY AUTOINCREMENT NOT NULL
short_name	CHAR(8)	NOT NULL
organism	TEXT	NOT NULL
common_name	TEXT	NOT NULL

Code:

$stmt = qq(CREATE TABLE IF NOT EXISTS species_table(
        id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
        short_name CHAR(8) NOT NULL,
        organism TEXT NOT NULL,
        common_name TEXT NOT NULL););

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
cpdb2cytoscape.pl		cpdb2cytoscape.pl
cuffdiff2SQLite.pl		cuffdiff2SQLite.pl
init_db.pl		init_db.pl
init_human_ref.pl		init_human_ref.pl
init_mouse_ref.pl		init_mouse_ref.pl
report_pathways.pl		report_pathways.pl
species_codes.txt		species_codes.txt

License

Cytogence/rnaSQLite

Folders and files

Latest commit

History

Repository files navigation

rnaSQLite

Prerequisites

Getting Started (the basics)

For Mouse

For Human

Cytoscape and CPDB Induced Network Modules

Database Schema

diff_table

reference_table

sample_table

species_table

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages