PLAIDOH LncRNA Annotation Software

Manuscript accepted at BMC Genomics. Code made public for reviewer access, but in Beta-testing until manuscript publication. Please feel free to get in touch with requests or comments: sarahpyfrom@gmail.com

Supplementary information and a description of PLAIDOH output files can be found in Pyfrom et al, 2019 (PLAIDOH: A Novel Approach for Functional Prediction of Long Non-Coding RNAs Identifies Cancer-Specific LncRNA Activities). We suggest you take a look at Table 1!

Installation

After cloning or downloading PLAIDOH from github, you should have the following directory structure:

|-- Default_Files
    |-- Biomartquery_hg19.txt.zip
    |-- ChIP_Files
	|-- <lots of tissue filenames>
    |-- EnhancerAtlas_Enhancers.bed
    |-- K562_and_MCF7_POL2RAChIApet.txt.zip
    |-- Nuclear_Fraction_GM12878_hg19.txt
    |-- RBPs
	|-- <lots of RBP filenames>
    |-- SEA_SuperEnhancers.bed
    |-- hESCTADboundariesCombined.bed
|-- Example_Input_File.txt
|-- PLAIDOH.pl
|-- README.md

PLEASE NOTE: ALL default files use Human hg19 genomic coordinates. If you have input expression data aligned to another genome assembly, I suggest using a genomic liftOver tool to lift over your coordinates to hg19

Required Input File

PLAIDOH requires a single user-generated input file, in addition to the publicly available default inputs that come with PALIDOH. The input file must have the following columns in the order shown in the example table below:

#CHR	START	STOP	NAME	TYPE	SAMPLE1	SAMPLEN
chr1	112	256	DHX9	protein_coding	0.675	5.89
chr1	778	4334	AC00896.1	lncRNA	89	4
chr1	334	566	RP9911.3	antisense_rna	8.3	0.33

CHR must be the chromosome for the transcript on that line. The START and STOP coordinates may be any region of the transcripts the user wishes to study, but PLAIDOH will only consider within the boundaries provided by the input file. The NAME category may be either a gene name or an ENSEMBL identifier. The TYPE category may be any identifier the user chooses but PLAIDOH will ONLY provide annotation and predictions for transcripts labeled “protein_coding, “lncRNA” or antisense_rna” all other designations will be filtered into the “misc” output file. The user may include as many samples as they choose and utilize any calculation for expression (e.g., FPKM, CPM, microarray probe count etc.). The input should be sorted by CHR and START coordinate using sort -k1,1 -k2,2n.

Dependencies

PLAIDOH has few dependencies, but does require a few common bioinformatics tools/languages on your PATH. Most people will likely already have them installed:

Quick Start

To test your dependencies and create some necessary default files, PLAIDOH should be run on the included test file:

perl PLAIDOH.pl --cellENH A549 --cellSE A549 --cellCHIP A549 --RBP NONO Example_Input_File.txt

This process is only necessary once. It should create two Default files:

Default_Files/RNABindingProtein_eCLIP_h19.txt
Default_Files/ENCODE_ChIPseq_pValues.txt

And will unzip:

K562_and_MCF7_POL2RAChIApet.txt.zip
Biomartquery_hg19.txt.zip

The test run should also create all of PLAIDOH's standard output files for the Example input file:

PLAIDOH_OUTPUT_Example_Input_File.txt This is the primary PLAIDOH output file described in detail in the table below.

PLAIDOH_RUN_LOG.txt Each time PLAIDOH.pl is run, this file is updated with a new entry, recording the input files and options selected by the user and including "COMPLETED SUCCESSFULLY" with a date and time at which the program finished running if PLAIDOH successfully completed its run and all calculations.

lncs_Example_Input_File.txt contains all entries from the input file with "lncRNA" or "antisense_RNA" in the "Type" column.

protein_coding_Example_Input_File.txt contains all entries from the input file with "protein_coding" in the "Type" column.

misc_Example_Input_File.txt contains all entries from the input file with anything OTHER THAN "protein_coding", "lncRNA" or "antisense_RNA" in the "Type" column.

User_Selected_Enhancers.txt, User_Selected_Super_Enhancers.txt, and User_Selected_RBPs.txt, are the subset of each datatype that the user's options selected from the chosen Enhancer, Super enhancer, and RBP files.

lncRNAs_bound_by_RBPs.txt is output from a bedtools intersect between lnc_Input_Filename.txt and any RBPs selected by the user.

lncs_and_proteins_intersect_Example_Input_File.txt is output from a bedtools intersect between lncs_Example_Input_File.txt and protein_coding_Example_Input_File.txt.

lncRNAs_in_SuperEnhancers.txt is output from a bedtools intersect between lncs_Example_Input_File.txt and any User_Selected_Super_Enhancers.txt.

PLAIDOH also outputs three R-generated plots of your data:

RankedCisRegulatoryScores.jpg is a graph plotting every Cis-Regulatory score for every LncRNA/Coding gene Pair (LCP) identified by PLAIDOH, ranked from lowest to highest. A linear regression line is plotted in red and can be used to determine the best cut-off score for a likely cis-regulatory lncRNA.

RankedEnhancerScores.jpg is a graph plotting every Enhnacer score for every LCP identified by PLAIDOH, ranked from lowest to highest. A linear regression line is plotted in red and can be used to determine the best cut-off score for a likely enhancer-associated lncRNA.

CisRegulatoryScorebyEnhancerScore.jpg is a graph plotting the Cis-regulatory and Enhancer score for each LCP.

PLAIDOH Usage

To run PLAIDOH on your own dataset, you should use the following command with the appropriate options to specify your own datasets:

    USAGE

        perl PLAIDOH.pl [options] <Input Expression File> 
                
    REQUIRED INPUT FILE
        
        Tab-delimited file with columns: #Chr  Start   Stop   Name   Type(lncRNA, antisense_RNA or protein_coding) Sample1_Expression Sample2_Expression etc
        
        The input file should include all lncRNA and protein-coding transcripts of interest.
        
        Please see Example_Input_File.txt for an Example of a properly-formatted Input file. 
         
                Please Note: Input expression file and all user-generated bed
                files should be sorted using the following command if you wish
                to use any default PLAIDOH input files:
                sort -k1,1 -k2,2n in.txt > in_sorted.txt
                
    OPTIONS
        
       -s filename for Super-enhancer file.
            Default file: SEA_SuperEnhancers.bed
        
        --cellSE Optional selection to pick a specific cell line
               from the super enhancer list
           
       -e filename for Enhancer data.
            Default file: EnhancerAtlas_Enhancers.bed
           
        --cellENH Optional selection to pick a specific cell line
               from the Enhancer list
       
       -f finename for Nuclear Fraction file
            Default file: Nuclear_Fraction_GM12878_hg19.txt
       
       -b filename for Biomart Query file (includes GO terms and other annotatin options)
            Default file: Biomartquery_hg19.txt
        
       -c filename for Chip-seq data
            Default file: ENCODE_ChIPseq_pValues.txt
            
        --cellCHIP Optional selection to pick a specific cell line
                from the ChipSeq list
        
       -p filename for CHIA-pet data
             Default file: K562_and_MCF7_POL2RAChIApet.txt   
        
       -r filename for RBP file
            Default file: RNABindingProtein_eCLIP_h19.txt
        
        --RBP Optional selection to pick a specific RBP
                variable from the RBP list

                
        Detailed descriptions for all input files and their sources can be found in the Methods
        secition of Pyfrom, Luo and Payton, 2018 BMC Genomics.

PLAIDOH Output File

PLAIDOH outputs a several supporting files and a single final file with the name "Output_$input_filename". It has 42 tab-delimited columns, which are described in the following table. Each line contains information about a single lncRNA and one coding gene within 400kb of the lncRNA:

Column Number	Column Heading	Description
1	LINE_NUMBER	Unique numerical designation for each lncRNA and protein pair (LCP)
2	CHRlnc	Chromosome of lncRNA
3	STARTlnc	Downstream chromosomal position of lncRNA
4	STOPlnc	Upstreamstream chromosomal position of lncRNA
5	NAMElnc	Name of lncRNA
6	TYPElnc	"lncRNA"
7	NUMBERlnc	Unique numerical designation for each lncRNA; different for each isoform of a given lncRNA included in input
8	CHRcoding	Chromosome of protein
9	STARTcoding	Downstream chromosomal position of protein
10	STOPprot	Upstreamstream chromosomal position of protein
11	NAMEcoding	Name of protein
12	TYPEcoding	"protein_coding"
13	NUMBERcoding	Unique numerical designation for each protein; different for each isoform of a given protein included in input
14	LNC_CODING_OVERLAP	"Overlap" if current lncRNA and protein overlap by 1+ base pairs. "Distal" if there is no overlap.
15	DIST_L_to_C	Shortest distance between lncRNA and coding-gene as determined by START and STOP positions from input.
16	LNC_AVERAGE	Average expression of lncRNA across all input samples.
17	PROTEIN_AVERAGE	Average expression of coding-gene across all input samples.
18	PEARSON	Pearson correlation between lncRNA and protein expression as calculated by R cor.test package.
19	PEARSONp	p-value of above pearson correlation.
20	SPEARMAN	Spearman correlation between lncRNA and protein expression as calculated by R cor.test package.
21	SPEARMANp	p-value of above spearman correlation.
22	CHIA	"0" if no evidence of ChIA-PET interaction between lncRNA and coding-gene. Any >0 value represents the score of interaction (between 100 and 1000 if default input is used).
23	TAD	1 if both lncRNA and protein are both at least partly within the same TAD, 0 if entirety of lncRNA or protein is in a separate TAD.
24	INTERGENIC	1 if lncRNA does not overlap with ANY "protein_coding" gene in the input file. 0 if lncRNA overlaps with ANY "protein_coding" gene in the input file.
25	CHIPSEQ	Full list of ChIP-seq peaks and associated scores that overlap any part of the lncRNA.
26	H3K4ME3	Highest -log10 p-value of any overlapping H3K4ME3 ChIP-seq peak. 0 if no overlap.
27	H3K27AC	Highest -log10 p-value of any overlapping H3K27AC ChIP-seq peak. 0 if no overlap.
28	H3K4ME1	Highest -log10 p-value of any overlapping H3K4ME1 ChIP-seq peak. 0 if no overlap.
29	RNA Binding Proteins	List of all RNA Binding proteins and their binding score (200 or 1000) bound to lncRNA.
30	NEAR_ENH	Shortest distance between lncRNA and closest enhancer as determined by START and STOP positions from input files.
31	NEAR_SE	Shortest distance between lncRNA and closest super enhancer as determined by START and STOP positions from input files.
32	GO	Gene Ontology for Protein. Matched by name of protein. Empty if exact match for protein not found in GO input file.
33	L_strand	"1" if lncRNA on + strand. "-1" if lncRNA on negative strand. Empty if exact match for protein not found in GO input file.
34	C_strand	"1" if protein on + strand. "-1" if protein on negative strand. Empty if exact match for protein not found in GO input file.
35	TSStoTSSdist	Distance from TSS of lncRNA to TSS of protein. Only available if both lncRNA and protein names can be found in GO input file.
36	LNC_NUC_FRACTION	Fraction of lncRNA transcript found in nucleus. If default files are used, this refers to the fraction of reads found in the nucleus vs cytoplasm of GM12878 cell line from ENCODE.
37	CODING_NUC_FRACTION	Fraction of protein transcript found in nucleus. If default files are used, this refers to the fraction of reads found in the nucleus vs cytoplasm of GM12878 cell line from ENCODE.
38	SENSE_CATEGORY	"Distal" = The lncRNA and coding gene are non-overlapping AND (separated by at least 2kb OR they are on the same strand). "Antisense_Proximal" = The lncRNA and coding gene's TSS' are less than 2kb apart on opposing strands, but the transcripts do not overlap. "Antisense_Overlap" = The lncRNA and coding gene's TSS' are less than 2kb apart on opposing strands, and the transcripts do overlap by one bp or greater. " "Overlap" = The lncRNA and coding gene overlap but their TSS' are greater than 2kb apart OR they are on the same strand.
39	ADJUSTED_SPEARMAN	Adjusted Spearman p-value of lncRNA/protein correlation to correct for multiple tests.
40	SCORE	Cis-regulatory score calculated by PLAIDOH
41	ENHANCER_SCORE	Enhancer Score calculated by PLAIDOH
42	FRACTION_SCORE	For each possible lncRNA/RBP combination a “Fraction Score” was calculated. If there is no evidence that the lncRNA and RBP interact based on the ENCODE eCLIP data, the pair is given a score of 0. If an RBP binds a given lncRNA and the sub-cellular localization of both the lncRNA and corresponding RBP are both determined to be Nuclear or both are Cytoplasmic, the pair is given a score of 2 or -2, respectively. If the lncRNA is primarily Nuclear and the RBP does not have a Nuclear localization by Immunofluorescence (or vice versa), the overlapping region is given a score of 1 or -1. If both the RBP and lncRNA are considered to be present in both the nucleus and cytoplasm (as described above) the pair is given a score of 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default_Files

Default_Files

.gitignore

.gitignore

Example_Input_File.txt

Example_Input_File.txt

LICENSE

LICENSE

PLAIDOH.pl

PLAIDOH.pl

README.md

README.md

Repository files navigation

PLAIDOH LncRNA Annotation Software

Installation

Required Input File

Dependencies

Quick Start

PLAIDOH Usage

PLAIDOH Output File

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Default_Files		Default_Files
.gitignore		.gitignore
Example_Input_File.txt		Example_Input_File.txt
LICENSE		LICENSE
PLAIDOH.pl		PLAIDOH.pl
README.md		README.md

License

sarahpyfrom/PLAIDOH

Folders and files

Latest commit

History

Repository files navigation

PLAIDOH LncRNA Annotation Software

Installation

Required Input File

Dependencies

Quick Start

PLAIDOH Usage

PLAIDOH Output File

About

Resources

License

Stars

Watchers

Forks

Languages