Manuscript accepted at BMC Genomics. Code made public for reviewer access, but in Beta-testing until manuscript publication. Please feel free to get in touch with requests or comments: sarahpyfrom@gmail.com
Supplementary information and a description of PLAIDOH output files can be found in Pyfrom et al, 2019 (PLAIDOH: A Novel Approach for Functional Prediction of Long Non-Coding RNAs Identifies Cancer-Specific LncRNA Activities). We suggest you take a look at Table 1!
After cloning or downloading PLAIDOH from github, you should have the following directory structure:
|-- Default_Files
|-- Biomartquery_hg19.txt.zip
|-- ChIP_Files
|-- <lots of tissue filenames>
|-- EnhancerAtlas_Enhancers.bed
|-- K562_and_MCF7_POL2RAChIApet.txt.zip
|-- Nuclear_Fraction_GM12878_hg19.txt
|-- RBPs
|-- <lots of RBP filenames>
|-- SEA_SuperEnhancers.bed
|-- hESCTADboundariesCombined.bed
|-- Example_Input_File.txt
|-- PLAIDOH.pl
|-- README.md
PLEASE NOTE: ALL default files use Human hg19 genomic coordinates. If you have input expression data aligned to another genome assembly, I suggest using a genomic liftOver tool to lift over your coordinates to hg19
PLAIDOH requires a single user-generated input file, in addition to the publicly available default inputs that come with PALIDOH. The input file must have the following columns in the order shown in the example table below:
#CHR | START | STOP | NAME | TYPE | SAMPLE1 | SAMPLEN |
---|---|---|---|---|---|---|
chr1 | 112 | 256 | DHX9 | protein_coding | 0.675 | 5.89 |
chr1 | 778 | 4334 | AC00896.1 | lncRNA | 89 | 4 |
chr1 | 334 | 566 | RP9911.3 | antisense_rna | 8.3 | 0.33 |
CHR must be the chromosome for the transcript on that line. The START and STOP coordinates may be any region of the transcripts the user wishes to study, but PLAIDOH will only consider within the boundaries provided by the input file. The NAME category may be either a gene name or an ENSEMBL identifier. The TYPE category may be any identifier the user chooses but PLAIDOH will ONLY provide annotation and predictions for transcripts labeled “protein_coding, “lncRNA” or antisense_rna” all other designations will be filtered into the “misc” output file. The user may include as many samples as they choose and utilize any calculation for expression (e.g., FPKM, CPM, microarray probe count etc.). The input should be sorted by CHR and START coordinate using sort -k1,1 -k2,2n.
PLAIDOH has few dependencies, but does require a few common bioinformatics tools/languages on your PATH. Most people will likely already have them installed:
To test your dependencies and create some necessary default files, PLAIDOH should be run on the included test file:
perl PLAIDOH.pl --cellENH A549 --cellSE A549 --cellCHIP A549 --RBP NONO Example_Input_File.txt
This process is only necessary once. It should create two Default files:
Default_Files/RNABindingProtein_eCLIP_h19.txt
Default_Files/ENCODE_ChIPseq_pValues.txt
And will unzip:
K562_and_MCF7_POL2RAChIApet.txt.zip
Biomartquery_hg19.txt.zip
The test run should also create all of PLAIDOH's standard output files for the Example input file:
PLAIDOH_OUTPUT_Example_Input_File.txt
This is the primary PLAIDOH output file described in detail in the table below.
PLAIDOH_RUN_LOG.txt
Each time PLAIDOH.pl is run, this file is updated with a new entry, recording the input files and options selected by the user and including "COMPLETED SUCCESSFULLY" with a date and time at which the program finished running if PLAIDOH successfully completed its run and all calculations.
lncs_Example_Input_File.txt
contains all entries from the input file with "lncRNA" or "antisense_RNA" in the "Type" column.
protein_coding_Example_Input_File.txt
contains all entries from the input file with "protein_coding" in the "Type" column.
misc_Example_Input_File.txt
contains all entries from the input file with anything OTHER THAN "protein_coding", "lncRNA" or "antisense_RNA" in the "Type" column.
User_Selected_Enhancers.txt
, User_Selected_Super_Enhancers.txt
, and User_Selected_RBPs.txt
, are the subset of each datatype that the user's options selected from the chosen Enhancer, Super enhancer, and RBP files.
lncRNAs_bound_by_RBPs.txt
is output from a bedtools intersect between lnc_Input_Filename.txt and any RBPs selected by the user.
lncs_and_proteins_intersect_Example_Input_File.txt
is output from a bedtools intersect between lncs_Example_Input_File.txt and protein_coding_Example_Input_File.txt.
lncRNAs_in_SuperEnhancers.txt
is output from a bedtools intersect between lncs_Example_Input_File.txt and any User_Selected_Super_Enhancers.txt.
PLAIDOH also outputs three R-generated plots of your data:
RankedCisRegulatoryScores.jpg
is a graph plotting every Cis-Regulatory score for every LncRNA/Coding gene Pair (LCP) identified by PLAIDOH, ranked from lowest to highest. A linear regression line is plotted in red and can be used to determine the best cut-off score for a likely cis-regulatory lncRNA.
RankedEnhancerScores.jpg
is a graph plotting every Enhnacer score for every LCP identified by PLAIDOH, ranked from lowest to highest. A linear regression line is plotted in red and can be used to determine the best cut-off score for a likely enhancer-associated lncRNA.
CisRegulatoryScorebyEnhancerScore.jpg
is a graph plotting the Cis-regulatory and Enhancer score for each LCP.
To run PLAIDOH on your own dataset, you should use the following command with the appropriate options to specify your own datasets:
USAGE
perl PLAIDOH.pl [options] <Input Expression File>
REQUIRED INPUT FILE
Tab-delimited file with columns: #Chr Start Stop Name Type(lncRNA, antisense_RNA or protein_coding) Sample1_Expression Sample2_Expression etc
The input file should include all lncRNA and protein-coding transcripts of interest.
Please see Example_Input_File.txt for an Example of a properly-formatted Input file.
Please Note: Input expression file and all user-generated bed
files should be sorted using the following command if you wish
to use any default PLAIDOH input files:
sort -k1,1 -k2,2n in.txt > in_sorted.txt
OPTIONS
-s filename for Super-enhancer file.
Default file: SEA_SuperEnhancers.bed
--cellSE Optional selection to pick a specific cell line
from the super enhancer list
-e filename for Enhancer data.
Default file: EnhancerAtlas_Enhancers.bed
--cellENH Optional selection to pick a specific cell line
from the Enhancer list
-f finename for Nuclear Fraction file
Default file: Nuclear_Fraction_GM12878_hg19.txt
-b filename for Biomart Query file (includes GO terms and other annotatin options)
Default file: Biomartquery_hg19.txt
-c filename for Chip-seq data
Default file: ENCODE_ChIPseq_pValues.txt
--cellCHIP Optional selection to pick a specific cell line
from the ChipSeq list
-p filename for CHIA-pet data
Default file: K562_and_MCF7_POL2RAChIApet.txt
-r filename for RBP file
Default file: RNABindingProtein_eCLIP_h19.txt
--RBP Optional selection to pick a specific RBP
variable from the RBP list
Detailed descriptions for all input files and their sources can be found in the Methods
secition of Pyfrom, Luo and Payton, 2018 BMC Genomics.
PLAIDOH outputs a several supporting files and a single final file with the name "Output_$input_filename". It has 42 tab-delimited columns, which are described in the following table. Each line contains information about a single lncRNA and one coding gene within 400kb of the lncRNA:
Column Number | Column Heading | Description |
---|---|---|
1 | LINE_NUMBER | Unique numerical designation for each lncRNA and protein pair (LCP) |
2 | CHRlnc | Chromosome of lncRNA |
3 | STARTlnc | Downstream chromosomal position of lncRNA |
4 | STOPlnc | Upstreamstream chromosomal position of lncRNA |
5 | NAMElnc | Name of lncRNA |
6 | TYPElnc | "lncRNA" |
7 | NUMBERlnc | Unique numerical designation for each lncRNA; different for each isoform of a given lncRNA included in input |
8 | CHRcoding | Chromosome of protein |
9 | STARTcoding | Downstream chromosomal position of protein |
10 | STOPprot | Upstreamstream chromosomal position of protein |
11 | NAMEcoding | Name of protein |
12 | TYPEcoding | "protein_coding" |
13 | NUMBERcoding | Unique numerical designation for each protein; different for each isoform of a given protein included in input |
14 | LNC_CODING_OVERLAP | "Overlap" if current lncRNA and protein overlap by 1+ base pairs. "Distal" if there is no overlap. |
15 | DIST_L_to_C | Shortest distance between lncRNA and coding-gene as determined by START and STOP positions from input. |
16 | LNC_AVERAGE | Average expression of lncRNA across all input samples. |
17 | PROTEIN_AVERAGE | Average expression of coding-gene across all input samples. |
18 | PEARSON | Pearson correlation between lncRNA and protein expression as calculated by R cor.test package. |
19 | PEARSONp | p-value of above pearson correlation. |
20 | SPEARMAN | Spearman correlation between lncRNA and protein expression as calculated by R cor.test package. |
21 | SPEARMANp | p-value of above spearman correlation. |
22 | CHIA | "0" if no evidence of ChIA-PET interaction between lncRNA and coding-gene. Any >0 value represents the score of interaction (between 100 and 1000 if default input is used). |
23 | TAD | 1 if both lncRNA and protein are both at least partly within the same TAD, 0 if entirety of lncRNA or protein is in a separate TAD. |
24 | INTERGENIC | 1 if lncRNA does not overlap with ANY "protein_coding" gene in the input file. 0 if lncRNA overlaps with ANY "protein_coding" gene in the input file. |
25 | CHIPSEQ | Full list of ChIP-seq peaks and associated scores that overlap any part of the lncRNA. |
26 | H3K4ME3 | Highest -log10 p-value of any overlapping H3K4ME3 ChIP-seq peak. 0 if no overlap. |
27 | H3K27AC | Highest -log10 p-value of any overlapping H3K27AC ChIP-seq peak. 0 if no overlap. |
28 | H3K4ME1 | Highest -log10 p-value of any overlapping H3K4ME1 ChIP-seq peak. 0 if no overlap. |
29 | RNA Binding Proteins | List of all RNA Binding proteins and their binding score (200 or 1000) bound to lncRNA. |
30 | NEAR_ENH | Shortest distance between lncRNA and closest enhancer as determined by START and STOP positions from input files. |
31 | NEAR_SE | Shortest distance between lncRNA and closest super enhancer as determined by START and STOP positions from input files. |
32 | GO | Gene Ontology for Protein. Matched by name of protein. Empty if exact match for protein not found in GO input file. |
33 | L_strand | "1" if lncRNA on + strand. "-1" if lncRNA on negative strand. Empty if exact match for protein not found in GO input file. |
34 | C_strand | "1" if protein on + strand. "-1" if protein on negative strand. Empty if exact match for protein not found in GO input file. |
35 | TSStoTSSdist | Distance from TSS of lncRNA to TSS of protein. Only available if both lncRNA and protein names can be found in GO input file. |
36 | LNC_NUC_FRACTION | Fraction of lncRNA transcript found in nucleus. If default files are used, this refers to the fraction of reads found in the nucleus vs cytoplasm of GM12878 cell line from ENCODE. |
37 | CODING_NUC_FRACTION | Fraction of protein transcript found in nucleus. If default files are used, this refers to the fraction of reads found in the nucleus vs cytoplasm of GM12878 cell line from ENCODE. |
38 | SENSE_CATEGORY | "Distal" = The lncRNA and coding gene are non-overlapping AND (separated by at least 2kb OR they are on the same strand). "Antisense_Proximal" = The lncRNA and coding gene's TSS' are less than 2kb apart on opposing strands, but the transcripts do not overlap. "Antisense_Overlap" = The lncRNA and coding gene's TSS' are less than 2kb apart on opposing strands, and the transcripts do overlap by one bp or greater. " "Overlap" = The lncRNA and coding gene overlap but their TSS' are greater than 2kb apart OR they are on the same strand. |
39 | ADJUSTED_SPEARMAN | Adjusted Spearman p-value of lncRNA/protein correlation to correct for multiple tests. |
40 | SCORE | Cis-regulatory score calculated by PLAIDOH |
41 | ENHANCER_SCORE | Enhancer Score calculated by PLAIDOH |
42 | FRACTION_SCORE | For each possible lncRNA/RBP combination a “Fraction Score” was calculated. If there is no evidence that the lncRNA and RBP interact based on the ENCODE eCLIP data, the pair is given a score of 0. If an RBP binds a given lncRNA and the sub-cellular localization of both the lncRNA and corresponding RBP are both determined to be Nuclear or both are Cytoplasmic, the pair is given a score of 2 or -2, respectively. If the lncRNA is primarily Nuclear and the RBP does not have a Nuclear localization by Immunofluorescence (or vice versa), the overlapping region is given a score of 1 or -1. If both the RBP and lncRNA are considered to be present in both the nucleus and cytoplasm (as described above) the pair is given a score of 3. |