Skip to content

ylaboratory/splitpea

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

splitpea: SPLicing InTeractions PErsonAlized

splitpea logo

Here we present Splitpea, a method for calculating network rewiring changes due to splicing.

Splitpea takes differential exons in the form of PSI values and combines them with domain-domain interactions (DDI) and protein-protein interactions (PPI) to produce a PPI network with edges that change due to rewiring between two conditions. As a proof of principle, we show how Splitpea can be used to identify novel cancer subtypes and driver genes affected by splicing changes using splicing changes in normal tissue as the background condition.

Citation

Splitpea: quantifying protein interaction network rewiring changes due to alternative splicing in cancer. Dannenfelser R and Yao V. Preprint of an article published in Pacific Symposium on Biocomputing 2024. https://doi.org/10.1101/2023.09.04.556262

Code Organization

All of the code needed to run the base functionality of Splitpea is found in the src directory. Helper functions for installation and exporting for different software tools are found in the src/utils directory. Scripts need to reproduce the analysis and tables in Splitpea manuscript can be found in src/analysis.

Setup

Conda is required for installing all relevant dependencies needed to run Splitpea. Once conda is installed, create a new environment with the dependencies specified in env.yml.

conda env create -f env.yml

Activate the environment:

conda activate splitpea

Tabix

This project makes heavy use of tabix (v1.17), available through htslib. Though htslib is available via bioconda, we have found that it is easier to install it separately to avoid package conflicts. We include a helper script for installing it. This file will download htslib-1.17 to your working directory and install everything to your home directory. The htslib-1.17 in your working directory is unnecessary after the install as binaries can be called directly. To keep things self-contained, you could also install the binaries into this project directory and call / modify paths as needed.

bash src/utils/install_tabix.sh
bash src/utils/tabix_feats.sh

R

This project also utilizes R for preprocessing PSI background and input PSI values.

Reference Files

Splitpea requires protein-protein and domain-domain interaction reference files as well as genomic positions for exon and protein families as annotated in Pfam. While these reference files can be manually assembled, we have provided these in the reference directory.

Spliced Exon Data

Splitpea currently uses spliced exon data from the IRIS project. This dataset is too large to be stored on GitHub, so only a small sample set is provided. The full set of data used is available with the final output files on Zenodo. If you wish to use this data with Splitpea please first download from Zenodo and use the downloaded spliced exon files in place of the example data below.

Running Splitpea

To illustrate how to use Splitpea, we show how it can be run to generate both patient-specific rewired networks for individual pancreatic cancer samples, as well a consensus network of PPI rewiring events across all pancreatic cancer samples.

As the first step, we will to create a background level summary of splicing changes in the normal pancreas. The following script will take alternatively spliced exon data and create mean summaries over all exon level coordinates across normal and tumor samples.

python src/combine_spliced_exons.py test-data

Next, calculate changes in PSI values (delta PSI) between each cancer sample relative to the summarized normal pancreatic data. Here, we also empirically calculate a p-value for each delta PSI. (Note that this script also constructs a summary volcano plot as shown in Figure 4 of the manuscript.)

Rscript src/delta_psi.R -o psis -s test-data/spliced_exons_gtex_pancreas_test_combined_mean.txt -b test-data/spliced_exons_gtex_pancreas_test.txt -t test-data/spliced_exons_tcga_paad_test.txt

Construct the background PPI network needed for later downstream analysis. This network represents what happens when there are no changes in the PPI network due to alternative splicing.

python src/get_background_ppi.py

Splitpea takes the preprocessed delta PSI values to generate a network with rewired edges (as both a .pickle and .dat file). In our case, we have a delta PSI file for each pancreatic cancer sample. Thus, we provide Splitpea with the directory containing these delta PSI files as well as an output directory for the final networks.

Here we use a bash script to parallelize and run Splitpea for each sample's psis. The -p flag is used to choose the number of cores for parallelization.

# run splitpea to take a directory of sample PSI values and construct
# sample level networks of rewiring changes
bash src/run_splitpea_batch.sh -i psis -o output -p 4

So far, we have generated sample level networks. To get one summary network for pancreatic cancer (the consensus network of interactions), we need to run a script to summarize and combine across the patient samples.

# combine
python src/get_consensus_network.py output

Analysis

The scripts needed to reproduce the analysis seen in the paper are in the src/analysis folder. These files assume that you have run Splitpea or have downloaded the patient specific networks for both breast cancer (BRCA) and pancreatic cancer (PAAD), as well as the consensus networks for each tumor type. The easiest setup is to run src/setup_from_zenodo.sh which downloads BRCA-patient-rewired-networks.zip, BRCA_consensus_networks.zip, PAAD-patient-rewired-networks.zip, PAAD_consensus_networks.zip, BRCA-psi.zip, and PAAD-psi.zip from Zenodo extracts all files and places them in a directory called IRIS in the main Splitpea folder. Once these files are downloaded or created you can generate the the main figures using the following scripts in the analysis folder:

  • clustering_psi.R: code that takes delta psi values and generates clustered heatmaps with color legend bar with TCGA metadata (Figure 2)
  • get_largestcc_sizes.py: separated by directionality of each edge, calculates the size (# nodes, # edges) for each patient-specific network as well as the largest connected component (outputs the data table used to build Figure 3)
  • get_consensus_network_stats.py: calculates consensus networks at different thresholds (% agreement across all networks) and also outputs the number of nodes and edges for the corresponding consensus network
  • analyze_tcga_netstats.R: figure plotting code for proportion of edges across patient-specific networks (Figure 3) and positive/negative consensus network size change at different thresholds (Figure 5A-B)
  • clustering_networks.ipynb: generates network embeddings using FEATHER
  • clustering_networks.R: further analysis of the network embeddings (Figure 6)