Skip to content

This repository reproduces the analysis for biosurfer tool

Notifications You must be signed in to change notification settings

sheynkman-lab/biosurfer_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 

Repository files navigation

Biosurfer Analysis

Analysis accompanying the manuscript "Biosurfer for systematic tracking of regulatory mechanisms leading to protein isoform diversity"

This repository contains steps to run the biosurfer analysis, which reproduces the results, summary plots, and figures for the Biosurfer manuscript (bioRxiv).

Contents

  1. Download Biosurfer analysis repository
  2. Download and install Biosurfer package
  3. Download input data
  4. Run Biosurfer modules
    1. Load database
    2. Run hybrid alignment
    3. Visualize protein isoforms
  5. Global characterization of altered protein regions in the human annotation (GENCODE)
    1. Altered protein regions across the human proteome
    2. Analysis of alternative splicing events that alter the N-terminus of proteins
    3. Characterization of splicing patterns underlying internal protein region differences
    4. Analyzing splicing patterns for C-terminal alterations

1. Download Biosurfer analysis repository

You can use the latest version from the source code.

git clone https://github.com/sheynkman-lab/biosurfer_analysis

cd biosurfer_analysis

2. Download and install Biosurfer package

Create the conda environment for Biosurfer via terminal

conda create --name biosurfer-install --channel conda-forge python=3 pip 

Activate the conda environment:

conda activate biosurfer-install

conda install --channel conda-forge graph-tool

Clone Biosurfer repository

git clone https://github.com/sheynkman-lab/biosurfer.git

Note: The Biosurfer package will be downloaded within the biosurfer-analysis directory.

Run setup

The editable installation of Biosurfer package looks for the setup.py within biosurfer directory.

pip install --editable biosurfer

Note: if you get a importlib.metadata.PackageNotFoundError error, please deactivate and then activate the conda env again


3. Download input data

The input data used for the analysis and the corresponding outputs generated by Biosurfer can be found on Zenodo:

  1. GENCODE toy:
    • Description: Toy dataset generated from GENCODE v38
    • Use: This dataset can be used to test the functionality and modules of Biosurfer
    • Size: 4.2 MB
  2. GENCODE v42:
    • Description: It contains the basic gene annotation on the primary assembly sequence regions
    • Use: Used for the analyses conducted in the manuscript
    • Size: 1.29 GB
  3. WTC11:
    • Description: WTC11 is a long-read RNA-seq data from a human induced pluripotent stem cells (iPSC) (Kreitzer et al. 2013)
    • Use: Used for the analyses conducted in the manuscript.
    • Size: 644 MB
for source in gencode_toy gencode_v42 wtc11
do
    bash "./scripts/download_$source.sh"
done

Note: Any GENCODE version can be used with the appropriate GTF, transcript FASTA, and translation FASTA files.

Please also note that in the code, the terms anchor and other correspond to the reference and alternative isoforms mentioned in the manuscript.


4. Run Biosurfer modules

For more information on the modules, refer to Biosurfer package repo (here)

i. Load database

Running the load database module creates a SQLite database file under biosurfer/databases/ directory.

GENCODE toy

biosurfer load_db \
    --source=GENCODE \
    --gtf A_gencode_toy/biosurfer_gencode_toy_data/gencode.v38.toy.gtf \
    --tx_fasta A_gencode_toy/biosurfer_gencode_toy_data/gencode.v38.toy.transcripts.fa \
    --tl_fasta A_gencode_toy/biosurfer_gencode_toy_data/gencode.v38.toy.translations.fa \
    -d gencode_toy

GENCODE v42

biosurfer load_db \
    --source=GENCODE \
    --gtf A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.basic.annotation.gtf \
    --tx_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_transcripts.fa \
    --tl_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_translations.fa \
    -d gencode_v42

WTC11

Load the GENCODE v42 GTF annotations first to set the reference isoforms for WTC11 PacBio data

biosurfer load_db \
    --source=GENCODE \
    --gtf A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.basic.annotation.gtf \
    --tx_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_transcripts.fa \
    --tl_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_translations.fa \
    -d wtc11

Load the WTC11 PacBio data

    biosurfer load_db \
    --source=PacBio \
    --gtf A_wtc11/biosurfer_wtc11_data/wtc11_with_cds.gtf \
    --tx_fasta A_wtc11/biosurfer_wtc11_data/wtc11_corrected.fasta \
    --tl_fasta A_wtc11/biosurfer_wtc11_data/wtc11_orf_refined.fasta \
    --sqanti A_wtc11/biosurfer_wtc11_data/wtc11_classification.txt \
    -d wtc11

ii. Run hybrid alignment

GENCODE toy

mkdir B_hybrid_aln_results_toy
biosurfer hybrid_alignment \
    -d gencode_toy \
    -o B_hybrid_aln_results_toy \
    --gencode

GENCODE v42

mkdir B_hybrid_aln_gencode_v42
biosurfer hybrid_alignment \
    -d gencode_v42 \
    -o B_hybrid_aln_gencode_v42 \
    --gencode

WTC11

mkdir B_hybrid_aln_wtc11
biosurfer hybrid_alignment \
    -d wtc11 \
    -o B_hybrid_aln_wtc11

Note: Running this step could take some time(~30 mins) depending on the size of the input data.


iii. Visualize protein isoforms

The below script invokes the plotting module for CRYBG2 gene and outputs a PNG file. Users can alter the below script to view protein isoforms of any gene they desire.

bash ./scripts/isoform_plotting.sh

5. Global characterization of altered protein regions in the human annotation (GENCODE)

The following steps reproduces the results for GENCODE v42.

Install required libraries

pip install ipykernel xlsxwriter openpyxl plotly

i. Altered protein regions across the human proteome

Genome-wide analysis of protein isoforms in the GENCODE annotation/WTC11

python3 ./scripts/genome_wide_summary.py

ii. Analysis of alternative splicing events that alter the N-terminus of proteins

python3 ./scripts/n_termini_summary.py

iii. Characterization of splicing patterns underlying internal protein region differences

python3 ./scripts/internal_summary.py

iv. Analyzing splicing patterns for C-terminal alterations

python3 ./scripts/c_termini_summary.py

To reproduce the results for for WTC11: in plot_config.py comment line 76 and uncomment line 78

About

This repository reproduces the analysis for biosurfer tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published