Understanding the molecular phenotype of astrocytoma progression

Roshan Lodha 08 February, 2022

Exploratory Data Analysis

Explanation of your data set

The brain cancer GSE dataset is derived from the paper “Expression data from human brain tumors and human normal brain” by Griesinger et. al. The data was pre-processed to convert the raw expression read data into gene set enrichment (GSE) data using command-line tools to be used for downstream bioinformatics applications. After processing, the data was made available on Kaggle. Each row of the dataset corresponds to a single sample, or a single brain tumor specimen, and its RNA-expression data. Each column corresponds to a specific transcript, that maps to a region of the genome. For example, the first transcript, 1007_s_at, maps to the gene Discoidin Domain Receptor Tyrosine Kinase 1 (DDR1). The transcripts code refers to the gene’s Affymetrix Probe ID, with Affymetrix being the company conducting the sequencing. The overall dimensions of the dataset include 130 specimen, each with expression data for 54,613 genes. Each expression value, and hence each cell, is numeric, with the exception of the “type” column, which denotes the type of astrocytoma. Astrocytoma is an aggresive cancer that originates in the astrocytes in the brain, and is generally graded on a scale of 0 to 4, with 0 being control (no astrocytoma). Thus each sample is one of 5 types. These assignments were made using histology of the biopsied tissue.

Data Cleaning

As aforementioned, the nature of the data mitigated the need for extensive data cleaning. However, AFFY control and background control probes needed to be removed, as they were used to determine the quality and depth of the sequencing and would artificially skew the data as they would be detected equally in each subtype of brain cancer.

Data Vizualizations

Variable Correlations

As the underlying question sought to distinguish cancer type via gene expression data, the gene expression across the 5 types were first averaged before PCA was done to reduce dimensionality. Interestingly, the model found that the 5th principal component was a linear combination of the first 4 (contributed to no variance), indicating that there may only be 3 distinct subtypes of astrocytoma and one normal based on this data. Data was not scaled to preserve within-sample relative distributions but was centered.

Following PCA, the top 5 contributing genes to astrocytoma differentiation were assessed by looking at the first 5 entries of principal component 1. These 5 AFFY codes were cross-referenced on the internet to find the genes most important to progression of astrocytoma.

The first gene, Gamma-Aminobutyric Acid Type A Receptor Subunit Gamma2 (GABRG2), is a receptor for a major neurotransmitter known as GABA, indicating that there are in fact neuronal changes driving astrocytoma progression (or vice versa). Of the other genes both Myelin Transcription Factor 1 Like (MYT1L) and Neurofilament light polypeptide (NEFL) are specific to the brain, further indicating that neuron-specific changes in astrocytoma.

Now, we can try clustering the data using various algorithms to see how seperable the data is.

PCA produces awful results, with little to no clustering by cancer subtype. This is common of gene set enrichment datasets, and thus advances techniques like tSNE and UMAP were developed to improve clustering accuracy.
From the UMAP plot, we can clearly see much better separability, with the normal cells, medulloblastoma tumors, and ependymoma tumors (mostly) forming their own clusters. Additionally, it seems as though pilocytic astrocytomas are genetically similar to glioblastomas. # Statistical Learning: Modeling & Prediction


