Skip to content

Latest commit

 

History

History
195 lines (152 loc) · 6.58 KB

hw1.md

File metadata and controls

195 lines (152 loc) · 6.58 KB

homework1

Ali
Monday, February 09, 2015

Homework01

Feel free to skip my housekeeping section and go straight to Question 01

Before starting the analysis, do some basic housekeeping:

  • It is good practice to load libraries and define functions at the start of a script
  • Pre-screen the data for missing values (we already know there are none)
  • Have a look at the variables to get familiar with the experimental design, e.g. (str(design.txt))
  • Read relevent background literature, e.g. (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527015/)
  • look at the range and distribution of the data; does it make sense? Plotting datasets this large takes a lot of comput power; I like using summary(data.frame$variable) instead to check for balance, richness, etc.

Housekeeping

#load libraries
library(car) #for recode
library(lattice) # for stripplot
data <- read.table("data.txt")
des <- read.table("design.txt")
newDat <- data.frame(gExp = as.vector(as.matrix(data)), sampleID = factor(colnames(data)), gene = factor(rownames(data)))
arDat <- suppressWarnings(data.frame(des, newDat))
summary(arDat$Treatment)
## cigarette_smoke         control 
##          250107          272844

There is some discrepancy between the number of observations between Treatments.

table(arDat$Treatment, arDat$time)
##                  
##                     1_h   2_h  24_h   4_h
##   cigarette_smoke 45474 68211 68211 68211
##   control         68211 68211 68211 68211

And the culprit is the 1 hr treatment. This might be due to the removal of missing values, and/or a problem with the microarray or experimental design. Let's expect the number of replicates:

table(arDat$sampleID, arDat$time)
##                     
##                        1_h   2_h  24_h   4_h
##   GSE10718_Biomat_1      0     0 22737     0
##   GSE10718_Biomat_10 22737     0     0     0
##   GSE10718_Biomat_11 22737     0     0     0
##   GSE10718_Biomat_12 22737     0     0     0
##   GSE10718_Biomat_13     0     0     0 22737
##   GSE10718_Biomat_14     0     0     0 22737
##   GSE10718_Biomat_15     0     0     0 22737
##   GSE10718_Biomat_16 22737     0     0     0
##   GSE10718_Biomat_17 22737     0     0     0
##   GSE10718_Biomat_19     0 22737     0     0
##   GSE10718_Biomat_2      0     0 22737     0
##   GSE10718_Biomat_20     0 22737     0     0
##   GSE10718_Biomat_21     0 22737     0     0
##   GSE10718_Biomat_22     0     0     0 22737
##   GSE10718_Biomat_23     0     0     0 22737
##   GSE10718_Biomat_24     0     0     0 22737
##   GSE10718_Biomat_3      0     0 22737     0
##   GSE10718_Biomat_4      0     0 22737     0
##   GSE10718_Biomat_5      0     0 22737     0
##   GSE10718_Biomat_6      0     0 22737     0
##   GSE10718_Biomat_7      0 22737     0     0
##   GSE10718_Biomat_8      0 22737     0     0
##   GSE10718_Biomat_9      0 22737     0     0

It looks like the intention was to have N = 6 for each time point; however, the 1 hr treatment only has N = 5. This will be important to keep in mind when performing statistics.

Question 1a

Q How many probes and how many samples?

A Simply take a look at the data.frame structure:

str(arDat)
## 'data.frame':	522951 obs. of  6 variables:
##  $ ExternalID: Factor w/ 23 levels "GSM270872","GSM270873",..: 11 3 2 1 9 8 7 14 13 16 ...
##  $ Treatment : Factor w/ 2 levels "cigarette_smoke",..: 2 2 2 2 2 2 2 1 1 1 ...
##  $ time      : Factor w/ 4 levels "1_h","2_h","24_h",..: 3 1 1 1 4 4 4 1 1 2 ...
##  $ gExp      : num  7.9 6.13 6.82 6.63 6.95 ...
##  $ sampleID  : Factor w/ 23 levels "GSE10718_Biomat_1",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ gene      : Factor w/ 22737 levels "1294_at","1316_at",..: 1 2 3 4 5 6 7 8 9 10 ...

The number of probes is 22737, and the number of samples is 23.

Question 1b

Q What is the breakdown of samples for treatment and time?

A HAH! This question is addressing the same thing I discovered in my "housekeeping" section above. (Please refer to the above code for more details). There are 4 time points (1h, 2h, 4h, and 24h), with N = 6 each except for 1h, with N = 5. The design is slightly unbalanced.

Question 1c

Q Create a quantitative variable that represents the time at which cells were measured

A Use the 'recode' function in the 'cars' package.

des$time <- recode(des$time, "'1_h' =1; '2_h'=2; '4_h'=4; '24_h'=24", as.factor.result = FALSE)
quantDat <- suppressWarnings(data.frame(des, newDat))
head(quantDat)
##   ExternalID Treatment time     gExp           sampleID    gene
## 1  GSM270883   control   24 7.900022  GSE10718_Biomat_1 1294_at
## 2  GSM270874   control    1 6.126008 GSE10718_Biomat_10 1316_at
## 3  GSM270873   control    1 6.822491 GSE10718_Biomat_11 1320_at
## 4  GSM270872   control    1 6.633596 GSE10718_Biomat_12 1431_at
## 5  GSM270881   control    4 6.948795 GSE10718_Biomat_13 1438_at
## 6  GSM270880   control    4 9.887123 GSE10718_Biomat_14 1487_at

Question 1d

Q Create a plot showing the gene expression data for one probe and the averages for all possible combinations of Treatment and time

A I think the simplest solution would make a colour-coded stripplot. The line connects time categories in order and passes through the average for each treatment.

#decide on a gene
set.seed(1)
(samp <- sample(1:nrow(arDat), size = 1))
## [1] 138849
#which gene did I get? 
head(arDat[samp, ])
##        ExternalID Treatment time     gExp          sampleID         gene
## 138849  GSM270878   control  2_h 4.235724 GSE10718_Biomat_7 1557628_s_at
specialGene = "1557628_s_at"
oneGene <- (subset(quantDat, subset = gene %in% specialGene))
#sanity check
head(oneGene[1:3, ])
##       ExternalID       Treatment time     gExp           sampleID
## 2427   GSM270888 cigarette_smoke    2 2.910596 GSE10718_Biomat_20
## 25164  GSM270874         control    1 5.820400 GSE10718_Biomat_10
## 47901  GSM270892 cigarette_smoke    4 5.373230 GSE10718_Biomat_23
##               gene
## 2427  1557628_s_at
## 25164 1557628_s_at
## 47901 1557628_s_at
#a stripplot function I wrote for another seminar:
makeStrip <- function(df) {
  stripplot(gExp ~ as.factor(time) | gene, df,
          group = Treatment, xlab = "Time after exposure (hrs)", jitter.data = FALSE, cex=1.5, lwd = 2,  
          auto.key = TRUE, type = c('p', 'a'))
}
makeStrip(oneGene)

** TO DO **

  • Report averages for each treatment, time (the line passes through the average, but this is not explicit)
  • format the figure more nicely, maybe in ggplot2 instead of lattice