-
Notifications
You must be signed in to change notification settings - Fork 0
/
hw1.rmd
103 lines (78 loc) · 3.63 KB
/
hw1.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: "homework1"
author: "Ali"
date: "Monday, February 09, 2015"
output:
html_document:
keep_md: yes
---
**Homework01**
Feel free to skip my housekeeping section and go straight to Question 01
Before starting the analysis, do some basic housekeeping:
- It is good practice to load libraries and define functions at the start of a script
- Pre-screen the data for missing values (we already know there are none)
- Have a look at the variables to get familiar with the experimental design, e.g. (str(design.txt))
- Read relevent background literature, e.g. (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527015/)
- look at the range and distribution of the data; does it make sense? Plotting datasets this large takes a lot of comput power; I like using summary(data.frame$variable) instead to check for balance, richness, etc.
**Housekeeping**
```{r}
#load libraries
library(car) #for recode
library(lattice) # for stripplot
data <- read.table("data.txt")
des <- read.table("design.txt")
newDat <- data.frame(gExp = as.vector(as.matrix(data)), sampleID = factor(colnames(data)), gene = factor(rownames(data)))
arDat <- suppressWarnings(data.frame(des, newDat))
summary(arDat$Treatment)
```
There is some discrepancy between the number of observations between Treatments.
```{r}
table(arDat$Treatment, arDat$time)
```
And the culprit is the 1 hr treatment. This might be due to the removal of missing values, and/or a problem with the microarray or experimental design. Let's expect the number of replicates:
```{r}
table(arDat$sampleID, arDat$time)
```
It looks like the intention was to have N = 6 for each time point; however, the 1 hr treatment only has N = 5. This will be important to keep in mind when performing statistics.
**Question 1a**
**Q** How many probes and how many samples?
**A** Simply take a look at the data.frame structure:
```{r}
str(arDat)
```
The number of probes is 22737, and the number of samples is 23.
**Question 1b**
**Q** What is the breakdown of samples for treatment and time?
**A** HAH! This question is addressing the same thing I discovered in my "housekeeping" section above. (Please refer to the above code for more details). There are 4 time points (1h, 2h, 4h, and 24h), with N = 6 each except for 1h, with N = 5. The design is slightly unbalanced.
**Question 1c**
**Q** Create a quantitative variable that represents the time at which cells were measured
**A** Use the 'recode' function in the 'cars' package.
```{r}
des$time <- recode(des$time, "'1_h' =1; '2_h'=2; '4_h'=4; '24_h'=24", as.factor.result = FALSE)
quantDat <- suppressWarnings(data.frame(des, newDat))
head(quantDat)
```
**Question 1d**
**Q** Create a plot showing the gene expression data for one probe and the averages for all possible combinations of Treatment and time
**A** I think the simplest solution would make a colour-coded stripplot. The line connects time categories in order and passes through the average for each treatment.
```{r}
#decide on a gene
set.seed(1)
(samp <- sample(1:nrow(arDat), size = 1))
#which gene did I get?
head(arDat[samp, ])
specialGene = "1557628_s_at"
oneGene <- (subset(quantDat, subset = gene %in% specialGene))
#sanity check
head(oneGene[1:3, ])
#a stripplot function I wrote for another seminar:
makeStrip <- function(df) {
stripplot(gExp ~ as.factor(time) | gene, df,
group = Treatment, xlab = "Time after exposure (hrs)", jitter.data = FALSE, cex=1.5, lwd = 2,
auto.key = TRUE, type = c('p', 'a'))
}
makeStrip(oneGene)
```
** TO DO **
- Report averages for each treatment, time (the line passes through the average, but this is not explicit)
- format the figure more nicely, maybe in ggplot2 instead of lattice