-
Notifications
You must be signed in to change notification settings - Fork 0
/
projectAnalysis.Rmd
86 lines (63 loc) · 3.73 KB
/
projectAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
title: "Machine Learning Project Analysis"
author: Greg Janesch
output: html_document
---
```{r, echo = FALSE, message = FALSE}
library(caret)
```
This document summarizes an attempt to build a simple machine learning algorithm, intended to predict from a sample data set how a barbell lift was performed.
## Preprocessing
The data was first read in using the <TT>read.csv()</TT> function:
```{r}
training <- read.csv("pml-training.csv")
```
The resulting data frame had 160 variables and 19,622 observations. The variables were rendered as one of three classes: numeric, integer, and factor. In addition, multiple columns had a number of entries where the contents were "" or NA. For consistency, all data columns were converted to numeric; at the same time, the first seven variables (index, user, timestamp, and window data) were discarded.
```{r, warning = FALSE}
## Create a vector with all of the columns' classes
columnClasses <- sapply(training,class)
## Convert all data columns to numeric
for(i in 8:159){
if(columnClasses[i] != "numeric"){
if(columnClasses[i] == "factor"){
training[training[,i] == "",i] <- NA
training[,i] <- as.numeric(as.character(training[,i]))
}
training[,i] <- as.numeric(training[,i])
}
}
## Remove the non-data columns
training <- training[,-c(1:7)]
```
This, in turn, generated a number of columns with NAs. In fact, any columns with NAs consisted almost entirely of NA entries;
```{r}
NAcount <- sapply(training, function(x) sum(is.na(x)))
unique(NAcount)
```
The very small amount of data made any attempt to impute or otherwise fill in the data risky at best; as such, all columns with NA values were eliminated.
```{r}
training <- training[,ifelse(NAcount == 0, TRUE, FALSE)]
```
## Feature Selection
In order to select the relevant features, the various variables were plotted graphically. To begin, the variables were plotted a few at a time using boxplots in R's <TT>featurePlot()</TT> function, like so:
```{r}
featurePlot(x = training[,8:10], y = training$classe, plot = "box")
```
At this point, the plots were visually inspected. Any variable where the boxes for a single variable had some significant differences were then further inspected via a stacked histogram, such as this one for the accel_belt_z variable above:
```{r, warning = FALSE}
ggplot(data = training, aes(x = accel_belt_z, fill = classe)) + geom_histogram()
```
Those that were determined to be potentially useful were then noted, and ultimately used in the final model.
## Cross validation
In order to properly cross-validate the model, the following procedure was repeated 10 times:
First, the training data was randomly split into two subsets - one for training, one for testing, like so:
```{r}
splitter <- createDataPartition(training$classe, p = 0.6, list = FALSE)
training_train <- training[splitter,]
training_test <- training[-splitter,]
```
The training_train subset was then used to determine a model, using the features selected previously. The models were then used to predict the outcomes of each training_test subset. Finally, the accuracy of each prediction was determined and recorded down.
Once all 10 iterations were complete, the accuracies were averaged together to get a single estimate of the accuracy, from which an estimate of the out-of-sample error could be determined.
## The Final Model
The final version of the model uses a total of seven variables: yaw_belt, accel_belt_z, gyros_arm_x, accel_arm_x, roll_dumbbell, gyros_dumbbell_x, and magnet_arm_x. The method used was the <TT>train()</TT> function's default method, the random forest.
The estimated accuracy of the model on the training_test subsets was 94%, indicating an out-of sample error of 6%.