forked from pablobarbera/eui-text-workshop
/
01-supervised.Rmd
139 lines (117 loc) · 4.67 KB
/
01-supervised.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Supervised machine learning"
author: Ryan Wesslen (modified)
date: July 27, 2016
output: html_document
---
### Supervised machine learning
We'll be working with tweets from the four major US presidential candidates' Twitter accounts.
Let's create an algorithm to determine what words most distinctively are used by Donald Trump (realDonaldTrump) relative to Ted Cruz (tedcruz).
```{r}
library(quanteda)
tweets <- read.csv("../datasets/candidate-tweets.csv", stringsAsFactors=F)
tweets <- subset(tweets, screen_name %in% c("realDonaldTrump","tedcruz"))
tweets$trump <- ifelse(tweets$screen_name=="realDonaldTrump", 0, 1)
```
We'll do some cleaning as well -- substituting handles with @. Why? We want to provent overfitting.
```{r}
tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)
```
Create the dfm and trim it so that only tokens that appear in 2 or more tweets are included.
```{r}
twcorpus <- corpus(tweets$text)
twdfm <- dfm(twcorpus, ignoredFeatures=c(
stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can"))
twdfm <- trim(twdfm, minDoc = 10)
```
And split the dataset into training and test set. We'll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.
```{r}
set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]
```
Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It's much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don't need to code it from scratch.
```{r}
library(glmnet)
require(doMC)
registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$trump[training],
family="binomial", alpha=0, nfolds=5, parallel=TRUE,
type.measure="deviance")
plot(ridge)
```
We can now compute the performance metrics on the test set.
```{r}
## function to compute accuracy
accuracy <- function(ypred, y){
tab <- table(ypred, y)
return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
tab <- table(ypred, y)
return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
tab <- table(ypred, y)
return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="response") > mean(tweets$trump[test])
# confusion matrix
table(preds, tweets$trump[test])
# performance metrics
accuracy(preds, tweets$trump[test])
precision(preds, tweets$trump[test])
recall(preds, tweets$trump[test])
```
Something that is often very useful is to look at the actual estiamted coefficients and see which of these have the highest or lowest values:
```{r}
# from the different values of lambda, let's pick the best one
best.lambda <- which(ridge$lambda==ridge$lambda.min)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
paste(df$word[1:30], collapse=", ")
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
paste(df$word[1:30], collapse=", ")
```
### A special case: wordscores
Let's check an example of wordscores. Here we have tweets from a random sample of 100 Members of the U.S. Congress, as well as their ideal points based on roll-call votes. Can we replicate the ideal points only using the text of their tweets?
```{r}
cong <- read.csv("../datasets/congress-tweets.csv", stringsAsFactors=F)
# creating the corpus and dfm objects
ccorpus <- corpus(cong$text)
docnames(ccorpus) <- cong$screen_name
cdfm <- dfm(ccorpus, ignoredFeatures=c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can"))
cdfm <- trim(cdfm, minDoc = 2)
# running wordscores
ws <- textmodel(cdfm, cong$idealPoint, model="wordscores", smooth=.5)
ws
# let's look at the most discriminant words
sw <- sort(ws@Sw)
head(sw, n=20)
tail(sw, n=20)
```
Now let's split the data into training and test set and see what we can learn...
```{r}
set.seed(123)
test <- sample(1:nrow(cong), floor(.20 * nrow(cong)))
# extracting ideal points and replacing them with missing values
refpoints <- cong$idealPoint
refpoints[test] <- NA
# running wordscores
ws <- textmodel(cdfm, refpoints, model="wordscores", smooth=.5)
# predicted values (this will take a while...)
preds <- predict(ws, rescaling="lbg")
scores <- preds@textscores
# and let's compare
plot(scores$textscore_lbg[test], cong$idealPoint[test])
cor(scores$textscore_lbg[test], cong$idealPoint[test])
```