-
Notifications
You must be signed in to change notification settings - Fork 1
/
building.html
253 lines (218 loc) · 9.16 KB
/
building.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
<!DOCTYPE html>
<html>
<head>
<title></title>
<meta charset="utf-8">
<meta name="generator" content="knitr" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/7.3/styles/github.min.css">
<style type="text/css">
/* Derived from the Docco package by Jeremy Ashkenas: https://github.com/jashkenas/docco/ */
body {
font-family: 'Palatino Linotype', 'Book Antiqua', Palatino, FreeSerif, serif;
height:100%;
font-size: 16px;
line-height: 24px;
color: #30404f;
margin: 0;
padding: 0;
}
h1, h2, h3, h4, h5, h6 {
color: #112233;
line-height: 1em;
font-weight: normal;
margin: 0 0 15px 0;
}
p {
margin: 0 0 15px 0;
font-size:17px;
}
#footer p{
margin:0;
font-size:12px;
text-align: center;
}
a{
color:#0088cc;
text-decoration:none;
}
a:hover,a:focus{
color:#005580;
text-decoration:underline;
}
#container {
position: relative;
margin: 0;
height:100%;
}
body > #container { height: auto; min-height: 100%; }
table{
width:100%;
border: 0;
outline: 0;
}
td.docs{
width: 50%;
text-align: left;
vertical-align: top;
padding: 10px 25px 1px 50px;
}
td.code{
background: #f5f5ff;
padding: 10px 25px 1px 50px;
overflow-x: hidden;
vertical-align: top;
}
code{
font-size:12px;
margin: 0;
padding: 0;
}
td.docs code{
background: #f8f8ff;
border: 1px solid #dedede;
font-size: 80%;
padding: 0 0.2em;
}
td.docs img{
max-width: 100%;
}
pre code{
padding:2px 4px;
background:#f5f5ff;
}
td.code pre code{
line-height: 18px;
}
.pilwrap {
position: relative;
}
.pilcrow {
font: 12px Arial;
text-decoration: none;
color: rgb(69, 69, 69);
position: absolute;
top: 3px;
left: -20px;
padding: 1px 2px;
opacity: 0;
}
td.docs:hover .pilcrow {
opacity: 1;
}
blockquote {
border-left: 4px solid #DDD;
padding: 0 15px;
color: #777;
}
div.handler{
width: 5px;
padding: 0;
cursor: col-resize;
position: absolute;
z-index: 5;
}
</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/7.3/highlight.min.js"></script>
<script type="text/javascript">
hljs.LANGUAGES.r=function(a){var b="([a-zA-Z]|\\.[a-zA-Z.])[a-zA-Z0-9._]*";return{c:[a.HCM,{b:b,l:b,k:{keyword:"function if in break next repeat else for return switch while try tryCatch|10 stop warning require library attach detach source setMethod setGeneric setGroupGeneric setClass ...|10",literal:"NULL NA TRUE FALSE T F Inf NaN NA_integer_|10 NA_real_|10 NA_character_|10 NA_complex_|10"},r:0},{cN:"number",b:"0[xX][0-9a-fA-F]+[Li]?\\b",r:0},{cN:"number",b:"\\d+(?:[eE][+\\-]?\\d*)?L\\b",r:0},{cN:"number",b:"\\d+\\.(?!\\d)(?:i\\b)?",r:0},{cN:"number",b:"\\d+(?:\\.\\d*)?(?:[eE][+\\-]?\\d*)?i?\\b",r:0},{cN:"number",b:"\\.\\d+(?:[eE][+\\-]?\\d*)?i?\\b",r:0},{b:"`",e:"`",r:0},{cN:"string",b:'"',e:'"',c:[a.BE],r:0},{cN:"string",b:"'",e:"'",c:[a.BE],r:0}]}}(hljs);
</script>
<script>hljs.initHighlightingOnLoad();</script>
<script src="http://yihui.name/media/js/center-images.js"></script>
</head>
<body>
<div id="container">
<table><!--table start-->
<tr id="row1"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row1">¶</a></div><p>prelims</p></td><td class="code"><pre><code class="r">root_dir<-"/Users/tom/Documents/Programming/Restaurant kaggle"
require(foreign)
require(caret)
require(caretEnsemble)
#require(h2o)
require(doSNOW)
require(parallel)
require(knitr)
opts_chunk$set(eval=F,echo=T,purl=F)
</code></pre></td></tr><tr id="row2"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row2">¶</a></div>
<p>get the requisite training data. convert into epoch time. that's seconds since jan 1 1970. it's a number, so who cares what it is. </p></td><td class="code"><pre><code class="r">train<-read.csv("train.csv")
test<-read.csv("test.csv")
train$Open.Date<-as.integer(
as.POSIXct(as.Date(train$Open.Date,format="%m/%d/%Y"),tz = "GMT"))
test$Open.Date<-as.integer(
as.POSIXct(as.Date(test$Open.Date,format="%m/%d/%Y"),tz = "GMT"))
levels(train$Type)<-c(levels(train$Type),"MB")
partition<-createDataPartition(train$revenue,2)
</code></pre></td></tr><tr id="row3"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row3">¶</a></div>
<p>I tried to mess with this as factors, but it didn't pan out. Then again, I know shit about factors. I ran this with oblimax rotations too. still little to no improvement. </p></td><td class="code"><pre><code class="r">require("psych")
x<-fa(train[,grepl(names(train),pattern="P")],nfactors = 6,rotate = "varimax")
trainfa<-cbind(
train[grepl(names(train),pattern=".*(rev|Open|Type|City|Group).*")],x$scores)
testfa<-fa(
test[,grepl(names(test),pattern="P")],nfactors = 6,rotate = "varimax")
test<-cbind(test[,],testfa$scores)
train<-cbind(train,x$scores)
</code></pre></td></tr><tr id="row4"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row4">¶</a></div>
<p>setup for parallel processing</p></td><td class="code"><pre><code class="r">cl<-makeCluster(8)
registerDoSNOW(cl)
</code></pre></td></tr><tr id="row5"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row5">¶</a></div>
<p>I am abundantly aware that using this many folds will overfit the data, esp. when using gbm. but it has yeilded a decent score so far. </p></td><td class="code"><pre><code class="r">tc<-trainControl(method = 'cv',number = 70,repeats = 100)
</code></pre></td></tr><tr id="row6"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row6">¶</a></div>
<p>i've been looking at this to try to get more diversity in the algorithms I use.
methods tried: ANFIS (high rmse), bsTree(high rmse, correlates with gbm), rpart and rf (both correlate strongly with gbm but are slower)</p></td><td class="code"><pre><code class="r">tl=list(
gbm=caretModelSpec(
method='gbm',
tuneGrid=expand.grid(interaction.depth = c(3:5),
n.trees = seq(500,5000,500), shrinkage = c(0.01,.1,.001) ) )
,dnn=caretModelSpec(method='dnn',
tuneGrid=expand.grid(layer1=c(3),layer2=c(5),layer3=c(2),
hidden_dropout=c(.1,.2,.3),visible_dropout=c(.1,.2,.3)))
,glmboost=caretModelSpec(
method='glmboost',tuneGrid=expand.grid(prune=T,mstop=c(100,200,300)) )
)
model_list <- caretList(
revenue~., data=train[,c(2,4,6:19,24:29,33,34,43)],
tuneList=tl,
trControl = tc
,methodList=c('cubist')
)
modelCor(resamples(model_list))
</code></pre></td></tr><tr id="row7"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row7">¶</a></div>
<p>Running a greedy ensemble like this produces shitty rmse
again, way overfitting here. I know. but it has been good for the score. low RMSE, even with high variance seems to pay off a bit here.
I've run the stacked model using both glmboost, straight glm and rf. glmboost has performed the best (w.r.t. lowest rmse)</p></td><td class="code"><pre><code class="r">greedy_ensemble <- caretEnsemble(model_list)
summary(greedy_ensemble)
glmensemble<-caretStack(
model_list,
method='gbm'
,tuneGrid = expand.grid(
interaction.depth = c(3:5),n.trees=seq(500,5000,500),shrinkage=.001)
,trControl=trainControl(method='cv',number=110,repeats = 10)
)
glmensemble
</code></pre></td></tr><tr id="row8"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row8">¶</a></div>
<p>pulling in the sample submission and writing the results to a file called submit.csv.</p></td><td class="code"><pre><code class="r">sample<-read.csv(file.path("sampleSubmission.csv"))
submit<-predict(glmensemble,test)
sample$Prediction<-submit
write.csv(sample,file="submit.csv",row.names=F,col.names=F)
</code></pre></td></tr><tr id="row9"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row9">¶</a></div>
<p>this can be used to see what other models (that don't correlate)</p></td><td class="code"><pre><code class="r">tag <- read.csv("http://topepo.github.io/caret/tag_data.csv", row.names = 1)
tag <- as.matrix(tag)
## Select only models for regression
regModels <- tag[tag[,"Regression"] == 1,]
all <- 1:nrow(regModels)
## Seed the analysis with the SVM model
start <- grep("(gbm)", rownames(regModels), fixed = TRUE)
pool <- all[all != start]
## Select 4 model models by maximizing the Jaccard
## dissimilarity between sets of models
nextMods <- maxDissim(regModels[start,,drop = FALSE],
regModels[pool, ],
method = "Jaccard",
n = 4)
rownames(regModels)[c(start, nextMods)]
</code></pre></td></tr><tr id="row10"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#row10">¶</a></div>
</td><td class="code"></td></tr>
</table><!--table end-->
</div>
<script src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
<script src="http://yihui.name/knitr/js/docco-resize.js"></script>
</body>
</html>