/
Final_Project_Team2.Rmd
287 lines (203 loc) · 11.2 KB
/
Final_Project_Team2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
title: "Beautiful Bananas Final Project"
runtime: shiny
output: html_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
##Team 2: Fu Wen, Sanjay Hariharan, Sophie Guo, Xiao Xu
###1.Introduction
All of the members of the Beautiful Bananas group are Masters students, hoping to enter the workforce after graduation. As students of Statistics, organizations all over the country want our skillset, but we are specifically curious what elements of our skillset they are interested in. We decided to web scrape Indeed.com, a website posting jobs of all kinds for students like us. We are specifically curious about what words appear most in job postings for Statisticians, as well as other fields. Ultimately, we want to create a Shiny App that allows the user to dynamically select a field of study, number of words, location, and other specifications, showing the user a word cloud detailing the frequency of words appearing for those specifications on Indeed.com.
###2.Methods
2.1 Posting Scraping
At the beginning, we use a keyword, i.e. “Statistical Analyst”, to get search result pages with 10 entries on each one. Then we go to each job posting page to download the html file. We need to find elements on a particular job posting. Because indeed.com does not provide a format template for employers, it is very hard to extract the desired text chunks since they are discrete and not in order. We use a combination of keywords’ roots and regular expression to extract the chuck in the parse_indeed.R.
```{r}
#Former combination of keywords’ roots and regular expression in parse_indeed.R#
#q1 <- str_extract_all(job_summary[i],"qualifi[a-z]+[: ]+.+? {2}") %>% unlist()
#q3 <- str_extract_all(job_summary[i],"require[a-z]+[: ]+.+? {2}") %>% unlist()
#q4 <- str_extract_all(job_summary[i],"demand[a-z]+[: ]+.+? {2}") %>% unlist()
#q5 <- str_extract_all(job_summary[i],"essentials[a-z]+[: ]+.+? {2}") %>% unlist()
#q6 <- str_extract_all(job_summary[i],"must[a-z]+[: ]+.+? {2}") %>% unlist()
#q7 <- str_extract_all(job_summary[i],"should[a-z]+[: ]+.+? {2}") %>% unlist()
```
2.2. Description Mining
The success rate of posting scraping is around 50%, so we decide to deploy another approach. We abandon job postings and switch back to search result pages. On the search result page, there is a description section for each job entry. So we extract descriptions and perform text mining. Details are as following:
Our first step is cleaning the text of numbers, stop words, stem words and any html node tags resulting from web scraping. Then we perform tf-idf to the document term matrix to weigh the words based on their frequencies. After studying the scraped descriptions, we realize that qualifications appear mostly as nouns while verbs are mostly related to responsibilities. As a result, we use the openNLP package in R to find the part-of-speech of each word and retain only nouns for our analysis. At this point, we have filtered out the irrelevant and the most frequent, non-informative words. We then use word cloud to visualize the highest weighted words.
```{r, include = FALSE}
source("~/Desktop/Indeed-Web-Scraping/get_indeed.R")
#source("parse_indeed.R")
source("~/Desktop/Indeed-Web-Scraping/Scrape Descriptions_tfidf.R")
```
###3.Visualization
For the Shiny App, once a user inputs his/her interested job field and location, our app will display a weighted word graph. If the user is interested in details of job postings, he/she can choose to display a detailed table with the N most frequent words where N is a parameter given by user. The user can choose N to be an integer from 10 to 100. The user is able to enter any job field into the input text box. The app will take the text and perform web scraping task with input text. The location value accepts either city or state. In general, Our analysis is highly customized. From the visualization output, a user could easily identify the most matched skillsets along with duties for his/her ideal job.
Below is our shiny app:
```{r, echo=FALSE}
#Initialize Shiny Package#
require(stringr)
require(rvest)
require(magrittr)
require(koRpus)
require(rJava)
require(openNLP)
require(tm)
library(wordcloud)
library(shiny)
include = c("Yes" = "yes" , "No" = "no")
shinyApp(
ui = fluidPage(
#Set Title Panel#
titlePanel(
"Beautiful Banana Job Search Engine"
),
sidebarPanel(
#Input the job field to search#
h3("Field"),
textInput("field", label = NULL, value = "Statistics", width = NULL),
hr(),
#Input the location to search
h3("Location:"),
textInput("loc", label = NULL, value = "North Carolina", width = NULL),
hr(),
#Input the max number of word to include in the summary
h3("Max Word:"),
sliderInput("num", "Input the max number of words you want to show:",
min = 10, max = 100, value = 30),
#To choose whether to inlcude a detailed table or not
h3("Output Option"),
selectInput("include", "Would you like to see the detailed word summary:", include)
),
mainPanel(
#Additional Results View#
h3("Key Words"),
hr(),
#The panel to show the word cloud plot.
tabsetPanel(
tabPanel("Word Cloud", plotOutput("plot"))
),
#The panel to demonstrate the detailed table if user requires
conditionalPanel(
#If want summary table from Input, include the table#
condition = "input.include == 'yes'",
tabsetPanel(
tabPanel("Detailed Word Summary",
tableOutput("table"))
)
)
)
),
server = function(input, output, session)
{
#reactive part of the data frame
d=reactive(
{
get_indeed <- function(start,dest){
url <- paste0("http://rss.indeed.com/rss?q=",input$field,"&l=",input$loc,"&start=",start)
download.file(url,destfile=dest,method="wget")
}
dir.create("data/indeed/",recursive=TRUE,showWarnings=FALSE)
for(i in seq(10,100,length.out=10)){
dest = paste0("data/indeed/",i/10,".html")
#Call function based on those values, a destination directory, and a specified limit#
get_indeed(i, dest)
}
files <- dir("data/indeed/", pattern = "*.html", full.names = TRUE)
descriptions <-c() #create a vector to store the results
for (i in 1:length(files)){
new <- read_xml(files[i]) %>%
html_nodes("description") %>% html_text()
descriptions <- c(descriptions,new)
}
pos <- function(ele){
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
annotations <- annotate(ele, list(sent_token_annotator, word_token_annotator))
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
pos_tag_annotator
pos <- annotate(ele, pos_tag_annotator, annotations)
return (pos$features[[2]]$POS == "NN")
}
#get rid of the tags#
for (i in 1:length(descriptions)){
if (!is.na(str_extract(descriptions[i],".*\\."))){
descriptions[i] = str_extract(descriptions[i],".*\\.")
}
}
#remove duplicate entries
descriptions <- unique(descriptions)
corp <- Corpus(VectorSource(descriptions))
#dtm = DocumentTermMatrix(corp,control=list(removePunctuation=TRUE,removeNumbers=TRUE,stemming=TRUE))
corp<-tm_map(corp,content_transformer(removeNumbers)) #convert to lower case#
corp<-tm_map(corp,removePunctuation) #remove punctuation#
#corp <- tm_map(corp, stemDocument)
corp <-tm_map(corp, removeWords, stopwords('english'))
dtm <- as.matrix(DocumentTermMatrix(corp))
#TFIDF
norm <- dtm/rowSums(dtm) #normalize the matrix
nonZero <- colSums(norm != 0)
weight <- log(dim(norm)[2]/nonZero)
weight_sorted <- sort(weight,decreasing=TRUE)
names <- names(weight_sorted)
weight_matrix <- matrix(0,dim(norm)[1],dim(norm)[2])
for (i in 1:dim(norm)[2]){
weight_matrix[,i] <- norm[,i]*weight[i]
}
colnames(weight_matrix) <- colnames(norm)
rownames(weight_matrix) <- rownames(norm)
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
annotations <- annotate(colnames(weight_matrix), list(sent_token_annotator, word_token_annotator))
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
pos_tag_annotator
pos <- annotate(colnames(weight_matrix), pos_tag_annotator, annotations)
non <- c()
for (i in 2:length(pos)){
if (pos$features[[i]]$POS != "NN"){
non <- c(non,i-1)
}
}
#non_verb <- c()
#for (i in 2:length(pos)){
#if (pos$features[[i]]$POS != "VBD" ||pos$features[[i]]$POS != "VBG" ){
#non_verb<- c(non_verb,i-1)
#}
#}
irrelevant <- c()
for (i in 1:dim(weight_matrix)[2]){
if (str_detect(colnames(weight_matrix)[i],"job")||str_detect(colnames(weight_matrix)[i],"http")||str_detect(colnames(weight_matrix)[i],"indeed")){
irrelevant <- c(irrelevant,i)
}
}
#weight_matrix <- weight_matrix[,-irrelevant]
weight_matrix <- weight_matrix[,-c(non,irrelevant)]
#weight_matrix <-weight_matrix[,-non_verb]
#foreach(i = 1:dim(weight_matrix)[2]) %do% if (!pos(colnames(weight_matrix)[i])){weight_matrix[,-i]}
#Source: http://www.r-bloggers.com/text-mining/#
v = sort(colSums(weight_matrix), decreasing=TRUE);
myNames = names(v);
d = data.frame(word=myNames, freq=v)
}
)
#Render Plots for word cloud plot#
output$plot = renderPlot(
{
wordcloud(d()$word, colors=brewer.pal(8,"Dark2"),
d()$freq, max.words=input$num, min.freq=min(d()$freq))
}
)
output$table = renderTable(
{
table = d()[1:input$num,]
rownames(table) = 1:input$num
names(table) = c("Word","Frequency")
table
}
)
}
#options = list(height = 500)
)
```
###4.Challenge and Next Step
The biggest challenge is to improve the accuracy of keyword match in the process of web text mining. Since employers write the job posting on their own, the contents of postings vary considerably in format with organizations. Sometimes they are not even segmented with everything in a single paragraph instead. The keywords we previously thought about are also difficult to find. For example, a posting list bullet points for “requirements” while another have sentences starting with “A qualified candidate should”.
To solve the problem, we process job descriptions on the search result page using nouns as qualifications. This involves manual work to read some job descriptions to look for potential patterns. There is no doubt we will miss something with this approach. For the long term process, it is necessary to figure out how to improve the text matching rate continuously.
Another limitation of our app is that when the user is waiting for output, an error might show on the screen even though the output will come out instead in a few seconds.