Mariken van der Velden & Kasper Welbers
Mini-Hackathons are performed and submitted in pairs of two. You must hand in your assignment on Canvas the next week before Tuesday Midnight.
Use this RMarkdown template on the canvas page for this mini-hackathon
to complete your hackathon. When you are finished, knit the file into a
pdf with the knit button in the toolbar (or using Ctrl+Shift+K). For
this you need to have the knitr
and printr
packages installed, and
all your code needs to work (see the R course companion for more
instructions). If you cannot knit the .Rmd
file, there is probably an
error in your R code, therefore add eval=FALSE
to the code chunk: {r, eval = FALSE}
, so you are still able to knit and upload the file.
This mini-hackathon builds upon the tutorial on LDA Models. For this hackathon, you can (and are strongly recommended to) use code from this week’s tutorial as well as provided here. If you aim to conduct additional analyses in R, we of course encourage this. Nevertheless, it is important to not do additional analysis just for the sake of running more code chunks. For that reason, please provide a justification for these additional analyses.
Also, we recommend to use parameters for RMarkdown codeblocks, in
particular the cache = TRUE
parameter for codeblocks that take long to
compute (e.g., downloading data from AmCAT). A brief explanation of some
usefull parameters is given in the the tutorial of two weeks
ago.
Important to take into account is that this week you laptop will be doing the heavy lifting. Accordingly, in the assignment, you should be carefull not to choose a concept that has too many articles.
Choose a data set and run a LDA topic
model
to inquire the topics of the media agenda. You can either work with the
AmCAT data set with newspaper articles on the US presidential candidates
in 2020 or the Guardian data set with newspaper articles in politics,
economy, society and international sections from 2012 to 2016. To get
the data, one can run the code to download the data from
quanteda.corpora
or download the .csv
file with AmCAT data from
Canvas. Take a look at respectively the tutorial of two weeks
ago
or last week
(here
and
here)
to see how to obtain the data.
For later challenges in this hackathon, please already add a year
variable to the corpus by running the following code if you use Guaridan
data:
library(quanteda)
df <- convert(corp, to = "data.frame") %>%
mutate(year = substr(date, 1, 4))
corp <- corpus(df)
If you use AmCAT data, run the following code to add the variable:
d <-d %>%
mutate(month = substr(date, 6, 7))
To conduct the analysis using LDA topic models, we have to create a document-term matrix (DTM), for more information about DTM, see the tutorial on basic text analysis with quanteda. Look at the tutorials of last week (here and here) how you can shape the AmCAT data or the Guardian data into a dtm.
After turning the corpus into a dtm, run LDA topic model from a dfm – have a look at the tutorial to see how to do this.
Don’t forget to filter the words, otherwise the procedure of running an LDA might take forever, or even makes your computer crash.
I will run a LDA topic model with 10 topics, and set the seed to the date of the tutorial.
This can take some time if you do not have a powerful computer, don’t
forget cache=T
therefore.
Validation is very important when applying topic models to measure any
agenda. Inspect the results of your LDA model using terms
, and make a
wordcloud for at least 3 topics. Give an interpretation for each topic
elaborated with an example. Have a look at the
tutorial
for the code.
library(wordcloud)
terms(m_G, 5)
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 | Topic 9 | Topic 10 |
---|---|---|---|---|---|---|---|---|---|
said | said | people | one | block-time | said | said | party | trump | said |
police | people | said | can | published-time | eu | us | said | said | year |
court | one | health | says | gmt | government | government | labour | clinton | $ |
told | two | can | like | bst | minister | security | government | new | business |
topic <- 5
words_G <- posterior(m_G)$terms[topic, ]
topwords_G <- head(sort(words_G, decreasing = T), n=50)
wordcloud(names(topwords_G), topwords_G)
terms(m_A, 5)
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 | Topic 9 | Topic 10 |
---|---|---|---|---|---|---|---|---|---|
like | court | mr | said | said | $ | mr | said | de | bst |
one | senate | biden | united | police | said | trump | coronavirus | la | coronavirus |
people | said | trump | china | black | companies | president | health | que | � |
can | republican | said | mr | people | new | said | new | en | cases |
just | election | campaign | states | protests | economic | house | virus | el | updated |
topic <- 5
words_A <- posterior(m_A)$terms[topic, ]
topwords_A <- head(sort(words_A, decreasing = T), n=50)
wordcloud(names(topwords_A), topwords_A)
Finally, see in which years certain topics were more prevelent.
Reflect on the validity of the current analysis: to what extent do the results of the LDA model measure the media agenda?
The method used in the tutorial is pretty crude. Can you think of any problem in particular, and any ways to mend them? Are there problems which cannot be solved, or would be very hard to solve?