Mini-Hackathon2.Rmd

---
layout: tutorial
title: 'Mini-Hackathon 2: LDA Topic Modeling'
author: "Mariken van der Velden & Kasper Welbers"
output:
  github_document:
  toc: yes
editor_options: 
  chunk_output_type: console
---
  
```{r setup, include=FALSE}
## include this at top of your RMarkdown file for pretty output
## make sure to have the printr package installed: install.packages('printr')
knitr::opts_chunk$set(echo = TRUE)
library(printr)
library(rmarkdown)
```

Mini-Hackathons are performed and submitted in pairs of two. 
You must hand in your assignment on Canvas the next week before **Tuesday Midnight**.

Use this RMarkdown template on the canvas page for this mini-hackathon to complete your hackathon.
When you are finished, knit the file into a pdf with the knit button in the toolbar (or using Ctrl+Shift+K).
For this you need to have the `knitr` and `printr` packages installed, and all your code needs to work (see the R course companion for more instructions).
If you cannot knit the `.Rmd` file, there is probably an error in your R code, therefore add `eval=FALSE` to the code chunk: `{r, eval = FALSE}`, so you are still able to knit and upload the file.

# This Mini-Hackathon
This mini-hackathon builds upon the tutorial on [LDA Models](https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/r_text_lda.md). 
For this hackathon, you can (and are strongly recommended to) use code from this week's tutorial as well as provided here.
If you aim to conduct additional analyses in R, we of course encourage this.
Nevertheless, it is important to not do additional analysis just for the sake of running more code chunks. 
For that reason, please provide a justification for these additional analyses.

Also, we recommend to use parameters for RMarkdown codeblocks, in particular the `cache = TRUE` parameter for codeblocks that take long to compute (e.g., downloading data from AmCAT). 
A brief explanation of some usefull parameters is given in the [the tutorial of two weeks ago](https://github.com/MarikenvdVelden/Replication-Hackathons/blob/main/Intro-to-rmd-and-data-retrieval.md).

Important to take into account is that this week you laptop will be doing the heavy lifting. 
Accordingly, in the assignment, you should be carefull not to choose a concept that has too many articles. 

### Hackathon Challenges

#### Challenge 1
Choose a data set and run a [LDA topic model](https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/r_text_lda.md) to inquire the topics of the media agenda. 
You can either work with the AmCAT data set with newspaper articles on the US presidential candidates in 2020 or the Guardian data set with newspaper articles in politics, economy, society and international sections from 2012 to 2016.
To get the data, one can run the code to download the data from `quanteda.corpora` or download the `.csv` file with AmCAT data from Canvas.
Take a look at respectively [the tutorial of two weeks ago](https://github.com/MarikenvdVelden/Replication-Hackathons/blob/main/Intro-to-rmd-and-data-retrieval.md) or last week ([here](https://github.com/MarikenvdVelden/Replication-Hackathons/blob/main/Mini-Hackathon1.md) and [here](https://github.com/MarikenvdVelden/Replication-Hackathons/blob/main/Mini-Hackathon1-Guardian.md)) to see how to obtain the data.

```{r, warning=FALSE, message=F, include=F, eval=T}
library(tidyverse)
d <- read_csv("../amcat_data.csv")

library(quanteda.corpora)
corp <- download('data_corpus_guardian')
```

For later challenges in this hackathon, please already add a `year` variable to the corpus by running the following code if you use Guaridan data:

```{r, message=F,warning=F}
library(quanteda)

df <- convert(corp, to = "data.frame") %>%
  mutate(year = substr(date, 1, 4))
corp <- corpus(df)
```

If you use AmCAT data, run the following code to add the variable:

```{r, message=F,warning=F}
d <-d %>%
  mutate(month = substr(date, 6, 7))
```

To conduct the analysis using LDA topic models, we have to create a document-term matrix (DTM), for more information about DTM, [see the tutorial on basic text analysis with quanteda](https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/R_text_3_quanteda.md).
Look at the tutorials of last week ([here](https://github.com/MarikenvdVelden/Replication-Hackathons/blob/main/Mini-Hackathon1.md) and [here](https://github.com/MarikenvdVelden/Replication-Hackathons/blob/main/Mini-Hackathon1-Guardian.md)) how you can shape the AmCAT data or the Guardian data into a dtm.

```{r, eval=T, include=FALSE, message=F,warning=F, cache=TRUE}
#the data is already in corpus format so we can create a dtm
dtm_G <- corp %>% 
        dfm(remove = stopwords("english"), remove_punct = T) %>% 
        dfm_trim(min_docfreq = 5)
dtm_G

corp_A <- corpus(d, docid_field = 'id', text_field = 'text')
dtm_A <- corp_A %>% 
        dfm(remove = stopwords("english"), remove_punct = T) %>% 
        dfm_trim(min_docfreq = 5)
dtm_A

```

After turning the corpus into a dtm, run LDA topic model from a dfm -- have a look at the [tutorial](https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/r_text_lda.md) to see how to do this.

**Don't forget to filter the words, otherwise the procedure of running an LDA might take forever, or even makes your computer crash.**

I will run a LDA topic model with 10 topics, and set the seed to the date of the tutorial.

This can take some time if you do not have a powerful computer, **don't forget `cache=T` therefore**.

```{r, eval=T, message=F, warning=F, include=F, cache=F}
library(topicmodels)
set.seed(11112020)

dfm_G <- convert(dtm_G, to = "topicmodels") 
m_G <- LDA(dfm_G, method = "Gibbs", k = 10,  control = list(alpha = 0.1))

dfm_A <- convert(dtm_A, to = "topicmodels") 
m_A <- LDA(dfm_A, method = "Gibbs", k = 10,  control = list(alpha = 0.1))
```

#### Challenge 2
Validation is very important when applying topic models to measure any agenda.
Inspect the results of your LDA model using `terms`, and make a wordcloud for at least 3 topics.
Give an interpretation for each topic elaborated with an example.
Have a look at the [tutorial](https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/r_text_lda.md#inspecting-lda-results) for the code.

```{r, warning=F, message=F}
library(wordcloud)

terms(m_G, 5)
topic <- 5
words_G <- posterior(m_G)$terms[topic, ]
topwords_G <- head(sort(words_G, decreasing = T), n=50)
wordcloud(names(topwords_G), topwords_G)

terms(m_A, 5)
topic <- 5
words_A <- posterior(m_A)$terms[topic, ]
topwords_A <- head(sort(words_A, decreasing = T), n=50)
wordcloud(names(topwords_A), topwords_A)
```

Finally, see in which years certain topics were more prevelent.

```{r, message=F,warning=F, echo=F, eval=T}
docs_G <- docvars(dtm_G)[match(rownames(dtm_G), docnames(dtm_G)),]
docs_G <- docs_G[1:5997,]
tpp_G <- aggregate(posterior(m_G)$topics, by=docs_G["year"], mean)
rownames(tpp_G) = tpp_G$year
heatmap(as.matrix(tpp_G[-1]))

docs_A <- docvars(dtm_A)[match(rownames(dtm_A), docnames(dtm_A)),]
tpp_A <- aggregate(posterior(m_A)$topics, by=docs_A["month"], mean)
rownames(tpp_A) = tpp_A$month
heatmap(as.matrix(tpp_A[-1]))
```

#### Challenge 3
Reflect on the validity of the current analysis: to what extent do the results of the LDA model measure the media agenda?

#### Challenge 4
The method used in the tutorial is pretty crude. Can you think of any problem in particular, and any ways to mend them? Are there problems which cannot be solved, or would be very hard to solve?