tests_topic_mod_v1.Rmd

---
title: "Topic Modelling"
author: "NLP4StatRef"
date: "11/4/2021"
output:
  github_document: default
  html_document: default
  word_document: default
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(fig.path='Figs/')
knitr::opts_chunk$set(eval=FALSE)
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(cache=FALSE)
knitr::opts_chunk$set(comment='>')
knitr::opts_chunk$set(tidy=FALSE)
knitr::opts_chunk$set(fig.width=8)
knitr::opts_chunk$set(fig.height=8)
```

## Topic modelling: tests with the  Latent Dirichlet Allocation (LDA) algorithm.
***

### 1. Initialization of the R environment.
***
The first step is to load the required libraries. The code chunk below automatically installs these libraries if they are missing. Then we set the working folder to the one containing the R Markdown document and the input datasets. The commented-out code: 

_current_working_dir <- dirname(rstudioapi::getActiveDocumentContext()$path)_ 

works only from within RStudio when running the document chunk-by-chunk. If this is not the case (e.g. when knitting the document), the user has to set the working directory manually.

```{r, eval=TRUE, echo=TRUE, message=FALSE, warning=FALSE}

rm(list=ls()) ## clear objects from memory

## install libraries if missing
list.of.packages <- c('tm','ggplot2','topicmodels','tidytext','dplyr')
                      
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(tm)
library(ggplot2)
library(topicmodels)
library(tidytext)
library(dplyr)

#current_working_dir <- dirname(rstudioapi::getActiveDocumentContext()$path)
#print(current_working_dir)
#setwd(current_working_dir)

## ADJUST THIS 
## setwd('D://Kimon//Documents//Quantos-new//NLP4StatRef//Deliverable D2.2')

```

### 2. Data input.
***

We read two of the files extracted from the database, with the glossary articles definitions in _ESTAT_dat_concepts_2021_04_08.csv_ and their titles in _ESTAT_dat_link_info_2021_04_08.csv_. The common key is _id_. **At a later stage, the reading of the files will be directly from the KD**. 

We then drop articles with missing titles and/or definitions and also de-duplicate the records of the resulting file based on these two fields.


```{r, eval=TRUE, echo=TRUE, message=TRUE, warning=TRUE}

# db_connect <- odbcConnect(dsn='https://virtuoso.kappasante.com/',uid='kimon',pwd='jjIJMFIZTWhxeEmX8u7K')
#                           
#                           
# 
# {SQL Server};
#     server=s001111;database=XX;trusted_connection=true')
# 
# c = pyodbc.connect('DSN=VirtuosoKapcode;DBA=ESTAT;UID=XXXX;PWD=XXXXXXXXXXXX')\n",

dat1 <- read.csv2('~//Data//ESTAT_dat_concepts_2021_04_08.csv')
dat2 <- read.csv2('~//Data//ESTAT_dat_link_info_2021_04_08.csv')
dat <- merge(dat1,dat2,by=c('id'),all=FALSE)
dat <- dat[,c('title','definition')]

dels <- which(is.na(dat$title))
if(length(dels)>0) dat <- dat[-dels,]

dels <- which(is.na(dat$definition))
if(length(dels)>0) dat <- dat[-dels,]

dels <-which(duplicated(dat$title))
if(length(dels)>0) dat <- dat[-dels,]

dels <- which(duplicated(dat$definition))
if(length(dels)>0) dat <- dat[-dels,]

rm(dat1,dat2)
```

### 3. Data cleaning.
***

In the next step we do some data cleaning: 

* Replace multiple spaces with single ones in definitions.
* Discard spaces at the start of definitions and titles. 
* Replace space-comma-space by comma-space in definitions.

```{r, eval=TRUE, echo=TRUE, message=TRUE, warning=TRUE}

dat$definition <- gsub(' +',' ',dat$definition) ## discard multiple spaces
dat$definition <- gsub('^ +','',dat$definition) ## discard spaces at start
dat$definition <- gsub(' \\, ','\\, ',dat$definition) ## space-comma-space -> comma-space

dat$title <- gsub('^ +','',dat$title) ## discard spaces at start

```

### 4. Creating tm objects.
***

Next we create a corpus _texts_ from the articles. This has initially 1285 text entries. We apply the standard pre-processing steps to the texts:

* Remove punctuation and numbers. 
* Convert all to lower case.
* Strip whitespace and apply an English stemmer.

In the end we obtain 331 terms. 

We then create a document-to-term matrix _dmat_, keeping words with minimum length 5, each one in at least 2% of documents and in at most 30% of the documents. We remove documents without terms and convert the matrix to a 1278 x 331 dataframe for inspection. 

Note that in the construction of the document-to-term matrix, we do not request any weights, such as tf-idf. This is a requirement of the LDA algorithm.



```{r,  eval=TRUE, echo=TRUE, message=TRUE, warning=FALSE}

texts <- Corpus(VectorSource(dat$definition))
ndocs <- nrow(dat)
cat('ndocs = ',ndocs,'\n')

## apply several pre-processing steps (see package tm)
texts <- tm_map(texts, removePunctuation) 
texts <- tm_map(texts, removeNumbers) 
texts <- tm_map(texts, tolower)

texts <- tm_map(texts, removeWords, stopwords(kind='SMART')) 
texts <- tm_map(texts, stripWhitespace) 
texts <- tm_map(texts, stemDocument, language='english')

## create document-to-term matrix (tf-idf)
## min word length: 5, each term in at least 2% of documents 
## and at most in 30% of documents
dtm <- DocumentTermMatrix(texts,
                          control=list(weighting=weightTf, 
                            wordLengths=c(5, Inf),bounds = 
                              list(global = c(0.02*ndocs,
                                              0.3*ndocs))))

dels <- which(apply(dtm,1,sum)==0) #remove all texts without terms 
if(length(dels)>0) {
  dtm   <- dtm[-dels, ]           
  dat <- dat[-dels,]
}

nTerms(dtm)
Terms(dtm)

## convert to dataframe for inspection
dtm.dat <- as.data.frame(as.matrix(dtm))
rownames(dtm.dat)<- dat$title

print(inspect(dtm))
```

### 5. Application of the LDA algorithm.
***

We apply the LDA algorithm with k=20 topics. Function _LDA()_ returns an object which contains, among others, a matrix _beta_ expressing, for each topic and term, the **probability that the term is generated from the specific topic**. For details, see [r package topicmodels](https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf). 

In the following code, we first group the results by topic and then select the terms with the top _beta_ values in each topic.Then we plot these values and the corresponding terms for each topic.

```{r,  eval=TRUE, echo=TRUE, message=TRUE, warning=TRUE, fig.keep='all'}

lda_model <- LDA(dtm, k = 20, control = list(seed = 1234))
topics <- tidy(lda_model, matrix = "beta")

top_terms <- topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

```

The results with the top 10 terms by topic can be interpreted as follows: 

* Topic 1: Social expenditure and contributions.
* Topic 2: Population, regions and geography. 
* Topic 3: Persons and employment.
* Topic 4: Intellectual property rights. 
* Topic 5: Economic sectors.
* Topic 6: Public services.
* Topic 7: International trade.
* Topic 8: Price indices. 
* Topic 9: Surveys.
* Topic 10: Technology, research and innovation.
* Topic 11: Countries, territories and resident population.
* Topic 12: Business activities and enterprises.
* Topic 13: Transport. 
* Topic 14: Primary production and the environment.
* Topic 15: The EU and the member states.
* Topic 16: Energy and water resources.
* Topic 17: Accounting and finance.
* Topic 18: Healthcare.
* Topic 19: Households disposable income and consumption.
* Topic 20: Production, consumption and gross capital.

If these results are useful, the analysis will be extended to take into account the _gamma_ coefficients which express, for each document and topic, the **estimated proportion of terms from the document that are generated from that topic**.