forked from eurostat/NLP4Stat
/
tests_topic_mod_v1.Rmd
225 lines (163 loc) · 8.04 KB
/
tests_topic_mod_v1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
title: "Topic Modelling"
author: "NLP4StatRef"
date: "11/4/2021"
output:
github_document: default
html_document: default
word_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.path='Figs/')
knitr::opts_chunk$set(eval=FALSE)
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(cache=FALSE)
knitr::opts_chunk$set(comment='>')
knitr::opts_chunk$set(tidy=FALSE)
knitr::opts_chunk$set(fig.width=8)
knitr::opts_chunk$set(fig.height=8)
```
## Topic modelling: tests with the Latent Dirichlet Allocation (LDA) algorithm.
***
### 1. Initialization of the R environment.
***
The first step is to load the required libraries. The code chunk below automatically installs these libraries if they are missing. Then we set the working folder to the one containing the R Markdown document and the input datasets. The commented-out code:
_current_working_dir <- dirname(rstudioapi::getActiveDocumentContext()$path)_
works only from within RStudio when running the document chunk-by-chunk. If this is not the case (e.g. when knitting the document), the user has to set the working directory manually.
```{r, eval=TRUE, echo=TRUE, message=FALSE, warning=FALSE}
rm(list=ls()) ## clear objects from memory
## install libraries if missing
list.of.packages <- c('tm','ggplot2','topicmodels','tidytext','dplyr')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(tm)
library(ggplot2)
library(topicmodels)
library(tidytext)
library(dplyr)
#current_working_dir <- dirname(rstudioapi::getActiveDocumentContext()$path)
#print(current_working_dir)
#setwd(current_working_dir)
## ADJUST THIS
## setwd('D://Kimon//Documents//Quantos-new//NLP4StatRef//Deliverable D2.2')
```
### 2. Data input.
***
We read two of the files extracted from the database, with the glossary articles definitions in _ESTAT_dat_concepts_2021_04_08.csv_ and their titles in _ESTAT_dat_link_info_2021_04_08.csv_. The common key is _id_. **At a later stage, the reading of the files will be directly from the KD**.
We then drop articles with missing titles and/or definitions and also de-duplicate the records of the resulting file based on these two fields.
```{r, eval=TRUE, echo=TRUE, message=TRUE, warning=TRUE}
# db_connect <- odbcConnect(dsn='https://virtuoso.kappasante.com/',uid='kimon',pwd='jjIJMFIZTWhxeEmX8u7K')
#
#
#
# {SQL Server};
# server=s001111;database=XX;trusted_connection=true')
#
# c = pyodbc.connect('DSN=VirtuosoKapcode;DBA=ESTAT;UID=XXXX;PWD=XXXXXXXXXXXX')\n",
dat1 <- read.csv2('~//Data//ESTAT_dat_concepts_2021_04_08.csv')
dat2 <- read.csv2('~//Data//ESTAT_dat_link_info_2021_04_08.csv')
dat <- merge(dat1,dat2,by=c('id'),all=FALSE)
dat <- dat[,c('title','definition')]
dels <- which(is.na(dat$title))
if(length(dels)>0) dat <- dat[-dels,]
dels <- which(is.na(dat$definition))
if(length(dels)>0) dat <- dat[-dels,]
dels <-which(duplicated(dat$title))
if(length(dels)>0) dat <- dat[-dels,]
dels <- which(duplicated(dat$definition))
if(length(dels)>0) dat <- dat[-dels,]
rm(dat1,dat2)
```
### 3. Data cleaning.
***
In the next step we do some data cleaning:
* Replace multiple spaces with single ones in definitions.
* Discard spaces at the start of definitions and titles.
* Replace space-comma-space by comma-space in definitions.
```{r, eval=TRUE, echo=TRUE, message=TRUE, warning=TRUE}
dat$definition <- gsub(' +',' ',dat$definition) ## discard multiple spaces
dat$definition <- gsub('^ +','',dat$definition) ## discard spaces at start
dat$definition <- gsub(' \\, ','\\, ',dat$definition) ## space-comma-space -> comma-space
dat$title <- gsub('^ +','',dat$title) ## discard spaces at start
```
### 4. Creating tm objects.
***
Next we create a corpus _texts_ from the articles. This has initially 1285 text entries. We apply the standard pre-processing steps to the texts:
* Remove punctuation and numbers.
* Convert all to lower case.
* Strip whitespace and apply an English stemmer.
In the end we obtain 331 terms.
We then create a document-to-term matrix _dmat_, keeping words with minimum length 5, each one in at least 2% of documents and in at most 30% of the documents. We remove documents without terms and convert the matrix to a 1278 x 331 dataframe for inspection.
Note that in the construction of the document-to-term matrix, we do not request any weights, such as tf-idf. This is a requirement of the LDA algorithm.
```{r, eval=TRUE, echo=TRUE, message=TRUE, warning=FALSE}
texts <- Corpus(VectorSource(dat$definition))
ndocs <- nrow(dat)
cat('ndocs = ',ndocs,'\n')
## apply several pre-processing steps (see package tm)
texts <- tm_map(texts, removePunctuation)
texts <- tm_map(texts, removeNumbers)
texts <- tm_map(texts, tolower)
texts <- tm_map(texts, removeWords, stopwords(kind='SMART'))
texts <- tm_map(texts, stripWhitespace)
texts <- tm_map(texts, stemDocument, language='english')
## create document-to-term matrix (tf-idf)
## min word length: 5, each term in at least 2% of documents
## and at most in 30% of documents
dtm <- DocumentTermMatrix(texts,
control=list(weighting=weightTf,
wordLengths=c(5, Inf),bounds =
list(global = c(0.02*ndocs,
0.3*ndocs))))
dels <- which(apply(dtm,1,sum)==0) #remove all texts without terms
if(length(dels)>0) {
dtm <- dtm[-dels, ]
dat <- dat[-dels,]
}
nTerms(dtm)
Terms(dtm)
## convert to dataframe for inspection
dtm.dat <- as.data.frame(as.matrix(dtm))
rownames(dtm.dat)<- dat$title
print(inspect(dtm))
```
### 5. Application of the LDA algorithm.
***
We apply the LDA algorithm with k=20 topics. Function _LDA()_ returns an object which contains, among others, a matrix _beta_ expressing, for each topic and term, the **probability that the term is generated from the specific topic**. For details, see [r package topicmodels](https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf).
In the following code, we first group the results by topic and then select the terms with the top _beta_ values in each topic.Then we plot these values and the corresponding terms for each topic.
```{r, eval=TRUE, echo=TRUE, message=TRUE, warning=TRUE, fig.keep='all'}
lda_model <- LDA(dtm, k = 20, control = list(seed = 1234))
topics <- tidy(lda_model, matrix = "beta")
top_terms <- topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
The results with the top 10 terms by topic can be interpreted as follows:
* Topic 1: Social expenditure and contributions.
* Topic 2: Population, regions and geography.
* Topic 3: Persons and employment.
* Topic 4: Intellectual property rights.
* Topic 5: Economic sectors.
* Topic 6: Public services.
* Topic 7: International trade.
* Topic 8: Price indices.
* Topic 9: Surveys.
* Topic 10: Technology, research and innovation.
* Topic 11: Countries, territories and resident population.
* Topic 12: Business activities and enterprises.
* Topic 13: Transport.
* Topic 14: Primary production and the environment.
* Topic 15: The EU and the member states.
* Topic 16: Energy and water resources.
* Topic 17: Accounting and finance.
* Topic 18: Healthcare.
* Topic 19: Households disposable income and consumption.
* Topic 20: Production, consumption and gross capital.
If these results are useful, the analysis will be extended to take into account the _gamma_ coefficients which express, for each document and topic, the **estimated proportion of terms from the document that are generated from that topic**.