forked from GuillaumeBgc/SensoryDataScience
/
QualiQuali_02.Rmd
282 lines (216 loc) · 13.2 KB
/
QualiQuali_02.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: "Part2 : Quali-Quali"
author: "Guillaume"
date: "07/06/2022"
output:
html_document:
toc: yes
---
#Representation of sensory attributes and bivariate analysis
## data importation, cleaning and selection
```{r}
library(FactoMineR)
library(SensoMineR)
#import
GMO <- read.csv("data/gmo.csv", sep=';')
GMO<-GMO[-136,]
#transform into factor
for (i in 1:21)
{
GMO[, i] <- as.factor(GMO[, i])
}
#summary
summary(GMO)
#selection variable Political.Party and Position.Al.A
GMO.Al.A <- GMO[,c( "Political.Party","Position.Al.A")]
#or tidyverse version :
library(dplyr)
GMO.Al.A <- GMO %>% select(c( Political.Party,Position.Al.A,))
```
## Bivariate analysis
We want to visualize both variables simultaneously. A barchat is appropriate to represent categories of one variable, but here it can also be useful to represent the cross-categories of the two variables.
```{r}
library(ggplot2)
ggplot(GMO.Al.A, aes(Position.Al.A, ..count..)) + geom_bar(aes(fill = Political.Party), color="black", position = "dodge") + ggtitle("Barplot of cross-categories") +
scale_fill_manual(values=c("Extreme left"="#bb0000", "Left"="#FF8080", "Liberal"="#ffeb00", "Greens"="#3BA019", "Right"="#0066cc"))
## ..count.. : stat transformation of the original data set
```
#Concept of independance
## Sample data
To introduce this concept of independence, we remake the first part, but considering samples data. That means, we take keep a fixed column of our 2 variables dataframe, and we arranged randomly the second variable.
```{r}
set.seed(7)
ind.sample <- sample(1:135)
GMO.Al.A.Sample <- data.frame("Position.Al.A"=GMO.Al.A$Position.Al.A[ind.sample],'Political.Party'=GMO.Al.A$Political.Party)
#visualization
ggplot(GMO.Al.A.Sample, aes(Position.Al.A, ..count..)) + geom_bar(aes(fill = Political.Party), color="black", position = "dodge") + ggtitle("Barplot of cross-categories sampled") +
scale_fill_manual(values=c("Extreme left"="#bb0000", "Left"="#FF8080", "Liberal"="#ffeb00", "Greens"="#3BA019", "Right"="#0066cc"))
```
## probability side
Study a link between two variables comes to position data to a referent situation, the absence of linkage. In term of probability, saying that two events A and B are independent comes to say :
$$\mathcal{P}(A\cap B)=\mathcal{P}(A)\cdot\mathcal{P}(B) $$
Hence, two qualitative variables are independent when we have the relation :
$f_{ij}=f_{i\cdot}\cdot f_{\cdot j}$
The mutual probability only depends of marginal numbers.
## Referent situation
First, we construct the table of reference, __ie : __ a table with the product $$f_{i\cdot}\cdot f_{\cdot j}$$ in each cells. We have $f_{\cdot j}=\dfrac{1}{n}\sum\limits_{j\in\mathcal{J}}n_{ij}$ and $f_{i\cdot}=\dfrac{1}{n}\sum\limits_{i\in\mathcal{I}}n_{ij}$
```{r}
#We start by adding margins to have row and column profiles
GMO_table <- as.data.frame.matrix(table(GMO.Al.A))
fi <- apply(GMO_table, 1, sum)/sum(GMO_table)
fj <- apply(GMO_table, 2, sum)/sum(GMO_table)
GMO_theo <- matrix(nrow=5,ncol=4)
for(i in 1:5){
for(j in 1:4) GMO_theo[i, j] <- (fi[i]*fj[j])
}
rownames(GMO_theo) <- rownames(GMO_table)
colnames(GMO_theo) <- colnames(GMO_table)
#barplot
barplot((as.table(as.matrix(GMO_theo))), beside = TRUE, main="effectifs theoriques", col=c("#bb0000","#3BA019","#FF8080","#ffeb00","#0066cc"))
legend("topleft", inset=.02, title="Political Party", c("Extreme left","Greens", "Left", 'Liberal', 'Right'), horiz=TRUE, cex=0.8, fill=c("#bb0000", "#3BA019", "#FF8080", "#ffeb00", "#0066cc"))
barplot(t(table(GMO.Al.A.Sample))/135, beside = TRUE, main="sample", col=c("#bb0000", "#3BA019", "#FF8080", "#ffeb00", "#0066cc"))
legend("topleft", inset=.02, title="Political Party", c("Extreme left", "Greens", "Left", 'Liberal', 'Right'), horiz=TRUE, cex=0.8, fill=c("#bb0000", "#3BA019", "#FF8080", "#ffeb00", "#0066cc"))
#
barplot((table(GMO.Al.A)/135), beside = TRUE,main="non sample", col=c("#bb0000", "#3BA019","#FF8080","#ffeb00","#0066cc"))
legend("topleft", inset=.02, title="Political Party", c("Extreme left", "Greens", "Left", 'Liberal', 'Right'), horiz=TRUE, cex=0.8, fill=c("#bb0000", "#3BA019","#FF8080", "#ffeb00", "#0066cc"))
```
We have our referent situation on the left, where variables are perfectly independant. The sample data is very closed to the it, but the original one seems to be deviate. We can try to make first hypothesis concerning a test on independence.
ggplot version :
```{r}
library(tidyr)
df_theo <- as.data.frame.matrix(GMO_theo)
library(tibble)
df <- rownames_to_column(df_theo, "Political.Party")
df <- df %>% pivot_longer(!Political.Party, names_to="Position.Al.A", values_to = "frequence")
ggplot(df, aes(y=frequence,x=Position.Al.A, fill=Political.Party))+
geom_col( position = "dodge", color="black")+ggtitle("Barplot of theorical table")+
scale_fill_manual(values=c("Extreme left"="#bb0000", "Left"="#FF8080", "Liberal"="#ffeb00", "Greens"="#3BA019", "Right"="#0066cc"))
```
#Deviation to the independance
Now that we have our referent situation, we need to define a deviation to the independence for each pairs of categories. With the hypothesis of independence, we have the relation $f_{ij}=f_{i\cdot}\cdot f_{\cdot j}$ __ie :__ $\dfrac{f_{ij}}{f_{i\cdot}\cdot f_{\cdot j}}-1=0$
```{r}
fi <- apply(GMO_table, 1, sum)/sum(GMO_table)
fj <- apply(GMO_table, 2, sum)/sum(GMO_table)
GMO_deviation <- matrix(nrow=5, ncol=4)
for(i in 1:5){
for(j in 1:4) GMO_deviation[i, j] <- (GMO_table[i, j]/sum(GMO_table)/((fi[i]*fj[j])))-1
}
rownames(GMO_deviation) <- rownames(GMO_table)
colnames(GMO_deviation) <- colnames(GMO_table)
```
These multivariate profiles (as vectors of $R^{J}$ and $R^{I}$) can be directly interpreted in terms of difference from the independence model.
Naturally, if you had to compare two political party in terms of difference from the independence model, you would calculate a distance based on the differences regarding each type of Position.Al.A categories. As the different types of Position.Al.A have a different weight relative to each other, you would naturally take that information into account, and calculate a distance weighted by the relative importance of each type of music. In other words, we're going to consider the following distance between two occupational statuses i and i':
$$ d^2(i,i') = \sum\limits_{j\in\mathcal{J}}f_{\cdot j}\cdot((\dfrac{f_{ij}}{f_{i\cdot}\cdot f_{\cdot j}})-(\dfrac{f_{i'j}}{f_{i'\cdot}\cdot f_{\cdot j}}))^2$$
Which can also be written
$$ d^2(i,i') = \sum\limits_{j\in\mathcal{J}}\dfrac{1}{f_{\cdot j}}(\dfrac{f_{ij}}{f_{i\cdot}}-\dfrac{f_{i'j}}{f_{i'\cdot}})^2$$
or :
$$ d^2(i,i') = \sum\limits_{j\in\mathcal{J}}\dfrac{n}{n_{\cdot j}}(\dfrac{n_{ij}}{n_{i\cdot}}-\dfrac{n_{i'j}}{n_{i'\cdot}})^2$$
The distance between i and i' in terms of difference from the independence model lead to the comparison of the profile of i' and the profile of i', category by category. The name of the distance between two row profiles is called the Chi-square distance :
$$ d^2(i,i') = d_{\chi^2}^2(i,i') = \sum\limits_{j\in\mathcal{J}}\dfrac{n}{n_{\cdot j}}(\dfrac{n_{ij}}{n_{i\cdot}}-\dfrac{n_{i'j}}{n_{i'\cdot}})^2$$
Calculate the distance matrix between row-profiles _ie_ the political party.
```{r}
ni <- apply(GMO_table,1, sum)#row sum
nj <- apply(GMO_table,2, sum)#col sum
n <- as.numeric(sum(ni))
GMO_chi2_row <- matrix(nrow=5, ncol=5)
for(i in 1:5){
for(j in 1:5){GMO_chi2_row[i, j] <- sum((n/nj)*((GMO_table[i,]/ni[i]) - (GMO_table[j,]/ni[j]))^2)}
}
rownames(GMO_chi2_row) <- rownames(GMO_table)
colnames(GMO_chi2_row) <- rownames(GMO_table)
dim <- ncol(GMO_chi2_row)
image(1:dim, 1:dim,GMO_chi2_row, axes=F, xlab="", ylab="")
axis(1, 1:dim, row.names(GMO_chi2_row), cex.axis=1, las=3)
axis(2, 1:dim, row.names(GMO_chi2_row), cex.axis=1, las=1)
text(expand.grid(1:dim, 1:dim), sprintf("%0.1f", GMO_chi2_row), cex=1)
```
The chi2 statistic is calculated with the `chisq.test` function. It also gives you the results of the test where the null hypothesis is the independence between the variables based on the statistic. If the statistic is too large, we will reject the hypothesis of independence and reciprocally.
```{r}
res.chi2.gmo <- chisq.test(GMO_table)
res.chi2.sample <- chisq.test(table(GMO.Al.A.Sample))
res.chi2.gmo$residuals
```
Here, the 2 variables are not independent at a level of 5%, but the sample data is independent at 5%. This result was interpretable graphically comparing the referent situation barplot to the other two. the p-value is by definition the probability for the statistic to be higher than 36.77, under the null hypothesis for the first test.
#Inertia
With a distance and a system of masses, we can now focus on the very important concept of inertia. By definition, as evoked in our introduction, the inertia of our Political party is obtained by the following calculation :
$$ \mathcal{I}(N_{I}) = \sum\limits_{i\in\mathcal{I}} f_{i\cdot}\cdot d^2(i,\mathcal{O}) $$
where $\mathcal{I}(N_{I})$ is the inertia of the system of rows, $N_{I}$ denotes the scatter plot of the rows in the $R^{J}$ space, and $\mathcal{O}$ denotes the center of gravity of $N_{I}$ .
#CA
```{r}
res.CA <- CA(GMO_table)
```
We can observe a Guttman effect, with a _parabolic_ form.
You can also only represent variables or individuals :
```{r}
plot(res.CA, invisible = "col")
plot(res.CA, invisible = "row")
```
#Categories involvement in independance
`descfreq` describes the rows or groups of rows in a contingency table
```{r}
res.descfreq <- descfreq(GMO_table, proba = 1)
res.descfreq
```
The V-test is used to transform p-values into scores that are more easily interpretable. The main
feature of this indicator is that it can be positive, or negative, depending on the test that has been performed. It is closed to a Z-score.
# Pratic example with barbes
```{r}
library(readxl)
library(dplyr)
library(FactoMineR)
library(vietnameseConverter)
library(stringi)
```
## Import + modifications
```{r}
barbesV2 <- read_excel("data/barbesV2.xlsx")
barbesV2 <- barbesV2 %>% mutate(trad_sans_accents = stri_trans_general(Trad, "Latin-ASCII"))
barbesV2$Juge <- as.factor(barbesV2$Juge)
barbesV2$Barbes <- as.factor(barbesV2$Barbes)
barbesV2.viet.acc <- as.data.frame(barbesV2[, -c(3,5)])
barbesV2.viet.sans.acc <- as.data.frame(barbesV2[, -c(3,4)])
barbesV2.en <- as.data.frame(barbesV2[,-c(4,5)])
```
### Anglais :
```{r}
res.textual.en.V2 <- textual(barbesV2.en,num.text = 3, contingence.by = 2, sep.word = ";")
eff.en.V2 <- as.data.frame(apply(res.textual.en.V2$cont.table, MARGIN = 2, FUN=sum))
words_selection.en.V2 <- res.textual.en.V2$cont.table[,apply(res.textual.en.V2$cont.table, 2, sum)>=1]
res.CA.en.V2 <- CA(words_selection.en.V2, graph=FALSE)
plot.CA(res.CA.en.V2, invisible = "row")
plot.CA(res.CA.en.V2, invisible="col")
```
### Vietnamien avec accents :
```{r}
res.textual.viet.V2 <- textual(barbesV2.viet.acc,num.text = 3, contingence.by = 2, sep.word = ";")
eff.viet.V2 <- as.data.frame(apply(res.textual.viet.V2$cont.table, MARGIN = 2, FUN=sum))
words_selection.viet.V2 <- res.textual.viet.V2$cont.table[,apply(res.textual.viet.V2$cont.table, 2, sum)>=1]
res.CA.viet.V2 <- CA(words_selection.viet.V2, graph=FALSE)
plot.CA(res.CA.viet.V2, invisible="row")
plot.CA(res.CA.viet.V2, invisible="col")
```
### Vietnamien sans accents :
```{r}
res.textual.viet.sans.accent.V2 <- textual(barbesV2.viet.sans.acc,num.text = 3, contingence.by = 2, sep.word = ";")
eff.viet.V2 <- as.data.frame(apply(res.textual.viet.sans.accent.V2$cont.table, MARGIN = 2, FUN=sum))
words_selection.viet.sans.accent.V2 <- res.textual.viet.sans.accent.V2$cont.table[,apply(res.textual.viet.sans.accent.V2$cont.table, 2, sum)>=1]
res.CA.viet.sans.accent.V2 <- CA(words_selection.viet.sans.accent.V2, graph=FALSE)
plot.CA(res.CA.viet.sans.accent.V2, invisible = "row")
plot.CA(res.CA.viet.sans.accent.V2, invisible = "col")
```
### Comparaison
```{r}
coord.en.V2 <- data.frame("words"=rownames(res.CA.en.V2$col$coord))
coord.en.V2 <- coord.en.V2 %>% mutate(Dim1.en=res.CA.en.V2$col$coord[,1], Dim2.en=res.CA.en.V2$col$coord[,2])
coord.en.V2 <- as.data.frame(coord.en.V2[order(coord.en.V2$Dim1.en,coord.en.V2$Dim2.en),])
coord.viet.acc.V2 <- data.frame("words"=rownames(res.CA.viet.V2$col$coord))
coord.viet.acc.V2 <- coord.viet.acc.V2 %>% mutate(Dim1.viet.acc = res.CA.viet.V2$col$coord[,1], Dim2.viet.acc=res.CA.viet.V2$col$coord[,2])
coord.viet.acc.V2 <- as.data.frame(coord.viet.acc.V2[order(coord.viet.acc.V2$Dim1.viet.acc,coord.viet.acc.V2$Dim2.viet.acc),])
coord.viet.sans.acc.V2 <- data.frame("words"=rownames(res.CA.viet.sans.accent.V2$col$coord))
coord.viet.sans.acc.V2 <- coord.viet.sans.acc.V2 %>% mutate(Dim1.viet.sans.acc = res.CA.viet.sans.accent.V2$col$coord[,1], Dim2.viet.sans.acc=res.CA.viet.sans.accent.V2$col$coord[,2])
coord.viet.sans.acc.V2 <- as.data.frame(coord.viet.sans.acc.V2[order(coord.viet.sans.acc.V2$Dim1.viet.sans.acc,coord.viet.sans.acc.V2$Dim2.viet.sans.acc),])
coord <- cbind(coord.en.V2, coord.viet.acc.V2, coord.viet.sans.acc.V2)
```
Conclusion :
- pas de différence entre CA mots anglais versus mots vietnamiens (warnings sur l'encodage lors de la CA mais coordonnées pareilles);
- aucune utilité des mots vietnamiens sans accents