The code in this repository can be used to reproduce the results and simulations in Sanna Passino, F., Heard, N. A., and Rubin-Delanchy, P. (2021) "Spectral clustering on spherical coordinates under the degree-corrected stochastic blockmodel", Technometrics (to appear, link to the arXiv publication).
The main tool for inference on DCSBMs is the class EGMM
contained in the file dcsbm.py
. The class can be initialised using simply the number of communities K
, a positive integer. The parameters for the EM algorithm are initialised at random (initialise_random
) or with k-means (initialise_kmeans
). An estimate of the communities for a -dimensional embedding X
can be obtained with the function fit_predict
applied on the class. The function has the following arguments:
X
: anumpy
array containing the -dimensional embedding;d
: an integer representing the dimension of the latent positions, with ;transformation
: a string, chosen betweennormalised
,theta
, andscore
, representing the required transformation of the embedding (row-normalisation, spherical coordintes or SCORE - if no value is provided fortransformation
, the default isNone
, corresponding clustering on the standard adjacency spectral embedding);random_init
: aTrue/False
boolean value used for initialisation (ifTrue
, the functioninitialise_random
is used, otherwiseinitialise_kmeans
);verbose
(boolean),max_iter
(integer),tolerance
(float): other parameters for the EM algorithm.
For example, the following snippet initialises the model with K
communities and estimates the clusters using the spherical coordinates transformation (theta
) with assumed dimension d
of the latent positions:
M = dcsbm.EGMM(K=K)
z = M.fit_predict(X, d=d, transformation='theta', verbose=False, random_init=False, max_iter=500, tolerance=1e-6)
The results, tables and figures in the paper could be reproduced using the following files:
- Section 1 - Introduction
- Figure 1:
degree_distribution.py
.
- Figure 1:
- Section 2.3 - Asymptotic properties of spectral embedding of DCSBMs
- Figure 2(a):
dcsbm_example.R
; - Figure 2(b):
dcsbm_clt_simulation.py
.
- Figure 2(a):
- Section 5.1 - Gaussian mixture modelling of DCSBM embeddings
- Figure 3:
dcsbm_transformations.R
.
- Figure 3:
- Section 5.2 - Structure of the likelihood
- Figure 4:
validation_undirected.py
(the script also gives results for reproducing Figure 8 in Appendix B.1 - Asymptotic behaviour).
- Figure 4:
- Section 5.3 - Normality of the spherical coordinates
- Calculation of the p-values of the paired sign tests:
validation_mardia.py
.
- Calculation of the p-values of the paired sign tests:
- Section 6.1 - Synthetic networks
- Table 1(a), Figures 5(a), 5(b), 6(a) and 6(b):
sim_undirected.py
, using the calls insim_undirected_calls.sh
; - Table 1(b), Figures 5(c) and 6(c):
sim_bipartite.py
, using the calls insim_bipartite_calls.sh
.
- Table 1(a), Figures 5(a), 5(b), 6(a) and 6(b):
- Section 6.2 - Imperial College network flow data
- Table 3:
icl.py
andicl_louvain.py
, using the calls inicl_calls.sh
; - Mardia tests:
icl_mardia.py
.
- Table 3:
- Appendix B.2 - Changes in the correlation between blocks
- Figure 9:
validation_correlation.py
.
- Figure 9:
Note that, for security reasons, the ICL network data have not been made available, but the code to run the model on such networks is available.
The figures in the paper are all produced with the LaTeX
package pgfplots
, using the results obtained running the scripts listed above. Hence, the figures obtained from the R
and python
scripts listed above are not identical to those in the paper.
Additional scripts are published in the repository:
bic_fit.py
, which is useful to fit a GMM on a dataset, choosing the number of clusters using BIC;mclust.py
, which is useful to run theR
packagemclust
inpython
(using the function inmclust.R
);zhu.py
, which can be used to calculate the elbows of the scree plot using the criterion of Zhu and Ghodsi.