Skip to content

We unified some latent block models by proposing a flexible ELBM that is extended to SELBM to address the sparse problem by revealing a diagonal structure from sparse datasets. This leads to obtain more homogeneous co-clusters and therefore produce useful, ready-to-use and easy-to-interpret results.

License

Saeidhoseinipour/ELBMcoclust

Repository files navigation

License: MIT https://github.com/Saeidhoseinipour/NMTFcoclust https://github.com/Saeidhoseinipour/NMTFcoclust https://github.com/Saeidhoseinipour/EM-typecoclust/tree/main https://github.com/Saeidhoseinipour/EM-typecoclust/tree/main

Table of Contents

  1. ELBMcoclust Overview
  2. Datasets
  3. Models
  4. Confusion Matrices
  5. Visualization
  6. Word Cloud of PoissonSELBM for Classic3
  7. Main Contributions
  8. Cite
  9. Highlights
  10. Supplementary Materials
  11. Data Availability
  12. Presentation Video
  13. References

ELBMcoclust and SELBMcoclust

Sparse and Non-Sparse Exponential Family Latent Block Model for Co-clustering

The goal of the statistical approach is to analyze the behavior of the data by considering the probability distribution. The complete log-likelihood function for three version of LBM, Exponential LBM and Sparse Exponential LBM, will be as follows:

  • LBM
$$L^{\text{LBM}}(\mathbf{r},\mathbf{c},\boldsymbol{\gamma})= \sum\limits_{i,k}r_{ik} \log\pi_{k} +\sum\limits_{j,h} \log\rho_{h} c^{\top}_{jh}+ \sum\limits_{i,j,k,h} r_{ik}\log \varphi(x_{ij};\alpha_{kh})c^{\top}_{hj}.$$
  • ELBM
$$L^{\text{ELBM}}(\mathbf{r},\mathbf{c},\boldsymbol{\gamma}) \propto \sum\limits_{k} r_{.k} \log\pi_{k} + \sum\limits_{h} c_{.h} \log\rho_{h} + \text{Tr}\left( (\mathbf{R}^{\top} (\mathbf{S_{x}}\odot \hat{\boldsymbol{\beta}}) \mathbf{C})^{\top} \mathbf{A}_{\boldsymbol{\alpha}} \right) - \text{Tr}\left( (\mathbf{R}^{\top} (\mathbf{E}_{mn}\odot \hat{\boldsymbol{\beta}}) \mathbf{C})^{\top} \mathbf{F}_{\boldsymbol{\alpha}} \right).$$
  • SELBM
$$\begin{align*} L^{\text{SELBM}}(\mathbf{r},\mathbf{c},\boldsymbol{\gamma}) \propto& \sum\limits_{k} r_{.k} \log\pi_{k} + \sum\limits_{h} c_{.h}\log\rho_{h} + \sum\limits_{k} \left[ \mathbf{R}^{\top}(\mathbf{S_{x}}\odot \hat{\boldsymbol{\beta}})\mathbf{C} \right]_{kk} \left( A(\alpha_{kk}) - A(\alpha) \right)\nonumber\\\ &- \sum\limits_{k} [\mathbf{R}^{\top} (\mathbf{E}_{mn} \odot \hat{\boldsymbol{\beta}} )\mathbf{C}]_{kk} \left( F(A(\alpha_{kk})) -F(A(\alpha)) \right). \end{align*}$$

Datasets Topics #Classes (#Documents, #Words) Sparsity(%0) Balance
Classic3 Medical, Information retrieval, Aeronautical systems 3 (3891, 4303) 98.95 0.71
CSTR Robotics/Vision, Systems, Natural Language Processing, Theory 4 (475, 1000) 96.60 0.399
WebACE 20 different topics from WebACE project 20 (2340, 1000) 91.83 0.169
Reviews Food, Music, Movies, Radio, Restaurants 5 (4069, 18483) 98.99 0.099
Sports Baseball, Basketball, Bicycling, Boxing, Football, Golfing, Hockey 7 (8580, 14870) 99.14 0.036
TDT2 30 different topics 30 (9394, 36771) 99.64 0.028
  • Balance: (#documents in the smallest class)/(#documents in the largest class)
from ELBMcoclust.Models.coclust_ELBMcem import CoclustELBMcem
from ELBMcoclust.Models.coclust_SELBMcem import CoclustSELBMcem
from NMTFcoclust.Evaluation.EV import Process_EV

ELBM = CoclustELBMcem(n_row_clusters = 4, n_col_clusters = 4, model = "Poisson")
ELBM.fit(X_CSTR)

SELBM = CoclustSELBMcem(n_row_clusters = 4, n_col_clusters = 4, model = "Poisson")
SELBM.fit(X_CSTR)

Process_Ev = Process_EV(true_labels ,X_CSTR, ELBM) 
from sklearn.metrics import confusion_matrix 

confusion_matrix(true_labels, np.sort(ELBM.row_labels_))


array([[101,   0,   0,   0],
       [  4,  52,  15,   0],
       [  0,   0,  178,  0],
       [  0,   0,   34, 91]], dtype=int64)

Confusion Matrices

Screenshot: 'README.md'

Visualization

Text mining, Matrix factorization, Co-clustering, Saeid Hoseinipour

Screenshot: 'README.md'

Screenshot: 'README.md'

Screenshot: 'README.md'

Word cloud of PoissonSELBM for Classic3

Word clouds top 60 words in classic3 dataset obtined by PoissonSELBM for co-clustering,  Latent Block Model, Text mining, Matrix factorization, Co-clustering, Saeid Hoseinipour, clustering

Bar charts top 50 words in classic3 dataset obtined by PoissonSELBM for co-clustering, Saeid Hoseinipour, text mining, clustering, Expoential family, Latent Block Model, Text mining, Matrix factorization, Co-clustering, Saeid Hoseinipour

Main Contributions

In this paper, we provide a summary of the main contributions:

  • Exponential family Latent Block Model (ELBM) and Sparse version (SELBM): We propose these models, which unify many leading algorithms suited to various data types.

  • Classification Expectation Maximization Approach: Our proposed algorithms use this approach and have a general framework based on matrix form.

  • Focus on Document-Word Matrices: While we propose a flexible matrix formalism for different models according to different distributions, we focus on document-word matrices in this work. We evaluate ELBMs and SELBMs using six real document-word matrices and three synthetic datasets.

Cite

Please cite the following paper in your publication if you are using ELBMcoclust in your research:

 @article{ELBMcoclust, 
    title={Sparse Exponential Family Latent Block Model for Co-clustering}, 
Journal={Submitted}
  authors={Saeid Hoseinipour, Mina Aminghafari, Adel Mohammadpour, Mohamed Nadif}, 
    year={2024}
} 

Highlights

  • Exponential family Latent Block Model (ELBM) and Sparse version (SELBM) were proposed, which unify many models with various data types.
  • The proposed algorithms using the classification expectation maximization approach have a general framework based on matrix form.
  • Using six real document-word matrices and three synthetic datasets (Bernoulli, Poisson, Gaussian), we compared ELBM with SELBM.
  • All datasets and algorithm codes are available on GitHub as ELBMcoclust repository.

Supplementary materials

  • More details about the Classic3 real-text dataset are available here.
  • For additional visualization, see here.

Data Availability

The code of algorithms, all datasets, additional visualizations, and materials are available at ELBMcoclust repository. Our experiments were performed on a PC (Intel(R), Core(TM) i7-10510U, 2.30 GHz), and all figures were produced in Python using the Seaborn and Matplotlib libraries.

Presentation video

Presentation video for OPNMTF, Text mining, Matrix factorization, Co-clustering, Saeid Hoseinipour

References

[1] Govaert and Nadif, Clustering with block mixture models, Pattern Recognition (2013).

[2] Govaert and Nadif, Block clustering with Bernoulli mixture models: Comparison of different approaches, Computational Statistics and Data Analysis (2008).

[3] Rodolphe Priam et al, Topographic Bernoulli block mixture mapping for binary tables, Pattern Analysis and Applications (2014).

[4] Ailem, Melissa et al, Sparse Poisson latent block model for document clustering, IEEE Transactions on Knowledge and Data Engineering (2017).

[5] Fossier, Riverain et al, Semi-supervised Latent Block Model with pairwise constraints, Machine Learning (2022).

[6] Saeid, Hoseinipour et al, Orthogonal parametric non-negative matrix tri-factorization with $\alpha$-Divergence for co-clustering, Expert Systems with Applications (2023).

About

We unified some latent block models by proposing a flexible ELBM that is extended to SELBM to address the sparse problem by revealing a diagonal structure from sparse datasets. This leads to obtain more homogeneous co-clusters and therefore produce useful, ready-to-use and easy-to-interpret results.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages