GitHub - azurechen97/Spectral-methods-meet-EM

Topic in Data Science: Spectral Method and Nonconvex optimization

literature review

Paper: Spectral method meet EM
data and matlab implement by author
team member: Aoxue Chen, Song Liang
Other references:
- Tensor decompositions for learning latent variable models

Todo:

Interesting problems that raise along with our implementation:

Packages

import numpy as np
from utils import transform_data, get_confusion_matrix, errorRate
from EMfunctions import spectralEM

Data generator

Generator(num_worker=100, num_item=1000, num_category=2, alpha=2, beta=2). To generate sparse data set, we set the probability of a worker to label an item follows $beta(\alpha, \beta)$ distribution.

g = Generator(num_worker=10, num_item=10)

confusionMatrix(anomaly_prop=0.05, true_prob=np.array([0.3, 0.9]), false_prob=np.array([0.0, 0.5]))

True_prob is the parameter for diagonal elements of confusion matrix. False_prob is the parameter for non-diagonal elements of confusion matrix.

CM = g.confusionMatrix()

generate_item_label(prob=[0.5,0.5])

prob is the ground truth label probability vector, which represent the distribution of labels. The length of prob should equal to the number of category.

truth = g.generate_item_label()

generate_worker_label(CM, truth)

CM is short for confusion matrix, truth ground true label.

label = g.generate_worker_label(CM, truth)

save data to txt

np.savetxt('synthetic_data/1_truth.txt', truth, delimiter=' ', fmt='%d')
np.savetxt('synthetic_data/1_crowd.txt', label, delimiter=' ', fmt='%d')

Load and transform data

df = np.loadtxt('data/rte_crowd.txt')
df = transform_data(df)

truth = np.loadtxt('data/rte_truth.txt')

Use spectral method to initialize starting points

get_confusion_matrix(k, labels, groups=None, sym=True, cutoff=1e-7, L=50, N=10, seed=None)

k is number of categories. labels is worker*item matrix.

init_mu = get_confusion_matrix(k=2, labels=df)

Expectation-maximization

to instantiate the class, we need input init_mu and labels, where are initialization points from spectral method and worker labeled data respectively. In spectralEM.run() method, we can choose converge or max_iter two stop strategy.

EM_optimizor = spectralEM(init_mu=init_mu, labels=df)
logLik = EM_optimizor.run(strategy='converge', delta=1e-2)
print(logLik)

Prediction and error rate

error = errorRate(EM_optimizor.output_q(), truth)
print(error)

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.idea		.idea
data		data
exp_results		exp_results
pic		pic
synthetic_data		synthetic_data
.gitignore		.gitignore
EMfunctions.py		EMfunctions.py
README.md		README.md
dataGenerator.py		dataGenerator.py
runDataGenerator.py		runDataGenerator.py
spectral_findings.md		spectral_findings.md
test.ipynb		test.ipynb
test2.ipynb		test2.ipynb
test3.ipynb		test3.ipynb
testfunc.py		testfunc.py
utils.py		utils.py

azurechen97/Spectral-methods-meet-EM

Folders and files

Latest commit

History

Repository files navigation

Todo:

Packages

Data generator

Load and transform data

Use spectral method to initialize starting points

Expectation-maximization

Prediction and error rate

About

Resources

Stars

Watchers

Forks

Languages