Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EM algorithm in GMM fails for one-dimensional datasets using 0.16.1 (but fine with 0.15.2) #4717

Closed
rebeccaroisin opened this issue May 13, 2015 · 4 comments
Labels

Comments

@rebeccaroisin
Copy link

Fitting a one-dimensional gaussian distribution using GMM.fit() produces a Runtime error using scikit-learn version 0.16.1, but produces appropriate parameters using 0.15.2.

A short example to demonstrate the problem:

import sklearn
from sklearn import mixture
import numpy as np
from scipy import stats
import sys

# the version info 
print("Python version: %s.%s" %(sys.version_info.major, sys.version_info.minor))
print("scikit-learn version: %s" %(sklearn.__version__))

# some pretend data
np.random.seed(seed=0)
data = stats.norm.rvs(loc=100, scale=1, size=1000)
print("Data mean = %s, Data std dev = %s" %(np.mean(data), np.std(data)))

# Fitting using a GMM with a single component
clf = mixture.GMM(n_components=1)
clf.fit(data)
print(clf.means_, clf.weights_, clf.covars_)

Running this example code with scikit-learn 0.15.2 produces correct output:

Python version: 3.4
scikit-learn version: 0.15.2
Data mean = 99.9547432925, Data std dev = 0.987033158669
[[ 99.95474329]] [ 1.] [[ 0.97523446]]

However, exactly the same code using scikit-learn 0.16.1 gives this traceback:

Python version: 3.4
scikit-learn version: 0.16.1
Data mean = 99.9547432925, Data std dev = 0.987033158669
/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/numpy/lib/function_base.py:1890: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)
/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/numpy/lib/function_base.py:1901: RuntimeWarning: invalid value encountered in true_divide
  return (dot(X, X.T.conj()) / fact).squeeze()
Traceback (most recent call last):
  File "test_sklearn.py", line 18, in <module>
    clf.fit(data)
  File "/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/sklearn/mixture/gmm.py", line 498, in fit
    "(or increasing n_init) or check for degenerate data.")
RuntimeError: EM algorithm was never able to compute a valid likelihood given initial parameters. Try different init parameters (or increasing n_init) or check for degenerate data.

I've tried various different values of the n_init, n_iter and covariance_type parameters. I've also tried a range of different datasets. All of these result in this error or similar using 0.16.1, but there are no issues at all using 0.15.2. The problem seems to be related to the initial parameters used in the expectation maximisation, so it's possible that this is related to this issue: #4429

In case this is useful info, I was using an anaconda virtual environment with a clean install of scikit-learn, set up as follows (for version 0.16.1):

conda create -n new_sklearn python=3.4
source activate new_sklearn
conda install sklearn
@amueller amueller added the Bug label May 13, 2015
@amueller
Copy link
Member

This likely an issue with the data shape.
Is your X 1 ndim or 2 ndim?
There might an unintentional behavior change between 0.15 and 0.16, but I think we decided that in the future, we will not support 1ndim input, so your input shape should be X.shape = (n_samples, 1).
You can do

X = X.reshape(-1, 1)

Otherwise it is somewhat ambiguous if you mean one sample or one feature.

@rebeccaroisin
Copy link
Author

Yes, once I reshape the input data, it works fine. Thank you!

@ghost
Copy link

ghost commented Apr 15, 2016

Hi, the above code fixes the error, but the behavior is different I think. I am running this tutorial code but after I reshape the input data, I can not reproduce the plot.

Is there something I can change (parameters, maybe) so that I can have the same result? Thanks!

@KelseyJustis
Copy link

@imadie

in the tutorial referenced change following lines to :

clf = GMM(4, n_iter=500, random_state=3)
x.shape = (x.shape[0],1)
clf = clf.fit(x)

xpdf = np.linspace(-10, 20, 1000)
xpdf.shape = (xpdf.shape[0],1)
density = np.exp(clf.score(xpdf))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants