EM algorithm in GMM fails for one-dimensional datasets using 0.16.1 (but fine with 0.15.2) #4717

rebeccaroisin · 2015-05-13T13:41:39Z

Fitting a one-dimensional gaussian distribution using GMM.fit() produces a Runtime error using scikit-learn version 0.16.1, but produces appropriate parameters using 0.15.2.

A short example to demonstrate the problem:

import sklearn
from sklearn import mixture
import numpy as np
from scipy import stats
import sys

# the version info 
print("Python version: %s.%s" %(sys.version_info.major, sys.version_info.minor))
print("scikit-learn version: %s" %(sklearn.__version__))

# some pretend data
np.random.seed(seed=0)
data = stats.norm.rvs(loc=100, scale=1, size=1000)
print("Data mean = %s, Data std dev = %s" %(np.mean(data), np.std(data)))

# Fitting using a GMM with a single component
clf = mixture.GMM(n_components=1)
clf.fit(data)
print(clf.means_, clf.weights_, clf.covars_)

Running this example code with scikit-learn 0.15.2 produces correct output:

Python version: 3.4
scikit-learn version: 0.15.2
Data mean = 99.9547432925, Data std dev = 0.987033158669
[[ 99.95474329]] [ 1.] [[ 0.97523446]]

However, exactly the same code using scikit-learn 0.16.1 gives this traceback:

Python version: 3.4
scikit-learn version: 0.16.1
Data mean = 99.9547432925, Data std dev = 0.987033158669
/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/numpy/lib/function_base.py:1890: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)
/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/numpy/lib/function_base.py:1901: RuntimeWarning: invalid value encountered in true_divide
  return (dot(X, X.T.conj()) / fact).squeeze()
Traceback (most recent call last):
  File "test_sklearn.py", line 18, in <module>
    clf.fit(data)
  File "/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/sklearn/mixture/gmm.py", line 498, in fit
    "(or increasing n_init) or check for degenerate data.")
RuntimeError: EM algorithm was never able to compute a valid likelihood given initial parameters. Try different init parameters (or increasing n_init) or check for degenerate data.

I've tried various different values of the n_init, n_iter and covariance_type parameters. I've also tried a range of different datasets. All of these result in this error or similar using 0.16.1, but there are no issues at all using 0.15.2. The problem seems to be related to the initial parameters used in the expectation maximisation, so it's possible that this is related to this issue: #4429

In case this is useful info, I was using an anaconda virtual environment with a clean install of scikit-learn, set up as follows (for version 0.16.1):

conda create -n new_sklearn python=3.4
source activate new_sklearn
conda install sklearn

amueller · 2015-05-13T15:30:44Z

This likely an issue with the data shape.
Is your X 1 ndim or 2 ndim?
There might an unintentional behavior change between 0.15 and 0.16, but I think we decided that in the future, we will not support 1ndim input, so your input shape should be X.shape = (n_samples, 1).
You can do

X = X.reshape(-1, 1)

Otherwise it is somewhat ambiguous if you mean one sample or one feature.

rebeccaroisin · 2015-05-14T09:21:03Z

Yes, once I reshape the input data, it works fine. Thank you!

ghost · 2016-04-15T11:59:19Z

Hi, the above code fixes the error, but the behavior is different I think. I am running this tutorial code but after I reshape the input data, I can not reproduce the plot.

Is there something I can change (parameters, maybe) so that I can have the same result? Thanks!

KelseyJustis · 2016-05-31T01:38:28Z

@imadie

in the tutorial referenced change following lines to :

clf = GMM(4, n_iter=500, random_state=3)
x.shape = (x.shape[0],1)
clf = clf.fit(x)

xpdf = np.linspace(-10, 20, 1000)
xpdf.shape = (xpdf.shape[0],1)
density = np.exp(clf.score(xpdf))

amueller added the Bug label May 13, 2015

rebeccaroisin closed this as completed May 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EM algorithm in GMM fails for one-dimensional datasets using 0.16.1 (but fine with 0.15.2) #4717

EM algorithm in GMM fails for one-dimensional datasets using 0.16.1 (but fine with 0.15.2) #4717

rebeccaroisin commented May 13, 2015

amueller commented May 13, 2015

rebeccaroisin commented May 14, 2015

ghost commented Apr 15, 2016

KelseyJustis commented May 31, 2016

EM algorithm in GMM fails for one-dimensional datasets using 0.16.1 (but fine with 0.15.2) #4717

EM algorithm in GMM fails for one-dimensional datasets using 0.16.1 (but fine with 0.15.2) #4717

Comments

rebeccaroisin commented May 13, 2015

amueller commented May 13, 2015

rebeccaroisin commented May 14, 2015

ghost commented Apr 15, 2016

KelseyJustis commented May 31, 2016