Skip to content

qcut can fail for highly discontinuous data distributions #15069

@wesm

Description

@wesm

Code Sample, a copy-pastable example if possible

This code fails for any K:

# Your code here
K = 100

pd.qcut([0] * K + [1] * (K + 1), 2)

Problem description

With pandas 0.19.2, I have:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-782385490865> in <module>()
----> 1 pd.qcut([0] * K + [1] * (K + 1), 2)

pandas/tools/tile.py in qcut(x, q, labels, retbins, precision)
    173     bins = algos.quantile(x, quantiles)
    174     return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,
--> 175                          precision=precision, include_lowest=True)
    176 
    177 

pandas/tools/tile.py in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
    192 
    193     if len(algos.unique(bins)) < len(bins):
--> 194         raise ValueError('Bin edges must be unique: %s' % repr(bins))
    195 
    196     if include_lowest:

ValueError: Bin edges must be unique: array([0, 1, 1])

Expected Output

We need some kind of option to decide how to assign values to a quantile bucket in the event that two quantiles have the same value prior to the searchsorted call. In this case, the appropriate behavior may be to assign all 1 values to the 50% quantile bucket.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions