Utilities :: Stratified Sampling #465

Aylr · 2018-01-29T23:50:36Z

I created this code snippet doing some client work and it has been very helpful when I want to work with dataframes. It uses scikit-learn's train_test_split() and appears to be solid.

TODO

Tests (should be fairly simple since train_test_split is covered
create a utilities module

Code

from sklearn.model_selection import train_test_split
import pandas as pd

def stratified_sample(df, stratified_column, test_size=0.1, verbose=False):
    """Build a stratified sampled dataframe."""
    def _glue(y_column, x_column_names, x, y):
        temp_df = pd.DataFrame(x)
        temp_df.columns = x_column_names
        temp_df[y_column] = y
        return temp_df

    x_df = df.drop(stratified_column, axis='columns')
    y_df = df[stratified_column]
    x = x_df.as_matrix()
    y = y_df.as_matrix()
    x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=test_size)
    x_columns = x_df.columns

    train_df = _glue(stratified_column, x_columns, x_train, y_train)
    test_df = _glue(stratified_column, x_columns, x_test, y_test)

    if verbose:
        print('Original:\n', df[stratified_column].value_counts(), '\n')
        print('Sampled down to ({}) records:\n'.format(len(test_df)), test_df[stratified_column].value_counts(), '\n')
        
        df.final_state.value_counts().plot.barh(title='Original Dataset')
        plt.show()
        test_df.final_state.value_counts().plot.barh(title='Sampled Dataset')
        plt.show()

    return train_df, test_df

Usage

df = pd.DataFrame({
    'id': list(range(40)),
    'other': [- x for x in range(40)],
    'foo': [1, 1, 1, 1, 2, 1, 1, 1, 2, 1,1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1,1, 1, 1, 1, 2, 1, 1, 1, 2, 1]
    })
train, holdout = stratified_sample(df, stratified_column='foo', test_size=0.1)
print(len(holdout))
holdout.foo.hist()

The text was updated successfully, but these errors were encountered:

Aylr · 2018-01-30T20:35:20Z

Example verbose output:

Aylr added this to the Sprint 40 milestone Jan 29, 2018

Aylr self-assigned this Jan 29, 2018

Aylr added enhancement utilities labels Jan 29, 2018

Aylr removed this from the Sprint 40 milestone Feb 5, 2018

Aylr removed their assignment Mar 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilities :: Stratified Sampling #465

Utilities :: Stratified Sampling #465

Aylr commented Jan 29, 2018 •

edited

Aylr commented Jan 30, 2018

Utilities :: Stratified Sampling #465

Utilities :: Stratified Sampling #465

Comments

Aylr commented Jan 29, 2018 • edited

TODO

Code

Usage

Aylr commented Jan 30, 2018

Aylr commented Jan 29, 2018 •

edited