Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilities :: Stratified Sampling #465

Open
2 tasks
Aylr opened this issue Jan 29, 2018 · 1 comment
Open
2 tasks

Utilities :: Stratified Sampling #465

Aylr opened this issue Jan 29, 2018 · 1 comment

Comments

@Aylr
Copy link
Contributor

Aylr commented Jan 29, 2018

I created this code snippet doing some client work and it has been very helpful when I want to work with dataframes. It uses scikit-learn's train_test_split() and appears to be solid.

TODO

  • Tests (should be fairly simple since train_test_split is covered
  • create a utilities module

Code

from sklearn.model_selection import train_test_split
import pandas as pd

def stratified_sample(df, stratified_column, test_size=0.1, verbose=False):
    """Build a stratified sampled dataframe."""
    def _glue(y_column, x_column_names, x, y):
        temp_df = pd.DataFrame(x)
        temp_df.columns = x_column_names
        temp_df[y_column] = y
        return temp_df

    x_df = df.drop(stratified_column, axis='columns')
    y_df = df[stratified_column]
    x = x_df.as_matrix()
    y = y_df.as_matrix()
    x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=test_size)
    x_columns = x_df.columns

    train_df = _glue(stratified_column, x_columns, x_train, y_train)
    test_df = _glue(stratified_column, x_columns, x_test, y_test)

    if verbose:
        print('Original:\n', df[stratified_column].value_counts(), '\n')
        print('Sampled down to ({}) records:\n'.format(len(test_df)), test_df[stratified_column].value_counts(), '\n')
        
        df.final_state.value_counts().plot.barh(title='Original Dataset')
        plt.show()
        test_df.final_state.value_counts().plot.barh(title='Sampled Dataset')
        plt.show()

    return train_df, test_df

Usage

df = pd.DataFrame({
    'id': list(range(40)),
    'other': [- x for x in range(40)],
    'foo': [1, 1, 1, 1, 2, 1, 1, 1, 2, 1,1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1,1, 1, 1, 1, 2, 1, 1, 1, 2, 1]
    })
train, holdout = stratified_sample(df, stratified_column='foo', test_size=0.1)
print(len(holdout))
holdout.foo.hist()
@Aylr Aylr added this to the Sprint 40 milestone Jan 29, 2018
@Aylr Aylr self-assigned this Jan 29, 2018
@Aylr
Copy link
Contributor Author

Aylr commented Jan 30, 2018

Example verbose output:

screen shot 2018-01-30 at 1 34 37 pm

@Aylr Aylr removed this from the Sprint 40 milestone Feb 5, 2018
@Aylr Aylr removed their assignment Mar 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant