You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I created this code snippet doing some client work and it has been very helpful when I want to work with dataframes. It uses scikit-learn's train_test_split() and appears to be solid.
TODO
Tests (should be fairly simple since train_test_split is covered
create a utilities module
Code
from sklearn.model_selection import train_test_split
import pandas as pd
def stratified_sample(df, stratified_column, test_size=0.1, verbose=False):
"""Build a stratified sampled dataframe."""
def _glue(y_column, x_column_names, x, y):
temp_df = pd.DataFrame(x)
temp_df.columns = x_column_names
temp_df[y_column] = y
return temp_df
x_df = df.drop(stratified_column, axis='columns')
y_df = df[stratified_column]
x = x_df.as_matrix()
y = y_df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=test_size)
x_columns = x_df.columns
train_df = _glue(stratified_column, x_columns, x_train, y_train)
test_df = _glue(stratified_column, x_columns, x_test, y_test)
if verbose:
print('Original:\n', df[stratified_column].value_counts(), '\n')
print('Sampled down to ({}) records:\n'.format(len(test_df)), test_df[stratified_column].value_counts(), '\n')
df.final_state.value_counts().plot.barh(title='Original Dataset')
plt.show()
test_df.final_state.value_counts().plot.barh(title='Sampled Dataset')
plt.show()
return train_df, test_df
I created this code snippet doing some client work and it has been very helpful when I want to work with dataframes. It uses scikit-learn's
train_test_split()
and appears to be solid.TODO
train_test_split
is coveredCode
Usage
The text was updated successfully, but these errors were encountered: