Imbalanced classes are not always represented in the training set #437

Aylr · 2017-11-10T20:50:43Z

Background

Train test split is a stochastic process. It is possible that underrepresented classes can be left out of the training set and therefore not modeled!

STR

You may need to run this a few times to see a confusion matrix with only 2 classes represented.

import healthcareai


def main():
    """Template script for using healthcareai to train a classification lr."""
    # Load the included diabetes sample data
    dataframe = healthcareai.load_diabetes()

    dataframe['ThirtyDayReadmitFLG'].replace('Y', 'SnoCones', inplace=True)
    dataframe['ThirtyDayReadmitFLG'].replace('N', 'Waffles', inplace=True)
    dataframe.loc[0:5, 'ThirtyDayReadmitFLG'] = "Omelette"
    print(dataframe['ThirtyDayReadmitFLG'].value_counts())

    dataframe.drop(['PatientID'], axis=1, inplace=True)

    classification_trainer = healthcareai.SupervisedModelTrainer(
        dataframe=dataframe,
        predicted_column='ThirtyDayReadmitFLG',
        model_type='classification',
        grain_column='PatientEncounterID',
        impute=True,
        verbose=False)

    print(dataframe.head(5))

    lr = classification_trainer.logistic_regression()
    lr.print_confusion_matrix()

    knn = classification_trainer.knn()
    knn.print_confusion_matrix()


if __name__ == "__main__":
    main()

Bad Output

Waffles     840
SnoCones    154
Omelette      6
Name: ThirtyDayReadmitFLG, dtype: int64

Note: Numeric imputation will always occur when making predictions on new data - otherwise rows would be dropped, which would lead to missing predictions.

Imputed values for numeric columns:
╒═══════════════╤═══════════════════╕
│ Column        │   Percent Imputed │
╞═══════════════╪═══════════════════╡
│ SystolicBPNBR │             0.013 │
├───────────────┼───────────────────┤
│ LDLNBR        │             0.013 │
├───────────────┼───────────────────┤
│ A1CNBR        │             0.013 │
╘═══════════════╧═══════════════════╛


   PatientEncounterID  SystolicBPNBR  LDLNBR  A1CNBR GenderFLG  \
0                   1          167.0   195.0     4.2         M   
1                   2          153.0   214.0     5.0         M   
2                   3          170.0   191.0     4.0         M   
3                   4          187.0   135.0     4.4         M   
4                   5          188.0   125.0     4.3         M   

  ThirtyDayReadmitFLG  
0            Omelette  
1            Omelette  
2            Omelette  
3            Omelette  
4            Omelette  
Training: Logistic Regression , Type: classification
LogisticRegression Training Results:
- Training time:
    LogisticRegression seconds
- Best hyperparameters found were:
    N/A: No hyperparameter search was performed
- LogisticRegression selected performance metrics:
    accuracy: 0.85
    positive_label: Waffles
    roc_auc: 0.38
    pr_auc: 0.78

Confusion Matrix (Counts)
    - Predicted Classes are along the top
    - True Classes are along the left.

            SnoCones    Waffles
--------  ----------  ---------
SnoCones           0         30
Waffles            0        170
Training: Knn , Type: classification
KNN Grid: {'n_neighbors': [5, 8, 11, 14, 17, 20, 23], 'weights': ['uniform', 'distance']}
KNeighborsClassifier Training Results:
- Training time:
    KNeighborsClassifier seconds
- Best hyperparameters found were:
    {'weights': 'distance', 'n_neighbors': 23}
- KNeighborsClassifier selected performance metrics:
    accuracy: 0.85
    positive_label: Waffles
    roc_auc: 0.23
    pr_auc: 0.74

Confusion Matrix (Counts)
    - Predicted Classes are along the top
    - True Classes are along the left.

            SnoCones    Waffles
--------  ----------  ---------
SnoCones           1         29
Waffles            0        170

Process finished with exit code 0

Good Output

Waffles     840
SnoCones    154
Omelette      6
Name: ThirtyDayReadmitFLG, dtype: int64

Note: Numeric imputation will always occur when making predictions on new data - otherwise rows would be dropped, which would lead to missing predictions.

Imputed values for numeric columns:
╒═══════════════╤═══════════════════╕
│ Column        │   Percent Imputed │
╞═══════════════╪═══════════════════╡
│ SystolicBPNBR │             0.013 │
├───────────────┼───────────────────┤
│ LDLNBR        │             0.013 │
├───────────────┼───────────────────┤
│ A1CNBR        │             0.013 │
╘═══════════════╧═══════════════════╛


   PatientEncounterID  SystolicBPNBR  LDLNBR  A1CNBR GenderFLG  \
0                   1          167.0   195.0     4.2         M   
1                   2          153.0   214.0     5.0         M   
2                   3          170.0   191.0     4.0         M   
3                   4          187.0   135.0     4.4         M   
4                   5          188.0   125.0     4.3         M   

  ThirtyDayReadmitFLG  
0            Omelette  
1            Omelette  
2            Omelette  
3            Omelette  
4            Omelette  
Training: Logistic Regression , Type: classification
LogisticRegression Training Results:
- Training time:
    LogisticRegression seconds
- Best hyperparameters found were:
    N/A: No hyperparameter search was performed
- LogisticRegression selected performance metrics:
    accuracy: 0.84

Confusion Matrix (Counts)
    - Predicted Classes are along the top
    - True Classes are along the left.

            Omelette    SnoCones    Waffles
--------  ----------  ----------  ---------
Omelette           0           0          2
SnoCones           0           0         29
Waffles            0           0        169
Training: Knn , Type: classification
KNN Grid: {'n_neighbors': [5, 8, 11, 14, 17, 20, 23], 'weights': ['uniform', 'distance']}
KNeighborsClassifier Training Results:
- Training time:
    KNeighborsClassifier seconds
- Best hyperparameters found were:
    {'weights': 'distance', 'n_neighbors': 17}
- KNeighborsClassifier selected performance metrics:
    accuracy: 0.84

Confusion Matrix (Counts)
    - Predicted Classes are along the top
    - True Classes are along the left.

            Omelette    SnoCones    Waffles
--------  ----------  ----------  ---------
Omelette           0           0          2
SnoCones           0           2         27
Waffles            0           3        166

Process finished with exit code 0

The text was updated successfully, but these errors were encountered:

Aylr · 2017-11-14T21:04:51Z

Maybe MVP scikit stratified train test split?

Aylr self-assigned this Nov 10, 2017

Aylr added the bug high label Nov 10, 2017

Aylr modified the milestones: Sprint 37, Sprint 38 Dec 13, 2017

Aylr modified the milestones: Sprint 38, Sprint 39 Jan 5, 2018

Aylr removed this from the Sprint 39 milestone Jan 22, 2018

Aylr added bug med and removed bug high labels Mar 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imbalanced classes are not always represented in the training set #437

Imbalanced classes are not always represented in the training set #437

Aylr commented Nov 10, 2017 •

edited

Aylr commented Nov 14, 2017

Imbalanced classes are not always represented in the training set #437

Imbalanced classes are not always represented in the training set #437

Comments

Aylr commented Nov 10, 2017 • edited

Background

STR

Bad Output

Good Output

Aylr commented Nov 14, 2017

Aylr commented Nov 10, 2017 •

edited