Skip to content

adrian-io/medical-cost-prediction

Repository files navigation

Medical Cost Prediction

What medical costs should a health insurance company expect from their policy holders?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
sns.set_style("whitegrid")
pd.set_option('display.max_rows', 500)
data = pd.read_csv("insurance.csv")
data.info()
data.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

1. Exploratory Data Analysis

6 possible predictors and 1 target variable

1.1 Metric Variables

  • charges (float, continuous) = target variable
  • children (int, discrete)
    • number of children / dependents coverd by the insurance plan
  • bmi (float, continuous)
    • body mass index = weight(kg) / (height (mtr))^2
    • ideal bmi: 18.5 - 24.9
  • age (int, continuous)
metric_vars = data[["age", "bmi", "children", "charges"]]
data.describe().T
count mean std min 25% 50% 75% max
age 1338.0 39.207025 14.049960 18.0000 27.00000 39.000 51.000000 64.00000
bmi 1338.0 30.663397 6.098187 15.9600 26.29625 30.400 34.693750 53.13000
children 1338.0 1.094918 1.205493 0.0000 0.00000 1.000 2.000000 5.00000
charges 1338.0 13270.422265 12110.011237 1121.8739 4740.28715 9382.033 16639.912515 63770.42801

Scatter Matrix

  • diagonal: univariate distributions
  • off-diagonal: bivariate distributions
g = sns.PairGrid(data, hue="smoker", palette=["C3", "C2"])
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
g.add_legend()
<seaborn.axisgrid.PairGrid at 0x2207ca05400>

png

Correlation Matrix

  • absolute pearson correlation coefficent
corr = metric_vars.corr().round(decimals = 2).abs()
sns.heatmap(corr, cmap="Blues", robust=True, annot = True)
<AxesSubplot:>

png

Charges

  • 50% of people cause medical costs between 4740.29\$ and 16639.91\$
  • only 10% have costs over 34831.72\$
  • high positive skew (lef-leaning, right-skewed)
print("Median:\t\t", np.percentile(data["charges"], 50))
print("90%-quantile:\t", np.percentile(data["charges"], 90))
Median:		 9382.033
90%-quantile:	 34831.7197
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
sns.histplot(x=data["charges"], ax=ax[0])
sns.boxplot(y=data["charges"], ax=ax[1])
<AxesSubplot:ylabel='charges'>

png

Children

  • the number of children doesn't seem too tell a lot about the medical charges
sns.catplot(x="children", y="charges", kind="box", data=data)
<seaborn.axisgrid.FacetGrid at 0x2207e9a15e0>

png

BMI

  • the bmi alone is only weakly correlated with the medical costs: $r=+0.2$
  • For smokers, however, a higher bmi clearly coincides with higher medical costs
sns.regplot(x=data["bmi"][data["smoker"] == "yes"], y=data["charges"][data["smoker"] == "yes"], color="C3", label="smoker")
sns.regplot(x=data["bmi"][data["smoker"] == "no"], y=data["charges"][data["smoker"] == "no"], color="C2", label="non-smoker")
plt.legend()
<matplotlib.legend.Legend at 0x22073a5e9d0>

png

Age

  • only people from 18 to 64 years old --> no retired persons captured, working age population

  • overproportionally more data of young people (age 18-22) captured

  • older people clearly seem to produce higher medical costs

  • there can be identified 3 clusters:

    • cluster 2: smokers, high medical costs
    • cluster 1: smokers & non-smokers, medium medical costs
    • cluster 0: non-smokers, low medical costs
fig, ax = plt.subplots(1, 2, figsize=(14, 4))

sns.histplot(x=data["age"], ax=ax[0])
sns.scatterplot(x="age", y="charges", hue="smoker",  data=data, palette=["C3", "C2"], ax=ax[1])
<AxesSubplot:xlabel='age', ylabel='charges'>

png

1.2 Categorical Variables

  • Smoker: "Yes" or "No" ----> binary
  • Region: "southwest", "southeast", "nortwest" or "northeast"
    • the beneficiary's place of residence in the U.S, divided into 4 geographical regions
  • Sex: "Male" or "Female" ----> binary

Smoker

  • Smoking clearly coincides with much higher medical charges
sns.catplot(x="smoker", y="charges", kind="box", hue="sex",  data=data, palette=["C1", "C0"])
<seaborn.axisgrid.FacetGrid at 0x22073993130>

png

Region

  • there seem to be no significant differences between charges in the different regions
sns.catplot(x="region", y="charges", kind="box", hue="smoker", data=data, palette=["C3", "C2"])
<seaborn.axisgrid.FacetGrid at 0x2207c8727f0>

png

Sex

  • The data set is balanced in terms of gender
  • There are ~6% more men smoking than women
sns.countplot(x="smoker", data=data, hue="sex", palette=["C1", "C0"])
males = data[data.sex == "male"].shape[0]
females = data[data.sex == "female"].shape[0]

male_smokers = data[(data.sex == "male") & (data.smoker == "yes")].shape[0]
female_smokers = data[(data.sex == "female") & (data.smoker == "yes")].shape[0]

print("Male {:.2f}% | Female {:.2f}%".format(100*males/(males+females), 100*females/(males+females)))
print("\n{} of {} ({:.2f}%) men smoke.".format(male_smokers, males, 100*male_smokers/males))
print("{} of {} ({:.2f}%) women smoke.".format(female_smokers, females, 100*female_smokers/females))
Male 50.52% | Female 49.48%

159 of 676 (23.52%) men smoke.
115 of 662 (17.37%) women smoke.

png

  • Men tend to cause slightly higher medical charges (same median, but higher mean)
sns.catplot(x="sex", y="charges", kind="box", data=data)
<seaborn.axisgrid.FacetGrid at 0x22073fc3b80>

png

2. Data Preprocessing

2.1 Clustering the data into three medical cost groups

from sklearn.cluster import KMeans
data['charges/age'] = data['charges'] - data['age']*260

X = np.array(data["charges/age"]).reshape(-1, 1)
kmeans = KMeans(n_clusters=3, init=np.array([8000, 20000, 40000]).reshape(-1, 1), n_init=1, random_state=0).fit(X)

print(kmeans.labels_)
data["age_cluster"] = kmeans.labels_
sns.scatterplot(x=data["age"], y=data["charges"], hue=data["age_cluster"])
plt.ylabel("charges")
[1 0 0 ... 0 0 1]





Text(0, 0.5, 'charges')

png

import scipy
fig, ax = plt.subplots(1, 3, figsize=(18, 4))

x = data["age"][data["age_cluster"] == 0]
y = data["charges"][data["age_cluster"] == 0]

slope, intercept, r, p, stderr = scipy.stats.linregress(x, y)
line = f'Regression line: y={intercept:.2f}+{slope:.2f}x, r={r:.2f}'
#line
ax[0].set_title("CLUSTER 0")
ax[0].plot(x, y, linewidth=0, marker='x')
ax[0].plot(x, intercept + slope * x, label=line)
ax[0].legend(facecolor='white')

x = data["age"][data["age_cluster"] == 1]
y = data["charges"][data["age_cluster"] == 1]
slope, intercept, r, p, stderr = scipy.stats.linregress(x, y)
line = f'Regression line: y={intercept:.2f}+{slope:.2f}x, r={r:.2f}'
#line
ax[1].set_title("CLUSTER 1")
ax[1].plot(x, y, linewidth=0, marker='x')
ax[1].plot(x, intercept + slope * x, label=line)
ax[1].legend(facecolor='white')

x = data["age"][data["age_cluster"] == 2]
y = data["charges"][data["age_cluster"] == 2]
slope, intercept, r, p, stderr = scipy.stats.linregress(x, y)
line = f'Regression line: y={intercept:.2f}+{slope:.2f}x, r={r:.2f}'
#line
ax[2].set_title("CLUSTER 2")
ax[2].plot(x, y, linewidth=0, marker='x')
ax[2].plot(x, intercept + slope * x, label=line)

ax[2].legend(facecolor='white')
plt.show()
# sns.regplot(x=X, y=y)

png

2.2 Dummy Variables

  • introducing dummy variables for sex, smoker, region: !AVOIDING DUMMY TRAP!
    • sex --> sex_male
    • smoker --> smoker_yes
    • region --> region_northeast, region_southeast, region_south_west
# data = data.drop(["sex_female", "smoker_no", "region_northwest"], axis=1)
data[["age_cluster_0", "age_cluster_1", "age_cluster_2"]] = pd.get_dummies(data["age_cluster"], drop_first=False).copy()
# data = data.drop(["age_cluster"], axis=1)

data = pd.get_dummies(data)
data = data.drop(["sex_female", "smoker_no", "region_northwest"], axis=1)
data.head()
age bmi children charges charges/age age_cluster age_cluster_0 age_cluster_1 age_cluster_2 sex_male smoker_yes region_northeast region_southeast region_southwest
0 19 27.900 0 16884.92400 11944.92400 1 0 1 0 0 1 0 0 1
1 18 33.770 1 1725.55230 -2954.44770 0 1 0 0 1 0 0 1 0
2 28 33.000 3 4449.46200 -2830.53800 0 1 0 0 1 0 0 1 0
3 33 22.705 0 21984.47061 13404.47061 1 0 1 0 1 0 0 0 0
4 32 28.880 0 3866.85520 -4453.14480 0 1 0 0 1 0 0 0 0

2.3 Interaction Terms

  • bmi*smoker

  • bmi*cluster0, bmi*cluster1, bmi*cluster2

  • age*cluster0, age*cluster1, age*cluster2

sns.regplot(x=data["bmi"][data["smoker_yes"] == 1], y=data["charges"][data["smoker_yes"] == 1], color="C3", label="smoker")
sns.regplot(x=data["bmi"][data["smoker_yes"] == 0], y=data["charges"][data["smoker_yes"] == 0], color="C2", label="non-smoker")
plt.legend()
<matplotlib.legend.Legend at 0x22073c5b730>

png

# poly = PolynomialFeatures(interaction_only=True,include_bias = True)
data["smoker*bmi"] = data["smoker_yes"]*data["bmi"]
data["bmi*age0"] = data["age_cluster_0"]*data["bmi"]
data["bmi*age1"] = data["age_cluster_1"]*data["bmi"]
data["bmi*age2"] = data["age_cluster_2"]*data["bmi"]
data["age2"] = data["age"]**2

data["age0"] = data["age_cluster_0"]*data["age"]
data["age1"] = data["age_cluster_1"]*data["age"]
data["age2"] = data["age_cluster_2"]*data["age"]
data["age0*smoker_no"] = data["age_cluster_0"]*data["age"]*(1-data["smoker_yes"])
data["age2*smoker_yes"] = data["age_cluster_2"]*data["age"]*data["smoker_yes"]

data.info()
# data.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               1338 non-null   int64  
 1   bmi               1338 non-null   float64
 2   children          1338 non-null   int64  
 3   charges           1338 non-null   float64
 4   charges/age       1338 non-null   float64
 5   age_cluster       1338 non-null   int32  
 6   age_cluster_0     1338 non-null   uint8  
 7   age_cluster_1     1338 non-null   uint8  
 8   age_cluster_2     1338 non-null   uint8  
 9   sex_male          1338 non-null   uint8  
 10  smoker_yes        1338 non-null   uint8  
 11  region_northeast  1338 non-null   uint8  
 12  region_southeast  1338 non-null   uint8  
 13  region_southwest  1338 non-null   uint8  
 14  smoker*bmi        1338 non-null   float64
 15  bmi*age0          1338 non-null   float64
 16  bmi*age1          1338 non-null   float64
 17  bmi*age2          1338 non-null   float64
 18  age2              1338 non-null   int64  
 19  age0              1338 non-null   int64  
 20  age1              1338 non-null   int64  
 21  age0*smoker_no    1338 non-null   int64  
 22  age2*smoker_yes   1338 non-null   int64  
dtypes: float64(7), int32(1), int64(7), uint8(8)
memory usage: 162.1 KB
# from sklearn.preprocessing import PolynomialFeatures
# poly = PolynomialFeatures(interaction_only=False)
# data = poly.fit_transform(data)

# print(data.shape)

2.4 Split Dataset into Training and Test Set

  • training dataset split again into (real) training set and validation set for the purpose of model validation/hyperparameter selection in CV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data_train, data_test = train_test_split(data, test_size=0.3, random_state=24, stratify=data["age_cluster"]).copy()
data_train = data_train.sort_index().copy()
data_test = data_test.sort_index().copy()

from collections import Counter
print(Counter(data_train["age_cluster"]))
print(Counter(data_test["age_cluster"]))
Counter({0: 680, 1: 150, 2: 106})
Counter({0: 292, 1: 64, 2: 46})

2.5 Standardizing all Numerical Predictor Variables

scaler = StandardScaler()
data_train = data_train.copy()
data_train[["age", "bmi", "children", "smoker*bmi", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", "age0*smoker_no", "age2*smoker_yes"]] = scaler.fit_transform(data_train[["age", "bmi", "children", "smoker*bmi", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", "age0*smoker_no", "age2*smoker_yes"]])
    
# scale test data with scaler trained on traning data --> no information leakage
data_test = data_test.copy()
data_test[["age", "bmi", "children", "smoker*bmi", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", "age0*smoker_no", "age2*smoker_yes"]] = scaler.transform(data_test[["age", "bmi", "children", "smoker*bmi", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", "age0*smoker_no", "age2*smoker_yes"]])

data_train.head()
age bmi children charges charges/age age_cluster age_cluster_0 age_cluster_1 age_cluster_2 sex_male ... region_southwest smoker*bmi bmi*age0 bmi*age1 bmi*age2 age2 age0 age1 age0*smoker_no age2*smoker_yes
0 -1.424944 -0.449654 -0.904391 16884.9240 11944.9240 1 0 1 0 0 ... 1 1.683996 -1.525072 2.283565 -0.353945 -0.333614 -1.345294 0.847148 -1.345294 -0.331874
1 -1.496459 0.526350 -0.103530 1725.5523 -2954.4477 0 1 0 0 1 ... 0 -0.501836 0.792094 -0.429204 -0.353945 -0.333614 -0.491924 -0.405924 -0.491924 -0.331874
4 -0.495254 -0.286710 -0.904391 3866.8552 -4453.1448 0 1 0 0 1 ... 0 -0.501836 0.456561 -0.429204 -0.353945 -0.333614 0.171809 -0.405924 0.171809 -0.331874
5 -0.566769 -0.808797 -0.904391 3756.6216 -4303.3784 0 1 0 0 0 ... 0 -0.501836 0.241106 -0.429204 -0.353945 -0.333614 0.124399 -0.405924 0.124399 -0.331874
7 -0.137681 -0.476257 1.498191 7281.5056 -2338.4944 0 1 0 0 0 ... 0 -0.501836 0.378339 -0.429204 -0.353945 -0.333614 0.408856 -0.405924 0.408856 -0.331874

5 rows Ă— 23 columns

2.6 Correlation Matrix

plt.figure(figsize=(12,12))
metric_vars = data_train[["charges", "age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast", "region_southwest", 
    "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", 
    "age0*smoker_no", "age2*smoker_yes"]]
corr = metric_vars.corr().round(decimals = 2).abs()
sns.heatmap(corr, cmap="Blues", robust=True, annot = True)
<AxesSubplot:>

png

2.7 Define Variables for Classification and Regression

class_X = [
    "age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast",
    "region_southwest", "smoker*bmi"
]
class_X = [
    "age", "bmi", "children", "sex_male", "smoker_yes"
]

reg_X = [
    "age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast", "region_southwest", 
    "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", 
    "age0*smoker_no", "age2*smoker_yes"
]

class_y = ["age_cluster"]
reg_y = ["charges"]

# class_X = {}

reg_X = {}
training_results = {}
single_training_results = {}

2.8 Baseline Regression Model

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

linReg = LinearRegression()
X_train = data_train[["age", "children", "bmi", "smoker_yes", "region_northeast", "region_southeast", "region_southwest"]]
y_train = data_train[["charges"]]
linReg.fit(X_train, y_train)
print("######### Baseline Model Training #########")
print("R-squared:\t", linReg.score(X_train, y_train))
print("intercept:\t", linReg.intercept_)
print("coefficients:\t", linReg.coef_)

# Baseline Prediction
X_test = data_test[["age", "children", "bmi", "smoker_yes", "region_northeast", "region_southeast", "region_southwest"]]
y_test = data_test[["charges"]]

y_pred = linReg.predict(X_test)

# Calculate regression metrics        
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\n######### Baseline Model Test #########")
print("R-squared:\t", r2)
print("RMSE:\t", rmse)
print("MAE:\t", mae)
######### Baseline Model Training #########
R-squared:	 0.7489852211199308
intercept:	 [8272.18507222]
coefficients:	 [[ 3550.6070567    715.63126578  2052.80736344 23836.17306996
    816.96079161  -510.4643883   -270.90536007]]

######### Baseline Model Test #########
R-squared:	 0.7544318452656915
RMSE:	 5888.9098571164195
MAE:	 4127.799435577642

Model Training

Idea: Use Cluster Information to achieve more precise medical cost predictions

3. Model Training: Classification
4. Model Training: Sequential Regression Models (using Cluster Information)

Then check how models without cluster information perform

5. Model Training: Single Regression Models (no Cluster Information)

3. Model Training: Classification

3.1 Support Vector Classifier
3.2 Random Forest Classifier
3.3 Gradient Boosting Classifier
# specify training data
X_train = data_train[class_X]
y_train = np.array(data_train[class_y]).reshape(-1, 1).flatten()

# Create List to save trained classification models
classification_models = []

3.1 Support Vector Machine

Hyperparameters:

  • C: Regularization parameter where the strength of the regularization is inversely proportional to C. It controls the impact of misclassification on the training process. Intuitively it specifies how many training points are allowed to be misclassified (soft margin vs. hard margin).

  • kernel: Specifies the kernel type to be used in the algorithm: image.png

  • gamma: Gamma controls the smoothness and the extent of curvature on our decision boundary. A large value of gamma of is problematic with regard to overfitting because the class regions get too specific. A lower gamma value often results in a more generalized model because the decision boundary gets smoother.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

# Set up steps for pipeline
steps = [
    ("SVC", SVC(class_weight="balanced", decision_function_shape='ovo'))
        ]
pipeline = Pipeline(steps)

# Specify possible hyperparameters 
parameters = {'SVC__C': np.logspace(-1, 3, 5),
              'SVC__gamma': np.logspace(-3, 1, 5),
# #               "SVC__kernel": ["linear", "poly", "sigmoid"]
              "SVC__kernel": ["rbf"]
             }
# parameters = {'SVC__C': [1000],
#               'SVC__gamma': [0.01],
#               "SVC__kernel": ["rbf"]
#               }

# Define scoring metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
scoring = {'Accuracy': make_scorer(accuracy_score), "F1": "f1_micro"}

# Instantiate GridSearchCV object
stratKFold = StratifiedKFold(n_splits=5, random_state=99, shuffle = True)
gs_cv = GridSearchCV(pipeline, parameters, refit = "Accuracy", cv=stratKFold, scoring=scoring, verbose=0)

# # Oversampling the training dataset only
# oversample = SMOTE()
# X_train, y_train = oversample.fit_resample(X_train, y_train)

# Fit to training set
gs_cv.fit(X_train, y_train)
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=99, shuffle=True),
             estimator=Pipeline(steps=[('SVC',
                                        SVC(class_weight='balanced',
                                            decision_function_shape='ovo'))]),
             param_grid={'SVC__C': array([1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'SVC__gamma': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01]),
                         'SVC__kernel': ['rbf']},
             refit='Accuracy',
             scoring={'Accuracy': make_scorer(accuracy_score),
                      'F1': 'f1_micro'})
# print(gs_cv.cv_results_["params"])

# Accuracy Scores
accuracy_cv = gs_cv.cv_results_["split0_test_Accuracy"]+gs_cv.cv_results_["split1_test_Accuracy"]+gs_cv.cv_results_["split2_test_Accuracy"]+gs_cv.cv_results_["split3_test_Accuracy"]+gs_cv.cv_results_["split4_test_Accuracy"]
accuracy_cv = accuracy_cv/5

# F1-Scores
f1_cv = gs_cv.cv_results_["split0_test_F1"]+gs_cv.cv_results_["split1_test_F1"]+gs_cv.cv_results_["split2_test_F1"]+gs_cv.cv_results_["split3_test_F1"]+gs_cv.cv_results_["split4_test_F1"]
f1_cv = f1_cv/5

svc__gamma = []
svc__C = []
for i in gs_cv.cv_results_["params"]:
    svc__gamma.append(i["SVC__gamma"])
    svc__C.append(i["SVC__C"])
    
# print(svc__C)
# Plot CV_Results
# plt.scatter(gs_cv.cv_results_["params"][np.arange(0, accuracy_cv.shape[0])]["SVC__C"], accuracy_cv)
# plt.scatter(svc__gamma, f1_cv)

# plot hyyperparameters vs. accuracy in 3D
from mpl_toolkits import mplot3d

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax = plt.axes(projection="3d")

import math
plt.title("CV-Accuracy vs. Hyperparameter Space")

z_line = accuracy_cv
x_line = np.log(svc__gamma) / np.log(10)
y_line = np.log(svc__C) / np.log(10)
# ax.plot3D(x_line, y_line, z_line, 'gray')

z_points = z_line
x_points = x_line
y_points = y_line
ax.scatter3D(x_points, y_points, z_points, c=z_points, cmap='hsv');
ax.set_xlabel("$log(\gamma)$")
ax.set_ylabel("$log(C)$")
ax.set_zlabel("Accuracy")

plt.show()

png

print("Best CV Score (Accuracy): {}".format(gs_cv.best_score_))
print("Tuned Model Parameters: {}".format(gs_cv.best_params_))

# from sklearn.model_selection import cross_val_score
# all_scores = cross_val_score(estimator=gs_cv, X=X_train, y=y_train, cv=5)
# print(all_scores)
Best CV Score (Accuracy): 0.9273466833541928
Tuned Model Parameters: {'SVC__C': 10.0, 'SVC__gamma': 0.01, 'SVC__kernel': 'rbf'}
# Save SVM with optimized hyperparameters
import pickle
classification_models.append("svm")
pickle.dump(gs_cv, open("models/svm.sav", "wb"))

3.2 Random Forest Classifier

Hyperparameters:

  • max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • max_features: The number of features to consider when looking for the best split:

  • min_samples_leaf: The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  • min_samples_split: The minimum number of samples required to split an internal node.

  • n_estimators: The number of trees in the forest.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

# Specify possible hyperparameters 
# parameters = {
#     'max_depth': [10, 50, 100],
#     'max_features': ['log2', 'sqrt'],
#     'min_samples_leaf': [1, 2, 4],
#     'min_samples_split': [2, 5, 10],
#     'n_estimators': np.insert(np.arange(100, 1000, 100), 0, 10)
# }
parameters = {
    'max_depth': [10],
    'max_features': ['log2'],
    'min_samples_leaf': [2],
    'min_samples_split': [2],
    'n_estimators': [250]
}

# Define scoring metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
scoring = {'Accuracy': make_scorer(accuracy_score), "F1": "f1_macro"}

# Instantiate GridSearchCV object
gs_cv = GridSearchCV(RandomForestClassifier(), parameters, scoring=scoring, refit="Accuracy", verbose=42, n_jobs=-1)

# Fit to training set
gs_cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits





GridSearchCV(estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'max_depth': [10], 'max_features': ['log2'],
                         'min_samples_leaf': [2], 'min_samples_split': [2],
                         'n_estimators': [250]},
             refit='Accuracy',
             scoring={'Accuracy': make_scorer(accuracy_score),
                      'F1': 'f1_macro'},
             verbose=42)
print("Best CV Score (Accuracy): {}".format(gs_cv.best_score_))
print("Tuned Model Parameters: {}".format(gs_cv.best_params_))
print("\nFEATURE IMPORTANCES:\n")
for c in range(0, X_train.columns.shape[0]):
    print("{}: \t{:.2f}".format(X_train.columns[c], gs_cv.best_estimator_.feature_importances_[c]))

# from sklearn.model_selection import cross_val_score
# all_scores = cross_val_score(estimator=gs_cv, X=X_train, y=y_train, cv=5)
# print(all_scores)
Best CV Score (Accuracy): 0.9294743429286608
Tuned Model Parameters: {'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 250}

FEATURE IMPORTANCES:

age: 	0.08
bmi: 	0.25
children: 	0.02
sex_male: 	0.01
smoker_yes: 	0.63
# Save model with optimized hyperparameters
import pickle
classification_models.append("rand_forest_class")
pickle.dump(gs_cv, open("models/rand_forest_class.sav", "wb"))

3.3 Gradient Tree Boosting Classifier

Hyperparameters:

  • learning_rate: Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.

  • subsample: The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.

  • n_estimators: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.

from sklearn.ensemble import GradientBoostingClassifier

# Specify possible hyperparameters 
parameters = {
#     "loss":["deviance"],
    "learning_rate": [0.0025, 0.005, 0.0075],
#     "min_samples_split": np.linspace(0.1, 0.5, 12),
#     "min_samples_leaf": np.linspace(0.1, 0.5, 12),
#     "max_depth":[3,7,9],
#     "max_features":["log2","sqrt"],
#     "criterion": ["friedman_mse",  "mae"],
    "subsample":[0.65, 0.7, 0.75],
    "n_estimators":[120, 125, 130]
    }
parameters = {
        "learning_rate": [0.005],
        "subsample": [0.75],
        "n_estimators": [120]
     }

# Define scoring metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
scoring = {'Accuracy': make_scorer(accuracy_score), "F1": "f1_macro"}

# Instantiate GridSearchCV object
gs_cv = GridSearchCV(GradientBoostingClassifier(), parameters, scoring=scoring, refit="Accuracy", verbose=42, n_jobs=-1)

# Fit to training set
gs_cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits





GridSearchCV(estimator=GradientBoostingClassifier(), n_jobs=-1,
             param_grid={'learning_rate': [0.005], 'n_estimators': [120],
                         'subsample': [0.75]},
             refit='Accuracy',
             scoring={'Accuracy': make_scorer(accuracy_score),
                      'F1': 'f1_macro'},
             verbose=42)
print("Best CV Score (Accuracy): {}".format(gs_cv.best_score_))
print("Tuned Model Parameters: {}".format(gs_cv.best_params_))

print("\nFEATURE IMPORTANCES:\n")
for c in range(0, X_train.columns.shape[0]):
    print("{}: \t{:.2f}".format(X_train.columns[c], gs_cv.best_estimator_.feature_importances_[c]))


# from sklearn.model_selection import cross_val_score
# all_scores = cross_val_score(estimator=gs_cv, X=X_train, y=y_train, cv=5)
# print(all_scores)
Best CV Score (Accuracy): 0.9284048242120833
Tuned Model Parameters: {'learning_rate': 0.005, 'n_estimators': 120, 'subsample': 0.75}

FEATURE IMPORTANCES:

age: 	0.00
bmi: 	0.34
children: 	0.00
sex_male: 	0.00
smoker_yes: 	0.65
# Save SVM with optimized hyperparameters
import pickle
classification_models.append("grad_boost_class")
pickle.dump(gs_cv, open("models/grad_boost_class.sav", "wb"))

4. Model Training: Sequential Regression Models (using Cluster Information)

4.1 Ridge Regression
4.2 Clustered Ridge Regression
4.3 LASSO Regression
4.4 OLS Regression
4.5 Random Forest Regressor
4.6 Gradient Boosting Regressor
4.7 Neural Network
# Create List to save trained regression models
regression_models = []
single_regression_models = []

Preparing Training Set: Train Reg. Models on Predicted Clusters from Classification

 # specify train data for classification
X_train = data_train[class_X].copy()
y_train = np.array(data_train[class_y]).reshape(-1, 1).flatten()

cm = "svm"

print(cm)
class_model = pickle.load(open("models/"+cm+".sav", "rb"))
print(class_model)

# add classification result to train data
y_pred = class_model.predict(X_train[class_X])

data_train_reg = data_train.copy()
data_train_reg["pred_age_cluster"] = y_pred.copy()
data_train_reg["age_cluster"] = y_pred.copy()

#     print("\ny_pred added")
#     print(data_train_reg.info())
#     print(data_train_reg["age_cluster"])
#     sns.scatterplot(x = data_train_reg["age"], y=data_train_reg["charges"], hue=data_train_reg["pred_age_cluster"])


# replace real age_cluster with predicted age cluster from classification
#     df = pd.DataFrame()
#     df[["age_cluster_0", "age_cluster_1", "age_cluster_2"]] = pd.get_dummies(data_train["pred_age_cluster"].copy(), drop_first=False)
#     print(df.info())
data_train_reg[["age_cluster_0", "age_cluster_1", "age_cluster_2"]] = pd.get_dummies(data_train_reg["pred_age_cluster"].copy(), drop_first=False)
#     print("\nage cluster dummies added")
#     print(data_train.info())

data_train_reg["bmi*age0"] = data_train_reg["age_cluster_0"].copy()*data_train_reg["bmi"].copy()
data_train_reg["bmi*age1"] = data_train_reg.loc[:,"age_cluster_1"]*data_train_reg["bmi"]
data_train_reg["bmi*age2"] = data_train_reg.loc[:,"age_cluster_2"]*data_train_reg["bmi"]

data_train_reg["age0"] = data_train_reg["age_cluster_0"]*data_train_reg["age"]
data_train_reg["age1"] = data_train_reg["age_cluster_1"]*data_train_reg["age"]
data_train_reg["age2"] = data_train_reg["age_cluster_2"]*data_train_reg["age"]
data_train_reg["age0*smoker_no"] = data_train_reg["age_cluster_0"]*data_train_reg["age"]*(1-data_train_reg["smoker_yes"])
data_train_reg["age2*smoker_yes"] = data_train_reg["age_cluster_2"]*data_train_reg["age"]*data_train_reg["smoker_yes"]
svm
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=99, shuffle=True),
             estimator=Pipeline(steps=[('SVC',
                                        SVC(class_weight='balanced',
                                            decision_function_shape='ovo'))]),
             param_grid={'SVC__C': array([1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'SVC__gamma': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01]),
                         'SVC__kernel': ['rbf']},
             refit='Accuracy',
             scoring={'Accuracy': make_scorer(accuracy_score),
                      'F1': 'f1_micro'})

4.1 Ridge-Regression with Interaction Terms

image-2.png

reg_X["ridge_x_terms"] = ["children", "sex_male", "smoker_yes", "smoker*bmi","bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", 
    "age0*smoker_no", "age2*smoker_yes"]
reg_X["ridge_x_terms"] = ["age", "bmi", "children", "region_northeast", "region_southeast", "region_southwest", 
    "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", 
    "age0*smoker_no", "age2*smoker_yes"]

reg_X["ridge_x_terms"] = ["age", "bmi", "children", "region_northeast", "region_southeast", "region_southwest", 
    "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", 
    "age0*smoker_no", "age2*smoker_yes"]

X_train = data_train_reg[reg_X["ridge_x_terms"]]
# X_train = X_train[["age", "smoker_yes", "children", "bmi"]]
y_train = np.array(data_train_reg[reg_y]).reshape(-1, 1).flatten()

Ridge Regression: Recursive Feature Elimination (cross-validated)

# Import modules
from sklearn.feature_selection import RFECV
from sklearn.linear_model import Ridge


selector = RFECV(Ridge(), step=1, cv=5, scoring="neg_mean_squared_error")
selector = selector.fit(X_train, y_train)
print("\nWhich variables should be selected? ", selector.support_)
print("\nRanking of the variables: ", selector.ranking_)
print("\nGrid search scores: ", np.sqrt(np.abs(selector.grid_scores_)))
Which variables should be selected?  [ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True]

Ranking of the variables:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Grid search scores:  [[7138.67103019 7739.77368905 6337.92465015 7398.72070492 7042.69153942]
 [5936.81142318 6626.86076858 5266.74730253 6130.49498527 5693.17489792]
 [4150.28219027 5719.44228157 3781.62420373 4803.27007947 4597.38172403]
 [4148.36253166 5722.6880076  3781.47001082 4799.86168777 4604.95971821]
 [4156.43019153 5701.56961518 3782.72792572 4784.05858971 4617.58071579]
 [4144.50580069 5679.75298713 3776.20510701 4753.31605139 4644.70770821]
 [4124.60263887 5676.1042636  3794.78100912 4742.52466822 4640.62079901]
 [4121.95560945 5650.08523587 3791.55999314 4751.39725541 4666.42074818]
 [4128.41951805 5659.07821672 3791.29016312 4757.41959744 4682.59828349]
 [4140.15675865 5656.54015738 3791.57669633 4757.68220131 4683.01562734]
 [4150.98029378 5656.78347693 3739.6946527  4757.24369007 4681.07258915]
 [4146.88265727 5656.88809737 3739.60791478 4757.24922355 4608.92042897]
 [4146.60687995 5656.79070238 3722.51532363 4684.23234727 4611.38406745]
 [4146.81972293 5660.03104114 3727.91347861 4684.04785506 4612.0499913 ]
 [4146.69251666 5661.07710205 3728.15838223 4690.27757142 4612.11687324]
 [4144.35751635 5660.79345423 3728.08188747 4690.14497819 4611.80003467]
 [4145.24983333 5654.90468923 3713.60943913 4690.07686246 4605.39133569]]


c:\Users\adria\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:103: FutureWarning: The `grid_scores_` attribute is deprecated in version 1.0 in favor of `cv_results_` and will be removed in version 1.2.
  warnings.warn(msg, category=FutureWarning)
# Plotting the number of features vs. cross-validation roc_auc scores
fig, ax = plt.subplots()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (MSE)")
plt.plot(selector.grid_scores_)
plt.xticks(np.arange(0, 19, 1))
plt.show()

png

reg_X["ridge_x_terms"] = ["age", "children", "region_southwest", "smoker*bmi", "age_cluster_2"]

X_train = data_train_reg[reg_X["ridge_x_terms"]]
y_train = np.array(data_train_reg[reg_y]).reshape(-1, 1).flatten()

Ridge Regression: Hyperparameter Tuning

Hyperparameters:

  • alpha: Regularization strength. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization.
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Define Pipeline for GridSearchCV
steps = [("Ridge", Ridge())]

pipeline = Pipeline(steps)

# Specify possible hyperparameters 
n_alphas = 100
parameters = {'Ridge__alpha': np.insert(np.logspace(-1, 5, n_alphas), 0, 0)
              }
# parameters = {'Ridge__alpha': [0]
#               }

# Instantiate GridSearchCV object
gs_cv = GridSearchCV(pipeline, parameters, scoring='neg_mean_squared_error', refit = "true", verbose=0)
gs_cv.fit(X_train, y_train)
GridSearchCV(estimator=Pipeline(steps=[('Ridge', Ridge())]),
             param_grid={'Ridge__alpha': array([0.00000000e+00, 1.00000000e-01, 1.14975700e-01, 1.32194115e-01,
       1.51991108e-01, 1.74752840e-01, 2.00923300e-01, 2.31012970e-01,
       2.65608778e-01, 3.05385551e-01, 3.51119173e-01, 4.03701726e-01,
       4.64158883e-01, 5.33669923e-01, 6.13590727e-01, 7.05480231e-01,
       8.11130831e-01, 9....
       6.13590727e+03, 7.05480231e+03, 8.11130831e+03, 9.32603347e+03,
       1.07226722e+04, 1.23284674e+04, 1.41747416e+04, 1.62975083e+04,
       1.87381742e+04, 2.15443469e+04, 2.47707636e+04, 2.84803587e+04,
       3.27454916e+04, 3.76493581e+04, 4.32876128e+04, 4.97702356e+04,
       5.72236766e+04, 6.57933225e+04, 7.56463328e+04, 8.69749003e+04,
       1.00000000e+05])},
             refit='true', scoring='neg_mean_squared_error')
coefs = []
mse = []
mae = []
r2 = []
alphas = np.insert(np.logspace(-1, 5, n_alphas), 0, 0)

for a in alphas:
#     print(a)
    ridge = Ridge(alpha=a, fit_intercept=True)
    ridge.fit(X_train, y_train)
    coefs.append(ridge.coef_)

    y_pred = ridge.predict(X_train)
    r2.append(ridge.score(X_train, y_train))
    mse.append(mean_squared_error(y_train, y_pred))
    mae.append(mean_absolute_error(y_train, y_pred))
#     print(ridge.score(X_test, y_test))
#     print()

# #############################################################################
# Display results
coefs = np.array(coefs)
fig, ax = plt.subplots(1)
# print(coefs)
for (c,l) in zip(coefs.T, X_train.columns):
    sns.lineplot(ax=ax, x=alphas, y=c, label=l)

ax.set_xscale("log")
ax.set_ylabel("ridge estimates")
ax.set_xlabel('alpha')
ax.set_title('Parameter Shrinking in Ridge')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.ylim([-3e3, 15e3])
plt.xlim([0.5,1e5])
plt.show()


fig, ax = plt.subplots(1, 3, figsize=(12, 4))
# plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.4)

sns.lineplot(ax=ax[0], x=alphas, y=mse, label="mse")
sns.lineplot(ax=ax[1], x=alphas, y=mae, label="mae")
sns.lineplot(ax=ax[2], x=alphas, y=r2, label="r2")
# print(mae)
ax[0].set_xscale("log")
ax[0].set_xlabel('alpha')
ax[0].set_title('Ridge Training Performance (MSE)')
# print(mae)
ax[1].set_xscale("log")
ax[1].set_xlabel('alpha')
ax[1].set_title('Ridge Training Performance (MAE)')

ax[2].set_xscale("log")
ax[2].set_xlabel('alpha')
ax[2].set_title('Ridge Training Performance (R2)')

plt.show()

png

png

# best estimator
print(gs_cv.best_estimator_)
# print(gs_cv.getsupport())
# print(gs_cv.ranking_)

print("Best CV Score (RMSE): {}".format(np.sqrt(np.abs(gs_cv.best_score_))))
print("Tuned Model Parameters: {}".format(gs_cv.best_params_))
training_results["ridge_x_terms"] = np.sqrt(np.abs(gs_cv.best_score_))

# from sklearn.model_selection import cross_val_score
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='neg_mean_squared_error', cv=5)))
# print(all_scores)
# # r^2
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='r2', cv=5)))
# print(all_scores)
# print(np.mean(all_scores))
Pipeline(steps=[('Ridge', Ridge(alpha=0.1519911082952934))])
Best CV Score (RMSE): 4597.851400728133
Tuned Model Parameters: {'Ridge__alpha': 0.1519911082952934}
# save the best trained model
regression_models.append("ridge_x_terms")
pickle.dump(gs_cv, open("models/ridge_x_terms.sav", "wb"))

4.2 Clustered Ridge Regression (1 Ridge Regression per Cluster)

reg_X["clustered_ridge"] = ["age_cluster","age", "smoker*bmi","bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", 
    "age0*smoker_no", "age2*smoker_yes"]
X_train = data_train_reg[reg_X["clustered_ridge"]]
y_train = np.array(data_train_reg[reg_y]).reshape(-1, 1).flatten()
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression

def ridge(data_train, reg_X, reg_y):
    X_train = data_train[reg_X]
    y_train = data_train[reg_y]

    # Specify possible hyperparameters 
    n_alphas = 10
    parameters = {'alpha': np.insert(np.logspace(-3, 7, n_alphas), 0, 0)
                  }

    # Instantiate GridSearchCV object
    ridge = GridSearchCV(Ridge(), parameters, scoring='neg_mean_squared_error', refit = "true", verbose=0)
    ridge.fit(X_train, y_train)

    # best estimator
#     print(ridge.best_estimator_)
    
    return ridge
class ClusteredRidge:
    def __init__(self):
        self.models = []
        self.y_pred = None
        self.best_params_ = []
        self.best_score_ = None
        self.model_scores = []
        
    def fit(self, data_train_reg, reg_X, reg_y):
        score_sum = 0
        number_total = data_train_reg.shape[0]
        for cluster in range(0,3):
            training_data = data_train_reg[data_train_reg["age_cluster"] == cluster]
            number = training_data.shape[0]
            model = ridge(training_data, reg_X, reg_y)
            # calculate score (neg_mse) as weighted avg
            self.model_scores.append(model.best_score_)
            score_sum = score_sum + model.best_score_ * (number/number_total)
            # save ridge model to list
            self.models.append(model)
            # save best params to list
            self.best_params_.append(model.best_params_)
        self.best_score_ = score_sum/3
            
            
    def predict(self, X_test):
        predictions = pd.DataFrame()
        for cluster in range(0,3):
            test_data = X_test[X_test["age_cluster"] == cluster].copy()
#             print(test_data.info())
#             print(test_data.head())
#             print(test_data.tail())

            # save df index for merging the results 
            index = test_data.index
            # add prediction for every cluster
            predictions = predictions.append(pd.DataFrame(data = self.models[cluster].predict(test_data), index=index))
       
        y_pred = np.array(predictions.sort_index()).flatten()
        return y_pred            
clusteredRidge = ClusteredRidge()
clusteredRidge.fit(data_train_reg, reg_X["clustered_ridge"], reg_y)
# clusteredRidge.predict(data_test[reg_X])
# print(clusteredRidge.predict(data_test[reg_X]))
# print(clusteredRidge.predict(data_test[reg_X]))
# print(clusteredRidge.model_scores)
print("Best CV Score (RMSE): {}".format(np.sqrt(np.abs(clusteredRidge.best_score_))))
print("Tuned Model Parameters: {}".format(clusteredRidge.best_params_))
training_results["clustered_ridge"] = np.sqrt(np.abs(clusteredRidge.best_score_))

# from sklearn.model_selection import cross_val_score
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='neg_mean_squared_error', cv=5)))
# print(all_scores)
# # r^2
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='r2', cv=5)))
# print(all_scores)
# print(np.mean(all_scores))
Best CV Score (RMSE): 2691.955927908613
Tuned Model Parameters: [{'alpha': 0.0}, {'alpha': 2.1544346900318843}, {'alpha': 0.001}]
# save the best trained model
regression_models.append("clustered_ridge")
pickle.dump(clusteredRidge, open("models/clustered_ridge.sav", "wb"))

4.3 LASSO-Regression

image.png

reg_X["lasso"] = ["age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast", "region_southwest", 
    "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", 
    "age0*smoker_no", "age2*smoker_yes"]
reg_X["lasso"] = ["age_cluster","age", "smoker*bmi","bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", 
    "age0*smoker_no"]
X_train = data_train_reg[reg_X["lasso"]]
y_train = np.array(data_train_reg[reg_y]).reshape(-1, 1).flatten()

Hyperparameters:

  • alpha: Regularization strength. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization.
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

# Specify possible hyperparameters 
n_alphas = 100
parameters = {'alpha': np.logspace(-3, 7, n_alphas),
              "max_iter": [100000]
              }
# Instantiate GridSearchCV object
gs_cv = GridSearchCV(Lasso(), parameters, scoring='neg_mean_squared_error', refit = "true", verbose=0)
gs_cv.fit(X_train, y_train)
GridSearchCV(estimator=Lasso(),
             param_grid={'alpha': array([1.00000000e-03, 1.26185688e-03, 1.59228279e-03, 2.00923300e-03,
       2.53536449e-03, 3.19926714e-03, 4.03701726e-03, 5.09413801e-03,
       6.42807312e-03, 8.11130831e-03, 1.02353102e-02, 1.29154967e-02,
       1.62975083e-02, 2.05651231e-02, 2.59502421e-02, 3.27454916e-02,
       4.13201240e-02, 5.21400829e-02, 6.57933225e-02, 8.30217...
       1.20450354e+05, 1.51991108e+05, 1.91791026e+05, 2.42012826e+05,
       3.05385551e+05, 3.85352859e+05, 4.86260158e+05, 6.13590727e+05,
       7.74263683e+05, 9.77009957e+05, 1.23284674e+06, 1.55567614e+06,
       1.96304065e+06, 2.47707636e+06, 3.12571585e+06, 3.94420606e+06,
       4.97702356e+06, 6.28029144e+06, 7.92482898e+06, 1.00000000e+07]),
                         'max_iter': [100000]},
             refit='true', scoring='neg_mean_squared_error')
# best estimator
print(gs_cv.best_estimator_)

print("Best CV Score (RMSE): {}".format(np.sqrt(np.abs(gs_cv.best_score_))))
print("Tuned Model Parameters: {}".format(gs_cv.best_params_))

training_results["lasso"] = np.sqrt(np.abs(gs_cv.best_score_))

print("\nLASSO FEATURE SELECTION:\n")
for c in range(0, X_train.columns.shape[0]):
    print("{}: \t{:.2f}".format(X_train.columns[c], gs_cv.best_estimator_.coef_[c]))

# from sklearn.model_selection import cross_val_score
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=ridge, X=X_train, y=y_train, scoring='neg_mean_squared_error', cv=5)))
# print(all_scores)
# # r^2
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=ridge, X=X_train, y=y_train, scoring='r2', cv=5)))
# print(all_scores)
# print(np.mean(all_scores))
Lasso(alpha=17.47528400007683, max_iter=100000)
Best CV Score (RMSE): 4659.321710689284
Tuned Model Parameters: {'alpha': 17.47528400007683, 'max_iter': 100000}

LASSO FEATURE SELECTION:

age_cluster: 	14074.41
age: 	3585.50
smoker*bmi: 	461.66
bmi*age0: 	-22.19
bmi*age1: 	1599.81
bmi*age2: 	4127.02
age0: 	113.83
age1: 	-48.98
age0*smoker_no: 	0.00
# save the best trained model
regression_models.append("lasso")
pickle.dump(gs_cv, open("models/lasso.sav", "wb"))

4.4 OLS-Regression (statsmodels)

reg_X["ols_regression"] = ["age", "bmi", "children", "age_cluster_0", "age_cluster_2"]
reg_X["ols_regression"] = ["age", "children", "region_southwest", "smoker*bmi", "age_cluster_2"]

X_train = data_train_reg[reg_X["ols_regression"]]
y_train = np.array(data_train_reg[reg_y]).reshape(-1, 1).flatten()
import statsmodels.api as sm

X_train = sm.add_constant(X_train)
statsmod = sm.OLS(y_train,X_train).fit()

statsmod.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.860
Model: OLS Adj. R-squared: 0.860
Method: Least Squares F-statistic: 1147.
Date: Wed, 15 Feb 2023 Prob (F-statistic): 0.00
Time: 18:26:51 Log-Likelihood: -9213.5
No. Observations: 936 AIC: 1.844e+04
Df Residuals: 930 BIC: 1.847e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 1.189e+04 192.176 61.887 0.000 1.15e+04 1.23e+04
age 3662.6649 149.575 24.487 0.000 3369.122 3956.208
children 697.2045 149.538 4.662 0.000 403.733 990.677
region_southwest -870.5627 346.879 -2.510 0.012 -1551.319 -189.807
smoker*bmi 6803.3546 253.246 26.865 0.000 6306.354 7300.355
age_cluster_2 1.395e+04 798.376 17.467 0.000 1.24e+04 1.55e+04
Omnibus: 591.879 Durbin-Watson: 2.012
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5174.028
Skew: 2.885 Prob(JB): 0.00
Kurtosis: 12.969 Cond. No. 5.84


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
training_results["ols_regression"] = np.sqrt(mean_squared_error(y_train, statsmod.fittedvalues))
# save the best trained model
regression_models.append("ols_regression")
pickle.dump(statsmod, open("models/ols_regression.sav", "wb"))

4.5 Random Forest Regressor

Hyperparameters:

  • max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • max_features: The number of features to consider when looking for the best split:

  • n_estimators: The number of trees in the forest.

# reg_X["random_forest_reg"] = ["age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast", "region_southwest", 
#     "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", 
#     "age0*smoker_no", "age2*smoker_yes"]
reg_X["random_forest_reg"] = ["age", "bmi", "smoker_yes", 
    "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age2", 
    "age0*smoker_no", "age2*smoker_yes"]
# reg_X["random_forest_reg"] = ["age", "children", "region_southwest", "smoker*bmi", "age_cluster_2"]

X_train = data_train_reg[reg_X["random_forest_reg"]]

y_train = np.array(data_train_reg[reg_y]).reshape(-1, 1).flatten()
from sklearn.ensemble import RandomForestRegressor

# Specify possible hyperparameters 
# parameters = {
#         'max_depth': [10],
#         'max_features': ['log2'],
#         'n_estimators': np.insert(np.arange(100, 1000, 100), 0, 50)
#               }
# Specify possible hyperparameters 
parameters = {
    'max_depth': [1, 2, 3, 5, 10],
    'max_features': ['log2'],
    'n_estimators': np.insert(np.arange(100, 1000, 100), 0, 10)
}
parameters = {
        'max_depth': [5],
        'max_features': ['log2'],
        'n_estimators': [300]
              }

# Instantiate GridSearchCV object
gs_cv = GridSearchCV(RandomForestRegressor(), parameters, scoring='neg_mean_squared_error', refit = "true", verbose=5, n_jobs=-1)
gs_cv.fit(X_train, y_train)

# best estimator
print(gs_cv.best_estimator_)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
RandomForestRegressor(max_depth=5, max_features='log2', n_estimators=300)
print("Best CV Score (RMSE): {}".format(np.sqrt(np.abs(gs_cv.best_score_))))
print("Tuned Model Parameters: {}".format(gs_cv.best_params_))

training_results["random_forest_reg"] = np.sqrt(np.abs(gs_cv.best_score_))

print("\nFEATURE IMPORTANCES:\n")
for c in range(0, X_train.columns.shape[0]):
    print("{}: \t{:.2f}".format(X_train.columns[c], gs_cv.best_estimator_.feature_importances_[c]))



# from sklearn.model_selection import cross_val_score
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='neg_mean_squared_error', cv=5)))
# print(all_scores)
# # r^2
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='r2', cv=5)))
# print(all_scores)
# print(np.mean(all_scores))
Best CV Score (RMSE): 4708.778409890828
Tuned Model Parameters: {'max_depth': 5, 'max_features': 'log2', 'n_estimators': 300}

FEATURE IMPORTANCES:

age: 	0.03
bmi: 	0.02
smoker_yes: 	0.11
smoker*bmi: 	0.23
age_cluster_0: 	0.14
age_cluster_2: 	0.18
bmi*age0: 	0.00
bmi*age1: 	0.02
bmi*age2: 	0.09
age0: 	0.05
age2: 	0.03
age0*smoker_no: 	0.05
age2*smoker_yes: 	0.04
# save the best trained model
regression_models.append("random_forest_reg")
pickle.dump(gs_cv, open("models/random_forest_reg.sav", "wb"))

4.6 Gradient Boosting Regression

Hyperparameters:

  • n_estimators: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.

  • learning_rate: Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.

  • subsample: The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.

reg_X["gradient_boosting_reg"] = ["age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast", "region_southwest", 
    "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", 
    "age0*smoker_no", "age2*smoker_yes"]
reg_X["gradient_boosting_reg"] = [ "age_cluster_2", "bmi*age1", "bmi*age2", "age0", "age2"]
reg_X["gradient_boosting_reg"] = ["age", "children", "region_southwest", "smoker*bmi", "age_cluster_2"]
X_train = data_train_reg[reg_X["gradient_boosting_reg"]]
y_train = np.array(data_train_reg[reg_y]).reshape(-1, 1).flatten()
from sklearn.ensemble import GradientBoostingRegressor

# Specify possible hyperparameters 
parameters = {'n_estimators': [50, 100, 200],
              "learning_rate": [0.03, 0.04, 0.045, 0.05, 0.055, 0.06, 0.07, 0.1],
              "subsample": [0.8, 0.85, 0.875, 0.9, 0.925, 0.95, 0.975, 1]
              }
parameters = {'n_estimators': [100],
              "learning_rate": [0.05],
              "subsample": [1]
              }

# Instantiate GridSearchCV object
gs_cv = GridSearchCV(GradientBoostingRegressor(), parameters, scoring='neg_mean_squared_error', refit = "true", verbose=0)
gs_cv.fit(X_train, y_train)

# best estimator
print(gs_cv.best_estimator_)
GradientBoostingRegressor(learning_rate=0.05, subsample=1)
print("Best CV Score (RMSE): {}".format(np.sqrt(np.abs(gs_cv.best_score_))))
print("Tuned Model Parameters: {}".format(gs_cv.best_params_))

training_results["gradient_boosting_reg"] = np.sqrt(np.abs(gs_cv.best_score_))

print("\nFEATURE IMPORTANCES:\n")
for c in range(0, X_train.columns.shape[0]):
    print("{}: \t{:.2f}".format(X_train.columns[c], gs_cv.best_estimator_.feature_importances_[c]))



# from sklearn.model_selection import cross_val_score
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='neg_mean_squared_error', cv=5)))
# print(all_scores)
# # r^2
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='r2', cv=5)))
# print(all_scores)
# print(np.mean(all_scores))
Best CV Score (RMSE): 4593.299016472459
Tuned Model Parameters: {'learning_rate': 0.05, 'n_estimators': 100, 'subsample': 1}

FEATURE IMPORTANCES:

age: 	0.11
children: 	0.01
region_southwest: 	0.00
smoker*bmi: 	0.87
age_cluster_2: 	0.00
# save the best trained model
regression_models.append("gradient_boosting_reg")
pickle.dump(gs_cv, open("models/gradient_boosting_reg.sav", "wb"))

4.7 Neural Network

reg_X["neural_net"] = ["age_cluster","age", "smoker*bmi","bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", 
    "age0*smoker_no", "age2*smoker_yes"]
X_train = data_train_reg[reg_X["neural_net"]]
y_train = np.array(data_train_reg[reg_y]).reshape(-1, 1).flatten()

from sklearn.preprocessing import MinMaxScaler
nn_y_scaler = MinMaxScaler()
y_train = nn_y_scaler.fit_transform(np.array(data_train_reg[reg_y]))

nn_y_min = np.min(data_train_reg[reg_y])
nn_y_max = np.max(data_train_reg[reg_y])
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# define base model
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(20, input_dim=10, kernel_initializer='normal', activation='sigmoid'))
#     model.add(Dense(20, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation="sigmoid"))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

print(X_train.shape, y_train.shape)
(936, 10) (936, 1)
# evaluate model
neural_net = KerasRegressor(build_fn=baseline_model, epochs=1000, batch_size=100, verbose=0)
kfold = KFold(n_splits=5)
results = cross_val_score(neural_net, X_train, y_train, cv=kfold)

neural_net.fit(X_train, y_train)
<keras.callbacks.History at 0x2207ea888e0>
print("Baseline: %.2f (%.2f) RMSE" % (np.sqrt(np.abs(results.mean())), np.sqrt(np.abs(results.std()))))
print(((nn_y_max - nn_y_min) * np.sqrt(np.abs(results.mean()))).item())
Baseline: 0.07 (0.04) RMSE
4695.918632151114
# save the best trained model
regression_models.append("neural_net")
training_results["neural_net"] = ( (nn_y_max - nn_y_min) * np.sqrt(np.abs(results.mean()))).item()

5. Model Training: Single Regression Models (no Cluster Information)

5.1 Ridge Regression
5.2 Random Forest Regressor
5.3 Gradient Boosting Regressor
5.4 Neural Network

5.1 Ridge Regression (Single Model)

Hyperparameters:

  • alpha: Regularization strength. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization.
reg_X["ridge_single"] = ["age", "bmi", "children", "smoker_yes", "smoker*bmi"]
X_train = data_train[reg_X["ridge_single"]]
y_train = np.array(data_train[reg_y]).reshape(-1, 1).flatten()
ridge_reg = ridge(data_train, reg_X["ridge_single"], reg_y)
single_training_results["ridge_single"] = np.sqrt(np.abs(ridge_reg.best_score_))
print(single_training_results["ridge_single"])
4938.270961588569
# save the best trained model
single_regression_models.append("ridge_single")
pickle.dump(ridge_reg, open("models/ridge_single.sav", "wb"))

5.2 Random Forest Regressor (Single Model)

Hyperparameters:

  • max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • max_features: The number of features to consider when looking for the best split:

  • n_estimators: The number of trees in the forest.

reg_X["rand_forest_reg_single"] = ["age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast", "region_southwest", 
    "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", 
    "age0*smoker_no", "age2*smoker_yes"]
reg_X["rand_forest_reg_single"] = ["age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast", 
    "smoker*bmi"]
X_train = data_train[reg_X["rand_forest_reg_single"]]
y_train = np.array(data_train[reg_y]).reshape(-1, 1).flatten()
from sklearn.ensemble import RandomForestRegressor

# Specify possible hyperparameters 
parameters = {
        'max_depth': [10],
        'max_features': ['log2'],
        'n_estimators': np.insert(np.arange(100, 1000, 100), 0, 50)
              }
parameters = {
        'max_depth': [10],
        'max_features': ['log2'],
        'n_estimators': [800]
              }
# Instantiate GridSearchCV object
gs_cv = GridSearchCV(RandomForestRegressor(), parameters, scoring='neg_mean_squared_error', refit = "true", verbose=0)
gs_cv.fit(X_train, y_train)

# best estimator
print(gs_cv.best_estimator_)
RandomForestRegressor(max_depth=10, max_features='log2', n_estimators=800)
print("Best CV Score (RMSE): {}".format(np.sqrt(np.abs(gs_cv.best_score_))))
print("Tuned Model Parameters: {}".format(gs_cv.best_params_))

single_training_results["rand_forest_reg_single"] = np.sqrt(np.abs(gs_cv.best_score_))

print("\nFEATURE IMPORTANCES:\n")
for c in range(0, X_train.columns.shape[0]):
    print("{}: \t{:.2f}".format(X_train.columns[c], gs_cv.best_estimator_.feature_importances_[c]))



# from sklearn.model_selection import cross_val_score
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='neg_mean_squared_error', cv=5)))
# print(all_scores)
# # r^2
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='r2', cv=5)))
# print(all_scores)
# print(np.mean(all_scores))
Best CV Score (RMSE): 4752.358188849131
Tuned Model Parameters: {'max_depth': 10, 'max_features': 'log2', 'n_estimators': 800}

FEATURE IMPORTANCES:

age: 	0.12
bmi: 	0.09
children: 	0.02
sex_male: 	0.01
smoker_yes: 	0.29
region_northeast: 	0.01
region_southeast: 	0.00
smoker*bmi: 	0.46
# save the best trained model
single_regression_models.append("rand_forest_reg_single")
pickle.dump(gs_cv, open("models/rand_forest_reg_single.sav", "wb"))

5.3 Gradient Boosting Regression (Single Model)

Hyperparameters:

  • n_estimators: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.

  • learning_rate: Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.

  • subsample: The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.

reg_X["grad_boost_reg_single"] = ["age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast", "region_southwest", 
    "smoker*bmi", "age_cluster_0", "age_cluster_2", "bmi*age0", "bmi*age1", "bmi*age2", "age0", "age1", "age2", 
    "age0*smoker_no", "age2*smoker_yes"]
reg_X["grad_boost_reg_single"] = ["age", "bmi", "children", "smoker_yes", "smoker*bmi"]
X_train = data_train[reg_X["grad_boost_reg_single"]]
y_train = np.array(data_train[reg_y]).reshape(-1, 1).flatten()
from sklearn.ensemble import GradientBoostingRegressor

# Specify possible hyperparameters 
parameters = {'n_estimators': [50, 100, 200],
              "learning_rate": [0.001, 0.005, 0.01, 0.05, 0.1, 0.15, 0.25, 0.5],
              "subsample": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
              }
parameters = {'n_estimators': [75, 100, 150],
              "learning_rate": [0.03, 0.035, 0.04, 0.045, 0.05, 0.055, 0.06, 0.07, 0.1],
              "subsample": [0.4, 0.5, 0.6, 0.7, 0.8, 0.825, 0.85, 0.875, 0.9, 0.925, 0.975, 1]
              }
parameters = {'n_estimators': [100],
              "learning_rate": [0.045],
              "subsample": [0.875]
              }
# Instantiate GridSearchCV object
gs_cv = GridSearchCV(GradientBoostingRegressor(random_state=42), parameters, scoring='neg_mean_squared_error', refit = "true", verbose=0, n_jobs=-1)
gs_cv.fit(X_train, y_train)

# best estimator
print(gs_cv.best_estimator_)
GradientBoostingRegressor(learning_rate=0.045, random_state=42, subsample=0.875)
print("Best CV Score (RMSE): {}".format(np.sqrt(np.abs(gs_cv.best_score_))))
print("Tuned Model Parameters: {}".format(gs_cv.best_params_))

single_training_results["grad_boost_reg_single"] = np.sqrt(np.abs(gs_cv.best_score_))

print("\nFEATURE IMPORTANCES:\n")
for c in range(0, X_train.columns.shape[0]):
    print("{}: \t{:.2f}".format(X_train.columns[c], gs_cv.best_estimator_.feature_importances_[c]))

# from sklearn.model_selection import cross_val_score
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='neg_mean_squared_error', cv=5)))
# print(all_scores)
# # r^2
# all_scores = np.sqrt(np.abs(cross_val_score(estimator=gs_cv, X=X_train, y=y_train, scoring='r2', cv=5)))
# print(all_scores)
# print(np.mean(all_scores))
Best CV Score (RMSE): 4603.210125271665
Tuned Model Parameters: {'learning_rate': 0.045, 'n_estimators': 100, 'subsample': 0.875}

FEATURE IMPORTANCES:

age: 	0.12
bmi: 	0.02
children: 	0.01
smoker_yes: 	0.02
smoker*bmi: 	0.83
# save the best trained model
single_regression_models.append("grad_boost_reg_single")
pickle.dump(gs_cv, open("models/grad_boost_reg_single.sav", "wb"))

5.4 Neural Network (Single Model)

reg_X["neural_net_single"] = ["age", "bmi", "children", "sex_male", "smoker_yes", "region_northeast", "region_southeast", "region_southwest", 
    "smoker*bmi"]
X_train = data_train[reg_X["neural_net_single"]]
y_train = np.array(data_train[reg_y]).reshape(-1, 1).flatten()

# scale target value for NN
from sklearn.preprocessing import MinMaxScaler
nn_pure_y_scaler = MinMaxScaler()
y_train = nn_pure_y_scaler.fit_transform(data_train[reg_y])

nn_y_min = np.min(data_train[reg_y])
nn_y_max = np.max(data_train[reg_y])
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# define base model
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(18, input_dim=9, kernel_initializer='normal', activation='sigmoid'))
#     model.add(Dense(18, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation="sigmoid"))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

print(X_train.shape, y_train.shape)
(936, 9) (936, 1)
# evaluate model
neural_net_pure = KerasRegressor(build_fn=baseline_model, epochs=500, batch_size=100, verbose=0)
kfold = KFold(n_splits=5)
results = cross_val_score(neural_net_pure, X_train, y_train, cv=kfold)

neural_net_pure.fit(X_train, y_train)
<keras.callbacks.History at 0x2207f523f70>
print("Baseline: %.2f (%.2f) RMSE" % (np.sqrt(np.abs(results.mean())), np.sqrt(np.abs(results.std()))))
single_regression_models.append("neural_net_single")
single_training_results["neural_net_single"] = ( (nn_y_max - nn_y_min) * np.sqrt(np.abs(results.mean()))).item()
print("cross-validated RMSE:", single_training_results["neural_net_single"])
Baseline: 0.08 (0.04) RMSE
cross-validated RMSE: 4796.793006385908

6. Prediction on the Test Data Set

6.1 Prediction: Testing the Sequential Models (Classification + Regression)

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


#Evaluate classification on test data
prediction_results = {}

for cm in classification_models:
    # specify test data for classification
    X_test = data_test[class_X].copy()
    y_test = np.array(data_test[class_y]).reshape(-1, 1).flatten()
    
    print("------------------", cm, "-------------------")
    class_model = pickle.load(open("models/"+cm+".sav", "rb"))
#     print(class_model)
    
    # add classification result to test data
    y_pred = class_model.predict(X_test[class_X])
    print(classification_report(y_pred, y_test))
    print(confusion_matrix(y_test, y_pred, labels=[0,1,2]))
    
    class_scores = {
        "accuracy": accuracy_score(y_test, y_pred)
    }
    
    data_test_reg = data_test.copy()
    data_test_reg["pred_age_cluster"] = y_pred.copy()
    data_test_reg["age_cluster"] = y_pred.copy()
    
#     print("\ny_pred added")
#     print(data_test_reg.info())
#     print(data_test_reg["age_cluster"])
    sns.scatterplot(x = data_test_reg["age"], y=data_test_reg["charges"], hue=data_test_reg["pred_age_cluster"])
    plt.show()

    
    # replace real age_cluster with predicted age cluster from classification
#     df = pd.DataFrame()
#     df[["age_cluster_0", "age_cluster_1", "age_cluster_2"]] = pd.get_dummies(data_test["pred_age_cluster"].copy(), drop_first=False)
#     print(df.info())
    data_test_reg[["age_cluster_0", "age_cluster_1", "age_cluster_2"]] = pd.get_dummies(data_test_reg["pred_age_cluster"].copy(), drop_first=False)
#     print("\nage cluster dummies added")
#     print(data_test.info())

    data_test_reg["bmi*age0"] = data_test_reg["age_cluster_0"].copy()*data_test_reg["bmi"].copy()
    data_test_reg["bmi*age1"] = data_test_reg.loc[:,"age_cluster_1"]*data_test_reg["bmi"]
    data_test_reg["bmi*age2"] = data_test_reg.loc[:,"age_cluster_2"]*data_test_reg["bmi"]

    data_test_reg["age0"] = data_test_reg["age_cluster_0"]*data_test_reg["age"]
    data_test_reg["age1"] = data_test_reg["age_cluster_1"]*data_test_reg["age"]
    data_test_reg["age2"] = data_test_reg["age_cluster_2"]*data_test_reg["age"]
    data_test_reg["age0*smoker_no"] = data_test_reg["age_cluster_0"]*data_test_reg["age"]*(1-data_test_reg["smoker_yes"])
    data_test_reg["age2*smoker_yes"] = data_test_reg["age_cluster_2"]*data_test_reg["age"]*data_test_reg["smoker_yes"]
    
#     print("\nall variables added")
#     print(data_test.info())
#     print(y_pred.shape)
#     print(reg_X)
    
    new_entry = {
            "class_score": class_scores    
            }
    
    for rm in regression_models:
        print("################# ",rm)
        # specify test data for regression
        X_test = data_test_reg[reg_X[rm]]
        print(reg_X[rm])
        y_test = np.array(data_test_reg[reg_y]).reshape(-1, 1).flatten()
#         print(X_test[0:10])
#         print(y_test[0:10])
        if rm == "neural_net":
            reg_model = neural_net
        else:
            reg_model = pickle.load(open("models/"+rm+".sav", "rb"))
        
        # Predict with regression model
#         print(X_test.info())
        if rm == "ols_regression":
            y_pred = reg_model.predict(sm.add_constant(X_test))
        elif rm == "neural_net":
            y_pred = nn_y_scaler.inverse_transform(reg_model.predict(X_test).reshape(-1, 1))
            y_pred = y_pred.reshape(-1, )
        else:
            y_pred = reg_model.predict(X_test)
            
        print(y_pred.shape)
            
        sns.scatterplot(x = data_test_reg["age"], y=y_pred, label="y_pred", color="red")
        sns.scatterplot(x = data_test_reg["age"], y=y_test, label="y_test", color="green")

        plt.show()
        
        # Calculate regression metrics
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        
        scores = {
            "rmse": rmse,
            "mae": mae,
            "r2": r2            
        }
        
        new_entry.update({str(rm): scores})
#         print("--------update", scores)

    # add to dict
    prediction_results.update({str(cm): new_entry})
------------------ svm -------------------
              precision    recall  f1-score   support

           0       1.00      0.90      0.95       324
           1       0.47      0.94      0.62        32
           2       0.93      0.93      0.93        46

    accuracy                           0.91       402
   macro avg       0.80      0.92      0.84       402
weighted avg       0.95      0.91      0.92       402

[[292   0   0]
 [ 31  30   3]
 [  1   2  43]]

png

#################  ridge_x_terms
['age', 'children', 'region_southwest', 'smoker*bmi', 'age_cluster_2']
(402,)

png

#################  clustered_ridge
['age_cluster', 'age', 'smoker*bmi', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age1', 'age0*smoker_no', 'age2*smoker_yes']
(402,)

png

#################  lasso
['age_cluster', 'age', 'smoker*bmi', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age1', 'age0*smoker_no']
(402,)

png

#################  ols_regression
['age', 'children', 'region_southwest', 'smoker*bmi', 'age_cluster_2']
(402,)

png

#################  random_forest_reg
['age', 'bmi', 'smoker_yes', 'smoker*bmi', 'age_cluster_0', 'age_cluster_2', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age2', 'age0*smoker_no', 'age2*smoker_yes']
(402,)

png

#################  gradient_boosting_reg
['age', 'children', 'region_southwest', 'smoker*bmi', 'age_cluster_2']
(402,)

png

#################  neural_net
['age_cluster', 'age', 'smoker*bmi', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age1', 'age0*smoker_no', 'age2*smoker_yes']
(402,)

png

------------------ rand_forest_class -------------------
              precision    recall  f1-score   support

           0       1.00      0.90      0.95       324
           1       0.52      0.94      0.67        35
           2       0.93      1.00      0.97        43

    accuracy                           0.92       402
   macro avg       0.82      0.95      0.86       402
weighted avg       0.95      0.92      0.93       402

[[292   0   0]
 [ 31  33   0]
 [  1   2  43]]

png

#################  ridge_x_terms
['age', 'children', 'region_southwest', 'smoker*bmi', 'age_cluster_2']
(402,)

png

#################  clustered_ridge
['age_cluster', 'age', 'smoker*bmi', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age1', 'age0*smoker_no', 'age2*smoker_yes']
(402,)

png

#################  lasso
['age_cluster', 'age', 'smoker*bmi', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age1', 'age0*smoker_no']
(402,)

png

#################  ols_regression
['age', 'children', 'region_southwest', 'smoker*bmi', 'age_cluster_2']
(402,)

png

#################  random_forest_reg
['age', 'bmi', 'smoker_yes', 'smoker*bmi', 'age_cluster_0', 'age_cluster_2', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age2', 'age0*smoker_no', 'age2*smoker_yes']
(402,)

png

#################  gradient_boosting_reg
['age', 'children', 'region_southwest', 'smoker*bmi', 'age_cluster_2']
(402,)

png

#################  neural_net
['age_cluster', 'age', 'smoker*bmi', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age1', 'age0*smoker_no', 'age2*smoker_yes']
(402,)

png

------------------ grad_boost_class -------------------
              precision    recall  f1-score   support

           0       1.00      0.90      0.95       324
           1       0.52      0.94      0.67        35
           2       0.93      1.00      0.97        43

    accuracy                           0.92       402
   macro avg       0.82      0.95      0.86       402
weighted avg       0.95      0.92      0.93       402

[[292   0   0]
 [ 31  33   0]
 [  1   2  43]]

png

#################  ridge_x_terms
['age', 'children', 'region_southwest', 'smoker*bmi', 'age_cluster_2']
(402,)

png

#################  clustered_ridge
['age_cluster', 'age', 'smoker*bmi', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age1', 'age0*smoker_no', 'age2*smoker_yes']
(402,)

png

#################  lasso
['age_cluster', 'age', 'smoker*bmi', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age1', 'age0*smoker_no']
(402,)

png

#################  ols_regression
['age', 'children', 'region_southwest', 'smoker*bmi', 'age_cluster_2']
(402,)

png

#################  random_forest_reg
['age', 'bmi', 'smoker_yes', 'smoker*bmi', 'age_cluster_0', 'age_cluster_2', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age2', 'age0*smoker_no', 'age2*smoker_yes']
(402,)

png

#################  gradient_boosting_reg
['age', 'children', 'region_southwest', 'smoker*bmi', 'age_cluster_2']
(402,)

png

#################  neural_net
['age_cluster', 'age', 'smoker*bmi', 'bmi*age0', 'bmi*age1', 'bmi*age2', 'age0', 'age1', 'age0*smoker_no', 'age2*smoker_yes']
(402,)

png

x = []
y = []
models = []
for cm in prediction_results:
    for rm in prediction_results[cm]:
        print(cm+"->"+rm+":")
        if rm != "class_score":
            print(prediction_results[cm][rm])
            models.append(cm+"+"+rm)
            x.append(prediction_results[cm][rm]["rmse"])
            y.append(prediction_results[cm][rm]["mae"])

        for s in prediction_results[cm][rm]:
            print("\t", s, prediction_results[cm][rm][s])
    print("\n")
svm->class_score:
	 accuracy 0.9079601990049752
svm->ridge_x_terms:
{'rmse': 4559.966170530843, 'mae': 2411.5121821100297, 'r2': 0.8527601130760974}
	 rmse 4559.966170530843
	 mae 2411.5121821100297
	 r2 0.8527601130760974
svm->clustered_ridge:
{'rmse': 4568.484334057749, 'mae': 2508.2148892272617, 'r2': 0.8522095015833457}
	 rmse 4568.484334057749
	 mae 2508.2148892272617
	 r2 0.8522095015833457
svm->lasso:
{'rmse': 4562.8115936998465, 'mae': 2505.3459228306065, 'r2': 0.8525763000890569}
	 rmse 4562.8115936998465
	 mae 2505.3459228306065
	 r2 0.8525763000890569
svm->ols_regression:
{'rmse': 4560.5509421512925, 'mae': 2411.4006426455785, 'r2': 0.8527223464677905}
	 rmse 4560.5509421512925
	 mae 2411.4006426455785
	 r2 0.8527223464677905
svm->random_forest_reg:
{'rmse': 4515.218977120326, 'mae': 2597.6097515355186, 'r2': 0.8556356803274469}
	 rmse 4515.218977120326
	 mae 2597.6097515355186
	 r2 0.8556356803274469
svm->gradient_boosting_reg:
{'rmse': 4425.730562775533, 'mae': 2419.597293975663, 'r2': 0.8613013687906581}
	 rmse 4425.730562775533
	 mae 2419.597293975663
	 r2 0.8613013687906581
svm->neural_net:
{'rmse': 4509.003190141607, 'mae': 2484.034467409632, 'r2': 0.8560328793167942}
	 rmse 4509.003190141607
	 mae 2484.034467409632
	 r2 0.8560328793167942


rand_forest_class->class_score:
	 accuracy 0.9154228855721394
rand_forest_class->ridge_x_terms:
{'rmse': 4379.625555603834, 'mae': 2307.862224174572, 'r2': 0.8641761000582435}
	 rmse 4379.625555603834
	 mae 2307.862224174572
	 r2 0.8641761000582435
rand_forest_class->clustered_ridge:
{'rmse': 4424.333728044337, 'mae': 2418.757960126792, 'r2': 0.8613889061989333}
	 rmse 4424.333728044337
	 mae 2418.757960126792
	 r2 0.8613889061989333
rand_forest_class->lasso:
{'rmse': 4418.478182278509, 'mae': 2414.7974719208514, 'r2': 0.8617555632160584}
	 rmse 4418.478182278509
	 mae 2414.7974719208514
	 r2 0.8617555632160584
rand_forest_class->ols_regression:
{'rmse': 4379.557830599713, 'mae': 2307.3304071678235, 'r2': 0.8641803006927242}
	 rmse 4379.557830599713
	 mae 2307.3304071678235
	 r2 0.8641803006927242
rand_forest_class->random_forest_reg:
{'rmse': 4516.535208253194, 'mae': 2602.222375235576, 'r2': 0.8555515007964498}
	 rmse 4516.535208253194
	 mae 2602.222375235576
	 r2 0.8555515007964498
rand_forest_class->gradient_boosting_reg:
{'rmse': 4425.730562775533, 'mae': 2419.597293975663, 'r2': 0.8613013687906581}
	 rmse 4425.730562775533
	 mae 2419.597293975663
	 r2 0.8613013687906581
rand_forest_class->neural_net:
{'rmse': 4405.499379910087, 'mae': 2410.087708990788, 'r2': 0.8625665266047825}
	 rmse 4405.499379910087
	 mae 2410.087708990788
	 r2 0.8625665266047825


grad_boost_class->class_score:
	 accuracy 0.9154228855721394
grad_boost_class->ridge_x_terms:
{'rmse': 4379.625555603834, 'mae': 2307.862224174572, 'r2': 0.8641761000582435}
	 rmse 4379.625555603834
	 mae 2307.862224174572
	 r2 0.8641761000582435
grad_boost_class->clustered_ridge:
{'rmse': 4424.333728044337, 'mae': 2418.757960126792, 'r2': 0.8613889061989333}
	 rmse 4424.333728044337
	 mae 2418.757960126792
	 r2 0.8613889061989333
grad_boost_class->lasso:
{'rmse': 4418.478182278509, 'mae': 2414.7974719208514, 'r2': 0.8617555632160584}
	 rmse 4418.478182278509
	 mae 2414.7974719208514
	 r2 0.8617555632160584
grad_boost_class->ols_regression:
{'rmse': 4379.557830599713, 'mae': 2307.3304071678235, 'r2': 0.8641803006927242}
	 rmse 4379.557830599713
	 mae 2307.3304071678235
	 r2 0.8641803006927242
grad_boost_class->random_forest_reg:
{'rmse': 4516.535208253194, 'mae': 2602.222375235576, 'r2': 0.8555515007964498}
	 rmse 4516.535208253194
	 mae 2602.222375235576
	 r2 0.8555515007964498
grad_boost_class->gradient_boosting_reg:
{'rmse': 4425.730562775533, 'mae': 2419.597293975663, 'r2': 0.8613013687906581}
	 rmse 4425.730562775533
	 mae 2419.597293975663
	 r2 0.8613013687906581
grad_boost_class->neural_net:
{'rmse': 4405.499379910087, 'mae': 2410.087708990788, 'r2': 0.8625665266047825}
	 rmse 4405.499379910087
	 mae 2410.087708990788
	 r2 0.8625665266047825
from IPython.display import HTML, display
import tabulate

def display_table(data):
    html = "<h3>Performance of the Sequential Models (Classification + Regression)</h3>"
    html += "<h4>Root Mean Squared Error of Prediction (Training RMSE)</h4><table>"
    html += "<tr><th>class/reg</th><th>class. accuracy</th>"
    for r in regression_models:
        html += "<th><u>"+str(r)+"</u></th>"
    html += "</tr>"    
    
    for cm in data:
        html += "<tr><th><u>"+str(cm)+"</u></th>"
#         print(cm)
        for (rm) in data[cm]:
#             print(rm)
            if rm != "class_score":
                html += "<td>{:.2f} ({:.2f})</td>".format(data[cm][rm]["rmse"], training_results[rm])
            else:
                html += "<td>{:.2f}</td>".format(data[cm][rm]["accuracy"])

        html += "</tr>"
    html += "</table>"
#     print(html)
    display(HTML(html))
    
display_table(prediction_results)
from IPython.display import HTML, display
import tabulate
            
def display_table(data):
    html = "<h4>Mean Absolute Error</h4><table>"
    html += "<tr><th>class/reg</th><th>class. accuracy</th>"
    for r in regression_models:
        html += "<th><u>"+str(r)+"</u></th>"
    html += "</tr>"    
    
    for cm in data:
        html += "<tr><th><u>"+str(cm)+"</u></th>"
#         print(cm)
        for rm in data[cm]:
#             print(rm)
            if rm != "class_score":
                html += "<td>{:.2f}</td>".format(data[cm][rm]["mae"])
            else:
                html += "<td>{:.2f}</td>".format(data[cm][rm]["accuracy"])
        html += "</tr>"
    html += "</table>"
#     print(html)
    display(HTML(html))
    
display_table(prediction_results)

def display_table(data):
    html = "<h4>R-Squared Adjusted (out-of-sample)</h4><table>"
    html += "<tr><th>class/reg</th><th>class. accuracy</th>"
    for r in regression_models:
        html += "<th><u>"+str(r)+"</u></th>"
    html += "</tr>"
        
    for cm in data:
        html += "<tr><th><u>"+str(cm)+"</u></th>"
#         print(cm)
        for rm in data[cm]:
#             print(rm)
            if rm != "class_score":
                html += "<td>{:.4f}</td>".format(data[cm][rm]["r2"])
            else:
                html += "<td>{:.2f}</td>".format(data[cm][rm]["accuracy"])
        html += "</tr>"
    html += "</table>"
#     print(html)
    display(HTML(html))
    
display_table(prediction_results)

sequential_results = pd.DataFrame(new_entry, columns=["regression_model", "rmse", "mae", "r2"])
for r in regression_models:
    rmse = prediction_results["grad_boost_class"][r]["rmse"]
    mae = prediction_results["grad_boost_class"][r]["mae"]
    r2 = prediction_results["grad_boost_class"][r]["r2"]
    sequential_results = sequential_results.append({'regression_model':r, 'rmse':rmse, "mae":mae, "r2":r2}, ignore_index=True)
    
sequential_results = sequential_results.sort_values("rmse")
fig, ax = plt.subplots(3, 1, figsize=(12,7))
sns.barplot(x=sequential_results["regression_model"], y=sequential_results["rmse"], ax=ax[0])
ax[0].set(ylim=(4000, 4700))
sns.barplot(x=sequential_results["regression_model"], y=sequential_results["mae"], ax=ax[1])
ax[1].set(ylim=(2000, 2800))
sns.barplot(x=sequential_results["regression_model"], y=sequential_results["r2"], ax=ax[2])
ax[2].set(ylim=(0.8, 0.9))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.4)

Performance of the Sequential Models (Classification + Regression)

Root Mean Squared Error of Prediction (Training RMSE)

class/regclass. accuracyridge_x_termsclustered_ridgelassools_regressionrandom_forest_reggradient_boosting_regneural_net
svm0.914559.97 (4597.85)4568.48 (2691.96)4562.81 (4659.32)4560.55 (4557.65)4515.22 (4708.78)4425.73 (4593.30)4509.00 (4695.92)
rand_forest_class0.924379.63 (4597.85)4424.33 (2691.96)4418.48 (4659.32)4379.56 (4557.65)4516.54 (4708.78)4425.73 (4593.30)4405.50 (4695.92)
grad_boost_class0.924379.63 (4597.85)4424.33 (2691.96)4418.48 (4659.32)4379.56 (4557.65)4516.54 (4708.78)4425.73 (4593.30)4405.50 (4695.92)

Mean Absolute Error

class/regclass. accuracyridge_x_termsclustered_ridgelassools_regressionrandom_forest_reggradient_boosting_regneural_net
svm0.912411.512508.212505.352411.402597.612419.602484.03
rand_forest_class0.922307.862418.762414.802307.332602.222419.602410.09
grad_boost_class0.922307.862418.762414.802307.332602.222419.602410.09

R-Squared Adjusted (out-of-sample)

class/regclass. accuracyridge_x_termsclustered_ridgelassools_regressionrandom_forest_reggradient_boosting_regneural_net
svm0.910.85280.85220.85260.85270.85560.86130.8560
rand_forest_class0.920.86420.86140.86180.86420.85560.86130.8626
grad_boost_class0.920.86420.86140.86180.86420.85560.86130.8626

png

5.2 Prediction: Testing the Single Models (Only Regression)

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

#Evaluate classification on test data
prediction_results = {}

for rm in single_regression_models:
    print("################# ",rm)
    # specify test data for regression
    X_test = data_test[reg_X[rm]]
    print(reg_X[rm])
    y_test = np.array(data_test[reg_y]).reshape(-1, 1).flatten()
#         print(X_test[0:10])
#         print(y_test[0:10])
    if rm == "neural_net_single":
        reg_model = neural_net_pure
    else:
        reg_model = pickle.load(open("models/"+rm+".sav", "rb"))

    # Predict with regression model
#         print(X_test.info())
    if rm == "neural_net_single":
        y_pred = nn_pure_y_scaler.inverse_transform(reg_model.predict(X_test).reshape(-1, 1))
        y_pred = y_pred.reshape(-1, )
    else:
        y_pred = reg_model.predict(X_test)
        y_pred = y_pred.reshape(-1, )


    print(y_pred.shape)

    sns.scatterplot(x = data_test["age"], y=y_pred, label="y_pred", color="red")
    sns.scatterplot(x = data_test["age"], y=y_test, label="y_test", color="green")

    plt.show()

    # Calculate regression metrics
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    scores = {
        "rmse": rmse,
        "mae": mae,
        "r2": r2            
    }
    
    # add to dict
    prediction_results.update({str(rm): scores})
#################  ridge_single
['age', 'bmi', 'children', 'smoker_yes', 'smoker*bmi']
(402,)

png

#################  rand_forest_reg_single
['age', 'bmi', 'children', 'sex_male', 'smoker_yes', 'region_northeast', 'region_southeast', 'smoker*bmi']
(402,)

png

#################  grad_boost_reg_single
['age', 'bmi', 'children', 'smoker_yes', 'smoker*bmi']
(402,)

png

#################  neural_net_single
['age', 'bmi', 'children', 'sex_male', 'smoker_yes', 'region_northeast', 'region_southeast', 'region_southwest', 'smoker*bmi']
(402,)

png

from IPython.display import HTML, display
import tabulate

def display_table(data):
    html = "<h3>Performance of the Single Models (Only Regression)</h3>"
    html += "<h4>Root Mean Squared Error of Prediction (Training RMSE)</h4><table>"
    for r in single_training_results:
        html += "<th><u>"+str(r)+"</u></th>"
    html += "</tr>"    
    
    html += "<tr>"
    for (rm) in data:
#         print(rm)
        html += "<td>{:.2f} ({:.2f})</td>".format(prediction_results[rm]["rmse"], single_training_results[rm])
    html += "</tr>"
    html += "</table>"
#     print(html)
    display(HTML(html))
    
display_table(prediction_results)

def display_table(data):
    html = "<h4>Mean Absolute Error</h4><table>"
    for r in single_training_results:
        html += "<th><u>"+str(r)+"</u></th>"
    html += "</tr>"    
    
    html += "<tr>"
    for (rm) in data:
#         print(rm)
        html += "<td>{:.2f}</td>".format(prediction_results[rm]["mae"])
    html += "</tr>"
    html += "</table>"
#     print(html)
    display(HTML(html))
    
display_table(prediction_results)

def display_table(data):
    html = "<h4>R-Squared Adjusted (out-of-sample)</h4><table>"
    for r in single_training_results:
        html += "<th><u>"+str(r)+"</u></th>"
    html += "</tr>"    
    
    html += "<tr>"
    for (rm) in data:
#         print(rm)
        html += "<td>{:.2f}</td>".format(prediction_results[rm]["r2"])
    html += "</tr>"
    html += "</table>"
#     print(html)
    display(HTML(html))
    
display_table(prediction_results)

# Barplots
single_results = pd.DataFrame(new_entry, columns=["regression_model", "rmse", "mae", "r2"])
for r in single_regression_models:
    rmse = prediction_results[r]["rmse"]
    mae = prediction_results[r]["mae"]
    r2 = prediction_results[r]["r2"]
    single_results = single_results.append({'regression_model':r, 'rmse':rmse, "mae":mae, "r2":r2}, ignore_index=True)
    
single_results = single_results.sort_values("rmse")
fig, ax = plt.subplots(3, 1, figsize=(12,7))
sns.barplot(x=single_results["regression_model"], y=single_results["rmse"], ax=ax[0])
ax[0].set(ylim=(4000, 4800))
sns.barplot(x=single_results["regression_model"], y=single_results["mae"], ax=ax[1])
ax[1].set(ylim=(2000, 2800))
sns.barplot(x=single_results["regression_model"], y=single_results["r2"], ax=ax[2])
ax[2].set(ylim=(0.8, 0.9))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.4)

Performance of the Single Models (Only Regression)

Root Mean Squared Error of Prediction (Training RMSE)

ridge_singlerand_forest_reg_singlegrad_boost_reg_singleneural_net_single
4775.81 (4938.27)4611.33 (4752.36)4433.55 (4603.21)4600.37 (4796.79)

Mean Absolute Error

ridge_singlerand_forest_reg_singlegrad_boost_reg_singleneural_net_single
2820.952547.952438.082727.66

R-Squared Adjusted (out-of-sample)

ridge_singlerand_forest_reg_singlegrad_boost_reg_singleneural_net_single
0.840.850.860.85

png

About

What medical costs should a health insurance company expect from their policy holders?

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published