Skip to content

Commit

Permalink
Updating incremental input importance documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
drylks committed May 11, 2020
1 parent 4884b39 commit 53ac46c
Show file tree
Hide file tree
Showing 9 changed files with 143 additions and 60 deletions.
Binary file modified docs/images/gm_separability_mov.gif
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/incremental_input_importance.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 17 additions & 13 deletions docs/latest/introduction/getting_started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Pre-Learning: Input Importance
""""""""""""""""""""""""""""""
Once we know the problem is feasible using inputs at hand, the next question before we jump
into modeling is what are the inputs that are the most useful for solving this problem. Once
more, this qustion is asked and answered independently from any classification model (hence the expression **pre-learning**),
more, this question is asked and answered independently from any classification model (hence the expression **pre-learning**),
and reduces time wasted improving models fitted on irrelevant inputs.


Expand Down Expand Up @@ -172,11 +172,11 @@ Back to our bank note example, given how high an out-of-sample accuracy we got,
>>> test_df.classification_suboptimality('prediction', 'Is Fake', \
... discrete_input_columns=(), continuous_input_columns=())
2.52
>>> train_df.classification_feasibility('Is Fake')
0.00
>>> train_df.classification_feasibility('Is Fake')
2.52
As it turns out, a simple logistic regression allows us to extract 98% of the intrinsic value there is in using the 3 inputs above to determmine whether a bank note is fake. Thus, using a nonlinear model might not yield the highest ROI.
As it turns out, a simple logistic regression allows us to extract nearly all of the intrinsic value there is in using the 3 inputs above to determmine whether a bank note is fake. Thus, using a nonlinear model might not yield the highest ROI.

That a nonlinear model would not perform materially better than a linear model is consistent with the visualization below, where it can be seen that a curved boundary would not necessarily do a much better job at separating geniune (green) from fake (red) notes than a straight line.

Expand Down Expand Up @@ -241,18 +241,18 @@ Pre-Learning
>>> # Pre-Learning: How feasible or solvable is this problem? Are inputs any useful?
>>> print('Feasibility: %.4f, Entropy: %.4f' % (\
... df.regression_feasibility(label_column), kxy.scalar_continuous_entropy(df[label_column].values)))
Feasibility: 1.8866, Entropy: 2.2420
Feasibility: 2.1038, Entropy: 3.3815
>>> # Pre-Learning: How useful is each input individually?
>>> importance_df = df.input_importance(label_column, problem='regression')
>>> print(importance_df)
input importance normalized_importance
0 Froude Number 1.8194 0.9975
1 Length-Displacement 0.0018 0.0010
2 Longitudinal Position 0.0010 0.0005
3 Beam-Draught Ratio 0.0009 0.0005
4 Prismatic Coeefficient 0.0007 0.0004
5 Length-Beam Ratio 0.0002 0.0001
input individual_importance normalized_individual_importance
0 Froude Number 2.1038 0.9978
1 Length-Displacement 0.0018 0.0009
2 Longitudinal Position 0.0010 0.0005
3 Beam-Draught Ratio 0.0009 0.0004
4 Prismatic Coeefficient 0.0007 0.0003
5 Length-Beam Ratio 0.0002 0.0001
Expand Down Expand Up @@ -289,6 +289,10 @@ Post-Learning
>>> # Can we do better with a nonlinear model, without new inputs?
>>> print('Additive Suboptimality: %.4f' % \
... test_df.regression_additive_suboptimality('Prediction', label_column))
Additive Suboptimality: 0.0279
Additive Suboptimality: 0.6424
>>> print('Suboptimality: %.4f' % \
... test_df.regression_suboptimality('Prediction', label_column))
Suboptimality: 0.8506
143 changes: 105 additions & 38 deletions docs/latest/introduction/memoryless/index.rst

Large diffs are not rendered by default.

9 changes: 5 additions & 4 deletions examples/classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,11 @@

importance_df = pd.concat([importance_df_1, importance_df_2], axis=1)
importance_df.reset_index(inplace=True)
importance_df = importance_df.rename(columns={'individual_importance': 'Individual Importance', \
'incremental_importance': 'Incremental Importance', 'index': 'Input', 'selection_order': 'Selection Order'})
importance_df = importance_df[['Input', 'Individual Importance', 'Incremental Importance', 'Selection Order']]
importance_df = importance_df.sort_values(by=['Selection Order'], ascending=True)
importance_df.rename(columns={'individual_importance': 'Individual Importance', \
'incremental_importance': 'Incremental Importance', 'index': 'Input', 'selection_order': 'Selection Order', \
'input': 'Input'}, inplace=True)
print(importance_df)
importance_df = importance_df[['Input', 'Individual Importance', 'Incremental Importance', 'Selection Order']].sort_values(by=['Selection Order'], ascending=True)
ax = importance_df[['Input', 'Individual Importance', 'Incremental Importance']].plot.bar(x='Input', rot=0)
ax.set_ylabel('Importance (nats)')
plt.savefig('/Users/yl/Dropbox/KXY Technologies, Inc./GitHubCodeBase/kxy-python/docs/images/bn_importance.png', dpi=500)
Expand Down
2 changes: 1 addition & 1 deletion examples/regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,5 +63,5 @@
"""
# How suboptimal is this linear regression model?
print('Additive Suboptimality: %.4f' % test_df.regression_additive_suboptimality('Prediction', label_column))

print('Suboptimality: %.4f' % test_df.regression_suboptimality('Prediction', label_column))

15 changes: 13 additions & 2 deletions kxy/data/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
import numpy as np
import pandas as pd

from kxy.api.core import least_total_correlation, spearman_corr, least_continuous_conditional_mutual_information
from kxy.api.core import least_total_correlation, spearman_corr, least_continuous_conditional_mutual_information, \
scalar_continuous_entropy
from kxy.api import solve_copula_async
from kxy.classification import classification_feasibility, classification_suboptimality, \
classification_input_incremental_importance
Expand Down Expand Up @@ -578,15 +579,25 @@ def regression_suboptimality(self, prediction_column, label_column, input_column
:ref:`kxy.regression.post_learning.regression_suboptimality <regression-suboptimality>`
"""
self.adjust_quantized_values()
assert not self.is_categorical(prediction_column), "The prediction column should not be categorical"
assert not self.is_categorical(label_column), "The label column should not be categorical"

input_columns = [_ for _ in self.columns if _ not in (prediction_column, label_column) ] \
if input_columns == () else list(input_columns)

return regression_suboptimality(self[prediction_column].values, self[label_column].values, \
# Direct estimation of SO
result_1 = regression_suboptimality(self[prediction_column].values, self[label_column].values, \
self[input_columns].values)

# SO = ASO + h(y) - h(e)
result_2 = self.regression_additive_suboptimality(prediction_column, label_column, \
input_columns=input_columns)
e = (self[prediction_column]-self[label_column]).values
result_2 += scalar_continuous_entropy(self[label_column].values) - scalar_continuous_entropy(e)

return max(result_1, result_2)




Expand Down
2 changes: 1 addition & 1 deletion kxy/regression/post_learning.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ def regression_suboptimality(yp, y, x):
:ref:`kxy.api.core.mutual_information.least_continuous_mutual_information <least-continuous-mutual-information>`
"""
return least_continuous_mutual_information(x, y)-least_continuous_mutual_information(yp, y)
return max(least_continuous_mutual_information(x, y)-least_continuous_mutual_information(yp, y), 0.0)



Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import sys
sys.path.append('.')
from setuptools import setup, find_packages
version = "0.0.9"
version = "0.0.10"

setup(name="kxy",
version=version,
Expand Down

0 comments on commit 53ac46c

Please sign in to comment.