Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seaborn Plots #62

Merged
merged 31 commits into from Sep 10, 2019
Merged

Seaborn Plots #62

merged 31 commits into from Sep 10, 2019

Conversation

jeff-hernandez
Copy link
Collaborator

@jeff-hernandez jeff-hernandez commented Aug 29, 2019

This PR uses Seaborn for plotting Label Times. Theses are plot examples for categorical and continuous Label Times. Closes #60. Closes #16 . Closes #56 .

@codecov
Copy link

codecov bot commented Aug 29, 2019

Codecov Report

Merging #62 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff          @@
##           master    #62   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          16     17    +1     
  Lines         532    629   +97     
=====================================
+ Hits          532    629   +97
Impacted Files Coverage Δ
composeml/label_plots.py 100% <100%> (ø)
composeml/tests/utils.py 100% <100%> (ø) ⬆️
composeml/label_maker.py 100% <100%> (ø) ⬆️
composeml/tests/test_label_times.py 100% <100%> (ø) ⬆️
composeml/tests/test_label_maker.py 100% <100%> (ø) ⬆️
composeml/tests/test_label_plots.py 100% <100%> (ø) ⬆️
...seml/tests/test_label_transforms/test_threshold.py 100% <100%> (ø) ⬆️
composeml/conftest.py 100% <100%> (ø) ⬆️
composeml/label_times.py 100% <100%> (ø) ⬆️

@kmax12
Copy link
Contributor

kmax12 commented Aug 30, 2019

does count by time make sense for continuous? I feel like it only makes sense if you were to bucket it in discrete labels or if you take some sort of rolling average by time

@jeff-hernandez
Copy link
Collaborator Author

jeff-hernandez commented Aug 30, 2019

I agree. When count_by_time is called with continuous labels, not sure whether to:

  • make discrete automatically before plotting
  • return none
  • raise an error

Also, do we want to plot: continuous labels vs. cutoff times?

@kmax12
Copy link
Contributor

kmax12 commented Aug 30, 2019

maybe just use overall total number of label by time. don't try to break it up at all.

@kmax12 kmax12 reopened this Aug 30, 2019
@jeff-hernandez
Copy link
Collaborator Author

jeff-hernandez commented Aug 30, 2019

I see. So, we'd just count event per cutoff time without grouping by label.

@jeff-hernandez
Copy link
Collaborator Author

jeff-hernandez commented Aug 30, 2019

Updated plot for count by time here. May need a hard refresh in browser to remove cached files.

labels = labels.groupby(self.name)
distribution = labels['count'].count()
return distribution
def _is_categorical(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a user should be able to provide a label_type parameter to override this. Label types could be "categorical" or "continuous". we can also pass this parameter through when using the label maker. for example, bin would know to pass categorical through.

In terms of infer categorical, I'd also check for any non numeric dtyes like objects, strs, or the pandas categorical dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additionally, we only want to be calculating this once when the labeltime are initialized, not every time _is_categorical is accessed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about overriding it by using type hints?

>>> def labeling_function(df) -> 'categorical':
...

>>> labeling_function.__annotations__
{'return': 'categorical'}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, casting label times to categorical dtype if categorical. Related to #60.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the type hints for annotations is very intuitive. might just be better to make it a parameter to the label maker

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, will make parameter for label_type

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new plotting stuff looks good. just some comments on how we do the label type inference

composeml/label_maker.py Show resolved Hide resolved
composeml/label_times.py Show resolved Hide resolved
@@ -1,9 +1,10 @@
import pandas as pd

from composeml.label_plots import LabelPlots


class LabelTimes(pd.DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the label times class should store the type of label it is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added label_type attribute to label times

value = self.groupby('cutoff_time')
value = value[self.name].count()
value = value.cumsum()
return value

def describe(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the describe method should say the label type

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should label type be under settings? I guess it is a parameter of search.

@@ -327,14 +329,23 @@ def search(self,

labels = LabelTimes(data=labels, name=name, target_entity=self.target_entity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should pass the label type to the LabelTime so it can track it as well. perhaps even move the logic to handle and check is into the init.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved check to label times init


else:
labels = labels.infer_type()
if labels.label_type == 'discrete':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this casting before we init the label_times? like above line 330

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels are records (list of dictionaries) above line 330. Should I pass records to a pandas data frame to make categorical before initializing label times?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. it's fine to leave it here then

error = 'label type must be "continuous" or "discrete"'
assert label_type in ['continuous', 'discrete'], error

if label_type is None and name in self.columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would name not be in self.columns?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came across behavior where pandas would initialize label times without passing value for name. I refactored logic to infer inside is_discrete only if label type is none. link

if is_discrete:
return True

labels = self[self.name].iloc[:100]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, let's only look at the dtype to infer type. we can always add this functionality in later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated function

@@ -320,16 +355,9 @@ def infer_type(self):
"""Infer label type.

Returns:
LabelTimes : Label Times as inferred type.
str : Inferred label type. Can be "continuous" or "discrete".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think intuitively I'd expect the logic in is_discrete to be here in infer_type and then I'd expect is_discrete to just check if label_type == "discrete.

then every we currently check if self.label_type == 'discrete' we'd just replace with is_discrete

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated logic

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeff-hernandez jeff-hernandez merged commit 50e9f90 into master Sep 10, 2019
@jeff-hernandez jeff-hernandez mentioned this pull request Sep 10, 2019
@jeff-hernandez jeff-hernandez deleted the seaborn-plots branch September 13, 2019 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plotly Backend Cast Categorical Seaborn Plots
2 participants