Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/two way scaling #104

Open
wants to merge 34 commits into
base: develop
Choose a base branch
from
Open

Feature/two way scaling #104

wants to merge 34 commits into from

Conversation

mnarayan
Copy link
Member

@mnarayan mnarayan commented Oct 9, 2017

Implements two-way centering and scaling of an input data matrix using successive normalization. Important for unbiased estimation of second-order dependence i.e. correlation matrices.

  • twoway_standardize implements the successive normalization
  • TwoWayStandardScaler is a class analogous to StandardScaler but stripped down, without support for sparse matrices or a functional inverse_transform. Key functionality is
    • fit which returns both row and column means, variance, standard deviation estimates, and transform which merely invokessuccessive_normalization.
    • The mean/scale estimates precomputed for fit are not identical to the ones needed for iterative centering/scaling. Thus, inefficient in that it does not take advantage of precomputed fit due to the need for iterative estimation.

Reference
"Successive Normalization of Rectangular Arrays" by Olshen and Rajaratnam
Ann. Statist. Volume 38, Number 3 (2010), 1638-1664.
https://projecteuclid.org/euclid.aos/1269452650

  • Add basic alternating estimation
  • Create a class analogous to sklearn's StandardScaler to wrap around the method twoway_standardize
  • Add convergence checks
  • Update docstrings for TwoWayStandardScaler
  • Add twoway-corrcoef alternative to quic_graph_lasso
  • Add example to show importance for BIC or any other sparse model selection along regularization path

@mnarayan mnarayan self-assigned this Oct 14, 2017
Copy link
Member

@jasonlaska jasonlaska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enjoy these comments, I'd like to take another look before merge.

)


def twoway_standardize(X, axis=0, with_mean=True, with_std=True, copy=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to two_way_standardize

@@ -0,0 +1,323 @@
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer name of file is two_way_standard_scalar.py. If you want it to be part of a larger set of cleaning tools you could put it in a module with inverse_covariance/clean/__init__.py but I think it's fine to leave it in here where it is (flat) and just rename it since we dont have a lot of these yet).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed by just changing the file names. If we want to subscope it to clean or preprocessing later we can but since this is the only one, keeping it flat for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*addressed when I push changes

raise NotImplemented(
"Algorithm for sparse matrices currently not supported.")

else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for else here since you are raising, can bump indention to the left for this whole block after removing else

Xrow_polish = scale(Xcol_polish.T, axis=1,
with_mean=True, with_std=with_std)
n_iter += 1
err_norm_row = np.linalg.norm(oldXrow-Xrow_polish, 'fro')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add space between variables and operations oldXrow - Xrow_polish

with_mean=True, with_std=with_std)
n_iter += 1
err_norm_row = np.linalg.norm(oldXrow-Xrow_polish, 'fro')
err_norm_col = np.linalg.norm(oldXcol-Xcol_polish, 'fro')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add space between variables and operations

print('Input is sparse')
raise NotImplementedError(
'Algorithm for sparse matrices currently not supported.')
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for else after raise or return

else:
raise NotImplementedError(
'Two Way standardization not reversible with accuracy')
X = np.asarray(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code will never execute

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure there is a test that tests each of these methods, that would have picked this up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be a warning not an exception.

sqcov_cols = np.diag(np.sqrt(var_cols))
return mu + sqcov_rows * X * sqcov_cols


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests for two_way_standardize that test that it does what's expected, not just the interface? Like with a fixed input and expected output.

from inverse_covariance.clean import (
twoway_standardize
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests for the TwoWayStandardScaler with fixed inputs and expected outputs that run over a number of options and ensure it's right. Also the standard sklearn check_estimator test (see example here: https://github.com/skggm/skggm/blob/develop/inverse_covariance/tests/common_test.py)

"""
check_is_fitted(self, 'row_scale_')

copy = copy if copy is not None else self.copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can be shortened to copy = copy or self.copy

@jasonlaska
Copy link
Member

Alright, I've reworked this a bit, I am going to leave some questions for you @mnarayan inline.


return mu + sqcov_rows * X * sqcov_cols


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mnarayan please review all of the tests in the file to see if its behaving as you would expect. You can add more tests to the parameterize dectorator if you know of good specific small examples.


# Q: This doesnt seem to actually get used in the transform, only in
# the inverse transform which it sounds like we should not support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mnarayan The values computed in the fit/partial_fit routines do not get used in the two_way_standard_scaler routine, only in the "inverse transform" method. Do we actually need them or are they just cribbed from sklearn?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mnarayan did you see this question about partial fit? If these variables are not needed, we also don't need partial_fit, can simplify everything, and have this be a transform-only transformer.

estimator=self,
)
n_rows, n_cols = np.shape(X)
if self.n_cols_ != n_cols:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check was required to pass the sklearn check_estimator suite. The main idea is that we throw a value error if the dimention of the transform data is not the same as that of the fit method. Does this seem right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this sounds good.


# Q: Should ^ be a warning or should we just rais here and delete the
# rest of the code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed your exception to a warning and then the remaining code is the same as you had implemented. Is it desirable to retain this method or should it be removed?

TODO: tests for this method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the use case of inverse_transform really and only added what would be a reasonable one-step approximation to maintain consistency with API https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/preprocessing/data.py#L697.

Is it preferable to just create a function that does nothing with a warning that this functionality is not supported?

I think it is better to add approximate solution with some tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I can leave the interface, raise so that it breaks but says why, and delete the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its preferable not to provide an approximate solution if the utility is interface compatibility with a different utility-- it means maintenance for us and footguns for anyone using it.

@mnarayan
Copy link
Member Author

mnarayan commented Sep 10, 2018 via email

@jasonlaska
Copy link
Member

jasonlaska commented Sep 10, 2018

Ok so then as is, it seems good. @mnarayan in order to complete this branch I need the following things from you:

  • added examples using this transformer
  • additional tests on the main transform function that test for correctness
  • final review of whats here

You should be able to just git pull to your local branch. If there are conflicts, maybe just clone a fresh one in a new directory.

Be sure to run pip install black (in python3), black inverse_covariance/ and black examples/ before pushing or the Travis tests will not pass. This will run the autoformatter. After that I'll take a look and merge it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants