Feature/two way scaling #104

mnarayan · 2017-10-09T03:53:31Z

Implements two-way centering and scaling of an input data matrix using successive normalization. Important for unbiased estimation of second-order dependence i.e. correlation matrices.

twoway_standardize implements the successive normalization
TwoWayStandardScaler is a class analogous to StandardScaler but stripped down, without support for sparse matrices or a functional inverse_transform. Key functionality is
- fit which returns both row and column means, variance, standard deviation estimates, and transform which merely invokessuccessive_normalization.
- The mean/scale estimates precomputed for fit are not identical to the ones needed for iterative centering/scaling. Thus, inefficient in that it does not take advantage of precomputed fit due to the need for iterative estimation.

Reference
"Successive Normalization of Rectangular Arrays" by Olshen and Rajaratnam
Ann. Statist. Volume 38, Number 3 (2010), 1638-1664.
https://projecteuclid.org/euclid.aos/1269452650

Add basic alternating estimation
Create a class analogous to sklearn's StandardScaler to wrap around the method twoway_standardize
Add convergence checks
Update docstrings for TwoWayStandardScaler
Add twoway-corrcoef alternative to quic_graph_lasso
Add example to show importance for BIC or any other sparse model selection along regularization path

jasonlaska

Enjoy these comments, I'd like to take another look before merge.

jasonlaska · 2017-10-14T16:44:04Z

inverse_covariance/clean.py

+    )
+
+
+def twoway_standardize(X, axis=0, with_mean=True, with_std=True, copy=True,


rename to two_way_standardize

jasonlaska · 2017-10-14T16:45:22Z

inverse_covariance/clean.py

@@ -0,0 +1,323 @@
+import numpy as np


prefer name of file is two_way_standard_scalar.py. If you want it to be part of a larger set of cleaning tools you could put it in a module with inverse_covariance/clean/__init__.py but I think it's fine to leave it in here where it is (flat) and just rename it since we dont have a lot of these yet).

I addressed by just changing the file names. If we want to subscope it to clean or preprocessing later we can but since this is the only one, keeping it flat for now.

*addressed when I push changes

jasonlaska · 2017-10-14T16:47:18Z

inverse_covariance/clean.py

+        raise NotImplemented(
+                "Algorithm for sparse matrices currently not supported.")
+
+    else:


No need for else here since you are raising, can bump indention to the left for this whole block after removing else

jasonlaska · 2017-10-14T16:47:47Z

inverse_covariance/clean.py

+            Xrow_polish = scale(Xcol_polish.T, axis=1,
+                                with_mean=True, with_std=with_std)
+            n_iter += 1
+            err_norm_row = np.linalg.norm(oldXrow-Xrow_polish, 'fro')


add space between variables and operations oldXrow - Xrow_polish

jasonlaska · 2017-10-14T16:47:57Z

inverse_covariance/clean.py

+                                with_mean=True, with_std=with_std)
+            n_iter += 1
+            err_norm_row = np.linalg.norm(oldXrow-Xrow_polish, 'fro')
+            err_norm_col = np.linalg.norm(oldXcol-Xcol_polish, 'fro')


add space between variables and operations

jasonlaska · 2017-10-14T17:02:43Z

inverse_covariance/clean.py

+            print('Input is sparse')
+            raise NotImplementedError(
+                'Algorithm for sparse matrices currently not supported.')
+        else:


no need for else after raise or return

jasonlaska · 2017-10-14T17:02:55Z

inverse_covariance/clean.py

+        else:
+            raise NotImplementedError(
+                'Two Way standardization not reversible with accuracy')
+            X = np.asarray(X)


this code will never execute

make sure there is a test that tests each of these methods, that would have picked this up

Seems like this should be a warning not an exception.

jasonlaska · 2017-10-14T17:04:20Z

inverse_covariance/tests/clean_test.py

+    sqcov_cols = np.diag(np.sqrt(var_cols))
+    return mu + sqcov_rows * X * sqcov_cols
+
+


Can we add some tests for two_way_standardize that test that it does what's expected, not just the interface? Like with a fixed input and expected output.

jasonlaska · 2017-10-14T17:05:53Z

inverse_covariance/tests/clean_test.py

+from inverse_covariance.clean import (
+   twoway_standardize
+)
+


Can we add some tests for the TwoWayStandardScaler with fixed inputs and expected outputs that run over a number of options and ensure it's right. Also the standard sklearn check_estimator test (see example here: https://github.com/skggm/skggm/blob/develop/inverse_covariance/tests/common_test.py)

jasonlaska · 2017-10-14T17:07:30Z

inverse_covariance/clean.py

+        """
+        check_is_fitted(self, 'row_scale_')
+
+        copy = copy if copy is not None else self.copy


I think can be shortened to copy = copy or self.copy

jasonlaska · 2018-09-09T22:34:40Z

Alright, I've reworked this a bit, I am going to leave some questions for you @mnarayan inline.

jasonlaska · 2018-09-09T22:35:39Z

inverse_covariance/tests/two_way_standard_scaler_test.py

+
+    return mu + sqcov_rows * X * sqcov_cols
+
+


@mnarayan please review all of the tests in the file to see if its behaving as you would expect. You can add more tests to the parameterize dectorator if you know of good specific small examples.

jasonlaska · 2018-09-09T22:36:54Z

inverse_covariance/two_way_standard_scaler.py

+
+        # Q: This doesnt seem to actually get used in the transform, only in
+        #    the inverse transform which it sounds like we should not support.
+


@mnarayan The values computed in the fit/partial_fit routines do not get used in the two_way_standard_scaler routine, only in the "inverse transform" method. Do we actually need them or are they just cribbed from sklearn?

@mnarayan did you see this question about partial fit? If these variables are not needed, we also don't need partial_fit, can simplify everything, and have this be a transform-only transformer.

jasonlaska · 2018-09-09T22:37:55Z

inverse_covariance/two_way_standard_scaler.py

+            estimator=self,
+        )
+        n_rows, n_cols = np.shape(X)
+        if self.n_cols_ != n_cols:


This check was required to pass the sklearn check_estimator suite. The main idea is that we throw a value error if the dimention of the transform data is not the same as that of the fit method. Does this seem right?

Yeah this sounds good.

jasonlaska · 2018-09-09T22:38:48Z

inverse_covariance/two_way_standard_scaler.py

+
+        # Q: Should ^ be a warning or should we just rais here and delete the
+        #    rest of the code?
+


I've changed your exception to a warning and then the remaining code is the same as you had implemented. Is it desirable to retain this method or should it be removed?

TODO: tests for this method.

I don't understand the use case of inverse_transform really and only added what would be a reasonable one-step approximation to maintain consistency with API https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/preprocessing/data.py#L697.

Is it preferable to just create a function that does nothing with a warning that this functionality is not supported?

I think it is better to add approximate solution with some tests.

Then I can leave the interface, raise so that it breaks but says why, and delete the code.

I think its preferable not to provide an approximate solution if the utility is interface compatibility with a different utility-- it means maintenance for us and footguns for anyone using it.

mnarayan · 2018-09-10T21:32:23Z

There is a use case for fit/partial_fit that are not a part of transform. For instance, I want to know the row/column means in one dataset like training samples that I later want to use to transform the test set, I may want to retrieve the mean/variance parameters only without applying transform.

…

On Mon, Sep 10, 2018 at 1:33 PM Jason Laska ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In inverse_covariance/two_way_standard_scaler.py <#104 (comment)>: > + copy=self.copy, + warn_on_dtype=True, + dtype=FLOAT_DTYPES, + estimator=self, + ensure_min_features=2, + ) + if sparse.issparse(X): + raise NotImplemented( + "Input is sparse: Algorithm for sparse matrices currently not supported." + ) + + self.n_rows_, self.n_cols_ = np.shape(X) + + # Q: This doesnt seem to actually get used in the transform, only in + # the inverse transform which it sounds like we should not support. + @mnarayan <https://github.com/mnarayan> did you see this question about partial fit? If these variables are not needed, we also don't need partial_fit, can simplify everything, and have this be a transform-only transformer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#104 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAov9wYmLQTSNQiq6LyhLvSj6UYMwUfuks5uZsx9gaJpZM4Px-2T> .

jasonlaska · 2018-09-10T22:01:12Z

Ok so then as is, it seems good. @mnarayan in order to complete this branch I need the following things from you:

added examples using this transformer
additional tests on the main transform function that test for correctness
final review of whats here

You should be able to just git pull to your local branch. If there are conflicts, maybe just clone a fresh one in a new directory.

Be sure to run pip install black (in python3), black inverse_covariance/ and black examples/ before pushing or the Travis tests will not pass. This will run the autoformatter. After that I'll take a look and merge it down.

…-scaling

mnarayan added 2 commits October 9, 2017 03:20

Added basic functions to test two-way center and scaling

9e50115

Added basic twoway standardization algorithm. Relevant to issue #93

2f74e1b

mnarayan added enhancement examples labels Oct 9, 2017

mnarayan added 14 commits October 9, 2017 05:30

Cleaned up TwoWayStandardScaler API. partial_fit not supported

f2a1b20

Reset internal row,col attributes

a30bc8e

Added basic structure for partial_fit

ed6fd34

partial_fit now calculates row, col statistics

47c87cb

Added convergence checks. Algorithm completed

2dcde0a

Transform now calls twoway_standardize

dd92d80

Updated algorithm. Test passes

a754f44

Fixed bug in transform()

4a3c038

Return original dimensions

cc1d8d3

inverse_transform completed, raises not implemented error

16b1116

Delinting

a58383a

More delinting

5904f60

Fixed import error

a17a530

Added clean.py

e5395bd

mnarayan requested a review from jasonlaska October 12, 2017 18:42

mnarayan self-assigned this Oct 14, 2017

jasonlaska requested changes Oct 14, 2017

View reviewed changes

jasonlaska added 9 commits September 9, 2018 11:22

Fix merge conflicts

87ee167

Rename files from clean to two_way_standard_scaler

a2940df

Add estimator check

238f393

Rename commont_test to sklearn_test as is more descriptive of this test.

34dc936

Address initial comments and some cleanup.

7869659

Black formatting and more simplification and cleanup.

9cbb212

Black formatting and more simplification and cleanup.

a8e980f

Ensure interface can be validated.

e864e72

More simplification.

7f86bb3

jasonlaska added 2 commits September 9, 2018 12:32

Autoformat.

748fe33

Bring back partial_fit capability, add tests, ask questions.

d2800fc

jasonlaska reviewed Sep 9, 2018

View reviewed changes

Minor cleanup.

7c37030

jasonlaska approved these changes Sep 9, 2018

View reviewed changes

jasonlaska added 4 commits September 10, 2018 08:00

Raise on inverse transform, remove code.

1757216

Remove unneeded check.

17806e8

Remove redundant raise.

f1f682e

Remove unneeded comments.

eb8c54b

jasonlaska added 2 commits September 10, 2018 16:30

Merge branch 'develop' of github.com:skggm/skggm into feature/two-way…

070c017

…-scaling

Merge branch 'develop' of github.com:skggm/skggm into feature/two-way…

4f8267e

…-scaling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/two way scaling #104

Feature/two way scaling #104

mnarayan commented Oct 9, 2017 •

edited

jasonlaska left a comment

jasonlaska Oct 14, 2017

jasonlaska Oct 14, 2017

jasonlaska Sep 9, 2018

jasonlaska Sep 9, 2018

jasonlaska Oct 14, 2017

jasonlaska Oct 14, 2017

jasonlaska Oct 14, 2017

jasonlaska Oct 14, 2017

jasonlaska Oct 14, 2017

jasonlaska Oct 14, 2017

jasonlaska Sep 9, 2018

jasonlaska Oct 14, 2017

jasonlaska Oct 14, 2017

jasonlaska Oct 14, 2017

jasonlaska commented Sep 9, 2018

jasonlaska Sep 9, 2018

jasonlaska Sep 9, 2018

jasonlaska Sep 10, 2018

jasonlaska Sep 9, 2018

mnarayan Sep 9, 2018

jasonlaska Sep 9, 2018

mnarayan Sep 9, 2018

jasonlaska Sep 10, 2018

jasonlaska Sep 10, 2018

mnarayan commented Sep 10, 2018 via email

jasonlaska commented Sep 10, 2018 •

edited

		)


		def twoway_standardize(X, axis=0, with_mean=True, with_std=True, copy=True,

		sqcov_cols = np.diag(np.sqrt(var_cols))
		return mu + sqcov_rows * X * sqcov_cols


		# Q: This doesnt seem to actually get used in the transform, only in
		# the inverse transform which it sounds like we should not support.


		# Q: Should ^ be a warning or should we just rais here and delete the
		# rest of the code?

Feature/two way scaling #104

Are you sure you want to change the base?

Feature/two way scaling #104

Conversation

mnarayan commented Oct 9, 2017 • edited

jasonlaska left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasonlaska commented Sep 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mnarayan commented Sep 10, 2018 via email

jasonlaska commented Sep 10, 2018 • edited

mnarayan commented Oct 9, 2017 •

edited

jasonlaska commented Sep 10, 2018 •

edited