Added new classifier ShrinkageLDA to gda.py #565

mmbannert · 2018-01-24T13:03:44Z

Dear PyMVPA users,

Since I couldn't find an implementation of "shrinkage LDA" (e.g., Pereira & Botvinick, NeuroImage, 2011) in PyMVPA, I implemented my own based on Gaussian Discriminant Analysis (gda.py). I figured it could be helpful for other users as well?

It is considerably faster than using the scikit-learn implementation through the SKLLearnerAdapter, which I have been using until now:

SKLLearnerAdapter(LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto'))

It is also a little more flexible because you can choose between the Ledoit-Wolf estimator and the Oracle Approximating Shrinkage.

May it be useful.

Cheers,
Michael

coveralls · 2018-01-24T13:23:30Z

Coverage increased (+0.09%) to 80.289% when pulling c766d86 on Twitterfax:shrinkage-lda into 0b0c57a on PyMVPA:master.

yarikoptic · 2018-01-25T23:49:19Z

Interested in us guiding you @Twitterfax to polish up the PR (make your code tested, not failing as now, etc), or just want us to do that (might take longer ;) )

mmbannert · 2018-01-26T08:34:42Z

Please do guide me. It would be much appreciated :)

For example: one thing I wasn't sure about was whether scikit-learn was installed along with PyMVPA by default. If it is not available I'd have to integrate the shrinkage estimators manually. I'm guessing that this could be a problem.

How can I inspect the Travis CI details?

yarikoptic · 2018-01-26T14:06:59Z

Great:

I will leave comments through out your PR which will be visible in the "Files changed" tab
Travis CI details -- just click on "Details" by the "continuous-integration/travis-ci/pr" above:

then go to the one failing (red) to see that it is indeed about your code is not being guarded against not having sklearn installed . I will leave comments pointing to how it should be done

I would also recommend installing codecov browser extension which would help to see which lines of your code are covered by the tests and which not, right in your browser while reviewing differences in that "Files changed" tab: https://github.com/codecov/browser-extension#choose-your-browser-below

yarikoptic

Basic quick review #1: changes to be done to get testing going etc

yarikoptic · 2018-01-26T14:09:06Z

mvpa2/clfs/gda.py

+    __tags__ = GDA.__tags__ + ['linear', 'lda', 'shrinkage']
+
+    def __init__(self, shrinkage_estimator='LedoitWolf', block_size=1000, \
+    **kwargs):


shave a look at some other classifiers, e.g. at GDA or others to how to ideally define parameters (via those Parameter definitions in the class)

no need for trailing \ here since it is within ()

yarikoptic · 2018-01-26T14:10:30Z

mvpa2/clfs/gda.py

@@ -35,12 +35,12 @@
 from mvpa2.base.constraints import EnsureChoice
 from mvpa2.base.state import ConditionalAttribute
 #from mvpa2.measures.base import Sensitivity
-
+from sklearn.covariance import LedoitWolf, OAS


since this code depends on sklearn presence it might be better to move it into a separate file (e.g. shrinklda.py or alike) which would help to import it conditionally on sklearn presence within the warehouse

yarikoptic · 2018-01-26T14:12:01Z

mvpa2/clfs/gda.py

+        if shrinkage_estimator is 'LedoitWolf':
+            self.shrinkage_estimator = LedoitWolf(block_size=block_size)
+        elif shrinkage_estimator is 'OAS':
+            self.shrinkage_estimator = OAS()


better delay this instantiation until _train, and do not even assign that instance to any ShrinkageLDA attribute then.
It would allow to change that parameter of a classifier instance and retrain and would keep its repr correct

yarikoptic · 2018-01-26T14:14:08Z

mvpa2/clfs/gda.py

+        # because it only computes the class-dependent SSCP matrices. The
+        # shrinkage coefficient, however, needs to be computed from the whole
+        # time series (with samples centered on their respective class means)
+        params = self.params


here we go -- that is where you would find those Parameters after you move them out to be defined at the class level, e.g. params.shrinkage_estimator

yarikoptic · 2018-01-26T14:15:30Z

mvpa2/clfs/gda.py

+        # Store prior probabilities
+        self.priors = self._get_priors(nlabels, nsamples, nsamples_per_class)
+
+        if __debug__ and 'ShrinkageLDA' in debug.active:


you would need to add ShrinkageLDA among debug targets alongside with GDA etc:

$> git grep GDA -- mvpa2/base/ mvpa2/base/__init__.py: debug.register('GDA', "GDA - Gaussian Discriminant Analyses")

yarikoptic · 2018-01-26T14:17:26Z

mvpa2/clfs/warehouse.py

@@ -18,7 +18,7 @@
     MulticlassClassifier, RegressionAsClassifier
 from mvpa2.clfs.smlr import SMLR
 from mvpa2.clfs.knn import kNN
-from mvpa2.clfs.gda import LDA, QDA
+from mvpa2.clfs.gda import LDA, QDA, ShrinkageLDA


after you move that definition into a dedicated file, you could place the import within if externals.exists('skl') below so it would be imported only when sklearn installed

Then add an actual instance of the class, similarly to how it is done for LDA, GDA etc in this file. This should gets its testing going and we will continue from there

mmbannert · 2018-01-26T18:21:56Z

Awesome! Thanks for your time and help. I will try to implement the suggested changes.

codecov-io · 2018-01-27T13:20:11Z

Codecov Report

Merging #565 into master will increase coverage by 0.08%.
The diff coverage is 92.3%.

@@            Coverage Diff            @@
##           master    #565      +/-   ##
=========================================
+ Coverage   75.71%   75.8%   +0.08%     
=========================================
  Files         364     366       +2     
  Lines       41453   41555     +102     
  Branches     6683    6694      +11     
=========================================
+ Hits        31386   31500     +114     
+ Misses       8213    8194      -19     
- Partials     1854    1861       +7

Impacted Files	Coverage Δ
mvpa2/clfs/gda.py	`88.13% <ø> (ø)`	⬆️
mvpa2/tests/test_shrinklda.py	`100% <100%> (ø)`
mvpa2/clfs/warehouse.py	`70% <100%> (+0.27%)`	⬆️
mvpa2/base/__init__.py	`82.72% <100%> (+0.07%)`	⬆️
mvpa2/suite.py	`82.99% <75%> (-0.12%)`	⬇️
mvpa2/clfs/shrinklda.py	`89.09% <89.09%> (ø)`
mvpa2/tests/test_support.py	`98.24% <0%> (-1.17%)`	⬇️
mvpa2/clfs/stats.py	`71.91% <0%> (-0.81%)`	⬇️
mvpa2/tests/test_searchlight_hyperalignment.py	`96.7% <0%> (-0.42%)`	⬇️
mvpa2/tests/test_usecases.py	`98.92% <0%> (-0.36%)`	⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b0c57a...c766d86. Read the comment docs.

…nal on availability of sklearn

mmbannert · 2018-01-27T14:59:13Z

Hi Yaroslav. Thanks again for your feedback and patience. Things look ok now... I suppose(?)

yarikoptic · 2018-01-27T16:36:55Z

Good. Now it would be great to add a dedicated unit test which would compare results against sklearn wrapped implementation you mentioned - should be closer to identical right? I can provide a bit more detail later or you could check out other tests to see how it could be done

mmbannert · 2018-01-28T14:10:37Z

Ok, I have used test_gnb.py etc. as a template to write the test. Let me know if it needs improvement. How would such a test typically be used?

…the wrapped shrinkage LDA from sklearn.

…il previously

yarikoptic · 2018-01-29T14:29:01Z

How would such a test typically be used?

Look into .travis.yml (which is a bit a mouthful) for how we invoke testing using nose. To test locally just run

python -m nose -s -v mvpa2/tests/test_shrinklda.py

to your added tests only

mmbannert · 2018-01-29T15:49:48Z

Travis CI doesn't complain about my coding anymore. I take it as a positive sign ;)

… of 20 on my machine)

mmbannert · 2018-02-16T09:50:00Z

Hi Yaroslav,
Is there anything else I can do to improve the code? I made the test a little faster by simulating a smaller (but still p > n) dataset, which made it 20 times faster on my computer.
Cheers,
Michael

yarikoptic

sorry for a delay. almost there -- what about actual cv error? (see my comments)

yarikoptic · 2018-03-03T22:03:14Z

mvpa2/tests/test_shrinklda.py

+
+        # First check if scikit-learn is available at all
+        if not externals.exists('skl') or externals.versions['skl'] < '0.17':
+            self.skipTest('sklearn (0.17 or higher) unavailable')


ha -- never used it this way ... according to docs this should indeed raise an exception. So no need for explicit else: block indentation -- could be dedented since would not run if there is old sklearn and this exception is raised.

anyways -- you could just use a line

skip_if_no_external('skl', min_version='0.17')

yarikoptic · 2018-03-03T22:04:40Z

mvpa2/tests/test_shrinklda.py

+            delta_t = []
+            for clf in [pymvpa_clf, skl_clf]:
+                cv = CrossValidation(clf, partitioner)
+                _ = cv(ds)


well, test could compare those results -- your implementation should be not just faster but also providing a similar (better?) result, right? ;-) otherwise -- what would be the point of having a faster but notably worse implementation?

Thanks for the feedback!

1.) Yes, that's a cleaner solution with 'skip_if_no_external'.
2.) I though about this but I didn't know how to define the similarity of results. One idea I had was to show that the cv error of the new LDA implementation is more similar to the cv error from the old LDA version than from SVM classification. This is indeed the case in - say - 95 % of the cases. I figured this would not be acceptable for a test which would probably be run routinely together with a bunch of other tests. You don't want to risk the whole test failing just because you're in the last 5 % (or so) of this little test. Do you have a solution for this?

Otherwise I suggest that I elaborate a bit more on the documentation of the new ShrinkageLDA class to explain what it does. The code is quite simple and with a bit more explanation I could help the user convince himself/herself that what it does is indeed basic LDA with shrinkage on the within-class covariance matrix.

mmbannert · 2018-03-05T07:39:30Z

Thanks for the feedback!

1.) Yes, this is a much cleaner solution.
2.) I thought about this too and the solution I came up with was to compare the cv error of the new classifier with those of the old implementation and the cv error of SVM classification. A comparison between the two shows that the new results are more similar to the old LDA version than SVM. But only in about 95 % of the cases. For a test that should be run routinely, I figured that this would be too low. You don't want the whole test to fail just because you are in the remaining 5 % (roughly) where the test fails, right? Do you have any ideas to solve this?

Otherwise I suggest that I could improve the documentation of the new class. The code is relatively simple but with a little more detail on what the algorithm does the user could easily convince himself/herself that it indeed does what it's supposed to do to implement LDA with shrinkage on the within-class covariance matrix. In the end, this is the point right? In this way the "validity" of the new algorithm would result directly from its implementation rather than from a comparison with other implementations.

Cheers,
Michael

yarikoptic · 2018-03-26T13:44:02Z

Sorry @Twitterfax again for the delay with the feedback, I will try to do better ;)

to the old LDA version than SVM - a bit confused why we would want to compare to SVM here. But I thought that result should be similar to the one produced by sklearn which you compare in speed in against.

re unstable test(s): we do allow for those which only so some times fail. It is somewhat of our curse but we do have such and do not fix seed to guarantee that it is always the same data/result. you could use either

if cfg.getboolean('tests', 'labile', default='yes'): before the test which might fail at times
@labile(5,1) (from mvpa2.testing.tools) to annotate the entire test so it runs up to 5 times to allow for no more than 1 failure (iirc) -- it would need random data regeneration within and we use it more rarely

Added new classifier ShrinkageLDA to gda.py

dd306d3

yarikoptic requested changes Jan 26, 2018

View reviewed changes

Merge https://github.com/PyMVPA/PyMVPA into shrinkage-lda

f5bf3b8

mmbannert added 2 commits January 27, 2018 14:20

moved ShrinkageLDA to shrinklda.py, made ShrinkageLDA import conditio…

c09a3b3

…nal on availability of sklearn

why do I have to add again?

a90e838

mmbannert added 4 commits January 28, 2018 15:10

Included test_shrinklda.py to check that ShrinkageLDA is faster than …

7f7e05a

…the wrapped shrinkage LDA from sklearn.

Removed unnecessary 'enumerate'

71bf2a8

Forgot to import from 'mvpa2.testing'

6bfcd69

added missing imports to mvpa2/suite.py, which caused testsuite to fa…

f6019cd

…il previously

mmbannert added 2 commits January 29, 2018 16:01

test_shrinklda.py skips test if sklearn (>=0.17) is unavailable

d72346a

ShrinkageLDA import in unittest conditional on sklearn availablity

3d543fe

Reduced size of simulated dataset to speed up test duration (by favor…

54e2786

… of 20 on my machine)

yarikoptic requested changes Mar 3, 2018

View reviewed changes

Check for scikit-learn availability using 'skip_if_no_external'

c766d86

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new classifier ShrinkageLDA to gda.py #565

Added new classifier ShrinkageLDA to gda.py #565

mmbannert commented Jan 24, 2018

coveralls commented Jan 24, 2018 •

edited

yarikoptic commented Jan 25, 2018

mmbannert commented Jan 26, 2018

yarikoptic commented Jan 26, 2018

yarikoptic left a comment

yarikoptic Jan 26, 2018

yarikoptic Jan 26, 2018

yarikoptic Jan 26, 2018

yarikoptic Jan 26, 2018

yarikoptic Jan 26, 2018

yarikoptic Jan 26, 2018

mmbannert commented Jan 26, 2018

codecov-io commented Jan 27, 2018 •

edited

mmbannert commented Jan 27, 2018

yarikoptic commented Jan 27, 2018

mmbannert commented Jan 28, 2018

yarikoptic commented Jan 29, 2018

mmbannert commented Jan 29, 2018

mmbannert commented Feb 16, 2018

yarikoptic left a comment

yarikoptic Mar 3, 2018

yarikoptic Mar 3, 2018

mmbannert Mar 5, 2018

mmbannert commented Mar 5, 2018

yarikoptic commented Mar 26, 2018

Added new classifier ShrinkageLDA to gda.py #565

Are you sure you want to change the base?

Added new classifier ShrinkageLDA to gda.py #565

Conversation

mmbannert commented Jan 24, 2018

coveralls commented Jan 24, 2018 • edited

yarikoptic commented Jan 25, 2018

mmbannert commented Jan 26, 2018

yarikoptic commented Jan 26, 2018

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmbannert commented Jan 26, 2018

codecov-io commented Jan 27, 2018 • edited

Codecov Report

mmbannert commented Jan 27, 2018

yarikoptic commented Jan 27, 2018

mmbannert commented Jan 28, 2018

yarikoptic commented Jan 29, 2018

mmbannert commented Jan 29, 2018

mmbannert commented Feb 16, 2018

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmbannert commented Mar 5, 2018

yarikoptic commented Mar 26, 2018

coveralls commented Jan 24, 2018 •

edited

codecov-io commented Jan 27, 2018 •

edited