Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new classifier ShrinkageLDA to gda.py #565

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

mmbannert
Copy link

Dear PyMVPA users,

Since I couldn't find an implementation of "shrinkage LDA" (e.g., Pereira & Botvinick, NeuroImage, 2011) in PyMVPA, I implemented my own based on Gaussian Discriminant Analysis (gda.py). I figured it could be helpful for other users as well?

It is considerably faster than using the scikit-learn implementation through the SKLLearnerAdapter, which I have been using until now:

SKLLearnerAdapter(LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto'))

It is also a little more flexible because you can choose between the Ledoit-Wolf estimator and the Oracle Approximating Shrinkage.

May it be useful.

Cheers,
Michael

@coveralls
Copy link

coveralls commented Jan 24, 2018

Coverage Status

Coverage increased (+0.09%) to 80.289% when pulling c766d86 on Twitterfax:shrinkage-lda into 0b0c57a on PyMVPA:master.

@yarikoptic
Copy link
Member

Interested in us guiding you @Twitterfax to polish up the PR (make your code tested, not failing as now, etc), or just want us to do that (might take longer ;) )

@mmbannert
Copy link
Author

Please do guide me. It would be much appreciated :)

For example: one thing I wasn't sure about was whether scikit-learn was installed along with PyMVPA by default. If it is not available I'd have to integrate the shrinkage estimators manually. I'm guessing that this could be a problem.

How can I inspect the Travis CI details?

@yarikoptic
Copy link
Member

Great:

  • I will leave comments through out your PR which will be visible in the "Files changed" tab
  • Travis CI details -- just click on "Details" by the "continuous-integration/travis-ci/pr" above:
    scr
    then go to the one failing (red) to see that it is indeed about your code is not being guarded against not having sklearn installed . I will leave comments pointing to how it should be done

I would also recommend installing codecov browser extension which would help to see which lines of your code are covered by the tests and which not, right in your browser while reviewing differences in that "Files changed" tab: https://github.com/codecov/browser-extension#choose-your-browser-below

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic quick review #1: changes to be done to get testing going etc

__tags__ = GDA.__tags__ + ['linear', 'lda', 'shrinkage']

def __init__(self, shrinkage_estimator='LedoitWolf', block_size=1000, \
**kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shave a look at some other classifiers, e.g. at GDA or others to how to ideally define parameters (via those Parameter definitions in the class)

no need for trailing \ here since it is within ()

@@ -35,12 +35,12 @@
from mvpa2.base.constraints import EnsureChoice
from mvpa2.base.state import ConditionalAttribute
#from mvpa2.measures.base import Sensitivity

from sklearn.covariance import LedoitWolf, OAS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this code depends on sklearn presence it might be better to move it into a separate file (e.g. shrinklda.py or alike) which would help to import it conditionally on sklearn presence within the warehouse

if shrinkage_estimator is 'LedoitWolf':
self.shrinkage_estimator = LedoitWolf(block_size=block_size)
elif shrinkage_estimator is 'OAS':
self.shrinkage_estimator = OAS()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better delay this instantiation until _train, and do not even assign that instance to any ShrinkageLDA attribute then.
It would allow to change that parameter of a classifier instance and retrain and would keep its repr correct

# because it only computes the class-dependent SSCP matrices. The
# shrinkage coefficient, however, needs to be computed from the whole
# time series (with samples centered on their respective class means)
params = self.params
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we go -- that is where you would find those Parameters after you move them out to be defined at the class level, e.g. params.shrinkage_estimator

# Store prior probabilities
self.priors = self._get_priors(nlabels, nsamples, nsamples_per_class)

if __debug__ and 'ShrinkageLDA' in debug.active:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would need to add ShrinkageLDA among debug targets alongside with GDA etc:

$> git grep GDA -- mvpa2/base/   
mvpa2/base/__init__.py:    debug.register('GDA', "GDA - Gaussian Discriminant Analyses")

@@ -18,7 +18,7 @@
MulticlassClassifier, RegressionAsClassifier
from mvpa2.clfs.smlr import SMLR
from mvpa2.clfs.knn import kNN
from mvpa2.clfs.gda import LDA, QDA
from mvpa2.clfs.gda import LDA, QDA, ShrinkageLDA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after you move that definition into a dedicated file, you could place the import within if externals.exists('skl') below so it would be imported only when sklearn installed

Then add an actual instance of the class, similarly to how it is done for LDA, GDA etc in this file. This should gets its testing going and we will continue from there

@mmbannert
Copy link
Author

Awesome! Thanks for your time and help. I will try to implement the suggested changes.

@codecov-io
Copy link

codecov-io commented Jan 27, 2018

Codecov Report

Merging #565 into master will increase coverage by 0.08%.
The diff coverage is 92.3%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #565      +/-   ##
=========================================
+ Coverage   75.71%   75.8%   +0.08%     
=========================================
  Files         364     366       +2     
  Lines       41453   41555     +102     
  Branches     6683    6694      +11     
=========================================
+ Hits        31386   31500     +114     
+ Misses       8213    8194      -19     
- Partials     1854    1861       +7
Impacted Files Coverage Δ
mvpa2/clfs/gda.py 88.13% <ø> (ø) ⬆️
mvpa2/tests/test_shrinklda.py 100% <100%> (ø)
mvpa2/clfs/warehouse.py 70% <100%> (+0.27%) ⬆️
mvpa2/base/__init__.py 82.72% <100%> (+0.07%) ⬆️
mvpa2/suite.py 82.99% <75%> (-0.12%) ⬇️
mvpa2/clfs/shrinklda.py 89.09% <89.09%> (ø)
mvpa2/tests/test_support.py 98.24% <0%> (-1.17%) ⬇️
mvpa2/clfs/stats.py 71.91% <0%> (-0.81%) ⬇️
mvpa2/tests/test_searchlight_hyperalignment.py 96.7% <0%> (-0.42%) ⬇️
mvpa2/tests/test_usecases.py 98.92% <0%> (-0.36%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b0c57a...c766d86. Read the comment docs.

@mmbannert
Copy link
Author

Hi Yaroslav. Thanks again for your feedback and patience. Things look ok now... I suppose(?)

@yarikoptic
Copy link
Member

Good. Now it would be great to add a dedicated unit test which would compare results against sklearn wrapped implementation you mentioned - should be closer to identical right? I can provide a bit more detail later or you could check out other tests to see how it could be done

@mmbannert
Copy link
Author

Ok, I have used test_gnb.py etc. as a template to write the test. Let me know if it needs improvement. How would such a test typically be used?

@yarikoptic
Copy link
Member

How would such a test typically be used?

Look into .travis.yml (which is a bit a mouthful) for how we invoke testing using nose. To test locally just run

python -m nose -s -v mvpa2/tests/test_shrinklda.py

to your added tests only

@mmbannert
Copy link
Author

Travis CI doesn't complain about my coding anymore. I take it as a positive sign ;)

@mmbannert
Copy link
Author

Hi Yaroslav,
Is there anything else I can do to improve the code? I made the test a little faster by simulating a smaller (but still p > n) dataset, which made it 20 times faster on my computer.
Cheers,
Michael

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for a delay. almost there -- what about actual cv error? (see my comments)


# First check if scikit-learn is available at all
if not externals.exists('skl') or externals.versions['skl'] < '0.17':
self.skipTest('sklearn (0.17 or higher) unavailable')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha -- never used it this way ... according to docs this should indeed raise an exception. So no need for explicit else: block indentation -- could be dedented since would not run if there is old sklearn and this exception is raised.

anyways -- you could just use a line

skip_if_no_external('skl', min_version='0.17')

delta_t = []
for clf in [pymvpa_clf, skl_clf]:
cv = CrossValidation(clf, partitioner)
_ = cv(ds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, test could compare those results -- your implementation should be not just faster but also providing a similar (better?) result, right? ;-) otherwise -- what would be the point of having a faster but notably worse implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!

1.) Yes, that's a cleaner solution with 'skip_if_no_external'.
2.) I though about this but I didn't know how to define the similarity of results. One idea I had was to show that the cv error of the new LDA implementation is more similar to the cv error from the old LDA version than from SVM classification. This is indeed the case in - say - 95 % of the cases. I figured this would not be acceptable for a test which would probably be run routinely together with a bunch of other tests. You don't want to risk the whole test failing just because you're in the last 5 % (or so) of this little test. Do you have a solution for this?

Otherwise I suggest that I elaborate a bit more on the documentation of the new ShrinkageLDA class to explain what it does. The code is quite simple and with a bit more explanation I could help the user convince himself/herself that what it does is indeed basic LDA with shrinkage on the within-class covariance matrix.

@mmbannert
Copy link
Author

Thanks for the feedback!

1.) Yes, this is a much cleaner solution.
2.) I thought about this too and the solution I came up with was to compare the cv error of the new classifier with those of the old implementation and the cv error of SVM classification. A comparison between the two shows that the new results are more similar to the old LDA version than SVM. But only in about 95 % of the cases. For a test that should be run routinely, I figured that this would be too low. You don't want the whole test to fail just because you are in the remaining 5 % (roughly) where the test fails, right? Do you have any ideas to solve this?

Otherwise I suggest that I could improve the documentation of the new class. The code is relatively simple but with a little more detail on what the algorithm does the user could easily convince himself/herself that it indeed does what it's supposed to do to implement LDA with shrinkage on the within-class covariance matrix. In the end, this is the point right? In this way the "validity" of the new algorithm would result directly from its implementation rather than from a comparison with other implementations.

Cheers,
Michael

@yarikoptic
Copy link
Member

Sorry @Twitterfax again for the delay with the feedback, I will try to do better ;)

to the old LDA version than SVM - a bit confused why we would want to compare to SVM here. But I thought that result should be similar to the one produced by sklearn which you compare in speed in against.

re unstable test(s): we do allow for those which only so some times fail. It is somewhat of our curse but we do have such and do not fix seed to guarantee that it is always the same data/result. you could use either

  1. if cfg.getboolean('tests', 'labile', default='yes'): before the test which might fail at times
  2. @labile(5,1) (from mvpa2.testing.tools) to annotate the entire test so it runs up to 5 times to allow for no more than 1 failure (iirc) -- it would need random data regeneration within and we use it more rarely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants