-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH(WiP): perspective API improvement for GroupClusterThreshold #395
base: master
Are you sure you want to change the base?
Conversation
…pport other metrics, QE, etc
sample.""") | ||
sample. If None, no bootstrapping will be performed, and each | ||
sample will be used as a NULL result sample, thus there should be | ||
a sufficient number of them""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will allow for the same machine to be used if we have e.g. across subject classification results and e.g. 1000 permutations of those
Theoretically a plain NullDist machinery could be used BUT most probably it would blow, and it would not have all the clustering handling linked to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the point is that bootstrapping is just a first step needed only when we have insufficient number of NULL samples. May be we could even manage to separate that out somehow into a generator which would furnish those samples to the actual logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need to keep in mind that it should not become painful to generate the original samples adn the bootstrapped ones separately
Would this allow for TFCE and surface-based clustering? |
* origin/master: (61 commits) DOC: some changelog entries for recent changes BF: Sample attribute name convenience can fail for non-default settings BF: simplify testing of cv_no_generator tests since we need really a classifier there ENH: make ad-hoc searchlights also being able to just (ab)use custom Splitter BF: adhoc (e.g. gnb) searchlights now can tollerate awkward partitionings ENH: make partitioner optional for CrossValidation BF: Typo + PEP8 BF+ENH: explicitly use float for "sum of" estimation in anova so we get proper float division BF: Lists new nose-version of tests in another dinosaur of ours BF: Random attempt to fix the coverage setup ENH: Modernize SMLR test (run more of its code) BF: Adjust code to match reality RM: Removal of Hamster() RF: not X is None -> X is not None BF: Replace all occurrences of mutable default values with a safe default RF: Be more compact when testing for isinstance() RF: Prefer in-place modification for speed RF: Replace pattern for...range(len(x)) with enumerate(x) Reenable test where it is not obvious why we don't run it. BF: Explicitly comment out unreachable code ...
@nno -- added |
@yarikoptic, I did not have a particular concern - except that the current architecture does not seem easily extensible to support TFCE. But this architecture can be changed whenever there are plans to implement TFCE of course. |
… query engines and thus artbitrary geometry
…inor opts in Labeler code
for now all logic for collecting/testing spatial statistics on detected clusters is disabled since requires a thought on how to RF it so it scales for any QE. Some tests of helper functions yet to be fixed
@PyMVPA/developers I have pushed my tentative RF to provide "clustering" using QEs. The only additional somewhat demand is really that the dataset with permutations should carry a FlattenMapper somewhere with non-degenerate (i.e. non-None) space i.e. so we could identify which 'fa' to use for determining neighbors. But that is only in the case if no explicit 'Labeler' (new construct implementing depth-first-search to identify clusters) was provided. If provided - should be used. Ideally we should get a unittest with data on surface. I have disabled for now collection of 'spatial' statistics for clusters since that would require RFing of returned values anyways -- e.g. how should it look in case of surface dataset? what if volumetric + temporal searchlight? I thought that we should probably collect those using dataset's .fa and query engine's arguments, but again not sure if those would always be consistent across all query engines. IMHO such operation might better be performed explicitly after per each use case, e.g. Also one test will break ATM since I was not sure on best way to address RFing of the internal What do you think? |
BTW @Hanke , if you could check it out on some of your real results -- that would be great. It should provide identical results -- if not -- there is or was a bug ;) |
Doing the tests now.... First comment: The documentation needs to mention that an input dataset now needs to have a feature attribute indicating the spatial association of features. Before a mapper was sufficient -- which is now obsolete. This likely implies a change in existing analysis code, as the output of a searchlight, for example, does not have something like |
mvpa2/algorithms/group_clusterthr.py
Outdated
if not len(area): | ||
return [0] | ||
else: | ||
import pdb; pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hu?
@yarikoptic I will not keep a full analysis running to check for identical results. The estimated runtime of the cluster size calculation is 20 hours (up from ~10min; see #427 for performance estimates for this test analysis). I will massively reduce the number of bootstrap samples and try again. |
@yarikoptic Apologies. It seems the new code behaves in non-obvious ways with parallelization and n_blocks>1. With n_proc=1 and n_blocks=1 it is wayyyyyy faster. Still playing... |
continue # already labeled, so was considered etc | ||
# or simply was 0 to start with | ||
# immediately label | ||
lmap[neighbor] = idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything is a neighbor, even if .samples have different values? Why not support clustering so that two features are in the same cluster if they are connected and have the same value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might be a worthwhile feature, but do you see immediate (I love this word given that we last corresponded here over 3 years ago) use case?
The idea was -- to just pass a thresholded map (not the binarized mask) and do clustering.
In the long(er) run might indeed be nice to add an optional mode to take into account a value, so we could pass already some labeled maps, but I am yet to see the use-case
Ok, I am confused wrt performance. But anyway, here is a direct comparison of the original implementation and this PR without parallelization, and compartmentalization:
so performance looks all good in this case. However, the results differ. How much they differ is kinda hard to tell. The clusters in the new implementation are not sorted by size, so IDs cannot easily be matched, but it seems that the biggest clusters there is an approximate equivalent in size. Looking at the results in terms of the location in the volume it seems that they are structurally the same. There are a few smaller clusters different between the two algorithms, but given the not-necessarily stellar number of bootstraps I consider this good enough. I am in favor of replacing the current implementation with this more flexible one. But we should make sure to keep parallelization in mind. It could be that I screwed up somewhere, though.... |
Ah, and sorting the cluster IDs again will be much appreciated! |
FTR: Before a merge the docstring needs a full pass to adjust to the new implementation. |
mvpa2/algorithms/group_clusterthr.py
Outdated
if not len(area): | ||
return [0] | ||
else: | ||
return area.astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return [area.max().astype(int)]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh -- thanks! sure should have been a max somewhere ;)
…e plain ndarrays as well
…r would be trained already
…ion by removing counter as an arg + RF a bit more to unify/minimize code between joblib and non-joblib execution
… + BF no explicit is_sorted for mult corr
@Hanke sorting by size should be inplace as of 9c7a3ff , note that differences still might happen if clusters have equal number of members. Otherwise as of that commit I have refactored quite a few pieces primarily hopefully without change of functionality. But now all 3 metrics for cluster level (size, sizenon0, maxsize) should work although still undertested I would say. |
…e anything besides mean_sample
""" | ||
qe = Parameter( | ||
None, | ||
doc="QueryEngine to determine neighbors of any given feature") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the cluster computation depend on the "radius" of the query engine? Or would it be possible to extract a neighborhood structure that takes only the closest neighboring samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Fri, 08 Apr 2016, bpinsard wrote:
In mvpa2/measures/label.py:
+from ..misc.neighborhood import CachedQueryEngine
+
+from .base import Measure
+
+if debug:
- from ..base import debug
+class ClustersLabeler(Measure):
- """Determines contiguous regions of features using a query engine.
- "label" in the name chosen to mimic scipy.ndimage.measurements.label
- """
- qe = Parameter(
None,
doc="QueryEngine to determine neighbors of any given feature")
Will the cluster computation depend on the "radius" of the query engine?
if query engine takes radius, then yes ;) but the point is to allow any
neighborhood querying by providing corresponding query engine. E.g. in
the volume it could be the IndexQueryEngine which would take some
HollowSphere in one dimension (e.g. voxels coords) and then some other,
e.g. Sphere in another (e.g. time) thus allowing even to cluster
spatio-temporal beasts of arbitrary kinds ;)
Or would it be possible to extract a neighborhood structure that takes
only the closest neighboring samples?
define "samples" ;) those query engines operate across "features" (in
pymvpa terminology), so if you need to operate across samples, you would
need to bring them into the same sample (e.g. boxcar'ing etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean using the QE used to create the searchlight map?
It seems that there would be cases when multiple clusters are considered a single cluster just due to extended connectivity definition. See the attached schema with thresholded map in blue and neighborhood in circles overlay.
In that case it would certainly overestimate the null cluster size distribution by merging noisy patches in permutated maps, but also overestimate the cluster size in the true map by created superclusters?
Also when using surface QE, the neighborhood might vary for each subject for each location due to different distances between vertices of a standard surface spherically warped to the subject space. Which surface would you then use for group analysis, the template one (fsaveragex/icox)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- queryengine here is pretty much merely a generalization to define immediate neighborhood. E.g. before
scipy.ndimage.measurements.label
internally was defining by default "immediate neighbors touching the face" and was restricted only to volume. Here use of query engine relaxes it and allows to find an arbitrary immediate (or not if e.g. Sphere with big radius is used) neighborhood (in volume or on surface). But in general/default you would just useSphere(1)
making it identical to how previous implementation behaved (usingscipy...label
) - interesting point on per-subject QEs! Similar aspect was recently dealt with in searchlight hyperalignment (slhyper below) implementation (see e.g. https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/algorithms/searchlight_hyperalignment.py#L232 and where used in there). But situation there was "easier" and somewhat clearer due to that slhyper obtaining a list of datasets, so they could easily carry varying .fa as per each subject and were not required to have corresponding features... Here we have inherent assumption that features already correspond across subjects, since we generally just computing a mean performance map, and clustering/qe is used later only to figure out clusters on that map. So unless we want to aim for some extended functionality where QEs would also be used to project/mean performances within neighborhoods of each subject, there is probably no need for per-subject QE, but even then such meaning could simply be done prior, and for this computation clustering is done on e.g. a corresponding "template" surface
* origin/master: (438 commits) TST: generate vertices rounded to five decimals to avoid rounding issues BLD: use trusty distribution BLD: fix issue with erlang not being functional (anymore?) at travis-ci. Solution is to pretend to be java instead ENH: get_contrasts to return constrast stats (z or p) for an existing nipy GLM model ENH: Dataset.to_df() to export dataset (e.g. for visualization) as Pandas dataframe BF: make GNB robust against cases where some feature/classes have no variances ENH: add reseeding in tests to reproduce failures BF/ENH: make GNB puke when data is clearly degenerate (the same means/variances) BF: guard against under/overflows and mitigate in case of logprob=True (default) ENH(BK): prototype for get_contrasts to obtain contrasts for fit NiPy GLM model RF(TST): mark portion of the test as not labile BF(TST): label pairs should be (pos, neg), so L0, L1 not L1, LO RF: swap pos and neg while assigning in MulticlasClassifier BF: GNB - add has_sensitivity tag in constructor BF/RF: GNB sensitivity - should correspond to the 2nd label to be the "positive" one BF: account for np.histogram returning also rightmost edge for nbins, making it nbins + 1 BF(TST): account for the prior fix to obtain proper pairs in tuples BF: make _uniquemerge2literal preserve the tuples (used in sensitivities) RF(DOC): document in code inconsistency in order or pos/neg labels which is "by design" ENH(TST): verify that sensitivites are computed correctly when wrapped into MulticlassClassifier ...
Codecov Report
@@ Coverage Diff @@
## master #395 +/- ##
=========================================
- Coverage 75.96% 75.8% -0.17%
=========================================
Files 367 369 +2
Lines 42076 42369 +293
Branches 6772 6813 +41
=========================================
+ Hits 31965 32118 +153
- Misses 8225 8322 +97
- Partials 1886 1929 +43
Continue to review full report at Codecov.
|
to support other metrics, QE, etc
wanted to see what parameters should come into play etc. And it feels that may be there should be some class hierarchy since many parameters are specific for cluster forming (threshold, qe) and require the two-stage process (first estimate thr_map, and only then clusters for the metric). In that two stage process also -- may be we need change/parameter to remove a possible bias due to using the same bootstraps/data twice (I don't have a clear/strong feeling on that yet).
If we agree on basic idea/API -- I could prep the functionality and see how it goes from there (either requires classes etc)