Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH(WiP): perspective API improvement for GroupClusterThreshold #395

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

yarikoptic
Copy link
Member

to support other metrics, QE, etc

wanted to see what parameters should come into play etc. And it feels that may be there should be some class hierarchy since many parameters are specific for cluster forming (threshold, qe) and require the two-stage process (first estimate thr_map, and only then clusters for the metric). In that two stage process also -- may be we need change/parameter to remove a possible bias due to using the same bootstraps/data twice (I don't have a clear/strong feeling on that yet).

If we agree on basic idea/API -- I could prep the functionality and see how it goes from there (either requires classes etc)

sample.""")
sample. If None, no bootstrapping will be performed, and each
sample will be used as a NULL result sample, thus there should be
a sufficient number of them""")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will allow for the same machine to be used if we have e.g. across subject classification results and e.g. 1000 permutations of those

Theoretically a plain NullDist machinery could be used BUT most probably it would blow, and it would not have all the clustering handling linked to it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point is that bootstrapping is just a first step needed only when we have insufficient number of NULL samples. May be we could even manage to separate that out somehow into a generator which would furnish those samples to the actual logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to keep in mind that it should not become painful to generate the original samples adn the bootstrapped ones separately

@nno
Copy link
Contributor

nno commented Jan 17, 2016

Would this allow for TFCE and surface-based clustering?
As I wrote before, in my experience TFCE is much easier to implement if relevant cluster statistics are stored for each feature rather than each cluster. fixed-threshold clustering becomes then one special case.

* origin/master: (61 commits)
  DOC: some changelog entries for recent changes
  BF: Sample attribute name convenience can fail for non-default settings
  BF: simplify testing of cv_no_generator tests since we need really a classifier there
  ENH: make ad-hoc searchlights also being able to just (ab)use custom Splitter
  BF: adhoc (e.g. gnb) searchlights now can tollerate awkward partitionings
  ENH: make partitioner optional for CrossValidation
  BF: Typo + PEP8
  BF+ENH: explicitly use float for "sum of" estimation in anova so we get proper float division
  BF: Lists new nose-version of tests in another dinosaur of ours
  BF: Random attempt to fix the coverage setup
  ENH: Modernize SMLR test (run more of its code)
  BF: Adjust code to match reality
  RM: Removal of Hamster()
  RF: not X is None -> X is not None
  BF: Replace all occurrences of mutable default values with a safe default
  RF: Be more compact when testing for isinstance()
  RF: Prefer in-place modification for speed
  RF: Replace pattern for...range(len(x)) with enumerate(x)
  Reenable test where it is not obvious why we don't run it.
  BF: Explicitly comment out unreachable code
  ...
@yarikoptic
Copy link
Member Author

@nno -- added map_postproc parameter to the beast, so we could feed it with any callable which transforms original or bootstrapped sample anyhow. Implementation details yet to be figured out. First I will implement clustering using QE (yet to comprehend what was your concern)

@nno
Copy link
Contributor

nno commented Jan 26, 2016

@yarikoptic, I did not have a particular concern - except that the current architecture does not seem easily extensible to support TFCE. But this architecture can be changed whenever there are plans to implement TFCE of course.

for now all logic for collecting/testing spatial statistics on detected
clusters is disabled since requires a thought on how to RF it so it scales
for any QE.

Some tests of helper functions yet to be fixed
@yarikoptic
Copy link
Member Author

@PyMVPA/developers I have pushed my tentative RF to provide "clustering" using QEs. The only additional somewhat demand is really that the dataset with permutations should carry a FlattenMapper somewhere with non-degenerate (i.e. non-None) space i.e. so we could identify which 'fa' to use for determining neighbors. But that is only in the case if no explicit 'Labeler' (new construct implementing depth-first-search to identify clusters) was provided. If provided - should be used. Ideally we should get a unittest with data on surface.

I have disabled for now collection of 'spatial' statistics for clusters since that would require RFing of returned values anyways -- e.g. how should it look in case of surface dataset? what if volumetric + temporal searchlight? I thought that we should probably collect those using dataset's .fa and query engine's arguments, but again not sure if those would always be consistent across all query engines. IMHO such operation might better be performed explicitly after per each use case, e.g. collect_spatial_clusters_stats(ds, fas=['voxel_indices']). What do you think?

Also one test will break ATM since I was not sure on best way to address RFing of the internal _get_map_cluster_sizes which previously was taking a map and then calling .label() assuming pure Nd array to find neighbors in. I could e.g. RF it so if no labels/num given it would flatten given data with our FlattenMapper and then proceed but imho it wouldn't be kosher. I guess ideally I should provide it all those values so to test only that single invocation to count actual clusters ;)

What do you think?
Now I want to switch to enable other FW estimations, which wouldn't even care about "cluster" level anything

@yarikoptic
Copy link
Member Author

BTW @Hanke , if you could check it out on some of your real results -- that would be great. It should provide identical results -- if not -- there is or was a bug ;)

@mih
Copy link
Member

mih commented Mar 4, 2016

Doing the tests now....

First comment: The documentation needs to mention that an input dataset now needs to have a feature attribute indicating the spatial association of features. Before a mapper was sufficient -- which is now obsolete. This likely implies a change in existing analysis code, as the output of a searchlight, for example, does not have something like voxel_indices.

if not len(area):
return [0]
else:
import pdb; pdb.set_trace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hu?

@mih
Copy link
Member

mih commented Mar 4, 2016

@yarikoptic I will not keep a full analysis running to check for identical results. The estimated runtime of the cluster size calculation is 20 hours (up from ~10min; see #427 for performance estimates for this test analysis).

I will massively reduce the number of bootstrap samples and try again.

@mih
Copy link
Member

mih commented Mar 4, 2016

@yarikoptic Apologies. It seems the new code behaves in non-obvious ways with parallelization and n_blocks>1. With n_proc=1 and n_blocks=1 it is wayyyyyy faster. Still playing...

continue # already labeled, so was considered etc
# or simply was 0 to start with
# immediately label
lmap[neighbor] = idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything is a neighbor, even if .samples have different values? Why not support clustering so that two features are in the same cluster if they are connected and have the same value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might be a worthwhile feature, but do you see immediate (I love this word given that we last corresponded here over 3 years ago) use case?
The idea was -- to just pass a thresholded map (not the binarized mask) and do clustering.
In the long(er) run might indeed be nice to add an optional mode to take into account a value, so we could pass already some labeled maps, but I am yet to see the use-case

@mih
Copy link
Member

mih commented Mar 4, 2016

Ok, I am confused wrt performance. But anyway, here is a direct comparison of the original implementation and this PR without parallelization, and compartmentalization:

# ORIGINAL
# n_proc=1, n_blocks=1, n_bootstrap=5000
683.82s user 17.23s system 100% cpu 11:37.20 total

# THIS PR
# n_proc=1, n_blocks=1, n_bootstrap=5000
589.81s user 17.94s system 100% cpu 10:04.26 total

so performance looks all good in this case. However, the results differ. How much they differ is kinda hard to tell. The clusters in the new implementation are not sorted by size, so IDs cannot easily be matched, but it seems that the biggest clusters there is an approximate equivalent in size.

Looking at the results in terms of the location in the volume it seems that they are structurally the same. There are a few smaller clusters different between the two algorithms, but given the not-necessarily stellar number of bootstraps I consider this good enough.

I am in favor of replacing the current implementation with this more flexible one. But we should make sure to keep parallelization in mind. It could be that I screwed up somewhere, though....

@mih
Copy link
Member

mih commented Mar 4, 2016

Ah, and sorting the cluster IDs again will be much appreciated!

@mih
Copy link
Member

mih commented Mar 4, 2016

FTR: Before a merge the docstring needs a full pass to adjust to the new implementation.

if not len(area):
return [0]
else:
return area.astype(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return [area.max().astype(int)]

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh -- thanks! sure should have been a max somewhere ;)

@yarikoptic
Copy link
Member Author

@Hanke sorting by size should be inplace as of 9c7a3ff , note that differences still might happen if clusters have equal number of members. Otherwise as of that commit I have refactored quite a few pieces primarily hopefully without change of functionality. But now all 3 metrics for cluster level (size, sizenon0, maxsize) should work although still undertested I would say.

"""
qe = Parameter(
None,
doc="QueryEngine to determine neighbors of any given feature")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the cluster computation depend on the "radius" of the query engine? Or would it be possible to extract a neighborhood structure that takes only the closest neighboring samples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Fri, 08 Apr 2016, bpinsard wrote:

In mvpa2/measures/label.py:

+from ..misc.neighborhood import CachedQueryEngine
+
+from .base import Measure
+
+if debug:

  • from ..base import debug

+class ClustersLabeler(Measure):

  • """Determines contiguous regions of features using a query engine.
  • "label" in the name chosen to mimic scipy.ndimage.measurements.label
  • """
  • qe = Parameter(
  •    None,
    
  •    doc="QueryEngine to determine neighbors of any given feature")
    

Will the cluster computation depend on the "radius" of the query engine?

if query engine takes radius, then yes ;) but the point is to allow any
neighborhood querying by providing corresponding query engine. E.g. in
the volume it could be the IndexQueryEngine which would take some
HollowSphere in one dimension (e.g. voxels coords) and then some other,
e.g. Sphere in another (e.g. time) thus allowing even to cluster
spatio-temporal beasts of arbitrary kinds ;)

Or would it be possible to extract a neighborhood structure that takes
only the closest neighboring samples?

define "samples" ;) those query engines operate across "features" (in
pymvpa terminology), so if you need to operate across samples, you would
need to bring them into the same sample (e.g. boxcar'ing etc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean using the QE used to create the searchlight map?
It seems that there would be cases when multiple clusters are considered a single cluster just due to extended connectivity definition. See the attached schema with thresholded map in blue and neighborhood in circles overlay.
slght
In that case it would certainly overestimate the null cluster size distribution by merging noisy patches in permutated maps, but also overestimate the cluster size in the true map by created superclusters?

Also when using surface QE, the neighborhood might vary for each subject for each location due to different distances between vertices of a standard surface spherically warped to the subject space. Which surface would you then use for group analysis, the template one (fsaveragex/icox)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. queryengine here is pretty much merely a generalization to define immediate neighborhood. E.g. before scipy.ndimage.measurements.label internally was defining by default "immediate neighbors touching the face" and was restricted only to volume. Here use of query engine relaxes it and allows to find an arbitrary immediate (or not if e.g. Sphere with big radius is used) neighborhood (in volume or on surface). But in general/default you would just use Sphere(1) making it identical to how previous implementation behaved (using scipy...label)
  2. interesting point on per-subject QEs! Similar aspect was recently dealt with in searchlight hyperalignment (slhyper below) implementation (see e.g. https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/algorithms/searchlight_hyperalignment.py#L232 and where used in there). But situation there was "easier" and somewhat clearer due to that slhyper obtaining a list of datasets, so they could easily carry varying .fa as per each subject and were not required to have corresponding features... Here we have inherent assumption that features already correspond across subjects, since we generally just computing a mean performance map, and clustering/qe is used later only to figure out clusters on that map. So unless we want to aim for some extended functionality where QEs would also be used to project/mean performances within neighborhoods of each subject, there is probably no need for per-subject QE, but even then such meaning could simply be done prior, and for this computation clustering is done on e.g. a corresponding "template" surface

* origin/master: (438 commits)
  TST: generate vertices rounded to five decimals to avoid rounding issues
  BLD: use trusty distribution
  BLD: fix issue with erlang not being functional (anymore?) at travis-ci. Solution is to pretend to be java instead
  ENH: get_contrasts to return constrast stats (z or p) for an existing nipy GLM model
  ENH: Dataset.to_df() to export dataset (e.g. for visualization) as Pandas dataframe
  BF: make GNB robust against cases where some feature/classes have no variances
  ENH: add reseeding in tests to reproduce failures
  BF/ENH: make GNB puke when data is clearly degenerate (the same means/variances)
  BF: guard against under/overflows and mitigate in case of logprob=True (default)
  ENH(BK): prototype for get_contrasts to obtain contrasts for fit NiPy GLM  model
  RF(TST): mark portion of the test as not labile
  BF(TST): label pairs should be (pos, neg), so L0, L1 not L1, LO
  RF: swap pos and neg while assigning in MulticlasClassifier
  BF: GNB - add has_sensitivity tag in constructor
  BF/RF: GNB sensitivity - should correspond to the 2nd label to be the "positive" one
  BF: account for np.histogram returning also rightmost edge for nbins, making it nbins + 1
  BF(TST): account for the prior fix to obtain proper pairs in tuples
  BF: make _uniquemerge2literal preserve the tuples (used in sensitivities)
  RF(DOC): document in code inconsistency in order or pos/neg labels which is "by design"
  ENH(TST): verify that sensitivites are computed correctly when wrapped into MulticlassClassifier
  ...
@coveralls
Copy link

coveralls commented Oct 10, 2019

Coverage Status

Coverage decreased (-4.4%) to 76.056% when pulling 667862a on yarikoptic:enh-clusterthr into 6550809 on PyMVPA:master.

@codecov-io
Copy link

codecov-io commented Oct 11, 2019

Codecov Report

Merging #395 into master will decrease coverage by 0.16%.
The diff coverage is 93.89%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #395      +/-   ##
=========================================
- Coverage   75.96%   75.8%   -0.17%     
=========================================
  Files         367     369       +2     
  Lines       42076   42369     +293     
  Branches     6772    6813      +41     
=========================================
+ Hits        31965   32118     +153     
- Misses       8225    8322      +97     
- Partials     1886    1929      +43
Impacted Files Coverage Δ
mvpa2/tests/test_testing.py 95.34% <100%> (+0.95%) ⬆️
mvpa2/tests/test_usecases.py 98.92% <100%> (-0.36%) ⬇️
mvpa2/tests/test_group_clusterthr.py 100% <100%> (ø) ⬆️
mvpa2/tests/test_labeler.py 100% <100%> (ø)
mvpa2/testing/tools.py 79.54% <100%> (-0.07%) ⬇️
mvpa2/misc/neighborhood.py 87.45% <83.33%> (+0.27%) ⬆️
mvpa2/algorithms/group_clusterthr.py 89.09% <86.74%> (-9.23%) ⬇️
mvpa2/measures/label.py 95.83% <95.83%> (ø)
mvpa2/testing/sweep.py 56.25% <0%> (-26.25%) ⬇️
mvpa2/tests/test_multiclf.py 90.97% <0%> (-6.02%) ⬇️
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6550809...667862a. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants