Automated adjustment of `n_bins` when discrete labels are passed #128

stes · 2024-02-12T22:16:55Z

As discussed in #106 by @FrancescaGuo, when passing a discrete index for consistency calculation, the default n_bins = 100 raises an (expected) error message. The current way to circumvent this error message is to set n_bins to the number of passed labels. However, this could be improved directly in the code: Whenever discrete labels are passed, the binning process required for continuous data could be replaced/adapted.

The text was updated successfully, but these errors were encountered:

introspective-swallow · 2024-02-15T22:46:37Z

In what way should discrete labels be treated? If they should be treated as qualitative variables, then it should be warned whenever values are getting merged (e.g. some embedding has no occurrence of a value and then n_bins is set to something like the minimum number of label values that appear in all cases).

GarrettBlair · 2024-02-25T21:54:55Z

Adding my experience with this issue as well - If labels are not fully sampled between the max and min values (say label values are 0 and 2, but never 1), then an error is thrown by _coarse_to_fine():

The following example code:

import cebra
import numpy as np
embedding1 = np.random.uniform(0, 1, (1000, 5))
embedding2 = np.random.uniform(0, 1, (1000, 8))
labels1 = np.random.uniform(0, 1, (1000, ))
labels2 = np.random.uniform(0, 1, (1000, ))
# force the labels to only sample 0 or 2
labels1 = np.round(labels1)
labels1[labels1>0] += 1 
labels2 = np.round(labels2)
labels2[labels2>0] += 1 
# Between-runs consistency
scores, pairs, ids_runs = cebra.sklearn.metrics.consistency_score(embeddings=[embedding1, embedding2],
                                                    between="runs")
# Between-datasets consistency, by aligning on the labels
scores, pairs, ids_datasets = cebra.sklearn.metrics.consistency_score(embeddings=[embedding1, embedding2],
                                                    labels=[labels1, labels2],
                                                    dataset_ids=["achilles", "buddy"],
                                                    between="datasets")

Yields ValueError: Digitalized labels does not have elements close enough to bin index 4. The bin index should be in the range of the labels values.

I'm using a angular position label with avoidance learning, so not fully sampling the label range is common, and using fewer bins to avoid the error loses too much resolution and merges too many bins. Also, it may be sampled in one set of labels and not the other.

For clarity, a session with this problem has the following unique labels:
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 30 31 32 33 34 35 36 ]
So the issue comes from the lack of sampling in bins 22-29 (10 degree bins, 0-360)

I think, as you stated @stes, allowing the user to pass discrete labels would alleviate this? Not sure how this would be handled when one has different sampling than another (session one has [0,1,2] and session two only has [0,2] )

stes added the enhancement New feature or request label Feb 12, 2024

stes self-assigned this Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated adjustment of `n_bins` when discrete labels are passed #128

Automated adjustment of `n_bins` when discrete labels are passed #128

stes commented Feb 12, 2024

introspective-swallow commented Feb 15, 2024

GarrettBlair commented Feb 25, 2024

Automated adjustment of n_bins when discrete labels are passed #128

Automated adjustment of n_bins when discrete labels are passed #128

Comments

stes commented Feb 12, 2024

introspective-swallow commented Feb 15, 2024

GarrettBlair commented Feb 25, 2024

Automated adjustment of `n_bins` when discrete labels are passed #128

Automated adjustment of `n_bins` when discrete labels are passed #128