fix: Convert Reddit cluster s2s and p2p to fast #729

isaac-chung · 2024-05-15T12:28:45Z

Checklist for adding MMTEB dataset

Resolve #728

mteb/tasks/Clustering/eng/RedditClustering.py

isaac-chung · 2024-05-15T12:55:10Z

When subsampling for the p2p dataset, there are a few labels that only has 1 count. So we get the following error:

ValueError: The least populated class in labels column has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Should we exclude them before subsampling? Or force include them somehow?

KennethEnevoldsen · 2024-05-15T13:53:22Z

When subsampling for the p2p dataset, there are a few labels that only has 1 count.

Clusters with only one label?

isaac-chung · 2024-05-15T13:58:29Z

Clusters with only one label?

Yes. You should be able to see this error when you check out the PR and run

gh pr checkout 729
mteb -t RedditFastClusteringP2P -m sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

KennethEnevoldsen · 2024-05-15T14:09:25Z

Hmm, that seems weird. This should be there before conversion as well I assume? It might be that the cluster sets does not have the same labels across the sets.

isaac-chung · 2024-05-15T14:21:40Z

Yeah. But before conversion, we do not use subsampling.

The following labels only have 1 sample in the clusters after combining all rows:

array(['Bariloche', 'CRISPR', 'RomanianWolves', 'TherosDMs',
       'ValorantBrasil', 'auntienetwork', 'candlemagick', 'graylog',
       'imaginarypenpals', 'visualization'], dtype='<U21')

KennethEnevoldsen · 2024-05-15T14:27:34Z

We should probably filter out all labels with less than N. I assume N=1-5 is reasonable?

isaac-chung · 2024-05-15T20:45:01Z

Went with N=1.

reddit cluster s2s to fast

ebefa8a

isaac-chung marked this pull request as draft May 15, 2024 12:28

got task running and add results

987b158

isaac-chung changed the title ~~Convert reddit cluster s2s to fast~~ Convert reddit cluster s2s and p2p to fast May 15, 2024

KennethEnevoldsen reviewed May 15, 2024

View reviewed changes

mteb/tasks/Clustering/eng/RedditClustering.py Outdated Show resolved Hide resolved

mteb/tasks/Clustering/eng/RedditClustering.py Outdated Show resolved Hide resolved

mteb/tasks/Clustering/eng/RedditClustering.py Show resolved Hide resolved

isaac-chung added 2 commits May 15, 2024 12:59

rerun with 16k samples

d0c054f

full reddit cluster s2s

76fa271

isaac-chung added 4 commits May 15, 2024 17:25

filter out labels and result runs

d8b6c4b

Merge branch 'main' into convert-reddit-s2s-to-fast

af09f9c

points

5056e79

make lint

3ae224f

isaac-chung marked this pull request as ready for review May 15, 2024 17:27

isaac-chung requested a review from KennethEnevoldsen May 15, 2024 17:27

isaac-chung changed the title ~~Convert reddit cluster s2s and p2p to fast~~ fix: Convert reddit cluster s2s and p2p to fast May 15, 2024

validation

ae107b0

isaac-chung changed the title ~~fix: Convert reddit cluster s2s and p2p to fast~~ fix: Convert Reddit cluster s2s and p2p to fast May 15, 2024

Merge branch 'main' into convert-reddit-s2s-to-fast

4fda934

isaac-chung assigned KennethEnevoldsen May 16, 2024

isaac-chung added 5 commits May 16, 2024 12:41

Merge branch 'main' into convert-reddit-s2s-to-fast

23a58cd

Merge branch 'main' into convert-reddit-s2s-to-fast

8f55c69

Merge branch 'main' into convert-reddit-s2s-to-fast

aa8c45c

Merge branch 'main' into convert-reddit-s2s-to-fast

0b579a1

Merge branch 'main' into convert-reddit-s2s-to-fast

3781ae1

isaac-chung enabled auto-merge (squash) May 17, 2024 07:27

Merge branch 'main' into convert-reddit-s2s-to-fast

77bfd73

isaac-chung merged commit b02f252 into main May 17, 2024
7 checks passed

isaac-chung deleted the convert-reddit-s2s-to-fast branch May 17, 2024 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Convert Reddit cluster s2s and p2p to fast #729

fix: Convert Reddit cluster s2s and p2p to fast #729

isaac-chung commented May 15, 2024 •

edited

isaac-chung commented May 15, 2024

KennethEnevoldsen commented May 15, 2024 •

edited

isaac-chung commented May 15, 2024

KennethEnevoldsen commented May 15, 2024

isaac-chung commented May 15, 2024 •

edited

KennethEnevoldsen commented May 15, 2024

isaac-chung commented May 15, 2024

fix: Convert Reddit cluster s2s and p2p to fast #729

fix: Convert Reddit cluster s2s and p2p to fast #729

Conversation

isaac-chung commented May 15, 2024 • edited

Checklist for adding MMTEB dataset

isaac-chung commented May 15, 2024

KennethEnevoldsen commented May 15, 2024 • edited

isaac-chung commented May 15, 2024

KennethEnevoldsen commented May 15, 2024

isaac-chung commented May 15, 2024 • edited

KennethEnevoldsen commented May 15, 2024

isaac-chung commented May 15, 2024

isaac-chung commented May 15, 2024 •

edited

KennethEnevoldsen commented May 15, 2024 •

edited

isaac-chung commented May 15, 2024 •

edited