Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Convert Reddit cluster s2s and p2p to fast #729

Merged
merged 16 commits into from
May 17, 2024

Conversation

isaac-chung
Copy link
Collaborator

@isaac-chung isaac-chung commented May 15, 2024

Checklist for adding MMTEB dataset

Resolve #728

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

@isaac-chung isaac-chung marked this pull request as draft May 15, 2024 12:28
@isaac-chung isaac-chung changed the title Convert reddit cluster s2s to fast Convert reddit cluster s2s and p2p to fast May 15, 2024
@isaac-chung
Copy link
Collaborator Author

When subsampling for the p2p dataset, there are a few labels that only has 1 count. So we get the following error:

ValueError: The least populated class in labels column has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Should we exclude them before subsampling? Or force include them somehow?

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented May 15, 2024

When subsampling for the p2p dataset, there are a few labels that only has 1 count.

Clusters with only one label?

@isaac-chung
Copy link
Collaborator Author

Clusters with only one label?

Yes. You should be able to see this error when you check out the PR and run

gh pr checkout 729
mteb -t RedditFastClusteringP2P -m sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

@KennethEnevoldsen
Copy link
Contributor

Hmm, that seems weird. This should be there before conversion as well I assume? It might be that the cluster sets does not have the same labels across the sets.

@isaac-chung
Copy link
Collaborator Author

isaac-chung commented May 15, 2024

Yeah. But before conversion, we do not use subsampling.

The following labels only have 1 sample in the clusters after combining all rows:

array(['Bariloche', 'CRISPR', 'RomanianWolves', 'TherosDMs',
       'ValorantBrasil', 'auntienetwork', 'candlemagick', 'graylog',
       'imaginarypenpals', 'visualization'], dtype='<U21')

@KennethEnevoldsen
Copy link
Contributor

We should probably filter out all labels with less than N. I assume N=1-5 is reasonable?

@isaac-chung isaac-chung marked this pull request as ready for review May 15, 2024 17:27
@isaac-chung isaac-chung changed the title Convert reddit cluster s2s and p2p to fast fix: Convert reddit cluster s2s and p2p to fast May 15, 2024
@isaac-chung isaac-chung changed the title fix: Convert reddit cluster s2s and p2p to fast fix: Convert Reddit cluster s2s and p2p to fast May 15, 2024
@isaac-chung
Copy link
Collaborator Author

Went with N=1.

@isaac-chung isaac-chung enabled auto-merge (squash) May 17, 2024 07:27
@isaac-chung isaac-chung merged commit b02f252 into main May 17, 2024
7 checks passed
@isaac-chung isaac-chung deleted the convert-reddit-s2s-to-fast branch May 17, 2024 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert Reddit to ClusteringFast
2 participants