fix: Convert Multilingual/Crosslingual to fast-loading format #635

loicmagne · 2024-05-05T16:45:17Z

Following #530, #572

The goal of this PR is to modify multilingual/crosslingual datasets to the fast loading format, i.e. where each row in the dataset has an additional "lang" feature. I don't know of an automatic way to do this, so for now I'm updating each dataset on a case by case basis, which is a bit tedious

List of datasets converted/to convert:

STS:

STS17Crosslingual
STS22CrosslingualSTS
IndicCrosslingualSTS
STSBenchmarkMultilingualSTS

Pair classification:

XNLI

Bitext Mining:

🚧 BibleNLPBitextMining

Classification

Those are the datasets with >10 subsets, which would benefit the most from fast loading

loicmagne · 2024-05-06T14:38:31Z

One of the issue to convert existing datasets is that several of them use custom loading scripts, which makes it non trivial to convert to the fast format

For example Flores, NTREX or IN22-Conv don't explicitly defines subsets for each language pairs, they contain data for each languages and then the pairs are created on the fly (so basically the 'en-fr' subset and the 'en-es' one share the same 'en' sentences). Not sure what the correct way to handle this would be. Converting those dataset to the standard format "1 file per subset" would duplicate a lot of data, but having different configurations for each datasets is hard to maintain

KennethEnevoldsen · 2024-05-07T08:47:19Z

@loicmagne, let us to the easy ones in this PR and then discuss how to best handle the rest.

loicmagne · 2024-05-07T22:14:59Z

I've managed to convert most multilingual datasets from the STS, Pair classification and Classification category, at least those with more than 10 subsets

The remaining datasets are in the Bitext mining category, and are not straightforward to convert. They either have a different format, or have too many files to be loaded in one go (see this issue huggingface/datasets#6877 )

I suggest we merge this PR and discuss how to handle the remaining datasets ? @KennethEnevoldsen

loicmagne · 2024-05-07T22:20:57Z

I checked that all the results remains the same within a 1e-4 threshold, although I don't really know why they sometimes vary slightly

KennethEnevoldsen · 2024-05-08T11:53:47Z

I don't really know why they sometimes vary slightly

My guess is that it is to do with a single place where the calculations change pr. run influencing the seed (a solution is to use an rng_state which is passed along, but not influence by other operations as it e.g. done in #481). For a related blogpost you might want to check out: https://builtin.com/data-science/numpy-random-seed

I think that is a seperate PR though. I think merging this is a great idea.

Will you add points (bug fixes where you add one point pr. dataset seems reasonable?). Will you also create an issue for the remaining datasets not addressed here.

loicmagne · 2024-05-08T12:11:20Z

Sounds good I'm writing the issue for the remaining datasets

loicmagne · 2024-05-08T14:23:41Z

Issue opened here for the remaining datasets: #651

mteb/abstasks/AbsTaskBitextMining.py

loicmagne · 2024-05-14T22:56:35Z

@KennethEnevoldsen
Following #651 I converted the remaining datasets to a compact format and changed the BitextMiningEvaluator accordingly to handle multiple languages

I think this makes those datasets usable now, loading and running evaluation on Flores on the 42k language pairs now takes <10 minutes on small models

The last remaining dataset BibleNLPBitextMining relies on a fix of the datasets lib huggingface/datasets#6893 which isn't in the latest release yet so I'll wait for that

KennethEnevoldsen · 2024-05-15T09:19:45Z

@loicmagne once we have resolved the question related to BUCC I believe we can merge this

loicmagne · 2024-05-15T09:52:57Z

@loicmagne once we have resolved the question related to BUCC I believe we can merge this

@KennethEnevoldsen For the BUCC dataset, you can see the new results in the BUCC.json file. Overall there's a ~1% point difference, I can revert the changes if that's a problem but it simplifies the code greatly

How did you proceed for other tasks (clustering I think) where the changes created non-backward compatible results ?

KennethEnevoldsen · 2024-05-15T11:27:30Z

@loicmagne what we have done is keep the old implementation (which you might do here by moving to logic over to the BUCC task) and then add a superseeded_by = "new_task_name" (see e.g. #694) to the old task. Then we can still run the old task, but it will raise a warning stating that X supersedes it.

I believe this is the best approach here as well.

loicmagne · 2024-05-15T13:21:23Z

@KennethEnevoldsen alright I added back the previous BUCC version and named the new one BUCC.v2. There's almost a 20x evaluation time difference between the two so I think it's worth having the new version

I think we can merge then

loicmagne · 2024-05-17T10:24:11Z

@KennethEnevoldsen Can we merge this?

KennethEnevoldsen · 2024-05-17T10:36:01Z

Yes indeed - thanks again @loicmagne! I have enabled automerge

@KennethEnevoldsen alright I added back the previous BUCC version and named the new one BUCC.v2. There's almost a 20x evaluation time difference between the two so I think it's worth having the new version

Sorry was out yesterday so didn't have time to look at this before now. 20x def. seems like a good reason to introduce v2

STS

e01e3c1

loicmagne added the WIP Work In Progress label May 5, 2024

loicmagne self-assigned this May 5, 2024

loicmagne added 3 commits May 5, 2024 22:42

DatasetDict for better compatibility

39ec30e

IndicCrosslingualSTS

9490956

MultiHateClassification

c437438

loicmagne added 3 commits May 7, 2024 01:18

subsetloader fix

3e62f26

MultilingualSentimentClassification

aede16e

lint

54c24b7

loicmagne added 10 commits May 7, 2024 11:38

TweetSentimentClassification

e92c760

IndicSentimentClassification

8046cc6

STSBenchmarkMultilingualSTS

312277b

xnli

8378cce

fix readmes

a56ea59

tests

1f3f617

MasakhaNEWSClassification

4d7ca57

SIB200Classification

59fc250

MassiveIntentClassification

670fbdf

MassiveScenarioClassification

9f21f8d

loicmagne removed the WIP Work In Progress label May 7, 2024

loicmagne marked this pull request as ready for review May 7, 2024 22:15

loicmagne requested a review from KennethEnevoldsen May 7, 2024 22:15

KennethEnevoldsen approved these changes May 8, 2024

View reviewed changes

loicmagne and others added 2 commits May 8, 2024 14:06

points

b190967

Merge branch 'main' into fast-loading-datasets

0a0401c

KennethEnevoldsen changed the title ~~Convert Multilingual/Crosslingual to fast-loading format~~ fix: Convert Multilingual/Crosslingual to fast-loading format May 8, 2024

loicmagne mentioned this pull request May 8, 2024

New fast loader for Bitext Mining parallel corpora #651

Closed

loicmagne added the WIP Work In Progress label May 14, 2024

loicmagne added 3 commits May 14, 2024 19:18

support for parallel subsets

2e08eb3

missing subsets parameters

99b9f15

bucc

0b7601b

loicmagne commented May 14, 2024

View reviewed changes

mteb/abstasks/AbsTaskBitextMining.py Show resolved Hide resolved

loicmagne added 2 commits May 15, 2024 00:43

fixes

109f94f

Flores, IN22Conv, IN22Gen, NTREX

1d78b63

loicmagne removed the WIP Work In Progress label May 14, 2024

Merge branch 'main' into fast-loading-datasets

af81a3b

loicmagne requested a review from KennethEnevoldsen May 14, 2024 23:04

lin

69322ca

KennethEnevoldsen approved these changes May 15, 2024

View reviewed changes

loicmagne added 4 commits May 15, 2024 15:08

BUCC and BUCC.v2

66cbccd

lint

5a26181

tests

732f927

points

29ef9ca

Merge branch 'main' into fast-loading-datasets

56efe2f

KennethEnevoldsen enabled auto-merge (squash) May 17, 2024 10:36

KennethEnevoldsen merged commit aa82ada into embeddings-benchmark:main May 17, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Convert Multilingual/Crosslingual to fast-loading format #635

fix: Convert Multilingual/Crosslingual to fast-loading format #635

loicmagne commented May 5, 2024 •

edited

loicmagne commented May 6, 2024

KennethEnevoldsen commented May 7, 2024

loicmagne commented May 7, 2024

loicmagne commented May 7, 2024 •

edited

KennethEnevoldsen commented May 8, 2024

loicmagne commented May 8, 2024

loicmagne commented May 8, 2024

loicmagne commented May 14, 2024 •

edited

KennethEnevoldsen commented May 15, 2024

loicmagne commented May 15, 2024

KennethEnevoldsen commented May 15, 2024 •

edited

loicmagne commented May 15, 2024

loicmagne commented May 17, 2024

KennethEnevoldsen commented May 17, 2024 •

edited

fix: Convert Multilingual/Crosslingual to fast-loading format #635

fix: Convert Multilingual/Crosslingual to fast-loading format #635

Conversation

loicmagne commented May 5, 2024 • edited

loicmagne commented May 6, 2024

KennethEnevoldsen commented May 7, 2024

loicmagne commented May 7, 2024

loicmagne commented May 7, 2024 • edited

KennethEnevoldsen commented May 8, 2024

loicmagne commented May 8, 2024

loicmagne commented May 8, 2024

loicmagne commented May 14, 2024 • edited

KennethEnevoldsen commented May 15, 2024

loicmagne commented May 15, 2024

KennethEnevoldsen commented May 15, 2024 • edited

loicmagne commented May 15, 2024

loicmagne commented May 17, 2024

KennethEnevoldsen commented May 17, 2024 • edited

loicmagne commented May 5, 2024 •

edited

loicmagne commented May 7, 2024 •

edited

loicmagne commented May 14, 2024 •

edited

KennethEnevoldsen commented May 15, 2024 •

edited

KennethEnevoldsen commented May 17, 2024 •

edited