Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Convert Multilingual/Crosslingual to fast-loading format #635

Merged

Conversation

loicmagne
Copy link
Member

@loicmagne loicmagne commented May 5, 2024

Following #530, #572

The goal of this PR is to modify multilingual/crosslingual datasets to the fast loading format, i.e. where each row in the dataset has an additional "lang" feature. I don't know of an automatic way to do this, so for now I'm updating each dataset on a case by case basis, which is a bit tedious

List of datasets converted/to convert:

STS:

  • STS17Crosslingual
  • STS22CrosslingualSTS
  • IndicCrosslingualSTS
  • STSBenchmarkMultilingualSTS

Pair classification:

  • XNLI

Bitext Mining:

  • TatoebaBitextMining
  • BUCCBitextMining
  • FloresBitextMining
  • IN22ConvBitextMining
  • IN22GenBitextMining
  • NTREXBitextMining

馃毀 BibleNLPBitextMining

Classification

  • IndicSentimentClassification
  • MultiHateClassification
  • MultilingualSentimentClassification
  • TweetSentimentClassification
  • MasakhaNEWSClassification
  • SIB200Classification
  • MassiveIntentClassification
  • MassiveScenarioClassification

Those are the datasets with >10 subsets, which would benefit the most from fast loading

@loicmagne loicmagne added the WIP Work In Progress label May 5, 2024
@loicmagne loicmagne self-assigned this May 5, 2024
@loicmagne
Copy link
Member Author

One of the issue to convert existing datasets is that several of them use custom loading scripts, which makes it non trivial to convert to the fast format

For example Flores, NTREX or IN22-Conv don't explicitly defines subsets for each language pairs, they contain data for each languages and then the pairs are created on the fly (so basically the 'en-fr' subset and the 'en-es' one share the same 'en' sentences). Not sure what the correct way to handle this would be. Converting those dataset to the standard format "1 file per subset" would duplicate a lot of data, but having different configurations for each datasets is hard to maintain

@KennethEnevoldsen
Copy link
Contributor

@loicmagne, let us to the easy ones in this PR and then discuss how to best handle the rest.

@loicmagne loicmagne removed the WIP Work In Progress label May 7, 2024
@loicmagne
Copy link
Member Author

I've managed to convert most multilingual datasets from the STS, Pair classification and Classification category, at least those with more than 10 subsets

The remaining datasets are in the Bitext mining category, and are not straightforward to convert. They either have a different format, or have too many files to be loaded in one go (see this issue huggingface/datasets#6877 )

I suggest we merge this PR and discuss how to handle the remaining datasets ? @KennethEnevoldsen

@loicmagne loicmagne marked this pull request as ready for review May 7, 2024 22:15
@loicmagne
Copy link
Member Author

loicmagne commented May 7, 2024

I checked that all the results remains the same within a 1e-4 threshold, although I don't really know why they sometimes vary slightly

@KennethEnevoldsen
Copy link
Contributor

I don't really know why they sometimes vary slightly

My guess is that it is to do with a single place where the calculations change pr. run influencing the seed (a solution is to use an rng_state which is passed along, but not influence by other operations as it e.g. done in #481). For a related blogpost you might want to check out: https://builtin.com/data-science/numpy-random-seed

I think that is a seperate PR though. I think merging this is a great idea.

Will you add points (bug fixes where you add one point pr. dataset seems reasonable?). Will you also create an issue for the remaining datasets not addressed here.

@loicmagne
Copy link
Member Author

Sounds good I'm writing the issue for the remaining datasets

@KennethEnevoldsen KennethEnevoldsen changed the title Convert Multilingual/Crosslingual to fast-loading format fix: Convert Multilingual/Crosslingual to fast-loading format May 8, 2024
@loicmagne
Copy link
Member Author

Issue opened here for the remaining datasets: #651

@loicmagne loicmagne added the WIP Work In Progress label May 14, 2024
@loicmagne
Copy link
Member Author

loicmagne commented May 14, 2024

@KennethEnevoldsen
Following #651 I converted the remaining datasets to a compact format and changed the BitextMiningEvaluator accordingly to handle multiple languages

I think this makes those datasets usable now, loading and running evaluation on Flores on the 42k language pairs now takes <10 minutes on small models

The last remaining dataset BibleNLPBitextMining relies on a fix of the datasets lib huggingface/datasets#6893 which isn't in the latest release yet so I'll wait for that

@loicmagne loicmagne removed the WIP Work In Progress label May 14, 2024
@KennethEnevoldsen
Copy link
Contributor

@loicmagne once we have resolved the question related to BUCC I believe we can merge this

@loicmagne
Copy link
Member Author

@loicmagne once we have resolved the question related to BUCC I believe we can merge this

@KennethEnevoldsen For the BUCC dataset, you can see the new results in the BUCC.json file. Overall there's a ~1% point difference, I can revert the changes if that's a problem but it simplifies the code greatly

How did you proceed for other tasks (clustering I think) where the changes created non-backward compatible results ?

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented May 15, 2024

@loicmagne what we have done is keep the old implementation (which you might do here by moving to logic over to the BUCC task) and then add a superseeded_by = "new_task_name" (see e.g. #694) to the old task. Then we can still run the old task, but it will raise a warning stating that X supersedes it.

I believe this is the best approach here as well.

@loicmagne
Copy link
Member Author

@KennethEnevoldsen alright I added back the previous BUCC version and named the new one BUCC.v2. There's almost a 20x evaluation time difference between the two so I think it's worth having the new version

I think we can merge then

@loicmagne
Copy link
Member Author

@KennethEnevoldsen Can we merge this?

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented May 17, 2024

Yes indeed - thanks again @loicmagne! I have enabled automerge

@KennethEnevoldsen alright I added back the previous BUCC version and named the new one BUCC.v2. There's almost a 20x evaluation time difference between the two so I think it's worth having the new version

Sorry was out yesterday so didn't have time to look at this before now. 20x def. seems like a good reason to introduce v2

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) May 17, 2024 10:36
@KennethEnevoldsen KennethEnevoldsen merged commit aa82ada into embeddings-benchmark:main May 17, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants