Add data generation for RC with Active Learning #332

mxschmdt · 2022-10-06T16:48:45Z

Pull Request

What does this PR do?

Add code and instructions to perform data generation for RC using Active Learning.

Description

The core AL part is currently in primeqa/al (primeqa/qg was also modified to add the data generation model) and the instructions with scripts are in examples/datagen_with_al.
The data generation implements the QAGen2S model from https://arxiv.org/abs/2010.06028.

Versioning

When opening a PR to make changes to PrimeQA (i.e. primeqa/) master, be sure to increment the version following
semantic versioning. The VERSION is stored here
and is incremented using bump2version {patch,minor,major} as described in the (development guide documentation)[development.html].

Have you updated the VERSION?
Or does this PR not change the primeqa package or was not into master?
Will bump VERSION once PR is ready for merging

After pulling in changes from master to an existing PR, ensure the VERSION is updated appropriately.
This may require bumping the version again if it has been previously bumped.

If you're not quite ready yet to post a PR for review, feel free to open a draft PR.

Releases

After Merging

If merging into master and VERSION was updated, after this PR is merged:

Create a release from the master with version equal to VERSION
Not merging to master or VERSION not updated

Checklist

Review the following and mark as completed:

mxschmdt · 2022-10-06T16:58:35Z

Will add documentation and tests soon!

This adds a class for active learning, ActiveLearner, as well as a scoring function for RC based confidence score, namely BALD. An example script shows how it can be used.

mxschmdt · 2022-10-10T19:41:14Z

Will add documentation and tests soon!

Done

Signed-off-by: Maximilian Schmidt <maximilian.s@posteo.de>

jdpsen

Some major comments to make the PR compatible with PrimeQA standards:

Let's have a similar namespace as other run-*.py scripts.

Can we change run.py to run_qg_al.py (assuming if we need a separate run-*.py script)
Similarly arguments like qg_do_train should just be do_train, qg_model_name should be model_name_or_path , I think all arguments with qg_* can remove the qg prefix. Check the run_qg and run_mrc to see the common namespace and we should keep it uniform.
We have to make sure we reuse what can be reused from existing PrimeQA.
(a) The prerequisite step of training the qg model, can't we use run_qg to do that ?
(b) for RTCon filtering step , we can finetune the RC on squad using run_mrc.

we should reuse run_mrc and run-qg as much as possible given that they already exist in PrimeQA core, example .py scripts should build on top of that.

mxschmdt · 2022-11-28T12:58:24Z

Some major comments to make the PR compatible with PrimeQA standards:

Let's have a similar namespace as other run-*.py scripts.
1. Can we change run.py to run_qg_al.py  (assuming if we need a separate run-*.py script)

2. Similarly arguments like qg_do_train should just be do_train, qg_model_name should be model_name_or_path , I think all arguments with qg_* can remove the qg prefix. Check the run_qg and run_mrc to see the common namespace and we should keep it uniform.

3. We have to make sure  we  reuse what can be reused from existing PrimeQA.
   (a)  The prerequisite step of training the qg model, can't we use run_qg to do that ?
   (b) for RTCon filtering step , we can finetune the RC on squad using run_mrc.
we should reuse run_mrc and run-qg as much as possible given that they already exist in PrimeQA core, example .py scripts should build on top of that.

Thanks for the comments!
Regarding 2), I need to differ between arguments for the RC model and the QG model because AL can also handle RC training (using BALD).
Therefore should I get rid of the qg_* prefix and keep the rc_* prefix?
I also added some datasets and pre-processing which do not work with the current run_mrc.py script, therefore should I update the run_mrc.py script for this purpose or would you mind keeping it separate?

vishwajeetkumar93 · 2022-11-29T10:10:48Z