Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data generation for RC with Active Learning #332

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

mxschmdt
Copy link
Collaborator

@mxschmdt mxschmdt commented Oct 6, 2022

Pull Request

What does this PR do?

Add code and instructions to perform data generation for RC using Active Learning.

Description

The core AL part is currently in primeqa/al (primeqa/qg was also modified to add the data generation model) and the instructions with scripts are in examples/datagen_with_al.
The data generation implements the QAGen2S model from https://arxiv.org/abs/2010.06028.

Versioning

When opening a PR to make changes to PrimeQA (i.e. primeqa/) master, be sure to increment the version following
semantic versioning. The VERSION is stored here
and is incremented using bump2version {patch,minor,major} as described in the (development guide documentation)[development.html].

  • Have you updated the VERSION?
  • Or does this PR not change the primeqa package or was not into master?
  • Will bump VERSION once PR is ready for merging

After pulling in changes from master to an existing PR, ensure the VERSION is updated appropriately.
This may require bumping the version again if it has been previously bumped.

If you're not quite ready yet to post a PR for review, feel free to open a draft PR.

Releases

After Merging

If merging into master and VERSION was updated, after this PR is merged:

Checklist

Review the following and mark as completed:

  • Tag an issue or issues this PR addresses.
  • Added description of changes proposed.
  • Review requested as appropriate.
  • Version bumped as appropriate.
  • New classes, methods, and functions documented.
  • Documentation for modified code is updated.
  • Built documentation to confirm it renders as expected (see here).
  • Code cleaned up and commented out code removed.
  • Tests added to ensure all functionalities tested at >= 60% unit test coverage (see here).
  • Release created as needed after merging.

@mxschmdt mxschmdt self-assigned this Oct 6, 2022
@mxschmdt mxschmdt changed the title Add data generation with Active Learning Add data generation for RC with Active Learning Oct 6, 2022
@mxschmdt
Copy link
Collaborator Author

mxschmdt commented Oct 6, 2022

Will add documentation and tests soon!

@mxschmdt
Copy link
Collaborator Author

Will add documentation and tests soon!

Done

@mxschmdt mxschmdt requested a review from avisil October 10, 2022 19:42
Signed-off-by: Maximilian Schmidt <maximilian.s@posteo.de>
Copy link
Collaborator

@jdpsen jdpsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some major comments to make the PR compatible with PrimeQA standards:

Let's have a similar namespace as other run-*.py scripts.

  1. Can we change run.py to run_qg_al.py (assuming if we need a separate run-*.py script)
  2. Similarly arguments like qg_do_train should just be do_train, qg_model_name should be model_name_or_path , I think all arguments with qg_* can remove the qg prefix. Check the run_qg and run_mrc to see the common namespace and we should keep it uniform.
  3. We have to make sure we reuse what can be reused from existing PrimeQA.
    (a) The prerequisite step of training the qg model, can't we use run_qg to do that ?
    (b) for RTCon filtering step , we can finetune the RC on squad using run_mrc.

we should reuse run_mrc and run-qg as much as possible given that they already exist in PrimeQA core, example .py scripts should build on top of that.

@mxschmdt
Copy link
Collaborator Author

mxschmdt commented Nov 28, 2022

Some major comments to make the PR compatible with PrimeQA standards:

Let's have a similar namespace as other run-*.py scripts.

1. Can we change run.py to run_qg_al.py  (assuming if we need a separate run-*.py script)

2. Similarly arguments like qg_do_train should just be do_train, qg_model_name should be model_name_or_path , I think all arguments with qg_* can remove the qg prefix. Check the run_qg and run_mrc to see the common namespace and we should keep it uniform.

3. We have to make sure  we  reuse what can be reused from existing PrimeQA.
   (a)  The prerequisite step of training the qg model, can't we use run_qg to do that ?
   (b) for RTCon filtering step , we can finetune the RC on squad using run_mrc.

we should reuse run_mrc and run-qg as much as possible given that they already exist in PrimeQA core, example .py scripts should build on top of that.

Thanks for the comments!
Regarding 2), I need to differ between arguments for the RC model and the QG model because AL can also handle RC training (using BALD).
Therefore should I get rid of the qg_* prefix and keep the rc_* prefix?
I also added some datasets and pre-processing which do not work with the current run_mrc.py script, therefore should I update the run_mrc.py script for this purpose or would you mind keeping it separate?

If you use CUDA, make sure to also set `CUDA_VISIBLE_DEVICES` to the appropriate device.
E.g. for GPU 2 (the third one) on the current host:
```bash
export CUDA_VISIBLE_DEVICES=2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make sure that it always runs on gpu 2? If not let's change it to export CUDA_VISIBLE_DEVICES=<gpu_device_id> . can we automatically set it in the python code itself based on device availability instead of asking user to set env variable?

Copy link
Collaborator Author

@mxschmdt mxschmdt Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the instructions; how would you automatically choose the GPU?
To be honest I like setting the environment variable because you can be sure which GPUs are used and schedulers do it the same way.

@@ -0,0 +1,221 @@
# Training with Active Learning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify "Training with Active Learning" -> "Training with Active Learning (AL)" as AL is used as abbreviation in next part of the readme.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will make sure to use abbreviations consistently!

Since generated data is noisy, we will filter them in a next step.
The script `filter_gen_data.py` does this:
```bash
python filter_gen_data.py <paths to generated data> <output dir> --filter {lm,rt}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is filter_gen_data.py directly accessible or we need to call it like techqa/ filter_gen_data.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here, does techqa/ reference a folder in this case?

We suggest fine-tuning a model on SQuAD first followed by fine-tuning on the selected, annotated data:
```bash
# fine-tune RC model on SQuAD
python run.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use run_mrc from primeqa to fine-tune model on squad?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that works as well, but currently the MRC functionality is needed for AL anyway (if used with an RC model).

--rc_num_train_epochs 3 \
--rc_preprocessor primeqa.mrc.processors.preprocessors.squad.SQUADPreprocessor \
--rc_postprocessor primeqa.mrc.processors.postprocessors.squad.SQUADPostProcessor \
--rc_eval_metrics SQUAD \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about evaluation ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean exactly? Evaluation is currently done during training.

return self.value


def filter_samples(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add doc string and inline comments. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry forgot it for this script, next push will include them!



if __name__ == "__main__":

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have a main class here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not intended to be imported I don't think so.

validation_dataset = get_datasets(data_args.eval_dataset)
validation_dataset = expand_answers(validation_dataset, separate_answers=False)

def preprocess_fn_wrapper(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add inline comments and docstrings wherever possible.

"the `--rc_output_dir` or add `--rc_overwrite_output_dir` to train from scratch."
)

task_heads = EXTRACTIVE_HEAD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not extend/use existing run_mrc.py to do the same thing? I see many repetitive code snippets here and there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Trainer and the Trainer arguments are used in the AL process, so I don't think the run_mrc.py can be used here.

@jdpsen
Copy link
Collaborator

jdpsen commented Nov 29, 2022

Therefore should I get rid of the qg_* prefix and keep the rc_* prefix?
--> Yes, remove qg_* prefix to make it compatible with run_qg, keep rc_* prefixes only when you need it.
I also added some datasets and pre-processing which do not work with the current run_mrc.py script, therefore should I update the run_mrc.py script for this purpose or would you mind keeping it separate?
--> Can it be contributed back to primeqa core ? meaning the dataset supports you added via pre-processors, can we have it in /primeqa/mrc/preprocessors like the others ? It will also enable you use it in run_mrc for train/eval.

@@ -38,7 +39,7 @@ class ModelArguments:
default="table",
metadata={
"help": "Whether to generate questions from tables or passages",
"choices": ["table", "passage"],
"choices": ["table", "passage_qg", "passage_qa2s"],
Copy link
Collaborator

@riyazbhat riyazbhat Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Primeqa QG only supports the modalities 'passage,' 'table,' and 'hybrid' (see https://github.com/primeqa/primeqa/blob/qg_path-sampler/primeqa/qg/run_qg.py). It also makes no sense to include modality options that do not correlate with its meaning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've introduced a new parameter gen_config in order to separate the method for the generation from the modality.

@mxschmdt
Copy link
Collaborator Author

Therefore should I get rid of the qg_* prefix and keep the rc_* prefix?
--> Yes, remove qg_* prefix to make it compatible with run_qg, keep rc_* prefixes only when you need it.
I also added some datasets and pre-processing which do not work with the current run_mrc.py script, therefore should I update the run_mrc.py script for this purpose or would you mind keeping it separate?
--> Can it be contributed back to primeqa core ? meaning the dataset supports you added via pre-processors, can we have it in /primeqa/mrc/preprocessors like the others ? It will also enable you use it in run_mrc for train/eval.

I haven't added them as preprocessors but dataset loading scripts (so a .py file containing a Builder which can be used via datasets.load_dataset(file.py)). It could theoretically be implemented as a preprocessor, but for example splitting a dataset like mrqa into the subsets would require one preprocessor for each subset. Maybe that is an option? Or even a preprocessor with arguments? (Would have to see how to implement this with argparse).

* Add docstrings and comments
* Rename run script
* Introduce `gen_config` parameter for passage-based generation models
* Update instructions
@jdpsen
Copy link
Collaborator

jdpsen commented Jan 10, 2023

Therefore should I get rid of the qg_* prefix and keep the rc_* prefix?
--> Yes, remove qg_* prefix to make it compatible with run_qg, keep rc_* prefixes only when you need it.
I also added some datasets and pre-processing which do not work with the current run_mrc.py script, therefore should I update the run_mrc.py script for this purpose or would you mind keeping it separate?
--> Can it be contributed back to primeqa core ? meaning the dataset supports you added via pre-processors, can we have it in /primeqa/mrc/preprocessors like the others ? It will also enable you use it in run_mrc for train/eval.

@mxschmdt : are you planning to push new commits to handle the PR feedback ?

@mxschmdt
Copy link
Collaborator Author

Therefore should I get rid of the qg_* prefix and keep the rc_* prefix?
--> Yes, remove qg_* prefix to make it compatible with run_qg, keep rc_* prefixes only when you need it.
I also added some datasets and pre-processing which do not work with the current run_mrc.py script, therefore should I update the run_mrc.py script for this purpose or would you mind keeping it separate?
--> Can it be contributed back to primeqa core ? meaning the dataset supports you added via pre-processors, can we have it in /primeqa/mrc/preprocessors like the others ? It will also enable you use it in run_mrc for train/eval.

@mxschmdt : are you planning to push new commits to handle the PR feedback ?

I've already pushed a new commit: b50fd35
Please also kindly consider my comment above on the dataset loading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants