-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data generation for RC with Active Learning #332
base: main
Are you sure you want to change the base?
Conversation
Will add documentation and tests soon! |
This adds a class for active learning, ActiveLearner, as well as a scoring function for RC based confidence score, namely BALD. An example script shows how it can be used.
Done |
Signed-off-by: Maximilian Schmidt <maximilian.s@posteo.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some major comments to make the PR compatible with PrimeQA standards:
Let's have a similar namespace as other run-*.py scripts.
- Can we change run.py to run_qg_al.py (assuming if we need a separate run-*.py script)
- Similarly arguments like qg_do_train should just be do_train, qg_model_name should be model_name_or_path , I think all arguments with qg_* can remove the qg prefix. Check the run_qg and run_mrc to see the common namespace and we should keep it uniform.
- We have to make sure we reuse what can be reused from existing PrimeQA.
(a) The prerequisite step of training the qg model, can't we use run_qg to do that ?
(b) for RTCon filtering step , we can finetune the RC on squad using run_mrc.
we should reuse run_mrc and run-qg as much as possible given that they already exist in PrimeQA core, example .py scripts should build on top of that.
Thanks for the comments! |
examples/datagen_with_al/README.md
Outdated
If you use CUDA, make sure to also set `CUDA_VISIBLE_DEVICES` to the appropriate device. | ||
E.g. for GPU 2 (the third one) on the current host: | ||
```bash | ||
export CUDA_VISIBLE_DEVICES=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make sure that it always runs on gpu 2? If not let's change it to export CUDA_VISIBLE_DEVICES=<gpu_device_id> . can we automatically set it in the python code itself based on device availability instead of asking user to set env variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update the instructions; how would you automatically choose the GPU?
To be honest I like setting the environment variable because you can be sure which GPUs are used and schedulers do it the same way.
examples/datagen_with_al/README.md
Outdated
@@ -0,0 +1,221 @@ | |||
# Training with Active Learning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify "Training with Active Learning" -> "Training with Active Learning (AL)" as AL is used as abbreviation in next part of the readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will make sure to use abbreviations consistently!
Since generated data is noisy, we will filter them in a next step. | ||
The script `filter_gen_data.py` does this: | ||
```bash | ||
python filter_gen_data.py <paths to generated data> <output dir> --filter {lm,rt} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is filter_gen_data.py directly accessible or we need to call it like techqa/ filter_gen_data.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here, does techqa/ reference a folder in this case?
We suggest fine-tuning a model on SQuAD first followed by fine-tuning on the selected, annotated data: | ||
```bash | ||
# fine-tune RC model on SQuAD | ||
python run.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use run_mrc from primeqa to fine-tune model on squad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that works as well, but currently the MRC functionality is needed for AL anyway (if used with an RC model).
--rc_num_train_epochs 3 \ | ||
--rc_preprocessor primeqa.mrc.processors.preprocessors.squad.SQUADPreprocessor \ | ||
--rc_postprocessor primeqa.mrc.processors.postprocessors.squad.SQUADPostProcessor \ | ||
--rc_eval_metrics SQUAD \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about evaluation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean exactly? Evaluation is currently done during training.
return self.value | ||
|
||
|
||
def filter_samples( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add doc string and inline comments. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry forgot it for this script, next push will include them!
|
||
|
||
if __name__ == "__main__": | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have a main class here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not intended to be imported I don't think so.
examples/datagen_with_al/run.py
Outdated
validation_dataset = get_datasets(data_args.eval_dataset) | ||
validation_dataset = expand_answers(validation_dataset, separate_answers=False) | ||
|
||
def preprocess_fn_wrapper( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add inline comments and docstrings wherever possible.
examples/datagen_with_al/run.py
Outdated
"the `--rc_output_dir` or add `--rc_overwrite_output_dir` to train from scratch." | ||
) | ||
|
||
task_heads = EXTRACTIVE_HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not extend/use existing run_mrc.py to do the same thing? I see many repetitive code snippets here and there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Trainer and the Trainer arguments are used in the AL process, so I don't think the run_mrc.py can be used here.
|
primeqa/qg/run_qg.py
Outdated
@@ -38,7 +39,7 @@ class ModelArguments: | |||
default="table", | |||
metadata={ | |||
"help": "Whether to generate questions from tables or passages", | |||
"choices": ["table", "passage"], | |||
"choices": ["table", "passage_qg", "passage_qa2s"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Primeqa QG only supports the modalities 'passage,' 'table,' and 'hybrid' (see https://github.com/primeqa/primeqa/blob/qg_path-sampler/primeqa/qg/run_qg.py). It also makes no sense to include modality options that do not correlate with its meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've introduced a new parameter gen_config
in order to separate the method for the generation from the modality.
I haven't added them as preprocessors but dataset loading scripts (so a .py file containing a Builder which can be used via |
* Add docstrings and comments * Rename run script * Introduce `gen_config` parameter for passage-based generation models * Update instructions
@mxschmdt : are you planning to push new commits to handle the PR feedback ? |
I've already pushed a new commit: b50fd35 |
Pull Request
What does this PR do?
Add code and instructions to perform data generation for RC using Active Learning.
Description
The core AL part is currently in primeqa/al (primeqa/qg was also modified to add the data generation model) and the instructions with scripts are in examples/datagen_with_al.
The data generation implements the QAGen2S model from https://arxiv.org/abs/2010.06028.
Versioning
When opening a PR to make changes to PrimeQA (i.e.
primeqa/
) master, be sure to increment the version followingsemantic versioning. The VERSION is stored here
and is incremented using
bump2version {patch,minor,major}
as described in the (development guide documentation)[development.html].primeqa
package or was not into master?After pulling in changes from master to an existing PR, ensure the VERSION is updated appropriately.
This may require bumping the version again if it has been previously bumped.
If you're not quite ready yet to post a PR for review, feel free to open a draft PR.
Releases
After Merging
If merging into master and VERSION was updated, after this PR is merged:
Checklist
Review the following and mark as completed: