Add assert for doc_stride, max_seq_length and max_query_length #1587

bartekkuncer · 2022-04-07T17:07:09Z

Description

This change adds assert for doc_stride, max_seq_length and max_query_length relation (args.doc_stride <= args.max_seq_length - args.max_query_length - 3) as incautious setting of them can cause data loss when chunking input features and ultimately significantly lower accuracy.

Example

Without the assert when one sets max_seq_length to e.g. 128 and keeps default 128 value for doc_stride this happens for the input feature of qas_id == "572fe53104bcaa1900d76e6b" when running bash ~/gluon-nlp/scripts/question_answering/commands/run_squad2_uncased_bert_base.sh:

As you can see we are losing some of the context_tokens_ids (in red rectangle) as they are not included in any of the ChunkFeatures due to too high doc_stride in comparison to max_seq_length and user does not get notified even with a simple warning. This can lead to significant accuracy drop as this kind of data losses happen for all input features which do not fit entirely into single chunk.

This change introduces an assert popping when there is a possible data loss and forces the user to set proper/safe values for doc_stride, max_seq_length and max_query_length.

Error message

Chunk from example above with doc_stride reduced to 32

As you can see when values of doc_stride, max_seq_length and max_query_length satisfy abovementioned equation no data is lost during chunking and we avoid accuracy loss.

cc @dmlc/gluon-nlp-team

scripts/question_answering/run_squad.py

Co-authored-by: bgawrych <bartlomiej.gawrych@intel.com>

bartekkuncer requested a review from a team as a code owner April 7, 2022 17:07

bartekkuncer added 2 commits April 7, 2022 19:07

Add assert for doc_stride, max_seq_length and max_query_length

40b5eb1

Uniform assert message

1836ac2

bgawrych reviewed Jul 12, 2022

View reviewed changes

scripts/question_answering/run_squad.py Outdated Show resolved Hide resolved

Update scripts/question_answering/run_squad.py

da8d8d4

Co-authored-by: bgawrych <bartlomiej.gawrych@intel.com>

bgawrych approved these changes Jul 12, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add assert for doc_stride, max_seq_length and max_query_length #1587

Add assert for doc_stride, max_seq_length and max_query_length #1587

bartekkuncer commented Apr 7, 2022 •

edited

Add assert for doc_stride, max_seq_length and max_query_length #1587

Are you sure you want to change the base?

Add assert for doc_stride, max_seq_length and max_query_length #1587

Conversation

bartekkuncer commented Apr 7, 2022 • edited

Description

Example

Error message

Chunk from example above with doc_stride reduced to 32

bartekkuncer commented Apr 7, 2022 •

edited