Fix DiverseBeamSearch so that no diversity groups will be dropped. #5069
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
DiverseBeamSearch result is noticed to be not diverse.
a) In DiverseBeamSearch (search.py), we are trying to enforce distance among different groups and each group contains 2 x group_beam_size candidates (group_beam_size = total beam_size / num_groups).
b) However, in during sequence generation (sequence_generator.py), we are selecting final top beam_size tokens among all group all candidates. Basically it is not aware of groups used in DiverseBeamSearch.
c) We iterate a) and b) at each step. This could lead to finally all groups converge to descendent candidates from the same group. This tend to happen as the original first group is the one not receiving any diversity penalty and high in fluency score.
This patch includes two changes:
The additional bookkeeping needed for cumulative diversity is estimated to incur 5% latency overhead.
This is measured on a BART-base model with batch_size=9 and num_groups=12 on V100.
Footnotes
Diversity function illustration:
A) I like dogs.
B) I like ____.
C) There are ___.
Assuming each word is a token and we are at step=2, trying to fill in the blank:
Current/Hamming diversity:
Penalty for B from A is 1 for "dogs" and 0 for any other like "cats".
Penalty for C from A is 1 for "dogs" and 0 for any other like "cats".
Cumulative diversity:
Penalty for B from A is 3 for "dogs" and 0 for any other like "cats".
Penalty for C from A is 1 for "dogs" and 0 for any other like "cats".
B and C differ because B matches with A for "I" and "like" at respective steps incurring 2 cumulative penalty.
Using divesrity_discount to interpolate between these two:
If diverstiy_discount = 0.5, then
Penalty for B from A is 1.75 (1 + 0.5 + 0.25) for "dogs" and 0 for any other words like "cats".
Penalty for C from A is 1 for "dogs" and 0 for any other words like "cats".
"I" and "like" matched for B and A at step 0 and 1 respectively. Since "I" is two steps away and "like" is one step away, they are discounted by (0.5)^2 and 0.5 respectively.
When diversity_discount = 0, we recover Hamming diversity and when diversity_discount = 1, we recover cumulative diversity. ↩