Clarification on post-processing generated result #16

SuvodipDey · 2021-12-21T04:57:29Z

Hi. Kudos for this nice work. I am trying to reproduce the results on DailyDialog dataset. It will be very helpful if you can clarify the following details.
In Issue #13, you mentioned using "nltk.word_tokenize() to tokenize the sentence and then concatenate the tokens" to make the format of the generated dialogue same as the reference response. I have two questions here,

Did you use any post-processing on the reference files?
Did you try only nltk.word_tokenize() or some other tokenizer as well?

It will be very useful if you can briefly mention your post-processing steps.

lizekang · 2021-12-28T03:13:55Z

Hi, sorry for the late response. We use multi-reference dailydialog dataset

For the reference files, we only lowercase the words.
We checked the reference files and found that nltk.word_tokenize() can match the format of reference files.

If you have any questions, please feel free to ask.

SuvodipDey · 2022-01-21T12:39:28Z

I got the following results on the DailyDialog dataset with the default settings. I fine-tuned the pre-trained Dialoflow models and used a beam size of 5 to generate the output followed by the NLTK tokenization step.

Base model
bleu : [47.52, 25.180000000000003, 14.99, 9.5]
nist : [3.0337, 3.5533, 3.6926, 3.7344]
meteor : 15.795387062299081
entropy : [5.124008409948641, 7.878595854657088, 9.120790969580254, 9.761830789511295]
div : [0.03529397236252269, 0.1907269959723588]
avg_len : 12.014391691394659
Best model : epoch 23
Validation loss at epoch 23: 2.275749640226364, 0.06294980964809656, 5.164146141052246
Medium model
bleu : [48.75, 26.6, 16.16, 10.440000000000001]
nist : [3.1213, 3.6897, 3.8463, 3.8946]
meteor : 16.304052665651195
entropy : [5.144744081651956, 7.941478091176311, 9.209349722332135, 9.843659464780313]
div : [0.03810110214888246, 0.2018800778889411]
avg_len : 12.048219584569733
Best model: epoch 9
Validation loss at epoch 9: 2.1608937056064605, 0.05948701365292072, 5.135612022399902
Large model
bleu : [48.209999999999994, 26.36, 16.08, 10.47]
nist : [3.1413, 3.7197, 3.8753, 3.9219]
meteor : 16.239091665401443
entropy : [5.276497773225424, 8.151637518115594, 9.417401985653257, 10.004133672073646]
div : [0.04034703399794552, 0.22087794866255284]
avg_len : 11.987982195845698
Best Model : epoch 5
Validation loss at epoch 5: 2.1056192281246187, 0.07405774672329425, 5.126054071426392

There is a little gap between the reported results, especially for BLEU, NIST, and Meteor. Could you please help me figure out the source of this discrepancy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on post-processing generated result #16

Clarification on post-processing generated result #16

SuvodipDey commented Dec 21, 2021

lizekang commented Dec 28, 2021

SuvodipDey commented Jan 21, 2022

Clarification on post-processing generated result #16

Clarification on post-processing generated result #16

Comments

SuvodipDey commented Dec 21, 2021

lizekang commented Dec 28, 2021

SuvodipDey commented Jan 21, 2022