Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on post-processing generated result #16

Open
SuvodipDey opened this issue Dec 21, 2021 · 2 comments
Open

Clarification on post-processing generated result #16

SuvodipDey opened this issue Dec 21, 2021 · 2 comments

Comments

@SuvodipDey
Copy link

Hi. Kudos for this nice work. I am trying to reproduce the results on DailyDialog dataset. It will be very helpful if you can clarify the following details.
In Issue #13, you mentioned using "nltk.word_tokenize() to tokenize the sentence and then concatenate the tokens" to make the format of the generated dialogue same as the reference response. I have two questions here,

  1. Did you use any post-processing on the reference files?
  2. Did you try only nltk.word_tokenize() or some other tokenizer as well?

It will be very useful if you can briefly mention your post-processing steps.

@lizekang
Copy link
Collaborator

Hi, sorry for the late response. We use multi-reference dailydialog dataset

  1. For the reference files, we only lowercase the words.
  2. We checked the reference files and found that nltk.word_tokenize() can match the format of reference files.

If you have any questions, please feel free to ask.

@SuvodipDey
Copy link
Author

I got the following results on the DailyDialog dataset with the default settings. I fine-tuned the pre-trained Dialoflow models and used a beam size of 5 to generate the output followed by the NLTK tokenization step.

  1. Base model
    bleu : [47.52, 25.180000000000003, 14.99, 9.5]
    nist : [3.0337, 3.5533, 3.6926, 3.7344]
    meteor : 15.795387062299081
    entropy : [5.124008409948641, 7.878595854657088, 9.120790969580254, 9.761830789511295]
    div : [0.03529397236252269, 0.1907269959723588]
    avg_len : 12.014391691394659
    Best model : epoch 23
    Validation loss at epoch 23: 2.275749640226364, 0.06294980964809656, 5.164146141052246

  2. Medium model
    bleu : [48.75, 26.6, 16.16, 10.440000000000001]
    nist : [3.1213, 3.6897, 3.8463, 3.8946]
    meteor : 16.304052665651195
    entropy : [5.144744081651956, 7.941478091176311, 9.209349722332135, 9.843659464780313]
    div : [0.03810110214888246, 0.2018800778889411]
    avg_len : 12.048219584569733
    Best model: epoch 9
    Validation loss at epoch 9: 2.1608937056064605, 0.05948701365292072, 5.135612022399902

  3. Large model
    bleu : [48.209999999999994, 26.36, 16.08, 10.47]
    nist : [3.1413, 3.7197, 3.8753, 3.9219]
    meteor : 16.239091665401443
    entropy : [5.276497773225424, 8.151637518115594, 9.417401985653257, 10.004133672073646]
    div : [0.04034703399794552, 0.22087794866255284]
    avg_len : 11.987982195845698
    Best Model : epoch 5
    Validation loss at epoch 5: 2.1056192281246187, 0.07405774672329425, 5.126054071426392

There is a little gap between the reported results, especially for BLEU, NIST, and Meteor. Could you please help me figure out the source of this discrepancy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants