Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NYT10m Experiments -- Manual Evaluation Matters #348

Open
suamin opened this issue Sep 28, 2021 · 2 comments
Open

NYT10m Experiments -- Manual Evaluation Matters #348

suamin opened this issue Sep 28, 2021 · 2 comments

Comments

@suamin
Copy link

suamin commented Sep 28, 2021

Hi,

Thank you for the latest contribution "Manual Evaluation Matters: Reviewing Test Protocols of Distantly Supervised Relation Extraction", having manual test significantly improves our understanding of the DRE models. I have few questions re the paper's experiments:

Q1: Is it possible to provide the pre-trained checkpoints for BERT+sent/bag+AVG models?

Q2: Regarding evaluation, it is mentioned in paper:

Bag-level manual evaluation: We take our
human-labeled test data for bag-level evaluation.
Since annotated data are at the sentence-level, we
construct bag-level annotations in the following
way: For each bag, if one sentence in the bag has
a human-labeled relation, this bag is labeled with
this relation; if no sentence in the bag is annotated
with any relation, this bag is labeled as N/A.

Can you elaborate this further, is this same as in current eval part of BagRELoader code? Unfortunately, I cannot find 'anno_relation_list' in the manually created test set, does this require additional pre-processing?

Q3: At evaluation (valid, test) time, the bag_size parameter should be set to 0 (so we consider all sentences in the Bag as also reported in paper -- but this is not handled in current BagRE framework) and entpair_as_bag to True?

Q4: Can you provide the scores for the NYT10m val set for the models reported in Table 4 of the paper? Do you also plan to provide P@k metrics and pr_curves for the models reported in Table 4?

Q5: Is BERT+sent level training performed with MultiLabelSentenceRE or simple SentenceRE?

Thank you in advance!

@HenryPaik1
Copy link

@gaotianyu1350 Thanks for great work. I have same questions.
@suamin Did you find the answers for your questions? As for NYT10m, I trained BERT with sentence level framework, and then test it by using bag level framework and multi label separately. The results shows that test with bag-level (60.6, 35.32) is better than multi label (58.39, 31.98). However, I still cannot reproduce the result on the paper.

@suamin
Copy link
Author

suamin commented Oct 29, 2021

@HenryPaik1 thanks for your input. I've not been able to find answers to the questions. I still struggle to reproduce paper numbers. For BERT+sent+AVG, I get AUC=55.45, macro-F1=21.12 on val and AUC=47.49, macro-F1=11.23 on test with Bag-level evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants