Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training the transformer #1

Open
JonasxRie opened this issue Aug 12, 2021 · 2 comments
Open

Training the transformer #1

JonasxRie opened this issue Aug 12, 2021 · 2 comments

Comments

@JonasxRie
Copy link

Hi @Hellisotherpeople ,

as you wrote in your paper, you trained several transformer models, including BERT-large and Longformer-base. You also mentioned the usage of simple-transformer library. Could you share a short code snippet on how you trained the model for extractive summarization, please?

Thanks in advance!

@Hellisotherpeople
Copy link
Owner

Hellisotherpeople commented Dec 8, 2021

@JonasxRie

Yes, I promise I will get to doing this (let's hope before Christmas!)
The main difficulty is converting the dataset from its current form into a token classification style format. I actually have since lost this script in a recent losing battle with my local manjaro install that is now formatted - so I will have to rewrite it. This is not difficult but a bit tedious to do.

You can currently try to do word-level extractive summarization by formatting it as a sequence to sequence task - but my experience is that none of these pre-trained language models deduce that the output sequence must keep the original words outputted in the original order or else it stops being "extractive" in the sense that myself and competitive debaters are looking for (like an actual highlighter). I've tried to put tags around the labels - but the models are still too stupid to figure it out. Honestly would love some help from anyone in the community who might have insights on how to fix this issue.

As a side note.

I do warn anyone trying to play the benchmark chasing game that the original evaluation done in the paper is hilariously bad because the default settings in pyrouge only look at the first 100 tokens of the summary. This seems to be done for performance reasons as when I realized this (after the paper was published), running it with no limit has pretty much always eventually crashed.

I think that ROUGE is now built in as an evaluator in many frameworks (such as huggingface) which should solve this problem now.

As such, I want it on the record that future authors should discount my reported benchmark numbers and link to my github comment here to indicate why this is. I would prefer that people not try to play the benchmark chasing game (at least with ROUGE) because summarization is an inherently subjective task. The space of potential "good" summarizations explodes when you do it at the word level on longish documents - of which this dataset ultimately is.

Proper evaluation would almost certainly report significantly different results than what's found in the paper. Future authors should instead reevaluate the models I reported scores for properly, and note the error made in this paper.

@Hellisotherpeople
Copy link
Owner

And I will also make an effort to get trained models/weights posted and hosted on huggingface - but this may take some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants