Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edmundson summarizer #21

Open
nick-magnini opened this issue Apr 6, 2015 · 6 comments
Open

Edmundson summarizer #21

nick-magnini opened this issue Apr 6, 2015 · 6 comments
Assignees
Labels

Comments

@nick-magnini
Copy link

Hi,

I checked the code for Edmundson summarizer. As I figured out it doesn't do anything for English. Basically it suppose to extract cue words and significant words and the words in title and rank the sentences based in these scores and the location. Well, when the input is a raw text file, then the summarizer works based on the location of the sentence. Is that right? There is no method to extract the cue words and significant words as well as title words for the text. So in this way the implementation is wrong I suppose. Let me know if I did not understand your code or I'm making a mistake? Thanks.

@nick-magnini
Copy link
Author

I realized that even the location in Edmundson doesn't work when the input document is a raw text document in one sentence per line format.

@miso-belica
Copy link
Owner

Hi, I suppose some format of "plain text". But I'm not sure if I understand you. Can you give an example of the text? And what does "it doesn't do anything for English" means? It means that for other languages summarizer works correctly? And what do you suggest? How do you think should the summarized behave?

@miso-belica miso-belica self-assigned this Apr 10, 2015
@nick-magnini
Copy link
Author

Hi,

Well, it does give the output but it's not based on the Edmundson algorithm. Basically the list of cue words and significant words are the non_english version which is in the parser/parse.py:

SIGNIFICANT_WORDS = (
"významný",
"vynikající",
"podstatný",
"význačný",
"důležitý",
"slavný",
"zajímavý",
"eminentní",
"vlivný",
"supr",
"super",
"nejlepší",
"dobrý",
"kvalitní",
"optimální",
"relevantní",
)
STIGMA_WORDS = (
"nejhorší",
"zlý",
"šeredný",
)

Which is called from the main:

if summarizer_class is EdmundsonSummarizer:
summarizer.null_words = stop_words
summarizer.bonus_words = parser.significant_words
summarizer.stigma_words = parser.stigma_words

So when the Edmundson summarizer for English is called, the it will go not find any significant/stigma words in English. In the document is one sentence per line, the location class will not give the correct output for the edmundson_location.py as well. So the Edmundson method will get totally wrong inputs. Correct me if I'm wrong.

@miso-belica
Copy link
Owner

Yes, you are absolutely right. I totally forget about it. I tested summarizers with Czech texts and let it there. This should be fixed. Thanks a lot for this :)

But as I remember there is no method for gathering stigma/bonus words from the text. They should be provided based on the language like stop-words are.

@nick-magnini
Copy link
Author

Ok, we should then think about it then. stigma/bonus words should be extracted from the summarizing text. A general one will not help. It can be done using various methods such as topic extraction, phrase extraction, ... We can work on it. I'll come with some modules and points on that soon.

@nick-magnini
Copy link
Author

Also regarding the location, it should be fixed in the edmundson_location.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants