LexRank Performance when corpus is large #109

MonkandMonkey · 2018-06-13T03:39:59Z

My corpus contains 300 paragraphs, and the speed is slow. More than 30 mins.
Could you please introduce sumy's performance ? And which stage will make it slow when corpus is large.
Thanks!

miso-belica · 2018-06-13T07:50:37Z

Hi, can you share the corpus and let me know exact command that is slow?

MonkandMonkey · 2018-06-13T09:02:54Z

I am so sorry the corpus can't be shared due to its privacy. Here is the description of the corpus.

lang: Chinese (I replace the default jieba tokenizer with ltp tokenizer to cut words.)
Text format:

Each line is a doc, and it may contains several sentences.
Doc len varies from 10 to 2000+. (Because our corpus contains 'titles' and 'contents'.)
Some sentences may be very long.

Here is the code:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

def summary_docs(self, src_file, output_file):
        Tokenizer.SPECIAL_WORD_TOKENIZERS[self.lang] = LTPTokenizer()
        parser = PlaintextParser.from_file(src_file, Tokenizer(self.lang))
        stemmer = Stemmer(self.lang)

        summarizer = Summarizer(stemmer)
        summarizer.stop_words = get_stop_words(self.lang)

        with open(output_file, "w", encoding=Config.default_encoding) as fw:
            logging.debug(parser.document.paragraphs)
            for sentence in summarizer(parser.document, self.summary_sentence_cnt):
                fw.write(str(sentence) + "\n")
        logging.info("finish summarization, summary is saved in: {}".format(Config.summary_dir))

Thank you very much.

miso-belica self-assigned this Jun 13, 2018

miso-belica added the bug label Jun 13, 2018

miso-belica mentioned this issue May 19, 2019

TextRank Too Heavy? #121

Closed

miso-belica changed the title ~~Performance when corpus is large~~ LexRank Performance when corpus is large May 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LexRank Performance when corpus is large #109

LexRank Performance when corpus is large #109

MonkandMonkey commented Jun 13, 2018

miso-belica commented Jun 13, 2018

MonkandMonkey commented Jun 13, 2018 •

edited by miso-belica

LexRank Performance when corpus is large #109

LexRank Performance when corpus is large #109

Comments

MonkandMonkey commented Jun 13, 2018

miso-belica commented Jun 13, 2018

MonkandMonkey commented Jun 13, 2018 • edited by miso-belica

MonkandMonkey commented Jun 13, 2018 •

edited by miso-belica