Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LexRank Performance when corpus is large #109

Open
MonkandMonkey opened this issue Jun 13, 2018 · 2 comments
Open

LexRank Performance when corpus is large #109

MonkandMonkey opened this issue Jun 13, 2018 · 2 comments
Assignees
Labels

Comments

@MonkandMonkey
Copy link

My corpus contains 300 paragraphs, and the speed is slow. More than 30 mins.
Could you please introduce sumy's performance ? And which stage will make it slow when corpus is large.
Thanks!

@miso-belica
Copy link
Owner

Hi, can you share the corpus and let me know exact command that is slow?

@MonkandMonkey
Copy link
Author

MonkandMonkey commented Jun 13, 2018

I am so sorry the corpus can't be shared due to its privacy. Here is the description of the corpus.

  • lang: Chinese (I replace the default jieba tokenizer with ltp tokenizer to cut words.)
  • Text format:
  1. Each line is a doc, and it may contains several sentences.
  2. Doc len varies from 10 to 2000+. (Because our corpus contains 'titles' and 'contents'.)
  3. Some sentences may be very long.
  • Here is the code:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

def summary_docs(self, src_file, output_file):
        Tokenizer.SPECIAL_WORD_TOKENIZERS[self.lang] = LTPTokenizer()
        parser = PlaintextParser.from_file(src_file, Tokenizer(self.lang))
        stemmer = Stemmer(self.lang)

        summarizer = Summarizer(stemmer)
        summarizer.stop_words = get_stop_words(self.lang)

        with open(output_file, "w", encoding=Config.default_encoding) as fw:
            logging.debug(parser.document.paragraphs)
            for sentence in summarizer(parser.document, self.summary_sentence_cnt):
                fw.write(str(sentence) + "\n")
        logging.info("finish summarization, summary is saved in: {}".format(Config.summary_dir))

Thank you very much.

@miso-belica miso-belica self-assigned this Jun 13, 2018
@miso-belica miso-belica changed the title Performance when corpus is large LexRank Performance when corpus is large May 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants