Skip to content

vcccaat/nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Models credited to Xin Pan Peter Liu and Google Brain

Papers and references Awesome text summarization

Table of Contents



Data Preparation



Pipeline for Financial Dataset

  • Use pretrain model for financial news (currently based on non-financial news CNN/Dailymail)

  • Tokenize test financial news using corenlp-stanford python test_summary.py

  • Preprocess tokenized financial news and store in test.bin

  • Use pointer generator network to load pretrain model to decode (generate summary)

    python run_summarization.py --mode=decode --data_path=/path/to/data/test.bin --vocab_path=/path/to/data/vocab --log_root=/path/to/directory/containing/pretrained_model --exp_name=pretrained_model --max_enc_steps=400 --max_dec_steps=100 --coverage=1
    

    Adjust number of encode (input passage length) and decode step (ouput summary length)

  • Visualize the result


  • Sample abstractive summary for CNN news: Here


Result and Visualization

Visualize the attention network this

For Python3 run: python -m http.server


result with coverage and output 100 words

image-20200214130722462


result with coverage and output 50 words

machine copy the whole sentence in the paragraph...

image-20200214131136402


result without coverage and output 100 words

machine copy the whole sentence in the paragraph...

image-20200214133323672



Sumy software for text summarization

Convenience to extract news from url or text document, sumy has several algorithms to select important sentence from the article: luhn | edmundson | lsa | text-rank | lex-rank | sum-basic | kl

Install

pip install sumy

Example

sumy sum-basic --length=2 --url=https://edition.cnn.com/2020/02/16/asia/coronavirus-wuhan-china-recovery-intl-hnk/index.html

Output:

Like Zhang, Ye has recovered from the deadly novel coronavirus.

"So please go to the hospital for examination as soon as possible when you got it.



Seq2Seq with Attention

Introduction

Encoder contains the input words that want to be transformed (translate, generate summary), and each word is a vector that go through forward and backward activation with bi-directional RNN. Then calculate the attention value for each words in encoder reflects its importance in a sentence. Decoder generates the output word one at a time, by taking dot product of the feature vector and their corresponding attention for each timestamp.

image-20200210123710672


Network architecture

image-20200210123616083

  • Encoder: Bi-directional RNN, feature vector a at timestamp t is the concatenation of forward RNN and backward RNN

    image-20200210110118887


  • Attention: img: the amount of attention img should pay to img

    • Done by a neural network takes previous word img in the decoder and img in the encoder generate img go through softmax to generate img

    image-20200206104243639 image-20200207180539638

    • additive attention for neural network:

      image-20200210121315826

    • simplier ways can choose dot-product attention:

      image-20200210110149495


  • Decoder: RNN of dot product between attention and activation

    image-20200210121232055

    Beam search is used in decoder to keep up to k most likely words choice, where k is a user-specified parameter (beam width).



Pointer-generator

Introduction

Abstrative text summarization requires sequence-to-sequence models, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. The state-of-the-art pointer-generator model came up by Google Brain at 2017 solves these problems. In addition to attention model, it add two features: first, it copys words from the source text via pointing which aids accurate repro- duction of information. Second, it uses coverage to keep track of what has been summarized, which discourages repetition.

image-20200210123647436


Network architecture

In addition to attention, we add two things:



Copy distribution

  • Copy frequent words occur in the text by adding distribution of the same word

    image-20200210121154603

    image-20200207164857450


  • Combine copy distribution Pcopywith general attention vocabulary distribution Pvocab(computed in attention earlier: img) with certain weight Pgen: pgen ∈ [0, 1] for timestep t is calculated from the context vector a∗, the decoder state sand the decoder input c :

    image-20200210121140811

    image-20200210121130835


  • Training: use Pfinal to compute sigmoid probability



Coverage mechanism

record certain sentences that have appear in decoder many times

  • Sum the attention over all previous decoder timesteps, c represents the degree of coverage that those words have received from the attention mechanism so far.

    image-20200210121105048

  • additive attention of previous seq2seq attention model has changed to:

    image-20200210121045197

  • add one more term for loss

    loss = softmax loss + image-20200210121018318



Implementation

  • Training from scratch: GitHub Code Here

  • Transfer learning

    Use a pre-trained model (Version for Tensorflow 1.2.1) which is a saved network that was previously trained by others on a large dataset. Then I don't need to re-train the model with number of hours starting from scratch (for this model it takes around 7 days to train the data), and the pre-trained model built from the massive dataset could already effectively served as a generic model of the visual world.



Model Evaluation

Model Performance

image-20200212102144483

image-20200212102302631

image-20200212102320925

Metrics used:

ROUGE-1:overlap of unigrams between the system generated summary and reference summary / number of 1-gram in reference summary

ROUGE-2: overlap of bigrams between the system generated summary and reference summaries / number of 2-gram in reference summary

ROUGE-L: overlap of LCS (Longest Common Subsequence) between system generated summary and reference summaries / number of 1-gram in reference summary

image-20200210105922693

Example from Paper:

image-20200210105854973