Data Preparation

Datasets:
- Non-financial
  - CNN and Daily Mail
  - Yelp Review Dataset
- Financial
  - Reuters dataset (100k news)
Environment
- Python=3.6
- tensorflow=1.2.1
- numpy=1.16
- nltk
- pandas
- sklearn
- corenlp-stanford
- pyrouge
Pretrain model
- Version for Tensorflow 1.2.1

Pipeline for Financial Dataset

Use pretrain model for financial news (currently based on non-financial news CNN/Dailymail)
Tokenize test financial news using corenlp-stanford python test_summary.py
Preprocess tokenized financial news and store in test.bin

Use pointer generator network to load pretrain model to decode (generate summary)

python run_summarization.py --mode=decode --data_path=/path/to/data/test.bin --vocab_path=/path/to/data/vocab --log_root=/path/to/directory/containing/pretrained_model --exp_name=pretrained_model --max_enc_steps=400 --max_dec_steps=100 --coverage=1

Adjust number of encode (input passage length) and decode step (ouput summary length)

Visualize the result

Sample abstractive summary for CNN news: Here

Result and Visualization

Visualize the attention network this

For Python3 run: python -m http.server

result with coverage and output 100 words

result with coverage and output 50 words

machine copy the whole sentence in the paragraph...

result without coverage and output 100 words

machine copy the whole sentence in the paragraph...

Sumy software for text summarization

Install

pip install sumy

Example

sumy sum-basic --length=2 --url=https://edition.cnn.com/2020/02/16/asia/coronavirus-wuhan-china-recovery-intl-hnk/index.html

Output:

Like Zhang, Ye has recovered from the deadly novel coronavirus.

"So please go to the hospital for examination as soon as possible when you got it.

Seq2Seq with Attention

Introduction

Encoder contains the input words that want to be transformed (translate, generate summary), and each word is a vector that go through forward and backward activation with bi-directional RNN. Then calculate the attention value for each words in encoder reflects its importance in a sentence. Decoder generates the output word one at a time, by taking dot product of the feature vector and their corresponding attention for each timestamp.

Network architecture

Encoder: Bi-directional RNN, feature vector a at timestamp t is the concatenation of forward RNN and backward RNN
Attention: : the amount of attention should pay to
- Done by a neural network takes previous word in the decoder and in the encoder generate go through softmax to generate
- additive attention for neural network:
- simplier ways can choose dot-product attention:
Decoder: RNN of dot product between attention and activation

Beam search is used in decoder to keep up to k most likely words choice, where k is a user-specified parameter (beam width).

Pointer-generator

Introduction

Abstrative text summarization requires sequence-to-sequence models, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. The state-of-the-art pointer-generator model came up by Google Brain at 2017 solves these problems. In addition to attention model, it add two features: first, it copys words from the source text via pointing which aids accurate repro- duction of information. Second, it uses coverage to keep track of what has been summarized, which discourages repetition.

Network architecture

In addition to attention, we add two things:

Copy distribution

Copy frequent words occur in the text by adding distribution of the same word
Combine copy distribution Pcopywith general attention vocabulary distribution Pvocab(computed in attention earlier: ) with certain weight Pgen: pgen ∈ [0, 1] for timestep t is calculated from the context vector a∗, the decoder state sand the decoder input c :
Training: use Pfinal to compute sigmoid probability

Coverage mechanism

record certain sentences that have appear in decoder many times

Sum the attention over all previous decoder timesteps, c represents the degree of coverage that those words have received from the attention mechanism so far.
additive attention of previous seq2seq attention model has changed to:
add one more term for loss

loss = softmax loss +

Implementation

Training from scratch: GitHub Code Here
Transfer learning

Use a pre-trained model (Version for Tensorflow 1.2.1) which is a saved network that was previously trained by others on a large dataset. Then I don't need to re-train the model with number of hours starting from scratch (for this model it takes around 7 days to train the data), and the pre-trained model built from the massive dataset could already effectively served as a generic model of the visual world.

Model Evaluation

Model Performance

Metrics used:

ROUGE-1：overlap of unigrams between the system generated summary and reference summary / number of 1-gram in reference summary

ROUGE-2： overlap of bigrams between the system generated summary and reference summaries / number of 2-gram in reference summary

ROUGE-L： overlap of LCS (Longest Common Subsequence) between system generated summary and reference summaries / number of 1-gram in reference summary

Example from Paper:

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
sentiment-analysis		sentiment-analysis
text-summarization		text-summarization
yelp-review-dataset		yelp-review-dataset
README.md		README.md
gh-md-toc		gh-md-toc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

sentiment-analysis

sentiment-analysis

text-summarization

text-summarization

yelp-review-dataset

yelp-review-dataset

README.md

README.md

gh-md-toc

gh-md-toc

Repository files navigation

Table of Contents

Data Preparation

Pipeline for Financial Dataset

Result and Visualization

Sumy software for text summarization

Seq2Seq with Attention

Introduction

Network architecture

Pointer-generator

Introduction

Network architecture

Copy distribution

Coverage mechanism

Implementation

Model Evaluation

About

Releases

Packages

Languages

vcccaat/nlp

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Data Preparation

Pipeline for Financial Dataset

Result and Visualization

Sumy software for text summarization

Seq2Seq with Attention

Introduction

Network architecture

Pointer-generator

Introduction

Network architecture

Copy distribution

Coverage mechanism

Implementation

Model Evaluation

About

Topics

Resources

Stars

Watchers

Forks

Languages