Skip to content

A script to process non-anonymized CNN and DailyMail for summary.

License

Notifications You must be signed in to change notification settings

hpzhao/non-anonymized-CNN-DailyMail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A script to produce non-anonymized CNN and DailyMail for summary. Reference: abisee/cnn-daulymail

Environment

  • python 3.6

Features

  • tokenized by CoreNLP
  • non-anonymized
  • lowercase
  • remove artical infomation
  • multiprocess
  • json(more readable)

How to use it?

1. Download data

Download the stories directories from here for both CNN and Daily Mail.

2. Download CoreNLP

Download and unzip CoreNLP from here. Add the following command in your bash_profile:

export CLASSPATH=$CLASSPATH:/path/to/stanfordnlp-corenlp-full-2018-02-27/stanford-corenlp-3.9.1.jar

3. Make dataset

# for dailymail(similar for cnn)
# if your device has multiple CPUs, you could speed up by setting -worker_num

python make_dataset.py -stories_dir dailymail/stories -tokenized_stories_dir dailymail/tokenized_stories -train_urls url_lists/dailymail_wayback_training_urls.txt -test_urls url_lists/dailymail_wayback_test_urls.txt -val_urls url_lists/dailymail_wayback_validation_urls.txt -output_dir dailymail 
python make_dataset.py -stories_dir cnn/stories -tokenized_stories_dir cnn/tokenized_stories -train_urls url_lists/cnn_wayback_training_urls.txt -test_urls url_lists/cnn_wayback_test_urls.txt -val_urls url_lists/cnn_wayback_validation_urls.txt -output_dir cnn

About

A script to process non-anonymized CNN and DailyMail for summary.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages