NatCat

This repo provides the NatCat dataset from Natcat: Weakly Supervised Text Classification with Naturally Annotated Datasets.

Data

NatCat

NatCat can be downloaded from here

NatCat are naturally annotated category-text pairs for training text classifiers.

NatCat is constructed from three different data domains.

Wikipedia
Stack Exchange
Reddit

Each directory contains the data from the corresponding domain. The files are named as train.tsv???.data Each data file is tab separated data. The first field is the positive/correct category, the second to the eighth fields are negative/wrong categories. The nineth/last field is the text to categorize.

CatEval

CatEval contains the 11 tasks we use to evaluate NatCat trained text classifiers.

Under each task directory, the file named classes.txt.acl list the category names we used to run the experiments.

The test datasets of AGNews, DBP, Yahoo, Amazon-2, Yelp2 can be downloaded from https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M and are created by Zhang et al., Character-level Convolutional Networks for Text Classification .

The Comment dataset is available under data/cateval/comment and is created from https://dataturks.com/projects/zhiqiyubupt/comment

The NYT dataset is constructed from The New York Times Annotated Corpus . You will need to get a license from "The New York Times Annotated Corpus Agreement" to use their data. The script we used to construct the CatEval test set can be found under data/cateval/nyt/data-prep.

WikiCat

We provide another full version of NatCat constructed from Wikipeda, namely, WikiCat. It can be downloaded from here.

WikiCat is constructed from Wikipedia. Each Wikipedia page is annotated by categories (can be found at the bottom of each Wikipedia page) and their immediate parent categories.

WikiCat can be used to train topical text classification models.

Files

wikipedia-documents: contains all Wikpedia documents. Each file is named by a digital ID and contains a single Wikipedia document.
{train,dev}.tsv are tab separated files containing Wikipedia IDs and their corresponding categories. Each row starts with a Wikipedia ID, and followed by their annotated categories separated by tabs.

Code

To train a text classifier

python code/run_natcat.py \
    --model_type roberta \
    --model_name_or_path roberta-base \
    --task_name natcat \
    --seed 1 \
    --do_train \
    --do_lower_case \
    --data_dir data/sample-data \
    --max_seq_length 128 \
    --per_gpu_train_batch_size=32   \
    --learning_rate 2e-5 \
    --num_train_epochs 1 \
    --save_total_limit 2 \
    --output_dir saved_checkpoints/roberta-base \
    --warmup_steps 7500

To evaluate on a single label text classification task

python code/run_eval.py \
    --model_type roberta \
    --model_name_or_path saved_checkpoints/roberta-base \
    --task_name eval \
    --do_eval \
    --do_lower_case \
    --eval_data_file data/cateval/agnews/test.csv \
    --max_seq_length 128 \
    --class_file_name=data/cateval/agnews/classes.txt.acl \
    --pred_output_file=saved_checkpoints/roberta-base/agnews.preds.txt \
    --output_dir saved_checkpoints/roberta-base \
    --per_gpu_eval_batch_size=64

To calculate the model prediction accuracy of a single label task

python code/compute_acc.py  saved_checkpoints/roberta-base/agnews.preds.txt  data/cateval/agnews/test.csv

To evaluate on a multi label text classification task

python code/run_eval.py \
    --model_type roberta \
    --label_filepath data/cateval/comment/test.class.txt \
    --model_name_or_path saved_checkpoints/roberta-base \
    --eval_data_file data/cateval/comment/test.doc.txt \
    --class_file_name=data/cateval/comment/classes.txt.acl \
    --task_name comment \
    --do_eval \
    --multi_class \
    --do_lower_case \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=64   \
    --output_dir saved_checkpoints/roberta-base \
    --pred_output_file=saved_checkpoints/roberta-base/comment.preds.txt

To calculate the model prediction accuracy of a multi label task

python code/compute_lrap.py saved_checkpoints/roberta-base/comment.preds.txt data/cateval/comment/test.class.txt data/cateval/comment/classes.txt.acl

Dependencies

transformers 3.1.0
torch 1.4.0

Pretrained Mod

pretrained zero shot models (BERT and RoBERTa trained with random seed 1) can be downloaded here.

Citation

@misc{chu2020natcat,
      title={NatCat: Weakly Supervised Text Classification with Naturally Annotated Datasets}, 
      author={Zewei Chu and Karl Stratos and Kevin Gimpel},
      year={2020},
      eprint={2009.14335},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Zewei Chu 9/29/2020

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
code		code
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

LICENSE

LICENSE

README.md

README.md

Repository files navigation

NatCat

Data

NatCat

CatEval

WikiCat

Files

Code

Pretrained Mod

Citation

About

Releases

Packages

Languages

License

ZeweiChu/NatCat

Folders and files

Latest commit

History

Repository files navigation

NatCat

Data

NatCat

CatEval

WikiCat

Files

Code

Pretrained Mod

Citation

About

Resources

License

Stars

Watchers

Forks

Languages