COLDataset

The official repository of the paper: COLD: A Benchmark for Chinese Offensive Language Detection

中文冒犯语言检测数据集

Paper link: https://arxiv.org/abs/2201.06025

Detector: We release the version of roberta-base-cold in Huggingface.

News

Our paper has been accepted by EMNLP 2022!

Info

COLDataset contains 37,480 comments with binary offensive labels and covers diverse topics of race, gender, and region. To gain further insights into the data types and characteristics, we annotate the test set at a fine-grained level with four categories: attacking individuals, attacking groups, anti-bias and other non-offensive.

the labels in train.csv and dev.csv:

label 0: safe,
label 1: offensive

fine-grained-label in test.csv:

0: safe (other-Non-offen)
1: attack individual
2: attack group
3: safe (anti-bias)

Citing

Please kindly cite our paper if this paper and the dataset are helpful.

  @article{deng2022cold,
  title="Cold: A benchmark for chinese offensive language detection",
  author= "Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Mi, Fei and Huang, Minlie",
  booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
  month = dec,
  year = "2022",
  address = "Abu Dhabi, United Arab Emirates",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2022.emnlp-main.796",
  pages = "11580--11599"
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
COLDataset		COLDataset
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COLDataset

COLDataset

LICENSE

LICENSE

README.md

README.md

Repository files navigation

COLDataset

News

Info

Citing

About

Releases

Packages

Contributors 2

License

thu-coai/COLDataset

Folders and files

Latest commit

History

Repository files navigation

COLDataset

News

Info

Citing

About

Resources

License

Stars

Watchers

Forks