Skip to content

JaniceZhao/Douban-Dushu-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dateset Description

DouBan DuShu is a Chinese website where users can share their reviews about various kinds of books. Most of the users in this website are unprofessional book reviewers. Therefore, the comments are usually spoken Chinese or even Internet slang.

In addition to the comments, users can mark the books from one star to 5 stars according to the quality of the books. We have collected more than 37 million short comments from about 18 thousand books with 1 million users. The great number of users provide diversity of the language styles, from moderate formal to informal. An example of the data item is shown in the following table.

Key Description Value Example
Book Name The name of the book 理想国
User Name Who gives the comment 399
Tag The tag the book belongs to 思想
Comment Content of the comment 我是国师的脑残粉
Star Stars given to the book (from 1 star to 5 stars) 5 stars
Date When the comment posted 2018-08-21
Like Count of "like" on the comment 0

Data Preprocessing

  1. convert full width symbols to half width symbols
  2. remove some special symbols
  3. convert traditional Chinese to simplified Chinese

Terms of Use:

  1. Respect the privacy of personal information of the original source
  2. The original copyright of all the data belongs to writers of the reviews and DouBan
  3. The dataset is only for study and research purposes. Without permission, it may not be used for any commercial purposes
  4. Redistribution is NOT allowed
  5. Some items must be deleted if the copyright owners claim
  6. If you want to use the dataset for depth study, please cite this paper:
@article{zhao2018lsicc,
  title={LSICC: A Large Scale Informal Chinese Corpus},
  author={Zhao, Jianyu and Ji, Zhuoran},
  journal={arXiv preprint arXiv:1811.10167},
  year={2018}
}

Date Download

You had to agree to the terms and conditions above before you download and use this corpus.

Due to the large size of the corpus, we divide the whole dataset to 4 CSV files.

  1. Google Drive: link
  2. Baidu Cloud: link Extraction Code: vpik

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Releases

No releases published

Packages

No packages published