Dateset Description

DouBan DuShu is a Chinese website where users can share their reviews about various kinds of books. Most of the users in this website are unprofessional book reviewers. Therefore, the comments are usually spoken Chinese or even Internet slang.

In addition to the comments, users can mark the books from one star to 5 stars according to the quality of the books. We have collected more than 37 million short comments from about 18 thousand books with 1 million users. The great number of users provide diversity of the language styles, from moderate formal to informal. An example of the data item is shown in the following table.

Key	Description	Value Example
Book Name	The name of the book	理想国
User Name	Who gives the comment	399
Tag	The tag the book belongs to	思想
Comment	Content of the comment	我是国师的脑残粉
Star	Stars given to the book (from 1 star to 5 stars)	5 stars
Date	When the comment posted	2018-08-21
Like	Count of "like" on the comment	0

Data Preprocessing

convert full width symbols to half width symbols
remove some special symbols
convert traditional Chinese to simplified Chinese

Terms of Use:

Respect the privacy of personal information of the original source
The original copyright of all the data belongs to writers of the reviews and DouBan
The dataset is only for study and research purposes. Without permission, it may not be used for any commercial purposes
Redistribution is NOT allowed
Some items must be deleted if the copyright owners claim
If you want to use the dataset for depth study, please cite this paper:

@article{zhao2018lsicc,
  title={LSICC: A Large Scale Informal Chinese Corpus},
  author={Zhao, Jianyu and Ji, Zhuoran},
  journal={arXiv preprint arXiv:1811.10167},
  year={2018}
}

Date Download

You had to agree to the terms and conditions above before you download and use this corpus.

Due to the large size of the corpus, we divide the whole dataset to 4 CSV files.

Google Drive: link
Baidu Cloud: link Extraction Code: vpik

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.DS_Store		.DS_Store
._.DS_Store		._.DS_Store
._README.md		._README.md
._application form.pdf		._application form.pdf
README.md		README.md
application form.pdf		application form.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.DS_Store

.DS_Store

._.DS_Store

._.DS_Store

._README.md

._README.md

._application form.pdf

._application form.pdf

README.md

README.md

application form.pdf

application form.pdf

Repository files navigation

Dateset Description

Data Preprocessing

Terms of Use:

Date Download

License

About

Releases

Packages

JaniceZhao/Douban-Dushu-Dataset

Folders and files

Latest commit

History

Repository files navigation

Dateset Description

Data Preprocessing

Terms of Use:

Date Download

License

About

Topics

Resources

Stars

Watchers

Forks