重要提示

训练机器学习模型，评测算法和交流，可以使用另外一个质量更好的语料库了 - 机器学习保险行业问答开放数据集

Egret Wenda Corpus

中文问答语料

QA Corpus, based on egret bbs.

在做机器学习的过程中，训练问答机器人的过程往往需要高质量的数据。针对英文，有很多庞大的预料库，针对中文，公开的资料很少。在学习的过程中，我接触到了Ubuntu Dialogue Corpus，这也启发在技术社区挖掘出一些数据，制作语料。

目前这版语料，是从白鹭时代官方论坛问答板块10,000+ 问题中，选择被标注了“最佳答案”的纪录汇总而成。

使用爬虫将目标数据存储到数据库
从数据库生成raw data
人工review raw data，给每一个问题，一个可以接受的答案。

目前，语料库包含2907个问答，虽然问题库很小，但针对一个垂直领域而言，也许足够了。

DESCRIPTION

In all files the field separator is " +++$+++ "

egret_wenda_lines.txt

- contains the actual text of each utterance
- fields:
	- lineID
	- person id (who uttered this phrase)
	- text of the utterance

egret_wenda_conversations.txt

- the structure of the conversations
- fields
	- conversationId
	- person id of the first character involved in the conversation
	- person id of the second character involved in the conversation
	- date of the post
	- source of this conversation in URL
	- list of the utterances that make the conversation, in chronological 
		order: ['Question lineID','Answer lineID']
		has to be matched with egret_wenda_lines.txt to reconstruct the actual content

What's more

Data in raw are raw data from BBS.

To make it more suitable for training, I have personally reviewed the raw data and modify some utterances, such as deleting codes in utterances.

processer.js

Generate raw data from data collection, the data collection is built with Egret问答专区.

Tips

NOTE: If you have results to report on these corpora, please send email to hain_wang@foxmail.com, so I can add you to list of people using this data.

Thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config/environment		config/environment
raw		raw
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
egret_wenda_conversations.txt		egret_wenda_conversations.txt
egret_wenda_lines.txt		egret_wenda_lines.txt
package.json		package.json
processer.js		processer.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config/environment

config/environment

raw

raw

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

egret_wenda_conversations.txt

egret_wenda_conversations.txt

egret_wenda_lines.txt

egret_wenda_lines.txt

package.json

package.json

processer.js

processer.js

Repository files navigation

重要提示

Egret Wenda Corpus

DESCRIPTION

egret_wenda_lines.txt

egret_wenda_conversations.txt

What's more

Data in raw are raw data from BBS.

processer.js

Tips

About

Releases

Packages

Languages

License

hailiang-wang/egret-wenda-corpus

Folders and files

Latest commit

History

Repository files navigation

重要提示

Egret Wenda Corpus

DESCRIPTION

egret_wenda_lines.txt

egret_wenda_conversations.txt

What's more

Data in raw are raw data from BBS.

processer.js

Tips

About

Topics

Resources

License

Stars

Watchers

Forks

Languages