Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

每个压缩包对应的语料分类能否提供下? #2

Open
ScottishFold007 opened this issue Jan 13, 2023 · 3 comments
Open

每个压缩包对应的语料分类能否提供下? #2

ScottishFold007 opened this issue Jan 13, 2023 · 3 comments

Comments

@ScottishFold007
Copy link

数据规模还是比较大的,能否进一步提供每个压缩包对应的语料分类?也就是这个包含有小说、散文、作文还是其他类别?

@esbatmop
Copy link
Owner

本项目是为了对标ChatGPT的40T网页语料,力求在数据量上先达到同一级别,暂时不提供索引和分类。

@esbatmop
Copy link
Owner

esbatmop commented Jan 13, 2023

因为我们没有对数据来源进行版权审核的能力,为了能尽量长期的提供服务,本数据集不会提供压缩包的索引和分类信息,并且恳请网友们不要讨论压缩包的索引和分类,低调的使用数据。

@Orion-Zheng
Copy link

请问压缩包之间的数据会有重叠吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants