Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-processing steps #40

Open
aravindcheruvu opened this issue Apr 23, 2024 · 1 comment
Open

Pre-processing steps #40

aravindcheruvu opened this issue Apr 23, 2024 · 1 comment

Comments

@aravindcheruvu
Copy link

Hi,

Thank you for open-sourcing this work. I have a few questions:

  1. What are the pre-processing steps that are applied before releasing the datasets?
  2. Do we need to apply all the pre-processing scripts mentioned in the dataset?
  3. I still see profane words in the dataset, does that mean we have to apply the profane word filtering?

Thanks.

@ehsk
Copy link
Collaborator

ehsk commented May 15, 2024

Apologies for the late reply. We don't actively monitor this repo.

The pre-processing scripts can be found here. The datasets that we released were already pre-processed. We already filtered profane words based on this file. If you found problems that our list does not cover, then yes, you need to apply it on the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants