Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data cleanse component #9879

Merged
merged 39 commits into from May 20, 2024
Merged

Add data cleanse component #9879

merged 39 commits into from May 20, 2024

Conversation

AdRiley
Copy link
Member

@AdRiley AdRiley commented May 7, 2024

Pull Request Description

Add new cleanse and text_cleanse components

image

image

image

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.

@AdRiley AdRiley marked this pull request as ready for review May 16, 2024 08:58
Copy link
Member

@radeusgd radeusgd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I'd prefer to move the Text type tests to Base_Tests as testing them as part of in-memory Table implementation feels weird.

@somebody1234
Copy link
Collaborator

would we perhaps want a input.replace "\d" " " cleanse too? (like Duplicate_Whitespace, except always replacing with spaces - for example, 'foo bar\t\t baz')

although it probably wouldn't be too late to only add that only when someone actually has a usecase for it...

@AdRiley
Copy link
Member Author

AdRiley commented May 16, 2024

would we perhaps want a input.replace "\d" " " cleanse too? (like Duplicate_Whitespace, except always replacing with spaces - for example, 'foo bar\t\t baz')

although it probably wouldn't be too late to only add that only when someone actually has a usecase for it...

So operator78170.replace (regex "\d") ' ' exists today. But do you mean the ability to use our named regexs in replace?

That is an interesting idea...

@radeusgd
Copy link
Member

would we perhaps want a input.replace "\d" " " cleanse too? (like Duplicate_Whitespace, except always replacing with spaces - for example, 'foo bar\t\t baz')
although it probably wouldn't be too late to only add that only when someone actually has a usecase for it...

So operator78170.replace (regex "\d") ' ' exists today. But do you mean the ability to use our named regexs in replace?

That is an interesting idea...

I understood as ability for the regex to replace the number not with empty "", but with single whitespace " ". Not sure if that was what you meant @somebody1234 ?

But I'm writing because this also struck a chord with me - I was thinking that with this method when cleaning e.g. [a,b,c] from all non-letters, I will get abc. Often that is what I want. But it feels to me that I may also want to get a b c. For example for language processing tasks, if I want to do some naive cleanup of punctuation before tokenization, I want foo:bar, baz... Hmm? to probably become foo bar baz Hmm, so that I can then split it on " " to get all the tokens. (Of course if foo:bar should become foo bar or actually foobar is highly use-case dependent.)

But essentially this stems the idea if maybe we should be able to control if the cleansing should "preserve separation between words". I.e. by default we replace everything with "", but we could have an alternative mode where we kind of replace everything with " " and then do remove duplicate whitespace as the last step to normalize all separators to be a single space.

But it feels like this is complicating this rather simple tool, so maybe that is not really what we want at this stage for this component. Just throwing ideas around.

@somebody1234
Copy link
Collaborator

whoops my bad, i meant "\s+" " " 😅

@AdRiley AdRiley merged commit c7476c1 into develop May 20, 2024
36 checks passed
@AdRiley AdRiley deleted the wip/adr/add-data-cleanse branch May 20, 2024 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants