Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite some rlike expression to StartsWith/Contains #10715

Merged
merged 17 commits into from
May 15, 2024

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Apr 16, 2024

Closes #10742

WIP

This PR rewrites RLike in some simple cases that can be replaced by GpuStartsWith / GpuEndsWith / GpuContains / GpuEqualTo.

Replacing RLike with GpuContains gains about 10% e2e speedup in a customer query. Needs further performance testing.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned that you are writing your own Regexp parsing code instead of reusing the existing code

https://github.com/NVIDIA/spark-rapids/blob/branch-24.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

Can we please go off of the existing RegexpParser instead of trying to write something new from scratch.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

I am a little concerned that you are writing your own Regexp parsing code instead of reusing the existing code

https://github.com/NVIDIA/spark-rapids/blob/branch-24.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

Can we please go off of the existing RegexpParser instead of trying to write something new from scratch.

I updated code to use RegexParser, please take another look. It prevents me from writing a regex parser from scratch but makes the matching logic a bit more complicated. But overall I think reusing it is really better than having two parsers.

Will adds more tests, such as a UT, to verify that it is taking the speedup path.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it looks good to me

assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a',
'regexp_like(a, "(abcd)(.*)")',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about \A and \Z? Is that something that we can support with this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\Z means The end of the input but for the final terminator, if any in java, so it is not the same as endsWith. Will support \A

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -444,6 +444,28 @@ def test_regexp_like():
'regexp_like(a, "a[bc]d")'),
conf=_regexp_conf)

@pytest.mark.skipif(is_before_spark_320(), reason='regexp_like is synonym for RLike starting in Spark 3.2.0')
def test_regexp_rlike_rewrite_optimization():
gen = mk_str_gen('[abcd]{3,6}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add in some new line characters to the generated strings? ^ and $ in some cases can match just begin and end of line, not begin and end of string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! The test failed in $ case, didn't know that $ means end of line in java regex.

Sadly it means we could not support endsWith pattern at all because we haven't support \w so it will fallback first. (technically we can by check this case when tagging but I don't think we need to do that now) I will remove the endsWith part.

I'm surprised that ^ passed this test with new line characters because ^ means "The beginning of a line". Will do some investigation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

'a',
'regexp_like(a, "(abcd)(.*)")',
'regexp_like(a, "abcd(.*)")',
'regexp_like(a, "(.*)(abcd)(.*)")',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how likely it is to have abcd show up in the generated data for any of these queries.

If we look at a starts with abcd. We have a 25% chance of picking an a as the first char, and 25% chance of picking a b as the second ... That means if we had an input pattern of abcd{4} then we would only likely have 8 rows in the entire 2048 data set that would match, but we have {3,6}, which makes it likely that we would have no rows in the data set that match.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven changed the title Rewrite some rlike expression to StartsWith/EndsWith/Contains Rewrite some rlike expression to StartsWith/Contains May 8, 2024
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven marked this pull request as ready for review May 8, 2024 10:53
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven
Copy link
Collaborator Author

thirtiseven commented May 9, 2024

Did a simple performance test:

data: 1000000 random strings, each string has 0-2000 characters, 30% strings start with "abcde"*20, 30% strings contain "abcde"*20, 40% random strings.
startsWith query: 100 rlike queries from '^a' to '^abcde*20'
contains query: 100 rlike queries from '^(.*)a' to '^(.*)abcde*20'

startsWith:

  • CPU: 24620 ms
  • 24.06: 2923 ms
  • this pr: 883 ms

contains:

  • CPU: 435027 ms
  • 24.06: 314065 ms
  • this pr: 2071 ms

(Maybe the patterns in contains test is not very general for regex engine, that's why the speedup is very obvious. I can run more tests)

gerashegalov
gerashegalov previously approved these changes May 9, 2024
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven self-assigned this May 13, 2024
revans2
revans2 previously approved these changes May 14, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits.

val (transpiledAST, _) =
new CudfRegexTranspiler(RegexFindMode).getTranspiledAST(str.toString, None, None)
originalPattern = str.toString
val (transpiledAST, _) = new CudfRegexTranspiler(RegexFindMode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we have a follow on issue to figure out how to parse the regexp once, instead of multiple times?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10817, will do it in my next regex rewrite pr.

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending Bobby's comments

gerashegalov
gerashegalov previously approved these changes May 14, 2024
Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. I think the test fix is required, but I would like others to comment on whether to enable the optimization even when the regexp is disabled on GPU.

case _ => throw new IllegalStateException("Unexpected optimization type")
}
}

override def tagExprForGpu(): Unit = {
GpuRegExpUtils.tagForRegExpEnabled(this)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can actually disable regexp on the GPU. This means that these optimizations will never kick in when regexp is disabled. I don't know if that is actually desired. You can look at GpuSplit, where we implemented transpileToSplittableString and that codepath is not affected by the regexp enable flag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm fine with either way.

I think regex rewrite is more like an internal optimization in regex engine from user's perspective, users are still writing regex in rlike and won't be aware regex rewrite is happening, while in split case user would be aware that they are writing literal string as split delimiter.

Also, if there is something wrong when using the regex, it could also be a bug in the regex rewrite logic, and disabling regex config won't help it fallback correctly, especially when spark.rapids.sql.rLikeRegexRewrite.enabled is now an internal config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep it as is. From an end user standpoint, if I say that I want to disable regex, and we go ahead and rewrite the query to do something different it might be kind of confusing. But I think that is minor. The main thing I am worried about is if there are situations where we could convert a regular expression into a custom kernel, but the transpiler cannot support it. We are now stuck. That appears to be simple enough to do, and we can do it when we see a need for it.

@@ -444,6 +444,28 @@ def test_regexp_like():
'regexp_like(a, "a[bc]d")'),
conf=_regexp_conf)

@pytest.mark.skipif(is_before_spark_320(), reason='regexp_like is synonym for RLike starting in Spark 3.2.0')
def test_regexp_rlike_rewrite_optimization():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we rewrite this test using RLIKE so it runs on all Spark versions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, updated.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven merged commit 8431c64 into NVIDIA:branch-24.06 May 15, 2024
43 of 44 checks passed
@thirtiseven thirtiseven deleted the regexpr_trick branch May 15, 2024 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite some rlike expression to StartsWith/Contains
6 participants