Rewrite some rlike expression to StartsWith/Contains #10715

thirtiseven · 2024-04-16T10:52:37Z

WIP

This PR rewrites RLike in some simple cases that can be replaced by GpuStartsWith / GpuEndsWith / GpuContains / GpuEqualTo.

Replacing RLike with GpuContains gains about 10% e2e speedup in a customer query. Needs further performance testing.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

revans2

I am a little concerned that you are writing your own Regexp parsing code instead of reusing the existing code

https://github.com/NVIDIA/spark-rapids/blob/branch-24.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

Can we please go off of the existing RegexpParser instead of trying to write something new from scratch.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-05-07T10:34:35Z

I am a little concerned that you are writing your own Regexp parsing code instead of reusing the existing code

https://github.com/NVIDIA/spark-rapids/blob/branch-24.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

Can we please go off of the existing RegexpParser instead of trying to write something new from scratch.

I updated code to use RegexParser, please take another look. It prevents me from writing a regex parser from scratch but makes the matching logic a bit more complicated. But overall I think reusing it is really better than having two parsers.

Will adds more tests, such as a UT, to verify that it is taking the speedup path.

revans2

Generally it looks good to me

revans2 · 2024-05-07T13:22:47Z

integration_tests/src/main/python/regexp_test.py

+    assert_gpu_and_cpu_are_equal_collect(
+            lambda spark: unary_op_df(spark, gen).selectExpr(
+                'a',
+                'regexp_like(a, "(abcd)(.*)")',


What about \A and \Z? Is that something that we can support with this?

\Z means The end of the input but for the final terminator, if any in java, so it is not the same as endsWith. Will support \A

revans2 · 2024-05-07T13:23:50Z

integration_tests/src/main/python/regexp_test.py

@@ -444,6 +444,28 @@ def test_regexp_like():
                'regexp_like(a, "a[bc]d")'),
        conf=_regexp_conf)

+@pytest.mark.skipif(is_before_spark_320(), reason='regexp_like is synonym for RLike starting in Spark 3.2.0')
+def test_regexp_rlike_rewrite_optimization():
+    gen = mk_str_gen('[abcd]{3,6}')


Can we add in some new line characters to the generated strings? ^ and $ in some cases can match just begin and end of line, not begin and end of string.

Nice catch! The test failed in $ case, didn't know that $ means end of line in java regex.

Sadly it means we could not support endsWith pattern at all because we haven't support \w so it will fallback first. (technically we can by check this case when tagging but I don't think we need to do that now) I will remove the endsWith part.

I'm surprised that ^ passed this test with new line characters because ^ means "The beginning of a line". Will do some investigation.

revans2 · 2024-05-07T13:33:36Z

integration_tests/src/main/python/regexp_test.py

+                'a',
+                'regexp_like(a, "(abcd)(.*)")',
+                'regexp_like(a, "abcd(.*)")',
+                'regexp_like(a, "(.*)(abcd)(.*)")',


I'm not sure how likely it is to have abcd show up in the generated data for any of these queries.

If we look at a starts with abcd. We have a 25% chance of picking an a as the first char, and 25% chance of picking a b as the second ... That means if we had an input pattern of abcd{4} then we would only likely have 8 rows in the entire 2048 data set that would match, but we have {3,6}, which makes it likely that we would have no rows in the data set that match.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

integration_tests/src/main/python/regexp_test.py

tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionRewriteSuite.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-05-09T09:30:04Z

build

thirtiseven · 2024-05-09T09:39:22Z

Did a simple performance test:

data: 1000000 random strings, each string has 0-2000 characters, 30% strings start with "abcde"*20, 30% strings contain "abcde"*20, 40% random strings.
startsWith query: 100 rlike queries from '^a' to '^abcde*20'
contains query: 100 rlike queries from '^(.*)a' to '^(.*)abcde*20'

startsWith:

CPU: 24620 ms
24.06: 2923 ms
this pr: 883 ms

contains:

CPU: 435027 ms
24.06: 314065 ms
this pr: 2071 ms

(Maybe the patterns in contains test is not very general for regex engine, that's why the speedup is very obvious. I can run more tests)

gerashegalov

LGTM

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-05-10T01:51:34Z

build

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-05-10T03:03:40Z

build

revans2

Just some nits.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

revans2 · 2024-05-14T14:50:29Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

-            val (transpiledAST, _) =
-                new CudfRegexTranspiler(RegexFindMode).getTranspiledAST(str.toString, None, None)
+            originalPattern = str.toString
+            val (transpiledAST, _) = new CudfRegexTranspiler(RegexFindMode)


nit: Could we have a follow on issue to figure out how to parse the regexp once, instead of multiple times?

Filed #10817, will do it in my next regex rewrite pr.

gerashegalov

LGTM, pending Bobby's comments

NVnavkumar

Left some comments. I think the test fix is required, but I would like others to comment on whether to enable the optimization even when the regexp is disabled on GPU.

NVnavkumar · 2024-05-14T18:54:32Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

+        case _ => throw new IllegalStateException("Unexpected optimization type")
+      }
+    }
+
    override def tagExprForGpu(): Unit = {
      GpuRegExpUtils.tagForRegExpEnabled(this)


This method can actually disable regexp on the GPU. This means that these optimizations will never kick in when regexp is disabled. I don't know if that is actually desired. You can look at GpuSplit, where we implemented transpileToSplittableString and that codepath is not affected by the regexp enable flag.

Good point. I'm fine with either way.

I think regex rewrite is more like an internal optimization in regex engine from user's perspective, users are still writing regex in rlike and won't be aware regex rewrite is happening, while in split case user would be aware that they are writing literal string as split delimiter.

Also, if there is something wrong when using the regex, it could also be a bug in the regex rewrite logic, and disabling regex config won't help it fallback correctly, especially when spark.rapids.sql.rLikeRegexRewrite.enabled is now an internal config.

I think we can keep it as is. From an end user standpoint, if I say that I want to disable regex, and we go ahead and rewrite the query to do something different it might be kind of confusing. But I think that is minor. The main thing I am worried about is if there are situations where we could convert a regular expression into a custom kernel, but the transpiler cannot support it. We are now stuck. That appears to be simple enough to do, and we can do it when we see a need for it.

NVnavkumar · 2024-05-14T18:55:41Z

integration_tests/src/main/python/regexp_test.py

@@ -444,6 +444,28 @@ def test_regexp_like():
                'regexp_like(a, "a[bc]d")'),
        conf=_regexp_conf)

+@pytest.mark.skipif(is_before_spark_320(), reason='regexp_like is synonym for RLike starting in Spark 3.2.0')
+def test_regexp_rlike_rewrite_optimization():


Why don't we rewrite this test using RLIKE so it runs on all Spark versions?

Good idea, updated.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-05-15T05:28:56Z

build

thirtiseven added 6 commits April 7, 2024 19:42

A hacky approach for regexpr rewrite

24988bf

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Use contains instead for that case

23b8dbf

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

add config to switch

5682864

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Merge branch 'branch-24.06' into regexpr_trick

482248c

Rewrite some rlike expression to StartsWith/EndsWith/Contains

552cf7e

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

clean up

8b88378

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

revans2 reviewed Apr 16, 2024

View reviewed changes

sameerz added the performance A performance related task/issue label Apr 17, 2024

GregoryKimball mentioned this pull request Apr 29, 2024

[FEA] Improve performance of strings matching in libcudf rapidsai/cudf#15611

Open

thirtiseven added 2 commits May 7, 2024 18:21

Draft code to adapt RegexParser in regex rewrite

1f4d1a4

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

clean up

21af975

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

revans2 reviewed May 7, 2024

View reviewed changes

gerashegalov reviewed May 8, 2024

View reviewed changes

Apply suggestions from code review

5b33efe

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>

gerashegalov reviewed May 8, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Outdated Show resolved Hide resolved

thirtiseven added 2 commits May 8, 2024 17:01

A checkpoint before removing endsWith rewrite

14617fa

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Remove equalsTo and endsWith

8404fa6

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven changed the title ~~Rewrite some rlike expression to StartsWith/EndsWith/Contains~~ Rewrite some rlike expression to StartsWith/Contains May 8, 2024

clean up

ffea38c

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven marked this pull request as ready for review May 8, 2024 10:53

gerashegalov reviewed May 8, 2024

View reviewed changes

integration_tests/src/main/python/regexp_test.py Outdated Show resolved Hide resolved

tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionRewriteSuite.scala Outdated Show resolved Hide resolved

address a comment

853c12d

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

HaoYang670 reviewed May 9, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Outdated Show resolved Hide resolved

address a comment

d9acb7d

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

gerashegalov previously approved these changes May 9, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Outdated Show resolved Hide resolved

gerashegalov reviewed May 10, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Outdated Show resolved Hide resolved

address comments

8ef1d42

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven dismissed gerashegalov’s stale review via 8ef1d42 May 10, 2024 01:51

fix 2.13 build

59bca82

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven self-assigned this May 13, 2024

thirtiseven requested review from gerashegalov and revans2 May 14, 2024 08:28

revans2 previously approved these changes May 14, 2024

View reviewed changes

gerashegalov reviewed May 14, 2024

View reviewed changes

gerashegalov previously approved these changes May 14, 2024

View reviewed changes

NVnavkumar reviewed May 14, 2024

View reviewed changes

Address comments

2b4545f

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven dismissed stale reviews from gerashegalov and revans2 via 2b4545f May 15, 2024 03:45

thirtiseven mentioned this pull request May 15, 2024

[FOLLOW ON] Combining regex parsing in transpiling and regex rewrite in rlike #10817

Closed

thirtiseven mentioned this pull request May 15, 2024

Rewrite regex pattern literal[a-b]{x} to custom kernel in rlike #10822

Merged

NVnavkumar approved these changes May 15, 2024

View reviewed changes

revans2 approved these changes May 15, 2024

View reviewed changes

thirtiseven merged commit 8431c64 into NVIDIA:branch-24.06 May 15, 2024
43 of 44 checks passed

thirtiseven deleted the regexpr_trick branch May 15, 2024 23:16

gerashegalov mentioned this pull request May 30, 2024

[BUG] 24.06 test_conditional_with_side_effects_case_when test failed on Scala 2.13 with DATAGEN_SEED=1716656294 #10928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite some rlike expression to StartsWith/Contains #10715

Rewrite some rlike expression to StartsWith/Contains #10715

thirtiseven commented Apr 16, 2024 •

edited

revans2 left a comment

thirtiseven commented May 7, 2024

revans2 left a comment

revans2 May 7, 2024

thirtiseven May 8, 2024

thirtiseven May 8, 2024

revans2 May 7, 2024

thirtiseven May 8, 2024

thirtiseven May 8, 2024

revans2 May 7, 2024

thirtiseven May 8, 2024

thirtiseven commented May 9, 2024

thirtiseven commented May 9, 2024 •

edited

gerashegalov left a comment

thirtiseven commented May 10, 2024

thirtiseven commented May 10, 2024

revans2 left a comment

revans2 May 14, 2024

thirtiseven May 15, 2024

gerashegalov left a comment

NVnavkumar left a comment

NVnavkumar May 14, 2024

thirtiseven May 15, 2024 •

edited

revans2 May 15, 2024

NVnavkumar May 14, 2024

thirtiseven May 15, 2024

thirtiseven commented May 15, 2024

Rewrite some rlike expression to StartsWith/Contains #10715

Rewrite some rlike expression to StartsWith/Contains #10715

Conversation

thirtiseven commented Apr 16, 2024 • edited

revans2 left a comment

Choose a reason for hiding this comment

thirtiseven commented May 7, 2024

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thirtiseven commented May 9, 2024

thirtiseven commented May 9, 2024 • edited

gerashegalov left a comment

Choose a reason for hiding this comment

thirtiseven commented May 10, 2024

thirtiseven commented May 10, 2024

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov left a comment

Choose a reason for hiding this comment

NVnavkumar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thirtiseven May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thirtiseven commented May 15, 2024

thirtiseven commented Apr 16, 2024 •

edited

thirtiseven commented May 9, 2024 •

edited

thirtiseven May 15, 2024 •

edited