[preview] String parsing improvements #314

jrudolph · 2019-07-29T23:08:36Z

String parsing dominates parsing times so far for real world input (that isn't particularly number heavy). So, here's a first attempt to improve performance.

The speed-ups mostly come from

no explicit bounds checking
removing tricks
some code reorganization

Before:

GithubIssuesJsonAST.readCirce                github-akka-issues.json 2465.716 ± 191.574  ops/s
GithubIssuesJsonAST.readCirce                  big-string-array.json 3375.916 ± 317.283  ops/s
GithubIssuesJsonAST.readPlayJson             github-akka-issues.json 2002.679 ± 159.931  ops/s
GithubIssuesJsonAST.readPlayJson               big-string-array.json 2790.919 ±  59.378  ops/s

GithubIssuesJsonAST.readSprayJsonFromBytes   github-akka-issues.json 1048.052 ±  12.170  ops/s
GithubIssuesJsonAST.readSprayJsonFromBytes     big-string-array.json 1353.650 ±  27.423  ops/s
GithubIssuesJsonAST.readSprayJsonFromString  github-akka-issues.json 1202.670 ±  28.346  ops/s
GithubIssuesJsonAST.readSprayJsonFromString    big-string-array.json 1507.546 ±  19.228  ops/s

GithubIssuesCaseClassReading.readCircePartial           2074.334 ± 463.204  ops/s
GithubIssuesCaseClassReading.readJacksonPartial         3393.928 ± 166.386  ops/s
GithubIssuesCaseClassReading.readPlayJsonPartial        1883.332 ± 182.801  ops/s

GithubIssuesCaseClassReading.readSprayJsonViaASTPartial 1141.810 ± 3.432  ops/s

After:

GithubIssuesJsonAST.readSprayJsonFromBytes   github-akka-issues.json 2321.173 ±  74.728  ops/s
GithubIssuesJsonAST.readSprayJsonFromBytes     big-string-array.json 4250.720 ± 187.668  ops/s
GithubIssuesJsonAST.readSprayJsonFromString  github-akka-issues.json 2181.839 ±  60.334  ops/s
GithubIssuesJsonAST.readSprayJsonFromString    big-string-array.json 3447.510 ±  27.173  ops/s

GithubIssuesCaseClassReading.readSprayJsonViaASTPartial  2142.250 ± 99.059  ops/s

I.e. about 2x-3x faster for byte input. So, string parsing has been improved by ~ 3x which leads to a 2x improvement for the test case (88k json from github API).

As ParserInput needs to be changed, the change won't be simply binary compatible. I'll probably deprecate the current parser infrastructure and duplicate it by simpler entry-points that don't leak so many implementation details for common cases.

Otherwise, when running the benchmark project it will use the wrong version from sprayJsonJVM (yeah, weird).

(cherry picked from commit bf6c20065d9833b011fbefe9c418f72b5f46cd4f) (cherry picked from commit 28c8da725426403e5fedfdc49d096ef5e9a1585e) # Conflicts: # spray-json/shared/src/main/scala/spray/json/JsonParser.scala

(cherry picked from commit e1e7a8bfae1015e55999b6d9c57c0a978d4e9956) # Conflicts: # spray-json/shared/src/main/scala/spray/json/JsonParser.scala

plokhotnyuk · 2019-07-30T03:02:53Z

benchmark/src/main/scala/spray/json/GithubIssuesCaseClassReading.scala

+    import io.circe.generic.semiauto._
+    import io.circe.parser._
+    implicit def circeUserFormat: Decoder[User] = deriveDecoder
+    implicit def circeLabelFormat: Decoder[LabelsArrayElement] = deriveDecoder


@jrudolph please consider to cache derived decoders using val for better performance

plokhotnyuk · 2019-07-30T03:17:05Z

spray-json/shared/src/main/scala/spray/json/JsonParser.scala

+        input.setCursor(cursor) // now at end '"'
+        advance()
+        new String(charBuffer, 0, charsRead)
+      } else if (((b - 32) ^ 60) <= 0) // trick from jsoniter-scala


it is a check for escaping character and for characters that are not allowed to be represented as is without encoding: b < ' ' || b = '\'

Yep, I know, pretty clever :) I printed out the complete mapping between inputs and outputs to see where it was going. It seems it didn't help much here in the end.

plokhotnyuk · 2019-07-30T03:24:10Z

spray-json/shared/src/main/scala/spray/json/JsonParser.scala

+      } else if (b == '\\') {
+        val skip = escaped(cursor, charsRead)
+        parseOneSlow(cursor + skip, charsRead + 1)
+      } else if (input.needsDecoding && b >= 128) {


missing check for control characters that must be escaped

sirthias · 2019-07-30T05:48:54Z

Awesome stuff, @jrudolph!
These are great improvements!!
The old "tricks" really don't pull their weight, especially on the newer JVMs.

sirthias · 2019-07-30T06:09:16Z

One thing that might be worthwhile exploring to speed up string parsing even more is unrolling the parseOneFast loop to handle 4 characters in a single loop iteration.
I could imagine that we getter instruction level parallelism this way.

The beginning of this blog post here: https://chadaustin.me/2017/05/writing-a-really-really-fast-json-parser/ discusses this effect for JSON parsing.

In the end it's all about reducing the average parsing cycles per character, so any parallelization that we can get will help.

Another thing could be sth similar to this line of borer's JSON parser:
https://github.com/sirthias/borer/blob/master/core/src/main/scala/io/bullet/borer/json/JsonParser.scala#L417

Decoupling the character copying from the rest of the loop had an incredible effect on performance in borer's case. I suspect because the CPU can then freely move those instructions between everything else and make use of otherwise unused silicon capacity, so a lot of it gets essentially moved out of the critical path and costs "nothing" any more...

plokhotnyuk · 2019-07-30T06:32:53Z

@sirthias I have found that C2 can do unrolling for some simple loops, and for my case only GraalVM required for manual unrolling

sirthias · 2019-07-30T06:35:41Z

Yeah, the JVM might be able to do it itself.
But this is hard to control and small changes might break the JVMs ability to do it.
So manual unrolling for this very hot code path appears to be worth the effort.

plokhotnyuk · 2019-07-30T06:46:11Z

Here are throughput scores of benchmarks which parse a string of 128 ASCII chars on the same Intel Core i7 CPU but on different versions of JDK and GraalVM:

It is much harder to keep them high on different hardware.

jrudolph · 2019-07-30T08:15:02Z

One thing that might be worthwhile exploring to speed up string parsing even more is unrolling the parseOneFast loop to handle 4 characters in a single loop iteration.
I could imagine that we getter instruction level parallelism this way

Tried that in different variations but no positive results so far.

Decoupling the character copying from the rest of the loop had an incredible effect on performance in borer's case.

Tried that as well in different versions but I also didn't find any clear wins. What helped indeed was moving the copying of the single character out of the branches.

I also experimented with using table lookups which also didn't help :)

Regarding string parsing benchmarks in particular, I found that the biggest factor that will bias the result is the size of the internal buffer for collecting the characters. If every parser run will reallocate that buffer you can only get max performance in the benchmark if the buffer is already sized just big enough for the resulting string. You can probably improve a lot in these benchmarks when you bring-your-own-buffer.

There's also the problem, that JIT often doesn't end up in the same state and I've often seen +/- 20% between different runs with 2 or 3 different steady states over different runs.

Here are the benchmark results from jsoniter-scala when parsing 1000 bytes with a 1024 char buffer:

# VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04
# VM invoker: /usr/lib/jvm/adoptopenjdk-8-jdk-hotspot/jre/bin/java
# VM options: -server -Xms2g -Xmx2g -XX:NewSize=1g -XX:MaxNewSize=1g -XX:InitialCodeCacheSize=512m -XX:ReservedCodeCacheSize=512m -XX:+UseParallelGC -XX:-UseBiasedLocking -XX:+AlwaysPreTouch

Benchmark                                      (size)      Score       Error  Units
StringOfAsciiCharsReading.avSystemGenCodec       1000 420377.975 ±   3314.706  ops/s
StringOfAsciiCharsReading.borerJson              1000 374747.750 ±   5469.807  ops/s
StringOfAsciiCharsReading.circe                  1000 296227.975 ±   1608.166  ops/s
StringOfAsciiCharsReading.dslJsonScala           1000 747348.589 ±  14188.010  ops/s
StringOfAsciiCharsReading.jacksonScala           1000 668431.496 ±  10057.838  ops/s
StringOfAsciiCharsReading.jsoniterJava           1000 663563.884 ±   4021.098  ops/s
StringOfAsciiCharsReading.jsoniterScala          1000 799233.199 ±   3866.261  ops/s
StringOfAsciiCharsReading.playJson               1000 697148.819 ±  13384.538  ops/s
StringOfAsciiCharsReading.scalikeJackson         1000 383223.384 ±   1653.179  ops/s
StringOfAsciiCharsReading.sprayJson              1000 516299.306 ±  12754.457  ops/s
StringOfAsciiCharsReading.uPickle                1000 396939.986 ±   8720.495  ops/s

StringOfEscapedCharsReading.avSystemGenCodec     1000  40351.801 ±  2746.862  ops/s
StringOfEscapedCharsReading.borerJson            1000  56583.767 ±  3943.939  ops/s
StringOfEscapedCharsReading.circe                1000  33025.217 ±  1741.437  ops/s
StringOfEscapedCharsReading.dslJsonScala         1000  71626.399 ±  2489.666  ops/s
StringOfEscapedCharsReading.jacksonScala         1000  60839.264 ±   687.483  ops/s
StringOfEscapedCharsReading.jsoniterJava         1000  99722.853 ± 14966.104  ops/s
StringOfEscapedCharsReading.jsoniterScala        1000 185767.529 ±  4066.498  ops/s
StringOfEscapedCharsReading.playJson             1000  59031.123 ±  6318.695  ops/s
StringOfEscapedCharsReading.scalikeJackson       1000  49066.981 ±   810.482  ops/s
StringOfEscapedCharsReading.sprayJson            1000  81830.164 ±  3713.948  ops/s
StringOfEscapedCharsReading.uPickle              1000  34547.016 ±   730.916  ops/s

StringOfNonAsciiCharsReading.avSystemGenCodec    1000 126835.981 ±  5900.099  ops/s
StringOfNonAsciiCharsReading.borerJson           1000  81477.660 ±  2066.403  ops/s
StringOfNonAsciiCharsReading.circe               1000 103151.358 ±  3512.299  ops/s
StringOfNonAsciiCharsReading.dslJsonScala        1000 189423.431 ±  1927.495  ops/s
StringOfNonAsciiCharsReading.jacksonScala        1000 101410.377 ± 19933.428  ops/s
StringOfNonAsciiCharsReading.jsoniterJava        1000 267360.452 ±  5517.855  ops/s
StringOfNonAsciiCharsReading.jsoniterScala       1000 193001.124 ±  2849.416  ops/s
StringOfNonAsciiCharsReading.playJson            1000 116526.373 ±  4094.327  ops/s
StringOfNonAsciiCharsReading.scalikeJackson      1000 121490.223 ±  6358.179  ops/s
StringOfNonAsciiCharsReading.sprayJson           1000 215865.902 ± 13995.871  ops/s
StringOfNonAsciiCharsReading.uPickle             1000 103967.342 ±  3001.080  ops/s

Which are now very decent results for this PR given that there are no tricks involved.

jrudolph · 2019-07-30T08:26:39Z

It is much harder to keep them high on different hardware.

I think, I'll keep spray-json reasonably simple, so I'll rather go with a somewhat simple solution even if that costs a bit of performance. That also has the benefit that it hasn't been optimized to any architecture and should still show reasonable (but maybe not top) performance.

jrudolph · 2019-07-30T08:27:21Z

You can probably improve a lot in these benchmarks when you bring-your-own-buffer.

Caching the buffer in a thread local between parsing runs is another 10-20% win indeed.

sirthias · 2019-07-30T11:44:13Z

Great stuff!
Looking forward to seeing this released!

sirthias · 2019-07-30T11:57:16Z

Caching the buffer in a thread local between parsing runs is another 10-20% win indeed.

Interesting. In my tests with borer adding a ThreadLocal cache of the char buffer somehow resulted in a slight performance decrease, at least on graal.
This perf work really is a complete trial and error process... unfortunately.

jrudolph · 2019-07-30T12:02:05Z

This perf work really is a complete trial and error process... unfortunately.

It is, I'm also still looking for a more predictable workflow. I didn't test extensively on Graal, from what I've seen, GraalCE was slower than hotspot and JDK11 was a bit faster than JDK8.

plokhotnyuk · 2019-07-30T12:54:34Z

spray-json/shared/src/main/scala/spray/json/JsonParser.scala

+            ((input.byteAt(cursor + 1) & 0x3f) << 6) |
+            (input.byteAt(cursor + 2) & 0x3f)
+
+        if (codePoint < 0xffff) {


the check is redundant, 2nd branch is unreachable in runtime

plokhotnyuk · 2019-07-30T12:56:36Z

spray-json/shared/src/main/scala/spray/json/JsonParser.scala

+          charBuffer(charsRead) = codePoint.toChar
+          4
+        } else {
+          charBuffer(charsRead) = ((0xd7C0 + (codePoint >> 10)).toChar)


I would add checking for broken surrogate pairs here, emoji so popular these days 😄

plokhotnyuk · 2019-07-30T13:01:09Z

spray-json/shared/src/main/scala/spray/json/JsonParser.scala

+    def unicodeChar(b: Char, cursor: Int, charsRead: Int): Int = {
+      if ((b & 0xE0) == 0xC0) { // two byte sequence
+        charBuffer(charsRead) = (((b & 0x1f) << 6) |
+          (input.byteAt(cursor + 1) & 0x3f)).toChar


here and bellow all subsequent bytes after leading one should be checked if they are well formed (have 10 highest bits)

jrudolph · 2019-08-12T12:05:52Z

spray-json/shared/src/main/scala/spray/json/JsonParser.scala

+    /* Scan to the end of a simple string as fast as possible, no escapes, no unicode */
+    // so far unclear results if that is really helpful
+    /*@tailrec def parseOneFast(cursor: Int, charsRead: Int): String = {
+      val b = input.byteAt(cursor)


I had another try and found out that parseOneFast was slower because I wasn't using charAt here directly as below in parseOneSlow. Changing that it now makes sense again to keep it to improve parsing speed of simple strings (without escapes and unicode).

Comparing to jsoniter-scala I found another weird thing which is that surrounding the whole parseOneFast with a useless (always true) if clause (if (i < minLim) in jsoniter and here e.g. if(charsRead <= cursor)) is another 20-25% improvement.

Those kind of things seem to be rather random findings that cannot really be explained in terms of the original code but only by looking at all stages of the compilation/execution pipeline (Scala compiler -> JIT compiler -> µops). I tried Intel's VTune for the first time to get some insights and while it looks promising on the surface I didn't find any actionable explanations for the performance cliffs (and in any case it those could be quite specific to my particular processor). E.g. for the charAt vs. byteAt it showed bad core port usage ratios (>50% of cycles where no port made progress) which (AFAIU) may be caused by too many dependencies between instructions.

(So, kudos again to jsonitor-scala for managing to be fast in various environments but as I said before that's not my goal here)

crazy stuff... and a huge time sink, unfortunately

…harAt`

jrudolph added 17 commits July 29, 2019 15:07

Update versions of libraries to benchmark against

63b9d8d

Align crossScalaVersions between sprayJsonJVM and benchmark

5564abf

Otherwise, when running the benchmark project it will use the wrong version from sprayJsonJVM (yeah, weird).

Add benchmark for parsing (partially) to case class instances

03c5044

Add just-strings benchmark

06e641c

update jmh and play-json for benchmarking

515ffc9

Add jackson as comparison for case class reading

1a074b9

implement fast path for string parsing

c7b12f0

(cherry picked from commit bf6c20065d9833b011fbefe9c418f72b5f46cd4f) (cherry picked from commit 28c8da725426403e5fedfdc49d096ef5e9a1585e) # Conflicts: # spray-json/shared/src/main/scala/spray/json/JsonParser.scala

improve string parsing

6c001f4

(cherry picked from commit e1e7a8bfae1015e55999b6d9c57c0a978d4e9956) # Conflicts: # spray-json/shared/src/main/scala/spray/json/JsonParser.scala

wip

5fd15a6

Ignore error message parser tests for now

19f1cb7

Faster string parsing

e22eac8

Fast slow parsing

24f63cb

Some refactoring for clarity

c94ed53

Introduce new setting for maximum string size

22b9e8b

Automatically scale char buffer up to a given size

27ea4f3

disable parseOneFast for now as it doesn't seem to help

d4ef762

Remove now (almost) unused nextUtf8Char

8a867c8

plokhotnyuk reviewed Jul 30, 2019

View reviewed changes

Cache buffer in thread local between runs

2288c5e

plokhotnyuk reviewed Jul 30, 2019

View reviewed changes

jrudolph commented Aug 12, 2019

View reviewed changes

Enable parseOneFast again after actually making it fast by using `c…

da793e6

…harAt`

jrudolph added this to In progress in 1.5.0 Aug 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[preview] String parsing improvements #314

[preview] String parsing improvements #314

jrudolph commented Jul 29, 2019 •

edited

plokhotnyuk Jul 30, 2019

plokhotnyuk Jul 30, 2019

jrudolph Aug 3, 2019

plokhotnyuk Jul 30, 2019

sirthias commented Jul 30, 2019

sirthias commented Jul 30, 2019

plokhotnyuk commented Jul 30, 2019

sirthias commented Jul 30, 2019

plokhotnyuk commented Jul 30, 2019

jrudolph commented Jul 30, 2019 •

edited

jrudolph commented Jul 30, 2019

jrudolph commented Jul 30, 2019

sirthias commented Jul 30, 2019

sirthias commented Jul 30, 2019

jrudolph commented Jul 30, 2019

plokhotnyuk Jul 30, 2019

plokhotnyuk Jul 30, 2019

plokhotnyuk Jul 30, 2019

jrudolph Aug 12, 2019

sirthias Aug 12, 2019

[preview] String parsing improvements #314

Are you sure you want to change the base?

[preview] String parsing improvements #314

Conversation

jrudolph commented Jul 29, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirthias commented Jul 30, 2019

sirthias commented Jul 30, 2019

plokhotnyuk commented Jul 30, 2019

sirthias commented Jul 30, 2019

plokhotnyuk commented Jul 30, 2019

jrudolph commented Jul 30, 2019 • edited

jrudolph commented Jul 30, 2019

jrudolph commented Jul 30, 2019

sirthias commented Jul 30, 2019

sirthias commented Jul 30, 2019

jrudolph commented Jul 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrudolph commented Jul 29, 2019 •

edited

jrudolph commented Jul 30, 2019 •

edited