[O] Optimize CharSequence.lines() #5278

hykilpikonna · 2024-03-20T04:28:37Z

Splitting lines by doing two string replacements and then splitting by one character is way faster than splitting by three substrings.

https://youtrack.jetbrains.com/issue/KT-66715/Performance-faster-alternative-to-String.lines

Profiling of the two methods:

Detailed time usage:

In-process time output:

fzhinkin · 2024-03-20T13:23:29Z

@hykilpikonna nice catch! CharSequence.lines() performance indeed could be improved.

fzhinkin · 2024-03-20T13:19:21Z

libraries/stdlib/src/kotlin/text/Strings.kt

@@ -1403,14 +1403,14 @@ public inline fun CharSequence.splitToSequence(regex: Regex, limit: Int = 0): Se
 *
 * The lines returned do not include terminating line separators.
 */
-public fun CharSequence.lineSequence(): Sequence<String> = splitToSequence("\r\n", "\n", "\r")
+public fun CharSequence.lineSequence(): Sequence<String> = toString().replace("\r\n", "\n").replace('\r', '\n').splitToSequence('\n')


CharSequence interface neither require implementing toString, nor constraint developers to do that in a particular way:

kotlin/core/builtins/native/kotlin/CharSequence.kt

Line 22 in 0938b46

public interface CharSequence {

In standard library, there's only one class implementing that interface, StringBuidler, and it's toString implementation returns underlying string, but in general, we can't rely on that.

A random example of how CharSequence's implementation may override toString: https://github.com/h2oai/h2o-3/blob/776b53716559e065ccf2af08d1cbb5b8d2c883e7/h2o-algos/src/main/java/hex/grep/Grep.java#L80

there's only one class

Besides String, of course

Although, it seems to be a problem with how CharSequence's contract is defined in Kotlin (there's no explicit common declaration ), and Java's CharSequence specify toString's behavior.

Interesting... the reason why I used toString is because there doesn't seem to exist a function CharSequence.replace(String, String) or even CharSequence.replace(Char, Char)

The only replace function I can find under CharSequence uses Regex replace, but it performs worse than a JVM implementation of replace.

However, if I copy the String.replace() algorithm in the following JVM implementation and set the signature to take in a CharSequence instead of String, the replacement works and performs well as intended.

kotlin/libraries/stdlib/jvm/src/kotlin/text/StringsJVM.kt

Line 86 in 85efbe5

public actual fun String.replace(oldValue: String, newValue: String, ignoreCase: Boolean = false): String {

We might have two options for optimizing this:

We can create a new String.lines() implementation separate from CharSequence.lines().

We can either create a new CharSequence.replace(String, String) interface or create a helper function for replace, and use this replacement function for CharSequence.lines().

On JVM, both approaches should perform equally well, but this might not be the case on other platforms. So I've tested the performance of the second approach on different platforms below.

On Kotlin Native, compiling the JVM CharSequence.replace into native is slightly slower than using the native String.replace (but still better than splitting by three strings):

(test sample size reduced by 10x because Kotlin Native is very slow and memory-intensive in processing strings)

On Kotlin nodejs, compiling the JVM CharSequence.replace function into js is slightly faster than using the js String.replace:

Please let me know what you think is the best approach.

One thing to keep in mind on the lineSequence implementation is that it performs lazy splitting as expected for sequence, whereas using toString().replace(...).splitToSequence(...) would perform lazy splitting, but eager replaces on a whole string.
This may be wasteful when only part of the resulting sequence is used afterward.
Example benchmark:

private val string = (0..1000).joinToString("\n") { "line_$it" } private val hugeString = (0..100000000).joinToString("\n") { "line_$it" } @State(Scope.Benchmark) class StringLinesBenchmark { @Benchmark fun stdLibLineSequenceNormalString() = string.lineSequence().take(100).toList() @Benchmark fun newLineSequenceNormalString() = string.newLineSequence().take(100).toList() @Benchmark fun stdLibLineSequenceHugeString() = hugeString.lineSequence().take(100).toList() @Benchmark fun newLineSequenceHugeString() = hugeString.newLineSequence().take(100).toList() }

benchmarks summary: Benchmark Mode Cnt Score Error Units StringLinesBenchmark.newLineSequenceHugeString avgt 5 866.933 ± 20.053 ms/op StringLinesBenchmark.newLineSequenceNormalString avgt 5 0.005 ± 0.001 ms/op StringLinesBenchmark.stdLibLineSequenceHugeString avgt 5 0.015 ± 0.001 ms/op StringLinesBenchmark.stdLibLineSequenceNormalString avgt 5 0.014 ± 0.001 ms/op

In this case, stdlib implementation performance doesn't really rely on input string lines count as long as only part of lines sequence result is used later.
But it is worth noting that for a string with 1000 lines, where only 10% of them were later used for processing, new implementation was much faster.

@hykilpikonna, as @Shykial mentioned, lineSequence will no longer be lazy after using toString and replace. It's also worth mentioning that multiple string transformations will increase object allocation rate which in some scenarios may not be tolerable.

However, it seems like performance could be improved without hurting the laziness (but it needs to be verified) by switching from rangesDelimitedBy to DelimitedRangesSequence with a custom getNextMatch callback. Currently, we use CharSequence.findAnyOf to find where a line splits. Instead, we can scan the char sequence seeking for either \n, or \r (optionally followed by the \n).

The main caveat here is that there are not so many tests dedicated to line splitting, so before changing anything the test coverage needs to be improved.

To further experiment with performance improvements, I would recommend writing benchmarks like the one in the message above. It would be nice to cover various scenarios (different receiver types, strings using different delimiters and having different structure, different operations on a resulting sequence, like first, last, count, etc) to better understand trade offs. To ease benchmarks configuration, you can clone https://github.com/fzhinkin/benchmarks-template, it's KMP project using kotlinx-benchmark with various benchmarking targets being already set up.

Gamandoor · 2024-03-20T15:06:02Z

https://www.squarespace.org/public-comments/hajizaffar@gamandor.com

fzhinkin · 2024-05-07T14:01:19Z

Hey! @hykilpikonna, are you considering proceeding with this PR? Please let me know if additional any help is needed here.

hykilpikonna added 2 commits March 19, 2024 21:21

[O] Optimize lines()

29c6445

[O] More optimization for CharSequence.lines()

2263dd3

ivakub added the Standard Library label Mar 20, 2024

fzhinkin self-assigned this Mar 20, 2024

fzhinkin reviewed Mar 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[O] Optimize CharSequence.lines() #5278

[O] Optimize CharSequence.lines() #5278

hykilpikonna commented Mar 20, 2024

fzhinkin commented Mar 20, 2024

fzhinkin Mar 20, 2024

fzhinkin Mar 20, 2024

fzhinkin Mar 20, 2024

hykilpikonna Mar 20, 2024 •

edited

hykilpikonna Mar 20, 2024

Shykial Mar 21, 2024

fzhinkin Mar 22, 2024

Gamandoor commented Mar 20, 2024

fzhinkin commented May 7, 2024

[O] Optimize CharSequence.lines() #5278

Are you sure you want to change the base?

[O] Optimize CharSequence.lines() #5278

Conversation

hykilpikonna commented Mar 20, 2024

fzhinkin commented Mar 20, 2024

fzhinkin Mar 20, 2024

Choose a reason for hiding this comment

fzhinkin Mar 20, 2024

Choose a reason for hiding this comment

fzhinkin Mar 20, 2024

Choose a reason for hiding this comment

hykilpikonna Mar 20, 2024 • edited

Choose a reason for hiding this comment

hykilpikonna Mar 20, 2024

Choose a reason for hiding this comment

Shykial Mar 21, 2024

Choose a reason for hiding this comment

fzhinkin Mar 22, 2024

Choose a reason for hiding this comment

Gamandoor commented Mar 20, 2024

fzhinkin commented May 7, 2024

hykilpikonna Mar 20, 2024 •

edited