Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[O] Optimize CharSequence.lines() #5278

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hykilpikonna
Copy link

Splitting lines by doing two string replacements and then splitting by one character is way faster than splitting by three substrings.

https://youtrack.jetbrains.com/issue/KT-66715/Performance-faster-alternative-to-String.lines

Profiling of the two methods:

image

Detailed time usage:

image

In-process time output:

image

@fzhinkin fzhinkin self-assigned this Mar 20, 2024
@fzhinkin
Copy link
Contributor

@hykilpikonna nice catch! CharSequence.lines() performance indeed could be improved.

@@ -1403,14 +1403,14 @@ public inline fun CharSequence.splitToSequence(regex: Regex, limit: Int = 0): Se
*
* The lines returned do not include terminating line separators.
*/
public fun CharSequence.lineSequence(): Sequence<String> = splitToSequence("\r\n", "\n", "\r")
public fun CharSequence.lineSequence(): Sequence<String> = toString().replace("\r\n", "\n").replace('\r', '\n').splitToSequence('\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CharSequence interface neither require implementing toString, nor constraint developers to do that in a particular way:

public interface CharSequence {

In standard library, there's only one class implementing that interface, StringBuidler, and it's toString implementation returns underlying string, but in general, we can't rely on that.

A random example of how CharSequence's implementation may override toString: https://github.com/h2oai/h2o-3/blob/776b53716559e065ccf2af08d1cbb5b8d2c883e7/h2o-algos/src/main/java/hex/grep/Grep.java#L80

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's only one class

Besides String, of course

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, it seems to be a problem with how CharSequence's contract is defined in Kotlin (there's no explicit common declaration ), and Java's CharSequence specify toString's behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... the reason why I used toString is because there doesn't seem to exist a function CharSequence.replace(String, String) or even CharSequence.replace(Char, Char)

The only replace function I can find under CharSequence uses Regex replace, but it performs worse than a JVM implementation of replace.

image

However, if I copy the String.replace() algorithm in the following JVM implementation and set the signature to take in a CharSequence instead of String, the replacement works and performs well as intended.

public actual fun String.replace(oldValue: String, newValue: String, ignoreCase: Boolean = false): String {

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have two options for optimizing this:

  1. We can create a new String.lines() implementation separate from CharSequence.lines().
  2. We can either create a new CharSequence.replace(String, String) interface or create a helper function for replace, and use this replacement function for CharSequence.lines().

On JVM, both approaches should perform equally well, but this might not be the case on other platforms. So I've tested the performance of the second approach on different platforms below.

On Kotlin Native, compiling the JVM CharSequence.replace into native is slightly slower than using the native String.replace (but still better than splitting by three strings):

(test sample size reduced by 10x because Kotlin Native is very slow and memory-intensive in processing strings)

image

On Kotlin nodejs, compiling the JVM CharSequence.replace function into js is slightly faster than using the js String.replace:

image

Please let me know what you think is the best approach.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind on the lineSequence implementation is that it performs lazy splitting as expected for sequence, whereas using toString().replace(...).splitToSequence(...) would perform lazy splitting, but eager replaces on a whole string.
This may be wasteful when only part of the resulting sequence is used afterward.
Example benchmark:

private val string = (0..1000).joinToString("\n") { "line_$it" }
private val hugeString = (0..100000000).joinToString("\n") { "line_$it" }

@State(Scope.Benchmark)
class StringLinesBenchmark {

    @Benchmark
    fun stdLibLineSequenceNormalString() = string.lineSequence().take(100).toList()

    @Benchmark
    fun newLineSequenceNormalString() = string.newLineSequence().take(100).toList()

    @Benchmark
    fun stdLibLineSequenceHugeString() = hugeString.lineSequence().take(100).toList()

    @Benchmark
    fun newLineSequenceHugeString() = hugeString.newLineSequence().take(100).toList()
}
benchmarks summary:
Benchmark                                            Mode  Cnt    Score    Error  Units
StringLinesBenchmark.newLineSequenceHugeString       avgt    5  866.933 ± 20.053  ms/op
StringLinesBenchmark.newLineSequenceNormalString     avgt    5    0.005 ±  0.001  ms/op
StringLinesBenchmark.stdLibLineSequenceHugeString    avgt    5    0.015 ±  0.001  ms/op
StringLinesBenchmark.stdLibLineSequenceNormalString  avgt    5    0.014 ±  0.001  ms/op

In this case, stdlib implementation performance doesn't really rely on input string lines count as long as only part of lines sequence result is used later.
But it is worth noting that for a string with 1000 lines, where only 10% of them were later used for processing, new implementation was much faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hykilpikonna, as @Shykial mentioned, lineSequence will no longer be lazy after using toString and replace. It's also worth mentioning that multiple string transformations will increase object allocation rate which in some scenarios may not be tolerable.

However, it seems like performance could be improved without hurting the laziness (but it needs to be verified) by switching from rangesDelimitedBy to DelimitedRangesSequence with a custom getNextMatch callback. Currently, we use CharSequence.findAnyOf to find where a line splits. Instead, we can scan the char sequence seeking for either \n, or \r (optionally followed by the \n).

The main caveat here is that there are not so many tests dedicated to line splitting, so before changing anything the test coverage needs to be improved.

To further experiment with performance improvements, I would recommend writing benchmarks like the one in the message above. It would be nice to cover various scenarios (different receiver types, strings using different delimiters and having different structure, different operations on a resulting sequence, like first, last, count, etc) to better understand trade offs. To ease benchmarks configuration, you can clone https://github.com/fzhinkin/benchmarks-template, it's KMP project using kotlinx-benchmark with various benchmarking targets being already set up.

@Gamandoor
Copy link

@fzhinkin
Copy link
Contributor

fzhinkin commented May 7, 2024

Hey! @hykilpikonna, are you considering proceeding with this PR? Please let me know if additional any help is needed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants