Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/fs: Add ASCII fastpath for normalization #9365

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft

Conversation

bt90
Copy link
Contributor

@bt90 bt90 commented Jan 23, 2024

Purpose

Avoids unicode.ToLower(unicode.ToUpper(r)) and norm.NFC.String(s) for simple ASCII filenames.

lib/fs/folding.go Outdated Show resolved Hide resolved
@bt90
Copy link
Contributor Author

bt90 commented Jan 23, 2024

@greatroar WDYT?

lib/fs/folding.go Outdated Show resolved Hide resolved
@bt90 bt90 requested a review from calmh January 23, 2024 16:50
@calmh
Copy link
Member

calmh commented Jan 23, 2024

Test case and benchmark... how much faster is this than the old stuff, for the various cases, since it's a performance optimisation?

@AudriusButkevicius
Copy link
Member

Yes, benchmarks please.

@bt90
Copy link
Contributor Author

bt90 commented Jan 24, 2024

Added a benchmark. Comparison with the previous implementation:

goos: linux
goarch: amd64
pkg: github.com/syncthing/syncthing/lib/fs
cpu: AMD EPYC 7763 64-Core Processor                
                                     │ /tmp/old.txt  │            /tmp/new.txt             │
                                     │    sec/op     │   sec/op     vs base                │
UnicodeLowercase/lowercase_ASCII-2      48.34n ±  7%   17.25n ± 8%  -64.32% (p=0.000 n=10)
UnicodeLowercase/mixedcase_ASCII-2     191.45n ± 14%   75.91n ± 6%  -60.35% (p=0.000 n=10)
UnicodeLowercase/lowercase_unicode-2    127.2n ±  4%   235.0n ± 7%  +84.75% (p=0.000 n=10)
UnicodeLowercase/mixedcase_unicode-2    347.2n ±  7%   315.6n ± 6%   -9.13% (p=0.001 n=10)
UnicodeLowercase/multibyte_unicode-2    537.2n ±  6%   586.1n ± 3%   +9.10% (p=0.000 n=10)
geomean                                 185.5n         141.6n       -23.67%

                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │     B/op     │    B/op     vs base                 │
UnicodeLowercase/lowercase_ASCII-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/mixedcase_ASCII-2     24.00 ± 0%     24.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/lowercase_unicode-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/mixedcase_unicode-2   32.00 ± 0%     32.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/multibyte_unicode-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │  allocs/op   │ allocs/op   vs base                 │
UnicodeLowercase/lowercase_ASCII-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/mixedcase_ASCII-2     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/lowercase_unicode-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/mixedcase_unicode-2   1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/multibyte_unicode-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

The ASCII fastpath seems to be working as intended but there's still a severe regression for the lowercase unicode testcase.

for _, r := range s[i:] {
rs.WriteRune(unicode.ToLower(unicode.ToUpper(r)))
func toLower(r rune) rune {
if r <= unicode.MaxASCII {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be optimized another ~50% via this bikeshed: https://stackoverflow.com/a/77689444/1432614

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're dealing with 4 cases:

  • ASCII lower -> no-op
  • ASCII mixed -> replace uppercase ones
  • Unicode lower -> apply NFC
  • Unicode mixed -> replace uppercase ones and apply NFC

The linked logic only offers fast ASCII detection but lacks the upper/lower distinction.

@bt90
Copy link
Contributor Author

bt90 commented Jan 25, 2024

Doing things in single pass:

goos: linux
goarch: amd64
pkg: github.com/syncthing/syncthing/lib/fs
cpu: AMD EPYC 7763 64-Core Processor                
                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │    sec/op    │   sec/op     vs base                │
UnicodeLowercase/ASCII_lowercase-2      47.80n ± 5%   17.77n ± 9%  -62.83% (p=0.000 n=10)
UnicodeLowercase/ASCII_mixedcase-2     193.65n ± 9%   53.66n ± 3%  -72.29% (p=0.000 n=10)
UnicodeLowercase/Unicode_lowercase-2    124.3n ± 2%   161.6n ± 7%  +29.96% (p=0.000 n=10)
UnicodeLowercase/Unicode_mixedcase-2    345.9n ± 9%   292.4n ± 7%  -15.48% (p=0.000 n=10)
UnicodeLowercase/Unicode_multibyte-2    538.5n ± 2%   564.7n ± 5%   +4.86% (p=0.000 n=10)
geomean                                 184.6n        120.5n       -34.71%

                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │     B/op     │    B/op     vs base                 │
UnicodeLowercase/ASCII_lowercase-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/ASCII_mixedcase-2     24.00 ± 0%     24.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_lowercase-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_mixedcase-2   32.00 ± 0%     32.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_multibyte-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │  allocs/op   │ allocs/op   vs base                 │
UnicodeLowercase/ASCII_lowercase-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/ASCII_mixedcase-2     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_lowercase-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_mixedcase-2   1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_multibyte-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Copy link
Member

@imsodin imsodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal idea from looking at the pprof in https://forum.syncthing.net/t/380k-files-24gb-in-directory-results-in-very-slow-sync-8-days/21488/26:
Quite a bit of time is spent on memory management - re-using bytes slices with a pool might help here.

lib/fs/folding_test.go Outdated Show resolved Hide resolved
for i := 0; i < len(s); i++ {
c := s[i]
if c > unicode.MaxASCII {
return false, isLower
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given unicode_lowercase is the slow bench result with the new implementatin, maybe it would be worth testing keeping that behaviour from the old code. I.e. when encountering non-ascii, keep checking if there's any upper-case chars (and then exit). So you can then take the lower-case-unicode shortcut from before. This will make the unicode-mixed-case slower, but I wouldn't be surprised if not by a lot compared to the gains for unicode-lower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the point I don't really understand. golangs strings.Map() has that shortcut built into it:

https://cs.opensource.google/go/go/+/refs/tags/go1.21.6:src/strings/strings.go;l=489-517

My previous (flawed?) mixedcase unicode benchmark case should have stopped at the very first byte in the ASCII detection. The rest of the code looks pretty much the same to me? Map the runes and if a case change is detected, assign a string builder with the unchanged part and write mapped runes after that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely glossed over the usage of strings.Map xD

That's still one or two more loops. If you did the unicode lower detection in our initial loop that does ascii detection, and then not use strings.Map, we get rid of part of a loop in both the unicode cases.
Now that also makes me wonder if we could wrap isASCII and toLowerAscii into a mapping function for strings.Map for less code branches/complexity, and equal, maybe even slightly better performance :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we are now mixing two classes of optimisations: Improving things about the unicode path, and adding an ascii shortcut. Maybe worth doing that separately for optimal results. Just an idle, optional suggestion though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ASCII path iterates over single bytes which is fast but doesn't work as soon as any unicode characters are involved. Iterating over runes is a tad slower but necessary in that case.

So you either end up with two loops or one rune based loop which is slower for ASCII inputs.

I also tried to get rid of lower(upper(r)) as far as possible as there are only 23(!) characters in unicode that satisfy lower(r) != lower(upper(r))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it, but apparently the type isn't exported by the unicode package which renders the SpecialCase unusable for us.

Copy link
Contributor Author

@bt90 bt90 Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive code generation attempt:

import (
	"fmt"
	"unicode"
)

func main() {
	fmt.Println("unicode.SpecialCase{")
	for r := rune(0); r <= unicode.MaxRune; r++ {
		lower := unicode.ToLower(r)
		lowerUpper := unicode.ToLower(unicode.ToUpper(r))
		if lower == lowerUpper {
			continue
		}
		hr := fmt.Sprintf("0x%04x", r)
		lr := fmt.Sprintf("0x%04x", lowerUpper)
		fmt.Printf("\tunicode.CaseRange{%s, %s, d{0, %s - %s, 0}}, // %s\n", hr, hr, lr, hr, string(r))
	}
	fmt.Println("}")
}
unicode.SpecialCase{
	unicode.CaseRange{0x00b5, 0x00b5, d{0, 0x03bc - 0x00b5, 0}}, // µ
	unicode.CaseRange{0x0131, 0x0131, d{0, 0x0069 - 0x0131, 0}}, // ı
	unicode.CaseRange{0x017f, 0x017f, d{0, 0x0073 - 0x017f, 0}}, // ſ
	unicode.CaseRange{0x0345, 0x0345, d{0, 0x03b9 - 0x0345, 0}}, // ͅ
	unicode.CaseRange{0x03c2, 0x03c2, d{0, 0x03c3 - 0x03c2, 0}}, // ς
	unicode.CaseRange{0x03d0, 0x03d0, d{0, 0x03b2 - 0x03d0, 0}}, // ϐ
	unicode.CaseRange{0x03d1, 0x03d1, d{0, 0x03b8 - 0x03d1, 0}}, // ϑ
	unicode.CaseRange{0x03d5, 0x03d5, d{0, 0x03c6 - 0x03d5, 0}}, // ϕ
	unicode.CaseRange{0x03d6, 0x03d6, d{0, 0x03c0 - 0x03d6, 0}}, // ϖ
	unicode.CaseRange{0x03f0, 0x03f0, d{0, 0x03ba - 0x03f0, 0}}, // ϰ
	unicode.CaseRange{0x03f1, 0x03f1, d{0, 0x03c1 - 0x03f1, 0}}, // ϱ
	unicode.CaseRange{0x03f5, 0x03f5, d{0, 0x03b5 - 0x03f5, 0}}, // ϵ
	unicode.CaseRange{0x1c80, 0x1c80, d{0, 0x0432 - 0x1c80, 0}}, // ᲀ
	unicode.CaseRange{0x1c81, 0x1c81, d{0, 0x0434 - 0x1c81, 0}}, // ᲁ
	unicode.CaseRange{0x1c82, 0x1c82, d{0, 0x043e - 0x1c82, 0}}, // ᲂ
	unicode.CaseRange{0x1c83, 0x1c83, d{0, 0x0441 - 0x1c83, 0}}, // ᲃ
	unicode.CaseRange{0x1c84, 0x1c84, d{0, 0x0442 - 0x1c84, 0}}, // ᲄ
	unicode.CaseRange{0x1c85, 0x1c85, d{0, 0x0442 - 0x1c85, 0}}, // ᲅ
	unicode.CaseRange{0x1c86, 0x1c86, d{0, 0x044a - 0x1c86, 0}}, // ᲆ
	unicode.CaseRange{0x1c87, 0x1c87, d{0, 0x0463 - 0x1c87, 0}}, // ᲇ
	unicode.CaseRange{0x1c88, 0x1c88, d{0, 0xa64b - 0x1c88, 0}}, // ᲈ
	unicode.CaseRange{0x1e9b, 0x1e9b, d{0, 0x1e61 - 0x1e9b, 0}}, // ẛ
	unicode.CaseRange{0x1fbe, 0x1fbe, d{0, 0x03b9 - 0x1fbe, 0}}, // ι
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the exceptions are that few it certainly seems valid to pre-generate the table and then use that plus regular ToLower in order to save on the more complex operations.

Copy link
Member

@calmh calmh Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the unexported d is a bummer. Wonder what the thinking there is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played a bit with unicode.RangeTable but couldn't improve the performance:

var complexFolds = &unicode.RangeTable{
	R16: []unicode.Range16{
		{181, 305, 124},
		{383, 837, 454},
		{962, 976, 14},
		{977, 981, 4},
		{982, 1008, 26},
		{1009, 1013, 4},
		{7296, 7304, 1},
		{7835, 8126, 291},
	},
	R32:         []unicode.Range32{},
	LatinOffset: 0,
}

func toLower(r rune) rune {
	if r <= unicode.MaxASCII {
		if r < 'A' || 'Z' < r {
			return r
		}
		return r + 'a' - 'A'
	}
	if unicode.Is(complexFolds, r) {
		return unicode.To(unicode.LowerCase, unicode.To(unicode.UpperCase, r))
	}
	return unicode.To(unicode.LowerCase, r)
}

@bt90
Copy link
Contributor Author

bt90 commented Jan 28, 2024

Latest numbers:

goos: windows
goarch: amd64
pkg: github.com/syncthing/syncthing/lib/fs
cpu: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
                                     │   old.txt    │               new.txt               │
                                     │    sec/op    │   sec/op     vs base                │
UnicodeLowercase/ASCII_lowercase-8      36.62n ± 1%   12.91n ± 1%  -64.75% (p=0.000 n=10)
UnicodeLowercase/ASCII_mixedcase-8     181.95n ± 1%   63.24n ± 1%  -65.25% (p=0.000 n=10)
UnicodeLowercase/Unicode_lowercase-8    111.2n ± 1%   125.1n ± 1%  +12.51% (p=0.000 n=10)
UnicodeLowercase/Unicode_mixedcase-8    290.2n ± 0%   188.7n ± 1%  -34.96% (p=0.000 n=10)
UnicodeLowercase/Unicode_multibyte-8    479.7n ± 0%   488.5n ± 0%   +1.85% (p=0.000 n=10)
geomean                                 159.5n        98.80n       -38.04%

                                     │   old.txt    │               new.txt               │
                                     │     B/op     │    B/op     vs base                 │
UnicodeLowercase/ASCII_lowercase-8     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/ASCII_mixedcase-8     24.00 ± 0%     24.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_lowercase-8   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_mixedcase-8   32.00 ± 0%     32.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_multibyte-8   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                     │   old.txt    │               new.txt               │
                                     │  allocs/op   │ allocs/op   vs base                 │
UnicodeLowercase/ASCII_lowercase-8     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/ASCII_mixedcase-8     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_lowercase-8   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_mixedcase-8   1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_multibyte-8   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

@rdslw
Copy link

rdslw commented Mar 24, 2024

@bt90 Would you like me to check some test build on this very machine, let me know. Eager to see this implemented/released.

@bt90
Copy link
Contributor Author

bt90 commented Mar 24, 2024

Unfortunately my time is rather limited these days. We moved last week and I still have a lot of work to do around the house.

But feel free to test and post your results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants