Implement and use vmx_strchr() #440

classilla · 2017-09-29T16:08:22Z

Like it says. Not as dramatic as memchr but should still improve things.

The text was updated successfully, but these errors were encountered:

… for js_strchr_limit Latin-1 strings

classilla · 2017-11-16T15:17:58Z

The network code caused JPEG artifacts on Macintosh Repository, probably due to bad HTTP chunking. Disabling it and flushing the cache removed the artifacts. I suspect the problem is the first vec_* can see the same character in adjacent strings and fails to see the null terminator. We probably need a second check that sees if there is both the search character and the null, and chooses a terminal bytewalker based on that.

NapalmSauce · 2018-11-01T01:28:26Z

Did you receive my mail a couple weeks ago? Anyway, I should've commented here at first. Since assembly is completely innapropriate for plc I went ahead and incorporated the small improvements I made to plvmx.c using intrinsics. It mainly benefits stchr since it loops faster. I was able to add more intrinsics (among others, non-altivec ones) to memchr and haschr also. This slightly decreases setup/exit penalty when vmx is in use and 64-bit GPRs are used when available.
If you're giving it a look, it's here: https://github.com/NapalmSauce/tenfourfox/blob/master/nsprpub/lib/libc/src/plvmx.c
Please advise/ask if there's anything.

classilla · 2018-11-01T05:12:27Z

Do you have a diff? If it seems reasonable, I'll incorporate it into FPR12 (I'm already typing this in what will be the FPR11 beta).

classilla · 2018-11-01T05:15:41Z

Actually, I just finished reading it. Clever. It definitely needs a full beta bake time, but it's pretty ingenious, I must say. Can you turn it into a PR?

NapalmSauce · 2018-11-01T06:01:55Z

Thanks a lot :) Do you mind if I delay the PR a little bit? There's one case where it's detrimental vs the current vmx_strchr. If the character c is the first element of the first vector load, it'll make strchr 4-5% slower than currently and I'm curious if there's a way I can avoid this that's near gratis. I'll PR soon.
There's a couple more things than can be done for the loops, manually unroll them of course; since gcc doesn't unroll loops with intrinsics.

My handwritten assembly routines are still significantly faster than what gcc produces but.. I spent about two weeks finishing the whole thing in asm. That was for fun, at least.

classilla · 2018-11-01T06:03:52Z

Sure, take your time. This won't get integrated for a few weeks until I'm into FPR12. What are you using to do your benchmarks?

NapalmSauce · 2018-11-01T06:07:45Z

It will look silly, but currently I use the unix time; I made a C file that calls the routines 10M times so It can be represented in miliseconds (and in seconds for long strings). I try to only note the percentage it's faster; else that would make it a lot too subjective(though a percentage easily gets subjective too).

NapalmSauce · 2018-11-01T06:28:18Z

To be more precise: the resulting executable accepts one command-line argument: the offset of the byte that it has to look for. Then it prints the address of the array(alignment), the address of the pointer returned by vmx_*chr, and lastly the pointer address returned by regular libc *chr to confirm it returned the expected result. After that it does the benchmark. I have a small makefile with it. I can provide the whole thing.

NapalmSauce · 2018-11-04T03:41:13Z

Okay, I believe I'm done. I have a few more changes to commit. I removed vec_lvebx intrinsics and instead put a (const vector unsigned char){char} cast where appropriate because gcc seems to optimize it better. The only exception to this is haschr which seems to trigger a compiler bug (2x slower than memchr to complete with the latter)? It seems to happen with haschr the way it is in the tree right now, too. I'm uploading the makefile I mentioned earlier and the c sources mostly as a testcase. Can you confirm wether it also happens with gcc-4.8.2 or not when you have time? I can reproduce on 4.8.5, 5.5 and 6.4 right now.

The makefile will produce executables haschr, haschr1, memchr, memchr1, strchr and strchr1, where the -1 suffixed ones use the modified plvmx and others use in-tree plvmx.c So what I'm asking for is essentially that you compare results from calling time ./haschr 1000 and time ./memchr 1000. You can also use it to compare in-tree/modified functions as you like.

Otherwise I made my type usage more consistent.

vmxstring_test.zip

Thanks!

classilla · 2018-11-08T03:58:12Z

I haven't forgotten about this, I've just been on the Talos II all week.
How much faster is the assembly version, for my interest? If it's a big delta, we could convert it to __asm__. Kludgey, but would work.

NapalmSauce · 2018-11-08T18:24:21Z

Well, obviously SIMD is SIMD, so there's no major speedup, but the difference was tempting enough that I relinked, silently dropping the assembly version plvmx.o in the objects and using it once I passed all JS tests. It cuts libnss by about 2KB. I think for what it does it's worth it, though I'm not sure what should be "want" and "not worthwhile".
Once I've read a bit on how mozilla benchmarks their JS (python, again?) I'll comment benchmarks of the JS and of the individual function themselves run on the 3 archs to compare and clarify to what degree it is faster.

NapalmSauce · 2018-11-13T02:28:05Z

Sorry for the delay. I made and attached a quick chart filled with time results from 10 millions calls for each function with 3 machines (one for each altivec-enabled arch) I had at hand to try to give a somewhat overall view. There's a couple things you'll notice. Every C source was compiled with gcc-4.8.5 from macports with the good -mcpu value for each subarch.

Just as a note, exceptionally for the G5 in assembly memchr and haschr I've added nops in strategical places to reduce the density of branch instructions. Otherwise the loop was bottleneck with dispatch groups and the speed improvement was tiny.

vmx_results.txt

NapalmSauce · 2018-12-13T04:52:49Z

@classilla, I know you're busy right now. Given the blog post, I believe you're still interested, and FPRb12 is next on the road map. So I expect no more than a little "ok" before converting the thing to __asm__, and of course this doesn't commit you to take a PR at all if something's not right. I'll rework the readability at the same time, though it's not that ugly as it is ATM.

If it can help reduce hesitation on taking the patch, it's been part of my own build for over 1 month and I have yet to find any problem with it.

classilla · 2018-12-13T05:33:55Z

Yes, I'm still interested, I just don't have a large number of cycles just now (between fixing Talos II issues with Firefox like https://bugzilla.mozilla.org/show_bug.cgi?id=1512162 and the holiday) but the idea is to get it into FPR12 when I get back on the G5. So ... "ok" ;)

classilla · 2019-01-10T17:25:09Z

Okay. We're back to this now that I have most of the other stuff for FPR12 landed and I have a testing G5 build up.

classilla · 2019-01-10T18:18:56Z

I'm looking at your strchr implementation. Most of it looks good, but I don't see an early return if a null is found before the character did as the older implementation had. On a short string this could end up doing more work than needed if the character isn't there, unless your testing demonstrates this is negligible.

NapalmSauce · 2019-01-10T19:07:29Z

You're right, I can observe this. In this case the in-tree version becomes slightly faster than the C version with my modifications. Though similar happens with the asm version, it's already far ahead so it still completes faster.

I'm no longer convinced the changes brought to the C versions are worthwhile. It represents well the algorithm from the asm version, that said.

NapalmSauce · 2019-01-10T23:06:30Z

On the topic of readability. I suspect adding comments on top of an asm function would make it even more a mess. So how about I add the original .S files for each function somewhere appropriate in the tree?

classilla · 2019-01-11T01:11:08Z

Or, just attach them here (no PR, just a naked set of files or a zip) and I'll figure out where to place them. I'll need to teach NSPR about .S files though. I'm hoping to have the beta out early next week, so please send ASAP.

NapalmSauce · 2019-01-11T04:50:52Z

Sorry if it's taking awhile, I was reviewing everything, and now I'm re-passing JS tests on the G5. As soon as it's done I'm uploading.

NapalmSauce · 2019-01-11T06:58:51Z

PMG5Raph:jit-test raph$ ./jit_test.py --ion -f ../../../obj-ff-dbg/dist/bin/js
[9945|   0|   0|   0] 100% ==========================================>|1400.6s
PASSED ALL

PMG5Raph:tests raph$ ./jstests.py --jitflags=ion --run-slow-tests ../../../obj-ff-dbg/dist/bin/js -j2
[12966|    0|    0| 1098] 100% ======================================>|2827.9s
PASS

string functions.zip

classilla · 2019-01-12T02:16:22Z

Very nice! Some really clever stuff in the assembly.

I'm patching NSPR for the build.

NapalmSauce · 2019-01-12T02:58:11Z

Thanks! I put a good amount of thought in this.

I'm repeating myself, but if there's anything I can do to help, just let me know. I don't want this to end up being a burden.

…nges

Add a SSE2 optimized variant of `strchr`.

classilla added a commit that referenced this issue Sep 29, 2017

#440: implement vmx_strchr() for JS; expand jsstr to use vmx_memchr()…

9d4ebbc

… for js_strchr_limit Latin-1 strings

classilla added a commit that referenced this issue Nov 16, 2017

#440: vmx_strchr() in netwerk/, security/

81e2358

classilla added a commit that referenced this issue Nov 16, 2017

#440: xpcom/

9bb17e8

classilla added a commit that referenced this issue Nov 17, 2017

#440: fix end-of-string bug with vmx_strchr()

f6c2519

classilla mentioned this issue Nov 29, 2017

Chemspill for sec-high [M1410106] #451

Closed

classilla added a commit that referenced this issue Jan 12, 2019

#440: Raphael's assembly VMX haschr/memchr/strchr plus NSPR build cha…

fa5198b

…nges

OlgaTPark added a commit to OlgaTPark/tenfourfox that referenced this issue Mar 21, 2020

#8, classilla#440 — Add SSE2 strchr

f6acaf7

Add a SSE2 optimized variant of `strchr`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement and use vmx_strchr() #440

Implement and use vmx_strchr() #440

classilla commented Sep 29, 2017

classilla commented Nov 16, 2017

NapalmSauce commented Nov 1, 2018 •

edited

classilla commented Nov 1, 2018

classilla commented Nov 1, 2018

NapalmSauce commented Nov 1, 2018

classilla commented Nov 1, 2018

NapalmSauce commented Nov 1, 2018 •

edited

NapalmSauce commented Nov 1, 2018 •

edited

NapalmSauce commented Nov 4, 2018 •

edited

classilla commented Nov 8, 2018

NapalmSauce commented Nov 8, 2018 •

edited

NapalmSauce commented Nov 13, 2018 •

edited

NapalmSauce commented Dec 13, 2018 •

edited

classilla commented Dec 13, 2018

classilla commented Jan 10, 2019

classilla commented Jan 10, 2019

NapalmSauce commented Jan 10, 2019

NapalmSauce commented Jan 10, 2019

classilla commented Jan 11, 2019

NapalmSauce commented Jan 11, 2019

NapalmSauce commented Jan 11, 2019

classilla commented Jan 12, 2019

NapalmSauce commented Jan 12, 2019

Implement and use vmx_strchr() #440

Implement and use vmx_strchr() #440

Comments

classilla commented Sep 29, 2017

classilla commented Nov 16, 2017

NapalmSauce commented Nov 1, 2018 • edited

classilla commented Nov 1, 2018

classilla commented Nov 1, 2018

NapalmSauce commented Nov 1, 2018

classilla commented Nov 1, 2018

NapalmSauce commented Nov 1, 2018 • edited

NapalmSauce commented Nov 1, 2018 • edited

NapalmSauce commented Nov 4, 2018 • edited

classilla commented Nov 8, 2018

NapalmSauce commented Nov 8, 2018 • edited

NapalmSauce commented Nov 13, 2018 • edited

NapalmSauce commented Dec 13, 2018 • edited

classilla commented Dec 13, 2018

classilla commented Jan 10, 2019

classilla commented Jan 10, 2019

NapalmSauce commented Jan 10, 2019

NapalmSauce commented Jan 10, 2019

classilla commented Jan 11, 2019

NapalmSauce commented Jan 11, 2019

NapalmSauce commented Jan 11, 2019

classilla commented Jan 12, 2019

NapalmSauce commented Jan 12, 2019

NapalmSauce commented Nov 1, 2018 •

edited

NapalmSauce commented Nov 1, 2018 •

edited

NapalmSauce commented Nov 1, 2018 •

edited

NapalmSauce commented Nov 4, 2018 •

edited

NapalmSauce commented Nov 8, 2018 •

edited

NapalmSauce commented Nov 13, 2018 •

edited

NapalmSauce commented Dec 13, 2018 •

edited