Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement and use vmx_strchr() #440

Open
classilla opened this issue Sep 29, 2017 · 23 comments
Open

Implement and use vmx_strchr() #440

classilla opened this issue Sep 29, 2017 · 23 comments

Comments

@classilla
Copy link
Owner

Like it says. Not as dramatic as memchr but should still improve things.

classilla added a commit that referenced this issue Sep 29, 2017
classilla added a commit that referenced this issue Nov 16, 2017
@classilla
Copy link
Owner Author

The network code caused JPEG artifacts on Macintosh Repository, probably due to bad HTTP chunking. Disabling it and flushing the cache removed the artifacts. I suspect the problem is the first vec_* can see the same character in adjacent strings and fails to see the null terminator. We probably need a second check that sees if there is both the search character and the null, and chooses a terminal bytewalker based on that.

@NapalmSauce
Copy link
Contributor

NapalmSauce commented Nov 1, 2018

Did you receive my mail a couple weeks ago? Anyway, I should've commented here at first. Since assembly is completely innapropriate for plc I went ahead and incorporated the small improvements I made to plvmx.c using intrinsics. It mainly benefits stchr since it loops faster. I was able to add more intrinsics (among others, non-altivec ones) to memchr and haschr also. This slightly decreases setup/exit penalty when vmx is in use and 64-bit GPRs are used when available.
If you're giving it a look, it's here: https://github.com/NapalmSauce/tenfourfox/blob/master/nsprpub/lib/libc/src/plvmx.c
Please advise/ask if there's anything.

@classilla
Copy link
Owner Author

Do you have a diff? If it seems reasonable, I'll incorporate it into FPR12 (I'm already typing this in what will be the FPR11 beta).

@classilla
Copy link
Owner Author

Actually, I just finished reading it. Clever. It definitely needs a full beta bake time, but it's pretty ingenious, I must say. Can you turn it into a PR?

@NapalmSauce
Copy link
Contributor

Thanks a lot :) Do you mind if I delay the PR a little bit? There's one case where it's detrimental vs the current vmx_strchr. If the character c is the first element of the first vector load, it'll make strchr 4-5% slower than currently and I'm curious if there's a way I can avoid this that's near gratis. I'll PR soon.
There's a couple more things than can be done for the loops, manually unroll them of course; since gcc doesn't unroll loops with intrinsics.

My handwritten assembly routines are still significantly faster than what gcc produces but.. I spent about two weeks finishing the whole thing in asm. That was for fun, at least.

@classilla
Copy link
Owner Author

Sure, take your time. This won't get integrated for a few weeks until I'm into FPR12. What are you using to do your benchmarks?

@NapalmSauce
Copy link
Contributor

NapalmSauce commented Nov 1, 2018

It will look silly, but currently I use the unix time; I made a C file that calls the routines 10M times so It can be represented in miliseconds (and in seconds for long strings). I try to only note the percentage it's faster; else that would make it a lot too subjective(though a percentage easily gets subjective too).

@NapalmSauce
Copy link
Contributor

NapalmSauce commented Nov 1, 2018

To be more precise: the resulting executable accepts one command-line argument: the offset of the byte that it has to look for. Then it prints the address of the array(alignment), the address of the pointer returned by vmx_*chr, and lastly the pointer address returned by regular libc *chr to confirm it returned the expected result. After that it does the benchmark. I have a small makefile with it. I can provide the whole thing.

@NapalmSauce
Copy link
Contributor

NapalmSauce commented Nov 4, 2018

Okay, I believe I'm done. I have a few more changes to commit. I removed vec_lvebx intrinsics and instead put a (const vector unsigned char){char} cast where appropriate because gcc seems to optimize it better. The only exception to this is haschr which seems to trigger a compiler bug (2x slower than memchr to complete with the latter)? It seems to happen with haschr the way it is in the tree right now, too. I'm uploading the makefile I mentioned earlier and the c sources mostly as a testcase. Can you confirm wether it also happens with gcc-4.8.2 or not when you have time? I can reproduce on 4.8.5, 5.5 and 6.4 right now.

The makefile will produce executables haschr, haschr1, memchr, memchr1, strchr and strchr1, where the -1 suffixed ones use the modified plvmx and others use in-tree plvmx.c So what I'm asking for is essentially that you compare results from calling time ./haschr 1000 and time ./memchr 1000. You can also use it to compare in-tree/modified functions as you like.

Otherwise I made my type usage more consistent.

vmxstring_test.zip

Thanks!

@classilla
Copy link
Owner Author

I haven't forgotten about this, I've just been on the Talos II all week.
How much faster is the assembly version, for my interest? If it's a big delta, we could convert it to __asm__. Kludgey, but would work.

@NapalmSauce
Copy link
Contributor

NapalmSauce commented Nov 8, 2018

Well, obviously SIMD is SIMD, so there's no major speedup, but the difference was tempting enough that I relinked, silently dropping the assembly version plvmx.o in the objects and using it once I passed all JS tests. It cuts libnss by about 2KB. I think for what it does it's worth it, though I'm not sure what should be "want" and "not worthwhile".
Once I've read a bit on how mozilla benchmarks their JS (python, again?) I'll comment benchmarks of the JS and of the individual function themselves run on the 3 archs to compare and clarify to what degree it is faster.

@NapalmSauce
Copy link
Contributor

NapalmSauce commented Nov 13, 2018

Sorry for the delay. I made and attached a quick chart filled with time results from 10 millions calls for each function with 3 machines (one for each altivec-enabled arch) I had at hand to try to give a somewhat overall view. There's a couple things you'll notice. Every C source was compiled with gcc-4.8.5 from macports with the good -mcpu value for each subarch.

Just as a note, exceptionally for the G5 in assembly memchr and haschr I've added nops in strategical places to reduce the density of branch instructions. Otherwise the loop was bottleneck with dispatch groups and the speed improvement was tiny.

vmx_results.txt

@NapalmSauce
Copy link
Contributor

NapalmSauce commented Dec 13, 2018

@classilla, I know you're busy right now. Given the blog post, I believe you're still interested, and FPRb12 is next on the road map. So I expect no more than a little "ok" before converting the thing to __asm__, and of course this doesn't commit you to take a PR at all if something's not right. I'll rework the readability at the same time, though it's not that ugly as it is ATM.

If it can help reduce hesitation on taking the patch, it's been part of my own build for over 1 month and I have yet to find any problem with it.

@classilla
Copy link
Owner Author

Yes, I'm still interested, I just don't have a large number of cycles just now (between fixing Talos II issues with Firefox like https://bugzilla.mozilla.org/show_bug.cgi?id=1512162 and the holiday) but the idea is to get it into FPR12 when I get back on the G5. So ... "ok" ;)

@classilla
Copy link
Owner Author

Okay. We're back to this now that I have most of the other stuff for FPR12 landed and I have a testing G5 build up.

@classilla
Copy link
Owner Author

I'm looking at your strchr implementation. Most of it looks good, but I don't see an early return if a null is found before the character did as the older implementation had. On a short string this could end up doing more work than needed if the character isn't there, unless your testing demonstrates this is negligible.

@NapalmSauce
Copy link
Contributor

You're right, I can observe this. In this case the in-tree version becomes slightly faster than the C version with my modifications. Though similar happens with the asm version, it's already far ahead so it still completes faster.

I'm no longer convinced the changes brought to the C versions are worthwhile. It represents well the algorithm from the asm version, that said.

@NapalmSauce
Copy link
Contributor

On the topic of readability. I suspect adding comments on top of an asm function would make it even more a mess. So how about I add the original .S files for each function somewhere appropriate in the tree?

@classilla
Copy link
Owner Author

Or, just attach them here (no PR, just a naked set of files or a zip) and I'll figure out where to place them. I'll need to teach NSPR about .S files though. I'm hoping to have the beta out early next week, so please send ASAP.

@NapalmSauce
Copy link
Contributor

Sorry if it's taking awhile, I was reviewing everything, and now I'm re-passing JS tests on the G5. As soon as it's done I'm uploading.

@NapalmSauce
Copy link
Contributor

PMG5Raph:jit-test raph$ ./jit_test.py --ion -f ../../../obj-ff-dbg/dist/bin/js
[9945|   0|   0|   0] 100% ==========================================>|1400.6s
PASSED ALL

PMG5Raph:tests raph$ ./jstests.py --jitflags=ion --run-slow-tests ../../../obj-ff-dbg/dist/bin/js -j2
[12966|    0|    0| 1098] 100% ======================================>|2827.9s
PASS

string functions.zip

@classilla
Copy link
Owner Author

Very nice! Some really clever stuff in the assembly.

I'm patching NSPR for the build.

@NapalmSauce
Copy link
Contributor

Thanks! I put a good amount of thought in this.

I'm repeating myself, but if there's anything I can do to help, just let me know. I don't want this to end up being a burden.

OlgaTPark added a commit to OlgaTPark/tenfourfox that referenced this issue Mar 21, 2020
Add a SSE2 optimized variant of `strchr`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants