New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RISC-V vectors support #620
Comments
LOL. Yeah, as if I have the budget for that. This won’t happen without outside funding. |
Yep, RISC-V Foundation probably can fund it, or somebody from SiFive |
BTW, I feel I should provide more information about the general strategy behind adopting SIMD extensions in libjpeg-turbo. In general, support for a new SIMD instruction set in libjpeg-turbo is driven by demand from application and operating system developers. A complete set of libjpeg-turbo SIMD extensions can take hundreds of hours, so it's an expensive proposition. Even if someone else does the initial implementation, it can still sometimes take anywhere from 30 to more than 100 hours for me to clean up the implementation, benchmark it, and integrate it, depending on the quality of the code. The willingness of organizations to fund that labor is an imperfect measure of demand for the new instruction set. Irrespective of funding, however, I am not interested in supporting CPU architectures that have little likelihood of adoption in major downstream software projects (such as popular web browsers) or end-user hardware products. Additional concerns include the design of the SIMD instruction set and the CPU and compiler architecture. My experience with RISC architectures is that, in general, the speedup to be gained from using SIMD instructions is less than that of CISC architectures, because the libjpeg-turbo scalar (non-SIMD) code is already faster with RISC CPUs and compilers. (With PowerPC, for instance, I was able to get a 2x speedup with some difficulty, but certainly not the 6x speedup that was achievable with x86.) My experience is also that not all SIMD instruction sets are created equal, and some of them are poor fits for the libjpeg-turbo algorithms. (With 256-bit AVX2, for instance, some of the DCT algorithms simply couldn't be accelerated relative to 128-bit SSE2, because of the dual-lane nature of AVX2 and the fact that the DCT algorithms operate on 128 bits of data at a time. 64-bit libjpeg-turbo SIMD extensions generally require multiple passes through certain algorithms, so the speedup tends to not be very impressive. 128-bit is kind of the "sweet spot" at the moment.) Other issues include whether compilers have a well-optimized implementation of compiler intrinsics for the instruction set. (This bit us in the butt with Arm Neon.) The RISC-V vector ISA is interesting in that it supports arbitrary vector lengths, but that raises additional concerns regarding whether a specific RISC-V SIMD implementation in libjpeg-turbo will provide consistent performance gains across all RISC-V CPUs or whether it will be necessary to handle differing capabilities in specific RISC-V CPU implementations. From a performance point of view, hand-coded assembly with a fixed vector length has the least variability among different CPUs. Even well-optimized intrinsics with a fixed vector length have more variability, and poorly-optimized intrinsics introduce more variability. A variable vector length potentially introduces more variability as well. More variability means that it will take more time to get the implementation right. Even if initial funding is provided to implement or integrate a new set of SIMD extensions, no set of SIMD extensions is zero-maintenance. Once the initial implementation or integration is complete, then it is necessary to use the libjpeg-turbo General Fund to fix bugs in the implementation that are inevitably discovered down the road. Thus, I must ensure that the platform is strategically important enough to adopt as a first-class citizen within the libjpeg-turbo ecosystem. As much as RISC-V appeals to me, given my background in CPU architecture, I am frankly not convinced that it is strategically important enough yet. It's a moot point at the moment, though, because I won't have any availability to work on such a project until well into 2023 at the earliest, even if funding is secured. |
Thank you for the detailed explanation. I appreciate it. |
Hi @dcommander , I am Hong-Rong from SiFive. We are interested in upstream RISC-V support in libjpeg-turbo. May I have your email to talk about the funding and co-working model? |
@rednoah91 @dcommander I hope your collaboration will be a success. |
|
actually I used to implement the SIMD version of libjpeg-turbo based on RVV1.0, which is my graduation project. It has already been open sourced: https://github.com/isrc-cas/libjpeg-turbo/tree/riscv-dev. |
If your implementation is high-quality, then it may reduce the amount of labor I have to do in order to implement the feature, but that labor will still not be zero. All of my labor on this, even if it involves validating and integrating someone else's implementation, must be funded. Funding and access to test hardware are the two primary gating issues from my point of view. |
Sorry for the late update. I am proposing RISE(RISC-V Software Ecosystem) to support the funding. It may take some time. I'll keep you posted. |
Thanks for you guys' update. I will spend my spare time to keep maintaining and updating the |
hey guys, i have some updates for RVV support. I add support for the newest rvv intrinsic version(v0.12 and v0.11), and test it on a real machine (Kendryte K230). The related document can be seen here. So how is the RISE project going now? @rednoah91 And i notice that most of the tests in the test suite are executed very quickly, which makes me uncertain about performance improvement brought by my implementation. Is there other test codes to prove the performance enhencement? @dcommander |
@BHbean The benchmark procedure for new SIMD extensions is rather tedious. Roughly speaking, what I do is run As you can imagine, this is extremely time consuming and requires some intuition regarding the behavior of the various algorithms and how much speedup should be expected from each. This is why I don't generally accept contributions of SIMD extensions unless there is funding for me to peer review the extensions, which involves performing the aforementioned procedure. This procedure is necessary in order to ensure that all proposed new SIMD extensions are justified. (It wouldn't make sense to introduce a SIMD extension for a particular algorithm if that SIMD extension doesn't significantly speed up that algorithm.) However, if the developer of the SIMD extensions can perform the benchmarking procedure during the development process and provide me with that data, then I will only have to peer review the final results, which takes significantly less time. (This is assuming that there is a high degree of confidence that the data are reflective of the expected speedups across a wide array of CPUs within the architecture family, as well as a wide array of compilers and operating systems. I mention that because, in the case of the Arm Neon SIMD extensions, it was necessary to separately benchmark in-order and out-of-order cores. It was also necessary to separately benchmark GCC and Clang, because GCC < 12 didn't have an optimal Neon intrinsics implementation.) This OpenOffice Calc template may be useful. NOTE: It might be possible to develop a lower-level benchmark that runs each algorithm in isolation repeatedly for a specified period of time, thus eliminating at least some of the aforementioned differential performance analysis tedium. However, such a benchmark would likely be unrealistically cache-resident and thus wouldn't accurately reflect the performance of a particular SIMD extension in a real-world compression/decompression scenario. (Also, you couldn't test the Huffman algorithms using such a low-level benchmark, because the performance of those algorithms strongly depends on the input.) Another possibility might be to instrument the libjpeg-turbo code so that it independently tracks the average execution time of each algorithm for a particular libjpeg instance and prints a performance table when the instance is destroyed. I'm not sure if that would really accelerate the development process, because it would still be desirable to benchmark each SIMD extension in isolation as it is developed. It would, however, accelerate the peer review process, i.e. the process of benchmarking a set of SIMD extensions that has already been developed. |
@dcommander Thanks for your comprehensive explanation. I think I can continue my testing now. Considering the situations i currently faced, all my tests might just be a reference instead of a strong evidence, because of the following reasons:
Besides, i tried to run $ ./tjbench-static ./testimages/artificial.ppm 95 -rgb -qq -nowrite -warmup 10 It outputs as follows (i marked the meanings i guess for the output): Warmup time = 10.0 seconds
// Gray
32.60 <== megapixels/sec? (Comp. Perf)
24.33 <== percentage? (Comp. Ratio)
35.01 <== megapixels/sec? (Decomp. Perf)
// 4:2:0
20.49 <== megapixels/sec? (Comp. Perf)
18.87 <== percentage? (Comp. Ratio)
21.07 <== megapixels/sec? (Decomp. Perf)
... So did i get it right? |
|
It worked! Thanks a lot!
|
The procedure I described above is a procedure for validating the performance of new SIMD extensions as you develop them, so it assumes that the decompression algorithms won't be SIMD-accelerated when you're developing and benchmarking the compression SIMD extensions. If you can get the decompression performance to within +/- 1% by disabling the decompression SIMD extensions, then yes, that's what you should do when benchmarking the compression SIMD extensions. It would also be valid to disable the compression SIMD extensions while benchmarking the decompression SIMD extensions, even though my procedure and the spreadsheet template don't do that. It does concern me, however, that the SIMD performance is so variable. |
Got it! Thanks again for your patient reply 👍 ! I would continue my testing! |
@BHbean just in case, do you know any new boards with final vector instructions? I want to buy one, because in qemu we mostly can validate code itself but not a performance. Also, it will be good to understand our real targets. |
I'm surprised that there are still no public boards available. If that is the case, then it is too early to consider this technology for inclusion in libjpeg-turbo. SIMD extensions for libjpeg-turbo potentially take hundreds of hours to implement the right way. My concern is that I will finish implementing them (or validating and integrating someone else's implementation) only to find that the implementation performs optimally on the pre-release hardware and not on the final hardware. I speak from experience in that regard, since the differences in various Arm implementations were a source of pain for our project in the past. Even if someone pays me for the initial implementation or integration work, I will ultimately have to maintain the feature using General Fund money, just like I maintain every other set of SIMD extensions in libjpeg-turbo. Thus, I have a strong desire to not develop against a moving target. I may have said something similar in the comments above, and if so, I apologize for repeating myself. I was just under the impression that the technology was farther along than it apparently is. |
We (Andes Technology) implemented the IDCT functions on source files listed as below: |
@rayeh5, for now, it will be better if you can make PR to the @BHbean branch.
@dcommander, it's available now and will be available more in 2024; the situation is not so bad, but yes, it's early stage. |
@stalkerg Thanks for your mentioning! My branch welcome every commit that brings pratical performance improvement. BTW, Kendryte K230 has been resaled now, maybe you could have a look :)
@rayeh5 I am considering make the |
I am in discussions with several key people regarding funding for my labor to complete the implementation, which will involve reviewing any existing code and also introducing a new SIMD profiling framework into libjpeg-turbo to facilitate that review. I do not accept piecemeal SIMD extensions, because it is much more efficient to review and test a complete implementation. Thus, I need to secure enough funding to cover the possibility that I have to implement most of it myself. At this point, I would encourage you to be patient rather than fork libjpeg-turbo for the purpose of introducing an incomplete feature. I have never received a pull request for SIMD extensions that didn't require extensive modifications on my part, so I am understandably skeptical of other people's SIMD code until it has been thoroughly reviewed. I am happy to discuss the implementation in general terms, but funding needs to be secured before I can do much else. |
@dcommander as I understand @BHbean doesn't do a project fork, he just maintain a repo fork to accumulate changes for rvv support. It's can be already helpful for some people and he can get a good feedback. To provide you a finished PR people should develop it somewhere and test it. As alternative you can give him a branch on this repo. |
Banana Pi with SpacemiT K1 may satisfy. |
This year, will be at least 4 devices with RVV 1.0, mostly come from SiFive chip design. We already have some work in forks, next year should be mature enough to talk about it here. |
From my end, I was prepared to begin work on this in January, and I thought that the wheels were in motion to get a contract in place for me to do so. However, the people I was corresponding with in email (from the Linux Foundation, Google, and SiFive) haven't responded to me since January. (I just re-pinged them last week and still haven't heard back.) I need to be looped in on the design of this feature to make sure that it will be acceptable to our project once completed. There are some basic technical questions that haven't yet been answered (such as whether it is possible to implement the DCT algorithms using variable-length vectors, etc.) Historically, SIMD extensions developed externally to this project have always required costly and time-consuming modifications before they could be integrated. |
The RISC-V vectors are finally stable, and even dev boards are realized. I suppose it's a perfect time to start working on RISC-V support.
Spec is here https://github.com/riscv/riscv-v-spec/releases/tag/v1.0
Here you can check examples: https://github.com/riscv/riscv-v-spec/blob/master/vector-examples.adoc
Regards,
The text was updated successfully, but these errors were encountered: