Add RISC-V vectors support #620

stalkerg · 2022-09-27T14:39:22Z

The RISC-V vectors are finally stable, and even dev boards are realized. I suppose it's a perfect time to start working on RISC-V support.

Spec is here https://github.com/riscv/riscv-v-spec/releases/tag/v1.0
Here you can check examples: https://github.com/riscv/riscv-v-spec/blob/master/vector-examples.adoc

Regards,

dcommander · 2022-09-27T15:06:03Z

LOL. Yeah, as if I have the budget for that. This won’t happen without outside funding.

stalkerg · 2022-09-27T15:56:45Z

Yep, RISC-V Foundation probably can fund it, or somebody from SiFive
@kasanovic ?

dcommander · 2022-10-14T17:12:20Z

BTW, I feel I should provide more information about the general strategy behind adopting SIMD extensions in libjpeg-turbo.

In general, support for a new SIMD instruction set in libjpeg-turbo is driven by demand from application and operating system developers. A complete set of libjpeg-turbo SIMD extensions can take hundreds of hours, so it's an expensive proposition. Even if someone else does the initial implementation, it can still sometimes take anywhere from 30 to more than 100 hours for me to clean up the implementation, benchmark it, and integrate it, depending on the quality of the code. The willingness of organizations to fund that labor is an imperfect measure of demand for the new instruction set. Irrespective of funding, however, I am not interested in supporting CPU architectures that have little likelihood of adoption in major downstream software projects (such as popular web browsers) or end-user hardware products.

Additional concerns include the design of the SIMD instruction set and the CPU and compiler architecture. My experience with RISC architectures is that, in general, the speedup to be gained from using SIMD instructions is less than that of CISC architectures, because the libjpeg-turbo scalar (non-SIMD) code is already faster with RISC CPUs and compilers. (With PowerPC, for instance, I was able to get a 2x speedup with some difficulty, but certainly not the 6x speedup that was achievable with x86.) My experience is also that not all SIMD instruction sets are created equal, and some of them are poor fits for the libjpeg-turbo algorithms. (With 256-bit AVX2, for instance, some of the DCT algorithms simply couldn't be accelerated relative to 128-bit SSE2, because of the dual-lane nature of AVX2 and the fact that the DCT algorithms operate on 128 bits of data at a time. 64-bit libjpeg-turbo SIMD extensions generally require multiple passes through certain algorithms, so the speedup tends to not be very impressive. 128-bit is kind of the "sweet spot" at the moment.) Other issues include whether compilers have a well-optimized implementation of compiler intrinsics for the instruction set. (This bit us in the butt with Arm Neon.) The RISC-V vector ISA is interesting in that it supports arbitrary vector lengths, but that raises additional concerns regarding whether a specific RISC-V SIMD implementation in libjpeg-turbo will provide consistent performance gains across all RISC-V CPUs or whether it will be necessary to handle differing capabilities in specific RISC-V CPU implementations. From a performance point of view, hand-coded assembly with a fixed vector length has the least variability among different CPUs. Even well-optimized intrinsics with a fixed vector length have more variability, and poorly-optimized intrinsics introduce more variability. A variable vector length potentially introduces more variability as well. More variability means that it will take more time to get the implementation right.

Even if initial funding is provided to implement or integrate a new set of SIMD extensions, no set of SIMD extensions is zero-maintenance. Once the initial implementation or integration is complete, then it is necessary to use the libjpeg-turbo General Fund to fix bugs in the implementation that are inevitably discovered down the road. Thus, I must ensure that the platform is strategically important enough to adopt as a first-class citizen within the libjpeg-turbo ecosystem. As much as RISC-V appeals to me, given my background in CPU architecture, I am frankly not convinced that it is strategically important enough yet. It's a moot point at the moment, though, because I won't have any availability to work on such a project until well into 2023 at the earliest, even if funding is secured.

stalkerg · 2022-10-15T00:02:38Z

Thank you for the detailed explanation. I appreciate it.
Is it maybe good text even for a blog post?

rednoah91 · 2023-07-27T09:41:43Z

Hi @dcommander , I am Hong-Rong from SiFive. We are interested in upstream RISC-V support in libjpeg-turbo. May I have your email to talk about the funding and co-working model?

stalkerg · 2023-07-30T23:34:09Z

@rednoah91 @dcommander I hope your collaboration will be a success.
btw C implementation seems like working but because gcc anyway doesn't fully vectors support it's not really good.

dcommander · 2023-07-31T00:43:01Z

Hi @dcommander , I am Hong-Rong from SiFive. We are interested in upstream RISC-V support in libjpeg-turbo. May I have your email to talk about the funding and co-working model?

@rednoah91 https://libjpeg-turbo.org/About/Contact

BHbean · 2023-09-05T02:38:55Z

actually I used to implement the SIMD version of libjpeg-turbo based on RVV1.0, which is my graduation project. It has already been open sourced: https://github.com/isrc-cas/libjpeg-turbo/tree/riscv-dev.
This implementation can already make sure the correctness of libjpeg-turbo, and the number of instructions on QEMU has significantly reduced. But I dont have the real machine at that time so I did not test it on a real system and left the implementation aside.
Recently I'm reviewing my previous work and find a possible chance to test it on a real machine. I think my previous work can help a lot, though it still has a lot needs to be improved. How can I help you guys? @dcommander @rednoah91

dcommander · 2023-09-05T03:56:35Z

If your implementation is high-quality, then it may reduce the amount of labor I have to do in order to implement the feature, but that labor will still not be zero. All of my labor on this, even if it involves validating and integrating someone else's implementation, must be funded. Funding and access to test hardware are the two primary gating issues from my point of view.

rednoah91 · 2023-09-05T04:54:45Z

Sorry for the late update. I am proposing RISE(RISC-V Software Ecosystem) to support the funding. It may take some time. I'll keep you posted.

BHbean · 2023-09-05T05:48:31Z

Thanks for you guys' update. I will spend my spare time to keep maintaining and updating the riscv-dev branch on the above repo. Hopefully my codes can do a little help for you.

BHbean · 2023-12-07T03:25:31Z

hey guys, i have some updates for RVV support. I add support for the newest rvv intrinsic version(v0.12 and v0.11), and test it on a real machine (Kendryte K230). The related document can be seen here.

So how is the RISE project going now? @rednoah91

And i notice that most of the tests in the test suite are executed very quickly, which makes me uncertain about performance improvement brought by my implementation. Is there other test codes to prove the performance enhencement? @dcommander

dcommander · 2023-12-07T20:05:26Z

@BHbean The benchmark procedure for new SIMD extensions is rather tedious. Roughly speaking, what I do is run tjbench {image} 95 -rgb -qq -nowrite -accuratedct -warmup 10 on the five test images described in this report. However, I want to get the results of those tests to within 1% accuracy, so I disable any kind of CPU speed throttling and all unnecessary processes on the (Linux) test system. (Basically all that's running on the system are kernel daemons and an SSH server.) I first establish a baseline performance reading with the latest stable release of libjpeg-turbo, which has none of the proposed SIMD extensions. Then I start adding new SIMD extensions and benchmarking each one in isolation, in the order in which the algorithms appear on this page. I start with the compression algorithms, so RGB-to-YCbCr and RGB-to-Grayscale Color Conversion are first. Since those are compression algorithms, I expect the decompression speedup to be within +/- 1% relative to the baseline. If it isn't, then I may need to increase the benchmark time (using the -benchtime argument to tjbench) on both the baseline and RGB-to-YCbCr/RGB-to-Grayscale benchmark runs until I can reliably get the decompression speedup to within the desired margin of error. That gives me confidence that any compression speedup I observe is legitimate and not noise. For each compression algorithm in sequence, I establish a new baseline in that manner, with the expectation being that each algorithm will speed up overall compression by significantly more than 1% but that decompression will remain within +/- 1% of the previous baseline. If the speedup of a particular algorithm isn't what I expect, then I have to analyze the SIMD code carefully and figure out why it isn't performing as well as it should. (It isn't unusual for a particular algorithm to require several development and benchmark passes.) Once I have completed that procedure for all of the compression algorithms, then I have a new baseline that takes into account the total compression speedup, and I can use that new baseline to start testing decompression algorithms in sequence. For the decompression algorithms, I expect the compression speedup to be within +/- 1% relative to the baseline, and I expect that each algorithm will speed up overall decompression by significantly more than 1%. If I want to improve a particular SIMD extension after a full benchmark run is completed, then I benchmark that improvement relative to the final results of the benchmark run. (It isn't necessary to redo the whole benchmark run.) If SIMD extensions are added for the "fast" FDCT and IDCT algorithms, then the whole procedure must be repeated, replacing -accuratedct with -fastdct in the tjbench command line. That means obtaining a new overall baseline with the reference (unmodified) release of libjpeg-turbo. However, it is only necessary to benchmark the FDCT and IDCT algorithms in isolation, since the other algorithms were already benchmarked with the accurate FDCT/IDCT. Similarly, if a SIMD extension is added for the progressive Huffman encoding algorithm, then the whole procedure must be repeated, adding -progressive to the tjbench command line, but only the progressive Huffman encoding algorithm must be benchmarked in isolation. If SIMD extensions are added for the H2V1 and H2V2 merged upsampling algorithms, then the whole procedure must be repeated, adding -fastupsample to the tjbench command line, but only the merged upsampling algorithms must be benchmarked in isolation.

As you can imagine, this is extremely time consuming and requires some intuition regarding the behavior of the various algorithms and how much speedup should be expected from each. This is why I don't generally accept contributions of SIMD extensions unless there is funding for me to peer review the extensions, which involves performing the aforementioned procedure. This procedure is necessary in order to ensure that all proposed new SIMD extensions are justified. (It wouldn't make sense to introduce a SIMD extension for a particular algorithm if that SIMD extension doesn't significantly speed up that algorithm.) However, if the developer of the SIMD extensions can perform the benchmarking procedure during the development process and provide me with that data, then I will only have to peer review the final results, which takes significantly less time. (This is assuming that there is a high degree of confidence that the data are reflective of the expected speedups across a wide array of CPUs within the architecture family, as well as a wide array of compilers and operating systems. I mention that because, in the case of the Arm Neon SIMD extensions, it was necessary to separately benchmark in-order and out-of-order cores. It was also necessary to separately benchmark GCC and Clang, because GCC < 12 didn't have an optimal Neon intrinsics implementation.)

This OpenOffice Calc template may be useful.

NOTE: It might be possible to develop a lower-level benchmark that runs each algorithm in isolation repeatedly for a specified period of time, thus eliminating at least some of the aforementioned differential performance analysis tedium. However, such a benchmark would likely be unrealistically cache-resident and thus wouldn't accurately reflect the performance of a particular SIMD extension in a real-world compression/decompression scenario. (Also, you couldn't test the Huffman algorithms using such a low-level benchmark, because the performance of those algorithms strongly depends on the input.) Another possibility might be to instrument the libjpeg-turbo code so that it independently tracks the average execution time of each algorithm for a particular libjpeg instance and prints a performance table when the instance is destroyed. I'm not sure if that would really accelerate the development process, because it would still be desirable to benchmark each SIMD extension in isolation as it is developed. It would, however, accelerate the peer review process, i.e. the process of benchmarking a set of SIMD extensions that has already been developed.

BHbean · 2023-12-08T12:39:59Z

@dcommander Thanks for your comprehensive explanation. I think I can continue my testing now. Considering the situations i currently faced, all my tests might just be a reference instead of a strong evidence, because of the following reasons:

This is assuming that there is a high degree of confidence that the data are reflective of the expected speedups across a wide array of CPUs within the architecture family, as well as a wide array of compilers and operating systems.

Currently there only one physical device (Kendryte K230) that support RVV 1.0, so it might be impossible to do tests on multiple device now. But i heard there would be rvv IP cores coming soon, so maybe tests could be done on these upcoming devices in the future;

It was also necessary to separately benchmark GCC and Clang, because GCC < 12 didn't have an optimal Neon intrinsics implementation.

That is a good reminder for me, as i only use Clang-16.0.6 to build (i am more familiar with LLVM now). I would try using GCC-13 to build and test. However, older versions of these compilers may not be tested, as they may be unable to generate vector instructions of RVV 1.0.

Besides, i tried to run tjbench-static early this day, using the following command:

$ ./tjbench-static ./testimages/artificial.ppm 95 -rgb -qq -nowrite -warmup 10

It outputs as follows (i marked the meanings i guess for the output):

Warmup time = 10.0 seconds

// Gray
32.60    <== megapixels/sec? (Comp. Perf)
24.33    <== percentage? (Comp. Ratio)
35.01    <== megapixels/sec? (Decomp. Perf)

// 4:2:0
20.49    <== megapixels/sec? (Comp. Perf)
18.87    <== percentage? (Comp. Ratio)
21.07    <== megapixels/sec? (Decomp. Perf)

...

So did i get it right?
And do i just need to make the decompression time of both simd and no-simd as the same as possible when i test the compression algorithms, and vice versa? @dcommander

dcommander · 2023-12-08T14:32:55Z

-qq outputs raw data appropriate for inputting into the spreadsheet template I attached earlier. If you want to see what the numbers actually mean, run without -qq.

BHbean · 2023-12-08T15:06:39Z

It worked! Thanks a lot!
And two more questions (really sorry to bother a lot):

-accuratedct option seems unavailable in the tjbench-static, can i take it as the default option?
When i test the compression process, i cannot limit the speedup of decompression within +/- 1% no matter how i increase the value of -benchtime, should i disable the simd optimization for decompression to test the compression?

dcommander · 2023-12-08T15:17:38Z

It worked! Thanks a lot! And two more questions (really sorry to bother a lot):

-accuratedct option seems unavailable in the tjbench-static, can i take it as the default option?

When i test the compression process, i cannot limit the speedup of decompression within +/- 1% no matter how i increase the value of -benchtime, should i disable the simd optimization for decompression to test the compression?

-accuratedct is the default in libjpeg-turbo 3.0. The spreadsheet template was developed for prior versions, in which it was necessary to specify -accuratedct. I thought that the -accuratedct option was still available but hidden, but apparently I removed it altogether. Sorry about that.

The procedure I described above is a procedure for validating the performance of new SIMD extensions as you develop them, so it assumes that the decompression algorithms won't be SIMD-accelerated when you're developing and benchmarking the compression SIMD extensions. If you can get the decompression performance to within +/- 1% by disabling the decompression SIMD extensions, then yes, that's what you should do when benchmarking the compression SIMD extensions. It would also be valid to disable the compression SIMD extensions while benchmarking the decompression SIMD extensions, even though my procedure and the spreadsheet template don't do that.

It does concern me, however, that the SIMD performance is so variable.

BHbean · 2023-12-08T15:54:31Z

Got it! Thanks again for your patient reply 👍 ! I would continue my testing!

stalkerg · 2023-12-13T13:15:06Z

@BHbean just in case, do you know any new boards with final vector instructions? I want to buy one, because in qemu we mostly can validate code itself but not a performance. Also, it will be good to understand our real targets.

BHbean · 2023-12-14T02:01:19Z

@stalkerg Kendryte K230 was used to be available on Taobao, a Chinese e-shop platform. However, it has been sold out now. So maybe there are no available boards with RVV 1.0 now. I heard that SiFive might be developing such products, so maybe you could wait for a while :)

dcommander · 2023-12-14T04:53:14Z

I'm surprised that there are still no public boards available. If that is the case, then it is too early to consider this technology for inclusion in libjpeg-turbo. SIMD extensions for libjpeg-turbo potentially take hundreds of hours to implement the right way. My concern is that I will finish implementing them (or validating and integrating someone else's implementation) only to find that the implementation performs optimally on the pre-release hardware and not on the final hardware. I speak from experience in that regard, since the differences in various Arm implementations were a source of pain for our project in the past. Even if someone pays me for the initial implementation or integration work, I will ultimately have to maintain the feature using General Fund money, just like I maintain every other set of SIMD extensions in libjpeg-turbo. Thus, I have a strong desire to not develop against a moving target. I may have said something similar in the comments above, and if so, I apologize for repeating myself. I was just under the impression that the technology was farther along than it apparently is.

rayeh5 · 2024-01-24T02:46:03Z

We (Andes Technology) implemented the IDCT functions on source files listed as below:
simd/rvv/jidctint-rvv.c, simd/rvv/jidctred-rvv.c and simd/rvv/jidctfst-rvv.c.
The RISC-V Vector Intrinsic version used is v0.11 and RVV veriosn 1.0.
The verification is done on Andes’ FPGA and chip platform.
We think it can be helpful for the RVV porting and will contribute if you need.

stalkerg · 2024-01-24T04:13:40Z

@rayeh5, for now, it will be better if you can make PR to the @BHbean branch.

I'm surprised that there are still no public boards available.

@dcommander, it's available now and will be available more in 2024; the situation is not so bad, but yes, it's early stage.

BHbean · 2024-01-24T14:14:25Z

@rayeh5, for now, it will be better if you can make PR to the @BHbean branch.

@stalkerg Thanks for your mentioning! My branch welcome every commit that brings pratical performance improvement. BTW, Kendryte K230 has been resaled now, maybe you could have a look :)

We (Andes Technology) implemented the IDCT functions on source files listed as below:
simd/rvv/jidctint-rvv.c, simd/rvv/jidctred-rvv.c and simd/rvv/jidctfst-rvv.c.
The RISC-V Vector Intrinsic version used is v0.11 and RVV veriosn 1.0.

@rayeh5 I am considering make the libjpeg-turbo-rvv an independent repo for now, since contributing to the upstream on a RISC-V arch does not seem to be accepted now. And I also implement simd/rvv/jidctint-rvv.c and simd/rvv/jidctfst-rvv.c in my repo, maybe you can compare the two versions with each other on your FPGA and chips platform. It would be an honor for me to see my implemention running on other platforms rather than K230, since I am really pleased and curious about the performance over different platforms. You can open an issue here and we can disscus more details there.

dcommander · 2024-01-24T14:56:17Z

I am in discussions with several key people regarding funding for my labor to complete the implementation, which will involve reviewing any existing code and also introducing a new SIMD profiling framework into libjpeg-turbo to facilitate that review. I do not accept piecemeal SIMD extensions, because it is much more efficient to review and test a complete implementation. Thus, I need to secure enough funding to cover the possibility that I have to implement most of it myself.

At this point, I would encourage you to be patient rather than fork libjpeg-turbo for the purpose of introducing an incomplete feature. I have never received a pull request for SIMD extensions that didn't require extensive modifications on my part, so I am understandably skeptical of other people's SIMD code until it has been thoroughly reviewed. I am happy to discuss the implementation in general terms, but funding needs to be secured before I can do much else.

stalkerg · 2024-01-31T03:50:58Z

@dcommander as I understand @BHbean doesn't do a project fork, he just maintain a repo fork to accumulate changes for rvv support. It's can be already helpful for some people and he can get a good feedback. To provide you a finished PR people should develop it somewhere and test it. As alternative you can give him a branch on this repo.

chLFF · 2024-05-07T07:51:03Z

@BHbean just in case, do you know any new boards with final vector instructions? I want to buy one, because in qemu we mostly can validate code itself but not a performance. Also, it will be good to understand our real targets.

Banana Pi with SpacemiT K1 may satisfy.
https://docs.banana-pi.org/en/BPI-F3/SpacemiT_K1_datasheet

stalkerg · 2024-05-07T09:20:39Z

This year, will be at least 4 devices with RVV 1.0, mostly come from SiFive chip design. We already have some work in forks, next year should be mature enough to talk about it here.
(I am checking the risc-v state each months)

dcommander · 2024-05-07T13:50:51Z

From my end, I was prepared to begin work on this in January, and I thought that the wheels were in motion to get a contract in place for me to do so. However, the people I was corresponding with in email (from the Linux Foundation, Google, and SiFive) haven't responded to me since January. (I just re-pinged them last week and still haven't heard back.) I need to be looped in on the design of this feature to make sure that it will be acceptable to our project once completed. There are some basic technical questions that haven't yet been answered (such as whether it is possible to implement the DCT algorithms using variable-length vectors, etc.) Historically, SIMD extensions developed externally to this project have always required costly and time-consuming modifications before they could be integrated.

stalkerg added the enhancement label Sep 27, 2022

stalkerg assigned dcommander Sep 27, 2022

dcommander added the funding needed label Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RISC-V vectors support #620

Add RISC-V vectors support #620

stalkerg commented Sep 27, 2022

dcommander commented Sep 27, 2022

stalkerg commented Sep 27, 2022

dcommander commented Oct 14, 2022 •

edited

stalkerg commented Oct 15, 2022

rednoah91 commented Jul 27, 2023

stalkerg commented Jul 30, 2023

dcommander commented Jul 31, 2023

BHbean commented Sep 5, 2023

dcommander commented Sep 5, 2023

rednoah91 commented Sep 5, 2023

BHbean commented Sep 5, 2023

BHbean commented Dec 7, 2023

dcommander commented Dec 7, 2023

BHbean commented Dec 8, 2023 •

edited

dcommander commented Dec 8, 2023

BHbean commented Dec 8, 2023

dcommander commented Dec 8, 2023

BHbean commented Dec 8, 2023

stalkerg commented Dec 13, 2023

BHbean commented Dec 14, 2023

dcommander commented Dec 14, 2023

rayeh5 commented Jan 24, 2024

stalkerg commented Jan 24, 2024

BHbean commented Jan 24, 2024

dcommander commented Jan 24, 2024 •

edited

stalkerg commented Jan 31, 2024 •

edited

chLFF commented May 7, 2024

stalkerg commented May 7, 2024

dcommander commented May 7, 2024

Add RISC-V vectors support #620

Add RISC-V vectors support #620

Comments

stalkerg commented Sep 27, 2022

dcommander commented Sep 27, 2022

stalkerg commented Sep 27, 2022

dcommander commented Oct 14, 2022 • edited

stalkerg commented Oct 15, 2022

rednoah91 commented Jul 27, 2023

stalkerg commented Jul 30, 2023

dcommander commented Jul 31, 2023

BHbean commented Sep 5, 2023

dcommander commented Sep 5, 2023

rednoah91 commented Sep 5, 2023

BHbean commented Sep 5, 2023

BHbean commented Dec 7, 2023

dcommander commented Dec 7, 2023

BHbean commented Dec 8, 2023 • edited

dcommander commented Dec 8, 2023

BHbean commented Dec 8, 2023

dcommander commented Dec 8, 2023

BHbean commented Dec 8, 2023

stalkerg commented Dec 13, 2023

BHbean commented Dec 14, 2023

dcommander commented Dec 14, 2023

rayeh5 commented Jan 24, 2024

stalkerg commented Jan 24, 2024

BHbean commented Jan 24, 2024

dcommander commented Jan 24, 2024 • edited

stalkerg commented Jan 31, 2024 • edited

chLFF commented May 7, 2024

stalkerg commented May 7, 2024

dcommander commented May 7, 2024

dcommander commented Oct 14, 2022 •

edited

BHbean commented Dec 8, 2023 •

edited

dcommander commented Jan 24, 2024 •

edited

stalkerg commented Jan 31, 2024 •

edited