Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WEBP] SaveImage is taking more than x100 times slower in AWS Instance than my local machine #2125

Open
4 tasks done
christallire opened this issue May 21, 2022 · 13 comments
Open
4 tasks done

Comments

@christallire
Copy link

christallire commented May 21, 2022

Prerequisites

  • I have written a descriptive issue title
  • I have verified that I am running the latest version of ImageSharp
  • I have verified if the problem exist in both DEBUG and RELEASE mode
  • I have searched open and closed issues to ensure it has not already been reported

ImageSharp version

2.1.1

Other ImageSharp packages and versions

None

Environment (Operating system, version and so on)

ARM64/Ubuntu/k8s

.NET Framework version

.NET 6

Description

Hi guys,

I've observed increased processing time of Image in arm64 VM environment.
It's similar to this (#2104) issue but slightly different.

Image service is running in k8s with an unlimited CPU budget and is supposed to process a lot of images concurrently.

Pod info

# uname -a
Linux image-service-8566cd55f6-f6hkb 5.4.188-104.359.amzn2.aarch64 #1 SMP Thu Apr 14 20:53:17 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

My local test of an image save takes 16ms, but in VM takes 1,000ms to 10,000ms (depending on size, 10 seconds with ~2MB jpg)

Steps to Reproduce

I've run this code in the thread pool and it ran into thread pool starvation almost immediately, lol.

Code

    private async Task<Stream> ConvertImage(Stream stream)
    {
        using var image = await Image.LoadAsync(stream);
        var imageStream = new MemoryStream();

        await image.SaveAsync(imageStream, new WebpEncoder()
        {
            FileFormat = WebpFileFormatType.Lossy,
        });
        
        return imageStream;
    }

Note that the stream used to call the method is MemoryStream

Images

No response

@brianpopow
Copy link
Collaborator

From your pod info it seems its running on ARM devices, the architecture is: AArch64. We heavily rely on hardware intrinsics only available for x86/x64 (SSE, AVX) for speed, with ARM we cannot use those. That is very likely the reason for that.

We have a branch for adding some ARM hardware intrinsics support, but porting all webp intrinsics to ARM would be a huge task.

@JimBobSquarePants
Copy link
Member

It's not just WebP, none of our custom intrinsics support ARM at the moment. Now we've dropped a bunch of targets I hope to greatly simplify a lot of our pipeline processes and add ARM intrinsics in the process.

@christallire
Copy link
Author

Yup, I've expected that. since the dawn of .net core arm64 support, I've been poking everywhere around to get arm support ASAP👍🏿

@brianpopow
Copy link
Collaborator

@christallire you can do more then just poking. This is a open source project. You can make it better by opening Pull requests.
The webp implementation is based on libwebp. libwebp has support for ARM intrinsics and this can be ported over to ImageSharp, but we need help from the community to get this done, because it will be a huge task.

See libwebp, all files which have _neon suffix contain ARM specific code.

@brianpopow
Copy link
Collaborator

Here are some real world benchmarks of encoding webp run with our benchmark project.

Encode Webp:

BenchmarkDotNet=v0.13.0, OS=ubuntu 20.04
Unknown processor
.NET SDK=6.0.405
  [Host]     : .NET 6.0.13 (6.0.1322.58009), Arm64 RyuJIT
  Job-SOYIPU : .NET 6.0.13 (6.0.1322.58009), Arm64 RyuJIT

Runtime=.NET 6.0  Arguments=/p:DebugType=portable  IterationCount=3
LaunchCount=1  WarmupCount=3

|                     Method |    TestImage |        Mean |      Error |   StdDev | Ratio | RatioSD |     Gen 0 |     Gen 1 |     Gen 2 | Allocated |
|--------------------------- |------------- |------------:|-----------:|---------:|------:|--------:|----------:|----------:|----------:|----------:|
|        'Magick Webp Lossy' | Png/Bike.png |    86.90 ms |   0.587 ms | 0.032 ms |  0.15 |    0.00 |         - |         - |         - |     70 KB |
|    'ImageSharp Webp Lossy' | Png/Bike.png |   714.41 ms |  16.649 ms | 0.913 ms |  1.24 |    0.00 | 7000.0000 |         - |         - | 17,589 KB |
|     'Magick Webp Lossless' | Png/Bike.png |   575.48 ms |   9.339 ms | 0.512 ms |  1.00 |    0.00 |         - |         - |         - |    532 KB |
| 'ImageSharp Webp Lossless' | Png/Bike.png | 1,274.98 ms | 180.146 ms | 9.874 ms |  2.22 |    0.02 | 8000.0000 | 4000.0000 | 2000.0000 | 46,898 KB |

cpu info

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              6
On-line CPU(s) list: 0-5
Thread(s) per core:  1
Core(s) per socket:  3
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         1896.0000
CPU min MHz:         100.0000
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32

@adamsitnik
Copy link
Contributor

Perhaps @SwapnilGaikwad or somebody else from @ARM-software would be interested in providing ARM optimizations?

@a74nh
Copy link

a74nh commented Feb 6, 2023

Perhaps @SwapnilGaikwad or somebody else from @ARM-software would be interested in providing ARM optimizations?

I can check, but we'd have to prioritise this around other things. In the meantime myself and @SwapnilGaikwad would be happy to review any patches in this area.

@brianpopow
Copy link
Collaborator

Perhaps @SwapnilGaikwad or somebody else from @ARM-software would be interested in providing ARM optimizations?

I can check, but we'd have to prioritise this around other things. In the meantime myself and @SwapnilGaikwad would be happy to review any patches in this area.

Thank you @a74nh, reviewing PR's would already be very helpful. I want to add PR's in that area to improve ARM performance, but I am still a beginner when it comes to ARM intrinsics and could use any advice there.

@antonfirsov
Copy link
Member

From OP:

My local test of an image save takes 16ms, but in VM takes 1,000ms to 10,000ms (depending on size, 10 seconds with ~2MB jpg)

I don't think this issue should be tagged with [WEBP]. I would prioritize Jpeg with ARM optimizations.

@brianpopow
Copy link
Collaborator

I don't think this issue should be tagged with [WEBP]. I would prioritize Jpeg with ARM optimizations.

The code sniped the OP provided was encoding a lossy webp. Maybe he took a jpeg as input?
Its true though that jpeg encoder/decoder offers alot potential for ARM intrinsics improvements.

@kunalspathak
Copy link
Contributor

but I am still a beginner when it comes to ARM intrinsics and could use any advice there.

Few years ago, I prepared a document to outline various Arm intrinsics APIs in .NET and explained what they do, along with an example. I still need to publish the last set of them, but probably a good starting point.

@brianpopow
Copy link
Collaborator

Thanks @kunalspathak, the examples are really helpful to me!

@JimBobSquarePants
Copy link
Member

For V4 we should be able to achieve speedup by adopting the new APIs for .NET 7/8 #2532

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants