Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a SmallVec for every pixel is slow #172

Open
Shnatsel opened this issue Dec 27, 2022 · 9 comments
Open

Creating a SmallVec for every pixel is slow #172

Shnatsel opened this issue Dec 27, 2022 · 9 comments

Comments

@Shnatsel
Copy link
Contributor

I've been looking into the TODO in the README:

TODO profile if smallvec is really an improvement!

SmallVec is all over the place, so it's rather difficult to remove it and benchmark the before/after. I've used Rust's instrumentation-based coverage to find out how frequently a SmallVec is constructed. I've profiled the 7_crop_alpha_any_image example. Here's the report in HTML (trimmed so that only covered lines are shown): results_gt1.html.gz

I've found that an instance of this type is constructed for every pixel:

pub type FlatSamplesPixel = SmallVec<[Sample; 8]>;

Profiling shows that 55% of the execution time is spent just creating SmallVecs for every pixel in the crop function: https://share.firefox.dev/3WQyWUR

This will be even worse on images with more than 8 channels, where a heap allocation will be performed for every pixel.

It seems that the approach of creating a SmallVec for every pixel is fundamentally slow. A much more performant way to implement this would be returning a slice of the original data, without constructing an owned data structure.

@Shnatsel
Copy link
Contributor Author

The sample_vec_at function is to blame.

@johannesvollmer
Copy link
Owner

thanks for investing the time! great observations :)

@johannesvollmer
Copy link
Owner

the most important question is: is smallvec slower than a standard vec? because usually, slices are used where possible already. most of the time, smallvec is only used where sliced cannot be used unfortunately

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 28, 2022

Instead of creating a SmallVec for every pixel, you can have the caller provide a &mut [u8] to which the pixel will be written. This allows for a single allocation to be reused.

    pub fn output_sample_at(&self, position: Vec2<usize>, output_buffer: &mut [u8]) {
        output_buffer.copy_from_slice(self.samples_at(position))
    }

And the usage would look roughly like this:

// pre-allocate a buffer once; a Vec is faster in this case
let mut buffer = vec![0; bytes_per_pixel];
// iterate over all pixels in an image
// note: iteration order (rows first or columns first) 
// must match the image layout for performance
for x in 0..img_width {
    for y in 0..img_height {
        image.output_pixel_at(x, y, &mut buffer);
        // the pixel was written to the buffer
        process_pixel(&mut buffer); // whatever we want to do with it
    }
}

This API also allows for cheaper transformation of images without copying the data to/from the stack. You could allocate a large Vec for the output image, iterate over it with slice.chunks_exact(bytes_per_pixel) and have the iterator write directly to the output Vec, then make any further transformations in place if needed. This is great for images where only some parts of them have to be transformed, and the transformation preserves their size.

@johannesvollmer
Copy link
Owner

johannesvollmer commented Dec 29, 2022

I'm a little bit hesitant with this approach. I fear it is too low level. The content of the byte slice can not be interpreted without knowing about the channel layout of the particular image. What about passing a &mut SmallVec as an out argument?

Also, there is a function that avoids the allocation of the smallvec, returning an iterator instead. The iterator can be written into a pre allocated vector. Maybe this is something the crop function needs to use instead of sample_vec_at? This function might currently not exist in the general image trait, but maybe we can make that happen?

exrs/src/image/mod.rs

Lines 504 to 511 in 49fece0

pub fn samples_at(&self, position: Vec2<usize>) -> FlatSampleIterator<'_> {
FlatSampleIterator {
layer: self,
channel_index: 0,
position
}
}
}

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 29, 2022

Also, there is a function that avoids the allocation of the smallvec, returning an iterator instead. The iterator can be written into a pre allocated vector. Maybe this is something the crop function needs to use instead of sample_vec_at? This function might currently not exist in the general image trait, but maybe we can make that happen?

Yes, I believe this would resolve the performance pitfall. The function returning a SmallVec is just a convenience wrapper around this one, correct? If so, I suggest exposing this function everywhere and marking the one returning SmallVec as deprecated.


Some thoughts on the &mut [u8] for output are included below, but they're less relevant if the iterator is available (it's only a convenience wrapper around the iterator anyway).

The content of the byte slice can not be interpreted without knowing about the channel layout of the particular image.

I'm not sure I follow. The proposed function will write out the exact same sequence of bytes as the one returning a SmallVec, the difference is that the allocation is controlled by the caller.

The current calls look something like this (ignoring the Vec2 type):

let pixel = img.samples_at(x, y);
process_pixel(&mut pixel);

While a call to the proposed interface would be:

let buffer = vec![0; sample_size];
output_samples_at(x, y, &mut buffer);
process_pixel(&mut buffer);

Except nobody ever looks up just one pixel, so calling output_samples_at in a loop would be dramatically more efficient.

The only difference is that the required length of the buffer needs to be known before calling output_sample_at.

@johannesvollmer
Copy link
Owner

johannesvollmer commented Dec 29, 2022

I would like to implement a prototypal refactoring to show what I mean :) but maybe I cannot get it done today

My plan is to adjust the crop function to re-use one allocation, by calling the iterator based function. Then Profile to find out whether the Iterator is implement in a way that the compiler can optimize as expected. I don't want to remove the convenience wrapper returning a smallvec, but we can rename it and add documentation about the performance implications.

@johannesvollmer
Copy link
Owner

johannesvollmer commented Dec 31, 2022

@Shnatsel would you mind summarizing your approach to profiling? And how you generated those interactive reports? Those are pretty nice. What software did you use? I'm interested in profiling this library more

@Shnatsel
Copy link
Contributor Author

I have an entire article written about this, but it's currently not published. Drop me a line at the email in my Github profile and I'll share the draft with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants