Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster fblits() for multiple blits of the same Surface #2825

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

itzpr3d4t0r
Copy link
Member

@itzpr3d4t0r itzpr3d4t0r commented Apr 26, 2024

Optimizing Multi Image Blitting

In a typical blit sequence, we often need to draw the same surface at multiple positions. The current approach is:

screen.fblits([ (surf, pos1), (surf, pos2), (surf, pos3), ... ])

This method is straightforward but not optimal. It requires running the same checks for each surface separately, even if it’s the same surface. This checking process takes time, invalidating the destination’s pixel cache before the actual blit.

The current process involves checking the type of blit we want, then reading/writing the pixels. However, by the time we get to the next surface in the sequence, the temporal caching mechanisms are invalidated due to the time elapsed.

I propose a new method for blitting multiple copies of the same Surface onto another Surface. This method allows for faster iteration, reduced checking overhead, and improved temporal coherency, resulting in higher performance.

The proposed approach involves a new format in fblits:

screen.fblits([ (surf, [pos1, pos2, pos3]), ... ])

This format takes a tuple containing a reference to the surface and a list of positions we want to draw to.

Inside the function, we can use memcpy for blitting the same surface more efficiently and run the checks only once, leading to significant performance improvements.

Conclusions

How to use:
To implement this, pass a tuple in the format (Surface_To_Draw, List_Of_Positions_To_Draw):

import pygame
pygame.init()

screen = pygame.display.set_mode((500, 500))

surf = pygame.Surface((20, 20))
surf.fill("red")

positions = [(0, 0), (100, 100), (30, 30)]

screen.fblits([(surf, positions)], 0, True)  # sets the flag to 0 and cache=True to actually use the caching

TODO:

  • Respect destination clip rect
  • Respect blitting to a subsurface
  • Investigate self blitting (ensure it works)
  • Ability to partially draw a surface
  • Support for Rect/Frect and more sequences as blit positions
  • (FUTURE WORK) Support blend flags, alpha and colorkey.

The results so far are very promising:
(TO BE UPDATED)

Here is a little program I've been experimenting with to get these numbers:

from random import randint, random

import pygame

pygame.init()

screen_size = 1000

screen = pygame.display.set_mode((screen_size, screen_size))

size = 50
flag = 0

s = pygame.Surface((size, size))
pygame.draw.circle(s, (100, 0, 255), (size // 2, size // 2), size // 2)


class Particle:
    def __init__(self, x, y, vx, vy):
        self.x = x
        self.y = y
        self.vx = vx
        self.vy = vy

    def update(self, dt):
        self.x += self.vx * dt
        self.y += self.vy * dt

        if self.x < 0:
            self.x = 0
            self.vx *= -1
        elif self.x > screen_size - size - 1:
            self.x = screen_size - size - 1
            self.vx *= -1

        if self.y < 0:
            self.y = 0
            self.vy *= -1
        elif self.y > screen_size - size - 1:
            self.y = screen_size - size - 1
            self.vy *= -1

        return self.x, self.y


particles = [
    Particle(randint(0, screen_size - size), randint(0, screen_size - size),
             random() * 2 - 1, random() * 2 - 1)
    for _ in range(10000)
]

clock = pygame.time.Clock()

font = pygame.font.SysFont("Arial", 25, True)

modes = ["fblits", "blits", "fblits cached"]
mode_index = 0
mode = modes[mode_index]

while True:
    dt = clock.tick_busy_loop(1000) * 60 / 1000
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            pygame.quit()
            exit()
        elif event.type == pygame.MOUSEWHEEL:
            mode_index = (mode_index + event.y) % len(modes)
            mode = modes[mode_index]
        elif event.type == pygame.KEYDOWN:
            if event.key == pygame.K_0:
                flag = 0

    screen.fill(0)

    if mode == "fblits":
        screen.fblits([(s, p.update(dt)) for p in particles], flag)
    elif mode == "blits":
        screen.blits([(s, p.update(dt), None, flag) for p in particles])
    elif mode == "fblits cached":
        screen.fblits(
            [
                (s, [p.update(dt) for p in particles])
            ],
            flag
        )

    fps_text = font.render(
        "FPS: " + f"{int(clock.get_fps())}" + f"| N: {len(particles)}",
        True, (255, 255, 255))
    pygame.draw.rect(screen, (0, 0, 0),
                     fps_text.get_rect(center=(screen_size / 2, screen_size / 2)))
    screen.blit(fps_text, fps_text.get_rect(center=(screen_size / 2, screen_size / 2)))

    mode_text = font.render(f"{mode}", True, (255, 255, 255))
    pygame.draw.rect(screen, (0, 0, 0),
                     mode_text.get_rect(center=(screen_size / 2, screen_size / 2 + 25)))
    screen.blit(mode_text,
                mode_text.get_rect(center=(screen_size / 2, screen_size / 2 + 25)))

    pygame.display.flip()

@itzpr3d4t0r itzpr3d4t0r added Performance Related to the speed or resource usage of the project SIMD Surface pygame.Surface labels Apr 26, 2024
@itzpr3d4t0r itzpr3d4t0r changed the title Surface.fblits() image caching for performace Surface.fblits() caching for performace Apr 26, 2024
@Starbuck5
Copy link
Member

Starbuck5 commented Apr 29, 2024

I don't really understand the proposal, but piping in on this:

Create a register cache (of __m256i) and load the Surface’s pixels once.

It sounds like you want to store an arbitrary amount of data in SIMD registers-- that's not possible. There are a set amount of these registers in each processor core, like 16. http://www.infophysics.net/amd64regs.png

XMM are the SSE registers, YMM are the AVX registers, ZMM (not shown in that diagram) are the AVX512 registers.

The intrinsics look like they're giving us registers, but it doesn't compile as 1-for-1 as it seems. I think some are routines have more intrinsics variables than there are physical registers.

@Starbuck5 Starbuck5 changed the title Surface.fblits() caching for performace Surface.fblits() caching for performance Apr 29, 2024
@itzpr3d4t0r
Copy link
Member Author

I don't really understand the proposal, but piping in on this:

Create a register cache (of __m256i) and load the Surface’s pixels once.

It sounds like you want to store an arbitrary amount of data in SIMD registers-- that's not possible. There are a set amount of these registers in each processor core, like 16. http://www.infophysics.net/amd64regs.png

The intrinsics look like they're giving us registers, but it doesn't compile as 1-for-1 as it seems. I think some are routines have more intrinsics variables than there are physical registers.

I'm aware of all of this but it looked like it was working anyway. But with further experimentation i found out that using memcpy directly yielded similar if not better results while not requiring any heap allocation. This is true for blitcopy alone tho as i still need to check for other blendmodes.

@itzpr3d4t0r itzpr3d4t0r changed the title Surface.fblits() caching for performance Faster fblits() for multiple blits of the same Surface May 11, 2024
@itzpr3d4t0r itzpr3d4t0r removed the SIMD label May 18, 2024
@itzpr3d4t0r itzpr3d4t0r marked this pull request as ready for review May 18, 2024 14:57
@itzpr3d4t0r itzpr3d4t0r requested a review from a team as a code owner May 18, 2024 14:57
@MyreMylar
Copy link
Member

Had a little peek at this.

Are you intending to add alpha blending support, or special blend mode support, in this PR or in a future one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Related to the speed or resource usage of the project Surface pygame.Surface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants