Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider creating a game math library benchmark for the working group #93

Open
kettle11 opened this issue Aug 25, 2020 · 16 comments
Open

Comments

@kettle11
Copy link

kettle11 commented Aug 25, 2020

Given that the working group recently took ownership of an ECS benchmark it seems appropriate to also have a game math library benchmark. Game math libraries are even more benchmarked and debated than ECS frameworks.

A benchmark from the working group provides a common point of reference everyone can contribute to on neutral ground. The goal is to provide useful information to help people make informed choices about the Rust ecosystem.

Benchmarks provided by the Working Group should aim to help people holistically people evaluate libraries. Ideally such a benchmark also includes metrics for compile times and perhaps lines of code (as a rough measurement of functionality and complexity).

@bitshifter, @sebcrozet, and @termhn have all created their own benchmarks, perhaps they have thoughts?

@fu5ha
Copy link

fu5ha commented Aug 25, 2020 via email

@bitshifter
Copy link

I'd be happy for the working group to take ownership of mathbench. I think it's been useful to the community but it's usually way down my list of things to work on when I have free time so it's a bit unloved.

It would be good to get the wide nalgebra/ultrawide in the same repo with the scalar benches intact as well as @termhn mentioned (see bitshifter/mathbench-rs#21).

If the working group were to take ownership of the code, I think they would also need to take ownership of publishing the results and updating them periodically when existing libraries are updated or new libraries are added. I publish results to my github site https://bitshifter.github.io/mathbench/0.3.0/report/index.html, @sebcrozet and @termhn have published their own results to their own blogs/READMEs. I think it would be good if there was a central location for keeping these.

The other thing to do when publishing results is update the summary in the README and document the hardware and OS used to generate them. I also make a tag when publishing the results, so it's easy to see what lib versions were used to generate them. I've consistently used the same hardware, an old laptop of mine. However that laptop doesn't support AVX-512 so it couldn't run some of the wide benchmarks. It's probably not the end of the world if hardware changed between publishing runs, but it would be better if it didn't.

They take a long time to run and you can't really use the machine for anything else while they are running, which is another reason I haven't really been updating them.

@AlexEne
Copy link
Member

AlexEne commented Sep 9, 2020

@bitshifter Can you add this information in the repo somwehere? a sort of Contributing.md for maintainers where this is described, so it's not lost in this issue.
I also have some questions on the hardware that should be used for these, do you run them on your machine or some EC2 instances? (sorry if this was already discussed in meetings but I can never make it to a wg meeting).

On a more meta-level should we wait for @termhn 's proposed changes to land before moving it?
Who has bandwidth to help from the WG with this? (assign this to yourself and ping me for permissions if needed).

@bitshifter
Copy link

bitshifter commented Sep 9, 2020

Sure, I can document guidelines for publishing results.

Hardware wise I generally run on my own laptop. I think it's useful using the same hardware each time I update it. The downside is the machine is 5 years old and doesn't have recent CPU features that some libraries want to take advantage of. I have not investigated a cloud solution. Sounds like in theory providing they can guarantee that nothing else is using resources when mathbench is running and the hardware is known and consistent.

I don't know if @termhn intended to try get these changes back into mathbench or to keep them as a fork? It is probably a bit of work to get those changes back in the main repo just because they were quite extensive.

On that note I recently updated mathbench to include ultraviolet. I was holding out until I'd added wide support but I still haven't found the time to do that and I had a PR to add another library so it seemed like I may as well add ultraviolet at the same time. The ultraviolet support is mostly based on @termhn 's fork (without 0.6-pre changes).

I would still like to add wide tests. I'd like to keep them separate but have one of the scalar libs running the test for comparisons. I was possibly going to take a slightly different approach to what @sebcrozet did in his fork - which was to have a bench with say 100 elements in it and they run it through different width types, rather than having a bench for each type width, if that makes sense? I was thinking of producing separate scalar and wide summary tables. The current scalar summary table is getting pretty huge on its own.

@sebcrozet's fork also added a lot of benches for types that other libraries don't generally have, which is fine, my original intention for mathbench was kind of comparison of the lowest common denominator of math library features. In some sense there's no harm in adding "exotic" features, it's just there won't be much to compare them against so maybe they're not so useful to be in "official" repo?

I think there is some sense in people forking mathbench and adding benches that make sense for their library, or compiler flags that makes sense for their library. I see no harm in that.

@fu5ha
Copy link

fu5ha commented Sep 9, 2020

Oh nice... I'll probably try to "rebase" my work on top of your current mathbench then @bitshifter

@fu5ha
Copy link

fu5ha commented Sep 10, 2020

I was possibly going to take a slightly different approach to what @sebcrozet did in his fork - which was to have a bench with say 100 elements in it and they run it through different width types, rather than having a bench for each type width, if that makes sense?

If I understand what you mean, that would mean that every type would do the same number of total iterations (and as such, wide types would be doing more total valued processed, but the same number of ops)... if so, I'm not sure I really like that way as I think it sorta obfuscates the higher throughput and makes it harder to reason about? Of course the current method isn't perfect as it's assuming you are able to start and end in wide types for your algorithm which isn't always true, but I think it's still a valid case to test (it's how I use ultraviolet in rayn) and it makes it easier to compare the throughputs like I said before.

I was thinking of producing separate scalar and wide summary tables. The current scalar summary table is getting pretty huge on its own.

Yeah makes sense to me.

@bitshifter
Copy link

If I understand what you mean, that would mean that every type would do the same number of total iterations (and as such, wide types would be doing more total valued processed, but the same number of ops)... if so, I'm not sure I really like that way as I think it sorta obfuscates the higher throughput and makes it harder to reason about? Of course the current method isn't perfect as it's assuming you are able to start and end in wide types for your algorithm which isn't always true, but I think it's still a valid case to test (it's how I use ultraviolet in rayn) and it makes it easier to compare the throughputs like I said before.

No, not total number of iterations. I'm suggesting the same number of inputs are used for each type. Wider types would be doing less iterations because they are processing 4, 8 or 16 elements at a time. So say 100 single input Vec3's so glam would process 1 at a time, an 32fx4 type would process 4 and a time, 32x8 would process 8 at a time and so on. So it should make the throughput advantage or wide types clearer I think?

What it doesn't show is the timing of a single function call for each wide type (like how long does a single Vec3x4::dot take) which is what most of the scalar benches are trying to achieve (i.e. the scalar benches give a good idea of the cost of Vec3::dot for example).

I feel like the using the same input size would give a better example of the throughput advantage of the wider types though. I could add both single calls and throughput benches. It's just more to write and takes longer to run.

@fu5ha
Copy link

fu5ha commented Sep 10, 2020

I don't see how that is different than the way @sebcrozet implemented it (though I could just not be understanding still of course 😅)

I agree with you though, afaict

@bitshifter
Copy link

It's probably no different, I'm not super familiar with his fork :) The main thing is I would keep the existing scalar benches and wouldn't have all of the scalar types run the wide benches except for maybe 1 for comparison. Mostly because I think there's limited value in it for the scalar libs and it adds to the time to run the benches.

@fu5ha
Copy link

fu5ha commented Sep 10, 2020

One more thing... I currently have a couple benches implemented with f64 and f32 versions of wide types, but at this point I'm not sure it's actually worth it to do that to be honest. Think I'm gonna rip that out and just keep it consistently f32 across the board

@Lokathor
Copy link
Member

Ralith will cry

@fu5ha
Copy link

fu5ha commented Sep 10, 2020

Well, there's not gonna be benches for scalar f64 across the board anyway so 😅

as far as all the current benchmarks go, any perf trends that are true of f32s are basically true of f64s, just f64s are like 3x slower across the board or something

@bitshifter
Copy link

I don't have a problem with dropping f64. Potentially someone could add them back at a later date for people who want it. Hopefully the existing macros could be used for f64 benches since they should be largely the same as the f32 ones.

@fu5ha
Copy link

fu5ha commented Sep 10, 2020

https://github.com/termhn/mathbench-rs/blob/wide/benches/eulerbench.rs Here's the approach I'm taking that I think I will just copy out to the other benches basically

@bitshifter
Copy link

bitshifter commented Sep 10, 2020

Sounds good to me.

I was thinking of passing a stride to the euler_bench macro and having the macro handle the &(size / 8) bit, just to streamline things a bit more. Also I think @sebcrozet's version used &((*size as f32 / 8.0).ceil()) which would make sure input wouldn't be truncated if the size wasn't a multiple of the stride. Kind of verbose which was another reason I think it would be good if could be handled by macros.

Note that a lot of the existing benches don't take a size parameter so you'll need to make a version of them that can handle that. Fairly easy to do, just repetitious.

@fu5ha
Copy link

fu5ha commented Sep 11, 2020

opened bitshifter/mathbench-rs#24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants