Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No longer accepting plaintext only frameworks / Limited number of tests mutations #8420

Open
NateBrady23 opened this issue Sep 14, 2023 · 12 comments

Comments

@NateBrady23
Copy link
Member

Hi everyone!

As the number of new frameworks submitted to the benchmarks grows, the amount of time it takes to complete a full run does as well. Because of this, we will be implementing the following rules:

  • New frameworks that only implement plaintext will no longer be accepted. Of course, we'd like all frameworks to implement all tests to get a better idea of performance in various areas of the framework, but we expect at least 2 different tests to be implemented. Ideally plaintext or json and one db test.

  • The number of test mutations will be limited to 10. We do not mind if you open up pull requests between runs to try out various mutations for your framework so long as the total number at any given time does not exceed 10.

After the next round, we will ping framework maintainers to make these changes. We will also look to remove tests that are older and no longer maintained.

Thank you!

@NateBrady23 NateBrady23 pinned this issue Sep 14, 2023
@fakeshadow
Copy link
Contributor

Rules like these show how popular the project is and I agree with both.
On top of it I suggest composite score being calculated per mutation which would offer a quick view of per mutation detail.

@gi0baro
Copy link
Contributor

gi0baro commented Sep 15, 2023

@nbrady-techempower on the number of mutations I proposed #8055 some time ago but then left it given the community feedbacks. It might be worth a while re-check it

@joanhey
Copy link
Contributor

joanhey commented Sep 17, 2023

I like it a lot, but it exist a problem long ago.
At the moment the framework is removed, all the history of the framework will disappear from the Rounds.

Like I said before the Rounds need to be immutable.
For example, in PHP we need to change the name because was php5, after the change plain PHP don't appear in the old Rounds.
We have the numbers, and the work done, but don't show in the Rounds.

@otrosien
Copy link
Contributor

otrosien commented Oct 26, 2023

One framework to remove: Baratine. The domain baratine.io is not registered to the project anymore (careful, clickbait!), and the github project has last changes 7 years ago (https://github.com/baratine/baratine)

@joanhey
Copy link
Contributor

joanhey commented Oct 26, 2023

@fakeshadow
Copy link
Contributor

In reality Baratine is marked as Stripped.

Why not bypass all the stripped frameworks from the runs ??

https://github.com/search?q=repo%3ATechEmpower%2FFrameworkBenchmarks+%5C%22Stripped%5C%22+OR+%5C%22stripped%5C%22+path%3A%2F%5Eframeworks%5C%2F%2F&type=code

I disagree. In xitca-web Stripped bench is used to avoid polluting the default leaderboard while keep perf tracking of low level system software like OS and lang(and/or program) runtime at the same time. In fact Stripped is a fairly arbitrary category because there are even more unrealistic bench marked as Realistic. Unless there is a unified standard to determine what bench must be Stripped or not it's unfair to bypass them.

@joanhey
Copy link
Contributor

joanhey commented Oct 26, 2023

@fakeshadow Ok.
I'm happy that is useful this information.

And about the what need to be Stripped, I think that it's a work of all the devs here, help to clarify the requirements and also to identify the frameworks than bypass these requirements.

@fakeshadow
Copy link
Contributor

fakeshadow commented Oct 27, 2023

@fakeshadow Ok. I'm happy that is useful this information.

And about the what need to be Stripped, I think that it's a work of all the devs here, help to clarify the requirements and also to identify the frameworks than bypass these requirements.

Unfortunately the meaning of "Realistic" is subjective and from the existing bench code it's clear we have very divided opinions among bench maintainers. Therefore I doubt a common ground can be reached easily.
Actually I'm fine with the current configuration where the category is up to the maintainers to decide. When people look into the code and figure it out they would know which framework and it's community share the same opinion.
In other word as long as stripped bench can run in non official bench I personally find it's fine. As for broken(or outdated) bench I believe we can use “broken” tag to stop them from hogging resources in runs.

@billywhizz
Copy link
Contributor

one thing i have been thinking is not quite fair is to combine results from different framework mutations together into the composite score. surely composite score should reflect a single configuration and that configuration's performance across all benches?

for example, if we look at ntex, which was top of the last official round, the different flavours get wildly different scores across the different benchmarks. is it fair to pick the best mutation in each category and combine those for composite? is it even possible to run a single service on ntex which would score highly across all benches? it doesn't seem so, but this is surely what the composite score should be measuring?

maybe a better system would be to sum up the scores across all benchmarks for a particular mutation and then, for each framework, choose the mutation that got the best composite score?

maybe this has been raised before. sorry for bringing it up again if so.

@fakeshadow
Copy link
Contributor

fakeshadow commented Dec 8, 2023

one thing i have been thinking is not quite fair is to combine results from different framework mutations together into the composite score. surely composite score should reflect a single configuration and that configuration's performance across all benches?

for example, if we look at ntex, which was top of the last official round, the different flavours get wildly different scores across the different benchmarks. is it fair to pick the best mutation in each category and combine those for composite? is it even possible to run a single service on ntex which would score highly across all benches? it doesn't seem so, but this is surely what the composite score should be measuring?

maybe a better system would be to sum up the scores across all benchmarks for a particular mutation and then, for each framework, choose the mutation that got the best composite score?

maybe this has been raised before. sorry for bringing it up again if so.

I agree with you on the composite score issue. Besides incompatible features it's a common practice in the bench that frameworks implement low level json and/or plaintext to boost their composite score which is questionable to say at least.

Speaking of ntex from what I see the current bench has to choose one async runtime which means it's tokio or async-std flavor scores can't be achieved at the same time. That said it's possible to modify the code to combine multiple runtimes and get the best of them which would be a big refactor but it can be done.

@MarkReedZ
Copy link
Contributor

Should we remove frameworks like gnet? It only implements plaintext and isn't actually doing any parsing / routing - it just scans to the \r\n\r\n and sends a canned response which doesn't meet the test requirements.

@remittor
Copy link
Contributor

@MarkReedZ , your project also has bugs:
#9055

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants