Scoring Reftests #167

fantasai · 2023-08-03T01:05:38Z

I was drafting up advice for when to split out reftests into multiple files or gather them into a single one, and one concern that came in was “combining reftests affects scores in WPT”, which... is not great. :/ Can we make this concern go away somehow?

I would like to be able to give advice based on arguments like “significantly improves understandability of the tests by factoring out common code on related test variants”, or “makes it easier to understand test results by keeping Level 1 and Level 2 features separate”, or “reduces test running time by an order of magnitude”, rather than bringing WPT test scores into it.

(Similarly, I don't want this concern to start weighing into the choice of whether to use testharness.js or reftests. Reftests are far more comprehensive for layout, so there are good reasons to use them.)

fantasai · 2023-08-03T01:26:41Z

Actually splitting out the scoring for reftests seems hard, but, maybe we can weight them differently? "This test is really 12 tests." Or list them out with descriptions, similar to testharness.js descriptions, but without the ability to report individual scores? That would maybe also encourage people to comment on what exactly they're testing...

KyleJu · 2023-08-09T21:11:39Z

Let me bring it up in our WPT sync meetings and see if there best practices for this!

KyleJu · 2023-08-18T00:08:34Z

@fantasai

combining reftests affects scores in WPT

The short answer to this question is yes. It will affect WPT scores, and also depends on how many reftests and which reftests are changed. The exact detail is worth being discussed at the wpt infra meetings. Would you mind filing a similar bug in https://github.com/web-platform-tests/interop/issues so it can be brought up in future meetings?

fantasai · 2023-08-18T18:58:00Z

combining reftests affects scores in WPT

The short answer to this question is yes.

That wasn't a question.

It will affect WPT scores

That is the premise under which this issue was filed, yes.

Would you mind filing a similar bug in https://github.com/web-platform-tests/interop/issues so it can be brought up in future meetings?

Would you mind explaining what you think this issue is about or what kind of action by Interop you think would fix it? Because I'm not sure I understand why you want to transfer it to Interop.

KyleJu · 2023-08-22T17:49:29Z

combining reftests affects scores in WPT

The short answer to this question is yes.

That wasn't a question.

It will affect WPT scores

That is the premise under which this issue was filed, yes.

Would you mind filing a similar bug in https://github.com/web-platform-tests/interop/issues so it can be brought up in future meetings?

Would you mind explaining what you think this issue is about or what kind of action by Interop you think would fix it? Because I'm not sure I understand why you want to transfer it to Interop.

My bad, I should have explained a bit more. Your question is around understanding the scoring of reftests after proposed actions. In that case, the interop group will have better expertise for it. They have a weekly meeting about WPT infra (not just strictly interop) that discusses a range of topics. @jcscottiii Just to confirm with you, do you think the interop repo is a good place to transfer this issue?

jcscottiii · 2023-08-22T18:59:42Z

Yes, interop is a better location for this

fantasai · 2023-09-28T21:03:46Z

Your question is around understanding the scoring of reftests after proposed actions.

Sorry, I must have been entirely incomprehensible in my original posting. I'm not asking a question in order to clarify my understanding of reftest scoring. I understand how they are scored. I have a problem with how they are scored, and I am asking for that problem to be fixed.

Specifically, I'm asking wpt.fyi to be able to score reftest files that contain multiple tests as having multiple tests. Doing this will require such reftests to include some additional information detailing what or at least how many subtests they contain, and it will require wpt.fyi to be able to process that information.

The reason I'm asking for this is that people writing tests are suggesting that wpt.fyi scoring should be a consideration in how the tests are structured, and I don't want this to be a consideration. The choice of whether to use reftests or testharness.js test, or whether to structure reftests into one file or multiple files, should be based on the understandability of the test and its code, and on whether the feature is more comprehensively tested via reftest or testharness methods, and not on whether it will improve or detract from any individual browser's wpt.fyi score.

Is what I'm asking for making sense now?

foolip · 2023-10-03T22:26:41Z

@fantasai Before designing a technical solution, I'd like to try just pushing back against the arguments against test changes you were getting.

I think tests should be authored in the way that makes them easy to understand, debug and maintain, and that no improvement in those dimensions should be rejected because it changes metrics. If we're talking about Interop 2023 metrics specifically, if there's a test change proposal that makes the tests better (without scope change) but makes Chrome's scores worse, the Chrome team should still support that change.

If you were the one who wanted to make the changes, and the score in question was Interop 2023, you can file a test change proposal to get the Interop Team's blessing to make the change. Hopefully that would be enough to settle the discussion in the specific case.

For a technical solution to the scoring problem, I think it would have to be a generic mechanism for weighting, so that we can make any change neutral in its effect on the score. That includes both testharness.js and reftest refactoring, and conversions between the two types of tests. However, I would prefer if we have a concrete case where we've failed with the test change proposal approach before considering that.

gsnedders · 2023-10-09T13:54:33Z

I was drafting up advice for when to split out reftests into multiple files or gather them into a single one, and one concern that came in was “combining reftests affects scores in WPT”, which... is not great. :/ Can we make this concern go away somehow?

This seems like a policy question for the Interop team, given it is a question about mitigating unintended incentives from the scoring systems being used.

While https://github.com/web-platform-tests/results-analysis and/or https://github.com/web-platform-tests/wpt.fyi own the technical implementations of the Interop and BSF scoring, questions about what the scoring systems should be are ones for the Interop team.

I will also note there's not really anything as a "wpt.fyi score"—wpt.fyi really shows at least three different scores (the subtest totals in the default results view, the BSF graph on the homepage, the Interop scores on the Interop dashboard).

The proposed technical solution:

I'm asking wpt.fyi to be able to score reftest files that contain multiple tests as having multiple tests.

…implies two changes, I think:

Allow subtests of reftests,
Allow subtests to be scored as whole tests

(1) is a question for the RFC process (and it is on that basis during the WPT Infrastructure meeting that issue got moved), whereas (2) is a question for the Interop team to consider (as a matter of scoring policy).

For (2), it's worthwhile to point out that subtests are currently scored as fractions of a test for both Interop and BSF (i.e., for n subtests, each subtest counts as 1 / n). There's not currently any distinction between test types here; merely some test types, like reftests, don't currently support subtests.

jgraham · 2023-10-09T14:22:08Z

In terms of actually allowing subtests of reftests, I can imagine two possible ways to do that:

Consider each variant (identified by a unique query string or fragment, and requiring a unique load) to be a subtest rather than a unique top-level test (unlike testharness tests).
Have some mechanism to map different areas of the rendering to different subtest results.

Both of those seem technically challenging. The first would require significantly restructuring the code to schedule tests and collect their results, and the latter would require changes to every reftest implementation, including ones that live in vendor code.

So, like @foolip I'd like to start by examining the premise here.

For interop we have quite intentionally not considered weighting beyond the idea that one test is worth one point, and subtests are worth equal fractions of the parent tests (i.e. a reftest is scored as 0 or 1, whereas a testharness test with N subtests is scored in increments of 1/N from 0 to 1).

No one believes this is fair in all cases. One could have a situation where some very important bug is one subtest amongst many less important ones. Or where we overweight reftests because all the testharness tests are in a single file. But the reason for keeping a simple system is because, in general, it would be a huge amount of effort to try to design a "fair" weighting for each focus area, and it's unclear that the process would even converge (people might just disagree about what the most important cases are).

If we're starting to see a feedback effect where people design their tests around their preferred Interop score (burying things that they want to be scored low, or writing a lot of tests in some areas just to get a good score) then we need to have a broader conversation about how to discourage that; it seems unlikely that the solution will be primarily technical in nature, and very unlikely to be as easy as enabling an extra degree of freedom for test authors.

gsnedders · 2023-10-13T14:22:59Z

Ignoring the scoring question for a minute, I think there's a reasonable argument that we should be able to score subtests of reftests.

We have, e.g., https://wpt.fyi/results/css/css-flexbox/abspos/flex-abspos-staticpos-align-self-003.html?run_id=5094504119402496&run_id=5137215690113024&run_id=5145266740527104 where we have a test using checkLayout to implement a layout test where some subtests pass and others fail. If this were a single reftest, we'd lose regression testing coverage of the subtests which currently pass.

And as e.g. web-platform-tests/wpt#38833 shows there's groups with desire for such tests to be reftests, rather than using checkLayout.

Even looking at things that are currently reftests, https://wpt.fyi/results/css/css-text/text-indent/text-indent-each-line-hanging.html?run_id=5094504119402496&run_id=5137215690113024&run_id=5145266740527104 has "subtests" that pass cross-browser, and others which fail in Chrome & Firefox. Being able to surface that would be useful, and (again) provide regression testing coverage of the "subtests" which currently pass.

But as @jgraham says above, adding subtests is a relatively major change however we do this.

gsnedders transferred this issue from web-platform-tests/wpt.fyi Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring Reftests #167

Scoring Reftests #167

fantasai commented Aug 3, 2023 •

edited

fantasai commented Aug 3, 2023

KyleJu commented Aug 9, 2023

KyleJu commented Aug 18, 2023

fantasai commented Aug 18, 2023 •

edited

KyleJu commented Aug 22, 2023

jcscottiii commented Aug 22, 2023

fantasai commented Sep 28, 2023

foolip commented Oct 3, 2023

gsnedders commented Oct 9, 2023

jgraham commented Oct 9, 2023 •

edited

gsnedders commented Oct 13, 2023

Scoring Reftests #167

Scoring Reftests #167

Comments

fantasai commented Aug 3, 2023 • edited

fantasai commented Aug 3, 2023

KyleJu commented Aug 9, 2023

KyleJu commented Aug 18, 2023

fantasai commented Aug 18, 2023 • edited

KyleJu commented Aug 22, 2023

jcscottiii commented Aug 22, 2023

fantasai commented Sep 28, 2023

foolip commented Oct 3, 2023

gsnedders commented Oct 9, 2023

jgraham commented Oct 9, 2023 • edited

gsnedders commented Oct 13, 2023

fantasai commented Aug 3, 2023 •

edited

fantasai commented Aug 18, 2023 •

edited

jgraham commented Oct 9, 2023 •

edited