Scoring: variants and multi-globals #256

gsnedders · 2022-12-20T22:55:32Z

One suggestion for webcodecs was to use all the tests which match the search video: https://wpt.fyi/results/webcodecs?label=master&label=experimental&aligned&view=subtest&q=video

However, due to extensive use of multi-global tests and variants this ends up not working particularly well. Most obviously, webcodecs/videoDecoder-codec-specific.https.any.js ends up contributing 20% to the overall score, as the one file contributing the following ten tests (out of a total of 48 tests):

This isn't the first time we've had problems like this with our scoring, but I think this is a much more extreme case than we've had otherwise.

The text was updated successfully, but these errors were encountered:

foolip · 2022-12-21T15:42:42Z

I agree this isn't great, and it also causes a bit of inflation in general for multi-global tests on wpt.fyi. There I've often thought that we should use the manifest and group these tests under the filename somehow, perhaps going so far as to call the filename "the test" and treat all of the variants as subtests. But that'd be a lot of work.

For the problem at hand, we could just not label the worker variants and reduce the size of the problem, but that isn't a very reusable approach...

gsnedders · 2022-12-21T18:54:22Z

I guess the challenge is our current implementation of the scoring is JS, versus the rest of the WPT infra (including all the manifest stuff) being Python… hmm.

must… not… rewrite… this… while… on… holiday…

foolip · 2023-02-07T08:43:58Z

I'm not aware of any hurdles we'll run into fetching and using the manifest from JS. The main issue is that it would be very slow. Storing all manifests in a tree-deduplicating setup more like https://github.com/web-platform-tests/results-analysis-cache would make it faster.

foolip · 2023-02-10T08:47:41Z

This came up again in #281. We have cases (URL) where we want to include some variants but not others, so that rules out a "clean" approach of labeling file names and using the manifest to figure out which test names to include, while treating it as 1 test, scored as 0-1.

The more complex solution then is:

Label test names, but use the manifest to figure out which tests are defined in the same file. Treat those as a group and score them 0-1.

jgraham · 2023-02-17T15:34:59Z

So, the previous situation was that each variant is its own top-level test for the purposes of scoring, and the proposal is that we define things based on the file rather than on the test id?

FWIW I don't feel especially strongly either way; I think "the score just doesn't quite match reality" is an inevitable feature of the setup, and it's also possible to have a case where one file containing many subtests exercises a lot of the feature whereas a few tests that were moved to seperate top-level files only cover edge cases, but end up dominating the scoring. But if people feel that defacto today it's a better tradeoff to treat variants as a single test, I think it's reasonable to change.

foolip · 2023-02-17T17:14:49Z

Summary from the notes:

When we have variants, we can group them using information from the manifest, score them individually, and divide by the number of variants in the group. Similarly for multi-global tests.

So yes, it would be based on the file, but importantly we need to handle the case where we've only labeled some of the variants or multi-global tests.

To be robust we need to use the manifest, so this isn't trivial to implement.

I also think it would be very good if we could do the same grouping on wpt.fyi, otherwise we can't make the interop score view match.

foolip · 2023-03-15T14:44:48Z

We've discussed this in a meeting. We have a pretty good idea of what we'd change to address this, but nobody assigned to do the work.

gsnedders added the meta Process and/or repo issues label Feb 4, 2023

foolip added the agenda+ label Feb 7, 2023

gsnedders mentioned this issue Feb 9, 2023

Dashboard no longer reports accurate scores for Cascade Layers #281

Closed

foolip mentioned this issue Feb 16, 2023

Agenda for Feb 16, 2023 #287

Closed

foolip removed the agenda+ label Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring: variants and multi-globals #256

Scoring: variants and multi-globals #256

gsnedders commented Dec 20, 2022

foolip commented Dec 21, 2022

gsnedders commented Dec 21, 2022

foolip commented Feb 7, 2023

foolip commented Feb 10, 2023

jgraham commented Feb 17, 2023

foolip commented Feb 17, 2023

foolip commented Mar 15, 2023

Scoring: variants and multi-globals #256

Scoring: variants and multi-globals #256

Comments

gsnedders commented Dec 20, 2022

foolip commented Dec 21, 2022

gsnedders commented Dec 21, 2022

foolip commented Feb 7, 2023

foolip commented Feb 10, 2023

jgraham commented Feb 17, 2023

foolip commented Feb 17, 2023

foolip commented Mar 15, 2023