Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoring: variants and multi-globals #256

Open
gsnedders opened this issue Dec 20, 2022 · 7 comments
Open

Scoring: variants and multi-globals #256

gsnedders opened this issue Dec 20, 2022 · 7 comments
Labels
meta Process and/or repo issues

Comments

@gsnedders
Copy link
Member

One suggestion for webcodecs was to use all the tests which match the search video: https://wpt.fyi/results/webcodecs?label=master&label=experimental&aligned&view=subtest&q=video

However, due to extensive use of multi-global tests and variants this ends up not working particularly well. Most obviously, webcodecs/videoDecoder-codec-specific.https.any.js ends up contributing 20% to the overall score, as the one file contributing the following ten tests (out of a total of 48 tests):

This isn't the first time we've had problems like this with our scoring, but I think this is a much more extreme case than we've had otherwise.

@foolip
Copy link
Member

foolip commented Dec 21, 2022

I agree this isn't great, and it also causes a bit of inflation in general for multi-global tests on wpt.fyi. There I've often thought that we should use the manifest and group these tests under the filename somehow, perhaps going so far as to call the filename "the test" and treat all of the variants as subtests. But that'd be a lot of work.

For the problem at hand, we could just not label the worker variants and reduce the size of the problem, but that isn't a very reusable approach...

@gsnedders
Copy link
Member Author

I guess the challenge is our current implementation of the scoring is JS, versus the rest of the WPT infra (including all the manifest stuff) being Python… hmm.

must… not… rewrite… this… while… on… holiday…

@gsnedders gsnedders added the meta Process and/or repo issues label Feb 4, 2023
@foolip foolip added the agenda+ label Feb 7, 2023
@foolip
Copy link
Member

foolip commented Feb 7, 2023

I'm not aware of any hurdles we'll run into fetching and using the manifest from JS. The main issue is that it would be very slow. Storing all manifests in a tree-deduplicating setup more like https://github.com/web-platform-tests/results-analysis-cache would make it faster.

@foolip
Copy link
Member

foolip commented Feb 10, 2023

This came up again in #281. We have cases (URL) where we want to include some variants but not others, so that rules out a "clean" approach of labeling file names and using the manifest to figure out which test names to include, while treating it as 1 test, scored as 0-1.

The more complex solution then is:

Label test names, but use the manifest to figure out which tests are defined in the same file. Treat those as a group and score them 0-1.

@jgraham
Copy link
Contributor

jgraham commented Feb 17, 2023

So, the previous situation was that each variant is its own top-level test for the purposes of scoring, and the proposal is that we define things based on the file rather than on the test id?

FWIW I don't feel especially strongly either way; I think "the score just doesn't quite match reality" is an inevitable feature of the setup, and it's also possible to have a case where one file containing many subtests exercises a lot of the feature whereas a few tests that were moved to seperate top-level files only cover edge cases, but end up dominating the scoring. But if people feel that defacto today it's a better tradeoff to treat variants as a single test, I think it's reasonable to change.

@foolip
Copy link
Member

foolip commented Feb 17, 2023

Summary from the notes:

When we have variants, we can group them using information from the manifest, score them individually, and divide by the number of variants in the group. Similarly for multi-global tests.

So yes, it would be based on the file, but importantly we need to handle the case where we've only labeled some of the variants or multi-global tests.

To be robust we need to use the manifest, so this isn't trivial to implement.

I also think it would be very good if we could do the same grouping on wpt.fyi, otherwise we can't make the interop score view match.

@foolip foolip removed the agenda+ label Mar 15, 2023
@foolip
Copy link
Member

foolip commented Mar 15, 2023

We've discussed this in a meeting. We have a pretty good idea of what we'd change to address this, but nobody assigned to do the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta Process and/or repo issues
Projects
None yet
Development

No branches or pull requests

3 participants