Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-scoring previous test runs causes confusion #356

Open
jgraham opened this issue Jun 9, 2023 · 5 comments
Open

Re-scoring previous test runs causes confusion #356

jgraham opened this issue Jun 9, 2023 · 5 comments

Comments

@jgraham
Copy link
Contributor

jgraham commented Jun 9, 2023

Recent changes to motion-path and URL tests caused a noticeable overall change in the Firefox score. This is fine; those tests changes were agreed and the score change was predictable. However, what caused some problems was that people saw an overall score of X on one day, and then on the next day saw that not only had the score dropped to < X, but that the graph suggested that the score had never been as high as X in the first place. That caused a lot of confusion.

This happens because we try to rescore previous runs as if we had the current test set (zero filling results for tests that weren't in the previous runs). That's reasonable; it means drops in the graph usually (but not always e.g. in the case that existing tests are edited to have different pass conditions) correspond to actual browser regression. But there are a couple of problems:

  • There's a lack of documentation explaining exactly what the system is. To actually understand what's going on in detail the best source is the scoring code itself (which is well commented!), but it's unreasonable to expect most people to find that.

  • The fact that old scores are silently changed makes it very difficult to quote a specific score. Consider a press article that says "at time of writing browser B had score Y". Then someone reading the article some time later tries to verify that, and finds a graph that shows B never getting a score of Y. In that situation the reader would likely conclude that the article author had made an error, rather than digging in to the discrepancy.

I don't think the rescoring system is necessarily bad, but I do think we need to do more to make it clear what's going on. In particular the following seems like it would help:

  • Clearly document that we re-score previous runs using the current set of tests, backfilling zero where the test doesn't exist, and explain the set of tradeoffs that led to this system.
  • Generate, and publish, the actual measured results for each run, with the test set at the time of the run. Functionally the backend for this would just be a new CSV file that we append the latest results to every time a new aligned run is processed. On the frontend, having some way to switch the graphs to show either the re-scored results, or the historic point-in-time results would make it much more transparent what's going on.
@foolip
Copy link
Member

foolip commented Jun 13, 2023

How about documenting this in a README in https://github.com/web-platform-tests/results-analysis/tree/main/interop-scoring and linking it from https://wpt.fyi/interop-2023?

For keeping historical results unchanged, an alternative way to achieve that is to move labels into WPT itself somehow, in a way where it's possible to get the labels either for an arbitrary commit or at least for all tagged commits. That would be a lot more work, however.

cc @DanielRyanSmith

@jgraham
Copy link
Contributor Author

jgraham commented Jun 13, 2023

For keeping historical results unchanged, an alternative way to achieve that is to move labels into WPT itself somehow, in a way where it's possible to get the labels either for an arbitrary commit or at least for all tagged commits. That would be a lot more work, however.

I think there's a lot to be said for just putting all of the metadata directly in to web-platform-tests rather than having a separate repo. For example it would allow people to update tests and metadata in the same commit. But this would indeed mean revisting a lot of tooling that's based on the current separation. If we did this we could publish the combined metadata as an artifact and maybe do something similar to the results cache for long-term storage.

@DanielRyanSmith
Copy link
Contributor

DanielRyanSmith commented Jun 13, 2023

It is true that there's an inherent risk of "rewriting history" with the current scoring process we have in place. We're at least making progress in freezing the scoring of previous years, which should be live soon.

My fear with maintaining each score historically as they're written, rather than re-aggregating, is that we run the risk of solidifying scoring mistakes from non-finalized metadata or broken test suites. I could see a situation where metadata changes cause an increase in questions raised about, "Why has the score jumped drastically from yesterday?", and "What caused this score drop last week?" This could permanently add these scoring calibration changes to the historical data, and it's something we don't realize is happening much today because they're retroactively corrected.

I don't know the full risk of the above scenario, but it seems that metadata and test suite changes are not infrequent, even as the interop year progresses.

(I know this issue is not advocating for removal of the rescoring process - just documenting my thought process here)

How about documenting this in a README in https://github.com/web-platform-tests/results-analysis/tree/main/interop-scoring and linking it from https://wpt.fyi/interop-2023?

This seems like the easiest way to do some explaining about what's happening behind the scenes. Although I agree with @jgraham that people who notice scoring discrepancies will likely assume a blog post or the dashboard has made some mistake rather than reading deeper into the scoring process (and I am likely one of those people 😅).

@jgraham
Copy link
Contributor Author

jgraham commented Jun 13, 2023

Right, the proposal is not to display the graphs of historic scores by default, but to have an option to display them instead of the current graph, so that in case someone does quote a score in a blog post or press article, and that score is later changed, there's a clear way to confirm that the number they gave was accurate at the time of publication.

@DanielRyanSmith
Copy link
Contributor

Sorry, I realized I rambled more about the current scoring process more than the problem at hand in my above comment.

so that in case someone does quote a score in a blog post or press article, and that score is later changed, there's a clear way to confirm that the number they gave was accurate at the time of publication.

I agree that it would be useful to have some historical accuracy. I'm wondering if having any access to a different view of historical scores on the dashboard could serve to confuse a greater audience, since I imagine it's not easy to explain the discrepancies in these scores concisely to a general user.

There's a bit of a conflict in making this historical score easy to find, because ideally it's not exposed to users who don't need it since it could be confusing, but In the blog post scenario described, it would need to be easy enough to find that a user could determine the veracity of the blog post's score.

I'm not a UX expert here, so I will defer any suggestions outside of explaining our current process in a link on the dashboard, which seems easy and useful. I am indifferent about metadata storage location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants