Mozlandia Automation

Servo and ateam meeting on perf, power, and mobile automation stuff

How can we make reliable perf/power tests (so we can compare week to week results?)
Where should we upload/store results? Does something already exist?
We'd like to automate tests against devices. Any best practices here?

Buildbot / testing situation

jack: We review, run tests, and only merge if it lands.
jgraham: From the point of view of WPT, their infra is OK. Non-trivial to port test harnesses to Servo (no marionette). No result visualizaiton. They have to check logs, and it takes a lot of time.
jack: Shows just the failures at the bottom of the logs.
ted: Test reporting is a weak point. Treeherder (https://treeherder.mozilla.org/#/jobs) integration would help. Maybe in a separate tree, for visualization. Structured logging, too. We have a standard for how it should look - one line of JSON per tests for output. Can look at failures / test runtimes, etc. Tooling will be around that.
jgraham: Selling thing is not that it's amazing now, but that it's where we're building stuff today.
ted: Low-effort for Servo, but good leverage for the ecosystem.
jgraham: WPT is simple because it already emits them. If you want to from unit tests, then just need a Rust structured logging library.
ted: Just JSON messages.
larsberg: We're not behind teh VPN. Is that OK?
mdoglio: All the APIs are exposed. HTTP points provided, just need credentials.
larsberg: We don't allow failing...
manish: There are some intermittent failures, though. In TBPL, you can annotate failures with a bug link.
ted: Starring.
manish: Orange test = this bug.
mdoglio: Classify as an infra failure, intermittent, etc. and tie to a bug.
joel: Disk out of space; network error downloading, etc. It's not related, but it's good to have the history.
larsberg: bugzilla? or github, too?
mdoglio: We just look for an error in the build log, extract the info, and have a cache of bugs from bugzilla every hour and find the match.
ted: Arbitrary comments are OK.
manish: So just no TBPL bot support.
c: Bugs on file for Treeherder to mark them as acknowledged for a week, etc.
jgraham: Since the B2G folks want treeherder too, it should get github support...
larsberg: They don't use Github issues.
c: Taskcluster and jenkins results are coming in...
ted: Structured logging and reporting gives you the UI, which helps.
d: The program that runs Servo and reports results running in Servo... is that in Rust?
jack: Four kinds of tests. unit tests in rust have an annotation, and the compiler creates a binary you report.
ted: What runs it?
jack: mach. ref tests. We have a reftest runner written in Rust using the same framework as Gecko.
jgraham: I will remove that runner.
manish: Instead of having the failures in the content test runner, we should be able to parse and change it to structured logging.
jack: Third is content tests that just run in the JS engine.
larsberg: These will go away once we can get them in WPT.
jack: Invoke Servo for each one, run a JS test, that's it.
ted: XPC shell.
jack: There's a runner in Rust. It's about 100 lines of code.
d: Do you care that they're in Rust?
jack: No. And the fourth is WPT tests.
ted: So you want to move ref and content tests into WPT?
jgraham: Yes.
ted: Sounds like a great goal. Great goal for getting things upstreamed to the W3C. Seems like we don't need to run the mochitests since they're FF-specific. The WPT tests, since they're cross-browser, should be easier for you to run. Plus, WPT is already a good citizen with our tooling story.

Performance

jack: We can do benchmark testing in Rust, if we wanted. They work exactly like unit tests, except do timining.
joel: I manage Talos.
jack: We want to make sure that parallel layout on four cores is always faster than one core.
joel: It's not just the tests, but also what it's run on. It's an issue. Running them per-commit, then you need standardized machines. But daily might be easier.
ted: Pass/fail? Or a value?
jack: We don't run them. We just do our perf testing ad-hoc. We do reports occasionally where we pull up some numbers and just run at a variety of cores... just want to generate some curves.
joel: In Treeherder, we're looking at one number per revision, which defines it.
ted: Tests in a bunch of different configurations. So it's like a bunch of benchmarks with different things.
joel: Some of the use cases don't fit into the model of autophone, etc.
ted: But if we just track it over time?
jgraham: Yes, just showing the ratio shouldn't go down.
joel: We collect power consumption for a bunch of the browsers for some sites. It's hard to compare them, though. These are things to consider - if you mock up something and put it in a spreadsheet, and it compares over 2-3 types of graphs, it might work. Otherwise, we might be able to store it, but use the data.
ted: Using the Rust stuff?
jack: No, totally manual.
ted: Not fundamentally different than what we do in FF. you run FF, it does sampling and reporting, which we then parse and log it. If they wanted to get that reporting to get graphs, what's the fastest way?
joel: Timeline matters. Will is the graphing guru. Hrm... Send to GraphServer (graphs.mozilla.org).
will: How much do you care?
jack: When we first did Android power measurment, we found out we had a huge 1000x regression. We'd like to see those sooner than later.
will: Morrow showed me graphana, which can consume streams of JSON and plot it on a graph. You could do that while we sort our story out. Could just create a tool that creates a JSON file of numbers and have a servo dashboard that creates it.
larsberg: We are fine with that - we just don't want to build things that make people ask why we aren't using your stuff.
ted: Keep will in the loop.
jack: Fine to stand it up ourselves, but would like feedback that we're doing things in the right direction.
ted: Yes, that would be great; teams often mess that up.
jack: So, for that, we'd just have the builder at the end, create the JSON file, stick it in S3, and call it a day.
ted: In FF, we have MozRunner that handles it all.
jack: MozRunner runs Servo, too.
jgraham: Really?
jack: Yes, we use it too, in our execution of WPT.
ted: MozRunner handles B2G, etc. So makes sense that it would just work. Handles all the edge cases -
jack: Oops, wptrunner, not mozrunner.
jgraham: Aha, wptrunner uses mozprocess.
ted: That's underlying MozRunner, so it should still handle most of the cases for running and tracking Servo. People use FF in a lot of contexts, and MozRunner/MozProcess handles it.
joel: You don't have preferences, which is a key for desktop & android. We set hundreds of those.
jack: We have a couple of commandline arguments now. Layout threads, render on CPU or GPU, etc.
ted: It's just a rendering engine, not a browser today.
jgraham: From the point of view of preferences, they're hardcoded for every test run. not that different.
ted: Most of that is for testing FF.
joel: Turning telemetry off, etc.
jack: Yeah, we just do it with cmdline arguments now, but might be preferences later.
ted: Not super-complicated right now, but soon you'll have 50 arguments, and need preferences later. Small python script for that should be fine.
jack: For us, the easiest is on PR landing.
joel: Every PR makes it easier, even if it's another 10 minutes.
jack: What we would find useful, even weekly would be useful.
ted: Regression ranges is the biggest thing.
jack: 50-60 PRs a week. Not huge.
ted: Good, much smaller. Also, there's lots of variability in perf testing, so having more data points is helpful.
jgraham: If you can, it's better to overrun the tests instead of really infrequently and finding out surprisingly.
jack: Currently, the WPT tests are the biggest part of our cycle.
joel: How long?
larsberg: Under 15 minutes.
ted: Machine situation?
larsberg: Linode & macstadium. Just to avoid EC2 madness.
ted: Probably hit it in a while.

Rust

simonsapin: They're bigger, and having big issues on r+. They test before landing, too. Cycle time is large.
larsberg: 3 hours.
simonsapin: Their queue is really long. Sit there for days.
ted: What do they do?
larsberg: EC2+macs on desk.
ted: How do they fix?
manish: Manual rollup.
ted: Cool, that's a side issue, but we should make sure you don't get there. You will get there at some point.
larsberg: Hopefully by then releng will be ready.
ted: taskcluster, etc.
joel: Different hardware causes a huge deathmarch.

Devices

larsberg: What can we do to make sure we still run on the device?
jack: We make sure that we build on, but what else?
joel: Emulators.
jgraham: Some company that does on-device stuff hosting, too?
ted: Some of both. Hardware doesn't scale, even with 800 pandaboards. Gotta maintain them. Standing up a lot of emulators, especially on EC2. It's QEMU, and it's just not as fast. Especially if you're doing graphics stuff.
manish: Won't catch graphics driver issues.
ted: have LLVMPIPE for a realish GL.
mark: Some do, some don't...
jack: Even just one reftest would be a huge improvement.
ted: Could chat with jeff brown about it. Have things running in emulators.
joel: Increases end-to-end time.
jack: Run WPT on two machines and others with unit tests. May be able to add them in. Or just spin up more machines.
larsberg: For me, good to know that emulators are what we have to do our automated work on.
ted: We have some stuff you can only test on phones.
joel: Not realistic to test on devices in your 30 minute thing.
ted: Need the devices for your perf testing. Perf on mobile means wanting phones, and doing that separately.
jgraham: We must be doing something like this for B2G. There's some company building a cloud of flames.
ted: If it's just perf testing semi-regularly, we have stuff where we can do that. AutoPhone, etc.
jack: We have people working on Android stuff so we get some feedback...
ted: Eideticker.
wlach: On Android, it's a video capture & perf analysis harness. instead of an abstract number, it captures the device doing things, so it can do analysis. It'll say you're done when the page is visually complete. Can also do checkerboarding tests to see how long it takes until it pans. We run it with FF, but it should work with other stuff.
ted: On Android / B2G, you can automate via that.

Perf on layout

jack: Trying to get perf on layout engine parallel. Took a bunch of alexa top50 static snapshots. Thing for Servo isn't the static snapshots, but twitter/facebook.com are fast. What do you do?
joel: Dilemma there is loading a live page on twitter, not logged in / scrolling you get no content. Supposedly they have test accounts you can sign up for, but then there's no content.
ted: No repeatability.
jack: I'd like to have a dynamic test suite with the dynamic stuff?
jgraham: You're talking about building your own becnhmark?
jack: Just want to know if it is here.
joel: For power, we do it on live test sites like that. Theory is that if we test it at 1:30pm on FF and 1:32pm on IE, it should be pretty close.
ted: MozBench is the closest thing we have.
jgraham: Small number that try to be dynamic web application-esques. Apple's Speedometer, which is scripted use of a dynamic frameworky thing. Closer than other benchmarks. (http://browserbench.org/Speedometer/)
ted: Ideally, you'd load something in the browser, wireshark it, play it back, etc.
jack: So, if we need to make something like this, would you want it or have ideas?
joel: We'd use it.
ted: Pick a web app. Capture the state so that you can replay it, and be done.
joel: Networking? Or dynamic content?
jack: We use SM, so most of the benchmarks are not interesting to us. We have page load profiling, but nothing on scrolling, etc.
c: Talk to ollie. No solution yet, but we need to fix it, too.
wlach: Chromium Webpage Replay Project?
joel: We looked into it a lot. Theoretically, you can load a live webpage and the replay will suck it down (the session) and replay it. bz and I spent two months trying to make it work, and couldn't make it.
jgraham: They're not necessarily deterministic.
ted: Ad banners.
manish: Not with twitter, but discourse could be run on localhost, maybe.
jack: That might work.
manish: It uses XHR, etc.
jack: Thing to get out of this is: who do we need to talk with?
ted: You want bz and smaug there.
joel: Patrick mcmanus has great ideas on networking. But I don't think you care as much about that.
jack: We want to make sure we're testing the things that distinguish us (like parallel layout).
manish: Could get a fake twitter.
ted: Just make sure it's representative.
jgraham: The speedometer was the highest-level benchmark thing I'm aware of people using. It's not best, but it works today and will spit out a number when you run it.
joel: arnie was open to collaborating on benchmarks.
ted: Just make sure you talk with bz and smaug. Otherwise, you'll be unhappy.
jgraham: drmaeo is not thought of well.
ted: Any general-purpose benchmarks you create we should run in gecko, too.
jack: We often compare ourselves against gecko and other browsers and hope that if we claim we're 2x faster you'll call us on it. Right now, we instrument the gecko code where we think the analogous operations end, etc.
joel: We have some runner stuff (MozBench) in place to try running them across a bunch of browsers nightly so we could compare against Servo.
ted: dminor from the ateam is managing MozBench. Mainly designed for games benchmarks, but should work for ours, too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mozlandia Automation

Servo and ateam meeting on perf, power, and mobile automation stuff

Buildbot / testing situation

Performance

Rust

Devices

Perf on layout

Clone this wiki locally