Skip to content

Meeting 2014 08 25

Lars Bergstrom edited this page Aug 25, 2014 · 1 revision

Agenda

  • new intern (jack)
  • work week location (jack)
  • Feedback on the TravisCI experiment? Double-down, or investigate other options? (larsberg)
  • Cargo update (jack)
  • libgreen update (jack)
  • security data analysis (jack)
  • latest power measurements (laleh)

Attending

  • laleh, larsberg, mrobinson, mbrubeck, czwarich, ms2ger, abinader, kmc, luqman, simonsapin, jdm

New intern

  • jack: We have a new intern starting today! Clark Gable will start today. Probably in his IT lab currently. He should be on the call next week. Not quite sure yet what he will be working on; we'll figure it out over the next couple of days.

Workweek location

  • jack: Let's decide! SimonSapin will be in the bay area and would like it there. Objections? No? Preferences for SF vs. MTV?
  • zwarich: Just one, not both. That was frustrating having to commute one day, stay in one the other, etc.
  • jack: Yeah, enough of us that it's a space issue. Daala was in Foxhole in SF which was great because it's a huge space and has its own conference rooms. Also has its own kitchen so you don't have to go far for drinks & stuff.
  • kmc: I prefer SF, but can do either.
  • mbrubeck: Same here, if both are an option.
  • jack: You'd rather be on the embarcadero than in nowhere- mountain view?
  • mbrubeck: I've spent too much time in that concrete wasteland...
  • jack: Azita sent me some notes, so I'll see if we can get the big space.

Travis CI

  • larsberg: The big thing I wanted to get reactions on is, we're at a point where we need to decide whether the experiment so far is a success we want to build on, or a failure we've learned our lesson from. Going forward would mean either ponying up money to get the features we want on Travis, or building an alternative ourselves.
  • zwarich: I think it would be more positive if we didn't get so many random failures without usable explanations.
  • jack: Major reason we did it is that previously rust upgrades would need system upgrades, which were blocked on releng. We don't have that problem anymore, but we have many more random failures.
  • SimonSapin: Are random test failures Travis's fault, or ours?
  • jack: Travis randomly kills things for lack of output, memory usage, etc.
  • zwarich: And it doesn't tell you when or why it happens, as far as I can tell.
  • jack: We tried workarounds like turning on logging to avoid "lack of output" killing. But there doesn't seem to be one large problem, just death by a thousand (undocumented) cuts. And because of the slowness and the time limits, we've had to split up tests and increase overall cycle time.
  • jdm: Should we talk to Taras Glek about whether releng has the things we need yet? He was enthusiastic about that back in Portland.
  • jack: Probably. If they have better self-serve infrastructure it might solve both our problems. It sounds like people aren't super satisfied with where we are at Travis. Going back to our own infrastructure is one option; another is to pay Travis to get these problems fixed. I'm not very optimistic about that option or the price tag. Alternately we could run more things ourselves like the Rust team does; not sure how they currently feel about that.
  • zwarich: Doing something that the Rust team does would be good for getting a setup where Rust is guaranteed to work. Travis would be nicer if all the little problems got ironed out, but they seem to be fundamental problems that won't be solved by throwing them a couple thousand dollars. The web UI and other integration is nice, but I don't know if it's worth the money.
  • jack: We had a lot of that before; the only thing we didn't have is that bors didn't kick off builds without a manual review. And we could change that.
  • Manishearth: Both Travis and Jenkins are open source, so we could use them and get those features in-house if we set it all up.
  • jack: The only caveat is we have to figure out what to do for OS X (which we can't run on AWS). Setting up new AWS servers and billing them to Mozilla is easy, but I don't know what to do about Mac OS X.
  • larsberg: It sounds like maybe I should drill into what our options are, what Taras's team has available, and send around what our options are.
  • Manishearth: Lars, have you talked with the team about the timeouts and the kills and possible solutions for them?
  • larsberg: It's not entirely clear to me. The biggest problem I've had interacting with them is that replies take either 5 minutes or 2-3 weeks. I have mail out to them about scaling up hardware and about killed processes, and that's been sitting in their support inbox for over a week.
  • SimonSapin: For OS X, is it an option to have a Mac Mini sitting in an office somewhere? We even have an in-office server room here.
  • jack: For Rust, brson has four Mac Minis sitting at his desk. We could do that to zwarich's desk since he's never there. :) There are also hosting services we could use. That would be nice since they offer remote administration (instead of asking zwarich to head to the office). Or maybe brson wants another few Minis for his stack. :)

cargo updates

  • jack: I investigated switching us over and ran into some problems with rebuilding. acrichto fixed them all and we're ready to go on the next round of testing, which I will pick away at this week. We discussed all remaining issues at the Rust workweek last week. No known blockers to this work.
  • larsberg: Will you die from the compiler ICE?
  • jack: No, I should be able to use a freestanding cargo build.

libgreen update

  • jack: The Rust team has decided to abandon libgreen and have native-only threading. They will no longer proxy I/O calls or mutexs to make them work across both schedulers. libgreen itself will move into its own repo, which acrichto will maintain. This impacts Servo substantially. It's not clear that we had a full solution from libgreen, but now we have no solution to userland scheduling.
  • kmc: We already have a custom workqueue abstraction for parallel layout. So maybe port that to libnative?
  • jack: Work queue is already on libnative. Big issue is we were hoping this would solve synchronous queries from script to layout (e.g., GBCR). Message passing from native to native thread requires a context switch, which libgreen solved by making it a userland switch. Message passing was optimized for this usecase, too. Big problem is that everything is slower than a function call, so Servo will always be slower than other browsers here. Also, libgreen has all this I/O stuff we don't need.
  • kmc: Don't want our own userspace I/O async layer? We'll need it for a good HTTP stack. DNS, etc.
  • jack: Can do that with message passing and native threads...
  • kmc: If you have tons of queries out, you probably want them in an epoll loop, not lots of native threads.
  • jack: Giant stack overhead, though?
  • kmc: Probably have to implement our own userspace threading library.
  • zwarich: Big problem is "use a threadpool" means if you have one threadpool per pipeline, even one native thread per iframe would mean way too many for a browser. pcwalton said "threads are fast on linux" but I don't know how true it is on android/embedded devices. Maybe it is optimized for large numbers of native threads. But then you also have tons of memory overhead from tons of native stacks. It's going to be a huge amount of memory bloat that will be very difficult to get rid of. I am not sure we know the right way forward. As far as the DOM/script problem, accepting some slowdown there is probably essential to making a concurrent browser, since synchronization is not free. I just don't know how much faster we can make it than libgreen message passing. Maybe since libgreen would not serve other clients, we could add special things to handle this case that wouldn't happen if libgreen was built into the Rust standard library?
  • kmc: I remember the golang team was talking about adding a system call to linux that bypassed the context switch. Not a cross-platform solution, but maybe at least a solution for linux here?
  • jack: I roughed out a small plan for figuring this out. The critical thing is figuring out the synchronous script call. First, need to make them execute concurrently again. The barrier is: as soon as we have a flow tree, we can return to script and run in parallel (because we don't access DOm members after that).
  • zwarich: There's a bug open from pcwalton saying we still do, but maybe it hasn't been closed? [NOTE (zwarich): I didn't find the bug, so I guess it was closed?]
  • jack: Yeah, we'd need to fix that. And then measure the perf. of these synchronous script calls. jdm had a microbenchmark he made to do this, so we could see where we are for them. Then, pcwalton had the suggestion that where script is running and layout is not, there's no reason to send a message - you could just access the layout structures directly. We could try implementing that suggest, figure out how to make it "safe" and then measure the cost of those calls. Then we'll have to figure out some strategy for what to do with these small tasks and a hundred iframes in their own pipelines. N*100 native threads is probably not the way to go.
  • larsberg: My worry is once we're managing lots of threads across thread pools we have to worry about fairness, starvation, etc.
  • zwarich: It's also unfortunate because it seems like one of the big abstractions Rust was supposed to provide, and now they're punting it to us.
  • jack: But we don't need anything they had that made it a maintenance burden. So we could delete half the library :-)
  • larsberg: Hope it doesn't turn into "linked task failure 2.0"
  • jack: We might be able to bring that back. If we maintain it ourselves it's up to us whether we want to deal with linked task failure.

Security data analysis

  • jack: Been talking with people about Servo moving forward, becoming a product, whatever. Doing that, answering the usual pushback. One is, lots of ways are cheaper than writing a browser from scratch. The DOM APIs have no security considerations in the API, but in the C++ code have problems. Roc's investigations in WebAudio security bugs said that Rust would have solved most/all of them. When talking about this, they followed up with: are those the kinds of bugs that dominate browser security bugs? Roc said he believed they are, but didn't have time to trawl bugzilla.
  • zwarich: I thought it was generally considered that at this point the large cause of browser security issues is use-after-free.
  • kmc: Aren't there web platform interaction problems? Fingerprinting, link privacy violations, content sniffing? Or do those not count?
  • zwarich: I mean most of the new things people find. The conceptual ones can be worked around or reengineered for. With use after free, there's nothing you can do to solve them forever. When I started on browsers, they were just bugs, and now they're always critical security problems. There are lots of other interesting security problems, but not programming language problems that cause security problems.
  • jack: If we get to a world where the biggest problems are small privacy leaks, then we're definitely in a better place. Right now, we still have remote code execution bugs in problems.
  • kmc: Sure, I feel like memory safety should have been solved long ago, but we should be careful about our security claims, especially as the web platform becomes richer. No systematic way to address all interactions between parts of the platform.
  • jack: Next question is - do we have anyone on the team with access to the Mozilla bugzilla security secret bugs? I'd like to do this analysis or find someone who can.
  • jdm: There is nobody on the team who has security team access, which is what's required to view any arbitrary security bug.
  • jack: I'll ping Roc to see if he can help us find a volunteer. Failing that, I'll have azita/dave help us track somebody down internally. ms2ger suggests giving me access... but I may be the NSA plant!

latest power measurements

  • laleh: I was trying to change from busy-wait to wait. I first changed the default spin time down from 1000 to 100, we had better perf. Not sure what the best number is for these benchmarks. Trying to find the optimum number. I tried to add a sleep, but it really affects the performance a lot.
  • jack: We talked with alex/brian at the Rust meetup, and the numbers were made up. We should change what we find and submit it upstream.
  • laleh: I'd also like to try some other benchmarks.
  • zwarich: Long-term, we're not going to want a fixed magic value. I feel like user-space scheduling should be done adaptively instead of fixed timeouts and we should investigate the production-quality schedulers of this nature to see what they do. You can try to find something that's a bit better, with a backoff - do a bit of waiting, then double it the next time, up to a max threshold + OS sleep. I feel like this is a problem that has been solved in a much more sophisticated way than a single constant and we shouldn't reinvent the wheel.
  • jack: One question is: the power usage between Servo / Firefox on Android is due to spinning. Is that the case? In single-threaded, we don't use this workqueue, right? Is our single-threaded power usage in the same ballpark? Next would be: how much spinning are we actually doing? Do we know when that's happening? Just during layout, or do those threads spin all the time as soon as layout starts until servo shuts down?
  • zwarich: Definitely not that, because my mac would heat up if four threads were constantly spinning. Playing around with a different value is interesting, but I predict that the dynamically optimal value will be benchmark-dependent, unless we find that nearly all web pages have very similar workloads.
  • jack: OK, sounds great! I like that you deleted two lines of code from Rust and got both a performance and power increase. Someone from Daala recommended we
  • zwarich: There's an ASPLOS '14 paper on that! laleh, what was the slowdown if you remove the spinning entirely? sleep immediately?
  • laleh: I didn't try that...
  • zwarich: Maybe you just brought the value really low. Was it like a 50% slowdown?
  • laleh: Not that bad.

Rust hi-pri issues

  • jack: Can we upgrade Rust again?
  • kmc: I think I can work around the ICE for html5ever
  • larsberg: Also a bug in: https://github.com/rust-lang/rust/issues/16483 that prevents an upgrade
  • zwarich: Maybe pcwalton's changes to for will fix the parser issue...
  • kmc: I'll try rewriting to work around it.
  • jack: It would be nice to unblock another Rust upgrade.
  • kmc: Yes, the html parser will be ready to land once we upgrade Servo to a Rust upgrade that has all the fixes I need. I hope to start on that soon.
  • jack: As soon as we get the LLVM bug fixes, we should start an upgrade so we can land the parser.
Clone this wiki locally