Skip to content

Architecture Meetings 2018 06 14

Erik Moeller edited this page Jun 14, 2018 · 3 revisions

Architecture Meeting: June 14, 2018, 10:00 AM PDT (5:00 PM UTC)

Topic: SecureDrop Server Operating System Transition

Participants: Emmanuel, Joshua, Conor, Loic, Mickael, Jen, Mike, Erik

Background: Ubuntu Trusty reaches end-of-life in April 2019. We must begin the transition to an upgraded version or a new SecureDrop base OS well before that date, to bring the number of SecureDrop instances running on an unsupported base OS as close to zero as possible.

Links:

Agenda:

  • Intro (5 minutes)
  • Debrief from Xenial research (15 minutes)
  • Decide which near term choices we want to focus on (e.g., Xenial vs. Atomic) (15 minutes)
  • Document pros/cons of these approaches (30 minutes)
  • Check-in: Do we have enough information to make a decision? If so: let's attempt to build consensus. If not: what other information do we need? (remainder of meeting)

Notes

Mike: It was fairly straightforward to do a fresh install. Some Apache changes - we were depending on a package that doesn't exist in Xenial. Easy workaround. Some iptables changes, some apparmor stuff.

Conor: Any comments on the nature of the Ansible changes we need to make? Would it work well against Trusty for an interim period?

Mike: Pretty minimal changes. The worker that I switched to is marked as experimental in Apache 2.2, only production-ready in 2.4. Tried to switch down the one we were using before, but it was a really jarry experience. Tried to disable the module, but broke Apache. Had to do manual file manipulation. Could switch to the older worker type.

Conor: Some of the edits we have to do in that scope - 100% of the Ansible logic in the SD repo is written by people who care about SD. When we run servers (e.g., host apt repo), we try to re-use logic. Should we cull custom Ansible logic?

Mike: Not that many changes. Should move to community roles, but not necessarily a blocker.

Conor: Upgrade attempt went pretty smoothly. I was able to get it to the point where we can install new deb packages with a few changes. Did not run Ansible logic, did all through deb packages. One major point of concern that required postinst hook: firewall rules are very restrictive, e.g., outbound connection restricted. In modern version of apt, there's system account that executes apt, which does not work under firewall rules. By installing new deb packages and rebooting, everything is cool. In order for admin to run do-release-upgrade, needed to set variable that permitted that to be shown. Had to set prompt to the LTS channel.

Jen: Did you run tests?

Conor: Did not. But see Mike's notes -- Selenium tests are the only thing that failed. Selenium API changes? Was xvfb running?

Jen: Integration tests also failing. Redis worker?

Erik: Would have to be attended upgrade?

Conor: SSH'd in over Tor. SSH sessions create tmux section. There are prompts we can't really get around. I was able to follow all the defaults. One exception: don't restart services (will break the connection). At the end it prompts you to reboot.

Choices, Choices, Choices

Xenial vs. Atomic? Or other choices?

Conor: People have advocated Debian and CoreOS. If we're seriously considering Atomic, I don't see value in considering CoreOS.

Josh: In the context of working with this on Qubes, Ubuntu is poorly supported. It would make it a lot easier to move to Debian.

Emmanuel: A move to Debian would be less predictable in terms of timelines for support. Schedule for "stable" a little less predictable.

Mike: What are the support issues for Ubuntu under Qubes?

Josh: Qubes only supports Debian and Fedora as first-class members of the VM ecosystem. It makes it difficult/impossible to use other operating systems. You have inter-VM communication, participation in the qrexec system, copy/paste, management via dom0. You can only do this with Ubuntu with great difficulty.

Conor: We'd do well to rely less on Ansible playbooks. ISO customization largely a nonstarter.

Mike: I will say that I really like CoreOS. A lot harder to spin up a custom kernel. Everything is configurable via Yaml. If we were to consider dropping grsec, it would be nice alternative. Nice appliance OS.

Erik: Does transition to Atomic make things easier for SD admins?

Conor: Could make it more of an appliance.

Emmanuel: Some problems admins run into are caused by them making changes to base OS and not telling them. Immutability would make it simpler. I don't think it happens way too much, but from time to time.

Jen: Pretty rare.

Conor: Design problem - how much are we encouraging custom changes.

Emmanuel: Would call it uncommon, not rare.

Jen: Transitioning to Atomic alone would not necessarily make things easier for SD Admins. But design shift -- admins should not be SSHing in -- would. We could make that design shift now, in order to be in a position to remove admin access.

Mike: Agree with Jen's comments. First install might be easier, but still need application-level changes. It makes our lives easier.

Jen: Move as much config as possible to the admin interface web app. Having a GUI application to do everything that securedrop-admin does.

Mickael: If something goes wrong with install, to debug, an admin can figure things out. If we go Atomic route, would be more difficult.

Jen: HTTPS bug -- was debugged in prod and fixed.

Emmanuel: How difficult would it be to have a button in the admin interface to have logs?

Conor: Absolutely something we could do.

Erik: What does it mean for us to stick with Xenial?

Mike: We should at least be snapshotting repos. No matter what, snapshotting repo and doing QA on that. We know exactly what we're shipping and there's not suddenly a new kernel package.

Anytime we want to make any package change, it's a big PITA. I have to update Ansible logic, have to update the package so that existing installs get. If I want to switch to nginx, that's a PITA to do right now. If we had Atomic, that would be a lot simpler because we can swap out the whole base. Would want to take Ansible story away completely.

Mike: Is there a possibility to automate Xenial upgrade process?

Conor: I don't really think we should try. If we can pull this off with enough lead time, that would make the support burden more manageable. If we tried to do everything automagically, it would create more problems than it solves. We should advise folks to upgrade - most issues resolvable with single admin action.

Jen: Does anyone disagree with appliance vision? Admins do not SSH in, we have more and more control over system state, all admins are doing is changing some config values, everything else we control.

Conor: Need to consider liability, decentralized aspect. I think appliance is a good direction to go in. Don't think we'll be able to do it anytime soon. Even if we're locking admins out, don't want to gain SSH in the process.

Jen: We don't want to have too much control, agreed. If FPF disappears, the project continues -- problem whether or not we do this.

Conor: In general, pushing to least admin involvement possible is generally beneficial to the org.

Jen: Would we need to release several times a week?

Conor: Sounds correct.

Mike: There are sec updates a couple of times a week, but not necessarily related to the app we're running. If random lib gets update, not sure we have to update right away.

Mickael: Human needs to analyze, or CI/CD system.

Conor: Version-locking package conversation needs to be hammered out.

Jen: Whether we move to Xenial and snapshot apt repos or move to Atomic -- either way that should be on the roadmap.

Erik/Conor: There are potential staffing implications of doing more review than we do currently.

Jen: Improving the admin story so that we can have admin experience so they don't need to SSH in is a prereq. Could we do both of those things in less than a year that we have left.

Mike: If we didn't have workstation. My worry is how long do we drag out Xenial migration. Pull the band-aid and cut support for Trusty as soon as possible.

Conor: I think we have some wiggle room there in terms of obligations to community, newsrooms. Would basically have to support Trusty until EOL. Just working with the admins will take months. Don't think it's really an option to withdraw Trusty support sooner.

Mike: I'd almost rather send FPF staff all over the place and get it done rather than send emails. I was thinking it's basically feature freeze for Trusty. That whole split of supporting two OSes is very scary. Doubling QA effort, which is very manual. Need to get two more pieces of hardware - just going to be nightmare.

Conor: Taxing to do well. Testing will get easier with coming improvements.

Emmanuel: Would it be helpful to prepare a couple of example timeline plans to keep in mind Mike's/Conor's concerns.

Erik: Let's hammer out more details of Xenial transition first.

Mike: Would you consider remote functional tests to be prereq?

Conor: Yes - containerized client test for Selenium would have made it smoother process.

Conor: In my version of the appliance, that could or could not include end-to-end, hosting by third parties. But yes, we should get it to the point where we can call it SD-in-a-box.

Jen: E2E is a big deal in its own right. Everything we need to do to be in a good position to deploy something like Atomic, we need to do anyway.

Conor: I would wrap for additional discussion: prereqs for Xenial. Everyone wants to be Selenium tests.

Mike: Ansible sink logic. Ansible logic that's running on the system.

Conor: Given upgrade experience, open question. I want it for a varienty reasons but not sure needed for Xenial.

Conor: consider automated upgrade testing pretty necessary QA

Collaborative writing: pros and cons of different approaches

What are the upsides of transitioning to a different base OS like Atomic? ( https://github.com/freedomofpress/securedrop/issues/3492 )

  • immutable filesystem (with exceptions for e.g. configs)
  • powerful rollback capability, e.g. in event of problematic kernels
  • potentially no admin SSH access after first run (maybe that's unrealistic) - would users be OK with this? some certainly would
  • ability to provide additional guard rails to prevent admins from doing unsupported tasks
  • predictable upgrade cycle that can be automatic (w/o admin intervention)
* We will control packages/images for everything other than baseos, we can potentially serve via THS to reduce fingerprintability. We could do this also on Xenial if we see a strong benefit. There are advnatages, the main issue is the maintenance burden, also re: the THS -- would be worried about killing updates when/not if ;-) Tor breaks.

* Tight controls over entire system

* No need for migration techniques if we want to switch out tools (for example Apache-->nginx). Developers no longer have to be concerned with legacy backwards compatibility with system components. Quick turn-around for dramatically changing the system layout.

* Kushal has amazing connections to the Red Hat community and has offered to get us in touch with core developers on both RHEL/Fedora teams

* We can brand the image as "SecureDrop Server OS" or something like that (might be useful for branding, esp. with SecureDrop WS)

* More reliable testing/upgrade testing

* Less environments to maintain, potentially, or at least rapid deployment (for testing)

What are the downsides of transitioning to Atomic? What is the maintenance burden?

  • If we don't also support Xenial path, we will likely lose SecureDrop adoption as people don't want to do a full reinstall yet again

  • net new technology for team, we would have to become fluent and design upgrade path before LTS window closes

  • requires reinstall of all instances, significant burden on team as similar to SpookyInstall we'll likely want to go to even non-priority sites

  • maintaining SecureDrop Atomic and SecureDrop Classic at the same time

  • potentially less community involvement, given widespread familiarity with Debian/Ubuntu, presumably less so with Atomic

  • likely confusing to admins; in support contexts, instance admins are unlikely to be familiar with basic Atomic functionality

  • lack of unattended upgrades: immutable filesystem means we must build new template, test it, and release it for upgrades

  • must provide writable paths in filesystem for instance-specific config info, e.g. keys, onion urls, etc.

    • Heavy operational burden, we will have to do vulnerability management for all packages that are used in the images and requires rapid response for updates - is this several releases/week? We would need to invest significant engineering time into improving our CI and pre-release automated tested such that we are in a position to be able to do this

    • Support burden will increase due to maintaining effectively two branches of SecureDrop instances as instances move over time (more so than with an upgrade to base OS)

    • Hardware compatibility questions will arise (will require extensive testing on currently recommended hardware, add'l testing on unsupported hardware for clients)

    • will have to rewrite some roles for RHEL family compatibility (package names)

    • will have to switch to selinux

    • doesnt have wacky codenames for each release

    • If something goes wrong, might be more complex for admins to debug. Current SD requires basic linux knowledge

What are the upsides of transitioning to Xenial?

  • We do not need to reinstall all instances
  • Comparatively low burden on instance admins: ideally, run do-release-upgrade, and voila.
  • Less logic changes for ansible
  • Robust community support: more accessible at hackathons, potentially to auditors, enables us to delete custom provisioning/maintenance logic in favor of community-maintained logic

If we transition to Xenial first, will we want to shift to a new base OS soon thereafter?

  • Xenial would give us breathing room until 2021 (https://www.ubuntu.com/info/release-end-of-life)
  • We can, once we get the workstation into production by that time we might have more team members and have more bandwidth for major server side changes

If we transition to Xenial first, will we have to reinstall all existing SecureDrop instances?

  • No, this is a big bonus
  • Upgrade testing quite positive. Certainly, there are changes that must be made—in particular, the firewall rules must be updated beforehand—but it seems Admins can run "do-release-upgrade" and Xenial will work

Will we commit to maintaining a Trusty branch of SecureDrop? If so, for how long?

  • Yes. We can post new SD packages to xenial channel, and observe who has updated
  • We cannot maintain Trusty support past LTS lapse: we have neither bandwidth nor expertise to patch, test, and maintain security backports

What are the downsides of moving to Xenial and continuing with the current deployment story?

  • In 2021, we will need to do this again - Kicking the can down the road
  • We will need to walk admins through doing the release upgrade and FPF staff will likely need to do a bunch of on-sites to bring priority instances onto Xeni
  • Abandoned SD instances will not be upgraded to xenial. We should consider forcing a (well-communicated) unattended do-release-upgrade at EOL time. Necessary for source security.

Inputs into the decision should include

  • Capacity. Note that we're already managing a major potential transition for SecureDrop users (Qubes client) which will involve simultaneously maintaining Tails and Qubes workstations for a significant time (at least a year, maybe much longer).
  • Impact on administrators. We should keep the number of "asks" from SecureDrop administrators to a minimum, but we should also strive to make SecureDrop easier to install and update.
  • Impact on security. We should continually harden SecureDrop client and servers.
  • Impact on long term maintainability, including our ability to make and deploy changes.
Clone this wiki locally