Skip to content

Retrospectives 2022 01 04 SecureDrop Workstation Client 0.5.0 Release

rocodes edited this page Jan 5, 2022 · 1 revision

Context

On 2021-12-02, we released SecureDrop Client 0.5.0, which introduced a bug that required a point release. This meeting follows an internal meeting which occurred 2021-12-07.

Relevant prep

  • Links to some of our past retrospectives: SecureDrop Workstation alpha retro, SecureDrop 1.7.0/1.7.1 Retro, SecureDrop 1.8.0 Retro, security vulnerability example (retro not captured, unfortunately, but this captures the preparation documentation for one)
  • Handy links to some other retrospectin’ we’ve done recently during sprint planning: 2021-12-15 Sprint retro, 2021-11-10 Sprint retro

Agenda:

  • Meeting goals/housekeeping
  • Quick recap: brief narrative
  • Discussion/observations (note the separate section for technical changes)
    • Problem complexity? Can we address simple problems, or make harder problems simpler?
    • Any common threads in our previous retros?
  • Release mechanics: any specific points to call out?
    • Technical, human/process-oriented
  • Prioritizing goals/next steps
    • Stakeholders
    • Success metrics
  • Wrap-up

Housekeeping

  • Facilitator: Ro
  • Attendees: Allie, Conor, Cory, Erik, Kevin, Kushal
  • Meeting goals (suggested): identify areas of strength/growth based on this client release experience; bring in reflections on our other retros; develop specific priorities or actionable items.
  • Note to attendees: the recap section will be briefer than normal so we can focus on observations/next steps
  • By consensus: we’ll publish the meeting notes on the wiki

Recap of 0.5.0 release

Individual experiences

(Erik) I provided QA for the SecureDrop Client 0.5.0 release. As mentioned in the previous retro, I became aware of an issue with 0.5.0 after the release on Thursday when running the preflight updater for a sanity check.

It turned out that the SecureDrop Client failed to run due to an AppArmor policy violation (the used Python binary — the system Python, symlinked from within the package — was not allow-listed in the AppArmor rules). This was due to a build environment discrepancy between the artifact I tested vs. the one that was actually released.

I alerted other team members; Kev prepared a rollback plan and both Allie and Conor ultimately helped to prepare a point release. I submitted a small SD Client PR that would have changed the AppArmor rules, but we ultimately resolved the issue in a point release by addressing the build environment discrepancy that was the root cause.

What’s still not clear to me: Why do we ship a Python binary in the Debian package? Should we instead rely on the system Python?

(allie) I was the developer who worked on a lot of the code that went into the release so spent a lot of time focused on regressions and review. I also built the deb and submitted a PR before I left for the “weekend”. The third role I had was coming in at the end of the incident, seeing that the issue was with my build environment, then helping with testing and pushing out a new build.

(conor) I scheduled last-minute PTO for the latter half of the day (Thursday), into a long weekend. Allie had helpfully built the package and passed over for review. I made a note to myself to discuss the use of DispVMs during the build process, but didn’t reject the PR. Merged, deployed, and then went AFK. Returned in the evening after checking phone and seeing that we had problems. Issue was already fully understood, a quick rebuild (in a clean environment) resolved.

(KOG) I think Erik first flagged the issue, I reproed (which took a bit because my sdw setup was pretty old) and worked on a worst-case scenario plan where the best option would be to roll back the release and notify users. The fix turned out to be pretty simple so once Allie and Conor were back online and working on 0.5.1 I pivoted to figuring out how to verify the fix (pushing to apt-qa proved simplest).

(cfm) Saw discussion of 0.5.0 launch bug in #securedrop-workstation. Didn’t have a full Workstation setup for troubleshooting, but started poking around with Diffoscope. Investigation was fast and collaborative, but roles and tasks felt improvised until more folks were online and came up with a plan forward over the course of a couple of synchronous check-ins.

(ro) much like cfm, I saw the discussion of the bug on Slack. I attempted to diagnose but at the time, did not have an operational prod workstation setup. I was pretty hands-off on the response but remained online as backup.

Observations/Discussion

Individual observations

  • (Kushal) Before merging any package PR, the package (deb/rpm) should be tested by someone manually in a system (for both rpm and deb packages)
  • (Conor) Expecting humans (i.e. our developers) to behave as reproducibly as a computer is unfair. If we want a consistent build environment, let’s use a computer for that.
  • (KOG) preflight testing of packages should be mandatory in non-emergency situations imo.
  • (KOG) if the build process is not working for whatever reason (even if we move to fully-automated builds), that should stop the release process until the failure is fully understood and any changes have been reviewed.
  • (KOG) rollback procedures should be available (tho the fix was quick in this case - we will not always be that lucky)
  • (KOG) there will probably be some notes about availability of devs - IMO this is a red herring and we should be allowing for unexpected unavailabilities
  • (Erik) apt-qa.freedom.press is very useful for testing prod artifacts before final release. We do this for SD Server release artifacts, but haven’t historically done so for SD Client. IMO it would be good to use the same process across the board — always test prod artifacts (for Debian packages) on apt-qa prior to release. To recap my understanding, the packages on apt-qa are updated by updating the release branch in the prod LFS repo. apt-qa uses the prod signing key, so it can be used (including for the SecureDrop Workstation) as a source of packages in a prod-like configuration.
  • (cfm) There is “documentation debt” here—the slight asymmetry between the RPM and Apt workflows. Started mapping this out before the holidays but need to put into a usable/publishable form.
  • (Allie) I came in so late that I missed the incident response handling or coordination. It seemed like a lot folks were trying to help with the incident.
  • (Allie) I realized that I totally forgot to make a PR against the test repo so that we could first QA on apt-test, so I left for the “weekend” with the PR for a prod release. The actual testing of the artifact during QA never happened.
  • (Allie) Another observation is that we’ve become quite comfortable testing nightlies, so that paired with a very slow release cycle might be a recipe for skipping the manual testing of the artifact.
  • (Allie) I learned that we need to bump the build version along with package version if we want to do a full release… Still learning about how we could have just done a build bump without the package bump and resigning
  • (ro) I wanted to help but wasn’t really sure how to fit in usefully. Also, it’s moments like these when ‘administrative debt’ (I don’t know what to call it, but the number of workstations/environments that need care/feeding/provisioning) comes back to bite you.
  • (ro) I was a little stressed during the ‘rollback or point release’ discussion that we were going to rush to a decision that might have hidden consequences without Allie or Conor present
  • (ro) maybe we need to have some predetermined procedures for when we’re too much of a ‘skeleton crew’ to be performing certain actions, like releases
    • (cfm) +1 “release bus factor”
    • (Erik) Agree, in this case it would have been preferable to reschedule the release. A checklist that includes RM or Deputy RM availability during business hours seems like a good idea.

Specific Notes or Comments on Release Process (technical sidebar? Have it here!)

(cfm) As a matter of course, should we add a test case or CI check for any failure-mode that’s bitten us in a real-world production release? [Can you elaborate a little?] Packaging sidebar: separate code and package, rebuild package and sign it once we have tested it; bumping the version of the package is different than changing the code instead of re-releasing “Read-only Fridays” rule means Thursday mid-day prior to PTO was effectively a Friday. I (conorsch) was overly eager to deploy the long-awaited 0.5.0 feature set.

Group Discussion: additional notes

  • Testing the final rpm and deb artifacts has to be part of the release checklist
  • Discussion of having a yum-qa repo (as well as apt-qa)
  • Reverting the release is an option (thanks GitHub!); it doesn’t remove it for people who did install it, just for pilot stage
  • Decision tree for whether or not to revert/rollback or point release
    • Depends on staff capability–we shouldn't assume people will leave PTO
  • How to do the technical bits (rollback)? Does everyone know how? Drills/practice?
  • Availability: people take on all the roles unless we have roles assigned
  • apt-qa vs apt-test: do we have a clear understanding of which is for what? Should we discuss this here or later? Let’s define this clearly and tell people where to push to
  • Discussion: dedicated build machine? [Conor: yes, but no shell access–it will be a mutating machine–we should document!]
  • Docs: it’s not just a one person task, it could be a lot of collaboration

Goals/Next Steps/Todos ("+n" denotes endorsement from multiple people)

  • Documentation need: Apt-test vs apt-qa, when to use each [+1] ^ cfm There is a "WIP map" that may belong in the “securedrop” github wiki (conorsch) → https://github.com/freedomofpress/securedrop/wiki/Artifact-Flow
  • Documentation need: how to roll back a release +3
  • Documentation need: decision tree for when to roll back +1
  • Process: define internal minimal crew required for release +2
  • Process: for SD Workstation component releases, use an issue template similar to SD Server release that has clearly defined roles at least for RM and Deputy RM +2
  • Process: packaging vs code separation: we could have used 0.5.0 -> 0.5.0-1 rather than 0.5.0 -> 0.5.1. (Doing so would be a change, but would sidestep the need to create new signed tags in the git repo, for instance.) This was the reason to separate the code release and the packaging building process.+1
  • Process: testing final artifacts, being sure to test final artifacts on prod machines before release, by downloading the rpm/deb package and run dnf/apt on them.+4
  • Process: Build environment standardization. We can mandate use of DispVMs for package builds as a near-term mitigation. Conor volunteers to document. +6 Fully automated builds in the distant future, dispVMs in the shorter term+1
Clone this wiki locally