Skip to content

Latest commit

 

History

History
378 lines (330 loc) · 19.7 KB

blameless-postmortems.md

File metadata and controls

378 lines (330 loc) · 19.7 KB

Stats 337, Applied Readings in Data Science (Spring 2018)

Annotated bibliography


Theme: Blameless postmortems for data science?

Executive summary

The idea of postmortems to evaluate failure events has long been considered an important practice for effective risk management. The idea of blameless postmortems goes further to emphasize the need to create and facilitate a postmortem process where participants are incentivized to provide detailed accounts and analyses of what happened without fear of punishment.

My readings for this annotated bibliography were guided by a desire to learn more about the idea of blameless postmortems, with particular attention to how they might be implemented for data analysis errors and data science issues. I’ve outlined a number of potential reasons that blameless postmortems may have not yet been widely adopted within data science practices, as well as some potential suggestions for trying to encourage this practice.

Barriers & Interventions to blameless postmortems for data science

While the idea of blameless postmortems have been adopted by many software engineering and devop teams (with many referring to Etsy’s process as a model), it seems that blameless postmortems have not yet infiltrated standard data science practices.

This may be because of a number of reasons, such as the following:

  • Data science is a relatively new field and there are not yet set “standard” data science practices
  • The definition of a data analysis success or failure may be more nebulous, such that the errors or failures due to data science may be less clearly identifiable - contrast this with some of the obvious software engineering failures (e.g. cloud service is disrupted).
  • There may be less incentives for small data science teams or data scientists spread across an organization to prioritize postmortem processes and learning over efficiency, and/or they may be in a weaker position to establish a culture of blameless postmortems.
  • There are a lack of examples or case studies about blameless postmortems as applied to data analysis errors – and likewise, there are a lack of templates for conducting these kinds of postmortems.
  • Data scientists may underestimate the degree to which decisions made as part of data science practices are subject to human bias and error.
  • Changing organizational culture is hard work! And managing blameless postmortem processes effectively may sometimes be delicate, and/or require specific training and practice. These skills likely lie outside the typical scope of what a data scientist thinks their job constitutes.

From this list and from the readings, some thoughts about potential interventions to accelerate the adoption of blameless postmortems in data science are the following:

  • Have data scientists create a blameless post-mortem template for data science failures within their own organization. Doing so would likely catalyze thoughtful and explicit discussions on what data science success and failure looks like, as well as help establish group norms about what a blameless post-mortem process looks like before a crisis forces the issue.
  • If they already exist elsewhere in the organization, communicate and learn about the postmortem processes already in-place within software engineering groups, etc – and use these resources as potential templates for data science postmortems, or if appropriate, see if data science postmortems would belong within existing processes.
  • Consider when it would be appropriate and/or beneficial to the community to make a data science post-mortem public
  • Consider conducting systematic and internally reviewed premortems to identify potential risks and human biases before embarking on a data science project; revisit and iterate as necessary as the project unfolds

Any feedback, thoughts, critiques, additions, welcome!

Top 3 articles

1. Blameless PostMortems and a Just Culture. John Allspaw (from Etsy). Code as Craft (May 2012).

Why you should read this: If there was a canon of readings on blameless postmortem, this article would be on it. The article is relatively short, but lays out the philosophy behind blameless postmortems in a cogent and persuasive manner and at a digestible pace – it’s a great way to quickly get up to speed on the ideas as well as the actions that blameless postmortems involve. As in, John not only presents simple explanations of key principles from the literature on risk management and safety (e.g. from Sidney Dekker), but also lays out concrete steps that Etsy takes to implement these ideas. And it seems like everyone writing about blameless postmortems links to this article… so don’t be out of the loop!

Winning Quotations:

  • “So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end.”
  • “We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.”

2. What is a Successful Data Analysis? Roger Peng. Simplystats (Apr 2018).

Why you should read this: Maybe this article should come first, because fundamental to the question of postmortems for data science is the question: what does data analysis failure look like? What metrics do we use to identify it when we see it?

This article is a great entry into these questions – you’ll inevitably push your thinking by observing your own reactions and thoughts in response to Roger’s proposed definition, which he suggests might be unsettling (or not!).

In terms of content: Roger presents a framework with which to think about the question of success in data analysis, and contrasts his ideas about “acceptance” and “audience” to other notions such as that of using internal and external validity as a measure of successful data analysis. He also brings two critical yet underappreciated points into the conversation: 1) the importance of considering the context in which an analysis is performed when trying to evaluate what analysis is appropriate; and 2) that human nature plays a big role in defining the success of data analysis.

Winning quotations:

  • “Success depends on human beings, unfortunately, and this is something analysts must be prepared to deal with.”
  • “When an audience is upset by a data analysis, and they are being honest, they are usually upset with the chosen narrative, not with the facts per se.”

3. Fearless shared postmortems – CRE life lessons. Adrian Hilton, Gwendolyn Stockman. Google Cloud Platform Blog (Nov 2017).

Why you should read this: This is a bit of an oddball reading suggestion (so maybe that’s reason enough!). While the motivation for why teams for Google’s Site Reliability Engineering are thinking about the mechanics of writing an external postmortem may be obvious, it is less obvious why data scientists may want to think about the value of external postmortems. So here are two reasons to read this article: 1) As the importance and role of data science grows, the likelihood that data science decisions and failures will affect customers more directly and obviously may also grow (e.g. think facebook experiments that the public has pushed back on) – and thus the value of external postmortems. And 2) this article has a nice section at the very bottom called “A side note on the role of luck”, which offers something both wise and unique to most descriptions of postmortem write-ups.

Winning quotations:

  • “We have found that, with a combination of automation and practice, we can produce a shareable version of an internal postmortem with about 10% additional work, plus internal review.”
  • “An internal postmortem assumes the reader has basic knowledge of the technical and operational background; this is unlikely to be true for your customer. We try to write the least detailed explanation that still allows the reader to understand why the incident happened; too much detail here is more likely to be off-putting than helpful.”

Bibliography

Note about citation formats:

  • Most citations follow the convention used in the GitHub syllabus, reverting to a more traditional academic citation format for academic publications.
  • Readings are generally grouped by topic and listed in reverse chronological order, except for the priority readings which are placed first.

General

Company case studies

How to run a postmortem debrief and other postmortem resources

Other


Articles about “a case for data literacy”

I didn’t go with this topic, but in case this is helpful to anyone…!

General:

In higher ed:

For educators: