Review: Ops School – General Operations 101
November 9, 2013
I've always focused on software development and I barely have experience on operations. This video series seemed like a light introduction to such a deep topic so I decided to take a look at it.
A short first section enumerates the many career paths that coexist under the generic operations umbrella and describes the responsibilites of each of them. It seems pretty complete and accurate to me, but I would have used a different presentation format: the long enumeration in front of a camera could benefit from diagrams to better visualize how the different careers differ and complement each other.
This second section covers outages in a conceptual, non-technological way. It emphasizes how outages happen in the context of an organization, a system where people are key and our human nature can be an obstacle not only during an outage but also on its further investigation. Among others, the following human flaws are discussed:
- Some incidents can be traumatic and affect our memory.
- We can be tempted to hide facts or be ashamed of speaking about our errors. If we provide a friendly environment where errors are allowed, we'll be able to detect and study them for future prevention.
- Sometimes we focus too much in a given fact, getting biased towards a simple theory than would explain the incident, but sometimes the simplest theory is not the right one.
- Outcome bias: we evaluate our decisions when we know what happened next, but decisions should be evaluated without considering the results.
- Thematic Vagabonding: not spending enough time with each part of the investigation, jumping back and forth from one to another.
- Heroism: we sometimes react as silent lone wolfs, trying to fix the problem on our own without communicating to anyone else.
Chat-based communications help to prevent some of these problems, and their timestamped logs allow to track activities, communications and decisions for further study. Public communication channels (status page, twitter feed) should be monitored as well, not only to track how external stakeholders are being informed, but also how internal teams perceive the incident at any moment. They should be correlated with data logs, changes in the system (e.g: deploys) and outside events (e.g: marketing campaigns driving traffic to the system). This data compilation should will be ordered and reviewed, conforming a timeline of events that should be clear for everyone. At that point, the incident can be discussed to learn from it.
The video also provides some basic terminology when it comes to tracking an outage (time to detect/resolve, impact time, degrees of severity) and ideas to quantify the impact (e.g: number of affected customers, orders lost...).
This section also explains the Swiss cheese model when it comes to explaining causes of an incident. Incidents in complex systems are often caused by several emergent factors that contribute to the problem and therefore looking for a single root cause can be missleading. A few references to different system theories and risk management are also provided at the end of the chapter.
The whole section felt a bit unstructured but I found it way more interesting that I initially expected.
Application-level monitoring with Graphite and StatsD
This section starts with a short introduction to StatsD and the different components of Graphite. Then a few examples demonstrate how data are stored and queried, and how the different components are installed and configured.
This section starts with a basic tutorial on SSH (mentioning Mosh, which can be useful if you have to work with unreliable connections) and introductions to screen and tmux (in which I missed visualizing the keystrokes). Then a short tutorial introduces .bashrc and .bash_profile with a brief example.
The second part of this section gives some basic notions on ticketing systems using Jira and briefly mentions how useful note-taking tools can be, using Evernote as an example. I found this whole part pretty fluffy, even for beginners. I would have preferred it to focus on one tool or to providing actionable advice or generic best practices instead.
As its own name indicates, this last section introduces the Nagios monitoring tool. It explains its installation and setup in a pretty comprehensive way, explaining all the basic concepts and options and mentioning how extensible Nagios is.
The video seems a bit inconsistent in terms of content (with two very theoretical sections vs three hands-on ones) and the skills required from the audience vary from section to section (e.g: those who understand the CLI tools used to generate data in the Graphite examples will already know about .bashrc).
Leaving apart the second part of the "productivity tools" section, I think the content can be interesting for those who want to get exposed to these topics for the first time.
This series of videos is an alternative learning method from the the community-driven Ops School Curriculum, which is an interesting project I wasn't aware of. Oh, and any profits from the videos are donated to non-profit organisations which aid in learning for low-income persons. Nice initiative O'Reilly & Etsy!