Failure. I need to talk about failure, and not any failure, my failure. I need to share it with everyone in the production group, everyone in R&D. My team, my peers, my managers. The meeting will start in just a few minutes and I am under fire to explain what went wrong, how I failed the organization and how we need to be better.
General George S. Patton Jr. said “The test of success is not what you do when you are on top. Success is how high you bounce when you hit the bottom.” There is a lot to learn from that saying, and not only for people. Successful systems need to bounce back from a failure and do it well. IT systems need to be able to endure a catastrophic event and just dust it off. This is what we expect of our production systems in Taboola, nothing less.
Back to the meeting, I was the crisis manager during a data center failure. (you can read more about it here – https://www.linkedin.com/pulse/moment-ariel-pisetzky/.) The entire data center went dark and it’s my team that needs to bring it all back. It’s my job to lead the effort through the crisis management process to a successful conclusion. One of the most important parts of that crisis management process comes a few days after the dust settles in the form of a post mortem review. It’s minutes before the post mortem meeting and I’m already connected to Zoom. How do I embrace the failure and bounce back?
There are about 20 people in the meeting. Everyone here was impacted; from working into the night with restore procedures to validating reports and making sure all the systems and data are in perfect condition. It’s easy to look for people and external reasons to blame. “It’s the provider that didn’t follow through” or “the priority wasn’t set, so it wasn’t done” and so on. It’s so very easy to blame the system. That is not what we do. We make sure to go over the facts and use no names.
We never use names in a post mortem. A name is associated with blame. The point of the post mortem is to learn, improve operational excellence. One can say, to bounce back and be better. There is no way we can talk about the facts if there is fear. Fear that names and reputations will be dragged into the mud.
I start the meeting with the facts and timelines. If I did my job right it’s documented in #Slack and we should have a good account of what we did and when we did it. The alert system and supporting graphs are also reviewed, as are the logs to make sure we checked the relevant engineering angles of the incident. Once we have the facts out and the timeline sorted it’s time to talk about the failure.
We love what we do, and we love the solutions we came up with. It’s hard to look at a system you built and say, this is the failure point. This is the design flaw that caused the problem. It’s exactly what needs to be done. It’s time to voice the failure points in the system that caused the incident. It doesn’t matter if this is internal, external, code I wrote or a system someone else manages. The root cause is the raison d’etre of the meeting and needs to be debated. Yes, additional factors come into play: was there monitoring? was there alerting? Did we fix the problem fast enough? All of these need to be debated and improved. But we need to start with the root cause.
Post Mortem Rules
Any good post mortem will assure the participants have the confidence to share everything and anything they know about the failure. So here are the ingredients we mix in our culture and post mortem meetings so you can do the same:
- It’s OK to fail. It’s part of what we do and how the innovation process works. You can and will fail, make sure to learn from it.
- We don’t blame anyone for a failure. To further help this, we don’t use names. If it’s a code push or a configuration change, we don’t name the person or team that did it. It’s just not relevant to the discussion.
- The most important part of the meeting is the focus on improvement. What needs to be done, what monitoring should be set in place and what alerting is missing.
- No matter how bad the incident was, we make sure to use the crisis as a learning experience. The SLA budget was already spent, the lost revenue already in the books. You can lose all that budget, or make sure to use it for a learning experience.
Sharing failure should be easy. It’s not. Talking about my failure in a data center downtime event in front of all my colleagues is easy. It’s because we all make sure to take the best out of the incident and to fix everything we can before a data center goes dark again. It will fail again as data centers do fail from time to time. It’s how we respond to that failure that defines our success.
Take Away from the Post Mortem
This was our first major production incident in the COVID work from home era, so we took a few lessons from our incident, most of them in regard to war room management when we are all home. The most valuable lesson we learned? Multiple and parallel Zoom meetings with a few people acting as production liaisons. It worked like a charm and helped us immensely to bring back the system from the DR event.
Each task force working on a specific realm of issues had a person designated as a liaison that was attentive to the task force Zoom and the main war room Zoom. This allowed fast communications and feedback to the other task forces working on the recovery process. From our past experience we were able to cut 90% of our recovery time from the last DR test in this real time event. Much of the saved time was due to improved communications.
So, how was it to talk about my failure to a group of Taboolars? It was easy, it was insightful and it helped me, my team and Taboola improve. Being the person on-point to lead the process was not daunting nor scary, it was a chance to improve our SRE skills, our production procedures and our promise to provide the best platform for our clients.
If you have not read Radical Candor by Kim Scott, it’s a good example of how feedback should be provided and why it’s important. I recommend reading it.