Keep Calm and Fix It
How to run an incident management process and my lessons learnt from a decade of failures.
A decade ago, I was part of the TYPO3 Server Admin Team and helped to keep *.typo3.org alive. When things fell apart, hopefully someone from the team read monitoring emails and was available to check.
In my current company, such best effort approach doesn’t fit. But this also became only apparent after a historic >24h incident years ago. As we are critical to the business of thousands of companies and their millions of IoT devices, we run an incident management process with 24/7 on-call across our team of almost 100 engineers.
In this talk, I give an introduction to the art of incident management, how you can set it up in your company, and most importantly, the learnings I made in open source as well as while being responsible for the process since 2019.