HBO Max yesterday got a lot of attention on social media yesterday after they sent out an email to a lot of their customers.
Internet was quick to react, and most of the tweets were so so supportive. We have all been there. Wanting to test something and you accidentally run the command on a production system.
Running any command on production scares me. One missed logic, and there are a lot of things that could go wrong. Even with backup, the process of restoring to the previous state is not an experience I would want anyone to have.
I don’t have a “integration test email” story, but two production incidents I do remember are:
I was responsible for ensuring that production systems are well patched with the latest package update. On a Friday evening, I ran ‘sudo apt-get upgrade‘. A few minutes later, the system is updated, and server monitoring tells me the system is down—the days of having just one web servers and no-load balancers in place.
Thanks to the server image created two days ago, I made a new server in no time and deployed the latest code there.
UPDATE gone wrong
As part of the new feature we introduced, we allowed users to attach files from box.net. We had tested the integration, and it worked well. Code review was approved. In production, someone managed to use the integration, which called an “UPDATE” query without a “WHERE” condition. 🤦♂️
It took me five hours to restore the database to the previous state and ensure that data was synced to as recent of a snapshot.
Production incidents are always brutal. Always be learning and ensure that processes and tests are updated to avoid as many incidents as you can.