This morning at work I helped perform an upgrade of some software I have helped create. It is made up of a bunch of small programs that have evolved over time. Each program is quite simple and very easy to understand. Together they form a complex system that runs on over a dozen different servers and process large amounts of data.
The upgrade seemed to be running smoothly and I was able to do my portion of the deployment without any issues. When we went to start several of the programs, we saw some errors. That led to everyone gathering around a collection of screens, trying to troubleshoot the problem as we were under a tight deadline to get the system back up and running.
As all of my stuff was working correctly, I acted as a messenger to one of our operations engineers who was at our data center in another part of the country. It would have been nice to talk to him on the phone but was impossible because of all the noise where he was. We had to resort to texting. The whole team worked together and we quickly diagnosed the problem. It turns out that the engineer doing most of the upgrade missed a critical step. Once we figured that out, everything came up nicely.
How did this critical step get missed? We had a list of instructions with every individual process outlined for the upgrade. The engineer was meticulously checking off each task as it was completed. After two pages of instructions, he got a little careless and checked twice after only completing one step. It was an easy mistake to make but shows the importance of carefully following directions. Not doing so in a car can get you lost. Not doing so with a bunch of computer servers can really screw things up.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment