Tuesday, December 13, 2011

Effective Troubleshooting

Yesterday and today I spent several hours hunting down a computer problem in a production system at work. I did a pretty good job of narrowing down the problem to a piece of software. I changed the configuration file, upgraded to a newer version, and tried a number of other solutions only to have the problem continue.

My coworker and I were able to get the system limping along so we could sleep on the problem. We both woke up this morning with a similar idea on how to locate the true cause of the issue. It turns out that the real culprit was a database table was larger than it should be. The system was designed for this table to have a few hundred rows. Instead it had over 5 million. There was a clean-up script that wasn't running.

The first half of solving this problem meant I had to manually clean out the table. Trying to use an automated system would have brought the database machine crashing down. Once that was done, we had to get the clean-up script rewritten. That took a good part of the day, but we got it working. The total problem solution took a few hours to come up with. I just wish we had done a better job troubleshooting last night.

No comments:

Post a Comment