Proving to anyone can serve as a bad example, we have the people responsible for the DNS foul up of Sweden’s .seTLD.
Analysis of the failure is that an update script left a “.” off each record, blowing up the works. You might think with something as important as a country TLD that people could avoid basic project management mistakes.
The first and most important part of any data center, system, or networking project is test, test, and test again. Since this script created a fatal error, a simple test of it would have revealed the fault before turning it loose on everything. If you turn an untested script loose, be prepared for the consequences.
The second part is to be able to test for things going wrong and to stop it before it cascades. As well as things are tested, it’s always possibly something might go wrong, the more complex the system, the more likely. This requires a test protocol to identify that things are going correctly and a way to stop it spreading if it doesn’t. 900,000 web sites effected proves the failure of this point.
The final point is a plan to back out any changes made as the final reaction to an unplanned failure. If you can’t forsee the problem, can’t keep the problem from impacting users, at least make the duration of the problem as short as possible by reversing changes, back to the previous version. Another critical point that didn’t get implemented.
So, follow these 3 simple rules and your project won’t end up be laughing stock of the Internet.
Vern, SwiftWater Telecom
data center, web hosting, Internet engineering