Recently I blogged about our Exchange server getting toasted. That was just a warm up. Here is the timeline of the ongoing catastrophe.
- Monday October 7th, 2007 - New network administrator starts.
- Friday October 12th, 2007 - Old network administrator's last day. I'm in Boston. A problem is reported with files not updating on a DMZ server. Together with the new network administrator they determine it was an issue with SQL Server replication. They reinitialize the replication snapshot in such a way that it kills the security on the destination database and prevents customers from accessing it.
- Monday, October 15th, 2007 - I walk into the SQL Server replication mess and work on getting that cleaned up. The actual problem was a firewall rule they changed that was blocking FTP uploads to the DMZ server. Meanwhile, new network administrator is busily installing software which he says will help him document our infrastructure. This, in turn, installed .Net 3.0, which requires a reboot. He doesn't restart because it's the middle of the work day. The cluster services on our SQL Server and Exchange boxes to go haywire.
New network administrator doesn't notice the services failing and proceeds to install all outstanding Windows Updates on all servers. He also installs Internet Explorer 7 on all of them, even the DMZ server which didn't even have IE installed at all for security reasons.
- Wednesday, October 17th, 20007 - The cluster failures are affecting user productivity. The PC Tech notices there are amber and red lights flashing on the drive arrays for Exchange and SQL Server. Exchange flambe (linked above) ensues. New network administrator earns the title Retarded Network Monkey (RNM). Restoring Exchange and rebuilding mailboxes takes the better part of a week.
While this is going on there are weird problems with ISA and IIS servers. It turns out that while all the environment is horribly unstable RNM continues patching servers. One patch on the DMZ server takes our customer-facing ASP.Net 1.1 website offline. It takes nearly two days of research before my boss finds and fix the problem. I'm fixing SQL Server replication again after another patch kills it.
- Tuesday, October 30th - Guess what? More patches are applied. SQL Server replication breaks again. Our ISA server goes bonkers and won't route outbound traffic. ISA rules are implemented with no rhyme or reason in an attempt to route traffic through a secondary connection.
We finally have our first departmental meeting to discuss the problems we have been having. RNM makes it clear he doesn't think he had anything to do with any of the problems. He mentions he's in the middle of applying patches to the DMZ server and our SQL Server cluster. In the middle of the work day. Again. I bite my tongue until it bleeds. We come up with a change control process and a new edict: NO NEW PATCHES until everything we have is stable.
- Thursday, November 1st - I finally have SQL Server replication rebuilt and stable. Users report that files they FTP to the DMZ server aren't showing up. It turns out the rules implemented on Tuesday broke FTP, but it's also eating the response so the clients think they files are going through. This is the same rule that was first implemented on October 12th. RNM decided to re-enable it, but added the nice touch of not providing any error messages this time.
- Friday, November 2nd - Exchange is intermittently unreachable. I haven't checked the servers yet, but I'm thinking it's the cluster service failing and it has something to do with more damned patches.
If you've read this far you may be wondering why there are so many outstanding patches. It's two things: First, the previous admins had an "if it's not broke don't fix it" mindset. Secondly, and more importantly, the previous network admins actually read what the patches were for and only applied the necessary ones. They also researched what the patch did and checked for reports of problems before he applied them. That's why things like .Net 3.0 wasn't applied to the DMZ server, because the prior admins knew it would cause problems with the ASP.Net 1.1 websites. They should probably have documented this somewhere, or at least flagged the patches so they didn't show up. I'll admit we were seriously behind on getting patches applied, but the shotgun approach of "apply everything and fix what breaks" is NOT acceptable to me.
You may also be wondering why RNM still has his job. I honestly don't know. If I were in charge he wouldn't be here. Since I'm not, I may not be here much longer myself. Someone who claims to have experience that would contraindicate doing stuff so stupid it brings down a production network shouldn't need a babysitter.