Wednesday, April 7, 2010

Fire in the sky ... when will they learn?

These days, with the plethora of cheap, easy to use BDR solutions (I am hot on this one from TheLCOGroup - http://www.thelcogroup.com/services/disasterservices.html), it amazes me when well-healed firms won't spend a tiny fraction of their total revenue on insuring themselves against catastrophic data or site loss.


Yesterday, there was a manole explosion on West 52nd street in New York - a busy streetcorner with some Class A buildings, and an area where we have several clients I have done consulting work with over the years. The explosion was apparently caused by underground fires. Buildings were evacuated because of carbon-monoxide threats, and possibilities of additional explosions. Broadway, between 50th and 54th street, was completely closed down.

One client in particular stands out - a small, well-managed hedge fund that has been operating since 1998, has about 15 local employees, and probably has in the neighborhood of $500 million under management. They have a modest IT setup - an exchange server, a file/print server, a utility server/BES server, and a SQL server that runs some Quant apps they have developed.

I have, on at least 2 occassions in the past 2 years, recommended that these folks put together a proper DR/BC plan. Their current plan, which was 'we will work from home', was not a plan but a case of 'wishful thinking'. They had no means to get their data (they use tapes, but dont take them offsite), had no way to run their Quant apps, and had no way to get email/blackberry (their 'mission critical' app). They had managed to dodge bullets for so long, that they thought they had a 'pass go and collect $200' for their system. They had become so blase', in fact, that they were foregoing regular maintenance, checking backups, and doing things that were industry 'best practices'.

Well, lady luck was not with them this time.


I got a frantic call from their 'tech' that their exchange server was 'having issues.' He had been working on restoring it for a full day, and still wasnt there yet ...

Their server had crashed, hard, during a building power interruption after the blast (apparently the UPS had been making 'beeping noises' on a regular basis, so their 'tech' plugged the servers directly into the wall). It didn't come up properly, with a dreaded STOP code and a bluescreen. 5 hours of Microsoft tech calls later, they were attempting to restore.

Turns out, their Veritas support was expired, and Microsoft was having a hard time getting the tape drive to work properly. They passed the buck over to Veritas, and anyone who has used their tech support in the past knows that hold times can be 2, 3 hours, or more. Frustrating when you have users yelling at you to get email back up.

As of now, 24 hours later, they are still not up. Their server has been rebuilt, but apparently they havent had a successful backup in 3 months, and had done 'incrementals' before that, and the tapes were either mislabled or misplaced. No one bothered to ever check the backup logs, or done a restore test on anything save a single file test they once ran.

Fact is, this story could have been much worse. I have no doubt they will get their email back (though it might be missing the last 6 months) and they are able to work at their desks. But had they been roped off from the building for several days, or weeks, then this firm was heading for the big old 'out of business'.

For a couple of hundred a month, all of this could have been prevented.

Won't they ever learn?