38 hours uptime without the benefit of caffiene – the story of my weekend

A description of my weekend.

Project to upgrade PBX to windows 2000 seems to be going well. I needed to upgrade the firmware of the RAID card to deal with the new driver for 2k. No big deal. Flash it without issue, and things are going well. Stage 1 of the upgrade goes well. It boots into stage 2, and suddenly I get this weird bluescreen
it says something about there being no boot volume
and all the lights on my raid are red
"hmm, that was weird. Let's powercycle and try this again"
well, I boot the machine up.
the RAID BIOS says something about "RAID configuration initialized"
I snap my fingers and say "damnit, I'll have to rebuild the array and restore from backup."
so I look at the backup server
and I didn't believe what I saw, so I looked again.
yep, in the past 4 years, noone thought the phone system important enough to be backed up.
note this was a normal NT 4 system in every way, with a special set of system services running
you could install quake on this thing and play it while it routes calls if you wanted to.
but no. Noone installed the backup exec agent
despite having several copies on the network, as well as 2 CDs within 6 feet of the machine
so now I'm in panic mode. This maintainance window was going to end in an hour. Now, that's pretty okay, since noone uses the system on the entire weekend, but I like to make sure there's no possibility of interference when I do my work. It's best if people never notice that it happens.
But I wanted to go to bed soon.
so I call up my coworker. his cell phone isn't answering
I figure out at this point that our offsite backup service we use for laptops might have been installed and working on this system for whatever reason.
I keep calling his cell phone, since he has access to this service. No answer, no answer.
it's 3:30 am and I start pounding the line at his house, where he is currently staying with his parents.
eventually he wakes up and answers, telling me the information was in his palm, which he left on his desk by mistake, and that he'd be running in to the office in case I need help with anything
I check, and it's all there. yay!
So back to the RAID array to try to rebuild it.
the array software initiates a rebuild, which includes a lengthy zeroing procedure as part of making an array.
6 hours lengthy
at this point, it's about 10am, and I know it's got at least 4 hours to go, so I realize "oh, that's right, I said I'd help out at the ARPSC booth at the police open house"
so I shoot over there, where my sensitive senses are bombarded by people, lights, and loud sounds.
(And I don't deal well with large crowds)
so by 4pm when it ends, I'm a bit frazzled and I've been awake for well over 24 hours
my nextel's battery has been changed once since I've been awake, and my second's half dead.
I drive back to the office, where my coworker and I get back to find it finally zeroed out. yay!
we reboot, and it says "no drives configured"
go into the software, and it indeed shows that the last 6 hours may have said they created an array, but they didn't.
I get pissed off, tear out our old secondary exchange server, and retrieve a RAID card that doesn't suck from its innards, replacing the one in the phone system.
it shows one of the drives isn't identifying on the SCSI chain, and the other two drives are reporting they are 0MB disks
so I now have two drives stating "WDC2342342340 0MB offline"
the other is saying like "????????????? ?MB offline"
I find out that the RAID enclosure key is nowhere to be found
we tear apart the office for an hour and eventually find it sitting openly on my coworker's desk
I pull out the drives, and discover one of them had a capacitor blown off the controller board, and the other two appear fine.
put them in a normal SCSI chain with a normal card, and they appear as 0MB disks, even in linux.
something cooked the firmware on the drive's controllers something fierce
so now I need to find 3 SCSI 2 drives before 6am monday morning, or I'm dead.
I remember that in the dead pile, we have some perfectly good alphaservers, that use SCSI drives internally. The first machine gives me shit opening the cover, and I am at this point so angry that I tore the screw hole in the lid off the case, leaving the screw in the main part, when it jammed up on me.
the rest of the machines see this raw display of power and anger, and submit willingly to being cannibalized
I tore the cover off with my bare hands, no screwdrivers, no prybars, no tools.
my coworker was like "holy fucking shit"
anyway, so now I've torn the machines apart, and have started to jam drives into the enclosures. they're half the size of the originals, but nonetheless fine for the task.
this is when I realize "oh shit, one of those machines was in the dead pile because its SCSI drive would short the bus during heavy load"
if only I could figure out what drive that was!
fuck it, not enough time. It's now 8pm
the drives eventually come up, and the array begins to build. The zeroing process takes 30 minutes on this controller. whee! we go grab dinner and come back.
it works! we install windows 2000 on the machine, and it works, wee!
so now I restore the disk from the backup, it's 10pm. I told my coworker that come hell or high water, we'd be outta here by 1am.
the software installs over the backup without a hitch
fucking up the config in the process
I figure that out, and overwrite the backup again. bingo, now it works
except that only our VoIP board and the T1 trunk board are visible
no extensions
an hour of reading documentation later, and I find out the manufacturer says "you can't have a board numbered 0 in windows 2000, they can start at 0 in NT, but need to start at 1 in 2000"
I yank card 0, and reboot. The machine works
all the extensions come online, and it's midnight and we have dialtone on the phones
set the ID to 12 on the old card 0 and plug it in. The system finds it without error, and I send out a quick email saying "if your phone doesn't work in the morning, follow this 2 step proceedure to sign into the system from your phone to make your extension active again"
we clean up, case up everything and go home. I pass out in the bed after writing a 3 line rant in IRC
I get a call from my coworker this morning around 11
he rolls in to check on everything, and the system is beeping funny
I recognize the sound over the phone as a RAID alarm. I try to remote into the machine, and I see "ahh, SCSI ID 1 is acting funky, let's re-add it to the array and start a rebuild"
poof, machine dies
we reboot it, and it comes back.
I say fuck it, we'll wait until after hours.
poof, machine dies after hours. I rush in to fix it. The drive is acting super funny during raid rebuild
I tear apart my 5th alpha in 2 days and cannibalize it again. put it in the enclosure and rebuild, and everything goes okay
I found my SCSI drive that shorts the bus during heavy activity, by the way!
and now, the machine is up and operational. go me! yay!

only a few people noticed.
thus my goal was met.
and best of all, I still volunteered at the police open house
and as of a few minutes ago, there's officially a backup job scheduled for my now redundant-drived phone server

I was so tired out after being up 38 hours
I have NO IDEA how I got that thing working again
I barely remember most of it
I was doing impressions of towelie at the dinnertable
oh, note that also today I stripped this laptop to the bare metal to try and figure out why the screen flakes out all the time, with no conclusive result
in between crashing the phonesystem and rebuilding its array
busy busy busy 🙂

3 thoughts on “38 hours uptime without the benefit of caffiene – the story of my weekend”

Leave a Reply