Pelican Parts Forums - View Single Post - Lousy day

Z-man · Z-man's Garage

Got a call at 4:30am about a production server that had a hardware failure that effected the SAN Storage it was attached to. (I'm the SAN guy here at work). So I grab my clothes, jump in the car and come into work to verify that nothing has changed on the SAN environment. This after not being able to convince the guy on the other side of the phone that the SAN doesn't simply 'break' by itself - if there are no warning lights or if the box doesn't "phone home" to IBM, there simply is no problem. We're talking about enterprise level storage here - 5 9's stuff - Raid five with dynamic sparing, dual fibre connected servers -- there's so much redundancy built into the SAN fabric. I mean, if a cooling fan in the storage device (IBM ESS 2105-800) stops, the other cooling fans actually increase their RPM's in order to keep the box at the optimum temps.

Oh, and by the way, I'm supposed to coordinate a disaster recovery of an important Linux server at an IBM sight about an hour from work this morning. Hoping that I can verify the SAN, and then jump over to IBM and kick off the DR test without skippnig a beat. Ah the best laid plans of men...

Ok, so anyway, I get my butt into work early - 5:30/6-ish, and verify the system. Sure enough, everything's fine on the SAN side, but the lousy Microsoft servers cannot connect. The operating system on DB9 is completely corrupt and has taken a non-releasable reserve on a LUN (logical disk) that DB10 needs to see. (Note: I hate Microsoft Clustering - it's about the worst server/SAN methodology I have ever worked with). We come to this conclusion around 10:00am. (I was supposed to kick off the DR test around 8:00am. Four people are anxiously awaiting my arrival - but I gotta work on produciton first. Two of them decide to come into work instead. Good call.)

Oh, and in the meantime, I'm getting slammed left and right with a million other production related problems. BTW: A little less than a year ago, my college was fired. Why? Harassment. (Yeah, she was harassing me - to the point that as soon as she came back from her appointment with HR, where she was told explicitly to leave me alone, she ripped me a new hole. Infront of our boss. That was the last straw.) So half the staff was eliminated, but the work has increased ever since then. And they will not fill her position with someone else. Why? The company has decided to freeze all open positions to save money. Thank you very much. Remind me NOT to be a company man.... but I digress...

By around 1:00pm they call off the DR test. At least I didn't have to worry about going up there AFTER work to get the test done, though there was talk of doing so, since, after all, we have the site until midnight tonight!

So I'm putting out fires left and right - trying to get work done, and all the while, I'm supporting this Microsoft Cluster-F***k. Removing logical luns, adding luns, verifing the zones in the McData SAN fabric, removing disks on the IBM storage box, adding them....basically turning the disks upside down to try to get these freakin' servers to both see all the disks.

Meanwhile, I contact IBM to assist. They dial into the box, and as I suspected, the box is running just fine. The reserve is on the software level. and there are two ways to resolve:
1. Get the original system that had the reserve back online
2. Do a POR on the storage box. (POR = POWER ON RESET, IE: REBOOT the box.) Mind you, this is a production box - with about 17 TB (that's terabytes) of production data on it. With DB9/DB10 down, it's only effecting about 300-400 users. Bring this box down, and the whole company infastructre suffers. Nation wide. All our important stuff. Basically business would grind to a halt and the company would loose money. Big time.
So we all wisely chose option #1. Problem is, it is going to take a long time to recover the bad operating system - we were hoping the Raid 5 disks could rebuild the two failed drives (it didn't) so plan two was to recovery from Tivoli backups. (that finally worked...but I'm getting head of myself here...)

So, back to our saga - as all this LUN masking/ Zoning / bouncing disks left and right is happening, another LINUX system desperately needs a backup put in place. No biggie - it's an FTP from VM to MVS to initial a 5 job process that backs up about 15GB worth of disk, and an FTP back to the VM/LINUX server. Got the JCL set up easy (since the DR test was a very similar setup and we got all that ironed out last week). Problem is, I'm having trouble putting the code to production, via our new change control package, SCLM.

SCLM - oh yeah, I was 'deemed' the administrator of that product too. Went live a couple of weeks ago with it. It was not a smooth transition. But the System manager insists that I take on all responsibilities of this new product (that incidentally, wasn't fully and properly installed by he and his staff - typical install - there's always a shakedown period and fallout that needs resolution. Problem is that I'm the one who's getting dumped on)

So anyway, I'm using SCLM to move this code to production, and hit a snag. I ask the system manager guy to help me resolve the issue and politely note that it is a production issue. He refuses to even look at my change record. I speak to my boss, and we speak to his boss, who calls Mr. System Manager. When my boss' boss asks this guy to help, there is silence on the other end. So my boss-boss says "This is BS - I'm coming over and we're resolving this now." Long story short (too late), Mr. System Manager reluctantly moves my change into production. Of course we missed the backup window, and will have to wait till tomorrow to back a clean backup. This guy has been giving others major grief as well - I just don't understand why people do malicious things, especially at work - I mean really - it's just freakin' work. Get the job done, go home, and live your life. No reason to make everyone else miserable just because you're a miserable SOB...

Here a LUN, there a LUN, everywhere a LUN-LUN - still chasing the problem with good -ol DB9/DB10 - finally the DB9 system is restored about 3pm - but the persistent problem won't go away - DB10 still can't see half the disks, and DB9 can't see the other half. So we start bouncnig the servers, removing disks here, adding them there....till finally, about an hour ago, both systems have come up and can see all disks! But we're not out of the woods yet...

Now all that is needed is to get Microsoft Clustering reinstalled on both systems, and have them come up again without problems. The NT/Windows server guy has been working on it for the past hour. It's not supposed to take this long...afriad to ask what the hold up is...

It's 8:15pm. I'm at work, I'm frustrated beyond normal - especially since Mr. Manager was such an idiot - taking a personal vendetta to the point where it jeopardizes production...I am tired, I probably smell, and I just want this day to end...

All I know is - this coming weekend is the first DE of the year - three days at Pocono International Raceway. I don't want to know nothing about work starting Thursday afternoon...

I could deal with being over-worked. Deadlines for me are now a figment of the imagination - "if you don't give me backup, I can't promise you I'll me the deadlines you impose on me." And so far, I've been keeping my head above the water. But when I'm getting slammed with more work, and not getting the help I need when I ask for it due to a personal vendetta, that's just wrong. And unacceptable.

Rant over, if you've read all this, I am amazed by your persistence. Thank you.
-Z-man.