Pelican Parts Forums - Lousy day - and it's not over! (Work rant)

Pelican Parts Forums (http://forums.pelicanparts.com/)

- Off Topic Discussions (http://forums.pelicanparts.com/off-topic-discussions/)

- - Lousy day - and it's not over! (Work rant) (http://forums.pelicanparts.com/off-topic-discussions/275195-lousy-day-its-not-over-work-rant.html)

Lousy day - and it's not over! (Work rant)

Got a call at 4:30am about a production server that had a hardware failure that effected the SAN Storage it was attached to. (I'm the SAN guy here at work). So I grab my clothes, jump in the car and come into work to verify that nothing has changed on the SAN environment. This after not being able to convince the guy on the other side of the phone that the SAN doesn't simply 'break' by itself - if there are no warning lights or if the box doesn't "phone home" to IBM, there simply is no problem. We're talking about enterprise level storage here - 5 9's stuff - Raid five with dynamic sparing, dual fibre connected servers -- there's so much redundancy built into the SAN fabric. I mean, if a cooling fan in the storage device (IBM ESS 2105-800) stops, the other cooling fans actually increase their RPM's in order to keep the box at the optimum temps.

Oh, and by the way, I'm supposed to coordinate a disaster recovery of an important Linux server at an IBM sight about an hour from work this morning. Hoping that I can verify the SAN, and then jump over to IBM and kick off the DR test without skippnig a beat. Ah the best laid plans of men...

Ok, so anyway, I get my butt into work early - 5:30/6-ish, and verify the system. Sure enough, everything's fine on the SAN side, but the lousy Microsoft servers cannot connect. The operating system on DB9 is completely corrupt and has taken a non-releasable reserve on a LUN (logical disk) that DB10 needs to see. (Note: I hate Microsoft Clustering - it's about the worst server/SAN methodology I have ever worked with). We come to this conclusion around 10:00am. (I was supposed to kick off the DR test around 8:00am. Four people are anxiously awaiting my arrival - but I gotta work on produciton first. Two of them decide to come into work instead. Good call.)

Oh, and in the meantime, I'm getting slammed left and right with a million other production related problems. BTW: A little less than a year ago, my college was fired. Why? Harassment. (Yeah, she was harassing me - to the point that as soon as she came back from her appointment with HR, where she was told explicitly to leave me alone, she ripped me a new hole. Infront of our boss. That was the last straw.) So half the staff was eliminated, but the work has increased ever since then. And they will not fill her position with someone else. Why? The company has decided to freeze all open positions to save money. Thank you very much. Remind me NOT to be a company man.... but I digress...

By around 1:00pm they call off the DR test. At least I didn't have to worry about going up there AFTER work to get the test done, though there was talk of doing so, since, after all, we have the site until midnight tonight!

So I'm putting out fires left and right - trying to get work done, and all the while, I'm supporting this Microsoft Cluster-F***k. Removing logical luns, adding luns, verifing the zones in the McData SAN fabric, removing disks on the IBM storage box, adding them....basically turning the disks upside down to try to get these freakin' servers to both see all the disks.

Meanwhile, I contact IBM to assist. They dial into the box, and as I suspected, the box is running just fine. The reserve is on the software level. and there are two ways to resolve:
1. Get the original system that had the reserve back online
2. Do a POR on the storage box. (POR = POWER ON RESET, IE: REBOOT the box.) Mind you, this is a production box - with about 17 TB (that's terabytes) of production data on it. With DB9/DB10 down, it's only effecting about 300-400 users. Bring this box down, and the whole company infastructre suffers. Nation wide. All our important stuff. Basically business would grind to a halt and the company would loose money. Big time.
So we all wisely chose option #1. Problem is, it is going to take a long time to recover the bad operating system - we were hoping the Raid 5 disks could rebuild the two failed drives (it didn't) so plan two was to recovery from Tivoli backups. (that finally worked...but I'm getting head of myself here...)

So, back to our saga - as all this LUN masking/ Zoning / bouncing disks left and right is happening, another LINUX system desperately needs a backup put in place. No biggie - it's an FTP from VM to MVS to initial a 5 job process that backs up about 15GB worth of disk, and an FTP back to the VM/LINUX server. Got the JCL set up easy (since the DR test was a very similar setup and we got all that ironed out last week). Problem is, I'm having trouble putting the code to production, via our new change control package, SCLM.

SCLM - oh yeah, I was 'deemed' the administrator of that product too. Went live a couple of weeks ago with it. It was not a smooth transition. But the System manager insists that I take on all responsibilities of this new product (that incidentally, wasn't fully and properly installed by he and his staff - typical install - there's always a shakedown period and fallout that needs resolution. Problem is that I'm the one who's getting dumped on)

So anyway, I'm using SCLM to move this code to production, and hit a snag. I ask the system manager guy to help me resolve the issue and politely note that it is a production issue. He refuses to even look at my change record. I speak to my boss, and we speak to his boss, who calls Mr. System Manager. When my boss' boss asks this guy to help, there is silence on the other end. So my boss-boss says "This is BS - I'm coming over and we're resolving this now." Long story short (too late), Mr. System Manager reluctantly moves my change into production. Of course we missed the backup window, and will have to wait till tomorrow to back a clean backup. This guy has been giving others major grief as well - I just don't understand why people do malicious things, especially at work - I mean really - it's just freakin' work. Get the job done, go home, and live your life. No reason to make everyone else miserable just because you're a miserable SOB...

Here a LUN, there a LUN, everywhere a LUN-LUN - still chasing the problem with good -ol DB9/DB10 - finally the DB9 system is restored about 3pm - but the persistent problem won't go away - DB10 still can't see half the disks, and DB9 can't see the other half. So we start bouncnig the servers, removing disks here, adding them there....till finally, about an hour ago, both systems have come up and can see all disks! But we're not out of the woods yet...

Now all that is needed is to get Microsoft Clustering reinstalled on both systems, and have them come up again without problems. The NT/Windows server guy has been working on it for the past hour. It's not supposed to take this long...afriad to ask what the hold up is...

It's 8:15pm. I'm at work, I'm frustrated beyond normal - especially since Mr. Manager was such an idiot - taking a personal vendetta to the point where it jeopardizes production...I am tired, I probably smell, and I just want this day to end...

All I know is - this coming weekend is the first DE of the year - three days at Pocono International Raceway. I don't want to know nothing about work starting Thursday afternoon...

I could deal with being over-worked. Deadlines for me are now a figment of the imagination - "if you don't give me backup, I can't promise you I'll me the deadlines you impose on me." And so far, I've been keeping my head above the water. But when I'm getting slammed with more work, and not getting the help I need when I ask for it due to a personal vendetta, that's just wrong. And unacceptable.

Rant over, if you've read all this, I am amazed by your persistence. Thank you.
-Z-man.

14 Hours? Not too bad....I've been on site for 4 days straight, no shower, no shave, can't afforrd for the plant to go down during the acceptance test. Catch sleep here and there on cardboard on the floor.

Or the time I put in a Full day in Rhode Island then drove to Hartford only to pull an all nighter to restore the data recording system (A Microvax) for a power plant. I think I finally got to a hotel room about 10am the second day (Up about 40 hours including a 2 hour drive).

Yep, the good ole days. I don't miss them at all.

(Jeez, Z.....Be quiet or snowman will think you did not choose wisely in your career path....)

Yeah, I spent 17 hours in surgery, beginning at about 9Pm, five different patients.

Don't miss that either.

The workload, though overwhelming, I can bear. I've done the 2-3 days of work with little or no sleep before. It's the personality BS and the political crap and personal vendettas that I can't stand. :mad:

-Z.

Z

I don't have a clue what you wrote, but I hope you get some rest.

My plane arrived an hour late in New Orleans and I spent almost an hour getting my internet connection to work. Ended up changing hotel rooms. Ordered steak and got fish, with nothing else, just the f'n slab of fish, called room service 45 minutes ago, still waiting.

It could always be worse. Your sig line could read 2004 Toyota Prius. :D

I think you misspeled that. I think it's "Toyota Pious"

Quick everybody post nekkid pictures in OT while Z-man's asleep.

Shaaddduuppppp! You're gonna ruin it for all of us!

Z, I feel your pain! Got into work about 10:30 and there was no more half-n-half for my coffee. Man I hate that! Back to my desk I got a paper cut opening up my pay stub. Took a a nap before lunch and was woken up by a wrong number - at least I think it was. Maybe it was a cell call. I could n't understand anything past 'Mike, Ineedyour help figueringoft oyu bla bla bla" I have time for that nonesense!.

After my normal 2 hour lunch which gave me heartburn (dang triple servings of those beef frajetas and beer!) and took another nice nap in my car in the parking lot. What a lovely 85 degree day - windows down and some soft jazz playing. A gentle breeze blowing through the car when I was rudely woken up by some inconsiderate naive trying to trim the grass around our building. about 4:45 I ended up going back up to my office where I had to talk to somebody about something - can't really remember what it was about but they said that we had a meeting and they were waiting in my office for 45 minutes. Oh well, CEO's really have no sense of timing or consideration. I really have to slow down this pace before I burn out!

OK - my rant is over...

I really wish I didn't understand your post.

Microshaft screws everybody in the end...

God I remember being on call for production systems... After reading your nightmare Z-man I think I got sweaty palms like I was there all over again...

Cobol, JCL, and VSAM, oh my!

Older than that..FORTRAN and PL-1 !!!

(working from decks of punch cards in the late 60s..)

The jobs would line up and be run at night. On one, which I remember was a simple modulus of elasticity program, the engineer forgot to put an "end of run" card. Thing ran for hours. In real life, what would have been a bending problem would have been a coil!! Held up other jobs and led to a 36 hour marathon. Thank heavens for modern computers!!

Zoltan,
I read your post. Then I re-read a bit of it. I was going to c&p it to my son (a computer techie), but then realized I understood enough of it to understand your day was **** through and through.
Sorry to hear that. Hope today is better and the support issue gets resolved.
Les

I love Z/OS!

That, and we test for a minimum of a year before hitting production. Just installing something on a box around here is the quickest way to the unemployment line.

Tell us if you experience anymore in-SAN-ity...

Re: Lousy day - and it's not over! (Work rant)

Quote:

Originally posted by Z-man
Rant over, if you've read all this, I am amazed by your persistence. Thank you.

The burning question for me is: did you get to keep the stapler?

Well, after a good night's sleep, I am much better. The DB9/DB10 systems are now stable. So it's back to the grid.

Good to know there are others here at Pelican with an 'old school' IT background. Crazy thing is, as this storage environment is growing, I've been responsible for open systems stuff as well - UNIX/AIX & SUN Oracle, as well as crappy Microsoft. And ya know what? The most stable platform is still the mainframe, those the midrange stuff comes pretty close. As far as Microsoft - it works, unless it breaks, then it's a really pain to get back, like yesterday.

legion: in-SAN-ity: I like that! :D

-Z.

A line from a B grade movie (poolhall junkies) comes to mind here. Delivered by Rod Steiger, his last acting roll:
"Every day is a good day. Try missing one once."

grid girls make everything allright ;)

Re: Lousy day - and it's not over! (Work rant)

Quote:

Originally posted by Z-man
So half the staff was eliminated, but the work has increased ever since then. And they will not fill her position with someone else. Why? The company has decided to freeze all open positions to save money.

Hey Z
I see things are the same on your side of the street as well. I can't wait till this whole thing is over. We're all miserable here right now

z...did you, umm..try restarting the computer first? :)
ryan

Quote:

Originally posted by Moneyguy1
Older than that..FORTRAN and PL-1 !!!

(working from decks of punch cards in the late 60s..)

The jobs would line up and be run at night. On one, which I remember was a simple modulus of elasticity program, the engineer forgot to put an "end of run" card. Thing ran for hours. In real life, what would have been a bending problem would have been a coil!! Held up other jobs and led to a 36 hour marathon. Thank heavens for modern computers!!

Bob,

Also a Fortran/Cobol guy here, trained on the old IBM 360 using punch cards and jumper boards.

MS is a pile of doggy crap on many things. Some of their networking and RAID issues is unbelieveable...

Quote:

Originally posted by Joeaksa
[B]Also a Fortran/Cobol guy here, trained on the old IBM 360 using punch cards and jumper boards./B]

They renamed the 360 Z/os a few years ago. My employer is a major contributor to my university, so I learned IBM Cobol (though did HP Cobol on the HP3000 when I graduated). I now code in Aion.

Shoulda given in and banged your colleague that was 'harassing' you. Then she'd still be there to help you.

Also, unless you're making the absolutely unattainable megabucks, you SERIOUSLY need a new job,

ianc

Quote:

Originally posted by ianc
Shoulda given in and banged your colleague that was 'harassing' you. Then she'd still be there to help you.

Also, unless you're making the absolutely unattainable megabucks, you SERIOUSLY need a new job,

ianc

EDIT: Dude get real. The lady that was harassing me was a middle aged woman with a family. I was in no way attracted to her, and neither was she to me. Um, there are other forms of harassment besides what you are thinking. :rolleyes: It was more like a constant viscous attacking - she was downright venomous. Her anger over stupid thing was downright scary. She would yell at me constantly. In front of my boss.

-Z.

Not a regular poster here, but I had to share my sympathies with you. We're at the beginning of a year+ long project to pull all PCs out of our medical facilities nationwide, and replace them with Citrix clients coming back into our datacenter. We're doing this without testing... just taking the word of the consultants brought in to build the solution that it will all work. You can guess how well that has played out through our pilot sites so far...

You had me flinching when you talked about MS clustering. I'm officially a Windows-based sysadmin. We are moving all file & print services enterprise-wide to blade servers attached to a SAN I just got run through training on. The kicker? Its all going to be based on MS clustering. I know that one of these days the OS will break, and it won't be pretty... *shudder*

I'm drinking a beer for you tonight.

Quote:

Originally posted by shrouded
You had me flinching when you talked about MS clustering. I'm officially a Windows-based sysadmin. We are moving all file & print services enterprise-wide to blade servers attached to a SAN I just got run through training on. The kicker? Its all going to be based on MS clustering. I know that one of these days the OS will break, and it won't be pretty... *shudder*

I'm drinking a beer for you tonight.

MS Clustering is just not robust enough to work well in a SAN environment. Sadly, the whole purpose of MS Clustering is to share disks in (drum roll please) a SAN environment. :eek:

Every time we've had one of the servers take a hard hit, it has been brutal to get things back up. Often the problem is that the server that crashed put a reserve on a LUN in the storage device. The storage device will not release the reserve unless the original server comes back and says it can. So while the other server in the cluster knows that the crashed server isn't up (via the heartbeat ethernet connection) it can't see the disk since the storage device is waiting for the crashed server to respond. (Not bad explanation of MS for a mainframe guy, eh?! :D )

I think the answer lies on the server level - in the event that a server in a cluster is lost, the healthy server should be able to mimick the signature (server name and WWPN) of the bad server and indicate to the storage device that the reserve is no longer needed.

Currently, the only way we were able to resolve the reserve was to bring up the corrupt server (rebuild to op sys) and get it to communicate with the storage device, afterwhich, it took several attempts to get the two servers to not only communicate with each other, but to recognize all the disks attached to it.

Ok, too much techno babble, but someday shrouded may need this information!

is emc still the major player in the data storage arena? at least on the hardware side? i used to work as a recruiter in the data storage only field (storbyte.com). unfortunately, back in 2002, so many of the software side companies were falling off of the map..either being bought out or disappearing altogether. i had storage guy resumes running out my ears and nobody was hiring. final straw was a new-hire requisition i had for a sales guy at sun was pulled after end-of-fiscal year meetings..sun decided to not only freeze hiring, but actually let people go as well..there went a guranteed 30k in commissions out the window. sun was my largest client. if i'd only gotten the guy placed the month before, i might still be in business. :(
ryan

Quote:

Dude get real.

OK, I was kidding on the banging part.

I wasn't on the other though. If your job is making you that unhappy, you should be looking for another one real hard. 10 years down the road you will be a pretty unhappy camper when you look back, even if you are raking it in.

I also am a sysadmin. We have a Dell-branded EMC FC4700 that has been nothing but headaches. My experience is the opposite of other people's here though. We have four MS clusters: Exchange, File, Financial, and SQL. The MS clusters NEVER screw up; it is always the EMC that is causing me headaches. More than once I've been here to 3-4 AM cursing them.

We will be going with Netapp next time.

ianc

Quote:

Originally posted by Z-man
EDIT: Dude get real. The lady that was harassing me was a middle ages woman with a family. I was in no way attracted to her, and neither was she to me.

Sometimes you have to take one for the team and do things you don't like. For instance, I once had to change the radiator in a 944... :(

Quote:

Originally posted by widebody911
Sometimes you have to take one for the team and do things you don't like. For instance, I once had to change the radiator in a 944... :(

I would much rather shove a couple of ferrets down my trousers than take one for the team like that. :eek:
-------------------------------------------
ianc: While I have gone through some difficult times here at work, when things settle down, it's not as bad. Unfortunately, with the CEO announcing that more layoffs will happen soon, the atmosphere in the office is very strained, to say the least. I love doing what I do (Storage admin), and I love working for a car company. (Cause I'm a car nut). It's just that sometimes I get overwhelmed at work, and sometimes people take their personal agendas to far at work. IT's just work. I only work here so I can have enough $$ to be able to play when I'm not here!

Quote:

ianc
We will be going with Netapp next time.

As part of our SAN upgrade proposal currently on the block is getting a Netapp device for UNIX systems. Looks like some nice technology. And I believe it can work alongside Tivoli TSM.
-------------------------------------------
bigchillcar: Yep, EMC is still a big player in the storage field, but IBM has really taken the lead on SAN devices, IMHO. There are other storage companies popping up too - one that looks promising is a company called Xiotech. If we weren't so committed to IBM storage, I'd have these guys in here to at least demo a box. It is interesting how the storage field was shrinking a couple of years ago, but now, with the advent of SAN infastructures, it has really grown. All good stuff!

Now, where can I find a couple of ferrets?!? :eek:
-Z-man.

Quote:

Originally posted by Z-man
Sometimes you have to take one for the team and do things you don't like. For instance, I once had to change the radiator in a 944...

I would much rather shove a couple of ferrets down my trousers than take one for the team like that.

Absolutely, which is why I prefer air-cooled vehicles...

Quote:

There are other storage companies popping up too - one that looks promising is a company called Xiotech.

this just kills me..another one of the very few big storage companies where i'd managed to obtain a contract from as an 'approved vendor'...xiotech. matter of fact, for two years after i shut down, i still received a standardized contract from them in the mail. if only i could have lived a little longer on my credit cards and weathered the data storage storm. damn. :mad:
ryan

I have to ask.....why in the he11 would they use RAID5 for an OS? It's slow as heck anyway, then degrades exponentially on a disk fialure/rebuild. It's ok for a read-only data archive, but not for an OS.... RAID 0+1 is your friend :)

At my last job, we lost 35TB of data, yes, lost it, when two disks of the same RSS (redundant stripe set) group failed and the storage unit (HP EVA 5000) forgot about it's disk group, RAID level, and disk members..... THAT was a long day, er uh, week! Oh, and HP said it was impossible for that to happen. We got the ol' "one in a million" comment......

I feel for you man!

-B