![]() |
If your company runs on computers....
Don't go cheap. Seriously, if your business, if you making money, depends on computers and related equipment, why skimp on it? I have a client whose entire operation gets done with computers (which isn't too odd these days). All order taking/tracking/processing, account settling, credit card running, communication, EVERYTHING. We had a server failure awhile back and it cost us $20K in lost productivity in a single day. Now, that may not be a lot to those of you that work at big companies, but in our business...that's a lot of bread. So anyway, my point is this:
All our "mission-critical" stuff runs on garden variety PC hardware! The ONE server we have, that handles DNS, AD, accounting, credit card processing, internet access...is a homebrew white box. Intel mobo, P4 3.0Ghz, SATA RAID. The same thing you likely have sitting at your desk. I know that these days anything with a big disk, lots of RAM and a fast CPU is a "Server" but there's a DAMN good reason SUN, IBM, HP, SGI, etc make good money selling *real* servers. A homebrew white box running Windows 2003 hardly counts, in my book, when the business depends on it. Spend a few extra dollars and get something rock solid. The ONE this server went down would have more than paid for an entry level IBM/SUN/HP box. And I wouldn't have to field calls at all different times and days. Oh yea, this pisses me off, too: they splurged on a 50 user license 3DES equipped Cisco PIX 501 firewall, and then went with a run of the mill Linksys switch. Why not get a *good* switch? Hell, get one off ebay...I have a Cisco 1900 on my desk. This whole rant came about because I got a text message from my client saying that the server was down, and they need the password to reboot it. Last week I had to install a new HSF because the stock one SUCKED, and the CPU was ideling @55c. So anyway, I thought it was weird the server was down; I tested it before and it was fine. Temps were down 25c at idle. Oh yea, the original "server" kill -9ed itself due to overheating. So, anyway, I go in to check on it, and they hired some new kid to assembly parts that is "a computer geek." He decided to bounce the server, since the accounting software locked up. That's how he "knew" the server was "down." I check through event logs and what not, and can't find anything. Nada. Zip. Only a message that the last shutdown was unexpected. Duh. Then I start showing him the network gear and stuff. Suddenly the lights go out for about 3/4second on the switch and came back on. Moved it some more and they went off again. Maybe it was a switch issue, and not a "server" issue? MAYBE HE SHOULDN'T HAVE PULLED THE ****ING PLUG NOT KNOWING WHAT WAS WRONG! God damn it, he didn't even try and troubleshoot anything! Anyway, I'm pissed off, some kid now has the Administrator password to the "server" and he pulled the plug on it due what was probably nothing more than a network glitch. Thank God for RBAC, I'm going to give him an account but lock it down pretty well. And change the admin password. And why the hell can't we buy good ****? Why can't I have a GOOD switch back there? Seriously, this is the same switch you'd find in any Dick, Jane or Joe's house; not a problem if it reboots itself there or looses a connection. But at work? Big problem. IF YOU INSIST ON RUNNING YOUR ENTIRE OPERATION OFF COMPUTERS, DON'T BUY CHEAP ****! rant over, Nomex on, flame away. |
Well if you bill your client by the hour, then all these service calls aren't so bad. As long as they know the issue is with the hardware and not their main IT Consultant.
|
Do all the zOS boxes I'm looking at qualify as "servers"?
|
Right now I'm learning my way around Superdomes. We have two of them, with 128 CPUs and close to a TB of RAM - each.
|
You need to move on from these mom and pop shops and get into the real data centers. We just spent 20 million alone this week on new servers to replace our aging four year old servers. I spared no expense, and I never do on new enterprise class hardware. I know SA's have it rough, so I try and by the best technology possible every time I refresh a series of systems. Everything business critical should be N+1 for maximum availability.
One other thing bothers me about your client, he has credit card processing on servers that are accessible by the office "computer geek." That's borderline illegal, not to mention just plain reckless from a information security standpoint. Anything that handles peoples personal information should be locked in raised floor space and secured behind some sort of audited system. I'm sorry that you have to go though this, I would have pulled my hair out by this point. |
What a nightmare -- that's just plain awful.
If it's any consolation, the US Navy isn't doing a whole lot better. My submarine spent 4 months at the pier for a whole lot of shiny new computers, including all of sonar, most of fire control (the stuff that takes sonar information and does useful interpretation of the data) and the entire ship's LAN (which contains some remarkably useful stuff, and without which all ship's operations come to a halt). The fire control computers don't talk to the sonar computers, completely obviating the point of having fire control computers in the first place. The sonar computers overheat and shut down, because nobody apparently bothered running any kind of thermal analysis as to what happens if ambient temperature rises above 65F. That means that anytime we do anything unexpected with ventilation in the sonar computer space, we can't hear ("see") anything. The ship's LAN is no better -- most of it is installed in a space that was never meant for that many servers (at least we have decent gear), so overheating is a routine problem. We've performed a variety of our own modifications and have managed to keep temperatures within specification in there. But that doesn't fix the problem of the other server rack that they installed which was heavy enough to slightly alter the shape of the ship, disrupting the weapons handling system ever so slightly. Insubstantial? Sure, until you need to load a torpedo into a port-side tube, at which point the whole system binds up and bends lots of bits of expensive metal. Good stuff. I'll stop ranting now -- suffice it to say that I sympathize. |
We are a very small company but in the software biz so the infrastructure is mission-critical. We have decent equipment and very few issues in general. My main worry is the infrastructure of our space - like anything built more than a few years ago, it is short on electricity, cooling, cable management, and physical security (windows+walls). Not a showstopper, yet.
We do have a few "white box" Wintel machines that we use for dev/QA sandboxes - no critical files or apps there. Everything else is racked servers, dual power supplies, dual NICs, etc. The small cost delta it is not worth risking the downtime from cheapskate stuff dying - the commodity hardware isn't even very expensive these days. A few grand buys a pretty robust name-brand box. The HP 9000 server that was one of our main workhorses for many years was like $85,000 retail when new in 1996. We moved most of its load to a $6,000 Sun in 2003, then I finally replaced the HP in 2004 with another HP machine bought on ebay for $1,600...boy how things change |
Wayne - test it anyway. Sounds like you've made it part of a disaster/backup plan. Make sure it really works. That way, if it doesn't, you can find something that will work and have it setup and waiting *before* you need it :)
|
The zOS boxes we run are super-robust.
They've been running for over 20 years with an average of one hardware issue that affects applications every 3 years, and I can count on one hand the number of times any of those boxes have been rebooted (IPLed in mainframe lingo) during that time period. They can be upgraded while still running. The memory on the boxes can be partitioned to isolate applications from each other. http://www-03.ibm.com/systems/z/hardware/ |
Quote:
For real disaster recovery you should have boxes off site. All the extra hardware in the world won't help if it's all in the same spot that had a fire, got flooded, etc.... At minimum you should do a scheduled tape rotation and always keep one set off site. |
I work for a major company and all data is backed up nightly and store off site.
We just completed a huge ordeal for disaster recovery - making sure we have all primary systems available off site. Katrina really opened our eyes. |
Quote:
|
I have to agree with most of what's said here. We(a university of 15,000 students) run almost exclusively dell and sun machines not just because of their reliability but also their support. Both companies provide 4 hour onsite service, all of our major systems are redundant with SAN storage, etc, etc. The biggest problem we're running into currently is thermal. We've been using one of our two machines rooms for 30 years and our A/C when on generator power is marginal. It's one thing to think about when designing a datacenter.. if the power goes out and you go to UPS/Generator will your A/C units keep up?
Not to hijack the thread, but what is everyone using in their major data centers for monitoring? We've been on BMC Patrol for the last year. |
Quote:
As far as data center monitoring tools, we use ECM by Configuresoft, SiteScope, Landesk, a few top secret internal jobs that our command centers use, MOM, and I can't remember what we use on the Sun/Unix/Linux/AIX side. (I'm a Windows Architect) |
You can go overboard. We have a pair of dual processor 64-bit boxes with 4 gb of memory acting as domain controllers for out company......of 25 people :)
That 4 terabyte SAN we bought last year ain't exactly filling up to quickly either. We had a bit of cash lying around last year..... |
Thanks for the sympathy, I figured I might find kindred spirits here:-P I'm going to reply to a few posts..and then I have some other good stories..heh.
Neilk, most of the hours I bill for, but some I can't. Especially since they actually think it's mostly my fault! For example, when I installed the new HSF on the "server," a few days later they couldn't find the link/shortcut to the credit card processing system on their desktops. Their conclusion? I must have broken it (the CC software) while "rebuilding the server" (their words, not mine). That means I had to stop fishing/hunting, pack up my laptop and notebook, and find somewhere with an internet connection, then get on the VPN and poke around. On my vacation... Legion, widebody, I'm jealous. Not sure what else to say there...Big iron gives me big... :eek: Wayne, I've already thought of, and proposed, that exact solution. I was going to build (or repaier/get repaired) the old "server," throw a SATA RAID controller in their and 2 drives, and install 2003. Fault Tolerant Cluster setup. So if the main "server" goes down or stops responding, this one would take right over. They rejected my plan as unneeded, and too expensive. Keep in mind this is AFTER the server failure that cost us $20K in productivity. I would have even preferred going with a SAN or something, but I *knew* that wouldn't fly. I figured a 2 server cluster might, though. HAHAHAHAHAHAHAHAHAHAH. BACKUPS?!?!?! ROFL!!! Ahahahahaha...man..that was a good one. Backups, lol. The only backups we have are the accounting software backups, and they only make one copy (against my recommendation) and KEEP IT ON SITE IN THE OPEN! I've stressed multiple times that 2 copies of each backup should be made, and one should be stored offsite, preferably in a safe. The "server" doesn't even have a tape drive, so we have backups of the accounting stuff, but if the hardrives take a ****, we have to hope only one of the two dies. Drives are mirrored. A tape drive "isn't in the budget." Raised floors? Locked anything? Proper cooling? The current setup is in a backroom, with no AC, regular room. Door doesn't lock properly. I believe one of the key reasons the old "server" died was due to an extreme heat environment; it was in at attic that, by my bosses admission, you would sweat in if you sat in there. Contributing to that is a CPU core idle temp. of 55deg. C. And they wonder why it failed? Didn't help that they kept trying to turn it back on before I got there, and that it just kept overheating. Oh, well hell, here's a good one! Let's talk security policy! Windows NT has this wonderful idea called RBAC (Role Based Access Controls)*. It's a dead sexy idea. Except it's not implemented in our environment...so users can do things they shouldn't be able to do. Oh and how about a password policy. The passwords are TRIVIAL. I didn't even have to ASK for passwords, I guessed all of them. For email, accounting, CC processing, AD, you name it. I'm dying to implement a new password policy, but they aren't too receptive. I don't think they understand the true costs that a breach would entail, especially as 60% of business is conducted via email. Scott R, the office computer geek already has access to the accounting system, full access, so, hell, why not CC? Geeezz...:( And didn't you used to post on the 944 board when I first registered here? I seem to remember your nick, and haven't seen it in awhile. I also remember you worked with computers. *I'm just ranting, not assuming anyone here doesn't know what RBAC is. Thanks for letting me rant guys..feels good to just get this off my chest. For the SAs out here, a day in the life of a SysAdmin. I like to read it when I hate work:-P Gives me ideas... |
AHH yes BOFH.. it's brought me much comfort when all was lost in ITland.
|
Quote:
I was but a lowly Sys Admin back in the 944 days, I moved into management, then to architecture. |
I like Waynes semi manual recovery system. I worked for a place with a "fail-over" system that connected different servers with different disks depending on what failed. The fail-over system was the weakest link in the whole setup. I think a number of exactly the same severs and disks (so there are no startup problems) then swapping in the bits required, and some know how is the best solution. And brand name equipment is easier to get working than no name stuff.
|
RAID is one thing, but what do you do in the event of a break in?
A customer of our lost all their computer hardware in a break in wednesday. It's a small company and all of their servers and all laptops/desktops were stolen. They started backing up all their data 2 months ago (after we were hired to go over their system). Luckily we have all their data on tape. Without backup, they would now be bankrupt. No kidding! |
Quote:
|
In this case the problem would be that all their data would have been lost without backup.
You can have all the redundancy in the world (RAID, cluster etc.) but it only takes a break-in and all your data (and probably your business) is gone. |
Quote:
|
We'll I guess if a company chooses not to invest in a decent backup solution, they are unlikely to spend the cash to replicate their data to a different location. :)
We had to talk this customer into invest in a backup solution. Now just a few weeks later the see the full benefit of backups. |
Personally, I could care less if the gear is cheap... I'd spend my money on people and process.
I've seen some pretty amazing things being done by 2 smart guys with next to no budget; fully redundant, backed up off-site nightly, etc., etc. Not a name brand in the place. I've also seen a whole crew of people with a zillion bucks worth of name brand hardware not have a clue about what/how things work and kill a business because they didn't test a single database backup routine to see if they could actually recover with it. Now, if it's a big operation, and you need high-end performance/reliability from your hardware, and the 4-hour response time from support contracts, then the premium suppliers are the way to go. If it's a small to medium sized requirement, you can use something more reasonably priced. As with anything computer related, I think it's always a matter of compromise... maximum performance, reliability, fault tolerance, and manageability/support for a set budget (which is NEVER enough, it seems). Personally, for smaller installations I'm a big fan of buying a ton of the same gear, with hot-swappable drives where appropriate, so that a broken box can be immediately reimplemented using another box beside it, with minimal downtime. (Assuming the disk controller didn't FUBAR the drives, etc). Bigger stuff I prefer reprovisioning with blades or larger "multi-domain" boxes. It's all about determining and managing risk, and balancing that with the business requirements. And making the owners/operators of said business aware of the implications of their decisions. $0.02 (FYI, I'm a Technical/Systems Architect, specializing in Oracle RAC and distributed Java apps, in large, global, secure systems, such as banks, governments, gaming companies, etc). |
Quote:
I have set up backup hardware and software for them but it is still a big training and habit-development issue to actually USE it. This is one of many reasons why hosted solutions / software as a service kinds of models are appealing for certain applications and user communities. |
LA and where I live are eathquake towns and data (backups) has to not just be offsite but outa town too. Like you say the hardware is the easy stuff, the data and doing business is staying alive.
|
Quote:
|
I'm a big fan of the IBM Blades, and XCAT. Done some serious benchmarking/testing against all other major blade vendors, and they are hands-down the leaders in the field, IMO.
I've rolled out 18 racks of them for a render farm, as well as some for a global gaming company. In that case, they are all behind load balancers and we could care less if they work or not... they are just one instance/node in a giant pool. Most of those also net boot, and when they're not, then they're booting from mirrored local drives. Never had a problem. (Well, one bad blade, but that was replaced that same day by IBM). If it comes to big-box reliability/performance for multiple apps, I'm a big fan of Solaris Containers/Zones. Between that and DTrace, Solaris is my OS of choice. Haven't had too much call to use VMWare, but I've heard some pretty good things. (What is old is new again!) I HAVE used VMWare to test/train on Oracle RAC's though... that was VERY slick. When it comes to SAN's, I don't trust them as far as I can throw them. I've had too much experience with Toshiba, Fujitsu, EMC, etc., etc., that have totally messed up entire systems because they've messed up a firmware update, etc. (Just had a big client be down for 9 days due to this... what a mess! Got me a free golf trip from the vendor, though, as they knew they'd messed up and appreciated that we could come in and fix things) For true reliability, I always implement 2 separate SAN vendors for the same system. That way, if one "goes away", odds are the other one will still be around. |
Quote:
So I optioned out about 30 blade chassis last year for Coca Cola to run the bottle games, first SAN maintenance window we had killed the entire environment. Needless to say there was an angry mob of SA's at my doorstep the following Monday. Since them I have sworn them off. |
Well, THERE'S your problem! Windows? meh
I don't touch the stuff, personally. ;) (Joys of being the boss... get to choose what you get to work on!) We've tried running some Windows installs on the blades, and there are quite a few issues (or were, last year, when we tried it). Not just SAN stuff, but their IP failover on the internal blade switches is FUBAR. We had one minor "failure" that caused our ENTIRE internal network to completely come to a halt. We'd set up IP failover on the blade's 2 NICs, and brought the live one down, expecting to be able to test the failover. 120 machines over 8 vlans, all lost their routing info as a result of that driver going nuts and just spewing arp/spanning-tree stuff everywhere. Caused all of our network gear to feeze up. Powered off the windows blade (gotta love out-of-band management!), and everything came back up within 2 minutes as everything re-learned the internal networking info. Thankfully it was still in the "testing" phase, so no harm, no foul. It's just a major "moment" when everything around you just STOPS. Racks of gear just stop blinking, etc. Surreal. |
Quote:
|
Sorry to hear it!
I firmly believe in "right tool for the job", but I don't think Windows has a place in the server room. Desktop? Debatable. (I'm an OS X fan-boy, and we do all of our development on OSX, and run Parallels if we have to have Linux/Solaris X86 or Windoze for anything on our desktop) My friends and family still don't get the blank stare I give them when they ask me to fix their XP box... "Uhh... I don't know how to". They laugh and say something to the effect of "I thought you did computer stuff?". *sigh* My sister was funny, and got me one of those "no, I will not fix your computer" T-Shirts for X'Mas a few years ago, as my "go home to see the folks" trips seemed to include 2-3 days of "free" tech support for my folks and their friends. http://forums.pelicanparts.com/uploa...1174794162.jpg |
Here's another gold nuggest..
Boss calls today to ask me if I can check on the accounting software, it isn't letting her log in, and the list of companies it shows...aren't ours. For various reasons, there are several folders (each containing a company), and she can't find the one we are currently using. So I stumble over to my laptop, log in to the VPN, Remote Desktop to someone's computer, fire up the acct. software..and I can find it, but when I go to connect, it says there is an operation being performed that only allows single user access. Most probably, our accountant forgot to logout. There is no admin control panel or anything for the acct. software we use, so I was going to force a logoff on the Domain Controller. Not the prettiest fix but...So anyway, apparently now also my boss is having trouble logging into the VPN and the "office computer geek" text messages me (I'm in class now) asking how many people can log in to the VPN. Tell him 50. I get back, connect, log into the server, and as I'm looking around for some reason I threw the system info command at the cmd prompt. One thing I noticed is....server up time is 20 minutes! Hmmmm...open up Event Viewer and this is what I find: The previous system shutdown at 7:59:08 AM on 3/26/2007 was unexpected. and also, The reason supplied by user VITALMOOSE\Administrator for the last unexpected shutdown of this computer is: Other Failure: System Unresponsive Reason Code: 0x8000005 Bug ID: Bugcheck String: Comment: System did not allow logon for vpn Check that last line again... "System did not allow for logon for vpn" THE SERVER DOESN'T CONTROL VPN LOGIN!!!!!!!! THE PIX FIREWALL DOES! So apparently, this kid bounced the server, either by pulling the plug or just holding the power button, and when it comes backup he gives that lame ass reason. Next time I go into the office, that power button is getting disconnected. Now if I could just stop him from pulling the plug... Really good for data integrity, too. Guess I'll have to schedule a disk check... |
Jeffgrant,
I have the "I READ YOUR EMAIL" shirt. Don't wear that one to work.... |
Yeah... some people's "admin" skills consist of the MS-nurtured concept of "hmmm... try rebooting, see if that fixes it..."
*sigh* Right now I'm consulting at a big gaming company, so can wear all those "not appropriate for real client" shirts that I've got socked away in the closet. Meetings scheduled with annoying people? --> STFU Grumpy? --> NSFW (For some reason, I REALLY live up to this shirt... not really known for being too politically correct) I'm almost twice the age of most of the people at the job... I'm now the old "go-to" guy if they have issues. "He's worked on MAINFRAMES!". *sigh* http://forums.pelicanparts.com/uploa...1174922966.jpg http://forums.pelicanparts.com/uploa...1174922983.jpg |
Those shirts are great, good for you:-P Must be nice to work a place where not everything is deathly serious. Should see the stares I get when people figure out email *isn't* private. If I have the power to add, modify, and delete email accounts, as well as complete Admin control over the server, what makes you think I can't read your email?
User, sort of concerned: Can you really read my email? Me, tired of questions: Yes User, extremely scared look; like when you discipline a puppy. Here's a good one, a followup to the story I posted earlier today. I just text messaged the "office computer geek" to determine if everything is working, because I haven't heard from them in a few hours. He says: "As far as I know. Error messages that look weird." Really? No kidding? Error messages that look weird, huh? Well those are just my favorite. Come to think of it...I wonder if it could have anything to do with you PULLING THE PLUG ON A RUNNING SERVER. Hmmmm.... I guess I'm off to check the event log viewer. |
LOL! Yeah... I don't suffer fools easily... and have short patience when dealing with them.
Reminds me of a client meeting I had with a government agency a while back. After 6 weeks of planning, a major Oracle upgrade didn't go well, due to some pretty incompetent actions on the local "support" staff's part. We're in the postmortem meeting after about 28 hours on the go, and my patience is beyond thin. The main culprit is spouting off endlessly about how it's not his fault, yada yada yada. I'm just sitting at the table, trying to stay awake after giving my initial assessment, when the Big Kahuna from the client asks for my opinion of Idiot's assessment. Without thinking or pausing, I tell them what I was thinking "I think he's too fscking stupid to know he's stupid." There's silence in the room. "oh... was that my out loud voice?" EVERYONE started laughing their asses off, except for the Idiot. He storms out (never did see him again), and his boss catches up to me later and thanks me for my comments... he's been dying to say that for months, but was afraid of the union fallout. Sometimes it's fun being a contractor. |
All times are GMT -8. The time now is 08:19 PM. |
Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
Search Engine Optimization by vBSEO 3.6.0
Copyright 2025 Pelican Parts, LLC - Posts may be archived for display on the Pelican Parts Website