Quote:
Originally Posted by stealthn
It always surprises me when platforms this big cannot design their architecture properly so that a simple change can take their entire environment down…
|
It doesn't surprise me ... at all, and it usually isn't a design issue. My "toughest" network outages were extremely complex and due to unforseen "stupidity" by somebody ... not a design issue.
EVERYBODY knows you don't mix routing protocols (or static routes, etc.) in a complex network ... or else.
Then when some "rookie" does something in a remote location.... it immediately CRASHES every single backbone million $ router .... all at once ... no data to look either

.
I finally captured it on the BIG BLUE box .... and after 18 months of chronic outages I and a Cisco guy from RTP (not the average CEs we had on site every day) figured it out one evening in about an hour after I rubbed his nose in the data

.
Just one of many .... I "knew" what was going on for months... based upon observation, knowledge, and intuition. Getting to the cause was a head-scratcher ... until I captued an OSPF trace of the unpredictable, random, total network outages.
When something like a DNS server is "sickly", but not totally down ... it's not a design issue and will cripple EVERYTHING ....
97.648% of "my" network outages ... were DNS issues .... give or take

.
Don't miss it, but I loved the challenge of solving complex system/network issues/outages in a complex environment..... on stuff that I designed

.