HOWTO: Troubleshoot Any Networking Problem
Tech Section: How To Troubleshoot Any Networking Problem
All contents copyright 2006 Mark Minasi. You are encouraged to quote this material, SO LONG as you include this entire document; thanks.
Every day of the year, I get a bunch of e-mails from people trying to solve network problems. And while I love to help, I'd like even more to show folks how to solve any problem on their own. So it occurred to me that I've slowly learned that there are a bit over two dozen "rules of network troubleshooting." I then put together a 90 minute talk on it, and I've had the chance to do that talk for audiences of up to a thousand people to good reception but as always, I can't get everywhere, so what follows is some of that talk. My intention here isn't to reveal any hidden Registry entries or point you to some heretofore-secret $40,000 network diagnostic device. No, I just want to offer what's worked for me in solving network troubles. I'm sure some of this will be simply a reminder of what you've already learned, but I find, at least in my case, that it's all too easy to forget a rule and have to re-learn it, painfully!
(By the way, if you'd like to listen to this talk, surf over to http://techmentorevents.com/samples/ -- they recorded me doing this once and put it up on their Web site.)
Separate the C and Si Problems
I've solved a lot of network problems, but this one was a toughie.
"I've got a DHCP server that is delivering IP addresses to two segments. The systems on the same segment as the DHCP server are getting IP addresses with no trouble, but the systems on the other segment, none of them work!"
My first question (and probably yours, if you're a network techie) is, "does the router between the two segments pass DHCP requests?" (In geek-ese, you may know that the other way to say this is "does the router support RFC 1542 BOOTP forwarding?") Or alternatively, I ask, "is there a DHCP forwarder on the second segment?"
"Yes," the person replies, explaining that the router passes BOOTP packets.
Hmmm. So what else might it be? Check IP connectivity -- does the router block any particular port? If it's in a network with an Active Directory and the DHCP server is on a 2000 or 2003 server, has that server been authorized in AD? No port blocks, and yes, it's been authorized. That's when I realize that it was a stupid question -- if DHCP weren't working, the first segment wouldn't have IP addresses. Ah, but what if -- a eureka moment! -- somehow (1) the DHCP server hadn't been authorized for the past six days and for some reason all of the systems on the nearby segment still had lease time left but all of the ones on the second segment had their leases run out earlier, and so were the canaries in the coal mine? So I tell the person to try to do an IPCONFIG /RENEW on one system on each segment. The one of the first segment succeeds, the one on the second doesn't.
Ready for the answer? It's simple: the guy had no idea what the heck BOOTP forwarding was, figured that his router guys must have allowed for that -- after all, they did go to a CCNA boot camp -- and just told me what I wanted to hear. In other words, it is always possible that the carbon-based parts of the network ("C" is the symbol for the element carbon) don't report reliable information, and so the problem lay not in the silicon part of the network ("Si" is the symbol for the element silicon) but in the carbon component. To paraphrase Shakespeare, "the fault, dear Brutus, lay not in the chips but in the people."
Don't misunderstand me, I'm not saying that everyone lies or is incompetent. But I am saying that under stress people don't always think as clearly as they should, and that network support people have had a lot of new things thrown in their laps in the past few years -- remember when we "discovered" security in 2001, or that we all need database servers whether we want them or not in 2004? -- without receiving a concomitant increase in staffing. We're all just human. We make mistakes. Think about how we make silicon-based systems more reliable: we cluster them. The same thing works for carbon-based units: more eyeballs looking at a problem often make for a more quickly-solved problem.
And -- this is important -- remember that we techies tend to think of computer problems in terms of the silicon side sometimes more than we do the carbon side. In fact, sometimes we see the carbon side as being sort of minimal, and only relevant in a few cases. But if you sit back and think about most of the things that you have to fix, you'll end up seeing that most of those problems have a carbon component that is at least as important as the silicon component. I mean, Trojans don't write themselves, y'know?
Write Things Down
Now and then, I'll run into some problem that doesn't surrender itself to my charms quickly. I circle it, nudge it, try a lot of things, and finally fix it. Finally! It's been hours, the whole process was mildly traumatic, and so I say to myself "well, I'll never forget that problem and solution."
But, of course, I'm wrong when I say that. Because believe me, there's another trauma lurking tomorrow, or next week. And that new crisis will flush out the old memory. So I try to be methodical about writing down things that have vexed me and their solutions. My advice: keep a notebook, or some electronic version of a notebook. (I have over 500 memo files in my Palm.) You'll slowly build yourself a powerful "knowledge base" of your own.
Oh, and writing things down has another benefit: it slows you down. It's terribly easy to rush into a problem, running down some blind alley that seemed like a familiar problem/solution pair; taking the time to express the problem in a new format -- writing rather than just thinking -- can often wake up parts of your brain that weren't previously paying attention to the problem that you're trying to solve.
So the system worked yesterday, but it doesn't today... what did you(or someone else with an administrator account) do to it?
Clearly a quick look at what's changed in the past few minutes, hours or days is a good idea and certainly not one that's only occurred to me. But quite honestly it can sometimes be difficult to figure out exactly what has changed in a software environment like today's. Did this happen on the Wednesday after the second Tuesday in the month -- in other words, is it the day after Patch Tuesday, when Microsoft releases its monthly patches? That's a lot of changes!
Sometimes, of course, finding out what changed is one of those carbon/silicon problems. Recently a friend asked if I could figure out why she could no longer access her office XP desktop that she'd previously been able to get to from her home system. I asked her if she'd done anything different. Nope, nothing, she said. Didn't you tell me a week or two ago that you'd had some kind of issue with a virus, I asked? Oh, sure, she replied, she just had to update her antivirus suite. I guessed that the suite included a personal firewall, like most modern ones and, sure enough, by "just updating" her anti-virus software -- the software, not the pattern files -- she'd installed a personal firewall. The firewall blocked port 3389, of course, and so Remote Desktop didn't work.
Use Your References
Very few of us have the time or brain power to stay on top of all bugs, patches, upgrades and the like. Sure, it's terribly manly to be able to quote Microsoft Knowledge Base articles by number, but, that's really just showing off, right? That's why it's a good idea to rely on the many on-line resources that can help you in solving network problems. At the top of the list must be
Microsoft's Knowledge Base, of course, either in online or CD-based TechNet form
Google, at least until Steve Ballmer achieves his stated goal of crushing them
www.eventid.net, a neat Web site that lists all of the possible Event ID numbers that you'll find in your Windows event logs and suggests what might have caused them, and what might fix the trouble. What's particularly valuable about this is that the solutions come from anyone who wants to suggest one -- helpful people who've suffered through trying to troubleshoot Event ID so-and-so on Exchange for a week or so finally get the answer and, when they do, they post it on eventid.net. A great resource!
Knowledge bases for whatever non-Microsoft stuff you use
Google Groups, one nice way of reading what others have asked and, sometimes, answered on newsgroups about a particular topic
Favorite magazines and their Web sites, favorite books, etc.
Record the Exact Error Message
Ever since the growth of the Web, this has become absolutely essential. There are so many programs and so many error messages, some of which point to the tremendous numbers of bugs and workarounds, that an exact copy of the error message and, if possible, an actual event ID are often the key to figuring out a problem.
For example, a while back I ran into a problem with ntbackup.exe, the Backup program that comes with Windows. I needed to recover a lost file and so popped a tape into the server's drive. The tape whirred for a bit and then Windows 2000 asked me to insert the tape. Hmmm... I just did insert the tape. My right eyebrow rose involuntarily and my heart sank as I realized that I might just not be able to get that file off the tape after all. So I searched the Knowledge Base for "ntbackup" because, again, that's the program's name.
That was my first mistake. "Ntbackup.exe" may be the name of the file, but as far as the Knowledge Base is concerned, "Backup" is its name. "Ntbackup: nets a small number of articles, none of which helped.
My second mistake -- like many people, I get dumber when I get frantic -- was not to write down the error message. So, as the saying goes, I jumped on my horse and ran off in all directions trying to solve the problem until my brain calmed down. I wrote down the exact error message and went to the Microsoft KB page. Then -- my brain was working now -- I did not use the "Search" field on the KB page, but instead used the Google Toolbar to search the KB by typing the error message into the Toolbar and clicking "Search This Site." The answer came up -- delete the tape library file on disk, reinsert the tape and let Backup re-catalog the tape -- and got the file.
Double-Check The Antivirus and Antispyware
It'd be great if anti-virus and anti-spyware tools were intelligent, intuitive, and invisible. But they're not, of course.
First of all, AV/AS can be the cause of network troubles. Almost anyone who runs AV/AS software on their system has a tale to tell about how he couldn't install some piece of software (or, more mysteriously, a driver), only to find that it was the AV/AS system that kept it from installing. Worse yet, some users are smart enough to have figured that out, and so they disable the AV/AS software so that they can get something done... and forget to turn it back on.
Or it could be that, as someone once said, "an antivirus program with old pattern files is better than no antivirus at all... but not by much." I can't tell you how many times I've been asked to look at someone's system to solve some strange behavior. "Do you have antivirus software loaded and have you scanned recently?," I ask. "Sure, they reply, I scan weekly." They just haven't noticed those annoying little pop-ups that keep telling them that their subscription has run out and that they're not getting pattern file updates any more. No wonder they've got Zotob...
But let me go a bit off-topic and tell some truth about how I use antivirus software. In general I don't use antivirus software, because it does get in the way of running a system. How, then, do I protect myself from viruses? With what I consider to be the best AV tool around: a working brain. Yes, I understand that you've got non-technical users and that AV/AS software helps keep them from doing dumb stuff. But malware appears more and more quickly and exploits new bugs more quickly -- how often do you install new pattern files? In the end analysis, most of what you need to protect your systems from viruses, worms etc is just to teach your users to be careful about what they click on when they visit Web pages, to develop an intuition about what kind of Web pages to visit and to be careful about what attachments they open in e-mail.
Think of this way: if you got an e-mail from me that said something like "Dood... U mUst check this out! Grate info inside!!!!" with some attachment that called "invoice.exe." Would you say to yourself "heck, it's from Mark, I'd better open it," or would you say "hmmm... either Minasi has lost the ability to use standard grammar and spelling, or this might not be on the up-and-up." You could then either e-mail me back and ask if I meant to send you this e-mail (which is exactly what I did when I got my first copy of the "I love you" virus ages ago), or you could take a moment, update your pattern files or just visit some malware news site and then decide whether or not to open it.
When you think about it, that's not a terribly hard skill to share with your users -- just a little "street smarts" for the Net.
If, by the way, you find yourself working on a system but do not have an AV/AS package handy, visit www.antivirus.com. It's a very nice, free ActiveX malware scanner offered by the Trend Micro folks. It's a nice, free public service, and it's why when people do ask me for a recommendation on an AV/AS package I point them Trend's way. And while you probably know this already, the Microsoft Anti-Spyware Tool is a darn good piece of software, and free besides. The www.grisoft.com site also offers an antivirus package that's free for home users.
Wait 15 Minutes -- Microsoft's Favorite Time Interval
You may have heard this from a user once or twice (or a thousand or two thousand) times:
"I just logged on, but I can't see my computer in Network Neighborhood. What should I do?"
Of course, the real answer might be something along the lines of "well, actually Network Neighborhood is supposed to list servers, and your workstation isn't a server," but that's not going to help much. The reason that the machine hasn't yet appeared in Network Neighborhood is because the process that keeps Network Neighborhood up to date isn't instantaneous; instead, it's built to reflect network changes within about 15 minutes.
15 minutes seems to be a lucky number in the Microsoft world. A number of large and small things are intended to get done within 15 minutes. For example, in addition to Network Neighborhood, Active Directory's Knowledge Consistency Checker tool -- which is embedded in every domain controller -- wakes up every 15 minutes to re-check that its DC's replication partners still exist, and the KCC also ensures when choosing those partners that no matter how many DCs exist on a site, then any change to the AD that occurs on one DC will be transmitted to every other DC in the site within ... you guessed it ... fifteen minutes.
So when you make some kind of change to your network and the change seems not to have taken effect, relax... take the 15 minutes and document what you've done so far. Draw a picture of what you did, list the steps, and you'll often find that by the time you're done, either the change is apparent, or in the process of listing what you did, you discover that you forgot a step.
Check: Is it Plugged in?
Check it twice...
Okay, seriously, again I mean no disrespect to users or anyone else. It's just so easy to overlook the things that we can easily take for granted. I mean, when someone falls to the floor, do you immediately whip out your oxygen sensing probe and check that there's a detectable amount of oxygen in the room? I know -- the fact that you didn't also fall down kind of negates the need. But you know what I mean; we take the mundane for granted. And we take the reliable for granted; in billions of connections around the world transmitting gazillions of bits, things are plugged in 99.9999-plus percent of the time.
This is where checklists can be of help for two reasons: first, to remind you to check even the unusual stuff, and, second, as an excuse for asking someone else what sounds like an insulting question. Asking someone if everything's plugged in can make someone who's already upset more upset and angry. So before you ask, couch it as a bit of a joke -- "Now, I've got to ask you these next few questions and they are, well, a bit silly. But my boss makes me ask them anyway, so... would you take just a moment and crawl under your desk to make sure that everything's plugged in? I didn't ask recently, that turned out to be the problem, and I got my butt barbequed for calling in the third-level help desk guys on the job, only to have them reach down and plug in a CAT5 cable that had fallen out of the back of the user's laptop."
Assemble Your Toolkit
You probably fix many network problems away from your desk, at a client's machine. Or, for that matter, skip "client" -- if you're like most of us network techie types, then you're probably not only tech support for your organization, you're also tech support for your friends, family, and neighborhood, so this might be gratis work. But no matter how much you're getting paid or not getting paid, the fact is that I can guarantee that any time you try to fix something away from your normal workspace, then there's an extremely good chance that you'll figure out the problem...
... and then realize that the tool to fix it is back at your desk.
So do a little thinking about what you need close to hand, and assemble a toolkit. Yours might include
CDs and DVDs containing drivers, system software, service packs, hotfixes, all of those indispensable free tools from Sysinternals (www.sysinternals.com), some kind of bootable repair OS like one of the Linuxes that let you clear out a forgotten password, or Winternals' ERD Commander (www.winternals.com), or a BartPE bootable Windows disk (http://www.nu2.nu/pebuilder/), or Ultimate Boot CD (http://www.ultimatebootcd.com/) . Oh, and don't forget Support Tools and the Resource Kit. And it's never a bad idea to bring along the CD version of Mastering Windows Server 2003. (Sorry, the Marketing department made me add that. Hey, wait, I am the Marketing department. Hmmm...)
URLs for useful troubleshooting Web sites, like Trend Micro's very nice online virus and spyware scanner at www.antivirus.com. Or online knowledge bases of the vendors of whatever software you use. I also find the Internet bandwidth testers useful, like www.pcpitstop.com. What's particularly nice about PC Pitstop is that they have a Web-based Traceroute routine that works even in cases where the command-line tracert utility can't, when unwise router folks decide to turn off ICMP to defeat ping, but more on that later.
Spare parts -- but more on them later.
Phone numbers for tech support on your Internet provider, phone numbers for other members of your team, support numbers for your software and hardware vendors, and the like. And don't forget cell phone numbers -- they change more often than do land lines, and are often more useful.
Keys to open the cages on the racks so you can get to the server.
Contact information to get onto the site.
Your toolkit will almost certainly contain different things; this is just a start.
Check IP Connectivity
If system mypc.acme.com doesn't talk to system yourpc.bigfirm.com, then there may be many reasons for that. But the simplest should be the question, "do they have basic IP connectivity?" Almost everyone reading this knows this, but let say it anyway: the simplest and most easy to find tool in this category is "ping." You can either ping to a particular IP address or a host name. So, for example, if I'm at machine mypc and machine yourpc.bigfirm.com has an IP address of 10.50.50.70, then I can test the IP connection between mypc and yourpc by typing
And, assuming that all is well, then ping will either tell me that yourpc responded, or didn't. Now, for those of you who've been doing this for a long time, forgive me -- that's Internet Troubleshooting 101. But recall that either one of those pings may fail even if there's a perfectly good connection between the two systems. Why? Two reasons; let's consider the first.
The target system -- pc32, in this case -- may have a firewall that keeps it from responding to ping requests. That firewall may exist in the form of a hardware firewall or router, a piece of networking equipment between mypc and yourpc. Or the system software sitting right on yourpc may have a software firewall of its own which blocks responses to ping requests. Software firewalls used to be unusual, but since mid-2004, with the advent of XP's SP2, they've become quite common. And while I like the idea of firewalls in general, I believe that blocking pings is a bad idea.
Network folks decide to block pings because criminals like to use ping to detect systems on a network. Once they've detected a system, then they can try to attack that system. (Let me repeat that. The mere fact that a criminal knows that you have a system at IP address 220.127.116.11 does not mean that he now has complete control of your system. It just means that he knows that there is a system there. He has, as of yet, no knowledge at all of how secure or insecure your system is. It's sort of like saying, "if I build my house underground and grow grass on the roof then I'll never be burgled." Certainly it would reduce the probability, although it wouldn't negate it altogether. But it would reduce the enjoyment that most people would get out of their houses.)
So network security people believe that telling a system to ignore ping requests either via a hardware or software firewall secures their system by thwarting bad guys from the start. Ping runs atop a piece of Internet software called the Internet Control Message Protocol or ICMP. Thus, software firewalls don't always have a check box saying "block ping;" they may instead offer the ability to block ICMP. I've already said that I think blocking ICMP is a bad reason; here are a couple of reasons why.
In the first place, there are network programs that rely upon ping and with ping blocked, strange things happen. For example, domain controllers in Active Directories need to respond to ping in order for group policies to work correctly; if a DC doesn't respond to pings then its client thinks that it -- the client -- has dialed up to the domain rather than is connected on the LAN, causing the client to ignore logon scripts, software installation and folder redirection.
In the second place, there are tons of other ways for bad guys to find you. Ping just tickles ICMP, and disabling ICMP again essentially renders ping deaf. But IP, TCP and UDP have many "ears" -- you know them as ports -- that do not need ICMP to function. There are several tools out there that do what ping does, but not by using ICMP; instead, these tools look for activity on a particular TCP or UDP port. (There are 64K of each of those ports, in case you didn't know.) And most functioning systems in a corporate can't afford to make all of its ports deaf. Let's use for one example a nice Microsoft command-line tool called portqry; you can find it at www.microsoft.com/downloads. It lets you essentially "ping" any port that you like. So let's say that we want to find out if Microsoft's Web servers are active. We could ping them, but we won't get a response; for some reason Microsoft has disabled ICMP on their Web servers. But Web servers aren't of much use unless they communicate on port 80, so we'll just use portqry as a kind of "ping for port 80," by opening up a command line and doing the following:
E:\>portqry -n www.microsoft.com -e 80
Querying target system called:
Attempting to resolve name to IP address...
Name resolved to 18.104.22.168
TCP port 80 (http service): LISTENING
Where ping fails, portqry does an admirable job. So why bother telling your software firewalls to block pings? Leave everything else blocked if you like, but save yourself troubleshooting time down the road and just tell your software firewall to allow pings. Tracert uses ICMP also, so you may find that command doesn't work. As I mentioned before, however, PC Pitstop has a nice Web-based tracert that works even on sites that have disabled ICMP.
Isolate Name Resolution
I said that a ping might fail even if mypc and yourpc were both functioning for two reasons. The first was, as you just read, firewalls. The second is name resolution.
This is a large topic and one that I've covered in the Server 2003 book, so let me keep this brief. Our systems have IP addresses, like 10.60.60.3 or the like, and they're perfectly happy to respond to requests to those IP addresses. But we humans are less happy with a name like 10.60.60.3 and more happy with a name like pc31.bigfirm.com or \\MYPC. Those names aren't for the use of the computer, they're for our use, so if the system with IP address 10.60.60.3 has a DNS name of pc32.bigfirm.com and a share named DATA, then we can map to that drive either by typing
net use * \\pc32.bigfirm.com\data
net use * \\10.60.60.3\data
Both have the same end effect, assuming that all's well. But under the hood, the first NET USE needs to do an extra step that the second one doesn't. Before the network software can contact pc32.bigfirm.com to start establishing the connection to the file server, then it's got to first stop and ask DNS, "what is the IP address associated with the system named pc32.bigfirm.com?" For that to work, you need a functioning DNS server. In the case of the second NET USE, you don't need a DNS server, as the question "what is the IP address of the server that you want to contact?" is already answered.
Now take that information and consider: what if we typed that first command, but our DNS server was inoperative or our system was misconfigured so that it couldn't find any DNS server at all? Then when it starts to do the job, your system queries DNS to find the IP address of the target file server. DNS can't answer the question either because there isn't any DNS server, the DNS server is configured badly, or the DNS server is inoperative for some reason. Your system never gets an answer from DNS, and so cannot go on. Now, in the perfect world, your system would say "there may well be a file server there, but I have no way of knowing, as I can't even get started doing what you've asked because of a DNS failure. Try the command again, but substitute the file server's name for its IP address; I may be able to connect you then." But instead you get some short, cryptic answer.
The same thing would apply to a diagnostic command like ping; telling a system to ping 10.60.60.3 tests only the cables, NICs and routers between the system that issued the ping and the one at 10.60.60.3. But telling a system to ping pc32.bigfirm.com requires not only the cables, NICs and routers, but the DNS infrastructure as well. So when testing things, try to first direct your tests at IP addresses. Then, if that test works, then try the test again, but this time call the target by its name, not its IP address. That way, if the first test passes and the second fail, then you can be pretty sure that the problem lies with the name resolution -- the DNS or WINS servers, probably -- rather than with the IP infrastructure.
Know WINS Versus DNS
Speaking of name resolution, understand that every system in the Windows world has two names -- its WINS name and its DNS name. WINS is part of NetBIOS, an old way of naming systems that is almost exclusively used in the Microsoft world. It's supposedly obsolete but it is still so embedded into Windows systems that, well, Microsoft's been trying to root it out of Windows for six years and has a ways to go before they succeed. DNS, in contrast, is an Internet standard thats used both in the Windows and non-Windows world and that Microsoft embraced in 2000 with the advent of Active Directory.
Okay, you ask, what does that all mean?
Well, again, nearly every Windows computer has a WINS/NetBIOS name and a DNS name. Some programs need to hear a DNS name, other programs need a WINS/NetBIOS name. Some programs will take either. Sometimes it's easier to think of this not as names but as "identifiers." For example, I've got a phone number, e-mail address, and a street address. They're all "identifiers" for me, they're all "names." To call me, you'd need the identifier that is the phone number -- knowing my street address will not help you call my phone or send me an e-mail.
Similarly, not only can networks run into general name resolution problems, they can also suffer from "wrong name resolution" problems. So if you haven't already, read up on WINS and DNS. This is important, particularly in the Active Directory world. DNS causes at least half of the problems that look like AD problems.
Check the Logs
Modern software writes logs. Lots of 'em. Event log entries. Logs of their own. For example, did you ever have DCPROMO fail, only to offer an explanation about as long as a fortune cookie, but less helpful? I'll bet you didn't know that in \Windows\Debug you'll find a file called dcpromo.log. It's actually helpful sometimes. Group policy has its own error log named userenv.log that you can enable with a Registry entry. DHCP, DNS, and WINS will log the heck out of themselves.
But many of us overlook logs. Why do we do this? I'm not sure, but I think it's that Windows trains us not to look. When it thinks something's important, then it sticks an annoying dialog box in your face, even if the subject of the dialog box is not important, so we kind of assume that Windows will do something to get our attention any time something bad happens. As a result, we're stunned that when we actually get a moment to look at an Event Log, then we see a sea of red crosses and some really scary events.
Check those logs. There's often quite a wealth of useful stuff in there. And make peeking at more than one system's Event Log easier by going to Microsoft's site to download a free tool called eventcombmt. It'll grab logs from multiple systems, filter them as you like and produce summary reports. It's a bit rough-edged but it works well and, of course, the price is right!
Simplify the Problem
I wish I had a dime for every time I get an e-mail that starts off, "I'm trying to make [fill in the blank software] run. The client and server just don't talk." I'm surprised because this often refers to software that I use a lot. I ask a few questions; get a few answers, am still confused, and then the light turns on.
"Are these two systems on the same segment?"
Well, actually, no, I usually then hear, and with a bit more prodding I discover that there are two firewalls, a public cable Internet connection and a NAT router between them. I take ten deep breaths, and then suggest that they test it on a single segment with no other hardware between them. Then, if it doesn't work, then it's probably a serious problem with whatever software they're using. If, on the other hand, it starts working, then it's time to re-insert those devices one at a time until things stop working. Then it's clear what needs a bit of configuring.
Look for the unusual. Is your test machine, or the one that you're trying to install to, unusual in some way? Is it multihomed? Is it a virtual rather than a physical machine? Does it run some kind of software firewall and, while we're at that, consider doing the test on an isolated segment and turn off the anti-virus and anti-spyware software. Again, if everything works in that situation, then turn things back on until the problem recurs. (And if it is multi-homed, then play around with the binding order of protocols on the NICs. That can solve a surprising number of things.)
Simplify the Network
While we're at the simplification stuff...
Any network that's been running for more than a few years contains a lot of software and a lot of hardware... but the longer a network runs, the more mainly useless hardware and software it accumulates. Maybe it's time to finally turn off that Banyan VINES server. And do we really use that dedicated Netgear print server doodad? The last Windows 98 box is gone... it's probably okay to kill NetBEUI. We haven't used that VPN in two years; let's bite the bullet and take its clunky client software off the laptops.
Fewer moving parts mean cleaner operation and fewer things to break. And while you're cleaning house, it's time to...
Know Your Network
It's a Sunday in January, 2003. A worm of some kind is loose on your network and it is saturating your LAN's bandwidth. You put a network monitor on your network, and discover that the bad device is at IP address 10.4.198.33. You and your co-worker exchange a look of triumph. But, a half-second later, those looks fade as you each ask yourselve the same question: which computer is that?
A network diagram really helps in this case. Even the smallest firms will find some network documentation handy at one time or another. Physical locations of systems, IP addresses, what software runs on them, what protocols run on them. Locations of WAPs, hubs, routers, switches, as well as a pointer to wherever their drivers, configuration utilities or the like reside.
I know, this seems simple. But there are two things to remember about this, and they're also simple -- but people forget them. First, do this network documentation before a problem occurs. And, second, be sure to have a copy or two that doesn't live on the network!
Isolate the Bad Component
Related to "simplify," I mean here to use clues to help zero in on the troublesome piece. Does turning something off make the problem go away? Does only one client have the problem? Then focus on the client. Do all clients have the problem? Then focus on the server. Does only one floor have the problem? Check that floor's hubs, switches and routers. Can you attach a new client and get it to work? Then perhaps some hotfix or other upgrade got in the way.
And speaking of routers, hubs and switches...
Hardware Breaks... Even Reliable Hardware
In the 21st Century, we're sort of used to software being the source of most of our problems. That's because in the thirty-odd years since microcomputers appeared, we've seen hardware get physically smaller. It's gotten simpler in the sense that what once required a big gray box with a twelve-inch-square motherboard and eight add-in cards now appears on a three-by-four inch motherboard with no add-in cards, and that motherboard contains a small fraction of the number of discrete chips of the older one. It doesn't draw as much power and is therefore cooler -- and therefore longer-lived. We have also seen the slow disappearance of moving parts, that bete noir of reliability; when was the last time you had to align a floppy drive head? It's true also that the increasingly low cost of chips allow hardware vendors to use dedicated computing devices to make unreliable physical devices like hard disk heads and platters more reliable through automatic error detecting and correcting systems.
I suspect that anyone keeping a log of computer problems would find that 95-plus percent of their problems were software-related rather than hardware-related. Take out the inevitable problems you see in new hardware, forswear overclocking and home-brew combinations of random CPUs and motherboards and that 95 percent rises further. Yup, nowadays, hardware is pretty good.
Which is, unfortunately, bad. It predisposes us not to consider hardware as a source of network mysteries. Here are a few examples.
A couple of years ago, one of my buildings' network access just plain stopped. The machines all worked, but nothing could ping them. Routing tables looked right, nothing much in the Event Log. The power "brick" for the hub that the six machines in that building shared had just plain stopped working.
Hard disks are very reliable. But only "very." About once every three years, one dies on me, and oftimes without giving much notice.
There's a reason why you can buy computers with redundant power supplies, as I learned three years ago. A (previously) very reliable e-mail server stopped working, and of course it did it when I was out of town. The problem? I'd saved a bit of money when I bought the server hardware and put it on a clone. (This was back in the days when there was still a significant price difference between big-name and no-name hardware.) It came with the usual $15 power supply, which figured that after five years it had worked too hard. For want of a nail...
I'm fortunate that the denizens of my online forum are a bunch of really smart folks. Now and then, however, someone will pose a really puzzler -- such-and-such system doesn't work on the network despite all sorts of ministrations. One forum member will ask, "is the cable okay?" and the fellow with the problem will sort of push the suggestion aside. The thread dies out as everyone exhausts their ideas. Ah, but once in a while our questioner turns out to be an honest person and returns a couple of weeks later, revealing the eventual solution. You guessed it: cables. My favorite was the person who told me that it couldn't be the cable because it worked most of the time. When he emailed back, it turned out that the network cable ran under the door saddle and whenever anyone stepped on the door threshold, it compressed the cable... and it temporarily stopped working.
My good friend Gary had a laptop that was habitually overheating. The laptop would get too hot and just shut itself down. He'd tried everything short of sitting outside in February to make the laptop work. Then he remembered something that we've all learned at one point: dust. He pulled out the battery, hard disk, RAM, and anything else he could, then got a can compressed air and went to town on it. Problem solved, laptop is now cool as a cucumber.
But how to to quickly diagnose this kind of stuff? That leads us to...
Have Spare Parts On Hand
All too often, hardware doesn't completely die, it just gets sick. So it needs testing. Now, you may know that in many industries you can purchase really nice test equipment from companies with names like Fluke and Agilent. (Really expensive equipment too, by the way, but worth it in saved time.) But there isn't, nor will there be any time soon, a big market for PC testing equipment. Sure, you can buy tools to test Ethernet or IEEE 802.something networking. But motherboard testers for your laptop or your RAID card aren't any time in the offing, mainly because the PC and PC networking market change so rapidly that by the time a piece of test equipment appeared for a given PC component, that component becomes irrelevant. I should, however, parenthetically note that this lack of PC and PC networking test equipment will probably change now that the pace of change in PC hardware has slackened. But that equipment might still end up priced out of the hands of many.
What, then, is often the least expensive piece of test equipment? A spare part. When I'd order a ton of some kind of equipment for a client, I'd often advise a client to buy an extra one. If buying, say, 100 desktop computers, then it'd be nice to have a 101st on hand so that as we take the PCs out of the boxes and try them out, then we can quickly verify what ails the occasional troubled new PC by swapping parts from that extra PC. As veteran PC troubleshooters often say, "swap 'til you drop."
Additionally, you may want to consider spares for anything that is a choke point in your network. I used to have an Internet connection via a frame relay to my ISP. (DSL doesn't go where I live. Nor does cell reception beyond one bar. Thank heavens for cable modem.) The ISP was great (Continental VisiNet, www.visi.net) but I had to leave them because the frame relay was run by Verizon, who for some reason could not keep a simple 256 Kbps connection up for than about 80 percent of the time. (The Verizon guys once told me that they had no idea why they'd bothered taking the job, as anything below a T1 was apparently beneath them. I think that in my next life, I'm going to skip this silly small service business stuff and get me a monopoly. Definitely.)
Anyway, I was connected to the world via a Cisco 1602 router and if lightning struck anywhere in the surrounding five towns, the Cisco would die. (Yes, I installed every kind of lightning protection I could find.) The best answer was to have another 1602 around, already configured to be swapped out while I sent the fried one out for repair.
I have made this comment with tongue-in-cheek for years, but it does bear some truth: "the two most effective tools in the Microsoft world are 'reboot' and 'reinstall.'" (I should mention, however, that XP's System Restore has drastically reduced the number of reinstalls that I've had to do to that product, and I can't wait to see System Restore come to Server in Longhorn.)
I remind you about rebooting because where once we just knew that anything more minor than changing the background color required a reboot, modern Windows can do an awful lot of things without needing a reboot. Consider that you can take a vanilla copy of Server and add DHCP, WINS, DNS, IIS, and the majority of patches delivered for XP and 2003 in the past year... all without a reboot.
But many things do require a reboot, and if you've made a change to your system software and it hasn't quite "taken" yet, then give it a reboot, if you can. ("If you can" because I know that some of you have annual bonuses tied to maintaining that "five nines" thing, and if I did the math right then 99.999 percent uptime means no more than about five minutes' downtime all year.) See if that reboot helps things start working.
Group policies can require a reboot or two. XP, in fact, has its own strange way of processing GPs such that some settings can take up to three reboots to take effect. Group Policy Management Console, a free download from Microsoft and an optional component of Server R2, has a very nice set of reports that can help figure out why a GP setting hasn't taken effect. (But don't try to run GPMC on an x64 system; .NET problems keep any of the Windows x64 builds from running GPMC. Bummer.)
Hardware often needs "rebooting" after being reconfigured. Routers, modems and the like won't always show the effects of reconfiguration until you actually power them off and on.
And speaking of reboots versus powering off and on, remember that if you are rebooting a system because you think that you've cleaned some kind virus, spyware or whatever off that system, then always shut the system down altogether, and then turn it back on, a so-called "cold boot." It's possible to create a piece of software that can survive a warm boot, so a virus that you've cleaned off the hard disk might be lurking in RAM hoping for a warm boot -- and another chance to infect your hard disk.
Know What's Normal
Until I got rid of Verizon's unreliable frame relay, something called a Frame Relay Access Device (FRAD) sat in my office. It was a beige box that contained six LEDs. Each LED could be green, amber, red, or off.
The first time the frame behaved strangely, I glanced at the FRAD to see if the lights told me anything. That's when I noticed that most of them weren't labeled, which was, well, disturbing. Then I noticed that two of them were off -- just dead. Two were green and two amber. Of course, the first thing I thought was, "guess I'll have to call Verizon."
As it turned out, two greens and two ambers was bad, three greens and one amber was good. The bottom two never ever lit up in the five years that I used the FRAD. But it made me do what I should have done before: make notes on "normal." And here's a cheap way to keep track of what's "normal" on a datacomm box's LEDs... just get out your digital camera and take a picture.
Make One Change At A Time
This is sooo obvious, and soooo hard...
Let's say that you have to install something in your Web server -- perhaps a FireWire card. But you've had some extra RAM lying around, so why not beef up the memory while you're at it? And as long as the server's case is open, it really is the time to get some compressed air and blow out the dust bunnies, right? And just maybe, since the cover's open ... NO!!!!
Let's be realistic about this: machines are out to get us techies. You know it, and I know it. But they have to play by certain rules, and one of those rules is, "if the human changes something in me [the machine], then I'm really only allowed to break if I can create an excuse that's relevant to what the human changed." In other words, if all you do is add the FireWire card and the system refuses to boot, then the chances are really good that simply removing the FireWire card will un-do the damage, unless you also dropped something into the case while installing the FireWire card. In contrast, doing two, three, or four things all at once, then putting the case back on, gives the machine all kinds of plausible reasons to fail, and basically puts you in a position of having to pretty much reduce the system to parts and re-build it from there. Ugh.
Now, I'm not saying not to add that RAM or blow out those dust bunnies... just that you'll be happiest if you do each of those jobs one at a time, then power up the system to ensure that the machine's not misbehaving, then power it down and make the next change.
The famous Felix Unger exclamation from an episode of The Odd Couple aside, assuming can get you in a lot of trouble. When a big part of my network stopped working, recall that I never even imagined that it was the power brick on the Netgear hub would die, so I didn't think to look at it for precious minutes while I looked at what it seemed certain to be. (I've forgotten what did seem certain at the time. Now I suppose I'd assume that it could be the power brick, and so overlook a bad cable. Well, I hope I wouldn't do that any more.)
Adopt a no-assumptions approach to troubleshooting. Sure, you need to create an order in which to test things -- for example, you might decide that software breaks more often than hardware, and so you check the software first -- but never leave a component off the "things to check" list just because it's reliable. It might just be reliable. Really reliable. Just not 100 percent reliable.
Get And Learn To Use a Network Monitor
Networks can seem sometimes like nothing more than piles of black boxes whose only outward signals are green, blue, red and amber LEDs. (Minasi's first law of data communication devices is, "the more little lights on the data comm thingies, the better.") But sometimes it'd be nice to just open up that cable and actually see the bits go by to understand what's going on. That's where a network monitor, sniffer, whatever you want to call it, is valuable. Programs of this type capture and analyze data transmitted across your network. With them, you can actually see how, for example, your system gets its IP address from DHCP.
There's a simple piece of network monitoring software that ships with every copy of Server, but it only shows traffic traveling to and from the server. Other network monitor software uses what's called a "promiscuous" network capture driver, which means that you can capture not only your system's traffic, but any other traffic on the network. (Actually, that only works if you've got Ethernet hubs rather than switches. You can make switches "promiscuous," but it's not normal behavior.) There's a free network sniffer called Ethereal from www.ethereal.com. The Unix world has always had a command-line network sniffer called "tcpdump" and now we Windows types have one as well called WinDump that you can find at www.winpcap.org/windump. The output is harder to read than the nicely-formatted stuff that a GUI sniffer produces but it's something of a standard and, besides, it's much easier to show tcpdump/windump output on a printed page than it is screen captures of some GUI sniffer.
Keep An External IP Address
When trying to fix many kinds of server problems, the ultimate connectivity test is often "can I get out to the Internet from inside our intranet?" or "can someone on the Internet get to our public servers?" The only way to check the second of those two is with an IP address not connected to your network at all, an address that does not directly appear in any of your routing tables. Here's a simple way to maintain such a thing: get a data service for your cell phone and the USB cable that lets you connect the cell to your PC. Then you can always get yourself an external IP address and ping away. External e-mail addresses are helpful as well; I use a Hotmail address to test sending e-mail addresses to my internal e-mail account whenever testing some change to my internal e-mail server.
Double-Check Security and Permissions
Can't do something that you think you ought to do? Then ask: have you done something to "harden" the security of your system recently?
After Code Red, I took some time and hardened my IIS server, and I mean really hardened it. About a year later, I wanted to learn how to use the Index Service so as to create a search engine for my newsletters. But after two weeks' work, I still couldn't get the Index Service to do a blasted thing. Finally, it occurred to me to build a fresh, out-of-the-box test system... and everything worked fine. You see, in the process of "hardening" my Web site, I'd inadvertently removed the System account's permission to read my Web site. The Index Service needed to index content inside the Web site, and, as the Index Service runs as System, it was denied the ability to read the files.
Sometimes you can detect this by auditing "processes" in the Security event log. That may give you a clue about whether or not you're being defeated by your security measures!
Call the Outside Communications Service (telephone company, cable company, ISP) Last
They don't care if you have a problem; they get to charge you monthly whether you use those bits or not. They assume that you're a moron and are completely prepared to lie to you just to get you off the phone. (Many are actually timed as to how long they're on the phone and are rated higher if they can process more calls per hour rather than, say, customer satisfaction. In other words, their bosses pay them to get you off the phone as soon as possible.) What's that, you don't believe they'd lie? Okay, how about a couple of real-life examples.
When getting DSL installed, the installer was late and when I talked to the dispatcher, she told me that he wasn't bringing a DSL modem even though I'd already been invoiced for one. I was irritated. "Don't worry," she told me, "you can use your current modem." "My current DIAL UP modem," I asked. "Yes," she said, continuing "you won't get as good a speed as you will with OUR DSL modem, but it'll still be many times faster than you're getting now." Or how about the Charter Cable guy who told me that the 5 kbps throughput that I was seeing on my cable modem wasn't Charter's fault, it was the "phone company." I was puzzled and asked what they had to do with it. He explained to me that "the phone company runs the Internet." Nope, I am not embellishing. That's a direct quote, and the gentleman who said it did so in November 2005.
So before you call your provider, get your facts straight. Can't ping out? Then do some tracert commands to find out exactly where things fall down. At an ISP that I used to work with, the tracert often failed at an IP address that I knew was in their router farm. So after they'd suggested by their tone that I was an idiot and they were just waiting to show me how stupid I was, I'd say "hey, I ran a traceroute and it stops at IP address such-and-such. Is that one of yours?" And oddly enough, I got connected to a third-level immediately!
And before you do contact your provider, be sure to do what they'll tell you to do anyway -- pull the plug on the router, frame relay, cable modem, DSU/CSU or whatever, count ten and plug it back in. (I find it fascinating that I pay for a business cable account and when I call to report that there's something wrong with my connection, they say "unplug the cable modem and plug it back in." I say "why? It's connected to a UPS -- yes, that's a UPS, not an SPS -- and that's backed up with a generator. The only reason that it'd need to be cycled would be if there were a serious problem in its design or firmware and if that's the case, then why are you using that brand of cable modem instead of ..." but then I realize that some arguments aren't won with logic alone, nor do I own any firearms, so I unplug the cable modem, count ten and plug it back in. Oddly enough, that never fixes things.) Being able to say "I've already cycled the power, disconnected my cable from the router and directly connected it to my laptop" and so on makes for faster service and doesn't let them put you on "ignore." (Oops, I meant "hold.")
Walk Around the Block, or Explain the Problem to Someone
This is a great tip. Honest.
We humans have got really good brains. Sometimes, though, we just don't know how to use them. Ever been faced with a problem that stumped you for a hour or two and, once you figure out the solution or are presented with the solution, you say "aw, heck, I knew that!" Of course you did; the answer was there, you just didn't have a path to it -- I think of it as "now, if this neuron and that neuron in my brain were to get together and have lunch now and they..." And besides, how many of you get to troubleshoot in a calm, relaxed, supportive environment? Stress makes us ready to run from a predator or kill something that's trying to kill us first; it's not so good at making you able to troubleshoot group policy conflicts.
So remove the stress and massage your brain a bit. I find that re-framing the problem in some way causes different parts of my brain to get involved. For example, sitting down and writing the problem out on paper may cause your brain to find the answer when the problem seemed insoluble moments ago. Explaining the problem to someone else causes all of those verbal neurons to wake up, and there must be a lot of them, because it's surprising how many times simply talking something out solves it, doesn't it?
Or get the big muscles moving by taking a walk around the block. It gives you a chance to get some air, restate the problem, and see it from a different light. Believe me, most problems don't get much worse in the ten minutes it'll take you to take in a short stroll. And who knows, it may get the clients from breathing down your neck for a minute or two, releasing some of your CPU time from the "ohmigod ohmigod ohmigod..." loop and freeing it up for the problem-solving section.
I hope that with these suggestions that I've reminded you of some of your own stories, perhaps suggested a new approach or two, and possibly even brought a smile to your face once or twice. Again, in this article I meant no ill will to anyone and in case I've not made it clear so far, I have no illusions about having all -- or even most -- of the answers. I figure that if I can just remember not to repeat a mistake, then eventually I will have made every possible mistake... and then I'll be perfect. Until then, however, I can't wait to hear your rules for troubleshooting network problems. Thanks for reading!
All contents copyright 2006 Mark Minasi. You are encouraged to quote this material, SO LONG as you include this entire document; thanks.
|All times are GMT +1. The time now is 18:00.|
Powered by vBulletin® - Copyright ©2000 - 2015, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO