Peter Seebach ([mailto: firstname.lastname@example.org?cc=Copy email address&subject=Crash] email@example.com) Freelance writer 10 Aug 2004
Abstract: Frequent computer crashes are totally unacceptable. They cost incredible amounts in lost productivity, frustrate you, and serve as a barrier between you and the promised features of purchased software and hardware.
It's close to the oldest story in computing. A friend of mine used to wear a button on his jacket that said, "You'll have to wait; the computer is down." The message proved correct often enough to be funny. Many jokes revolve around the sure and certain knowledge that computers crash often enough that you have to save your data often. People now consider it normal to find that something -- product, service, whatever -- is unavailable because a computer, somewhere, is down.
The way that manufacturers address these issues carries a certain amount of spin control. Software manuals advise you to save your work "in case of power outages," but crashes prove more common for most users than power outages. Depending on the platform and application you use, your computer might crash daily or worse. Indeed, I've had to complete long projects using software that crashed once an hour or more; it gets very, very, tedious. Of course, the problem is always unusual and isolated. When the PDA version of an instant messaging client began acting up, dozens of users posted complaints about problems using the client on multiple different models of PDAs and phones, all getting the exact same symptoms. A company representative, in a thread with two dozen users reporting symptoms, explained that the company couldn't track down every unique bug affecting a single user.
Crashes, you are told, result from complicated and unpredictable interactions between multiple software components. Why, then, do the software components interact in such complicated and error-prone ways? Why is it that you can experience problems when you run a given vendor's software on the same vendor's operating system (OS) ? Even if you run on the same vendor's hardware, you sometimes discover a disturbing tendency for programs to not work quite as expected.
The user habituation to constant low-level failure disturbs me. I mostly work on a UNIX system, running an editor whose code was last updated in 1996. I submitted a couple of bug reports on earlier versions of this editor in the early 1990s. They were fixed. The editor hasn't crashed since. When I use software applications, even expensive ones, on Mac and Windows systems, I'm always quite shocked. I've never quite adjusted to people saying things like "Oh, sometimes it just doesn't start. Just double-click it again until it comes up."
One of the most recurring computer instability themes centers on incompatibilities introduced by environmental changes in which the program runs. For instance, programs designed to run on an original IBM PC often failed to run on later, faster machines. For a while, many machines had a turbo button to slow the machine to the speed of an older machine for compatibility with programs that only worked at the slower speed.
Solutions to such problems have been known for a long time; an OS can provide developers with a standardized and safe way to interact with the hardware, providing a way to attain greater compatibility. For instance, instead of letting each program manipulate the disk directly, the OS may provide a standardized file interface. Sounds great, right? Unfortunately, programs sometimes try to bypass that feature. They do that for various reasons, often citing improved performance. The result, in reality, causes the program to crash horribly when the computer is upgraded. New hardware can be incompatible. A new OS can preserve the functionality of the standard OS interface, while changing the internals, so a program that relied on the unsupported internals will start crashing.
Count video games among the worst offenders in this regard, but other programs have been affected. It's true that some programs do, admittedly, have legitimate reasons to bypass the system. For instance, disk repair utilities need to bypass the OS's file system code and manipulate the disk directly. Of course, that can cause a painful stream of upgrades and counter-upgrades as changes in an OS require updates to utility programs.
A serious enough change to a protocol can defy workarounds, though. The Palm Tungsten T3 introduced fundamental changes to the way the core databases that hold PIM (Personal Information Management) information on the device are represented. While many programs continue to run as long as they use the official protocol, other programs -- such as backup utilities -- run into problems. The most common is a PDA which, on being restored from backups, says nothing but DataMgr.c, line 9529, Index out of range. Oops. I found a patch on Palm's Web site which makes this a little less frequent, but I never did find the official explanation. Eventually, I deleted enough files from my backup set that the PDA ran. I have yet to discover why the fairly common real-world case of needing to restore backups wasn't better tested and supported.
Most crashes do have workarounds that, rather than fix the problem, let the problem affect you less often. Workarounds, unfortunately, are not necessarily convenient or appropriate. Indeed, a workaround often consists of avoiding a given advertised feature, or, "every five minutes, save your work and exit the program."
It gets worse. When my mom's PDA developed a tendency to crash every time she tried to add a new appointment, the only solution was to delete the relevant database. At that point I was glad I had installed a file management utility on her PDA. Walking her through the procedure over the phone proved quite an adventure. Mind you, my mom's no technical idiot: she's the office go-to person for computer advice. However, when after a week or so into learning a new gizmo you must learn a fairly bare-bones utility program to find and delete a database file, that's a bit different.
Workarounds harm consumers another way: they make companies feel that a fix isn't required. This old saying applies fairly well: "Put your shoulder to the wheel, your nose to the grindstone, and your ear to the ground. Now try to work in that position." By the time I get the hang of not using the bold button right after an undo operation, only scrolling the font menu down slowly, using Save As to save files to avoid the corruption bug, saving every five minutes, never using drag-and-drop to move text with mixed formatting, always formatting things for the laser printer before reformatting them for the inkjet, saving to the hard disk and then copying the file to a floppy as a separate step.... Wait, where was I? I think it had something to do with a project due on Monday. Workarounds should be temporary measures, not long-term states.
Workarounds, acceptable for a short time, are no replacement for an actual fix. It's not okay for programs, especially OSs, to crash under normal use. Oftentimes, the culprit is a desire to add new features before old ones are even fully working. Sometimes, it's just laziness.
If you develop software, of any sort, make sure that reliability stays on the radar. Many good software engineering practices are known to help create reliable code. To produce useful results, most of them must be actually practiced, not merely written down in bullet lists of a project plan .
A lot of problems might show up in testing. Test your programs on a variety of systems, not just on one or two systems with similar hardware. Testing is an ongoing, built-in development procedure, not an afterthought. If you test your code on multiple versions of an OS, with different hardware, you boost the odds of catching problems that affect only some people.
Most importantly, when developing, don't let a bug slip away just because you have a workaround, and especially not because it's rare. Those bugs will truly frustrate your users and give you a reputation for shoddy code. Users complain much more about shoddy releases than about programs that were tested properly and released a little late.
This week's action item: Keep a crash log for a week or two. How often do computers fail near you? Distinguish between individual applications and the whole machine failing.
Peter Seebach would love to explain exactly why and how he wrote this column, but it's against his policy. And no, you can't talk to any human that had input into it. (Kidding.) Join him to best this force of nature at [mailto:firstname.lastname@example.org] email@example.com.