Self-Healing in Modern Operating Systems
Driving the stretch of Route 101 that connects San Francisco to Menlo Park each day, billboard faces smilingly reassure me that all is well in computerdom in 2004. Networks and servers, they tell me, can self-defend, self-diagnose, self-heal, and even have enough computing power left over from all this introspection to perform their owner-assigned tasks.
Then, after arriving at my office, I reacquaint myself with reality: that every IT manager, system administrator, and developer is fighting against the monster of computing complexity. The worst possible situation to be in is trying to identify, root-cause, and resolve a problem in today’s complex stack. Regardless of who their vendor is, the administrators I talk to around the world don’t look much like the ones on the billboard, who seem like the only other thing they need from their server is perhaps a martini dispenser.
While we need no reminder of the cost of complexity to the industry, it is worth wondering: Where are we really on the road to self-healing systems? How much of the problem is still open research versus lack of execution or priority on the part of vendors? Are we making more progress in hardware or in software? And how useful a solution can we expect, given the software we have now versus needing it to be modified or rewritten?