Disk drives lack reliable failure model
In storage circles, much discussion has arisen from the very interesting papers investigating disk drive reliability presented recently at FAST '07. Other columnists and bloggers, such as Frank Hayes and Robin Harris have already done an excellent job of covering them. Rather than repeat the details, I'd like to take the perspective of what the implications are for service level commitments with the storage infrastructure.
In tiered storage architectures, distinctions among service levels are commonly based on attributes like performance and availability. Given the findings of these studies, it's worthwhile to review service levels and the design of supporting storage tiers. Of the various findings, two factors stand out in this regard. The first is the lack of a reliable failure predictability model. The Google study, examining attributes such as age, heat, access, and SMART diagnostic data in consumer drives, found many drives failed without prior indication. The Carnegie Mellon (CMU) study does suggest that age is a factor in reliability, but it becomes significant far sooner than expected - in as little as two years. So, while the probability of a drive failing increases as it ages, the only meaningful action that can be taken from a service delivery perspective is to continue with regular tech refreshes (e.g., a 3-year cycle) and perhaps to institute a process to record and analyze disk failure ala these studies, but tailored to the particular environment.