Where do I put digital content for secure, affordable storage?
Many discussions about storage concentrate on devices: the pro’s and con’s of hard drives, or hard drives on shelves, or data tape of various formats — and companies are still advertising optical media for archive storage.
Since these answers are meant to be short — the short answer about storage devices is that they all have problems. Asking which kind of storage to use is the wrong question. The issue is: how to manage content — on whatever kind of storage. Every option has risks, and the key is active storage management, which is another form of a maintenance requirement.
The essence of preserving any collection, analogue or digital, on shelves or ‘in the cloud’, is a continuous programme of maintenance. Preservation has to happen every day.
Here are some basic principles:
1) two copies is the absolute minimum requirement for protection against risk. Even the slightest risk will eventually result in loss of content if there is only one copy.
2) checking is necessary on a regular basis, to make sure you still have (at least) two good copies.
3) checking requires fixity information; a sizeable collection isn’t checked by manually opening files and watching them and listening to them. Checking needs to be automated, and relies on a calculation that ensures simply that ‘no bits have changed’ since the file was originally checked by some more intelligent process.
Given a commitment to multiple copies and regular checking, there are still decisions about cost and performance. If material has to have fast access then disc storage or a tape robot are needed. If an archive can get by with access within hours rather than seconds, data tape can be kept on shelves.
The PrestoPRIME project gathered statistics from many large studies of storage devices, and came up with a clear finding: data tape is cheaper and more reliable than disc drives. [Threats to data integrity from use of large-scale management environments, p37]
PrestoPRIME also produced an online Storage Planning Tool that will show how often content needs to be checked in order to reach a desired quality assurance level. This tool will deal with complex storage strategies mixing discs and tape (or mixing any storage method that has known performance statistics, including cloud services if they have known and verifiable reliability statistics) and allowing ‘what if’ calculations to estimate the cheapest way to achieve a set level of reliability.
The approach is statistical: it needs to know the failure rates for a given technology. From the failure rate, the frequency of checking and the number of copies — the cost and effectiveness of a strategy can be calculated. The result will be the number of files damaged or completely lost, over a period of time (such as 20 years). There is no point at all asking for a guarantee of no loss — that can’t even be calculated. If there is any risk at all, then there is a finite probability of loss. The whole statistical approach to quality assurance, in archives as anywhere else, is the ‘number of nines’ in the probability that content will NOT be lost: it could be ‘4 nines’ = 99.99% safe. or ‘7 nines’ = 99.99999% safe or even ‘9 nines’ = 99.9999999 % safe. Storage strategies and costs can be calculated to achieve a required ‘number of nines’ — but there is no way to apply a statistical approach to achieving an assurance of 100%. “I don’t ever want to lose anything’ is a statement of an ideal or an aspiration. “I must have ‘7 nines’ of risk protection” is a statement that an engineer can work with, to design a storage and checking strategy that meets ‘7 nines’.