Preserving Files – How to make sure your files are safe and uncorrupted

There are formal standards and technology for digital preservation: OAIS and all that. The field has had intense development for the last ten years, with many results:
• standards for trusted repositories (OAIS)
• open-source repositories (D-Space, Fedora, Greenstone …)
• standards for risk management (TRAC)
• open-source tools for
o identifying file types (JHOVE, PRONOM, FITS)
o verifying that files conform to standards (so that a PDF is really a PDF) (DROID)
o stripping out embedded metadata so a new file can be added to the catalogue (EXIFtool, FFMPEG and others)
• standards for describing digital files (MODS and lots and lots more)
• standards and tools for combining files and metadata into a unit (METS, BAGIT … or MXF)

There are two other main approaches to maintaining a digital collections:
1. Digital Asset Management (DAM) systems. These have been around in some form for nearly 20 years. The Imagen system from Cambridge Imaging (www.cambridgeimaging.com) is a good example of an asset management system tailored for the needs of audiovisual collections.
2. Do it yourself: make a spreadsheet (or simple database) of information about what you have, and use manual processes to ensure you have backups and that they work.

Here we come to the key issue: do your backups work? I won’t even consider the case of a collection that doesn’t have backups, as that approach is clearly doomed. The easiest way to ensure that backups are present and usable is to have software, like an asset management system or storage management system, which automatically makes backups for every new file that enters the system – and automatically does periodical checking that the main files and the backups are viable.

The periodical checking is vital. Anybody can write files twice to storage, and walk away. The issue is: what will still be there, error-free, when you come back?

 

Matthew Addis (then of Southampton University) and the PrestoPRIME project developed a simplification of OAIS, as shown in the diagramme. This approach is at the heart of the service provided by Arkivum (http://arkivum.com/) – a company founded to not only offer storage, but to offer guarantees (with indemnities) that your content will be kept error-free.

rag_file_preservation_diagramme

The process begins (green circle) with having two files (master and backup), and with having a process that automatically checks the masters to make sure they are ok. If there is a failure, that means the state of the collection has switched to amber, meaning warning. The vital issue is detection of the failure, because only then can the system enter the yellow state (failure detected, corrective action initiated). Then the file is restored from backup – a process that only works if the backup file is also still ok. If it is, the system returns to green. If not, the file cannot be recovered and the status (for that file) is red: that file is lost and gone forever (except for the possibility of very expensive and time-consuming intervention to recover something from a corrupted file or storage system — or just possibly the re-ingest (re-digitisation) of items that were not ‘born digital’).

Several times now I’ve referred to checking that a file is ok. A simple concept, but how is it implemented? A small collection can be manually tested to see if the files open and play, but that doesn’t scale. The proper approach is to compute a “fixity check” on a file that is known to be good. Forever after, if a new fixity calculation produces the same number, the file has not changed.

A good asset management system will compute fixity checks for all new files, make two (or more) copies of all new files, and periodically recompute the fixity numbers to prove the files are still intact. In case of error, the system should replace the broken one from a backup which does pass its fixity check.

However, not all asset management systems manage backups, not all systems check fixity – and not all people have a comprehensive asset management system. There is now a simple, free tool that anyone can use for fixity calculation and checking. I use it on my personal computer for the inventory and monthly verification of a collection of about 25,000 photos. After getting the software set up, it now recomputes all 25,000 fixity codes every month, and informs me of any problems.

The software is called Fixity and comes from AVPreserve, a leading consultancy in audiovisual preservation (https://www.avpreserve.com/tools/fixity/). I strongly recommend that anyone with responsibility for large collections of files moves immediately to use of fixity calculation and regular (at least monthly) fixity checking to provide clear proof that their files are intact.

So how big is a ‘large collection’? I think the tipping point is around 1000 files. More than that, and manual approaches just run out of steam — and lead to loss. You forget where things are, you forget to do backups, the backups are disorganised or incomplete or out of date — and the files may be corrupt without you ever noticing. Using a tool, like Fixity, puts you on the road to preservation instead of the road to loss.

Leave a Reply

Your email address will not be published. Required fields are marked *