The Human Genome Archive Project
The importance of keeping scientific archives in the digital age – by Jenny Shaw
In an electronic age, what sort of archive material will historians be able to research? This question is at the heart of the new Human Genome Archive Project sponsored by the Wellcome Library. Today, all sorts of researchers delete their emails or send old datasets to the trash-bin; memory sticks get lost; research papers are erased. Now, more than ever, researchers need to work together to find new ways to preserve e-history as it happens. If they do not, then future historians will be unable to reconstruct all the contributions that made possible major scientific initiatives such as the Human Genome Project.
The Human Genome Project (HGP) broke a new frontier in genetics and was one of the most exciting international scientific collaborations. On 26 June 2000, it was announced to the world that the first working draft of the human genome sequence was complete. This scientific achievement was made possible through unprecedented partnership across public, private and non-profit sectors, and brought the potential to spark a revolution in medical discovery. The data for the HGP were openly released online through the sequence databases, making them secure and available for scientific researchers. But what of the organisational records, personal papers and other material created during the sequencing effort? Who is making sure that these are secured for historical researchers?
In June 2009, an initial meeting was held at Cold Spring Harbor Laboratory on Long Island, New York, where concern was expressed that the historical legacy of the HGP was at risk unless action was taken to secure it. Following preliminary work and the start of projects in other countries, the Wellcome Library launched the UK strand of the Human Genome Archive Project (HGAP) in January 2012.
The core aim of the HGAP is to preserve the documentary heritage of the HGP created between 1977 and 2004, from the development of Sanger sequencing to the publication of the ‘gold standard’ human genome in Nature. After developing an effective survey methodology, the HGAP will survey key holdings already preserved in recognised archives, as well as individual or organisational records not currently held in recognised archives. It will ensure that material in any format is secured so that it can eventually be made available to researchers.
What we are doing is not particularly novel – surveying historical material with the aim of preserving it – but the timing is. Although the project will encompass records created in all formats, including paper, a very large amount of the material created during the HGP is in born-digital format – that is, material created electronically rather than converted to a digital format through processes such as scanning or photography. This is crucial to the timing of the HGAP.
When an archive is contacted about taking on a scientist’s records, it is often after their retirement, or more commonly by a relative after their death. This model works in the hard-copy, analogue world. It allows a suitable passing of time to place the scientist’s work into perspective before decisions about preservation and providing access to their records need to be made. However, in the digital age this standard approach is now unsuitable. Increasingly, archivists need to start working with scientists before they retire. Although this brings new challenges, such as fitting in to already- busy schedules, it has the potential to allow better collections of material, with richer contextual information, to be preserved in archive collections of the future. So what are the e-challenges for archivists in a digital age?
One of the key reasons that archivists need to act earlier to preserve digital material is its vulnerability. The media are full of stories of hardware failure, data loss and digital black holes. Lots of digital material from the 1980s has already been lost, a poor comparison with paper or parchment manuscripts which have survived for hundreds of years. Unless we act now, there is a real risk that key material from the late 20th century will not survive. One of the main problems is that digital material needs to be interpreted by a whole host of software and hardware. This means that while a box of paper records can still easily be read having spent decades of benign neglect in the loft or under the spare bed, the ability to read digital material kept in the same conditions might well be lost.
The pace of technological change is quick, and both hardware and software often become obsolete in a short time. The 3.5” floppy disk was ubiquitous during the 1990s, but it is already difficult to find a computer with the necessary drive to read these disks. Add old operating systems and software, such as WordStar, into the mix and the situation becomes even more complex and difficult to manage. By being more proactive, doing e-archiving in collaboration with research teams, archivists can help to preserve more digital material. Time really is of the essence.
One of the scientists with whom the HGAP has been working closely is Michael Ashburner, Emeritus Professor of Genetics at the University of Cambridge. He was a leading figure in the sequencing of the Drosophila (fruit fly) genome. Some of the material we have found in the course of a survey of his papers highlights many of the common issues facing archivists in the digital age. Ashburner was an early adopter of computers for his genetics work and we have encountered digital material on a range of storage media. Some of these formats are more straightforward to handle than others and we have had to make difficult decisions about what we can deal with and what is prohibitively expensive to preserve. No organisation has limitless resources, so it is important to carefully balance the cost of recovering information against the potential historical benefit.
The decision has been made not to take Ashburner’s rolls of magnetic data tape, mainly because they contain sequence data rather than research records, but also because the cost of retrieving the information outweighed the potential benefit. We plan to capture a printed index of what was on the tapes and have documented our decisions. We are, however, hoping to be able to recover important information from some 5.25” floppy disks. These are going to be used as a test case to get baseline figures for the cost of data recovery and to explore whether we are able to work with the results in a meaningful way.
Extracting the data from the storage medium is often just the start of the preservation process. Even for the 3.5” floppy disks – some PC formatted and others Mac formatted – we have needed to use an external disk drive on our virus-checking laptops. After we have checked the disks to make sure they are clean, we bring the contents into our digital preservation system. Once in our system, they will be placed on ‘technology watch’: the file format will be monitored to make sure that it remains accessible. The example of the Ashburner digital material shows that, often, the older something is, the harder and more expensive it is to deal with.
Another key technology issue is how the use of personal computers has changed the way material is organised within filing systems – or not, as the case may be. The shift from centralised filing systems, often managed by a dedicated person, to personal filing systems is significant. It helps if an archivist is able to work with the record creator to understand its idiosyncrasies; this also provides the opportunity to preserve the original order of files and folders when they are transferred to the archive repository. There are many benefits to starting conversations with potential donors sooner rather than later, but it can also raise issues surrounding sensitivity and access to material once it has been deposited.
Taking in material while a scientist is still active means that interactions with other scientists might still be live issues and it is likely that the third parties mentioned will also still be alive. Managing sensitive information, however, is not a new challenge for archivists; indeed, the Wellcome Library already has a significant amount of material in our collection that contains personal or sensitive information. We take our responsibilities under the Data Protection Act seriously and have a robust access policy in place, which has been approved by the UK Information Commissioner’s Office. This policy covers material in any format, including born-digital and digitised content.
People often think about their personal digital material differently to its hard-copy equivalents. Email is a good case in point. The Digital Preservation Coalition published a report on preserving email in 2011, which identified the paradox that exists with digital communications: although email is ubiquitous it is also ephemeral. Few people manage or care for their electronic communications with the same rigour that they used for their hard-copy correspondence. Archives up and down the land have lots of collections of letters, and few would argue against the value of this material. The same attitude does not always extend to email, which can be seen as less relevant for archive repositories. When the British Library bought the poet Wendy Cope’s email collection in April 2011, dissenting voices questioned its worth. But email is the future personal letter and it needs preserving too.
Although email bears a strong similarity to letters, it does have significant differences. Email communication is often less formal than a written letter, and can also be used in a wider range of situations: for example, it is often used to replace communication by telephone. Email does not have the natural cooling-off period that a written letter might allow, so messages can be fired off in the heat of the moment. These uses of email often make potential depositors less comfortable with the idea of preserving it in an archive. It needs to be handled sensitively, and this is where a professional archive service can help.
Email should be a valuable part of modern archive collections and has a major advantage over written letters: the ability to easily capture both sides of the correspondence. Although many collections describe their contents as being correspondence, they are in fact letters – a one-sided half of the conversation. The beauty of email is that both sides are contained within a single account and are often found threaded together. With time, maybe we will also grow to value the form of the email just as we do the written letter and look at aspects such as the signature, the address being used and the font. Maybe someone showed their personality in an email with capital letters and exclamation marks, used a friendly or gruff tone, or helpfully felt the need to summarise key issues, making the exchange a valuable research tool for future historians.
Preserving born-digital material is fundamentally changing when and how we do things, but not what or why. Archivists have always needed to engage with scientists to capture a meaningful record of their work. The challenge is for scientists to help make available not just their published outputs but also the records of their working lives. For these should be preserved in partnership. Unlocking the genome sequence has been an extraordinary scientific achievement which deserves an archive record of the human interactions that helped to create such an important worldwide resource.
Henry Wellcome believed that history is not just in our making but in our keeping too. The HGAP seeks to build on his legacy by looking beyond the next historical corner, where the researchers of tomorrow will discover new findings about the important scientific work of today.
Jenny Shaw is the project archivist for the Human Genome Archive Project based at the Wellcome Library and the Wellcome Trust Sanger Institute (email@example.com). She welcomes enquiries from leading scientists and researchers who feel that their archives should be preserved for future generations.