Organising for digital archiving: new distribution models in the scientific information chain

 

John Mackenzie Owen

TICER / Tilburg University

owen@hum.uva.nl

 

The changing role of libraries

 

Digital archiving is a concept that has little to do with archives in the conventional sense. It is most often used to refer to the role of libraries for the long-term storage and preservation of published information in digital form. The subject of this paper is therefore how libraries should organise the archiving function in a way that will ensure the availability of scientific publications for future generations.

 

The dominant factor in the current development of libraries is the ongoing move towards digital distribution of information through the global network infrastructure. This is creating a shift from the traditional role of the library as a ‘clearing house’ and warehouse for printed publications to a role as a supplier of networked services for digital information resources. The library of the future can be characterised as follows (Mackenzie Owen, J.S. and Wiercx, A., 1996):

 

·       Services will be based on digital, networked information resources;

·       User interaction with the library will be through from the desk-top (distance access) instead of by physically visiting the library (on-site access);

·       Emphasis will be on access to networked resources instead of on storing materials in the library;

·       The traditional library catalogue will evolve into a networked resource discovery mechanism;

·       Bibliographic data included in library systems will be extended to include non-document resources (e.g. persons, organisations, datasets etc.);

·       New organisational models and distributed functions will arise, based on co-operation and domain-based services.

 

 

The traditional archiving role of libraries

 

Libraries belong to the so-called ‘memory organisations’, together with archives and museums. This reflects the fact that the global library system acts as the collective memory of the world’s cultural and scientific heritage as recorded in the printed word. In the world of printed publications only libraries perform a memory function which guarantees to a certain extent that publications are not lost after immediate use (fig. 1).

 

[Picture]

 

 

 The long-term storage of publications is more a (fortunate) outcome of librarian’s reluctance to discard infrequently used publications than the outcome of sound management. In fact, very few libraries have an explicitly stated responsibility with respect to long-term storage and preservation. A legal responsibility for this exists only for the national libraries of Europe through deposit legislation.

 

However, the memory function in the publication chain is far from perfect. It is highly  selective and random. What is preserved for future generations depends on a large number of decisions, and any publication stands a chance of being lost from the collective memory. To give a few examples:

 

·       What enters into the library system depends on what authors and publishers decide to publish. There are many examples of cultural and scientific works which are inaccessible because they have remained unpublished;

·       All libraries have an acquisitions policy which determines which publications enter the library collection. There is no system which guarantees that each publication will be acquired by at least one library, with the exception of publications in countries with a well-organised legal deposit system;

·       Libraries do not always store indefinitely all publications they acquire. Although there is a natural tendency amongst libraries not to discard items from the collection, this sometimes is necessary, e.g. for economic reasons;

·       Publications stored in libraries are sometimes lost due to media deterioration, either caused by inadequate storage conditions or other factors such as the chemical self-destruction of publications printed on chlorine paper;

·       Many kinds of disasters, such as fire or flooding, can lead to loss of publications.

·       Finally, political factors and censorship frequently prevent publications from being acquired by libraries, or lead to their removal from the collection.

 

However, in the printed world preservation of the intellectual record is enhanced by large print-runs, which means that publications are usually produced in at least hundreds of copies and are acquired by many different libraries, often distributed over the entire world. The chance that at least a single copy of a publication is preserved once it has been published, is  usually quite large.


Archiving in the digital world

 

In the world of digital publications the collective memory is as selective and random as it is in the world of print. In fact, the situation is far worse when we consider the following:

 

·       Digital publications are produced and archived in a far smaller number of copies – in most cases only a single copy is made available and stored on the network;

·       The cost per access for digital archiving is higher that that of print archiving; since budgets are limited, less copies will eventually be archived;

·       Digital materials periodically need to be migrated to new storage media, data formats and system environments; the future cost of migration is uncertain and it is likely that many digital archives will not be maintained in a way which guarantees that all materials will remain accessible;

·       In general there is a lack of understanding of digital archiving issues, which at least initially could lead to data loss;

·       Libraries are still focused on print publications, and tend to neglect their memory function for digital publications;

·       Finally, the dynamic, interactive, distributed document types which are now emerging are extremely difficult – if not  - impossible to archive in comparison with the current text and image based documents.

 

Digital publishing models

 

The nature of digital publications makes the archival task for libraries more difficult. But without adequate measures, there will be no archiving by libraries at all, and as a consequence the collective memory of science will disappear. This becomes clear if we look at the various models for digital publishing that are now beginning to emerge. These models all imply distribution directly from the creator or publisher to the end-user over the network, with no direct involvement from intermediary organisations such as libraries. The immediate consequence of this is that the distribution channel no longer has a memory function performed by organisations that have long-term archiving as their implicit (most libraries) or explicit (deposit libraries) responsibility. Consider the three publishing models described in fig 2:

 

·       Self-publishing, i.e. by individual authors or their parent organisations. There is no guarantee that they will have the inclination or the resources to maintain long-term availability. The archives (such as WWW and FTP-sites) they set up on the network will be subject to frequent changes and will usually have a short life-span, as is already noticeable to anybody trying to access materials put onto the Internet more than a year ago.

·       Publisher archives. Many large, international scientific publishers are now creating so-called ‘archives’ or repositories for distributing their publications in digital form. Although some now also distribute journals in digital form to libraries, they most certainly will not continue to do so. However, it is clear (and some publishers have already explicitly stated this) that materials will only be available through these repositories for as long as there is sufficiently frequent demand to justify the cost of storage. After a certain period (probably 2 to 5 years) publications will be removed from the repository and will no longer be available. When a publication goes ‘out of print’ in this way, there will be no copies stored in libraries as is the case with printed publications.

·       Push technology. The current publication model is based on the ‘pull’-concept: users interested in a publication go to a library or digital repository and pull the document out of the files for personal use. This is precisely the reason why (short term) storage is required: to hold the information in a file until a user comes and asks for it. In certain areas of publishing – and perhaps in future in science publishing too – this model is being replaced by the ‘push’-concept: the user indicates the type of materials he or she is interested in, and relevant materials are immediately sent to the user when it is created or published. In this model, there is no need for a memory function anywhere in the distribution channel.

 

[Picture]

 

 

Solutions for digital archiving

 

From our analysis it becomes clear that digital archiving, i.e. maintaining accessibility of scientific publications for future use, is a function that needs to be organised in an explicit way. It is highly unlikely that creators and publishers of digital information will be able to provide a coherent and persistent memory system. They have no commercial interest in long-term archiving, and they will not have the technical skills and funds to maintain digital collections indefinitely. The idea put forward in the United States by the Task Force on archiving of Digital Information of the CPA and the RLG (Task Force, 1996) that the creator of digital information should be responsible for long-term archiving is therefore potentially dangerous, since it could prevent other and better solutions from being developed.

 

What is needed for digital archiving is a system which gives the responsibility for digital archiving to organisations which have a specific archival function, which can develop the highly specialised skills required for long-term storage and preservation, and which can guarantee global accessibility to archival materials over the network. The European approach, which can serve as a model for other geographic areas, is the system of national deposit libraries (Mackenzie Owen, J.S. & Walle, J. v.d., 1996). These have a legal responsibility for archiving print materials which is currently being extended to cover digital publications. This system could well be supplemented by other archival organisations in specific subject domains, e.g. scientific institutes and emerging virtual libraries operating on a global scale.

 

Digital deposit libraries could interlink to form a comprehensive archival backbone for other libraries to provide service to users. In this way, there is no need for these other libraries to maintain their own digital collections (other than very frequently used current materials). Although the cost of digital archiving is higher than that of print archiving, this system would create enormous savings as compared to the current system. In the current system, the same publication is stored in a large number of libraries, each creating its own archival cost. In the system proposed here, only one storage location is required (or at least an extremely limited number for reasons of security and network efficiency). On a global scale the reduction in archival cost could be very large.

 

The system of archival deposit libraries for digital materials is based on two simple principles, viz. That publishers are willing (or legally obliged) to deposit a copy of digital materials on publication, and that the deposit library is allowed to provide global access to these materials as soon as they are no longer accessible from a repository under control of the publisher.

 

The future intermediary role of libraries

 

What does this mean for libraries in general? The ongoing move towards digital distribution of information through the global network infrastructure described at the beginning of this paper has major consequences for the traditional archival function. In the networked world a single location is sufficient. There is no need for the traditional ‘many copies, many libraries’ approach. In addition, publishers will not allow libraries to store digital publications because they wish to control access and maintain direct relationships with their customers, i.e. the end user. Therefore, publishers will set up digital repositories as short-term archives (possibly through outsourcing to subscription agents). However, publishers will not take on the responsibility for long-term archiving. But long-term digital archiving is expensive and requires specialised skills and infrastructure. Therefore, digital archives can only be maintained by national libraries and/or large, specialised, international, domain-based virtual libraries (Mackenzie Owen, J.S., 1996).

 

 

The large national deposit libraries are, at least in Europe, are best equipped to perform the long-term archival function to maintain access to the literature of science. It is therefore essential that they obtain the legal basis which extends their responsibilities to include digital materials. However, archiving on a national scale is not sufficient in a globally networked environment. It is therefore also necessary that the digital deposit libraries join forces to create a globally interlinked archival system. That will be the future memory of science.

 

References

 

Mackenzie Owen, J.S. (1996) – Preservation of digital materials for libraries. In: European research libraries co-operation; the LIBER quarterly, 6(1966)4, p. 435-451.

 

Mackenzie Owen, J.S. & Walle, J. v.d. (1996) - A study of issues faced by national libraries in the field of deposit collections of electronic publications: final report. - Luxembourg: European Commission.

 

Mackenzie Owen, J.S. and Wiercx, A. (1996) - Knowledge models for networked library services. - Luxembourg: European Commission.

 

Task Force (1996) - Preserving digital information: report of the Task Force on archiving of Digital Information commissioned by the CPA and the RLG: final report and recommendations.