Digital libraries and digital archives: new distribution models in the information chain

John Mackenzie Owen

TICER / Tilburg University

owen@hum.uva.nl

The changing role of libraries

The ongoing move towards digital distribution of information through the global network infrastructure is creating a shift from the traditional role of the library as a ‘clearing house’ for printed publications to a role as a supplier of networked services for digital information resources. The library of the future can be characterised as follows (Mackenzie Owen, J.S. and Wiercx, A., 1996):

· Services will be based on digital, networked information resources;

· User interaction with the library will be through from the desk-top (distance access) instead of by physically visiting the library (on-site access);

· Emphasis will be on access to networked resources instead of on storing materials in the library;

· The traditional library catalogue will evolve into a networked resource discovery mechanism;

· Bibliographic data included in library systems will be extended to include non-document resources (e.g. persons, organisations, datasets etc.);

· New organisational models and distributed functions will arise, based on co-operation and domain-based services.

The traditional archiving role of libraries

Libraries belong to the so-called ‘memory organisations’, together with archives and museums. This reflects the fact that the global library system acts as the collective memory of the world’s cultural and scientific heritage as recorded in the printed word. In the world of printed publications only libraries perform a memory function which guarantees to a certain extent that publications are not lost after immediate use (fig. 1).

The long-term storage of publications is more a (fortunate) outcome of librarian’s reluctance to discard infrequently used publications than the outcome of sound management. In fact, very few libraries have an explicitly stated responsibility with respect to long-term storage and preservation. A legal responsibility for this exists only for the national libraries of Europe through deposit legislation.

However, the memory function in the publication chain is far from perfect. It is highly selective and random. What is preserved for future generations depends on a large number of decisions, and any publication stands a chance of being lost from the collective memory. To give a few examples:

· What enters into the library system depends on what authors and publishers decide to publish. There are many examples of cultural and scientific works which are inaccessible because they have remained unpublished;

· All libraries have an acquisitions policy that determines which publications enter the library collection. There is no system which guarantees that each publication will be acquired by at least one library, with the exception of publications in countries with a well-organised legal deposit system;

· Libraries do not always store indefinitely all publications they acquire. Although there is a natural tendency amongst libraries not to discard items from the collection, this sometimes is necessary, e.g. for economic reasons;

· Publications stored in libraries are sometimes lost due to media deterioration, either caused by inadequate storage conditions or other factors such as the chemical self-destruction of publications printed on chlorine paper;

· Many kinds of disasters, such as fire or flooding, can lead to loss of publications.

· Finally, political factors and censorship frequently prevent publications from being acquired by libraries, or lead to their removal from the collection.

However, in the printed world preservation of the intellectual record is enhanced by large print-runs, which means that publications are usually produced in at least hundreds of copies and are acquired by many different libraries, often distributed over the entire world. The chance that at least a single copy of a publication is preserved once it has been published is usually quite large.

Archiving in the digital world

In the world of digital publications the collective memory is as selective and random as it is in the world of print. In fact, the situation is far worse when we consider the following:

· Digital publications are produced and archived in a far smaller number of copies – in most cases only a single copy is made available and stored on the network;

· The cost per access for digital archiving is higher that that of print archiving; since budgets are limited, less copies will eventually be archived;

· Digital materials periodically need to be migrated to new storage media, data formats and system environments; the future cost of migration is uncertain and it is likely that many digital archives will not be maintained in a way which guarantees that all materials will remain accessible;

· In general there is a lack of understanding of digital archiving issues, which at least initially could lead to data loss;

· Libraries are still focused on print publications, and tend to neglect their memory function for digital publications;

· Finally, the dynamic, interactive, distributed document types which are now emerging are extremely difficult – if not - impossible to archive in comparison with the current text and image based documents.

Digital publishing models

The nature of digital publications makes the archival task for libraries more difficult. But without adequate measures, there will be no archiving by libraries at all, and as a consequence the collective memory of science will disappear. This becomes clear if we look at the various models for digital publishing that are now beginning to emerge. These models all imply distribution directly from the creator or publisher to the end-user over the network, with no direct involvement from intermediary organisations such as libraries. The immediate consequence of this is that the distribution channel no longer has a memory function performed by organisations that have long-term archiving as their implicit (most libraries) or explicit (deposit libraries) responsibility. Consider the three publishing models described in fig 2:

· Self-publishing, i.e. by individual authors or their parent organisations. There is no guarantee that they will have the inclination or the resources to maintain long-term availability. The archives (such as WWW and FTP-sites) they set up on the network will be subject to frequent changes and will usually have a short life-span, as is already noticeable to anybody trying to access materials put onto the Internet more than a year ago.

· Publisher archives. Many large, international publishers are now creating so-called ‘archives’ or repositories for distributing their publications in digital form. Although some now also distribute journals in digital form to libraries, they most certainly will not continue to do so. However, it is clear (and some publishers have already explicitly stated this) that materials will only be available through these repositories for as long as there is sufficiently frequent demand to justify the cost of storage. After a certain period (probably 2 to 5 years) publications will be removed from the repository and will no longer be available. When a publication goes ‘out of print’ in this way, there will be no copies stored in libraries as is the case with printed publications.

· Push technology. The current publication model is based on the ‘pull’-concept: users interested in a publication go to a library or digital repository and pull the document out of the files for personal use. This is precisely the reason why (short-term) storage is required: to hold the information in a file until a user comes and asks for it. In certain areas of publishing – and perhaps in future in science publishing too – this model is being replaced by the ‘push’-concept: the user indicates the type of materials he or she is interested in, and relevant materials are immediately sent to the user when it is created or published. In this model, there is no need for a memory function anywhere in the distribution channel.

Solutions for digital archiving

From our analysis it becomes clear that digital archiving, i.e. maintaining accessibility of publications for future use, is a function that needs to be organised in an explicit way. It is highly unlikely that creators and publishers of digital information will be able to provide a coherent and persistent memory system. They have no commercial interest in long-term archiving, and they will not have the technical skills and funds to maintain digital collections indefinitely.

What is needed for digital archiving is a system which gives the responsibility for digital archiving to organisations which have a specific archival function, which can develop the highly specialised skills required for long-term storage and preservation, and which can guarantee global accessibility to archival materials over the network. The European approach, which can serve as a model for other geographic areas, is the system of national deposit libraries (Mackenzie Owen, J.S. & Walle, J. v.d., 1996). These have a legal responsibility for archiving print materials that is currently being extended to cover digital publications. This system could well be supplemented by other archival organisations in specific subject domains, e.g. scientific institutes and emerging virtual libraries operating on a global scale.

Digital deposit libraries could interconnect to form a comprehensive archival backbone for other libraries to provide service to users. In this way, there is no need for these other libraries to maintain their own digital collections (other than very frequently used current materials). Although the cost of digital archiving is higher than that of print archiving, this system would create enormous savings as compared to the current system. In the current system, the same publication is stored in a large number of libraries, each creating its own archival cost. In the system proposed here, only one storage location is required (or at least an extremely limited number for reasons of security and network efficiency). On a global scale the reduction in archival cost could be very large.

The system of archival deposit libraries for digital materials is based on two simple principles, viz. That publishers are willing (or legally obliged) to deposit a copy of digital materials on publication, and that the deposit library is allowed to provide global access to these materials as soon as they are no longer accessible from a repository under control of the publisher.

The future intermediary role of libraries

What does this mean for libraries in general? The ongoing move towards digital distribution of information through the global network infrastructure described at the beginning of this paper has major consequences for the traditional archival function. In the networked world a single location is sufficient. There is no need for the traditional ‘many copies, many libraries’ approach. In addition, publishers will not allow libraries to store digital publications because they wish to control access and maintain direct relationships with their customers, i.e. the end user. Therefore, publishers will set up digital repositories as short-term archives (possibly through outsourcing to subscription agents). However, publishers will not take on the responsibility for long-term archiving. But long-term digital archiving is expensive and requires specialised skills and infrastructure. Therefore, digital archives can only be maintained by national libraries and/or large, specialised, international, domain-based virtual libraries (Mackenzie Owen, J.S., 1996).

The large national deposit libraries are, at least in Europe, beginning to perform the long-term archival function to maintain access to digital information. It is therefore essential that they obtain a legal basis that extends their responsibilities to include digital materials. However, archiving on a national scale is not sufficient in a globally networked environment. It is also necessary that the digital deposit libraries join forces to create a globally interconnected archival system, together with specialised digital archives, e.g. for specific areas of science.

Recently there has been some discussion on the use of the word ‘(digital) archive’ for what many librarians would regard as the digital library collection. Traditionally, the distinction between libraries and archives is based on the following characteristics:

· Libraries collect items which are 'published' (either by official publishers or as grey literature by other organisations or individuals), whereas archives collect items related to 'work processes' (e.g. the work carried out by a specific organisation) and organisational entities or individuals.

· Libraries collect items in anticipation of their primary use (reading, studying); archives collect items after their primary use (the 'work process' in which they were used).

· Libraries collect items which are available in multiple copies; archives collect items which are, in the majority of cases, unique (e.g. correspondence).

· The value of library items is in their content as such; the value of archival items is not in their intrinsic content, but in what they tell us about the work process in which they were used and/or the organisation or individual by which/whom they were used. That is the reason why archival items lose their meaning if they are not stored 'in context', i.e. in relation to other items from the same process or originator.

In the traditional sense, therefore, digital archives are not archives. However, in the context of digital information the term 'archive' is acquiring a rather different meaning. It is now being used merely to refer to a storage location for digital objects. Especially publishers tend to use the term 'digital archive' to refer to depositories of digital publications on the Internet. There seems to be a need for a term to describe these repositories. Of the various functions of a library, the storage function is becoming isolated from the rest of library services, and is indeed shifting from

libraries to publishers and other organisations (c.f. the Los Alamos pre-print site). In fact, it is becoming clear that digital libraries will provide many types of useful services, but will themselves not maintain digital collections (the 'storage

versus access' debate). Therefore, the term 'digital archive' refers to a 'collection' (of digital documents) which is not part of a library.

For digital libraries without an explicit archival and preservation responsibility, the digital collection will be relatively unimportant. Local storage of digital materials will have the function of a short-term cache (e.g. to improve the efficiency of access to frequently requested materials), not of a long-term archive. This means that digital libraries will be able to – and have to – concentrate on their key functions: providing access to materials stored in large digital archives, handling license agreements for end-user access to copyright materials, providing a coherent set of access and delivery tools and procedures, and offering service and support to users. In addition, digital libraries could develop a role in end-user digital publishing and as an intermediary between authors and digital archives.

References

Mackenzie Owen, J.S. (1996) – Preservation of digital materials for libraries. In: European research libraries co-operation; the LIBER quarterly, 6(1966)4, p. 435-451.

Mackenzie Owen, J.S. & Walle, J. v.d. (1996) - A study of issues faced by national libraries in the field of deposit collections of electronic publications: final report. - Luxembourg: European Commission.

Mackenzie Owen, J.S. and Wiercx, A. (1996) - Knowledge models for networked library services. - Luxembourg: European Commission.

Task Force (1996) - Preserving digital information: report of the Task Force on archiving of Digital Information commissioned by the CPA and the RLG: final report and recommendations.

(This paper is a revised and expanded version of a paper for the Academia Europaea Workshop ‘The impact of electronic publishing on the academic community’, Stockholm, April 1997)