Organising for digital archiving: new
distribution models in the scientific information chain
John Mackenzie Owen
TICER / Tilburg University
Digital archiving is a
concept that has little to do with archives in the conventional sense. It is
most often used to refer to the role of libraries for the long-term storage and
preservation of published information in digital form. The subject of this
paper is therefore how libraries should organise the archiving function in a
way that will ensure the availability of scientific publications for future generations.
The dominant factor in
the current development of libraries is the ongoing move towards digital
distribution of information through the global network infrastructure. This is
creating a shift from the traditional role of the library as a ‘clearing house’
and warehouse for printed publications to a role as a supplier of networked
services for digital information resources. The library of the future can be
characterised as follows (Mackenzie Owen, J.S. and
Wiercx, A., 1996):
·
Services will be
based on digital, networked
information resources;
·
User interaction
with the library will be through from the desk-top (distance access) instead of by physically visiting the library
(on-site access);
·
Emphasis will be
on access to networked resources instead of on storing materials in the
library;
·
The traditional
library catalogue will evolve into a networked
resource discovery mechanism;
·
Bibliographic
data included in library systems will be extended to include non-document resources (e.g. persons,
organisations, datasets etc.);
·
New organisational models and distributed functions will arise, based on
co-operation and domain-based services.
Libraries belong to
the so-called ‘memory organisations’, together with archives and museums. This
reflects the fact that the global library system acts as the collective memory
of the world’s cultural and scientific heritage as recorded in the printed
word. In the world of printed publications only libraries perform a memory function
which guarantees to a certain extent that publications are not lost after
immediate use (fig. 1).
The long-term storage
of publications is more a (fortunate) outcome of librarian’s reluctance to discard
infrequently used publications than the outcome of sound management. In fact,
very few libraries have an explicitly stated responsibility with respect to
long-term storage and preservation. A legal responsibility for this exists only
for the national libraries of Europe through deposit legislation.
However, the memory
function in the publication chain is far from perfect. It is highly selective and random. What is preserved for
future generations depends on a large number of decisions, and any publication
stands a chance of being lost from the collective memory. To give a few
examples:
·
What enters into
the library system depends on what authors and publishers decide to publish.
There are many examples of cultural and scientific works which are inaccessible
because they have remained unpublished;
·
All libraries
have an acquisitions policy which determines which publications enter the
library collection. There is no system which guarantees that each publication
will be acquired by at least one library, with the exception of publications in
countries with a well-organised legal deposit system;
·
Libraries do not
always store indefinitely all publications they acquire. Although there is a
natural tendency amongst libraries not to discard items from the collection,
this sometimes is necessary, e.g. for economic reasons;
·
Publications
stored in libraries are sometimes lost due to media deterioration, either
caused by inadequate storage conditions or other factors such as the chemical
self-destruction of publications printed on chlorine paper;
·
Many kinds of
disasters, such as fire or flooding, can lead to loss of publications.
·
Finally,
political factors and censorship frequently prevent publications from being
acquired by libraries, or lead to their removal from the collection.
However, in the
printed world preservation of the intellectual record is enhanced by large
print-runs, which means that publications are usually produced in at least
hundreds of copies and are acquired by many different libraries, often distributed
over the entire world. The chance that at least a single copy of a publication
is preserved once it has been published, is
usually quite large.
In the world of
digital publications the collective memory is as selective and random as it is
in the world of print. In fact, the situation is far worse when we consider the
following:
·
Digital
publications are produced and archived in a far smaller number of copies – in
most cases only a single copy is made available and stored on the network;
·
The cost per
access for digital archiving is higher that that of print archiving; since
budgets are limited, less copies will eventually be archived;
·
Digital materials
periodically need to be migrated to new storage media, data formats and system
environments; the future cost of migration is uncertain and it is likely that
many digital archives will not be maintained in a way which guarantees that all
materials will remain accessible;
·
In general there
is a lack of understanding of digital archiving issues, which at least
initially could lead to data loss;
·
Libraries are
still focused on print publications, and tend to neglect their memory function
for digital publications;
·
Finally, the
dynamic, interactive, distributed document types which are now emerging are
extremely difficult – if not -
impossible to archive in comparison with the current text and image based
documents.
The nature of digital publications
makes the archival task for libraries more difficult. But without adequate
measures, there will be no archiving by libraries at all, and as a consequence
the collective memory of science will disappear. This becomes clear if we look
at the various models for digital publishing that are now beginning to emerge.
These models all imply distribution directly from the creator or publisher to
the end-user over the network, with no direct involvement from intermediary
organisations such as libraries. The immediate consequence of this is that the
distribution channel no longer has a memory function performed by organisations
that have long-term archiving as their implicit (most libraries) or explicit
(deposit libraries) responsibility. Consider the three publishing models
described in fig 2:
·
Self-publishing, i.e. by individual authors or their parent
organisations. There is no guarantee that they will have the inclination or the
resources to maintain long-term availability. The archives (such as WWW and
FTP-sites) they set up on the network will be subject to frequent changes and
will usually have a short life-span, as is already noticeable to anybody trying
to access materials put onto the Internet more than a year ago.
·
Publisher archives. Many large, international scientific
publishers are now creating so-called ‘archives’ or repositories for
distributing their publications in digital form. Although some now also
distribute journals in digital form to libraries, they most certainly will not
continue to do so. However, it is clear (and some publishers have already
explicitly stated this) that materials will only be available through these
repositories for as long as there is sufficiently frequent demand to justify
the cost of storage. After a certain period (probably 2 to 5 years)
publications will be removed from the repository and will no longer be
available. When a publication goes ‘out of print’ in this way, there will be no
copies stored in libraries as is the case with printed publications.
·
Push technology. The current publication model is based on the
‘pull’-concept: users interested in a publication go to a library or digital
repository and pull the document out of the files for personal use. This is
precisely the reason why (short term) storage is required: to hold the
information in a file until a user comes and asks for it. In certain areas of
publishing – and perhaps in future in science publishing too – this model is
being replaced by the ‘push’-concept: the user indicates the type of materials he
or she is interested in, and relevant materials are immediately sent to the
user when it is created or published. In this model, there is no need for a
memory function anywhere in the distribution channel.
From our analysis it
becomes clear that digital archiving, i.e. maintaining accessibility of
scientific publications for future use, is a function that needs to be
organised in an explicit way. It is highly unlikely that creators and publishers
of digital information will be able to provide a coherent and persistent memory
system. They have no commercial interest in long-term archiving, and they will
not have the technical skills and funds to maintain digital collections
indefinitely. The idea put forward in the United States by the Task Force on
archiving of Digital Information of the CPA and the RLG (Task Force, 1996) that
the creator of digital information should be responsible for long-term
archiving is therefore potentially dangerous, since it could prevent other and
better solutions from being developed.
What is needed for
digital archiving is a system which gives the responsibility for digital
archiving to organisations which have a specific archival function, which can
develop the highly specialised skills required for long-term storage and
preservation, and which can guarantee global accessibility to archival
materials over the network. The European approach, which can serve as a model
for other geographic areas, is the system of national deposit libraries
(Mackenzie Owen, J.S. & Walle, J. v.d., 1996). These have a legal
responsibility for archiving print materials which is currently being extended
to cover digital publications. This system could well be supplemented by other
archival organisations in specific subject domains, e.g. scientific institutes
and emerging virtual libraries operating on a global scale.
Digital deposit
libraries could interlink to form a comprehensive archival backbone for other
libraries to provide service to users. In this way, there is no need for these
other libraries to maintain their own digital collections (other than very
frequently used current materials). Although the cost of digital archiving is
higher than that of print archiving, this system would create enormous savings
as compared to the current system. In the current system, the same publication
is stored in a large number of libraries, each creating its own archival cost.
In the system proposed here, only one storage location is required (or at least
an extremely limited number for reasons of security and network efficiency). On
a global scale the reduction in archival cost could be very large.
The system of archival
deposit libraries for digital materials is based on two simple principles, viz.
That publishers are willing (or legally obliged) to deposit a copy of digital
materials on publication, and that the deposit library is allowed to provide
global access to these materials as soon as they are no longer accessible from
a repository under control of the publisher.
What does this mean
for libraries in general? The ongoing move towards digital distribution of
information through the global network infrastructure described at the
beginning of this paper has major consequences for the traditional archival
function. In the networked world a single location is sufficient. There is no
need for the traditional ‘many copies, many libraries’ approach. In addition,
publishers will not allow libraries to store digital publications because they
wish to control access and maintain direct relationships with their customers,
i.e. the end user. Therefore, publishers will set up digital repositories as
short-term archives (possibly through outsourcing to subscription agents).
However, publishers will not take on the responsibility for long-term
archiving. But long-term digital archiving is expensive and requires
specialised skills and infrastructure. Therefore, digital archives can only be
maintained by national libraries and/or large, specialised, international,
domain-based virtual libraries (Mackenzie Owen, J.S., 1996).
The large national
deposit libraries are, at least in Europe, are best equipped to perform the
long-term archival function to maintain access to the literature of science. It
is therefore essential that they obtain the legal basis which extends their
responsibilities to include digital materials. However, archiving on a national
scale is not sufficient in a globally networked environment. It is therefore
also necessary that the digital deposit libraries join forces to create a
globally interlinked archival system. That will be the future memory of
science.
Mackenzie Owen, J.S. (1996)
– Preservation of digital materials for libraries. In: European research
libraries co-operation; the LIBER quarterly, 6(1966)4, p. 435-451.
Mackenzie Owen, J.S.
& Walle, J. v.d. (1996) - A study of issues faced by national libraries in
the field of deposit collections of electronic publications: final report. -
Luxembourg: European Commission.
Mackenzie Owen, J.S. and Wiercx, A. (1996) - Knowledge models for networked library services. - Luxembourg: European Commission.
Task Force (1996) - Preserving digital information: report of the Task Force on archiving of Digital Information commissioned by the CPA and the RLG: final report and recommendations.