Is there a role for traditional knowledge organization systems in the Digital Age?

Claudio Gnoli
University of Pavia, Italy
Department of Mathematics. Library

Many information seekers — and sometimes librarians themselves — abandoned card catalogs and rushed to embrace the reach and simplicity of full-text search when Altavista, Google, and other search engines burst onto the scene. But the value of library classification systems and indexing of content by subjects did not suddenly expire. In fact, developments in library and information science in the middle of the 20th century show promise for overcoming the exasperating limitations of full-text search. Among those innovations, a model for knowledge organization called “faceted classification” is a quite natural fit for the computerized information environment.

A brief history of knowledge organization

Organizing knowledge is a very old need: Aristotle and his disciples, for example, were great systematizers of the domains of knowledge of their time. When modern science and technology began to emerge in the 17th century, such brilliant thinkers as Francis Bacon and the editors of the Encyclopédie pondered ways to organize growing bodies of knowledge and their disciplines. The French editors of the Encyclopédie, seeking a way to manage a huge work presented in alphabetical sorting, conceived an impressive and sophisticated system of internal links.

Since the end of the 19th century, formal systems to organize knowledge have been developed by librarians [1], who had to manage large collections of books that might be on any subject. As libraries grew in size, the need for systems to organize those growing collections had become a compelling practical issue. A single librarian could no longer remember all the books in the collection and where each of them was located. The largest library in the world, the Library of Congress in Washington, developed systems to index its documents by subject (the Library of Congress Subject Headings) and to shelve them by disciplines (the Library of Congress Classification). Because LC catalog cards — and later automated catalog records — were shared with many other libraries and used as a reference source, LC systems have been adopted widely, both in the United States and in other countries.

Another pioneer in library classification, Melvil Dewey, librarian at the Amherst College, devised a clever way to specify book subjects by using only ten digits. Subjects were specified more precisely by adding digits to the right of a shelving number. The Dewey Decimal Classification, which was created more than one century ago, has become the most widely used system to classify books around the world. It is continuously updated to reflect the emergence and development of disciplines — for example, in recent editions it was obviously necessary to revise the numbering system for computer science substantially.

The shelving problem of traditional library classification

Books on closely-related subjects can be assigned similar Dewey numbers, making it possible to place them near each other on library shelves … and making it easier for information seekers to find books on those topics when looking at those shelves. This is very useful, but it doesn't solve all the problems of locating information on a subject. Suppose you have to determine a shelf location for books on the following subjects:

  1. New businesses in the United States
  2. New businesses in the 18th century
  3. The USA in the 18th century

If you give A and B a similar number, with time and place as secondary specifications, then books A and B can be arranged close to each other on the shelves, while book C might be on a distant shelf — for example, a shelf devoted to geography. But what if you need to find all books about life in the 18th century? Clearly, the Dewey system can group subjects according to only one dimension, because shelves are linear objects, not cubic or four-dimensional.

So librarians developed standard ways to express semantic relations that overcame some of the physical limitations of shelving. For example, you might find a catalog card that provides the following subject information:

18th century

Faceted classification — a fundamental change in knowledge organization

While such cross references overcame some of the limitations of hierarchical classification systems, enabling people to find books on a subject even when those books were classified under a different subject, it took another deep insight in the library and information science community to produce a truly flexible and expressive way of classifying and retrieving information. Beginning in the 1930s, Indian  mathematician and librarian S. R. Ranganathan, introduced the idea of decomposing complex subjects into different facets.

Book A above would have two facets: New businesses and United States. Each facet belongs to a general category — for example, Action, Agent, Tool, Place, Time, etc. Ranganathan and his successors in the London-based Classification Research Group explored in depth the most helpful ways to sort faceted subjects and developed a coherent theory with advanced principles, including

These are complex but important questions, which will be addressed in coming issues of BRAKOR in more detail. The most advanced general system applying research in faceted classification is the Bliss Bibliographic Classification 2nd edition (BC2), which is curerently being published in England.

Knowledge organization expands into our work lives

Books represent only a small percentage of the documents used by people in everyday work today. In your office, you have not only books but computer files, databases, internal communications, notes of various kinds, and many other forms of communication. You might once have searched for information about a topic in the Encyclopedia Britannica, but today you will probably search, with a quick Google query, in the biggest “encyclopedia” ever available — the Internet.

So why should we care so much about all those specialized methods recommended by librarians? Many people, indeed, think that library science belongs to an obsolete world of dark rooms filled with shelves — that it has nothing to do with the real needs of contemporary life. Differing jargons contribute to the problem. Librarians, computer scientists, and managers once lived in quite separate communities, each with its own terminology and often unaware that they addressed problems that looked different but were, if fact, substantially similar.

On the other hand, people in business and computing environments have coined library metaphors for such services as “digital libraries,” “virtual libraries,” and “virtual reference desks” — sensing, perhaps subconsciously, the affinity of some modern technologies with traditional library concepts.

So couldn't a digital library be organized efficiently by applying a (digital) library classification system?

Knowledge organization and computers — early awareness in the LIS community

The question is not as new as it might seem. Researchers in faceted classification theory in the 1950s were already cognizant of the coming computer revolution. They could guess the importance of computers in future knowledge organization. They realized that computers would be a complement to classification theory, not an alternative.

In fact, the important difference lies in the physical forms in which documents are recorded and the technological means to access them. But documents — no matter whether they are Web sites, computer files, books, manuscripts, engraved stones, or whatever — still have content. And if you want to find them when you need them, you have to analyze the subjects they contain.

The personal computer did not become part of our daily work lives until the 1980s, but Brian Vickery, one of the outstanding researchers in library classification, wrote some forty years ago:

Various mechanized systems are now available, such as hand-sorted or machine-sorted punched cards. […] Mechanized systems only alter the mechanics of retrieval, the physical operations by which a search is effected. They do not alter the basic problems of subject analysis. The structure of a subject field, as laid bare by facet analysis, remains the same, and the same classification schedule can be adapted to either card cataloguing or mechanized searching. [2]

At the same time, new technologies have provided us with new forms of indexing and searching. Full-text search makes all words in document searchable. That was simply not possible with printed documents. We can view this as an additional form of semantic indexing — the simplest and most basic possible form.

However, a single word taken out of a text can be misleading without its context, as anyone who has used search engines knows. So brute-force searches of words in full texts don't solve all problems of retrieval. This is why selected keywords, abstracts, subject headings, thesauri, and classification schemes are necessary. Advances in full-text search are themselves helping to bridge that divide. For example, it is possible, given a collection of reference documents that are already classified, to state that a new document belongs to a given class on the basis of automatic analysis of the distribution of words in it. Other uses of such advances include, for example, defining word-distribution patterns for “relevance” and “non-relevance” and applying those patterns to spam-filtering or attributing authorship to anonymous documents.[3]

The success of the Google search engine shows that automatic analysis of links in a complex hypertextual environment (which is what basically makes Google better than its rivals) can achieve very precise results. Interestingly, the underlying principle of that process, citation analysis, was developed some decades ago by bibliographic services, to assess the impact of papers in the scientific literature.

Lessons from library science for the Digital Era

The previous examples show that the world of books can still teach useful lessons in the Digital Era. Sometimes the experience of the library science community is even more useful in the computer environment than in the world of paper documents. In fact, computers exist to manage and process labeled data, so they are the perfect ally of knowledge-organization systems.

For example, computers can make the work of indexers faster by suggesting to them a list of possible categories, to which the document seems to be similar on the basis of the words it contains. A human indexer is still needed to judge the actual relevance of the suggested classes and to apply principles of good indexing, but their work is made lighter by the help of the machine.

Ranganathan himself was already aware of the possibilities. Indeed, in the last edition of his foundational book [4] he wrote:

World War II ushered in the Electronic Age into the world. The Computer is one of the versatile forms in which Electronics can help mankind. We must accept the computer. We must derive all the benefits it is capable of giving us. […]

Apart from this, an important question is, “Can classificationists and classifiers abdicate their function and depend on the computer to taking their place in the chain of work involved in the rendering of library service to the satisfaction of all the Five Laws of Library Science?” They cannot. Because, classification involves judgement — judgement of the subject of the document in all its facets and arrays manifest in it. This cannot be done by the statistical analysis of the words in documents, which alone the machine can do.

Thinking that “automagical” procedures make human indexing unnecessary is a short-sighted view. It is just as shortsighted as the opposite view — that the value of traditional indexing techniques is limited to an elitist and isolated library environment and has nothing to do with the new chaos of electronic documents. Clearly, to obtain both powerful and effective organization of large amounts of information in any form, we need a synthesis between library classification perspectives and computer-based indexing and retrieval.

Online directories and advanced techniques from LIS

Popular Internet directories, including Yahoo! and DMOZ (which Google incorporates), have built their own semantic schemas to organize their information. Although editors of those directories claim they derived inspiration from such library classifications as Dewey, they don't really exploit the advanced techniques available from recent research — in particular, faceted classification.

But some online directories that have included librarians on their staffs do make more direct use of library knowledge organization systems:

Although these examples do use knowledge organization systems coming from the library world, unfortunately they have not chosen the most advanced systems. Popularity of the systems used and connection with other systems are their main advantages.

However, it seems that it is now time to explore better ways to apply the heritage of modern classification techniques to the contemporary requirements of information storage and retrieval, and faceted classification should clearly be considered as one of the more important advanced systems.

Recent implementations of faceted classification

In recent years, several web sites have experimented with principles of faceted classification to allow retrieval from their databases. For examples, see the “Example Web Sites” section of William Denton's “Putting Facets on the Web: An Annotated Bibliography” at <http://www.miskatonic.org/library/facet-biblio.html>. Information architects are working to build search interfaces that are as usable and effective as possible, and the faceted approach is getting popular for such purposes.

This does not mean that the full power of faceted classification is exploited yet. The indexed databases often have simple structures and limited size. So, for example, no attention is usually paid to the citation order of facets (that is, for example, whether to express Objects, or Actions, or Properties first and why) or to the details of sorting (for example, what to do with subjects in which one or more facets are not expressed) in browsing mode. However, these features become important when one has do deal with a large corpus of documents, as was the case for the libraries for which Ranganathan and the CRG conceived their sophisticated systems. Big organizations, too, may generate knowledge in many forms — with complex relationships among them — so an adequately refined system could be a very fruitful investment for them.

Some useful suggestions can come from current projects to apply faceted indexing in its full potential to digital documents. One such is being developed at the University of College London, under the name of FATKS: Facet Analytical Theory in Knowledge Systems for Humanities <http://www.ucl.ac.uk/fatks/> [6]. The staff of FATKS includes renowned experts in library classification theory and even members of the original Classification Research Group. A fully faceted classification system for humanities is being developed, called FATHUM, merging the faceted structure of such systems as BC2 with an expressive notation (which BC2 lacks), so that individual facets can be automatically searched and processed in clever ways. Another notable project with deep roots in classical faceted subject indexing is being developed in Thailand for application to Web documents [7].

The worlds of librarians and computer scientists, unfortunately, have barely been integrated until recently. The result has been an unnecessary proliferation of new terminology for useful concepts and techniques already developed decades ago in serious research in the library and information science community. One recent case is Giovanni Sacco's invention of “dynamic taxonomies” — which is, in fact, an excellent implementation of the principles of faceted classification. This situation can result in confusion and difficulty in finding one's way inside the developing world of advanced knowledge organization and retrieval. We need greater awareness of previous research and a well-grounded shared terminology in order to avoid wasting the time and resources of both researchers and managers interested in choosing the most suitable systems for their purposes.


References

[1] The subject approach to information: 5th ed / A C Foskett — Library association: London: 1996

[2] Faceted classification: a guide to the construction and use of special schemes: 4th ed / Brian C Vickery — Aslib: London: 1960

[3] A tutorial on automated text categorization / Fabrizio Sebastiani = ASAI 99: 1' Argentinian symposium on artificial intelligence: Buenos Aires: 1999: proceedings. p 7-35 / Analia Anandi, Alejandro Zunino: eds || <http://faure.iei.pi.cnr.it/~fabrizio/Publications/ASAI99.pdf>

[4] Prolegomena to library classification. Chapter XA / S R Ranganathan — Sarada Ranganathan endowment for library science: Bangalore (India): 1967

[5] Evaluating Dewey concepts as a knowledge base for automatic subject assignment / Roger Thompson, Keith Shafer, Diane Vizine-Goetz — OCLC <http://orc.rsch.oclc.org:6109/eval_dc.html>: 1997

[6] Faceted classification as a basis for knowledge organization in a digital environment: the Bliss Bibliographic Classification as a model for vocabulary management and the creation of multidimensional knowledge structures / Vanda Broughton = The new review of hypermedia and multimedia. 7: 2001. p 67-102

[7] Faceted indexing based system for organizing and accessing Internet resources / Francis J Devadason, Neelawat Intaraksa, Ponprapa Patamawongjariya, Kavita Desai = Knowledge organization. 29: 2002. 2. p 65-77