Information Research, Vol. 7 No. 4, July 2002,


The Semantic Web, universalist ambition and some lessons from librarianship

Terrence A. Brooks
Information School,
University of Washington
Seattle, WA 98195, USA



Abstract
Building the semantic web encounters problems similar to building large bibliographic systems. The experience of librarianship in controlling large, heterogeneous collections of bibliographic data suggests that the real obstacles facing a semantic web will be logical and textual, not mechanical. Three issues are explored in this essay: development of a standard container of information, desirability of standardizing the information hosted by this standardized container, and auxiliary tools to aid users find information. Value spaces are suggested as a solution, but dismissed as impracticable. The standardization necessary for the success of the Semantic Web may not be achievable in the Web environment.


A vision of shared meaning

Increasing the intelligibility of the Web is a compelling vision. Imagine how the utility of local data could be enhanced if they were meaningfully linked to data posted by strangers far away. The Web could evolve into a comprehensive meaning system, a universal encyclopedia or �world brain,� as prophesized by H.G. Wells (1938). Clever programs could roam this meaning space discovering useful, unanticipated information, emulating Bachman�s (1973) vision of database programmers navigating an n-dimensional database space.

The extensible markup language (XML) and its attendant technologies is the fundamental facilitator of the semantic web (Berners-Lee, 2001). XML replaces presentation markup, e.g.: <h4> My name is Terry </H4> with markup that provides a context for understanding the meaning of the data, e.g.: <name> Terry </name>. Extensible technologies facilitate an era of the distributed object where XML elements will roam the Internet as autonomous units in a sea of contextual relationships. Poynder (May 5, 2002) describes the Web populated with dictionaries of meaning that autonomous agents interrogate as they traverse cyberspace. Semantic markup could be potentially exploited in many ways; for example, disambiguating information resources and aiding information discovery in a rapidly expanding and heterogeneous Web. Problems like the following could be solved:

In addition, this markup makes it much easier to develop programs that can tackle complicated questions whose answers do not reside on a single Web page. Suppose you wish to find the Ms. Cook you met at a trade conference last year. You don't remember her first name, but you remember that she worked for one of your clients and that her son was a student at your alma mater. An intelligent search program can sift through all the pages of people whose name is "Cook" (sidestepping all the pages relating to cooks, cooking, the Cook Islands and so forth), find the ones that mention working for a company that's on your list of clients and follow links to Web pages of their children to track down if any are in school at the right place. (Berners-Lee 2001)

The �Ms. Cook� retrieval problem

Finding a particular �Ms. Cook� in a semantic web is essentially an information retrieval problem, similar to the bibliographic problem of finding an author named �Ms. Cook.� Librarians possess considerable experience dealing with this sort of problem. Their strategy for controlling bibliographic data can be summed up in a few words: Make the structural form of the data predictable, make the information contents hosted by this form predictable, and where information is structured arbitrarily, provide access tools to help the searcher find the difficult-to-anticipate information.

In some ways a semantic web and large bibliographic databases are similar. A semantic web is a single meaning system organizing a large collection of widely disparate information. So are large bibliographic databases. For example, the WorldCat database (sponsored by OCLC, Online Computer Library Center at http://oclc.org/home/) is a union catalog that hosts about 48 million records (as of Spring 2002) in 400 languages and indexes a heterogeneous collection of material including books, maps, films and slides, sound recordings, and so on. The WorldCat database has been called the most important database in academe (Smith, 1996).

A semantic web and large bibliographic databases also both employ expressive data structures. The Machine Readable Cataloging (MARC 21) record provides each field and subfield with a semantically significant field number or code. Usage conventions define exactly what sort of data can be placed in each field and subfield. One can distinguish, consequently, �John F. Kennedy� as the author of a work, the subject of a work, a person named in the work, and so on. XML also permits the definition of element names that express usage aspects of a personal name; for example, one can create tags such as <author>, <subject> or <named person>.

There are, of course, great systematic differences between the Web and large bibliographic databases. The Web is magnitudes larger. It is growing faster. The origins of Web pages are not a few cooperating agencies. Web pages do not reflect a single, well-groomed record structure. Web pages do not benefit from coordinated activity distinguishing material written by �Ms. Cook� from material describing �Ms. Cook.� Furthermore, Web pages have no coordinated activity distinguishing �Mary Cook� from �Sally Cook,� or even this �Mary Cook� from that �Mary Cook.� This problem is commonly encountered when one uses a Web tool to search for �Mary Cook,� receives hundreds of thousands of Web pages in return, and discovers that the vast majority are irrelevant.

Librarians have been struggling with these problems for decades. It is possible that their practical experience dealing with bibliographic data could be profitably applied to the semantic web proposal, especially if an exemplary semantic web activity were searching for a certain �Ms. Cook� in a heterogeneous, rapidly growing and decentralized Web.

Principal elements of a bibliographic system

The basic strategy for constructing a bibliographic database system is standardizing the container of the information, structuring the information contents within this container, and then building ancillary tools that aid the anticipation of the user.

The following are example methodologies and technologies: