Information Research, Vol. 7 No. 4, July 2002


The Semantic Web: opportunities and challenges for next-generation Web applications

Shiyong Lu, Ming Dong and Farshad Fotouhi
Department of Computer Science
Wayne State University, Detroit, MI 48202



Abstract
Recently there has been a growing interest in the investigation and development of the next generation web - the Semantic Web. While most of the current forms of web content are designed to be presented to humans, but are barely understandable by computers, the content of the Semantic Web is structured in a semantic way so that it is meaningful to computers as well as to humans. In this paper, we report a survey of recent research on the Semantic Web. In particular, we present the opportunities that this revolution will bring to us: web-services, agent-based distributed computing, semantics-based web search engines, and semantics-based digital libraries. We also discuss the technical and cultural challenges of realizing the Semantic Web: the development of ontologies, formal semantics of Semantic Web languages, and trust and proof models. We hope that this will shed some light on the direction of future work on this field.


Introduction

The Internet and the World Wide Web have brought a revolution to information technology and the daily lives of most people. However, most of the current forms of web content are designed and structured for use by people but are barely understandable by computers. The goal of the Semantic Web, with its vision by Berners-lee (1998), is to develop expressive languages to describe information in forms understandable by machines.

XML (Extensible Markup Language, 1998) has brought great features and promising prospects to the development of the Semantic Web. Currently, there are numerous techniques and tools available for XML, e.g., SAX (Simple API for XML), DOM (Document Object Model, 1998), XSL (Extensible Stylesheet Language), XSLT (XSL Transformation), XPath, XLink, and XPointer, and XML parsers are available in different languages and for different platforms. Using XML, one can describe document types for various domains and purposes. For example, XML documents may represent multi-media presentations (Hoschka, 1998), and business transactions (XML/EDI-Group). Applications can access XML documents via standard interfaces like SAX and DOM. A number of XML query languages have been proposed, including XML-QL (Deutsch, et al., 1998), X-QL (Robie, 1999), Lorel (Goldman, et al., 1999), and XQuery (Chamberlin, et al., 2001). Furthermore, some researchers propose that these query languages should be extended with an update capability so that an XML document repository becomes an XML database (Tatarinov, et al., 2001).

XML will continue to play an important role in the development of the Semantic Web. However, it does not provide a full solution to the requirements of the Semantic Web. XML can represent only some semantic properties through its syntactic structure, i.e., by the nesting or sequentially ordering relationship among elements (XML tags). XML queries need to be aware of this syntactic structure via the document type that is defined by a DTD (Document Type Definition). Although one might derive some sort of semantics from the structure of the documents within the context of the document type, the semantics of each element (XML tag) is not defined and its interpretation totally relies on the implicit knowledge hardcoded in application programs. To develop a Web with semantics, resources on the Web need to be represented in or annotated with structured machine-understandable descriptions of their contents and relationships, using vocabularies and constructs that have been explicitly and formally defined with a domain ontology.

The most acceptable definition of ontology seems to be the following one by Gruber (1993): an ontology is a "formal specification of a conceptualization", and is shared within a specific domain. The world view that an ontology embodies is usually conceived as a hierarchical description of a set of concepts (is-a hierarchy), a set of properties and their relationships, and a set of inference rules. Berners-lee (1998) outlined the architecture of the Semantic Web in the following three layers:

  1. The metadata layer. The data model at this layer contains just the concepts of resource and properties. Currently, the RDF (Resource Description Framework) (Lassila & Swick, 1999) is believed to be the most popular data model for the metadata layer.
  2. The schema layer. Web ontology languages are introduced at this layer to define a hierarchical description of concepts (is-a hierarchy) and properties. Currently, RDFS (RDF Schema) (Brickley & Guha, 2002) is considered as a candidate schema layer language.
  3. The logical layer. More powerful web ontology languages are introduced at this layer. These languages provide a richer set of modeling primitives that can be mapped to the well-known expressive Description Logics (1999). Currently, OIL (Ontology Inference Layer, 2000) and DAML-OIL (Darpa Agent Markup Language-Ontology Inference Layer, 2001) are two popular logical layer languages.

With the creation and development of the Semantic Web, various web resources will be able to be accessed by machines in a semantic fashion. The questions are, what opportunities will this new technology bring to us and what challenges and work are we facing now to get us from today's Web to the Web of tomorrow - the Semantic Web? In this paper, we share our understanding of the answers to these questions and we hope this will shed some light on future research in this area.

Opportunities

Like other technologies, the interest in creating and developing the Semantic Web is motivated by the opportunities it might bring: either it can solve new problems, or it can solve old problem in a better way. Here, instead of enumerating all the opportunities enabled by the Semantic Web, we focus our discussion on the following closely related aspects: web-services, agent-based distributed computing, semantics-based web search engines, and semantics-based digital libraries.

Web-services

Among the most important web resources on the Semantic Web are those so called web-services. Here, web-services refers to "web sites that do not merely provide static information but allow one to effect some action or change in the world". The Semantic Web will enable users to locate, select, employ, compose, and monitor web-services automatically (Ankolekar, et al., 2001).

The industry has already seen the potential market enabled by web-services and some efforts have been put to the development of standards for electronic commerce, in particular for the description of web-services. For example, Microsoft, IBM and Ariba proposed UDDI (Universal Description, Discovery, and Integration, 2000) to describe a standard for an online registry, and the publishing and dynamic discovery of web-services offered by businesses; Microsoft and IBM proposed WSDL (Web Service Definition Language) as an XML language to describe interfaces to web-services registered with a UDDI database; the DAML Services Coalition proposed DAML-S (Darpa Agent Markup Language - Service, 2001) as an ontology to describe web-services; and OASIS and the United Nations developed ebXML (Electronic Business XML Initiative, 2000) to describe business interactions from a workflow perspective. A number of communication protocols have been developed for the invocation of web-services: Remote Procedure Call (Birrell & Nelson, 1984) is client/server infrastructure that allows a client component of an application to employ a function call to access a server on a remote system. The CGI (Common Gateway Interface) mechanism is a standard for external gateway programs to interface with information servers such as HTTP servers. CORBA (Common Object Request Broker Architecture) uses a registry to store interfaces of distributed objects so that a client can invoke a method of a remote server object without knowing its location, programming language, operating system and other system aspects that are not part of the object's interface. SOAP (Simple Object Access Protocol) is a protocol for the exchange of information in a distributed environment, in particular, the Web. Using SOAP, one can describe the content of a message and the way to process it, define application-dependent data types, and represent remote procedure calls and responses. SOAP can potentially be used together with a variety of other protocols. Currently, its binding with HTTP is supported. Java RMI (Java Remote Method Invocation) enables programmers to develop Java applications in which a client can invoke a method of a remote Java Object. The ACL (Agent Communication Language) developed by SRI International in the framework of OAA (Open Agent Architecture) allows one to define and publish the capabilities of agents so that a request can be matched with a server agent by the facilitators (brokers) (Martin, et al., 1999)

Agent-based distributed computing paradigm

The Semantic Web will use ontologies to describe various web resources, hence, knowledge on the Web will be represented in a structured, logical, and semantic way. This will change the way that agents navigate, harvest and utilize information on the Web (Payne, et al., 2002). On one hand, the Semantic Web is a web of distributed knowledge bases, and agents can read and reason about published knowledge with the guidance of ontologies. On the other hand, the Semantic Web is a collection of web-services described by ontologies like DAML-S (Darpa Agent Markup Language - Services) (Ankolekar et al., 2001) and this will facilitate dynamic matchmaking among heterogeneous agents: service provider agents can advertise their capabilities to middle agents; middle agents store these advertisements; a service requester agent can ask a middle agent whether it knows of some provider agents with desired capabilities; and the middle agent matches the request against the stored advertisements and returns the result, a subset of the stored advertisements (Sycara, et al., 2002).

When agents are equipped with intelligence and mobility, the conventional client/server computing paradigm might be replaced by an agent-based distributed computing paradigm, in which agents can migrate from one site to another, carrying their codes, data, running states (including internal beliefs), and intelligence (specified by the users), and fulfill their missions autonomously and intelligently. Many researchers have speculated that mobile agents are inevitable for an open and distributed environment like the Semantic Web and have seen the advantages of this new computing paradigm (Lange & Oshima, 1999; Harrison, 1995) including:

A number of mobile agent systems have been developed (Mauldin, 1991; Luke, et al., 1997).

Semantics-based web search engines

Search engines are among the most useful resources on the Web and currently there are two types of search engines:

Both types of search engines are based on keywords, and hence are subject to the two well-known linguistic phenomena that strongly degrade a query's precision and recall: polysemy (one word might have several meanings) and synonymy (several terms, i.e. words or phrases, might designate the same concept). A number of stemming algorithms (Lennon, et al., 1981) have been developed to address the synonymy issue including suffice removal, strict truncation of character strings, word segmentation, letter bigrams and linguistic morphology. The idea is that different derivations of a word are similar to each other in their forms (e.g. they have the same prefix) and can be traced back to the same root (stem) using these stemming methods. However, these methods are subject to the following stemming errors: words with different meaning might be reduced to the same root. For example, words general, generous, generation, and generic might be reduced to the same root. On the other hand, different words with the same meaning cannot be reduced to the same root. For example, automobile and car. The situation becomes worse for the large-scale robot-based search engines. Only limited semantics can be derived from the lexical or syntactic content of the web pages.

Several systems have been built to overcome these problems based on the idea of annotating Web pages with special HTML tags to represent semantics, including SHOE (Simple HTML Ontology Extensions) system (Luke, et al., 1997), GDA system (Ttiyama & Hasida, 1997). However the limitation of these systems is that they can only process web pages that are annotated with these HTML tags, and so far there is no agreement upon a universally acceptable set of HTML tags.

XML is a promising technique since it keeps content, structure, and representation apart and is a much more adequate means for knowledge representation. However, XML can represent only some semantic properties through its syntactic structure. XML queries need to be aware of this syntactic structure. With the advent of the Semantic Web, resources on the Web will be represented semantically in ontologies. Semantics-based web search engines can be built in which each query is executed within the context of some ontology. The guidance from ontologies will increase recall and precision of the search result. For example, one might pose a query "return all the reviewers for book 'The Semantic Web: an Introduction'" to a semantics-based web search engine, then the engine will return only reviewers for this book instead of returning web pages that contain keyword "reviewer" and/or term "The Semantic Web: an Introduction". For another example, if one pose query "return all the chairs", with the guidance of a furniture ontology, only those furniture chairs are returned; and with the guidance of a person ontology, only people who are chairs of some organizations will be returned. In contrast, Keyword-based search engines will return web sites that contain keyword "chairs", including chairs that refer to furniture and chairs that refer to people. It is worth mentioning that some systems that use ontologies to enhance web search engines have been developed (Barros, et al., 1998; Erdmann, et al., 2001). Since ontologies are built on a domain basis, web search engines might be also built on a domain basis, and hence metasearch engines, which interface with multiple remote search engines and select and rank remote search engines intelligently, might be very useful (Dreilinger & Howe, 1997).

Semantics-based digital libraries

Digital multimedia data in various formats has increased tremendously in recent years on the Internet. With the development of digital photography, more and more people are able to store their personal photographs on their PCs. Sharing of picture albums and home videos on the Internet becomes more and more popular. Furthermore, many organizations have large image and video collections in digital format available for online access. Film producers want to advertise movies through interactive preview clips. Travel agencies are interested in digital achieves of holiday resorts photographs. Hospitals would like to build medical image databases. These emerging applications for multimedia digital libraries require interdisciplinary research in the areas of image processing, computer vision, information retrieval and database management.

Semantics-based retrieval of multimedia digital content is important for efficient use of the multimedia data repositories. Traditional content-based multimedia retrieval techniques describe images/videos based on low level features (such as color, texture, and shape) and support retrieval based on these features (Smeulders, et al., 2000). However, human typically does not view images/videos in terms of low-level features. A semantics-based query capability is highly desirable. For example, one might want to formulate a query like "return all the scenes in clip 1 in which a boy is riding a bicycles". Retrieving images/videos based on low-level features cannot provide satisfactory results (Vaiulaya, et al., 2001). Effective and precise multimedia retrieval by semantics remains an open and challenging problem (Vaiulaya, et al., 2001; Naphade, et al., 2001].

Recently, ontologies begin to be used in the context of digital libraries. For example, ScholOnto (Shum 2000) is an ontology-based digital library that supports scholarly interpretation and discourse, and ARION (Corcho, 2000), another ontology-based digital library that supports search and navigation of geospatial data sets and environmental applications.

We believe that various digital libraries will become another major web resource of the Semantic Web. The challenges here are: (1) The development of efficient and effective classification and indexing mechanism for each type of digital library, and (2) The semantic interoperability between digital libraries of similar types and between a digital library and the Semantic Web.

Challenges

New opportunities impose new challenges. In the following, we focus our discussion on the following challenges that we are facing now: the development of ontologies, and the development of the formal semantics of Semantic Web languages, and the development of trust and proof models.

The development of ontologies

It is well recognized within the Semantic Web community that ontologies will play an essential role in the development of the Semantic Web. Various effort has been devoted to the research of different aspects of ontologies, including ontology representation languages (Corcho, 2000), ontology development (Jones, et al, 1998), ontology learning approaches (Maedche & Staab, 2001), and ontology library systems (Ding & Fensel, 2001), which manage, adapt, and standardize ontologies.

Management. The main purpose of ontologies is to enable knowledge sharing and re-use, hence a typical ontology library system supports open storage and organization, identification and versioning. Open storage and organization address how ontologies are stored and organized in a library system to facilitate access and management of ontologies. Identification associates each ontology with a unique identifier. Versioning is an important feature since ontologies evolve over time and a versioning mechanism can ensure the consistency of different versions of ontologies.

Adaption. Since ontologies evolve over time, how to extend and update existing ontologies is an important issue. This includes the searching, editing and reasoning of ontologies in an ontology library system.

Standardization. Integration and interoperability is always the concern of any open system. This is especially the concern of the Semantic Web, an open system that has to be scalable at the Internet level. Currently, a number of ontology representation languages have been proposed (Corcho, et al., 2000) and various ontology library systems have been built (Ding & Fensel, 2001). The question is what would be the standardized ontology representation language. Each of them seems to have its advantages and disadvantages, and has its proponents and opponents. This might be a feature of our human being society: each of us has his/her preference. Since the Semantic Web is still at its early stage, it might be too early to enforce any standardization. Each representation language can grow on its own and the one or a few ones who win will become the de facto standards. XML might serve as the meta-languages of these representations to facilitate future interoperation and integration.

Formal semantics of the Semantic Web languages

The functional architecture of the Semantic Web has three layers: the metadata layer, the schema layer and the logical layer. Currently, the RDF (Resource Description Framework, 1999) is believed to be the most popular data model for the metadata layer. Although it is believed that the RDF data model is enough for defining and using metadata, the semantics of reification (statement about statement) is yet to be defined. RDFS (RDF Schema) extends RDF and is currently a popular schema layer language. It has been recognized that RDFS lacks a formal semantics and one proposal is to define a metamodeling architecture for RDFS similar to the one for UML (Universal Modeling Language), and hence defines a formal semantics (Pan & Horrocks, 2001). This approach, although formal, is complicated and not intuitive. Although RDFS has been blamed for its semantic confusion and some apparent paradox, it has not been shown that a formal semantics is impossible. To smooth the way of developing the Semantic Web, we believe that the semantics of RDFS need to be resolved: either a formal semantics is defined for it, or the problem of RDFS is pinned down so that the semantic issue can be resolved.

Proof and trust

As an open and distributed system, the Semantic Web bears the spirit that "anybody can say anything on anybody". People all over the world might assert some statements which can possibly conflict. Hence, one needs to make sure that the original source does make a particular statement (proof) and that source is trustworthy (trust).

Currently, the notion of proof and trust have yet to be formalized, and a theory that integrates them into inference engines of the Semantic Web is yet to be developed. However, these technologies are very important and are the foundation of building real commercial applications (e.g., B2B and B2C systems).

Conclusion

In this paper, we reported a survey of recent research on the Semantic Web. In particular, we presented the opportunities that this new revolution will bring to us, and the challenges that we are facing during the development of the Semantic Web. We hope that this paper will shed some light on the direction of future work.

The Semantic Web is still a vision. We believe that the Web will grow towards this vision in a way like the development of the real world: Semantic Web communities will appear and grow first, and then the interaction and interoperation among different communities will finally interweave them into the Semantic Web.

Acknowledgements

We thank Dr. Terry Brooks and the two anonymous reviewers for their helpful suggestions and comments.

References


How to cite this paper:

Lu, Shiyong, Dong, Ming and Fotouhi, Farshad (2002) "The Semantic Web: opportunities and challenges for next-generation Web applications." Information Research 7(4), Available at: http://InformationR.net/ir/7-4/paper134..html

© the authors, 2002. Updated: 30th April, 2002


Articles citing this paper, according to Google Scholar

Contents


Web Counter

Home