文章基本信息

标题：High level design of a semantic search engine.
作者：Cristescu, Sorin-Alexandru
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2011
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Key words: semantic search engine, semantic web
关键词：Internet/Web search services;Search engines;Semantics;Web site design;Web sites

High level design of a semantic search engine.

Cristescu, Sorin-Alexandru

Abstract: The Internet consists of an increasing number of web pages and web services. The vast majority are described by human-readable HTML and WSDL, respectively, which makes them suitable to be found in syntactic searches (textual matching). Annotating them with semantic information unleashes a huge potential, providing more suitable results in the search process, which benefits both humans (in the case of web pages), but also companies (in the case of web services). This paper presents the key requirements and design decisions for a semantic search engine dealing with web services annotated with semantic information using SA WSDL (w3.org)

Key words: semantic search engine, semantic web

1. INTRODUCTION

Web services are traditionally registered in a UDDI-like repository. A client needs to discover such a web service and then invoke it. In the real world this process is only used in a very limited way, mostly inside companies that control both the registration and the discovery process. This is due to the limitations of both phases, which are based on syntactic matching, rather than semantic.

We propose a registration and discovery process that would open up the usability of web services to real-world applications over the Internet. The idea is basically to have a search engine for web services, similar to a traditional search engine for web pages. However, in order for this to alleviate the limitations of the traditional search engines, we propose one based on semantic information. Our precondition is that web service descriptions (WSDL) need to be annotated with semantic concepts according to the SAWSDL standard (SAWSDL, 2007).

A traditional search engine needs to perform three basic steps in order to provide results to their consumers: crawling the web, indexing the web pages and actually searching for textual matches. A semantic search engine needs to provide the same functionality, with the notable difference that the indexing and searching are based on concepts (semantics), rather than on pure text (syntax). In this paper we present an algorithm for semantic matching of web services annotated using SAWSDL. We also present a high-level design of the semantic search engine.

In the area of semantic matching for web services, most of the research focuses on matching based on a semantic distance between the concepts. (Pukkasenung et al., 2010) proposes a matching algorithm based on the semantic distance for each of inputs, outputs, preconditions and effects (IOPEs). We also believe that using IOPEs gives accurate results, since we maximize the semantic information to base the matching on. At the same time, if matching of inputs and outputs is scalable to many users, extending the algorithm with preconditions and effects introduces only a small performance penalty, since the matching of the latter has less complexity than the matching of inputs or outputs. However, we propose matching IOPEs based on Tversky's model (Cardoso et al., 2008) and only use the semantic distance as a first filter, as explained in section 2. This provides more accurate results and at the same time it allows evaluating similarity of concepts defined in different ontologies.

Regarding registration, virtually all research in the area of semantic web services mentions registration of these services to UDDI by creating plug-ins that allow adding semantic capabilities and inference power to UDDI. However, the very creators of the standard--Microsoft, IBM and SAP--dropped their support for UDDI in 2007. We propose a replacement for UDDI that needs to address the main limitations of this standard: overwhelming complexity, lack of security, poor scalability, lack of good tooling, etc. Such a replacement is in fact the indexing step of a semantic search engine. Thus, semantic web services would be indexed based on their IOPEs annotated with semantic concepts. For example, if an input of a web service is annotated with the concept OrderRequest: http://www.w3.org/2002/ws/sawsdl/spec/ontology/purchaseord er#OrderRequest, then this concept is used in the index, much as a traditional search engine would index words. The advantage of approaching web service registration as indexing in a search engine is that we can take advantage of the work already done in the indexing area of classic search engines.

The search step then would mean matching the IOPEs of a requested web service to the advertised services based on Tversky's semantic matching model, which compares the properties of concepts.

A major difference with respect to traditional search is that in case of semantic web search we don't need all matching results, but only the best match, which is the web service to be invoked by the client who's looking for a match.

2. HIGH LEVEL DESIGN

The first important requirement to the data our search engine works with is that the WSDL descriptions of the services be annotated with semantic information. Our semantic search engine deals with services whose descriptions (WSDL files) contain lOPEs annotated with ontological concepts using SAWSDL. Our crawlers are fed with URLs pointing to semantically annotated WSDL documents. They fetch the documents and store them compressed in a repository. Each document is assigned a unique docID, derived from its URL. Importantly, the crawlers also fetch the ontology files used to annotate the WSDLs.

The indexer parses each WSDL document and creates a forward index, which keeps the relation between each docID and the hits, i.e. I, O, P, E concepts. At the same time, for each ontology referred by the IOPE concepts, the indexer involves an inference engine (Jena) to calculate the Tversky's matching values between each two concepts of that ontology. This list of matches is sorted in descending order by the matching values and is stored together with the corresponding ontology file. As we'll see later, this greatly improves the performance of the search.

Then the forward index is converted into four inverted indexes, each corresponding to I, O, P and E respectively. For each concept from IOPE, a unique wordID is generated. The indexes are sorted by this wordID.

With these preparations in place, the search process needs to return the best matching web service for a given profile. By profile we mean a set of IOPEs representing the desired characteristics of a service. The preconditions and effects are not mandatory, but the inputs and outputs are.

The search algorithm is structured on three levels:

1. First a syntactic (textual) matching is done with the IOPEs of the given profile; this is the same as what a traditional search engine would do, but it only returns the most relevant match (preferably one that covers all IOPEs);

2. If there was no result, a first kind of semantic matching is done: for each output concept in the given profile, we take each subclass in its ontology in increasing order of the semantic distance, as defined in (Pukkasenung et al., 2010); we compute the wordID for the current subclass and look it up in the corresponding index; thus, we try to find the best matches for all outputs according to the semantic distance; if at least one output can't be matched, the algorithm moves to step 3; then we do the same for inputs, but now we take the superclasses of each given input; if we have at least one input matched, we find the matches common for the outputs and the inputs; finally, we check which of these also match the preconditions and effects of the given profile, if they exist;

3. If there was no result from the previous step, more complex semantic matching is performed: for each output concept in the given profile, we take all concepts from the same ontology (except the subclasses covered at step 2) in decreasing order of Tversky's model value pre-calculated for the current output; we compute the wordID for each such concept and look it up in the corresponding index; if at least one output can't be matched, the algorithm stops and no match is returned; then we do the same for inputs; if we have at least one input matched, we find the matches common to the outputs and the inputs; finally, we check which of these also match the preconditions and effects of the given profile, if they exist.

Note that in step 1 we do a standard, textual match. This helps when an advertised web service is annotated with a concept such as: http://www.w3.org/2002/ws/sawsdl/spec/ontology/purchaseord er#OrderRequest, which is also part of the required profile and where the namespace and the name of the concept match textually. This alleviates the need for a semantic match, since we can assume both the required profile and the advertised service mean the same thing.

Step 2 performs a first semantic match. Note that we need all outputs of a required profile to be matched. If any output is not matched by an advertised web service, we'd remain with a partial answer, which is not acceptable, since the outputs are used to compose and further invoke web services. As for the inputs, the requirement is less strict: we need at least an input to match, not necessarily all. In order to match an input, we use the covariance principle: if the required profile's input is a subclass of the advertised service's input, they match. Intuitively this means that if an advertised service can deal with a certain concept, it can also deal with its derived concepts, since they are just restrictions of the original concept. However, for the outputs we have the dual principle (contravariance): if an advertised service outputs a certain concept, we can never require a more restrictive version (i.e. a subclass) of that. Only a superclass of that concept will match.

Finally, if there are no matches so far, Tversky model is used, which provides a matching value for any two concepts of an ontology. We start with the best Tversky match (e.g. two concepts which differ by just a property) and then, if no match is found, we proceed with lower matches, until a certain threshold. The threshold is necessary to eliminate really bad matches and it is a parameter of the search process.

3. NON-FUNCTIONAL DESIGN

Just like a traditional search engine, a semantic one must fulfill certain performance criteria. Naturally, we can base our work on the experience of traditional search engines.

Among the non-functional requirements, availability and scalability score high. Typically they are dealt with redundancy, where data is duplicated and distributed across the nodes of a distributed hash table (DHT).

By design, disk seeks need to be avoided as much as possible, as they are expensive operations. Indexes are typically implemented as B+trees, which helps in minimizing the number of disk seeks needed to find an item.

A search engine such as Google (Brin & Page, 1998) uses a custom-made storage system for indexing its data, called Bigtable. It is a distributed storage system, designed to scale to petabytes of data across thousands of servers. For our prototype, we use Apache Hadoop Distributed File System (HDFS, 2007), an open source implementation in Java of a distributed file system designed to hold huge amounts of information and provide fast access to it.

Compression techniques should be chosen as a tradeoff between compression/decompression speed and compression ratio. The original Google implementation chose zlib to compress the crawled HTML pages. Our prototype also uses zlib to compress the ontology files, which tend to be large.

4. LIMITATIONS AND FUTURE WORK

In order to open the semantic web to real-world applications, tools are needed that facilitate matching the web services by their capabilities, composing and invoking them. For this to happen, we need a proper search engine for semantic web services. In this paper we present the high-level design of such a search engine, with the precondition that the WSDL files describing the web services be semantically annotated according to SAWSDL. Our proposal is a combination between traditional, textual (syntactic) matching and semantic matching, the latter involving semantic distance and Tversky's model to provide accurate results. While the first results are encouraging, more work needs to be done especially in matching ontologies: what happens if two concepts come from different ontologies but mean the same thing ? We need to be able to match them.

Also, note that we limit our algorithm to SAWSDL annotated descriptions. This excludes e.g. any annotations following the WSDL-S standard. It also excludes annotated RESTful services, which use their WADL descriptions. These limitations need to be addressed in a real-world semantic search engine. They are the next steps in our research.

5. REFERENCES

Brin S. & Page L. (1998). The anatomy of a large-scale hypertextual Web search engine, Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, doi:10.1016/S0169-7552(98)00110X,pp.107-117

Cardoso J, Miller J. & Emani S. (2008), Web Services Discovery Utilizing Semantically Annotated WSDL, In: Reasoning Web, Baroglio et al., (Ed.), pp. 240-268, Springer-Verlag ISBN 978-3-540-85656-6, Berlin

Pukkasenung P., Sophatsathi P. & Lursinsap C. (2010). An Efficient Semantic Web Service Discovery Using Hybrid Matching, Proceedings of the Second International Conference on Knowledge and Smart Technologies, Chonburi,Thailand, pp. 49-53

*** (2007) http://hadoop.apache.org/hdfs/--Hadoop Distributed File System, Accessed on: 2011-08-11

*** (2007) http://www.w3.org/TR/sawsdl/SAWSDL, Accessed on: 2011-08-11