Meta-Search Engine

Pascal Francq

March 24, 2012

Abstract

A meta-search engine is a piece of software that uses multiple engines to retrieve a set of relevant documents and presents them consistently. This article describes how meta-search engines are integrated within the GALILEI framework.

1 Introduction

The goal of a meta-search engine is to combine results provided by different search engines. There are several reasons to use multiple search engines:
  • they cover different corpus (for example the Web and document databases);
  • they provide a different coverage of a given corpus (as in the case of the Web search engines);
  • they rank the documents with different criteria.
figs/MetaSearchEngine.svg
Figure 1 Meta-search engine.
As shown at Figure 1↑, a meta-search engine follows different steps when it receives a query from a user:
broker The query is dispatch to different search engines. It is often necessary to reformulate the query for each engine since they don’t share the same syntax and support different operators. Each engine retrieves a ordered set of documents.
rank Once all the documents are retrieved, the meta-search engine must rank them consistently to order to identify the most relevant documents.
present The documents must be presented to the user. The most simple solution is to propose the document in descending order of their global ranking. But other approaches exist. For example Yippy, formerly Clusty, organizes the documents in themes: when the user send the query “genetic algorithms”, it suggests several categories (“optimization”, “publications”, etc.).
Some meta-search engines offer more advanced features such as incremental searches.

2 Main Choices

Two technological questions must be answered when developing a meta-search engine:
  1. Which search engines to choose for which query?
  2. How to rank the documents retrieved by different search engines?
Most meta-search engines uses all known search engines for each request. But this is not very satisfactory. In fact, let us suppose that two information sources are available: Wikipédia and MedLine (a medical document database). If the user formulates the query “deep purple”, it seems evident that MedLine will be of no help. On the other hand, the most relevant documents to the query “penicillin second effects” are probably to search in MedLine. Ideally, the information sources to search in (the search engines to use) should be chosen according to a given query. But this supposes that some metadata exists for each search engine. Some automatic methods try to evaluate the relevance of each search engine by analyzing a sample of documents they retrieved (2 or 3 documents) [1].
Regarding the document ranking, several approaches also exist. A very simple solution consists in proposing the most retrieved documents first. Some meta-search engines don’t use the ranking proposed by the search engines, but compute their own one (most of the time expressing some similarity between the documents retrieved and the query). It is also possible to assign a weight to each search engine (eventually depending on a particular query) and to rank first the documents retrieved by the most weighted search engines.

3 The GALILEI Framework

The GALILEI platform doesn’t provide any meta-search engine as such. In fact, it supposes that several meta-search engines may be provided through specific plug-ins, while only one meta-search engine being the current one. It is the responsibility of the developer of a meta-search engine plug-in to ensure that:
  1. It interprets the query and identifies the keywords and (eventually) the operators.
  2. It calls some search engines (which are also implemented as plug-ins). If necessary, it must adapt the query for each search engine called.
  3. When a search engine, , retrieves a document, , for a query, , it must associated it with a ranking, , such as . This constraint ensures that each engine normalizes its rankings. Without this normalization, rankings from different engines cannot be compared.
  4. It produces a global ranking of all the documents retrieved by the selected search engines.
Currently, the GALILEI framework proposes only a very simple meta-search engine, originally developed by Valery Vandaele in 2003-2004.

3.1 Queries

The current meta-search engine doesn’t provide any particular query language. So, once a query is submitted, it is send as such to different search engines. If these latest support specific operators, such as for the document fragment retrieval, they must be mastered by the users to build useful search expressions.
That said, Vandaele proposes to create multiple queries by combining the terms provided by the user. Let us suppose that a query, , contains tokens [A]  [A] A token is a sequence of characters delimited by spaces., and two parameters and such as . New queries are build with the combinations of tokens where If the user enters the query “1 2 3” with and the following queries are send to the search engines “1”, “2”, “3”, “1 2”, “1 3” and “2 3”.
So, to each query corresponds a set of queries . In particular, in the single query mode: .

3.2 Ranking and Engine Choices

Actually, all known engines are used for each query to process. To each engine, , is associated a weight, . These weights are fixed by the engines themselves and may vary from one query to another.
Let us suppose that, for a query, , a set of documents, , is retrieved where each document has a set of rankings, . Of course, if the engine doesn’t retrieve document for query . A global ranking for document and query is computed using:
where is the set of search engines.
The documents are then ranked in descending order of this global ranking.

4 Implementation

The class GMetaEngine implements a generic meta-search engine. In practice, it provides methods for search engines plug-ins to signal the retrieval of a given document with a given ranking. The results are stored as a container of GDocFragment.
The simple meta-search engine actually developed in the GALLEI framework is available through the subversion repository metaenginesum.

References

[1] Faiza Abbaci, Méthodes de sélection de collections dans un environnement de recherche d’informations distribuée, PhD Thesis, École des Mines de Saint-Etienne, 2003.