Document Fragment Retrieval

Pascal Francq

September 22, 2015 (March 24, 2012)

Abstract

Most search engines retrieve documents and provide also a fragment of the document containing the terms of the query submitted. With highly structured documents, such as XML ones, several fragments containing the terms may have different relevance for a given query (for example a title of a chapter is considered as more relevant than a simple paragraph). In this article, a multicriteria method to rank different document fragments is proposed.

1 Introduction

Traditionally, search engines return to a keyword-based query a document list ordered by their supposed relevance to that query. In practice, they first select a given set of documents (selection task) and then rank them (ranking task). When results are presented, search engines generally propose a text fragment containing the keywords used in the query.
Nowadays, several document types are structured, i.e. they are represented as a tree of elements. In particular, XML technologies propose a framework to build semantically structured documents. Here is an example of a XML document describing a book catalog:
<?xml version="1.0"?>
<catalog xmlns:xsi="http://test.org/catalog">
      <book id="Francq2011">
      <author>Pascal Francq</author>
      <title>Internet. Tome 1 : Le caractère fétiche</title>
      <cite>Donald Knuth</cite>
      <genre>Internet</genre>
   </book>
      <book id="Knuth1997">
      <author>Donald Knuth</author>
      <title>The Art of Computer Programming. Volume 1: Fundamental Algorithms</title>
      <genre>Computer</genre>
   </book>
</catalog>
Information retrieval with structured documents has several characteristics including allowing complex structured queries (such as to search for information related to the author “Donald Knuth”) and retrieving relevant fragments rather than the whole document (for example extract only the books of the genre “Computer” in the above document).
In the context of a research project within the GALILEI project, Faiza Abbaci and Jean-Baptiste Valsamis work on a XML search engine [1, 2]. The aim of the project was to search inside an existing collection of technical documents available in XML for fragments corresponding to a given query. This article is highly based on their work and presents a generalized version adapted for all kind of structured documents.
The reader should remember that in the document structure, each token of the document (textual or not) corresponds to a node in the tree structure associated with the corresponding concept (term, XML tag, document division, etc.). In this context, we define a document fragment as composed from a root node and, eventually, a snapshot representing a text window containing the keywords of the query selecting it. An example of a document fragment in the above document is the root node <author> associated to the snapshot “Donald Knuth”.

2 Query Language

A query language defines a formal syntax to express an information need. In practice, it combines keywords and operators on these keywords (such as logical operators). In the tensor space model used in the GALILEI framework, a keyword corresponds to a concept. Concepts belong to category corresponding to its “nature”, mostly textual content (traditional query keywords), semantic content (such XML tags) and metadata (for example the XML tag <author>). Moreover, each concept is associated to a particular type (terms, a particular XML schema, etc.).
The current query language proposes a set of operators defined in Table 1↓. If no operator separates several concepts, it supposes that the AND operator is used.
Operator Signification Description
| OR
selects all the fragments containing the concepts or .
+,& AND
selects all the fragments containing the concepts and .
/ INCLUDE
selects all the fragments where a node containing concept has a child node containing concept .
. SIBLING
selects all the fragments where nodes containing concepts and have the same parent node.
~ NOT
selects all the fragments that doesn’t contain concept .
: NAMESPACE
: refers to a concept of a concept type .
Table 1 Operators.
Suppose the following query:
(“http://test.org/catalog”:author)/(knuth|francq)
It searches for all fragments containing a concept author (a XML tag) of the type http://test.org/catalog [A]  [A] In the document description, each XML schema used is considered to be a concept type of the category “Semantic”. containing the terms knuth or francq (a concept appearing without a namespace is considered as a term).

3 Fragment Selection

The first step of the document fragment retrieval is to select all the fragments corresponding to a given query. In practice, it consists to identify the fragments containing or not a given concept, and then to combine the results with the corresponding operators (a query can always be expressed as a binary tree which nodes are either operators or keywords). Let us suppose the query . First, two lists of fragments must be created: one with the fragments containing concept , and one with those not containing . Secondly, since we have an AND operator, a new list is build containing the fragments presented in both lists. So, the fragment selection is based on two tasks: identifying fragments containing (or not) a given concept, and combine fragment lists.

3.1 Identify Fragments

One fundamental task of the fragment selection is to identify all the fragments containing one concept, . With regards to the document indexes available in the GALILEI platform, the following actions should be performed:
  1. Using the document index, search for all the documents that contain concept .
  2. Using the document structure index, retrieve the tree structure of each document and find all the occurrences of concept .
  3. For each occurrence, build a fragment.
The last operation depends of the type of the concept searched. Mostly, the concepts appearing in a query are terms. In such cases, the fragment is formed with the parent node of the occurrence as root node (typically a semantic token or a document part) and a snapshot formed with the term. Let us suppose the concept knuth. It appears two times in the above document: once within the tag cite and once within the tag author. As result, two fragments are selected: they are composed from these two tags as root node and the snapshot “Knuth”.
Sometimes, the concept of a query is not a term. For example, a user may search for the tag author of the XML schema http://test.org/catalog. In this case, the fragment is simply formed with the node corresponding to the tag as root node, and the snapshot includes all its children. Of course, if the concept searched is the root node of a document, the whole document is retrieved.

3.2 Combine Fragment Lists

For each concept appearing in the query, we create a list of fragments containing it (or not containing if it is associated with the NOT operator). The lists build for the different concepts must then be combined with the corresponding operator. Given the query , we must select all the fragments contained in both the list build for and the list build for . This shows that to combine two lists with a given operator, we must apply this operator on each pair of fragments, the final combination being the union of each result.
The result depends from the operator:
OR The two fragments are returned, eventually merged if they are identical (they share the root node and their snapshots are identical).
AND The combination of the two fragments provides a result only if their are related to the same document, and if there exist a common node in the document tree for the two root nodes. If this is the case, the result is a new fragment composed from the common node as root node, and the snapshot is the union of both snapshots.
INCLUDE The combination of the two fragments provides a result only if their are related to the same document, and if the root node of the left operand is a parent node of the root node of the right operand. If this is the case, the result is the fragment corresponding to the left operand which snapshot is extended by including the one of the right operand.
SIBLING The combination of the two fragments provides a result only if their are related to the same document, and if the two root nodes share a parent node. If this is the case, the result is a new fragment composed from this parent node as root node, and the snapshot is the union of both snapshots.
Once a given operator is applied on two fragment lists, we obtain a new fragment list (which may be empty).

3.3 Fragment List Reduction

An operator applied on fragment lists of popular concepts may produce a large fragment list as result (in particular for the OR operator). It may therefore be interesting to study if a fragment list can be reduced, i.e. if some fragments can be merged.
Let us suppose the query “art|programming” and the above document. The first part of the query selects a fragment composed from (title,Art) and the second a fragment formed with (title,Programming). While these two fragments are different, there are in fact related to a same logical fragment (title, Art of Computer Programming). It seems therefore a better idea to replace the two fragments by this logical one.
Let us now suppose the query “pascal|francq” applied on the following document:
<description>
   Pascal is a well-known French philosopher. He also works on optical and mathematical problems. 
   Largo Winch is the principal hero of the comic book by Jean Van Hamme and Philippe Francq.
</description
In this example, it makes more sense to select the two fragments (description,Pascal) and (description,Francq) despite their common parent node because they are separated by a large syntactic distance.
In practice, we suppose that two fragments can be merged if they share the same root node and if the nodes of their snapshots are not separated by a maximal syntactic distance (for example 20). The resulting fragment is then obtained by merging the two snapshots. To reduce a given fragment list, each fragment pair is analyzed to see if they can be merged.

4 Fragment Ranking

Once we obtain a list of document fragments, the next step of the retrieval process is to rank them. Here, we propose to compute a set of criteria for each fragment, and then to use the PROMETHEE multicriteria method for the ranking.

4.1 Document Relevance

We may suppose that “relevant” fragments are contained in “relevant” documents. Therefore, a first criteria is the relevance of the document containing a fragment to the query. It can be seen as an aggregation of the relevance of a document to each concept forming the query. To evaluate the relevance, of a document, , for a given concept, , we can use the classical term frequency () and inverse document frequency () factors [3, 4]: where represents the total number of documents, the number of those containing concept and the number of occurrence of concept in the document .
As aggregating function one could simply add the relevance of each query concept. Such a solution promotes the documents containing the most number of concepts. For example, for the query “( and ) or ( and and )”, the documents corresponding to the right part of the query will have a higher relevance than those corresponding to the left of the query. Since all documents satisfies equally the query, it seems more fair to average each relevance rather than to sum them.
In the tensor space model, documents are represented through multiple vectors, each vector being associated to a meta-concept, . Since each concept of the query may appear in each vector, a specific relevance can be computed for each vector. As for the similarity measure, I propose that the relevance of each vector for each concept it contains is weighted by its size. The relevance, , of a document, , for a given query, , is therefore defined by:
where represents the number of non-zero weights in the vector corresponding to a meta-concept, , in the document tensor, , and represents its relevance for a concept of the query.
Since this criteria is identical for all fragments of a given document, it can be computed once for each document.

4.2 Fragment Relevance

A second criteria is the relevance of each fragment regarding the occurrences of the concepts of the query. As for the document, it can be computed as the average relevance of the fragment for each concept, , contained in the query.
If the concept is a term, it must appear in the snapshot of the fragment (section 3.1↑). This snapshot can then be seen as a “mini document” for which the and (for inverse fragment frequency, an adaptation of the traditional factor for document fragment) factors are used [2]. The relevance, of a fragment, (which root node corresponds to the concept ), for a concept, , can be computed with:
where represents the total number of fragments in all documents having as root node, the number of those containing concept and the occurrence of the concept in the fragment.
If the concept is not a term (for example the XML tag “http://test.org/catalog”:author), it may be the root node of the fragment. In this case, the whole fragment is the information searched and we must assigned to it a much higher value that (for example 1000). If the concept is the root node of the fragment and appears only in its snapshot, we use the same relevance formula as for a term.
The relevance for a given fragment for a query concept can therefore be defined by:
The relevance, , of a fragment, , for a given query, , can be defined by:
where represents the number of concepts of the query in the fragment .

4.3 Fragment Specificity

In traditional information retrieval, users expect that a suggested document is accompanied by a fragment that best “specified” it. Such a fragment should help the users to determine the relevance of the document. Typically, it must be short, ideally one or two sentences giving the context in which the query keywords appeared. For document fragment retrieval, a similar criteria can be used.
To express this criteria, one can argue that “short” fragments are more specific than “large one”. To compute it, we may simply count the number of child nodes of a given fragment (the most there are, the less specific is a given fragment) [2].

4.4 Query Keyword Distances

When a user formulates a query with multiple keywords, it is highly probable that he or she searches for documents with sentences containing these keywords (for example “genetic algorithm” or “linux ubuntu”). In other words, a fragment is more relevant if the query keywords it contains occur close to each other. For flat documents, we can use the syntactic distance between two keywords. But for structured documents, we must also add a vertical distance. In the above document, the “distance” between the terms “fétiche” and “knuth” should be greater than between the terms “internet” and “fétiche” also if they syntactic distance is lesser (in fact, they have different parent nodes).
One way to express such a criteria is to use an average distance between each keyword pairs in the fragment. Such a distance can be computed as the minimum size of a text window that includes the root node of a fragment and both keywords of a pair [2].

4.5 Fragment Type

A last criteria is to suppose that some fragment types are more relevant than others. For example, it is perhaps more interesting to retrieve a chapter title containing the query keywords than a paragraph. A simple way to express such a criteria is to assign a given weight to each fragment regarding the type of its root node. Table 2↓ proposes a weighting schema.
Root node Weight
Textual content 0
Link (such as URI) 1
Semantic (such as XML tag) 2
Attributes (such as XML attributes) 3
Metadata 4
Table 2 Fragment type weights.

5 Experiments

Some experiments were done with the INEX corpus [2], with the document and fragment relevance criteria computed using the sum of the relevance for each concept rather than their average value. Table 3↓ presents the PROMETHEE criteria giving the best results. It interesting to notice that the fragment type criteria appears as irrelevant.
Optimum Weight p q
Document relevant Maximize 1 0.9 0.5
Fragment relevance Maximize 1 0.9 0.5
Specificity Minimize 3 0.9 0.1
Distance Minimize 4 0.6 0.2
Type Maximize 0 - -
Table 3 Criteria parameters.
Since several changes were done since these experiments, new ones should be performed in the future to evaluate the performances.

6 Implementation

The class GEngine implements a generic engine. The document fragment retrieval presented here is available through the subversion repository xmlengine (named for an obvious historical reason). It maintains a structure (stored in a binary file) that holds the necessary information to compute the relevance fragment (it gathers the total number of occurrences of semantic tokens and the number of those containing each textual token).

References

[1] Faiza Abbaci, Jean-Baptiste Valsamis & Pascal Francq, “Index and Search XML Documents by Combining Content and Structure”, In Proceedings of the 2006 Internation Conference on Internet Computing, pp. 107—112, 2006.

[2] Faiza Abbaci & Pascal Francq, “XML components ranking: Which relevant ranking criteria? Which relevant criteria merging?”, In Applications of Digital Information and Web Technologies (ICADIWT) 2008, pp. 216—222, 2008.

[3] Gerard Salton, Automatic Information Organization and Retrieval, McGraw-Hill, 1968.

[4] Gerard Salton, The SMART Retrieval System — Experiments in Automatic Document Processing, Prentice Hall Inc., 1971.