Profile Description

Pascal Francq

January 21, 2014 (April 20, 2011)

Abstract

Computing a profile description is the process of describing a user’s interest based on assessments he or she made on documents. The method proposed here is an adaptation of the user relevance feedback methods.

Table of Contents

Section 1: User Relevance Feedback

Subsection 1.1: Query Expansion and the Vector Model

Subsection 1.2: Evaluation of Relevance Feedback Strategies

Section 2: Profiles and Assessments

Subsection 2.1: The Assessments

Subsection 2.2: Users, Profiles and Documents

Section 3: Principle of Locality

Section 4: Relevance Profile Computing Method

Section 5: Feedback Profile Computing Method

Section 6: Profile Descriptions

Subsection 6.1: Negative Weights

Subsection 6.2: Non-negative Weights

Section 7: Comparisons of the Profile Computing Methods

Section 8: Implementation

References

A profile represents a user’s interest. In the knowledge model of the GALILEI framework, the choice is to describe profiles as tensors. Computing a profile description is therefore the process to build a tensor that models the corresponding interest.

The process may also construct an index for each profile in order to provide a search-optimized description form. While the description is optimized to find all the concepts associated to a given profile, the latest is optimized to search all the profiles containing a given concept. the GALILEI framework adopts several mechanisms to store the descriptions and the index.

1 User Relevance Feedback

It is difficult for most users to construct the type of query which best describes their interests. A user query can be seen as a “bad” approximation of the query that will retrieve all the relevant documents in a collection. An idea may be to reformulate the query. User relevance feedback is a technique based on feedback from a user on the documents retrieved through an initial query, and decomposed into the following steps:

The user formulates a query.
Based on this query, a set of documents is retrieved.
The user examines the top 10 (or 20) ranked documents and marks the ones which are relevant.
The system uses information extracted from these relevant documents to modify the original query.
Based on this modified query, a new set of documents is retrieved.

Figure 1↓ illustrates the user relevance feedback process.

Figure 1 User Relevance Feedback.

This method is well-suited to the context of profile descriptions. Some experiments on the SMART system [1] with the vector model and, later, the probabilistic weighting model [2] have shown that query reformulation improves retrieval performance in the case of small text collections. Moreover, the interest of user relevance feedback for large document collection was also shown, in particular on the Web [3].

There are two ways to reformulate a query. Firstly, some new index terms can be added, a process called query expansion. Secondly, the weights assigned to the initially used index terms can be changed, the so called term re-weighting.

The first advantage of user relevance feedback is that a user has only to focus on the relevance of the documents and not on the underlying details of the query. Secondly, he has to cope with multiple small steps rather than a whole big task, so making information retrieval easier. Thirdly, the process checks the importance of the terms in comparison with the relevance of the corresponding documents.

There are other approaches to reformulate queries based on term clustering:

using information derived from initially retrieved documents (“Automatic Local Analysis”) [4, 5];
using information derived from the complete collection (“Automatic Global Analysis”) [6, 7].

These approaches were not studied in the GALILEI project because they cannot be easily adapted to propose new profile computing methods.

1.1 Query Expansion and the Vector Model

In the traditional vector space model, the application of user relevance feedback is based on the assumption that the term-weight vectors of the documents assessed to be relevant are similar to each other. Further, the documents which are considered as irrelevant have term-weight vectors unlike the ones of the relevant documents. The basic idea is to reformulate the initial query to fit the term-weight vector space of the relevant documents.

Let us suppose first of all that the complete set,

, of relevant documents for a given query

is known. In this situation, it has be demonstrated that the best query vector for distinguishing between the relevant and the irrelevant documents is given by:

where

is the document corpus.

Of course, an a priori knowledge of

is completely unrealistic. So, to avoid the problem of reformulation, a better idea is to incrementally update the initial query. An incremental change is only carried out on the documents retrieved and assessed to be relevant by the user. Let

be the set of relevant documents among the documents retrieved and

those who are irrelevant. There are three conventional ways to compute the modified query,

where

is a reference to the highest ranked irrelevant document, and

and

are tuning constants.

In the original formulation, Rocchio [8] specifies

and Ide [9] specifies

. The expressions shown are the modern variants. In fact, the three methods give similar results with respect to their information retrieval performances.

Rocchio’s formulation is based directly on (↓), where query

is added. This is justified by the fact that the original query may contain important information. Generally, the information contained in the relevant retrieved documents is more important than those found in the irrelevant retrieved documents [10]. Translated into terms of parameter values,

is smaller than

. It is also possible to take

, which is called a positive feedback strategy.

1.2 Evaluation of Relevance Feedback Strategies

Suppose that we want to computed the retrieval performances of modified query vector

generated by the Rocchio method. A natural approach is to retrieve a set of documents using this query, to evaluate the ranking with the traditional measures for a set of retrieved documents, and to compare the result with the same measures as for initial query

. The difference in their performances is generally impressive. The problem is that this method is not “honest” because some of the high ranked relevant documents retrieved by

are those already identified as relevant by the user during the feedback process [11]. Thus, this method is not interesting because these documents have already been seen by the user and cannot be considered as a direct benefit due to the modified query.

A better solution to evaluate

would be to compute performances by using the measures on the residual collection only, in other words, on the set of all the documents minus those analyzed by the user. Because the highly ranked documents are removed, the performances of

may be lower than those of the initial query,

. This method can therefore only be used to compare different relevance feedback strategies.

2 Profiles and Assessments

2.1 The Assessments

Since users may have multiple interests, the information needed by the system depends on each individual interest. The users are therefore made up of profiles, each profile corresponding to a particular interest. When a user creates a profile, he or she must not define it. A profile is just a label identifying a particular interest. When a user interacts with the system, he or she must always specify an active profile. To compute the profiles descriptions, our approach is based on the relevance assessments on documents and the analysis of their content

As an example, let us suppose that a profile is interested in the rock band “Deep Purple”, and different documents have been proposed by the system:

A document about the “In Rock” album, which is one of the rock band’s album.
A document about the rock band “Black Sabbath”, for which some “Deep Purple” musicians have played.
A document about Nino Tempo, a singer who made a record “deep purple”, which inspired the rock band’s name.
A document about a house painted deep purple, a color which has nothing to do with the subject.

Most information retrieval systems only support user feedback of the type “within the scope of the subject” or “outside the scope of the subject”, which poorly captures user preferences. For instance, it is evident that the first and second documents concern Deep Purple, and that the fifth has nothing to do with the rock band. For the third and the fourth document the answer is not trivial. In fact, while these documents do not directly concern “Deep Purple”, they are indirectly related to the subject, i.e. these documents could contain useful concepts corresponding to the subject without actually falling exactly within its scope.

Another more sophisticated method involves a rating, for example a number between 1 and 10, for each document consulted. In our example, the first and second documents would obtain the highest score, and the fifth the lowest one. A specific score could be assigned to the third and the fourth document, but the exact value would be somewhat arbitrary. Because of this, this method is likely to be too complicated for most users.

Based on these observations, in this thesis an intermediate solution based on the four different assessments presented in Table 1↓ has been adopted. The “fuzzy relevant” and the “outside the scope” categories contain all the irrelevant documents, i.e. the documents that are not directly part of the subject.

Assessment	Description
Relevant	The document is interesting.
Fuzzy relevant	The document is not interesting, but fall partially within the scope of the domain.
Irrelevant	The document is outside the scope of the domain.

Table 1 Assessments.

With this classification, the first document is assessed to be relevant and the fourth as irrelevant. For the second and the third document, the user could opt for the fuzzy relevance assessment. The fact that both documents are considered to be fuzzy relevant or that the third document is considered to be irrelevant is a subjective question which may depend on user preferences.

The main advantage of this classification is its user-friendliness. When he is not sure about the relevance of the document, he assesses it to be fuzzy relevant. In fact, this approach is easier than to associate a specific ranking to the documents.

Another way to understand the role of the fuzzy relevant assessment is shown in Figure 2↓. This figure represents the set of concepts and how the different types of documents interact with these concepts. The “Interest” set includes all the concepts in which a profile is interesting. The “relevant” documents contain a set of “relevant” concepts which constitutes a sub-set of “Interest”. The “fuzzy relevant” documents define a set of “fuzzy relevant” concepts , which may contains some concepts which are in “Interest” and “relevant”.

Figure 2 Assessments and concepts.

In fact, the intersection of the “relevant” and “fuzzy relevant” sets, which correspond to the dashed area of Figure 2↑, provides a list of concepts with discrimination values too low to isolate the interesting part from the overall subject. But the “fuzzy relevant” documents still contain some interesting concepts while the “irrelevant” ones do not contain any at all. So the “fuzzy relevant” documents cannot be considered in the same way as the “relevant” documents because most of the concepts contained in them are probably not relevant, but they cannot be considered to be “irrelevant” documents because there include certainly some relevant concepts. Therefore, these “fuzzy relevant” documents must be treated independently of the other types of documents which justify the corresponding assessment.

2.2 Users, Profiles and Documents

Rigorously, the GALILEI framework defines the set

users and the set

profiles:

where

) returns the user associated the profile,

(a profile is associated to one and only one user).

Based on the assessments, we may associate to each profile,

, four document subsets:

The documents assessed as relevant by the profile.

The documents assessed as fuzzy relevant by the profile.

The documents assessed as irrelevant by the profile.

The documents assessed by the profile.

Moreover, we may associate to each document,

, four profiles subsets:

The profiles assessing the document as relevant.

The profiles assessing the document as fuzzy relevant.

The profiles assessing the document as irrelevant.

The profiles assessing the document.

3 Principle of Locality

Whatever the method used to compute a profile, it exploits, in a way or another, the descriptions of the documents he or she has assessed. By default, the documents are described as tensors, where the different elements represents the importance of the concepts for a particular description.

To use these tensors, it is necessary to adapt the [concept weighting↓]. Such an adaptation weights the importance of a concept in a particular context of a document regarding a given set of documents. This set can be the whole corpus or a subset of it.

“In physics, the principle of locality states that an object is influenced directly only by its immediate surroundings” [A] [A] Principle of locality, Wikipedia, consulted December 21, 2010.. If we applied this principle when computing a profile description, it means that only the documents assessed should be taken into account. Concerning the concept weighting, it implies that the set used should be limited to those assessed by that particular profile (rather than the whole document corpus). In fact, we can go further: when the descriptions of documents assessed as relevant are used, the concept weighting is limited to this particular subset.

4 Relevance Profile Computing Method

In this section, a profile computing method based only on the documents assessed as relevant is described. We known that all the concepts in a document do not have the same importance. Therefore, a specific weight is associated with each concept. The underlying hypothesis is that the semantic load is independent of the physical division of the documents, i.e. when analyzing two different documents or the corresponding single document [B] [B] This document can be obtained using “copy/paste” of both documents., the result must be the same. The idea shown in Figure 3↓ is to compute the description by adding all the documents in the “relevance” subset as though they constitute a single one.

Figure 3 Relevance profile computing method.

By using the tensor operations and the concept weighting adapted to the tensor space model, we can compute the tensor,

, describing the profile

with:

where

is the tensor describing document

So the weights of each concept in the profile description corresponds to the weights of a single document defined by the tensor space model, where the document is a concatenation of the set of documents assessed to be relevant by the profile.

This equation supposes that all the relevant documents are used to compute the profile description. It is possible to propose an alternative incremental formula to compute

at time

based on the tensor at time

where

is tuning constant (for example 1), and

represents the latest documents assessed as relevant.

5 Feedback Profile Computing Method

In this section, a profile computing methods based on the user relevance feedback is described. There are parallelisms between queries involving search engines and profiles in this thesis: a profile can be seen as an ideal query that retrieve all the relevant documents for a given interest from a corpus (and only those).

The user relevance feedback methods presented in section 1↑ for the vector model are based on the breakdown of a collection of documents into a “relevant” and an “irrelevant” subset. The “irrelevant” subset contains some index terms describing some useful concepts of the user’s interest because they are contained in documents retrieved through the query, but without containing the concepts themselves. As explained, the GALILEI framework supposes three assessment types, each corresponding to a particular document subset. The “relevant” and “irrelevant” subsets have the same meaning as those defined by the user relevance feedback methods. The “fuzzy relevant” subset may contain some interesting concepts (see section 2↑).

It is possible to create a computing method based on formulas provide by the query expansion by taking three differences into account:

In the case of profiles, the initial queries could correspond their previous descriptions (this is useful for an incremental mode).
If it is clear that the “fuzzy relevance” subset must influence the profile computation, two hypothesizes exist:
Optimist Most of the concepts contained in the “fuzzy relevance” subset are interesting, i.e. they must be “added” to those of the “relevant” documents.

Pessimist Most of the concepts contained in the “fuzzy relevance” subset are not interesting, i.e. they must be “subtracted” from those of the “relevant” documents.
As for the user relevance feedback methods, the concepts contained in the documents in the “irrelevant” subset must “subtracted”.

To compute tensor,

, describing profile

, we uses a linear combination of the vectors of the corresponding assessed documents. Depending of the assessment of a document, its description is more or less “added” or “subtract” from

. By using the tensor operations and the concept weighting adapted to the tensor space model, we can compute the tensor,

, describing the profile

with:

where

is the tensor describing document

, and

and

are factors representing the importance of each document subset of the profile.

Of course, the relevant profile computing method is particular case when

and

We have always

and

. Concerning the value of

, it depends on the hypothesizes used for the corresponding subset: for the optimist hypothesis,

; and for the pessimist one,

. Because the concepts of the “relevance” subset are more important than that of the others subsets, the corresponding factor must be higher than the others. Because the “fuzzy relevance” subset may contain some interesting concept, the corresponding factor must be higher than that of the “irrelevant” subset. We can therefore make an assumption concerning the values of factors

and

Two particular values must be examined:

The “irrelevant” subset is supposed to have no concept able to characterize a profile. Moreover, because the number of documents assessed as “irrelevant” can increased very fast during a profile lifetime, using these documents to compute the profile may increase the computational time.

The “fuzzy relevance” subset is supposed to contain equally interesting and uninteresting information, and is therefore not dealt with.

It is also possible to propose an alternative incremental formula to compute

at time

based on the tensor at time

where

is tuning constant (for example 1),

represents the latest documents assessed as relevant,

represents those assessed as fuzzy relevant., and

represents those assessed as irrelevant..

6 Profile Descriptions

6.1 Negative Weights

An examination of the expressions of the relevance feedback computing method shows that the least on factor is negative. Because the tensor of a document,

, contains only positive or zero weights (

), a profile tensor may contain negative weights. In particular, if a concept,

, only appears in the documents assessed as irrelevant, the corresponding weights in the profile tensor will be highly negative.

The following interpretation is very intuitive:

A very positive weight of corresponds to a concept which appears much more frequently in the documents considered to contain many interesting concepts.
A small weight of corresponds to a concept which appears equally in all the documents assessed.
A very negative weight of corresponds to a concept which appears much more frequently in the documents considered to contain no interesting concepts.

Let us consider two profiles,

and

, that have the same concepts in the same context with above or below zero weights, i.e.

and

. When comparing the weight of a given concept in a particular context, four possibilities exist:

and

For both profiles the concept appears to be interesting. The profiles are probably similar.

and

For profile

, whereas the concept appears interesting, it does not appear as such for

. The profiles are probably dissimilar.

and

For profile

, whereas the concept does not appear interesting concepts, it appears as such for

. The profiles are probably dissimilar.

and

For both profiles, the concept appears uninteresting. Nothing can be supposed concerning the similarity of the profiles.

In fact, when two profiles share a large number of concepts with negative weights, no conclusions can be drawn concerning their similarity. Let us suppose that one of the profiles is interested in “genetic algorithms” and the other in the rock band “Black Sabbath”. Whereas both profiles may consider a document on “soccer” to be irrelevant, i.e. the concepts related to soccer will have negative weights in both descriptions, it is evident that these profiles have nothing in common.

6.2 Non-negative Weights

When the number of documents assessed increased, the matrix corresponding to the tensor becomes less sparse. The consequence is an increase of the amount of memory needed to manage the profiles, but also a higher computation time. It may therefore be interesting to limit the number of non-negative weights associated to each vector. The most intuitive approach is, of course, to take only the

most weighted concepts into account for each vector.

Another approach could be to take the only the

most weighted concepts, but also the

most negative weights. The idea behind is to include in the profile description the concepts that are the less relevant in order to hold off the most dissimilar profiles and documents.

7 Comparisons of the Profile Computing Methods

The purpose behind computing and describing profiles is to group them into communities of interests. The validation method proposes to construct profiles from test collections containing categorized documents, i.e. each document is in at least one ideal topic. When different profiles assess documents from the same ideal topic to be interesting, these profiles must be grouped together. So, the important point is to distinguish the profiles linked to a given ideal topic from all the others. So, to determine the quality of the profile computing methods, some conclusions must therefore be drawn from the statistics on the test collections in order:

Profiles linked to the same ideal topic should have a much higher degree of similarity than profiles linked to different ideal topics.
The minimum similarity between the profiles linked to the same ideal topic should have a much higher value than the degree of similarity between profiles linked to different ideal topics.

Moreover, the question to be answered is: Which computing method gives the best results?

A previous series of tests done with the classical vector space model (which is a particular case of the tensor space model) have lead to some conclusions [12]:

The feedback profile computing method is the best one.
The optimist hypothesis gives the most interesting results.
The number of non-null weights, , can be limited to , or (the tests were performed with ).
The best values for the factors are , and .

New tests will be performed with the tensor space model in a near future in order to compare the different methods proposed.

8 Implementation

The plug-in feedback implements the feedback profile computing method (since the relevant profile computing method correspond to particular values of the parameters:

and

References

[1] Gerard Salton, The SMART Retrieval System — Experiments in Automatic Document Processing, Prentice Hall Inc., 1971.

[2] Stephen Robertson & Karen Spärck Jones, “Relevance Weighting of Search Terms”, Journal of the American Society for Information Science, 27(3), pp. 129—146, 1976.

[3] Venkat N. Gudivada, Vijay V. Raghavan, William Grosky & Rajesh Kasanagottu, “Information Retrieval on the World Wide Web”, IEEE Internet Computing, 1(5), pp. 56—68, 1997.

[4] Rony Attar & Aviezri S. Fraenkel, “Local feedback in full-text retrieval systems”, Journal of the ACM, 24(3), pp. 397—417, 1977.

[5] Jinxi Xu & W. Bruce Croft, “Query expansion using local and global document analysis”, In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 4—17, 1996.

[6] Carolyn J. Crouch & Bokyung Yang, “Experiments in Automatic Statistical Thesaurus Construction”, In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 77‑-88, 1991.

[7] Yonggang Qui & Hans-Peter Frei, “Concept Based Query Expansion”, In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160—69, 1993.

[8] J. J. Rocchio, “Relevance Feedback”, In Information Retrieval in The SMART Retrieval System — Experiments in Automatic Document Processing, Gerard Salton (ed.), pp. 313—323, Prentice Hall Inc., 1971.

[9] Eleanor Ide, “New experiments”, In Information Retrieval in The SMART Retrieval System — Experiments in Automatic Document Processing, Gerard Salton (ed.), pp. 337—354, Prentice Hall Inc., 1971.

[10] Gerard Salton & Michael McGill, Modern Information Retrieval, McGraw-Hill Book Co., 1983.

[11] William B. Frakes & Ricardo Baeza-Yates, Information Retrieval: Data Structures & Algorithms, Prentice Hall, 1992.

[12] Pascal Francq, Collaborative and Structured Search: An Integrated Approach for Sharing Documents Among Users, PhD Thesis, Université libre de Bruxelles, 2003.