Group Description

Pascal Francq

January 7, 2012 (April 20, 2011)

Abstract

Once an [object clustering] is performed, objects (documents or profiles) are regrouped in groups (topics or communities of interests). It is then necessary to compute a description from this group.

1 Introduction

The GALILEI framework deals with two categories of objects that could be clustered: the documents in topics (each document being assigned to one topic only) and the profiles in communities of interests (each profile being assigned to one community only):
where ) returns the community of interest containing the profile, , and ) returns the topic of the document, .
Since in the knowledge model of the GALILEI framework, both categories are represented as tensors, it is possible to compute a description for topic and a community of interests based on those of the documents and the profiles grouped.
The process may also construct an index for each group in order to provide a search-optimized description form. While the description is optimized to find all the concepts associated to a given group, the latest is optimized to search all the documents containing a given group. GALILEI adopts several mechanisms to store the descriptions and the index.

2 Group Description Computing Methods

2.1 Principle

Let us suppose that each group (topic or community of interests), , consists of a set of objects (documents or profiles), . We define the function as giving the group containing object , and as the number of objects contained in this group.
It can be interesting to describe a group in other terms than its composition. The representation of groups is widely treated in the literature [1, 2, 3], but here an intuitive approach is used. In fact, a group can be interpreted as the “ideal object” representing the relevant characteristics of all the objects grouped together. Regarding information retrieval, describing a group correspond to the query that will retrieve all the relevant documents (and only those). In traditional information retrieval systems, queries are represented through index terms consisting of an index term space. Here, the space is extended and tensors represented documents.
The groups can therefore be described through tensors in the concept space. This choice has two main advantages:
  1. It is very easy and fast since it is based on matrix operations.
  2. It uses the same space as the documents and the profiles. The similarity between a community of interests and a profile, or between a topic and a document can therefore be computed using the measure defined in the GALILEI framework.
The prototype and the centroid methods were developed to compute the description of the groups from the object descriptions.

2.2 Prototype Method

This solution is based on the assumption that the prototype of a group describes the best this virtual community. The prototype, , of a group, , is the object with the highest sum of similarities with the object belonging to this group. In order words, it is the object that is most similar to all the other objects of the same group.
Let us that a similarity measure, between two objects, and , exists. The sum of the similarities, , with all other objects grouped into the same group is computed for each object, , of a group, :
The prototype, , of a group, , is defined as:
An interesting property of this method is that it only needs a measure of similarity between objects. This method can therefore be used with any type of object description and is not dependent on the particular tensor space model used. Another feature that could be interesting is that the description of a group (topic or community of interests) correspond to a real object (document or profile). In the case of profiles, the latest may be seen as a “spokesperson” of his or her community.

2.3 Centroid Method

As previously explained, the group descriptions used are based on the tensor space model and the objects are described as tensors in the concept space. Another way to describe a group is to compute the “center of gravity” of the set of tensors representing the objects of the group, i.e. the group is considered as an “average object”.
The tensor, , representing a virtual community, , is defined as the weighted sum of the tensors, , of its objects. By using the tensor operations and the concept weighting adapted to the tensor space model, we can compute this tensor:
While this method seems a natural description, it could not work with other types of object descriptions. The method can only be used when the object descriptions are expressed in a same space as the one of the groups.

3 Group Description

3.1 Negative Weights

Since the weights of the concepts in the profile descriptions can be negative (because a subtraction is used in the feedback profile computing method), negative weights can also appear in the community of interests descriptions. Let us consider two communities of interests, and , that have the same concepts in the same context with above or below zero weights, i.e. and . When comparing the weight of a given concept in a particular context, four possibilities exist:
 and  For both communities the concept appears to be relevant. The communities are probably similar.
 and  For community , whereas the concept appears as interesting, it does not appear as such in . The communities are probably dissimilar.
 and  For community , whereas the concept does appear as irrelevant, it appears as relevant in . The communities are probably dissimilar.
 and  For both communities, the concept does not appear as interesting. Nothing can be supposed concerning the similarity of the communities.
In fact, when two communities share a large number of concepts with negative weights, no conclusions can be drawn concerning their similarity.

3.2 Non-negative Weights

When the number of documents and profiles increased, the matrices corresponding to the tensors becomes less sparse. The consequence is an increase of the amount of memory needed to manage the topics and the communities of interests, but also a higher computation time. It may therefore be interesting to limit the number of non-negative weights associated to each vector. The most intuitive approach is, of course, to take only the most weighted concepts into account for each vector.
Another approach could be to take the only the most weighted concepts, but also the most negative weights. The idea behind is to include in the topic and communities descriptions the concepts that are the less relevant in order to hold off the most dissimilar profiles and documents.

4 Comparisons of the Group Computing Methods

Previous tests were done with the classical vector space model for the communities of interests [4]. Both methods give similar results, but the centroid method seems to be slightly better than the prototype one. This can be explained by the facts that the centroid method is an “average” profile, i.e. it is more similar to each profile in the community than the prototype of the community.
New tests will be performed with the tensor space model in a near future in order to compare the different methods proposed.

5 Implementation

The two group description computing methods are implemented in the gravitation plug-ins. It implements a class for the topics (GTopicCalcGravitation) and one for the communities if interests (GCommunityCalcGravitation). Each class provides a method for the prototype method, ComputePrototype (which uses the method GGroup::RelevantObj), and a method for the centroid method, ComputeCentroid.

References

[1] Odell Duran, Cluster Analysis: A Survey, Springer-Verlag, 1974.

[2] Edwin Diday & J. C. Simon, “Clustering analysis”, Digital Pattern Recognition, King S. Fu (ed.), pp. 47—94, Springer Verlag, 1976.

[3] Ryszard S. Michalski, Robert E. Stepp & Edwin Diday, “A Recent Advance in Data Analysis: Clustering Objects In Classes Characterized by Conjunctive Concepts in Progress”, Pattern Recognition, Laveen N. Kanal & Azriel Rosenfeld (ed.), pp. 35—56, North-Holland Publishing, 1981.

[4] Pascal Francq, Collaborative and Structured Search: An Integrated Approach for Sharing Documents Among Users, PhD Thesis, Université libre de Bruxelles, 2003.