Similarity-based Grouping Genetic Algorithm

Pascal Francq

June 24, 2013 (March 13, 2011)

Abstract

The Similarity-based Grouping Genetic Algorithm (SGGA) is a semi-supervised clustering to group a set of objects.

Table of Contents

Section 1: Introduction

Section 2: Coding

Section 3: Heuristic

Section 4: Operators

Section 5: Evaluation

Section 6: Validation

References

1 Introduction

The Similarity-based Grouping Genetic Algorithm (SGGA) is a semi-supervised clustering to group a set of objects,

. It is an application of the Grouping Genetic Algorihtms (GGA) developed by Falkenauer [1].

The algorithm supposes that three different matrices of measures between two objects

and

exist:

A matrix containing the measures of similarity, .
A matrix containing a human-based agreement ratio between two objects, .
A matrix containing a human-based disagreement ratio between two objects, .

The agreement and disagreement ratios represent the semi-supervised aspects of the algorithm. In practice, the SGGA needs at least the matrix of similarity for each object to group. The other matrices represent some additional constraints.

2 Coding

The SGGA uses the default coding of the GGA proposed by Falkenauer [1].

3 Heuristic

The SGGA implements a variation of a first-fit heuristic. The developed heuristic randomly handles the objects that must be grouped. For each object, the following steps heuristic are performed:

The heuristic search the first object in that is already in a cluster. If one is found, is grouped in the same cluster.
The heuristic parses each cluster to find those that satisfy the following constraints:
1. The cluster does not contains the maximum number of objects allowed (by default, this limit is infinitive).
2. The cluster does not contained any objects of .
3. The cluster does not contained an object having a similarity with lower than a minimum one.
If no valid cluster is found, the heuristic creates a new cluster and puts in it. Otherwise, the cluster containing the object with the highest similarity with is choose.

Once all objects are grouped, the heuristic verifies that each cluster contains at least a minimum number of objects (by default, this value is zero). For each cluster that doesn’t contain this number, the heuristic looks for each objects,

, can be moved in another cluster using the following method:

The heuristic searches a valid cluster containing one of the objects of . If one is found, the object is moved to it.
If no one is found, the heuristic searches a valid cluster that doesn’t contained objects of . If one is found, the object is moved to it.
If all the objects of a cluster were moved, the cluster is removed from the clustering.

4 Operators

The NNGGA uses the default operators of the GGA proposed by Falkenauer [1].

5 Evaluation

To evaluate the chromosomes, the NNGGA uses three criteria. Let us first define two functions:

represents the average values of the measure, , for object with all the other objects grouped of the same cluster.
represents the average values of the measure, , for object with all the other objects not grouped in the same cluster.

The three criteria to optimize are:

A similarity criteria to maximize defined by:
where represents the number of objects.
An agreement criteria to maximize defined by:

where represents the number of objects.
A disagreement criteria to maximize defined by:

where represents the number of objects.

To combine these criteria, the NNGGA uses the PROMETHEE multi-criteria method.

6 Validation

The NNGGA was used in the GALILEI framework for two problems:

The document clustering.
The profile clustering.

References

[1] Emanuel Falkenauer, Genetic Algorithms and Grouping Problems, John Wiley, 1998.