Nearest Neighbors Grouping Genetic Algorithm

Pascal Francq

June 24, 2013 (March 13, 2011)

Abstract

The Nearest Neighbors Grouping Genetic Algorithm (NNGGA) is a semi-supervised clustering to group a set of objects.

1 Introduction

The Nearest Neighbors Grouping Genetic Algorithm (NNGGA) is a semi-supervised clustering to group a set of objects, . It combines a Grouping Genetic Algorihtms (GGA) with an approach based on the nearest neighbors.
The algorithm supposes that three different measures between two objects and exist:
  1. A measure of similarity, .
  2. A human-based agreement ratio between two objects, .
  3. A human-based disagreement ratio between two objects, .
The agreement and disagreement ratios represent the semi-supervised aspects of the algorithm. If they are defined for some objects, the NNGGA uses them as constraints for the clustering.
Most clustering algorithms suppose that a given matrix exists defining some kind of similarity measure between all the objects to cluster. Despite the fact that it is often a sparse matrix, it can still have a high price to compute and store. Moreover, in our case, we want to exploit three measures which implies to compute three such matrices. Therefore, as its name suggests, the NNGGA uses only the nearest neighbors of each object in regards with these three measures. In practice, for each object , the NNGGA exploits the following lists:
  1. The nearest neighbors of object (those objects we have the highest similarity with it), .
  2. A list of objects sharing relevance agreements with , .
  3. A list of objects having some relevance disagreements with , .
In practice, the NNGGA needs at least the nearest neighbors for each object to group. The other lists represent some additional constraints.
The NNGGA is a particular application of the GGA developed by Falkenauer [1]. In fact, it is adapted from the Similarity-based Grouping Genetic Algorihtms (SGGA) when the number of objects is too high to compute (or maintain) these matrices of measures.

2 Coding

The NNGGA uses the default coding of the GGA proposed by Falkenauer [1].

3 Heuristic

The NNGGA implements a variation of a first-fit heuristic. The developed heuristic randomly handles the objects that must be grouped. For each object, the following steps heuristic are performed:
  1. The heuristic search the first object in that is already in a cluster. If one is found, is grouped in the same cluster.
  2. Otherwise, the heuristic counts the number of objects of in each cluster.
  3. The clusters are then treated in descending order of these numbers. The heuristic finds the first "valid" cluster that satisfy the following constraints:
    1. The cluster does not contains the maximum number of objects allowed (by default, this limit is infinitive).
    2. The cluster does not contained any objects of .
    3. The cluster does not contained an object having a similarity with lower than a minimum one.
  4. If no valid cluster is found, the heuristic creates a new cluster and puts in it.
Once all objects are grouped, the heuristic verifies that each cluster contains at least a minimum number of objects (by default, this value is zero). For each cluster that doesn’t contain this number, the heuristic looks for each objects, , can be moved in another cluster using the following method:
  1. The heuristic searches a valid cluster containing one of the objects of . If one is found, the object is moved to it.
  2. If no one is found, the heuristic searches a valid cluster that doesn’t contained objects of . If one is found, the object is moved to it.
  3. If all the objects of a cluster were moved, the cluster is removed from the clustering.

4 Operators

The NNGGA uses the default operators of the GGA proposed by Falkenauer [1].

5 Evaluation

To evaluate the chromosomes, the NNGGA uses three criteria. Let us first define two functions:
  1. represents the average values of the measure, , for object with all the other objects grouped of the same cluster.
  2. represents the average values of the measure, , for object with all the other objects not grouped in the same cluster.
The three criteria to optimize are:
  1. A similarity criteria to maximize defined by:
    where represents the number of objects.
  2. An agreement criteria to maximize defined by:
    where represents the number of objects.
  3. A disagreement criteria to maximize defined by:
    where represents the number of objects.
To combine these criteria, the NNGGA uses the PROMETHEE multi-criteria method.

6 Validation

The NNGGA was used in the GALILEI framework for two problems:
  1. The document clustering.
  2. The profile clustering.

References

[1] Emanuel Falkenauer, Genetic Algorithms and Grouping Problems, John Wiley, 1998.