Introduction

This is the documentation and user guide of some sparse Kernel Spectral Clustering(KSC) applications developed based on the \(\texttt{leuven}\) library and framework [3] (\(\texttt{leuven}\) library git repository and Documentation).

The document contains instructions for building and using the applications developed for training (see the Application for training Section), hyper parameter tuning (see the Application for hyper parameter tuning Section) and out of sample extension or testing (see the Application for testing Section) the sparse KSC model. The sparsity is achieved by the combination of incomplete Cholesky decomposition (ICD) based low rank approximation of the (training data) kernel matrix with the so-called reduced set method. The KSC model, trained on a training set, can be used to cluster any unseen data thanks its out-of-sample extension capability. Before describing the applications, the ICD is discussed briefly in the On the application of incomplete Choleksy decomposition Section due to its strong influence to the final accuracy of the KSC model and the clustering result itself.

The description of the applications is followed by a number of test problems that gives insight into the usage of the provided applications (using special toy problems). Auxiliary scripts, provided for visualising or evaluating the output of the applications e.g. results of the hyper parameter tuning or the clustering itself, are also demonstrated.

Table 1 Description of the provided tests.

Test

Problem

Description

Purpose

Pre-generated, N = 10 000, d = 2 dimensional samples from 10, well separated clusters.

Simple clustering problem. Not only the data but the scripts, for executing the KSC applications with the appropriate input parameter values, are also provided.

Verify if all applications works correctly.

More flexible, dynamic data generation (including number of cluster centers, data set size, dimension, etc.).

Problems with variable cluster overlaps. The data sets are generated dynamically as part of the test with the possibility of configurations (number of cluster centers, separation of clusters, dimension, etc.). Shell scripts, for the KSC applications, are also generated at the same time, with initial parameter values according to the the given data generation settings.

Give a deeper insight into the nature of the:

  • KSC algorithm trough the applications

  • effects of using the ICD as kernel matrix approximation

  • behaviour and capability of the KSC algorithm and applications in case of more and more overlapping clusters

Four different, pre-defined 2D clustering problems with 4 cluster centers each.

Four clustering problems, with variable complexity, including even very non-linear problems that are known to be challenging such as the intertwined spiral problem. The complete set is automatic: data generation, KSC shell script generation, clustering and result visualisation.

Demonstrate the capability of the underlying sparse KSC algorithm and the related applications to solve even the most challenging clustering problems in a very efficient way (both in terms of memory usage and computing speed).