Example Applications¶
When using the Kernel Spectral Clustering (KSC) algorithm to discover clusters in a given data set, one needs to provide hyper parameter values such as the number of clusters and parameters of the applied kernel function. Different values of these parameters results in different KSC models. One of the advantages of the KSC algorithm is that these different KSC models can be compared and their difference from an optimal/ideal case can be quantified. This makes possible to apply hyper parameter tuning for selecting the optimal cluster number and kernel parameter values.
Moreover, the out-of-sample extension capability of the KSC model makes possible to train the model a given training set and apply it to cluster on any new, out of sample points. Therefore, in spite of the unsupervised nature of the clustering problem, one can benefit from training and validation steps for determining the optimal values of the hyper parameters (similarly to supervised problems) when using the KSC model: a KSC model can be trained (by using a training data set) at each point of a hyper parameter value grid, each of these KSC models can be evaluated on a separate validation set and the model (i.e. hyper parameter values) that yields the closest-to-optimal model selection criterion value (i.e. measure from optimality) can be selected as optimal KSC model setting.
In reflection to this, example applications have been developed for training (\(\texttt{KscIchol}\_\texttt{Train}\)), hyper parameter tuning (\(\texttt{KscIchol}\_\texttt{Tune}\) and testing (\(\texttt{KscIchol}\_\texttt{Test}\) of a sparse KSC model based on the incomplete Cholesky decomposition (ICD) of the kernel matrix. These applications are described shortly in the next sections followed by example applications. However, since the underlying KSC model exploits sparsity, achieved by the ICD on the training data kernel matrix, ICD is discussed briefly in the next section completing with some practical guides regarding its application.
On the application of incomplete Choleksy decomposition¶
The KSC model, implemented in the following applications, is sparse, i.e. makes use only a small, well selected sub-set of the training data set when set up (trained). This sparsity is achieved by using the incomplete Cholesky decomposition(ICD) of the training data kernel matrix. Since all the following applications start with this decomposition step, that can influence significantly the quality of the corresponding KSC model (i.e. the clustering result), it’s briefly discussed here. Note, that while a more detailed theoretical discussion can be found elsewhere, the following summary focuses more on practical aspects. This might help to understand the role of the related parameters (such as maximum rank, tolerance, etc.) and to find their appropriate values needed by the KSC applications discussed later.
Cholesky decomposition and orthogonalisation¶
Cholesky factorisation of the kernel matrix is actually equivalent of applying Gram-Schmidt (GS) orthogonalisation (QR decomposition) to the corresponding feature map matrix.
Gram-Schmidt orthogonalisation (QR decomposition)¶
When applying GS orthogonalisation on a set of linearly independent vectors of a finite dimensional inner product space, the input vectors are expressed in a new (orthonormal) set of basis vectors. These basis vectors are greedily generated by orthogonalising the input data vectors one-by-one to all the previously generated basis vectors. This orthogonalisation is done by projecting the next input data vector to the orthogonal complement of the sub-space spanned by the current set of basis vectors: (vector) projecting the data vector to all previously generated bases, subtracting these projections form the input data vector (and normalising the result).
The final result of these above subtractions is called the residual norm and it tells how much of the given input data vector lies out of the sub-space spanned by the current set of basis vectors. In other words, the residual norm indicates how (linearly) independent of the given data vector from the previously processed vectors:
zero residual norm: linear independence do not hold and the corresponding input data vector can be expressed as linear combination of (some of) the previously processed ones i.e. this vectors lies into the sub-space spanned by the current set of basis vectors (sum of the projections to the current set of basis vectors already equals to the vector)
residual norm equals to the norm of the vector itself: the vector lies entirely into the orthogonal complement of the sub-space spanned by the current set of basis vectors (sum of the projections to the current set of basis vectors is zero i.e. orthogonal to all of these basis vectors)
Changing the order, in which the input data is processed, by selecting the one with the largest residual norm to be processed at the next step (pivoting) and eventually ignoring those with small residual norms (partial, incomplete) leads to the incomplete and pivoted versions of the QR decomposition.
Note
The obtained orthonormal basis vectors form the columns of the \(Q\) while the \(ij\) elements of the upper-triangular matrix \(R\) stores the projection of the j-th input data vector to the i-th basis. Therefore, QR factorisation yields a transformation of the input data to the basis stored as columns of \(Q\) with the corresponding projections of the original data stored as the rows of \(R\). As it will be shown in the followings, this new representation provides the same inner products as the original one.
QR factorisation of the feature map¶
Let \(\{\mathbf{u}_i\}_{i=1}^{N},\mathbf{u}_i \in \mathbb{R}^d\) to be the input data set and \(\varphi(\cdot):\mathbb{R}^d\to\mathbb{R}^{n_h}\) its mapping to the high(even infinite)-dimensional feature space resulting the \(\Phi \in \mathbb{R}^{(N\times n_h)}\) feature space matrix (each row corresponds to the feature map of an input data) as
If \(\Phi^T_{(n_h\times N)}=Q_{(n_h\times N)}R_{(N\times N)}\) is the QR decomposition (Gram–Schmidt orthogonalization) of \(\Phi^T\), then the kernel matrix
where the fact that the matrix \(Q\) builds up from mutually orthonormal columns (i.e. basis vectors) was used.
Note
Applying QR factorisation on the feature map \(\Phi^T=QR\) of the input data is equivalent to the Cholesky decomposition of the corresponding kernel matrix \(\Omega=\Phi \Phi^T=R^TR\). While the fist is not feasible due to the high, or often infinite dimensions of the feature space the second can be done easily and yields the same \(R\) i.e. the projections of the feature vectors onto data to a new basis (that would be stored in Q). It can be seen, if the feature map would actually be transformed further into a new basis in which the original inner product is conserved since \(\Phi\Phi^T=R^TR\).
Incomplete Cholesky decomposition of the kernel matrix¶
It has been shown above, that QR factorisation of the feature map is actually equivalent to the Cholesky decomposition of the corresponding kernel matrix. As it has been mentioned earlier at the discussion of the QR factorisation, the residual norm can be used as a measure of (linear) independence. Moreover, the order of processing the data vectors in the QR factorisation can be chosen by selecting the vector, at each step, with the maximal residual norm (pivoting) and data with small residual norms can even be ignored. These leads to the pivoted, incomplete version of the QR factorisation and the corresponding Cholesky decomposition.
This pivoted, incomplete version of the algorithm can be especially useful when the number of linearly independent or nearly independent feature map vectors is significantly smaller than the number of data itself. The rank of the feature map of the input data (i.e. number of independent rows in \(\Phi\) or dimension of the row space) is equal to the rank of the kernel matrix. Although the kernel matrix might have even full rank, its spectrum decays rapidly in case of many kernels and can be well approximated by low rank matrices. This happens when the data lives “near” a low-dimensional subspace in feature space which means that even with a low-number (low-dimensional) of basis vectors a small sum residual norm (“near”) can be achieved (the sum of the vector projections to these basis vectors already approximate very well the feature map vectors).
Incomplete Cholesky decomposition (ICD) can be used in such cases to obtain a low-rank approximation of the kernel matrix that can help e.g. to reduce the size of the eigenvalue problem in which the full kernel matrix is involved or to deal with the full kernel matrix that would not fit into the memory through an appropriate approximation.
ICD approximation error¶
It can be shown, that when applying ICD on the \(\Omega=\Phi \Phi^T, \Omega \in \mathbb{R}^{N\times N}\) kernel matrix such that the approximation after the \(k\)-th step is \(\Omega \approx \tilde{\Omega} = G^TG, G \in \mathbb{R}^{k\times N}\) upper triangular, then \(\nu_j^2 = \Omega_{jj} -\tilde{\Omega}_{jj}\) where \(\nu_j^2\) is the squared residual norm of the \(j\)-th feature map vector. The sum of the squared residual norms after the \(k\)-th step is then \(\sum_{j=1}^{N} \nu_j^2 = \sum_{j=1}^{N} [\Omega_{jj} -\tilde{\Omega}_{jj}] = \|\Omega - \tilde{\Omega}\|_1\) where \(\|\cdot\|_1\) denotes the trace of the matrix.
Note, that including a data vector into the orthogonalisation at the \(k+1\)-th step, adds an additional, \(k+1\)-th basis vector to the new representation, such that the vector lies entirely into the corresponding \(k+1\) dimensional sub-space. Therefore, it zeros out the corresponding residual norm. So actually \(k\) out of the total \(N\) residual norms are already zero in the above summation i.e. \(\sum_{j=1}^{N} \nu_j^2 = \sum_{j=k+1}^{N} \nu_j^2\) by assuming that the already selected \(k\) data are shuffled to the front.
Consequently, when pivoting is applied i.e. selecting the \(i\)-th data vector at the \(k+1\)-th step to include into the orthogonalisation such that \(i = \text{arg}\,\max_{i}\{\nu_i\}, i \in \{k+1,\cdots,N\}\), is equivalent to the \(\sum_{j=1}^{N} \nu_j^2 = \|\Omega - \tilde{\Omega}\|_1\) trace minimisation At each step, the maximum term (residual norm) of this summation is zeroed out and the \(\eta := \sum_{j=1}^{N} \nu_j^2/N = \sum_{j=k+1}^{N} \nu_j^2/N = \|\Omega - \tilde{\Omega}\|_1/N, \eta \in [0,1]\) value can be used as the measure of the current approximation error. This value indicates the fraction of (squared norm of) projections of the feature map that lies into the orthogonal complement of the current sub-space of the new representation. In other words, the lower this value the more information is captured by the new (hopefully low) dimensional sub-space of the approximation.
ICD practical guide¶
When clustering is applied, it is implicitly assumed, that the corresponding input data contains some hidden structures and a clustering algorithm is employed to discover these. In case of kernel methods (in general), the problem is solved in the feature space generated by a non-linear transformation of the input space. Their advantage is, that problems, that are linearly unsolvable in the input space, become (linearly) solvable in the feature space when the appropriate transformation, i.e. appropriate kernel function and its parameter(s), is applied. Different kernel parameters leads to different feature maps and kernel matrix.
According to the above discussion, ICD of the kernel matrix can lead to an appropriate low rank approximation of the kernel matrix when the data lives “near” a low-dimensional subspace in the feature space. The feature space representation of the data is determined by the feature map i.e. the actual kernel and its parameter(s). Therefore, the rank of the kernel matrix (as well as any structure in the feature map) is strongly determined by the applied kernel and its parameters.
An RBF kernel is employed in the following applications with a single bandwidth parameter
too small value, i.e. small compared to the distance between the underlying data points, leads to diagonal kernel matrix (identity matrix):
all resulted feature map vectors are mutually orthogonal, maximally dissimilar (actually each of the individual feature map points forms a separate cluster)
all will have equally maximal residual norms (neither pivoting nor low rank approximation do no work)
indicated by the permutation vector: data are selected in a continuous way i.e. no pivoting so the permutation vector contains continuous data indices
too large value , i.e. large compared to the distance between the underlying data points, leads to one block kernel matrix:
all resulted feature map vectors are almost linearly dependent, very similar (actually the whole feature map is one single cluster)
the feature map lives near to a very low dimensional space, very low rank approximation already gives very low error (the sum of the residual norms drops quickly)
indicated by the final rank and error values: decomposition terminates quickly with very low final rank and very low final approximation error
Between these two limiting cases there might be many different appropriate values leading to different structures in the feature map, different approximation of the corresponding kernel matrix and eventually different final clustering results. The suitable range of the ICD kernel parameter values can be find by identifying the values leading to the above limiting cases.
An additional possibility is to apply some kernel bandwidth estimation algorithms. While they sometimes give a good initial value, according to our experience, usually they require further tuning/scaling to eventually find the optimal value.
In all cases, the final/optimal value of the ICD kernel parameter needs to be selected based on the optimality of the final clustering result. Note, that the implemented KSC model has the advantage to provide a model selection criterion value that can guide this decision. See more at the tests later.
Similar comments applies on the proper values of the final rank (number of pivots) and approximation error (tolerance) parameter values. Since they related to the dimensionality of the feature space map they both depend strongly on the kernel function and its parameter value(s). However, the \(\eta \sim 0.6-0.9\) tolerance and \(\texttt{max rank} \sim\) few (\(2-20\)) times number of clusters usually provides appropriate results. See more at the tests later.
Application for hyper parameter tuning¶
The \(\texttt{KscIchol}\_\texttt{Tune}\) example application has been developed, based on the framework provided by the \(\texttt{leuven}\) library, to perform hyper parameter tuning of a sparse KSC model using a 1 dimensional RBF kernel.
The hyper parameter tuning is equivalent to a grid search to find the kernel number and RBF kernel bandwidth hyper parameter combinations resulting a KSC model, set up based on the training set, that provides the closest-to-optimal model selection criterion value on the validation set. The 2D hyper parameter grid is determined by the candidate cluster numbers (given by its min/max values) and list of candidate RBF kernel bandwidth values. The application will train a KSC model at each point of this 2D hyper parameter grid by using the training data set and evaluate the performance of the trained KSC model on the validation set. At the end, the application will write these model selection criterion values as well as the corresponding hyper parameter grid values to files.
The KSC model is sparse and the sparsity is achieved by the combination of the incomplete Cholesky decomposition (ICD) based approximation of the kernel matrix with the reduced set method. Therefore, before any KSC model training, the application will perform the ICD of the training data kernel matrix by using an RBF kernel with the (possible multidimensional) kernel parameter, tolerance and maximum rank parameter. (See more at the corresponding part, i.e. Incomplete Cholesky factorisation section, of the documentation of the \(\texttt{leuven}\) library.
The following steps are done in the application
the training data are loaded from the file
incomplete Cholesky decomposition of the training data kernel matrix
at each point of the 2D hyper parameter grid
a sparse KSC model is trained based on the above incomplete Cholesky approximation of the training data kernel matrix
the model is evaluated on the validation set
the resulted model evaluation criterion values as well as the hyper parameter values are saved to file
See more on the application and its input arguments by executing it with the \(\texttt{KscIchol}\_\texttt{Tune --help}\) option that will print out the following description
bash-3.2$ ./KscIchol_Tune --help
Application that tunes a sparse KSC model using a 1D RBF kernel to find the
optimal values of the kernel parameter (bandwidth) and optimal number of clusters.
The application trains a KSC model at each point of a 2D 'kernel parameter' -
'cluster number' grid (defined by the input arguments) on on the given training
data set. Each of these model is applied on the validation data set and the cor-
responding model evaluation criterion is computed. The 2D point giving the best
value of the model evaluation is reported and all values over the 2D grid are writ-
ten to the output file. The optimal kernel parameter and cluster number parameters
then can be determined by inspecting these values.
Usage:
KSC Tuning Application [OPTION...]
Cholesky decomposition [REQUIRED] options:
--icholTolError arg (double) Tolerated approximate error in the
inc. Cholesky decomposition.
--icholMaxRank arg (size_t) Maximum number of data to select
in the inc. Cholesky decomposition.
--icholRBFKernelPar arg (INP_DTYPE) RBF kernel parameter to be used in
the inc. Cholesky decomposition(scalar or
vector).
Training data set [REQUIRED] options:
--trDataNumber arg (size_t) Number of training data.
--trDataDimension arg (size_t) Dimension of the training data.
--trDataFile arg (string) File name of the training data.
Validation data set [REQUIRED] options:
--valDataNumber arg (size_t) Number of validation data.
--valDataFile arg (string) File name of the validation data.
Tuning [REQUIRED] options:
--minClusterNumber arg (size_t) Minimum cluster number for grid
search.
--maxClusterNumber arg (size_t) Maximum cluster numebr for grid
search.
--kernelParameters arg (vector of INP_DTYPE) List of RBF kernel
parameters for grid search (separated by ',')
Clustering [OPTIONAL] options:
--clEncodingScheme arg (string) KSC cluster membership encoding
scheme (BLF, AMS or BAS). (default: BAS)
--clEvalOutlierThrs arg (size_t) clusters with cardinality below this
are considered to be outliers with zero
contibution to quality measure (0). (default: 0)
--clEvalWBalance arg (DTYPE) Weight to give to the balance term in
the model evaluation (must be in [0,1]).
(default: 0.2)
Others [OPTIONAL] options:
--verbosityLevel arg (size_t) Verbosity level. (default: 2)
--numBLASThreads arg (size_t) Number of threads to be used in
BLAS/LAPACK. (default: 4)
--resFile arg (string) The result of tuning the KSC model is
written into this file. (default: TuningRes.dat)
-h, --help (flag) Print usage and available parameters
Application for testing¶
The \(\texttt{KscIchol}\_\texttt{Test}\) example application has been developed for testing a KSC model, i.e. applying a KSC model, trained on a given training data set, to cluster any additional, unseen data.
This step usually follows the above hyper parameter tuning since the (optimal) number of clusters and kernel function (1D RBF in this case) parameter needs to be provided. The application trains the KSC model on the given training data set and applies the trained model to partition the test data set.
As it has been mentioned at …, the testing, as all the other ICD based sparse KSC model applications, starts with the ICD of the training data kernel matrix. Therefore, the corresponding parameters such as RBF kernel bandwidth(s), tolerance, etc. also need to be given.
See more on the application and its input arguments by executing it with the \(\texttt{KscIchol}\_\texttt{Test --help}\) option that will print out the following description
bash-3.2$ ./KscIchol_Test --help
Application that trains a sparse KSC model using a 1D RBF kernel on the given
training data set and applies the trained model to cluster a given test data set.
The sparsity is achived with the combination of approximating the training set
kernel matrix by its incomplete Cholesky factorisation (i.e. incomplete QR
decomposition of the training data feature map) and using the reduced set method.
Usage:
KSC Training & Testing Application [OPTION...]
Cholesky decomposition [REQUIRED] options:
--icholTolError arg (double) Tolerated approximate error in the
inc. Cholesky decomposition.
--icholMaxRank arg (size_t) Maximum number of data to select
in the inc. Cholesky decomposition.
--icholRBFKernelPar arg (INP_DTYPE) RBF kernel parameter to be used in
the inc. Cholesky decomposition (scalar or
vector).
Cholesky decomposition [OPTIONAL] options:
--icholRedSetFile arg (string) The reduced set data is written into
this file (if given). (default: "")
--icholPermVectFile arg (string) The permutations (done during the
ICD) is written into this file (if given).
(default: "")
Training data set [REQUIRED] options:
--trDataNumber arg (size_t) Number of training data.
--trDataDimension arg (size_t) Dimension of the training data.
--trDataFile arg (string) File name of the training data.
Test data set [REQUIRED] options:
--tstDataNumber arg (size_t) Number of test data.
--tstDataFile arg (string) File name of the test data.
Clustering [REQUIRED] options:
--clNumber arg (size_t) Number of required cluster.
--clRBFKernelPar arg (INP_DTYPE) RBF kernel parameter to be used in
the KSC.
Clustering [OPTIONAL] options:
--clEncodingScheme arg (string) KSC cluster membership encoding
scheme (BLF, AMS or BAS). (default: BAS)
--clEvalOutlierThrs arg (size_t) clusters with cardinality below this
are considered to be outliers with zero
contibution to quality measure (0). (default: 0)
--clEvalWBalance arg (DTYPE) Weight to give to the balance term in
the model evaluation (must be in [0,1]).
(default: 0.2)
--clResFile arg (string) The result of clustering the test
data is written into this file (if clLevel>0).
(default: CRes.dat)
--clLevel arg (size_t) KSC clustering level: 0 - clustering
only; 1 - additional membership strength (AMS,
BAS); 2 - membership strength for all clusters
(only with AMS). (default: 1)
Others [OPTIONAL] options:
--verbosityLevel arg (size_t) Verbosity level. (default: 2)
--numBLASThreads arg (size_t) Number of threads to be used in
BLAS/LAPACK. (default: 4)
-h, --help (flag) Print usage and available parameters
Application for training¶
The \(\texttt{KscIchol}\_\texttt{Train}\) example application has been developed for training a KSC model on a given training data set.
This step usually applied before the tuning in the exploratory phase to find out the approximate range of the input parameter values both for the ICD (kernel parameter, tolerance, etc.) and training (range of number of clusters, kernel parameter values, etc.) parts of the tuning.
See more on the application and its input arguments by executing it with the \(\texttt{KscIchol}\_\texttt{Train --help}\) option that will print out the following description
bash-3.2$ ./KscIchol_Train --help
Application that trains a sparse KSC model using a 1D RBF kernel on the
given training data set. The sparsity is achived with the combination of
approximating the training set kernel matrix by its incomplete Cholesky
factorisation (i.e. incomplete QR decomposition of the training data feature
map) and using the reduced set method.
Usage:
KSC Training Application [OPTION...]
Cholesky decomposition [REQUIRED] options:
--icholTolError arg (double) Tolerated approximate error in the
inc. Cholesky decomposition.
--icholMaxRank arg (size_t) Maximum number of data to select
in the inc. Cholesky decomposition.
--icholRBFKernelPar arg (INP_DTYPE) RBF kernel parameter to be used in
the inc. Cholesky decomposition (scalar or
vector).
Cholesky decomposition [OPTIONAL] options:
--icholRedSetFile arg (string) The reduced set data is written into
this file (if given). (default: "")
--icholPermVectFile arg (string) The permutations (done during the
ICD) is written into this file (if given).
(default: "")
Training data set [REQUIRED] options:
--trDataNumber arg (size_t) Number of training data.
--trDataDimension arg (size_t) Dimension of the training data.
--trDataFile arg (string) File name of the training data.
Clustering [REQUIRED] options:
--clNumber arg (size_t) Number of required cluster.
--clRBFKernelPar arg (INP_DTYPE) RBF kernel parameter to be used in
the KSC.
Clustering [OPTIONAL] options:
--clEncodingScheme arg (string) KSC cluster membership encoding
scheme (BLF, AMS or BAS). (default: BAS)
--clEvalOutlierThrs arg (size_t) clusters with cardinality below this
are considered to be outliers with zero
contibution to quality measure (0). (default: 0)
--clEvalWBalance arg (DTYPE) Weight to give to the balance term in
the model evaluation (must be in [0,1]).
(default: 0.2)
--clResFile arg (string) The result of clustering the training
data is written into this file (if clLevel>0).
(default: CRes.dat)
--clResDataFile arg (string) The reordered training data is
written into this file (if clLevel>0). (default:
CData.dat)
--clLevel arg (size_t) KSC clustering level: 0 - clustering
only; 1 - additional membership strength (AMS,
BAS); 2 - membership strength for all clusters
(only with AMS). (default: 1)
Others [OPTIONAL] options:
--verbosityLevel arg (size_t) Verbosity level. (default: 2)
--numBLASThreads arg (size_t) Number of threads to be used in
BLAS/LAPACK. (default: 4)
-h, --help (flag) Print usage and available parameters