Test3: script¶

This script generates data for four different clustering problems, invokes the KSC application for clustering each and generates plots to visualise the results (located under the :math: exttt{res} directory at the end). This is the only script for Test3.

See Generate data for clustering for more details on the data generation.

Example

bash-3.2$ python kscICholTest3.py
==== (Python) === : clustering the  4Circles  data set ...
==== (Python) === : generating decision boundary for the  4Circles  data set ...
==== (Python) === : clustering the  4Clusters  data set ...
==== (Python) === : generating decision boundary for the  4Clusters  data set ...
==== (Python) === : clustering the  4Moons  data set ...
==== (Python) === : generating decision boundary for the  4Moons  data set ...
==== (Python) === : clustering the  4Spirals  data set ...
==== (Python) === : generating decision boundary for the  4Spirals  data set ...

kscICholTest3.Help()[source]¶

kscICholTest3.main(argv)[source]¶

Generate data for clustering¶

Auxiliary fuctions for generating data, distributed in different shapes, for clustering. The geneated clusters are tipically hard to detect and separate (e.g. 4 intertwined spirals or concentric rings) for classical clustering algorithms. The provided functionalities are used by Test3.

Description¶

Data are generated in 2 dimensions, clustered around 4 centers with different shapes:

4Clusters : distributed normally around the 4, random point centers

4Moons : distributed around 4 moon-shaped centers

4Circles : distributed around 4 concentric circles as centers

4Spirals : distributed around 4 intertwined spirals as centers

These can be selected ny providing one of the above strings as input ragument. The generated data will be shuffled, standardised. Sub-sets for training and validation will aslo be selected accoridng to the sizes given as input arguments. The generated data set, with the corresponding labels as well as the training and validation sets will be saved into files at the location specified by the corresponding input argument (see Example).

Example

The following example generates 100 000 data in 2 dimensions as 4 concentric rings. The data will be shuffled, standardised and saved to the \(\texttt{output/data}\_\texttt{4Circles.dat}\) file together with the \(\texttt{output/data}\_\texttt{4Circles}\_\texttt{Train}\_\texttt{N20000.dat}\) and \(\texttt{output/data}\_\texttt{4Circles}\_\texttt{Valid}\_\texttt{N20000.dat}\) files containing the 20 000 and 80 0000 sub-sampled data for training and validation

GenData('4Circles', 'output', nSamples=100000, nTrain=20000, nValid=80000)

Plot of the generated data is not required.

generateData.genKClusters(nSamples, nFeatures=2, nClusters=4, levSeparation=2, rndseed=0)[source]¶

Generates data for clustering that are normally distributed around the centers.

Parameters

nSamples (int) – number of sample points to generate
nFeatures (int) – dimensions of the data points
nClusters (int) – number of clusters to generate
levSeparation (float) – level of cluster separation
rndseed (int) – state of the random number generator

Returns

tuple containing the generated data and their labels

Return type

(numpy::array, numpy::array)

generateData.genKMoons(nSamples, levNoise=0.1, rndseed=0)[source]¶

Generates data for clustering: 4, moon-shape clusters in 2 dimension.

Parameters

nSamples (int) – number of sample points to generate
levNoise (float) – determines the thickness of the moons
rndseed (int) – state of the random number generator

Returns

tuple containing the generated data and their labels

Return type

(numpy::array, numpy::array)

generateData.genKCircles(nSamples, levNoise=0.075, rndseed=0)[source]¶

Generates data for clustering: 4, concentric rings in 2 dimension.

Parameters

nSamples (int) – number of sample points to generate
levNoise (float) – determines the thickness of the rings
rndseed (int) – state of the random number generator

Returns

tuple containing the generated data and their labels

Return type

(numpy::array, numpy::array)

generateData.genK4Spirals(nSamples, levNoise=0.1, rndseed=0)[source]¶

Generates data for clustering: 4, intertwined spirals in 2 dimension.

Parameters

nSamples (int) – number of sample points to generate
levNoise (float) – determines the thickness of the spirals
rndseed (int) – state of the random number generator

Returns

tuple containing the generated data and their labels

Return type

(numpy::array, numpy::array)

generateData.GenData(name, outpath, nSamples=100000, nTrain=0, nValid=0, doPlot=False)[source]¶

Generates any of the 4 (blobs, moons, rings, spirals) data sets for clustering.

Generates 2D data with the required size in 4 clusters according to the required shape of clusters. The data are shuffled and standardised. Data sets for training and validation, with the required sizes, are sub- sampled. The corresponding files are saved under the required location.

Parameters

name (str) – one of {‘4Clusters’,‘4Moons’,‘4Circles’,‘4Spirals’}
outpath (str) – location where the geneated data files will be saved
nSamples (int) – number of sample points to generate
nTrain (int) – number of sample points for training
nValid (int) – number of sample points for validation
doPlot (bool) – flag to indicate if data should be plotted (visualised)

Yields

Files, saved under the specified location, containing

\(\texttt{data}\_\texttt{x.dat}\) : the complete data set (data points as rows)

\(\texttt{data}\_\texttt{x}\_\texttt{Labels.dat}\) : the corresponding cluster labels (for each row)

\(\texttt{data}\_\texttt{x}\_\texttt{Train}\_\texttt{Ny.dat}\) : the y sub-sampled data for training

\(\texttt{data}\_\texttt{x}\_\texttt{Valid}\_\texttt{Nz.dat}\) : the z sub-sampled data for validation

where x is one of the available data set names. The generated data are also plotted in case it was required.