Test3: script

This script generates data for four different clustering problems, invokes the KSC application for clustering each and generates plots to visualise the results (located under the :math: exttt{res} directory at the end). This is the only script for Test3.

See Generate data for clustering for more details on the data generation.

Example

bash-3.2$ python kscICholTest3.py
==== (Python) === : clustering the  4Circles  data set ...
==== (Python) === : generating decision boundary for the  4Circles  data set ...
==== (Python) === : clustering the  4Clusters  data set ...
==== (Python) === : generating decision boundary for the  4Clusters  data set ...
==== (Python) === : clustering the  4Moons  data set ...
==== (Python) === : generating decision boundary for the  4Moons  data set ...
==== (Python) === : clustering the  4Spirals  data set ...
==== (Python) === : generating decision boundary for the  4Spirals  data set ...
kscICholTest3.Help()[source]
kscICholTest3.main(argv)[source]

Generate data for clustering

Auxiliary fuctions for generating data, distributed in different shapes, for clustering. The geneated clusters are tipically hard to detect and separate (e.g. 4 intertwined spirals or concentric rings) for classical clustering algorithms. The provided functionalities are used by Test3.

Description

Data are generated in 2 dimensions, clustered around 4 centers with different shapes:

  • 4Clusters : distributed normally around the 4, random point centers

  • 4Moons : distributed around 4 moon-shaped centers

  • 4Circles : distributed around 4 concentric circles as centers

  • 4Spirals : distributed around 4 intertwined spirals as centers

These can be selected ny providing one of the above strings as input ragument. The generated data will be shuffled, standardised. Sub-sets for training and validation will aslo be selected accoridng to the sizes given as input arguments. The generated data set, with the corresponding labels as well as the training and validation sets will be saved into files at the location specified by the corresponding input argument (see Example).

Example

The following example generates 100 000 data in 2 dimensions as 4 concentric rings. The data will be shuffled, standardised and saved to the \(\texttt{output/data}\_\texttt{4Circles.dat}\) file together with the \(\texttt{output/data}\_\texttt{4Circles}\_\texttt{Train}\_\texttt{N20000.dat}\) and \(\texttt{output/data}\_\texttt{4Circles}\_\texttt{Valid}\_\texttt{N20000.dat}\) files containing the 20 000 and 80 0000 sub-sampled data for training and validation

GenData('4Circles', 'output', nSamples=100000, nTrain=20000, nValid=80000)

Plot of the generated data is not required.


generateData.genKClusters(nSamples, nFeatures=2, nClusters=4, levSeparation=2, rndseed=0)[source]

Generates data for clustering that are normally distributed around the centers.

Parameters
  • nSamples (int) – number of sample points to generate

  • nFeatures (int) – dimensions of the data points

  • nClusters (int) – number of clusters to generate

  • levSeparation (float) – level of cluster separation

  • rndseed (int) – state of the random number generator

Returns

tuple containing the generated data and their labels

Return type

(numpy::array, numpy::array)

generateData.genKMoons(nSamples, levNoise=0.1, rndseed=0)[source]

Generates data for clustering: 4, moon-shape clusters in 2 dimension.

Parameters
  • nSamples (int) – number of sample points to generate

  • levNoise (float) – determines the thickness of the moons

  • rndseed (int) – state of the random number generator

Returns

tuple containing the generated data and their labels

Return type

(numpy::array, numpy::array)

generateData.genKCircles(nSamples, levNoise=0.075, rndseed=0)[source]

Generates data for clustering: 4, concentric rings in 2 dimension.

Parameters
  • nSamples (int) – number of sample points to generate

  • levNoise (float) – determines the thickness of the rings

  • rndseed (int) – state of the random number generator

Returns

tuple containing the generated data and their labels

Return type

(numpy::array, numpy::array)

generateData.genK4Spirals(nSamples, levNoise=0.1, rndseed=0)[source]

Generates data for clustering: 4, intertwined spirals in 2 dimension.

Parameters
  • nSamples (int) – number of sample points to generate

  • levNoise (float) – determines the thickness of the spirals

  • rndseed (int) – state of the random number generator

Returns

tuple containing the generated data and their labels

Return type

(numpy::array, numpy::array)

generateData.GenData(name, outpath, nSamples=100000, nTrain=0, nValid=0, doPlot=False)[source]

Generates any of the 4 (blobs, moons, rings, spirals) data sets for clustering.

Generates 2D data with the required size in 4 clusters according to the required shape of clusters. The data are shuffled and standardised. Data sets for training and validation, with the required sizes, are sub- sampled. The corresponding files are saved under the required location.

Parameters
  • name (str) – one of {‘4Clusters’,‘4Moons’,‘4Circles’,‘4Spirals’}

  • outpath (str) – location where the geneated data files will be saved

  • nSamples (int) – number of sample points to generate

  • nTrain (int) – number of sample points for training

  • nValid (int) – number of sample points for validation

  • doPlot (bool) – flag to indicate if data should be plotted (visualised)

Yields

Files, saved under the specified location, containing

  • \(\texttt{data}\_\texttt{x.dat}\) : the complete data set (data points as rows)

  • \(\texttt{data}\_\texttt{x}\_\texttt{Labels.dat}\) : the corresponding cluster labels (for each row)

  • \(\texttt{data}\_\texttt{x}\_\texttt{Train}\_\texttt{Ny.dat}\) : the y sub-sampled data for training

  • \(\texttt{data}\_\texttt{x}\_\texttt{Valid}\_\texttt{Nz.dat}\) : the z sub-sampled data for validation

where x is one of the available data set names. The generated data are also plotted in case it was required.