Test3: script¶
This script generates data for four different clustering problems, invokes the KSC application for clustering each and generates plots to visualise the results (located under the :math: exttt{res} directory at the end). This is the only script for Test3.
See Generate data for clustering for more details on the data generation.
Example
bash-3.2$ python kscICholTest3.py
==== (Python) === : clustering the 4Circles data set ...
==== (Python) === : generating decision boundary for the 4Circles data set ...
==== (Python) === : clustering the 4Clusters data set ...
==== (Python) === : generating decision boundary for the 4Clusters data set ...
==== (Python) === : clustering the 4Moons data set ...
==== (Python) === : generating decision boundary for the 4Moons data set ...
==== (Python) === : clustering the 4Spirals data set ...
==== (Python) === : generating decision boundary for the 4Spirals data set ...
Generate data for clustering¶
Auxiliary fuctions for generating data, distributed in different shapes, for clustering. The geneated clusters are tipically hard to detect and separate (e.g. 4 intertwined spirals or concentric rings) for classical clustering algorithms. The provided functionalities are used by Test3.
Description¶
Data are generated in 2 dimensions, clustered around 4 centers with different shapes:
4Clusters: distributed normally around the 4, random point centers
4Moons: distributed around 4 moon-shaped centers
4Circles: distributed around 4 concentric circles as centers
4Spirals: distributed around 4 intertwined spirals as centers
These can be selected ny providing one of the above strings as input ragument. The generated data will be shuffled, standardised. Sub-sets for training and validation will aslo be selected accoridng to the sizes given as input arguments. The generated data set, with the corresponding labels as well as the training and validation sets will be saved into files at the location specified by the corresponding input argument (see Example).
Example
The following example generates 100 000 data in 2 dimensions as 4 concentric rings. The data will be shuffled, standardised and saved to the \(\texttt{output/data}\_\texttt{4Circles.dat}\) file together with the \(\texttt{output/data}\_\texttt{4Circles}\_\texttt{Train}\_\texttt{N20000.dat}\) and \(\texttt{output/data}\_\texttt{4Circles}\_\texttt{Valid}\_\texttt{N20000.dat}\) files containing the 20 000 and 80 0000 sub-sampled data for training and validation
GenData('4Circles', 'output', nSamples=100000, nTrain=20000, nValid=80000)
Plot of the generated data is not required.
-
generateData.genKClusters(nSamples, nFeatures=2, nClusters=4, levSeparation=2, rndseed=0)[source]¶ Generates data for clustering that are normally distributed around the centers.
- Parameters
nSamples (int) – number of sample points to generate
nFeatures (int) – dimensions of the data points
nClusters (int) – number of clusters to generate
levSeparation (float) – level of cluster separation
rndseed (int) – state of the random number generator
- Returns
tuple containing the generated data and their labels
- Return type
(
numpy::array,numpy::array)
-
generateData.genKMoons(nSamples, levNoise=0.1, rndseed=0)[source]¶ Generates data for clustering: 4, moon-shape clusters in 2 dimension.
- Parameters
nSamples (int) – number of sample points to generate
levNoise (float) – determines the thickness of the moons
rndseed (int) – state of the random number generator
- Returns
tuple containing the generated data and their labels
- Return type
(
numpy::array,numpy::array)
-
generateData.genKCircles(nSamples, levNoise=0.075, rndseed=0)[source]¶ Generates data for clustering: 4, concentric rings in 2 dimension.
- Parameters
nSamples (int) – number of sample points to generate
levNoise (float) – determines the thickness of the rings
rndseed (int) – state of the random number generator
- Returns
tuple containing the generated data and their labels
- Return type
(
numpy::array,numpy::array)
-
generateData.genK4Spirals(nSamples, levNoise=0.1, rndseed=0)[source]¶ Generates data for clustering: 4, intertwined spirals in 2 dimension.
- Parameters
nSamples (int) – number of sample points to generate
levNoise (float) – determines the thickness of the spirals
rndseed (int) – state of the random number generator
- Returns
tuple containing the generated data and their labels
- Return type
(
numpy::array,numpy::array)
-
generateData.GenData(name, outpath, nSamples=100000, nTrain=0, nValid=0, doPlot=False)[source]¶ Generates any of the 4 (blobs, moons, rings, spirals) data sets for clustering.
Generates 2D data with the required size in 4 clusters according to the required shape of clusters. The data are shuffled and standardised. Data sets for training and validation, with the required sizes, are sub- sampled. The corresponding files are saved under the required location.
- Parameters
name (str) – one of {‘4Clusters’,‘4Moons’,‘4Circles’,‘4Spirals’}
outpath (str) – location where the geneated data files will be saved
nSamples (int) – number of sample points to generate
nTrain (int) – number of sample points for training
nValid (int) – number of sample points for validation
doPlot (bool) – flag to indicate if data should be plotted (visualised)
- Yields
Files, saved under the specified location, containing
\(\texttt{data}\_\texttt{x.dat}\) : the complete data set (data points as rows)
\(\texttt{data}\_\texttt{x}\_\texttt{Labels.dat}\) : the corresponding cluster labels (for each row)
\(\texttt{data}\_\texttt{x}\_\texttt{Train}\_\texttt{Ny.dat}\) : the y sub-sampled data for training
\(\texttt{data}\_\texttt{x}\_\texttt{Valid}\_\texttt{Nz.dat}\) : the z sub-sampled data for validation
where x is one of the available data set names. The generated data are also plotted in case it was required.