Test2: data and KSC appliaction script generation

This script is for generating the data set and KSC application shell scripts for Test2.

Data generation

Data are generated from isotropic Gaussian distributions with constant standard deviation. The number of cluster centers, their minimum separation (within a given box), the number of features (dimensions) can be specified as input arguments. The number of generated samples as well as the training and validaiton sub-sample sizes might be changed as well. The data generation related input parameters are listed in the Table 5.

Table 5 Data generation related input parameters, their configuration flags and default values.

Flag

Default

Description

-c

8

number of cluster centers

-f

2

number of features

-n

100 000

number of samples

-t

20 000

number of training sub-samples

-v

10 000

number of validation sub-samples

-d

5

minimum distance between cluster centers

-s

0.8

standard deviation of the underlying Gaussians

Note

feature values are generated such that they are in the [-20:20] box/range (before the standardisation). At initialisation, the required number of cluster centers are generated (uniformly random) in the (hyper) box, such that the generated cluster centers has the required minimum distance from each other. When after a given number of trial (1000) this condition cannot be fullfilled, i.e. random cluster centers, with the given minimum distance, could not be generated, the script is interapted with and error message. You need to change the minimum distance or number of cluster ceter parameters in this case.

Shell script generation

After generating and writing the data set in files (including sub-samples for training and validation), the shell scripts will be generated for the KSC applicaitons provided for training, hyper parameter tuning and out-of-sample extension i.e. testing.

Note, that since the data set is generated by the same script, number of training and validation set size, their location, number of features, etc … are known. Moreover, the usually unknown exact number of cluster center hyper parameter value is also well known in this case. The shell scripts, generated for the KSC applications will contain these known parameter values.

Moreover, some initial values for the RBF kernel bandwidth hyper parameter values will be estimated by the script (both for the IDC and KSC parts) using the estimateBW.py script provided in the Utils. However, these estimates are often inaccurate and leads to very poor results. An input parameter is available to scale these initial estimates to find the appropriate values that is listed in Table 6 (see Test2 for more details).

Table 6 KSC shell script generation related input parameters, their configuration flags and default values.

Flag

Default

Description

-x

\(0.4\sqrt{N}\)

scaling factor for the bandwidth estimation

-e

0.85

level of tolerated error in the inc. Cholesky approximation

-r

300

maximum allowed rank of the inc. Cholesky approximation

-m

‘AMS’

cluster memberships encoding-decoding {‘AMS’,’BAS’,’BLF’}

Examples

This example shows how to execute the data and the corresponding initial KSC shell script generation with the default parameters. Note, that this will print the available parameters, their default value and the corresponding flags to change them

bash-3.2$ python kscICholTest2.py
 ==== (Python) === : Parameters (-flag to change) ...
  ----------------------------------------------------------
  --- Number of cluster centers (-c)         :            8
  --- Number of features (-f)                :            2
  --- Number of smaples (-n)                 :    1.000e+05
  --- Training sub-smaples (-t)              :    2.000e+04
  --- Validation sub-smaples (-v)            :    1.000e+04
  --- Min. cluster center distance (-d)      :    5.000e+00
  --- STDV of the underlying Gaussians (-s)  :    8.000e-01
  --- Scaling of the kernel BW estimate (-x) :    5.657e+01
  --- Inc. Cholesky error tolerance (-e)     :    8.500e-01
  --- Inc. Cholesky maxium rank (-r)         :    3.000e+02
  --- KSC membership encoding-decoding (-m)  :          AMS
  ----------------------------------------------------------
 ==== (Python) === : Generating data ...
 ==== (Python) === : Generating (initial) script for hyper-parameter tuning ...
 ==== (Python) === : Generating (initial) script for training ...
 ==== (Python) === : Generating (initial) script for testing ...
kscICholTest2.PrintParam()[source]
kscICholTest2.GenerateData()[source]
kscICholTest2.GenerateScriptTraining()[source]
kscICholTest2.GenerateScriptTuning()[source]
kscICholTest2.GenerateScriptTesting()[source]
kscICholTest2.Help()[source]
kscICholTest2.main(argv)[source]