Split

The splitting of a dataset is a common pre-processing step in data analysis and modelling, either to reduce the volume of the data or to create subsets that will be used within the model external validation process1. In the latter case, two representative subsets are often required, and for that the initial dataset is divided into training and test sets. Objects within the test set are not used during model development but are left out to measure the performance of the model in real-case scenarios2,3. The splitting can be performed either randomly or using a method of representative selection, so that the selected instances cover uniformly the data space.

Table of contents

  1. Kennard-Stone
  2. Random Partitioning
  3. Tips
  4. See also
  5. References
  6. Version History

Kennard-Stone

The Kennard-Stone method facilitates the selection of two representative subsets (e.g., training and test sets) with a uniform distribution over an initial dataset4. This function implements the approach of Daszykowski et al.1:

  1. Initially, a central object is selected from the dataset.
  2. Extreme objects from the data space border are then added to the training set.

Specifically, the method begins by calculating the mean of the dataset. The object closest to this mean, deemed the most representative, is assigned as the first object in the training set (first partition) and subsequently removed from the initial dataset. The second object chosen for the training set is the one furthest from the first and is also removed from the dataset.

Following this, the algorithm continues by iteratively adding objects to the training set. For each iteration, it calculates the Euclidean distance between each remaining unassigned object and the already selected objects. The object furthest from its closest neighbor in the training set is selected and added to the training set. This process repeats until the desired number of samples is included in the training set.

Use the Kennard-Stone function by browsing in the top ribbon:

Data Transformation \(\rightarrow\) Split \(\rightarrow\) Kennard-Stone

Input

Data matrix to partition.

Configuration

Target Column Select the target column (dependent variable). This will be excluded from the splitting procedure. String columns are automatically excluded.
Percentage An integer number between (0, 100) specifying the splitting ratio, i.e., the proportion of input rows included in the first partition. Default value: 40.
Perform Computations Choose whether to perform calculations on the CPU or the GPU (if available on the user’s PC). The GPU option enables faster computations by leveraging the graphics processing unit.

Output

Two data partitions. The results are not visible in the output spreadsheet, but the different partitions can be imported independently in other tabs.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix that is going to be split.

split-input
Configuration
  1. Select Data Transformation \(\rightarrow\) Split \(\rightarrow\) Kennard-Stone.
  2. Select the Target Column name from the dropdown list [1].
  3. Type in the Percentage field [2] the ratio of input rows included in the first partition.
  4. Select whether the calculations are performed on CPU or GPU [3].
  5. Click on the Execute button [4] to perform the data partitioning.
kennard-stone-configuration
Output
  1. In the right-hand spreadsheet of the tab the input data matrix is presented intact.
  2. Insert a new tab by clicking on the + button [1].
  3. Right click on the left-hand spreadsheet and select Import from SpreadSheet [2].
  4. In the configuration window select from the Select input tab the training (first partition) or the test set (second partition) [3].
  5. Click on the Execute button [4] and continue with the rest of your analysis steps.
random-split-output

Random Partitioning

Random row-wise split of the input data matrix in two subsets (e.g., training and test sets).

Use the Random Partitioning function by browsing in the top ribbon:

Data Transformation \(\rightarrow\) Split \(\rightarrow\) Random Partitioning

Input

Data matrix to partition.

Configuration

Training set percentage An integer number between (0, 100) specifying the splitting ratio, i.e., the proportion of input rows included in the first partition. Default value: 40.
Usage of random generator seed Specify an integer number as fixed seed to acquire reproducible results in case of re-execution of the splitting step. If you tick on the Usage of random generator seed checkbox, different splits will be produced upon re-execution.
Stratified sampling Check this box, then select the column name from the dropdown list to ensure that the distribution of its values is approximately preserved during partitioning.

Output

Two data partitions. The results are not visible in the output spreadsheet, but the different partitions can be imported independently in other tabs.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix that is going to be split.

split-input
Configuration
  1. Select Data Transformation \(\rightarrow\) Split \(\rightarrow\) Random Partitioning.
  2. Type the Training set percentage [1] and the fixed seed [2] to achieve reproducible splits. Otherwise, click on the Usage of random seed generator [3].
  3. (Optional) Click on the Stratified sampling checkbox [4] and select the column name [5] that corresponds to the feature whose values distribution will be preserved in both partitions.
  4. Click on the Execute button [6] to perform the data partitioning.
random-split-configuration
Output
  1. In the right-hand spreadsheet of the tab the input data matrix is presented intact.
  2. Insert a new tab by clicking on the + button [1].
  3. Right click on the left-hand spreadsheet and select Import from SpreadSheet [2].
  4. In the configuration window select from the Select input tab the training (first partition) or the test set (second partition) [3].
  5. Click on the Execute button [4] and continue with the rest of your analysis steps.
random-split-output

Tips

  • The splitting functions are useful when two (or more) subsets of data are needed e.g., as training and test/validation sets in model development or when the original dataset is too large to process in its entirety, allowing for analysis to be conducted on a more manageable subset.
  • The Kennard and Stone methodology, in contrast to the random partitioning, is a deterministic methodology, producing consistent partitions for given input parameters and data matrices. It also allows more representative subset selection that covers all the data space. However, it is more time-consuming compared to random partitioning for the same input data matrix and splitting ratio.
  • Consider data scaling -if necessary- prior to the use of Kennard-Stone function, as calculation of distances is performed.
  • The external validation using the Kennard-Stone splitting is advised to be supplemented by cross-validation and or/other validation methodologies, as it is possible that the results lead to an overestimation of the models’ predictive ability (test set too similar to the training set).

See also

The generated train and test sets from the splitting function can be imported to subsequent tabs/nodes (see Data Representation). For data scaling refer to the Normalizers functions.

Workflows

Bodyfat prediction case study

House pricing case study

Insurance charges case study

MA score case study

Salary prediction case study

Breast cancer case study

Credit card case study

Parkinson’s disease case study

Students’ performance case study

References

  1. Daszykowski M, Walczak B, Massart DL. Representative subset selection. Anal Chim Acta 2002. doi.org/10.1016/S0003-2670(02)00651-7.
  2. Witten Ian H and Frank, Eibe and Hall, Mark A and Pal CJ. Data Mining: Practical Machine Learning Tools and Techniques. Fourth. Morgan Kaufmann; 2011. doi.org/10.1016/C2009-0-19715-5.
  3. Varsou D-D, Kolokathis PD, Antoniou M, Sidiropoulos NK, Tsoumanis A, Papadiamantis AG, et al. In silico assessment of nanoparticle toxicity powered by the Enalos Cloud Platform: Integrating automated machine learning and synthetic data for enhanced nanosafety evaluation. Comput Struct Biotechnol J 2024;25:47–60. doi.org/10.1016/j.csbj.2024.03.020.
  4. Kennard RW, Stone LA. Computer aided design of experiments 1969;11:137–48. doi.org/10.2307/1266770.

Version History

Introduced in Isalos Analytics Platform v0.2.3

Instructions last updated on June 2024