Regression
Regression is a typical supervised learning task in predictive modelling that aims to investigate the relationship between a set of independent features and a dependent target variable. Regression analysis algorithms are employed when the target variable is a continuous numeric value.1
Table of contents
- k Nearest Neighbors (kNN)
- Fully Connected Neural Network
- Radial Basis Function Network
- Linear SGD
- XGBoost
- Random Forest
- Tips
- See also
- References
- Version History
k Nearest Neighbors (kNN)
k Nearest Neighbors (kNN) is a simple non-parametric algorithm that operates by identifying the data points from the training set that are most proximate to a new unseen input. This instance-based learning method determines the closeness of data points by calculating the Euclidean distance between instances considering all attribute values. The k parameter denotes the number of nearest neighbors to consider.2 The target value for a new instance is predicted by averaging the distance-weighted target values of the k nearest data points. As calculations of Euclidean distances are performed, scaling of data is performed within the function.
Use the k Nearest Neighbors (kNN)
function by browsing in the top ribbon:
Analytics \(\rightarrow\) Regression \(\rightarrow\) k Nearest Neighbors (kNN) |
Input
Data matrix with training set data.
Configuration
Target Column | Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets. |
Number of Neighbors | An integer representing the number of closest neighbours-data points (k) used to make predictions for a new data point. |
Output
A data matrix including the actual target value and the value predicted by the algorithm (“kNN Prediction”). For each data point, the closest neighbors from the training set are listed in the “Closest NN” columns, along with their corresponding distances in the “Distance from NN” columns.
Example
Input
In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction.

Configuration
- Select
Analytics
\(\rightarrow\)Regression
\(\rightarrow\)k Nearest Neighbors (kNN)
. - Select the column that is going to be predicted from the drop-down menu [1]. The columns containing categorical features are automatically excluded from the list.
- Type the
Number of Neighbors
[2] to consider. - Click on the
Execute
button [3] to apply the algorithm on the input table.
Output
In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented. Also, the k most proximate identified instances from the training set are given for each data point (5 in this case), along with the corresponding Euclidean distances from each neighbor. Note that the “Closest NN1” represents the nearest neighbor, which is the data point itself when applied on the training set. Consequently, the “Distance from NN1” is 0 for all given training instances.

Application on external set
You can apply the trained k Nearest Neighbors (kNN)
model to any external (test) data using the Existing Model Utilization
function:
- Import the external data in the left-hand spreadsheet of the tab. Include the same columns used to build the kNN model.
- Select
Analytics
\(\rightarrow\)Existing Model Utilization
. Select the trained kNN model [1] and click on theExecute
button [2].
- Inspect the results in the right-hand spreadsheet of the tab. Note that in this case the closest neighbor listed in the “Closest NN1” belongs to the training set, and the “Distance from NN1” is not zero.

Fully Connected Neural Network
A type of feedforward artificial neural network consisting of multiple layers of neurons. It consists of an input layer, one or more hidden layers and an outer layer, which are fully connected with each other. A variety of non-linear activation functions are typically used in the hidden layer, allowing the network to learn complex patterns in data. MLP uses a backpropagation algorithm to train the model and classify instances.1,2
Use the Fully Connected Neural Network
function by browsing in the top ribbon:
Analytics \(\rightarrow\) Regression \(\rightarrow\) Fully Connected Neural Network |
Input
Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values.
Configuration
Batch Size | Select from the drop-down menu the number of training instances that will be propagated through the network. Four options are available for selection: 16 , 32 , 64 , and 128 . |
Number of Epochs | Specify the number (integer) of complete passes through the data during training. |
Learning Rate | Specify the learning rate (between 0 and 1), which controls the size of the steps taken during optimization. |
Momentum | Specify the momentum rate (between 0 and 1) for the backpropagation algorithm. |
+/- | Click on the + and - buttons to add or remove hidden layers, respectively. |
Hidden Layers | For each added hidden layer, specify the number of neurons and select the non-linear activation function used to map the weighted inputs to the output of each neuron. Options for the activation functions include: - RELU , - RELU6 , - LEAKYRELU , - SELU , - SWISH , - RRELU , - SIGMOID , - SOFTMAX , - SOFTPLUS , - SOFTSIGN , - TANH , - THRESHOLDEDRELU , - GELU , - ELU , - MISH , - CUBE , - HARDSIGMOID , - HARDTANH , - IDENTITY , - RATIONALTANH , and - RECTIFIEDTANH . |
Target Column | Select from the drop-down menu the column containing the feature that is going to be predicted. |
RNG Seed | Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available. |
Output
A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.
Example
Input
In the left-hand spreadsheet of the tab import the data matrix including the target variable with at least two distinct categories for prediction. In case that categorical-string columns are included in the set, these should be encoded into representative numerical values.
Configuration
- Select
Analytics
\(\rightarrow\)Regression
\(\rightarrow\)Fully Connected Neural Network
- Select the hyperparameters that determine the training procedure: the
Batch Size
[1], theNumber of Epochs
[2], theLearning Rate
[3] and theMomentum
[4]. - Select the hyperparameters that determine the
Hidden Layers
[5] of the neural network: the number of neurons [6] and the activation function [7] of each layer. - Add [8] or remove [9] hidden layers to define the architecture of the neural network.
- Select the column that is going to be predicted from the drop-down menu [10].
- Select a Seed for reproducible results or a random number generated
Time-based (RNG) Seed
[11]. - Click on the
Execute
button [12] to apply the training algorithm on the input columns.
Output
In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted values of the target is presented.
Radial Basis Function Network
Radial basis function networks (RBF) network is an artificial neural network that employs RBF kernels as activation functions. The network consists of three layers: the input layer modeling a vector that passes data, the hidden layer that performs computations and the output layer designated for regression problems. The output layer of the neural network is a linear combination of the activation (output) from the hidden units.1, 3
A Radial Basis Function is a real-valued function $φ(r)$ that is dependent only on the distance between a fixed input point ($r$) to the center ($c$) of each neuron as reference point Eq. 1.3
The radial basis function kernels available in Isalos include:
Gaussian:
Multiquadric:
Inverse Quadratic:
Inverse Multiquadric:
Polyharmonic spline:
where $k$ is the order of the spline.
Thin Plate Spline:
Bump Function:
where $\varepsilon$ is the shape parameter used to scale the input of the radial kernel.
Use the Radial Basis Function Network
regression function by browsing in the top ribbon:
Analytics \(\rightarrow\) Regression \(\rightarrow\) Radial Basis Function Network |
Input
Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values. RBF does not allow the use of integer or string input.
Configuration
Hidden Neurons | Specify the number of neurons in the hidden layer of the network. |
RBF Kernel | Select from the drop-down menu the radial basis function kernel. Options include: - GAUSSIAN Eq. 2, - MULTIQUADRIC Eq. 3, - INVERSE QUADRATIC Eq. 4, - INVERSE MULTIQUADRIC Eq. 5, - POLYHARMONIC SPLINE Eq. 6, - THIN PLATE SPLINE Eq. 7, and - BUMP FUNCTION Eq. 8. A new configuration field appears accordingly after RBF Kernel selection, for the selection of Epsilon ($\varepsilon$) shape parameter or K ($k$) where applicable. |
Point Selection | Select manually the way that determines how the centers of the neural network are chosen. Options include: - Random Points from Training set : chosen randomly from the training data. - Use KMeans : RBF centers are the cluster centers of the partitioned training data. |
RNG Seed | Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available. |
Target Column | Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets. |
Output
A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.
Example
Input
In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Configuration
- Select
Analytics
\(\rightarrow\)Regression
\(\rightarrow\)Radial Basis Function Network
- Type the number (integer) of
Hidden Neurons
[1]. - Select the
RBF Kernel
[2] used as activation function of the hidden layer and subsequently select theEpsilon
[3] orK
parameter where applicable. - Select the
Point Selection
[4] method to determine the center of the network. - Type an
RNG Seed
for reproducible results or a random number generatedTime-based (RNG) Seed
[5]. - Select the
Target Column
that is going to be predicted from the drop-down menu [6]. Columns with categorical features cannot be selected as targets. - Click on the
Execute
button [7] to proceed with training.
Output
In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target variable is presented.

Linear SGD
Stochastic Gradient Descent (SGD) is a method used to minimize an objective (loss) function, which measures the error between the actual values and the predicted outcomes of data points. SGD is employed to incrementally fit various linear regressors by iteratively updating the parameters of a linear model. At each step, it estimates the gradient of the loss function and adjusts the parameters using a constant or decreasing learning rate. Unlike traditional gradient descent, which processes the entire training set in each iteration, SGD operates on a small batch of the training data, enabling faster convergence.1,5
Use the Linear SGD
regression function by browsing in the top ribbon:
Analytics \(\rightarrow\) Regression \(\rightarrow\) Linear SGD |
Input
Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values. Linear SGD does not allow the use of integer or string input.
Configuration
Target Column | Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets. |
Loss function | Select manually from the drop-down menu the objective function to be minimized. Three options are available for selection, namely: - Squared Loss : average of the squared errors, - Absolute Loss : average of the absolute differences, and - Huber Loss: combination of Squared Loss and Absolute Loss. |
Optimizer | Select manually the gradient optimizer from the drop-down menu. Three options are available for selection, namely: - Linear Decay SGD : decreasing learning rate over time - Simple SGD : constant learning rate through training - Squared Root Decay SGD : learning rate is decreasing inversely to the number of iterations. |
Learning rate | Specify the initial step size (between 0 and 1) used in the iterative optimization procedure (default value: 0.1). |
Number of epochs | Specify the number (integer) of complete passes through the data during training (default value: 10). |
RNG Seed | Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available. |
Output
A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.
Example
Input
In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Configuration
- Select
Analytics
\(\rightarrow\)Regression
\(\rightarrow\)Linear SGD
. - Select the column containing the target variable that is going to be predicted from the drop-down menu [1]. Columns with categorical features cannot be selected as targets.
- Select the
Loss function
[2] and the gradientOptimizer
[3] from the drop-down menus. - Type the hyperparameters that determine the structure of the model: the
Learning rate
[4] and theNumber of epochs
[5]. Default values, data types (double or integer) and acceptable ranges are indicated as guidance on the input parameter values. - Type an
RNG Seed
for reproducible results or a random number generatedTime-based (RNG) Seed
[6]. - Click on the
Execute
button [7] to apply the training algorithm on the input data.
Output
In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented.

XGBoost
The Extreme Gradient Boosting (XGBoost) open-source library6 is used to implement the gradient boosting framework. The library uses a class of ensemble machine learning algorithms constructed from decision tree models. Ensemble learning operates by combining different individual base learners to obtain a final prediction.7 In an iterative process, trees are added to the ensemble so that the prediction error (loss) of previous models is reduced. In regression problems, the mean squared error is used as loss function.8
Use the XGBoost
regression function by browsing in the top ribbon:
Analytics \(\rightarrow\) Regression \(\rightarrow\) XGBoost |
Input
Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values (integer or double).
Configuration
Target Column | Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets. |
booster | Select from the drop-down menu which booster to use. Three options are available for selection, namely: - gbtree : default tree-based models, - dart : tree-based models, and - gblinear : linear functions. |
objective | Select from the drop-down menu the learning objective of the method. Options include: - reg:squarederror : regression with squared loss, - reg:gamma : gamma regression with log-link, whose output is a mean of gamma distribution, and - reg:tweedie : tweedie regression with log-link. |
number of estimators | Type the number of models (integer) to train in the learning ensemble. |
eta | Specify the learning rate (between 0 and 1) which determines the step size shrinkage to prevent overfitting (default value: 0.3). |
gamma | Specify the minimum loss reduction required to make a further partition on a leaf node of the tree (default value: 0). |
max depth | Specify the maximum depth of a tree as a positive integer (default value: 6). |
min child weight | Specify the minimum sum of instance weight (hessian) needed in a child (default value: 1). |
column sample by tree | Specify the subsample ratio of features when constructing each tree. Subsampling will occur once for every tree constructed. |
sub sample | Specify the subsampling ratio (between 0 and 1) of the training instances. Subsampling will occur once in every boosting iteration (default value: 1). |
tree method | Select the tree construction algorithm used in XGBoost. Options include: - auto : use this heuristically to choose the fastest method typically based on the dataset size, - exact : exact greedy algorithm, - approx : approximates the greedy algorithm using quantile sketch and gradient histogram, and - hist : fast histogram optimized approximate greedy algorithm. |
lambda | Specify the L2 regularization term on leaf weights (default value: 1). |
alpha | Specify the L1 regularization term on leaf weights (default value: 0). |
RNG Seed | Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available by clicking on the Time-based RNG Seed checkbox. |
Output
A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.
Example
Input
In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Configuration
- Select
Analytics
\(\rightarrow\)Regression
\(\rightarrow\)XGBoost
. - Select the column that is going to be predicted from the drop-down menu [1].
- Select the tree
booster
[2] method, theobjective
function [3] for loss and type thenumber of estimators
[4] involved in the ensemble. - Select the hyperparameters involved in the regularization of the model:
eta
[5],gamma
[6],lambda
[12] andalpha
[13]. Select the hyperparameters involved in tree construction:max depth
[7] andmin child weight
[8]. Select the column sampling rate by tree [9] and the overall subsampling rates [10]. Default values, data types (double or integer) and acceptable ranges are indicated as guidance on the input parameter values. - Select the tree construction algorithm [11] used in the XGBoost.
- Type an
RNG Seed
for reproducible results or a random number generatedTime-based RNG Seed
[14]. - Click on the
Execute
button [15] to apply the training algorithm on the input data.
Output
In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target variable is presented.

Random Forest
Random forest regressor is an ensemble learning method that operates by building multiple randomized decision trees during training and obtaining the prediction of the individual trees. The decision trees are constructed in parallel, with no interaction between them, using random subsets of training data and input attributes to ensure diversity. Predictions independently made by all the trees in the forest are aggregated and averaged to produce a final prediction.7,9
Use the Random Forest
function by browsing in the top ribbon:
Analytics \(\rightarrow\) Regression \(\rightarrow\) Random Forest |
Input
Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be excluded or encoded into numerical values (integer or double).
Configuration
Features fraction | Specify the feature subsampling rate represented as a fraction of features (between 0 and 1) available in each tree split (default value: 0.9). |
Min impurity decrease | Specify the impurity decrease threshold (between 0 and 1) necessary to determine the quality of splits in the decision trees. A split is only considered if it results in a decrease of impurity greater than or equal to this value (default value: 0.1). |
Seed | Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available. |
Number of ensembles | Specify the number of individual trees to be generated by the algorithm (default value: 10). |
Target column | Select manually from the drop-down menu the column name containing the target variable that is going to be predicted. |
Output
A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.
Example
Input
In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Configuration
- Select
Analytics
\(\rightarrow\)Regression
\(\rightarrow\)Random Forest
. - Specify the hyperparameters that determine the structure of the model: the
Features fraction
[1],Min impurity decrease
[2] andNumber of ensembles
[4]. Default values, data types (double or integer) and acceptable ranges are indicated as guidance on the input parameter values. - Type a
Seed
for reproducible results or a random number generatedTime-based RNG Seed
[3]. - Select the column name with the target variable that is going to be predicted from the drop-down menu [5].
- Click on the
Execute
button [6] to apply the training algorithm on the input columns
Output
In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented.

Tips
k Nearest neighbors:
- It works more efficiently for small to medium datasets and low-dimensional data. kNN is sensitive to missing data.
- The performance of the model is highly influenced by the selection of k.
Radial Basis Function Network:
- The number of neurons in the hidden layer has a high impact on the model performance, since a large number of neurons can lead to overfitting.
Linear SGD:
- It is effective in large datasets and can handle high-dimensional feature spaces. Higher learning rates may be required when Huber loss is selected, because it is less sensitive to outliers.
- Consider scaling the input data with a
Z score
normalizer to center them to mean and have a unit standard deviation.
XGBoost:
- Be cautious during hyperparameter tuning: Choosing smaller
eta
values, as well as increasing thelambda
,alpha
andgamma
values result in a more conservative boosting process. Increasing the value ofmax depth
parameter makes the model more complex, more likely to overfit.
Random Forest:
- This algorithm performs well with datasets that contain missing values. However, it is not as efficient with a large number of sparse features or with categorical variables of many levels that are improperly encoded.
See also
The model generated by either the k Nearest Neighbors (kNN)
, Fully Connected Neural Network
, Radial Basis Function Network
, Linear SGD
, XGBoost
or Random Forest
algorithms can be applied to any input data through the Existing Model Utilization
function (e.g., a regression algorithm trained from the training set data of a machine learning model can be applied to the test/external set data).
Workflows
Bodyfat prediction case study
House pricing case study
Insurance charges case study
MA score case study
Salary prediction case study
References
- Witten Ian H and Frank, Eibe and Hall, Mark A and Pal CJ. Data Mining: Practical Machine Learning Tools and Techniques. Fourth. Morgan Kaufmann; 2011. https://doi.org/10.1016/C2009-0-19715-5.
- Murphy KP. Machine Learning: A Probabilistic Perspective. The MIT Press; 2012. 10.5555/2380985.
- Lee C-C, Chung P-C, Tsai J-R, Chang C-I. Robust radial basis function neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 1999;29:674–85. https://doi.org/10.1109/3477.809023.
- Ghosh J, Nag A. An Overview of Radial Basis Function Networks. In: Howlett RJ, Jain LC, editors. Radial Basis Function Networks 2: New Advances in Design, Heidelberg: Physica-Verlag HD; 2001, p. 1–36. https://doi.org/10.1007/978-3-7908-1826-0_1.
- Bottou L. Large-Scale Machine Learning with Stochastic Gradient Descent. In: Lechevallier Y, Saporta G, editors. Proceedings of COMPSTAT’2010, Heidelberg: Physica-Verlag HD; 2010, p. 177–86. https://doi.org/10.1007/978-3-7908-2604-3_16.
- XGBoost Parameters — xgboost 2.1.0-dev documentation n.d. https://xgboost.readthedocs.io/en/latest/parameter.html (accessed June 3, 2024).
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York, NY: Springer New York; 2009. https://doi.org/10.1007/978-0-387-84858-7.
- Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, p. 785–94. https://doi.org/10.1145/2939672.2939785.
- Breiman L. Random Forests. Machine Learning 2001;45:5–32. https://doi.org/10.1023/A:1010933404324.
Version History
Introduced in Isalos Analytics Platform v0.1.18
Instructions last updated on January 2025