Regression

Regression is a typical supervised learning task in predictive modelling that aims to investigate the relationship between a set of independent features and a dependent target variable. Regression analysis algorithms are employed when the target variable is a continuous numeric value.¹

k Nearest Neighbors (kNN)
Fully Connected Neural Network
Radial Basis Function Network
Linear SGD
XGBoost
Random Forest
Generalized Linear Models
Tips
See also
References
Version History

k Nearest Neighbors (kNN)

k Nearest Neighbors (kNN) is a simple non-parametric algorithm that operates by identifying the data points from the training set that are most proximate to a new unseen input. This instance-based learning method determines the closeness of data points by calculating the Euclidean distance between instances considering all attribute values. The k parameter denotes the number of nearest neighbors to consider.² The target value for a new instance is predicted by averaging the distance-weighted target values of the k nearest data points. As calculations of Euclidean distances are performed, scaling of data is performed within the function.

Use the k Nearest Neighbors (kNN) function by browsing in the top ribbon:

Analytics $\rightarrow$ Regression $\rightarrow$ k Nearest Neighbors (kNN)

Input

Data matrix with training set data.

Configuration

Target Column	Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets.
Number of Neighbors	An integer representing the number of closest neighbours-data points (k) used to make predictions for a new data point.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“kNN Prediction”). For each data point, the closest neighbors from the training set are listed in the “Closest NN” columns, along with their corresponding distances in the “Distance from NN” columns.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction.

Configuration

Select Analytics $\rightarrow$ Regression $\rightarrow$ k Nearest Neighbors (kNN).
Select the column that is going to be predicted from the drop-down menu [1]. The columns containing categorical features are automatically excluded from the list.
Type the Number of Neighbors [2] to consider.
Click on the Execute button [3] to apply the algorithm on the input table.

Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented. Also, the k most proximate identified instances from the training set are given for each data point (5 in this case), along with the corresponding Euclidean distances from each neighbor. Note that the “Closest NN1” represents the nearest neighbor, which is the data point itself when applied on the training set. Consequently, the “Distance from NN1” is 0 for all given training instances.

Application on external set

You can apply the trained k Nearest Neighbors (kNN) model to any external (test) data using the Existing Model Utilization function:

Import the external data in the left-hand spreadsheet of the tab. Include the same columns used to build the kNN model.
Select Analytics $\rightarrow$ Existing Model Utilization. Select the trained kNN model [1] and click on the Execute button [2].

Inspect the results in the right-hand spreadsheet of the tab. Note that in this case the closest neighbor listed in the “Closest NN1” belongs to the training set, and the “Distance from NN1” is not zero.

Fully Connected Neural Network

A type of feedforward artificial neural network consisting of multiple layers of neurons. It consists of an input layer, one or more hidden layers and an outer layer, which are fully connected with each other. A variety of non-linear activation functions are typically used in the hidden layer, allowing the network to learn complex patterns in data. MLP uses a backpropagation algorithm to train the model and classify instances.^1,2

Use the Fully Connected Neural Network function by browsing in the top ribbon:

Analytics $\rightarrow$ Regression $\rightarrow$ Fully Connected Neural Network

Input

Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values.

Configuration

Batch Size	Select from the drop-down menu the number of training instances that will be propagated through the network. Four options are available for selection: `16`, `32`, `64`, and `128`.
Number of Epochs	Specify the number (integer) of complete passes through the data during training.
Learning Rate	Specify the learning rate (between 0 and 1), which controls the size of the steps taken during optimization.
Momentum	Specify the momentum rate (between 0 and 1) for the backpropagation algorithm.
+/-	Click on the `+` and `-` buttons to add or remove hidden layers, respectively.
Hidden Layers	For each added hidden layer, specify the number of neurons and select the non-linear activation function used to map the weighted inputs to the output of each neuron. Options for the activation functions include: - `RELU`, - `RELU6`, - `LEAKYRELU`, - `SELU`, - `SWISH`, - `RRELU`, - `SIGMOID`, - `SOFTMAX`, - `SOFTPLUS`, - `SOFTSIGN`, - `TANH`, - `THRESHOLDEDRELU`, - `GELU`, - `ELU`, - `MISH`, - `CUBE`, - `HARDSIGMOID`, - `HARDTANH`, - `IDENTITY`, - `RATIONALTANH`, and - `RECTIFIEDTANH`.
Target Column	Select from the drop-down menu the column containing the feature that is going to be predicted.
RNG Seed	Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable with at least two distinct categories for prediction. In case that categorical-string columns are included in the set, these should be encoded into representative numerical values.

Configuration

Select Analytics $\rightarrow$ Regression $\rightarrow$ Fully Connected Neural Network
Select the hyperparameters that determine the training procedure: the Batch Size [1], the Number of Epochs [2], the Learning Rate [3] and the Momentum [4].
Select the hyperparameters that determine the Hidden Layers [5] of the neural network: the number of neurons [6] and the activation function [7] of each layer.
Add [8] or remove [9] hidden layers to define the architecture of the neural network.
Select the column that is going to be predicted from the drop-down menu [10].
Select a Seed for reproducible results or a random number generated Time-based (RNG) Seed [11].
Click on the Execute button [12] to apply the training algorithm on the input columns.

Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted values of the target is presented.

Radial Basis Function Network

Radial basis function networks (RBF) network is an artificial neural network that employs RBF kernels as activation functions. The network consists of three layers: the input layer modeling a vector that passes data, the hidden layer that performs computations and the output layer designated for regression problems. The output layer of the neural network is a linear combination of the activation (output) from the hidden units.^{1, 3}

A Radial Basis Function is a real-valued function $φ(r)$ that is dependent only on the distance between a fixed input point ($r$) to the center ($c$) of each neuron as reference point Eq. 1.³

$$ \begin{equation} \varphi(r) = d(\|r - c\|) {\qquad [1] \qquad} \end{equation} $$

The radial basis function kernels available in Isalos include:

Gaussian:

$$ \begin{equation} \varphi(r) = e^{-(\varepsilon r)^{2}}\end{equation} {\qquad [2] \qquad} $$

Multiquadric:

$$ \begin{equation} \varphi(r) = \sqrt{1 + (\varepsilon r)^{2}}\end{equation} {\qquad [3] \qquad} $$

Inverse Quadratic:

$$ \begin{equation} \varphi(r) = \frac{1}{1 + (\varepsilon r)^{2}}\end{equation} {\qquad [4] \qquad} $$

Inverse Multiquadric:

$$ \begin{equation} \varphi(r) = \frac{1}{\sqrt{1 + (\varepsilon r)^{2}}}\end{equation} {\qquad [5] \qquad} $$

Polyharmonic spline:

$$ \begin{equation} \varphi(r) = \begin{cases} r^{k}, \: k = 1, 3, 5, \ldots \\ r^{k} \ln(r), \: k = 2, 4, 6, \ldots \end{cases} \end{equation} {\qquad [6] \qquad} $$

where $k$ is the order of the spline.

Thin Plate Spline:

$$ \begin{equation} \varphi(r) = r^{2} \ln(r)\end{equation} {\qquad [7] \qquad} $$

Bump Function:

$$ \begin{equation} \varphi(r) = \begin{cases} \exp\left( -\frac{1}{1 - (\varepsilon r)^{2}} \right), \: \text{ if } r < \frac{1}{\varepsilon}, \\ 0, \: \text{ otherwise } \end{cases} \end{equation} {\qquad [8] \qquad} $$

where $\varepsilon$ is the shape parameter used to scale the input of the radial kernel.

Use the Radial Basis Function Network regression function by browsing in the top ribbon:

Analytics $\rightarrow$ Regression $\rightarrow$ Radial Basis Function Network

Input

Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values. RBF does not allow the use of integer or string input.

Configuration

Hidden Neurons	Specify the number of neurons in the hidden layer of the network.
RBF Kernel	Select from the drop-down menu the radial basis function kernel. Options include: - `GAUSSIAN` Eq. 2, - `MULTIQUADRIC` Eq. 3, - `INVERSE QUADRATIC` Eq. 4, - `INVERSE MULTIQUADRIC` Eq. 5, - `POLYHARMONIC SPLINE` Eq. 6, - `THIN PLATE SPLINE` Eq. 7, and - `BUMP FUNCTION` Eq. 8. A new configuration field appears accordingly after `RBF Kernel` selection, for the selection of `Epsilon` ($\varepsilon$) shape parameter or `K` ($k$) where applicable.
Point Selection	Select manually the way that determines how the centers of the neural network are chosen. Options include: - `Random Points from Training set`: chosen randomly from the training data. - `Use KMeans`: RBF centers are the cluster centers of the partitioned training data.
RNG Seed	Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available.
Target Column	Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Configuration

Select Analytics $\rightarrow$ Regression $\rightarrow$ Radial Basis Function Network
Type the number (integer) of Hidden Neurons [1].
Select the RBF Kernel [2] used as activation function of the hidden layer and subsequently select the Epsilon [3] or K parameter where applicable.
Select the Point Selection [4] method to determine the center of the network.
Type an RNG Seed for reproducible results or a random number generated Time-based (RNG) Seed [5].
Select the Target Column that is going to be predicted from the drop-down menu [6]. Columns with categorical features cannot be selected as targets.
Click on the Execute button [7] to proceed with training.

Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target variable is presented.

Linear SGD

Stochastic Gradient Descent (SGD) is a method used to minimize an objective (loss) function, which measures the error between the actual values and the predicted outcomes of data points. SGD is employed to incrementally fit various linear regressors by iteratively updating the parameters of a linear model. At each step, it estimates the gradient of the loss function and adjusts the parameters using a constant or decreasing learning rate. Unlike traditional gradient descent, which processes the entire training set in each iteration, SGD operates on a small batch of the training data, enabling faster convergence.^1,5

Use the Linear SGD regression function by browsing in the top ribbon:

Analytics $\rightarrow$ Regression $\rightarrow$ Linear SGD

Input

Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values. Linear SGD does not allow the use of integer or string input.

Configuration

Target Column	Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets.
Loss function	Select manually from the drop-down menu the objective function to be minimized. Three options are available for selection, namely: - `Squared Loss`: average of the squared errors, - `Absolute Loss`: average of the absolute differences, and - `Huber` Loss: combination of Squared Loss and Absolute Loss.
Optimizer	Select manually the gradient optimizer from the drop-down menu. Three options are available for selection, namely: - `Linear Decay SGD`: decreasing learning rate over time - `Simple SGD`: constant learning rate through training - `Squared Root Decay SGD`: learning rate is decreasing inversely to the number of iterations.
Learning rate	Specify the initial step size (between 0 and 1) used in the iterative optimization procedure (default value: 0.1).
Number of epochs	Specify the number (integer) of complete passes through the data during training (default value: 10).
RNG Seed	Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Configuration

Select Analytics$\rightarrow$ Regression$\rightarrow$ Linear SGD.
Select the column containing the target variable that is going to be predicted from the drop-down menu [1]. Columns with categorical features cannot be selected as targets.
Select the Loss function [2] and the gradient Optimizer [3] from the drop-down menus.
Type the hyperparameters that determine the structure of the model: the Learning rate [4] and the Number of epochs[5]. Default values, data types (double or integer) and acceptable ranges are indicated as guidance on the input parameter values.
Type an RNG Seed for reproducible results or a random number generated Time-based (RNG) Seed [6].
Click on the Execute button [7] to apply the training algorithm on the input data.

Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented.

XGBoost

The Extreme Gradient Boosting (XGBoost) open-source library⁶ is used to implement the gradient boosting framework. The library uses a class of ensemble machine learning algorithms constructed from decision tree models. Ensemble learning operates by combining different individual base learners to obtain a final prediction.⁷ In an iterative process, trees are added to the ensemble so that the prediction error (loss) of previous models is reduced. In regression problems, the mean squared error is used as loss function.⁸

Use the XGBoost regression function by browsing in the top ribbon:

Analytics $\rightarrow$ Regression $\rightarrow$ XGBoost

Input

Configuration

Target Column	Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets.
booster	Select from the drop-down menu which booster to use. Three options are available for selection, namely: - `gbtree`: default tree-based models, - `dart`: tree-based models, and - `gblinear`: linear functions.
objective	Select from the drop-down menu the learning objective of the method. Options include: - `reg:squarederror`: regression with squared loss, - `reg:gamma`: gamma regression with log-link, whose output is a mean of gamma distribution, and - `reg:tweedie`: tweedie regression with log-link.
number of estimators	Type the number of models (integer) to train in the learning ensemble.
eta	Specify the learning rate (between 0 and 1) which determines the step size shrinkage to prevent overfitting (default value: 0.3).
gamma	Specify the minimum loss reduction required to make a further partition on a leaf node of the tree (default value: 0).
max depth	Specify the maximum depth of a tree as a positive integer (default value: 6).
min child weight	Specify the minimum sum of instance weight (hessian) needed in a child (default value: 1).
column sample by tree	Specify the subsample ratio of features when constructing each tree. Subsampling will occur once for every tree constructed.
sub sample	Specify the subsampling ratio (between 0 and 1) of the training instances. Subsampling will occur once in every boosting iteration (default value: 1).
tree method	Select the tree construction algorithm used in XGBoost. Options include: - `auto`: use this heuristically to choose the fastest method typically based on the dataset size, - `exact`: exact greedy algorithm, - `approx`: approximates the greedy algorithm using quantile sketch and gradient histogram, and - `hist`: fast histogram optimized approximate greedy algorithm.
lambda	Specify the L2 regularization term on leaf weights (default value: 1).
alpha	Specify the L1 regularization term on leaf weights (default value: 0).
RNG Seed	Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available by clicking on the `Time-based RNG Seed` checkbox.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Configuration

Select Analytics $\rightarrow$ Regression $\rightarrow$ XGBoost .
Select the column that is going to be predicted from the drop-down menu [1].
Select the tree booster [2] method, the objective function [3] for loss and type the number of estimators [4] involved in the ensemble.
Select the hyperparameters involved in the regularization of the model: eta [5], gamma [6], lambda [12] and alpha [13]. Select the hyperparameters involved in tree construction: max depth [7] and min child weight [8]. Select the column sampling rate by tree [9] and the overall subsampling rates [10]. Default values, data types (double or integer) and acceptable ranges are indicated as guidance on the input parameter values.
Select the tree construction algorithm [11] used in the XGBoost.
Type an RNG Seed for reproducible results or a random number generated Time-based RNG Seed [14].
Click on the Execute button [15] to apply the training algorithm on the input data.

Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target variable is presented.

Random Forest

Random forest regressor is an ensemble learning method that operates by building multiple randomized decision trees during training and obtaining the prediction of the individual trees. The decision trees are constructed in parallel, with no interaction between them, using random subsets of training data and input attributes to ensure diversity. Predictions independently made by all the trees in the forest are aggregated and averaged to produce a final prediction.^7,9

Use the Random Forest function by browsing in the top ribbon:

Analytics $\rightarrow$ Regression $\rightarrow$ Random Forest

Input

Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be excluded or encoded into numerical values (integer or double).

Configuration

Features fraction	Specify the feature subsampling rate represented as a fraction of features (between 0 and 1) available in each tree split (default value: 0.9).
Min impurity decrease	Specify the impurity decrease threshold (between 0 and 1) necessary to determine the quality of splits in the decision trees. A split is only considered if it results in a decrease of impurity greater than or equal to this value (default value: 0.1).
Seed	Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available.
Number of ensembles	Specify the number of individual trees to be generated by the algorithm (default value: 10).
Target column	Select manually from the drop-down menu the column name containing the target variable that is going to be predicted.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Configuration

Select Analytics $\rightarrow$ Regression $\rightarrow$ Random Forest.
Specify the hyperparameters that determine the structure of the model: the Features fraction [1], Min impurity decrease [2] and Number of ensembles [4]. Default values, data types (double or integer) and acceptable ranges are indicated as guidance on the input parameter values.
Type a Seed for reproducible results or a random number generated Time-based RNG Seed [3].
Select the column name with the target variable that is going to be predicted from the drop-down menu [5].
Click on the Execute button [6] to apply the training algorithm on the input columns

Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented.

Generalized Linear Models

Linear Regression

Negative Binomial Regression

Poisson Regression

Gamma Regression

Tweedie Regression

Input

Configuration

Output

Example

Input

Configuration

Output

Tips

k Nearest neighbors:

It works more efficiently for small to medium datasets and low-dimensional data. kNN is sensitive to missing data.
The performance of the model is highly influenced by the selection of k.

Radial Basis Function Network:

The number of neurons in the hidden layer has a high impact on the model performance, since a large number of neurons can lead to overfitting.

Linear SGD:

It is effective in large datasets and can handle high-dimensional feature spaces. Higher learning rates may be required when Huber loss is selected, because it is less sensitive to outliers.
Consider scaling the input data with a Z score normalizer to center them to mean and have a unit standard deviation.

XGBoost:

Be cautious during hyperparameter tuning: Choosing smaller eta values, as well as increasing the lambda, alpha and gamma values result in a more conservative boosting process. Increasing the value of max depth parameter makes the model more complex, more likely to overfit.

Random Forest:

This algorithm performs well with datasets that contain missing values. However, it is not as efficient with a large number of sparse features or with categorical variables of many levels that are improperly encoded.

References

Witten Ian H and Frank, Eibe and Hall, Mark A and Pal CJ. Data Mining: Practical Machine Learning Tools and Techniques. Fourth. Morgan Kaufmann; 2011. https://doi.org/10.1016/C2009-0-19715-5.
Murphy KP. Machine Learning: A Probabilistic Perspective. The MIT Press; 2012. 10.5555/2380985.
Lee C-C, Chung P-C, Tsai J-R, Chang C-I. Robust radial basis function neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 1999;29:674–85. https://doi.org/10.1109/3477.809023.
Ghosh J, Nag A. An Overview of Radial Basis Function Networks. In: Howlett RJ, Jain LC, editors. Radial Basis Function Networks 2: New Advances in Design, Heidelberg: Physica-Verlag HD; 2001, p. 1–36. https://doi.org/10.1007/978-3-7908-1826-0_1.
Bottou L. Large-Scale Machine Learning with Stochastic Gradient Descent. In: Lechevallier Y, Saporta G, editors. Proceedings of COMPSTAT’2010, Heidelberg: Physica-Verlag HD; 2010, p. 177–86. https://doi.org/10.1007/978-3-7908-2604-3_16.
XGBoost Parameters — xgboost 2.1.0-dev documentation n.d. https://xgboost.readthedocs.io/en/latest/parameter.html (accessed June 3, 2024).
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York, NY: Springer New York; 2009. https://doi.org/10.1007/978-0-387-84858-7.
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, p. 785–94. https://doi.org/10.1145/2939672.2939785.
Breiman L. Random Forests. Machine Learning 2001;45:5–32. https://doi.org/10.1023/A:1010933404324.

Version History

Introduced in Isalos Analytics Platform v0.1.18

Instructions last updated on May 2025

Regression

Table of contents

k Nearest Neighbors (kNN)

Input

Configuration

Output

Example

Input

Configuration

Output

Application on external set

Fully Connected Neural Network

Input

Configuration

Output

Example

Input

Configuration

Output

Radial Basis Function Network

Input

Configuration

Output

Example

Input

Configuration

Output

Linear SGD

Input

Configuration

Output

Example

Input

Configuration

Output

XGBoost

Input

Configuration

Output

Example

Input

Configuration

Output

Random Forest

Input

Configuration

Output

Example

Input

Configuration

Output

Generalized Linear Models

Linear Regression

Negative Binomial Regression

Poisson Regression

Gamma Regression

Tweedie Regression

Input

Configuration

Output

Example

Input

Configuration

Output

Tips

See also

Workflows

Bodyfat prediction case study

House pricing case study

Insurance charges case study

MA score case study

Salary prediction case study

References

Version History