Regression

Regression is a typical supervised learning task in predictive modelling that aims to investigate the relationship between a set of independent features and a dependent target variable. Regression analysis algorithms are employed when the target variable is a continuous numeric value.1

Table of contents

  1. k Nearest Neighbors (kNN)
  2. Fully Connected Neural Network
  3. Radial Basis Function Network
  4. Linear SGD
  5. XGBoost
  6. Random Forest
  7. Statistical fitting
    1. MLR
    2. Generalized Linear Models
    3. Generalized Estimating Equations
  8. Tips
  9. See also
  10. References
  11. Version History

k Nearest Neighbors (kNN)

k Nearest Neighbors (kNN) is a simple non-parametric algorithm that operates by identifying the data points from the training set that are most proximate to a new unseen input. This instance-based learning method determines the closeness of data points by calculating the Euclidean distance between instances considering all attribute values. The k parameter denotes the number of nearest neighbors to consider.2 The target value for a new instance is predicted by averaging the distance-weighted target values of the k nearest data points. As calculations of Euclidean distances are performed, scaling of data is performed within the function.

Use the k Nearest Neighbors (kNN) function by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) k Nearest Neighbors (kNN)

Input

Data matrix with training set data.

Configuration

Target Column Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets.
Number of Neighbors An integer representing the number of closest neighbours-data points (k) used to make predictions for a new data point.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“kNN Prediction”). For each data point, the closest neighbors from the training set are listed in the “Closest NN” columns, along with their corresponding distances in the “Distance from NN” columns.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction.

kNN input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) k Nearest Neighbors (kNN).
  2. Select the column that is going to be predicted from the drop-down menu [1]. The columns containing categorical features are automatically excluded from the list.
  3. Type the Number of Neighbors [2] to consider.
  4. Click on the Execute button [3] to apply the algorithm on the input table.
kNN
Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented. Also, the k most proximate identified instances from the training set are given for each data point (5 in this case), along with the corresponding Euclidean distances from each neighbor. Note that the “Closest NN1” represents the nearest neighbor, which is the data point itself when applied on the training set. Consequently, the “Distance from NN1” is 0 for all given training instances.

kNN output
Application on external set

You can apply the trained k Nearest Neighbors (kNN) model to any external (test) data using the Existing Model Utilization function:

  1. Import the external data in the left-hand spreadsheet of the tab. Include the same columns used to build the kNN model.
  2. Select Analytics \(\rightarrow\) Existing Model Utilization. Select the trained kNN model [1] and click on the Execute button [2].
kNN application
  1. Inspect the results in the right-hand spreadsheet of the tab. Note that in this case the closest neighbor listed in the “Closest NN1” belongs to the training set, and the “Distance from NN1” is not zero.
kNN application

Fully Connected Neural Network

A type of feedforward artificial neural network consisting of multiple layers of neurons. It consists of an input layer, one or more hidden layers and an outer layer, which are fully connected with each other. A variety of non-linear activation functions are typically used in the hidden layer, allowing the network to learn complex patterns in data. MLP uses a backpropagation algorithm to train the model and classify instances.1,2

Use the Fully Connected Neural Network function by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Fully Connected Neural Network

Input

Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values.

Configuration

Batch Size Select from the drop-down menu the number of training instances that will be propagated through the network. Four options are available for selection: 16, 32, 64, and 128.
Number of Epochs Specify the number (integer) of complete passes through the data during training.
Learning Rate Specify the learning rate (between 0 and 1), which controls the size of the steps taken during optimization.
Momentum Specify the momentum rate (between 0 and 1) for the backpropagation algorithm.
+/- Click on the + and - buttons to add or remove hidden layers, respectively.
Hidden Layers For each added hidden layer, specify the number of neurons and select the non-linear activation function used to map the weighted inputs to the output of each neuron. Options for the activation functions include:
    - RELU,
    - RELU6,
    - LEAKYRELU,
    - SELU,
    - SWISH,
    - RRELU,
    - SIGMOID,
    - SOFTMAX,
    - SOFTPLUS,
    - SOFTSIGN,
    - TANH,
    - THRESHOLDEDRELU,
    - GELU,
    - ELU,
    - MISH,
    - CUBE,
    - HARDSIGMOID,
    - HARDTANH,
    - IDENTITY,
    - RATIONALTANH, and
    - RECTIFIEDTANH.
Target Column Select from the drop-down menu the column containing the feature that is going to be predicted.
RNG Seed Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable with at least two distinct categories for prediction. In case that categorical-string columns are included in the set, these should be encoded into representative numerical values.

MLP input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Fully Connected Neural Network
  2. Select the hyperparameters that determine the training procedure: the Batch Size [1], the Number of Epochs [2], the Learning Rate [3] and the Momentum [4].
  3. Select the hyperparameters that determine the Hidden Layers [5] of the neural network: the number of neurons [6] and the activation function [7] of each layer.
  4. Add [8] or remove [9] hidden layers to define the architecture of the neural network.
  5. Select the column that is going to be predicted from the drop-down menu [10].
  6. Select a Seed for reproducible results or a random number generated Time-based (RNG) Seed [11].
  7. Click on the Execute button [12] to apply the training algorithm on the input columns.
MLP
Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted values of the target is presented.

MLP output

Radial Basis Function Network

Radial basis function networks (RBF) network is an artificial neural network that employs RBF kernels as activation functions. The network consists of three layers: the input layer modeling a vector that passes data, the hidden layer that performs computations and the output layer designated for regression problems. The output layer of the neural network is a linear combination of the activation (output) from the hidden units.1, 3

A Radial Basis Function is a real-valued function $φ(r)$ that is dependent only on the distance between a fixed input point ($r$) to the center ($c$) of each neuron as reference point Eq. 1.3

$$ \begin{equation} \varphi(r) = d(\|r - c\|) {\qquad [1] \qquad} \end{equation} $$

The radial basis function kernels available in Isalos include:

Gaussian:

$$ \begin{equation} \varphi(r) = e^{-(\varepsilon r)^{2}}\end{equation} {\qquad [2] \qquad} $$

Multiquadric:

$$ \begin{equation} \varphi(r) = \sqrt{1 + (\varepsilon r)^{2}}\end{equation} {\qquad [3] \qquad} $$

Inverse Quadratic:

$$ \begin{equation} \varphi(r) = \frac{1}{1 + (\varepsilon r)^{2}}\end{equation} {\qquad [4] \qquad} $$

Inverse Multiquadric:

$$ \begin{equation} \varphi(r) = \frac{1}{\sqrt{1 + (\varepsilon r)^{2}}}\end{equation} {\qquad [5] \qquad} $$

Polyharmonic spline:

$$ \begin{equation} \varphi(r) = \begin{cases} r^{k}, \: k = 1, 3, 5, \ldots \\ r^{k} \ln(r), \: k = 2, 4, 6, \ldots \end{cases} \end{equation} {\qquad [6] \qquad} $$

where $k$ is the order of the spline.

Thin Plate Spline:

$$ \begin{equation} \varphi(r) = r^{2} \ln(r)\end{equation} {\qquad [7] \qquad} $$

Bump Function:

$$ \begin{equation} \varphi(r) = \begin{cases} \exp\left( -\frac{1}{1 - (\varepsilon r)^{2}} \right), \: \text{ if } r < \frac{1}{\varepsilon}, \\ 0, \: \text{ otherwise } \end{cases} \end{equation} {\qquad [8] \qquad} $$

where $\varepsilon$ is the shape parameter used to scale the input of the radial kernel.

Use the Radial Basis Function Network regression function by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Radial Basis Function Network

Input

Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values. RBF does not allow the use of integer or string input.

Configuration

Hidden Neurons Specify the number of neurons in the hidden layer of the network.
RBF Kernel Select from the drop-down menu the radial basis function kernel. Options include:
    - GAUSSIAN Eq. 2,
    - MULTIQUADRIC Eq. 3,
    - INVERSE QUADRATIC Eq. 4,
    - INVERSE MULTIQUADRIC Eq. 5,
    - POLYHARMONIC SPLINE Eq. 6,
    - THIN PLATE SPLINE Eq. 7, and
    - BUMP FUNCTION Eq. 8.
A new configuration field appears accordingly after RBF Kernel selection, for the selection of Epsilon ($\varepsilon$) shape parameter or K ($k$) where applicable.
Point Selection Select manually the way that determines how the centers of the neural network are chosen. Options include:
    - Random Points from Training set: chosen randomly from the training data.
    - Use KMeans: RBF centers are the cluster centers of the partitioned training data.
RNG Seed Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available.
Target Column Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Rbf input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Radial Basis Function Network
  2. Type the number (integer) of Hidden Neurons [1].
  3. Select the RBF Kernel [2] used as activation function of the hidden layer and subsequently select the Epsilon [3] or K parameter where applicable.
  4. Select the Point Selection [4] method to determine the center of the network.
  5. Type an RNG Seed for reproducible results or a random number generated Time-based (RNG) Seed [5].
  6. Select the Target Column that is going to be predicted from the drop-down menu [6]. Columns with categorical features cannot be selected as targets.
  7. Click on the Execute button [7] to proceed with training.
Rbf configuration
Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target variable is presented.

Rbf output

Linear SGD

Stochastic Gradient Descent (SGD) is a method used to minimize an objective (loss) function, which measures the error between the actual values and the predicted outcomes of data points. SGD is employed to incrementally fit various linear regressors by iteratively updating the parameters of a linear model. At each step, it estimates the gradient of the loss function and adjusts the parameters using a constant or decreasing learning rate. Unlike traditional gradient descent, which processes the entire training set in each iteration, SGD operates on a small batch of the training data, enabling faster convergence.1,5

Use the Linear SGD regression function by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Linear SGD

Input

Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values. Linear SGD does not allow the use of integer or string input.

Configuration

Target Column Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets.
Loss function Select manually from the drop-down menu the objective function to be minimized. Three options are available for selection, namely:
    - Squared Loss: average of the squared errors,
    - Absolute Loss: average of the absolute differences, and
    - Huber Loss: combination of Squared Loss and Absolute Loss.
Optimizer Select manually the gradient optimizer from the drop-down menu. Three options are available for selection, namely:
    - Linear Decay SGD: decreasing learning rate over time
    - Simple SGD: constant learning rate through training
    - Squared Root Decay SGD: learning rate is decreasing inversely to the number of iterations.
Learning rate Specify the initial step size (between 0 and 1) used in the iterative optimization procedure (default value: 0.1).
Number of epochs Specify the number (integer) of complete passes through the data during training (default value: 10).
RNG Seed Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Rbf output
Configuration
  1. Select Analytics\(\rightarrow\) Regression\(\rightarrow\) Linear SGD.
  2. Select the column containing the target variable that is going to be predicted from the drop-down menu [1]. Columns with categorical features cannot be selected as targets.
  3. Select the Loss function [2] and the gradient Optimizer [3] from the drop-down menus.
  4. Type the hyperparameters that determine the structure of the model: the Learning rate [4] and the Number of epochs[5]. Default values, data types (double or integer) and acceptable ranges are indicated as guidance on the input parameter values.
  5. Type an RNG Seed for reproducible results or a random number generated Time-based (RNG) Seed [6].
  6. Click on the Execute button [7] to apply the training algorithm on the input data.
SGD
Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented.

SGD

XGBoost

The Extreme Gradient Boosting (XGBoost) open-source library6 is used to implement the gradient boosting framework. The library uses a class of ensemble machine learning algorithms constructed from decision tree models. Ensemble learning operates by combining different individual base learners to obtain a final prediction.7 In an iterative process, trees are added to the ensemble so that the prediction error (loss) of previous models is reduced. In regression problems, the mean squared error is used as loss function.8

Use the XGBoost regression function by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) XGBoost

Input

Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be removed or encoded into numerical values (integer or double).

Configuration

Target Column Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets.
booster Select from the drop-down menu which booster to use. Three options are available for selection, namely:
    - gbtree: default tree-based models,
    - dart: tree-based models, and
    - gblinear: linear functions.
objective Select from the drop-down menu the learning objective of the method. Options include:
    - reg:squarederror: regression with squared loss,
    - reg:gamma: gamma regression with log-link, whose output is a mean of gamma distribution, and
    - reg:tweedie: tweedie regression with log-link.
number of estimators Type the number of models (integer) to train in the learning ensemble.
eta Specify the learning rate (between 0 and 1) which determines the step size shrinkage to prevent overfitting (default value: 0.3).
gamma Specify the minimum loss reduction required to make a further partition on a leaf node of the tree (default value: 0).
max depth Specify the maximum depth of a tree as a positive integer (default value: 6).
min child weight Specify the minimum sum of instance weight (hessian) needed in a child (default value: 1).
column sample by tree Specify the subsample ratio of features when constructing each tree. Subsampling will occur once for every tree constructed.
sub sample Specify the subsampling ratio (between 0 and 1) of the training instances. Subsampling will occur once in every boosting iteration (default value: 1).
tree method Select the tree construction algorithm used in XGBoost. Options include:
    - auto: use this heuristically to choose the fastest method typically based on the dataset size,
    - exact: exact greedy algorithm,
    - approx: approximates the greedy algorithm using quantile sketch and gradient histogram, and
    - hist: fast histogram optimized approximate greedy algorithm.
lambda Specify the L2 regularization term on leaf weights (default value: 1).
alpha Specify the L1 regularization term on leaf weights (default value: 0).
RNG Seed Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available by clicking on the Time-based RNG Seed checkbox.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

XGBoost input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) XGBoost .
  2. Select the column that is going to be predicted from the drop-down menu [1].
  3. Select the tree booster [2] method, the objective function [3] for loss and type the number of estimators [4] involved in the ensemble.
  4. Select the hyperparameters involved in the regularization of the model: eta [5], gamma [6], lambda [12] and alpha [13]. Select the hyperparameters involved in tree construction: max depth [7] and min child weight [8]. Select the column sampling rate by tree [9] and the overall subsampling rates [10]. Default values, data types (double or integer) and acceptable ranges are indicated as guidance on the input parameter values.
  5. Select the tree construction algorithm [11] used in the XGBoost.
  6. Type an RNG Seed for reproducible results or a random number generated Time-based RNG Seed [14].
  7. Click on the Execute button [15] to apply the training algorithm on the input data.
XGBoost
Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target variable is presented.

XGBoost output

Random Forest

Random forest regressor is an ensemble learning method that operates by building multiple randomized decision trees during training and obtaining the prediction of the individual trees. The decision trees are constructed in parallel, with no interaction between them, using random subsets of training data and input attributes to ensure diversity. Predictions independently made by all the trees in the forest are aggregated and averaged to produce a final prediction.7,9

Use the Random Forest function by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Random Forest

Input

Data matrix with training set data. String columns are not taken into account in the algorithm implementation, therefore categorical features must be excluded or encoded into numerical values (integer or double).

Configuration

Features fraction Specify the feature subsampling rate represented as a fraction of features (between 0 and 1) available in each tree split (default value: 0.9).
Min impurity decrease Specify the impurity decrease threshold (between 0 and 1) necessary to determine the quality of splits in the decision trees. A split is only considered if it results in a decrease of impurity greater than or equal to this value (default value: 0.1).
Seed Select an integer as seed to get reproducible results. The option to select a time-based random number-generated seed is available.
Number of ensembles Specify the number of individual trees to be generated by the algorithm (default value: 10).
Target column Select manually from the drop-down menu the column name containing the target variable that is going to be predicted.

Output

A data matrix including the actual target value and the value predicted by the algorithm (“Prediction”) is presented.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction. Note that the “Species” categorical column is presented as double (D).

Random Forest input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Random Forest.
  2. Specify the hyperparameters that determine the structure of the model: the Features fraction [1], Min impurity decrease [2] and Number of ensembles [4]. Default values, data types (double or integer) and acceptable ranges are indicated as guidance on the input parameter values.
  3. Type a Seed for reproducible results or a random number generated Time-based RNG Seed [3].
  4. Select the column name with the target variable that is going to be predicted from the drop-down menu [5].
  5. Click on the Execute button [6] to apply the training algorithm on the input columns
Random Forest
Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented.

Random Forest

Statistical fitting

MLR

Multiple Linear Regression is a statistical technique that predicts a dependent variable using two or more independent variables. MLR models the relationship between the dependent variable and the predictor using a linear equation. It is an extension of the simple linear regression allowing for more than one independent variable. An MLR model consists of the following key components:

  1. Mean Structure: The expected value of the response variable is modeled as a linear combination of predictors:
$$ \begin{equation} y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i \end{equation} $$

where \(\beta_0\) is the intercept, \(\beta_j\) are regression coefficients, \(x_{ij}\) are predictor values, and \(\epsilon_i\) is the error term.

  1. Assumptions: MLR relies on several key assumptions for valid inference:
    1. Linearity: The relationship between predictors and outcome is linear.
    2. Independence: Observations are independent of each other.
    3. Homoscedasticity: Errors have constant variance across all levels of predictors.
    4. Normality of errors: Residuals are normally distributed (for hypothesis testing).
    5. No multicollinearity: Predictors are not highly correlated with each other.
  2. Estimation Method: Coefficients are typically estimated using Ordinary Least Squares (OLS), which minimizes the sum of squared residuals.
  3. Interpretation of Coefficients: Each \(\beta_j\) represents the expected change in the dependent variable for a one-unit change in predictor , \(x_{ij}\), holding all other predictors constant.
  4. Model Evaluation: Goodness of fit and explanatory power are commonly assessed with metrics such as \(R^2\), adjusted \(R^2\), F-tests, and residual diagnostics.

This method is a simpler implementation allowing only for continuous data and main effects to be included in the model, if you want to perform multiple linear regression with categorical data and more complex options you can refer to Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models.

Use the MLR method by browsing in the top ribbon:

Analytics \(\rightarrow\) Classification \(\rightarrow\) Statistical fitting \(\rightarrow\) MLR

Input

The input should contain only numerical data and at least 2 columns should be specified, one corresponding to the target variable and at least 1 more to act as an independent predictor. All columns included in the input spreadsheet will be included in the analysis.

Configuration

Target Column Select from the drop-down menu the column containing the target variable that is going to be predicted. Columns with categorical features cannot be selected as targets.

Output

A data matrix including the actual target value and the value predicted by the MLR method. Also the regression coefficients are present next to the predictions.

Example

Input

In the left-hand spreadsheet of the tab import the data matrix including the target variable for prediction.

MLR-input
Configuration
  1. Select Analytics \(\rightarrow\) Classification \(\rightarrow\) Statistical fitting \(\rightarrow\) MLR.
  2. Select the column that is going to be the Target Column [1].
  3. Click on the Execute button [2] to apply the algorithm on the input table.
MLR-config
Output

In the right-hand spreadsheet of the tab the output data matrix with the actual and the predicted value of the target is presented. Also, the coefficients of the regression model are shown. The coefficient below “1” corresponds to the intercept and all other coefficients corresponds to the variable which has the name specified above the value.

MLR-output

Generalized Linear Models

Generalized Linear Models (GLMs) are a flexible class of regression models that generalize ordinary linear regression to allow for response variables that have error distributions other than the normal distribution. GLMs are particularly useful when the response variable is categorical (e.g., binary outcomes) or count data, where assumptions of normality and constant variance are inappropriate. A GLM consists of three components:

  1. Random Component: The response variable \(y_i\) is assumed to follow a distribution from the exponential family (e.g., Normal, Binomial, Poisson, Gamma).
  2. Systematic Component: A linear predictor \(\eta_i= x_i^T \beta\), where \(x_i\) is the vector of predictors for the i-th observation, and \(\beta\) is the vector of coefficients.
  3. Link Function: A smooth, monotonic function \(g(⋅)\) that relates the mean \(\mu_i= E[y_i]\) of the response to the linear predictor via \(g(\mu_i )= \eta_i\).

Unlike traditional linear regression, GLMs do not assume a linear relationship between the predictors and the response. Instead, the link function transforms the expected value of the response variable to a scale where it can be modeled as a linear combination of the predictors. GLM parameters are estimated using Maximum Likelihood Estimation (MLE). The likelihood is constructed based on the assumed distribution of the response variable, and the log-likelihood function is maximized to obtain the parameter estimates. Because log-likelihood is usually a nonlinear function of the parameters, iterative optimization method such as Newton-Raphson and Fisher Scoring are used to obtain the maximum likelihood estimates of the parameters of each model. Categorical variables are encoded using one-hot encoding, where each category is represented by a binary (0 or 1) dummy variable. The reference category for each categorical variable is the first observed category, and its corresponding dummy variable is omitted to avoid multicollinearity. This omission allows the reference category to be implicitly represented in the model intercept, providing a baseline for interpreting the effects of the other categories.

Linear

The classical linear regression model assumes that the response variable follows a normal (Gaussian) distribution, conditional on the explanatory variables. Specifically, for a continuous dependent variable \(Y\), the model expresses \(Y\) as a linear function of one or more independent variables \(X_1, X_2, …, X_p\), plus a normally distributed error term:

$$ \begin{equation} Y = \beta_0 + \beta_1X_1 + ... + \beta_pX_p + \epsilon \text{ , } \epsilon \sim \mathcal{N}(0,\sigma^2) \end{equation} $$

Within the framework of Generalized Linear Models (GLMs), linear regression corresponds to the case where the distribution of the dependent variable is Gaussian, and the link function is the identity function:

This identity link implies that the expected value of the outcome is directly modeled as a linear combination of the predictors. Linear regression is the most used when the dependent variable is continuous, unbounded, and approximately normally distributed, and when the relationship between the predictors and the outcome is assumed to be additive and linear.

Use the Linear Regression method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models

And then choosing “Linear” as the Type.

Input

All variables need to be specified in the datasheet. Numerical values will be used for the covariates and the dependent variable. Factors, however, can be textual as well as numerical. The design for Linear Regression requires at least two columns in the input sheet: one column representing either a categorical factor or a covariate (independent variable), and another column for the numerical response. Each row represents a single observation.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Scale Parameter Method Determines how the scale (error variance) parameter is estimated. Options include: Fixed value, Deviance, or Pearson Chi-square.
Value Specifies the scale parameter value manually, only when Fixed value is selected in the Scale Parameter Method.
Factors/Covariates/Excluded Columns Select manually the columns that correspond to factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns and double-arrow buttons will move all columns. At least one covariate or factor column should be specified.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the linear regression procedure is organized into:

  1. The Predicted Values Table contains the actual values of the dependent variable and the corresponding predicted values generated by the model for each observation.
  2. The Goodness of Fit Table includes statistical measures that assess how well the model fits the data, such as Deviance, Log-Likelihood, AIC, BIC, and related metrics.
  3. The Parameter Estimates Table displays the estimated coefficients for each variable in the model, along with standard errors, confidence intervals, test statistics, degrees of freedom, and p-values.
Example
Input

The input datasheet must include one continuous dependent variable, which will serve as the target, and at least one column with a continuous or categorical independent variable.

linearGLM-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models
  2. Set the Type [1] of regression to Linear.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Select the Dependent Variable [3].
  5. Select the Scale Parameter Method [4].
  6. Specify the Value [5] of the scale parameter method if the Fixed value option was chosen as the scale parameter method.
  7. Select the columns by clicking on the arrow buttons [9] and moving columns between the Excluded Columns [6] and Factors [7] and Covariates [8] lists.
  8. Select your preferred option to define the model you want to analyze [10].
  9. If the Custom option is selected, specify the Formula [11] for the analysis.
  10. Click on the Execute button [12] to perform the Linear Regression method.
linearGLM-config
Output

The predictions, Goodness of Fit table and Parameter Estimates table are shown in the output spreadsheet.

linearGLM-output

Negative Binomial

Negative Binomial Regression is a type of generalized linear model (GLM) used for modeling count data that exhibit overdispersion – that is, when the variance exceeds the mean. It assumes that the dependent variable follows a Negative Binomial distribution, which is a generalization of the Poisson distribution that introduces an additional dispersion parameter to account for variability beyond the mean. The most commonly used link function is the log link, which models the logarithm of the expected count as a linear combination of the predictors:

Negative Binomial Regression is particularly useful in scenarios such as modeling the number of hospital visits, insurance claims, or any count-based outcome where the data are not well-fitted by Poisson model due to excess variation.

Use the Negative Binomial Regression method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models

And then choosing “Negative Binomial” as the Type.

Input

All variables must be specified in the datasheet. The dependent variable must be a non-negative integer count, as Negative Binomial Regression is designed for modeling count data. The dependent variable should not contain decimal values, negative numbers, or missing entries. Independent variables may be either numerical or categorical. Categorical variables can be represented using either text labels or numerical codes. The input datasheet must contain at least two columns: one for the dependent variable and one or more for the independent variables. Each row corresponds to single observation. This model is especially appropriate when the count data show overdispersion, meaning that the variance is greater than the mean – something that the standard Poisson model cannot adequately handle.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Maximum Step-Halving Controls how many times the algorithm is allowed to halve the step size during parameter updates when an iteration leads to worse model fit.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Parameter Estimation This option lets you choose how the model parameters will be estimated, Newton-Raphson, Fisher Scoring or Hybrid.
Maximum Scoring Iterations The “Maximum Scoring Iterations” parameter is used when the hybrid estimation method is selected, instead of Newton-Raphson or Fisher scoring alone, and specifies the maximum number of iterations to be performed during the scoring phase.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Scale Parameter Method Determines how the scale (error variance) parameter is estimated. Options include: Fixed value, Deviance, or Pearson Chi-square.
Value Specifies the scale parameter value manually, only when Fixed value is selected in the Scale Parameter Method.
Factors/Covariates/Excluded Columns Select manually the columns that correspond to factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns and double-arrow buttons will move all columns. At least one covariate or factor column should be specified.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Negative Binomial regression procedure is organized into three main sections: the Predicted Values Table, the Goodness of Fit Statistics, and the Parameter Estimates Table.

  1. The Predicted Values Table contains the actual values of the dependent variable and the corresponding predicted values generated by the model for each observation.
  2. The Goodness of Fit Table includes statistical measures that assess how well the model fits the data, such as Deviance, Log-Likelihood, AIC, BIC, and related metrics.
  3. The Parameter Estimates Table displays the estimated coefficients for each variable in the model, along with standard errors, confidence intervals, test statistics, degrees of freedom, and p-values.
Example
Input

The input datasheet must include one continuous, non-negative dependent variable, which will serve as the target, and at least one column with a continuous or categorical independent variable.

negbinGLM-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models
  2. Set the Type [1] of regression to Negative Binomial.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify the Maximum Step-Halving [4].
  6. Select the Dependent Variable [5].
  7. Select the Parameter Estimation Method [6].
  8. Specify the Maximum Scoring Iterations if Hybrid option is selected as the Parameter Estimation Method[7].
  9. Specify the Minimum Change in Parameter Estimates [8].
  10. Select the Scale Parameter Method [9].
  11. Specify the Value [10] of the scale parameter method if the Fixed value option was chosen as the scale parameter method.
  12. Select the columns by clicking on the arrow buttons [14] and moving columns between the Excluded Columns [11] and Factors [12] and Covariates [13] lists.
  13. Select your preferred option to define the model you want to analyze [15].
  14. If the Custom option is selected, specify the Formula [16] for the analysis.
  15. Click on the Execute button [17] to perform the Negative Binomial Regression method.
negbinGLM-config
Output

The predictions, Goodness of Fit table and Parameter Estimates table are shown in the output spreadsheet

negbinGLM-output

Poisson

Poisson Regression is type of generalized linear model (GLM) used for modeling count data, where the dependent variable represents the number of times an event occurs within a fixed period, space, or exposure. It assumes that the response variable follows a Poisson distribution, where the mean is equal to the variance. The most common link function used is the log link, which relates the natural logarithm of the expected count to a linear combination of independent variables..

The independent variables can be either numerical or categorical, and they are used to explain variation in the count outcome. Poisson Regression is typically applied in cases such as modeling the number of doctor visits, traffic accidents, or product defects.

Use the Poisson Regression method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models

And then choosing “Poisson” as the Type.

Input

All variables must be specified in the datasheet. The dependent variable must be a non-negative integer count, as Poisson Regression is designed for modeling count data. The dependent variable should not contain decimal values, negative numbers, or missing entries. Independent variables may be either numerical or categorical. Categorical variables can be represented using either text labels or numerical codes. The input datasheet must include at least two columns: one for the dependent variable and one or more for the independent variables. Each row corresponds to a single observation. This model is appropriate when the count data has a variance approximately equal to the mean.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Maximum Step-Halving Controls how many times the algorithm is allowed to halve the step size during parameter updates when an iteration leads to worse model fit.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Parameter Estimation This option lets you choose how the model parameters will be estimated, Newton-Raphson, Fisher Scoring or Hybrid.
Maximum Scoring Iterations The “Maximum Scoring Iterations” parameter is used when the hybrid estimation method is selected, instead of Newton-Raphson or Fisher scoring alone, and specifies the maximum number of iterations to be performed during the scoring phase.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Scale Parameter Method Determines how the scale (error variance) parameter is estimated. Options include: Fixed value, Deviance, or Pearson Chi-square.
Value Specifies the scale parameter value manually, only when Fixed value is selected in the Scale Parameter Method.
Factors/Covariates/Excluded Columns Select manually the columns that correspond to factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns and double-arrow buttons will move all columns. At least one covariate or factor column should be specified.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Poisson regression procedure is organized into three main sections: the Predicted Values Table, the Goodness of Fit Statistics, and the Parameter Estimates Table.

  1. The Predicted Values Table contains the actual values of the dependent variable and the corresponding predicted values generated by the model for each observation.
  2. The Goodness of Fit Table includes statistical measures that assess how well the model fits the data, such as Deviance, Log-Likelihood, AIC, BIC, and related metrics.
  3. The Parameter Estimates Table displays the estimated coefficients for each variable in the model, along with standard errors, confidence intervals, test statistics, degrees of freedom, and p-values.
Example
Input

Τhe dataset must include a target variable consisting of non-negative integer counts, appropriate for modeling with Poisson regression, and at least one column with a continuous or categorical independent variable.

poissonGLM-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models
  2. Set the Type [1] of regression to Poisson.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify the Maximum Step-Halving [4].
  6. Select the Dependent Variable [5].
  7. Select the Parameter Estimation Method [6].
  8. Specify the Maximum Scoring Iterations if Hybrid option is selected as the Parameter Estimation Method[7].
  9. Specify the Minimum Change in Parameter Estimates [8].
  10. Select the Scale Parameter Method [9].
  11. Specify the Value [10] of the scale parameter method if the Fixed value option was chosen as the scale parameter method.
  12. Select the columns by clicking on the arrow buttons [14] and moving columns between the Excluded Columns [11] and Factors [12] and Covariates [13] lists.
  13. Select your preferred option to define the model you want to analyze [15].
  14. If the Custom option is selected, specify the Formula [16] for the analysis.
  15. Click on the Execute button [17] to perform the Poisson Regression method.
poissonGLM-config
Output

The predictions, Goodness of Fit table and Parameter Estimates table are shown in the output spreadsheet.

poissonGLM-output

Gamma

Gamma regression is a type of generalized linear model (GLM) used when the response variable is continuous, strictly positive, and right-skewed behavior. It assumes that the dependent variable follows a Gamma distribution, which is well-suited for modeling non-negative data with a variance that increases with the mean. A common choice for the link function in Gamma regression is the log link, which relates the mean of the response variable to the linear predictors via the natural logarithm.

Gamma regression is particularly useful in applications such as modeling costs, waiting times, insurance claims, or any scenario where the outcome is positive and skewed. It provides a flexible framework to account for heteroscedasticity and asymmetry in data.

Use the Gamma Regression method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models

And then choosing “Gamma” as the Type.

Input

All variables must be specified in the datasheet. The dependent variable must be strictly positive continuous numerical variable, as Gamma regression is not defined for zero or negative values. Independent variables may be either numerical or numerical. Categorical factors can be represented as text labels or coded numerically. The input datasheet must include at least two columns: one for the dependent variable and one or more columns for the independent variables. Each row in the datasheet represents a single observation. It is important to check for zero or missing values in the dependent variable, as they may cause the model to fail or produce invalid estimates.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Maximum Step-Halving Controls how many times the algorithm is allowed to halve the step size during parameter updates when an iteration leads to worse model fit.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Parameter Estimation This option lets you choose how the model parameters will be estimated, Newton-Raphson, Fisher Scoring or Hybrid.
Maximum Scoring Iterations The “Maximum Scoring Iterations” parameter is used when the hybrid estimation method is selected, instead of Newton-Raphson or Fisher scoring alone, and specifies the maximum number of iterations to be performed during the scoring phase.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Scale Parameter Method Determines how the scale (error variance) parameter is estimated. Options include: Fixed value, Deviance, or Pearson Chi-square.
Value Specifies the scale parameter value manually, only when Fixed value is selected in the Scale Parameter Method.
Factors/Covariates/Excluded Columns Select manually the columns that correspond to factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns and double-arrow buttons will move all columns. At least one covariate or factor column should be specified.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Gamma regression procedure is organized into three main sections: the Predicted Values Table, the Goodness of Fit Statistics, and the Parameter Estimates Table.

  1. The Predicted Values Table contains the actual values of the dependent variable and the corresponding predicted values 1. generated by the model for each observation.
  2. The Goodness of Fit Table includes statistical measures that assess how well the model fits the data, such as Deviance, Log-Likelihood, AIC, BIC, and related metrics.
  3. The Parameter Estimates Table displays the estimated coefficients for each variable in the model, along with standard errors, confidence intervals, test statistics, degrees of freedom, and p-values.
Example
Input

The input datasheet must include one continuous, positive dependent variable, which will serve as the target, and at least one column with a continuous or categorical independent variable.

gammaGLM-input
Configuration
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models
  2. Set the Type [1] of regression to Gamma.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify the Maximum Step-Halving [4].
  6. Select the Dependent Variable [5].
  7. Select the Parameter Estimation Method [6].
  8. Specify the Maximum Scoring Iterations if Hybrid option is selected as the Parameter Estimation Method[7].
  9. Specify the Minimum Change in Parameter Estimates [8].
  10. Select the Scale Parameter Method [9].
  11. Specify the Value [10] of the scale parameter method if the Fixed value option was chosen as the scale parameter method.
  12. Select the columns by clicking on the arrow buttons [14] and moving columns between the Excluded Columns [11] and Factors [12] and Covariates [13] lists.
  13. Select your preferred option to define the model you want to analyze [15].
  14. If the Custom option is selected, specify the Formula [16] for the analysis.
  15. Click on the Execute button [17] to perform the Gamma Regression method.
gammaGLM-config
Output

The predictions, Goodness of Fit table and Parameter Estimates table are shown in the output spreadsheet.

gammaGLM-output

Tweedie Regression is a type of generalized linear model (GLM) designed to handle semi-continuous response variables, variables that exhibit a combination of many exact zeros and positive, continuous values. It assumes the response variable follows a Tweedie distribution, a member of the exponential dispersion family that encompasses distributions such as the normal, Poisson, gamma, and inverse Gaussian as special cases. When the Tweedie power parameter lies between 1 and 2, the distribution corresponds to a compound Poisson–Gamma process, which makes it particularly suitable for modeling zero-inflated, right-skewed data—such as insurance claim amounts, healthcare costs, or ecological measurements. In this formulation, the identity link function is used, meaning that the expected value of the response variable is modeled directly as a linear function of the predictors:

This link is appropriate when the response values are already on a meaningful scale and do not require transformation. However, care must be taken to ensure that the predicted values remain in a valid range (e.g., non-negative), since the identity link does not constrain the output.

Use the Tweedie Regression method with identity link by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models

And then choosing “Tweedie with Identity Link” as the Type.

Input

All variables must be specified in the datasheet. The dependent variable must be a non-negative continuous numeric value, which may include many zeros alongside positive, right-skewed values. This reflects the typical structure of semi-continuous data for which Tweedie Regression with identity link is appropriate. Because the identity link function models the expected value directly, care should be taken to ensure that the response values are within a reasonable range, and that the model will not predict invalid (e.g., negative) values. The dependent variable should not contain negative numbers or missing entries. Independent variables may be either numerical or categorical. Categorical variables can be represented using text labels or numerical codes. The input datasheet must include at least two columns: one for the dependent variable and one or more for the independent variables. Each row should represent a single observation.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Maximum Step-Halving Controls how many times the algorithm is allowed to halve the step size during parameter updates when an iteration leads to worse model fit.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Parameter Estimation This option lets you choose how the model parameters will be estimated, Newton-Raphson, Fisher Scoring or Hybrid.
Maximum Scoring Iterations The “Maximum Scoring Iterations” parameter is used when the hybrid estimation method is selected, instead of Newton-Raphson or Fisher scoring alone, and specifies the maximum number of iterations to be performed during the scoring phase.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Scale Parameter Method Determines how the scale (error variance) parameter is estimated. Options include: Fixed value, Deviance, or Pearson Chi-square.
Value Specifies the scale parameter value manually, only when Fixed value is selected in the Scale Parameter Method.
Factors/Covariates/Excluded Columns Select manually the columns that correspond to factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns and double-arrow buttons will move all columns. At least one covariate or factor column should be specified.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Tweedie regression with identity link procedure is organized into three main sections: the Predicted Values Table, the Goodness of Fit Statistics, and the Parameter Estimates Table.

  1. The Predicted Values Table contains the actual values of the dependent variable and the corresponding predicted values generated by the model for each observation.
  2. The Goodness of Fit Table includes statistical measures that assess how well the model fits the data, such as Deviance, Log-Likelihood, AIC, BIC, and related metrics.
  3. The Parameter Estimates Table displays the estimated coefficients for each variable in the model, along with standard errors, confidence intervals, test statistics, degrees of freedom, and p-values.
Example
Input

The input datasheet must include one non-negative continuous dependent variable, which will serve as the target, and at least one column with a continuous or categorical independent variable.

tweedieidentityGLM-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models
  2. Set the Type [1] of regression to Tweedie with Identity Link.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify the Maximum Step-Halving [4].
  6. Select the Dependent Variable [5].
  7. Select the Parameter Estimation Method [6].
  8. Specify the Maximum Scoring Iterations if Hybrid option is selected as the Parameter Estimation Method[7].
  9. Specify the Minimum Change in Parameter Estimates [8].
  10. Select the Scale Parameter Method [9].
  11. Specify the Value [10] of the scale parameter method if the Fixed value option was chosen as the scale parameter method.
  12. Select the columns by clicking on the arrow buttons [14] and moving columns between the Excluded Columns [11] and Factors [12] and Covariates [13] lists.
  13. Select your preferred option to define the model you want to analyze [15].
  14. If the Custom option is selected, specify the Formula [16] for the analysis.
  15. Click on the Execute button [17] to perform the Tweedie Regression with Identity Link method.
tweedieidentityGLM-config
Output

The predictions, Goodness of Fit table and Parameter Estimates table are shown in the output spreadsheet.

tweedieidentityGLM-output

Tweedie Regression is a type of generalized linear model (GLM) designed to handle semi-continuous response variables – that is, variables that take on many exact zeros and positive, continuous values otherwise. It assumes the response variable follows a Tweedie distribution, which belongs to the exponential dispersion family and includes the normal, Poisson, gamma and inverse Gaussian distributions as special cases. When the Tweedie power parameter lies between 1 and 2, the distribution corresponds to a compound Poisson- Gamma process, making it ideal for modeling zero-inflated, right-skewed data such as insurance claim amounts, healthcare expenditures, or ecological measurements. The most common link function used is the log link, where the logarithm of the expected value is modeled as a linear function of the predictors.

Use the Tweedie Regression with Log Link method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models

And then choosing “Tweedie with Log Link” as the Type.

Input

All variables must be specified in the datasheet. The dependent variable must be a non-negative continuous numeric value, which may include many zeros alongside positive, right-skewed continuous values. This reflects the typical structure of semi-continuous data for which Tweedie Regression is appropriate. The dependent variable should not contain negative numbers or missing entries. Independent variables may be either numerical or categorical. Categorical variables can be represented using text labels or numerical codes. The input datasheet must include at least two columns: one for the dependent variable and one or more for the independent variables. Each row should represent a single observation.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Maximum Step-Halving Controls how many times the algorithm is allowed to halve the step size during parameter updates when an iteration leads to worse model fit.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Parameter Estimation This option lets you choose how the model parameters will be estimated, Newton-Raphson, Fisher Scoring or Hybrid.
Maximum Scoring Iterations The “Maximum Scoring Iterations” parameter is used when the hybrid estimation method is selected, instead of Newton-Raphson or Fisher scoring alone, and specifies the maximum number of iterations to be performed during the scoring phase.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Scale Parameter Method Determines how the scale (error variance) parameter is estimated. Options include: Fixed value, Deviance, or Pearson Chi-square.
Value Specifies the scale parameter value manually, only when Fixed value is selected in the Scale Parameter Method.
Factors/Covariates/Excluded Columns Select manually the columns that correspond to factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns and double-arrow buttons will move all columns. At least one covariate or factor column should be specified.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Tweedie regression with log link procedure is organized into three main sections: the Predicted Values Table, the Goodness of Fit Statistics, and the Parameter Estimates Table.

  1. The Predicted Values Table contains the actual values of the dependent variable and the corresponding predicted values generated by the model for each observation.
  2. he Goodness of Fit Table includes statistical measures that assess how well the model fits the data, such as Deviance, Log-Likelihood, AIC, BIC, and related metrics.
  3. The Parameter Estimates Table displays the estimated coefficients for each variable in the model, along with standard errors, confidence intervals, test statistics, degrees of freedom, and p-values.
Example
Input

The input datasheet must include one non-negative continuous dependent variable, which will serve as the target, and at least one column with a continuous or categorical independent variable.

tweedielogGLM-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Linear Models
  2. Set the Type [1] of regression to Tweedie with Log Link.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify the Maximum Step-Halving [4].
  6. Select the Dependent Variable [5].
  7. Select the Parameter Estimation Method [6].
  8. Specify the Maximum Scoring Iterations if Hybrid option is selected as the Parameter Estimation Method[7].
  9. Specify the Minimum Change in Parameter Estimates [8].
  10. Select the Scale Parameter Method [9].
  11. Specify the Value [10] of the scale parameter method if the Fixed value option was chosen as the scale parameter method.
  12. Select the columns by clicking on the arrow buttons [14] and moving columns between the Excluded Columns [11] and Factors [12] and Covariates [13] lists.
  13. Select your preferred option to define the model you want to analyze [15].
  14. If the Custom option is selected, specify the Formula [16] for the analysis.
  15. Click on the Execute button [17] to perform the Tweedie Regression with Log Link method.
tweedielogGLM-config
Output

The predictions, Goodness of Fit table and Parameter Estimates table are shown in the output spreadsheet.

tweedielogGLM-output

Generalized Estimating Equations

Generalized Estimating Equations (GEEs) extend Generalized Linear Models (GLMs) to handle correlated or clustered response data, such as repeated measures or longitudinal observations. GEEs are particularly useful when responses are not independent, as they allow for within-subject correlation, making them ideal for analyzing data collected over time or across related units. A GEE model consists of the following key components:

  1. Mean Structure: Like GLMs, the expected value of the response variable is modeled using a linear predictor :
$$ \begin{equation} \eta_i = x_i^T\beta \end{equation} $$

Where \(x_i\) is the vector of predictors and \(\beta\) is the vector of coefficients.

  1. Link Function: A monotonic function \(g(∙)\) that links the mean of the response \(\mu_i= E[y_i]\) to the linear predictor, so that \(g(\mu_i)=\eta_i\).
  2. Working Correlation Matrix: Specifies the assumed correlation structure among repeated observations within the same subject. Common structures include:
    1. Independent: assumes no within-subject correlation.
    2. Exchangeable: assumes equal correlation between all pairs of observations.
    3. AR(1): assumes correlations decrease with time distance.
    4. Unstructured: allows a distinct correlation for each pair of observations, with no constraints.
    5. M-dependent: assumes nonzero correlations only up to lag M, and zero correlations beyond that distance.

GEE estimation relies on solving quasi-likelihood equations rather than maximizing a full likelihood function, making it a semi-parametric approach. Parameters are estimated using iterative methods such as iteratively reweighted least squares (IRLS), and standard errors are typically computed using robust estimators to ensure valid inference even if the correlation structure is misspecified. In GEE models, a subject variable is required to identify the unit over which repeated measurements occur (e.g., patient ID). Optionally, a within-subject variable (e.g., time or visit number) can be specified to define the ordering of repeated measurements. Cases are usually sorted by subject and within-subject variables to ensure correct modeling of correlation patterns. The robust covariance estimator is the default choice, providing consistent standard errors under mild assumptions. The model-based estimator may be more efficient if the correlation structure is correctly specified, but it is more sensitive to misspecification. Categorical variables are encoded using dummy coding, as in GLMs. The reference category is omitted by default to avoid multicollinearity and serve as a baseline for interpreting effects.

Each variant of this method is specified by the distribution of the response variable and the link function used. In the table below we specify the distribution and link function used for each variant:

GEE Variant Distribution Link Function
Linear Normal Identity
Negative Binomial Negative Binomial Log
Poisson Poisson Log
Gamma Gamma Log
Tweedie with Identity Link Tweedie Identity
Tweedie with Log Link Tweedie Log

Linear

Use the Linear Regression method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations

And then choosing “Linear” as the Type.

Input

All variables must be specified in the datasheet. The dependent variable must be numerical, as Linear Regression within the GEE framework models continuous outcomes. Covariates should also be numerical, while categorical predictors (factors) can be represented using either text labels or numeric codes. The input datasheet must include at least two columns: one for the dependent variable and one or more for the independent variables (covariates or factors). Each row should correspond to a single observation. Additionally, a subject identifier column is required to define clusters of repeated measurements, and optionally a within-subject variable (e.g., time) to specify the order of observations within each subject.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Iterations Between Updates Defines how often the working correlation matrix is updated during estimation.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Hessian Convergence Specifies a threshold for convergence based on the second derivatives (Hessian matrix).
Scale Parameter Adjusts the variance of the model.
Within Subject Selects the column that defines the ordering of repeated observations within each subject.
Structure Defines the working correlation structure for repeated measurements within subjects. Choices include Independent, Exchangeable, AR(1), Unstructured and M-Dependent.
M In the M-Dependent correlation structure, M specifies the maximum number of consecutive observations within each cluster that are assumed to be correlated, with correlations set to zero for observations more than M time points apart. M must be an integer in the range from 1 to (number of within-subject observations – 1).
Covariance Matrix In the Covariance Matrix option, the user selects how the model’s standard errors will be estimated. The Robust estimator is resistant to misspecification of the correlation structure, while the Model-based estimator assumes the specified correlation is correct.
Subjects/Factors/Covariates/Excluded Columns Select manually the columns that correspond to subjects, factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Subjects, Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns. At least one covariate or factor column should be specified. Also, a column for the Subject is required.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Generalized Estimating Equations (GEEs) procedure is organized into three main sections: the Predicted Values Table, the Parameter Estimates Table and the Working Correlation Matrix.

  1. The Predicted Values Table shows the fitted values for the dependent variable on the specified model.
  2. The Parameter Estimates Table includes the estimated coefficients for the independent variables. Each row corresponds to a predictor and includes its coefficient, standard error, confidence interval, test statistic, degrees of freedom, and p-value.
  3. The Working Correlation Matrix displays the estimated correlations between repeated measurements of the same subject over time.
Example
Input

The dataset must include a continuous target variable suitable for modeling with linear regression. It should also contain at least one independent variable, which can be continuous or categorical. In addition, the dataset must include a subject identifier to group repeated observations from the same individual or unit, as well as a within-subject time or measurement indicator to reflect the longitudinal or clustered nature of the data.

linearGEE-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations.
  2. Set the Type [1] of regression to Linear.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify Iterations Between Updates [4].
  6. Select the Dependent Variable [5].
  7. Specify the Minimum Change in Parameter Estimates [6].
  8. Specify the Hessian Convergence [7].
  9. Specify the Scale Parameter [8].
  10. Select the Within Subject [9].
  11. Select the working correlation Structure and specify the value of M if you select M-Dependent as the correlation structure. [10].
  12. Select the method used to estimate the covariance matrix [11] of the standard errors.
  13. Select the columns by clicking on the arrow buttons [16] and moving columns between the Excluded Columns [12] and Subjects [13] and Factors [14] and Covariates [15] lists.
  14. Select your preferred option to define the model you want to analyze [17].
  15. If the Custom option is selected, specify the Formula [18] for the analysis.
  16. Click on the Execute button [19] to perform the Linear Regression method.
linearGEE-config
Output

The Predictions, Parameter Estimates table and Working Correlation Matrix are shown in the output spreadsheet.

linearGEE-output

Negative Binomial

Use the Negative Binomial Regression method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations

And then choosing “Negative Binomial” as the Type.

Input

All variables must be specified in the datasheet. The dependent variable must be a count variable, consisting of non-negative integers, and suitable for modeling with a Negative Binomial distribution. Negative Binomial Regression within the GEE framework is appropriate for overdispersed count data where the variance exceeds the mean, such as the number of events, visits, or occurrences over time. Covariates should be numerical, while categorical predictors can be represented using either text labels or numeric codes. The input datasheet must include at least two columns: one for the dependent variable and one or more for the independent variables (covariates or factors). Each row should correspond to a single observation. Additionally, a subject identifier column is required to define clusters of repeated measurements, and optionally a within-subject variable (e.g., time or measurement occasion) to specify the order of observations within each subject. The structure of the dataset should support modeling intra-subject correlation under the Negative Binomial distribution assumption using an appropriate link function (e.g., log link) within the GEE framework.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Iterations Between Updates Defines how often the working correlation matrix is updated during estimation.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Hessian Convergence Specifies a threshold for convergence based on the second derivatives (Hessian matrix).
Scale Parameter Adjusts the variance of the model.
Within Subject Selects the column that defines the ordering of repeated observations within each subject.
Structure Defines the working correlation structure for repeated measurements within subjects. Choices include Independent, Exchangeable, AR(1), Unstructured and M-Dependent.
M In the M-Dependent correlation structure, M specifies the maximum number of consecutive observations within each cluster that are assumed to be correlated, with correlations set to zero for observations more than M time points apart. M must be an integer in the range from 1 to (number of within-subject observations – 1).
Covariance Matrix In the Covariance Matrix option, the user selects how the model’s standard errors will be estimated. The Robust estimator is resistant to misspecification of the correlation structure, while the Model-based estimator assumes the specified correlation is correct.
Subjects/Factors/Covariates/Excluded Columns Select manually the columns that correspond to subjects, factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Subjects, Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns. At least one covariate or factor column should be specified. Also, a column for the Subject is required.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Generalized Estimating Equations (GEEs) procedure is organized into three main sections: the Predicted Values Table, the Parameter Estimates Table and the Working Correlation Matrix.

  1. The Predicted Values Table shows the fitted values for the dependent variable on the specified model.
  2. The Parameter Estimates Table includes the estimated coefficients for the independent variables. Each row corresponds to a predictor and includes its coefficient, standard error, confidence interval, test statistic, degrees of freedom, and p-value.
  3. The Working Correlation Matrix displays the estimated correlations between repeated measurements of the same subject over time
Example
Input

The dataset must include a count target variable suitable for modeling with negative binomial regression. The dependent variable should consist of non-negative integer values and must exhibit overdispersion, which is a key condition for using the negative binomial distribution. It should also contain at least one independent variable, which can be continuous or categorical. In addition, the dataset must include a subject identifier to group repeated observations from the same individual or unit, as well as a within-subject time or measurement indicator to reflect the longitudinal or clustered nature of the data. The structure of the dataset should allow for the modeling of within-cluster correlation using a log link function under the negative binomial distribution assumption in the GEE framework.

negbinGEE-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations.
  2. Set the Type [1] of regression to Negative Binomial.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify Iterations Between Updates [4].
  6. Select the Dependent Variable [5].
  7. Specify the Minimum Change in Parameter Estimates [6].
  8. Specify the Hessian Convergence [7].
  9. Specify the Scale Parameter [8].
  10. Select the Within Subject [9].
  11. Select the working correlation Structure and specify the value of M if you select M-Dependent as the correlation structure. [10].
  12. Select the method used to estimate the covariance matrix [11] of the standard errors.
  13. Select the columns by clicking on the arrow buttons [16] and moving columns between the Excluded Columns [12] and Subjects [13] and Factors [14] and Covariates [15] lists.
  14. Select your preferred option to define the model you want to analyze [17].
  15. If the Custom option is selected, specify the Formula [18] for the analysis.
  16. Click on the Execute button [19] to perform the Negative Binomial Regression method.
negbinGEE-config
Output

The Predictions, Parameter Estimates table and Working Correlation Matrix are shown in the output spreadsheet.

negbinGEE-output

Poisson

Use the Poisson Regression method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations

And then choosing “Poisson” as the Type.

Input

count variable, consisting of non-negative integer values and suitable for modeling with a Poisson distribution, as Poisson Regression within the GEE framework is used for modeling count data such as number of events, visits, or occurrences over time. Covariates should be numerical, while categorical predictors can be represented using either text labels or numeric codes. The input datasheet must include at least two columns: one for the dependent variable and one or more for the independent variables (covariates or factors). Each row should correspond to a single observation. Additionally, a subject identifier column is required to define clusters of repeated measurements, and optionally a within-subject variable to specify the order of observations within each subject. The structure of the dataset should support modeling intra-subject correlation under the Poisson distribution assumption using an appropriate link function (e.g., log link) within the GEE framework.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Iterations Between Updates Defines how often the working correlation matrix is updated during estimation.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Hessian Convergence Specifies a threshold for convergence based on the second derivatives (Hessian matrix).
Scale Parameter Adjusts the variance of the model.
Within Subject Selects the column that defines the ordering of repeated observations within each subject.
Structure Defines the working correlation structure for repeated measurements within subjects. Choices include Independent, Exchangeable, AR(1), Unstructured and M-Dependent.
M In the M-Dependent correlation structure, M specifies the maximum number of consecutive observations within each cluster that are assumed to be correlated, with correlations set to zero for observations more than M time points apart. M must be an integer in the range from 1 to (number of within-subject observations – 1).
Covariance Matrix In the Covariance Matrix option, the user selects how the model’s standard errors will be estimated. The Robust estimator is resistant to misspecification of the correlation structure, while the Model-based estimator assumes the specified correlation is correct.
Subjects/Factors/Covariates/Excluded Columns Select manually the columns that correspond to subjects, factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Subjects, Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns. At least one covariate or factor column should be specified. Also, a column for the Subject is required.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Generalized Estimating Equations (GEEs) procedure is organized into three main sections: the Predicted Values Table, the Parameter Estimates Table and the Working Correlation Matrix.

  1. The Predicted Values Table shows the fitted values for the dependent variable on the specified model.
  2. The Parameter Estimates Table includes the estimated coefficients for the independent variables. Each row corresponds to a predictor and includes its coefficient, standard error, confidence interval, test statistic, degrees of freedom, and p-value.
  3. The Working Correlation Matrix displays the estimated correlations between repeated measurements of the same subject over time
Example
Input

The dataset must include a count outcome variable with non-negative integers suitable for Poisson regression. It should also include at least one independent variable (continuous or categorical), a subject identifier to group repeated measures, and a within-subject time or measurement variable to indicate observation order. The data structure should support modeling within-cluster correlation using a log link under the Poisson distribution in the GEE framework.

poissonGEE-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations.
  2. Set the Type [1] of regression to Poisson.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify Iterations Between Updates [4].
  6. Select the Dependent Variable [5].
  7. Specify the Minimum Change in Parameter Estimates [6].
  8. Specify the Hessian Convergence [7].
  9. Specify the Scale Parameter [8].
  10. Select the Within Subject [9].
  11. Select the working correlation Structure and specify the value of M if you select M-Dependent as the correlation structure. [10].
  12. Select the method used to estimate the covariance matrix [11] of the standard errors.
  13. Select the columns by clicking on the arrow buttons [16] and moving columns between the Excluded Columns [12] and Subjects [13] and Factors [14] and Covariates [15] lists.
  14. Select your preferred option to define the model you want to analyze [17].
  15. If the Custom option is selected, specify the Formula [18] for the analysis.
  16. Click on the Execute button [19] to perform the Poisson Regression method.
poissonGEE-config
Output

The Predictions, Parameter Estimates table and Working Correlation Matrix are shown in the output spreadsheet.

poissonGEE-output

Gamma

Use the Gamma Regression method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations

And then choosing “Gamma” as the Type.

Input

All variables must be specified in the datasheet. The dependent variable must be continuous, strictly positive, and suitable for modeling with a Gamma distribution, as Gamma Regression within the GEE framework is used for positively skewed outcomes such as cost, time, or rates. Covariates should be numerical, while categorical predictors can be represented using either text labels or numeric codes. The input datasheet must include at least two columns: one for the dependent variable and one or more for the independent variables (covariates or factors). Each row should correspond to a single observation. Additionally, a subject identifier column is required to define clusters of repeated measurements, and optionally a within-subject variable to specify the order of observations within each subject. The structure of the dataset should support modeling intra-subject correlation under the Gamma distribution assumption using an appropriate link function (e.g., log link) within the GEE framework.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Iterations Between Updates Defines how often the working correlation matrix is updated during estimation.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Hessian Convergence Specifies a threshold for convergence based on the second derivatives (Hessian matrix).
Scale Parameter Adjusts the variance of the model.
Within Subject Selects the column that defines the ordering of repeated observations within each subject.
Structure Defines the working correlation structure for repeated measurements within subjects. Choices include Independent, Exchangeable, AR(1), Unstructured and M-Dependent.
M In the M-Dependent correlation structure, M specifies the maximum number of consecutive observations within each cluster that are assumed to be correlated, with correlations set to zero for observations more than M time points apart. M must be an integer in the range from 1 to (number of within-subject observations – 1).
Covariance Matrix In the Covariance Matrix option, the user selects how the model’s standard errors will be estimated. The Robust estimator is resistant to misspecification of the correlation structure, while the Model-based estimator assumes the specified correlation is correct.
Subjects/Factors/Covariates/Excluded Columns Select manually the columns that correspond to subjects, factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Subjects, Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns. At least one covariate or factor column should be specified. Also, a column for the Subject is required.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Generalized Estimating Equations (GEEs) procedure is organized into three main sections: the Predicted Values Table, the Parameter Estimates Table and the Working Correlation Matrix.

  1. The Predicted Values Table shows the fitted values for the dependent variable on the specified model.
  2. The Parameter Estimates Table includes the estimated coefficients for the independent variables. Each row corresponds to a predictor and includes its coefficient, standard error, confidence interval, test statistic, degrees of freedom, and p-value.
  3. The Working Correlation Matrix displays the estimated correlations between repeated measurements of the same subject over time.
Example
Input

The dataset must include a positive continuous target variable suitable for modeling with Gamma regression. The outcome should be strictly positive and continuous, reflecting a right-skewed distribution typical of Gamma-distributed data. It should also contain at least one independent variable, which can be continuous or categorical. In addition, the dataset must include a subject identifier to group repeated observations from the same individual or unit, as well as a within-subject time or measurement indicator to reflect the longitudinal or clustered nature of the data. The structure of the dataset should allow for the modeling of within-cluster correlation using a log link function in the GEE framework.

gammaGEE-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations.
  2. Set the Type [1] of regression to Gamma.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify Iterations Between Updates [4].
  6. Select the Dependent Variable [5].
  7. Specify the Minimum Change in Parameter Estimates [6].
  8. Specify the Hessian Convergence [7].
  9. Specify the Scale Parameter [8].
  10. Select the Within Subject [9].
  11. Select the working correlation Structure and specify the value of M if you select M-Dependent as the correlation structure. [10].
  12. Select the method used to estimate the covariance matrix [11] of the standard errors.
  13. Select the columns by clicking on the arrow buttons [16] and moving columns between the Excluded Columns [12] and Subjects [13] and Factors [14] and Covariates [15] lists.
  14. Select your preferred option to define the model you want to analyze [17].
  15. If the Custom option is selected, specify the Formula [18] for the analysis.
  16. Click on the Execute button [19] to perform the Gamma Regression method.
gammaGEE-config
Output

The Predictions, Parameter Estimates table and Working Correlation Matrix are shown in the output spreadsheet.

gammaGEE-output

Use the Tweedie Regression method with identity link by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations

And then choosing “Tweedie with Identity Link” as the Type.

Input

The dataset must be appropriately structured to support Tweedie regression with an identity link function within the framework of Generalized Estimating Equations (GEE). The dependent variable should be continuous and non-negative, and exhibit characteristics suitable for modeling with the Tweedie distribution, which is particularly well-suited to semicontinuous outcomes that contain a point mass at zero and positive continuous values. Such outcomes commonly arise in fields such as actuarial science, health economics, and environmental studies, where responses like claim amounts, medical costs, or precipitation levels often exhibit a The dataset must include at least the following components: a dependent variable, one or more independent variables (which may be numerical or categorical, with categorical variables represented as text labels or numeric codes), a subject identifier column to define clusters of repeated observations from the same unit or individual, and optionally, a within-subject measurement indicator to capture the temporal or sequential structure of the observations within each cluster. Each row in the dataset must correspond to a single observation. The data structure must support the estimation of within-cluster correlation using GEE under the Tweedie distribution with an identity link, which enables modeling of repeated or correlated data where the response variable reflects non-negative semicontinuous behavior, and where a direct, untransformed relationship between the predictors and the mean of the response is appropriate.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Iterations Between Updates Defines how often the working correlation matrix is updated during estimation.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Hessian Convergence Specifies a threshold for convergence based on the second derivatives (Hessian matrix).
Scale Parameter Adjusts the variance of the model.
Within Subject Selects the column that defines the ordering of repeated observations within each subject.
Structure Defines the working correlation structure for repeated measurements within subjects. Choices include Independent, Exchangeable, AR(1), Unstructured and M-Dependent.
M In the M-Dependent correlation structure, M specifies the maximum number of consecutive observations within each cluster that are assumed to be correlated, with correlations set to zero for observations more than M time points apart. M must be an integer in the range from 1 to (number of within-subject observations – 1).
Covariance Matrix In the Covariance Matrix option, the user selects how the model’s standard errors will be estimated. The Robust estimator is resistant to misspecification of the correlation structure, while the Model-based estimator assumes the specified correlation is correct.
Subjects/Factors/Covariates/Excluded Columns Select manually the columns that correspond to subjects, factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Subjects, Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns. At least one covariate or factor column should be specified. Also, a column for the Subject is required.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Generalized Estimating Equations (GEEs) procedure is organized into three main sections: the Predicted Values Table, the Parameter Estimates Table and the Working Correlation Matrix.

  1. The Predicted Values Table shows the fitted values for the dependent variable on the specified model.
  2. The Parameter Estimates Table includes the estimated coefficients for the independent variables. Each row corresponds to a predictor and includes its coefficient, standard error, confidence interval, test statistic, degrees of freedom, and p-value.
  3. The Working Correlation Matrix displays the estimated correlations between repeated measurements of the same subject over time.
Example
Input

The dataset must include a continuous, non-negative outcome variable suitable for Tweedie regression with an identity link function. The dependent variable should reflect semicontinuous data with a mass at zero and positive continuous values, as commonly observed in applications such as insurance claims, healthcare expenditures, or environmental measures. The dataset should also include at least one independent variable, which may be continuous or categorical, a subject identifier to define clusters of repeated measurements from the same individual or unit, and a within-subject time or measurement variable to indicate the order of observations within each subject. The data structure should support the modeling of within-cluster correlation using an identity link under the Tweedie distribution in the GEE framework, allowing for a direct linear relationship between the covariates and the expected value of the outcome.

tweedieidentityGEE-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations.
  2. Set the Type [1] of regression to Tweedie with Identity Link.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify Iterations Between Updates [4].
  6. Select the Dependent Variable [5].
  7. Specify the Minimum Change in Parameter Estimates [6].
  8. Specify the Hessian Convergence [7].
  9. Specify the Scale Parameter [8].
  10. Select the Within Subject [9].
  11. Select the working correlation Structure and specify the value of M if you select M-Dependent as the correlation structure. [10].
  12. Select the method used to estimate the covariance matrix [11] of the standard errors.
  13. Select the columns by clicking on the arrow buttons [16] and moving columns between the Excluded Columns [12] and Subjects [13] and Factors [14] and Covariates [15] lists.
  14. Select your preferred option to define the model you want to analyze [17].
  15. If the Custom option is selected, specify the Formula [18] for the analysis.
  16. Click on the Execute button [19] to perform the Tweedie Regression with Identity Link method.
tweedieidentityGEE-config
Output

The Predictions, Parameter Estimates table and Working Correlation Matrix are shown in the output spreadsheet.

tweedieidentityGEE-output

Use the Tweedie Regression with Log Link method by browsing in the top ribbon:

Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations

And then choosing “Tweedie with Log Link” as the Type.

Input

All variables must be specified in the datasheet. The dataset must be appropriately structured to support the application of Tweedie regression with a log link function within the Generalized Estimating Equations (GEE) framework. The dependent variable should be continuous, non-negative, and exhibit characteristics consistent with the Tweedie distribution, which is particularly suitable for semicontinuous outcomes that contain a point mass at zero alongside positive continuous values. Common examples of such data include insurance claim amounts, healthcare expenditures, and rainfall measurements. The dataset must include, at minimum, the following columns: a dependent variable, one or more independent variables, which may be numeric or categorical (categorical predictors can be represented either by text labels or numeric codes), and a subject identifier that defines clusters of subject measurement indicator, representing the order or timing of observations within each cluster. Each row of the dataset should correspond to single observation. The overall structure must support the modeling of within-cluster correlation under the Tweedie distribution assumption with a log link, enabling the analysis of longitudinal or clustered data where the response variable combines features of both count and continuous distributions-such as zero inflation and positive skewness commonly observed in real world outcomes.

Configuration
Confidence Level (%) Specify the confidence level of the analysis. Values should range from 0 to 100 and correspond to percentages.
Max Iterations Defines the maximum number of iterations the model is allowed to perform during the estimation process. If the model fails to converge before reaching this number, the algorithm stops and returns the values of the last iteration.
Iterations Between Updates Defines how often the working correlation matrix is updated during estimation.
Dependent Variable Select the column that corresponds to values of the dependent variable.
Minimum Change in Parameter Estimates Sets the tolerance level for convergence — the smallest change in parameter estimates between iterations required to continue optimization. If the change in all parameters is below this value, the algorithm assumes convergence has been reached.
Hessian Convergence Specifies a threshold for convergence based on the second derivatives (Hessian matrix).
Scale Parameter Adjusts the variance of the model.
Within Subject Selects the column that defines the ordering of repeated observations within each subject.
Structure Defines the working correlation structure for repeated measurements within subjects. Choices include Independent, Exchangeable, AR(1), Unstructured and M-Dependent.
M In the M-Dependent correlation structure, M specifies the maximum number of consecutive observations within each cluster that are assumed to be correlated, with correlations set to zero for observations more than M time points apart. M must be an integer in the range from 1 to (number of within-subject observations – 1).
Covariance Matrix In the Covariance Matrix option, the user selects how the model’s standard errors will be estimated. The Robust estimator is resistant to misspecification of the correlation structure, while the Model-based estimator assumes the specified correlation is correct.
Subjects/Factors/Covariates/Excluded Columns Select manually the columns that correspond to subjects, factors and the columns that correspond to covariates through the dialog window: Use the buttons to move columns between the Subjects, Factors and Covariates list and Excluded Columns list. Single-arrow buttons will move all selected columns. At least one covariate or factor column should be specified. Also, a column for the Subject is required.
Custom/Include All Main Effects/Full Factorial These options refer to the terms that will be included in the model. The Custom option allows the user to input a formula defining the exact terms to be included. The Include All Main Effects option allows the analysis of a model that only includes all main effects and finally, the Full Factorial option includes both all main effects and all possible interaction terms to build a full model. Note that the Include All Main Effects and Full Factorial options do not allow the use of a formula.
Formula Specify the model formula used for the analysis if the Custom option is selected. Include all variables listed under Factors or Covariates, separated by “+”. To include interaction terms, use the format VariableA:VariableB. If interaction terms are included, the dataset must contain all combinations of the levels of the involved categorical variables — i.e., the design must be fully crossed — to ensure the model can be properly estimated.
Output

The output of the Generalized Estimating Equations (GEEs) procedure is organized into three main sections: the Predicted Values Table, the Parameter Estimates Table and the Working Correlation Matrix.

  1. The Predicted Values Table shows the fitted values for the dependent variable on the specified model.
  2. The Parameter Estimates Table includes the estimated coefficients for the independent variables. Each row corresponds to a predictor and includes its coefficient, standard error, confidence interval, test statistic, degrees of freedom, and p-value.
  3. The Working Correlation Matrix displays the estimated correlations between repeated measurements of the same subject over time.
Example
Input

The dataset must include a non-negative, continuous outcome variable suitable for Tweedie regression with a log link. The dependent variable should reflect semicontinuous data with a mass at zero and positive continuous values, as seen in insurance claims, healthcare costs, or rainfall. It must also contain at least one independent variable (continuous or categorical), a subject identifier to define clusters of repeated measures, and a within-subject measurement variable to indicate the order of observations. The dataset structure should support within-cluster correlation modeling using a log link under the Tweedie distribution in the GEE framework, allowing for a multiplicative relationship between predictors and the expected outcome.

tweedielogGEE-input
Configuration
  1. Select Analytics \(\rightarrow\) Regression \(\rightarrow\) Statistical fitting \(\rightarrow\) Generalized Estimating Equations.
  2. Set the Type [1] of regression to Tweedie with Log Link.
  3. Specify the Confidence Level (%) [2] for the test.
  4. Specify Max Iterations [3].
  5. Specify Iterations Between Updates [4].
  6. Select the Dependent Variable [5].
  7. Specify the Minimum Change in Parameter Estimates [6].
  8. Specify the Hessian Convergence [7].
  9. Specify the Scale Parameter [8].
  10. Select the Within Subject [9].
  11. Select the working correlation Structure and specify the value of M if you select M-Dependent as the correlation structure. [10].
  12. Select the method used to estimate the covariance matrix [11] of the standard errors.
  13. Select the columns by clicking on the arrow buttons [16] and moving columns between the Excluded Columns [12] and Subjects [13] and Factors [14] and Covariates [15] lists.
  14. Select your preferred option to define the model you want to analyze [17].
  15. If the Custom option is selected, specify the Formula [18] for the analysis.
  16. Click on the Execute button [19] to perform the Tweedie Regression with Log Link method.
tweedielogGEE-config
Output

The Predictions, Parameter Estimates table and Working Correlation Matrix are shown in the output spreadsheet.

tweedielogGEE-output

Tips

k Nearest neighbors:

  • It works more efficiently for small to medium datasets and low-dimensional data. kNN is sensitive to missing data.
  • The performance of the model is highly influenced by the selection of k.

Radial Basis Function Network:

  • The number of neurons in the hidden layer has a high impact on the model performance, since a large number of neurons can lead to overfitting.

Linear SGD:

  • It is effective in large datasets and can handle high-dimensional feature spaces. Higher learning rates may be required when Huber loss is selected, because it is less sensitive to outliers.
  • Consider scaling the input data with a Z score normalizer to center them to mean and have a unit standard deviation.

XGBoost:

  • Be cautious during hyperparameter tuning: Choosing smaller eta values, as well as increasing the lambda, alpha and gamma values result in a more conservative boosting process. Increasing the value of max depth parameter makes the model more complex, more likely to overfit.

Random Forest:

  • This algorithm performs well with datasets that contain missing values. However, it is not as efficient with a large number of sparse features or with categorical variables of many levels that are improperly encoded.

See also

The model generated by either the k Nearest Neighbors (kNN), Fully Connected Neural Network, Radial Basis Function Network, Linear SGD, XGBoost or Random Forest algorithms can be applied to any input data through the Existing Model Utilization function (e.g., a regression algorithm trained from the training set data of a machine learning model can be applied to the test/external set data).

Workflows

Bodyfat prediction case study

House pricing case study

Insurance charges case study

MA score case study

Salary prediction case study

References

  1. Witten Ian H and Frank, Eibe and Hall, Mark A and Pal CJ. Data Mining: Practical Machine Learning Tools and Techniques. Fourth. Morgan Kaufmann; 2011. https://doi.org/10.1016/C2009-0-19715-5.
  2. Murphy KP. Machine Learning: A Probabilistic Perspective. The MIT Press; 2012. 10.5555/2380985.
  3. Lee C-C, Chung P-C, Tsai J-R, Chang C-I. Robust radial basis function neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 1999;29:674–85. https://doi.org/10.1109/3477.809023.
  4. Ghosh J, Nag A. An Overview of Radial Basis Function Networks. In: Howlett RJ, Jain LC, editors. Radial Basis Function Networks 2: New Advances in Design, Heidelberg: Physica-Verlag HD; 2001, p. 1–36. https://doi.org/10.1007/978-3-7908-1826-0_1.
  5. Bottou L. Large-Scale Machine Learning with Stochastic Gradient Descent. In: Lechevallier Y, Saporta G, editors. Proceedings of COMPSTAT’2010, Heidelberg: Physica-Verlag HD; 2010, p. 177–86. https://doi.org/10.1007/978-3-7908-2604-3_16.
  6. XGBoost Parameters — xgboost 2.1.0-dev documentation n.d. https://xgboost.readthedocs.io/en/latest/parameter.html (accessed June 3, 2024).
  7. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York, NY: Springer New York; 2009. https://doi.org/10.1007/978-0-387-84858-7.
  8. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, p. 785–94. https://doi.org/10.1145/2939672.2939785.
  9. Breiman L. Random Forests. Machine Learning 2001;45:5–32. https://doi.org/10.1023/A:1010933404324.
  10. Nelder JA, Wedderburn RWM. Generalized linear models. J R Stat Soc A. 1972;135(3):370-84.https://doi.org/10.2307/2344614.
  11. McCullagh P, Nelder JA. Generalized linear models. 2nd ed. London: Chapman & Hall/CRC; 1989. https://doi.org/10.1201/9780203753736.
  12. McCullagh P. Regression models for ordinal data. J R Stat Soc B. 1980;42(2):109-42. https://doi.org/10.1111/j.2517-6161.1980.tb01109.x.
  13. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13-22. https://doi.org/10.1093/biomet/73.1.13.

Version History

Introduced in Isalos Analytics Platform v0.1.18

Instructions last updated on May 2025