Appendix A — ML-Stats Dictionary
Machine learning and statistics comprise a substantial synergy that is reflected in data science. Thus, it is imperative to construct solid bridges between both disciplines to ensure everything is clear regarding their tremendous amount of jargon and terminology. This ML-Stats dictionary (ML stands for Machine Learning) aims to be one of these bridges, especially within supervised learning and regression analysis contexts.

Below, you will find definitions either highlighted in blue if they correspond to statistical terminology or magenta if the terminology is machine learning-related. These definitions come from all definition admonitions introduced throughout the fifteen main chapters of this textbook. This colour scheme strives to combine all terminology to switch from one field to another easily. With practice and time, you will be able to jump back and forth when using these concepts.
Attention!
Noteworthy terms (either statistical or machine learning-related) will include a particular admonition identifying which terms (again, either statistical or machine learning-related) are equivalent.
A
Alternative hypothesis
The alternative hypothesis, denoted by \(H_1\) or \(H_a\), is the competing statement about the parameter. It describes the kind of departure from the null hypothesis that the test is designed to detect. If the null hypothesis restricts the parameter to \(\Theta_0\), the alternative hypothesis often restricts it to another part of the parameter space:
\[ H_1 \text{: } \boldsymbol{\theta} \in \Theta_1, \qquad \Theta_1 \subseteq \Theta. \]
For one-sided tests, \(\Theta_1\) specifies a direction, such as values greater than a benchmark or values less than a target. On the other hand, for two-sided tests, \(\Theta_1\) specifies departures from the null value in either direction, such as values that are either smaller or larger than the benchmark.
Note that the alternative hypothesis determines what counts as “more extreme” evidence:
- For a right-tailed test, large positive values of the test statistic support thealternative hypothesis.
- For a left-tailed test, large negative values support the alternative hypothesis.
- For a two-sided test, values far from the null value in either direction support the alternative hypothesis.
Attribute
An attribute is a variable used to help explain, describe, or predict the behaviour of a response variable in a regression model. Attributes represent the observed characteristics or conditions associated with each observational unit. Depending on the context, attributes may be quantitative or categorical, and their role may be primarily explanatory, predictive, or both.
Equivalent to:
Covariate, exogeneous variable, explanatory variable, feature, independent variable, input, predictor or regressor.
C
Critical value
A critical value is a cutoff from the null distribution used to decide whether the observed test statistic falls in the rejection region. For a test with significance level \(\alpha\), the critical value is chosen so that the rejection region has probability \(\alpha\) under the null hypothesis.
Conditional probability
Suppose \(A\) and \(B\) are two events in the same sample space \(\mathcal{S}\), with
\[ \Pr(B) > 0. \]
The conditional probability of \(A\) given \(B\) is defined as
\[ \Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}. \tag{A.1}\]
The event \(B\) is called the conditioning event. The expression \(\Pr(A \mid B)\) is read as “the probability of \(A\) given \(B\).” It describes the probability of \(A\) after we have restricted attention to situations where \(B\) has occurred.
Confidence interval
A confidence interval is an interval computed from sample data using a procedure designed to capture the unknown parameter a specified proportion of the time across repeated random samples. For example, a 95% confidence interval procedure is one that, under its assumptions, would produce intervals containing the true parameter in about 95% of repeated samples. For a single observed interval, we do not say that there is a 95% probability that the fixed frequentist parameter lies inside the interval. Instead, we say that the interval was produced by a procedure with 95% long-run coverage.
Continuous random variable
Let \(Y\) be a random variable with support \(\mathcal{Y}\). If \(\mathcal{Y}\) is an uncountably infinite set of possible values, then \(Y\) is called a continuous random variable. Continuous random variables can have different kinds of support. For example, they may be:
- completely unbounded, with possible values from \(-\infty\) to \(\infty\);
- positively unbounded, with possible values from \(0\) to \(\infty\);
- negatively unbounded, with possible values from \(-\infty\) to \(0\); or
- bounded, with possible values between two finite values \(a\) and \(b\).
Covariate
A covariate is a variable used to help explain, describe, or predict the behaviour of a response variable in a regression model. Covariates represent the observed characteristics or conditions associated with each observational unit. Depending on the context, covariates may be quantitative or categorical, and their role may be primarily explanatory, predictive, or both.
Equivalent to:
Attribute, exogeneous variable, explanatory variable, feature, independent variable, input, predictor or regressor.
Cumulative distribution function
Let \(Y\) be a random variable. The cumulative distribution function (CDF) of \(Y\) is the function \(F_Y(y)\) defined as
\[ F_Y(y) = \Pr(Y \leq y). \]
The CDF gives the probability that the random variable \(Y\) takes a value less than or equal to a chosen cutoff value \(y\). In this sense, it accumulates probability from the left up to \(y\).
For a discrete random variable with probability mass function (PMF) \(p_Y(y)\) and support \(\mathcal{Y}\), the CDF can be written as
\[ F_Y(y) = \sum_{u \in \mathcal{Y}: u \leq y} p_Y(u). \]
For a continuous random variable with probability density function (PDF) \(f_Y(y)\) and support \(\mathcal{Y}\), the CDF can be written as
\[ F_Y(y) = \int_{-\infty}^{y} f_Y(u)\,du, \]
with the understanding that the density is zero outside the support of \(Y\).
In these expressions, \(y\) is the cutoff value at which the CDF is evaluated, while \(u\) is a running variable used inside the summation or integral. This distinction keeps the notation clear: we are accumulating all probability attached to values \(u\) that are less than or equal to the cutoff \(y\).
In plain words, the PMF or PDF describes how probability is assigned locally, while the CDF describes how probability accumulates up to a given value.
D
Data leakage
Data leakage occurs when information from outside the proper training process is allowed to influence model fitting, model selection, or predictive assessment in a way that would not be available when making predictions on genuinely new observations. In practice, leakage can lead to performance results that look artificially strong because the model has, directly or indirectly, already “seen” information from the data used for evaluation.
Dependent variable
A dependent variable is the measurement or quantity of main interest in a supervised learning or regression analysis problem. It is the variable whose behaviour we aim to explain, model, or predict using one or more explanatory variables.
Depending on the application, the dependent variable may represent a continuous quantity, a count, a binary result, a proportion, or another type of measurement. The nature of the dependent variable plays a fundamental role in determining the appropriate regression model and probability model for the analysis.
Equivalent to:
Endogenous variable, response variable, outcome, output or target.
Discrete random variable
Let \(Y\) be a random variable with support \(\mathcal{Y}\). If \(\mathcal{Y}\) is a finite set or a countably infinite set of possible values, then \(Y\) is called a discrete random variable. Discrete random variables commonly appear as:
- binary variables, whose support contains two possible values;
- categorical variables, whose support contains three or more categories, either nominal or ordinal; or
- count variables, whose support contains nonnegative integer values.
Dispersion
E
Endogenous variable
An endogenous variable is the measurement or quantity of main interest in a supervised learning or regression analysis problem. It is the variable whose behaviour we aim to explain, model, or predict using one or more explanatory variables.
Depending on the application, the endogenous variable may represent a continuous quantity, a count, a binary result, a proportion, or another type of measurement. The nature of the endogenous variable plays a fundamental role in determining the appropriate regression model and probability model for the analysis.
Equivalent to:
Dependent variable, outcome, output, response variable or target.
Equidispersion
Estimate
An estimate is the numerical value obtained after applying an estimator to the observed data.
For example, if \(325\) out of \(500\) surveyed children prefer chocolate ice cream, then the observed estimate of \(\pi\) is
\[ \hat{\pi}_{\operatorname{obs}} = \bar{d} = \frac{325}{500} = 0.65. \]
Here, \(\hat{\pi}_{\operatorname{obs}}\) is the observed estimate.
Estimator
An estimator is a rule or function of a random sample used to estimate an unknown parameter. Because an estimator is computed from random variables, it is itself a random variable.
For example, if
\[ D_1,\ldots,D_{n_d} \overset{\text{i.i.d.}}{\sim} \operatorname{Bernoulli}(\pi), \]
then the sample proportion
\[ \hat{\pi} = \bar{D} = \frac{1}{n_d} \sum_{i=1}^{n_d} D_i \]
is an estimator of \(\pi\).
Expected value
Let \(Y\) be a random variable with support \(\mathcal{Y}\). The expected value or mean, of \(Y\) is a probability-weighted average of the possible values of the random variable.
If \(Y\) is discrete with probability mass function (PMF) \(p_Y(y)\), then
\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y\,p_Y(y). \]
If \(Y\) is continuous with probability density function (PDF) \(f_Y(y)\), then
\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y\,f_Y(y)\,dy. \]
The expected value is a population-level summary: it describes the centre of the probability distribution, not the average of one particular observed sample.
Equivalent to:
Mean.
Exogeneous variable
An exogeneous variable is a variable used to help explain, describe, or predict the behaviour of a response variable in a regression model. Exogeneous variables represent the observed characteristics or conditions associated with each observational unit. Depending on the context, exogeneous variables may be quantitative or categorical, and their role may be primarily explanatory, predictive, or both.
Equivalent to:
Attribute, covariate, explanatory variable, feature, independent variable, input, predictor or regressor.
Explanatory variable
An explanatory variable is a variable used to help explain, describe, or predict the behaviour of a response variable in a regression model. Explanatory variables represent the observed characteristics or conditions associated with each observational unit. Depending on the context, explanatory variables may be quantitative or categorical, and their role may be primarily explanatory, predictive, or both.
Equivalent to:
Attribute, covariate, exogeneous variable, feature, independent variable, input, predictor or regressor.
F
False negative
A false negative occurs when a test fails to reject the null hypothesis \(H_0\) even though \(H_0\) is false in a scientifically or practically relevant way. This is also known as Type II error: the test does not detect a departure from the null hypothesis, even though such a departure is actually present.
Equivalent to:
Type II error.
False positive
A false positive occurs when a test rejects the null hypothesis \(H_0\) even though \(H_0\) is true. This is also known as Type I error: the test concludes that there is enough evidence against the null hypothesis, but the null hypothesis is actually true.
Equivalent to:
Type I error.
Feature
A feature is a variable used to help explain, describe, or predict the behaviour of a response variable in a regression model. Features represent the observed characteristics or conditions associated with each observational unit. Depending on the context, features may be quantitative or categorical, and their role may be primarily explanatory, predictive, or both.
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, independent variable, input, predictor or regressor.
Fitted value
In a regression model, the fitted value for the \(i\)th observation is the model’s estimate of the conditional mean of the response variable given the \(k\) observed regressors. After estimating the model parameters from the training data, we plug the \(i\)th row of regressor values into the fitted systematic component to obtain
\[ \hat{y}_i = \widehat{\mathbb{E}}\left(Y_i \mid x_{i,1}, \ldots , x_{i,k}\right). \]
Frequentist statistics
Frequentist statistics treats unknown parameters as fixed but unknown quantities. Randomness comes from the data-generating process, the sampling process, or the experimental assignment mechanism. Under this view, probabilities describe the long-run behaviour of random quantities under repeated use of the same procedure. For example, a standard error describes how much an estimator would vary across repeated random samples, and a confidence level describes the long-run coverage behaviour of an interval procedure.
G
Generalized linear models
An umbrella of regression approaches that model the conditional expected value of the response variable \(Y\) based on a set of observed regressors \(x\). Unlike a traditional model such as the continuous Ordinary Least-squares (OLS) that relies solely on a Normal distribution to make inference, GLMs extend this distributional assumption, allowing for a variety of probability distributions for the response variable. Note that this umbrella encompasses approaches that accommodate continuous or discrete responses \(Y\). According to Casella and Berger (2024), a typical GLM consists of three key components:
- Random Component: The response variables in a training dataset of size \(n\) (i.e., the random variables \(Y_1, Y_2, \ldots, Y_n\)) are statistically independent but not identically distributed. Still, they do belong to the same family of probability distributions (e.g., Gamma, Beta, Poisson, Bernoulli, etc.).
- Systematic Component: For the \(i\)th observation, this component depicts how the \(k\) regressors \(x_{i, j}\) (for \(j = 1, 2, \ldots, k\)) come into the GLM as a linear combination involving \(k + 1\) regression parameters \(\beta_0, \beta_1, \ldots, \beta_k\). This relationship is expressed as
\[ \eta_i = \beta_0 + \beta_1 x_{i, 1} + \beta_2 x_{i, 2} + \ldots + \beta_k x_{i, k}. \]
- Link Function: This component connects (or “links”) the systematic component \(\eta_i\) with the mean of the random variable \(Y_i\), denoted as \(\mu_i\). The link function is mathematically represented as
\[ g(\mu_i) = \eta_i. \]
Nelder and Wedderburn (1972) introduced this umbrella term called GLM in the statistical literature and identified a set of distinct statistical models that shared the above three components.
Goodness of fit
Goodness of fit refers to the process of assessing whether a fitted probability model is compatible with the main patterns in the observed data. A goodness-of-fit assessment does not prove that the model is true. Instead, it asks whether the fitted model is adequate enough for the inferential or predictive purpose of the regression analysis.
H
Hypothesis testing
Hypothesis testing is a frequentist decision procedure that uses sample data to assess evidence against a null hypothesis. A hypothesis test compares the observed data with what would be expected under the null hypothesis. The outcome is not a statement that the null hypothesis is true or false with certainty. Instead, it is a conclusion about whether the observed evidence is strong enough, according to the chosen significance level and test procedure, to reject the null hypothesis.
I
Independence
Two events \(A\) and \(B\) are independent if knowing that one event occurred does not change the probability of the other event. Equivalently, if \(\Pr(B) > 0\), then the Equation A.1 for conditional probability simplifies to
\[ \Pr(A \mid B) = \Pr(A). \]
Using the multiplication rule, this is equivalent to
\[ \Pr(A \cap B) = \Pr(A) \times \Pr(B). \]
Independence for random variables
Random variables \(Y_1, Y_2, \ldots, Y_n\) are mutually independent if their joint probability distribution factors into the product of their marginal distributions.
The joint probability distribution, often shortened to the joint distribution, describes the probabilistic behaviour of the full collection of random variables together. For example, it assigns probability or density to the vector of possible values
\[ \mathbf{y} = (y_1, y_2, \ldots, y_n)^{\top}. \]
A marginal probability distribution, often shortened to a marginal distribution, describes the probabilistic behaviour of one random variable on its own, without explicitly conditioning on or modelling the others. For example, \(p_{Y_i}(y_i)\) is the marginal probability mass function (PMF) of the single random variable \(Y_i\), while \(f_{Y_i}(y_i)\) is the marginal probability density function (PDF) of the single random variable \(Y_i\).
Thus, independence says that the joint behaviour of the full collection can be recovered by multiplying the separate one-variable behaviours.
If \(Y_1, Y_2, \ldots, Y_n\) are discrete random variables with marginal PMFs \(p_{Y_1}(y_1), p_{Y_2}(y_2), \ldots, p_{Y_n}(y_n)\), then independence implies
\[ p_{Y_1,\ldots,Y_n}(y_1,\ldots,y_n) = \prod_{i=1}^n p_{Y_i}(y_i). \]
If \(Y_1, Y_2, \ldots, Y_n\) are continuous random variableswith marginal PDFs \(f_{Y_1}(y_1), f_{Y_2}(y_2), \ldots, f_{Y_n}(y_n)\), then independence implies
\[ f_{Y_1,\ldots,Y_n}(y_1,\ldots,y_n) = \prod_{i=1}^n f_{Y_i}(y_i). \]
Independent variable
An independent variable is a variable used to help explain, describe, or predict the behaviour of a response variable in a regression model. Independent variables represent the observed characteristics or conditions associated with each observational unit. Depending on the context, independent variables may be quantitative or categorical, and their role may be primarily explanatory, predictive, or both.
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, feature, input, predictor or regressor.
Input
An input is a variable used to help explain, describe, or predict the behaviour of a response variable in a regression model. Inputs represent the observed characteristics or conditions associated with each observational unit. Depending on the context, inputs may be quantitative or categorical, and their role may be primarily explanatory, predictive, or both.
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, feature, independent variable, predictor or regressor.
L
Likelihood function
Suppose \(Y_1,\ldots,Y_n\) are modelled as an i.i.d. random sample from a probability model with parameter vector
\[ \boldsymbol{\theta} = (\theta_1,\theta_2,\ldots,\theta_k)^\top \in \Theta. \]
After observing data
\[ \mathbf{y} = (y_1,\ldots,y_n)^\top, \]
the likelihood function is the joint probability mass function (PMF) or joint probability density function (PDF) of the observed data, viewed as a function of \(\boldsymbol{\theta}\).
For a discrete i.i.d. model with common PMF \(p_Y(y;\boldsymbol{\theta})\), the likelihood function is:
\[ \mathcal{L}(\boldsymbol{\theta};\mathbf{y}) = \prod_{i=1}^n p_Y(y_i;\boldsymbol{\theta}). \]
For a continuous i.i.d. model with common PDF \(f_Y(y;\boldsymbol{\theta})\), the likelihood function is:
\[ \mathcal{L}(\boldsymbol{\theta};\mathbf{y}) = \prod_{i=1}^n f_Y(y_i;\boldsymbol{\theta}). \]
The observed data \(\mathbf{y}\) appear after the semicolon because they are fixed once collected. The parameter vector \(\boldsymbol{\theta}\) is the input of the likelihood function because maximum likelihood estimation compares candidate parameter values.
Log-likelihood function
Suppose \(Y_1,\ldots,Y_n\) are modelled as an i.i.d. random sample from a probability model with parameter vector
\[ \boldsymbol{\theta} = (\theta_1,\theta_2,\ldots,\theta_k)^\top \in \Theta. \]
After observing data
\[ \mathbf{y} = (y_1,\ldots,y_n)^\top, \]
Then, the log-likelihood function is the natural logarithm of the likelihood function. We denote it by the lowercase script-like symbol \(\ell(\cdot)\):
\[ \ell(\boldsymbol{\theta};\mathbf{y}) = \log\left[ \mathcal{L}(\boldsymbol{\theta};\mathbf{y}) \right]. \]
Here, \(\ell(\cdot)\) is not a new probability model. It is simply the likelihood function viewed on the logarithmic scale.
For a discrete i.i.d. model,
\[ \ell(\boldsymbol{\theta};\mathbf{y}) = \sum_{i=1}^n \log p_Y(y_i;\boldsymbol{\theta}), \]
and for a continuous i.i.d. model,
\[ \ell(\boldsymbol{\theta};\mathbf{y}) = \sum_{i=1}^n \log f_Y(y_i;\boldsymbol{\theta}). \]
M
Maximum likelihood estimation
Let \(\mathcal{L}(\boldsymbol{\theta};\mathbf{y})\) be the likelihood function for observed data \(\mathbf{y}\) under a probability model with parameter vector \(\boldsymbol{\theta} \in \Theta\). The maximum likelihood estimate is the parameter value that maximizes the likelihood function:
\[ \hat{\boldsymbol{\theta}}_{\operatorname{MLE,obs}} = \underset{\boldsymbol{\theta}\in\Theta}{\operatorname{arg\,max}} \ \mathcal{L}(\boldsymbol{\theta};\mathbf{y}). \]
Equivalently, because \(\log(\cdot)\) is strictly increasing, we usually maximize the log-likelihood function:
\[ \hat{\boldsymbol{\theta}}_{\operatorname{MLE,obs}} = \underset{\boldsymbol{\theta}\in\Theta}{\operatorname{arg\,max}} \ \ell(\boldsymbol{\theta};\mathbf{y}). \]
In supervised learning, this idea is closely related to model training through loss minimization. Instead of saying that the model finds parameter values that maximize the log-likelihood,
\[ \hat{\boldsymbol{\theta}}_{\operatorname{MLE,obs}} = \underset{\boldsymbol{\theta}\in\Theta}{\operatorname{arg\,max}} \ \ell(\boldsymbol{\theta};\mathbf{y}), \]
we can equivalently say that it finds parameter values that minimize the negative log-likelihood:
\[ \hat{\boldsymbol{\theta}}_{\operatorname{MLE,obs}} = \underset{\boldsymbol{\theta}\in\Theta}{\operatorname{arg\,min}} \ \left[-\ell(\boldsymbol{\theta};\mathbf{y})\right]. \]
Thus, the negative log-likelihood can be interpreted as a loss function: smaller values indicate that the fitted model is more compatible with the observed data under the chosen probability model. This connection helps explain why maximum likelihood estimation appears so often behind the scenes in regression and machine learning software.
Mean
Let \(Y\) be a random variable with support \(\mathcal{Y}\). The mean or expected value, of \(Y\) is a probability-weighted average of the possible values of the random variable.
If \(Y\) is discrete with probability mass function (PMF) \(p_Y(y)\), then
\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y\,p_Y(y). \]
If \(Y\) is continuous with probability density function (PDF) \(f_Y(y)\), then
\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y\,f_Y(y)\,dy. \]
The mean is a population-level summary: it describes the centre of the probability distribution, not the average of one particular observed sample.
Equivalent to:
Expected value.
Measure of central tendency
A measure of central tendency is a summary that identifies a central or typical value of a probability distribution. Probabilistically, it describes where a random variable tends to be located when we imagine observing many realizations from the same population or system.
Measure of uncertainty
A measure of uncertainty summarizes the spread or variability of a random variable around its central tendency. Larger values indicate that the random variable tends to vary more across realizations, while smaller values indicate that realizations tend to be more tightly concentrated.
N
Null hypothesis
The null hypothesis, denoted by \(H_0\), is the reference statement about a population or system parameter. It often represents a status quo, benchmark value, absence of effect, or simpler model. Mathematically, if the parameter vector is \(\boldsymbol{\theta}\in\Theta\), the null hypothesis can be written as
\[ H_0 \text{: } \boldsymbol{\theta} \in \Theta_0, \qquad \Theta_0 \subseteq \Theta. \]
In practice, the null hypothesis should be translated into plain language so that the statistical test remains connected to the original inquiry.
Null distribution
The null distribution is the probability distribution of a test statistic computed under the assumption that the null hypothesis is true. It tells us what values of the test statistic would be ordinary or unusual if the null hypothesis were the correct reference condition.
O
Observed effect
An observed effect is the difference between an observed estimate and the value of the parameter specified by the null hypothesis. For a scalar parameter \(\theta\), an observed effect often has the form
\[ \hat{\theta}_{\operatorname{obs}} - \theta_0, \]
where \(\hat{\theta}_{\operatorname{obs}}\) is the estimate computed from the observed sample and \(\theta_0\) is the null value.
Outcome
An outcome is the measurement or quantity of main interest in a supervised learning or regression analysis problem. It is the variable whose behaviour we aim to explain, model, or predict using one or more explanatory variables.
Depending on the application, the outcome may represent a continuous quantity, a count, a binary result, a proportion, or another type of measurement. The nature of the outcome plays a fundamental role in determining the appropriate regression model and probability model for the analysis.
Equivalent to:
Dependent variable, endogenous variable, response variable, output or target.
Output
An output is the measurement or quantity of main interest in a supervised learning or regression analysis problem. It is the variable whose behaviour we aim to explain, model, or predict using one or more explanatory variables.
Depending on the application, the output may represent a continuous quantity, a count, a binary result, a proportion, or another type of measurement. The nature of the output plays a fundamental role in determining the appropriate regression model and probability model for the analysis.
Equivalent to:
Dependent variable, endogenous variable, response variable, outcome or target.
Overdispersion
P
Parameter
A parameter is an unknown characteristic that summarizes some feature of a population or system. Parameters are commonly numerical, such as a mean, variance, probability, rate, or a regression coefficient, although some summaries can be categorical in non-regression settings.
In this book, scalar parameters are usually be denoted by Greek letters such as \(\theta\), \(\mu\), \(\sigma^2\), \(\lambda\), or \(\beta_j\) (for more insights, see Appendix B). The word scalar means that the population parameter is a single numerical quantity. For example, \(\mu\) may denote one population mean, \(\lambda\) may denote one rate parameter, and \(\beta_j\) may denote one regression coefficient associated with the \(j\)th regressor.
When a model contains several parameters, it is often more convenient to collect them into a vector. A parameter vector will usually be denoted by a bold symbol such as \(\boldsymbol{\beta}\) or \(\boldsymbol{\theta}\). For instance, in a regression model with an intercept and \(k\) regressors, we may write
\[ \boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_k)^\top. \]
Here, \(\boldsymbol{\beta}\) denotes the full vector of regression coefficients, while each \(\beta_j\) (\(j = 1, \dots, k\)) denotes one scalar component of that vector. This notation helps us distinguish between questions about a single coefficient, such as whether \(\beta_1 = 0\), and questions about the collection of parameters that defines the fitted regression model as a whole.
Parametric family
A parametric family is a collection of probability distributions that share the same mathematical form but differ according to the value of one or more unknown parameters. If \(Y\) is a random variable and \(\boldsymbol{\theta}\) denotes a parameter vector, we can write a parametric family abstractly as
\[ \{p_Y(y; \boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta\} \]
for a discrete random variable, or as
\[ \{f_Y(y; \boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta\} \]
for a continuous random variable.
Here, \(\Theta\) denotes the parameter space, which is the set of possible values the parameter space vector \(\boldsymbol{\theta}\) can take. Each specific value of \(\boldsymbol{\theta}\) identifies one member of the family.
Parametric model
A parametric model is a type of model that assumes a specific functional relationship between the response variable of interest, \(Y\), which is considered a random variable, and one or more observed explanatory variables, \(x\). This relationship is characterized by a finite set of parameters and can often be expressed as a linear combination of the observed \(x\) variables, which favours interpretability.
Moreover, since \(Y\) is a random variable, there is room to make further assumptions on it in the form of a probability distribution, independence or even homoscedasticity (the condition where all responses in the population have the same variance). It is essential to test these assumptions after fitting this type of models, as any deviations may result in misleading or biased estimates, predictions, and inferential conclusions.
Point estimate
A point estimate is a single numerical value used to estimate an unknown population or system parameter. It is obtained by applying an estimator to the observed data.
Population
A population is the whole collection of individuals or items that share the attributes we want to study. In a statistical analysis, the population should be defined as precisely as possible, as it determines the scope of the conclusions we can draw.
Power
The power of a hypothesis test is the probability of rejecting the null hypothesis (\(H_0\)) when a particular alternative hypothesis (\(H_1\)) is true:
\[ \operatorname{Power} = \Pr(\text{Reject } H_0 \mid H_1 \text{ is true in a specified way}). \]
If \(\beta\) denotes the probability of a Type II error for a specified alternative hypothesis, then
\[ \operatorname{Power} = 1-\beta. \]
Power depends on the sample size, significance level, variability, test direction, and effect size that the study aims to detect.
Equivalent to:
True positive rate.
Power analysis
The power analysis is a set of statistical tools used to compute the minimum required sample size \(n\) for any given inferential study. These tools require the significance level, power, and effect size (i.e., the magnitude of the signal) the researcher aims to detect via their inferential study. This analysis seeks to determine whether observed results are likely due to chance or represent a true and meaningful effect.
Predictor
A predictor is a variable used to help explain, describe, or predict the behaviour of a response variable in a regression model. Predictors represent the observed characteristics or conditions associated with each observational unit. Depending on the context, predictors may be quantitative or categorical, and their role may be primarily explanatory, predictive, or both.
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, feature, independent variable, input or regressor.
Probability
Let \(A\) be an event in a random phenomenon with sample space \(\mathcal{S}\). The probability of event \(A\) is denoted by
\[ \Pr(A). \]
Under a frequentist interpretation, \(\Pr(A)\) can be understood as the limiting relative frequency of event \(A\) over repeated observations of the same random phenomenon:
\[ \Pr(A) = \lim_{n \to \infty} \frac{\text{Number of times event } A \text{ is observed in } n \text{ repetitions}}{n}. \]
Note that a probability must satisfy
\[ 0 \leq \Pr(A) \leq 1. \]
Probability distribution
Let \(Y\) be a random variable with support \(\mathcal{Y}\). A probability distribution describes how probability is assigned to the possible values of \(Y\):
- If \(Y\) is discrete, the distribution assigns probability to individual values in \(\mathcal{Y}\).
- If \(Y\) is continuous, the distribution assigns probability to intervals of values in \(\mathcal{Y}\).
Probability density function
Let \(Y\) be a continuous random variable with support \(\mathcal{Y}\). A probability density function (PDF) is a function \(f_Y(y)\) used to compute probabilities over intervals. Specifically, for two values \(a\) and \(b\) in the support of \(Y\), with \(a \leq b\),
\[ \Pr(a \leq Y \leq b) = \int_a^b f_Y(y)\,dy. \]
Thus, for a continuous random variable, the PDF is not itself a probability at a single point. Instead, probabilities are obtained by integrating the PDF over intervals.
A valid PDF must satisfy two conditions:
\[ f_Y(y) \geq 0 \quad \text{for all } y \in \mathcal{Y}, \]
and
\[ \int_{\mathcal{Y}} f_Y(y)\,dy = 1. \]
If the PDF depends on an unknown parameter or parameter vector, we will write this dependence using a semicolon. For example,
\[ f_Y(y;\boldsymbol{\theta}) \]
denotes the PDF of \(Y\) evaluated at \(y\), under a probability model controlled by the parameter vector \(\boldsymbol{\theta}\). The value before the semicolon, \(y\), is a possible value of the random variable. The quantity after the semicolon, \(\boldsymbol{\theta}\), contains the parameter or parameters that determine the shape or behaviour of the density.
Probability mass function
Let \(Y\) be a discrete random variable with support \(\mathcal{Y}\). The probability mass function (PMF) of \(Y\) is the function
\[ p_Y(y) = \Pr(Y = y), \]
which assigns a probability to each possible value \(y \in \mathcal{Y}\). Thus, for any possible value \(y\), the PMF evaluated at \(y\) gives the probability that the random variable \(Y\) takes that value. A valid PMF must satisfy
\[ p_Y(y) \geq 0 \quad \text{for all } y \in \mathcal{Y}, \]
and
\[ \sum_{y \in \mathcal{Y}} p_Y(y) = 1. \]
For values outside the support, we define
\[ p_Y(y) = 0 \quad \text{for } y \notin \mathcal{Y}. \]
If the PMF depends on an unknown parameter or parameter vector, we will write this dependence using a semicolon. For example,
\[ p_Y(y;\boldsymbol{\theta}) \]
denotes the PMF of \(Y\) evaluated at \(y\), under a probability model controlled by the parameter vector \(\boldsymbol{\theta}\). The value before the semicolon, \(y\), is a possible value of the random variable. The quantity after the semicolon, \(\boldsymbol{\theta}\), contains the parameter or parameters that determine the probabilities assigned by the model.
Probability model
A probability model is a mathematical representation of how a random variable behaves under uncertainty in a population or system of interest. It specifies the possible values the random variable can take together with a distribution that describes how probable those values are, usually through one or more governing parameters. This idea aligns with the notion of a generative model, where we use a probabiity distribution to describe how the observed data could have been generated from the population or system under study.
\(p\)-value
A \(p\)-value is the probability, computed under the null hypothesis, of observing a test statistic as extreme as or more extreme than the one obtained from the observed sample, in the direction specified by the alternative hypothesis. The \(p\)-value is not the probability that the null hypothesis is true. It is a probability about the test statistic under the null hypothesis.
R
Random sample
A random sample is a collection of random variables
\[ Y_1, Y_2, \ldots, Y_n \]
with support \(\mathcal{Y}\), that are commonly assumed to be independent and identically distributed from the same probability distribution. This assumption is abbreviated as i.i.d.
The phrase identically distributed means that each random variable in the sample is governed by the same probability model. The phrase independent means that the joint probability distribution of the full sample factors into the product of the marginal distributions of the individual observations.
If the common distribution is represented by \(F_Y\), we write
\[ Y_1, Y_2, \ldots, Y_n \overset{\text{i.i.d.}}{\sim} F_Y. \]
More explicitly, if the common probability model depends on an unknown parameter vector \(\boldsymbol{\theta}\), then each observation is governed by the same parameter vector. The notation \(\boldsymbol{\theta}\) may contain one parameter or several parameters, depending on the model. For example, \(\boldsymbol{\theta}\) could contain only one probability parameter, such as \(\pi\), or it could contain several regression coefficients, such as
\[ \boldsymbol{\beta} = (\beta_0,\beta_1,\ldots,\beta_k)^\top. \]
For a discrete i.i.d. random sample with common probability mass function (PMF) \(p_Y(y;\boldsymbol{\theta})\), the joint PMF is
\[ p_{Y_1,\ldots,Y_n}(y_1,\ldots,y_n;\boldsymbol{\theta}) = \prod_{i=1}^n p_Y(y_i;\boldsymbol{\theta}), \qquad y_i \in \mathcal{Y}. \tag{A.2}\]
For a continuous i.i.d. random sample with common probability density function (PDF) \(f_Y(y;\boldsymbol{\theta})\), the joint PDF is
\[ f_{Y_1,\ldots,Y_n}(y_1,\ldots,y_n;\boldsymbol{\theta}) = \prod_{i=1}^n f_Y(y_i;\boldsymbol{\theta}), \qquad y_i \in \mathcal{Y}. \tag{A.3}\]
The semicolon in Equation A.2 and Equation A.3 follows the notation convention for parametrized PMFs and PDFs. The values \(y_1,\ldots,y_n\) appear before the semicolon because they are possible observed values of the random variables. The parameter vector \(\boldsymbol{\theta}\) appears after the semicolon because it is part of the model specification.
Once data are collected, the realized values are written as
\[ y_1, y_2, \ldots, y_n. \]
The uppercase symbols \(Y_1,\ldots,Y_n\) denote the random variables before observation, while the lowercase symbols \(y_1,\ldots,y_n\) denote the observed values after data collection.
Random variable
A random variable is a function that assigns a numerical value to each possible outcome of a random phenomenon. Before the phenomenon is observed, the random variable represents uncertainty. After observation, we obtain a realized value. In this book, an uppercase letter such as \(Y\) denotes a random variable, whereas a lowercase value such as \(y_i\) denotes the \(i\)th observed realization.
Regression analysis
Regression analysis is a collection of statistical methods used to study the relationship between a response variable and one or more explanatory variables. Its goals may include explaining associations, estimating effects, quantifying uncertainty, or predicting future outcomes.
More broadly, regression analysis provides a principled framework for connecting a scientific or practical question to data, a probability model, an estimation method, and an interpretation of results. The specific form of the regression model depends on the nature of the response variable, the modelling assumptions, and whether the main objective is inference, prediction, or both.
Regressor
A regressor is a variable used to help explain, describe, or predict the behaviour of a response variable in a regression model. Regressors represent the observed characteristics or conditions associated with each observational unit. Depending on the context, regressors may be quantitative or categorical, and their role may be primarily explanatory, predictive, or both.
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, feature, independent variable, input or predictor.
Residual
A residual is the difference between the observed response variable and its fitted value from the model. For the \(i\)th observation, the residual is
\[ e_i = y_i - \hat{y}_i, \]
where \(y_i\) is the observed response variable and \(\hat{y}_i\) is the fitted value.
Residuals measure the part of the response variable that the fitted model did not explain through its systematic component. Many goodness-of-fit diagnostics are built by inspecting whether these residuals behave like the model’s assumed “random noise.”
Response variable
A response variable is the measurement or quantity of main interest in a supervised learning or regression analysis problem. It is the variable whose behaviour we aim to explain, model, or predict using one or more explanatory variables.
Depending on the application, the response variable may represent a continuous quantity, a count, a binary result, a proportion, or another type of measurement. The nature of the response variable plays a fundamental role in determining the appropriate regression model and probability model for the analysis.
Equivalent to:
Dependent variable, endogeneous variable, outcome, output or target.
S
Sample space
The sample space \(\mathcal{S}\) is the set of all possible outcomes of a random phenomenon. An event \(A\) is a subset of the sample space, so
\[ A \subseteq \mathcal{S}. \]
Note that the probability of the whole sample space is
\[ \Pr(\mathcal{S}) = 1. \]
Sampling distribution
The sampling distribution of an estimator is the probability distribution of that estimator across repeated random samples from the same population or system under the same sampling design.
Sampling variability
Sampling variability is the variation in an estimator across repeated random samples from the same population or system under the same sampling design. It reflects the fact that an estimator is a random variable before the data are observed.
Semiparametric model
A semiparametric model is a statistical model that incorporates both parametric and nonparametric parts. In the context of linear regression, these parts can be described as follows:
- The parametric part includes the systematic component where \(k\) observed regressors \(x\) are modelled along with \(k + 1\) regression parameters \(\beta_0, \beta_1, \ldots, \beta_k\) in a linear combination.
- The nonparametric part does not impose specific assumptions on one or more modelling components, allowing the observed training dataset to estimate these elements without requiring any probability distributions.
Significance level
The significance level, denoted by \(\alpha\), is the long-run probability of rejecting the null hypothesis (\(H_0\)) when the null hypothesis is true:
\[ \alpha = \Pr(\text{Reject } H_0 \mid H_0 \text{ is true}). \]
In plain language, \(\alpha\) is the test procedure’s tolerated Type I error rate. A smaller \(\alpha\) makes it harder to reject \(H_0\) and reduces the long-run probability of a Type I error rate, at the cost of making the test less sensitive to some alternatives.
Equivalent to:
Type I error rate.
Standard error in estimation
The standard error of an estimator is the standard deviation of its sampling distribution. It quantifies how much the estimator would vary across repeated random samples. In general, for an estimator \(\hat{\theta}\),
\[ \operatorname{SE}(\hat{\theta}) = \operatorname{SD}(\hat{\theta}). \]
In practice, the standard error is usually estimated from the observed data.
Standard error in inference
A standard error measures the repeated-sampling variability of an estimator. In hypothesis testing, the standard error is used to judge whether the observed effect is large or small relative to the variability expected under the sampling process.
Statistical hypothesis
A statistical hypothesis is a statement about an unknown population or system parameter. Suppose a probability model contains a parameter vector
\[ \boldsymbol{\theta} = (\theta_1,\theta_2,\ldots,\theta_k)^{\top}, \]
with parameter space \(\Theta\). A statistical hypothesis restricts the possible values of \(\boldsymbol{\theta}\) to a subset of that parameter space:
\[ H \text{: } \boldsymbol{\theta} \in \Theta^*, \qquad \Theta^* \subseteq \Theta. \]
In applied work, this mathematical statement should also have a plain-language interpretation connected to the data science inquiry.
Supervised learning
Supervised learning is a data modelling framework in which we use observed pairs \((\mathbf{x}_i, y_i)\) for the \(i\)th observation, where \(\mathbf{x}_i\) denotes the explanatory variables recorded for that observation, to learn a rule or function that maps explanatory variables to a response variable. The goal may be primarily predictive (accurate future predictions), primarily inferential (understanding associations or effects between \(\mathbf{x}_i\) and \(y_i\)), or a mixture of both.
Survival analysis
System
A system is a process, mechanism, or operational setting whose behaviour is governed by unknown features we want to study. The word system is useful when the object of study is not naturally a collection of people or items.
T
Target
A target is the measurement or quantity of main interest in a supervised learning or regression analysis problem. It is the variable whose behaviour we aim to explain, model, or predict using one or more explanatory variables.
Depending on the application, the target may represent a continuous quantity, a count, a binary result, a proportion, or another type of measurement. The nature of the target plays a fundamental role in determining the appropriate regression model and probability model for the analysis.
Equivalent to:
Dependent variable, endogenous variable, response variable, outcome or output.
Test set
A test set (or testing set) is the subset of the observed random sample reserved for a final and unbiased assessment of model performance after all model-building decisions have been completed. Its role is to provide a trustworthy evaluation of how well the chosen model is expected to perform on future or unseen observations from the same population or system of interest in predictive settings.
On the other hand, in inferential settings, the test set may serve a protective role against double dipping, since it allows us to reserve part of the observed random sample for a more formal assessment (via hypothesis testing) after exploratory work has been carried out on the training set.
Test statistic
A test statistic is a numerical summary of the observed sample used to measure how far the observed evidence is from the reference behaviour implied by the null hypothesis.
Training set
A training set is the subset of the observed random sample used to fit or estimate one or more candidate models. In regression analysis, this means that the training set is used to estimate unknown model terms, such as regression coefficients, and to carry out classical diagnostic checks on the fitted model.
In a supervised learning context, the training set is the portion of the observed random sample from which the model primarily learns the relationship between the explanatory variables and the response variable.
True positive rate
The true positive rate (also known as power) of a hypothesis test is the probability of rejecting the null hypothesis \((H_0)\) when a particular alternative hypothesis (\(H_1\)) is true:
\[ \operatorname{Power} = \Pr(\text{Reject } H_0 \mid H_1 \text{ is true in a specified way}). \]
If \(\beta\) denotes the probability of a Type II error for a specified alternative hypothesis, then
\[ \operatorname{Power} = 1-\beta. \]
The true positive rate depends on the sample size, significance level, variability, test direction, and effect size that the study aims to detect.
Equivalent to:
Power.
Type I error rate
The Type I error rate , denoted by \(\alpha\), is the long-run probability of rejecting the null hypothesis (\(H_0\)) when the null hypothesis is true:
\[ \alpha = \Pr(\text{Reject } H_0 \mid H_0 \text{ is true}). \]
In other words, \(\alpha\) is the test procedure’s tolerated significance level. A smaller \(\alpha\) makes it harder to reject \(H_0\) and reduces the long-run probability of a Type I error rate, at the cost of making the test less sensitive to some alternatives.
Equivalent to:
Significance level.
Type I error
A Type I error occurs when a test rejects the null hypothesis \(H_0\) even though \(H_0\) is true. In plain language, this is a false positive: the test concludes that there is enough evidence against the null hypothesis, but the null hypothesis is actually true.
Equivalent to:
False positive.
Type II error
A Type II error occurs when a test fails to reject the null hypothesis \(H_0\) even though \(H_0\) is false in a scientifically or practically relevant way. In plain language, this is a false negative: the test does not detect a departure from the null hypothesis, even though such a departure is actually present.
Equivalent to:
False negative.
U
Underdispersion
V
Validation set
A validation set is the subset of the observed random sample used to compare candidate models, modelling strategies, or tuning decisions after those models have been fitted on the training set. Its main role is to assess how well different modelling choices generalize to unseen observations before a final model is selected.
In predictive regression, the validation set is especially useful when comparing alternative model specifications, such as different sets of explanatory variables, transformations, or interaction terms. It is not primarily used for classical residual diagnostics, but rather for model comparison and predictive assessment.
Variance
Let \(Y\) be a discrete or continuous random variable with finite expected value \(\mathbb{E}(Y)\). The variance of \(Y\) is the expected squared deviation from its mean:
\[ \operatorname{Var}(Y) = \mathbb{E}\left\{[Y - \mathbb{E}(Y)]^2\right\}. \]
An equivalent expression is
\[ \operatorname{Var}(Y) = \mathbb{E}(Y^2) - [\mathbb{E}(Y)]^2. \]
The standard deviation of \(Y\) is
\[ \operatorname{SD}(Y) = \sqrt{\operatorname{Var}(Y)}. \]
The standard deviation is often easier to interpret than the variance because it is measured in the same units as the random variable.
