Appendix A — ML-Stats Dictionary

Machine learning and statistics comprise a substantial synergy that is reflected in data science. Thus, it is imperative to construct solid bridges between both disciplines to ensure everything is clear regarding their tremendous amount of jargon and terminology. This ML-Stats dictionary (ML stands for Machine Learning) aims to be one of these bridges, especially within supervised learning and regression analysis contexts.

Below, you will find definitions either highlighted in blue if they correspond to statistical terminology or magenta if the terminology is machine learning-related. These definitions come from all definition admonitions introduced throughout the fifteen main chapters of this textbook. This colour scheme strives to combine all terminology to switch from one field to another easily. With practice and time, you will be able to jump back and forth when using these concepts.

Attention!

Noteworthy terms (either statistical or machine learning-related) will include a particular admonition identifying which terms (again, either statistical or machine learning-related) are equivalent or somewhat equivalent (or even not equivalent if that is the case).

A

Alternative hypothesis

In a hypothesis testing, an alternative hypothesis is denoted by $H_1$. This hypothesis corresponds to the complement (i.e., the opposite) of the null hypothesis $H_0$. Since the whole inferential process is designed to assess the strength of the evidence in favour or against of $H_0$, any inferential conclusion against $H_0$ can be worded as “rejecting $H_0$ in favour of $H_1$.” In plain words, $H_1$ is an inferential statement associated to a non-status quo in some population(s) or system(s) of interest, which might refer to actual signal for the researcher in question.

Let us assume random variable $Y$ from some population(s) or system(s) of interest is governed by $k$ parameters contained in the following vector:

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]

Moreover, suppose this observed data $y$ follows certain probability distribution $\mathcal{D}(\cdot)$ in a generative model $m$ as in

\[ m: y \sim \mathcal{D}(\boldsymbol{\theta}). \]

Let $\boldsymbol{\Theta}_0^c \subset \boldsymbol{\theta}$ denote the non-status quo for the parameter(s) to be tested. Then, the alternative hypothesis is mathematically defined as

\[ H_1: \boldsymbol{\theta} \in \boldsymbol{\Theta}_0^c \quad \text{where} \quad \boldsymbol{\Theta}_0^c \subset \boldsymbol{\theta}. \]

Attribute

Equivalent to:

Covariate, exogeneous variable, explanatory variable, feature, independent variable, input, predictor or regressor.

Average

Let $Y$ be a random variable whose support is $\mathcal{Y}$. In general, the expected value or mean $\mathbb{E}(Y)$ of this random variable is defined as a weighted average according to its corresponding probability distribution. In other words, this measure of central tendency $\mathbb{E}(Y)$ aims to find the middle value of this random variable by weighting all its possible values in its support $\mathcal{Y}$ as dictated by its probability distribution.

Given the above definition, when $Y$ is a discrete random variable whose probability mass function (PMF) is $P_Y(Y = y)$, then its weighted average is mathematically defined as

\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y \cdot P_Y(Y = y). \]

When $Y$ is a continuous random variable whose probability density function (PDF) is $f_Y(y)$, its weighted average is mathematically defined as

\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y \cdot f_Y(y) \mathrm{d}y. \]

Equivalent to:

Expected value or mean.

B

Bayesian statistics

This statistical school of thinking also relies on the frequency of events to estimate specific parameters of interest in a population or system. Nevertheless, unlike frequentist statistics, Bayesian statisticians use prior knowledge on the population parameters to update their estimations on them along with the current evidence they can gather. This evidence is in the form of the repetition of $n$ experiments involving a random phenomenon. All these ingredients allow Bayesian statisticians to make inference by conducting appropriate hypothesis testings, which are designed differently from their mainstream frequentist counterpart.

Under the umbrella of this approach, we assume that our governing parameters are random; i.e., they have their own sample space and probabilities associated to their corresponding outcomes. The statistical process of inference is heavily backed up by probability theory mostly in the form of the Bayes’ rule (named after Reverend Thomas Bayes, an English statistician from the 18th century). This rule uses our current evidence along with our prior beliefs to deliver a posterior distribution of our random parameter(s) of interest.

Bayes’ rule

Suppose you have two events of interest, $A$ and $B$, in a random phenomenon of a population or system of interest. From Equation A.4, we can state the following expression for the conditional probability of $A$ given $B$:

\[ P(A | B) = \frac{P(A \cap B)}{P(B)} \quad \text{if $P(B) > 0$.} \tag{A.1}\]

Note the conditional probability of $B$ given $A$ can be stated as:

\[ \begin{align*} P(B | A) &= \frac{P(B \cap A)}{P(A)} \quad \text{if $P(A) > 0$} \\ &= \frac{P(A \cap B)}{P(A)} \quad \text{since $P(B \cap A) = P(A \cap B)$.} \end{align*} \tag{A.2}\]

Then, we can manipulate Equation A.2 as follows:

\[ P(A \cap B) = P(B | A) \times P(A). \]

The above result can be plugged into Equation A.1:

\[ \begin{align*} P(A | B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{P(B | A) \times P(A)}{P(B)}. \end{align*} \tag{A.3}\]

Equation A.3 is called the Bayes’ rule. We are basically flipping around conditional probabilities.

C

Critical value

The critical value of a hypothesis testing defines the region for which we might reject $H_0$ in favour of $H_1$. This critical value is in the function of the significance level $\alpha$ and test flavour. It is located on the corresponding $x$-axis of the probability distribution of $H_0$. Hence, this value acts as a threshold to decide either of the following:

If the observed test statistic exceeds a given critical value, then we have enough statistical evidence to reject $H_0$ in favour of $H_1$.
If the observed test statistic does not exceed a given critical value, then we have enough statistical evidence to fail to reject $H_0$.

Conditional probability

Suppose you have two events of interest, $A$ and $B$, in a random phenomenon, in a population or system of interest. These two events belong to the sample space $S$. Moreover, assume that the probability of event $B$ is such that

\[ P(B) > 0, \]

which is considered the conditioning event.

Hence, the conditional probability event $A$ given event $B$ is defined as

\[ P(A | B) = \frac{P(A \cap B)}{P(B)}, \tag{A.4}\]

where $P(A \cap B)$ is read as the probability of the intersection of events $A$ and $B$.

Confidence interval

A confidence interval provides an estimated range of values within which the true population parameter is likely to fall, based on the sample data. It reflects the degree of uncertainty associated with the obtained estimate.

For instance, a 95% confidence interval means that if the study were repeated many times using different random samples from the same population or system of interest, approximately 95% of the resulting intervals would contain the true parameter.

Continuous random variable

Let $Y$ be a random variable whose support is $\mathcal{Y}$. If this support $\mathcal{Y}$ corresponds to an uncountably infinite set of possible values, then $Y$ is considered a continuous random variable.

Note a continuous random variable could be

completely unbounded (i.e., its set of possible values goes from $-\infty$ to $\infty$ as in $-\infty < y < \infty$),
positively unbounded (i.e., its set of possible values goes from $0$ to $\infty$ as in $0 \leq y < \infty$),
negatively unbounded (i.e., its set of possible values goes from $-\infty$ to $0$ as in $-\infty < y \leq 0$), or
bounded between two values $a$ and $b$ (i.e., its set of possible values goes from $a$ to $b$ as in $a \leq y \leq b$).

Covariate

Equivalent to:

Attribute, exogeneous variable, explanatory variable, feature, independent variable, input, predictor or regressor.

Cumulative distribution function

Let $Y$ be a random variable either discrete or continuous. Its cumulative distribution function (CDF) $F_Y(y) : \mathbb{R} \rightarrow [0, 1]$ refers to the probability that $Y$ is less or equal than an observed value $y$:

\[ F_Y(y) = P(Y \leq y). \]

Then, we have the following by type of random variable:

When $Y$ is discrete, whose support is $\mathcal{Y}$, suppose it has a probability mass function (PMF) $P_Y(Y = y)$. Then, the CDF is mathematically represented as:

\[ F_Y(y) = \sum_{\substack{t \in \mathcal{Y} \\ t \leq y}} P_Y(Y = t). \tag{A.5}\]

When $Y$ is continuous, whose support is $\mathcal{Y}$, suppose it has a probability density function (PDF) $f_Y(y)$. Then, the CDF is mathematically represented as:

\[ F_Y(y) = \int_{-\infty}^y f_Y(t) \mathrm{d}t. \tag{A.6}\]

Note that in Equation A.5 and Equation A.6, we use the auxiliary variable $t$ since we do not compute the summation or integral over the observed $y$ given its role on either the PMF or PDF. Therefore, we use this auxiliary variable $t$.

D

Dependent variable

In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.

Equivalent to:

Endogeneous variable, response variable, outcome, output or target.

Discrete random variable

Let $Y$ be a random variable whose support is $\mathcal{Y}$. If this support $\mathcal{Y}$ corresponds to a finite set or a countably infinite set of possible values, then $Y$ is considered a discrete random variable.

For instance, we can encounter discrete random variables which could be classified as

binary (i.e., a finite set of two possible values),
categorical (either nominal or ordinal, which have a finite set of three or more possible values), or
counts (which might have a finite set or a countably infinite set of possible values as integers).

Dispersion

E

Endogeneous variable

Equivalent to:

Dependent variable, outcome, output, response variable or target.

Equidispersion

Estimate

Suppose we have an observed random sample of size $n$ with values $y_1, \dots , y_n$. Then, we apply a given estimator mathematical rule to these $n$ observed values. Hence, this numerical computation is called an estimate of our population parameter of interest.

Estimator

An estimator is a mathematical rule involving the random variables $Y_1, \dots, Y_n$ from our random sample of size $n$. As its name says, this rule allows us to estimate our population parameter of interest.

Expected value

Given the above definition, when $Y$ is a discrete random variable whose probability mass function (PMF) is $P_Y(Y = y)$, then its expected value is mathematically defined as

\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y \cdot P_Y(Y = y). \tag{A.7}\]

When $Y$ is a continuous random variable whose probability density function (PDF) is $f_Y(y)$, its expected value is mathematically defined as

\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y \cdot f_Y(y) \mathrm{d}y. \tag{A.8}\]

Equivalent to:

Average or mean.

Exogeneous variable

Equivalent to:

Attribute, covariate, explanatory variable, feature, independent variable, input, predictor or regressor.

Explanatory variable

Equivalent to:

Attribute, covariate, exogeneous variable, feature, independent variable, input, predictor or regressor.

F

False negative

A false negative is defined as incorrectly failing to reject the null hypothesis $H_0$ in favour of the alternative hypothesis $H_1$ when, in fact, $H_0$ is false.

Equivalent to:

Type II error.

False positive

A false positive is defined as incorrectly rejecting the null hypothesis $H_0$ in favour of the alternative hypothesis $H_1$ when, in fact, $H_0$ is true. Table A.1 summarizes the types of inferential conclusions in function on whether $H_0$ is true or not.

Table A.1: Types of inferential conclusions in a frequentist hypothesis testing.

	$H_0$ is true	$H_0$ is false
Reject $H_0$	Type I error (False positive)	Correct
Fail to reject $H_0$	Correct	Type II error (False negative)

Equivalent to:

Type I error.

Feature

Equivalent to:

Attribute, covariate, exogeneous variable, explanatory variable, independent variable, input, predictor or regressor.

Frequentist statistics

This statistical school of thinking heavily relies on the frequency of events to estimate specific parameters of interest in a population or system. This frequency of events is reflected in the repetition of $n$ experiments involving a random phenomenon within this population or system.

Under the umbrella of this approach, we assume that our governing parameters are fixed. Note that, within the philosophy of this school of thinking, we can only make precise and accurate predictions as long as we repeat our $n$ experiments as many times as possible, i.e.,

\[ n \rightarrow \infty. \]

G

Generalized linear model

Generative model

Suppose you observe some data $y$ from a population or system of interest. Moreover, let us assume this population or system is governed by $k$ parameters contained in the following vector:

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]

If we state that the random variable $Y$ follows certain probability distribution $\mathcal{D}(\cdot)$, then we will have a generative model $m$ such that

\[ m: Y \sim \mathcal{D}(\boldsymbol{\theta}). \]

H

Hypothesis

Suppose you observe some data $y$ from some population(s) or system(s) of interest governed by $k$ parameters contained in the following vector:

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]

Moreover, let us assume that random variable $Y$ follows certain probability distribution $\mathcal{D}(\cdot)$ in a generative model $m$ as in

\[ m: Y \sim \mathcal{D}(\boldsymbol{\theta}). \]

Beginning from the fact that $\boldsymbol{\theta} \in \boldsymbol{\Theta}$ where $\boldsymbol{\Theta} \in \mathbb{R}^k$, a statistical hypothesis is a general statement about some parameter vector $\boldsymbol{\theta}$ in regards to specific values contained in vector $\boldsymbol{\Theta}^*$ such that

\[ H: \boldsymbol{\theta} \in \boldsymbol{\Theta}^* \quad \text{where} \quad \boldsymbol{\Theta}^* \subset \boldsymbol{\Theta}. \]

Hypothesis testing

A hypothesis testing is the decision rule we have to apply between the null and alternative hypotheses, via our sample data, to fail to reject or reject the null hypothesis.

I

Independence

Suppose you have two events of interest, $A$ and $B$, in a random phenomenon of a population or system of interest. These two events are statistically independent if event $B$ does not affect event $A$ and vice versa. Therefore, the probability of their corresponding intersection is given by:

\[ P(A \cap B) = P(A) \times P(B). \]

Let us expand the above definition to a random variable framework:

Suppose you have a set of $n$ discrete random variables $Y_1, \dots, Y_n$ whose supports are $\mathcal{Y_1}, \dots, \mathcal{Y_n}$ with probability mass functions (PMFs) $P_{Y_1}(Y_1 = y_1), \dots, P_{Y_n}(Y_n = y_n)$ respectively. That said, the joint PMF of these $n$ random variables is the multiplication of their corresponding standalone PMFs:

\[ \begin{align*} P_{Y_1, \dots, Y_n}(Y_1 = y_1, \dots, Y_n = y_n) &= \prod_{i = 1}^n P_{Y_i}(Y_i = y_i) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}_i, i = 1, \dots, n. \end{align*} \tag{A.9}\]

Suppose you have a set of $n$ continuous random variables $Y_1, \dots, Y_n$ whose supports are $\mathcal{Y_1}, \dots, \mathcal{Y_n}$ with probability density functions (PDFs) $f_{Y_1}(y_1), \dots, f_{Y_n}(y_n)$ respectively. That said, the joint PDF of these $n$ random variables is the multiplication of their corresponding standalone PDFs:

\[ \begin{align*} f_{Y_1, \dots, Y_n}(y_1, \dots, y_n) &= \prod_{i = 1}^n f_{Y_i}(y_i) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}_i, i = 1, \dots, n. \end{align*} \tag{A.10}\]

Independent variable

Equivalent to:

Attribute, covariate, exogeneous variable, explanatory variable, feature, input, predictor or regressor.

Input

Equivalent to:

Attribute, covariate, exogeneous variable, explanatory variable, feature, independent variable, predictor or regressor.

L

Likelihood function

Suppose you observe some data $y$ from a population or system of interest which is governed by $k$ parameters contained in the following vector:

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]

The corresponding random variable $Y$ has a given probability function $P_Y(Y = y | \boldsymbol{\theta})$, which can be either a probability mass function (PMF) in the discrete case or a probability density function (PDF) in the continuous case. Then, the likelihood function for the parameter vector $\boldsymbol{\theta}$ given the observed data $y$ is mathematically equivalent to the aforementioned probability function such that:

\[ L(\boldsymbol{\theta} | y) = P_Y(Y = y | \boldsymbol{\theta}). \tag{A.11}\]

It is important to note that the above likelihood is in function of the parameter vector $\boldsymbol{\theta}$ and conditioned on the observed data $y$. Additionally, in many continuous cases, the likelihood function may exceed $1$ given the definition of bounds we have already established for a PDF (see Equation A.14).

Log-likelihood function

Suppose you observe some data $y$ from a population or system of interest which is governed by $k$ parameters contained in the following vector:

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]

The corresponding random variable $Y$ has a given probability function $P_Y(Y = y | \boldsymbol{\theta})$, which can be either a probability mass function (PMF) in the discrete case or a probability density function (PDF) in the continuous case. Moreover, the likelihood function, as described in Equation A.11, is defined as follows:

\[ L(\boldsymbol{\theta} | y) = P_Y(Y = y | \boldsymbol{\theta}). \]

Then, the log-likelihood function is merely the logarithm of the above function:

\[ \log L(\boldsymbol{\theta} | y) = \log \left[ P_Y(Y = y | \boldsymbol{\theta}) \right]. \tag{A.12}\]

Using a log-likelihood function, which is a monotonic transformation of the likelihood function, offers the following practical advantages:

The logarithmic properties convert products into sums. This is particularly useful in likelihood functions for random samples that involve multiplying probability functions (and its corresponding factors).
When estimating parameters, calculating the derivative of a sum is easier than that of a product.
In many cases, the likelihood functions for observed random samples can yield very small values, which may lead to computational instability. By working on a logarithmic scale, these computations become more stable. This stability is crucial for numerical optimization methods applied to a given log-likelihood function, in cases where a closed-form solution for an estimate is not mathematically feasible.

M

Maximum likelihood estimation (MLE)

Suppose you observe some data $y$ from a population or system of interest which is governed by $k$ parameters contained in the following vector:

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]

The corresponding random variable $Y$ has a given probability function $P_Y(Y = y | \boldsymbol{\theta})$, which can be either a probability mass function (PMF) in the discrete case or a probability density function (PDF) in the continuous case. Furthermore, the log-likelihood function is defined as in Equation A.12:

\[ \log L(\boldsymbol{\theta} | y) = \log \left[ P_Y(Y = y | \boldsymbol{\theta}) \right]. \]

Maximum likelihood estimation (MLE) aims to find the estimate of $\boldsymbol{\theta}$ that maximizes the above log-likelihood function as in:

\[ \hat{\boldsymbol{\theta}}_{\text{MLE}} = \underset{\boldsymbol{\theta}}{\operatorname{arg max}} \log L(\boldsymbol{\theta} | y). \]

In supervised learning, MLE is analogous to minimizing any given loss function during model training.

Mean

Given the above definition, when $Y$ is a discrete random variable whose probability mass function (PMF) is $P_Y(Y = y)$, then its mean is mathematically defined as

\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y \cdot P_Y(Y = y). \]

When $Y$ is a continuous random variable whose probability density function (PDF) is $f_Y(y)$, its mean is mathematically defined as

\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y \cdot f_Y(y) \mathrm{d}y. \]

Equivalent to:

Average or expected value.

Measure of central tendency

Probabilistically, a measure of central tendency is defined as a metric that identifies a central or typical value of a given probability distribution. In other words, a measure of central tendency refers to a central or typical value that a given random variable might take when we observe various realizations of this variable over a long period.

Measure of uncertainty

Probabilistically, a measure of uncertainty refers to the spread of a given random variable when we observe its different realizations in the long term. Note a larger spread indicates more variability in these realizations. On the other hand, a smaller spread denotes less variability in these realizations.

N

Null hypothesis

In a hypothesis(s) testing, a null hypothesis is denoted by $H_0$. The whole inferential process is designed to assess the strength of the evidence in favour or against this null hypothesis. In plain words, $H_0$ is an inferential statement associated to the status quo in some population(s) or system(s) of interest, which might refer to no signal for the researcher in question.

Again, suppose random variable $Y$ from some population(s) or system(s) of interest is governed by $k$ parameters contained in the following vector:

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]

Moreover, we assume this observed data $y$ follows certain probability distribution $\mathcal{D}(\cdot)$ in a generative model $m$ as in

\[ m: y \sim \mathcal{D}(\boldsymbol{\theta}). \]

Let $\boldsymbol{\Theta}_0 \subset \boldsymbol{\theta}$ denote the status quo for the parameter(s) to be tested. Then, the null hypothesis is mathematically defined as

\[ H_0: \boldsymbol{\theta} \in \boldsymbol{\Theta}_0 \quad \text{where} \quad \boldsymbol{\Theta}_0 \subset \boldsymbol{\theta}. \]

O

Observed effect

An observed effect is the difference between the estimate provided the observed random sample (of size $n$, as in $y_1, \dots, y_n$) to the hypothesized value(s) of the population parameter(s) depicted in the statistical hypotheses.

Outcome

In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.

Equivalent to:

Dependent variable, endogeneous variable, response variable, output or target.

Output

In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.

Equivalent to:

Dependent variable, endogeneous variable, response variable, outcome or target.

Overdispersion

P

Parameter

It is a characteristic (numerical or even non-numerical, such as a distinctive category) that summarizes the state of our population or system of interest.

Note the standard mathematical notation for population parameters are Greek letters (for more insights, you can check Appendix B). Moreover, in practice, these population parameter(s) of interest will be unknown to the data scientist or researcher. Instead, they would use formal statistical inference to estimate them.

Point estimate

Let $\theta$ denote a population parameter of interest. Suppose you have observed a random sample of size $n$, represented as the vector:

\[ \boldsymbol{y} = (y_1, y_2, \ldots, y_n)^T. \]

The point estimate $\hat{\theta}$ serves as a possible value for $\theta$ and is expressed as a function of the observed random sample contained in $\boldsymbol{y}$:

\[ \hat{\theta} = h(\boldsymbol{y}). \]

Population

It is a whole collection of individuals or items that share distinctive attributes. As data scientists or researchers, we are interested in studying these attributes, which we assume are governed by parameters. In practice, we must be as specific as possible when defining our given population such that we would frame our entire data modelling process since its very early stages.

Note that the term population could be exchanged for the term system, given that certain contexts do not specifically refer to individuals or items. Instead, these contexts could refer to processes whose attributes are also governed by parameters.

Power

The statistical power of a test $1 -\beta$ is the complement of the conditional probability $\beta$ of failing to reject the null hypothesis $H_0$ given that $H_0$ is false, which is mathematically represented as

\[ P \left( \text{Failing to reject $H_0$} | \text{$H_0$ is false} \right) = \beta; \]

yielding

\[ \text{Power} = 1 - \beta. \]

In plain words, $1 - \beta \in [0, 1]$ is the probabilistic ability of our hypothesis testing to detect any signal in our inferential process, if there is any. The larger the power in our power analysis, the less prone we are to commit a type II error.

Equivalent to:

True positive rate.

Power analysis

The power analysis is a set of statistical tools used to compute the minimum required sample size $n$ for any given inferential study. These tools require the significance level, power, and effect size (i.e., the magnitude of the signal) the researcher aims to detect via their inferential study. This analysis seeks to determine whether observed results are likely due to chance or represent a true and meaningful effect.

Predictor

Equivalent to:

Attribute, covariate, exogeneous variable, explanatory variable, feature, independent variable, input or regressor.

Probability

Let $A$ be an event of interest in a random phenomenon, in a population or system of interest, whose all possible outcomes belong to a given sample space $S$. Generally, the probability for this event $A$ happening can be mathematically depicted as $P(A)$. Moreover, suppose we observe the random phenomenon $n$ times such as we were running some class of experiment, then $P(A)$ is defined as the following ratio:

\[ P(A) = \frac{\text{Number of times event $A$ is observed}}{n}, \tag{A.13}\]

as the $n$ times we observe the random phenomenon goes to infinity.

Equation A.13 will always put $P(A)$ in the following numerical range:

\[ 0 \leq P(A) \leq 1. \]

Probability distribution

When we set a random variable $Y$, we also set a new set of $v$ possible outcomes $\mathcal{Y} = \{ y_1, \dots, y_v\}$ coming from the sample space $S$. This new set of possible outcomes $\mathcal{Y}$ corresponds to the support of the random variable $Y$ (i.e., all the possible values that could be taken on once we execute a given random experiment involving $Y$).

That said, let us suppose we have a sample space of $u$ elements defined as

\[ S = \{ s_1, \dots, s_u \}, \]

where each one of these elements has a probability assigned via a function $P_S(\cdot)$ such that

\[ P(S) = \sum_{i = 1}^u P_S(s_i) = 1. \]

which has to satisfy Equation A.17.

Then, the probability distribution of $Y$, i.e., $P_Y(\cdot)$ assigns a probability to each observed value $Y = y_j$ (with $j = 1, \dots, v$) if and only if the outcome of the random experiment belongs to the sample space, i.e., $s_i \in S$ (for $i = 1, \dots, u$) such that $Y(s_i) = y_j$:

\[ P_Y(Y = y_j) = P \left( \left\{ s_i \in S : Y(s_i) = y_j \right\} \right). \]

Probability density function (PDF)

Let $Y$ be a continuous random variable whose support is $\mathcal{Y}$. Furthermore, consider a function $f_Y(y)$ such that

\[ f_Y(y) : \mathbb{R} \rightarrow \mathbb{R} \]

with

\[ f_Y(y) \geq 0. \tag{A.14}\]

Then, $f_Y(y)$ is considered a probability density function (PDF) if the probability of $Y$ taking on a value within the range represented by the subset $A \subset \mathcal{Y}$ is equal to

\[ P_Y(Y \in A) = \int_A f_Y(y) \mathrm{d}y \]

with

\[ \int_{\mathcal{Y}} f_Y(y) \mathrm{d}y = 1. \]

Probability mass function (PMF)

Let $Y$ be a discrete random variable whose support is $\mathcal{Y}$. Moreover, suppose that $Y$ has a probability distribution such that

\[ P_Y(Y = y) : \mathbb{R} \rightarrow [0, 1] \]

where, for all $y \notin \mathcal{Y}$, we have

\[ P_Y(Y = y) = 0 \]

and

\[ \sum_{y \in \mathcal{Y}} P_Y(Y = y) = 1. \]

Then, $P_Y(Y = y)$ is considered a probability mass function (PMF).

$p$-value

A $p$-value refers to the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic coming from our observed random sample of size $n$. This $p$-value is obtained via the probability distribution of $H_0$ and the observed test statistic.

Alternatively to a critical value, we can reject or fail to reject the null hypothesis $H_0$ using this $p$-value as follows:

If the $p$-value associated to the observed test statistic exceeds a given significance level $\alpha$, then we have enough statistical evidence to reject $H_0$ in favour of $H_1$.
If the $p$-value associated to the observed test statistic does not exceed a given significance level $\alpha$, then we have enough statistical evidence to fail to reject $H_0$.

R

Random Sample

A random sample is a collection of random variables $Y_1, \dots, Y_n$ of size $n$ coming from a given population or system of interest. Note that the most elementary definition of a random sample assumes that these $n$ random variables are mutually independent and identically distributed (which is abbreviated as iid).

The fact that these $n$ random variables are identically distributed indicates that they have the same mathematical form for their corresponding probability mass functions (PMFs) or probability density function (PDFs), depending on whether they are discrete or continuous respectively. Hence, under a generative modelling approach in a population or system of interest governed by $k$ parameters contained in the vector

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T, \]

we can apply the iid property in an elementary random sample to obtain the following joint probability distributions:

In the case of $n$ iid discrete random variables $Y_1, \dots, Y_n$ whose common standalone PMF is $P_Y(Y = y | \boldsymbol{\theta})$ with support $\mathcal{Y}$, the joint PMF is mathematically expressed as

\[ \begin{align*} P_{Y_1, \dots, Y_n}(Y_1 = y_1, \dots, Y_n = y_n | \boldsymbol{\theta}) &= \prod_{i = 1}^n P_Y(Y = y_i | \boldsymbol{\theta}) \\ & \qquad \text{for all} \\ & \quad \quad y_i \in \mathcal{Y}, i = 1, \dots, n. \end{align*} \tag{A.15}\]

In the case of $n$ iid continuous random variables $Y_1, \dots, Y_n$ whose common standalone PDF is $f_Y(y | \boldsymbol{\theta})$ with support $\mathcal{Y}$, the joint PDF is mathematically expressed as

\[ \begin{align*} f_{Y_1, \dots, Y_n}(y_1, \dots, y_n | \boldsymbol{\theta}) &= \prod_{i = 1}^n f_Y(y_i | \boldsymbol{\theta}) \\ & \qquad \text{for all} \\ & \quad \quad y_i \in \mathcal{Y}, i = 1, \dots, n. \end{align*} \tag{A.16}\]

Unlike Equation A.9 and Equation A.10, note that Equation A.15 and Equation A.16 do not indicate a subscript for $Y$ in the corresponding probability distributions since we have identically distributed random variables. Furthermore, the joint distributions are conditioned on the population parameter vector $\boldsymbol{\theta}$ which reflects our generative modelling approach.

Somewhat equivalent to:

Training dataset.

Random variable

A random variable is a function where the input values correspond to real numbers assigned to events belonging to the sample space $S$, and whose outcome is one of these real numbers after executing a given random experiment. For instance, a random variable (and its support, i.e., real numbers) is depicted with an uppercase such that

\[Y \in \mathbb{R}.\]

Regression analysis

Regressor

Equivalent to:

Attribute, covariate, exogeneous variable, explanatory variable, feature, independent variable, input or predictor.

Response variable

In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.

Equivalent to:

Dependent variable, endogeneous variable, outcome, output or target.

S

Sample space

Let $A$ be an event of interest in a random phenomenon in a population or system of interest. The sample space $S$ of event $A$ denotes the set of all the possible random outcomes we might encounter every time we randomly observe $A$ such as we were running some class of experiment.

Note each of these outcomes has a determined probability associated with them. If we add up all these probabilities, the probability of the sample $S$ will be one, i.e.,

\[ P(S) = 1. \tag{A.17}\]

Significance level

The significance level $\alpha$ is defined as the conditional probability of rejecting the null hypothesis $H_0$ given that $H_0$ is true. This can be mathematically represented as

\[ P \left( \text{Reject $H_0$} | \text{$H_0$ is true} \right) = \alpha. \]

In plain words, $\alpha \in [0, 1]$ allows us to probabilistically control for type I error since we are dealing with random variables in our inferential process. The significance level can be thought as one of the main hypothesis testing and power analysis settings.

Standard error

The standard error allows us to quantify the extent to which an estimate coming from an observed random sample (of size $n$, as in $y_1, \dots, y_n$) may deviate from the expected value under the assumption that the null hypothesis is true.

It plays a critical role in determining whether an observed effect is likely attributable to random variation or represents a statistically significant finding. In the absence of the standard error, it would not be possible to rigorously assess the reliability or precision of an estimate.

Supervised learning

Survival analysis

T

Target

In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.

Equivalent to:

Dependent variable, endogeneous variable, response variable, outcome or output.

Test statistic

The test statistic is a function of the random sample of size $n$, i.e., it is in the function of the random variables $Y_1, \dots, Y_n$. Therefore, the test statistic will also be a random variable, whose observed value will describe how closely the probability distribution from which the random sample comes from matches the probability distribution of the null hypothesis $H_0$.

More specifically, once we have obtained the observed effect and standard error from our observed random sample, we can compute the corresponding observed test statistic. This test statistic computation will be placed on the corresponding $x$-axis of the probability distribution of $H_0$ so we can reject or fail to reject it accordingly.

Training dataset

Somewhat equivalent to:

Random sample.

True positive rate

The statistical true positive rate of a test $1 -\beta$ is the complement of the conditional probability $\beta$ of failing to reject the null hypothesis $H_0$ given that $H_0$ is false, which is mathematically represented as

\[ P \left( \text{Failing to reject $H_0$} | \text{$H_0$ is false} \right) = \beta; \]

yielding

\[ \text{Power} = 1 - \beta. \]

In plain words, $1 - \beta \in [0, 1]$ is the probabilistic ability of our hypothesis testing to detect any signal in our inferential process, if there is any. The larger the true positive rate in our power analysis, the less prone we are to commit a type II error.

Equivalent to:

Power.

Type I error

Type I error is defined as incorrectly rejecting the null hypothesis $H_0$ in favour of the alternative hypothesis $H_1$ when, in fact, $H_0$ is true.

Equivalent to:

False positive.

Type II error

Type II error is defined as incorrectly failing to reject the null hypothesis $H_0$ in favour of the alternative hypothesis $H_1$ when, in fact, $H_0$ is false. Table A.2 summarizes the types of inferential conclusions in function on whether $H_0$ is true or not.

Table A.2: Types of inferential conclusions in a frequentist hypothesis testing.

	$H_0$ is true	$H_0$ is false
Reject $H_0$	Type I error (False positive)	Correct
Fail to reject $H_0$	Correct	Type II error (False negative)

Equivalent to:

False negative.

U

Underdispersion

V

Variance

Let $Y$ be a discrete or continuous random variable whose support is $\mathcal{Y}$ with a mean represented by $\mathbb{E}(Y)$. Then, the variance of $Y$ is the mean of the squared deviation from the corresponding mean as follows:

\[ \text{Var}(Y) = \mathbb{E}\left\{[ Y - \mathbb{E}(Y)]^2 \right\}. \\ \]

Note the expression above is equivalent to:

\[ \text{Var}(Y) = \mathbb{E}(Y^2) - \left[ \mathbb{E}(Y) \right]^2. \] Finally, to put the spread measurement on the same units of random variable $Y$, the standard devation of $Y$ is merely the square root of $\text{Var}(Y)$:

\[ \text{sd}(Y) = \sqrt{\text{Var}(Y)}. \]

A

Alternative hypothesis

Attribute

Average

B

Bayesian statistics

Bayes’ rule

C

Critical value

Conditional probability

Confidence interval

Continuous random variable

Covariate

Cumulative distribution function

D

Dependent variable

Discrete random variable

Dispersion

E

Endogeneous variable

Equidispersion

Estimate

Estimator

Expected value

Exogeneous variable

Explanatory variable

F

False negative

False positive

Feature

Frequentist statistics

G

Generalized linear model

Generative model

H

Hypothesis

Hypothesis testing

I

Independence

Independent variable

Input

L

Likelihood function

Log-likelihood function

M

Maximum likelihood estimation (MLE)

Mean

Measure of central tendency

Measure of uncertainty

N

Null hypothesis

O

Observed effect

Outcome

Output

Overdispersion

P

Parameter

Point estimate

Population

Power

Power analysis

Predictor

Probability

Probability distribution

Probability density function (PDF)

Probability mass function (PMF)

\(p\)-value

R

Random Sample

Random variable

Regression analysis

Regressor

Response variable

S

Sample space

Significance level

Standard error

Supervised learning

Survival analysis