Appendix A — The Fusionified ML-Stats Dictionary
Fun fact!
Fusionified! A mix of flavors from various cuisines that somehow (or miraculously) works.
Machine learning and statistics comprise a substantial synergy that is reflected in data science. Thus, it is imperative to construct solid bridges between both disciplines to ensure everything is clear regarding their tremendous amount of jargon and terminology. This ML-Stats dictionary (ML stands for Machine Learning) aims to be one of these bridges in this textbook, especially within supervised learning and regression analysis contexts.
Below, you will find definitions either highlighted in blue if they correspond to statistical terminology or magenta if the terminology is machine learning-related. These definitions come from all definition admonitions, such as in (Definition-sample?). This colour scheme strives to combine all terminology to switch from one field to another easily. With practice and time, we should be able to jump back and forth when using these concepts.
Attention!
Noteworthy terms (either statistical or machine learning-related) will include a particular admonition identifying which terms (again, either statistical or machine learning-related) are equivalent or somewhat equivalent (or even NOT equivalent if that is the case!).
A
Alternative hypothesis
Attribute
Equivalent to:
Covariate, exogeneous variable, explanatory variable, feature, independent variable, input, predictor or regressor.
Average
Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). In general, the expected value or mean \(\mathbb{E}(Y)\) of this random variable is defined as a weighted average according to its corresponding probability distribution. In other words, this measure of central tendency \(\mathbb{E}(Y)\) aims to find the middle value of this random variable by weighting all its possible values in its support \(\mathcal{Y}\) as dictated by its probability distribution.
Given the above definition, when \(Y\) is a discrete random variable whose probability mass function (PMF) is \(P_Y(Y = y)\), then its weighted average is mathematically defined as
\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y \cdot P_Y(Y = y). \]
When \(Y\) is a continuous random variable whose probability density function (PDF) is \(f_Y(y)\), its weighted average is mathematically defined as
\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y \cdot f_Y(y) \mathrm{d}y. \]
Equivalent to:
Expected value or mean.
B
Bayesian statistics
This statistical school of thinking also relies on the frequency of events to estimate specific parameters of interest in a population or system. Nevertheless, unlike frequentist statistics, Bayesian statisticians use prior knowledge on the population parameters to update their estimations on them along with the current evidence they can gather. This evidence is in the form of the repetition of \(n\) experiments involving a random phenomenon. All these ingredients allow Bayesian statisticians to make inference by conducting appropriate hypothesis testings, which are designed differently from their mainstream frequentist counterpart.
Under the umbrella of this approach, we assume that our governing parameters are random; i.e., they have their own sample space and probabilities associated to their corresponding outcomes. The statistical process of inference is heavily backed up by probability theory mostly in the form of the Bayes theorem (named after Reverend Thomas Bayes, an English statistician from the 18th century). This theorem uses our current evidence along with our prior beliefs to deliver a posterior distribution of our random parameter(s) of interest.
Bayes’ rule
Suppose you have two events of interest, \(A\) and \(B\), in a random phenomenon of a population or system of interest. From Equation A.4, we can state the following expression for the conditional probability of \(A\) given \(B\):
\[ P(A | B) = \frac{P(A \cap B)}{P(B)} \quad \text{if $P(B) > 0$.} \tag{A.1}\]
Note the conditional probability of \(B\) given \(A\) can be stated as:
\[ \begin{align*} P(B | A) &= \frac{P(B \cap A)}{P(A)} \quad \text{if $P(A) > 0$} \\ &= \frac{P(A \cap B)}{P(A)} \quad \text{since $P(B \cap A) = P(A \cap B)$.} \end{align*} \tag{A.2}\]
Then, we can manipulate Equation A.2 as follows:
\[ P(A \cap B) = P(B | A) \times P(A). \]
The above result can be plugged into Equation A.1:
\[ \begin{align*} P(A | B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{P(B | A) \times P(A)}{P(B)}. \end{align*} \tag{A.3}\]
Equation A.3 is called the Bayes’ rule. We are basically flipping around conditional probabilities.
C
Critical value
Conditional probability
Suppose you have two events of interest, \(A\) and \(B\), in a random phenomenon, in a population or system of interest. These two events belong to the sample space \(S\). Moreover, assume that the probability of event \(B\) is such that
\[ P(B) > 0, \]
which is considered the conditioning event.
Hence, the conditional probability event \(A\) given event \(B\) is defined as
\[ P(A | B) = \frac{P(A \cap B)}{P(B)}, \tag{A.4}\]
where \(P(A \cap B)\) is read as the probability of the intersection of events \(A\) and \(B\).
Confidence interval
Continuous random variable
Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). If this support \(\mathcal{Y}\) corresponds to an uncountably infinite set of possible values, then \(Y\) is considered a continuous random variable.
Note a continuous random variable could be
- completely unbounded (i.e., its set of possible values goes from \(-\infty\) to \(\infty\) as in \(-\infty < y < \infty\)),
- positively unbounded (i.e., its set of possible values goes from \(0\) to \(\infty\) as in \(0 \leq y < \infty\)),
- negatively unbounded (i.e., its set of possible values goes from \(-\infty\) to \(0\) as in \(-\infty < y \leq 0\)), or
- bounded between two values \(a\) and \(b\) (i.e., its set of possible values goes from \(a\) to \(b\) as in \(a \leq y \leq b\)).
Covariate
Equivalent to:
Attribute, exogeneous variable, explanatory variable, feature, independent variable, input, predictor or regressor.
Cumulative distribution function
Let \(Y\) be a random variable either discrete or continuous. Its cumulative distribution function (CDF) \(F_Y(y) : \mathbb{R} \rightarrow [0, 1]\) refers to the probability that \(Y\) is less or equal than an observed value \(y\):
\[ F_Y(y) = P(Y \leq y). \]
Then, we have the following by type of random variable:
- When \(Y\) is discrete, whose support is \(\mathcal{Y}\), suppose it has a probability mass function (PMF) \(P_Y(Y = y)\). Then, the CDF is mathematically represented as:
\[ F_Y(y) = \sum_{\substack{t \in \mathcal{Y} \\ t \leq y}} P_Y(Y = t). \tag{A.5}\]
- When \(Y\) is continuous, whose support is \(\mathcal{Y}\), suppose it has a probability density function (PDF) \(f_Y(y)\). Then, the CDF is mathematically represented as:
\[ F_Y(y) = \int_{-\infty}^y f_Y(t) \mathrm{d}t. \tag{A.6}\]
Note that in Equation A.5 and Equation A.6, we use the auxiliary variable \(t\) since we do not compute the summation or integral over the observed \(y\) given its role on either the PMF or PDF. Therefore, we use this auxiliary variable \(t\).
D
Dependent variable
In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.
Equivalent to:
Endogeneous variable, response variable, outcome, output or target.
Discrete random variable
Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). If this support \(\mathcal{Y}\) corresponds to a finite set or a countably infinite set of possible values, then \(Y\) is considered a discrete random variable.
For instance, we can encounter discrete random variables which could be classified as
- binary (i.e., a finite set of two possible values),
- categorical (either nominal or ordinal, which have a finite set of three or more possible values), or
- counts (which might have a finite set or a countably infinite set of possible values as integers).
Dispersion
E
Endogeneous variable
Equivalent to:
Dependent variable, outcome, output, response variable or target.
Equidispersion
Expected value
Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). In general, the expected value or mean \(\mathbb{E}(Y)\) of this random variable is defined as a weighted average according to its corresponding probability distribution. In other words, this measure of central tendency \(\mathbb{E}(Y)\) aims to find the middle value of this random variable by weighting all its possible values in its support \(\mathcal{Y}\) as dictated by its probability distribution.
Given the above definition, when \(Y\) is a discrete random variable whose probability mass function (PMF) is \(P_Y(Y = y)\), then its expected value is mathematically defined as
\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y \cdot P_Y(Y = y). \tag{A.7}\]
When \(Y\) is a continuous random variable whose probability density function (PDF) is \(f_Y(y)\), its expected value is mathematically defined as
\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y \cdot f_Y(y) \mathrm{d}y. \tag{A.8}\]
Equivalent to:
Average or mean.
Exogeneous variable
Equivalent to:
Attribute, covariate, explanatory variable, feature, independent variable, input, predictor or regressor.
Explanatory variable
Equivalent to:
Attribute, covariate, exogeneous variable, feature, independent variable, input, predictor or regressor.
F
False negative
Equivalent to:
Type II error.
False positive
Equivalent to:
Type I error.
Feature
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, independent variable, input, predictor or regressor.
Frequentist statistics
This statistical school of thinking heavily relies on the frequency of events to estimate specific parameters of interest in a population or system. This frequency of events is reflected in the repetition of \(n\) experiments involving a random phenomenon within this population or system.
Under the umbrella of this approach, we assume that our governing parameters are fixed. Note that, within the philosophy of this school of thinking, we can only make precise and accurate predictions as long as we repeat our \(n\) experiments as many times as possible, i.e.,
\[ n \rightarrow \infty. \]
G
Generalized linear model
Generative model
Suppose you observe some data \(y\) from a population or system of interest. Moreover, let us assume this population or system is governed by \(k\) parameters contained in the following vector:
\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]
If we state that our observed data \(y\) follows certain probability distribution \(\mathcal{D}(\cdot)\), then we will have a generative model \(m\) such that
\[ m: y \sim \mathcal{D}(\boldsymbol{\theta}). \]
H
Hypothesis testing
I
Independence
Suppose you have two events of interest, \(A\) and \(B\), in a random phenomenon of a population or system of interest. These two events are statistically independent if event \(B\) does not affect event \(A\) and vice versa. Therefore, the probability of their corresponding intersection is given by:
\[ P(A \cap B) = P(A) \times P(B). \]
Let us expand the above definition to a random variable framework:
- Suppose you have a set of \(n\) discrete random variables \(Y_1, \dots, Y_n\) whose supports are \(\mathcal{Y_1}, \dots, \mathcal{Y_n}\) with probability mass functions (PMFs) \(P_{Y_1}(Y_1 = y_1), \dots, P_{Y_n}(Y_n = y_n)\) respectively. That said, the joint PMF of these \(n\) random variables is the multiplication of their corresponding standalone PMFs:
\[ \begin{align*} P_{Y_1, \dots, Y_n}(Y_1 = y_1, \dots, Y_n = y_n) &= \prod_{i = 1}^n P_{Y_i}(Y_i = y_i) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}_i, i = 1, \dots, n. \end{align*} \tag{A.9}\]
- Suppose you have a set of \(n\) continuous random variables \(Y_1, \dots, Y_n\) whose supports are \(\mathcal{Y_1}, \dots, \mathcal{Y_n}\) with probability density functions (PDFs) \(f_{Y_1}(y_1), \dots, f_{Y_n}(y_n)\) respectively. That said, the joint PDF of these \(n\) random variables is the multiplication of their corresponding standalone PDFs:
\[ \begin{align*} f_{Y_1, \dots, Y_n}(y_1, \dots, y_n) &= \prod_{i = 1}^n f_{Y_i}(y_i) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}_i, i = 1, \dots, n. \end{align*} \tag{A.10}\]
Independent variable
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, feature, input, predictor or regressor.
Input
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, feature, independent variable, predictor or regressor.
M
Mean
Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). In general, the expected value or mean \(\mathbb{E}(Y)\) of this random variable is defined as a weighted average according to its corresponding probability distribution. In other words, this measure of central tendency \(\mathbb{E}(Y)\) aims to find the middle value of this random variable by weighting all its possible values in its support \(\mathcal{Y}\) as dictated by its probability distribution.
Given the above definition, when \(Y\) is a discrete random variable whose probability mass function (PMF) is \(P_Y(Y = y)\), then its mean is mathematically defined as
\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y \cdot P_Y(Y = y). \]
When \(Y\) is a continuous random variable whose probability density function (PDF) is \(f_Y(y)\), its mean is mathematically defined as
\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y \cdot f_Y(y) \mathrm{d}y. \]
Equivalent to:
Average or expected value.
Measure of central tendency
Probabilistically, a measure of central tendency is defined as a metric that identifies a central or typical value of a given probability distribution. In other words, a measure of central tendency refers to a central or typical value that a given random variable might take when we observe various realizations of this variable over a long period.
Measure of uncertainty
Probabilistically, a measure of uncertainty refers to the spread of a given random variable when we observe its different realizations in the long term. Note a larger spread indicates more variability in these realizations. On the other hand, a smaller spread denotes less variability in these realizations.
N
Null hypothesis
O
Observed effect
Outcome
In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.
Equivalent to:
Dependent variable, endogeneous variable, response variable, output or target.
Output
In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.
Equivalent to:
Dependent variable, endogeneous variable, response variable, outcome or target.
Overdispersion
P
Parameter
It is a characteristic (numerical or even non-numerical, such as a distinctive category) that summarizes the state of our population or system of interest.
Note the standard mathematical notation for population parameters are Greek letters. Moreover, in practice, these population parameter(s) of interest will be unknown to the data scientist or researcher. Instead, they would use formal statistical inference to estimate them.
Population
It is a whole collection of individuals or items that share distinctive attributes. As data scientists or researchers, we are interested in studying these attributes, which we assume are governed by parameters. In practice, we must be as specific as possible when defining our given population such that we would frame our entire data modelling process since its very early stages.
Note that the term population could be exchanged for the term system, given that certain contexts do not specifically refer to individuals or items. Instead, these contexts could refer to processes whose attributes are also governed by parameters.
Power
Equivalent to:
True positive rate.
Predictor
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, feature, independent variable, input or regressor.
Probability
Let \(A\) be an event of interest in a random phenomenon, in a population or system of interest, whose all possible outcomes belong to a given sample space \(S\). Generally, the probability for this event \(A\) happening can be mathematically depicted as \(P(A)\). Moreover, suppose we observe the random phenomenon \(n\) times such as we were running some class of experiment, then \(P(A)\) is defined as the following ratio:
\[ P(A) = \frac{\text{Number of times event $A$ is observed}}{n}, \tag{A.11}\]
as the \(n\) times we observe the random phenomenon goes to infinity.
Equation A.11 will always put \(P(A)\) in the following numerical range:
\[ 0 \leq P(A) \leq 1. \]
Probability distribution
When we set a random variable \(Y\), we also set a new set of \(v\) possible outcomes \(\mathcal{Y} = \{ y_1, \dots, y_v\}\) coming from the sample space \(S\). This new set of possible outcomes \(\mathcal{Y}\) corresponds to the range of the random variable \(Y\) (i.e., all the possible values that could be taken on once we execute a given random experiment involving \(Y\)).
That said, let us suppose we have a sample space of \(u\) elements defined as
\[ S = \{ s_1, \dots, s_u \}, \]
where each one of these elements has a probability assigned via a function \(P_S(\cdot)\) such that
\[ P(S) = \sum_{i = 1}^u P_S(s_i) = 1. \]
which has to satisfy Equation A.14.
Then, the probability distribution of \(Y\), i.e., \(P_Y(\cdot)\) assigns a probability to each observed value \(Y = y_j\) (with \(j = 1, \dots, v\)) if and only if the outcome of the random experiment belongs to the sample space, i.e., \(s_i \in S\) (for \(i = 1, \dots, u\)) such that \(Y(s_i) = y_j\):
\[ P_Y(Y = y_j) = P \left( \left\{ s_i \in S : Y(s_i) = y_j \right\} \right). \]
Probability density function
Let \(Y\) be a continuous random variable whose support is \(\mathcal{Y}\). Furthermore, consider a function \(f_Y(y)\) such that
\[ f_Y(y) : \mathbb{R} \rightarrow \mathbb{R} \]
with
\[ f_Y(y) \geq 0. \]
Then, \(f_Y(y)\) is considered a probability density function (PDF) if the probability of \(Y\) taking on a value within the range represented by the subset \(A \subset \mathcal{Y}\) is equal to
\[ P_Y(Y \in A) = \int_A f_Y(y) \mathrm{d}y \]
with
\[ \int_{\mathcal{Y}} f_Y(y) \mathrm{d}y = 1. \]
Probability mass function
Let \(Y\) be a discrete random variable whose support is \(\mathcal{Y}\). Moreover, suppose that \(Y\) has a probability distribution such that
\[ P_Y(Y = y) : \mathbb{R} \rightarrow [0, 1] \]
where, for all \(y \notin \mathcal{Y}\), we have
\[ P_Y(Y = y) = 0 \]
and
\[ \sum_{y \in \mathcal{Y}} P_Y(Y = y) = 1. \] Then, \(P_Y(Y = y)\) is considered a probability mass function (PMF).
\(p\)-value
R
Random Sample
A random sample is a collection of random variables \(Y_1, \dots, Y_n\) of size \(n\) coming from a given population or system of interest. Note that the most elementary definition of a random sample assumes that these \(n\) random variables are mutually independent and identically distributed (which is abbreviated as iid).
The fact that these \(n\) random variables are identically distributed indicates that they have the same mathematical form for their corresponding probability mass functions (PMFs) or probability density function (PDFs), depending on whether they are discrete or continuous respectively. Hence, under a generative modelling approach in a population or system of interest governed by \(k\) parameters contained in the vector
\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T, \]
we can apply the iid property in an elementary random sample to obtain the following joint probability distributions:
- In the case of \(n\) iid discrete random variables \(Y_1, \dots, Y_n\) whose common standalone PMF is \(P_Y(Y = y)\) with support \(\mathcal{Y}\), the joint PMF is mathematically expressed as
\[ \begin{align*} P_{Y_1, \dots, Y_n}(Y_1 = y_1, \dots, Y_n = y_n | \boldsymbol{\theta}) &= \prod_{i = 1}^n P_Y(Y = y_i | \boldsymbol{\theta}) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}, i = 1, \dots, n. \end{align*} \tag{A.12}\]
- In the case of \(n\) iid continuous random variables \(Y_1, \dots, Y_n\) whose common standalone PDF is \(f_Y(y)\) with support \(\mathcal{Y}\), the joint PDF is mathematically expressed as
\[ \begin{align*} f_{Y_1, \dots, Y_n}(y_1, \dots, y_n | \boldsymbol{\theta}) &= \prod_{i = 1}^n f_Y(y_i | \boldsymbol{\theta}) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}, i = 1, \dots, n. \end{align*} \tag{A.13}\]
Unlike Equation A.9 and Equation A.10, note that Equation A.12 and Equation A.13 indicate the subscript \(Y\) in the corresponding probability distributions since we have identically distributed random variables. Furthermore, the joint distributions are conditioned on the population parameter vector \(\boldsymbol{\theta}\) which reflects our generative modelling approach.
Somewhat equivalent to:
Training dataset.
Random variable
A random variable is a function where the input values correspond to real numbers assigned to events belonging to the sample space \(S\), and whose outcome is one of these real numbers after executing a given random experiment. For instance, a random variable (and its support, i.e., real numbers) is depicted with an uppercase such that
\[Y \in \mathbb{R}.\]
Regression analysis
Regressor
Equivalent to:
Attribute, covariate, exogeneous variable, explanatory variable, feature, independent variable, input or predictor.
Response variable
In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.
Equivalent to:
Dependent variable, endogeneous variable, outcome, output or target.
S
Sample space
Let \(A\) be an event of interest in a random phenomenon in a population or system of interest. The sample space \(S\) of event \(A\) denotes the set of all the possible random outcomes we might encounter every time we randomly observe \(A\) such as we were running some class of experiment.
Note each of these outcomes has a determined probability associated with them. If we add up all these probabilities, the probability of the sample \(S\) will be one, i.e.,
\[ P(S) = 1. \tag{A.14}\]
Significance level
Standard error
T
Target
In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.
Equivalent to:
Dependent variable, endogeneous variable, response variable, outcome or output.
Test statistic
Training dataset
Somewhat equivalent to:
Random sample.
Type I error
Equivalent to:
False positive.
Type II error
Equivalent to:
False negative.
U
Underdispersion
V
Variance
Let \(Y\) be a discrete or continuous random variable whose support is \(\mathcal{Y}\) with a mean represented by \(\mathbb{E}(Y)\). Then, the variance of \(Y\) is the mean of the squared deviation from the corresponding mean as follows:
\[ \text{Var}(Y) = \mathbb{E}\left\{[ Y - \mathbb{E}(Y)]^2 \right\}. \\ \]
Note the expression above is equivalent to:
\[ \text{Var}(Y) = \mathbb{E}(Y^2) - \left[ \mathbb{E}(Y) \right]^2. \]