2  Basic Cuisine: A Review on Probability and Frequentist Statistical Inference

This chapter will delve into probability and frequentist statistical inference. We can view these sections as a quick review of introductory probability and statistics concepts. Moreover, this review will be important to understanding the philosophy of modelling parameter estimation as outlined in Section 1.2.5. Then, we will pave the way to the rationale behind statistical inference in the results stage (as in Section 1.2.7) in our workflow from Figure 1.1. Note that we aim to explain all these statistical and probabilistic concepts in the most possible practical way via a made-up case study throughout this chapter. Still, we will use an appropriate level of jargon and will follow the colour convention found in Appendix A along with the definition callout box.

Learning Objectives

By the end of this chapter, you will be able to:

  • Discuss why having a complete conceptual understanding of the process of statistical inference is key when conducting studies for general audiences.
  • Explain why probability is the language of statistics.
  • Recall foundational probabilistic insights.
  • Break down the differences between the two schools of statistical thinking: frequentist and Bayesian.
  • Apply the philosophy of generative modelling along with probability distributions in parameter estimation.
  • Justify using measures of central tendency and uncertainty to characterize probability distributions.
  • Illustrate how random sampling can be used in parameter estimation.
  • Describe conceptually what maximum likelihood estimation entails in a frequentist framework.
  • Formulate a maximum likelihood estimation-based approach in parameter estimation.
  • Outline the process of a frequentist classical-based hypothesis testing to solve inferential inquiries in regression modelling.
  • Contrast the differences and simmilarities between supervised learning and regression analysis.

Imagine you are an undergraduate engineering student. Moreover, last term, you just took and passed your first course in probability and statistics (inference included!) in an industrial engineering context. Moreover, as it could happen while taking an introductory course in probability and statistics, you used to feel quite overwhelmed by the large amount of jargon and formulas one had to grasp and use regularly for primary engineering fields such as quality control in a manufacturing facility. Population parameters, hypothesis testing, tests statistics, significance level, \(p\)-values, and confidence intervals (do not worry, our statistical/machine learning scheme will come in later in this review) were appearing here and there! And to your frustration, you could never find a statistical connection between all these inferential tools! Instead, you relied on mechanistic procedures when solving assignments or exam problems.

For instance, when performing hypothesis testing for a two-sample \(t\)-test, you struggled to reflect what the hypotheses were trying to indicate for the corresponding population parameter(s) or how the test statistic was related to these hypotheses. Moreover, your interpretation of the resulting \(p\)-value and/or confidence interval was purely mechanical with the inherent claim:

With a significance level \(\alpha = 0.05\), we reject (or fail to reject, if that is the case!) the null hypothesis given that…

Truthfully, this whole mechanical way of doing statistics is not ideal in a teaching, research or industry environment. Along the same lines, the above situation should also not happen when we learn key statistical topics for the very first time as undergraduate students. That is why we will investigate a more intuitive way of viewing probability and its crucial role in statistical inference. This matter will help us deliver more coherent storytelling (as in Section 1.2.8) when presenting our results in practice during any regression analysis to our peers or stakeholders. Note that the role of probability also extends to model training (as in Section 1.2.5) when it comes to supervised learning and not just regarding statistical inference.

Having said all this, it is time to introduce a statement that is key when teaching hypothesis testing in an introductory statistical inference course:

In statistical inference, everything always boils down to randomness and how we can control it!

That is quite a bold statement! Nonetheless, once one starts teaching statistical topics to audiences not entirely familiar with the usual field jargon, the idea of randomness always persists across many different tools. And, of course, regression analysis is not an exception at all since it also involves inference on population parameters of interest! This is why we have allocated this section in the textbook to explain core probabilistic and inferential concepts to pave the way to its role in regression analysis.

Heads-up on why we mean as a non-ideal mechanical analysis!

The reader might need clarification on why the mechanical way of performing hypothesis testing is considered non-ideal, mainly when the term cookbook is used in the book’s title. The cookbook concept here actually refers to a homogenized recipe for data modelling, as seen in the workflow from Figure 1.1. However, there’s a crucial distinction between this and the non-ideal mechanical way of hypothesis testing.

On the one hand, the non-ideal mechanical way refers to the use of a tool without understanding the rationale of what this tool stands for, resulting in vacuous and standard statements that we would not be able to explain any way further, such as the statement we previously indicated:

With a significance level \(\alpha = 0.05\), we reject (or fail to reject, if that is the case!) the null hypothesis given that…

What if a stakeholder of our analysis asks us in plain words what a significance level means? Why are we phrasing our conclusion on the null hypothesis and not directly on the alternative one? As a data scientist, one should be able to explain why the whole inference process yields that statement without misleading the stakeholders’ understanding. For sure, this also implicates appropriate communication skills that cater to general audiences rather than just statistical ones.

Conversely, the data modelling workflow in Figure 1.1 involves stages that necessitate a comprehensive and precise understanding of our analysis. Progressing to the next stage without a complete grasp of the current one risks perpetuating false insights, potentially leading to faulty data storytelling of the entire analysis.

Finally, even though this book has suggested reviews related to the basics of probability via different distributions and the fundamentals of frequentist statistical inference as stated in Audience and Scope, we will retake essential concepts as follows:

Without further ado, let us start with reviewing core concepts in probability via quite a tasty example.

2.1 Basics of Probability

In terms of regression analysis and its supervised learning counterpart (either on an inferential or predictive framework), probability can be viewed as the solid foundation on which more complex tools, including estimation and hypothesis testing, are built upon. Having said that, let us scaffold across all the necessary probabilistic concepts that will allow us to move forward into these more complex tools.

2.1.1 First Insights

Under the above solid foundation, our data is coming from a given population or system of interest. Moreover, the population or system is assumed to be governed by parameters which, as data scientists or researchers, they are of their best interest to study. That said, the terms population and parameter will pave the way to our first statistical definitions.

Definition of population

It is a whole collection of individuals or items that share distinctive attributes. As data scientists or researchers, we are interested in studying these attributes, which we assume are governed by parameters. In practice, we must be as specific as possible when defining our given population such that we would frame our entire data modelling process since its very early stages. Examples of a population could be the following:

  • Children between the ages of 5 and 10 years old in states of the American West Coast.
  • Customers of musical vinyl records in the Canadian provinces of British Columbia and Alberta.
  • Avocado trees grown in the Mexican state of Michoacán.
  • Adult giant pandas in the Southwestern Chinese province of Sichuan.
  • Mature açaí palm trees from the Brazilian Amazonian jungle.

Image by Eak K. via Pixabay.

Note that the term population could be exchanged for the term system, given that certain contexts do not particularly refer to individuals or items. Instead, these contexts could refer to processes whose attributes are also governed by parameters. Examples of a system could be the following:

  • The production of cellular phones from a given model in a set of manufacturing facilities.
  • The sale process in the Vancouver franchises of a well-known ice cream parlour.
  • The transit cycle during rush hours on weekdays in the twelve lines of Mexico City’s subway.

Definition of parameter

It is a characteristic (numerical or even non-numerical, such as a distinctive category) that summarizes the state of our population or system of interest. Examples of a population parameter can be described as follows:

  • The average weight of children between the ages of 5 and 10 years old in states of the American west coast (numerical).
  • The variability in the height of the mature açaí palm trees from the Brazilian Amazonian jungle (numerical).
  • The proportion of defective items in the production of cellular phones in a set of manufacturing facilities (numerical).
  • The average customer waiting time to get their order in the Vancouver franchises of a well-known ice cream parlour (numerical).
  • The most favourite pizza topping of vegetarian adults between the ages of 30 and 40 years old in Edmonton (non-numerical).

Image by meineresterampe via Pixabay.

Note the standard mathematical notation for population parameters are Greek letters. Moreover, in practice, these population parameter(s) of interest will be unknown to the data scientist or researcher. Instead, they would use formal statistical inference to estimate them.

The parameter definition points out a crucial fact in investigating any given population or system:

Our parameter(s) of interest are usually unknown!

Given this fact, it would be pretty unfortunate and inconvenient if we eventually wanted to discover any significant insights about the population or system. Therefore, let us proceed to our so-called tasty example so we can dive into the need for statistical inference and why probability is our perfect ally in this parameter quest.

Imagine you are the owner of a large fleet of ice cream carts, around 900 to be exact. These ice cream carts operate across different parks in the following Canadian cities: Vancouver, Victoria, Edmonton, Calgary, Winnipeg, Ottawa, Toronto, and Montréal. In the past, to optimize operational costs, you decided to limit ice cream cones to only two items: vanilla and chocolate flavours, as in Figure 2.1.

Figure 2.1: The two flavours of the ice cream cone you sell across all your ice cream carts: vanilla and chocolate. Image by tomekwalecki via Pixabay.

Now, let us direct this whole case onto a more statistical and probabilistic field; suppose you have a well-defined overall population of interest for those above eight Canadian cities: children between 4 and 11 years old attending these parks during the Summer weekends. Of course, Summer time is coming this year, and you would like to know which ice cream cone flavour is the favourite one for this population (and by how much!). As a business owner, investigating ice cream flavour preferences would allow you to plan Summer restocks more carefully with your corresponding suppliers. Therefore, it would be essential to start collecting consumer data so the company can tackle this demand query.

Also, suppose there is a second query. For the sake of our case, we will call it a time query. As a critical component of demand planning, besides estimating which cone flavour is the most preferred one (and by how much!) for the above population of interest, the operations area is currently requiring a realistic estimation of the average waiting time from one customer to the next one in any given cart during Summer weekends. This average waiting time would allow the operations team to plan carefully how much stock each cart should have so there will not be any waste or shortage.

Image by Icons8 Team via Unsplash.

Note that the nature of the aforementioned time query is more related to a larger population. Therefore, we can define it as all our ice cream customers during the Summer weekends. Furthermore, this second definition would expand this query to our corresponding general ice cream customers, given the requirements of our operations team, and not all the children between 4 and 11 years old attending the parks during Summer weekends. Consequently, it is crucial to note that the nature of our queries will dictate how we define our population and our subsequent data modelling and statistical inference.

Summer time represents the most profitable season from a business perspective, thus solving these above two queries is a significant priority for your company. Hence, you decide to organize a meeting with your eight general managers (one per Canadian city). Finally, during the meeting with the general managers, it was decided to do the following:

  1. For the demand query, a comprehensive market study will be run on the population of interest across the eight Canadian cities right before next Summer; suppose we are currently in Spring.
  2. For the time query, since the operations team has not previously recorded any historical data, ALL vendor staff from 900 carts will start collecting data on the waiting time in seconds between each customer this upcoming Summer.

Surprisingly, when discussing study requirements for the marketing firm who would be in charge of it for the demand query, Vancouver’s general manager dares to state the following:

Since we’re already planning to collect consumer data on these cities, let’s mimic a census-type study to ensure we can have the MOST PRECISE results on their preferences.

On the other hand, when agreeing on the specific operations protocol to start recording waiting times for all the 900 vending carts this upcoming Summer, Ottawa’s general manager provides a comment for further statistical food for thought:

The operations protocol for recording waiting times in the 900 vending carts looks too cumbersome to implement straightforwardly this upcoming Summer. Why don’t we select A SMALLER GROUP of ice cream carts across the eight cities to have a more efficient process implementation that would allow us to optimize operational costs?

Bingo! Ottawa’s general manager just nailed the probabilistic way of making inference on our population parameter of interest for the time query. Indeed, their comment was primarily framed from a business perspective of optimizing operational costs. Still, this fact does not take away a crucial insight on which statistical inference is built: a random sample ( as in its corresponding definition). As for Vancouver’s general manager, ironically, their statement is NOT PRECISE at all! Mimicking a census-type study might not be the most optimal decision for the demand query given the time constraint and the potential size of its target population.

Realistically, there is no cheap and efficient way to conduct a census-type study for any of the two queries!

Moving on to one of the core topics in this chapter, we can state that probability is viewed as the language to decode random phenomena that occur in any given population or system of interest. In our example, we have two random phenomena:

  1. For the demand query, a phenomenon can be represented by the preferred ice cream cone flavour of any randomly selected child between 4 and 11 years old attending the parks of the above eight Canadian cities during the Summer weekends.
  2. Regarding the time query, a phenomenon of this kind can be represented by any randomly recorded waiting time between two customers during a Summer weekend in any of the above eight Canadian cities.

Now, let us finally define what we mean by probability along with the inherent concept of sample space.

Definition of probability

Let \(A\) be an event of interest in a random phenomenon of a population or system of interest, whose all possible outcomes belong to a given sample space \(S\). Generally, the probability for this event \(A\) happening can be mathematically depicted as \(P(A)\). Moreover, suppose we observe the random phenomenon \(n\) times such as we were running some class of experiment, then \(P(A)\) is defined as the following ratio:

\[ P(A) = \frac{\text{Number of times event $A$ is observed}}{n}, \tag{2.1}\]

as the \(n\) times we observe the random phenomenon goes to infinity.

Equation 2.1 will always put \(P(A)\) in the following numerical range:

\[ 0 \leq P(A) \leq 1. \]

Definition of sample space

Let \(A\) be an event of interest in a random phenomenon of a population or system of interest. The sample space \(S\) of event \(A\) denotes the set of all the possible random outcomes we might encounter every time we randomly observe \(A\) such as we were running some class of experiment.

Note each of these outcomes has a determined probability associated with them. If we add up all these probabilities, the probability of the sample space \(S\) will be one, i.e.,

\[ P(S) = 1. \tag{2.2}\]

2.1.2 Schools of Statistical Thinking

Note the above definition for the probability of an event \(A\) specifically highlights the following:

… as the \(n\) times we observe the random phenomenon goes to infinity.

The “infinity” term is key when it comes to understanding the philosophy behind the frequentist school of statistical thinking in contrast to its Bayesian counterpart. In general, the frequentist way of practicing statistics in terms of probability and inference is the approach we usually learn in introductory courses, more specifically when it comes to hypothesis testing and confidence intervals which will be explored in Section 2.3. That said, the Bayesian approach is another way of practicing statistical inference. Its philosophy differs in what information is used to infer our population parameters of interest. Below, we briefly define both schools of thinking.

Definition of frequentist statistics

This statistical school of thinking heavily relies on the frequency of events to estimate specific parameters of interest in a population or system. This frequency of events is reflected in the repetition of \(n\) experiments involving a random phenomenon within this population or system.

Under the umbrella of this approach, we assume that our governing parameters are fixed. Note that, within the philosophy of this school of thinking, we can only make precise and accurate predictions as long as we repeat our \(n\) experiments as many times as possible, i.e.,

\[ n \rightarrow \infty. \]

Definition of Bayesian statistics

This statistical school of thinking also relies on the frequency of events to estimate specific parameters of interest in a population or system. Nevertheless, unlike frequentist statistics, Bayesian statisticians use prior knowledge on the population parameters to update their estimations on them along with the current evidence they can gather. This evidence is in the form of the repetition of \(n\) experiments involving a random phenomenon. All these ingredients allow Bayesian statisticians to make inference by conducting appropriate hypothesis testings, which are designed differently from their mainstream frequentist counterpart.

The unique known portrait of Reverend Thomas Bayes according to O’Donnell, T. (1936), even though Bellhouse (2004) argues it might not be a Bayes’ portrait.

Under the umbrella of this approach, we assume that our governing parameters are random; i.e., they have their own sample space and probabilities associated to their corresponding outcomes. The statistical process of inference is heavily backed up by probability theory mostly in the form of the Bayes theorem (named after Reverend Thomas Bayes, an English statistician from the 18th century). This theorem uses our current evidence along with our prior beliefs to deliver a posterior distribution of our random parameter(s) of interest.

Let us put the definitions for the above schools of statistical thinking into a more concrete example. We can use the demand query from our ice cream case as a starting point. More concretely, we can dig more into a standalone population parameter such as the probability that a randomly selected child between 4 and 11 years old, attending the parks of the above eight Canadian cities during the Summer weekends, prefers the chocolate-flavoured ice cream cone over the vanilla one. Think about the following two hypothetical questions:

  1. From a frequentist point of view, what is the estimated probability of preferring chocolate over vanilla after randomly surveying \(n = 100\) children from our population of interest?
  2. Using a Bayesian approach, suppose the marketing team has found ten prior market studies on similar children populations on their preferred ice cream flavour (between chocolate and vanilla). Therefore, along with our actual random survey of \(n = 100\) children from our population of interest, what is the posterior estimation of the probability of preferring chocolate over vanilla?

By comparing the above (a) and (b), we can see one characteristic in common when it comes to the estimation of the probability of preferring chocolate over vanilla: both frequentist and Bayesian approaches rely on the gathered evidence coming from the random survey of \(n = 100\) children from our population of interest. On the one hand, the frequentist approach solely relies on observed data to estimate this single probability of preferring chocolate over vanilla. On the other hand, the Bayesian approach uses the observed data in conjunction with the prior knowledge provided by the ten estimated probabilities to deliver a whole posterior distribution (i.e., the posterior estimation) of the probability of preferring chocolate over vanilla.

Heads-up on the debate between frequentist and Bayesian statistics!

Even though most of us began our statistical journey in a frequentist framework, we might be tempted to state that a Bayesian paradigm for parameter estimation and inference is better than a frequentist one since the former only takes into account the observed evidence without the prior knowledge on our parameters of interest.

Image by Manfred Steger via Pixabay.

In the statistical community, there could be a fascinating debate between the pros and cons of each school of thinking. That said, it is crucial to state that no paradigm is considered wrong! Instead, using a pragmatic strategy of performing statistics according to our data science context is more convenient.

Tip on further Bayesian and frequentist insights!

Let us check the following two examples (aside from our ice cream case) to illustrate the above pragmatic way of doing things:

  1. Take the production of cellular phones from a given model in a set of manufacturing facilities as the context. Hence, one might find a frequentist estimation of the proportion of defective items as a quicker and more efficient way to correct any given manufacturing process. That is, we will sample products from our finalized batches and check their status (defective or non-defective, our observed evidence) to deliver a proportion estimation of defective items.
  2. Now, take a physician’s context. It would not make a lot of sense to study the probability that a patient develops a certain disease by only using a frequentist approach, i.e., looking at the current symptoms which account for the observed evidence. In lieu, a Bayesian approach would be more suitable to study this probability which uses the observed evidence combined with the patient’s history (i.e., the prior knowledge) to deliver our posterior belief on the disease probability.

Having said all this, it is important to reiterate that the focus of this textbook is purely frequentist in regards to data modelling in regression analysis. If you would like to explore the fundamentals of the Bayesian paradigm; Johnson, Ott, and Dogucu (2022) have developed an amazing textbook on the basic probability theory behind this school of statistical thinking along with a whole variety regression techniques including the parameter estimation rationale.

2.1.3 The Random Variables

Moving along…

Casella and Berger (2024) and Soch et al. (2024)

Definition of random variable

A random variable is a function where the input values correspond to real numbers assigned to events belonging to the sample space \(S\), and whose outcome is one of these real numbers after executing a given random experiment. For instance, a random variable (and its support, i.e., real numbers) is depicted with an uppercase such that

\[Y \in \mathbb{R}.\]

Definition of discrete random variable

Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). If this support \(\mathcal{Y}\) corresponds to a finite set or a countably infinite set of possible values, then \(Y\) is considered a discrete random variable.

For instance, we can encounter discrete random variables which could be classified as

  • binary (i.e., a finite set of two possible values),
  • categorical (either nominal or ordinal, which have a finite set of three or more possible values), or
  • counts (which might have a finite set or a countably infinite set of possible values as integers).

Image by Pexels via Pixabay.

Definition of continuous random variable

Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). If this support \(\mathcal{Y}\) corresponds to an uncountably infinite set of possible values, then \(Y\) is considered a continuous random variable.

Note a continuous random variable could be

  • completely unbounded (i.e., its set of possible values goes from \(-\infty\) to \(\infty\) as in \(-\infty < y < \infty\)),
  • positively unbounded (i.e., its set of possible values goes from \(0\) to \(\infty\) as in \(0 \leq y < \infty\)),
  • negatively unbounded (i.e., its set of possible values goes from \(-\infty\) to \(0\) as in \(-\infty < y \leq 0\)), or
  • bounded between two values \(a\) and \(b\) (i.e., its set of possible values goes from \(a\) to \(b\) as in \(a \leq y \leq b\)).

Image by arielrobin via Pixabay.

2.1.4 The Wonders of Generative Modelling and Probability Distributions

Definition of generative model

Suppose you observe some data \(y\) from a population or system of interest. Moreover, let us assume this population or system is governed by \(k\) parameters contained in the following vector:

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]

If we state that our observed data \(y\) follows certain probability distribution \(\mathcal{D}(\cdot)\), then we will have a generative model \(m\) such that

\[ m: y \sim \mathcal{D}(\boldsymbol{\theta}). \]

Image by Manfred Stege via Pixabay.

Definition of probability distribution

When we set a random variable \(Y\), we also set a new set of \(v\) possible outcomes \(\mathcal{Y} = \{ y_1, \dots, y_v\}\) coming from the sample space \(S\). This new set of possible outcomes \(\mathcal{Y}\) corresponds to the range of the random variable \(Y\) (i.e., all the possible values that could be taken on once we execute a given random experiment involving \(Y\)).

That said, let us suppose we have a sample space of \(u\) elements defined as

\[ S = \{ s_1, \dots, s_u \}, \]

where each one of these elements has a probability assigned via a function \(P_S(\cdot)\) such that

\[ P(S) = \sum_{i = 1}^u P_S(s_i) = 1. \]

which has to satisfy Equation 2.2.

Then, the probability distribution of \(Y\), i.e., \(P_Y(\cdot)\) assigns a probability to each observed value \(Y = y_j\) (with \(j = 1, \dots, v\)) if and only if the outcome of the random experiment belongs to the sample space, i.e., \(s_i \in S\) (for \(i = 1, \dots, u\)) such that \(Y(s_i) = y_j\):

\[ P_Y(Y = y_j) = P \left( \left\{ s_i \in S : Y(s_i) = y_j \right\} \right). \]

Definition of probability mass function

Let \(Y\) be a discrete random variable whose support is \(\mathcal{Y}\). Moreover, suppose that \(Y\) has a probability distribution such that

\[ P_Y(Y = y) : \mathbb{R} \rightarrow [0, 1] \]

where, for all \(y \notin \mathcal{Y}\), we have

\[ P_Y(Y = y) = 0 \]

and

\[ \sum_{y \in \mathcal{Y}} P_Y(Y = y) = 1. \] Then, \(P_Y(Y = y)\) is considered a probability mass function (PMF).

Definition of probability density function

Let \(Y\) be a continuous random variable whose support is \(\mathcal{Y}\). Furthermore, consider a function \(f_Y(y)\) such that

\[ f_Y(y) : \mathbb{R} \rightarrow \mathbb{R} \]

with

\[ f_Y(y) \geq 0. \]

Then, \(f_Y(y)\) is considered a probability density function (PDF) if the probability of \(Y\) taking on a value within the range represented by the subset \(A \subset \mathcal{Y}\) is equal to

\[ P_Y(Y \in A) = \int_A f_Y(y) \mathrm{d}y \]

with

\[ \int_{\mathcal{Y}} f_Y(y) \mathrm{d}y = 1. \]

2.1.5 Characterizing Probability Distributions

Definition of measure of central tendency

Probabilistically, a measure of central tendency is defined as a metric that identifies a central or typical value of a given probability distribution. In other words, a measure of central tendency refers to a central or typical value that a given random variable might take when we observe various realizations of this variable over a long period.

Image by Manfred Stege via Pixabay.

Definition of measure of uncertainty

Probabilistically, a measure of uncertainty refers to the spread of a given random variable when we observe its different realizations in the long term. Note a larger spread indicates more variability in these realizations. On the other hand, a smaller spread denotes less variability in these realizations.

Image by Manfred Stege via Pixabay.

Definition of expected value

Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). In general, the expected value or mean \(\mathbb{E}(Y)\) of this random variable is defined as a weighted average according to its corresponding probability distribution. In other words, this measure of central tendency \(\mathbb{E}(Y)\) aims to find the middle value of this random variable by weighting all its possible values in its support \(\mathcal{Y}\) as dictated by its probability distribution.

Given the above definition, when \(Y\) is a discrete random variable whose PMF is \(P_Y(Y = y)\), then its expected value is mathematically defined as

\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y \cdot P_Y(Y = y). \tag{2.3}\]

When \(Y\) is a continuous random variable whose PDF is \(f_Y(y)\), its expected value is mathematically defined as

\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y \cdot f_Y(y) \mathrm{d}y. \tag{2.4}\]

Image by Manfred Stege via Pixabay.

Definition of cumulative distribution function

Let \(Y\) be a random variable either discrete or continuous. Its cumulative distribution function (CDF) \(F_Y(y) : \mathbb{R} \rightarrow [0, 1]\) refers to the probability that \(Y\) is less or equal than an observed value \(y\):

\[ F_Y(y) = P(Y \leq y). \tag{2.5}\]

Then, we have the following by type of random variable:

  • When \(Y\) is discrete, whose support is \(\mathcal{Y}\), suppose it has a PMF \(P_Y(Y = y)\). Then, the CDF is mathematically represented as:

\[ F_Y(y) = \sum_{\substack{t \in \mathcal{Y} \\ t \leq y}} P_Y(Y = t). \tag{2.6}\]

  • When \(Y\) is continuous, whose support is \(\mathcal{Y}\), suppose it has a PDF \(f_Y(y)\). Then, the CDF is mathematically represented as:

\[ F_Y(y) = \int_{-\infty}^y f_Y(t) \mathrm{d}t. \tag{2.7}\]

Note that in Equation 2.6 and Equation 2.7, we use the auxiliary variable \(t\) since we do not compute the summation or integral over the observed \(y\) given its role on either the PMF or PDF. Therefore, we use this auxiliary variable \(t\).

Heads-up on the properties of the cumulative distribution function!

It is important to clarify that a valid CDF \(F_Y(y)\) fulfils the following properties:

  1. \(F_Y(y)\) must never be a decreasing function.
  2. Given that \(F_Y(y) : \mathbb{R} \rightarrow [0, 1]\), it must never evaluate to be \(< 0\) or \(> 1\). The output of a CDF is a cumulative probability, hence the previous bounds.
  3. When \(y \rightarrow -\infty\), if follows that \(F_Y(y) \rightarrow 0\).
  4. When \(y \rightarrow \infty\), if follows that \(F_Y(y) \rightarrow 1\).

Now, in the case of a CDF corresponding to a continuous random variable \(Y\), there is an additional handy property that relates the CDF \(F_Y(y)\) to the PDF \(f_Y(y)\):

\[ f_Y(y) = \frac{\mathrm{d}}{\mathrm{d}y} F_Y(y). \tag{2.8}\]

Equation 2.8 indicates that the PDF of \(Y\) can be obtained by taking the first derivative of the CDF with respect to \(y\).

Tip on the Law of the Unconscious Statistician!

The law of the unconscious statistician (LOTUS) is a particular theorem in probability theory that allows us to compute a wide variety of expected values. Let us properly define it for both discrete and continuous random variables.

Theorem 2.1 Let \(Y\) be a discrete random variable whose support is \(\mathcal{Y}\). The LOTUS indicates that the expected value of a general function \(g(Y)\) of this random variable \(Y\) can be obtained via \(g(Y)\) along with the corresponding PMF \(P_Y(Y = y)\). Hence, the expected value of \(g(Y)\) can be obtained as

\[ \mathbb{E}\left[ g(Y) \right] = \sum_{y \in \mathcal{Y}} g(y) \cdot P_Y(Y = y). \tag{2.9}\]

Proof. Let us explore the rationale provided by Soch et al. (2024). Thus, we will rename the general function \(g(Y)\) as another random variable called \(Z\) such that:

\[ Z = g(Y). \tag{2.10}\]

Note this function \(g(Y)\) can take on equal values \(g(y_1), g(y_2), \dots\) coming from different observed values \(y_1, y_2, \dots\); for example, if

\[ g(y) = y^2 \]

both

\[ y_1 = 2 \quad \text{and} \quad y_2 = -2 \]

yield

\[ g(y_1) = g(y_2) = 4. \]

The above Equation 2.10 is formally called a random variable transformation from the general function of random variable \(Y\), \(g(Y)\), to a new random variable \(Z\). Having said that, when we set up a transformation of this class, there will be a support mapping from this general function \(g(Y)\) to \(Z\). This will also yield a proper PMF,

\[ P_Z(Z = z) : \mathbb{R} \rightarrow [0, 1] \quad \forall z \in \mathcal{Z}, \]

given that \(g(Y)\) is a random variable-based function.

Therefore, using the expected value definition for a discrete random variable as in Equation 2.3, we have the following for \(Z\):

\[ \mathbb{E}(Z) = \sum_{z \in \mathcal{Z}} z \cdot P_Z(Z = z). \tag{2.11}\]

Within the support \(\mathcal{Z}\), suppose that \(z_1, z_2, \dots\) are the possible different values of \(Z\) corresponding to function \(g(Y)\). Then, for the \(i\)th value \(z_i\) in this correspondence, let \(I_i\) be the collection of all \(y_j\) such that

\[ g(y_j) = z_i. \tag{2.12}\]

Now, let us tweak a bit the above expression from Equation 2.11 to include this setting:

\[ \begin{align*} \mathbb{E}(Z) &= \sum_{z \in \mathcal{Z}} z \cdot P_Z(Z = z) \\ &= \sum_{i} z_i \cdot P_{g(Y)}(Z = z_i) \\ & \qquad \text{we subset the summation to all $z_i$ with $Z = g(Y)$}\\ &= \sum_{i} z_i \sum_{j \in I_i} P_Y(Y = y_j). \\ \end{align*} \tag{2.13}\]

The last line of Equation 2.13 maps the probabilities associated to all \(z_i\) in the corresponding PMF of \(Z\), \(P_Z(\cdot)\) via the function \(g(Y)\), to the original PMF of \(Y\), \(P_Y(\cdot)\), for all those \(y_j\) contained in the collection \(I_i\). Given that certain values \(z_i\) can be obtained with more than one value \(y_j\), such as in the above example when \(g(y) = y^2\) for \(y_1 = 2\) and \(y_2 = -2\), note we have a second summation of probabilities applied to the PMF of \(Y\).

Moving along with Equation 2.13 in conjunction with Equation 2.12, we have that:

\[ \begin{align*} \mathbb{E}(Z) &= \sum_{i} z_i \sum_{j \in I_i} P_Y(Y = y_j) \\ &= \sum_{i} \sum_{j \in I_i} z_i \cdot P_Y(Y = y_j) \\ &= \sum_{i} \sum_{j \in I_i} g(y_j) \cdot P_Y(Y = y_j). \end{align*} \tag{2.14}\]

The double summation in Equation 2.14 can be summarized into a single one, given neither of the factors on the right-hand side is subindexed by \(i\). Furthermore, this standalone summation can be applied to all \(y \in \mathcal{Y}\) while getting rid of the subindex \(j\) in the factors on the right-hand side:

\[ \begin{align*} \mathbb{E}(Z) &= \sum_{i} \sum_{j \in I_i} g(y_j) \cdot P_Y(Y = y_j) \\ &= \sum_{y \in \mathcal{Y}} g(y) \cdot P_Y(Y = y) \\ &= \mathbb{E}\left[ g(Y) \right]. \end{align*} \]

Therefore, we have:

\[ \mathbb{E}\left[ g(Y) \right] = \sum_{y \in \mathcal{Y}} g(y) \cdot P_Y(Y = y). \quad \square \]

Theorem 2.2 Let \(Y\) be a continuous random variable whose support is \(\mathcal{Y}\). The LOTUS indicates that the expected value of a general function \(g(Y)\) of this random variable \(Y\) can be obtained via \(g(Y)\) along with the corresponding PDF \(f_Y(y)\). Thus, the expected value of \(g(Y)\) can be obtained as

\[ \mathbb{E}\left[ g(Y) \right] = \int_{\mathcal{Y}} g(y) \cdot f_Y(y). \tag{2.15}\]

Proof. Let us explore the rationale provided by Soch et al. (2024). Hence, we will rename the general function \(g(Y)\) as another random variable called \(Z\) such that:

\[ Z = g(Y). \tag{2.16}\]

As in the discrete LOTUS proof, the above Equation 2.16 is formally called a random variable transformation from the general function of random variable \(Y,\) \(g(Y)\), to a new random variable \(Z\). Therefore, when we set up a transformation of this class, there will be a support mapping from this general function \(g(Y)\) to \(Z\). This will also yield a proper PDF:

\[ f_Z(z) : \mathbb{R} \rightarrow [0, 1] \quad \forall z \in \mathcal{Z}, \]

given that \(g(Y)\) is a random variable-based function.

Analogous to Equation 2.5, we will use the concept of the CDF for a continuous random variable \(Z\):

\[ \begin{align*} F_Z(z) &= P(Z \leq z) \\ &= P\left[g(Y) \leq z \right] \\ &= P\left[Y \leq g^{-1}(z) \right] \\ &= F_Y\left[ g^{-1}(z) \right]. \end{align*} \tag{2.17}\]

A well-known Calculus result is the inverse function theorem. Assuming that

\[ z = g(y) \]

is an invertible and differentiable function, then the inverse

\[ y = g^{-1}(z) \tag{2.18}\]

must be differentiable as in:

\[ \frac{\mathrm{d}}{\mathrm{d}z} \left[ g^{-1}(z) \right] = \frac{1}{g' \left[ g^{-1}(z) \right]}. \tag{2.19}\]

Note that we differentiate Equation 2.18 as follows:

\[ \frac{\mathrm{d}}{\mathrm{d}z} y = \frac{\mathrm{d}}{\mathrm{d}z} \left[ g^{-1}(z) \right]. \tag{2.20}\]

Then, plugging Equation 2.20 into Equation 2.19, we obtain:

\[ \begin{gather*} \frac{\mathrm{d}}{\mathrm{d}z} y = \frac{1}{g' \left[ g^{-1}(z) \right]} \\ \mathrm{d}y = \frac{1}{g' \left[ g^{-1}(z) \right]} \mathrm{d}z. \end{gather*} \tag{2.21}\]

Analogous to Equation 2.8, we use the property that relates the CDF \(F_Z(z)\) to the PDF \(f_Z(z)\):

\[ f_Z(z) = \frac{\mathrm{d}}{\mathrm{d}z} F_Z(z). \]

Using Equation 2.17, we have:

\[ \begin{align*} f_Z(z) &= \frac{\mathrm{d}}{\mathrm{d}z} F_Z(z) \\ &= \frac{\mathrm{d}}{\mathrm{d}z} F_Y\left[ g^{-1}(z) \right] \\ &= f_Y\left[ g^{-1}(z) \right] \frac{\mathrm{d}}{\mathrm{d}z} \left[ g^{-1}(z) \right]. \end{align*} \]

Then, via Equation 2.19, it follows that:

\[ f_Z(z) = f_Y\left[ g^{-1}(z) \right] \frac{1}{g' \left[ g^{-1}(z) \right]}. \tag{2.22}\]

Therefore, using the expected value definition for a continuous random variable as in Equation 2.4, we have for \(Z\) that

\[ \mathbb{E}(Z) = \int_{\mathcal{Z}} z \cdot f_Z(z) \mathrm{d}z, \]

which yields via Equation 2.22:

\[ \mathbb{E}(Z) = \int_{\mathcal{Z}} z \cdot f_Y \left[ g^{-1}(z) \right] \frac{1}{g' \left[ g^{-1}(z) \right]} \mathrm{d}z. \]

Using Equation 2.18 and Equation 2.21, it follows that:

\[ \begin{align*} \mathbb{E}(Z) &= \int_{\mathcal{Z}} z \cdot f_Y(y) \frac{1}{g' \left[ g^{-1}(z) \right]} \mathrm{d}z \\ &= \int_{\mathcal{Y}} g(y) \cdot f_Y(y) \mathrm{d}y. \end{align*} \]

Note the last line in the above equation changes the integration limits to the support of \(Y\), given all terms end up depending on \(y\) on the right-hand side.

Finally, given the random variable transformation from Equation 2.16, we have:

\[ \mathbb{E}\left[ g(X) \right] = \int_{\mathcal{Y}} g(y) \cdot f_Y(y) \mathrm{d}y. \quad \square \]

Definition of variance

Let \(Y\) be a discrete or continuous random variable whose support is \(\mathcal{Y}\) with a mean represented by \(\mathbb{E}(Y)\). Then, the variance of \(Y\) is the mean of the squared deviation from the corresponding mean as follows:

\[ \text{Var}(Y) = \mathbb{E}\left\{[ Y - \mathbb{E}(Y)]^2 \right\}. \\ \tag{2.23}\]

Note the expression above is equivalent to:

\[ \text{Var}(Y) = \mathbb{E}(Y^2) - \left[ \mathbb{E}(Y) \right]^2. \tag{2.24}\]

Image by Manfred Stege via Pixabay.

Heads-up on the two mathematical expressions of the variance!

Proving the equivalence of Equation 2.23 and Equation 2.24, requires the introduction of some further properties of the expected value of a random variable while using the LOTUS. We will dig into the insights provided by Casella and Berger (2024).

Theorem 2.3 Let \(Y\) be a discrete or continuous random variable. Furthermore, let \(a\), \(b\), and \(c\) be constants. Thus, for any functions \(g_1(y)\) and \(g_2(x)\) whose means exist, we have that:

\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = a \mathbb{E}\left[ g_1(Y) \right] + b \mathbb{E}\left[ g_2(Y) \right] + c. \tag{2.25}\]

Firstly, let us prove Equation 2.25 for the discrete case.

Proof. Let \(Y\) be a discrete random variable whose support is \(\mathcal{Y}\) and PMF is \(P_Y(Y = y)\). Let us apply the LOTUS as in Equation 2.9:

\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = \sum_{y \in \mathcal{Y}} \left[ a g_1(y) + b g_2(y) + c \right] \cdot P_Y(Y = y). \] We can distribute the summation across each addend as follows:

\[ \begin{align*} \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] &= \sum_{y \in \mathcal{Y}} \left[ a g_1(y) \right] \cdot P_Y(Y = y) + \\ & \qquad \sum_{y \in \mathcal{Y}} \left[ b g_2(y) \right] \cdot P_Y(Y = y) + \\ & \qquad \sum_{y \in \mathcal{Y}} c \cdot P_Y(Y = y). \end{align*} \]

Let us take the constants out of the corresponding summations:

\[ \begin{align*} \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] &= a \sum_{y \in \mathcal{Y}} g_1(y) \cdot P_Y(Y = y) + \\ & \qquad b \sum_{y \in \mathcal{Y}} g_2(y) \cdot P_Y(Y = y) + \\ & \qquad c \underbrace{\sum_{y \in \mathcal{Y}} P_Y(Y = y)}_1 \\ &= a \underbrace{\sum_{y \in \mathcal{Y}} g_1(y) \cdot P_Y(Y = y)}_{\mathbb{E} \left[ g_1(Y) \right]} + \\ & \qquad b \underbrace{\sum_{y \in \mathcal{Y}} g_2(y) \cdot P_Y(Y = y)}_{\mathbb{E} \left[ g_2(Y) \right]} + c. \end{align*} \]

For the first and second addends on the right-hand side in the above equation, let us apply the LOTUS again:

\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = a \mathbb{E} \left[ g_1(Y) \right] + b \mathbb{E} \left[ g_2(Y) \right] + c. \quad \square \]

Secondly, let us prove Equation 2.25 for the continuous case.

Proof. Let \(Y\) be a continuous random variable whose support is \(\mathcal{Y}\) and PDF is \(f_Y(y)\). Let us apply the LOTUS as in Equation 2.15:

\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = \int_{\mathcal{Y}} \left[ a g_1 (y) + b g_2(y) + c \right] \cdot f_Y(y) \mathrm{d}y. \]

We distribute the integral on the right-hand side of the above equation:

\[ \begin{align*} \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] &= \int_{\mathcal{Y}} \left[ a g_1 (y) \right] \cdot f_Y(y) \mathrm{d}y + \\ & \qquad \int_{\mathcal{Y}} \left[ b g_2(y) \right] \cdot f_Y(y) \mathrm{d}y + \\ & \qquad \int_{\mathcal{Y}} c \cdot f_Y(y) \mathrm{d}y. \end{align*} \]

Let us take the constants out of the corresponding integrals:

\[ \begin{align*} \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] &= a \int_{\mathcal{Y}} g_1 (y) \cdot f_Y(y) \mathrm{d}y + \\ & \qquad b \int_{\mathcal{Y}} g_2(y) \cdot f_Y(y) \mathrm{d}y + \\ & \qquad c \underbrace{\int_{\mathcal{Y}} f_Y(y) \mathrm{d}y}_{1} \\ &= a \underbrace{\int_{\mathcal{Y}} g_1 (y) \cdot f_Y(y) \mathrm{d}y}_{\mathbb{E} \left[ g_1(Y) \right]} + \\ & \qquad b \underbrace{\int_{\mathcal{Y}} g_2(y) \cdot f_Y(y) \mathrm{d}y}_{\mathbb{E} \left[ g_2(Y) \right]} + c. \end{align*} \]

For the first and second addends on the right-hand side in the above equation, let us apply the LOTUS again:

\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = a \mathbb{E} \left[ g_1(Y) \right] + b \mathbb{E} \left[ g_2(Y) \right] + c. \quad \square \]

Finally, after applying some algebraic rearrangements and the expected value properties shown in Equation 2.25, Equation 2.23 and Equation 2.24 are equivalent as follows:

Proof. \[ \begin{align*} \text{Var}(Y) &= \mathbb{E}\left\{[ Y - \mathbb{E}(Y)]^2 \right\} \\ &= \mathbb{E} \left\{ Y^2 - 2Y \mathbb{E}(Y) + \left[ \mathbb{E}(Y) \right]^2 \right\} \\ &= \mathbb{E}(Y^2) - \mathbb{E} \left[ 2Y \mathbb{E}(Y) \right] + \mathbb{E} \left[ \mathbb{E}(Y) \right]^2 \\ & \qquad \text{distributing the expected value operator} \\ &= \mathbb{E}(Y^2) - 2 \mathbb{E} \left[ Y \mathbb{E}(Y) \right] + \mathbb{E} \left[ \mathbb{E}(Y) \right]^2 \\ & \qquad \text{since $2$ is a constant} \\ &= \mathbb{E}(Y^2) - 2 \mathbb{E}(Y) \mathbb{E} \left( Y \right) + \left[ \mathbb{E}(Y) \right]^2 \\ & \qquad \text{since $\mathbb{E}(Y)$ is a constant} \\ &= \mathbb{E}(Y^2) - 2 \left[ \mathbb{E}(Y) \right]^2 + \left[ \mathbb{E}(Y) \right]^2 \\ &= \mathbb{E}(Y^2) - \left[ \mathbb{E}(Y) \right]^2. \qquad \qquad \qquad \qquad \qquad \square \end{align*} \]

2.1.6 The Rationale in Random Sampling

Definition of conditional probability

Suppose you have two events of interest, \(A\) and \(B\), in a random phenomenon of a population or system of interest. These two events belong to the sample space \(S\). Moreover, assume that the probability of event \(B\) is such that

\[ P(B) > 0, \]

which is considered the conditioning event.

Hence, the conditional probability event \(A\) given event \(B\) is defined as

\[ P(A | B) = \frac{P(A \cap B)}{P(B)}, \tag{2.26}\]

where \(P(A \cap B)\) is read as the probability of the intersection of events \(A\) and \(B\).

Image by Pexels via Pixabay.

Tip on the rationale behind conditional probability!

We can delve into the rationale of Equation 2.26 by using a handy probabilistic concept called cardinality, which refers to the corresponding total number of possible outcomes in a random phenomenon belonging to any given event or sample space.

Proof. Let \(|S|\) be the cardinality corresponding to the sample space in a random phenomenon. Hence, as in Equation 2.2, we have that:

\[ P(S) = \frac{|S|}{|S|} = 1. \]

Moreover, suppose that \(A\) is the primary event of interest whose cardinality is represented by \(|A|\). Alternatively to Equation 2.1, the probability of \(A\) can be represented as

\[ P(A) = \frac{|A|}{|S|}. \]

On the other hand, the cardinality of the conditioning event is

\[ P(B) = \frac{|B|}{|S|}. \tag{2.27}\]

Now, let \(|A \cap B|\) be the cardinality of the intersection between events \(A\) and \(B\). Its probability can be represented as:

\[ P(A \cap B) = \frac{|A \cap B|}{|B|}. \tag{2.28}\]

Analogous to Equation 2.27 and Equation 2.28, we can view the conditional probability \(P(A | B)\) as an updated probability of the primary event \(A\) restricted to the cardinality of the conditioning event \(|B|\). This places \(|A \cap B|\) in the numerator and \(|B|\) in the denominator as follows:

\[ P(A | B) = \frac{|A \cap B|}{|B|}. \tag{2.29}\]

Therefore, we can play around with Equation 2.29 along with Equation 2.27 and Equation 2.28 as follows:

\[ \begin{align*} P(A \cap B) &= \frac{|A \cap B|}{|B|} \\ &= \frac{\frac{|A \cap B}{|S|}}{\frac{|B|}{|S|}} \qquad \text{dividing numerator and denominator over $|S|$} \\ &= \frac{P(A \cap B)}{P(B)}. \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \square \end{align*} \]

Definition of the Bayes’ rule

Suppose you have two events of interest, \(A\) and \(B\), in a random phenomenon of a population or system of interest. From Equation 2.26, we can state the following expression for the conditional probability of \(A\) given \(B\):

\[ P(A | B) = \frac{P(A \cap B)}{P(B)} \quad \text{if $P(B) > 0$.} \tag{2.30}\]

Note the conditional probability of \(B\) given \(A\) can be stated as:

\[ \begin{align*} P(B | A) &= \frac{P(B \cap A)}{P(A)} \quad \text{if $P(A) > 0$} \\ &= \frac{P(A \cap B)}{P(A)} \quad \text{since $P(B \cap A) = P(A \cap B)$.} \end{align*} \tag{2.31}\]

Then, we can manipulate Equation 2.31 as follows:

\[ P(A \cap B) = P(B | A) \times P(A). \]

The above result can be plugged into Equation 2.30:

\[ \begin{align*} P(A | B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{P(B | A) \times P(A)}{P(B)}. \end{align*} \tag{2.32}\]

Equation 2.32 is called the Bayes’ rule. We are basically flipping around conditional probabilities.

Definition of independence

Suppose you have two events of interest, \(A\) and \(B\), in a random phenomenon of a population or system of interest. These two events are statistically independent if event \(B\) does not affect event \(A\) and vice versa. Therefore, the probability of their corresponding intersection is given by:

\[ P(A \cap B) = P(A) \times P(B). \tag{2.33}\]

Let us expand the above definition to a random variable framework:

  • Suppose you have a set of \(n\) discrete random variables \(Y_1, \dots, Y_n\) whose supports are \(\mathcal{Y_1}, \dots, \mathcal{Y_n}\) with PMFs \(P_{Y_1}(Y_1 = y_1), \dots, P_{Y_n}(Y_n = y_n)\) respectively. That said, the joint PMF of these \(n\) random variables is the multiplication of their corresponding standalone PMFs:

\[ \begin{align*} P_{Y_1, \dots, Y_n}(Y_1 = y_1, \dots, Y_n = y_n) &= \prod_{i = 1}^n P_{Y_i}(Y_i = y_i) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}_i, i = 1, \dots, n. \end{align*} \tag{2.34}\]

  • Suppose you have a set of \(n\) continuous random variables \(Y_1, \dots, Y_n\) whose supports are \(\mathcal{Y_1}, \dots, \mathcal{Y_n}\) with PDFs \(f_{Y_1}(y_1), \dots, f_{Y_n}(y_n)\) respectively. That said, the joint PDF of these \(n\) random variables is the multiplication of their corresponding standalone PDFs:

\[ \begin{align*} f_{Y_1, \dots, Y_n}(y_1, \dots, y_n) &= \prod_{i = 1}^n f_{Y_i}(y_i) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}_i, i = 1, \dots, n. \end{align*} \tag{2.35}\]

Tip on the rationale behind the rule of independent events!

We can delve into the rationale of Equation 2.33 by using the Bayes’ rule from Equation 2.32 along with the basic conditional probability formula from Equation 2.26.

Proof. Firstly, let us assume that a given event \(B\) does not affect event \(A\) which can be probabilistically represented as

\[ P(A | B) = P(A). \tag{2.36}\]

If the statement in Equation 2.36 holds, by using the Bayes’ rule from Equation 2.32, we have the following manipulation for the below conditional probability formula:

\[ \begin{align*} P(B | A) &= \frac{P(B \cap A)}{P(A)} \\ &= \frac{P(A \cap B)}{P(A)} \qquad \text{since $P(B \cap A) = P(A \cap B$)} \\ &= \frac{P(A | B) \times P(B)}{P(A)} \qquad \text{by the Bayes' rule} \\ &= \frac{P(A) \times P(B)}{P(A)} \qquad \text{since $P(A | B) = P(A)$} \\ &= P(B). \end{align*} \]

Then, again by using the Bayes’ rule, we obtain \(P(B \cap A)\) as follows:

\[ \begin{align*} P(B \cap A) &= P(B | A) \times P(A) \\ &= P(B) \times P(A) \qquad \text{since $P(B | A) = P(B)$.} \end{align*} \]

Finally, we have that:

\[ \begin{align*} P(A \cap B) &= P(B \cap A) \\ &= P(B) \times P(A) \\ &= P(A) \times P(B). \qquad \square \end{align*} \]

Definition of random sample

A random sample is a collection of random variables \(Y_1, \dots, Y_n\) of size \(n\) coming from a given population or system of interest. Note that the most elementary definition of a random sample assumes that these \(n\) random variables are mutually independent and identically distributed (which is abbreviated as iid).

The fact that these \(n\) random variables are identically distributed indicates that they have the same mathematical form for their corresponding PMFs or PDFs, depending on whether they are discrete or continuous respectively. Hence, under a generative modelling approach in a population or system of interest governed by \(k\) parameters contained in the vector

\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T, \]

we can apply the iid property in an elementary random sample to obtain the following joint probability distributions:

  • In the case of \(n\) iid discrete random variables \(Y_1, \dots, Y_n\) whose common standalone PMF is \(P_Y(Y = y)\) with support \(\mathcal{Y}\), the joint PMF is mathematically expressed as

\[ \begin{align*} P_{Y_1, \dots, Y_n}(Y_1 = y_1, \dots, Y_n = y_n | \boldsymbol{\theta}) &= \prod_{i = 1}^n P_Y(Y = y_i | \boldsymbol{\theta}) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}, i = 1, \dots, n. \end{align*} \tag{2.37}\]

  • In the case of \(n\) iid continuous random variables \(Y_1, \dots, Y_n\) whose common standalone PDF is \(f_Y(y)\) with support \(\mathcal{Y}\), the joint PDF is mathematically expressed as

\[ \begin{align*} f_{Y_1, \dots, Y_n}(y_1, \dots, y_n | \boldsymbol{\theta}) &= \prod_{i = 1}^n f_Y(y_i | \boldsymbol{\theta}) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}, i = 1, \dots, n. \end{align*} \tag{2.38}\]

Unlike Equation 2.34 and Equation 2.35, note that Equation 2.37 and Equation 2.38 indicate the subscript \(Y\) in the corresponding probability distributions since we have identically distributed random variables. Furthermore, the joint distributions are conditioned on the population parameter vector \(\boldsymbol{\theta}\) which reflects our generative modelling approach.

Image by Pexels via Pixabay.

2.2 What is Maximum Likelihood Estimation?

2.3 Basics of Frequentist Statistical Inference

Figure 2.2: Results stage from the data science workflow in Figure 1.1. This stage is directly followed by storytelling and preceded by goodness of fit.
Figure 2.3: A classical-based hypothesis testing workflow structured in four substages: general settings, hypotheses definitions, test flavour and components, and inferential conclusions.

2.3.1 General Settings

Figure 2.4: General settings substage from the classical-based hypothesis testing workflow in Figure 2.3. This substage is directly followed by the hypotheses definitions.

Definition of hypothesis testing

A frequentist hypothesis testing is

Definition of null hypothesis

A null hypothesis is

Definition of alternative hypothesis

An alternative hypothesis is

Definition of type I error

Type I error is defined as

Definition of type II error

Type II error is defined as

Definition of significance level

Significance level is defined as

Definition of power

The statistical power of a test is defined as

2.3.2 Hypotheses Definitions

Figure 2.5: Hypotheses definitions substage from the classical-based hypothesis testing workflow in Figure 2.3. This substage is directly preceded by general settings and followed by test flavour and components.

2.3.3 Test Flavour and Components

Figure 2.6: Test flavour and components substage from the classical-based hypothesis testing workflow in Figure 2.3. This substage is directly preceded by hypotheses definitions and followed by inferential conclusions.

Definition of observed effect

An observed effect is

Definition of standard error

An standard error is

Definition of test statistic

A test statistic is

2.3.4 Inferential Conclusions

Figure 2.7: Inferential conclusions substage from the classical-based hypothesis testing workflow in Figure 2.3. This substage is directly preceded by rest flavour and components and followed by the corresponding delivery significance conclusion within the results stage of the data science workflow as shown in Figure 2.2.

Definition of critical value

A critical value is

Definition of \(p\)-value

A \(p\)-value is

Definition of confidence interval

A confidence interval is

2.4 Supervised Learning and Regression Analysis