
2 Basic Cuisine: A Review on Probability and Frequentist Statistical Inference
This chapter will delve into probability and frequentist statistical inference. We can view these sections as a quick review of introductory probability and statistics concepts. Moreover, this review will be important to understanding the philosophy of modelling parameter estimation as outlined in Section 1.2.5. Then, we will pave the way to the rationale behind statistical inference in the Results stage (as in Section 1.2.7) in our workflow from Figure 1.1. Note that we aim to explain all these statistical and probabilistic concepts in the most possible practical way via a made-up case study throughout this chapter (while still presenting useful theoretical admonitions as explained in Chapter 1). Note we will use an appropriate level of jargon and will follow the colour convention found in Appendix A along with the definition callout box.
Learning Objectives
By the end of this chapter, you will be able to:
- Discuss why having a complete conceptual understanding of the process of statistical inference is key when conducting studies for general audiences.
- Explain why probability is the language of statistics.
- Recall foundational probabilistic insights.
- Break down the differences between the two schools of statistical thinking: frequentist and Bayesian.
- Apply the philosophy of generative modelling along with probability distributions in parameter estimation.
- Justify using measures of central tendency and uncertainty to characterize probability distributions.
- Illustrate how random sampling can be used in parameter estimation.
- Describe conceptually what maximum likelihood estimation entails in a frequentist framework.
- Formulate a maximum likelihood estimation-based approach in parameter estimation.
- Outline the process of a frequentist classical-based hypothesis testing to solve general inferential inquiries.
- Contrast the differences and simmilarities between supervised learning and regression analysis.
Let us start with a relatable story!
Imagine you are an undergraduate engineering student. Moreover, last term, you just took and passed your first course in probability and statistics (inference included!) in an industrial engineering context. Moreover, as it could happen while taking an introductory course in probability and statistics, you used to feel quite overwhelmed by the large amount of jargon and formulas one had to grasp and use regularly for primary engineering fields such as quality control in a manufacturing facility. Population parameters, hypothesis testing, tests statistics, significance level, \(p\)-values, and confidence intervals (do not worry, our statistical/machine learning scheme will come in later in this review) were appearing here and there! And to your frustration, you could never find a statistical connection between all these inferential tools! Instead, you relied on mechanistic procedures when solving assignments or exam problems.
For instance, when performing hypothesis testing for a two-sample \(t\)-test, you struggled to reflect what the hypotheses were trying to indicate for the corresponding population parameters or how the test statistic was related to these hypotheses. Moreover, your interpretation of the resulting \(p\)-value and/or confidence interval was purely mechanical with the inherent claim:
With a significance level \(\alpha = 0.05\), we reject (or fail to reject, if that is the case!) the null hypothesis in given that…
Truthfully, this whole mechanical way of doing statistics is not ideal in a teaching, research or industry environment. Along the same lines, the above situation should also not happen when we learn key statistical topics for the very first time as undergraduate students. That is why we will investigate a more intuitive way of viewing probability and its crucial role in statistical inference. This matter will help us deliver more coherent storytelling (as in Section 1.2.8) when presenting our results in practice during any regression analysis to our peers or stakeholders. Note that the role of probability also extends to model training (as in Section 1.2.5) when it comes to supervised learning and not just regarding statistical inference.
Having said all this, it is time to introduce a statement that is key when teaching hypothesis testing in an introductory statistical inference course:
In statistical inference, everything always boils down to randomness and how we can control it!
That is quite a bold statement! Nonetheless, once one starts teaching statistical topics to audiences not entirely familiar with the usual field jargon, the idea of randomness always persists across many different tools. And, of course, regression analysis is not an exception at all since it also involves inference on population parameters of interest! This is why we have allocated this section in the textbook to explain core probabilistic and inferential concepts to pave the way to its role in regression analysis.
Heads-up on why we mean as a non-ideal mechanical analysis!
The reader might need clarification on why the mechanical way of performing hypothesis testing is considered non-ideal, mainly when the term cookbook is used in the book’s title. The cookbook concept here actually refers to a homogenized recipe for data modelling, as seen in the workflow from Figure 1.1. However, there’s a crucial distinction between this and the non-ideal mechanical way of hypothesis testing.
On the one hand, the non-ideal mechanical way refers to the use of a tool without understanding the rationale of what this tool stands for, resulting in vacuous and standard statements that we would not be able to explain any way further, such as the statement we previously indicated:
With a significance level \(\alpha = 0.05\), we reject (or fail to reject, if that is the case!) the null hypothesis given that…
What if a stakeholder of our analysis asks us in plain words what a significance level means? Why are we phrasing our conclusion on the null hypothesis and not directly on the alternative one? As a data scientist, one should be able to explain why the whole inference process yields that statement without misleading the stakeholders’ understanding. For sure, this also implicates appropriate communication skills that cater to general audiences rather than just technical ones.
Conversely, the data modelling workflow in Figure 1.1 involves stages that necessitate a comprehensive and precise understanding of our analysis. Progressing to the next stage (without a complete grasp of the current one) risks perpetuating false insights, potentially leading to faulty data storytelling of the entire analysis.
Finally, even though this book has suggested reviews related to the basics of probability via different distributions and the fundamentals of frequentist statistical inference as stated in Audience and Scope, we will retake essential concepts as follows:
- The role of random variables and probability distributions and the governance of population (or system) parameters (i.e., the so-called Greek letters we usually see in statistical inference and regression analysis). Section 2.1 will explore these topics more in detail while connecting them to the subsequent inferential terrain under a frequentist context.
- When delving into supervised learning and regression analysis, we might wonder how randomness is incorporated into model fitting (i.e., parameter estimation). That is quite a fascinating aspect, implemented via a crucial statistical tool known as maximum likelihood estimation. This tool is heavily related to the concept of loss function in supervised learning. Section 2.2 will explore these matters in more detail and how the idea of a random sample is connected to this estimation tool.
- Section 2.3 will explore the basics of hypothesis testing and its intrinsic components such as null and alternative hypotheses, type I and type II errors, significance level, power, observed effect, standard error, test statistic, critical value, \(p\)-value, and confidence interval.
- Finally, Section 2.4 will briefly discuss the connections between supervised learning and regression analysis regarding terminology.
Without further ado, let us start with reviewing core concepts in probability via quite a tasty example.
2.1 Basics of Probability
In terms of regression analysis and its supervised learning counterpart (either on an inferential or predictive framework), probability can be viewed as the solid foundation on which more complex tools, including estimation and hypothesis testing, are built upon. Having said that, let us scaffold across all the necessary probabilistic concepts that will allow us to move forward into these more complex tools.
2.1.1 First Insights
To start building up our solid probabilistic foundation, we assume our data is coming from a given population or system of interest. Moreover, the population or system is assumed to be governed by parameters which, as data scientists or researchers, they are of our best interest to study. That said, the terms population and parameter will pave the way to our first statistical definitions.
Definition of population
It is a whole collection of individuals or items that share distinctive attributes. As data scientists or researchers, we are interested in studying these attributes, which we assume are governed by parameters. In practice, we must be as specific as possible when defining our given population such that we would frame our entire data modelling process since its very early stages. Examples of a population could be the following:
- Children between the ages of 5 and 10 years old in states of the American West Coast.
- Customers of musical vinyl records in the Canadian provinces of British Columbia and Alberta.
- Avocado trees grown in the Mexican state of Michoacán.
- Adult giant pandas in the Southwestern Chinese province of Sichuan.
- Mature açaí palm trees from the Brazilian Amazonian jungle.
Note that the term population could be exchanged for the term system, given that certain contexts do not particularly refer to individuals or items. Instead, these contexts could refer to processes whose attributes are also governed by parameters. Examples of a system could be the following:
- The production of cellular phones from a given model in a set of manufacturing facilities.
- The sale process in the Vancouver franchises of a well-known ice cream parlour.
- The transit cycle during rush hours on weekdays in the twelve lines of Mexico City’s subway.
Definition of parameter
It is a characteristic (numerical or even non-numerical, such as a distinctive category) that summarizes the state of our population or system of interest. Examples of a population parameter can be described as follows:
- The average weight of children between the ages of 5 and 10 years old in states of the American west coast (numerical).
- The variability in the height of the mature açaí palm trees from the Brazilian Amazonian jungle (numerical).
- The proportion of defective items in the production of cellular phones in a set of manufacturing facilities (numerical).
- The average customer waiting time to get their order in the Vancouver franchises of a well-known ice cream parlour (numerical).
- The most favourite pizza topping of vegetarian adults between the ages of 30 and 40 years old in Edmonton (non-numerical).
Note the standard mathematical notation for population parameters are Greek letters (for more insights, you can check Appendix B). Moreover, in practice, these population parameter(s) of interest will be unknown to the data scientist or researcher. Instead, they would use formal statistical inference to estimate them.
The parameter definition points out a crucial fact in investigating any given population or system:
Our parameter(s) of interest are usually unknown!
Given this fact, it would be pretty unfortunate and inconvenient if we eventually wanted to discover any significant insights about the population or system. Therefore, let us proceed to our so-called tasty example so we can dive into the need for statistical inference and why probability is our perfect ally in this parameter quest.
Imagine you are the owner of a large fleet of ice cream carts, around 900 to be exact. These ice cream carts operate across different parks in the following Canadian cities: Vancouver, Victoria, Edmonton, Calgary, Winnipeg, Ottawa, Toronto, and Montreal. In the past, to optimize operational costs, you decided to limit ice cream cones to only two items: vanilla and chocolate flavours, as in Figure 2.1.

Now, let us direct this whole case onto a more statistical and probabilistic field; suppose you have a well-defined overall population of interest for those above eight Canadian cities: children between 4 and 11 years old attending these parks during the Summer weekends. Of course, Summer time is coming this year, and you would like to know which ice cream cone flavour is the favourite one for this population (and by how much!). As a business owner, investigating ice cream flavour preferences would allow you to plan Summer restocks more carefully with your corresponding suppliers. Therefore, it would be essential to start collecting consumer data so the company can tackle this demand query.
Also, suppose there is a second query. For the sake of our case, we will call it a time query. As a critical component of demand planning, besides estimating which cone flavour is the most preferred one (and by how much!) for the above population of interest, the operations area is currently requiring a realistic estimation of the average waiting time from one customer to the next one in any given cart during Summer weekends. This average waiting time would allow the operations team to plan carefully how much stock each cart should have so there will not be any waste or shortage.
Note that the nature of the aforementioned time query is more related to a larger population. Therefore, we can define it as all our ice cream customers during the Summer weekends. Furthermore, this second definition would expand this query to our corresponding general ice cream customers, given the requirements of our operations team, and not all the children between 4 and 11 years old attending the parks during Summer weekends. Consequently, it is crucial to note that the nature of our queries will dictate how we define our population and our subsequent data modelling and statistical inference.
Summer time represents the most profitable season from a business perspective, thus solving these above two queries is a significant priority for your company. Hence, you decide to organize a meeting with your eight general managers (one per Canadian city). Finally, during the meeting with the general managers, it was decided to do the following:
- For the demand query, a comprehensive market study will be run on the population of interest across the eight Canadian cities right before next Summer; suppose we are currently in Spring.
- For the time query, since the operations team has not previously recorded any historical data (surprisingly!), ALL vendor staff from 900 carts will start collecting data on the waiting time in seconds between each customer this upcoming Summer.
When discussing study requirements for the marketing firm who would be in charge of it for the demand query, Vancouver’s general manager dares to state the following:
Since we’re already planning to collect consumer data on these cities, let’s mimic a census-type study to ensure we can have the MOST PRECISE results on their preferences.
On the other hand, when agreeing on the specific operations protocol to start recording waiting times for all the 900 vending carts this upcoming Summer, Ottawa’s general manager provides a comment for further statistical food for thought:
The operations protocol for recording waiting times in the 900 vending carts looks too cumbersome to implement straightforwardly this upcoming Summer. Why don’t we select A SMALLER SET of waiting times between two general customers across the 900 ice cream carts in the eight cities to have a more efficient process implementation that would allow us to optimize operational costs?
Bingo! Ottawa’s general manager just nailed the probabilistic way of making inference on our population parameter of interest for the time query. Indeed, their comment was primarily framed from a business perspective of optimizing operational costs. Still, this fact does not take away a crucial insight on which statistical inference is built: a random sample (as in its corresponding definition). As for Vancouver’s general manager, ironically, their statement is NOT PRECISE (from an inferential point of view)! Mimicking a census-type study might not be the most optimal decision for the demand query given the time constraint and the potential size of its target population.
Heads-up on the use random sampling with probabilistic foundations!
Let us clarify things from the start, especially from a statistical perspective:
Realistically, there is no cheap and efficient way to conduct a census-type study for either of the two queries.
We must rely on probabilistic random sampling, selecting two small subsets of individuals from our two populations of interest. This approach allows us to save both financial and operational resources compared to conducting a complete census. However, random sampling requires us to use various probabilistic and inferential tools to manage and report the uncertainty associated with the estimation of the corresponding population parameters, which will help us answer our initial main queries.
Therefore, having said all this, let us assume that in this ice cream case, the company decided to go ahead with random sampling to answer both queries.
Moving on to one of the core topics in this chapter, we can state that probability is viewed as the language to decode random phenomena that occur in any given population or system of interest. In our example, we have two random phenomena:
- For the demand query, a phenomenon can be represented by the preferred ice cream cone flavour of any randomly selected child between 4 and 11 years old attending the parks of the above eight Canadian cities during the Summer weekends.
- Regarding the time query, a phenomenon of this kind can be represented by any randomly recorded waiting time between two customers during a Summer weekend in any of the above eight Canadian cities across the 900 ice cream carts.
Now, let us finally define what we mean by probability along with the inherent concept of sample space.
Definition of probability
Let \(A\) be an event of interest in a random phenomenon of a population or system of interest, whose all possible outcomes belong to a given sample space \(S\). Generally, the probability for this event \(A\) happening can be mathematically depicted as \(P(A)\). Moreover, suppose we observe the random phenomenon \(n\) times such as we were running some class of experiment, then \(P(A)\) is defined as the following ratio:
\[ P(A) = \frac{\text{Number of times event $A$ is observed}}{n}, \tag{2.1}\]
as the \(n\) times we observe the random phenomenon goes to infinity.
Equation 2.1 will always put \(P(A)\) in the following numerical range:
\[ 0 \leq P(A) \leq 1. \]
Definition of sample space
Let \(A\) be an event of interest in a random phenomenon of a population or system of interest. The sample space \(S\) of event \(A\) denotes the set of all the possible random outcomes we might encounter every time we randomly observe \(A\) such as we were running some class of experiment.
Note each of these outcomes has a determined probability associated with them. If we add up all these probabilities, the probability of the sample space \(S\) will be one, i.e.,
\[ P(S) = 1. \tag{2.2}\]
2.1.2 Schools of Statistical Thinking
Note the above definition for the probability of an event \(A\) specifically highlights the following:
… as the \(n\) times we observe the random phenomenon goes to infinity.
The “infinity” term is key when it comes to understanding the philosophy behind the frequentist school of statistical thinking in contrast to its Bayesian counterpart. In general, the frequentist way of practicing statistics in terms of probability and inference is the approach we usually learn in introductory courses, more specifically when it comes to hypothesis testing and confidence intervals which will be explored in Section 2.3. That said, the Bayesian approach is another way of practicing statistical inference. Its philosophy differs in what information is used to infer our population parameters of interest. Below, we briefly define both schools of thinking.
Definition of frequentist statistics
This statistical school of thinking heavily relies on the frequency of events to estimate specific parameters of interest in a population or system. This frequency of events is reflected in the repetition of \(n\) experiments involving a random phenomenon within this population or system.
Under the umbrella of this approach, we assume that our governing parameters are fixed. Note that, within the philosophy of this school of thinking, we can only make precise and accurate predictions as long as we repeat our \(n\) experiments as many times as possible, i.e.,
\[ n \rightarrow \infty. \]
Definition of Bayesian statistics
This statistical school of thinking also relies on the frequency of events to estimate specific parameters of interest in a population or system. Nevertheless, unlike frequentist statistics, Bayesian statisticians use prior knowledge on the population parameters to update their estimations on them along with the current evidence they can gather. This evidence is in the form of the repetition of \(n\) experiments involving a random phenomenon. All these ingredients allow Bayesian statisticians to make inference by conducting appropriate hypothesis testings, which are designed differently from their mainstream frequentist counterpart.
Under the umbrella of this approach, we assume that our governing parameters are random; i.e., they have their own sample space and probabilities associated to their corresponding outcomes. The statistical process of inference is heavily backed up by probability theory mostly in the form of the Bayes theorem (named after Reverend Thomas Bayes, an English statistician from the 18th century). This theorem uses our current evidence along with our prior beliefs to deliver a posterior distribution of our random parameter(s) of interest.
Let us put the definitions for these two schools of statistical thinking into a more concrete example. We can use the demand query from our ice cream case as a starting point. More concretely, we can dig more into a standalone population parameter such as the probability that a randomly selected child between 4 and 11 years old, attending the parks of the above eight Canadian cities during the Summer weekends, prefers the chocolate-flavoured ice cream cone over the vanilla one. Think about the following two hypothetical questions:
- From a frequentist point of view, what is the estimated probability of preferring chocolate over vanilla after randomly surveying \(n = 100\) children from our population of interest?
- Using a Bayesian approach, suppose the marketing team has found ten prior market studies on similar children populations on their preferred ice cream flavour (between chocolate and vanilla). Therefore, along with our actual random survey of \(n = 100\) children from our population of interest, what is the posterior estimation of the probability of preferring chocolate over vanilla?
By comparing the above (a) and (b), we can see one characteristic in common when it comes to the estimation of the probability of preferring chocolate over vanilla: both frequentist and Bayesian approaches rely on the gathered evidence coming from the random survey of \(n = 100\) children from our population of interest. On the one hand, the frequentist approach solely relies on observed data to estimate this single probability of preferring chocolate over vanilla. On the other hand, the Bayesian approach uses the observed data in conjunction with the prior knowledge provided by the ten estimated probabilities to deliver a whole posterior distribution (i.e., the posterior estimation) of the probability of preferring chocolate over vanilla.
Heads-up on the debate between frequentist and Bayesian statistics!
Even though most of us began our statistical journey in a frequentist framework, we might be tempted to state that a Bayesian paradigm for parameter estimation and inference is better than a frequentist one since the former only takes into account the observed evidence without the prior knowledge on our parameters of interest.
In the statistical community, there could be a fascinating debate between the pros and cons of each school of thinking. That said, it is crucial to state that no paradigm is considered wrong! Instead, using a pragmatic strategy of performing statistics according to our data science context is more convenient.
Tip on further Bayesian and frequentist insights!
Let us check the following two examples (aside from our ice cream case) to illustrate the above pragmatic way of doing things:
- Take the production of cellular phones from a given model in a set of manufacturing facilities as the context. Hence, one might find a frequentist estimation of the proportion of defective items as a quicker and more efficient way to correct any given manufacturing process. That is, we will sample products from our finalized batches and check their status (defective or non-defective, our observed evidence) to deliver a proportion estimation of defective items.
- Now, take a physician’s context. It would not make a lot of sense to study the probability that a patient develops a certain disease by only using a frequentist approach, i.e., looking at the current symptoms which account for the observed evidence. In lieu, a Bayesian approach would be more suitable to study this probability which uses the observed evidence combined with the patient’s history (i.e., the prior knowledge) to deliver our posterior belief on the disease probability.
Having said all this, it is important to reiterate that the focus of this textbook is purely frequentist in regards to data modelling in regression analysis. If you would like to explore the fundamentals of the Bayesian paradigm; Johnson, Ott, and Dogucu (2022) have developed an amazing textbook on the basic probability theory behind this school of statistical thinking along with a whole variety regression techniques including the parameter estimation rationale.
2.1.3 The Random Variables
As we continue our frequentist quest to review the probabilistic insights related to parameter estimation and statistical inference, we will focus on our ice cream case while providing a comprehensive array of definitions. Many of these definitions are inspired by the work of Casella and Berger (2024) and Soch et al. (2024).
Each time we introduce a new probabilistic or statistical concept, we will apply it immediately to this ice cream case, allowing for hands-on practice that meets the learning objectives of this chapter. It is important to pay close attention to the definition and heads-up admonitions, as they are essential for fully understanding how these concepts apply to the ice cream case. On the other hand, the tip admonitions are designed to offer additional theoretical insights that may interest you, but they can be skipped if you prefer.
Demand Query | Time Query | |
---|---|---|
Statement | We would like to know which ice cream flavour is the favourite one (either chocolate or vanilla) and by how much. | We would like to know the average waiting time from one customer to the next one in any given ice cream cart. |
Population of interest | Children between 4 and 11 years old attending different parks in Vancouver, Victoria, Edmonton, Calgary, Winnipeg, Ottawa, Toronto, and Montreal during Summer weekends. | All our general customer-to-customer waiting times in the different parks of Vancouver, Victoria, Edmonton, Calgary, Winnipeg, Ottawa, Toronto, and Montreal during Summer weekends across the 900 ice cream carts. |
Parameter | Proportion of individuals from the population of interest who prefer the chocolate flavour versus the vanilla flavour. | Average waiting time from one customer to the next one. |
Table 2.1 presents the general statements and populations of interest derived from our two queries: demand and time. It is important to note that these general statements are based on the storytelling we initiated in Section 2.1.1. In practice, summarizing the overarching statistical problem is essential. This will enable us to translate the corresponding issue into a specific statement and population, from which we can define the parameters we aim to estimate later in our statistical process.
Now, recall that in our initial meeting with the general managers, Ottawa’s general manager provided valuable statistical insights regarding the foundation of a random sample. For the time query, they suggested selecting a smaller set of waiting times between two general customers across the 900 ice cream carts. We already addressed this process as sampling, more specifically random sampling in technical language.
Similarly, we can apply this concept to the demand query by selecting a subgroup of children aged 4 to 11 who are visiting different parks in these eight cities. Then, we can ask them about their favorite ice cream flavour, specifically whether they prefer chocolate or vanilla. It is important to note that we are not conducting any census-type studies; instead, we are carrying out two studies that heavily rely on sampling to estimate population parameters.
Furthermore, we want to ensure that our two groups of observations—both children and waiting times—are representative of their respective populations. So, how can we achieve this? The baseline key is through what we call simple random sampling. This process involves the following per query:
- For the demand query, let us assume there are \(N_D\) observations in our population of interest. In a simple random sampling scheme, our random sample will consist of \(n_D\) observations (noting that \(n_D << N_D\)), each having the same probability of being selected for our estimation and inferential purposes, which is given by \(\frac{1}{N_D}\).
- For the time query, assume there are \(N_T\) observations in our population of interest. Again, in a simple random sampling scheme, our random sample will consist of \(n_T\) observations (noting that \(n_T << N_T\)), each having the same probability of selection for estimation and inferential purposes, which is \(\frac{1}{N_T}\).
We can observe the concept of randomness reflected throughout the sampling schemes mentioned above. This aligns with what we referred to as random phenomena in both queries back in Section 2.1.1. Consequently, there should be a way to mathematically represent these phenomena, and the random variable is the starting point in this process.
Definition of random variable
A random variable is a function where the input values correspond to real numbers assigned to events belonging to the sample space \(S\), and whose outcome is one of these real numbers after executing a given random experiment. For instance, a random variable (and its support, i.e., real numbers) is depicted with an uppercase such that
\[Y \in \mathbb{R}.\]
To begin experimenting with random variables in this ice cream case, we need to define them clearly. It is important to be as clear as possible when defining random variables, and we should also remember to use uppercase letters as follows:
\[ \begin{align*} D_i &= \text{A favourite ice cream flavour of a randomly surveyed $i$th child} \\ & \qquad \text{between 4 and 11 years old attending the parks of} \\ & \qquad \text{Vancouver, Victoria, Edmonton, Calgary,} \\ & \qquad \text{Winnipeg, Ottawa, Toronto, and Montreal} \\ & \qquad \text{during the Summer weekends} \\ & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \text{for $i = 1, \dots, n_D.$} \\ \\ T_j &= \text{A randomly recorded $j$th waiting time in minutes between two} \\ & \qquad \text{customers during a Summer weekend in any of the above} \\ & \qquad \text{eight Canadian cities across the 900 ice cream carts} \\ & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \text{for $j = 1, \dots, n_T.$} \\ \end{align*} \]
Note that the demand query corresponds to the \(i\)th random variable \(D_i\), where the subindex \(i\) ranges from \(1\) to \(n_D\). The term \(n_D\) represents the size of our sample for this query and theoretically indicates the number of random variables we intend to observe from our population of interest during our sampling. On the other hand, for the time query, we have the \(j\)th random variable \(T_j\), with the subindex \(j\) ranging from \(1\) to \(n_T\). In the context of this query, \(n_T\) denotes the size of our respective sample and indicates how many random variables we plan to observe from our population of interest as part of our sampling.
Now, \(D_i\) will require real numbers that correspond to potential outcomes derived from the specific demand sample space of ice cream flavour, which we can denote as \(S_D\). It is crucial to note that a given child from our population may prefer a flavour other than chocolate or vanilla—for example, strawberry, salted caramel, or pistachio. However, we are limited by our available flavour menu as a company. Therefore, we will restrict our survey question regarding these potential \(n_D\) surveyed children as follows:
\[ d_i = \begin{cases} 1 \qquad \text{The surveyed child prefers chocolate.}\\ 0 \qquad \text{Otherwise.} \end{cases} \tag{2.3}\]
In the modelling associated with Equation 2.3, an observed random variable \(d_i\) (thus, the lowercase) can only yield values of \(1\) if the surveyed child prefers chocolate and \(0\) otherwise. The term “otherwise” refers to any flavour other than chocolate, which, in our limited menu context, is vanilla!
To define the real numbers from a given waiting time sample space \(S_T\), associated with an observed random variable \(t_j\) (thus, the lowercase) measured in minutes, we need to establish a possible range for these waiting times. It would not make sense to have observed negative waiting times in this ice cream scenario; therefore, our lower bound for this range of potential values should be \(0\) minutes. However, we cannot set an upper limit on these waiting times since any ice cream vendor might need to wait for \(1, 2, 3, \ldots, 10, \ldots, 20, \ldots, 60, \ldots\) minutes for the next customer to arrive. In fact, it is possible to wait for a very long time, especially on a low sales day! Thus, the range of this observed random variable can be expressed as:
\[ t_j \in [0, \infty), \]
where the \(\infty\) symbol indicates no upper bound.
After defining the possible values for our two random variables \(D_i\) and \(T_j\), we will now classify them correctly using further probabilistic definitions as shown below.
Definition of discrete random variable
Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). If this support \(\mathcal{Y}\) corresponds to a finite set or a countably infinite set of possible values, then \(Y\) is considered a discrete random variable.
For instance, we can encounter discrete random variables which could be classified as
- binary (i.e., a finite set of two possible values),
- categorical (either nominal or ordinal, which have a finite set of three or more possible values), or
- counts (which might have a finite set or a countably infinite set of possible values as integers).
Definition of continuous random variable
Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). If this support \(\mathcal{Y}\) corresponds to an uncountably infinite set of possible values, then \(Y\) is considered a continuous random variable.
Note a continuous random variable could be
- completely unbounded (i.e., its set of possible values goes from \(-\infty\) to \(\infty\) as in \(-\infty < y < \infty\)),
- positively unbounded (i.e., its set of possible values goes from \(0\) to \(\infty\) as in \(0 \leq y < \infty\)),
- negatively unbounded (i.e., its set of possible values goes from \(-\infty\) to \(0\) as in \(-\infty < y \leq 0\)), or
- bounded between two values \(a\) and \(b\) (i.e., its set of possible values goes from \(a\) to \(b\) as in \(a \leq y \leq b\)).
Therefore, we can classify our two random variables as follows:
- For the demand query, the support of \(D_i\) (denoted as \(\mathcal{D}\)) is a countable finite set with two possible values: \(d_i \in \{0, 1\}\), as noted by Equation 2.3. Therefore, \(D_i\) is categorized as a binary discrete random variable.
- For the time query, the support of \(T_j\) (denoted as \(\mathcal{T}\)) is positively unbounded. This results in an uncountably infinite set of values that \(T_j\) can take, including (but not limited to!) \(0, \dots, 0.01, \ldots, 0.02, \ldots, 0.00234, \ldots, 1, \ldots, 1.5576, \ldots\) minutes. Therefore, \(T_j\) is classified as a positively unbounded continuous random variable.
So far, we have successfully translated our two statistical queries into proper random variables, along with clear definitions and classifications derived from our problem statements, as well as the populations of interest, as noted in Table 2.1. However, we still need to find a way to include our parameters. The upcoming section will allow us to do that.
2.1.4 The Wonders of Generative Modelling and Probability Distributions
Before exploring the wonders of generative models, let us introduce Table 2.2, an extension of Table 2.1 that now includes the elements discussed in Section 2.1.3.
Demand Query | Time Query | |
---|---|---|
Statement | We would like to know which ice cream flavour is the favourite one (either chocolate or vanilla) and by how much. | We would like to know the average waiting time from one customer to the next one in any given ice cream cart. |
Population of interest | Children between 4 and 11 years old attending different parks in Vancouver, Victoria, Edmonton, Calgary, Winnipeg, Ottawa, Toronto, and Montreal during Summer weekends. | All our general customer-to-customer waiting times in the different parks of Vancouver, Victoria, Edmonton, Calgary, Winnipeg, Ottawa, Toronto, and Montreal during Summer weekends across the 900 ice cream carts. |
Parameter | Proportion of individuals from the population of interest who prefer the chocolate flavour versus the vanilla flavour. | Average waiting time from one customer to the next one. |
Random variable | \(D_i\) for \(i = 1, \dots, n_D\). | \(T_j\) for \(j = 1, \dots, n_T\). |
Random variable definition | A favourite ice cream flavour of a randomly surveyed \(i\)th child between 4 and 11 years old attending the parks of Vancouver, Victoria, Edmonton, Calgary, Winnipeg, Ottawa, Toronto, and Montreal during the Summer weekends. | A randomly recorded \(j\)th waiting time in minutes between two customers during a Summer weekend across the 900 ice cream carts found in Vancouver, Victoria, Edmonton, Calgary, Winnipeg, Ottawa, Toronto, and Montreal. |
Random variable type | Discrete and binary. | Continuous and positively unbounded. |
Random variable support | \(d_i \in \{ 0, 1\}\) as in Equation 2.3. | \(t_j \in [0, \infty).\) |
Having summarized all our probabilistic elements in Table 2.2, the parameters of interest must come into play for our data modelling game! Hence, the question is:
Is there any feasible way to do so via the the foundations of random variables?
The answer lies in what we call a generative model, for which we have a whole toolbox corresponding to another important concept called probability distributions, as shown below.
Definition of generative model
Suppose you observe some data \(y\) from a population or system of interest. Moreover, let us assume this population or system is governed by \(k\) parameters contained in the following vector:
\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]
If we state that the random variable \(Y\) follows certain probability distribution \(\mathcal{D}(\cdot)\), then we will have a generative model \(m\) such that
\[ m: Y \sim \mathcal{D}(\boldsymbol{\theta}). \]
Definition of probability distribution
When we set a random variable \(Y\), we also set a new set of \(v\) possible outcomes \(\mathcal{Y} = \{ y_1, \dots, y_v\}\) coming from the sample space \(S\). This new set of possible outcomes \(\mathcal{Y}\) corresponds to the support of the random variable \(Y\) (i.e., all the possible values that could be taken on once we execute a given random experiment involving \(Y\)).
That said, let us suppose we have a sample space of \(u\) elements defined as
\[ S = \{ s_1, \dots, s_u \}, \]
where each one of these elements has a probability assigned via a function \(P_S(\cdot)\) such that
\[ P(S) = \sum_{i = 1}^u P_S(s_i) = 1. \]
which has to satisfy Equation 2.2.
Then, the probability distribution of \(Y\), i.e., \(P_Y(\cdot)\) assigns a probability to each observed value \(Y = y_j\) (with \(j = 1, \dots, v\)) if and only if the outcome of the random experiment belongs to the sample space, i.e., \(s_i \in S\) (for \(i = 1, \dots, u\)) such that \(Y(s_i) = y_j\):
\[ P_Y(Y = y_j) = P \left( \left\{ s_i \in S : Y(s_i) = y_j \right\} \right). \]
Since we have two different queries, we will use two instances of generative models. It is worth noting that more complex modelling could refer to a single generative model. However, for the purposes of this review chapter, we will keep it simple with via two separate generative models.
Now, let us introduce a specific notation for our discussion: the Greek alphabet. Greek letters are frequently used to statistically represent population parameters in modelling setups, estimation, and statistical inference. These letters will be quite useful for our parameters in this ice cream case!
Tip on the Greek alphabet in statistics!
In the early stages of learning statistical modelling, including concepts such as regression analysis, it is common to feel overwhelmed by unfamiliar letters and terminology. Whenever confusion arises in any of the main chapters of this book regarding these letters, we recommend referring to the Greek alphabet shared by Appendix B. It is important to note that frequentist statistical inference primarily uses lowercase letters. With consistent practice over time, you will likely memorize most of this alphabet!
Let us retake the row corresponding to parameters in Table 2.2 and assign their corresponding Greek letters:
- For the demand query, we are interested in the parameter \(\pi\), which represents the proportion of individuals from the children population who prefer the chocolate flavour over the vanilla flavour. It is crucial to note that a proportion is always bounded between \(0\) and \(1\), similar to how probabilities function! For instance, a proportion of \(0.2\) would mean that \(20\%\) of the children in our population prefer chocolate flavour over vanilla. This definition establishes our demand query parameter as follows:
\[ \pi \in [0, 1]. \]
Heads-up on the use of \(\pi\)!
In this textbook, unless stated otherwise, the letter \(\pi\) will denote a population parameter and not the mathematical constant \(3.141592...\)
- For the time query, we are interested in the parameter \(\beta\), which represents the average waiting time in minutes from one customer to the next one in our population of interest. Unlike the above \(\pi\) parameter, \(\beta\) is only positively unbounded given the definition of our random variable \(T_j\). Therefore, this definition establishes our time query parameter as follows:
\[ \beta \in (0, \infty). \]
Having defined our parameters of interest with proper lowercase Greek letters, it is time to declare our corresponding generative models on a general basis. For the demand query, there will be a single parameter called \(\pi\), where the randomly surveyed child \(D_i\) will follow the model \(m_D\) such that
\[ \begin{gather*} m_D : D_i \sim \mathcal{D}_D(\pi) \\ \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \text{for $i = 1, \dots, n_D.$} \end{gather*} \tag{2.4}\]
Now, for the time query, there will also be a single parameter called \(\beta\). Thus, the randomly recorded waiting time \(T_j\) will follow the model \(m_T\) such that
\[ \begin{gather*} m_T : T_j \sim \mathcal{D}_T(\beta) \\ \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \text{for $j = 1, \dots, n_T.$} \end{gather*} \tag{2.5}\]
Nonetheless, we might wonder the following:
How can we determine the corresponding distributions \(\mathcal{D}_D(\pi)\) and \(\mathcal{D}_T(\beta)\)?
Of course the above definition of a probability distribution will come in handy to resolve this question. That said, given that we have two types of random variables (discrete and continuous), it is necessary to introduce two specific types of probability functions: probability mass function (PMF) and probability density function (PDF).
Definition of probability mass function
Let \(Y\) be a discrete random variable whose support is \(\mathcal{Y}\). Moreover, suppose that \(Y\) has a probability distribution such that
\[ P_Y(Y = y) : \mathbb{R} \rightarrow [0, 1] \]
where, for all \(y \notin \mathcal{Y}\), we have
\[ P_Y(Y = y) = 0 \]
and
\[ \sum_{y \in \mathcal{Y}} P_Y(Y = y) = 1. \tag{2.6}\] Then, \(P_Y(Y = y)\) is considered a PMF.
As we have discussed throughout this ice cream case, let us begin with the demand query. We have already defined the \(i\)th random variable \(D_i\) as discrete and binary. In statistical literature, certain random variables in common random processes can be modelled using what we call parametric families. We refer to these tools as parametric families because they are characterized by a specific set of parameters (in our case, each query has a single-element set, such as \(\pi\) or \(\beta\)).
Moreover, we call them families since each member corresponds to a particular value of our parameter(s). For instance, in our demand query, a chosen member could be where \(\pi = 0.8\) within the respective chosen parametric family to model our surveyed children. Other possible members could correspond to \(\pi = 0.2\), \(\pi = 0.4\) or \(\pi = 0.6\). In fact, the number of members in our chosen parametric family is infinite in this demand query!
Therefore, what parametric family can we choose for our demand query?
The question above introduces a new, valuable resource that is further elaborated upon in Appendix C. This resource outlines the various distributions that will be utilized in this textbook. In reality, the realm of parametric families—specifically, distributions—is quite extensive, and this material serves as only a brief overview of the many parametric families documented in statistical literature.
Tip on data modelling alternatives via different parametric families!
Any data model is simply an abstraction of reality, and different parametric families can provide various alternatives for modelling. In practice, we often need to select a specific family based on our particular inquiries and the conditions of our data. This process requires time and experience to master. Furthermore, it is important to note that different families are often interconnected!
If you wish to explore the world of univariate distribution families—which are used to model a single random variable—Leemis (n.d.) has created a comprehensive relational chart that covers 76 distinct probability distributions: 19 are discrete, and 57 are continuous. However, this chart does not encompass all the possible families that one might encounter in statistical literature (you can check another list at the end of this section).
Referring back to our discussion about Appendix C, it is time to choose the most suitable parametric family for a discrete and a binary random variable, such as the \(i\)th random variable \(D_i\). A particular case we can examine is the Bernoulli distribution (also, commonly known as a Bernoulli trial). The Bernoulli distribution applies to a discrete random variable that can take one of two values: \(0\), which we refer to as a failure, and \(1\), identified as a success. This aligns with our previous definition from Equation 2.3:
\[ d_i = \begin{cases} 1 \qquad \text{The surveyed child prefers chocolate.}\\ 0 \qquad \text{Otherwise.} \end{cases} \]
The equation above defines the chocolate preference of the \(i\)th surveyed child as a success, while another flavour—specifically vanilla in the context of our limited menu—is categorized as a failure. Thus, we can denote the support as \(d_i \in \{0, 1\}\).
We need to define our population parameter for this demand query in the context of a Bernoulli trial, which is denoted by \(\pi \in [0, 1]\). This represents the proportion of children who prefer the chocolate flavour over the vanilla flavour. In a Bernoulli trial, this parameter refers to the probability of success. Lastly, we can specify our generative model accordingly:
\[ \begin{gather*} m_D : D_i \sim \text{Bern}(\pi) \\ \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \text{for $i = 1, \dots, n_D.$} \end{gather*} \]
It is time to start with formal equations! We need to define the PMF corresponding to the above generative model. The statistical literature assigns the following PMF for the a Bernoulli trial \(D_i\):
\[ P_{D_i} \left( D_i = d_i \mid \pi \right) = \pi^{d_i } (1 - \pi)^{1 - d_i } \quad \text{for $d_i \in \{ 0, 1 \}$.} \tag{2.7}\]
A further question arises regarding whether Equation 2.7 satisfies the condition of the total probability of the sample space defined in the Equation 2.6 under the definition of a PMF. This condition states that a valid PMF should result in a total probability equal to one when we sum all the probabilities produced by this function over every possible value that the random variable can take.
Hence, we can state Equation 2.7 is a proper probability distribution (i.e., all the standalone probabilities over the support of \(D_i\) add up to one) given that:
Proof. \[ \begin{align*} \sum_{d_i = 0}^1 P_{D_i} \left( D_i = d_i \mid \pi \right) &= \sum_{d_i = 0}^1 \pi^{d_i} (1 - \pi)^{1 - d_i} \\ &= \underbrace{\pi^0}_{1} (1 - \pi) + \pi \underbrace{(1 - \pi)^{0}}_{1} \\ &= (1 - \pi) + \pi \\ &= 1. \qquad \qquad \qquad \qquad \quad \square \end{align*} \tag{2.8}\]
Indeed, this Bernoulli PMF is a proper probability distribution!
The probability distribution, obtained from Equation 2.8, is summarized in Table 2.3. Note that the chocolate preference has a probability equal to \(\pi\), whereas the vanilla preference corresponds to the complement \(1 - \pi\). This probability arrangement completely fulfils the corresponding probability condition of the sample space seen in Equation 2.6.
\(d_i\) | \(P_{D_i} \left( D_i = d_i \mid \pi \right)\) |
---|---|
\(0\) | \(1 - \pi\) |
\(1\) | \(\pi\) |
To proceed with the time query, we need to analyze the \(j\)th continuous random variable \(T_j\) and subsequently work with a PDF.
Definition of probability density function
Let \(Y\) be a continuous random variable whose support is \(\mathcal{Y}\). Furthermore, consider a function \(f_Y(y)\) such that
\[ f_Y(y) : \mathbb{R} \rightarrow \mathbb{R} \]
with
\[ f_Y(y) \geq 0. \]
Then, \(f_Y(y)\) is considered a PDF if the probability of \(Y\) taking on a value within the range represented by the subset \(A \subset \mathcal{Y}\) is equal to
\[ P_Y(Y \in A) = \int_A f_Y(y) \mathrm{d}y \]
with
\[ \int_{\mathcal{Y}} f_Y(y) \mathrm{d}y = 1. \tag{2.9}\]
To begin our second analysis, let us examine the nature of the variable \(T_j\) represented as a continuous random variable. This variable is nonnegative, meaning it is positively unbounded, as it models a waiting time. We can interpret \(T_j\) as the waiting time until a specific event of interest occurs, such as when the next customer arrives at the ice cream cart. In statistical literature, this is commonly referred to as a survival time. Hence, we might wonder:
What is the most suitable parametric family to model a survival time?
Well, in this case within our textbook and in general in statistical literature, there is more than one alternative to model a continuous and nonnegative survival time. Appendix C offers four possible ways:
-
Exponential. A random variable with a single parameter that can come in either of the following forms:
- As a rate \(\lambda \in (0, \infty)\), which generally defines the mean number of events of interest per time interval or space unit.
- As a scale \(\beta \in (0, \infty)\), which generally defines the mean time until the next event of interest occurs.
- Weibull. A random variable that is a generalization of the Exponential distribution. Note its distributional parameters are the scale continuous parameter \(\beta \in (0, \infty)\) and shape continuous parameter \(\gamma \in (0, \infty)\).
- Gamma A random variable whose distributional parameters are the shape continuous parameter \(\eta \in (0, \infty)\) and scale continuous parameter \(\theta \in (0, \infty)\).
- Lognormal. A random variable whose logarithmic transformation yields a Normal distribution. Its distributional parameters are the Normal location continuous parameter \(\mu \in (-\infty, \infty)\) and Normal scale continuous parameter \(\sigma^2 \in (0, \infty)\).
In our context, as summarized in the corresponding generative model, it is in our best interest to select a probability distribution characterized by a single parameter. Therefore, the Exponential distribution is the most suitable choice for our current time query, particularly under the scale parametrization, since we aim to estimate the waiting time between two customers.
Heads-up on survival analysis!
Although our ice cream case can be straightforwardly modelled using an Exponential distribution for our time query, by using a single population parameter which indicates a mean waiting time between two customers, it is important to stress that other distributions, such as the Weibull, Gamma, or Lognormal, are also entirely valid options. In fact, utilizing these distributions, that involve more than just a standalone parameter, can enhance the flexibility of our data modelling!
Additionally, there is a specialized statistical field focused on modelling waiting times—specifically, the time until an event of interest occurs. These types of times are formally referred to as survival times, and the associated field is known as survival analysis. It is worth noting that regression analysis can be extended to this area, and Chapter 6 will provide a more in-depth exploration of various parametric models that involve the Exponential, Weibull, and Lognormal distributions.
Since we are using an Exponential distribution, we need to establish our population parameter for this time query. As mentioned in Table 2.2, this parameter refers to the average (or mean) waiting time from one customer to the next. This corresponds to a scale parametrization, where the parameter \(\beta \in (0, \infty)\) defines the mean time until the next event of interest occurs (in this case, the next customer!). Therefore, we can specify our generative model as follows:
\[ \begin{gather*} m_T : T_j \sim \text{Exponential}(\beta) \\ \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \text{for $j = 1, \dots, n_T.$} \end{gather*} \]
Since \(T_j\) is a continuous random variable, we must define the PMF corresponding to the above generative model. The statistical literature assigns the following PDF for \(T_j\):
\[ f_{T_j} \left(t_j \mid \beta \right) = \frac{1}{\beta} \exp \left( -\frac{t_j}{\beta} \right) \quad \text{for $t_j \in [0, \infty )$.} \tag{2.10}\]
Now, we might wonder whether Equation 2.10 satisfies the condition of the total probability of the sample space defined in the Equation 2.9 under the definition of a PDF. This condition states that a valid PDF should result in a total probability equal to one when we integrate this function over all the support of \(T_j\).
Thus, we can state that Equation 2.10 is a proper probability distribution (i.e., Equation 2.10 integrates to one over the support of \(T_j\)) given that:
Proof. \[ \begin{align*} \int_{t_j = 0}^{t_j = \infty} f_{T_j} \left(t_j \mid \beta \right) \mathrm{d}y &= \int_{t_j = 0}^{t_j = \infty} \frac{1}{\beta} \exp \left( -\frac{t_j}{\beta} \right) \mathrm{d}t_j \\ &= \frac{1}{\beta} \int_{t_j = 0}^{t_j = \infty} \exp \left( -\frac{t_j}{\beta} \right) \mathrm{d}t_j \\ &= - \frac{\beta}{\beta} \exp \left( -\frac{t_j}{\beta} \right) \Bigg|_{t_j = 0}^{t_j = \infty} \\ &= - \exp \left( -\frac{t_j}{\beta} \right) \Bigg|_{t_j = 0}^{t_j = \infty} \\ &= - \left[ \exp \left( -\infty \right) - \exp \left( 0 \right) \right] \\ &= - \left( 0 - 1 \right) \\ &= 1. \qquad \qquad \qquad \qquad \quad \square \end{align*} \tag{2.11}\]
Indeed, the Exponential PDF, under a scale parametrization, is a proper probability distribution!
Unlike our demand query, which features a table illustrating the PMF for \(D_i \in \{ 0, 1 \}\) (see Table 2.3), it is not feasible to create a table for the PDF of \(T_j \in [0, \infty)\) because it represents an uncountably infinite set of possible values. However, we can plot the corresponding PDF using three specific members of the Exponential parametric family as examples. Figure 2.2 presents these three example members, with scale parameters values of \(\beta = 0.25, 0.5, 1\) minutes, representing waiting times through their corresponding PDFs. Based on our findings in Equation 2.11, we know that the area under these three density plots equals one, indicating the total probability of the sample space. Additionally, it is important to note that as we increase the scale parameter, larger observed values \(t_j\) become more probable.

2.1.5 Characterizing Probability Distributions
So far, we have been exploring the basics of random variables, as well as the importance of generative modelling and probability distributions in addressing different data inquiries. These concepts are fundamental to understanding the population parameter setup before we actually collect data and solve these inquiries to create effective storytelling. Therefore, before we delve into those stages, however, we need to identify and explain efficient ways to summarize probability distributions. This will help us make our storytelling compelling for a general audience, as we will discuss further.
Heads-up on coding tabs!
You may be wondering:
Where do we begin with some
R
orPython
code?
It is time to introduce our very first lines of code and provide some explanations about the coding approach in this book. As implied in the Preface, our goal is to make this book “bilingual,” meaning that all hands-on coding practices can be performed in either R
or Python
. Whenever we present a specific proof of concept or data modelling exercise, you will find two tabs: one for R
and another for Python
. We will first show the input code, followed by the output.
With this format, you can choose your coding journey based on your language preferences and interests as you progress through the book.
Alright! Moving forward with the code, we need to work with some simulated populations to create the corresponding proofs of concept in this section and the subsequent ones. Let us start with our demand query. We will consider a population size of \(N_D = 2,000,000\) children. The code (in either R
or Python
) below assigns this value as N_D
, along with a simulation seed to ensure our results are reproducible. Additionally, for the simulation purposes related to our generative modelling, we will assume that 65% of these children prefer chocolate over vanilla (i.e., \(\pi = 0.65\)).
Heads-up on real and unknown parameters!
Although we are assigning a value of \(\pi = 0.65\) as our true population parameter in this query, we can never know the exact value in practice unless we conduct a full census. This is why we rely on probabilistic tools, via random sampling and statistical inference, to estimate this \(\pi\).
Let us recall that we are assuming each child is modelled as a Bernoulli trial, where a success (denoted as 1
) indicates that the child “prefers chocolate.” This also reflects the flavour mapping in the code. Furthermore, instead of using a Bernoulli random number generator, we are utilizing a Binomial random number generator. This is because the Binomial case with parameters \(n = 1\) and \(\pi\) is equivalent to a Bernoulli trial with parameter \(pi\). Hence, in general, consider the following general Binomial case:
\[ Y \sim \text{Bin}(n = 1, \pi), \]
whose PMF is simplified as a Bernoulli given that
\[ \begin{align*} P_Y \left( Y = y \mid n = 1, \pi \right) &= {1 \choose y} \pi^y (1 - \pi)^{1 - y} \\ &= \underbrace{\frac{1!}{y!(1 - y)!}}_{\text{$1$ for $y \in \{ 0, 1 \}$}} \pi^y (1 - \pi)^{1 - y} \\ &= \pi^y (1 - \pi)^{1 - y} \\ & \qquad \qquad \qquad \qquad \qquad \text{for $y \in \{ 0, 1 \}$.} \end{align*} \]
The final output of this quick simulation, which models a population of children as Bernoulli trials with a probability of success \(\pi = 0.65\), consists of a data frame containing \(N_D = 2,000,000\) rows, with each row representing a child and their preferred ice cream flavor: either chocolate or vanilla. It is worth noting that the outputs from both R
and Python
differ due to the fact that each language employs different pseudo-random number generators.
set.seed(123) # Seed for reproducibility
# Population size
N_D <- 2000000
# Simulate binary outcomes: 1 = chocolate, 0 = vanilla
flavour_bin <- rbinom(N_D, size = 1, prob = 0.65)
# Map binary to flavour names
flavours <- ifelse(flavour_bin == 1, "chocolate", "vanilla")
# Create data frame
children_pop <- data.frame(
children_ID = 1:N_D,
fav_flavour = flavours
)
# Showing the first 100 children of the population
head(children_pop, n = 100)
# Importing libraries
import numpy as np
import pandas as pd
123) # Seed for reproducibility
np.random.seed(
# Population size
= 2000000
N_D
# Simulate binary outcomes: 1 = chocolate, 0 = vanilla
= np.random.binomial(n = 1, p = 0.65, size = N_D)
flavour_bin
# Map binary to flavour names
= np.where(flavour_bin == 1, "chocolate", "vanilla")
flavours
# Create data frame
= pd.DataFrame({
children_pop "children_ID": np.arange(1, N_D + 1),
"fav_flavour": flavours
})
# Showing the first 100 children of the population
print(children_pop.head(100))
set.seed(123) # Seed for reproducibility
# Population size
n <- 500000
# In R, 'rate' is 1 / scale and rounding to two decimal places
waiting_times <- round(rexp(n, rate = 1 / 10), 2)
# Create data frame
waiting_pop <- data.frame(
time_ID = 1:200,
waiting_time = waiting_times
)
# Showing the first 100 waiting times of the population
head(waiting_pop, n = 100)
123) # Seed for reproducibility
np.random.seed(
# Population size
= 500000
n
# Simulate waiting times
= np.round(np.random.exponential(scale = 10, size = n), 2)
waiting_times
# Create DataFrame
= pd.DataFrame({
waiting_pop "time_ID": np.arange(1, n + 1),
"waiting_time": waiting_times
})
# Showing the first 100 waiting times of the population
print(waiting_pop.head((100))
Now, imagine that the data collection and analysis for the ice cream case have progressed further into the future. You have a follow-up meeting with your eight general managers, one from each Canadian city, to discuss how to address the statements related to both the demand and time queries as depicted in Table 2.2, in relation to our populations of interest. Additionally, suppose you have collected data from \(n_D = 500\) randomly surveyed children across these eight Canadian cities.
123) # Seed for reproducibility
np.random.seed(
# Simple random sample of 500 rows
= children_pop.sample(n = 500)
children_sample
# Showing the first 100 sampled children
print(children_sample.head(100))
Also, you have sampled data on \(n_T = 200\) randomly recorded waiting times between customers across our 900 ice cream carts in the same cities.
123) # Seed for reproducibility
np.random.seed(
# Simple random sample of 200 waiting times
= waiting_pop.sample(n = 500)
waiting_sample
# Showing the first 100 sampled waiting times
print(waiting_sample.head(100))
In terms of the executive meeting with the eight general managers, it would not be an efficient use of time to go individually over these \(n_D = 500\) and \(n_T = 1000\) data points along with abstract mathematical concepts such as PMFs or PDFs, as well as probabilistic definitions of random variables and parameters represented by Greek letters. Instead, there should be a more straightforward and simple way to explain how these random variables behaved during our data collection process. The key to addressing this complexity lies in understanding measures of central tendency and uncertainty.
Definition of measure of central tendency
Probabilistically, a measure of central tendency is defined as a metric that identifies a central or typical value of a given probability distribution. In other words, a measure of central tendency refers to a central or typical value that a given random variable might take when we observe various realizations of this variable over a long period.
Definition of expected value
Let \(Y\) be a random variable whose support is \(\mathcal{Y}\). In general, the expected value or mean \(\mathbb{E}(Y)\) of this random variable is defined as a weighted average according to its corresponding probability distribution. In other words, this measure of central tendency \(\mathbb{E}(Y)\) aims to find the middle value of this random variable by weighting all its possible values in its support \(\mathcal{Y}\) as dictated by its probability distribution.
Given the above definition, when \(Y\) is a discrete random variable whose PMF is \(P_Y(Y = y)\), then its expected value is mathematically defined as
\[ \mathbb{E}(Y) = \sum_{y \in \mathcal{Y}} y \cdot P_Y(Y = y). \tag{2.12}\]
When \(Y\) is a continuous random variable whose PDF is \(f_Y(y)\), its expected value is mathematically defined as
\[ \mathbb{E}(Y) = \int_{\mathcal{Y}} y \cdot f_Y(y) \mathrm{d}y. \tag{2.13}\]
Definition of measure of uncertainty
Probabilistically, a measure of uncertainty refers to the spread of a given random variable when we observe its different realizations in the long term. Note a larger spread indicates more variability in these realizations. On the other hand, a smaller spread denotes less variability in these realizations.
Tip on the Law of the Unconscious Statistician!
The law of the unconscious statistician (LOTUS) is a particular theorem in probability theory that allows us to compute a wide variety of expected values. Let us properly define it for both discrete and continuous random variables.
Theorem 2.1 Let \(Y\) be a discrete random variable whose support is \(\mathcal{Y}\). The LOTUS indicates that the expected value of a general function \(g(Y)\) of this random variable \(Y\) can be obtained via \(g(Y)\) along with the corresponding PMF \(P_Y(Y = y)\). Hence, the expected value of \(g(Y)\) can be obtained as
\[ \mathbb{E}\left[ g(Y) \right] = \sum_{y \in \mathcal{Y}} g(y) \cdot P_Y(Y = y). \tag{2.14}\]
Proof. Let us explore the rationale provided by Soch et al. (2024). Thus, we will rename the general function \(g(Y)\) as another random variable called \(Z\) such that:
\[ Z = g(Y). \tag{2.15}\]
Note this function \(g(Y)\) can take on equal values \(g(y_1), g(y_2), \dots\) coming from different observed values \(y_1, y_2, \dots\); for example, if
\[ g(y) = y^2 \]
both
\[ y_1 = 2 \quad \text{and} \quad y_2 = -2 \]
yield
\[ g(y_1) = g(y_2) = 4. \]
The above Equation 2.15 is formally called a random variable transformation from the general function of random variable \(Y\), \(g(Y)\), to a new random variable \(Z\). Having said that, when we set up a transformation of this class, there will be a support mapping from this general function \(g(Y)\) to \(Z\). This will also yield a proper PMF,
\[ P_Z(Z = z) : \mathbb{R} \rightarrow [0, 1] \quad \forall z \in \mathcal{Z}, \]
given that \(g(Y)\) is a random variable-based function.
Therefore, using the expected value definition for a discrete random variable as in Equation 2.12, we have the following for \(Z\):
\[ \mathbb{E}(Z) = \sum_{z \in \mathcal{Z}} z \cdot P_Z(Z = z). \tag{2.16}\]
Within the support \(\mathcal{Z}\), suppose that \(z_1, z_2, \dots\) are the possible different values of \(Z\) corresponding to function \(g(Y)\). Then, for the \(i\)th value \(z_i\) in this correspondence, let \(I_i\) be the collection of all \(y_j\) such that
\[ g(y_j) = z_i. \tag{2.17}\]
Now, let us tweak a bit the above expression from Equation 2.16 to include this setting:
\[ \begin{align*} \mathbb{E}(Z) &= \sum_{z \in \mathcal{Z}} z \cdot P_Z(Z = z) \\ &= \sum_{i} z_i \cdot P_{g(Y)}(Z = z_i) \\ & \qquad \text{we subset the summation to all $z_i$ with $Z = g(Y)$}\\ &= \sum_{i} z_i \sum_{j \in I_i} P_Y(Y = y_j). \\ \end{align*} \tag{2.18}\]
The last line of Equation 2.18 maps the probabilities associated to all \(z_i\) in the corresponding PMF of \(Z\), \(P_Z(\cdot)\) via the function \(g(Y)\), to the original PMF of \(Y\), \(P_Y(\cdot)\), for all those \(y_j\) contained in the collection \(I_i\). Given that certain values \(z_i\) can be obtained with more than one value \(y_j\), such as in the above example when \(g(y) = y^2\) for \(y_1 = 2\) and \(y_2 = -2\), note we have a second summation of probabilities applied to the PMF of \(Y\).
Moving along with Equation 2.18 in conjunction with Equation 2.17, we have that:
\[ \begin{align*} \mathbb{E}(Z) &= \sum_{i} z_i \sum_{j \in I_i} P_Y(Y = y_j) \\ &= \sum_{i} \sum_{j \in I_i} z_i \cdot P_Y(Y = y_j) \\ &= \sum_{i} \sum_{j \in I_i} g(y_j) \cdot P_Y(Y = y_j). \end{align*} \tag{2.19}\]
The double summation in Equation 2.19 can be summarized into a single one, given neither of the factors on the right-hand side is subindexed by \(i\). Furthermore, this standalone summation can be applied to all \(y \in \mathcal{Y}\) while getting rid of the subindex \(j\) in the factors on the right-hand side:
\[ \begin{align*} \mathbb{E}(Z) &= \sum_{i} \sum_{j \in I_i} g(y_j) \cdot P_Y(Y = y_j) \\ &= \sum_{y \in \mathcal{Y}} g(y) \cdot P_Y(Y = y) \\ &= \mathbb{E}\left[ g(Y) \right]. \end{align*} \]
Therefore, we have:
\[ \mathbb{E}\left[ g(Y) \right] = \sum_{y \in \mathcal{Y}} g(y) \cdot P_Y(Y = y). \quad \square \]
Theorem 2.2 Let \(Y\) be a continuous random variable whose support is \(\mathcal{Y}\). The LOTUS indicates that the expected value of a general function \(g(Y)\) of this random variable \(Y\) can be obtained via \(g(Y)\) along with the corresponding PDF \(f_Y(y)\). Thus, the expected value of \(g(Y)\) can be obtained as
\[ \mathbb{E}\left[ g(Y) \right] = \int_{\mathcal{Y}} g(y) \cdot f_Y(y). \tag{2.20}\]
Proof. Let us explore the rationale provided by Soch et al. (2024). Hence, we will rename the general function \(g(Y)\) as another random variable called \(Z\) such that:
\[ Z = g(Y). \tag{2.21}\]
As in the discrete LOTUS proof, the above Equation 2.21 is formally called a random variable transformation from the general function of random variable \(Y,\) \(g(Y)\), to a new random variable \(Z\). Therefore, when we set up a transformation of this class, there will be a support mapping from this general function \(g(Y)\) to \(Z\). This will also yield a proper PDF:
\[ f_Z(z) : \mathbb{R} \rightarrow [0, 1] \quad \forall z \in \mathcal{Z}, \]
given that \(g(Y)\) is a random variable-based function.
Now, we will use the concept of the cumulative distribution function (CDF) for a continuous random variable \(Z\):
\[ \begin{align*} F_Z(z) &= P(Z \leq z) \\ &= P\left[g(Y) \leq z \right] \\ &= P\left[Y \leq g^{-1}(z) \right] \\ &= F_Y\left[ g^{-1}(z) \right]. \end{align*} \tag{2.22}\]
A well-known Calculus result is the inverse function theorem. Assuming that
\[ z = g(y) \]
is an invertible and differentiable function, then the inverse
\[ y = g^{-1}(z) \tag{2.23}\]
must be differentiable as in:
\[ \frac{\mathrm{d}}{\mathrm{d}z} \left[ g^{-1}(z) \right] = \frac{1}{g' \left[ g^{-1}(z) \right]}. \tag{2.24}\]
Note that we differentiate Equation 2.23 as follows:
\[ \frac{\mathrm{d}}{\mathrm{d}z} y = \frac{\mathrm{d}}{\mathrm{d}z} \left[ g^{-1}(z) \right]. \tag{2.25}\]
Then, plugging Equation 2.25 into Equation 2.24, we obtain:
\[ \begin{gather*} \frac{\mathrm{d}}{\mathrm{d}z} y = \frac{1}{g' \left[ g^{-1}(z) \right]} \\ \mathrm{d}y = \frac{1}{g' \left[ g^{-1}(z) \right]} \mathrm{d}z. \end{gather*} \tag{2.26}\]
Then, we use the property that relates the CDF \(F_Z(z)\) to the PDF \(f_Z(z)\):
\[ f_Z(z) = \frac{\mathrm{d}}{\mathrm{d}z} F_Z(z). \]
Using Equation 2.22, we have:
\[ \begin{align*} f_Z(z) &= \frac{\mathrm{d}}{\mathrm{d}z} F_Z(z) \\ &= \frac{\mathrm{d}}{\mathrm{d}z} F_Y\left[ g^{-1}(z) \right] \\ &= f_Y\left[ g^{-1}(z) \right] \frac{\mathrm{d}}{\mathrm{d}z} \left[ g^{-1}(z) \right]. \end{align*} \]
Then, via Equation 2.24, it follows that:
\[ f_Z(z) = f_Y\left[ g^{-1}(z) \right] \frac{1}{g' \left[ g^{-1}(z) \right]}. \tag{2.27}\]
Therefore, using the expected value definition for a continuous random variable as in Equation 2.13, we have for \(Z\) that
\[ \mathbb{E}(Z) = \int_{\mathcal{Z}} z \cdot f_Z(z) \mathrm{d}z, \]
which yields via Equation 2.27:
\[ \mathbb{E}(Z) = \int_{\mathcal{Z}} z \cdot f_Y \left[ g^{-1}(z) \right] \frac{1}{g' \left[ g^{-1}(z) \right]} \mathrm{d}z. \]
Using Equation 2.23 and Equation 2.26, it follows that:
\[ \begin{align*} \mathbb{E}(Z) &= \int_{\mathcal{Z}} z \cdot f_Y(y) \frac{1}{g' \left[ g^{-1}(z) \right]} \mathrm{d}z \\ &= \int_{\mathcal{Y}} g(y) \cdot f_Y(y) \mathrm{d}y. \end{align*} \]
Note the last line in the above equation changes the integration limits to the support of \(Y\), given all terms end up depending on \(y\) on the right-hand side.
Finally, given the random variable transformation from Equation 2.21, we have:
\[ \mathbb{E}\left[ g(X) \right] = \int_{\mathcal{Y}} g(y) \cdot f_Y(y) \mathrm{d}y. \quad \square \]
Definition of variance
Let \(Y\) be a discrete or continuous random variable whose support is \(\mathcal{Y}\) with a mean represented by \(\mathbb{E}(Y)\). Then, the variance of \(Y\) is the mean of the squared deviation from the corresponding mean as follows:
\[ \text{Var}(Y) = \mathbb{E}\left\{[ Y - \mathbb{E}(Y)]^2 \right\}. \\ \tag{2.28}\]
Note the expression above is equivalent to:
\[ \text{Var}(Y) = \mathbb{E} \left( Y^2 \right) - \left[ \mathbb{E}(Y) \right]^2. \tag{2.29}\]
Heads-up on the two mathematical expressions of the variance!
Proving the equivalence of Equation 2.28 and Equation 2.29, requires the introduction of some further properties of the expected value of a random variable while using the LOTUS. We will dig into the insights provided by Casella and Berger (2024).
Theorem 2.3 Let \(Y\) be a discrete or continuous random variable. Furthermore, let \(a\), \(b\), and \(c\) be constants. Thus, for any functions \(g_1(y)\) and \(g_2(x)\) whose means exist, we have that:
\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = a \mathbb{E}\left[ g_1(Y) \right] + b \mathbb{E}\left[ g_2(Y) \right] + c. \tag{2.30}\]
Firstly, let us prove Equation 2.30 for the discrete case.
Proof. Let \(Y\) be a discrete random variable whose support is \(\mathcal{Y}\) and PMF is \(P_Y(Y = y)\). Let us apply the LOTUS as in Equation 2.14:
\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = \sum_{y \in \mathcal{Y}} \left[ a g_1(y) + b g_2(y) + c \right] \cdot P_Y(Y = y). \] We can distribute the summation across each addend as follows:
\[ \begin{align*} \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] &= \sum_{y \in \mathcal{Y}} \left[ a g_1(y) \right] \cdot P_Y(Y = y) + \\ & \qquad \sum_{y \in \mathcal{Y}} \left[ b g_2(y) \right] \cdot P_Y(Y = y) + \\ & \qquad \sum_{y \in \mathcal{Y}} c \cdot P_Y(Y = y). \end{align*} \]
Let us take the constants out of the corresponding summations:
\[ \begin{align*} \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] &= a \sum_{y \in \mathcal{Y}} g_1(y) \cdot P_Y(Y = y) + \\ & \qquad b \sum_{y \in \mathcal{Y}} g_2(y) \cdot P_Y(Y = y) + \\ & \qquad c \underbrace{\sum_{y \in \mathcal{Y}} P_Y(Y = y)}_1 \\ &= a \underbrace{\sum_{y \in \mathcal{Y}} g_1(y) \cdot P_Y(Y = y)}_{\mathbb{E} \left[ g_1(Y) \right]} + \\ & \qquad b \underbrace{\sum_{y \in \mathcal{Y}} g_2(y) \cdot P_Y(Y = y)}_{\mathbb{E} \left[ g_2(Y) \right]} + c. \end{align*} \]
For the first and second addends on the right-hand side in the above equation, let us apply the LOTUS again:
\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = a \mathbb{E} \left[ g_1(Y) \right] + b \mathbb{E} \left[ g_2(Y) \right] + c. \quad \square \]
Secondly, let us prove Equation 2.30 for the continuous case.
Proof. Let \(Y\) be a continuous random variable whose support is \(\mathcal{Y}\) and PDF is \(f_Y(y)\). Let us apply the LOTUS as in Equation 2.20:
\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = \int_{\mathcal{Y}} \left[ a g_1 (y) + b g_2(y) + c \right] \cdot f_Y(y) \mathrm{d}y. \]
We distribute the integral on the right-hand side of the above equation:
\[ \begin{align*} \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] &= \int_{\mathcal{Y}} \left[ a g_1 (y) \right] \cdot f_Y(y) \mathrm{d}y + \\ & \qquad \int_{\mathcal{Y}} \left[ b g_2(y) \right] \cdot f_Y(y) \mathrm{d}y + \\ & \qquad \int_{\mathcal{Y}} c \cdot f_Y(y) \mathrm{d}y. \end{align*} \]
Let us take the constants out of the corresponding integrals:
\[ \begin{align*} \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] &= a \int_{\mathcal{Y}} g_1 (y) \cdot f_Y(y) \mathrm{d}y + \\ & \qquad b \int_{\mathcal{Y}} g_2(y) \cdot f_Y(y) \mathrm{d}y + \\ & \qquad c \underbrace{\int_{\mathcal{Y}} f_Y(y) \mathrm{d}y}_{1} \\ &= a \underbrace{\int_{\mathcal{Y}} g_1 (y) \cdot f_Y(y) \mathrm{d}y}_{\mathbb{E} \left[ g_1(Y) \right]} + \\ & \qquad b \underbrace{\int_{\mathcal{Y}} g_2(y) \cdot f_Y(y) \mathrm{d}y}_{\mathbb{E} \left[ g_2(Y) \right]} + c. \end{align*} \]
For the first and second addends on the right-hand side in the above equation, let us apply the LOTUS again:
\[ \mathbb{E}\left[ a g_1(Y) + b g_2(Y) + c \right] = a \mathbb{E} \left[ g_1(Y) \right] + b \mathbb{E} \left[ g_2(Y) \right] + c. \quad \square \]
Finally, after applying some algebraic rearrangements and the expected value properties shown in Equation 2.30, Equation 2.28 and Equation 2.29 are equivalent as follows:
Proof. \[ \begin{align*} \text{Var}(Y) &= \mathbb{E}\left\{[ Y - \mathbb{E}(Y)]^2 \right\} \\ &= \mathbb{E} \left\{ Y^2 - 2Y \mathbb{E}(Y) + \left[ \mathbb{E}(Y) \right]^2 \right\} \\ &= \mathbb{E} \left( Y^2 \right) - \mathbb{E} \left[ 2Y \mathbb{E}(Y) \right] + \mathbb{E} \left[ \mathbb{E}(Y) \right]^2 \\ & \qquad \text{distributing the expected value operator} \\ &= \mathbb{E} \left( Y^2 \right) - 2 \mathbb{E} \left[ Y \mathbb{E}(Y) \right] + \mathbb{E} \left[ \mathbb{E}(Y) \right]^2 \\ & \qquad \text{since $2$ is a constant} \\ &= \mathbb{E} \left( Y^2 \right) - 2 \mathbb{E}(Y) \mathbb{E} \left( Y \right) + \left[ \mathbb{E}(Y) \right]^2 \\ & \qquad \text{since $\mathbb{E}(Y)$ is a constant} \\ &= \mathbb{E} \left( Y^2 \right) - 2 \left[ \mathbb{E}(Y) \right]^2 + \left[ \mathbb{E}(Y) \right]^2 \\ &= \mathbb{E} \left( Y^2 \right) - \left[ \mathbb{E}(Y) \right]^2. \qquad \qquad \qquad \qquad \qquad \square \end{align*} \]
Definition of estimator
An estimator is a mathematical rule involving the random variables \(Y_1, \dots, Y_n\) from our random sample of size \(n\). As its name says, this rule allows us to estimate our population parameter of interest.
Definition of estimate
Suppose we have an observed random sample of size \(n\) with values \(y_1, \dots , y_n\). Then, we apply a given estimator mathematical rule to these \(n\) observed values. Hence, this numerical computation is called an estimate of our population parameter of interest.
[1] 0.68
0.67
[1] 10.63
10.13
2.1.6 The Rationale in Random Sampling
Definition of conditional probability
Suppose you have two events of interest, \(A\) and \(B\), in a random phenomenon of a population or system of interest. These two events belong to the sample space \(S\). Moreover, assume that the probability of event \(B\) is such that
\[ P(B) > 0, \]
which is considered the conditioning event.
Hence, the conditional probability of event \(A\) given event \(B\) is defined as
\[ P(A | B) = \frac{P(A \cap B)}{P(B)}, \tag{2.31}\]
where \(P(A \cap B)\) is read as the probability of the intersection of events \(A\) and \(B\).
Tip on the rationale behind conditional probability!
We can delve into the rationale of Equation 2.31 by using a handy probabilistic concept called cardinality, which refers to the corresponding total number of possible outcomes in a random phenomenon belonging to any given event or sample space.
Proof. Let \(|S|\) be the cardinality corresponding to the sample space in a random phenomenon. Hence, as in Equation 2.2, we have that:
\[ P(S) = \frac{|S|}{|S|} = 1. \]
Moreover, suppose that \(A\) is the primary event of interest whose cardinality is represented by \(|A|\). Alternatively to Equation 2.1, the probability of \(A\) can be represented as
\[ P(A) = \frac{|A|}{|S|}. \]
On the other hand, the cardinality of the conditioning event is
\[ P(B) = \frac{|B|}{|S|}. \tag{2.32}\]
Now, let \(|A \cap B|\) be the cardinality of the intersection between events \(A\) and \(B\). Its probability can be represented as:
\[ P(A \cap B) = \frac{|A \cap B|}{|B|}. \tag{2.33}\]
Analogous to Equation 2.32 and Equation 2.33, we can view the conditional probability \(P(A | B)\) as an updated probability of the primary event \(A\) restricted to the cardinality of the conditioning event \(|B|\). This places \(|A \cap B|\) in the numerator and \(|B|\) in the denominator as follows:
\[ P(A | B) = \frac{|A \cap B|}{|B|}. \tag{2.34}\]
Therefore, we can play around with Equation 2.34 along with Equation 2.32 and Equation 2.33 as follows:
\[ \begin{align*} P(A \cap B) &= \frac{|A \cap B|}{|B|} \\ &= \frac{\frac{|A \cap B}{|S|}}{\frac{|B|}{|S|}} \qquad \text{dividing numerator and denominator over $|S|$} \\ &= \frac{P(A \cap B)}{P(B)}. \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \square \end{align*} \]
Definition of the Bayes’ rule
Suppose you have two events of interest, \(A\) and \(B\), in a random phenomenon of a population or system of interest. From Equation 2.31, we can state the following expression for the conditional probability of \(A\) given \(B\):
\[ P(A | B) = \frac{P(A \cap B)}{P(B)} \quad \text{if $P(B) > 0$.} \tag{2.35}\]
Note the conditional probability of \(B\) given \(A\) can be stated as:
\[ \begin{align*} P(B | A) &= \frac{P(B \cap A)}{P(A)} \quad \text{if $P(A) > 0$} \\ &= \frac{P(A \cap B)}{P(A)} \quad \text{since $P(B \cap A) = P(A \cap B)$.} \end{align*} \tag{2.36}\]
Then, we can manipulate Equation 2.36 as follows:
\[ P(A \cap B) = P(B | A) \times P(A). \]
The above result can be plugged into Equation 2.35:
\[ \begin{align*} P(A | B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{P(B | A) \times P(A)}{P(B)}. \end{align*} \tag{2.37}\]
Equation 2.37 is called the Bayes’ rule. We are basically flipping around conditional probabilities.
Definition of independence
Suppose you have two events of interest, \(A\) and \(B\), in a random phenomenon of a population or system of interest. These two events are statistically independent if event \(B\) does not affect event \(A\) and vice versa. Therefore, the probability of their corresponding intersection is given by:
\[ P(A \cap B) = P(A) \times P(B). \tag{2.38}\]
Let us expand the above definition to a random variable framework:
- Suppose you have a set of \(n\) discrete random variables \(Y_1, \dots, Y_n\) whose supports are \(\mathcal{Y_1}, \dots, \mathcal{Y_n}\) with PMFs \(P_{Y_1}(Y_1 = y_1), \dots, P_{Y_n}(Y_n = y_n)\) respectively. That said, the joint PMF of these \(n\) random variables is the multiplication of their corresponding standalone PMFs:
\[ \begin{align*} P_{Y_1, \dots, Y_n}(Y_1 = y_1, \dots, Y_n = y_n) &= \prod_{i = 1}^n P_{Y_i}(Y_i = y_i) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}_i, i = 1, \dots, n. \end{align*} \tag{2.39}\]
- Suppose you have a set of \(n\) continuous random variables \(Y_1, \dots, Y_n\) whose supports are \(\mathcal{Y_1}, \dots, \mathcal{Y_n}\) with PDFs \(f_{Y_1}(y_1), \dots, f_{Y_n}(y_n)\) respectively. That said, the joint PDF of these \(n\) random variables is the multiplication of their corresponding standalone PDFs:
\[ \begin{align*} f_{Y_1, \dots, Y_n}(y_1, \dots, y_n) &= \prod_{i = 1}^n f_{Y_i}(y_i) \\ & \qquad \text{for all} \\ & \qquad \quad y_i \in \mathcal{Y}_i, i = 1, \dots, n. \end{align*} \tag{2.40}\]
Tip on the rationale behind the rule of independent events!
We can delve into the rationale of Equation 2.38 by using the Bayes’ rule from Equation 2.37 along with the basic conditional probability formula from Equation 2.31.
Proof. Firstly, let us assume that a given event \(B\) does not affect event \(A\) which can be probabilistically represented as
\[ P(A | B) = P(A). \tag{2.41}\]
If the statement in Equation 2.41 holds, by using the Bayes’ rule from Equation 2.37, we have the following manipulation for the below conditional probability formula:
\[ \begin{align*} P(B | A) &= \frac{P(B \cap A)}{P(A)} \\ &= \frac{P(A \cap B)}{P(A)} \qquad \text{since $P(B \cap A) = P(A \cap B$)} \\ &= \frac{P(A | B) \times P(B)}{P(A)} \qquad \text{by the Bayes' rule} \\ &= \frac{P(A) \times P(B)}{P(A)} \qquad \text{since $P(A | B) = P(A)$} \\ &= P(B). \end{align*} \]
Then, again by using the Bayes’ rule, we obtain \(P(B \cap A)\) as follows:
\[ \begin{align*} P(B \cap A) &= P(B | A) \times P(A) \\ &= P(B) \times P(A) \qquad \text{since $P(B | A) = P(B)$.} \end{align*} \]
Finally, we have that:
\[ \begin{align*} P(A \cap B) &= P(B \cap A) \\ &= P(B) \times P(A) \\ &= P(A) \times P(B). \qquad \square \end{align*} \]
Definition of random sample
A random sample is a collection of random variables \(Y_1, \dots, Y_n\) of size \(n\) coming from a given population or system of interest. Note that the most elementary definition of a random sample assumes that these \(n\) random variables are mutually independent and identically distributed (which is abbreviated as iid).
The fact that these \(n\) random variables are identically distributed indicates that they have the same mathematical form for their corresponding PMFs or PDFs, depending on whether they are discrete or continuous respectively. Hence, under a generative modelling approach in a population or system of interest governed by \(k\) parameters contained in the vector
\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T, \]
we can apply the iid property in an elementary random sample to obtain the following joint probability distributions:
- In the case of \(n\) iid discrete random variables \(Y_1, \dots, Y_n\) whose common standalone PMF is \(P_Y(Y = y | \boldsymbol{\theta})\) with support \(\mathcal{Y}\), the joint PMF is mathematically expressed as
\[ \begin{align*} P_{Y_1, \dots, Y_n}(Y_1 = y_1, \dots, Y_n = y_n | \boldsymbol{\theta}) &= \prod_{i = 1}^n P_Y(Y = y_i | \boldsymbol{\theta}) \\ & \quad \text{for all} \\ & \quad \quad y_i \in \mathcal{Y}, i = 1, \dots, n. \end{align*} \tag{2.42}\]
- In the case of \(n\) iid continuous random variables \(Y_1, \dots, Y_n\) whose common standalone PDF is \(f_Y(y | \boldsymbol{\theta})\) with support \(\mathcal{Y}\), the joint PDF is mathematically expressed as
\[ \begin{align*} f_{Y_1, \dots, Y_n}(y_1, \dots, y_n | \boldsymbol{\theta}) &= \prod_{i = 1}^n f_Y(y_i | \boldsymbol{\theta}) \\ & \quad \text{for all} \\ & \quad \quad y_i \in \mathcal{Y}, i = 1, \dots, n. \end{align*} \tag{2.43}\]
Unlike Equation 2.39 and Equation 2.40, note that Equation 2.42 and Equation 2.43 do not indicate a subscript for \(Y\) in the corresponding probability distributions since we have identically distributed random variables. Furthermore, the joint distributions are conditioned on the population parameter vector \(\boldsymbol{\theta}\) which reflects our generative modelling approach.
2.2 What is Maximum Likelihood Estimation?
2.3 Basics of Frequentist Statistical Inference


2.3.1 General Settings

Based on the work by Soch et al. (2024), let us check some key definitions.
Definition of hypothesis
Suppose you observe some data \(y\) from some population(s) or system(s) of interest governed by \(k\) parameters contained in the following vector:
\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]
Moreover, we assume this observed data \(y\) follows certain probability distribution \(\mathcal{D}(\cdot)\) in a generative model \(m\) as in
\[ m: y \sim \mathcal{D}(\boldsymbol{\theta}). \]
Beginning from the fact that \(\boldsymbol{\theta} \in \boldsymbol{\Theta}\) where \(\boldsymbol{\Theta} \in \mathbb{R}^k\), a statistical hypothesis is a general statement about some parameter vector \(\boldsymbol{\theta}\) in regards to specific values contained in vector \(\boldsymbol{\Theta}^*\) such that
\[ H: \boldsymbol{\theta} \in \boldsymbol{\Theta}^* \quad \text{where} \quad \boldsymbol{\Theta}^* \subset \boldsymbol{\Theta}. \]
Definition of null hypothesis
In a hypothesis(s) testing, a null hypothesis is denoted by \(H_0\). The whole inferential process is designed to assess the strength of the evidence in favour or against this null hypothesis. In plain words, \(H_0\) is an inferential statement associated to the status quo in some population(s) or system(s) of interest, which might refer to no signal for the researcher in question.
Again, suppose you observe some data \(y\) from some population(s) or system(s) of interest governed by \(k\) parameters contained in the following vector:
\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]
Moreover, we assume this observed data \(y\) follows certain probability distribution \(\mathcal{D}(\cdot)\) in a generative model \(m\) as in
\[ m: y \sim \mathcal{D}(\boldsymbol{\theta}). \]
Let \(\boldsymbol{\Theta}_0 \subset \boldsymbol{\theta}\) denote the status quo for the parameter(s) to be tested. Then, the null hypothesis is mathematically defined as
\[ H_0: \boldsymbol{\theta} \in \boldsymbol{\Theta}_0 \quad \text{where} \quad \boldsymbol{\Theta}_0 \subset \boldsymbol{\theta}. \tag{2.44}\]
Definition of alternative hypothesis
In a hypothesis testing, an alternative hypothesis is denoted by \(H_1\). This hypothesis corresponds to the complement (i.e., the opposite) of the null hypothesis \(H_0\). Since the whole inferential process is designed to assess the strength of the evidence in favour or against of \(H_0\), any inferential conclusion against \(H_0\) can be worded as “rejecting \(H_0\) in favour of \(H_1\).” In plain words, \(H_1\) is an inferential statement associated to a non-status quo in some population(s) or system(s) of interest, which might refer to actual signal for the researcher in question.
Let us assume you observe some data \(y\) from some population(s) or system(s) of interest governed by \(k\) parameters contained in the following vector:
\[ \boldsymbol{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)^T. \]
Moreover, suppose this observed data \(y\) follows certain probability distribution \(\mathcal{D}(\cdot)\) in a generative model \(m\) as in
\[ m: y \sim \mathcal{D}(\boldsymbol{\theta}). \]
Let \(\boldsymbol{\Theta}_0^c \subset \boldsymbol{\theta}\) denote the non-status quo for the parameter(s) to be tested. Then, the alternative hypothesis is mathematically defined as
\[ H_1: \boldsymbol{\theta} \in \boldsymbol{\Theta}_0^c \quad \text{where} \quad \boldsymbol{\Theta}_0^c \subset \boldsymbol{\theta}. \tag{2.45}\]
Definition of hypothesis testing
A hypothesis testing is the decision rule we have to apply between the null and alternative hypotheses, via our sample data, to fail to reject or reject the null hypothesis.
Definition of type I error (false positive)
Type I error is defined as incorrectly rejecting the null hypothesis \(H_0\) in favour of the alternative hypothesis \(H_1\) when, in fact, \(H_0\) is true. Analogously, this type of error is also called false positive .
Definition of type II error (false negative)
Type II error is defined as incorrectly failing to reject the null hypothesis \(H_0\) in favour of the alternative hypothesis \(H_1\) when, in fact, \(H_0\) is false. Analogously, this type of error is also called false negative . Table 2.4 summarizes the types of inferential conclusions in function on whether \(H_0\) is true or not.
\(H_0\) is true | \(H_0\) is false | |
---|---|---|
Reject \(H_0\) | Type I error (False positive) | Correct |
Fail to reject \(H_0\) | Correct | Type II error (False negative) |
Definition of significance level
The significance level \(\alpha\) is defined as the conditional probability of rejecting the null hypothesis \(H_0\) given that \(H_0\) is true. This can be mathematically represented as
\[ P \left( \text{Reject $H_0$} | \text{$H_0$ is true} \right) = \alpha. \]
In plain words, \(\alpha \in [0, 1]\) allows us to probabilistically control for type I error since we are dealing with random variables in our inferential process. The significance level can be thought as one of the main hypothesis testing and power analysis settings. The larger the significance level in our power analysis and hypothesis testing, the less prone we are to commit a type I error.
Definition of power
The statistical power of a test \(1 -\beta\) is the complement of the conditional probability \(\beta\) of failing to reject the null hypothesis \(H_0\) given that \(H_0\) is false, which is mathematically represented as
\[ P \left( \text{Failing to reject $H_0$} | \text{$H_0$ is false} \right) = \beta; \]
yielding
\[ \text{Power} = 1 - \beta. \]
In plain words, \(1 - \beta \in [0, 1]\) is the probabilistic ability of our hypothesis testing to detect any signal in our inferential process, if there is any. The larger the power in our power analysis, the less prone we are to commit a type II error.
Definition of power analysis
Power analysis is a set of statistical tools used to compute the minimum required sample size \(n\) for any given inferential study. These tools require the significance level, power, and effect size (i.e., the magnitude of the signal) the researcher aims to detect via their inferential study. This analysis seeks to determine whether observed results are likely due to chance or represent a true and meaningful effect.
2.3.2 Hypotheses Definitions

2.3.3 Test Flavour and Components

Definition of observed effect
An observed effect is the difference between the estimate provided the observed random sample (of size \(n\), as in \(y_1, \dots, y_n\)) to the hypothesized value(s) of the population parameter(s) depicted in the statistical hypotheses.
Definition of standard error
The standard error allows us to quantify the extent to which an estimate coming from an observed random sample (of size \(n\), as in \(y_1, \dots, y_n\)) may deviate from the expected value under the assumption that the null hypothesis is true.
It plays a critical role in determining whether an observed effect is likely attributable to random variation or represents a statistically significant finding. In the absence of the standard error, it would not be possible to rigorously assess the reliability or precision of an estimate.
Definition of test statistic
The test statistic is a function of the random sample of size \(n\), i.e., it is in the function of the random variables \(Y_1, \dots, Y_n\). Therefore, the test statistic will also be a random variable, whose observed value will describe how closely the probability distribution from which the random sample comes from matches the probability distribution of the null hypothesis \(H_0\).
More specifically, once we have obtained the observed effect and standard error from our observed random sample, we can compute the corresponding observed test statistic. This test statistic computation will be placed on the corresponding \(x\)-axis of the probability distribution of \(H_0\) so we can reject or fail to reject it accordingly.
2.3.4 Inferential Conclusions

Definition of critical value
The critical value of a hypothesis testing defines the region for which we might reject \(H_0\) in favour of \(H_1\). This critical value is in the function of the significance level \(\alpha\) and test flavour. It is located on the corresponding \(x\)-axis of the probability distribution of \(H_0\). Hence, this value acts as a threshold to decide either of the following:
- If the observed test statistic exceeds a given critical value, then we have enough statistical evidence to reject \(H_0\) in favour of \(H_1\).
- If the observed test statistic does not exceed a given critical value, then we have enough statistical evidence to fail to reject \(H_0\).
Definition of \(p\)-value
A \(p\)-value refers to the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic coming from our observed random sample of size \(n\). This \(p\)-value is obtained via the probability distribution of \(H_0\) and the observed test statistic.
Alternatively to a critical value, we can reject or fail to reject the null hypothesis \(H_0\) using this \(p\)-value as follows:
- If the \(p\)-value associated to the observed test statistic exceeds a given significance level \(\alpha\), then we have enough statistical evidence to reject \(H_0\) in favour of \(H_1\).
- If the \(p\)-value associated to the observed test statistic does not exceed a given significance level \(\alpha\), then we have enough statistical evidence to fail to reject \(H_0\).
Definition of confidence interval
A confidence interval provides an estimated range of values within which the true population parameter is likely to fall, based on the sample data. It reflects the degree of uncertainty associated with the obtained estimate. For instance, a 95% confidence interval means that if the study were repeated many times using different random samples from the same population or system of interest, approximately 95% of the resulting intervals would contain the true parameter.