2 Basic Cuisine: A Review on Probability and Frequentist Statistical Inference
This chapter will delve into probability and frequentist statistical inference. We can view these sections as a quick review of introductory probability and statistics concepts. Moreover, this review will be important to understanding the philosophy of modelling parameter estimation as outlined in Section 1.2.5. Then, we will pave the way to the rationale behind statistical inference in the results stage (as in Section 1.2.7) in our workflow from Figure 1.1. Note that we aim to explain all these statistical and probabilistic concepts in the most possible practical way via a made-up case study throughout this chapter. Still, we will use an appropriate level of jargon and will follow the colour convention found in Appendix A along with the definition callout box.
Imagine you are an undergraduate engineering student. Moreover, last term, you just took and passed your first course in probability and statistics (inference included!) in an industrial engineering context. Moreover, as it could happen while taking an introductory course in probability and statistics, you used to feel quite overwhelmed by the large amount of jargon and formulas one had to grasp and use regularly for primary engineering fields such as quality control in a manufacturing facility. Population parameters, hypothesis testing, tests statistics, significance level, \(p\)-values, and confidence intervals (do not worry, our statistical/machine learning scheme will come in later in this review) were appearing here and there! And to your frustration, you could never find a statistical connection between all these inferential tools! Instead, you relied on mechanistic procedures when solving assignments or exam problems.
For instance, when performing hypothesis testing for a two-sample \(t\)-test, you struggled to reflect what the hypotheses were trying to indicate for the corresponding population parameter(s) or how the test statistic was related to these hypotheses. Moreover, your interpretation of the resulting \(p\)-value and/or confidence interval was purely mechanical with the inherent claim:
With a significance level \(\alpha = 0.05\), we reject (or fail to reject, if that is the case!) the null hypothesis given that…
Truthfully, this whole mechanical way of doing statistics is not ideal in a teaching, research or industry environment. Along the same lines, the above situation should also not happen when we learn key statistical topics for the very first time as undergraduate students. That is why we will investigate a more intuitive way of viewing probability and its crucial role in statistical inference. This matter will help us deliver more coherent storytelling (as in Section 1.2.8) when presenting our results in practice during any regression analysis to our peers or stakeholders. Note that the role of probability also extends to model training (as in Section 1.2.5) when it comes to supervised learning and not just regarding statistical inference.
Having said all this, it is time to introduce a statement that is key when teaching hypothesis testing in an introductory statistical inference course:
In statistical inference, everything always boils down to randomness and how we can control it!
That is quite a bold statement! Nonetheless, once one starts teaching statistical topics to audiences not entirely familiar with the usual field jargon, the idea of randomness always persists across many different tools. And, of course, regression analysis is not an exception at all since it also involves inference on population parameters of interest! This is why we have allocated this section in the textbook to explain core probabilistic and inferential concepts to pave the way to its role in regression analysis.
Heads-up on why we mean as a non-ideal mechanical analysis!
The reader might need clarification on why the mechanical way of performing hypothesis testing is considered non-ideal, mainly when the term cookbook is used in the book’s title. The cookbook concept here actually refers to a homogenized recipe for data modelling, as seen in the workflow from Figure 1.1. However, there’s a crucial distinction between this and the non-ideal mechanical way of hypothesis testing.
On the one hand, the non-ideal mechanical way refers to the use of a tool without understanding the rationale of what this tool stands for, resulting in vacuous and standard statements that we would not be able to explain any way further, such as the statement we previously indicated:
With a significance level \(\alpha = 0.05\), we reject (or fail to reject, if that is the case!) the null hypothesis given that…
What if a stakeholder of our analysis asks us in plain words what a significance level means? Why are we phrasing our conclusion on the null hypothesis and not directly on the alternative one? As a data scientist, one should be able to explain why the whole inference process yields that statement without misleading the stakeholders’ understanding. For sure, this also implicates appropriate communication skills that cater to general audiences rather than just statistical ones.
Conversely, the data modelling workflow in Figure 1.1 involves stages that necessitate a comprehensive and precise understanding of our analysis. Progressing to the next stage without a complete grasp of the current one risks perpetuating false insights, potentially leading to faulty data storytelling of the entire analysis.
Finally, even though this book has suggested reviews related to the basics of probability via different distributions and the fundamentals of frequentist statistical inference as stated in Audience and Scope, we will retake essential concepts as follows:
- The role of random variables and probability distributions and the governance of population (or system) parameters (i.e., the so-called Greek letters we usually see in statistical inference and regression analysis). Section 2.1 will explore these topics more in detail while connecting them to the subsequent inferential terrain under a frequentist context.
- When delving into supervised learning and regression analysis, we might wonder how randomness is incorporated into model fitting (i.e., parameter estimation). That is quite a fascinating aspect, implemented via a crucial statistical tool known as maximum likelihood estimation. This tool is heavily related to the concept of loss function in supervised learning. Section 2.2 will explore these matters in more detail and how the idea of a random sample is connected to this estimation tool.
- Section 2.3 will explore the basics of hypothesis testing and its intrinsic components such as null and alternative hypotheses, type I and type II errors, test statistic, standard error, \(p\)-value, and confidence interval.
- Finally, Section 2.4 will briefly discuss the connections between supervised learning and regression analysis regarding terminology.
Without further ado, let us start with reviewing core concepts in probability via quite a tasty example.
2.1 Basics of Probability
In terms of regression analysis and its supervised learning counterpart (either on an inferential or predictive framework), probability can be viewed as the solid foundation on which more complex tools, including estimation and hypothesis testing, are built upon. Under this foundation, our data is coming from a given population or system of interest. Moreover, the population or system is assumed to be governed by parameters which, as data scientists or researchers, they are of their best interest to study. That said, the terms population and parameter will pave the way to our first statistical definitions.
Definition of population
It is a whole collection of individuals or items that share distinctive attributes. As data scientists or researchers, we are interested in studying these attributes, which we assume are governed by parameters. In practice, we must be as precise as possible when defining our given population such that we would frame our entire data modelling process since its very early stages. Examples of a population could be the following:
- Children between the ages of 5 and 10 years old in states of the American West Coast.
- Customers of musical vinyl records in the Canadian provinces of British Columbia and Alberta.
- Avocado trees grown in the Mexican state of Michoacán.
- Adult giant pandas in the Southwestern Chinese province of Sichuan.
- Mature açaí palm trees from the Brazilian Amazonian jungle.
Note that the term population could be exchanged for the term system, given that certain contexts do not particularly refer to individuals or items. Instead, these contexts could refer to processes whose attributes are also governed by parameters. Examples of a system could be the following:
- The production of cellular phones from a given model in a set of manufacturing facilities.
- The sale process in the Vancouver franchises of a well-known ice cream parlour.
- The transit cycle during rush hours on weekdays in the twelve lines of Mexico City’s subway.
Definition of parameter
It is a characteristic (numerical or even non-numerical, such as a distinctive category) that summarizes the state of our population or system of interest. Examples of a population parameter can be described as follows:
- The average weight of children between the ages of 5 and 10 years old in states of the American west coast (numerical).
- The variability in the height of the mature açaí palm trees from the Brazilian Amazonian jungle (numerical).
- The proportion of defective items in the production of cellular phones in a set of manufacturing facilities (numerical).
- The average customer waiting time to get their order in the Vancouver franchises of a well-known ice cream parlour (numerical).
- The most favourite pizza topping of vegetarian adults between the ages of 30 and 40 years old in Edmonton (non-numerical).
Note the standard mathematical notation for population parameters are Greek letters. Moreover, in practice, these population parameter(s) of interest will be unknown to the data scientist or researcher. Instead, they would use formal statistical inference to estimate them.
The parameter definition points out a crucial fact in investigating any given population or system:
Our parameter(s) of interest are usually unknown!
Given this fact, it would be pretty unfortunate and inconvenient if we eventually wanted to discover any significant insights about the population or system. Therefore, let us proceed to our so-called tasty example so we can dive into the need for statistical inference and why probability is our perfect ally in this parameter quest.
Imagine you are the owner of a large fleet of ice cream carts, around 900 to be exact. These ice cream carts operate across different parks in the following Canadian cities: Vancouver, Victoria, Edmonton, Calgary, Winnipeg, Ottawa, Toronto, and Montréal. In the past, to optimize operational costs, you decided to limit ice cream cones to only two items: vanilla and chocolate flavours, as in Figure 2.1.
Now, let us direct this whole case onto a more statistical and probabilistic field; suppose you have a well-defined overall population of interest for those above eight Canadian cities: children between 4 and 11 years old attending these parks during the Summer weekends. Of course, Summer time is coming this year, and you would like to know which ice cream cone flavour is the favourite one for this population (and by how much!). As a business owner, investigating ice cream flavour preferences would allow you to plan Summer restocks more carefully with your corresponding suppliers. Therefore, it would be essential to start collecting consumer data so the company can tackle this demand query.
Also, suppose there is a second query. For the sake of our case, we will call it a time query. As a critical component of demand planning, besides estimating which cone flavour is the most preferred one (and by how much!) for the above population of interest, the operations area is currently requiring a realistic estimation of the average waiting time from one customer to the next one in any given cart during Summer weekends. This average waiting time would allow the operations team to plan carefully how much stock each cart should have so there will not be any waste or shortage.
Note that the nature of the aforementioned time query is more related to a larger population. Therefore, we can define it as all our ice cream customers during the Summer weekends. Furthermore, this second definition would limit this query to our corresponding general ice cream customers, given the requirements of our operations team, and not all the children between 4 and 11 years old attending the parks during Summer weekends. Consequently, it is crucial to note that the nature of our queries will dictate how we define our population and our subsequent data modelling and statistical inference.
Summer time represents the most profitable season from a business perspective, thus solving these above two queries is a significant priority for your company. Hence, you decide to organize a meeting with your eight general managers (one per Canadian city). Finally, during the meeting with the general managers, it was decided to do the following:
- For the demand query, a comprehensive market study will be run on the population of interest across the eight Canadian cities right before next Summer; suppose we are currently in Spring.
- For the time query, since the operations team has not previously recorded any historical data, ALL vendor staff from 900 carts will start collecting data on the waiting time in seconds between each customer this upcoming Summer.
Surprisingly, when discussing study requirements for the marketing firm who would be in charge of it for the demand query, Vancouver’s general manager dares to state the following:
Since we’re already planning to collect consumer data on these cities, let’s mimic a census-type study to ensure we can have the MOST PRECISE results on their preferences.
On the other hand, when agreeing on the specific operations protocol to start recording waiting times for all the 900 vending carts this upcoming Summer, Ottawa’s general manager provides a comment for further statistical food for thought:
The operations protocol for recording waiting times in the 900 vending carts looks too cumbersome to implement straightforwardly this upcoming Summer. Why don’t we select A SMALLER GROUP of ice cream carts across the eight cities to have a more efficient process implementation that would allow us to optimize operational costs?
Bingo! Ottawa’s general manager just nailed the probabilistic way of making inference on our population parameter of interest for the time query. Indeed, their comment was primarily framed from a business perspective of optimizing operational costs. Still, this fact does not take away a crucial insight on which statistical inference is built: a random sample (as in Important 2.1). As for Vancouver’s general manager, ironically, their statement is NOT PRECISE at all! Mimicking a census-type study might not be the most optimal decision for the demand query given the time constraint and the potential size of its target population.
Realistically, there is no cheap and efficient way to conduct a census-type study for any of the two queries!
We can state that probability is viewed as the language to decode random phenomena that occur in any given population or system of interest. In our example, we have two random phenomena:
- For the demand query, a phenomenon can be represented by the preferred ice cream cone flavour of any randomly selected child between 4 and 11 years old attending the parks of the above eight Canadian cities during the Summer weekends.
- Regarding the time query, a phenomenon of this kind can be represented by any randomly recorded waiting time between two customers during a Summer weekend in any of the above eight Canadian cities.
Hence, let us finally define what we mean by probability along with the inherent concept of sample space.
Definition of probability
Let \(A\) be an event of interest in a random phenomenon, in a population or system of interest, whose all possible outcomes belong to a given sample space \(S\). Generally, the probability for this event \(A\) happening can be mathematically depicted as \(P(A)\). Moreover, suppose we observe the random phenomenon \(n\) times such as we were running some class of experiment, then \(P(A)\) is defined as the following ratio:
\[ P(A) = \frac{\text{Number of times event $A$ is observed}}{n}, \tag{2.1}\]
as the \(n\) times we observe the random phenomenon goes to infinity.
Equation 2.1 will always put \(P(A)\) in the following numerical range:
\[ 0 \leq P(A) \leq 1. \]
Definition of sample space
Let \(A\) be an event of interest in a random phenomenon in a population or system of interest. The sample space \(S\) of event \(A\) denotes the set of all the possible random outcomes we might encounter every time we randomly observe \(A\) such as we were running some class of experiment.
Note each of these outcomes has a determined probability associated with them. If we add up all these probabilities, the probability of the sample \(S\) will be one, i.e.,
\[ P(S) = 1. \tag{2.2}\]
Note the definition of the probability for an event \(A\) in the definition of probability specifically highlights the following:
… as the \(n\) times we observe the random phenomenon goes to infinity.
The “infinity” term is key when it comes to an understanding the philosophy behind the frequentist school of statistical thinking in contrast to its Bayesian counterpart. In general, the frequentist way of practicing statistics in terms of probability and inference is the approach we usually learn in introductory courses, more specifically when it comes to hypothesis testing and confidence intervals which will be explored in Section 2.3. That said, the Bayesian approach is another way of practicing statistical inference. Its philosophy differs in what information is used to infer our population parameters of interest. Below, we briefly define both schools of thinking.
Definition of frequentist statistics
This statistical school of thinking heavily relies on the frequency of events to estimate specific parameters of interest in a population or system. This frequency of events is reflected in the repetition of \(n\) experiments involving a random phenomenon within this population or system.
Under the umbrella of this approach, we assume that our governing parameters are fixed. Note that, within the philosophy of this school of thinking, we can only make precise and accurate predictions as long as we repeat our \(n\) experiments as many times as possible, i.e.,
\[ n \rightarrow \infty. \]
Definition of Bayesian statistics
This statistical school of thinking also relies on the frequency of events to estimate specific parameters of interest in a population or system. Nevertheless, unlike frequentist statistics, Bayesian statisticians use prior knowledge on the population parameters to update their estimations on them along with the current evidence they can gather. This evidence is in the form of the repetition of \(n\) experiments involving a random phenomenon. All these ingredients allow Bayesian statisticians to make inference by conducting appropriate hypothesis testings, which are designed differently from their mainstream frequentist counterpart.
Under the umbrella of this approach, we assume that our governing parameters are random; i.e., they have their own sample space and probabilities associated to their corresponding outcomes. The statistical process of inference is heavily backed up by probability theory mostly in the form of the Bayes theorem (named after Reverend Thomas Bayes, an English statistician from the 18th century). This theorem uses our current evidence along with our prior beliefs to deliver a posterior distribution of our random parameter(s) of interest.
Let us put the definitions for the above schools of statistical thinking into a more concrete example. We can use the demand query from our ice cream case as a starting point. More concretely, we can dig more into a standalone population parameter such as the probability that a randomly selected child between 4 and 11 years old, attending the parks of the above eight Canadian cities during the Summer weekends, prefers the chocolate-flavoured ice cream cone over the vanilla one. Think about the following two hypothetical questions:
- From a frequentist point of view, what is the estimated probability of preferring chocolate over vanilla after randomly surveying \(n = 100\) children from our population of interest?
- Using a Bayesian approach, suppose the marketing team has found ten prior market studies on similar children populations on their preferred ice cream flavour (between chocolate and vanilla). Therefore, along with our actual random survey of \(n = 100\) children from our population of interest, what is the posterior estimation of the probability of preferring chocolate over vanilla?
\[ P(X = x \mid \pi) = \pi^x (1 - \pi)^{1 - x} \quad \text{for} \quad x = 0, 1. \tag{2.3}\]