1 Getting Ready for Regression Cooking!

First things first! Let us prepare for all the different regression techniques to be introduced in Chapter 3.

Learning Objectives

By the end of this chapter, you will be able to:

Define the three core pillars to be applied in regression modelling throughout this book.
Outline how the ML-Stats dictionary works to bridge the terminology used in machine learning (ML) and statistics.
Explain how the data science workflow can be applied in regression analysis
Describe how the mind map of regression analysis acts as the primary chapter structure of this book and as a toolboox.
Contrast the differences and simmilarities between supervised learning and regression analysis.

That said, we want to highlight one guiding principle for all of our work:

Different modelling estimation techniques in regression analysis become easier to understand once we develop a strong probabilistic and inferential grasp of populations or systems of interest.

The above guiding principle rests on foundational statistical ideas on how data is generated and how it can be modelled through various regression methods. We will explore these underlying concepts in Chapter 2. Before doing so, however, this chapter will build on the following three core pillars:

Implementing a structured data science workflow as outlined in Section 1.2.
Selecting the appropriate workflow approach based on an inferential or predictive paradigm, as shown in Figure 1.1.
Choosing the appropriate regression model based on the response variable or outcome of interest, using the mind map in Section 1.3 (analogous to a regression toolbox).

The rationale behind the three pillars

Each data science problem involving regression analysis has unique characteristics, depending on if the inquiry is inferential or predictive. Different types of outcomes (or response variables) require distinct modelling approaches. For example, we might analyze survival times (e.g., the time until one particular equipment of a given brand fails), categorical outcomes (e.g., a preferred musical genre in the Canadian young population), or count-based outcomes (e.g., how many customers we would expect on a regular Monday morning in the branches of a major national bank), etc. Moreover, under this regression context, our analysis extends beyond the outcome itself, but we also examine how it relates to other explanatory variables (the so-called features). For instance, if we are studying musical genre preferences among young Canadians, we could explore how age groups influence these preferences or compare genre popularity across provinces.

At first glance, it might seem that every regression problem should have a unique workflow tailored to its specific model. However, this is not entirely the case. In Figure 1.1, we introduce a structured regression workflow designed as a proof of concept for thirteen different regression models. Each flow is covered in a separate chapter of this book alongside a review of probability and statistics (i.e, thirteen chapters in this book aside from the probability and statistics review). This workflow standardizes our approach, making analysis more transparent and efficient while allowing us to communicate insights effectively through data storytelling. Naturally, this workflow includes decision points that determine whether the approach follows an inferential or predictive path (the second pillar). As for our third pillar, this comes into play at the data modelling stage, where the regression toolbox Figure 1.10 guides model selection based on the response variable type.

Let us establish a convention for using admonitions throughout this textbook. These admonitions will help distinguish between key concepts, important insights, and supplementary material, ensuring clarity as we explore different regression techniques. We will start using these admonitions in Section 1.1.

Definition

A formal statistical and/or machine learning definition. This admonition aims to untangle the significant amount of jargon and concepts that both fields have. When applicable, alternative terminology will be included to highlight equivalent terms across statistics and machine learning.

Heads-up!

An idea (or ideas) related to a modelling approach, a specific workflow stage, or an important data science concept. This admonition is used to flag crucial statistical or machine learning topics that warrant deeper exploration.

Tip

An idea (or ideas) that may extend beyond the immediate discussion but provides additional context or helpful background. When applicable, references to further reading will be provided.

The core idea of the above admonition arrangement is to allow the reader to discern between ideas or concepts that are key to grasp from those whose understanding might not be highly essential (but still interesting to check out in further literature). With this structure in place, we can now introduce another foundational resource: a common ground between machine learning and statistics which will be elaborated on in the next section.

1.1 The ML-Stats Dictionary

Machine learning and statistics often overlap, especially in regression modelling. Topics covered in a regression-focused course, under a purely statistical framework, can also appear in machine learning-based courses on supervised learning, but the terminology can differ. Recognizing this overlap, the Master of Data Science (MDS) program at the University of British Columbia (UBC) provides the MDS Stat-ML dictionary (Gelbart 2017) under the following premises:

This document is intended to help students navigate the large amount of jargon, terminology, and acronyms encountered in the MDS program and beyond.

This section covers terms that have different meanings in different contexts, specifically statistics vs. machine learning (ML).

Both disciplines have a tremendous amount of jargon and terminology. As mentioned in the Preface, machine learning and statistics construct a substantial synergy reflected in data science. Despite this overlap, misunderstandings can still happen due to differences in terminology. To prevent this, we need clear bridges between these disciplines via a ML-Stats dictionary (ML stands for Machine Learning).

Heads-up on how the ML-Stats Dictionary is built and structured!

The complete ML-Stats dictionary can be found in Appendix A. This resource builds upon the concepts introduced in the definition callout box throughout the fifteen main chapters of this textbook. The dictionary aims to clarify terminology that varies between statistics and machine learning, specifically in the context of supervised learning and regression analysis.

Terms in this dictionary related to statistics will be highlighted in blue, while terms related to machine learning will be highlighted in magenta. This color scheme is designed to help readers easily navigate between the two disciplines. With practice, you will become proficient in applying concepts from both fields.

The above appendix will be the section in this book where the reader can find all those statistical and machine learning-related terms in alphabetical order. Notable terms (either statistical or machine learning-related) will include an admonition identifying which terms (again, either statistical or machine learning-related) are equivalent or somewhat equivalent (or even not equivalent if that is the case). For instance, consider the statistical term called dependent variable:

In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, in a statistical inference framework, the variable we are trying explain.

Then, the above definition will be followed by this admonition:

Equivalent to:

Response variable, outcome, output or target.

Note that we have identified four equivalent terms for the term dependent variable. Furthermore, these terms can be statistical or machine learning-related.

Heads-up on the use of terminology!

Throughout this book, we will use specific concepts interchangeably while explaining different regression methods. If confusion arises, you must always check definitions and equivalences (or non-equivalences) in Appendix A.

Next, we will introduce the three main foundations of this textbook: a data science workflow, choosing the correct workflow flavour (inferential or predictive), and building your regression toolbox.

1.2 The Data Science Workflow

Understanding the data science workflow is essential for mastering regression analysis. This workflow serves as a blueprint that guides us through each stage of our analysis, ensuring that we apply a systematic approach to solving our inquiries in a reproducible way. Each of the three pillars of this textbook—data science workflow, the right workflow flavour (inferential or predictive), and a regression toolbox—are deeply interconnected. Regardless of the regression model we explore, this general workflow provides a consistent framework that helps us navigate our data analysis with clarity and purpose.

As shown in Figure 1.1, the data science workflow is composed of the following eight stages (each of which will be discussed in more detail in subsequent subsections):

Study design: Define the research question, objectives, and variables of interest to ensure the analysis is purpose-driven and aligned with the problem at hand.
Data collection and wrangling: Gather and clean data, addressing issues such as missing values, outliers, and inconsistencies to transform it into a usable format.
Exploratory data analysis (EDA): Explore the data through statistical summaries and visualizations to identify patterns, trends, and potential anomalies.
Data modelling: Apply statistical or machine learning models to uncover relationships between variables or make predictions based on the data.
Estimation: Calculate model parameters to quantify relationships between variables and assess the accuracy and reliability of the model.
Goodness of fit: Evaluate the model’s performance using metrics and model diagnostic checks to determine how well it explains the data.
Results: Interpret the model’s outputs to derive meaningful insights and provide answers to the original research question.
Storytelling Communicate the findings through a clear, engaging narrative that is accessible to a non-technical audience.

By adhering to this workflow, we ensure that our regression analysis are not only systematic and thorough but also capable of producing results that are meaningful within the context of the problem we aim to solve.

Heads-up on the importance of a formal structure in regression analysis!

From the earliest stages of learning data analysis, understanding the importance of a structured workflow is crucial. If we do not adhere to a predefined workflow, we risk misinterpreting the data, leading to incorrect conclusions that fail to address the core questions of our analysis. Such missteps can result in outcomes that are not only meaningless but potentially misleading when taken out of the problem’s context.

Therefore, it is essential for aspiring data scientists to internalize this workflow from the very beginning of their education. A systematic approach ensures that each stage of the analysis is conducted with precision, ultimately producing reliable and contextually relevant results.

Figure 1.1: Data science workflow for *inferential* and *predictive* inquiries in regression analysis and supervised learning, respectively. The workflow is structured in eight stages: *study design*, *data collection and wrangling*, *exploratory data analysis*, *data modelling*, *estimation*, *goodness of fit*, *results*, and *storytelling*.

1.2.1 Study Design

The first stage of this workflow is centered around defining the main statistical inquiries we aim to address throughout the data analysis process. As a data scientist, your primary task is to translate these inquiries from the stakeholders into one of two categories: inferential or predictive. This classification determines the direction of your analysis and the methods you will use.

Inferential: The objective here is to explore and quantify relationships of association or causation between explanatory variables (regressors) and the response variable within the context of the problem at hand. For example, you may seek to determine whether a specific marketing campaign (regressor) significantly impacts sales revenue (response) and, if so, by how much.
Predictive: In this case, the focus is on making accurate predictions about the response variable based on future observations of the regressors. Unlike inferential inquiries, where understanding the relationship between variables is key, the primary goal here is to maximize prediction accuracy. This approach is fundamental in machine learning. For instance, you might build a model to predict future sales revenue based on past marketing expenditures, without necessarily needing to understand the underlying relationship between the two.

Example: Predicting Housing Prices

To illustrate the study design stage, let’s consider a simple example: predicting housing prices in a specific city.

If our goal is inferential, we might be interested in understanding the relationship between various factors (like square footage, number of bedrooms, and proximity to schools) and housing prices. Specifically, we would ask questions like, “How much does the number of bedrooms affect the price of a house, after accounting for other factors?”
If our goal is predictive, we would focus on creating a model that can accurately predict the price of a house based on its features, regardless of whether we fully understand how each factor contributes to the price.

In both cases, the study design stage involves clearly defining these objectives and determining the appropriate methods to address them. This stage sets the foundation for all subsequent steps in the data science workflow, as illustrated in Figure 1.2. Once the study design is established, the next stage is data collection and wrangling.

Figure 1.2: *Study design* stage from the data science workflow in Figure 1.1. This stage is directly followed by *data collection and wrangling*.

1.2.2 Data Collection and Wrangling

Once we have clearly defined our statistical questions, the next crucial step is to collect the data that will form the basis of our analysis. The way we collect this data is vital because it directly affects the accuracy and reliability of our results:

For inferential inquiries, we focus on understanding large groups or systems (populations) that we cannot fully observe. These populations are governed by characteristics (parameters) that we want to estimate. Because we can’t study every individual in the population, we collect a smaller, representative subset called a sample. The method we use to collect this sample—known as sampling—is crucial. A proper sampling method ensures that our sample reflects the larger population, allowing us to make accurate generalizations (inferences) about the entire group. After collecting the sample, it’s common practice to randomly split the data into training and test sets. This split allows us to build and validate our models, ensuring that the findings are robust and not overly tailored to the specific data at hand.

A Quick Debrief on Sampling!

Although this book does not cover sampling methods in detail, it’s important to know that the way you collect your sample can greatly influence your results. Depending on the problem, you might use different techniques:

Simple Random Sampling: Every individual in the population has an equal chance of being selected.
Systematic Sampling: You select individuals at regular intervals from a list of the population.
Stratified Sampling: You divide the population into subgroups (strata) and take a proportional sample from each subgroup.
Clustered Sampling: You divide the population into clusters and randomly select whole clusters for your sample.
Etc.

As in the case of Regression Analysis, statistical sampling is a vast field, and we could spend a whole course on it. If you’re interested in learning more about these methods, Sampling: design and analysis by Lohr is a great resource.

For predictive inquiries, our goal is often to use existing data to make predictions about future events or outcomes. In these cases, we usually work with large datasets (databases) that have already been collected. Instead of focusing on whether the data represents a population (as in inferential inquiries), we focus on cleaning and preparing the data so that it can be used to train models that make accurate predictions. After wrangling the data, it is typically split into training, validation, and test sets. The training set is used to build the model, the validation set is used to tune model parameters, and the test set evaluates the model’s final performance on unseen data.

Example: Collecting Data for Housing Price Predictions

Let’s continue with our housing price prediction example to illustrate these concepts:

Inferential Approach: Suppose we want to understand how the number of bedrooms affects housing prices in a city. To do this, we would collect a sample of house sales that accurately represents the city’s entire housing market. For instance, we might use stratified sampling to ensure that we include houses from different neighborhoods in proportion to how common they are. After collecting the data, we would split it into training and test sets. The training set helps us build our model and estimate the relationship between variables, while the test set allows us to evaluate how well our findings generalize to new data.
Predictive Approach: If our goal is to predict the selling price of a house based on its features (like size, number of bedrooms, and location), we would gather a large dataset of recent house sales. This data might come from a real estate database that tracks the details of each sale. Before we can use this data to train a model, we would clean it by filling in any missing information, converting data to a consistent format, and making sure all variables are ready for analysis. After preprocessing, we would split the data into training, validation, and test sets. The training set would be used to fit the model, the validation set to fine-tune it, and the test set to assess how well the model can predict prices for houses it hasn’t seen before.

As shown in Figure 1.3, the data collection and wrangling stage is fundamental to the workflow. It directly follows the study design and sets the stage for exploratory data analysis.

Figure 1.3: *Data collection and wrangling* stage from the data science workflow in Figure 1.1. This stage is directly followed by *exploratory data analysis* and preceded by *study design*.

1.2.3 Exploratory Data Analysis

Before diving into data modelling, it’s crucial to develop a deep understanding of the relationships between the variables in your training data. This is where the third stage of the data science workflow—Exploratory Data Analysis (EDA)—comes into play. EDA serves as a vital process that allows you to visualize and summarize your data, uncover patterns, detect anomalies, and test key assumptions that will inform your modelling decisions.

The first step in EDA is to classify your variables according to their types. This classification is essential because it guides your choice of analysis techniques and models. Specifically, you need to determine whether each variable is discrete or continuous, and whether it has any specific characteristics such as being bounded or unbounded.

Response Variable:
- Determine if your response variable is discrete (e.g., binary, count-based, categorical) or continuous.
- If it is continuous, consider whether it is bounded (e.g., percentages that range between 0 and 100) or unbounded (e.g., a variable like temperature that can take on a wide range of values).
Regressors:
- For each regressor, identify whether it is discrete or continuous.
- If a regressor is discrete, classify it further as binary, count-based, or categorical.
- If a regressor is continuous, determine whether it is bounded or unbounded.

This classification step ensures that you are prepared to choose the correct visualization and statistical methods for your analysis, as different types of variables often require different approaches.

After classifying your variables, the next step is to create visualizations and calculate descriptive statistics using your training data. This involves coding plots that can reveal the underlying distribution of each variable and the relationships between them. For instance, you might create histograms to visualize distributions, scatter plots to explore relationships between continuous variables, and box plots to compare categorical variables against the response variable.

Alongside these visualizations, it is important to calculate key descriptive statistics such as the mean, median, and standard deviation. These statistics provide a numerical summary of your data, offering insights into central tendency and variability. You might also use a correlation matrix to assess the strength of relationships between continuous variables.

Once you have generated these plots and statistics, they should be displayed in a clear and logical manner. The goal here is to interpret the data and draw preliminary conclusions about the relationships between variables. Presenting these findings effectively helps to uncover key insights and prepares you for the modelling stage.

Finally, the insights gained from this exploratory analysis must be clearly articulated. This involves summarizing the key findings and considering their implications for the next stage of the workflow—data modelling. Observing patterns, correlations, and potential outliers in this stage will inform your modelling approach and ensure that it is grounded in a thorough and informed analysis.

This structured approach to EDA is visually summarized in Figure 1.4, illustrating the sequential steps from variable classification to the delivery of exploratory insights.

Example: EDA for Housing Price Predictions

To illustrate the EDA process, let’s apply it to the example of predicting housing prices.

We start with variable classification:

The response variable is the sale price of a house, a continuous and unbounded variable.
The regressors include:
- Number of bedrooms: Discrete, count-based.
- Square footage: Continuous and unbounded.
- Neighborhood type: Discrete, categorical (e.g., urban, suburban, rural).
- Proximity to schools: Continuous, potentially bounded by distance.

Once the variables are classified, we move on to coding plots and calculating descriptive statistics. Here are a couple of visualizations that can be helpful in this context:

A histogram of sale prices helps visualize the distribution and spot any outliers.
A scatter plot of square footage versus sale price shows the relationship, typically revealing a positive correlation.
Box plots compare sale prices across different neighborhood types, highlighting any variations in median prices.
Descriptive statistics like the mean and standard deviation provide a numerical summary, while a correlation matrix helps assess relationships between continuous variables like square footage and sale price.

Finally, in displaying and interpreting results, these plots and statistics guide us in understanding the data:

The histogram might show most houses fall within a mid-range price.
The scatter plot could confirm that larger houses generally sell for more.
Box plots may reveal that urban homes tend to have higher prices.

These exploratory insights help identify key predictors like square footage and neighborhood type, and highlight any outliers that may require further attention during modelling.

By following these steps, the EDA process in the housing price prediction example lays a solid foundation for effective modelling, ensuring that the key variables and their relationships are well understood.

Figure 1.4: *Exploratory data analysis* stage from the data science workflow in Figure 1.1. This stage is directly followed by *data modelling* and preceded by *data collection and wrangling*.

1.2.4 Data Modelling

Figure 1.5: *Data modelling* stage from the data science workflow in Figure 1.1. This stage is directly preceded by *exploratory data analysis*. On the other hand, it is directly followed by *estimation* but indirectly with *goodness of fit*. If necessary, the *goodness of fit* stage could retake the process to *data modelling*.

1.2.5 Estimation

Figure 1.6: *Estimation* stage from the data science workflow in Figure 1.1. This stage is directly preceded by *data modelling* and followed by *goodness of fit*. If necessary, the *goodness of fit* stage could retake the process to *data modelling* and then to *estimation*.

1.2.6 Goodness of Fit

Figure 1.7: *Goodness of fit* stage from the data science workflow in Figure 1.1. This stage is directly preceded by *estimation* and followed by *results*. If necessary, the *goodness of fit* stage could retake the process to *data modelling* and then to *estimation*.

1.2.7 Results

Figure 1.8: *Results* stage from the data science workflow in Figure 1.1. This stage is directly followed by *storytelling* and preceded by *goodness of fit*.

1.2.8 Storytelling

Figure 1.9: *Storytelling* stage from the data science workflow in Figure 1.1. This stage preceded by *results*.

1.3 Mind Map of Regression Analysis

Having defined the necessary statistical aspects to execute a proper supervised learning analysis, either inferential or predictive across its seven sequential phases, we must dig into the different approaches we might encounter in practice as regression models. The nature of our outcome of interest will dictate any given modelling approach to apply, depicted as clouds in Figure 1.10. Note these regression models can be split into two sets depending on whether the outcome of interest is continuous or discrete. Therefore, under a probabilistic view, identifying the nature of a given random variable is crucial in regression analysis.

mindmap
  root((Regression 
  Analysis)
    Continuous <br/>Outcome Y
      {{Unbounded <br/>Outcome Y}}
        )Chapter 3: <br/>Ordinary <br/>Least Squares <br/>Regression(
          (Normal <br/>Outcome Y)
      {{Nonnegative <br/>Outcome Y}}
        )Chapter 4: <br/>Gamma Regression(
          (Gamma <br/>Outcome Y)
      {{Bounded <br/>Outcome Y <br/> between 0 and 1}}
        )Chapter 5: Beta <br/>Regression(
          (Beta <br/>Outcome Y)
      {{Nonnegative <br/>Survival <br/>Time Y}}
        )Chapter 6: <br/>Parametric <br/> Survival <br/>Regression(
          (Exponential <br/>Outcome Y)
          (Weibull <br/>Outcome Y)
          (Lognormal <br/>Outcome Y)
        )Chapter 7: <br/>Semiparametric <br/>Survival <br/>Regression(
          (Cox Proportional <br/>Hazards Model)
            (Hazard Function <br/>Outcome Y)
    Discrete <br/>Outcome Y
      {{Binary <br/>Outcome Y}}
        {{Ungrouped <br/>Data}}
          )Chapter 8: <br/>Binary Logistic <br/>Regression(
            (Bernoulli <br/>Outcome Y)
        {{Grouped <br/>Data}}
          )Chapter 9: <br/>Binomial Logistic <br/>Regression(
            (Binomial <br/>Outcome Y)
      {{Count <br/>Outcome Y}}
        {{Equidispersed <br/>Data}}
          )Chapter 10: <br/>Classical Poisson <br/>Regression(
            (Poisson <br/>Outcome Y)
        {{Overdispersed <br/>Data}}
          )Chapter 11: <br/>Negative Binomial <br/>Regression(
            (Negative Binomial <br/>Outcome Y)
        {{Overdispersed or <br/>Underdispersed <br/>Data}}
          )Chapter 13: <br/>Generalized <br/>Poisson <br/>Regression(
            (Generalized <br/>Poisson <br/>Outcome Y)
        {{Zero Inflated <br/>Data}}
          )Chapter 12: <br/>Zero Inflated <br/>Poisson <br/>Regression(
            (Zero Inflated <br/>Poisson <br/>Outcome Y)
      {{Categorical <br/>Outcome Y}}
        {{Nominal <br/>Outcome Y}}
          )Chapter 14: <br/>Multinomial <br/>Logistic <br/>Regression(
            (Multinomial <br/>Outcome Y)
        {{Ordinal <br/>Outcome Y}}
          )Chapter 15: <br/>Ordinal <br/>Logistic <br/>Regression(
            (Logistic <br/>Distributed <br/>Cumulative Outcome <br/>Probability)

Figure 1.10: Regression analysis mind map depicting all modelling techniques to be explored in this book. Depending on the type of outcome \(Y\), these techniques are split into two large zones: discrete and continuous.

That said, we will go beyond OLS regression and explore further regression techniques. In practice, these techniques have been developed in the statistical literature to address practical cases where the OLS modelling framework and assumptions are not suitable anymore. Thus, throughout this block, we will cover (at least) one new regression model per lecture.

As we can see in the clouds of Figure 1.10, there are 13 regression models: 8 belonging to discrete outcomes and 5 to continuous outcomes. Each of these models is contained in a chapter of this book, beginning with the most basic regression tool known as ordinary least-squares in Chapter 3. We must clarify that the current statistical literature is not restricted to these 13 regression models. The field of regression analysis is vast, and one might encounter more complex models to target certain specific inquiries. Nonetheless, I consider these models the fundamental regression approaches that any data scientist must be familiar with in everyday practice.

Even though this book comprises 13 chapters, each depicting a different regression model, we have split these chapters into two major subsets: those with continuous outcomes and those with discrete outcomes.

2 Supervised Learning and Regression Analysis