1  Getting Ready for Regression Cooking!

First things first! Let us prepare for all the different regression techniques to be introduced in Chapter 3.

Learning Objectives

By the end of this chapter, you will be able to:

  • Define the three core pillars to be applied in regression modelling throughout this book: a data science workflow, the right workflow flavour, and the most appropriate model.
  • Outline how the ML-Stats dictionary works to bridge the terminology used in machine learning (ML) and statistics.
  • Explain how the data science workflow can be applied in regression analysis.
  • Describe how the mind map of regression analysis acts as the primary chapter structure of this book and as a toolboox.
  • Contrast the differences and simmilarities between supervised learning and regression analysis.

That said, we want to highlight one guiding principle for all of our work:

Different modelling estimation techniques in regression analysis become easier to understand once we develop a strong probabilistic and inferential grasp of populations or systems of interest.

Image by Lucas Israel via Pixabay.

The above guiding principle rests on foundational statistical ideas on how data is generated and how it can be modelled through various regression methods. We will explore these underlying concepts in Chapter 2. Before doing so, however, this chapter will build on the following three core pillars:

  1. Implementing a structured data science workflow as outlined in Section 1.2.
  2. Selecting the appropriate workflow approach based on an inferential or predictive paradigm, as shown in Figure 1.1.
  3. Choosing the appropriate regression model based on the response variable or outcome of interest, using the mind map in Section 1.3 (analogous to a regression toolbox).

The rationale behind the three pillars

Each data science problem involving regression analysis has unique characteristics, depending on if the inquiry is inferential or predictive. Different types of outcomes (or response variables) require distinct modelling approaches. For example, we might analyze survival times (e.g., the time until one particular equipment of a given brand fails), categorical outcomes (e.g., a preferred musical genre in the Canadian young population), or count-based outcomes (e.g., how many customers we would expect on a regular Monday morning in the branches of a major national bank), etc. Moreover, under this regression context, our analysis extends beyond the outcome itself, but we also examine how it relates to other explanatory variables (the so-called features). For instance, if we are studying musical genre preferences among young Canadians, we could explore how age groups influence these preferences or compare genre popularity across provinces.

At first glance, it might seem that every regression problem should have a unique workflow tailored to its specific model. However, this is not entirely the case. In Figure 1.1, we introduce a structured regression workflow designed as a proof of concept for thirteen different regression models. Each flow is covered in a separate chapter of this book alongside a review of probability and statistics (i.e, thirteen chapters in this book aside from the probability and statistics review). This workflow standardizes our approach, making analysis more transparent and efficient while allowing us to communicate insights effectively through data storytelling. Naturally, this workflow includes decision points that determine whether the approach follows an inferential or predictive path (the second pillar). As for our third pillar, this comes into play at the data modelling stage, where the regression toolbox Figure 1.15 guides model selection based on the response variable type.

Let us establish a convention for using admonitions throughout this textbook. These admonitions will help distinguish between key concepts, important insights, and supplementary material, ensuring clarity as we explore different regression techniques. We will start using these admonitions in Section 1.1.

Definition

A formal statistical and/or machine learning definition. This admonition aims to untangle the significant amount of jargon and concepts that both fields have. When applicable, alternative terminology will be included to highlight equivalent terms across statistics and machine learning.

Heads-up!

An idea (or ideas) related to a modelling approach, a specific workflow stage, or an important data science concept. This admonition is used to flag crucial statistical or machine learning topics that warrant deeper exploration.

Tip

An idea (or ideas) that may extend beyond the immediate discussion but provides additional context or helpful background. When applicable, references to further reading will be provided.

The core idea of the above admonition arrangement is to allow the reader to discern between ideas or concepts that are key to grasp from those whose understanding might not be highly essential (but still interesting to check out in further literature). With this structure in place, we can now introduce another foundational resource: a common ground between machine learning and statistics which will be elaborated on in the next section.

1.1 The ML-Stats Dictionary

Machine learning and statistics often overlap, especially in regression modelling. Topics covered in a regression-focused course, under a purely statistical framework, can also appear in machine learning-based courses on supervised learning, but the terminology can differ. Recognizing this overlap, the Master of Data Science (MDS) program at the University of British Columbia (UBC) provides the MDS Stat-ML dictionary (Gelbart 2017) under the following premises:

This document is intended to help students navigate the large amount of jargon, terminology, and acronyms encountered in the MDS program and beyond.

This section covers terms that have different meanings in different contexts, specifically statistics vs. machine learning (ML).

Both disciplines have a tremendous amount of jargon and terminology. As mentioned in the Preface, machine learning and statistics construct a substantial synergy reflected in data science. Despite this overlap, misunderstandings can still happen due to differences in terminology. To prevent this, we need clear bridges between these disciplines via a ML-Stats dictionary (ML stands for Machine Learning).

Heads-up on how the ML-Stats Dictionary is built and structured!

The complete ML-Stats dictionary can be found in Appendix A. This resource builds upon the concepts introduced in the definition callout box throughout the fifteen main chapters of this textbook. The dictionary aims to clarify terminology that varies between statistics and machine learning, specifically in the context of supervised learning and regression analysis.

Image by manfredsteger via Pixabay.

Terms in this dictionary related to statistics will be highlighted in blue, while terms related to machine learning will be highlighted in magenta. This color scheme is designed to help readers easily navigate between the two disciplines. With practice, you will become proficient in applying concepts from both fields.

The above appendix will be the section in this book where the reader can find all those statistical and machine learning-related terms in alphabetical order. Notable terms (either statistical or machine learning-related) will include an admonition identifying which terms (again, either statistical or machine learning-related) are equivalent or somewhat equivalent (or even not equivalent if that is the case). For instance, consider the statistical term called dependent variable:

In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, in a statistical inference framework, the variable we are trying explain.

Then, the above definition will be followed by this admonition:

Equivalent to:

Response variable, outcome, output or target.

Note that we have identified four equivalent terms for the term dependent variable. Furthermore, these terms can be statistical or machine learning-related.

Heads-up on the use of terminology!

Throughout this book, we will use specific concepts interchangeably while explaining different regression methods. If confusion arises, you must always check definitions and equivalences (or non-equivalences) in Appendix A.

Next, we will introduce the three main foundations of this textbook: a data science workflow, choosing the correct workflow flavour (inferential or predictive), and building your regression toolbox.

1.2 The Data Science Workflow

Image by manfredsteger via Pixabay.

Understanding the data science workflow is essential for mastering regression analysis. This workflow serves as a blueprint that guides us through each stage of our analysis, ensuring that we apply a systematic approach to solving our inquiries in a reproducible way. Each of the three pillars of this textbook—data science workflow, the right workflow flavour (inferential or predictive), and a regression toolbox—are deeply interconnected. Regardless of the regression model we explore, this general workflow provides a consistent framework that helps us navigate our data analysis with clarity and purpose.

As shown in Figure 1.1, the data science workflow is composed of the following eight stages (each of which will be discussed in more detail in subsequent subsections):

  1. Study design: Define the research question, objectives, and variables of interest to ensure the analysis is purpose-driven and aligned with the problem at hand.
  2. Data collection and wrangling: Gather and clean data, addressing issues such as missing values, outliers, and inconsistencies to transform it into a usable format.
  3. Exploratory data analysis (EDA): Explore the data through statistical summaries and visualizations to identify patterns, trends, and potential anomalies.
  4. Data modelling: Apply statistical or machine learning models to uncover relationships between variables or make predictions based on the data.
  5. Estimation: Calculate model parameters to quantify relationships between variables and assess the accuracy and reliability of the model.
  6. Goodness of fit: Evaluate the model’s performance using metrics and model diagnostic checks to determine how well it explains the data.
  7. Results: Interpret the model’s outputs to derive meaningful insights and provide answers to the original research question.
  8. Storytelling Communicate the findings through a clear, engaging narrative that is accessible to a non-technical audience.

By adhering to this workflow, we ensure that our regression analysis are not only systematic and thorough but also capable of producing results that are meaningful within the context of the problem we aim to solve.

Heads-up on the importance of a formal structure in regression analysis!

From the earliest stages of learning data analysis, understanding the importance of a structured workflow is crucial. If we do not adhere to a predefined workflow, we risk misinterpreting the data, leading to incorrect conclusions that fail to address the core questions of our analysis. Such missteps can result in outcomes that are not only meaningless but potentially misleading when taken out of the problem’s context.

Therefore, it is essential for aspiring data scientists to internalize this workflow from the very beginning of their education. A systematic approach ensures that each stage of the analysis is conducted with precision, ultimately producing reliable and contextually relevant results.

Figure 1.1: Data science workflow for inferential and predictive inquiries in regression analysis and supervised learning, respectively. The workflow is structured in eight stages: study design, data collection and wrangling, exploratory data analysis, data modelling, estimation, goodness of fit, results, and storytelling.

1.2.1 Study Design

The first stage of this workflow is centered around defining the main statistical inquiries we aim to address throughout the data analysis process. As a data scientist, your primary task is to translate these inquiries from the stakeholders into one of two categories: inferential or predictive. This classification determines the direction of your analysis and the methods you will use.

  • Inferential: The objective here is to explore and quantify relationships of association or causation between explanatory variables (referred to as regressors in the models discussed in this textbook) and the response variable within the context of the specific problem at hand. For example, you might statistically seek to determine whether a particular marketing campaign (the regressor) significantly influences sales revenue (the response), and if it does, by how much.
  • Predictive: In this case, the focus is on making accurate predictions about the response variable based on future observations of the regressors. Unlike inferential inquiries, where understanding the relationship between variables is key, the primary goal here is to maximize prediction accuracy. This approach is fundamental in machine learning. For instance, you might build a model to predict future sales revenue based on past marketing expenditures, without necessarily needing to understand the underlying relationship between the two.

Image by Manfred Steger via Pixabay.

Heads-up on the inquiry focus of this book!

In the regression chapters of this book, we will emphasize both types of inquiries. As we follow the workflow from Figure 1.1, we will explore the two pathways identified by the decision points concerning inference and prediction.

Example: Housing Prices

To illustrate the study design stage, let us consider a simple example involving housing prices in a specific city:

  • If our goal is inferential, we might be interested in understanding the relationship between various factors—such as square footage, number of bedrooms, and proximity to schools—and housing prices. Specifically, we would ask questions like:

How does the number of bedrooms affect the price of a house, once we account for other factors?

  • If our goal is predictive, we would focus on creating a model that can accurately predict the price of a house based on its features (i.e., the characteristics of a given house), regardless of whether we fully understand how each feature contributes to the price. Hence, we would be able to answer questions such as:

What would be the predicted price of a house with 3,500 square feet and 3 bedrooms located on a block where the closest school is at 2.5 km?

Image by Tomislav Kaučić via Pixabay.

In both cases, the study design stage involves clearly defining these objectives and determining the appropriate data modelling methods to address them. This stage sets the foundation for all subsequent steps in the data science workflow. After establishing the study design, the next step is data collection and wrangling, as shown in Figure 1.2.

Figure 1.2: Study design stage from the data science workflow in Figure 1.1. This stage is directly followed by data collection and wrangling.

1.2.2 Data Collection and Wrangling

Image by Manfred Steger via Pixabay.

Once we have clearly defined our statistical questions, the next crucial step is to collect the data that will form the basis of our analysis. The way we collect this data is vital because it directly affects the accuracy and reliability of our results:

  • For inferential inquiries, we focus on understanding populations or systems that we cannot fully observe. These populations are governed by characteristics (referred to as parameters) that we want to estimate. Because we cannot study every individual in the population or system, we collect a smaller, representative subset called a sample. The method we use to collect this sample—known as sampling—is crucial. A proper sampling method ensures that our sample reflects the larger population or system, allowing us to make accurate and precise generalizations (i.e., inferences) about the entire population or system. After collecting the sample, it is common practice to randomly split the data into training and test sets. This split allows us to build and assess our models, ensuring that the findings are robust and not overly tailored to the specific data at hand.
  • For predictive inquiries, our goal is often to use existing data to make predictions about future events or outcomes. In these cases, we usually work with large datasets (databases) that have already been collected. Instead of focusing on whether the data represents a population (as in inferential inquiries), we focus on cleaning and preparing the data so that it can be used to train models that make accurate predictions. After wrangling the data, it is typically split into training, validation (if necessary, depending on our chosen modelling strategy), and test sets. The training set is used to build the model, the validation set is used to tune model parameters, and the test set evaluates the model’s final performance on unseen data.

Tip on sampling techniques!

Careful attention to sampling design is a crucial step in any research aimed at supporting valid regression-based inference. The selection of an appropriate sampling design should be guided by the structural characteristics of the population as well as the specific goals of the analysis. A well-designed sampling strategy enhances the accuracy, precision, and generalizability of parameter estimates derived from regression models, particularly when the intention is to extend model-based conclusions beyond the observed data to the whole population or system.

Below, we summarize some commonly used probability-based sampling designs, each of which has distinct implications for model validity and estimation efficiency:

  • Simple random sampling: Every unit in the population has an equal probability of selection. While this method is straightforward to implement and analyze, it may be inefficient or impractical for populations with heterogeneous subgroups.
  • Systematic sampling: Sampling occurs at fixed intervals from an ordered list, starting from a randomly chosen point. This design can improve efficiency under certain ordering schemes, but caution is necessary to avoid biases related to periodicity.
  • Stratified sampling: The population is divided into mutually exclusive strata based on key characteristics (e.g., age, income, region, etc.). Samples are drawn within each stratum, often in proportion to the strata sizes or based on optimal allocation. This approach increases precision for subgroup estimates and enhances overall model efficiency.
  • Cluster sampling: The population is divided into naturally occurring clusters (e.g., households, schools, geographic units, etc.), and entire clusters are sampled randomly. This design is often preferred for cost efficiency, but it typically requires adjustments for intracluster correlation during analysis.

In the context of our regression-based inferential framework, it is necessary to carefully plan data collection and preparation around the sampling strategy. The choice of sampling design can influence not only model estimation but also the interpretation and generalizability of the results. While this textbook does not provide an exhaustive treatment of sampling theory, we recommend Lohr (2021) for an in-depth reference. Their work offers both theoretical insights and applied examples that are highly relevant for data scientists engaged in model-based inference.

Example: Collecting Data for Housing Inference and Predictions

Let us continue with our housing example to illustrate the above concepts:

  • Inferential Approach: Suppose we want to understand how the number of bedrooms affects housing prices in a city. To do this, we would collect a sample of house sales that accurately represents the city’s entire housing market. For instance, we might use stratified sampling to ensure that we include houses from different neighbourhoods in proportion to how common they are. After collecting the data, we would split it into training and test sets. The training set helps us build our model and estimate the relationship between variables, while the test set allows us to evaluate how well our findings generalize to new data.
  • Predictive Approach: If our goal is to predict the selling price of a house based on its features (such as size, number of bedrooms, and location), we would gather a large dataset of recent house sales. This data might come from a real estate database that tracks the details of each sale. Before we can use this data to train a model, we would clean it by filling in any missing information, converting data to a consistent format, and making sure all variables are ready for analysis. After preprocessing, we would split the data into training, validation, and test sets. The training set would be used to fit the model, the validation set to fine-tune it, and the test set to assess how well the model can predict prices for houses it has not seen before.

Image by Stefan via Pixabay.

As shown in Figure 1.3, the data collection and wrangling stage is fundamental to the workflow. It directly follows the study design and sets the stage for exploratory data analysis.

Figure 1.3: Data collection and wrangling stage from the data science workflow in Figure 1.1. This stage is directly followed by exploratory data analysis and preceded by study design.

1.2.3 Exploratory Data Analysis

Before diving into data modelling, it is crucial to develop a deep understanding of the relationships between the variables in our training data. This is where the third stage of the data science workflow comes into play: exploratory data analysis (EDA). EDA serves as a vital process that allows us to visualize and summarize our data, uncover patterns, detect anomalies, and test key assumptions that will inform our modelling decisions.

Image by Manfred Steger via Pixabay.

The first step in EDA is to classify our variables according to their types. This classification is essential because it guides our choice of analysis techniques and models. Specifically, we need to determine whether each variable is discrete or continuous, and whether it has any specific characteristics such as being bounded or unbounded.

  • Response (i.e., the \(Y\)):
    • Determine if the response variable is discrete (e.g., binary, count-based, categorical) or continuous.
    • If it is continuous, let us consider whether it is bounded (e.g., percentages that range between \(0\) and \(100\)) or unbounded (e.g., a variable like company profits/losses that can take on a wide range of values).
  • Regressors (i.e., the \(X\)s):
    • For each regressor, we must identify whether it is discrete or continuous.
    • If a regressor is discrete, let us classify it further as binary, count-based, or categorical.
    • If a regressor is continuous, let us determine whether it is bounded or unbounded.

This classification scheme helps us select the appropriate visualization and statistical methods for our analysis, as different variable types often need different approaches. It ensures that we are well-equipped to make the right choices in our analyses.

After classifying your variables, the next step is to create visualizations and calculate descriptive statistics using our training data. This involves coding plots that can reveal the underlying distribution of each variable and the relationships between them. For instance, we might create histograms to visualize distributions, scatter plots to explore relationships between continuous variables, and box plots to compare discrete and categorical variables against the continuous variable.

Alongside these visualizations, it is important to calculate key descriptive statistics such as the mean, median, and standard deviation if our variables are numeric. These statistics provide a summary of our data, offering insights into central tendency and variability. We might also use a correlation matrix to assess the strength of relationships between continuous variables.

Image by Manfred Stege via Pixabay.

Once we have generated these plots and statistics, they should be displayed in a clear and logical manner. The goal here is to interpret the data and draw preliminary conclusions about the relationships between variables. Presenting these findings effectively helps to uncover key insights and prepares you for the subsequent modelling stage. Finally, the insights gained from our EDA must be clearly articulated. This involves summarizing the key findings and considering their implications for the next stage of the workflow—data modelling. Observing patterns, correlations, and potential outliers in this stage will inform your modelling approach and ensure that it is grounded in a thorough and informed analysis.

Heads-up on the use of EDA to deliver inferential conclusions!

While EDA plays a critical role in uncovering patterns, detecting anomalies, and generating hypotheses. However, it is important to emphasize that the results of EDA should not be generalized beyond the specific sample data being analyzed. EDA is inherently descriptive and focused on the sample, and it is not intended to support inferential claims about larger populations. The insights gained from EDA are contingent on the specific sample and may not accurately reflect systematic relationships within the broader population. Nevertheless, EDA can provide valuable information to inform our modelling decisions.

Image by Manfred Stege via Pixabay.

Generalizing findings to a larger population requires formal statistical inference, which takes into account sampling variability, model uncertainty, and the precision of estimates. This is particularly important in regression analysis, where extending patterns observed in a sample to the wider population needs rigorous modelling assumptions, estimation procedures, and a quantification of uncertainty (e.g., through confidence intervals). Treating EDA findings as if they were inferential conclusions can lead to misleading interpretations throughout our data science workflow.

Example: EDA for Housing Data

To illustrate the EDA process, we will follow it within the context of the housing example used in the previous two workflow stages, utilizing simulated data. Suppose we have a sample size of \(n = 2,000\) houses drawn from various Canadian cities through cluster sampling. As shown in Table 1.1, our earlier inferential and predictive inquiries focus on sale price in CAD as our response variable in a regression context. Note that this numeric response cannot be negative, which classifies it as positively unbounded. Additionally, Table 1.1 provides the relevant details for the regressors in this case: the number of bedrooms, square footage, neighbourhood type, and proximity to schools.

Table 1.1: Classification table for variables in housing data.
Variable Type Scale Model Role
Sale Price (CAD) Continuous Positively unbounded Response
Number of Bedrooms Discrete Count Regressor
Square Footage Continuous Positively unbounded Regressor
Neighbourhood Type (Rural, Suburban or Urban) Discrete Categorical Regressor
Proximity to Schools (km) Continuous Positively unbounded Regressor

We will randomly split the sampled data into training and testing sets for both inferential and predictive inquiries. Specifically, 20% of the data will be allocated to the training set, while the remaining 80% will serve as the testing set. For the predictive analysis, we will not create a validation set because our chosen modelling strategy (to be discussed in Section 1.2.4) does not require it. In addition, Table 1.2 displays the first 100 rows of our training data which is subset of size equal to \(400\).

Table 1.2: First 100 rows of training housing data.

Heads-up on data splitting for inferential inquiries!

In machine learning, data splitting is a foundational practice designed to prevent data leakage in predictive inquiries. However, you may wonder:

Why should we also split the data for inferential inquiries?

In the context of statistical inference, especially when making claims about population parameters, data splitting plays a different but important role: it helps prevent double dipping. Double dipping refers to the misuse of the same data for both exploring hypotheses (as in EDA) and formally testing those hypotheses. This practice undermines the validity of inferential claims by increasing the probability of Type I errors—incorrectly rejecting the null hypothesis \(H_0\) when it is actually true for the population under study.

Image by Manfred Steger via Pixabay.

To illustrate this, consider conducting a one-sample \(t\)-test in a double-dipping scenario. Suppose we first observe a sample mean of \(\bar{x} = 9.5\), and then decide to test the null hypothesis

\[\text{$H_0$: } \mu \geq 10\]

against the alternative hypothesis

\[\text{$H_1$: } \mu < 10,\]

after performing EDA on the same data. If we proceed with the formal \(t\)-test using that same data, we are essentially tailoring the hypothesis to fit our sample. Empirical simulations can show that such practices lead to inflated false positive rates, which threaten the reproducibility and integrity of statistical inference.

Unlike predictive modelling, data splitting is not a routine practice in statistical inference. However, it becomes relevant when the line between exploration and formal testing is blurred. For more information on double dipping in statistical inference, Chapter 6 of Reinhart (2015) offers in-depth insights and some real-life examples.

After classifying the variables and splitting our data, we will move on to coding the plots and calculating the summary statistics. We provide a list of these plots and summary statistics, along with their corresponding EDA outputs and interpretations, based on our training data, which has a subset size of \(400\):

  • A histogram of sale prices, as in Figure 1.4, shows the response’s distribution and helps identify any outliers. The training set reveals a fairly symmetric distribution of sale prices, with a noticeable concentration of sales between \(\$200,000\) and \(\$400,000\). However, there are a few outliers. Even with just 20% of the total data, this plot provides valuable graphical insights into central tendency and variability.
Figure 1.4: Histogram of sale prices via training set.
  • Side-by-side jitter plots, as in Figure 1.5, visualize the distribution of sale prices across different bedroom counts, highlighting spread. Overall, these plots indicate a positive association between the number of bedrooms and home sale price. Note that the average price (represented by red diamonds) tends to increase with the addition of more bedrooms. The training set predominantly has homes with 3 to 5 bedrooms, and there are some high-priced outliers present even among mid-sized homes.
Figure 1.5: Side-by-side jitter plots of sale prices by number of bedrooms via training set (red diamonds indicate sale price means by number of bedrooms).
  • A scatter plot displaying the relationship between square footage and sale price, as in Figure 1.6, illustrates how these two continuous variables interact. There is a clear upward trend in the training data, indicated by the fitted solid red line of the simple linear regression. Although the variability increases with larger square footage, the overall positive linear pattern is still clear.
Figure 1.6: Scatter plot of square footage versus sale prices via training set (solid red line indicates a simple linear regression fitting).
  • Side-by-side box plots, as in Figure 1.7, are used to compare sale prices across different types of neighbourhoods, highlighting variations in median prices. The training data reveals neighbourhood-specific price patterns: urban homes tend to have higher prices, while rural homes are generally less expensive. However, from a graphical perspective, we do not observe significant differences in price spreads between these types of neighbourhoods.
Figure 1.7: Side-by-side box plots of sale prices by neighbourhood type via training set.
  • The scatter plot showing the relationship between proximity to schools and sale price, as in Figure 1.8, reveals an almost flat trend in the training data. This observation is supported by the fitted solid red line of the simple linear regression, indicating a weak graphical relationship between these two variables.
Figure 1.8: Scatter plot of proximity to schools versus sale prices via training set.
  • Descriptive statistics from Table 1.3, such as the mean and standard deviation, summarize continuous variables. In addition, a Pearson correlation matrix from Table 1.4 numerically assesses the relationships between these variables. Note that square footage is positively correlated with sale price, while proximity to schools has a negative association.
Table 1.3: Descriptive statistics of housing data via training set.
Table 1.4: Pearson correlation matrix of housing data, via training set, for numeric variables.

In displaying and interpreting results, the plots and statistics will guide us in understanding the data. In this specific example, these exploratory insights help identify key factors, such as square footage and neighborhood type, that influence housing prices. They also highlight any outliers that may need further attention during modelling. By following this EDA process, we will establish a solid descriptive foundation for effective data modelling, ensuring that the key variables and their relationships are well understood.

Finally, this structured approach to EDA is visually summarized in Figure 1.9, which shows the sequential steps from variable classification to the delivery of exploratory insights.

Figure 1.9: Exploratory data analysis stage from the data science workflow in Figure 1.1. This stage is directly followed by data modelling and preceded by data collection and wrangling.

1.2.4 Data Modelling

Figure 1.10: Data modelling stage from the data science workflow in Figure 1.1. This stage is directly preceded by exploratory data analysis. On the other hand, it is directly followed by estimation but indirectly with goodness of fit. If necessary, the goodness of fit stage could retake the process to data modelling.

1.2.5 Estimation

Figure 1.11: Estimation stage from the data science workflow in Figure 1.1. This stage is directly preceded by data modelling and followed by goodness of fit. If necessary, the goodness of fit stage could retake the process to data modelling and then to estimation.

1.2.6 Goodness of Fit

Figure 1.12: Goodness of fit stage from the data science workflow in Figure 1.1. This stage is directly preceded by estimation and followed by results. If necessary, the goodness of fit stage could retake the process to data modelling and then to estimation.

1.2.7 Results

Figure 1.13: Results stage from the data science workflow in Figure 1.1. This stage is directly followed by storytelling and preceded by goodness of fit.

1.2.8 Storytelling

Figure 1.14: Storytelling stage from the data science workflow in Figure 1.1. This stage preceded by results.

1.3 Mind Map of Regression Analysis


Image by Manfred Steger via Pixabay.

Having defined the necessary statistical aspects to execute a proper supervised learning analysis, either inferential or predictive across its seven sequential phases, we must dig into the different approaches we might encounter in practice as regression models. The nature of our outcome of interest will dictate any given modelling approach to apply, depicted as clouds in Figure 1.15. Note these regression models can be split into two sets depending on whether the outcome of interest is continuous or discrete. Therefore, under a probabilistic view, identifying the nature of a given random variable is crucial in regression analysis.

mindmap
  root((Regression 
  Analysis)
    Continuous <br/>Outcome Y
      {{Unbounded <br/>Outcome Y}}
        )Chapter 3: <br/>Ordinary <br/>Least Squares <br/>Regression(
          (Normal <br/>Outcome Y)
      {{Nonnegative <br/>Outcome Y}}
        )Chapter 4: <br/>Gamma Regression(
          (Gamma <br/>Outcome Y)
      {{Bounded <br/>Outcome Y <br/> between 0 and 1}}
        )Chapter 5: Beta <br/>Regression(
          (Beta <br/>Outcome Y)
      {{Nonnegative <br/>Survival <br/>Time Y}}
        )Chapter 6: <br/>Parametric <br/> Survival <br/>Regression(
          (Exponential <br/>Outcome Y)
          (Weibull <br/>Outcome Y)
          (Lognormal <br/>Outcome Y)
        )Chapter 7: <br/>Semiparametric <br/>Survival <br/>Regression(
          (Cox Proportional <br/>Hazards Model)
            (Hazard Function <br/>Outcome Y)
    Discrete <br/>Outcome Y
      {{Binary <br/>Outcome Y}}
        {{Ungrouped <br/>Data}}
          )Chapter 8: <br/>Binary Logistic <br/>Regression(
            (Bernoulli <br/>Outcome Y)
        {{Grouped <br/>Data}}
          )Chapter 9: <br/>Binomial Logistic <br/>Regression(
            (Binomial <br/>Outcome Y)
      {{Count <br/>Outcome Y}}
        {{Equidispersed <br/>Data}}
          )Chapter 10: <br/>Classical Poisson <br/>Regression(
            (Poisson <br/>Outcome Y)
        {{Overdispersed <br/>Data}}
          )Chapter 11: <br/>Negative Binomial <br/>Regression(
            (Negative Binomial <br/>Outcome Y)
        {{Overdispersed or <br/>Underdispersed <br/>Data}}
          )Chapter 13: <br/>Generalized <br/>Poisson <br/>Regression(
            (Generalized <br/>Poisson <br/>Outcome Y)
        {{Zero Inflated <br/>Data}}
          )Chapter 12: <br/>Zero Inflated <br/>Poisson <br/>Regression(
            (Zero Inflated <br/>Poisson <br/>Outcome Y)
      {{Categorical <br/>Outcome Y}}
        {{Nominal <br/>Outcome Y}}
          )Chapter 14: <br/>Multinomial <br/>Logistic <br/>Regression(
            (Multinomial <br/>Outcome Y)
        {{Ordinal <br/>Outcome Y}}
          )Chapter 15: <br/>Ordinal <br/>Logistic <br/>Regression(
            (Logistic <br/>Distributed <br/>Cumulative Outcome <br/>Probability)

Figure 1.15: Regression analysis mind map depicting all modelling techniques to be explored in this book. Depending on the type of outcome \(Y\), these techniques are split into two large zones: discrete and continuous.

That said, we will go beyond OLS regression and explore further regression techniques. In practice, these techniques have been developed in the statistical literature to address practical cases where the OLS modelling framework and assumptions are not suitable anymore. Thus, throughout this block, we will cover (at least) one new regression model per lecture.

As we can see in the clouds of Figure 1.15, there are 13 regression models: 8 belonging to discrete outcomes and 5 to continuous outcomes. Each of these models is contained in a chapter of this book, beginning with the most basic regression tool known as ordinary least-squares in Chapter 3. We must clarify that the current statistical literature is not restricted to these 13 regression models. The field of regression analysis is vast, and one might encounter more complex models to target certain specific inquiries. Nonetheless, I consider these models the fundamental regression approaches that any data scientist must be familiar with in everyday practice.

Even though this book comprises 13 chapters, each depicting a different regression model, we have split these chapters into two major subsets: those with continuous outcomes and those with discrete outcomes.

2 Supervised Learning and Regression Analysis