1 Getting Ready for Regression Cooking!
It is time to prepare for all the different regression techniques we will check beginning Chapter 3. That said, there is a strong message we want to convey across all this work:
Different modelling estimation techniques in regression analysis can be smoothly grasped when we develop a fair probabilistic and inferential intuition on our populations or systems of interest.
The above statement has a fundamental statistical foundation on how data is generated and can be modelled via different regression approaches. More details on the concepts and ideas associated with this foundation will be delivered in Chapter 2. Hence, before reviewing these statistical concepts and ideas, this chapter will elaborate on the three big pillars we previously pointed out in Audience and Scope:
- The use of an ordered data science workflow in Section 1.2,
- choosing the proper workflow flavour according to either an inferential or predictive paradigm as shown in Figure 1.1, and
- the correct use of an appropriate regression model based on the response variable or outcome of interest as shown in the mind map from Section 1.3 (analogous to a regression toolbox).
The Rationale Behind the Three Pillars
Each data science-related problem that uses regression analysis might have distinctive characteristics considering inferential (statistics!) or predictive (machine learning!) inquiries. Specific problems would implicate using outcomes (or response variables) related to survival times (e.g., the time until one particular equipment of a given brand fails), categories (e.g., a preferred musical genre in the Canadian young population), counts (e.g., how many customers we would expect on a regular Monday morning in the branches of a major national bank), etc. Moreover, under this regression context, our analyses would be expanded to explore and assess how our outcome of interest is related to a further set of variables (the so-called features!). For instance, following up with the categorical outcome of the preferred musical genre in the Canadian young population, we might analyze how particular age groups prefer certain genres over others or even how preferred genres compare each other across different Canadian provinces in this young population. The sky is the limit here!
Therefore, we might be tempted to say that each regression problem should have its own workflow, given that the regression model to use would implicate particular analysis phases. However, it turns out that is not the case to a certain extent, and we have a regression workflow in Figure 1.1 to support this bold statement as a proof of concept for thirteen different regression models (i.e, thirteen chapters in this book aside from the probability and statistics review). The workflow aims to homogenize our data analyses and make our modelling process more transparent and smoother. We can adequately deliver concluding insights as data storytelling while addressing our initial main inquiries. Of course, when depicting the workflow as a flowchart, there will be decision points that will turn it into inferential or predictive (the second pillar). Finally, where does the third pillar come into play in this workflow? This pillar is contained in the data modelling stage, where the mind map from Figure 1.10 will come in handy.
Moving along, let us establish a convention on the use of admonitions beginning Section 1.1 in this textbook:
Definition
A formal statistical and/or machine learning definition. This admonition aims to untangle the significant amount of jargon and concepts that both fields have. Furthermore, alternative terminology will be brought up when necessary to indicate the same definition across both fields.
Heads-up!
An idea (or ideas!) of key relevance for any given modelling approach, specific workflow stage or data science-related terminology. This admonition also extends to crucial statistical or machine learning topics that the reader would be interested in exploring more in-depth.
Tip
An idea (or ideas!) that might be slightly out of the scope of the topic any specific section is discussing. Still, we will provide significant insights on the matter along with further literature references to look for.
The core idea of the above admonition arrangement is to allow the reader to discern between ideas or concepts that are key to grasp from those whose understanding might not be highly essential (but still interesting to check out in further literature!). Before jumping into our three pillars in sections Section 1.2 and Section 1.3, let us elaborate on an additional resource related to setting a common ground between machine learning and statistics.
1.1 The ML-Stats Dictionary
Machine learning and statistics usually overlap across many subjects, and regression modelling is no exception. Topics we teach in an utterly regression-based course, under a purely statistical framework, also appear in machine learning-based courses such as fundamental supervised learning, but often with different terminology. On this basis, the Master of Data Science (MDS) program at the University of British Columbia (UBC) provides the MDS Stat-ML dictionary (Gelbart 2017) under the following premises:
This document is intended to help students navigate the large amount of jargon, terminology, and acronyms encountered in the MDS program and beyond.
This section covers terms that have different meanings in different contexts, specifically statistics vs. machine learning (ML).
Indeed, both disciplines have a tremendous amount of jargon and terminology. Furthermore, as previously emphasized in the Preface, machine learning and statistics construct a substantial synergy reflected in data science. Even with this, people in both fields could encounter miscommunication issues when working together. This should not happen if we build solid bridges between both disciplines. Therefore, the above definition callout box will pave the way to a complimentary resource called the ML-Stats dictionary (ML stands for Machine Learning). This comprehensive ML-Stats dictionary is imperative, and our textbook offers a perfect opportunity to build this resource. Primarily, this dictionary will clarify any potential confusion between statistics and machine learning regarding terminology within supervised learning and regression analysis contexts.
Heads-up on terminology highlights!
Following the spirit of the ML-Stats dictionary, throughout the book, all statistical terms will be highlighted in blue whereas the machine learning terms will be highlighted in magenta. This colour scheme strives to combine this terminology so we can switch from one field to another in an easier way. With practice and time, we should be able to jump back and forth when using these concepts.
Finally, Appendix A will be the section in this book where the reader can find all those statistical and machine learning-related terms in alphabetical order. Notable terms (either statistical or machine learning-related) will include an admonition identifying which terms (again, either statistical or machine learning-related) are equivalent (or NOT equivalent if that is the case). Take as an example the statistical term dependent variable:
In supervised learning, it is the main variable of interest we are trying to learn or predict, or equivalently, the variable we are trying explain in a statistical inference framework.
Then, the above definition will be followed by this admonition:
Equivalent to:
Response variable, outcome, output or target.
Above, we have identified four equivalent terms for the term dependent variable. Furthermore, according to our already defined colour scheme, these terms can be statistical or machine learning-related. Finally, note we will start using this colour scheme in Chapter 2.
Heads-up on the use of terminology!
Throughout this book, we will interchangeably use specific terms when explaining the different regression approaches in each chapter. Whenever confusion arises about using these interchangeable terms, it is highly recommended to consult their definitions and equivalences (or non-equivalences) in Appendix A.
Now, let us proceed to explaining the three pillars on which this textbook is built upon: a data science workflow, the right workflow flavour (inferential or predictive), and a regression toolbox.
1.2 The Data Science Workflow
Understanding the data science workflow is essential for mastering regression analysis. This workflow serves as a blueprint that guides us through each stage of our analysis, ensuring that we apply a systematic approach to solving our inquiries in a reproducible way. Each of the three pillars of this textbook—data science workflow, the right workflow flavor (inferential or predictive), and a regression toolbox—are deeply interconnected. Regardless of the regression model we explore, this general workflow provides a consistent framework that helps us navigate our data analysis with clarity and purpose. As shown in Figure 1.1, the data science workflow is composed of the following stages (each of which will be discussed in more detail in subsequent subsections):
- Study design: Define the research question, objectives, and variables of interest to ensure the analysis is purpose-driven and aligned with the problem at hand.
- Data collection and wrangling: Gather and clean data, addressing issues such as missing values, outliers, and inconsistencies to transform it into a usable format.
- Exploratory data analysis: Explore the data through statistical summaries and visualizations to identify patterns, trends, and potential anomalies.
- Data modelling: Apply statistical or machine learning models to uncover relationships between variables or make predictions based on the data.
- Estimation: Calculate model parameters to quantify relationships between variables and assess the accuracy and reliability of the model.
- Goodness of fit: Evaluate the model’s performance using metrics and diagnostic checks to determine how well it explains the data.
- Results: Interpret the model’s outputs to derive meaningful insights and provide answers to the original research question.
- Storytelling Communicate the findings through a clear, engaging narrative that is accessible to a non-technical audience.
By adhering to this workflow, we ensure that our regression analyses are not only systematic and thorough but also capable of producing results that are meaningful within the context of the problem we aim to solve.
From the earliest stages of learning data analysis, understanding the importance of a structured workflow is crucial. If we do not adhere to a predefined workflow, we risk misinterpreting the data, leading to incorrect conclusions that fail to address the core questions of our analysis. Such missteps can result in outcomes that are not only meaningless but potentially misleading when taken out of the problem’s context. Therefore, it is essential for aspiring data scientists to internalize this workflow from the very beginning of their education. A systematic approach ensures that each stage of the analysis is conducted with precision, ultimately producing reliable and contextually relevant results.
1.2.1 Study Design
The first stage of this workflow is centered around defining the main statistical inquiries we aim to address throughout the data analysis process. As a data scientist, your primary task is to translate these inquiries from the stakeholders into one of two categories: inferential or predictive. This classification determines the direction of your analysis and the methods you will use.
Inferential: The objective here is to explore and quantify relationships of association or causation between explanatory variables (regressors) and the response variable within the context of the problem at hand. For example, you may seek to determine whether a specific marketing campaign (regressor) significantly impacts sales revenue (response) and, if so, by how much.
Predictive: In this case, the focus is on making accurate predictions about the response variable based on future observations of the regressors. Unlike inferential inquiries, where understanding the relationship between variables is key, the primary goal here is to maximize prediction accuracy. This approach is fundamental in machine learning. For instance, you might build a model to predict future sales revenue based on past marketing expenditures, without necessarily needing to understand the underlying relationship between the two.
Example: Predicting Housing Prices
To illustrate the study design stage, let’s consider a simple example: predicting housing prices in a specific city.
If our goal is inferential, we might be interested in understanding the relationship between various factors (like square footage, number of bedrooms, and proximity to schools) and housing prices. Specifically, we would ask questions like, “How much does the number of bedrooms affect the price of a house, after accounting for other factors?”
If our goal is predictive, we would focus on creating a model that can accurately predict the price of a house based on its features, regardless of whether we fully understand how each factor contributes to the price.
In both cases, the study design stage involves clearly defining these objectives and determining the appropriate methods to address them. This stage sets the foundation for all subsequent steps in the data science workflow, as illustrated in Figure 1.2. Once the study design is established, the next stage is data collection and wrangling.
1.2.2 Data Collection and Wrangling
Once we have clearly defined our statistical questions, the next crucial step is to collect the data that will form the basis of our analysis. The way we collect this data is vital because it directly affects the accuracy and reliability of our results:
- For inferential inquiries, we focus on understanding large groups or systems (populations) that we cannot fully observe. These populations are governed by characteristics (parameters) that we want to estimate. Because we can’t study every individual in the population, we collect a smaller, representative subset called a sample. The method we use to collect this sample—known as sampling—is crucial. A proper sampling method ensures that our sample reflects the larger population, allowing us to make accurate generalizations (inferences) about the entire group. After collecting the sample, it’s common practice to randomly split the data into training and test sets. This split allows us to build and validate our models, ensuring that the findings are robust and not overly tailored to the specific data at hand.
Although this book does not cover sampling methods in detail, it’s important to know that the way you collect your sample can greatly influence your results. Depending on the problem, you might use different techniques:
- Simple Random Sampling: Every individual in the population has an equal chance of being selected.
- Systematic Sampling: You select individuals at regular intervals from a list of the population.
- Stratified Sampling: You divide the population into subgroups (strata) and take a proportional sample from each subgroup.
- Clustered Sampling: You divide the population into clusters and randomly select whole clusters for your sample.
- Etc.
As in the case of Regression Analysis, statistical sampling is a vast field, and we could spend a whole course on it. If you’re interested in learning more about these methods, Sampling: design and analysis by Lohr is a great resource.
- For predictive inquiries, our goal is often to use existing data to make predictions about future events or outcomes. In these cases, we usually work with large datasets (databases) that have already been collected. Instead of focusing on whether the data represents a population (as in inferential inquiries), we focus on cleaning and preparing the data so that it can be used to train models that make accurate predictions. After wrangling the data, it is typically split into training, validation, and test sets. The training set is used to build the model, the validation set is used to tune model parameters, and the test set evaluates the model’s final performance on unseen data.
Example: Collecting Data for Housing Price Predictions
Let’s continue with our housing price prediction example to illustrate these concepts:
Inferential Approach: Suppose we want to understand how the number of bedrooms affects housing prices in a city. To do this, we would collect a sample of house sales that accurately represents the city’s entire housing market. For instance, we might use stratified sampling to ensure that we include houses from different neighborhoods in proportion to how common they are. After collecting the data, we would split it into training and test sets. The training set helps us build our model and estimate the relationship between variables, while the test set allows us to evaluate how well our findings generalize to new data.
Predictive Approach: If our goal is to predict the selling price of a house based on its features (like size, number of bedrooms, and location), we would gather a large dataset of recent house sales. This data might come from a real estate database that tracks the details of each sale. Before we can use this data to train a model, we would clean it by filling in any missing information, converting data to a consistent format, and making sure all variables are ready for analysis. After preprocessing, we would split the data into training, validation, and test sets. The training set would be used to fit the model, the validation set to fine-tune it, and the test set to assess how well the model can predict prices for houses it hasn’t seen before.
As shown in Figure 1.3, the data collection and wrangling stage is fundamental to the workflow. It directly follows the study design and sets the stage for exploratory data analysis.
1.2.3 Exploratory Data Analysis
Before diving into data modeling, it’s crucial to develop a deep understanding of the relationships between the variables in your training data. This is where the third stage of the data science workflow—Exploratory Data Analysis (EDA)—comes into play. EDA serves as a vital process that allows you to visualize and summarize your data, uncover patterns, detect anomalies, and test key assumptions that will inform your modeling decisions.
The first step in EDA is to classify your variables according to their types. This classification is essential because it guides your choice of analysis techniques and models. Specifically, you need to determine whether each variable is discrete or continuous, and whether it has any specific characteristics such as being bounded or unbounded.
-
Response Variable:
- Determine if your response variable is discrete (e.g., binary, count-based, categorical) or continuous.
- If it is continuous, consider whether it is bounded (e.g., percentages that range between 0 and 100) or unbounded (e.g., a variable like temperature that can take on a wide range of values).
-
Regressors:
- For each regressor, identify whether it is discrete or continuous.
- If a regressor is discrete, classify it further as binary, count-based, or categorical.
- If a regressor is continuous, determine whether it is bounded or unbounded.
This classification step ensures that you are prepared to choose the correct visualization and statistical methods for your analysis, as different types of variables often require different approaches.
This classification step ensures that you are prepared to choose the correct visualization and statistical methods for your analysis, as different types of variables often require different approaches.
After classifying your variables, the next step is to create visualizations and calculate descriptive statistics using your training data. This involves coding plots that can reveal the underlying distribution of each variable and the relationships between them. For instance, you might create histograms to visualize distributions, scatter plots to explore relationships between continuous variables, and box plots to compare categorical variables against the response variable.
Alongside these visualizations, it is important to calculate key descriptive statistics such as the mean, median, and standard deviation. These statistics provide a numerical summary of your data, offering insights into central tendency and variability. You might also use a correlation matrix to assess the strength of relationships between continuous variables.
Once you have generated these plots and statistics, they should be displayed in a clear and logical manner. The goal here is to interpret the data and draw preliminary conclusions about the relationships between variables. Presenting these findings effectively helps to uncover key insights and prepares you for the modeling stage.
Finally, the insights gained from this exploratory analysis must be clearly articulated. This involves summarizing the key findings and considering their implications for the next stage of the workflow—data modeling. Observing patterns, correlations, and potential outliers in this stage will inform your modeling approach and ensure that it is grounded in a thorough and informed analysis.
This structured approach to EDA is visually summarized in Figure 1.4, illustrating the sequential steps from variable classification to the delivery of exploratory insights.
Example: EDA for Housing Price Predictions
To illustrate the EDA process, let’s apply it to the example of predicting housing prices.
We start with variable classification:
- The response variable is the sale price of a house, a continuous and unbounded variable.
- The regressors include:
- Number of bedrooms: Discrete, count-based.
- Square footage: Continuous and unbounded.
- Neighborhood type: Discrete, categorical (e.g., urban, suburban, rural).
- Proximity to schools: Continuous, potentially bounded by distance.
Once the variables are classified, we move on to coding plots and calculating descriptive statistics. Here are a couple of visualizations that can be helpful in this context:
- A histogram of sale prices helps visualize the distribution and spot any outliers.
- A scatter plot of square footage versus sale price shows the relationship, typically revealing a positive correlation.
- Box plots compare sale prices across different neighborhood types, highlighting any variations in median prices.
- Descriptive statistics like the mean and standard deviation provide a numerical summary, while a correlation matrix helps assess relationships between continuous variables like square footage and sale price.
Finally, in displaying and interpreting results, these plots and statistics guide us in understanding the data:
- The histogram might show most houses fall within a mid-range price.
- The scatter plot could confirm that larger houses generally sell for more.
- Box plots may reveal that urban homes tend to have higher prices.
These exploratory insights help identify key predictors like square footage and neighborhood type, and highlight any outliers that may require further attention during modeling.
By following these steps, the EDA process in the housing price prediction example lays a solid foundation for effective modeling, ensuring that the key variables and their relationships are well understood.
1.2.4 Data Modelling
1.2.5 Estimation
1.2.6 Goodness of Fit
1.2.7 Results
1.2.8 Storytelling
1.3 Mindmap of Regression Analysis
Having defined the necessary statistical aspects to execute a proper supervised learning analysis, either inferential or predictive across its seven sequential phases, we must dig into the different approaches we might encounter in practice as regression models. The nature of our outcome of interest will dictate any given modelling approach to apply, depicted as clouds in Figure 1.10. Note these regression models can be split into two sets depending on whether the outcome of interest is continuous or discrete. Therefore, under a probabilistic view, identifying the nature of a given random variable is crucial in regression analysis.
That said, we will go beyond OLS regression and explore further regression techniques. In practice, these techniques have been developed in the statistical literature to address practical cases where the OLS modelling framework and assumptions are not suitable anymore. Thus, throughout this block, we will cover (at least) one new regression model per lecture.
As we can see in the clouds of Figure 1.10, there are 13 regression models: 8 belonging to discrete outcomes and 5 to continuous outcomes. Each of these models is contained in a chapter of this book, beginning with the most basic regression tool known as ordinary least-squares in Chapter 3. We must clarify that the current statistical literature is not restricted to these 13 regression models. The field of regression analysis is vast, and one might encounter more complex models to target certain specific inquiries. Nonetheless, I consider these models the fundamental regression approaches that any data scientist must be familiar with in everyday practice.
Even though this book comprises 13 chapters, each depicting a different regression model, we have split these chapters into two major subsets: those with continuous outcomes and those with discrete outcomes.