10  Classical Poisson Regression

When to Use and Not Use Classical Poisson Regression

Classical Poisson regression is a type of generalized linear model that is appropriately used under the following conditions:

  • The outcome variable is discrete and count-type as shown in Figure 10.1. Examples include number of doctor visits, number of web clicks, number of accidents, etc.
  • Outcomes are statistically independent from one to the next.
  • The mean and variance of the counts is approximately equal (i.e., there is equidispersion).
  • Counts occur over fixed and known exposures (such as time intervals or geographic areas).
  • We intend to use the logarithm of the mean of the outcome as a link function, which ensures that our predictions are always positive.

However, Classical Poisson regression should not be used in the following scenarios:

  • The outcome is not a count (e.g., binary outcomes, proportions, or continuous variables).
  • The data shows overdispersion (i.e., the variance exceeds the mean). In this case, check Negative Binomial or Generalized Poisson regressions.
  • The data shows underdispersion (i.e., the variance is less than the mean), which Poisson regression cannot accommodate. Alternatively, check Generalized Poisson regression.
  • The outcome variable includes a large number of zeros. Instead, check Zero-Inflated Poisson regression.

mindmap
  root((Regression 
  Analysis)
    Continuous <br/>Outcome Y
      {{Unbounded <br/>Outcome Y}}
        )Chapter 3: <br/>Ordinary <br/>Least Squares <br/>Regression(
          (Normal <br/>Outcome Y)
      {{Nonnegative <br/>Outcome Y}}
        )Chapter 4: <br/>Gamma Regression(
          (Gamma <br/>Outcome Y)
      {{Bounded <br/>Outcome Y <br/> between 0 and 1}}
        )Chapter 5: Beta <br/>Regression(
          (Beta <br/>Outcome Y)
      {{Nonnegative <br/>Survival <br/>Time Y}}
        )Chapter 6: <br/>Parametric <br/> Survival <br/>Regression(
          (Exponential <br/>Outcome Y)
          (Weibull <br/>Outcome Y)
          (Lognormal <br/>Outcome Y)
        )Chapter 7: <br/>Semiparametric <br/>Survival <br/>Regression(
          (Cox Proportional <br/>Hazards Model)
            (Hazard Function <br/>Outcome Y)
    Discrete <br/>Outcome Y
      {{Binary <br/>Outcome Y}}
        {{Ungrouped <br/>Data}}
          )Chapter 8: <br/>Binary Logistic <br/>Regression(
            (Bernoulli <br/>Outcome Y)
        {{Grouped <br/>Data}}
          )Chapter 9: <br/>Binomial Logistic <br/>Regression(
            (Binomial <br/>Outcome Y)
      {{Count <br/>Outcome Y}}
        {{Equidispersed <br/>Data}}
          )Chapter 10: <br/>Classical Poisson <br/>Regression(
            (Poisson <br/>Outcome Y)

Figure 10.1: Regression analysis mind map depicting all modelling techniques explored so far in this book. Depending on the type of outcome \(Y\), these techniques are split into two large zones: discrete and continuous.

Learning Objectives

By the end of this chapter, you will be able to:

  • Describe the reason that ordinary linear models is inappropriate to use.
  • Determine when Poisson regression is an appropriate modeling choice.
  • Write down the likelihood function of a Poisson regression models.
  • Understand the computation procedure of the Poisson regression coefficient estimation.
  • Interpret the Poisson regression coefficients in the real scenarios.
  • Evaluate model performance and construct confidence intervals.

10.1 Introduction

This chapter introduces a generalized linear regression model that can be applied on Poisson-distributed count data. Compared to ordinary regression, which assume normality and constant variance—homoskedasticity, Poisson regression models count data that is skewed and heteroskedastic.

Some research questions you might explore using Poisson regression:

  • How is the number of insurance claims filed by policyholders in a year associated with ages, vehicle types, and regions?
  • How can the number of complaints received by a telecom company from customers be explained by service types and contract lengths?
  • How does the distribution of counts of satellite male horseshoe crabs residing around a female horseshoe crab nest vary by the phenotypic characteristics of the female horseshoe crabs?

10.1.1 Poisson Regression Assumptions

  • Independence: Poisson regression assumes the responses to be counts–(nonnegative integers: 0, 1, 2,…). Each response is mutually independent to each other with mean parameter \(\lambda_i,i=1,\dots,n\).

  • Log-linearity: Poisson regression models the log-link function of the response as a linear combination of the explanatory variables, i.e., \(\log(\lambda_i) = \beta_0 + \beta_1 X_{i1}+\dots+\beta_p X_{ip}\) with \(\lambda_i> 0\) for all \(i=1,\dots,n\).

  • Heteroskedasticity: Poisson regression assumes heteoroskedastic response, i.e., the variance of the response increases along with the mean increasing; in contrast to the ordinary regression with Gaussian noise, whose the variance is constant and independent to the mean. Poisson regression assumes equidispersion, i.e., the variance is the same as the expectation of the response; compared to overdispersion in negative binomial distribution where the variance is greater than the mean.

10.1.2 A Graphical Look

Below is a graphical demonstration of the comparison between the ordinary linear regression with Gaussian distributed response and Poisson regression with Poisson distributed response. The left panel illustrates the price of smartphones (in 100 USD) increase with the number of days listed on an online marketplace. The right panel illustrates the number of bacteria colonies on a petri dish versus the hours of incubation.

The ordinary linear regression has a linear fitted blue line (in the left panel), while the Poisson regression, due to the use of log-link function, has a fitted blue curve (in the right panel). Segment the explanatory variable (in the x axis) in each scenario into five sections using the gray dashed vertical lines. The red horizontal segments represent the frequencies of binned response in each section, which represents the rotated histogram of the response. In ordinary linear regression, the response in each section is symmetrically distributed with similar variation across five sections (i.e., homoskedasticity); while in Poisson regression, the response is skewed with heteroskedasticity (specifically, increasing variances as the responses increase) across sections.

10.2 Case Study: Horseshoe Crab Satellites

Let’s take a closer look at the example in the third question mentioned in the Introduction—exploring the differential distribution of the number of satellite male horseshoe crabs residing around a female horseshoe crab nest across various phenotypic characteristics. Using the dataset crabs provided by Brockmann, 1996, let’s load the dataset and do some data wrangling.

10.3 Data Collection and Wrangling

There are 173 records of the female horseshoe crab nests and their male satellites with 5 features: satell: satellite size of each nest (i.e., the number of male horseshoe crabs around each female horseshoe crab nest), width: shell width in cm, color: 1 = medium light, 2 = medium, 3 = medium dark, 4 = dark, spine: spine condition: 1 = both good, 2 = one broken, 3 = both broken, and weight: weight in kg (referring to Section 3.3.3 in Agresti, 2018).

library(asbio)
data(crabs)
crabs <- as_tibble(crabs)
# reorder the columns
crabs <- crabs %>%
  select(satell, width, everything())
# display the dataset with dimension and column types
crabs
import pandas as pd
print(crabs.dtypes)
crabs
# A tibble: 173 × 5
   satell width color spine weight
    <int> <dbl> <fct> <fct>  <dbl>
 1      8  28.3 2     3       3.05
 2      0  22.5 3     3       1.55
 3      9  26   1     1       2.3 
 4      0  24.8 3     3       2.1 
 5      4  26   3     3       2.6 
 6      0  23.8 2     3       2.1 
 7      0  26.5 1     1       2.35
 8      0  24.7 3     2       1.9 
 9      0  23.7 2     1       1.95
10      0  25.6 3     3       2.15
# ℹ 163 more rows
satell       int64
width      float64
color     category
spine     category
weight     float64
dtype: object
satell width color spine weight
0 8 28.3 2 3 3.050
1 0 22.5 3 3 1.550
2 9 26.0 1 1 2.300
3 0 24.8 3 3 2.100
4 4 26.0 3 3 2.600
... ... ... ... ... ...
168 3 26.1 3 3 2.750
169 4 29.0 3 3 3.275
170 0 28.0 1 1 2.625
171 0 27.0 4 3 2.625
172 0 24.5 2 2 2.000

173 rows × 5 columns

Let’s now split the data into training and test sets.

# Use 70% data for training and 30% for testing
par <- 0.7

# Set seed for reproducible splitting
set.seed(1046)
n <- nrow(crabs)
train_indices <- sample(seq_len(n), size = floor(par * n))

# Create training and test sets
train_set <- crabs[train_indices, ]
test_set <- crabs[-train_indices, ]

# Print the number of training and test samples
cat("Number of training samples:", nrow(train_set), "\n")
cat("Number of test samples:", nrow(test_set), "\n")
# Use 70% data for training and 30% for testing
par = 0.7

# Set seed for reproducible splitting
np.random.seed(1046)
n = len(crabs)
train_indices = np.random.choice(n, size=int(par * n), replace=False)

# Create training and test sets
train_set = crabs.iloc[train_indices]
test_set = crabs.drop(index=train_indices)

# Print the number of training and test samples
print(f"Number of training samples: {len(train_set)}")
print(f"Number of test samples: {len(test_set)}")
Number of training samples: 122 
Number of test samples: 51 
Number of training samples: 122
Number of test samples: 51

10.4 Exploratory Data Analysis

Suppose that the relationship between the satellite size per nest (satell) and width of the female horseshoe crab (width in cm) is our main research interest. Let’s explore their relationship in the following scatterplot with the distribution of satell on the margin.

# draw the scatterplot of satellite size versus width 
p <- crabs %>%
  ggplot(aes(y = satell, x = width)) + 
  geom_point(alpha = 0.8) + 
  labs(x = "Width (cm)", y = "Satellite Size") +
  scale_x_continuous(breaks = seq(floor(min(crabs$width)), ceiling(max(crabs$width)), by = 2)) + 
  scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
  theme_bw()
# add a marginal histogram with binwidth of 1, i.e., bins for integers; equivalent to the barplot
ggMarginal(p, type = "histogram", margins = "y", binwidth = 1, boundary = -0.5) # start with `satell=-0.5` to avoid the first bins (0s and 1s) being combined in one bin
import seaborn as sns
sns.set(style = "whitegrid")
# create bins of width 1 for the marginal histogram
bin_edges = np.arange(crabs['satell'].min(), crabs['satell'].max() + 2, 1)  # bins of width 1
# draw the scatterplot of satellite size versus width with marginal distributions
g = sns.jointplot(
    data = crabs, 
    x = "width", 
    y = "satell", 
    color = "black",
    kind = "scatter",
    marginal_ticks = False,
    marginal_kws = dict(bins = bin_edges, fill = True, color="black", alpha=0.6)
)
g.ax_marg_x.remove() # remove the marginal histogram on x axis
# add labels
g.ax_joint.set_xlabel("Width (cm)") 
g.ax_joint.set_ylabel("Satellite Size")
# add axes' limits
ymin, ymax = crabs['satell'].min(), crabs['satell'].max()
xmin, xmax = crabs['width'].min(), crabs['width'].max()
g.ax_joint.set_xlim(xmin - 1, xmax + 1);
g.ax_joint.set_ylim(ymin - 0.5, ymax + 0.5);
g.ax_marg_y.set_ylim(ymin - 0.5, ymax + 0.5);

The distribution of satellite sizes is highly right skewed, which violates the normality assumption of the ordinary linear regression.

In the scatter plot above, we can tell that the satellite size gets more spread out as the width increases. The averaged satellite sizes turn to increase along the width increasing. Let’s now split the width into a few intervals and compute the representative point for each interval to have a clear look at the trend.

# set up the number of intervals
n_intervals <- 10

# compute the average points for each interval of `width`
crabs_binned <- crabs %>%
  mutate(width_inl = cut(width, breaks = n_intervals)) %>%
  group_by(width_inl) %>%
  summarize(
    mean_width = mean(width),
    mean_satell = mean(satell),
    .groups = "drop"
  )
crabs_binned
# set up the number of intervals
n_intervals = 10

# compute the average points for each interval of `width`
crabs['width_inl'] = pd.cut(crabs['width'], bins=n_intervals)
crabs_binned = (
    crabs
    .groupby('width_inl', observed=True)
    .agg(mean_width=('width', 'mean'), mean_satell=('satell', 'mean'))
    .reset_index()
    .round(2)
)
crabs_binned
# A tibble: 10 × 3
   width_inl   mean_width mean_satell
   <fct>            <dbl>       <dbl>
 1 (21,22.2]         21.5        0   
 2 (22.2,23.5]       23.0        1   
 3 (23.5,24.8]       24.2        1.77
 4 (24.8,26]         25.5        2.98
 5 (26,27.2]         26.6        2.53
 6 (27.2,28.5]       27.9        4.15
 7 (28.5,29.8]       29.1        4   
 8 (29.8,31]         30.1        4.86
 9 (31,32.2]         31.8        3   
10 (32.2,33.5]       33.5        7   
width_inl mean_width mean_satell
0 (20.988, 22.25] 21.50 0.00
1 (22.25, 23.5] 22.97 1.00
2 (23.5, 24.75] 24.25 1.77
3 (24.75, 26.0] 25.49 2.98
4 (26.0, 27.25] 26.60 2.53
5 (27.25, 28.5] 27.92 4.15
6 (28.5, 29.75] 29.08 4.00
7 (29.75, 31.0] 30.11 4.86
8 (31.0, 32.25] 31.80 3.00
9 (32.25, 33.5] 33.50 7.00

We’ve prepared the summarized dataset for the average points—each entry includes an interval of width, an averaged width, and an averaged satellite size per interval. Now, let’s visualize the representative points on the scatter plot to take a closer look at the trend.

# add the average points onto the scatterplot, connect them, and mark intervals
breaks <- seq(min(crabs$width), max(crabs$width), length.out = n_intervals + 1)
p + 
  geom_vline(xintercept = breaks, linetype = "dashed", color = "gray50") +
  geom_point(data = crabs_binned, aes(x = mean_width, y = mean_satell), color = "red", size = 2) +
  geom_line(data = crabs_binned, aes(x = mean_width, y = mean_satell), color = "red", linewidth = 1)
# draw the scatterplot of satellite size versus width
p = (
    ggplot(crabs, aes(x='width', y='satell')) +
    geom_point(alpha=0.7, color='black') +  # points in gray
    labs(x="Width (cm)", y="Satellite Size") +
    scale_x_continuous(breaks=range(int(np.floor(crabs['width'].min())), int(np.ceil(crabs['width'].max())) + 1, 2)) +
    scale_y_continuous(breaks=range(int(np.floor(crabs['satell'].min())), int(np.ceil(crabs['satell'].max())) + 1, 2)) +
    theme_bw()
)
# add the average points onto the scatterplot, connect them, and mark intervals
breaks = np.linspace(crabs['width'].min(), crabs['width'].max(), n_intervals + 1)
p_final = (
    p +
    geom_vline(xintercept=breaks, linetype='dashed', color='gray') +
    geom_point(data=crabs_binned, mapping=aes(x='mean_width', y='mean_satell'), color='red', size=2) +
    geom_line(data=crabs_binned, mapping=aes(x='mean_width', y='mean_satell'), color='red', size=1)
)
p_final.show()

We can see a general increasing trend of the satellite size as the nest width grows.

10.5 Data Modelling

The Poisson regression model assumes a random sample of \(n\) count observations \(Y_i\)s, hence independent (but not identically distributed!), which have the following distribution: \[Y_i \sim \mathrm{Poisson}(\lambda_i).\] Each \(i\)th observation has its own \(\mathbb{E}(Y_i)=\lambda_i>0\), which also implicates \(Var(Y_i)=\lambda_i>0\).

Parameter \(\lambda_i\) is the risk of an event occurrence, coming from the definition of the Poisson random variable, in a given timeframe or even a space. It is a continuous distributional parameter! For the crabs dataset, the events are the number of satellite male horseshoe crabs around a space: the female breeding nest.

Since the Poisson Regression model is also a GLM, we need to set up a link function for the mean. Let \(X_{i,\text{width}}\) be the \(i\)th value for the regressor width in our dataset. An easy modelling solution would be an identity link function as in \[ h(\lambda_i)=\lambda_i=\beta_0+\beta_1 X_{i,\text{width}}. \label{eq:pois-uni-iden} \]

However, we have a response range issue!

The model (eq:pois-uni-iden?) for has a significant drawback: the right-hand side is allowed to take on even negative values, which does not align with the nature of the parameter \(\lambda_i\) (that always has to be non-negative).

Recall the essential characteristic of a GLM that should come into play for a link function. In this class of models, the direct relationship between the original response and the regressors may be non-linear as in \(h(\lambda_i)\). Hence, instead of using the identity link function, we will use the natural logarithm of the mean: \(\log(\lambda_i)\).

Before continuing with the crabs dataset, let us generalize the Poisson regression model with \(k\) regressors as: \[ h(\lambda_i) = \log(\lambda_i)=\beta_0+\beta_1 X_{i,1}+\dots+\beta_k X_{i,k}. \label{eq:pois-k} \] We could make more sense in the interpretation by exponentiating (eq:pois-k?): \[ \lambda_i = \exp(\beta_0+\beta_1 X_{i,1}+\dots+\beta_k X_{i,k}), \label{eq:pois-k-exp} \] where an increase in one unit in any of the \(k\) regressors (while keeping the rest of them constant) multiplies the mean \(\lambda_i\) by a factor \(\exp(\beta_j)\), for all \(j=i,\dots,k\).

In the crabs dataset with width as a regressor, the Poisson regression model is depicted as: \[ \log(\lambda_i)=\lambda_i=\beta_0+\beta_1 X_{i,\text{width}}. \label{eq:pois-uni} \]

10.6 Estimation

In generalized linear models (GLMs), parameter estimation is typically performed using maximum likelihood estimation (MLE). A widely used algorithm to compute the MLE in GLMs is the iteratively reweighted least squares (IRLS) method, which learns the regression coefficients by solving the weighted least squares problems through iteration. Below we learn the essential elements of the IRLS method and the MLE that it solves. We use Poisson regression as an example, but the procedure applies to other distributions straightforwardly.

The general idea to solve the generalized linear regression is to estimate the parameters by maximizing the likelihood. For Poisson regression in (eq:pois-k?) specifically, it is written as: \[ \prod_{i=1}^n p_{Y_i}(y_i) = \prod_{i=1}^n \frac{e^{-\lambda_i}{\lambda_i}^{y_i}}{y_i!}, \] where \(p_{Y_i}(y_i)\) is the probability mass function of \(Y_i\), for \(i=1,\dots,n\).

Let \(\boldsymbol{\beta}:=(\beta_0,\dots,\beta_k)^{\top}\) be the coefficient vector, and then (eq:pois-k-exp?) can be rewritten as \(\lambda_i=\exp(X_{i}^{\top}\boldsymbol{\beta})\), where \(X_{i}^{\top}\) is the row \(i\) in the design matrix \(X\). Maximizing the likelihood is equivalent to minimizing the log-likelihood in terms of solving for the parameters \(\beta_j,j=1,\dots,k\): \[ {\arg\min}_{\boldsymbol{\lambda}} - \prod_{i=1}^n\frac{e^{-\lambda_i}{\lambda_i}^{y_i}}{y_i!} = {\arg\min}_{\boldsymbol{\lambda}} \sum_{i=1}^n \lambda_i - y_i \log(\lambda_i). \] Plugging in \(\lambda_i=\exp(X_{i}^{\top}\boldsymbol{\beta})\) gives the minimization problem with respect to the Poisson regression coefficients: \[ {\arg\min}_{\boldsymbol{\beta}} \sum_{i=1}^n \exp(X_{i}^{\top}\boldsymbol{\beta}) - y_i X_{i}^{\top}\boldsymbol{\beta}. \label{eq:pois-log} \]

Since the objective in (eq:pois-log?) is not quadratic and there is no closed form solution, we could not solve the equation exactly. IRLS solves an approximated solution with a specified accuracy or tolerance. The basic idea is that IRLS uses Fisher scoring (or Newton-Raphson) to solve an approximated problem of the original objective (eq:pois-log?) and solve it iteratively with weights updated through iteration until the specified accuracy is achieved, i.e., the objective is converged.

The MLE \(\hat{\boldsymbol{\beta}}\) satisfies the score function \(U(\boldsymbol{\beta}) = \nabla_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}) = 0\). Applying the Fisher scoring update gives: \[ \boldsymbol{\beta}^{t+1} \leftarrow \boldsymbol{\beta}^{t} + {\mathcal{I}(\boldsymbol{\beta}^t)}^{-1} U(\boldsymbol{\beta}^{t}), \] where \(\mathcal{I}(\boldsymbol{\beta}^t) := \mathbb{E}\big[- \nabla^2_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}^t)\big]\) is the Fisher information, which is the expectation of the negative second-order derivative of the log-likelihood with respect to the parameters.

The Fisher scoring update corresponds to solve the following approximated problem of the original objective at iterate \(t+1\), which is a weighted least squares problem written as: \[ \begin{align} \boldsymbol{\beta}^{t+1} &= {\arg\min}_{\boldsymbol{\beta}} \sum_{i=1}^n w_i^t (y_i^t - X_i^{\top}\boldsymbol{\beta})^2 \\ &= {\arg\min}_{\boldsymbol{\beta}} \sum_{i=1}^n w_i^t (y_i^t - \beta_0 - \beta_1 x_{i,\text{width}})^2 \label{eq:irls-approx} \end{align} \] given the estimates at time \(t\) for the univariate case using crabs dataset. The approximated problem has a closed-form solution and is much easier to solve than the original objective. The observations \(y_i\) and weights \(w_i\) should be updated per iterate. The arbitrary observations \(y_i^{t+1}\) is a function of the original observations \(y_i^0\) and the coefficient estimates. The weights is updated by \[w_i^{t+1} = \Big(\frac{1}{Var(Y_i)}\Big)\Big(\frac{1}{g'(\lambda_i)}\Big)^2,\] where \(\lambda_i\) is the mean of the response \(y_i\), \(\eta_i=g(\lambda_i)\) is the link function of the mean. In our case, the weights at iterate \(t+1\) can be updated given the coefficient estimates at \(t\) using the following formula: \[ w_i^{t+1} = \exp(X_i^{\top} \boldsymbol{\beta}^t), \forall i=1,\dots,n. \]

The IRLS iteration procedure is as follows.

  1. Choose a set of initial coefficients \(\boldsymbol{\lambda}^0 = (\lambda_0^0,\dots,\lambda_k^0)\), which can be a vector of zeros.
  2. Compute the weights based on the estimates from the previous iterate, or the inital coefficients at the first iterate.
  3. Solve the approximated problem (irls-approx?).
  4. Check te convergence condition.
  5. Return estimates if the objective is converged; go back to step 2 if not.

Let’s now fit the model on the training data.

# Fit Poisson regression
poisson_model <- glm(satell ~ width, family = poisson(link = "log"), data = train_data)

# View summary
summary(poisson_model)
# Fit Poisson regression
poisson_model = glm(formula='satell ~ width', data=train_data, family=sm.families.Poisson(link=sm.families.links.log())).fit()

# View summary
print(poisson_model.summary())

Call:
glm(formula = satell ~ width, family = poisson(link = "log"), 
    data = train_set)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.43023    0.64716  -5.300 1.16e-07 ***
width        0.16973    0.02386   7.112 1.14e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 476.83  on 121  degrees of freedom
Residual deviance: 428.20  on 120  degrees of freedom
AIC: 679.01

Number of Fisher Scoring iterations: 6
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                 satell   No. Observations:                  122
Model:                            GLM   Df Residuals:                      120
Model Family:                 Poisson   Df Model:                            1
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -337.50
Date:                Mon, 09 Feb 2026   Deviance:                       428.20
Time:                        14:27:54   Pearson chi2:                     418.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.3287
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.4302      0.647     -5.300      0.000      -4.699      -2.162
width          0.1697      0.024      7.112      0.000       0.123       0.216
==============================================================================

10.7 Goodness of Fit

In GLMs, we can use residual deviance and Chi-squared test to check the goodness of fit of the model. The residual deviance measures how well the fitted model explains the observed outcomes, compared to a perfect model (the saturated model) that explains the data exactly. A smaller residual deviance means a better fit of the data. It is computed as \[\text{Residual Deviance } = 2 (\ell_{\mathrm{saturated}} - \ell_{\mathrm{fitted}}),\] where \(\ell_A\) is the log-likelihood of model \(A\). For Poisson regression specifically, the residual deviance is \(2\sum_{i=1}^n \Big[y_i\log\big(\frac{y_i}{\hat{\lambda}_i}\big) - (y_i-\hat{\lambda}_i)\Big]\).

When the model is a good fit, the residual deviance is expected to follow a chi-squared distribution with degrees of freedom \(df = n - p\), where \(n\) is the number of observations and \(p\) is the number of parameters, asymptotically for a large enough sample size. We then can compute the \(p\)-value using the chi-squared distribution: \[ p\text{-value} = 1-P(\chi^2_{\mathrm{df}} \leq \text{residual deviance}). \]

\(p\)-value is the probability of observing a residual deviance as large as (or larger than) what we got, if the model is truly correct. A large p-value (e.g., > 0.05) means that the observed deviance is not surprisingly large. Out model is a good fit and could plausibly have generated the data. A small p-value (e.g., < 0.05) means that the deviance is too large to be due to chance. The model is a poor fit and likely missing something important.

# Residual deviance and degrees of freedom
res_dev <- deviance(poisson_model)
df_res <- df.residual(poisson_model)

# Chi-squared goodness-of-fit test
p_value <- 1 - pchisq(res_dev, df_res)

cat("Residual Deviance:", round(res_dev, 4), "\n")
cat("Degrees of Freedom:", df_res, "\n")
cat("Goodness-of-fit p-value:", round(p_value, 4), "\n")
res_dev = poisson_model.deviance
df_res = poisson_model.df_resid

p_value = 1 - stats.chi2.cdf(res_dev, df_res)

print(f"Residual Deviance: {res_dev:.4f}")
print(f"Degrees of Freedom: {df_res}")
print(f"Goodness-of-fit p-value: {p_value:.4f}")
Residual Deviance: 428.2018 
Degrees of Freedom: 120 
Goodness-of-fit p-value: 0 
Residual Deviance: 428.2018
Degrees of Freedom: 120
Goodness-of-fit p-value: 0.0000

The probability of observing a deviance as large as this if the model is truly correct (i.e., the goodness-of-fit \(p\)-value) is esentially 0, saying that there is significant evidence of lack-of-fit. There can be several reasons for lack-of-fit. The model is misspecified, e.g., it is missing important covariates or nonlinear effects. The link function might be incorrect, so that the model is systematically overestimate or underestimate the data. There can be outliers or influential points that inflate the deviance. The data might be overdispersed, so that the residual deviance is inflated by the large variance.

10.8 Inference

The estimated model can be used for two purposes: inference and prediction. In terms of inference, we use the fitted model to identify the relationship between the response and regressors. Wald’s test is a general method for hypothesis testing in maximum likelihood estimation (MLE), including generalized linear models (GLMs) like Poisson regression. It tests the hypotheses of the form: \[ \begin{align} H_0 &: \beta_j = 0, \\ H_a &: \beta_j \neq 0. \end{align} \] using the fact that \(\frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim \mathcal{N}(0,1)\) asymptotically under \(H_0\), where \(\beta_j\) is the \(j\)th estimated regression coefficient and \(SE(\hat{\beta}_j)\) is its corresponding variability which is reflected in the standard error of the estimate. This ratio is the Wald’s test statistic \[z_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}\] that is used to determine the statistical significance of \(\hat{\beta}_j\) using the fact that the squared statistic asymptotically follows a one-degree-of-freedom chi-squared test, i.e., \[W_j = \Big(\frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}\Big)^2\sim \chi_1^2.\]

Remark 1: A statistic like \(t_j=\frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}\) in t test is referred to as a \(t\)-value. It assumes finite-sample normality—a student’s \(t\)-distribution under the null hypothesis with \(H_0\) with \(n-k-1\) degrees of freedom. It can only be used for linear models with normal errors. While the Wald’s test applies to any MLE (e.g., Poisson or logistic) to be used in any GLMs, and assumes asymptotic normality—standard normal or \(\chi^2\) distribution under \(H_0\).

Remark 2: Wald’s test is validate under several conditions: 1-large sample size, so that the test statistic is asymptotically normal, 2-regularity conditions for MLE, including differentiable likelihood, positive definite Fisher information, correctly specified model and parameter space, 3-well-estimated parameters, i.e., parameters are not near the bounaries, 4-stable and finite standard errors, and 5-no outliers or high leverage points.

We can obtain the corresponding \(p\)-values for each \(\beta_j\) associated to the Wald’s test statistic under the null hypothesis \(H_0\). The smaller the \(p\)-value, the stronger the evidence against the null hypothesis \(H_0\) in our sample. Hence, small \(p\)-values (less than the significance level \(\alpha\)) indicate that the data provides evidence in favour of association (or causation if that is the case) between the response variable and the \(j\)th regressor.

Similarly, given a specified \((1-\alpha)\times 100%\) level of confidence, we can construct confidence intervals for the corresponding true value of \(\beta_j\): \[\hat{\beta}_j \pm t_{\alpha/2,n-k-1} SE(\hat{\beta}_j),\] where \(t_{\alpha/2,n-k-1}\) is the upper \(\alpha/2\) quantile of the \(t\)-distribution with \(n-k-1\) degrees of freedom.

Let’s now compute the 95% confidence interval.

tidy(poisson_model, conf.int = TRUE) %>% mutate_if(is.numeric, round, 3)
# Get coefficient table
summary_df = poisson_model.summary2().tables[1]

# Compute confidence intervals
conf_int = poisson_model.conf_int()
conf_int.columns = ['conf_low', 'conf_high']

# Combine with coefficient table
summary_df = summary_df.join(conf_int)

# Round all numeric columns to 3 decimals
summary_df = summary_df.round(3)

print(summary_df)
# A tibble: 2 × 7
  term        estimate std.error statistic p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)    -3.43     0.647     -5.3        0   -4.70     -2.16 
2 width           0.17     0.024      7.11       0    0.123     0.216
           Coef.  Std.Err.      z  P>|z|  [0.025  0.975]  conf_low  conf_high
Intercept  -3.43     0.647 -5.300    0.0  -4.699  -2.162    -4.699     -2.162
width       0.17     0.024  7.112    0.0   0.123   0.216     0.123      0.216

Our sample gives us evidence to reject \(H_0\) (with a nearly-zero \(p\) value), which suggests that carapace width is statistically associated to the logarithm of the satellite size. Now, it is time to plot the fitted values coming from poisson_model to check whether it provides a positive relationship between width and the original scale of the response satell.

p +
  geom_smooth(
    data = train_set, aes(width, satell),
    method = "glm", formula = y ~ x,
    method.args = list(family = poisson), se = FALSE
  ) +
  labs(title="Poisson Regression Fitted Curve")
# Compute the fitted values
width_range = np.linspace(train_set['width'].min(), train_set['width'].max(), 100)
predict_df = pd.DataFrame({'width': width_range})
predict_df['predicted'] = poisson_model.predict(predict_df)

# Draw the scatterplot
sns.scatterplot(data=train_set, x='width', y='satell', label='Observed')

# Add the Poisson regression line
sns.lineplot(data=predict_df, x='width', y='predicted', color='red', label='Poisson fit')

# Add title
plt.title('Poisson Regression Fitted Curve')
plt.xlabel('Width')
plt.ylabel('Satellite size')
plt.legend()
plt.show()

The blue line in the plot above is the fitted Poisson regression of satell versus width. The positive relationship is now clear with this regression line.

10.9 Results

Let us fit a second model with two regressors: width (\(X_{\mathrm{width}_i}\)) and color (\(X_{\mathrm{color_2}_i}\), \(X_{\mathrm{color_3}_i}\), \(X_{\mathrm{color_4}_i}\)) for the \(i\)th observation: \[ h(\lambda_i) = \log(\lambda_i) = \beta_0 + \beta_1 X_{\mathrm{width}_i} + \beta_2 X_{\mathrm{color_2}_i} + \beta_3 X_{\mathrm{color_3}_i} + \beta_4 X_{\mathrm{color_4}_i}. \] The explanatory variable color is of factor-type (discrete) and nominal (its levels do not follow any specific order). There are four levels: 1 = medium light, 2 = medium, 3 = medium dark, 4 = dark.

# Factorize the variable
train_set$color <- as.factor(train_set$color)
# Print color categories
levels(train_set$color)
# Factorize the variable
train_set['color'] = train_set['color'].astype('category')
# Print color categories
train_set['color'].cat.categories
[1] "1" "2" "3" "4"
Index([1, 2, 3, 4], dtype='int64')

Using categorical variables such as color involves using dummy variables as in Binary Logistic regression. Taking the baseline 1 = medium light by default, this Poisson regression model will incorporate three dummy variables: \(X_{\mathrm{color_2}_i}\), \(X_{\mathrm{color_3}_i}\), and \(X_{\mathrm{color_4}_i}\). For different levels of color, these dummy variables take on the following values:

  • When color is 1 = medium light, then all three dummy variables \(X_{\mathrm{color_2}_i} = X_{\mathrm{color_3}_i} = X_{\mathrm{color_4}_i}=0\).
  • When color is 2 = medium, then \(X_{\mathrm{color_2}_i}=1\) while the other two dummy variables \(X_{\mathrm{color_3}_i} = X_{\mathrm{color_4}_i}=0\).
  • When color is 3 = medium dark, then \(X_{\mathrm{color_3}_i}=1\) while the other two dummy variables \(X_{\mathrm{color_2}_i} = X_{\mathrm{color_4}_i}=0\).
  • When color is 4 = dark, then \(X_{\mathrm{color_4}_i}=1\) while the other two dummy variables \(X_{\mathrm{color_2}_i} = X_{\mathrm{color_3}_i}=0\).

Note that the level 1 = medium light is depicted as the baseline here. Hence, the interpretation of the coefficients in the model for each dummy variable will be compared to this baseline.

Now, let us fit this second Poisson regression model and print the model summary including coefficient estimates, standard errors, Wald’s test statistic (\(z\)-test), the corresponding \(p\)-value, and 95% confidence intervals.

# Fit the Poisson model
poisson_model2 <- glm(satell ~ width + color, family = poisson, data = train_set)
# Summarise the model output
summary_df <- tidy(poisson_model2, conf.int = TRUE) %>% mutate_if(is.numeric, round, 3)
# Display the output in table
kable(summary_df, "pipe")
# Fit the Poisson regression model
poisson_model2 = glm(formula='satell ~ width + color', data=train_set, family=sm.families.Poisson()).fit()
# Extract the summary table
summary_df = poisson_model2.summary2().tables[1]
# Display the table
print(summary_df.round(3))
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -2.915 0.687 -4.244 0.000 -4.257 -1.564
width 0.161 0.025 6.495 0.000 0.112 0.209
color2 -0.224 0.180 -1.244 0.213 -0.562 0.145
color3 -0.458 0.212 -2.160 0.031 -0.867 -0.034
color4 -0.410 0.231 -1.773 0.076 -0.862 0.047
            Coef.  Std.Err.      z  P>|z|  [0.025  0.975]
Intercept  -2.915     0.687 -4.244  0.000  -4.261  -1.568
color[T.2] -0.224     0.180 -1.244  0.213  -0.576   0.129
color[T.3] -0.458     0.212 -2.160  0.031  -0.873  -0.042
color[T.4] -0.410     0.231 -1.773  0.076  -0.862   0.043
width       0.161     0.025  6.495  0.000   0.112   0.209

We can see that width and color3 are significant according to the \(p\)-value column under the significance level \(\alpha=0.05\).

10.9.1 Interpreting the results of a continuous variable

First, let us focus on the coefficient corresponding to width, while keeping color constant. Consider an observation with a given shell width \(X_{\mathrm{width}}=\texttt{w}\) cm, and another observation with a given \(X_{\mathrm{width+1}}=\texttt{w}+1\) cm (i.e., an increase of \(1\) cm). Denote the corresponding expected satell sizes as \(\lambda_{\mathrm{width}}\) and \(\lambda_{\mathrm{width+1}}\) respectively. Then we have their corresponding regression equations:

\[ \begin{align} \log \left( \lambda_{\mathrm{width}} \right) &= \beta_0 + \beta_1 \overbrace{\texttt{w}}^{X_{\mathrm{width}}} + \overbrace{\beta_2 X_{\mathrm{color_2}} + \beta_3 X_{\mathrm{color_3}} + \beta_4 X_{\mathrm{color_4}}}^{\mathrm{Constant}} \\ \log \left( \lambda_{\mathrm{width + 1}} \right) &= \beta_0 + \beta_1 \underbrace{(\texttt{w} + 1)}_{X_{\mathrm{width + 1}}} + \underbrace{\beta_2 X_{\mathrm{color_2}} + \beta_3 X_{\mathrm{color_3}} + \beta_4 X_{\mathrm{color_4}}}_{\mathrm{Constant}}. \end{align} \]

We take the difference between both equations as: \[ \begin{align} \log \left( \lambda_{\mathrm{width + 1}} \right) - \log \left( \lambda_{\mathrm{width}} \right) &= \beta_1 (\texttt{w} + 1) - \beta_1 \texttt{w} \\ &= \beta_1. \end{align} \]

We apply the logarithm property for a ratio: \[ \begin{align*} \log \left( \frac{\lambda_{\mathrm{width + 1}} }{\lambda_{\mathrm{width}}} \right) &= \log \left( \lambda_{\mathrm{width + 1}} \right) - \log \left( \lambda_{\mathrm{width}} \right) \\ &= \beta_1. \end{align*} \]

Finally, we have to exponentiate the previous equation: \[\frac{\lambda_{\mathrm{width+1}}}{\lambda_{\mathrm{width}}} = e^{\beta_1},\] which indicates that the satellite size varies in a multiplicative way when 1 cm is increased in satellite width.

Therefore, by using the estimate \(\hat{\beta_1}\) (note the hat notation) coming from the model poisson_model2, we calculate this multiplicative effect as follows (via exponentiate = TRUE in tidy()):

summary_exp_df <- tidy(poisson_model2, exponentiate = TRUE, conf.int = TRUE) 
  %>% mutate_if(is.numeric, round, 3)
knitr::kable(summary_exp_df, "pipe")
summary_exp_df = summary_df.copy()
summary_exp_df['Coef.'] = np.exp(summary_df['Coef.'])
summary_exp_df['[0.025'] = np.exp(summary_df['[0.025'])
summary_exp_df['0.975]'] = np.exp(summary_df['0.975]'])
summary_exp_df = summary_exp_df.round(3)
print(summary_exp_df)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.054 0.687 -4.244 0.000 0.014 0.209
width 1.174 0.025 6.495 0.000 1.118 1.232
color2 0.800 0.180 -1.244 0.213 0.570 1.156
color3 0.633 0.212 -2.160 0.031 0.420 0.967
color4 0.664 0.231 -1.773 0.076 0.422 1.048
            Coef.  Std.Err.      z  P>|z|  [0.025  0.975]
Intercept   0.054     0.687 -4.244  0.000   0.014   0.208
color[T.2]  0.800     0.180 -1.244  0.213   0.562   1.137
color[T.3]  0.633     0.212 -2.160  0.031   0.418   0.958
color[T.4]  0.664     0.231 -1.773  0.076   0.422   1.044
width       1.174     0.025  6.495  0.000   1.119   1.232

\(\frac{\hat{\lambda}_{\mathrm{width + 1}} }{\hat{\lambda}_{\mathrm{width}}} = e^{\hat{\beta}_1} = 1.174\) indicates that the averaged satellite size (satell) of male horseshoe crabs around a female breeding nest increases by \(17.4\%\) when increasing the crab width by \(1\) cm, while keeping color constant. The interpretation of the significant coefficients corresponding to difference between the color groups and the baseline level.

10.9.2 Interpreting the results of a categorical variable

Consider two observations, one with color 1 = medium light of the prosoma (the baseline) and another with 4 = dark. Their corresponding responses are denoted as \(\lambda_{\mathrm{D}}\) (for dark) and \(\lambda_{\mathrm{L}}\) (for medium light). While holding \(X_{\mathrm{width}}\) constant, their regression equations are: \[ \begin{gather*} \log \left( \lambda_{\mathrm{D}} \right) = \beta_0 + \overbrace{\beta_1 X_{\mathrm{width}}}^{\text{Constant}} + \beta_2 X_{\mathrm{color_2}_{\mathrm{D}}} + \beta_3 X_{\mathrm{color_3}_{\mathrm{D}}} + \beta_4 X_{\mathrm{color_4}_{\mathrm{D}}} \\ \log \left( \lambda_{\mathrm{L}} \right) = \beta_0 + \underbrace{\beta_1 X_{\mathrm{width}}}_{\text{Constant}} + \beta_2 X_{\mathrm{color_2}_{\mathrm{L}}} + \beta_3 X_{\mathrm{color_3}_{\mathrm{L}}} + \beta_4 X_{\mathrm{color_4}_{\mathrm{L}}} \end{gather*} \]

The corresponding color indicator variables for both \(\lambda_{\mathrm{D}}\) and \(\lambda_{\mathrm{L}}\) take on these values: \[ \begin{align*} \log \left( \lambda_{\mathrm{D}} \right) &= \beta_0 + \overbrace{\beta_1 X_{\mathrm{width}}}^{\text{Constant}} + \beta_2 X_{\mathrm{color_2}_{\mathrm{D}}} + \beta_3 X_{\mathrm{color_3}_{\mathrm{D}}} + \beta_4 X_{\mathrm{color_4}_{\mathrm{D}}} \\ &= \beta_0 + \beta_1 X_{\mathrm{width}}+ \beta_2 \times 0 + \beta_3 \times 0 + \beta_4 \times 1 \\ &= \beta_0 + \beta_1 X_{\mathrm{width}} + \beta_4, \\ \log \left( \lambda_{\mathrm{L}} \right) &= \beta_0 + \beta_1 X_{\mathrm{width}} + \beta_2 X_{\mathrm{color_2}_{\mathrm{L}}} + \beta_3 X_{\mathrm{color_3}_{\mathrm{L}}} + \beta_4 X_{\mathrm{color_4}_{\mathrm{L}}} \\ &= \beta_0 + \beta_1 X_{\mathrm{width}}+ \beta_2 \times 0 + \beta_3 \times 0 + \beta_4 \times 0 \\ &= \beta_0 + \underbrace{\beta_1 X_{\mathrm{width}}}_{\text{Constant}}. \end{align*} \]

Therefore, what is the association of the level medium light with respect to dark? Let us take the differences again: \[ \begin{align*} \log \left( \frac{\lambda_{\mathrm{D}} }{\lambda_{\mathrm{L}}} \right) &= \log \left( \lambda_{\mathrm{D}} \right) - \log \left( \lambda_{\mathrm{L}} \right) \\ &= \beta_4. \end{align*} \]

Then, we exponentiate the previous equation:

\[ \frac{\lambda_{\mathrm{D}} }{\lambda_{\mathrm{L}}} = e^{\beta_4}. \]

The expression \(\frac{\lambda_{\mathrm{D}} }{\lambda_{\mathrm{L}}} = e^{\beta_4}\) indicates that the response varies in a multiplicative way when the color of the prosoma changes from medium light to dark.

\(\frac{\hat{\lambda}_{\mathrm{D}} }{\hat{\lambda}_{\mathrm{L}}} = e^{\hat{\beta}_4} = 0.66\) indicates that the mean satellite size (satell) of male horseshoe crabs around a female breeding nest decreases by \(34\%\) when the color of the prosoma changes from medium light to dark, while keeping the carapace width constant.

10.9.3 Prediction using test data

Let us now predict on the test set. Besides the residual deviance discussed in Goodness of fit, the model performance can also be measured by Kullback-Leibler (KL) divergence, which is defined as \[ D_{\mathrm{KL}}(\hat{y_i} \Vert \hat{\lambda}_i) = y_i \log\left(\frac{\hat{y_i}}{\hat{\lambda}_i} \right) - y_i + \hat{\lambda}_i. \] Kullback–Leibler (KL) divergence is a measure of how one probability distribution diverges from a reference distribution. In the context of Poisson regression, KL divergence quantifies how much the predicted Poisson distribution diverges from the true observed data.

Averaging \(\overline{D}_{\mathrm{KL}}(\hat{y}\Vert \hat{\lambda}) = \frac{1}{n} \sum_{i=1}^n D_{\mathrm{KL}}(\hat{y_i} \Vert \hat{\lambda}_i)\) across all observations gives a measure of how well the model captures the true data distribution. Smaller averaged KL values indicate better fit.

# Factorize `color` in test set
test_set$color <- as.factor(test_set$color)
# Predict
predictions <- predict(poisson_model2, newdata = test_set, type = "response")
actuals <- test_set$satell
# KL divergence
kf = 1

cat("KL divergence:", kf, "\n")
# Factorize `color` in test set
test_set['color'] = test_set['color'].astype('category')
# Predict
predictions = poisson_model2.predict(test_set)
actuals = test_set['satell']
# KL divergence
kf = 1

print(f"KF divergence: {kf:.4f}")
KL divergence: 1 
KL divergence: 1.0000

Poisson residual deviance equals 2 times the KL divergence of Poisson distribution summed across all points. It measures the difference in log-likelihood between a saturated model (perfect fit) and the fitted model as defined in Goodness of fit. Lower deviance values indicate the model predictions are closer to the observed data.

10.10 Storytelling

The data case explores the relationship between satellite size, i.e., the number of male horseshoe crabs around the female horseshoe breeding nests, and several phenotypic characteristis of the female horseshoe crabs. The graphical display suggests that there is a dependence between satellite sizes and female horseshoe crab widths. The Poisson regression modelling suggests that widths is significantely related to satellite size and the dark color is significantly different to the medium light color of the female horseshoe size, given constant confoudning variables at the 5% significance level.

10.11 Practice Problems

In this final section, you can test your understanding with the following conceptual interactive exercises, and then proceed to work through an inferential analysis of a simple dataset.

10.11.1 Conceptual Questions

  1. What makes Poisson regression appropriate for modelling the number of satellites per crab?
Click to reveal answer The response variable (number of satellites) is a non-negative integer, shows right skewness, and represents event counts for a fixed unit (one crab), all of which make Poisson regression appropriate.
  1. What is the key assumption made about the distribution of the response by a classical Poisson regression model?
Click to reveal answer Equidispersion: The conditional distribution of the response given the covariates follows a Poisson distribution with mean equal to variance.
  1. Why do we use the log function in Poisson regression?
Click to reveal answer The log function (technically the log link) ensures that the predicted mean of the response variable is always positive, which is necessary since counts cannot be negative.
  1. In the univariate model, what does a positive coefficient on carapace width mean?
Click to reveal answer Larger female crabs are expected to attract more satellite males, holding all other variables constant.
  1. What can the incidence rate ratio (IRR) tell us in the context of carapace width for this dataset?
Click to reveal answer The IRR indicates the multiplicative change in the expected number of satellites for each one-unit increase in carapace width, holding other variables constant.
  1. In this chapter, why did we use binning (with grouped means) before fitting to a Poisson model?
Click to reveal answer Binning helps to visualize the relationship between carapace width and satellite size, by highlighting trends in the mean response of the data, to assess whether a log-linear trend is plausible.
  1. What does residual deviance measure in a Poisson regression model?
Click to reveal answer How far the fitted model is from the saturated model (perfect fit) – lower values indicate a better fit.
  1. How do we assess the goodness-of-fit of a Poisson regression model?
Click to reveal answer Use the residual deviance, and its associated p-value from a chi-squared test to assess goodness-of-fit: A high p-value indicates a good fit, while a low p-value indicates a lack-of-fit.
  1. What does overdispersion mean? Why is it a problem for Poisson regression?
Click to reveal answer Overdispersion occurs when the variance of the response variable exceeds its mean. This violates the equidispersion assumption, which can lead to underestimated standard errors and overstated significance of predictors.
  1. How can we diagnoze overdispersion in the horseshoe crab dataset?
Click to reveal answer We can compare the residual deviance to the degrees of freedom. If the ratio is significantly greater than 1, this indicates overdispersion.
  1. Why are categorical predictors, such as color, included as indicator variables in Poisson regression?
Click to reveal answer They allow the model to estimate separate effects for each category relative to a baseline category, which allows for the analysis of how different levels of the categorical variable influence the response.
  1. Why do we interpret on the response scale (exponentiated coefficients) rather than the log scale in Poisson regression?
Click to reveal answer This is simply more intuitive, as it tells us the multiplicative effect on the expected count of the response variable for a one-unit change in the predictor.
  1. How do we interpret the coefficient for a categorical variable in Poisson regression?
Click to reveal answer The coefficient (exponentiated!) represents the change in the expected count of the response variable when moving from the baseline category to the category represented by the coefficient (holding other variables constant).

10.11.2 Coding Question

Problem 1

In this problem, you are given a set of 700 individuals with different demographic and socioeconomic characteristics, along with their reported number of annual doctor visits. Your task will be to perform a full poisson regression analysis following a similar structure to this chapter. The variables involved are:

Variable Name Description
doctor_visits The number of visits in the past year (0-20+)
age Age in years
income Annual income in USD
education_years Total years of education
married Married or not (0 or 1)
urban Urban residence or not (0 or 1)
insurance Health insurance or not (0 or 1)

By the end of these questions, you will be able to describe how age, insurance and socioeconomic factors affect the number of doctor visits individuals need. You will determine whether the Poisson model is appropriate, and which predictors matter most in answering this problem.

A. Data Wrangling and Exploratory Data Analysis

Let us check if doctor_visists meets the assumptions for a classical Poisson model:

  1. Check if it is a non-negative integer.

  2. Examine its distribution, computing the mean and variance.

  3. Is equidispersion (Var \(\approx\) Mean) plausible here?

Click to reveal answer
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.1     ✔ purrr     1.1.0
✔ lubridate 1.9.4     ✔ stringr   1.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ broom::bootstrap()   masks asbio::bootstrap()
✖ gridExtra::combine() masks dplyr::combine()
✖ magrittr::extract()  masks tidyr::extract()
✖ dplyr::filter()      masks stats::filter()
✖ dplyr::lag()         masks stats::lag()
✖ lubridate::pm()      masks asbio::pm()
✖ purrr::set_names()   masks magrittr::set_names()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'MASS'

The following object is masked _by_ '.GlobalEnv':

    crabs

The following object is masked from 'package:dplyr':

    select

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
This is DHARMa 0.4.7. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
select <- dplyr::select

## setwd("/home/michael/coding/stats-textbook/regression-cookbook") 

load("./data/poisson_regression.rda")
df <- poisson_regression  

# Convert to categorical 
df <- df %>%
  mutate(
    married   = factor(married),
    urban     = factor(urban),
    insurance = factor(insurance)
  )

############################################################
# A. Poisson Regression Assumptions
############################################################

summary(df$doctor_visits)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    3.00    5.00    5.13    7.00   21.00 
mean_val <- mean(df$doctor_visits)
var_val  <- var(df$doctor_visits)
c(mean = mean_val, variance = var_val)
    mean variance 
5.130000 9.624725 
table(df$doctor_visits)

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  20  21 
 18  59 108 155 140 151  93  84  62  43  28  23  11  10   1   3   5   3   2   1 
cat("Mean =", mean_val, "Variance =", var_val, "\n")
Mean = 5.13 Variance = 9.624725 
############################################################
# Imports
############################################################

import pyreadr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import chi2

sns.set(style="whitegrid")

############################################################
# Load data (RDA)
############################################################

result = pyreadr.read_r("./data/poisson_regression.rda")
df = result["poisson_regression"]

# Convert categorical variables
for col in ["married", "urban", "insurance"]:
    df[col] = df[col].astype("category")

############################################################
# A. Poisson Regression Assumptions
############################################################

print(df["doctor_visits"].describe())
count    1000.000000
mean        5.130000
std         3.102374
min         0.000000
25%         3.000000
50%         5.000000
75%         7.000000
max        21.000000
Name: doctor_visits, dtype: float64
mean_val = df["doctor_visits"].mean()
var_val  = df["doctor_visits"].var()

print({"mean": mean_val, "variance": var_val})
{'mean': np.float64(5.13), 'variance': np.float64(9.624724724724727)}
print(df["doctor_visits"].value_counts().sort_index())
doctor_visits
0.0      18
1.0      59
2.0     108
3.0     155
4.0     140
5.0     151
6.0      93
7.0      84
8.0      62
9.0      43
10.0     28
11.0     23
12.0     11
13.0     10
14.0      1
15.0      3
16.0      5
17.0      3
20.0      2
21.0      1
Name: count, dtype: int64
We see that the distribution is right-skewed, with many small counts and a few larger values; the sample mean and variance are of similar magnitude, suggesting that equidispersion is plausible here.

B. A Graphical Look

  1. Create a scatterplot of doctor_visits vs age, adding an offset to the points with jitter for visibility.

  2. Comment on the (1) skewness, (2) heteroskedasticity and (3) whether the variability increases with the mean, relating this to your observation in question A3.

Click to reveal answer
############################################################
# B. A Graphical Look
############################################################

# Jitter scatter with marginal histograms
p_scatter <- ggplot(df, aes(x = age, y = doctor_visits)) +
  geom_jitter(alpha = 0.3, width = 0.2, height = 0.2) +
  labs(title = "Doctor Visits vs Age")

p_hist_x <- ggplot(df, aes(x = age)) + geom_histogram(bins = 30)
p_hist_y <- ggplot(df, aes(x = doctor_visits)) + geom_histogram(bins = 30)

print(p_scatter)

print(p_hist_x)

print(p_hist_y)

############################################################
# B. A Graphical Look
############################################################

plt.figure(figsize=(6,4))
plt.scatter(df["age"], df["doctor_visits"], alpha=0.3)
plt.xlabel("Age")
plt.ylabel("Doctor Visits")
plt.title("Doctor Visits vs Age")
plt.show()

plt.figure(figsize=(6,4))
plt.hist(df["age"], bins=30)
plt.title("Age Distribution")
plt.show()

plt.figure(figsize=(6,4))
plt.hist(df["doctor_visits"], bins=30)
plt.title("Doctor Visits Distribution")
plt.show()

The distribution of doctor_visits is right-skewed, the spread increases with age, and the variability appears larger at higher mean values. This is consistent with the Poisson mean–variance relationship noted in A3.

C. Data Wrangling

Now, prepare the data for modelling:

  1. Convert all appropriate variables to factors. Report the number of levels in each factor to verify this is working.

  2. Drop any missing observations.

  3. Produce a summary table of all variables.

Click to reveal answer
############################################################
# C. Data Wrangling
############################################################

summary(df)
      age            income      education_years urban   married insurance
 Min.   : 2.70   Min.   :-7021   Min.   :10.00   0:293   0:498   0:422    
 1st Qu.:28.18   1st Qu.:40416   1st Qu.:12.00   1:707   1:502   1:578    
 Median :34.60   Median :50727   Median :15.00                            
 Mean   :34.60   Mean   :50126   Mean   :15.07                            
 3rd Qu.:41.70   3rd Qu.:59808   3rd Qu.:18.00                            
 Max.   :70.70   Max.   :92761   Max.   :20.00                            
 doctor_visits  
 Min.   : 0.00  
 1st Qu.: 3.00  
 Median : 5.00  
 Mean   : 5.13  
 3rd Qu.: 7.00  
 Max.   :21.00  
lapply(df[c("married","urban","insurance")], table)
$married

  0   1 
498 502 

$urban

  0   1 
293 707 

$insurance

  0   1 
422 578 
############################################################
# C. Data Wrangling
############################################################

print(df.describe(include="all"))
                age        income  ...  insurance  doctor_visits
count   1000.000000   1000.000000  ...     1000.0    1000.000000
unique          NaN           NaN  ...        2.0            NaN
top             NaN           NaN  ...        1.0            NaN
freq            NaN           NaN  ...      578.0            NaN
mean      34.604300  50125.837370  ...        NaN       5.130000
std       10.013233  14377.333877  ...        NaN       3.102374
min        2.700000  -7020.670000  ...        NaN       0.000000
25%       28.175000  40416.130000  ...        NaN       3.000000
50%       34.600000  50726.830000  ...        NaN       5.000000
75%       41.700000  59808.250000  ...        NaN       7.000000
max       70.700000  92760.620000  ...        NaN      21.000000

[11 rows x 7 columns]
for col in ["married", "urban", "insurance"]:
    print(f"\n{col} counts:")
    print(df[col].value_counts())

married counts:
married
1.0    502
0.0    498
Name: count, dtype: int64

urban counts:
urban
1.0    707
0.0    293
Name: count, dtype: int64

insurance counts:
insurance
1.0    578
0.0    422
Name: count, dtype: int64

The variables married, urban, and insurance are categorical with a small number of discrete levels, and can thus be treated as factors.

D. Exploratory Data Analysis

Following the horsehsoe crab example closely:

  1. Bin age into 10 equally-sized groups, and compute the mean age and doctor_visits of each.

  2. Overlay the bin means on the scatterplot produced in question B1.

  3. Comment on the trend: How does doctor visitation vary with age? Does this trend (if any) appear linear?

Click to reveal answer
############################################################
# D. Exploratory Data Analysis
############################################################

df_binned <- df %>%
  mutate(age_bin = ntile(age, 10)) %>%
  group_by(age_bin) %>%
  summarize(
    mean_age = mean(age),
    mean_visits = mean(doctor_visits),
    .groups = "drop"
  )

ggplot(df, aes(x = age, y = doctor_visits)) +
  geom_jitter(alpha = 0.25) +
  geom_point(data = df_binned, aes(x = mean_age, y = mean_visits),
             color = "red", size = 3) +
  geom_line(data = df_binned, aes(x = mean_age, y = mean_visits),
            color = "red") +
  labs(title = "Binned Mean Doctor Visits vs Age")

############################################################
# D. Exploratory Data Analysis (Binning)
############################################################

df["age_bin"] = pd.qcut(df["age"], 10, labels=False)

df_binned = (
    df.groupby("age_bin")
      .agg(mean_age=("age", "mean"),
           mean_visits=("doctor_visits", "mean"))
      .reset_index()
)

plt.figure(figsize=(6,4))
plt.scatter(df["age"], df["doctor_visits"], alpha=0.25)
plt.plot(df_binned["mean_age"], df_binned["mean_visits"],
         color="red", marker="o")
plt.title("Binned Mean Doctor Visits vs Age")
plt.xlabel("Age")
plt.ylabel("Doctor Visits")
plt.show()

The number of doctor visits tend to increase with age on average—the binned means show a smooth upward trend that is not strictly linear on the response scale.

E. Simple Data Modelling & Estimation

We will model the expected number of visits of some individual \(i\) as \(\log(\lambda_i) = \beta_0 + \beta_1 X_{i,\text{age}}\).

  1. Fit the univariate Poisson regression above.

  2. Print a coefficient table.

  3. Compute and display \(\text{IRR}(\text{exp}(\beta_1))\).

  4. Compute and display the 95% confidence interval for \(\beta_1\).

Click to reveal answer
############################################################
# E. Simple Data Modelling & Estimation
############################################################

mod_uni <- glm(doctor_visits ~ age, family = poisson(), data = df)
summary(mod_uni)

Call:
glm(formula = doctor_visits ~ age, family = poisson(), data = df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.545217   0.054529   9.999   <2e-16 ***
age         0.030181   0.001401  21.537   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1822.8  on 999  degrees of freedom
Residual deviance: 1357.5  on 998  degrees of freedom
AIC: 4662.7

Number of Fisher Scoring iterations: 4
tidy(mod_uni, conf.int = TRUE) %>%
  mutate(IRR = exp(estimate),
         IRR_low = exp(conf.low),
         IRR_high = exp(conf.high))
# A tibble: 2 × 10
  term   estimate std.error statistic   p.value conf.low conf.high   IRR IRR_low
  <chr>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl> <dbl>   <dbl>
1 (Inte…   0.545    0.0545      10.00 1.54e- 23   0.438     0.652   1.72    1.55
2 age      0.0302   0.00140     21.5  7.04e-103   0.0274    0.0329  1.03    1.03
# ℹ 1 more variable: IRR_high <dbl>
############################################################
# E. Simple Poisson Regression
############################################################

mod_uni = smf.glm(
    "doctor_visits ~ age",
    data=df,
    family=sm.families.Poisson()
).fit()

print(mod_uni.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:          doctor_visits   No. Observations:                 1000
Model:                            GLM   Df Residuals:                      998
Model Family:                 Poisson   Df Model:                            1
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2329.3
Date:                Mon, 09 Feb 2026   Deviance:                       1357.5
Time:                        14:27:57   Pearson chi2:                 1.30e+03
No. Iterations:                     4   Pseudo R-squ. (CS):             0.3720
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5452      0.055      9.999      0.000       0.438       0.652
age            0.0302      0.001     21.536      0.000       0.027       0.033
==============================================================================
# IRRs + confidence intervals
uni_params = mod_uni.params
uni_ci = mod_uni.conf_int()

uni_results = pd.DataFrame({
    "estimate": uni_params,
    "conf_low": uni_ci[0],
    "conf_high": uni_ci[1],
    "IRR": np.exp(uni_params),
    "IRR_low": np.exp(uni_ci[0]),
    "IRR_high": np.exp(uni_ci[1])
})

print(uni_results)
           estimate  conf_low  conf_high       IRR   IRR_low  IRR_high
Intercept  0.545217  0.438340   0.652093  1.724982  1.550132  1.919554
age        0.030181  0.027435   0.032928  1.030641  1.027814  1.033476

The IRR is the multiplicative change in the expected number of doctor visits associated with a one-year increase in age. The 95% confidence interval for \(\beta_1\) does not include zero, indicating a statistically significant association between age and doctor visits.

F. Goodness of Fit

Using your univariate model from question E:

  1. Report the residual deviance and the degrees of freedom.

  2. Compute a chi-square goodness-of-fit test.

  3. Compute the Pearson dispersion statistic.

  4. Is this an overdispersion or underdispersion, and how do you know?

Click to reveal answer
############################################################
# F. Goodness of Fit
############################################################

res_dev  <- deviance(mod_uni)
df_res   <- df.residual(mod_uni)
gof_pval <- 1 - pchisq(res_dev, df_res)

pearson_X2 <- sum(residuals(mod_uni, type = "pearson")^2)
dispersion <- pearson_X2 / df_res

cat("Residual deviance =", res_dev, "df =", df_res, "GOF p-value =", gof_pval, "\n")
Residual deviance = 1357.524 df = 998 GOF p-value = 1.957323e-13 
cat("Pearson X2 =", pearson_X2, "Dispersion =", dispersion, "\n")
Pearson X2 = 1299.367 Dispersion = 1.301971 
############################################################
# F. Goodness of Fit
############################################################

res_dev = mod_uni.deviance
df_res  = mod_uni.df_resid
gof_pval = 1 - chi2.cdf(res_dev, df_res)

pearson_X2 = np.sum(mod_uni.resid_pearson**2)
dispersion = pearson_X2 / df_res

print(f"Residual deviance = {res_dev:.3f}")
Residual deviance = 1357.524
print(f"df = {df_res}")
df = 998
print(f"GOF p-value = {gof_pval:.4f}")
GOF p-value = 0.0000
print(f"Pearson X2 = {pearson_X2:.3f}")
Pearson X2 = 1299.367
print(f"Dispersion = {dispersion:.3f}")
Dispersion = 1.302

The dispersion statistic is greater than 1, indicating mild overdispersion relative to the Poisson assumption.

G. Inference

  1. Perform the Wald test for \(\beta_1\).

  2. Determine the 95% confidence interval for the IRR computed in question E3.

  3. Explain how the estimated coefficient \(\hat{\beta}_1\) affects the relationship between age and the expected number of doctor visits. In particular:

    1. Describe how the expected number of doctor visits changes as age increases.

    2. Is the relationship linear, exponential, or something else? Explain your reasoning.

    3. Do equal increases in age lead to multiplicative or additive changes in the expected number of visits? Why?

Click to reveal answer
############################################################
# G. Inference
############################################################

coef_uni <- tidy(mod_uni, conf.int = TRUE)
coef_uni
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   0.545    0.0545      10.00 1.54e- 23   0.438     0.652 
2 age           0.0302   0.00140     21.5  7.04e-103   0.0274    0.0329
cat("IRR for age =", exp(coef(mod_uni)["age"]), "\n")
IRR for age = 1.030641 

The estimated coefficient \(\hat{\beta}_1\) is included in the model through the log link function, \[ \log(\lambda_i) = \beta_0 + \beta_1 \,\text{age}_i. \] This implies that the expected number of doctor visits satisfies \[ \lambda_i = \exp(\beta_0)\exp(\beta_1 \,\text{age}_i). \]

As age increases, the expected number of doctor visits increases multiplicatively rather than additively: Each one-year increase in age multiplies the expected number of visits by a constant factor equal to the incidence rate ratio (IRR), i.e. \(\exp(\hat{\beta}_1)\).

Therefore, the relationship between age and expected doctor visits is exponential in the response scale and linear in the log scale. Equal increases in age produce proportionally equal changes in the expected count (a feature of all Poisson regression models).

############################################################
# G. Inference
############################################################

print("IRR for age =", np.exp(mod_uni.params["age"]))
IRR for age = 1.0306414682533183

The estimated coefficient \(\hat{\beta}_1\) is included in the model through the log link function, \[ \log(\lambda_i) = \beta_0 + \beta_1 \,\text{age}_i. \] This implies that the expected number of doctor visits satisfies \[ \lambda_i = \exp(\beta_0)\exp(\beta_1 \,\text{age}_i). \]

As age increases, the expected number of doctor visits increases multiplicatively rather than additively: Each one-year increase in age multiplies the expected number of visits by a constant factor equal to the incidence rate ratio (IRR), i.e. \(\exp(\hat{\beta}_1)\).

Therefore, the relationship between age and expected doctor visits is exponential in the response scale and linear in the log scale. Equal increases in age produce proportionally equal changes in the expected count (a feature of all Poisson regression models).

H. Multivariate Poisson Regression

Now, extend the model to: \[\log(\lambda_i) = \beta_0 + \sum_{j}\beta_i X_{i,j},\] where \(j\) runs over age, income, education_years, married, urban, and insurance.

  1. Fit the multivariable model above.

  2. Compute the IRRs and 95% confidence intervals.

  3. Interpret the effect of insurance (insured or not).

  4. Which predictors are the most significant? How do you know?

  5. Fit another model with age * insurance, and perform a likelihood-ratio test vs the additive model above.

  6. Based on the result, does insurance modify the effect of age?

Click to reveal answer
############################################################
# H. Multivariate Poisson Regression
############################################################

form_multi <- doctor_visits ~ age + income + education_years +
  married + urban + insurance

mod_multi <- glm(form_multi, family = poisson(), data = df)
summary(mod_multi)

Call:
glm(formula = form_multi, family = poisson(), data = df)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.710e-02  1.073e-01   0.159    0.873    
age              2.913e-02  1.396e-03  20.863  < 2e-16 ***
income          -1.165e-05  9.493e-07 -12.276  < 2e-16 ***
education_years  5.647e-02  4.571e-03  12.353  < 2e-16 ***
married1         1.229e-01  2.814e-02   4.369 1.25e-05 ***
urban1           1.742e-01  3.181e-02   5.474 4.40e-08 ***
insurance1       1.359e-01  2.872e-02   4.730 2.24e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1822.8  on 999  degrees of freedom
Residual deviance: 1002.2  on 993  degrees of freedom
AIC: 4317.3

Number of Fisher Scoring iterations: 4
tidy(mod_multi, conf.int = TRUE) %>%
  mutate(IRR = exp(estimate),
         IRR_low = exp(conf.low),
         IRR_high = exp(conf.high))
# A tibble: 7 × 10
  term    estimate std.error statistic  p.value conf.low conf.high   IRR IRR_low
  <chr>      <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <dbl>   <dbl>
1 (Inter…  1.71e-2   1.07e-1     0.159 8.73e- 1 -1.94e-1   2.27e-1 1.02    0.824
2 age      2.91e-2   1.40e-3    20.9   1.15e-96  2.64e-2   3.19e-2 1.03    1.03 
3 income  -1.17e-5   9.49e-7   -12.3   1.22e-34 -1.35e-5  -9.79e-6 1.000   1.000
4 educat…  5.65e-2   4.57e-3    12.4   4.67e-35  4.75e-2   6.54e-2 1.06    1.05 
5 marrie…  1.23e-1   2.81e-2     4.37  1.25e- 5  6.78e-2   1.78e-1 1.13    1.07 
6 urban1   1.74e-1   3.18e-2     5.47  4.40e- 8  1.12e-1   2.37e-1 1.19    1.12 
7 insura…  1.36e-1   2.87e-2     4.73  2.24e- 6  7.97e-2   1.92e-1 1.15    1.08 
# ℹ 1 more variable: IRR_high <dbl>
# Interaction: age * insurance
mod_int <- glm(doctor_visits ~ age * insurance + income + education_years +
                 married + urban, family = poisson(), data = df)

anova(mod_multi, mod_int, test = "Chisq")
Analysis of Deviance Table

Model 1: doctor_visits ~ age + income + education_years + married + urban + 
    insurance
Model 2: doctor_visits ~ age * insurance + income + education_years + 
    married + urban
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       993     1002.2                     
2       992     1001.8  1  0.43642   0.5089
############################################################
# H. Multivariate Poisson Regression
############################################################

formula_multi = (
    "doctor_visits ~ age + income + education_years "
    "+ married + urban + insurance"
)

mod_multi = smf.glm(
    formula_multi,
    data=df,
    family=sm.families.Poisson()
).fit()

print(mod_multi.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:          doctor_visits   No. Observations:                 1000
Model:                            GLM   Df Residuals:                      993
Model Family:                 Poisson   Df Model:                            6
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2151.7
Date:                Mon, 09 Feb 2026   Deviance:                       1002.2
Time:                        14:27:57   Pearson chi2:                     940.
No. Iterations:                     4   Pseudo R-squ. (CS):             0.5598
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.0171      0.107      0.159      0.873      -0.193       0.227
married[T.1.0]       0.1229      0.028      4.369      0.000       0.068       0.178
urban[T.1.0]         0.1742      0.032      5.474      0.000       0.112       0.237
insurance[T.1.0]     0.1359      0.029      4.730      0.000       0.080       0.192
age                  0.0291      0.001     20.863      0.000       0.026       0.032
income           -1.165e-05   9.49e-07    -12.276      0.000   -1.35e-05   -9.79e-06
education_years      0.0565      0.005     12.353      0.000       0.048       0.065
====================================================================================
multi_params = mod_multi.params
multi_ci = mod_multi.conf_int()

multi_results = pd.DataFrame({
    "estimate": multi_params,
    "conf_low": multi_ci[0],
    "conf_high": multi_ci[1],
    "IRR": np.exp(multi_params),
    "IRR_low": np.exp(multi_ci[0]),
    "IRR_high": np.exp(multi_ci[1])
})

print(multi_results)
                  estimate  conf_low  conf_high       IRR   IRR_low  IRR_high
Intercept         0.017104 -0.193212   0.227420  1.017251  0.824308  1.255357
married[T.1.0]    0.122949  0.067791   0.178106  1.130826  1.070142  1.194953
urban[T.1.0]      0.174160  0.111803   0.236516  1.190246  1.118293  1.266828
insurance[T.1.0]  0.135868  0.079568   0.192167  1.145530  1.082820  1.211873
age               0.029131  0.026394   0.031868  1.029559  1.026746  1.032381
income           -0.000012 -0.000014  -0.000010  0.999988  0.999986  0.999990
education_years   0.056471  0.047511   0.065431  1.058096  1.048658  1.067619
# Interaction model
mod_int = smf.glm(
    "doctor_visits ~ age * insurance + income + education_years + married + urban",
    data=df,
    family=sm.families.Poisson()
).fit()

lr_stat = 2 * (mod_int.llf - mod_multi.llf)
df_diff = mod_int.df_model - mod_multi.df_model
p_lr = 1 - chi2.cdf(lr_stat, df_diff)

print("LR test statistic =", lr_stat)
LR test statistic = 0.43641994780045934
print("df =", df_diff)
df = 1
print("p-value =", p_lr)
p-value = 0.5088554748118505

Holding all other variables constant, insured individuals are expected to have about 14.6% more doctor visits than uninsured individuals (IRR ≈ 1.15). The 95% confidence interval for the IRR does not include 1, indicating this effect is statistically significant.

The most statistically significant predictors are age, education_years, married, urban, and insurance, as their confidence intervals for the IRR do not include 1. Among these, age shows the most precise and consistent effect. education_years also shows a clear positive association with doctor_visits. income has a very small effect size despite being statistically significant.

I. Prediction & KL Divergence

Using the multivariate model from question H:

  1. Predict the expected number of doctor visits for a grid of age values, from 20 to 80.

  2. Plot predicted visits for insured and uninsured groups, including 95% confidence interval bands.

  3. Compute the KL divergence. How does this value reflect the model fit?

Click to reveal answer
############################################################
# I. Prediction & KL Divergence
############################################################

# Prediction grid for ages
age_grid <- seq(min(df$age), max(df$age), length.out = 100)
newdata <- expand.grid(
  age = age_grid,
  income = mean(df$income),
  education_years = mean(df$education_years),
  married = levels(df$married)[1],
  urban = levels(df$urban)[1],
  insurance = levels(df$insurance)
)

pred_link <- predict(mod_multi, newdata, type = "link", se.fit = TRUE)

newdata$fit <- exp(pred_link$fit)
newdata$lower95 <- exp(pred_link$fit - 1.96 * pred_link$se.fit)
newdata$upper95 <- exp(pred_link$fit + 1.96 * pred_link$se.fit)

ggplot(newdata, aes(x = age, y = fit, color = insurance)) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = lower95, ymax = upper95, fill = insurance),
              alpha = 0.15, color = NA) +
  labs(y = "Predicted Doctor Visits", title = "Predicted Visits by Age and Insurance")

# KL Divergence
lambda_hat <- fitted(mod_multi)
y <- df$doctor_visits
KL_i <- ifelse(y == 0,
               -y + lambda_hat,
               y * log(y / lambda_hat) - y + lambda_hat)
KL_avg <- mean(KL_i)

cat("Average KL divergence =", KL_avg, "\n")
Average KL divergence = 0.5010995 
############################################################
# I. Prediction & KL Divergence
############################################################

age_grid = np.linspace(df["age"].min(), df["age"].max(), 100)

newdata = pd.DataFrame({
    "age": np.repeat(age_grid, df["insurance"].cat.categories.size),
    "income": df["income"].mean(),
    "education_years": df["education_years"].mean(),
    "married": df["married"].cat.categories[0],
    "urban": df["urban"].cat.categories[0],
    "insurance": np.tile(df["insurance"].cat.categories, 100)
})

# Ensure categories match
for col in ["married", "urban", "insurance"]:
    newdata[col] = newdata[col].astype("category")
    newdata[col] = newdata[col].cat.set_categories(df[col].cat.categories)

pred = mod_multi.get_prediction(newdata)
pred_summary = pred.summary_frame()

newdata["fit"] = np.exp(pred_summary["mean"])
newdata["lower95"] = np.exp(pred_summary["mean_ci_lower"])
newdata["upper95"] = np.exp(pred_summary["mean_ci_upper"])

plt.figure(figsize=(7,5))
sns.lineplot(data=newdata, x="age", y="fit", hue="insurance")
plt.fill_between(
    newdata["age"],
    newdata["lower95"],
    newdata["upper95"],
    alpha=0.15
)
plt.ylabel("Predicted Doctor Visits")
plt.title("Predicted Visits by Age and Insurance")
plt.show()

# KL divergence
lambda_hat = mod_multi.fittedvalues.values
y = df["doctor_visits"].values

KL_i = np.where(
    y == 0,
    -y + lambda_hat,
    y * np.log(y / lambda_hat) - y + lambda_hat
)
<string>:5: RuntimeWarning: divide by zero encountered in log
<string>:5: RuntimeWarning: invalid value encountered in multiply
KL_avg = KL_i.mean()
print("Average KL divergence =", KL_avg)
Average KL divergence = 0.5010994579709896

A smaller average KL divergence indicates that the fitted Poisson model produces predicted counts that are close to the observed data, reflecting better model fit.

J. Storytelling

The goal of this analysis was to understand how individual characteristics are associated with the expected number of doctor visits.

  1. Summarize the main findings of the Poisson regression analysis. Which variables are most strongly associated with doctor visitation, and in what direction?

  2. How do age, insurance status, and education jointly shape expected doctor visits? Describe the relationship in substantive terms rather than model coefficients.

  3. Assess whether the classical Poisson model appears to be an appropriate choice for this dataset. What evidence supports or weakens this conclusion?

  4. Based on the fitted model, what broad conclusions can be drawn about patterns of healthcare use in this population?

Click to reveal answer

The Poisson regression analysis shows that doctor visitation increases systematically with age, education, insurance coverage, marital status, and urban residence. Within these, age and education show strong and precise positive associations, while having insurance coverage is associated with an increase in expected visits (vs being uninsured).

Older individuals and those with higher levels of education tend to visit doctors more frequently, and insured individuals have higher expected doctor use across all ages. Insurance primarily shifts the overall level of doctor visits upward rather than changing how visits increase with age.

The classical Poisson model appears reasonably appropriate for this dataset. The outcome variable is a non-negative count, the mean–variance relationship is approximately consistent with Poisson assumptions, and goodness-of-fit diagnostics show only mild overdispersion. Alternative models could be considered, but there is no strong evidence that the Poisson model is fundamentally misspecified.

Overall, the results suggest that healthcare use in this population is strongly patterned by age and socioeconomic factors, with insurance coverage playing an important role in increasing access to healthcare care. The model provides a coherent and interpretable summary of how these factors relate to doctor visitation rates.