\( \newcommand{\bm}[1]{\boldsymbol{#1}} \)

Chapter 6 Model Checking

In this chapter, we discuss in detail each of the criteria used to compare the models that were investigated in the development of the ACA.

6.1 WAIC

Model fit was assessed using WAIC. The advantages of WAIC over DIC are discussed in Section 3.4.3. Recall that there are two variants of WAIC, depending on how the term \(p_{\text{WAIC}}\) is defined. For this investigation, we used the second variant, which yields the criterion:

\[\begin{align*} \text{WAIC}_2 & = 2p_{\text{WAIC }2} - 2 \text{ LPPD} \\ & = 2\sum_{i=1}^N \text{Var}_{\text{post}}\log p(y_i|\theta_i) - 2\log \prod_{i=1}^N p_{\text{post}}(y_i) \\ & = 2\sum_{i=1}^N \text{Var} \left[\log p(y_i|\theta_i) | y_i\right] - 2\log \prod_{i=1}^N \text{E}_{\bm{\theta}} \left[p(y_i | \theta_i) | y_i\right] \\ & = 2\sum_{i=1}^N \frac{1}{M-1} \sum_{m=1}^M \left[ \log p\left(y_i \bigg| \theta_i^{(m)}\right) -\frac{1}{M} \sum_{m=1}^M \log p\left(y_i \bigg| \theta_i^{(m)}\right) \right] \\ & \hspace{2em} - 2\sum_{i=1}^N \log \frac{1}{M} \sum_{m=1}^M p \left(y_i \bigg| \theta_i^{(m)}\right)\\ & = 2\sum_{i=1}^N \left\{ \underset{m=1,\ldots,M}{\text{var}}\left[\log L_i^{(m)}\right] - \log \frac{1}{M} \sum_{m=1}^M L_i^{(m)} \right\} \end{align*}\]

where \[L_i^{(m)} = p \left(y_i \bigg| \theta_i^{(m)}\right)\] is the likelihood for the \(i^{\text{th}}\) observation given the \(m^{\text{th}}\) parameter estimates.

Given a posterior sample in R, computing the WAIC is straightforward as the following R code demonstrates.

# Simulate MCMC sample of Poison rates
N <- 100
M <- 5000
set.seed(456)
eta <- abs(rnorm(N, 0, 20))
eta <- abs(matrix(eta, M, N, byrow = TRUE) + matrix(rnorm(M*N, 0, 0.5), M, N))

# Simulate observations (we just use first sample)
y <- rpois(N, eta[1,])

# User-defined function for computing WAIC
get.WAIC <- function(theta, y, L){
    # Pointwise evaluation of the likelihood and log-likelihood
    LogLike <- matrix(NA, nrow(theta), ncol(theta))
    for(i in 1:ncol(theta)){
        LogLike[,i] <- L(y[i], theta[,i], log = TRUE)
    }
    Like <- exp(LogLike)
    Var <- apply(LogLike, 2, var)
    S <- apply(Like, 2, mean)
    WAIC <- 2 * sum(Var - log(S))
    return(WAIC)
}

# Compute WAIC
WAIC <- get.WAIC(theta = eta, y = y, L = dpois)
WAIC
## [1] 518.6738

Note that this user-defined function assumes the parameter \(\theta\) is univariate, notwithstanding the MCMC samples. This code would need to be extended to allow multivariate data, e.g. spatio-temporal data, or multi-parameter likelihoods, e.g. a Gaussian likelihood with unknown mean and variance. It should also be fairly straightforward to modify this function to compute the alternative WAIC formula and/or other model fit criteria like DIC.

To rank the models in the simulation study, the WAIC was computed for each model, and then the ventiles (20-quantiles) were calculated from that sample of WAIC values. This gave each of the models a rank between 1 and 20.

6.2 Computation Time

Computation time was less of a priority than achieving a good model fit and satisfying the model assumptions. However, if two models performed equally well in terms of other criteria, then a faster computation time is desirable. Moreover, there are plans to extend the models used in the ACA to include additional covariates and potentially another dimension for time. These extensions have the potential to increase the computational costs substantially.

For this criterion, the models in the simulation study were ranked according to the ventiles (20-quantiles) of the computation time. This gave each of the models a rank between 1 and 20.

Note that the software, the data (number of observations), and the model (number of parameters and complexity of hierarchical structure) can all have an impact on computation time. For the simulation study, the main factors affecting the computation time were the software and the models. This was also true for the ACA. CARBayes, which was used for the incidence models in the ACA, was exceptionally fast compared to WinBUGS, which was used for the survival models (since CARBayes was not capable of implementing our formulation of survival models). However, there was still some variation in computation time due to differences in the data and model structure, especially for the survival models. The computation times for the simulation study are given above in Section 5.4.1. The computation times for the survival models for the ACA are shown below:

The computation times for the incidence models were much more consistent - approximately 550s (0.15 hours) for each cancer/sex.

6.3 Moran's I Statistic

Moran's I statistic (Moran 1950) measures the degree of autocorrelation in a set of data given a matrix of weights which describes the dependencies between each pair of datum. This can be used to assess the model assumption that the residuals are independent and identically distributed. The idea is that a spatial model ought to account for the spatial autocorrelation in the data in such a way that the model residuals resemble white noise. So rather than compute Moran's I for the observed values, Moran's I is applied to the residuals. The residuals are defined as the difference between the observed values and fitted values, which for our Bayesian models can be defined as the likelihood evaluated at a reasonable plug-in estimate for the parameters, such as the posterior mean.

Values of Moran's I close to 0 indicate very low or no spatial autocorrelation, while values to -1 and 1 indicate strong negative and strong positive autocorrelation respectively. The closer Moran's I is to zero, the better the model accounts for spatial autocorrelation (Anderson and Ryan 2017). Thus models can be compared by either:

  • directly comparing the values of Moran's I statistic (values above 0.2 suggest some positive spatial autocorrelation); or
  • comparing the p-values (small p-values indicate evidence to reject the null hypothesis of no spatial autocorrelation, so better models should have larger p-values).

The ape package provides an easy means of computing Moran's I statistic and p-values. The two main inputs required are the vector of residuals, and a spatial weights matrix defining the spatial dependencies between areas. For the ACA, we choose to use the first-order, binary, adjacency weights matrix (see Sections 4.3 and 5.2.1). The syntax is simply ape::Moran.I(residuals, W).

Moran's I was not used as one of the criteria for ranking and comparing the models. However, it is still a useful statistic for checking that a model satisfies the residual assumptions, and it could potentially be used as an alternative method of assessing under- and over-smoothing.

6.4 Under- and Over-smoothing

These criteria were arguably the most difficult to define due to the lack of literature on this subject. Although these terms are used quite often, there does not appear to be any consensus on how to define these critera for the purpose of evaluating or comparing models. But the broad meaning conveyed by the terms is self-explanatory: under-smoothing means the model is not smoothing enough; over-smoothing means the model is smoothing too much. The difficulty lies not only in quantifying how much smoothing is too little or too much, but also what metric to use. Often, substantial under- and over-smoothing can be detected through visualisations such as a:

  • spatial plot (i.e. map) of the estimated SIR values;
  • plot of the empirical density of the SIR; or
  • plot of point estimates (e.g. median) of the area-specific SIR values, overlayed with the raw SIR values, and for clarity, ordered by the magnitude of either the raw or estimated values (see Sections 7.3 and 8.3 for examples).

These visualisations may be useful for detecting specific cases of under- and over-smoothing, but it is not very useful in the task of comparing models. However, the third visualisation listed above suggests that the percentiles of the SIR values could be used as the basis for a metric. As the raw SIR values are smoothed towards 1, so the percentiles of the estimated values also move towards 1. Therefore, it would be hoped that the 90% credible interval (range between the 5th and 95th percentiles) would be narrower for the posterior sample than the range between the 5th and 95th percentiles of the raw values. Conversely, a narrower credible interval for a less extreme range such as the 60% credible interval may be indicative of over-smoothing.

Based on these ideas, we decided to use the following criteria to quantify under- and over-smoothing:

  • Under-smoothing: Rank based on the number of posterior SIRs > 98th percentile of raw SIRs plus number of posterior SIRs < 2nd percentile of raw SIRs. Smallest rank corresponds to smallest number.
  • Over-smoothing: Rank based on the number of posterior SIRs > 80th percentile of raw SIRs plus number of posterior SIRs < 20th percentile of raw SIRs. Smallest rank corresponds to largest number.

Some additional notes to keep in mind:

  • The level of smoothing is automatically determined by the model. However, the prior distributions can have a very large impact on the level of smoothing, especially for the prior on the variance of the (structured) spatial random effect. A prior that is too vague can lead to under-smoothing. A prior that is too narrow can lead to over-smoothing, and has a greater potential to lead to non-sensible results if it does not agree with the data.
  • Based on the literature, the BYM model is known to over-smooth. By fitting the BYM model to your data, you could use the results as a benchmark for assessing the level of smoothing produced by other models, and calibrate these models if necessary.

Since developing the ACA, new research that was inspired by the difficulty we experienced in quantifying under- and over-smoothing has led to a better understanding of this phenomenon and provided more stringent criteria (see Duncan and Mengersen (2020)).

6.5 CI Plausibility

Comparing models based on point estimates like the posterior median of the SIR is valid, but it ignores the uncertainty of those estimates. If two models had similar area-specific point estimates, but one model had much larger CIs, then the estimates for this model would be considered less reliable, or less plausible. However, models with very precise estimates (i.e. very narrow CIs) also suggest estimates that are less plausible. This criterion involved ranking models according the following rules:

  • rank 0 = reasonable CIs;
  • rank 1 = OK, but some areas have CIs that are too wide/precise; and
  • rank 2 = bad (CIs too precise or too wide).

This criterion is linked to the concept of smoothing: 'well-behaved', reliable estimates tend to result from models with more smoothing occurring. However, the goal is broader: to identify models with estimates that are not "believable".

(While this criterion is arguably more subjective than that used for under- and over-smoothing, the ranks were limited to a maximum value of 2, which limits the contribution of this criterion to the consensus in determining the preferred model.)

6.6 Convergence

There are numerous statistics for assessing the convergence of an MCMC chain. The Geweke convergence diagnostic (Geweke 1992) is one of the simplest and it requires only a single MCMC chain (which is a desirable attribute when the computational cost of running a model is very high). The gist of this diagnostic is this: if the chain has converged, then any subsets of that chain, say the first and second half, should not be significantly different (in terms of empirical means). Formally, the diagnostic computes a two-sample t-test (assuming unequal variances) for the first 10% of the chain and the last 50% of the chain. We used the p-value of this test (using the chain of the estimated SIRs for each area) to rank models as follows:

  • rank 0 = OK (<1% of areas with a Geweke p-value < 0.01);
  • rank 1 = poor (1 to <10% of areas with a Geweke p-value <0.01);
  • rank 2 - 20 = bad (percentage (10%+) of areas with a Geweke p-value < 0.01 divided by 5 and rounded).

Convergence can also be assessed visually using trace and autocorrelation plots. However, this approach is considered less formal than using diagnostics like Geweke, and can be too tedious, especially if there are a large number of areas.

For this investigation, the Geweke diagnostic was computed using the wbgeweke command in Stata v15.0. It can also be computed in R using the geweke.diag function from the coda package.

6.7 Consensus: Determining the Preferred Model

Based on these criteria, an overall rank was constructed to try to reach a consensus on the preferred model. This consensus rank is simply the sum of the ranks for the other criteria: WAIC, computation time, under-smoothing, over-smoothing, CI plausibility, and convergence - as shown in Section 5.4.1. Recall that some models were excluded based on theoretical and practical grounds.

There are two potential criticisms to our method of comparing models. The first is the criteria themselves. For example:

  • should model fit, or under- and over-smoothing be measured differently?
  • should the underlying metrics (rescaled) be used instead of ranks?
  • should other criteria have been considered?

The second criticism is that each criteria is weighted according to the ranks. For example, assuming only two criteria, say WAIC and computation time, a model with rank 20 WAIC and rank 1 computation time would have the same consensus as a model with rank 1 WAIC and rank 20 computation time. This implies that model fit (based on WAIC) and computation time are equally important. Although some effort was made to give more weight to criteria that were deemed more important, in hindsight, with the experience that we have gained from seeing the results from the preferred model on real data in the ACA, these rankings may have been constructed differently.

Even with the aid of this consensus rank, it was difficult to determine the preferred model. Given the above criticisms, the ranks were treated as a guideline. The decision for choosing the Leroux model as the preferred model was based on a combination of these ranks as well as what was known about the models from the literature. For example, the BYM model placed in the top 3 ranks for each of the three simulated cancers, but it is also known to over-smooth, (which may not be reflected adequately by our criterion).

Having seen the results for the Leroux model in the ACA on real cancer data, we believe that this decision was justified.