6 Hierarchical Bottom-Up Modelling Tutorial

6.1 Introduction

The purpose of this tutorial is to introduce users to the hierarchical bottom-up population modelling. The following series of tutorials are not designed to be an introduction to statistics. For these tutorials, it is recommended that users are familiar with some statistical concepts, and for that purpose, we suggest useful resources below that users can read:

Probabilistic distribution
- Linear regression
- Normal distribution, Poisson distribution, Log-normal distribution
Markov chain Monte Carlo (MCMC), a simulation-based method for model estimation:
- Markov chains explained visually by Victor Powell
- Metropolis-Hastings Monte Carlo, a specific case of MCMC that is used in stan. This chapter comes from the Bayes Rules! online book by Alicia A. Johnson, Miles Ott and Mine Dogucu
Prior probability

Population modelling for this tutorial is done using the rstan package in R. Further information on stan can be found below

Software: Stan

Stan is a C++ library for Bayesian modeling and inference that primarily uses the No-U-Turn sampler (NUTS) Hoffman & Gelman to obtain posterior simulations given a user-specified model and data
Interface: rstan

The rstan package allows one to conveniently fit Stan models from R (R Core Team 2014) and access the output, including posterior inferences and intermediate quantities such as evaluations of the log posterior density and its gradients.

Note For a more interactive version of the tutorial, including knowledge quizzes, please go to: Bottom-Up Tutorials

6.2 Part I: How to think about population as a Bayesian

The first part of this tutorial introduces the basics of Bayesian modelling and how to apply Bayesian principles to bottom-up gridded population estimation. It covers how to transition from frequentist modelling to Bayesian thinking, and applying Bayesian principles to fit a poisson model for population count. This model is then built upon to fit a poisson-lognormal model, which is used for bottom-up population modelling. In summary, the objectives of this part are to:

Write a simple linear regression in a Bayesian framework
Adapt the statistician toolbox to a real-world example
Fit a Normal model in stan with simulated data:
- Format data for stan
- Specify a model in the stan language
- Set up a MCMC sampler to fit the model
Fit a Poisson model for population count
- Evaluate results and limitations
Fit a Poisson-Lognormal model for modelling population

6.2.1 Set-up

This model will be implemented in the R programming language (R Core Team 2020a). The first step is to setup the R environment so that it contains the required R packages and the data that is needed for this tutorials. Firstly, ensure the most recent versions of the following softwares are downloaded:

R (https://www.r-project.org/)
RStudio (https://rstudio.com/products/rstudio/download/)

Next, install and set up the rstan package by carefully following the given directions. It is also important to ensure the following r packages are installed if not already:

install.packages('tidyverse') # managing data
install.packages('ggdag') # drawing DAG
install.packages('kableExtra') # visualising table
install.packages("here")

Once installed, load the installed package into the R environment.

library(ggdag) 
library(kableExtra) 
library(here)
library(tidyverse)
library(rstan)

After loading the packages in R, stan needs to be setup for compilation of the model by specifying the number of cores to use and increase the speed running time for fitting a model

# stan setup
options(mc.cores = parallel::detectCores()) #set up the maximum number of cores used by stan
rstan::rstan_options(auto_write = TRUE) # speed up running time

The rstan package allows one to conveniently fit Stan models from R (R Core Team 2020a) and access the output, including posterior inferences and intermediate quantities such as evaluations of the log posterior density and its gradients.

6.2.2 From a frequentist to a Bayesian mindset

In a standard frequentist approach, a linear regression between $Y$ the response variable and $X$ the predictors can be formulated as:

\[\begin{equation} Y = \alpha + \beta X + \epsilon \\ \epsilon \sim Normal(mean=0,sd=\sigma)\tag{6.1} \end{equation}\]

Equation (6.1) can be rewritten as: \[Y \sim Normal(mean=\mu, sd=\sigma)\] \[ \mu=\alpha + \beta X\]

This format is more flexible for working with non-normal error structures and custom modelling components. This linear regression can be represented using a Directed Acyclic Graph (DAG) to illustrate the relationships between model parameters and input data:

Figure 6.1: Graph of linear regression

In the DAG above, the different data are represented by squares and the parameters by circles. Directions of arrows indicate dependence. In a Bayesian model, all root node parameters (those with no arrows pointing towards them) need priors to be specified: \[ \alpha \sim Normal(mean=0, sd=1000)\] \[ \beta \sim Normal(mean=0, sd=1000)\] \[ \sigma \sim Uniform(min=0, max=1000)\]

These are examples of weakly informative priors, because the variances are large relative to the data. Weakly informative priors should not have any noticeable influence on the final parameter estimates.

6.2.3 How to choose priors

To identify a distribution to use for priors, it is useful to ask the following questions:

“What values are possible for this parameter?”
“How certain am I about it?”.

In most Bayesian analyses, priors can be built based on the known properties of potential datasets. Defining these priors consists of indicating what is the likely range of values for the parameters and downweighting out-of-range values. Furthermore, if we are uncertain about the range, the variability of the priors (through their scale/variance parameter) can be increased to avoid the priors influencing too much the posteriors.

Regression coefficients are generally continuous numbers that can take values from -∞ to +∞. The normal distribution is a good choice of prior for these parameters because it has the same characteristics.

Standard deviations are continuous numbers that must be positive. A normal distribution is not a good choice for this prior because it includes negative numbers. The simplest is a uniform(0,a) with a defining the upper bound on likely values. In reality this is not perfect as it enforces a hard constraint. Other distributions can be considered as a half-Normal or a half-Cauchy distribution.

The bayes rules book includes a useful chapter on the interaction between priors and data. Furthermore, the stan team provides a set of interesting guidelines to prior setting.

6.2.4 Simulating data

We will simulate fake observations to introduce the basic concepts of Bayesian modelling and their implementation in stan.

We produce our fake data as 1000 draws of a Normal distribution with mean 5 and standard deviation 50:

# 2 Simulated data ----

# Simulate data
seed <- 2004
set.seed(seed)
data <- tibble(y=rnorm(1e3, mean= 5, sd=50))

Note that a seed is defined for the results to be exactly replicated.

The observed distribution of the simulated data is the following:

Figure 6.2: Simulated observations distribution

6.2.5 Modelling the data

Let’s pretend we don’t know how the observations, $y$ have been created. We want to model $y$ and we see that it has a clear bell shape. It can thus be approximated with a Normal distribution that is: \[Y \sim Normal(mean=\mu, sd=\sigma)\]

The corresponding DAG is:

Figure 6.3: Model with simulated data: Normal distribution of Y

We need to assign some priors to the two parameters $\mu$ the mean, and $\sigma$ the standard deviation. Let’s imagine that we don’t have access to more information on the variable $Y$ other than its manifestation in the dataset.

We might assume that $\mu$ is around zero but without great confidence (see Figure 6.4). In other words, we can assume that $\mu$ can be drawn from a Normal(0,100).

Figure 6.4: Mu prior distribution

Figure 6.4 shows what our prior for $\mu$ means: we think that $\mu$ is likely to be around zero, however it could be up to 200. Given that the observed $y$, $\hat{y}$, has no occurrences above 200, it is unlikely that the mean of $y$ is 200 but we don’t exclude the possibilty. However we do consider that 0 is more likely than 200.

For $\sigma$, there are stricter expectations - standard deviations are positive. As such a Uniform(0, 200) is chosen as prior for $\sigma$.

Figure 6.5: Sigma prior distribution

Figure 6.5 illustrates that a uniform prior generates an equal probability of $\sigma$ being between 0 and 200.

The final model is:

\[\begin{equation} Y \sim Normal( \mu, \sigma ) \\[1pt] \\ \mu \sim Normal( 0, 100 ) \\ \sigma \sim Uniform( 0, 200 )\tag{6.2} \end{equation}\]

6.2.6 Implementing the model in `stan`

To estimate $\mu$ and $\sigma$, we will write our first model in stan.

// Model for simulated data: y as normal distribution
data {
  int<lower=0> n; // number of observations
  real y[n]; // observations
}

// The parameters accepted by the model. Our model
// accepts two parameters 'mu' and 'sigma'.
parameters {
  real mu;
  real<lower=0> sigma;
}

// The model to be estimated. We model the output
// 'y' to be normally distributed with mean 'mu'
// and standard deviation 'sigma'.
model {
  y ~ normal(mu, sigma);
  
  mu ~ normal(0,100);
  sigma ~ uniform(0,200);
}

A stan model is composed of code blocks. The most fundamental are:

The data block: describes the input data to the model
The parameters block: that describes the parameters to be estimated
The model block: that describes the stochastic elements: (1) the interaction between the parameters and the data, (2) the prior distribution

The stan software requires the user to declare all variables, both parameters and data, with their type (int, real) and size (as indicated with [n]). It is possible to incorporate constraints on the variable support, for example, it is not possible to have a negative $\sigma$ (real<lower=0> sigma). More details here.

Note: stan requires to leave one blank line at the end of the script. We will store the model in a stan file called tutorial1_model.stan in the tutorial1 folder.

6.2.7 Preparing the data for stan

Stan software takes as input a list of the observed data that defines the variables indicated on the data block.

# prepare data for stan
stan_data <- list(
  y = data$y,
  n = nrow(data))

6.2.8 Running the model

Set up the parameters of the Markov Chain algorithm.

# mcmc settings
chains <- 4
warmup <- 250
iter <- 500

The chains argument specifies the number of Markov chains to run simultaneously. The Markov chains should replicate a fully random process. However, the design of the chain algorithm makes every sample dependant on the previous sample. To recreate a random setting several chains can be run independently to explore the parameter space and that ideally will converge to the same consistent solution.

The warmup parameter is the number of samples at the beginning of the estimation process that are discarded from the results.

The iter parameter specifies the number of iterations, that is the length of the Markov chain. The longer the chain the more likely it is to stabilize around the correct estimate.

Then we define the parameters that we want to monitor, which are stored during the estimation process:

# parameters to monitor
pars <- c('mu','sigma')

Now, the model can be run.

# mcmc
fit <- rstan::stan(file = file.path(here::here('dat/BUM_tutorial/tutorial1_model.stan')), 
                   data = stan_data,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

6.2.9 Checking the MCMC simulations

To check whether the Markov chains converge to a unique solution, a traceplot can be used to visualize the two parameters, $\mu$ and $\sigma$. A traceplot describes the evolution of the parameter estimation across the Markov chain iterations. A good traceplot sees the mixing of the different chains, evidence of convergence to a single estimate.

Figure 6.6: Model Traceplot

Figure 6.6 shows both the warm-up period (up until 250) and the following iterations. It shows that convergence happened before the end of the warm-up period and is stable over the iterations, because the four chains mixed well.

From Figure 6.6 (and the absence of warning from stan ), it can be concluded that the model converged.

6.2.10 Evaluating the estimated parameters

Bayesian statistics consider parameters as stochastic, and thus estimates a distribution for each parameter. In practice, stan stores parameters estimates for each iteration post-warmup for each chain which gives us 500x4 estimates of the posterior distribution. A summary of the estimated distribution can be extracted from the stanfit object using the summary function:

# summarise estimated parameters
estimated <- summary(fit, pars=pars)$summary

estimated %>% kbl() %>% kable_minimal()

	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
mu	7.608826	0.0450440	1.579665	4.48944	6.528863	7.641456	8.663377	10.74402	1229.864	1.001695
sigma	49.640567	0.0270366	1.093665	47.47846	48.898752	49.630383	50.363545	51.84672	1636.310	1.000865

The the parameters $\hat\mu$ and $\hat\sigma$, can be compared with the observed average and standard deviation of $\hat{y}$ as well as the true value, using the stan_plot function.

Figure 6.7: Compare parameters using stan_plot function

This shows that (1) the observed mean and standard deviation are within the 95% credible intervals of the estimated parameters, and (2) the true mean and standard deviation are within the 95% credible intervals of the estimated parameters,

The model structure is inline with the observed data and is able to approximate the true data generating process.

6.2.11 Let’s try with real data

This section of the tutorials will cover actual population modelling using sample data for Nigeria. The first step is to download the data for the modelling. It can be found in the supplementary material of the seminal paper describing WorldPop bottom-up population models (Leasure et al. 2020f).

# download  tutorial data
download.file(
  "https://www.pnas.org/highwire/filestream/949050/field_highwire_adjunct_files/1/pnas.1913050117.sd01.xls",
  'tutorials/data/nga_demo_data.xls',
  method='libcurl',
  mode='wb'
)

6.2.12 The data

The dataset consists of household surveys which capture the total population in 1141 clusters in 15 of 37 states in Nigeria, collected during 2016 and 2017. Clusters varied slightly in size, but were all approximately 3 hectares. These clusters were randomly sampled locations whose boundaries were drawn based on high resolution satellite imagery. The surveys are further described in Leasure et al. (2020f) and Weber et al. (2018c). Survey site locations are shown in Figure 8.1.

Microcensus surveys mapped as the number of survey locations within a 20km grid cell. Region groupings are shaded and numbered R1 - R11. Source: Leasure et al. (2020).

Figure 6.8: Microcensus surveys mapped as the number of survey locations within a 20km grid cell. Region groupings are shaded and numbered R1 - R11. Source: Leasure et al. (2020).

The map in Figure 8.1 shows some key characteristics of the stratified-random sample design:

Only some states were sampled
At least 1 state per “region” was sampled
Within states, locations were randomly sampled within settlement types

Let explore the table attributes:

#load data
data <- readxl::read_excel(here("dat/BUM_tutorial/nga_demo_data.xls"))
# create unique cluster id
data <- data %>% 
  mutate(
    
    id= paste0('cluster_',(1:n())), # compute cluster id
    pop_density = N/A # compute population density
  )
data %>% dplyr::select(id,N, A) %>% head(10) %>% kbl() %>% kable_minimal()

id	N	A
cluster_1	547	3.036998
cluster_2	803	2.989821
cluster_3	750	3.078278
cluster_4	281	3.019307
cluster_5	730	3.025204
cluster_6	529	3.090072
cluster_7	505	3.007513
cluster_8	402	3.019307
cluster_9	388	3.202116
cluster_10	900	3.054689

Each row is a survey site, with population counts (N) and the settled area (A) in hectares.

6.2.13 Response variable: the population count

We want to model the distribution of population count at each survey site:

Figure 6.9: Observed population count distribution at survey sites

Note the wide variation in population count per survey site, with a maximum of 2370 people.

6.2.14 Modelling Population count: a Poisson distribution

By definition, population count is a discrete, positive variable. A good distribution candidate is the Poisson distribution.

\[ population \sim Poisson( \lambda ) \] The corresponding DAG shows the interaction between the population count and the model parameter:

Figure 6.10: Model 1: Poisson distribution of population count

The $\lambda$ then requires a prior - which corresponds to the mean of the Poisson distribution. We will choose a relatively uninformed prior based on our understanding of the data. We know that the mean observed population count is around 450 per cluster. We set up the prior for $\lambda$ to follow a Uniform between 0 and 3000 to ensure a positive parameter while reducing the information given to the estimation process, and thus the bias introduced.

Figure 6.11: Lambda prior distribution

The final model is:

\[\begin{equation} population \sim Poisson( \lambda ) \\ \\ \lambda \sim Uniform( 0, 3000 )\tag{6.3} \end{equation}\]

6.2.15 Implementing the model

To estimate $\lambda$, we will write our first population model in stan.

// Model 1: Population count as a Poisson distribution 
data{
  int<lower=0> n; // number of microcensus clusters
  int<lower=0> population[n]; // count of people
}
parameters{
  // rate
  real<lower=0> lambda; 
}

model{
  // population totals
  population ~ poisson(lambda);
  // rate
  lambda ~ uniform(0, 3000);
}

We declare the input variable population as integer because our population data are counts and set it up to be positive. We define $\lambda$ as a positive real. We store this model under tutorial1_model1.stan.

6.2.16 Estimating the model

We prepare the data for stan:

# prepare data for stan
stan_data <- list(
  population = data$N,
  n = nrow(data))

We keep the same parameters as previously for the Markov Chain algorithm and declare $\lambda$ as the parameter to monitor

# parameters to monitor
pars <- c('lambda')

And we are ready to run the model!

# mcmc
fit <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial1_model1.stan'), 
                   data = stan_data,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

The traceplot shows a model that converges to the observed mean:

Figure 6.12: Traceplot

Note the wide variation at the start of the estimation. This is due to the uniform prior for $\lambda$ declaring that any value between 0 and 3000 is similarly possible.

6.2.17 Evaluating the model goodness-of-fit

6.2.18 Estimated parameters

Plot the estimated parameter $\hat\lambda$ to see how it compares with the observed average population count.

$Estimated parameter $\hat\lambda$ comparison with observed average population count$

Figure 6.13: Estimated parameter $\hat\lambda$ comparison with observed average population count

The estimated mean is very close to the observed mean.

6.2.19 Predicted population count

To see if the model is coherent with the observations, the predicted population count can be computed for every survey site. It is part of posterior predictive checking which is based on the following idea: if a model is a good fit then the user should be able to use it to generate data that looks a lot like the data we observed.

It is possible in stan to do this as part of the estimation process through the generated quantities block .

// Model 1bis: Population count as a normal distribution with integrated predictions
...
generated quantities{
   real population_hat[n];

   for(idx in 1:n){
     population_hat[idx] = poisson_rng( lambda );
   }
}

We define the parameter population_hat as a draw (as represented by the suffix rng for random number generator) from a Poisson distribution with the estimated $\hat\lambda$ for each iteration. Run the model stored under tutorial1_model1bis.stan:

pars <- c('lambda', 'population_hat')

# mcmc
fit_model1 <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial1_model1bis.stan'), 
                   data = stan_data,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

And extract the predicted population count.

# extract predictions
predicted_pop_model1 <- as_tibble(rstan::extract(fit_model1, 'population_hat')$population_hat)

colnames(predicted_pop_model1) <- data$id

This produces a table with 500 predictions * 4 chains for each survey site.

predicted_pop_model1 %>% 
  mutate(iteration= paste0('iter_', 1:(iter*chains))) %>% 
  dplyr::select(iteration, 1:10) %>% head(10) %>% kbl() %>% kable_minimal()

iteration	cluster_1	cluster_2	cluster_3	cluster_4	cluster_5	cluster_6	cluster_7	cluster_8	cluster_9	cluster_10
iter_1	450	446	380	409	462	440	393	431	446	450
iter_2	440	415	479	420	429	456	435	405	449	487
iter_3	435	430	468	455	423	450	437	423	452	478
iter_4	445	405	449	454	449	452	451	428	412	438
iter_5	440	380	436	420	453	441	448	444	446	434
iter_6	427	461	491	418	453	470	432	491	433	462
iter_7	438	433	448	441	429	438	454	453	485	418
iter_8	489	421	447	451	471	486	457	464	437	458
iter_9	438	454	446	415	446	496	410	450	427	438
iter_10	444	460	441	418	432	449	433	422	447	416

This provides a posterior prediction distribution of population count for every survey site. Figure 6.14 shows a posterior distribution for the first survey site.

Figure 6.14: Example of posterior prediction distribution for one cluster

For every survey site, its mean prediction and 95% credible interval can be extracted.

# summarize predictions
comparison_df <- predicted_pop_model1 %>% 
      pivot_longer(everything(),names_to = 'id', values_to = 'predicted') %>% 
      group_by(id) %>% 
      summarise(across(everything(), list(mean=~mean(.), 
                                          upper=~quantile(., probs=0.975), 
                                          lower=~quantile(., probs=0.025))))
comparison_df %>% head() %>% kbl() %>% kable_minimal()

id	predicted_mean	predicted_upper	predicted_lower
cluster_1	444.0695	486.000	403.975
cluster_10	444.8125	486.000	402.975
cluster_100	445.4620	487.000	405.000
cluster_1000	444.0330	485.000	403.000
cluster_1001	446.0340	488.000	405.000
cluster_1002	444.2210	485.025	402.000

Note that all predictions are very similar. The model includes only one parameter — lambda — and the site-level predictions do not account for any site-level characteristics. Therefore, predictions at each site are drawn from the exact same distribution. To see the global picture, plot the observed vs the predicted population count. A perfect model would see all points on the 1:1 line.

Figure 6.15: Comparison between observed and predicted population count (with the Poisson model). Orange line indicates the 1:1 line

Figure 6.15 provides a great visualization of the prediction process. Since the model has no covariates (introduced in tutorial 3) and no hierarchical structure (introduced in tutorial 2), there is no subnational variation introduced. Furthermore the credible intervals entail few observations: few grey lines intersect the orange line. This indicates issues with the modelling.

We can compute goodness-of-fit metrics to complete the picture:

The bias, the mean of the residuals (prediction - observation)
The imprecision, standard deviation of the residual
The inaccuracy, mean of the absolute residuals
The proportion of observations falling into the predicted credible interval

# compute goodness-of-fit metrics
comparison_df %>%
  dplyr::mutate(residual = predicted_mean-data$N,
          in_CI = ifelse(data$N>predicted_lower &data$N<predicted_upper, T, F)) %>% 
  summarise(
    `Bias`= mean(residual),
    `Imprecision` = sd(residual),
    `Inaccuracy` = mean(abs(residual)),
    `Correct credible interval (in %)` = round(sum(in_CI)/n()*100,1)
  ) %>% 
    kbl(caption = "Poisson model goodness-of-fit metrics") %>% kable_minimal()

Table 6.1: Poisson model goodness-of-fit metrics
Bias	Imprecision	Inaccuracy	Correct credible interval (in %)
-0.0291301	315.8238	230.918	10.1

Table 6.1 confirms that the model is incorrectly specified: only 10% of the observations falls in their credible intervals.

This limitation is due to the impossibility to model overdispersion within a Poisson framework. Indeed $\lambda$ defines both the mean and the variance of a Poisson variable.

A source of overdispersion comes from the size of the clusters. We observed population counts for units with different area, in particular different sizes of settled areas as shown in Figure 6.16.

Figure 6.16: Scatterplot contrasting population count with settled area for every cluster

6.2.20 Modelling Population count: a Poisson Lognormal model

To incorporate overdispersion in the model, population count can be decomposed as follows:

\[ population = pop\_density * settled\_area \] The left hand side of the equation is population, a discrete observed variable, whereas the right hand side is composed of two continuous variables, settled_area the settled area, an observed variable, and pop_density, the population density, a latent variable.

6.2.21 Writing the model

Formally Equation (6.3) can be rewritten as:

\[ population \sim Poisson( pop\_density * settled\_area) \] This formulation unveils the continuous positive scale-free latent variable, pop_density that can be modelled with its own distribution.

A Lognormal is used here, which is a continuous probability distribution of a random variable whose logarithm is normally distributed. It is characterised by a positive distribution skewed to the right. The lognormal has two parameters, $\mu$, its location parameter that defines its median and $\sigma$ its scale parameter that defines its geometric standard deviation. This model allows us to capture overdispersion through $\sigma$.

Figure 6.17: Probability distribution functions of log-normal distributions Source: Krishnavedala, CC0, via Wikimedia Commons

Applied to the population modelling it gives:

Figure 6.18: Model 2: Lognormal-Poisson distribution of population count

And in formal modelling:

\[\begin{equation} population \sim Poisson( pop\_density * settled\_area) \\ pop\_density \sim Lognormal( \mu, \sigma) \end{equation}\]

Note that this equation is equivalent to:

\[\begin{equation} log(pop\_density) \sim Normal( \mu, \sigma) \end{equation}\]

Under this form it is clear to see that the model is not linear, but log-linear. It can be rewritten as:

\[\begin{equation} pop\_density \sim exp(Normal( \mu, \sigma)) \end{equation}\]

6.2.22 Defining the priors

In the Lognormal, there are two parameters, $\mu$ representing the median of the population density on the log scale, and $\sigma$, the geometric standard deviation of population density on the log scale. Their priors are similarly set up as before and retrieve from the data that the log observed median of the population density is 4.82 and the observed log geometric standard deviation is0.87.

We choose as prior for $\mu$ a Normal(5,4):

Figure 6.19: Prior distribution for mu

We choose as prior for $\sigma$ a Uniform(0,4)

Figure 6.20: Prior distribution for mu

The resulting model can be written as follows:

\[\begin{equation} population \sim Poisson( pop\_density * settled\_area) \\ pop\_density \sim Lognormal( \mu, \sigma) \\ \\ \mu \sim Normal( 5, 4 ) \\ \sigma \sim Uniform( 0, 4 )\tag{6.4} \end{equation}\]

6.2.23 Implementing the model

We adapt the stan code to the model change which affects all code blocks:

// Model 2: Population count as a Poisson-Lognormal distribution 
data{
  int<lower=0> n; // number of microcensus clusters
  int<lower=0> population[n]; // count of people
  vector<lower=0>[n] area; // settled area
}
parameters{
  // population density
  vector<lower=0>[n] pop_density;
  // intercept
  real mu; 
  // variance
  real<lower=0> sigma; 
}
model{
  // population totals
  population ~ poisson(pop_density .* area);
  pop_density ~ lognormal( mu, sigma );
  // intercept
  mu ~ normal(5, 4);
  // variance
  sigma ~ uniform(0, 4);
}
generated quantities{
   int<lower=0> population_hat[n];
   real<lower=0> density_hat[n];

   for(idx in 1:n){
     density_hat[idx] = lognormal_rng( mu, sigma );
     population_hat[idx] = poisson_rng(density_hat[idx] * area[idx]);
   }
}

Note the need to use the vector format for area and pop_density. This formatting is implied by definition of the multiplication operation for the Poisson rate.

The model is stored under tutorial1_model2.stan, and the corresponding data is prepared as follows:

# prepare data for stan
stan_data_model2 <- list(
  population = data$N,
  n = nrow(data),
  area = data$A)

Note that the population density is not an input data of the model, but an unobserved latent variable that we model through the Lognormal.

Then declare the parameters to monitor (including density_hat) and run the model.

# set parameters to monitor
pars <- c('mu','sigma', 'population_hat', 'density_hat')

# mcmc
fit_model2 <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial1_model2.stan'), 
                   data = stan_data_model2,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

No warnings are shown.

Question: Can you plot the traceplot and interpret it?

Click for the solution

We plot the traceplot:

Figure 6.21: Traceplot

We see that the chains have mixed well and that the posterior distributions of the parameters are not constrained by their prior specification.

We plot the predicted density and the predicted count against the observations:

# extract posterior predictions
predicted_pop_model2 <- as_tibble(extract(fit_model2, 'population_hat')$population_hat)

predicted_dens_model2 <- as_tibble(extract(fit_model2, 'density_hat')$density_hat)
colnames(predicted_pop_model2) <- data$id
colnames(predicted_dens_model2) <- data$id

# summarise posterior predictions
comparison_df <- rbind(predicted_dens_model2 %>% 
   pivot_longer(everything(),names_to = 'id', values_to = 'predicted') %>% 
   group_by(id) %>% 
   summarise(across(everything(), list(mean=~mean(.), 
                                       upper=~quantile(., probs=0.975), 
                                       lower=~quantile(., probs=0.025)))) %>% 
   mutate(source= 'Poisson-Lognormal model',
          type= 'Population density') %>% 
   left_join(data %>% 
               select(id, pop_density)%>% 
              rename(reference=pop_density), by = 'id'),
  predicted_pop_model2 %>% 
  pivot_longer(everything(),names_to = 'id', values_to = 'predicted') %>% 
  group_by(id) %>% 
  summarise(across(everything(), list(mean=~mean(.), 
                                      upper=~quantile(., probs=0.975), 
                                      lower=~quantile(., probs=0.025)))) %>% 
  mutate(source= 'Poisson-Lognormal model',
         type='Population count') %>% 
  left_join(data %>% 
              dplyr::select(id, N) %>% 
              rename(reference=N), by = 'id'))
# plot posterior predictions
ggplot(comparison_df %>% 
         mutate(type= factor(type, levels=c('Population density', 'Population count')))) +
  geom_pointrange(aes(x=reference, y=predicted_mean, ymin=predicted_lower, ymax=predicted_upper
                      ),
                   fill='grey50', color='grey70', shape=21
                  )+
  geom_abline(slope=1, intercept = 0, color='orange', size=1)+
  theme_minimal()+
  labs(title = '', x='Observations', y='Predictions')+ 
  facet_wrap(.~type, scales = 'free')

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

The population_density produces the same estimation pattern as in the Poisson model, that is a similar mean posterior prediction distribution for every survey site. In contrast, the length of the confidence intervals varies between cluster predictions due to the variance term in the Lognormal. Furthermore the predicted population_count is influenced by the varying settled_area.

We use the same metrics based on residuals to assess the goodness-of-fit of the model.

# compute goodness-of-fit metrics
comparison_df %>%
  filter(type=='Population count') %>% 
  dplyr::mutate(residual = predicted_mean-reference,
          in_CI = ifelse(reference>predicted_lower &reference<predicted_upper, T, F)) %>% 
  summarise(
    `Bias`= mean(residual),
    `Imprecision` = sd(residual),
    `Inaccuracy` = mean(abs(residual)),
    `Correct credible interval (in %)` = round(sum(in_CI)/n()*100,1)
  ) %>% 
    kbl(caption = "Poisson-Lognormal model goodness-of-fit metrics") %>% kable_minimal()

Table 6.2: Poisson-Lognormal model goodness-of-fit metrics
Bias	Imprecision	Inaccuracy	Correct credible interval (in %)
61.66909	338.0013	265.4084	95.2

The proportion of observations that fall in their credible intervals shows a well-behaved model: 95.5% of the observations falls in the 95% credible interval. On average the model overestimates the population count by around 60 people, but this statistic varies on average by 337 people.

No variations are included in this model: we built a national model of population count. It will require further refinements that will be introduced in the next tutorials.

And for that purpose, store this last model as a RDS file.

saveRDS(fit_model2, 'dat/BUM_tutorial/tutorial1_model2_fit.rds')

6.3 Part II: How to model large-scale spatial variation

Part I, looked at modelling population count with a Poisson-Lognormal model. This modelling does not include site-level variations of population count. Part II will explore the spatial patterns of population counts and how they are clustered across space.

This section will explore the grouping of observations and the unravelling of their hierarchical structure, demonstrating how Bayesian models offer a great modelling toolbox to represent hierarchical data. This will involve testing different hierarchical modelling structures to see which one fits the data best. Hierarchical models define a complex parameter space. This will therefore involve revising the prior setting to account for the complexity.

6.3.1 Goals

Understand the concept of grouped observations
Estimate population density in a no-pooling framework
Estimate population density in a partial pooling framework
Integrate additional hierarchical level to the modelling

6.3.2 Supporting readings

An Introduction to Hierarchical Modeling by Michael Freeman. Great visualisation of grouped data and goals of general hierarchical modelling
Bayes rules! on hierarchical model by Alicia A. Johnson, Miles Ott, Mine Dogucu. Good explanation of hierarchical modelling in stan with many examples
Exchangeability and hierarchical models in Bayesian Data Analysis (p.104) by Andrew Gelman, John Carlin, Hal Stern, Donald Rubin, David Dunson, and Aki Vehtari

6.3.3 Extra packages

We will use the bayesplot package that offers more flexibility in evaluating Bayesian estimations and plotly for interactive plotting.

install.packages('bayesplot') # additional evaluations of stan models
install.packages('plotly') # interactive plot
install.packages("RColorBrewer")

Vignettes can be found here: bayesplot and plotly.

# 1 Set-up ----

# load libraries
library(ggdag) # drawing DAG
library(kableExtra) # visualising table
library(here) # handling path
library(rstan) # running Bayesian models
library(plotly) # interactive plot
library(RColorBrewer) # colours
library(tidyverse) # managing data

# stan setup
options(mc.cores = parallel::detectCores()-1)
rstan::rstan_options(auto_write = TRUE) # speed up running time of compiled model

knitr::opts_chunk$set(fig.align = "center")

6.3.4 Hierarchical structure in the data

We are still working with the Nigeria population dataset we downloaded in Part I. Let’s unravel the clustering patterns in population counts. Figure 6.22 shows the spatial variation of population density across Nigeria. Population density (people/settled hectares) is represented because it is more comparable across survey sites.

We can observe clustered patterns such as high population densities in the North and the Southeast.

Figure 6.22: Map of survey site population densities overlaid with Nigeria region boundaries

Let’s plot the population density distribution by regions:

Figure 6.23: Boxplot of population densities per region

In addition to the 11 regions, the data provides two other Nigerian administrative divisions: state (15) and local (standing for local government areas, 223), which gives extra flexibility for grouping the survey sites. The underlying assumption is that the administrative structure of a country gives information on the population density variation.

Another key grouping is the type attribute indicated in the data. It refers to one of eight settlement types classified from a settlement map based on satellite imagery (Pleiades, Airbus and WorldView2, DigitalGlobe), pictured in Figure 6.24 Weber et al. (2018c).

Exemplars of the urban residential (A–F), rural residential (M), and non-residential (Z) types for Kano and Kaduna states, Nigeria, from Weber et al (2018)

Figure 6.24: Exemplars of the urban residential (A–F), rural residential (M), and non-residential (Z) types for Kano and Kaduna states, Nigeria, from Weber et al (2018)

Figure 6.25: Boxplot of population densities per settlement type

Settlement type clearly stratifies the population density - for example, settlement type 1 has a substantial higher population density than settlement type 4.

Note that out of the eight settlement types displayed in Figure 6.24, only five types are actually present in the data:

No sites were surveyed in the non-residential type area (Z).
Types C, D and E have been merged together to enforce their prevalence in the South region

6.3.5 Full picture of the data grouping

Figure 6.26 shows the full structure of the survey sites, with the following nesting: settlement type, region, state and local government area. Hovering on the schema, shows how many survey sites are in each grouping. The detailed structure can be accessed by clicking on a specific group.

# Visualise grouping of the data
# create unique id for the nested admin level
data1 <- data %>% 
  dplyr::mutate(state= paste0(state,region),
         local = paste0(state, local))

# create data for sunburst plot
d1 <- rbind(
  # first layer
  data1 %>% 
    group_by(type) %>% 
    summarise(n=n()) %>% 
    dplyr::mutate(
      ids = paste0('settlement', type),
      labels = paste0('settlement <br>', type),
      parents = '') %>% 
    ungroup() %>% 
    dplyr::select(ids,labels, parents,n),
  # second layer
  data1 %>% 
    group_by(type, region) %>% 
    summarise(n=n()) %>% 
    dplyr::mutate(
      ids = paste('settlement', type, '-', 'region', region),
      labels = paste0('region ', region),
      parents = paste0('settlement', type))%>% 
    ungroup() %>% 
    dplyr::select(ids,labels, parents,n),
  # third layer
  data1 %>% 
    group_by(type, region, state) %>% 
    summarise(n=n()) %>% 
    dplyr::mutate(
      ids = paste('settlement', type, '-', 'region', region, '-', 'state', state, '-', 'region', region),
      labels = paste0('state ', state),
      parents = paste('settlement', type, '-', 'region', region))%>% 
    ungroup() %>% 
    dplyr::select(ids,labels, parents,n),
  # fourth layer
  data1 %>% 
    group_by(type, region, state, local) %>% 
    summarise(n=n()) %>% 
    dplyr::mutate(
      ids = paste('settlement', type, '-', 'region', region, '-', 'state', state,  '-', 'local', local),
      labels = paste0('local ', local),
      parents = paste('settlement', type, '-', 'region', region, '-', 'state', state, '-', 'region', region))%>% 
    ungroup() %>% 
    dplyr::select(ids,labels, parents,n)
) %>% 
 dplyr::mutate(
    hover= paste(ids, '\n sample size', n)
  )

plot_ly(d1, ids = ~ids, labels = ~labels, parents = ~parents, type = 'sunburst', 
        hovertext=~hover, insidetextorientation='radial')

Figure 6.26: Full grouping structure of the survey (settlement, region, state, local)

Note that:

not every state and local government area are present in the data due to sampling limitations
not every group combination is represented in the data. Some region/state/local governmental area do not have survey sites belonging to all settlement types.

Figure 6.27 shows the regions with missing settlement type. From this analysis alone, it is not possible to assess whether those settlement types are missing because they were not sampled in the survey or because they just don’t exist (for example a remote region that does not contain any highly urban settlement type).

makeSunburst2layer <- function(data){
  layers <- rbind(
  # first layer
  data %>% 
    group_by(type) %>% 
    summarise(n=sum(!is.na(N))) %>% 
    dplyr::mutate(
      ids = paste0('settlement', type),
      labels = paste0('settlement <br>', type),
      parents = '') %>% 
    ungroup() %>% 
    dplyr::select(ids,labels, parents,n),
  # second layer
  data %>% 
    group_by(type, region) %>% 
    summarise(n=sum(!is.na(N))) %>% 
    dplyr::mutate(
      ids = paste('settlement', type, '-', 'region', region),
      labels = paste0('region ', region),
      parents = paste0('settlement', type))%>% 
    ungroup() %>% 
    dplyr::select(ids,labels, parents,n)) %>%
    dplyr::mutate(
      hover= paste(ids, '\n sample size', n),
      color= ifelse(n==0, 'yellow','')
    )
  
 return(layers)
}

# create missing combinations
data1_complete <- data1 %>% 
  complete(region, nesting(type)) 


plot_ly() %>% 
   add_trace(data=makeSunburst2layer(data1), 
             ids = ~ids, labels = ~labels, parents = ~parents, 
             type = 'sunburst', 
             hovertext=~hover, marker= list(colors=~color),  
             insidetextorientation='radial',
             domain = list(column = 0)) %>% 
     add_trace(data=makeSunburst2layer(data1_complete), 
             ids = ~ids, labels = ~labels, parents = ~parents, 
             type = 'sunburst', 
             hovertext=~hover, marker= list(colors=~color),  
             insidetextorientation='radial',
             domain = list(column = 1))  %>%  
  layout(
      grid = list(columns =2, rows = 1),
      margin = list(l = 0, r = 0, b = 0, t = 0))

Figure 6.27: Grouping structure of the survey (settlement, region) with missing combinations in yellow

6.3.6 Hierarchical structure in the model

When all observations are treated together and indistinctly in the model, it is called complete pooling. This is how population count was modelled in Part I. It assumes perfect exchangeability between the observations, that the indexing of the clusters doesn’t impact the probability of the population density. In other words, it assumes that no ordering or grouping of the data can be made. Generally, the less we know about a problem, the more confidently we can make claims of exchangeability. The consequence of complete pooling is that single parameters are estimated for the entire dataset. This can produce spurious conclusions that are valid at global level but invalid at group level, through diluting group-specific patterns (cf.the ecological fallacy).

An excellent example of the concept of exchangeability is provided in the Bayesian data analysis book, proposed as supporting reading.

When a model is fit for each data grouping independently, it is called no pooling and creates a model per grouping. It reduces the sample size substantially, and makes it impossible to extrapolate for a grouping that is not present in the observation sample.

The purpose of hierarchical modelling is to aim at partial pooling where information is shared between groups while accounting for the hierarchical structure of the data. It provides insights on:

between-group variability: population densities differ from one region to the other
within-group variability: some regions have a greater variability than others (cf region 9 in Figure 6.23)

Link between random models and hierarchical models: Random models are any model that assumes that parameters are not fixed but vary between groups, that is have a random effect. Hierarchical models are random models that provides a overarching structure to draw information from other groups.

Refer back to the previous model:

\[\begin{equation} population \sim Poisson( pop\_density * settled\_area) \\[10pt] pop\_density \sim Lognormal( \mu, \sigma) \\ \\\\ \mu \sim Normal( 5, 4 ) \\ \sigma \sim Uniform( 0, 4 ) \end{equation}\]

$\mu$ and $\sigma$ have both a fixed effect: they are estimated for the entire dataset at once.

6.3.7 A hierarchical intercept by settlement type

We want to differentiate the parameters according to the hierarchical structure of the data.

The first parameter is $\mu$, the median on the log scale of the population density distribution. Modelling $\mu$ hierarchically means acknowledging that the median differs depending on the grouping, for example, the different settlement types. Going forward we will decompose $\mu$ into a $\alpha$ term for which we will design a random effect.

Let’s explore the different scenarios for estimating $\alpha$ by settlement type (only three settlement types are shown for sake of simplicity):

Figure 6.28: Different estimates of $lpha$ by settlement type (examples with three settlement types for sake of simplicity)

We will see the difference between the three models: complete pooling (one $\alpha$), no pooling (3 $\alpha$s) and partial pooling (3 $\alpha$ + 1 overarching $\alpha$).

Model 2 from tutorial 1 is based on complete pooling of the data or fixed effect: a unique $\alpha$ for all settlement type.

6.3.8 No-pooling

Let’s implement a no-pooling framework, that is $\alpha$ will be estimated independently for each settlement type $t$. Equation (6.3) represents its formal model, whereby $\alpha$ is indexed by $t$ the settlement type and each $\alpha_t$ has a identical but independent prior Normal(5,4).

\[\begin{equation} population \sim Poisson( pop\_density * settled\_area) \\ pop\_density \sim Lognormal( \alpha_t, \sigma) \\[10pt] \alpha_t \sim Normal( 5, 4 ) \\ \sigma \sim Uniform( 0, 4 )\tag{6.3} \end{equation}\]

This is written in Stan as follows:

// Model 1: Independent alpha by settlement type
data{
  ...
  int<lower=0> type[n]; // settlement type
  int<lower=0> ntype; // number of settlement types
}
parameters{
  ...
  // independent intercept by settlement type
  vector[ntype] alpha_t; 
}
model{
  // population totals
  ...
  pop_density ~ lognormal( alpha_t[type], sigma );
  
  // independent intercept by settlement type
  alpha_t ~ normal(5, 4);
  ...
}
generated quantities{
  ...
   for(idx in 1:n){
     density_hat[idx] = lognormal_rng( alpha_t[type[idx]], sigma );
     population_hat[idx] = poisson_rng(density_hat[idx] * area[idx]);
   }
}

In the data block , we declare the observation structure of settlement type.

In the parameters block , $\alpha$ becomes a vector of size 5, the number of settlement type in the data.

In the model block, the same prior is chosen for every $\alpha_t$, five normal distributions centered on 5 and with a standard deviation of 4.

Notice how the indexing works in the generated quantities block.

Once the model is written, we prepare the corresponding data, adding the settlement type:

# 3. Hierarchical model by settlement type ----

# No-pooling
# prepare data for stan
stan_data_model1 <- list(
  population = data$N,
  n = nrow(data),
  area = data$A,
  type = data$type,
  ntype= n_distinct(data$type))

And run the model similarly as in part I:

# mcmc settings
chains <- 4
warmup <- 250
iter <- 500
seed <- 1789

# parameters to monitor
pars <- c('alpha_t','sigma', 'population_hat', 'density_hat')

# mcmc
fit1 <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial2_model1.stan'), 
                   data = stan_data_model1,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

No warnings are output by stan, we can therefore infer that the model has converged.

Figure 6.29 shows that the estimated $\hat\alpha_t$ vary by settlement type. We also see how the previously estimated $\alpha$ was averaging between the different patterns.

Figure 6.29: Intercept estimated independently by settlement type

Draw the observed vs predicted plot to see the impact of estimating $\alpha$ by settlement type:

Figure 6.30: Comparison of observations with predictions from model 1

Figure 6.31: Comparison of observations with predictions from model 1

This illustrates that the predicted population density now has five modes that correspond to the five settlement types.

6.3.9 Partial pooling

This section will explore partial pooling. Partial-pooling involves a hierarchical set-up for the priors, that is, each $\alpha_t$ prior distribution will depend on an overarching $\alpha$ prior distribution. Essentially, $\alpha$ defines a country-wide distribution that constrains each $\alpha_t$ to be in a similar range.

\[\begin{equation} population \sim Poisson( pop\_density * settled\_area) \\ pop\_density \sim Lognormal( \alpha_t, \sigma) \\[5pt] \alpha_t \sim Normal( \alpha, \nu_\alpha ) \\[5pt] \alpha \sim Normal( 5, 10 ) \\ \nu_\alpha \sim Uniform( 0, 15 ) \\ \sigma \sim Uniform( 0, 10 )\tag{6.4} \end{equation}\]

Note that:

$\alpha_t$, the random effects are each drawn from a Normal distribution with the same overarching hyperparameters -parameters of prior distributions- $\alpha$ and $\nu_\alpha$.
The prior constraint on $\alpha$ and $\sigma$ from a Normal(5, 4) and Uniform (0,4) is relaxed to a Normal(5,10) and Uniform(0,10).Hierarchical models are complex to estimate as some parameters are defined as correlated. This therefore offers to the algorithm more space to explore.
Uniform(0, 15) is set as hyperprior -prior distribution of a hyperparameter- for $\nu_\alpha$, the standard deviation of the random effect prior, because standard deviations are positive. We set up a wider range for $\nu_\alpha$ than for $\sigma$ because from experience standard deviations of hierarchically defined parameters are more difficult to estimate than other standard deviations (see point 2).

In stan the model is written as follows:

// Model 2: Hierarchical alpha by settlement type
parameters{
  ...
  // hierarchical intercept by settlement
  vector[ntype] alpha_t; 
  real alpha;
  real<lower=0> nu_alpha;
}
model{
  ...
  // hierarchical intercept by settlement
  alpha_t ~ normal(alpha, nu_alpha);
  alpha ~ normal(5, 10);
  nu_alpha ~ uniform(0, 15);
}

We add the new parameters to the parameters to monitor and we run the model:

# Partial pooling

pars <- c('alpha_t','alpha', 'nu_alpha', 'sigma', 'population_hat', 'density_hat')

# mcmc
fit2 <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial2_model2.stan'), 
                   data = stan_data_model1,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

We can now compare the estimated parameters between a no pooling and a partial pooling framework.

Figure 6.32: Comparison between alpha_t estimated independently and hierarchically

There are no significant differences between the two models, but the no-pooling framework tends to over estimate the gap between each random effect, whereas the partial pooling framework tends to pull the random effects towards each other. Table 6.3 shows how incorporating the settlement type grouping has a direct effect on the prediction.

# Complete pooling 

# load previous model for complete pooling
fit_tuto1_model2 <- readRDS('dat/BUM_tutorial/tutorial1_model2_fit.rds')

# build comprehensive dataframe
comparison_df <- rbind(
 getPopPredictions(fit1) %>% 
   mutate(model='No pooling'),
 getPopPredictions(fit2) %>% 
   mutate(model='Partial pooling'),
 getPopPredictions(fit_tuto1_model2) %>% 
   mutate(model='Complete pooling'))


# compute goodness-of-fit metrics
comparison_df %>% group_by(model) %>% 
  summarise( `Bias`= mean(residual),
    `Inaccuracy` = mean(abs(residual)),
        `Imprecision` = sd(residual)
) %>%  kbl(caption = 'Goodness-of-fit metrics comparison between complete pooling, no pooling, and partial pooling') %>% kable_minimal()

Table 6.3: Goodness-of-fit metrics comparison between complete pooling, no pooling, and partial pooling
model	Bias	Inaccuracy	Imprecision
Complete pooling	61.66909	265.4084	338.0013
No pooling	46.07355	234.3725	307.5498
Partial pooling	45.37351	234.3248	307.1996

Save the model as

saveRDS(fit2, 'dat/BUM_tutorial/tutorial2_model3_fit.rds')

Estimating $\alpha$ by settlement type does improve the model as shown by smaller residuals (Bias and Imprecision) and less variation (Inaccuracy). The hierarchical model does reduce slightly Bias but not strikingly, which indicates that the number of survey sites per settlement type was big enough to estimate the $\alpha_t$ independently, in a no-pooling framework.

The next section will explore the second advantage of hierarchical modelling.

6.4 Part III: How to model small-scale spatial variation

Part II above explored how to model large-scale spatial patterns, with the example of population densities differing between large groupings, such as region or settlement type. These large-scale variations were integrated in a Bayesian framework by using a hierarchical random intercept model for the population density.

This section will explore integrating small-scale variations of population density that are linked to local context of human settlement. Figure 6.33 shows how high-resolution geospatial covariates provide precise information on local context.

Figure 6.33: Example of a covariate (residential area) around Onitscha, Nigeria

Adding high-resolution covariates helps to improve the model fit and helps guides population prediction in unsampled areas. For data to be used as covariates in the model, they should be:

correlated with differences in population density
measured consistently and completely across the study space
accurately mapped as high-resolution geospatial layers

High-resolution covariates suitable for the modelling are typically spatial covariates with national coverage and consistent data collection. While individual or household-level information such as survey data is useful for understanding differences in demographic characteristics, that type of information is difficult to use in the bottom-up approaches because it only comes from a sample of households. The primary objective of the population model is to make a spatially-complete prediction.

6.4.1 Goals

Understand the requirements for covariates to be included in the model
Get an overview of covariates used in WorldPop bottom-up population models
Become familiar with covariates processing for population model
Integrate a linear regression to refine the definition of the median of the Lognormal
Add a random slope model
Introduce the concept of initialisation in a MCMC model estimation

6.4.2 Supporting readings

To incorporate a spatial covariate into the modelling and use it as a support for prediction, gridded datasets, known as rasters are used. To proceed with this, a complementary understanding of GIS is required, in particular pertaining to vector and raster file management. The purpose of this tutorial is not spatial data processing, and thus practical experience of GIS is not necessarily a requirement here. This may however be of benefit in understanding the processing techniques involved in covariate data preparation which will be discussed in the following chapter. Suggested below are R-based resources on spatial manipulation to compliment the material of this chapter

The bible by Pebesma and Bivand on the sf (vector data) and stars (raster data) R packages with excellent overview of the different spatial manipulations
A hands-on introduction to sf and raster package by the University of Wageningen
A focus on raster manipulation with raster and terra by Hijmans

The same packages as loaded previously will be used here. If these have not been loaded already, they can be loaded as below.

# 1 Set-up ----

# load libraries
library(tidyverse) # managing data
library(ggdag) # drawing DAG
library(kableExtra) # visualising table
library(here) # handling path
library(rstan) # running Bayesian models
library(plotly) # interactive plot
library(RColorBrewer) # colours

# stan setup
options(mc.cores = parallel::detectCores()-1)
rstan::rstan_options(auto_write = TRUE) # speed up running time of compiled model

knitr::opts_chunk$set(fig.align = "center")

6.4.3 Formal modelling

We will work from model 3 in Part II, which is based on a Poisson distribution to model population count and on a Lognormal distribution with a hierarchical random intercept by settlement type and region to model population density .

Because the median of the population density is only defined with a random intercept, it results in 5 (number of settlement type) x 11 (number of region) options for the population density estimates. To add small-scale variations we refine the median of the Lognormal with a regression model that integrates the covariates.

More formally, let’s define $X$ as a matrix of size number of observations x number of covariates that contains the covariates values for each study site and $\beta$ a vector of size the number of covariates. Based on Equation 3 in tutorial 2 (and removing the prior distribution for $\alpha_{t,r}$ for sake of readability), we define $\mu$ the median of the Lognormal distribution as follows:

\[\begin{equation} population 〜 Poisson( pop\_density * settled\_area) \\ pop\_density 〜 Lognormal(\mu, \: \sigma) \\ \mu = \alpha_{t,r} + X \beta \\[10pt] \beta 〜 Normal(0,10)\tag{6.3} \end{equation}\]

Note that the prior for $\beta$ are identical normal distribution for each covariate with mean 0 and standard deviation 10 to avoid introducing any bias.

Figure 6.34 shows the updated dependent relationships of the model when integrating covariates.

Figure 6.34: Dependancy graph for model 1 of tutorial 3

6.4.4 Review of the covariates used in WorldPop

As of the date of publishment, five bottom-up population models have been produced at WorldPop:

WorldPop. 2019. Bottom-up gridded population estimates for Nigeria, version 1.2.
WorldPop, University of Southampton. https://dx.doi.org/10.5258/SOTON/WP00655.
WorldPop (School of Geography and Environmental Science, University of
Southampton). 2020. Bottom-up gridded population estimates for Zambia, version 1.0.
https://dx.doi.org/10.5258/SOTON/WP00662
WorldPop and Institut National de la Statistique et de la Démographie du Burkina Faso. 2021. Census-based gridded population estimates for Burkina Faso (2019), version 1.0. WorldPop, University of Southampton. https://dx.doi.org/10.5258/SOTON/WP00687.
Boo G, Darin E, Leasure DR, Dooley CA, Chamberlain HR, Lazar AN, Tatem AJ. 2020.
Modelled gridded population estimates for the Kinshasa, Kongo-Central, Kwango, Kwilu,
and Mai-Ndombe provinces in the Democratic Republic of the Congo, version 2.0.
WorldPop, University of Southampton. https://dx.doi.org/10.5258/SOTON/WP00669
WorldPop. 2020. Bottom-up gridded population estimates for the Kinshasa, Kongo-Central, Kwango, Kwilu, and Mai-Ndombe provinces in the Democratic Republic of the Congo, version 1.0. WorldPop, University of Southampton. https://dx.doi.org/10.5258/SOTON/WP00658

In addition to that, two models are currently being updated: Nigeria version 2.0 and Democratic Republic of the Congo v3.0.

These five models encompass a large range of covariates, specific to each country. Table 6.4 offers an overview of the final covariates set selected for each model.

review_cov <- read_csv('dat/BUM_tutorial/covs_review.csv')

review_cov %>% arrange(Type) %>% kbl(caption='Review of covariates used in WorldPop bottom-up population models') %>% kable_minimal()

Table 6.4: Review of covariates used in WorldPop bottom-up population models
Type	Covariate	Model	Source
Gridded population	UN-adjusted projected gridded estimates	Burkina Faso v1.0	WorldPop
Gridded population	Projected gridded estimates	Nigeria v1.2	WorldPop
Gridded population	Mean UN-adjusted projected gridded estimates within a 2km radius	Democratic Republic of Congo v1.0	WorldPop
Infrastucture	Fricition surface	Burkina Faso v1.0	Access to the Cities project
Infrastucture	Distance to secondary roads	Burkina Faso v1.0	National Geographical Office
Infrastucture	Household size	Nigeria v1.2, Nigeria v2.0	Demographic and Health Survey
Infrastucture	Residential roads density	Democratic Republic of Congo v1.0	OpenStreetMap
Infrastucture	Travel time to cities	Democratic Republic of Congo v1.0	Malaria Atlas Project
Infrastucture	Tertiary-sector activities density	Democratic Republic of Congo v1.0	OpenStreetMap
Natural feature	Distance to temporary rivers	Burkina Faso v1.0	National Geographical Office
Natural feature	Monthly variability of dry matter productivity	Democratic Republic of Congo v3.0	Copernicus
Natural feature	Monthly variability of surface air temperature	Democratic Republic of Congo v3.0	Copernicus
Natural feature	Land surface ‘roughness’ from Synthetic Aperture Radard VH	Nigeria v2.0	Sentinel-1
Natural feature	Land surface ‘roughness’ from Synthetic Aperture Radard VV	Nigeria v2.0	Sentinel-1
Settlement	Mean building count within a 5km radius	Burkina Faso v1.0	Ecopia & Maxar
Settlement	Mean building area within a 1km radius	Democratic Republic of Congo v2.0	Ecopia & Maxar
Settlement	Mean distance to nearest building within a 1km radius	Democratic Republic of Congo v2.0	Ecopia & Maxar
Settlement	Mean building count within a 1km radius	Democratic Republic of Congo v2.0	Ecopia & Maxar
Settlement	Mean building area	Zambia v1.0	Ecopia & Maxar
Settlement	Building density	Zambia v1.0	Ecopia & Maxar
Settlement	Coefficient of variation of building area	Zambia v1.0	Ecopia & Maxar
Settlement	Sum residential area within a 1km radius	Nigeria v1.2	Oak Ridge National Laboratory
Settlement	Sum nonresidential area within a 1km radius	Nigeria v1.2	Oak Ridge National Laboratory
Settlement	School density within a 1km radius	Nigeria v1.2	Oak Ridge National Laboratory
Settlement	Mean building perimeter	Democratic Republic of Congo v3.0	Ecopia & Maxar
Settlement	Compactness of building	Democratic Republic of Congo v3.0	Ecopia & Maxar

The covariates selected can be broadly classified as describing four main drivers of local population density variation:

previous population spatial distribution through WorldPop top-down population disaggregation (WorldPop Research Group et al. 2018)
infrastructure through OpenStreetMap (OpenStreetMap contributors 2018), country-specific resources (for example Institut Géographique du Burkina Faso (2015)) or modelled resources (for example travel time to cities (Weiss et al. 2018a))
natural features through remote sensing product such as radar data (Sentinel-1) or dry matter productivity (Copernicus)
settlement shape through the morphology of building footprints (Ecopia.AI & Maxar Technologies 2019) or settled area (Oak Ridge National Laboratory 2018a)

Other covariates sources that were considered (but not selected in the final models) include:

Conflict locations from the Armed Conflict Location and Event Data Project
Climatic variables from the Climatic Research Unit at the university of Anglia
Active mining concessions from the IPIS group
Land cover classification from the European Spatial Agency
Elevation and slope from WorldDEM
Fossil fuel emissions from the Open-source Data Inventory for Anthropogenic CO2 Project

To build a model, covariates that could be related to the specific population data for that project are collated. This can include as many as 900 or more individual datasets. Then using geospatial analysis techniques, we produce
gridded versions of the covariates with identical spatial resolution, alignment and extent. For raster datasets, this can involve resampling and clipping the extents of the dataset, while for vector datasets this may involve computing count, density, distance to nearest features or interpolation techniques.

6.4.5 Covariates engineering

Further covariates engineering steps can help extract additional information from the gathered covariates.

Log-transformation

Considering the logarithm of covariates helps in handling extreme values.

Focal statistics

Focal statistics consist of summarising covariates in a moving window around each grid cell. As seen in Table 6.4, we used different window sizes (1km, 2km or 5km) and summary statistics (mean or coefficient of variation). It provides contextual information around the grid cells.

Standardisation

Scaling the covariate (that is subtracting the mean and dividing by the standard deviation) helps enhancing meaningful local variations from the mean. The scaling can even be refined by computing the mean and the standard deviation by region, such that local variations are representative of the region.

6.4.5.1 A note on covariate selection

After engineering the gathered covariates, this may produce 1000+ potential covariates.

To select the best one for prediction purposes, generally one of the two following methods are used:

pairwise correlation and scatter plot with the population density at study site level
univariate model, testing each covariate successively

6.4.6 Including covariates in the model

The remaining parts of this tutorial will focus on the data downloaded from Leasure et al. (2020f) which corresponds to the Nigeria model v1.2.

6.4.7 Overview of the covariates in Nigeria, v1.2

Six covariates were used in Nigeria v1.2 model:

x_1: gridded population estimates from WorldPop Global
x_2: school densities within a 1km radius
x_3: household sizes by interpolating Demographic Health Survey results from 2013
x_4: settled area within a 1km radius
x_5: residential area in a 1km radius
x_6: nonresidential settled area within a 1km radius

Covariate x_4 was scaled based on its mean and standard deviation nationally, whereas covariates x_5 and x_6 were scaled based on their mean and standard deviation within a 50-km radius. Leasure et al scaled x_5 and x_6 in this way because they suspected that neighborhood types may not be directly comparable across regions (especially northern versus southern Nigeria). This scaling also reduced correlation with x_4 .

They scaled the WorldPop Global estimates (x_1) based on their mean and standard deviation nationally. They treated this covariate as an indicator of relative population densities based on the geospatial covariates that were used in the random forest model.

Covariate x_2 was scaled using its mean and standard deviation within a 50km radius. They scaled this covariate within a 50km moving window because what constitutes a “high density” of schools varies by region and this distinction was lost when the covariate was scaled nationally. This also helped to control for possible differences in school mapping effort in different regions.

They scaled x_3 based on its mean and standard deviation nationally. One key reason for including this covariate was to account for a strong north–south gradient in household sizes, with significantly more people per household in northern Nigeria than in southern Nigeria.

6.4.8 Preparing the data

To integrate the covariates in the model, the first step is to build a dataset consisting of the average covariate values for each study site (captured using zonal statistics).

Note that this constitutes a change in support: we might want to check if the range of covariates values at study site level is representative of the covariate values at grid cell level.

Figure ?? shows the relation between the covariates and the population density at study site level. We see that household size (x_3) is positively associated with population density. The negative values are due to the scaling method adopted. Conversely, nonresidential settled area (x_6) is negatively associated with population density. This is expected: the more the surroundings are nonresidential the lower the population density.

Figure 6.35: Scatterplot of covariates values vs population density for each study site

## `geom_smooth()` using formula = 'y ~ x'

Figure 6.36: Scatterplot of covariates values vs population density for each study site

Before implementing the model in stan, the covariates are uniformly scaled at each study site level, such that the $\beta_k$ have the same scale. We first compute the scaling coefficients (mean and standard deviation) for each covariate:

# compute scaling factors (mean and sd)
covariatesScaling <- function(var){
  mean_var <- mean(var)
  sd_var <- sd(var)
  return(
    data.frame(
      'cov_mean'= mean_var,
      'cov_sd' = sd_var
    )
  )
} 

covs <- data %>% 
  select(starts_with('x'))

scale_factor <- bind_rows(apply(covs, 2, covariatesScaling))
scale_factor$cov <- colnames(covs)

scale_factor %>% select(cov, cov_mean, cov_sd) %>% kbl %>%  kable_minimal()

cov	cov_mean	cov_sd
x1	3.913963	8.071879
x2	2.950790	3.970139
x3	0.000000	1.000000
x4	5.081461	4.748466
x5	3.941279	4.377713
x6	2.729769	5.028868

We then apply the scaling coefficient to the covariates:

# apply scaling factors to covariates
covs_scaled <-  covs %>% 
  mutate(cluster_id = 1:nrow(covs)) %>% 
  pivot_longer(-cluster_id,names_to = 'cov') %>% 
  left_join(scale_factor, by="cov") %>% 
  mutate(value= (value-cov_mean)/cov_sd ) %>% 
  select(-cov_mean, -cov_sd) %>% 
  pivot_wider(names_from = cov, values_from = value, id_cols = cluster_id) %>% 
  select(-cluster_id)

We store the scaled covariates and the scaling coefficients for the prediction stage (in Part IV).

# save scaling factor
write_csv(covs_scaled, here('dat/BUM_tutorial/covs_scaled.csv'))
write_csv(scale_factor, here('dat/BUM_tutorial/scale_factor.csv'))

6.4.9 Implementing the model

Equation (6.3) is implemented in stan as follows:

// Model 1: Hierarchical alpha by settlement type , region + covariates
data{
  ...
  // slope
  int<lower=0> ncov; // number of covariates
  matrix[n, ncov] cov; // covariates
}
parameters{
  ...
  // slope
  row_vector[ncov] beta; 
}
transformed parameters{
  ...
  for(idx in 1:n){
    pop_density_median[idx] = alpha_t_r[type[idx], region[idx]] + sum( cov[idx,] .* beta );
  }
}
model{
  ...
  //slope
  beta ~ normal(0,10);
}
generated quantities{
  ...
   for(idx in 1:n){
    density_hat[idx] = lognormal_rng( alpha_t_r[type[idx], region[idx]] + sum(cov[idx,] .* beta), sigma );
   }
}

Note the two new data types - matrix for the covariate values and row_vector for $\beta$ - as well as the new operator, .* . A row vector is a matrix with one row. .* performs elementwise multiplication. These elements are required due to the way the linear regression is coded for pop_density_median:in a for loop, running through each study site defined by their idx. The corresponding covariates values, cov[idx,] which is a row vector are extracted for each study site.
To obtain a vector for each covariate value associated with each parameter $\beta_k$, beta needs to be a row_vector and the multiplication needs to be performed element by element.

We keep the same set up for the MCMC:

# 3. Modelling with covariates ----

# mcmc settings
chains <- 4
warmup <- 500
iter <- 500
seed <- 1789

And we add the covariates to stan input data:

# prepare data for stan
stan_data_model1 <- list(
  population = data$N,
  n = nrow(data),
  area = data$A,
  type = data$type,
  ntype= n_distinct(data$type),
  region = data$region,
  nregion = n_distinct(data$region),
  seed=seed,
  cov = covs_scaled,
  ncov = ncol(covs_scaled)
  )

beta is added as a parameter to monitor and run the model.

pars <- c('alpha','sigma','beta','alpha_t', 'nu_alpha', 'nu_alpha_t', 'population_hat',  'density_hat')

# mcmc
fit1 <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial3_model1.stan'), 
                   data = stan_data_model1,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

## Warning in validityMethod(object): The following variables have undefined values:
## population_hat[1],The following variables have undefined values: population_hat[2],The
## following variables have undefined values: population_hat[3],The following variables have
## undefined values: population_hat[4],The following variables have undefined values:
## population_hat[5],The following variables have undefined values: population_hat[6],The
## following variables have undefined values: population_hat[7],The following variables have
## undefined values: population_hat[8],The following variables have undefined values:
## population_hat[9],The following variables have undefined values: population_hat[10],The
## following variables have undefined values: population_hat[11],The following variables have
## undefined values: population_hat[12],The following variables have undefined values:
## population_hat[13],The following variables have undefined values: population_hat[14],The
## following variables have undefined values: population_hat[15],The following variables have
## undefined values: population_hat[16],The following variables have undefined values:
## population_hat[17],The following variables have undefined values: population_hat[18],The
## following variables have undefined values: population_hat[19],The following variables have
## undefined values: population_hat[20],The following variables have undefined values:
## population_hat[21],The following variables have undefined values: population_hat[22],The
## following variables have undefined values: population_hat[23],The following variables have
## undefined values: population_hat[24],The following variables have undefined values:
## population_hat[25],The following variables have undefined values: population_hat[26],The
## following variables have undefined values: population_hat[27],The following variables have
## undefined values: population_hat[28],The following variables have undefined values:
## population_hat[29],The following variables have undefined values: population_hat[30],The
## following variables have undefined values: population_hat[31],The following variables have
## undefined values: population_hat[32],The following variables have undefined values:
## population_hat[33],The following variables have undefined values: population_hat[34],The
## following variables have undefined values: population_hat[35],The following variables have
## undefined values: population_hat[36],The following variables have undefined values:
## population_hat[37],The following variables have undefined values: population_hat[38],The
## following variables have undefined values: population_hat[39],The following variables have
## undefined values: population_hat[40],The following variables have undefined values:
## population_hat[41],The following variables have undefined values: population_hat[42],The
## following variables have undefined values: population_hat[43],The following variables have
## undefined values: population_hat[44],The following variables have undefined values:
## population_hat[45],The following variables have undefined values: population_hat[46],The
## following variables have undefined values: population_hat[47],The following variables have
## undefined values: population_hat[48],The following variables have undefined values:
## population_hat[49],The following variables have undefined values: population_hat[50],The
## following variables have undefined values: population_hat[51],The following variables have
## undefined values: population_hat[52],The following variables have undefined values:
## population_hat[53],The following variables have undefined values: population_hat[54],The
## following variables have undefined values: population_hat[55],The following variables have
## undefined values: population_hat[56],The following variables have undefined values:
## population_hat[57],The following variables have undefined values: population_hat[58],The
## following variables have undefined values: population_hat[59],The following variables have
## undefined values: population_hat[60],The following variables have undefined values:
## population_hat[61],The following variables have undefined values: population_hat[62],The
## following variables have undefined values: population_hat[63],The following variables have
## undefined values: population_hat[64],The following variables have undefined values:
## population_hat[65],The following variables have undefined values: population_hat[66],The
## following variables have undefined values: population_hat[67],The following variables have
## undefined values: population_hat[68],The following variables have undefined values:
## population_hat[69],The following variables have undefined values: population_hat[70],The
## following variables have undefined values: population_hat[71],The following variables have
## undefined values: population_hat[72],The following variables have undefined values:
## population_hat[73],The following variables have undefined values: population_hat[74],The
## following variables have undefined values: population_hat[75],The following variables have
## undefined values: population_hat[76],The following variables have undefined values:
## population_hat[77],The following variables have undefined values: population_hat[78],The
## following variables have undefined values: population_hat[79],The following variables have
## undefined values: population_hat[80],The following variables have undefined values:
## population_hat[81],The following variables have undefined values: population_hat[82],The
## following variables have undefined values: population_hat[83],The following variables have
## undefined values: population_hat[84],The following variables have undefined values:
## population_hat[85],The following variables have undefined values: population_hat[86],The
## following variables have undefined values: population_hat[87],The following variables have
## undefined values: population_hat[88],The following variables have undefined values:
## population_hat[89],The following variables have undefined values: population_hat[90],The
## following variables have undefined values: population_hat[91],The following variables have
## undefined values: population_hat[92],The following variables have undefined values:
## population_hat[93],The following variables have undefined values: population_hat[94],The
## following variables have undefined values: population_hat[95],The following variables have
## undefined values: population_hat[96],The following variables have undefined values:
## population_hat[97],The following variables have undefined values: population_hat[98],The
## following variables have undefined values: population_hat[99],The following variables have
## undefined values: population_hat[100],The following variables have undefined values:
## population_hat[101],The following variables have undefined values: population_hat[102],The
## following variables have undefined values: population_hat[103],The following variables
## have undefined values: population_hat[104],The following variables have undefined values:
## population_hat[105],The following variables have undefined values: population_hat[106],The
## following variables have undefined values: population_hat[107],The following variables
## have undefined values: population_hat[108],The following variables have undefined values:
## population_hat[109],The following variables have undefined values: population_hat[110],The
## following variables have undefined values: population_hat[111],The following variables
## have undefined values: population_hat[112],The following variables have undefined values:
## population_hat[113],The following variables have undefined values: population_hat[114],The
## following variables have undefined values: population_hat[115],The following variables
## have undefined values: population_hat[116],The following variables have undefined values:
## population_hat[117],The following variables have undefined values: population_hat[118],The
## following variables have undefined values: population_hat[119],The following variables
## have undefined values: population_hat[120],The following variables have undefined values:
## population_hat[121],The following variables have undefined values: population_hat[122],Th

The model runs into convergence issues. The error message that is returned is not very informative. However the Viewer pane in Rstudio contains more information, namely:

Chain 2: Exception: lognormal_lpdf: Scale parameter is 0, but must be > 0!.

This inform us that the combination of $\alpha$, $\beta$ and the covariates value that is currently being tested has resulted in some zeros in the scale term of the lognormal which is forbidden.

6.4.10 A note on initialisation

MCMC simulations start exploring the parameter space from one initial value. This initial value is controlled in stan by the option init. Its default is random, that is “Let Stan generate random initial values for all parameters. The seed of the random number generator used by Stan can be specified via the seed argument. If the seed for Stan is fixed, the same initial values are used. The default is to randomly generate initial values between -2 and 2 on the unconstrained support”.

Defining the starting values helps the algorithm to start close to the region of interest so that no time is lost in exploring an area of the parameter space that we know doesn’t fit the likely values and might run into parameters combination that is unfit to our model structure. Note that the initialisation is not constraining, it just gives a hint to the algorithm.

Only root parameters are initialised. Root parameters are parameters that don’t depend on other parameters. Dependent parameters will then inherit the initialisation.

# add initialisation
inits.out <- list()
set.seed(stan_data_model1$seed)

for (c in 1:chains){
  inits.i <- list()

  inits.i$sigma <- runif(1, 0.4, 0.8)
  inits.i$alpha <- runif(1, 3, 6)
  inits.i$beta <- runif(stan_data_model1$ncov, -1, 1)
  
  inits.out[[c]] <- inits.i
}

Note that initial values are defined for each chain. They are based around estimated values in previous models, adding some random jittering.

The estimation is re-run with these initialisation values.

fit1bis <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial3_model1.stan'), 
                   data = stan_data_model1,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed,
                   init= inits.out)

## Warning in validityMethod(object): The following variables have undefined values:
## population_hat[1],The following variables have undefined values: population_hat[2],The
## following variables have undefined values: population_hat[3],The following variables have
## undefined values: population_hat[4],The following variables have undefined values:
## population_hat[5],The following variables have undefined values: population_hat[6],The
## following variables have undefined values: population_hat[7],The following variables have
## undefined values: population_hat[8],The following variables have undefined values:
## population_hat[9],The following variables have undefined values: population_hat[10],The
## following variables have undefined values: population_hat[11],The following variables have
## undefined values: population_hat[12],The following variables have undefined values:
## population_hat[13],The following variables have undefined values: population_hat[14],The
## following variables have undefined values: population_hat[15],The following variables have
## undefined values: population_hat[16],The following variables have undefined values:
## population_hat[17],The following variables have undefined values: population_hat[18],The
## following variables have undefined values: population_hat[19],The following variables have
## undefined values: population_hat[20],The following variables have undefined values:
## population_hat[21],The following variables have undefined values: population_hat[22],The
## following variables have undefined values: population_hat[23],The following variables have
## undefined values: population_hat[24],The following variables have undefined values:
## population_hat[25],The following variables have undefined values: population_hat[26],The
## following variables have undefined values: population_hat[27],The following variables have
## undefined values: population_hat[28],The following variables have undefined values:
## population_hat[29],The following variables have undefined values: population_hat[30],The
## following variables have undefined values: population_hat[31],The following variables have
## undefined values: population_hat[32],The following variables have undefined values:
## population_hat[33],The following variables have undefined values: population_hat[34],The
## following variables have undefined values: population_hat[35],The following variables have
## undefined values: population_hat[36],The following variables have undefined values:
## population_hat[37],The following variables have undefined values: population_hat[38],The
## following variables have undefined values: population_hat[39],The following variables have
## undefined values: population_hat[40],The following variables have undefined values:
## population_hat[41],The following variables have undefined values: population_hat[42],The
## following variables have undefined values: population_hat[43],The following variables have
## undefined values: population_hat[44],The following variables have undefined values:
## population_hat[45],The following variables have undefined values: population_hat[46],The
## following variables have undefined values: population_hat[47],The following variables have
## undefined values: population_hat[48],The following variables have undefined values:
## population_hat[49],The following variables have undefined values: population_hat[50],The
## following variables have undefined values: population_hat[51],The following variables have
## undefined values: population_hat[52],The following variables have undefined values:
## population_hat[53],The following variables have undefined values: population_hat[54],The
## following variables have undefined values: population_hat[55],The following variables have
## undefined values: population_hat[56],The following variables have undefined values:
## population_hat[57],The following variables have undefined values: population_hat[58],The
## following variables have undefined values: population_hat[59],The following variables have
## undefined values: population_hat[60],The following variables have undefined values:
## population_hat[61],The following variables have undefined values: population_hat[62],The
## following variables have undefined values: population_hat[63],The following variables have
## undefined values: population_hat[64],The following variables have undefined values:
## population_hat[65],The following variables have undefined values: population_hat[66],The
## following variables have undefined values: population_hat[67],The following variables have
## undefined values: population_hat[68],The following variables have undefined values:
## population_hat[69],The following variables have undefined values: population_hat[70],The
## following variables have undefined values: population_hat[71],The following variables have
## undefined values: population_hat[72],The following variables have undefined values:
## population_hat[73],The following variables have undefined values: population_hat[74],The
## following variables have undefined values: population_hat[75],The following variables have
## undefined values: population_hat[76],The following variables have undefined values:
## population_hat[77],The following variables have undefined values: population_hat[78],The
## following variables have undefined values: population_hat[79],The following variables have
## undefined values: population_hat[80],The following variables have undefined values:
## population_hat[81],The following variables have undefined values: population_hat[82],The
## following variables have undefined values: population_hat[83],The following variables have
## undefined values: population_hat[84],The following variables have undefined values:
## population_hat[85],The following variables have undefined values: population_hat[86],The
## following variables have undefined values: population_hat[87],The following variables have
## undefined values: population_hat[88],The following variables have undefined values:
## population_hat[89],The following variables have undefined values: population_hat[90],The
## following variables have undefined values: population_hat[91],The following variables have
## undefined values: population_hat[92],The following variables have undefined values:
## population_hat[93],The following variables have undefined values: population_hat[94],The
## following variables have undefined values: population_hat[95],The following variables have
## undefined values: population_hat[96],The following variables have undefined values:
## population_hat[97],The following variables have undefined values: population_hat[98],The
## following variables have undefined values: population_hat[99],The following variables have
## undefined values: population_hat[100],The following variables have undefined values:
## population_hat[101],The following variables have undefined values: population_hat[102],The
## following variables have undefined values: population_hat[103],The following variables
## have undefined values: population_hat[104],The following variables have undefined values:
## population_hat[105],The following variables have undefined values: population_hat[106],The
## following variables have undefined values: population_hat[107],The following variables
## have undefined values: population_hat[108],The following variables have undefined values:
## population_hat[109],The following variables have undefined values: population_hat[110],The
## following variables have undefined values: population_hat[111],The following variables
## have undefined values: population_hat[112],The following variables have undefined values:
## population_hat[113],The following variables have undefined values: population_hat[114],The
## following variables have undefined values: population_hat[115],The following variables
## have undefined values: population_hat[116],The following variables have undefined values:
## population_hat[117],The following variables have undefined values: population_hat[118],The
## following variables have undefined values: population_hat[119],The following variables
## have undefined values: population_hat[120],The following variables have undefined values:
## population_hat[121],The following variables have undefined values: population_hat[122],Th

## Warning: There were 1 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.

## Warning: Examine the pairs() plot to diagnose sampling problems

The former error message no longer appears and the model contains samples. We still observe the same issue of tutorial 2, the integer overflow for the prediction of population count in the warmup period. We can safely say that the model has converged.

We can plot the $\hat\beta_k$. We can see that the sign and magnitude of the covariate effects are inline with the association shown in Figure 6.37.

Figure 6.37: Plot of beta estimation

6.4.11 Comparing prediction with previous model

The next question is: How much improvement is brought by integrating covariates in the model?

We load Tutorial 2 model 3 to compare:

# load tutorial 2 final model
fit0 <- readRDS('dat/BUM_tutorial/tutorial2_model3_fit.rds')

We can then compute the predictions for every study site and compare the goodness-of-fit when adding the covariates.

# extract predictions
getPopPredictions <- function(model_fit, 
                              estimate='population_hat',
                              obs='N', reference_data=data){
  # extract predictions
  predicted_pop <- as_tibble(extract(model_fit, estimate)[[estimate]])
  colnames(predicted_pop) <- reference_data$id
  
  # summarise predictions
  predicted_pop <- predicted_pop %>% 
    pivot_longer(everything(),names_to = 'id', values_to = 'predicted') %>% 
    group_by(id) %>% 
    summarise(
      across(everything(), list(mean=~mean(.), 
                                upper=~quantile(., probs=0.975), 
                                lower=~quantile(., probs=0.025)))
      ) %>% 
    left_join(reference_data %>% 
                rename('reference'=all_of(obs)) %>% 
                select(id, reference), by = 'id')%>% 
    mutate(
      residual= predicted_mean - reference,
      ci_size = predicted_upper- predicted_lower,
      estimate = estimate
      )

return(predicted_pop)
}

comparison_df <- rbind(
 getPopPredictions(fit0) %>% 
   mutate(Model='Without covariates'),
 getPopPredictions(fit1) %>% 
   mutate(Model='With covariates'))

# compute goodness-of-fit metrics
comparison_df %>% group_by(Model) %>% 
  summarise( `Bias`= mean(residual),
    `Inaccuracy` = mean(abs(residual)),
        `Imprecision` = sd(residual)
) %>%  kbl(caption = 'Goodness-of-metrics comparison with and without covariates ') %>% kable_minimal()

Table 6.5: Goodness-of-metrics comparison with and without covariates
Model	Bias	Inaccuracy	Imprecision
With covariates	32.93574	193.2409	275.3399
Without covariates	45.37351	234.3248	307.1996

An improvement can be seen on every goodness-of-fit metrics.

6.4.12 Grouping and covariates effect: a random slope model

In tutorial 2 we have seen that a model fitting a unique $\alpha$ for all the observations could be improved by splitting the observations into groupings that would share a similar pattern of population density.

The idea is similar to $\beta$: some covariates effects may vary by grouping. Example: x_4, the sum of the settled area within a 1km radius might, might have a greater predictive power in rural areas than in urban areas. The different relationships between covariate and population density by settlement type is highlighted in Figure 6.38.

Figure 6.38: Scatterplot of covariates vs population density by settlement type

Modelling $\beta_k$ by settlement type is called a random slope model.

Question: Do we want to model the $\beta_k$ hierarchically?

Click for the solution

Modelliong $\beta$ hierarchically means assuming that there is a national pattern with subnational refinement. The $\beta_{k,t}$ can however have opposite directions (see Figure 6.38) which speaks against a common overarching $\beta$.

Formally, a random slope model is written as follows:

\[\begin{equation} population 〜 Poisson( pop\_density * settled\_area) \\ pop\_density 〜 Lognormal(\mu, \: \sigma) \\[10pt] \mu = \alpha_{t,r} + \beta^{random}_t X^{random} + \beta^{fixed} X^{fixed} \\[15pt] \beta^{random}_t 〜 Normal(0,10) \\ \beta^{fixed} 〜 Normal(0,10) \end{equation}\]

The difference can be seen in the indexing: $\beta^{random}$ is indexed by $t$. Similarly as in the no-pooling framework in Tutorial 2, we set the priors to be independent prior, namely Normal(0,10).

The stan implementation is as follows:

// Model 1: Hierarchical alpha by settlement type , region + covariates
data{
  ...
    // fixed slope
  int<lower=0> ncov_fixed; // number of covariates -1
  matrix[n, ncov_fixed] cov_fixed; // covariates
  // random slope
  vector[n] cov_random;
}
parameters{
  ...
  // slope
  row_vector[ncov_fixed] beta_fixed; 
  vector[ntype] beta_random;
}
transformed parameters{
  ...
  vector[n] beta;

  for(idx in 1:n){
    beta[idx] = sum( cov_fixed[idx,] .* beta_fixed) + cov_random[idx] * beta_random[type[idx]];
    pop_density_median[idx] = alpha_t_r[type[idx], region[idx]] + beta[idx];
  }
}
model{
  ...
  //slope
  beta_fixed ~ normal(0,10);
  beta_random ~ normal(0,10);
}
generated quantities{
  ...
 vector[n] beta_hat;

  for(idx in 1:n){
    beta_hat[idx] = sum( cov_fixed[idx,] .* beta_fixed) + cov_random[idx] * beta_random[type[idx]];
    density_hat[idx] = lognormal_rng( alpha_t_r[type[idx], region[idx]] + beta_hat[idx], sigma );
  ...
}

Note that the code is written to model only one random covariate, such that beta_random is only a vector that contains the $\beta^{random}_t$. For implementing several random effect a matrix is required (settlement type x number of covariates).

To run the model, we distinguish in the input data between the covariates that are fixed and the one that is random. We choose x_4 , the sum of settled area within a 1km radius, to be modelled with a random effect.

This setting allows the user to test the model with different covariate candidates for the random effect.

# prepare stan data
stan_data_model2 <- list(
  population = data$N,
  n = nrow(data),
  area = data$A,
  type = data$type,
  ntype= n_distinct(data$type),
  region = data$region,
  nregion = n_distinct(data$region),
  seed=seed,
  cov_fixed = covs_scaled %>% select(-x4),
  ncov_fixed = ncol(covs_scaled) -1,
  cov_random = covs_scaled$x4
  )

pars <- c('alpha','sigma','beta_fixed','beta_random','alpha_t','alpha_t_r', 'nu_alpha', 'nu_alpha_t', 'population_hat',  'density_hat')

# initialise
inits.out <- list()
set.seed(stan_data_model2$seed)

for (c in 1:chains){
  inits.i <- list()
  # intercept
  inits.i$sigma <- runif(1, 0.4, 0.8)
  inits.i$alpha <- runif(1, 3, 6)
  inits.i$beta_fixed <- runif(stan_data_model2$ncov_fixed, -1, 1)
  inits.i$beta_random <- runif(stan_data_model2$ntype, -1, 1)

  inits.out[[c]] <- inits.i
}

# mcmc
fit2 <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial3_model2.stan'), 
                   data = stan_data_model2,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed,
                   init = inits.out)

## Warning: There were 3 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.

## Warning: Examine the pairs() plot to diagnose sampling problems

No error indicating a convergence issue is outputted from the above code. We plot beta_random that is a vector with a $\hat\beta_t$ for each settlement type.

Figure 6.39: Plot of beta estimation

We see that modelling $\beta^{x4}$ by settlement type highlights different patterns: we observe a non-significant effect for settlement 1 and 3 the most urbanised types. This is expected, as the sum of the settled area is likely to be homogeneous across urbanised areas. In addition, we see that the previous estimated $\beta_4$ was masking effect in opposite directions between settlement 2 and settlement 4,5. To finalise, We evaluate the effect on the predicted population count for each study site.

# extract predictions
comparison_df <- rbind(
 getPopPredictions(fit1) %>% 
   mutate(model='Fixed effect'),
 getPopPredictions(fit2) %>% 
   mutate(model='Random effect in x4'))
# compute goodness-of-fit metrics
comparison_df %>% group_by(model) %>% 
  summarise( `Bias`= mean(residual),
    `Inaccuracy` = mean(abs(residual)),
        `Imprecision` = sd(residual)
) %>%  kbl(caption = 'Goodness-of-metrics comparison with and without random effect in x4 ') %>% kable_minimal()

Table 6.6: Goodness-of-metrics comparison with and without random effect in x4
model	Bias	Inaccuracy	Imprecision
Fixed effect	32.93574	193.2409	275.3399
Random effect in x4	33.06451	191.9198	274.2706

We see a slight decrease of bias and an increase of the precision of the estimates.

Save the results of this final model as a RDS file. This will be explored further in Tutorial 4.

# save model
saveRDS(fit2, 'dat/BUM_tutorial/tutorial3_model2_fit.rds')

6.5 Part IV: Model diagnostics and predictions

From Part I to Part III, we have explored building a bottom-up population models through sequential modelling. Part I explored modelling population count as a Poisson-Lognormal compound, accounting for settled area. Part II developed this, adding a hierarchical random intercept by settlement type and region, to differentiate parameter estimation by the natural clustering of the observations. Part III continued by modelling small-scale variations in population density by adding a linear regression based on geospatial covariates. We covered thus all the fundamental modelling blocks of WorldPop bottom-up population model, described in Leasure et al. (2020f).

This tutorial will cover advanced model diagnostics for a complete goodness-of-fit assessment. Following checks that the model successfully passes the different diagnostics, we will then predict population count for every grid cell of the study area. Predicting population will be the opportunity to talk about estimates uncertainty.

6.5.1 Goals

Explore convergence issue
Understand posterior predictive check
Understand cross-validation and overfitting
Predict population for every grid cell of the study area
Visualise prediction uncertainty
Aggregate prediction in custom spatial units

6.5.2 Supporting readings

Evaluating regression models by the Bayes rules! book, on posterior predictive check and the difference between correct estimation and correct modelling.
Tidy data and Bayesian analysis make uncertainty visualization fun by Matthew Kay, University of Michigan. It emphasizes the importance of communicating uncertainty with estimates and explains how a Bayesian framework addresses that issue.
You’re fit and you know it: overfitting and cross-validation, by Andy Elmsley. Good introduction to cross-validation despite applying it to machine learning problems.

6.5.3 Extra packages

Some additional packages will be required going forward, mainly for GIS manipulation:

raster for predicting the gridded population in a raster format
sf for aggregating the prediction in custom spatial unit
tmap to plot spatial data
RColorBrewer to use nice color palette in plots

#install.packages('raster')
#install.packages('RColorBrewer')
#install.packages('sf')
#install.packages('tmap')

Vignettes can be found here: raster, RColorBrewer, sf and tmap.

Load these packages in addition to the packages used in previous tutorials

# 1 Set-up ----

# load libraries
library(raster) #Raster data manipulation
library(RColorBrewer) # color code
library(sf) # vector data manipulation
library(tmap) # for mapping
library(tidyverse) # managing data
library(ggdag) # drawing DAG
library(kableExtra) # visualising table
library(here) # handling path
library(rstan) # running Bayesian models
library(plotly) # interactive plot

# stan setup
options(mc.cores = parallel::detectCores()-1)
rstan::rstan_options(auto_write = TRUE) # speed up running time of compiled model

6.5.4 Advanced model diagnostics

Before predicting population count for the entire study area, it is important to ensure sure that the model is correct. Thus, additional checks here can be beneficial. We will thus perform additional checks: on the model convergence with traceplots and stan warnings, on the estimated parameters with a posterior predictive check and on the predicted population count at study site with a cross-validation approach.

This section will first explore this with the initial relevant population model that was built in Part I, the “basic” Poisson-Lognormal model:

\[\begin{equation} population 〜 Poisson( pop\_density * settled\_area) \\ pop\_density 〜 Lognormal( \mu, \sigma) \\ \\ \mu 〜 Normal( 5, 4 ) \\ \sigma 〜 Uniform( 0, 4 )\tag{6.4} \end{equation}\]

We do some set-ups that should now be familiar.

# 1. Model diagnostics ----

# load data
data <- readxl::read_excel(here('dat/BUM_tutorial/nga_demo_data.xls'))
data <- data %>% 
  mutate(
    id = as.character(1:nrow(data)),
    pop_density=N/A
  )
# mcmc settings
chains <- 4
iter <- 500
seed <- 1789

# stan data
stan_data <- list(
  population = data$N,
  n = nrow(data),
  area = data$A)

# set parameters to monitor
pars <- c('mu', 'sigma', 'density_hat', 'population_hat')

6.5.5 Assessing convergence

In this section we will explore models that have convergence issues and discover how to evaluate where they stem from and how to fix it.

Firstly, let’s look at the impact of the warmup length, the early phase of the simulations in which the sequences get closer to the mass of the distribution.

We choose a warmup of 20 iterations (instead of 250).

# 1.1 Short warmup period ----

# set short warmup
warmup <- 20

# mcmc
fit_lowWarmup <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial1_model2.stan'), 
                          data = stan_data,
                          iter = warmup + iter, 
                          chains = chains,
                          warmup = warmup, 
                          pars = pars,
                          seed = seed)

## Warning: There were 535 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
## https://mc-stan.org/misc/warnings.html#bfmi-low

## Warning: Examine the pairs() plot to diagnose sampling problems

Stan shows several warnings. You can explore the meaning of them here.

Figure 6.40: Traceplot

The traceplots indicate that some exploration is still happening after the shaded area, that is after the end of the warmup period.This can be handled by increasing the length of the warmup period.

Next we will look at a more serious convergence issue. Let’s say that we choose unfit priors for our model. Typically we can declare a Uniform with stricter bounds for $\sigma$:

\[\begin{equation} \sigma 〜 Uniform( 0, 1 ) \end{equation}\]

We store the model under dat/BUM_tutorial/tutorial1_model2_wrong.stan and estimate it.

# 1.2 Wrong prior ----

warmup <- 250

# mcmc
fit_wrong <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial1_model2_wrong.stan'), 
                   data = stan_data,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

## Warning: There were 1978 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.

## Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
## https://mc-stan.org/misc/warnings.html#bfmi-low

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: The largest R-hat is 4.4, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

Many metrics detect a convergence issue. The most worrying one is the divergent transitions that are often related with issues in the model structure itself and not only within the estimation process.

$Traceplot for $\mu$ and $\sigma$$

Figure 6.41: Traceplot for $\mu$ and $\sigma$

The traceplot for $\mu$ indicates that the chains didn’t mix. Moreover, the traceplot for $\sigma$ clearly illustrates the issue in bounding with a ceiling at 1 that overly constrained the estimation.

6.5.6 Checking predicted posterior

Let’s refer back to a correctly fit model.

# 2.3 Predicted posterior check ----

# mcmc
fit <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial1_model2.stan'), 
                          data = stan_data,
                          iter = warmup + iter, 
                          chains = chains,
                          warmup = warmup, 
                          pars = pars,
                          seed = seed)

As shown in previous paragraph, Bayesian estimation highly relies on choosing correct priors for the parameters. Those priors should reflect expert knowledge about the parameter of interest. The prior model defines the space for the posterior estimation. It is therefore important to ensure that the prior support does not overly constrain the posterior. This is called a posterior predictive check.

To do posterior predictive checks, We need to extract the estimated posterior distribution from the stanfit object. This is done using the extract function from rstan.

# extract estimated mu
mus <- data.frame(
    # extract parameter
    posterior=rstan::extract(fit, pars='mu')$mu,
    parameter = 'mu'
    )
glimpse(mus)

## Rows: 2,000
## Columns: 2
## $ posterior <dbl> 4.752114, 4.773860, 4.759803, 4.752098, 4.783852, 4.759727, 4.730896, 4…
## $ parameter <chr> "mu", "mu", "mu", "mu", "mu", "mu", "mu", "mu", "mu", "mu", "mu", "mu",…

mus contains the estimated mu for each iteration post-warmup (500) of each Markov chain (4).

The aim here is to recreate the prior distribution. In model 2 of Part III, we used as prior: \[\mu \sim Normal(5,4)\], we sample from this prior distribution, the same number of draws that we have for the posterior distribution (that is 2000).

# retrieve stan parameter
post_warmup_draws <- iter - warmup

# simulate from the prior distribution
mus$prior <- rnorm(chains * post_warmup_draws, 5, 4)

We build for $\sigma$ a similar dataframe to mus that contrasts its prior distribution with its posterior distribution and bind it together with mus.

# build dataframe with parameters distribution comparison
parameters <-  rbind(
  # alpha
  mus ,
  #sigma
  data.frame(
      posterior=rstan::extract(fit, pars='sigma')$sigma,
      prior = runif(chains * post_warmup_draws, 0, 4),
      parameter = 'sigma')
  )  %>% 
  pivot_longer(-parameter,
    names_to = 'distribution',
    values_to = 'value'
  )

# plot distribution comparison for both parameter
ggplotly(ggplot(parameters, aes(x=value, color=distribution, fill=distribution))+
  geom_density()+
  theme_minimal()+
    facet_wrap(.~parameter, scales = 'free'))

Figure 6.42: Posterior predictive checks for alpha and sigma

Figure 6.42 shows a posterior predictive check for both $\mu$ and $\sigma$. It compares the prior distribution that we set in the model and the estimated posterior distribution. The range of values here can can vary significantly between the prior and posterior, and can be explored further by zooming in.

We confirm that our priors did not falsely constrain the posterior, with a very wide range of possible values. The unique constrain we can observe is the lower bound on $\sigma$ at 0 but this is normal as a variance has to be positive.

6.5.7 Cross-validation of model predictions

After checking the model structure, we revisit the goodness-of-fit metrics based on population predictions.

In Part I to III, we checked the goodness-of-fit of the modelled predictions in sample, that is predicting population and comparing it to the observations for the study sites that were already used to fit the model.

In-sample checks do not account for model overfitting. Overfitting occurs when the model does not only reproduce the information from the observations but also the noise. The model should also be able to accurately predict population for unknown new study sites. This real-world scenario can be reproduced by keeping a percentage of the observations for model fitting and leaving the remaining study sites for predicting, a mechanism called cross-validation.

Firstly, the data needs to be to splited in two, a training set (70%) that will be used for fitting the model and a predicting set (30%) which will be used for model prediction:

# 2.4 Cross validation of predictions ----

set.seed(2004)
# sample observations
train_size <-  round(nrow(data)*0.7)
train_idx <- sample(1:nrow(data), size = train_size, replace=F)

# build train datasets
train_data <- data[train_idx,]

# build test datasets
test_data <- data[-train_idx,]

In stan, cross-validation can be done simultaneously with model fitting by providing the testing dataset to the generated quantities modelling block. Because all variables have to be defined for each set, it significantly lengthens the code.

data{
  
  int<lower=0> n_train; // number of microcensus clusters in training set
  int<lower=0> n_test; // number of microcensus clusters in predicting set

  int<lower=0> population_train[n_train]; // count of people
  
  vector<lower=0>[n_train] area_train; // settled area in training
  vector<lower=0>[n_test] area_test; // settled area in testing
}

parameters{
  // population density
  vector<lower=0>[n_train] pop_density_train;
  
  //intercept
  real mu;

  // variance
  real<lower=0> sigma; 
}

model{
  
  // population totals
  population_train ~ poisson(pop_density_train .* area_train);
  pop_density_train ~ lognormal( mu, sigma );
  
  //  intercept
  mu ~ normal(5, 4);
  
  // variance
  sigma ~ uniform(0, 4);
}

generated quantities{
  
  int<lower=-0> population_hat[n_test];
  real<lower=0> density_hat[n_test];

  for(idx in 1:n_test){
    density_hat[idx] = lognormal_rng( mu, sigma );
    population_hat[idx] = poisson_rng(density_hat[idx] * area_test[idx]);
  }
}

Prepare the data for stan and run the model:

# prepare data
stan_data_xval <- list(
  population_train = train_data$N,
  n_train = nrow(train_data),
  n_test = nrow(test_data),

  area_train = train_data$A,
  area_test = test_data$A,

  seed=seed
)

# mcmc setting
pars <- c('mu','sigma', 'population_hat',  'density_hat')

# mcmc
fit_xval <- rstan::stan(file = file.path('dat/BUM_tutorial/tutorial1_model2xval.stan'), 
                   data = stan_data_xval,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

We now have access to two predictions sets: one of the dataset, fit, from the in-sample prediction and the other, fit_xval, from the cross-validation prediction.

We compare the goodness-of-fit-metrics from the two models:

# predict population
getPopPredictions <- function(model_fit, 
                              estimate='population_hat',
                              obs='N', reference_data=data){
  # extract predictions
  predicted_pop <- as_tibble(rstan::extract(model_fit, estimate)[[estimate]])
  colnames(predicted_pop) <- reference_data$id
  
  # summarise predictions
  predicted_pop <- predicted_pop %>% 
    pivot_longer(everything(),names_to = 'id', values_to = 'predicted') %>% 
    group_by(id) %>% 
    summarise(
      across(everything(), list(mean=~mean(.), 
                                upper=~quantile(., probs=0.975), 
                                lower=~quantile(., probs=0.025)))
      ) %>% 
    # merge with observations
    left_join(reference_data %>% 
                rename('reference'=all_of(obs)) %>% 
                dplyr::select(id, reference), by = 'id') %>%
    # 
    mutate(
      residual= predicted_mean - reference,
      ci_size = predicted_upper- predicted_lower,
      estimate = estimate
      )
return(predicted_pop)
}

# build comparison dataframe between in-sample and xvalidation
comparison_df <- rbind(
 getPopPredictions(fit) %>% 
   mutate(Model='In-sample'),
 getPopPredictions(fit_xval, reference_data = test_data) %>% 
   mutate(Model='Cross-validation'))

# compute goodness-of-fit metrics
comparison_df %>% group_by(Model) %>% 
  summarise( `Bias`= mean(residual),
    `Inaccuracy` = mean(abs(residual)),
        `Imprecision` = sd(residual)
) %>%  kbl(caption = 'Goodness-of-metrics computed in-sample vs cross-validation') %>% kable_minimal()

Table 6.7: Goodness-of-metrics computed in-sample vs cross-validation
Model	Bias	Inaccuracy	Imprecision
Cross-validation	96.48372	262.1705	312.2181
In-sample	60.85992	266.0833	337.9697

Table 6.7 illustrates that all metrics shows a worse fit: the bias, the imprecision and the inaccuracy increase. However they still remains in the same range, which seems to indicate that our model is not overfitting.

It is a good practice is to always assess the model fit on the testing set to avoid producing overly optimistic goodness-of-fit metrics.

6.5.8 Gridded population prediction

We will cover in this section model predictions for the entire study area at the grid cell level. Population can be predicted for any spatial unit as long as 1. all the covariates and input data can be produced for the spatial unit and 2. the predicting spatial unit is not too different in size from the fitting spatial unit, such that the estimated relationship between covariates and population density can still be applied.

To predict population for grid cell of 100m x 100m, we will use the best population model: model 2 from part III.

# 3. Population prediction ----

tutorial3_model2_fit <- readRDS("dat/BUM_tutorial/tutorial3_model2_fit.rds")

This requires extracting the estimated parameters distributions: $\hat\alpha_{t,r}, \hat\sigma, \hat\beta^{fixed}, \hat\beta^{random}_t$. These estimated parameters are then combined with the input rasters following this modified model for prediction:

\[\begin{equation} population\_predicted 〜 Poisson( \hat{pop\_density} * settled\_area) \\ \hat{pop\_density} 〜 Lognormal(\hat\mu, \: \hat\sigma) \\ \hat\mu = \hat\alpha_{t,r} + \hat\beta^{random}_t X^{random} + \hat\beta^{fixed} X^{fixed} \tag{6.5} \end{equation}\]

$X$ and $settled\_area$ are the input data at grid cell level:

Grid-cell based model prediction requires 10 input rasters:

settled area
settlement type
administrative region
the six covariates
a mastergrid that defines the grid cells for which to predict the population. These grid cells are (1) contained inside the study area, (2) considered as settled.

Unfortunately, it is not possible to redistribute the settlement map used in Nigeria v1.2, provided by Oak Ridge National Laboratory (2018a). Nonetheless, the code has been shared to show the process of the population prediction. We will focus on region 8 that contains only 671 110 cells instead of the 166 412 498 cells for the entire country.

6.5.9 Preparing data for prediction

To predict gridded population, the rasters set is converted to a table where the rows represent the grid cell values and the columns represent the variable names. The spatial structure of the raster is not required for model prediction. Every grid cell can be estimated independently as long as we know its administrative region, settlement type and covariates attributes.

library(raster) #Raster data manipulation
wd_path <- 'Z:/Projects/WP517763_GRID3/Working/NGA/ed/bottom-up-tutorial/'

# function that loads the raster and transforms it into an array
getRasterValues <- function(x){
  return( raster(paste0(wd_path,x))[])
}

# get a list of the rasters
rasterlist <- list.files(wd_path,
                         pattern='.tif', full.names = F)
# apply function to raster list
raster_df <- as_tibble(sapply(rasterlist, getRasterValues))

raster_df$gridcell_id <- 1:nrow(raster_df)

a <- raster_df %>% filter(mastergrid.tif==1) %>% 
  dplyr::select(starts_with('settlement'), gridcell_id) 
sett_type <- 1:6
a <- a %>% 
    rowwise() %>%
    mutate(
      settlementType=sett_type[which.max(c_across(starts_with('settlement')))])
raster_df <- left_join(raster_df %>% dplyr::select(-starts_with('settlementarea')),
                       a %>% dplyr::select(gridcell_id, settlementType))

colnames(raster_df) <- str_remove(colnames(raster_df), '.tif')

raster_df <- raster_df %>% 
  mutate(mastergrid = ifelse(is.na(x1), 0, mastergrid),
         mastergrid = ifelse(settlementType==6, 0, mastergrid))

write_csv(raster_df %>% filter(mastergrid==1)%>% head(n=10), 'wd/misc/raster_predict_ex.csv')

raster_df <- read.csv('../wd/misc/raster_predict_ex.csv')

The raster table contains 671 110 rows and 10 columns. 95% of the rows are NAs because they are not considered as settled.

settled <- raster_df %>% filter(mastergrid==1)
settled %>% head() %>% kbl() %>% kable_minimal()

The covariates need to be scaled using the scaling factors computed during the fitting stage.

#load scaling factor
scale_factor <- read_csv('dat/BUM_tutorial/scale_factor.csv')

scaled_covs <- settled %>% 
  # subset covs column
  dplyr::select(starts_with('x'), gridcell_id) %>% 
  
  # convert table to long format
  pivot_longer(-gridcell_id, names_to='cov', values_to = 'value') %>%
  
  # add scaling factor
  left_join(scale_factor) %>% 
  
  # scale covs
  mutate(value_scaled= (value-cov_mean)/cov_sd) %>% 
  dplyr::select(gridcell_id, cov, value_scaled) %>% 
  
  # convert table back to wide format
  pivot_wider(names_from = cov, values_from = value_scaled)

# replace the covs with their scaled version
raster_df <- raster_df %>% dplyr::select(-starts_with('x')) %>% 
  left_join(scaled_covs)

6.5.10 Extracting estimated parameters

The second step is to extract the estimated parameter distributions. We use the model fitted with the full observations dataset and not the one from the cross-validation for maximum information.

#extract betas
beta_fixed <- t(rstan::extract(model_fit, pars='beta_fixed')$beta_fixed)
beta_random <- t(rstan::extract(model_fit, pars='beta_random')$beta_random)
#extract alphas
alphas <- rstan::extract(model_fit, pars='alpha_t_r')$alpha_t_r
#extract sigmas
sigmas <-  rstan::extract(model_fit, pars='sigma')$sigma

We first focus on producing $\hat\beta^{fixed} X^{fixed}$ from Equation (6.5), the covariates modelled with a fixed effect.

$\hat\beta^{fixed}$ is a 2000 x 5 matrix for the 5 covariates and 2000 estimations. We transpose it to multiply by $X^{fixed}$, a 310 512 X 5 matrix of covariates for every settled cells.

# extract covariates modelled with a fixed effect
cov_fixed <- settled %>% 
  dplyr::select(x1:x3, x5:x6) %>% 
  as.matrix()

cov_fixed <- cov_fixed %*% beta_fixed

We then produce $\hat\beta^{random}_t X^{random}$, the covariates modelled with a random effect.

$\hat\beta^{random}_t$ is a 2000 x 5 matrix for the 5 settlement types. Each grid cell needs to be associated with the correct $\hat\beta^{random}_t$ based on the grid cell settlement type and then multiplied by $X^{random}$ the covariates values.

beta_random <- as_tibble(beta_random)
# add settlement type to beta random
beta_random$settlementType <- 1:5

# extract covariates modelled with a random effect
cov_random <- settled %>% 
  # subset random covariate and settlement type
  dplyr::select(settlementType,x4) %>% 
  # associate correct estimated beta_t
  left_join(beta_random) %>% 
  # multiply cov by slope
  mutate(
    across(starts_with('V'), ~ .x*x4)
  ) %>% 
  # keep only the estimates
  dplyr::select(-settlementType, -x4) %>% 
  as.matrix()

We eventually need to associate the correct $\hat\alpha_{t,8}$ to each grid cell.

$\hat\alpha_{t,8}$ has a similar format as $\hat\beta^{random}_t$. We will thus proceed in the same way.

# subset alpha_t for region 8
alpha_t_8 <- as_tibble(t(alphas[,,8]))
# assign settlement type
alpha_t_8$settlementType <- 1:5

alpha_predicted <- settled %>% 
  dplyr::select(settlementType) %>% 
  left_join(alpha_t_8) %>% 
  dplyr::select(-settlementType) %>% 
  as.matrix()

We can finally compute $\hat\mu$ as the sum of $\hat\alpha_{t,r}$, $\hat\beta^{random}_t X^{random}$ and $\hat\beta^{fixed} X^{fixed}$

# sum mu components
mu_predicted <- alpha_predicted + cov_fixed + cov_random

6.5.11 Predicting population count for every grid cell

To estimate the grid-cell population, we need to simulate from Equation (6.5) with the estimated $\hat\mu$ and $\hat\sigma$.

predictions <- as_tibble(mu_predicted) %>% 
  # add grid cell id and the settled area to mu
  mutate(
      gridcell_id= settled$gridcell_id,
      settled_area = settled$settled_area
    )  %>% 
  # long format
  pivot_longer(
    -c(gridcell_id, settled_area), 
    names_to = 'iterations', values_to = 'mu_predicted') %>%
  mutate(
    # add sigma iterations for every grid cell
    sigma=rep(sigmas, nrow(mu_predicted)),
    # draw density for a log normal
    density_predicted = rlnorm(n(), mu_predicted, sigma),
    # draw population count from a Poisson
    population_predicted = rpois(n(), density_predicted*settled_area)
    ) %>% 
  dplyr::select(-mu_predicted,-sigma, -density_predicted) %>% 
  # convert it back to grid cell x iterations matrix
  pivot_wider(-c(iteration, population_predicted),
      names_from = iteration,
      values_from = population_predicted
    )

predictions contains 2000 population predictions for every settled grid. cell. Let’s see how does it look like for the 5 settled grid cells:

predictions <-  readRDS('../data/gridded_pop_predicted.rds')

predictions[1:5, ] %>% dplyr::select(gridcell_id, everything()) %>% 
  kbl() %>% kable_minimal()  %>% scroll_box(width = "100%")

Note: Model prediction is memory-intensive. We thus usually run it in parallel for blocks of grid cells.

6.5.12 Predicting distributions

Bayesian modelling is a stochastic modelling. It accounts thus for the unknowns, that may be caused by the following factors:

Weak predictors
Errors in input data
Incomplete sample representativity
Observations variation

These unknowns result in prediction uncertainty. Specifically, it results in predicting a distribution of the likely population count.

As seen in previous sections, every grid cell has 2000 predictions. Figure 6.43 shows an example for five grids cells.

prediction_ex <- predictions %>% slice(1:5) %>% 
         pivot_longer(-gridcell_id, names_to = 'iterations', values_to = 'prediction') %>% 
  group_by(gridcell_id) %>% 
  mutate(
    mean_pop = paste0(gridcell_id, ' (mean=', round(mean(prediction)), ')')
  )


ggplotly(ggplot(prediction_ex,   aes(x=prediction, color=mean_pop)) +
  geom_density()+
  theme_minimal()+
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())+
  labs(x='Population predicted', color='\n\nGridcell id \n(mean population)')) %>%
  layout(margin=list(t=100))

Figure 6.33 shows how high-resolution geospatial covariates provide precise information on local context.

Figure 6.43: Example of predicted population count for 5 grid cells and their mean population count

A meaningful statistic to retrieve from the distribution is the mean, which can be considered as the most likely value.

6.5.13 Gridded population

The gridded bottom-up population that WorldPop releases are based on the mean value per grid cell.

The following explores how a gridded population dataset can be produced from our predictions.

raster_df <- readRDS('../../wd/misc/gridcell_id.rds')

First compute the mean prediction for every grid cell:

mean_prediction <- tibble( 
  mean_prediction = apply(predictions %>% dplyr::select(-gridcell_id), 1, mean)
  )

Next, add the unsettled grid cell by joining with the raster_df, containing every grid cell.

mean_prediction <- mean_prediction %>% 
  # add gridcell_id
  mutate(gridcell_id = predictions$gridcell_id) %>% 
  # join unsettled grid cell
  right_join(raster_df %>% 
               dplyr::select(gridcell_id)) %>% 
  # sort according to position in the raster
  arrange(gridcell_id)

To retrieve the raster structure, we load a reference master - typically the mastergrid.

wd_path <- 'Z:/Projects/WP517763_GRID3/Working/NGA/ed/bottom-up-tutorial/'
mastergrid <- raster(paste0(wd_path, 'mastergrid.tif'))

And assign the predicted population count.

# create raster
mean_r <- mastergrid
# assign mean prediction 
mean_r[] <- mean_prediction$mean_prediction

We can finally plot the raster to see our gridded population:

# create color ramp
cuts <- quantile(mean_prediction$mean_prediction,
                 p=seq(from=0, to=1, length.out=7), na.rm=T)
col <-  brewer.pal(n = 7, name = "YlOrRd")

#plot mean prediction
plot(mean_r, col=col)

Figure 6.44: Gridded mean population prediction for region 8

Figure 6.44 shows typical population spatial distribution. The top-left corner shows a cluster of cities, with denser neighborhoods in the centers. Lines of denser settlements are linked with the spatial distribution of population around roads. In white are displayed the unsettled grid cells.

6.5.14 Gridded uncertainty

Additionally, from the full posterior distribution, the 95% credible interval (CI) can be retrieved for every prediction and the related relative uncertainty computed as:

\[\begin{equation} uncertainty = \frac{upper\_CI - lower\_CI}{mean\_prediction} \end{equation}\]

The relative uncertainty highlights areas where the mean prediction is less reliable.

The same process used for the gridded mean prediction can be used to compute the gridded uncertainty.

# retrieve the 95% credible interval
ci_prediction <- apply(predictions %>% dplyr::select(-gridcell_id),1, quantile, p=c(0.025, 0.975))

ci_prediction <- as_tibble(t(ci_prediction)) %>% 
    # add gridcell_id
  mutate(gridcell_id = predictions$gridcell_id) %>% 
  # join unsettled grid cell
  right_join(raster_df %>% 
               dplyr::select(gridcell_id)) %>% 
  # sort according to position in the raster
  arrange(gridcell_id) %>% 
  mutate(
    mean_prediction = mean_prediction$mean_prediction,
    uncertainty = (`97.5%` - `2.5%`)/ mean_prediction
  )

# create uncertainty raster
uncertainty_r <- mastergrid
uncertainty_r[] <- ci_prediction$uncertainty

#plot uncertainty raster
cuts <- quantile(ci_prediction$uncertainty,
                 p=seq(from=0, to=1, length.out=7), na.rm=T)
col <-  brewer.pal(n = 7, name = "Blues")
plot(uncertainty_r, col=col)

Figure 6.45: Map of Predicted Population Uncertainty.

We see that higher uncertainty can be observed on the outskirt of denser settlement. This is often the case due to the rapid expansion of settlement extent that is difficult to capture.

6.5.15 Aggregating prediction

Gridded populations are particularly useful in visualizing the spatial distribution of populations across an area. But often we want to get aggregates, that is a population estimate for a custom area such as a city or the catchment area of a hospital, which can prove more useful in application of the data.

Let’s look at an example. We manually drew the extent of one city in our study area.

Disclaimer: the extents are not the official extent

tmap_options(check.and.fix = TRUE)
city <- st_read(paste0(wd_path, 'study_city.gpkg'), quiet=T)

names(mean_r) <- 'Gridded population'
tmap_mode('view')
tm_shape(city)+
  tm_borders()+
  tm_shape(mean_r)+
  tm_raster()+
  tm_basemap('Esri.WorldGrayCanvas')

Generating the mean population prediction for the city corresponds to computing zonal statistic on the gridded population, that is summing up all the grid cells within the city boundary.

mean_city <- extract(mean_r, city, fun=sum, na.rm=TRUE, df=TRUE)
mean_city %>%  
  mutate(features = 'city', .after=ID) %>% 
  rename("Mean population prediction"=Gridded.population) %>% 
  kbl(caption='Mean population prediction for the city computed with the gridded population') %>% kable_minimal(full_width = F)

Obtaining the uncertainty for that estimate is more complicated. Indeed we can’t sum up uncertainty of the grid cell to retrieve the uncertainty of the aggregate because credible intervals are based on quantile computation, and quantiles can’t be summed.

The full population prediction distribution for the city therefore needs to be reconstructed by aggregating all grid-cells population prediction distribution.

The first step is to convert the city polygon to a raster with same extent as the gridded population. Cells that fall inside the city boundary contain a value of 1, and those that fall outside hold a value of 0.

city_r <- rasterize(city, mean_r)

The city raster is converted to an array and joined to raster_df to get the grid cell id. We filter out the grid cells that belongs to the city and then merge it with the predictions:

city_prediction <- raster_df %>% 
  # select grid cell id from raster df
  dplyr::select(gridcell_id) %>% 
  # add city dummy
  mutate(city = city_r[]) %>% 
  # keep grid cells inside the city
  filter(city==1) %>% 
  # join predictions
  left_join(predictions) %>% 
  # keep predictions
  dplyr::select(starts_with('V'))

city_prediction contains all the grid cells within the city with their corresponding full population prediction distribution.

The predictions are summed for every grid cell at each iteration.

city_prediction <- as_tibble(apply(city_prediction,2, sum, na.rm=T))

city_prediction becomes an array of size 2000 containing the 2000 population totals for the city. We can then derive meaningful statistics from the distribution of population total for the city.

city_prediction_stats <- city_prediction %>% 
  summarise( 
    "Mean population" = round(mean(value)),
    "Upper bound" = round(quantile(value, p=0.975)),
    'Lower bound' = round(quantile(value, p=0.025)))

ggplot(city_prediction, aes(x=value))+
  geom_density(size=1, color='orange')+
  theme_minimal()+
    theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())+
  labs(y='', x='Predicted population for the city')+
  annotate("segment",x=city_prediction_stats$`Mean population`,
           xend=city_prediction_stats$`Mean population`, y=0, yend=0.000005, size=1)+
  annotate('text',x=city_prediction_stats$`Mean population`+5000, y=0.0000012, hjust = 0, fontface =2,
           label=paste0('Mean prediction: \n', city_prediction_stats$`Mean population`, ' people'))+
  annotate("segment",x=city_prediction_stats$`Upper bound`,
           xend=city_prediction_stats$`Upper bound`, y=0, yend=0.000005)+
  annotate('text',x=city_prediction_stats$`Upper bound`+5000, y=0.000001, hjust = 0,
           label=paste0('97.5% prediction \nbound: \n', city_prediction_stats$`Upper bound`, ' people'))+
    annotate("segment",x=city_prediction_stats$`Lower bound`,
           xend=city_prediction_stats$`Lower bound`, y=0, yend=0.000005)+
    annotate('text',x=city_prediction_stats$`Lower bound`+5000, y=0.000001, hjust = 0,
           label=paste0('2.5% prediction \nbound: \n', city_prediction_stats$`Lower bound`, ' people'))

Figure 6.46: Full distribution of the predicted population total for the city.

The figure above shows that the mean prediction matches the one computed with the gridded population raster. Additionally, it illustrates that the model produces predictions with a very wide credible interval.

Having the full distribution can help answer questions such as “what is the probability that there is more than xx people in the area” or “with a 95% certainty, what is the maximum number of people in that area?”

Contribution

This tutorial was written by Edith Darin and Douglas Leasure with supervision from Andy Tatem

Suggested citation

Darin E, Leasure DR, Tatem AJ. 2023. Statistical population modelling for census support. United Nations Population Fund (UNFPA), Leverhulme Centre for Demographic Science, University of Oxford, and WorldPop, University of Southampton. https://wpgp.github.io/bottom-up-tutorial/, doi:10.5281/zenodo.7945266

References

Ecopia.AI, Maxar Technologies. 2019. Digitize africa data.

Institut Géographique du Burkina Faso. 2015. Base nationale de données topographiques.

Leasure DR, Jochem WC, Weber EM, Seaman V, Tatem AJ. 2020f. National population mapping from sparse survey data: A hierarchical bayesian modeling framework to account for uncertainty. Proceedings of the National Academy of Sciences.

Oak Ridge National Laboratory. 2018a. LandScan HD: nigeria.

OpenStreetMap contributors. 2018. Planet dump retrieved from https://planet.osm.org.

R Core Team. 2020a. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Weber EM, Seaman VY, Stewart RN, Bird TJ, Tatem AJ, McKee JJ, Bhaduri BL, Moehl JJ, Reith AE. 2018c. Census-independent population mapping in northern nigeria. Remote Sensing of Environment 204:786–798. doi:10.1016/j.rse.2017.09.024. http://www.sciencedirect.com/science/article/pii/S0034425717304364.

Weiss DJ, Nelson A, Gibson H, Temperley W, Peedell S, Lieber A, Hancher M, Poyart E, Belchior S, Fullman N, others. 2018a. A global map of travel time to cities to assess inequalities in accessibility in 2015. Nature 553:333336.

WorldPop Research Group, Department of Geography and Geosciences, University of Louisville, Departement de Geographie, Universite de Namur, Center for International Earth Science Information Network (CIESIN), Columbia University. 2018. Global high resolution population denominators project - funded by the bill and melinda gates foundation (OPP1134076). doi:10.5258/SOTON/WP00645.