logo

Introduction

The first part of this tutorial will cover how to revamp a basic frequentist model into a Bayesian model and will introduce some concepts such as Directed Acyclic Graph (DAG) and priors.

The second part will be devoted to introduce population modelling from sparse but representative study sites sample. We will present the dataset we will be working with for the next three tutorials and build our first two population models:

  • a model based on the Normal distribution

  • a multilevel model based on a Poisson-Lognormal compound.

This will be the occasion to experiment with the Stan software and its R interface rstan.

Goals

  1. Write a simple linear regression in a Bayesian framework

  2. Adapt the statistician toolbox to a real-world example

  3. Fit a Normal model in stan with simulated data:

    1. Format data for stan

    2. Specify a model in the stan language

    3. Set up a MCMC sampler to fit the model

  4. Fit a Poisson model for population count

    1. Evaluate results and limitations
  5. Fit a Poisson-Lognormal model for modelling population

Supporting readings

This series of tutorials are not an introduction to statistics. For this specific lesson, it would be good to be familiar with some statistical concepts, and for that purpose we indicate useful resources:

And we add to that list the documentation for the software used:

  • Software: Stan

    Stan is a C++ library for Bayesian modeling and inference that primarily uses the No-U-Turn sampler (NUTS) Hoffman and Gelman (n.d.) to obtain posterior simulations given a user-specified model and data

  • Interface: rstan

    The rstan package allows one to conveniently fit Stan models from R (R Core Team 2014) and access the output, including posterior inferences and intermediate quantities such as evaluations of the log posterior density and its gradients.

From a frequentist to a Bayesian mindset

In a standard frequentist approach, a linear regression between \(Y\) the response variable and \(X\) the predictors can be formulated as:

\[\begin{equation} Y = \alpha + \beta X + \epsilon \\ \epsilon \sim Normal(mean=0,sd=\sigma)\tag{1} \end{equation}\]

Equation (1) can be rewritten as: \[Y \sim Normal(mean=\mu, sd=\sigma)\] \[ \mu=\alpha + \beta X\]

This format is more flexible when we start working with non-normal error structures and custom modelling components.

This linear regression can be represented using a directed acyclic graph (DAG) that helps to picture the relationships between model parameters and input data:

In a DAG, squares represent data and circles represent parameters. Directions of arrows indicate dependence.

In a Bayesian model, all root node parameters (those with no arrows pointing towards them) need priors to be specified: \[ \alpha \sim Normal(mean=0, sd=1000)\] \[ \beta \sim Normal(mean=0, sd=1000)\] \[ \sigma \sim Uniform(min=0, max=1000)\]

These are examples of weakly informative priors, because the variances are large relative to the data. Weakly informative priors should not have any noticeable influence on the final parameter estimates.

How to choose priors

To identify a distribution to use for priors, ask yourself:

  1. “What values are possible for this parameter?” and

  2. “How certain am I about it?”.

In most of Bayesian analyses, we generally know what our data do and do not look like such that we can build priors that reflect the properties of potential datasets. Defining such priors consists of indicating what is the likely range of values for our parameters and downweighting out-of-range values. Furthermore if we are uncertain about the range we can increase the variability of the priors (through their scale/variance parameter), because we don’t want the priors to influence too much the posteriors.

Regression coefficients are generally continuous numbers that can take values from -∞ to +∞. The normal distribution is a good choice of prior for these parameters because it has the same characteristics.

Standard deviations are continuous numbers that must be positive. A normal distribution is not a good choice for this prior because it includes negative numbers. The simplest is a uniform(0,a) with a defining the upper bound on likely values. In reality this is not perfect as it enforces a hard constraint. Other distributions can be considered as a half-Normal or a half-Cauchy distribution.

The bayesrules book has a very good chapter on the interaction between priors and data.

The stan team put together interesting guidelines to prior setting.

Let’s try with simulated data

Setting up

First, download the most recent versions of the following softwares:

Next, install and set up the rstan package by carefully following the directions.

# stan setup
options(mc.cores = parallel::detectCores()) #set up the maximum number of cores used by stan
rstan::rstan_options(auto_write = TRUE) # speed up running time 
  • Install a set of data wrangling packages:
install.packages(c( "tidyverse", # for data manipulation
                    'kableExtra', # for good visualisation of tables
                    'here' # for handling relative path to external assets (pic, data)
                    ), 
                 dependencies = TRUE)

For more information, check out their vignettes: tidyverse, kableExtra and here.

Simulating data

We will simulate fake observations to introduce the basic concepts of Bayesian modelling and their implementation in stan.

We produce our fake data as 1000 draws of a Normal distribution with mean 5 and standard deviation 50:

# 2 Simulated data ----

# Simulate data
seed <- 2004
set.seed(seed)
data <- tibble(y=rnorm(1e3, mean= 5, sd=50))

Note that we define a seed for the results to be exactly replicated.

The observed distribution of our simulated data is the following:

# plot simulated data
ggplot(data, aes(x=y))+
  geom_histogram(bins = 50)+
  theme_minimal()+
  geom_vline(xintercept = mean(data$y), color='orange', size=1)+
  annotate('text', x=15, y=30, label=paste0('Observed mean: ', round(mean(data$y),2)), colour='orange', angle=90)+
  geom_segment(x=0,y=1, xend=sd(data$y), yend=1, colour='orange', size=1)+
  annotate('text', x=50, y=4, label=paste0('Observed sd: ', round(sd(data$y),2)), colour='orange')+
  labs(y='', x='observed y')+
  theme(axis.text.y = element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Simulated observations distribution

Figure 1: Simulated observations distribution

Modelling the data

Let’s pretend we don’t know how the observations, \(y\) have been created. We want to model \(y\) and we see that it has a clear bell shape. It can thus be approximated with a Normal distribution that is: \[Y \sim Normal(mean=\mu, sd=\sigma)\]

The corresponding DAG is:

# Normal model DAG
dagify(
  Y ~ mu,
  Y ~ sigma) %>%
  tidy_dagitty(seed=41) %>% 
  mutate(color=c('parameter', 'parameter','data')) %>% 
  ggplot(aes(x = x, y = y, xend = xend, yend = yend, color=color,shape=color)) +
  geom_dag_point() +
  geom_dag_edges() +
  geom_dag_text(col = "grey20",size=6,  parse=T) +
  scale_shape_manual(values=c(15,19))+
  theme_dag()+ labs(title = 'Model with simulated data: Normal distribution of Y', color='', shape='')

We need to give some prior to the two parameters \(\mu\) the mean and \(\sigma\) the standard deviation. Let’s imagine that we don’t have access to more information on the variable \(Y\) than its manifestation in our dataset.

We can guess that \(\mu\) is around zero but without great confidence (see Figure 1). In other words, we can assume that \(\mu\) can be drawn from a Normal(0,100).

# define mu prior
data$prior_mu <- rnorm(1e3, mean= 0, sd=100)

ggplot(data)+
  geom_histogram(aes(x=y, after_stat(density)), bins=50, fill='grey30')+
  geom_density(aes(x=prior_mu), colour='#00BFC4', size=1)+
  theme_minimal()+
  annotate('text', y=0.0020, x=250, label=paste0('Normal(0,100)'), colour='#00BFC4', size=5)+
    theme(axis.text.y = element_blank())+
  labs(y='', x='observed y')
Mu prior distribution

Figure 2: Mu prior distribution

Figure 2 shows what our prior for \(\mu\) means: we think that \(\mu\) is likely to be around zero but could be up to 200. Given that our observed \(y\), \(\hat{y}\), has no occurrences above 200, it is unlikely that the mean of \(y\) is 200 but we don’t exclude the possibilty. However we do consider that 0 is more likely than 200.

For \(\sigma\), we have stricter expectations. Indeed standard deviations are positive. We thus choose as prior for \(\sigma\) a Uniform(0, 200).

# define sigma prior
data$prior_sigma <- runif(1e3,min = 0, max = 200)

ggplot(data)+
  geom_histogram(aes(x=y, after_stat(density)), bins=50, fill='grey30')+
  geom_density(aes(x=prior_sigma), colour='#00BFC4', size=1, trim=T)+
  theme_minimal()+
  annotate('text', y=0.0060, x=150, label=paste0('Uniform(0,200)'), colour='#00BFC4', size=5)+
      theme(axis.text.y = element_blank())+
  labs(y='', x='observed y')
Sigma prior distribution

Figure 3: Sigma prior distribution

Figure 3 shows that a uniform prior means a equal probability of \(\sigma\) to be 0 as to be 200.

The final model is:

\[\begin{equation} Y \sim Normal( \mu, \sigma ) \\[1pt] \\ \mu \sim Normal( 0, 100 ) \\ \sigma \sim Uniform( 0, 200 )\tag{2} \end{equation}\]

Implementing the model in stan

To estimate \(\mu\) and \(\sigma\), we will write our first model in stan.

// Model for simulated data: y as normal distribution
data {
  int<lower=0> n; // number of observations
  real y[n]; // observations
}

// The parameters accepted by the model. Our model
// accepts two parameters 'mu' and 'sigma'.
parameters {
  real mu;
  real<lower=0> sigma;
}

// The model to be estimated. We model the output
// 'y' to be normally distributed with mean 'mu'
// and standard deviation 'sigma'.
model {
  y ~ normal(mu, sigma);
  
  mu ~ normal(0,100);
  sigma ~ uniform(0,200);
}

A stan model is composed of code blocks. The fundamental ones are:

  • the data block that describes the input data to the model

  • the parameters block that describes the parameters to be estimated

  • the model block that describes the stochastic elements: (1) the interaction between the parameters and the data, (2) the prior distribution

The stan software requires us to declare all variables, both parameters and data, with their type (int, real) and size (as indicated with [n])1 . It is possible to incorporate constraints on the variable support, e.g. it is not possible to have a negative \(\sigma\) (real<lower=0> sigma).

Note: stan requires to leave one blank line at the end of the script.

We will store the model in a stan file called tutorial1_model.stan in the tutorial1 folder.

Preparing the data for stan

Stan software takes as input a list of the observed data that defines the variables indicated on the data block.

# prepare data for stan
stan_data <- list(
  y = data$y,
  n = nrow(data))

Running the model

We set up the parameters of the Markov Chain algorithm.

# mcmc settings
chains <- 4
warmup <- 250
iter <- 500

The chains argument specified the number of Markov chains to run simultaneously. We want the Markov chains to replicate a fully random process. However, the design of the chain algorithm makes every sample dependant on the previous sample. To recreate a random setting we run independently several chains to explore the parameter space and that hopefully converge to the same consistent solution.

The warmup parameter is the number of samples at the beginning of the estimation process that we discard from the results. This is similar to cooking pancakes in the sense that you need the algorithm to warm up before nearing reasonable values.

The iter parameter specifies the number of iterations, that is the length of the Markov chain. The longer the chain the more likely it is to stabilize around the correct estimate.

Then we define the parameters that we want to monitor, which are stored during the estimation process:

# parameters to monitor
pars <- c('mu','sigma')

And we are ready to run the model!

# mcmc
fit <- rstan::stan(file = file.path('tutorial1_model.stan'), 
                   data = stan_data,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

Checking the MCMC simulations

We check the Markov chains to see if they converged to a unique solution. It can be visualized with a traceplot for the two parameters, \(\mu\) and \(\sigma\). A traceplot describes the evolution of the parameter estimation across the Markov chain iterations. A good traceplot sees the mixing of the different chains, evidence of convergence to a single estimate.

# plot trace
stan_trace(fit, inc_warmup = T)
Model Traceplot

Figure 4: Model Traceplot

Figure 4 shows both the warm-up period (up until 250) and the following iterations. We see that convergence happened before the end of the warm-up period and is stable over the iterations, because the four chains mixed well.

From Figure 4 (and the absence of warning from stan ), we can conclude that the model converged.

Evaluating the estimated parameters

Bayesian statistics consider parameters as stochastic, thus it estimates a distribution for each parameter. In practice stan stores parameters estimates for each iteration post-warmup for each chain which gives us 500x4 estimates of the posterior distribution.

We can extract from the stanfit object a summary of the estimated distribution using the summary function:

# summarise estimated parameters
estimated <- summary(fit, pars=pars)$summary

estimated %>% kbl() %>% kable_minimal()
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu 7.585318 0.0397414 1.556887 4.518311 6.53370 7.607114 8.649467 10.59326 1534.718 0.9998234
sigma 49.668592 0.0246272 1.126673 47.452121 48.91305 49.688114 50.404291 51.91864 2092.976 1.0003155

We can then compare the parameters \(\hat\mu\) and \(\hat\sigma\) with the observed average and standard deviation of \(\hat{y}\) as well as the true value, using the stan_plot function.

# plot estimated parameters
mean_pop <- mean(data$y)
sd_pop <- sd(data$y)


mu_plot <- stan_plot(fit, 'mu',fill_color='orange')+
  annotate('segment',x=mean_pop, xend=mean_pop, 
           y=0.7, yend=1.2,col='grey40', size=1)+
  annotate('text',x=mean_pop, 
           y=1.5, col='grey40',label= paste0('Observed\n',round(mean_pop,2)), fontface =2, size=4.5)+
    annotate('segment',x=5, xend=5, 
           y=0.7, yend=1.2,col='grey10', size=1)+
    annotate('text',x=5, 
           y=1.5, col='grey10',label= paste0('True\n',5), fontface =2, size=4.5)+
    annotate('text',x=estimated['mu', 'mean'], 
           y=1.5, col='orange',label= paste0('Estimated\n',round(estimated['mu', 'mean'],2)), fontface =2, size=4.5)
sigma_plot <- stan_plot(fit, 'sigma', fill_color='orange')+
  annotate('segment',x=sd_pop, xend=sd_pop, 
           y=0.7, yend=1.2,col='grey40', size=1)+
  annotate('text',x=sd_pop, 
           y=1.5, col='grey40',
           label= paste0('Observed\n', round(sd_pop,2)), fontface =2, size=4.5)+
    annotate('segment',x=50, xend=50, 
           y=0.7, yend=1.2,col='grey10', size=1)+
    annotate('text',x=50, 
           y=1.5, col='grey10',label= paste0('True\n',50), fontface =2, size=4.5)+
    annotate('text',x=estimated['sigma', 'mean'], 
           y=1.5, col='orange',label= paste0('Estimated\n',round(estimated['sigma', 'mean'],2)), fontface =2, size=4.5)

gridExtra::grid.arrange(mu_plot, sigma_plot, nrow=2)

We see that (1) the observed mean and standard deviation are within the 95% credible intervals of the estimated parameters, (2) the true mean and standard deviation are within the 95% credible intervals of the estimated parameters,

The model structure is inline with the observed data and manages to approximate the true data generating process.

Let’s try with real data

Let’s download the data we will be modelling. It belongs to the supplementary material of the seminal paper describing WorldPop bottom-up population models (Leasure et al. 2020).

# download  tutorial data
download.file(
  "https://www.pnas.org/highwire/filestream/949050/field_highwire_adjunct_files/1/pnas.1913050117.sd01.xls",
  'tutorials/data/nga_demo_data.xls',
  method='libcurl',
  mode='wb'
)

The data

The data consist of household surveys that collected information on the total population in 1141 clusters in 15 of 37 states in Nigeria during 2016 and 2017. Clusters varied slightly in size, but were all approximately 3 hectares. These clusters were randomly sampled locations whose boundaries were drawn based on high resolution satellite imagery. The surveys are further described in Leasure et al. (2020) and Weber et al. (2018). Survey site locations are shown in Figure 5.

Microcensus surveys mapped as the number of survey locations within a 20km grid cell. Region groupings are shaded and numbered R1 - R11. Source: Leasure et al. (2020).

Figure 5: Microcensus surveys mapped as the number of survey locations within a 20km grid cell. Region groupings are shaded and numbered R1 - R11. Source: Leasure et al. (2020).

The map in Figure 5 shows some key characteristics of the stratified-random sample design:

  • Only some states were sampled

  • But at least 1 state per “region” was sampled

  • Within states, locations were randomly sampled within settlement types

Let’s look at the table attributes:

#load data
data <- readxl::read_excel(here('tutorials/data/nga_demo_data.xls'))
# create unique cluster id
data <- data %>% 
  mutate(
    id= paste0('cluster_',(1:n())), # compute cluster id
    pop_density = N/A # compute population density
  )
data %>% select(id,N, A) %>% head(10) %>% kbl() %>% kable_minimal()
id N A
cluster_1 547 3.036998
cluster_2 803 2.989821
cluster_3 750 3.078278
cluster_4 281 3.019307
cluster_5 730 3.025204
cluster_6 529 3.090072
cluster_7 505 3.007513
cluster_8 402 3.019307
cluster_9 388 3.202116
cluster_10 900 3.054689

Each row is a survey site, with population counts (N) and the settled area (A) in hectares.

Response variable: the population count

We want to model the distribution of population count at each survey site:

# plot population count
ggplot(data, aes(x=N))+
  geom_histogram(bins=50)+
  theme_minimal()+
  theme(axis.text.y = element_blank())+
  labs(title = "", y='', x='Observed population count')+
  geom_vline(xintercept = mean(data$N), color='orange', size=1)+
  annotate('text', x=500, y=25, label=paste0('Observed mean: ', round(mean(data$N))), colour='orange', angle=90)
Observed population count distribution at survey sites

Figure 6: Observed population count distribution at survey sites

Note the wide variation in population count per survey site, with a maximum of 2370 people.

Modelling Population count: a Poisson distribution

Population count is by definition a discrete, positive variable. A good distribution candidate is the Poisson distribution.

\[ population \sim Poisson( \lambda ) \]

The corresponding DAG shows the interaction between the population count and the model parameter:

We then have to define the prior for \(\lambda\), which corresponds to the mean of the Poisson distribution. We will choose a relatively uninformed prior based on our understanding of the data.

We know that the mean observed population count is around 450 per cluster. We set up the prior for \(\lambda\) to follow a Uniform between 0 and 3000 to ensure a positive parameter while reducing the information given to the estimation process, and thus the bias introduced.

# define lambda prior
data$prior_lambda <- runif(nrow(data), min=0, max=3000)

ggplot(data)+
  geom_histogram(aes(x=N, after_stat(density)), bins=50, fill='grey30')+
  geom_density(aes(x=prior_lambda), colour='#00BFC4', size=1)+
  theme_minimal()+
  annotate('text', y=0.0005, x=2000, label=paste0('Uniform(0,3000)'), colour='#00BFC4', size=5)+
    theme(axis.text.y = element_blank())+
  labs(y='', x='Observed population count')
Lambda prior distribution

Figure 7: Lambda prior distribution

The final model is:

\[\begin{equation} population \sim Poisson( \lambda ) \\ \\ \lambda \sim Uniform( 0, 3000 )\tag{3} \end{equation}\]

Implementing the model

To estimate \(\lambda\), we will write our first population model in stan.

// Model 1: Population count as a Poisson distribution 
data{
  int<lower=0> n; // number of microcensus clusters
  int<lower=0> population[n]; // count of people
}
parameters{
  // rate
  real<lower=0> lambda; 
}

model{
  // population totals
  population ~ poisson(lambda);
  // rate
  lambda ~ uniform(0, 3000);
}

We declare the input variable population as integer because our population data are counts and set it up to be positive. We define \(\lambda\) as a positive real.

We store this model under tutorial1_model1.stan.

Estimating the model

We prepare the data for stan:

# prepare data for stan
stan_data <- list(
  population = data$N,
  n = nrow(data))

We keep the same parameters as previously for the Markov Chain algorithm and declare \(\lambda\) as the parameter to monitor

# parameters to monitor
pars <- c('lambda')

And we are ready to run the model!

# mcmc
fit <- rstan::stan(file = file.path('tutorial1_model1.stan'), 
                   data = stan_data,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

The traceplot shows a model that converges to the observed mean:

traceplot(fit, inc_warmup=T)

Note the wide variation at the start of the estimation. This is due to the uniform prior for \(\lambda\) declaring that any value between 0 and 3000 is similarly possible.

Evaluating the model goodness-of-fit

Estimated parameters

We plot the estimated parameter \(\hat\lambda\) to see how it compares with the observed average population count.

# plot estimated parameters
mean_pop <- mean(data$N)

stan_plot(fit, 'lambda',fill_color='orange')+
  annotate('segment',x=mean_pop, xend=mean_pop, 
           y=0.7, yend=1.2,col='grey40', size=1)+
  annotate('text',x=mean_pop, 
           y=1.5, col='grey40',label= paste0('Observed average\n',round(mean_pop,2)), fontface =2, size=4.5)

The estimated mean is very close to the observed mean.

Predicted population count

To see if the model is coherent with the observations, we can compute the predicted population count for every survey site. It is part of posterior predictive checking which is based on the following idea: if a model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed.

It is possible in stan to do it as part of the estimation process through the generated quantities block .

// Model 1bis: Population count as a normal distribution with integrated predictions
...
generated quantities{
   real population_hat[n];

   for(idx in 1:n){
     population_hat[idx] = poisson_rng( lambda );
   }
}

We define the parameter population_hat as a draw (as represented by the suffix rng for random number generator) from a Poisson distribution with the estimated \(\hat\lambda\) for each iteration.

We run the model stored under tutorial1_model1bis.stan:

pars <- c('lambda', 'population_hat')

# mcmc
fit_model1 <- rstan::stan(file = file.path('tutorial1_model1bis.stan'), 
                   data = stan_data,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

And extract the predicted population count.

# extract predictions
predicted_pop_model1 <- as_tibble(extract(fit_model1, 'population_hat')$population_hat)

colnames(predicted_pop_model1) <- data$id

We obtain a table with 500 predictions * 4 chains for each survey site.

predicted_pop_model1 %>% 
  mutate(iteration= paste0('iter_', 1:(iter*chains))) %>% 
  select(iteration, 1:10) %>% head(10) %>% kbl() %>% kable_minimal()
iteration cluster_1 cluster_2 cluster_3 cluster_4 cluster_5 cluster_6 cluster_7 cluster_8 cluster_9 cluster_10
iter_1 440 453 423 421 453 443 401 447 423 417
iter_2 473 430 431 459 441 416 466 422 468 439
iter_3 426 448 460 450 429 485 472 418 431 456
iter_4 465 439 472 445 448 462 468 428 445 393
iter_5 464 415 438 460 488 422 457 429 499 406
iter_6 405 409 429 411 441 430 470 485 449 449
iter_7 426 427 408 415 452 392 439 401 480 460
iter_8 457 449 435 461 434 438 448 451 408 476
iter_9 437 429 421 483 438 450 441 421 423 454
iter_10 455 474 408 462 451 461 499 471 451 488

We get thus a posterior prediction distribution of population count for every survey site. Figure 8 shows a posterior distribution for the first survey site.

# plot posterior prediction for one cluster
ggplot(predicted_pop_model1, aes(x=cluster_1))+
  geom_density(size=1.5, color='orange')+
  theme_minimal()+
    theme(axis.text.y = element_blank())+
  labs(title = "Population prediction for cluster 1 ", y='', x='')
Example of posterior prediction distribution for one cluster

Figure 8: Example of posterior prediction distribution for one cluster

We can extract for every survey site its mean prediction and 95% credible interval.

# summarize predictions
comparison_df <- predicted_pop_model1 %>% 
      pivot_longer(everything(),names_to = 'id', values_to = 'predicted') %>% 
      group_by(id) %>% 
      summarise(across(everything(), list(mean=~mean(.), 
                                          upper=~quantile(., probs=0.975), 
                                          lower=~quantile(., probs=0.025))))
comparison_df %>% head() %>% kbl() %>% kable_minimal()
id predicted_mean predicted_upper predicted_lower
cluster_1 445.1085 488.000 402.000
cluster_10 445.0675 488.000 404.000
cluster_100 445.1840 487.025 404.975
cluster_1000 445.6165 484.000 403.000
cluster_1001 445.2855 487.000 404.975
cluster_1002 444.4365 486.000 403.000

We note that all predictions are very similar. The model includes only one parameter — lambda — and the site-level predictions do not account for any site-level characteristics. Therefore, predictions at each site are drawn from the exact same distribution.

Let’s see the global picture by plotting the observed vs the predicted population count. A perfect model would see all points on the 1:1 line.

# add observed values
comparison_df <- comparison_df %>% 
  left_join(data %>% 
              select(id, N), by = 'id')

# plot predicted vs observed
ggplot(comparison_df) +
  geom_pointrange(aes(x=N, y=predicted_mean, ymin=predicted_lower, ymax=predicted_upper
                      ),
                   fill='grey50', color='grey70', shape=21
                  )+
  geom_abline(slope=1, intercept = 0, color='orange', size=1)+
  theme_minimal()+
  labs(title = '', x='Observed population count', y='Predicted population')
Comparison between observed and predicted population count (with the Poisson model). Orange line indicates the 1:1 line

Figure 9: Comparison between observed and predicted population count (with the Poisson model). Orange line indicates the 1:1 line

Figure 9 is a great visualization of the prediction process. Since the model has no covariates (introduced in tutorial 3) and no hierarchical structure (introduced in tutorial 2), there is no subnational variation introduced. Furthermore the credible intervals entail few observations: few grey lines intersect the orange line. This indicates issues with the modelling.

We can compute goodness-of-fit metrics to complete the picture:

  • The bias, the mean of the residuals (prediction - observation)

  • The imprecision, standard deviation of the residual

  • The inaccuracy, mean of the absolute residuals

  • The proportion of observations falling into the predicted credible interval

# compute goodness-of-fit metrics
comparison_df %>%
  mutate(residual = predicted_mean-N,
          in_CI = ifelse(N>predicted_lower &N<predicted_upper, T, F)) %>% 
  summarise(
    `Bias`= mean(residual),
    `Imprecision` = sd(residual),
    `Inaccuracy` = mean(abs(residual)),
    `Correct credible interval (in %)` = round(sum(in_CI)/n()*100,1)
  ) %>% 
    kbl(caption = "Poisson model goodness-of-fit metrics") %>% kable_minimal()
Table 1: Poisson model goodness-of-fit metrics
Bias Imprecision Inaccuracy Correct credible interval (in %)
0.0293225 315.8384 230.9604 10

Table 1 confirms that the model is incorrectly specified: only 10% of the observations falls in their credible intervals.

This limitation is due to the impossibility to model overdispersion within a Poisson framework. Indeed \(\lambda\) defines both the mean and the variance of a Poisson variable.

A source of overdispersion comes from the size of the clusters. We observed population counts for units with different area, in particular different sizes of settled areas as shown in Figure 10.

#plot overdispersion
ggplot(data, aes(x=A, y=N))+
  geom_point()+
  theme_minimal()+
  labs(x='Settled area in hectares', y='Population count')
Scatterplot contrasting population count with settled area for every cluster

Figure 10: Scatterplot contrasting population count with settled area for every cluster

Modelling Population count: a Poisson Lognormal model

To incorporate overdispersion in our model, we decompose population count as follows:

\[ population = pop\_density * settled\_area \]

The left hand side of the equation is population, a discrete observed variable, whereas the right hand side is composed of two continuous variables, settled_area the settled area, an observed variable, and pop_density, the population density, a latent variable.

Writing the model

Formally we can rewrite Equation (3) as:

\[ population \sim Poisson( pop\_density * settled\_area) \]

This formulation unveils the continuous positive scale-free latent variable, pop_density that can be modelled with its own distribution.

We opt for a Lognormal which is a continuous probability distribution of a random variable whose logarithm is normally distributed. It is characterised by a positive distribution skewed to the right. The lognormal has two parameters, \(\mu\), its location parameter that defines its median and \(\sigma\) its scale parameter that defines its geometric standard deviation. This model allows us to capture overdispersion through \(\sigma\).

Probability distribution functions of log-normal distributions Source: Krishnavedala, CC0, via Wikimedia Commons

Figure 11: Probability distribution functions of log-normal distributions Source: Krishnavedala, CC0, via Wikimedia Commons

Applied to our population modelling it gives us:

And in formal modelling:

\[\begin{equation} population \sim Poisson( pop\_density * settled\_area) \\ pop\_density \sim Lognormal( \mu, \sigma) \end{equation}\]

Note that this equation is equivalent to:

\[\begin{equation} log(pop\_density) \sim Normal( \mu, \sigma) \end{equation}\]

Under this form we see that the model is not linear, but log-linear. It can be rewritten as:

\[\begin{equation} pop\_density \sim exp(Normal( \mu, \sigma)) \end{equation}\]

Defining the priors

In the Lognormal, there are two parameters, \(\mu\) that represents the median of the population density on the log scale, and \(\sigma\) the geometric standard deviation of population density on the log scale.

We set up their priors similarly as before and retrieve from the data that the log observed median of the population density is 4.82 and the observed log geometric standard deviation is 0.87.

We choose as prior for \(\mu\) a Normal(5,4):

# 5 Model2: Lognormal ----

# define prior for mu
data$prior_mu <- rnorm(nrow(data), mean= 5, sd=4)

ggplot(data)+
  geom_histogram(aes(x=log(pop_density), after_stat(density)), bins=100, fill='grey30')+
  geom_density(aes(x=prior_mu), colour='#00BFC4', size=1)+
  theme_minimal()+
  annotate('text', y=0.1, x=15, label=paste0('Normal(5,4)'), colour='#00BFC4', size=5)+
    geom_vline(xintercept = median(log(data$pop_density)), color='orange', size=1)+
  annotate('text', x=5.1, y=0.3, label=paste0('Observed median: ', round(median(log(data$pop_density))), 2), colour='orange', angle=90)+
    theme(axis.text.y = element_blank())+
  labs(y='', x='Observed population density (log)')+
  xlim(-8,17)
Prior distribution for mu

Figure 12: Prior distribution for mu

We choose as prior for \(\sigma\) a Uniform(0,4)

# define prior for sigma
data$prior_sigma <- runif(nrow(data), min = 0, max=4)+5

ggplot(data)+
  geom_histogram(aes(x=log(pop_density), after_stat(density)), bins=100, fill='grey30')+
  geom_density(aes(x=prior_sigma), colour='#00BFC4', size=1, trim=T)+
  theme_minimal()+
  annotate('text', y=0.3, x=7.7, label=paste0('Uniform(0,4)'), colour='#00BFC4', size=5)+
    theme(axis.text.y = element_blank())+
  labs(y='', x='Observed population density (log)')
Prior distribution for mu

Figure 13: Prior distribution for mu

The resulting model can be written as follows:

\[\begin{equation} population \sim Poisson( pop\_density * settled\_area) \\ pop\_density \sim Lognormal( \mu, \sigma) \\ \\ \mu \sim Normal( 5, 4 ) \\ \sigma \sim Uniform( 0, 4 )\tag{4} \end{equation}\]

Implementing the model

We adapt the stan code to the model change which affects all code blocks:

// Model 2: Population count as a Poisson-Lognormal distribution 
data{
  int<lower=0> n; // number of microcensus clusters
  int<lower=0> population[n]; // count of people
  vector<lower=0>[n] area; // settled area
}
parameters{
  // population density
  vector<lower=0>[n] pop_density;
  // intercept
  real mu; 
  // variance
  real<lower=0> sigma; 
}
model{
  // population totals
  population ~ poisson(pop_density .* area);
  pop_density ~ lognormal( mu, sigma );
  // intercept
  mu ~ normal(5, 4);
  // variance
  sigma ~ uniform(0, 4);
}
generated quantities{
   int<lower=0> population_hat[n];
   real<lower=0> density_hat[n];

   for(idx in 1:n){
     density_hat[idx] = lognormal_rng( mu, sigma );
     population_hat[idx] = poisson_rng(density_hat[idx] * area[idx]);
   }
}

Note that we need to use the vector format for area and pop_density. This formatting is implied by definition of the multiplication operation for the Poisson rate.

We store the model under tutorial1_model2.stan, and prepare the corresponding data:

# prepare data for stan
stan_data_model2 <- list(
  population = data$N,
  n = nrow(data),
  area = data$A)

Note that the population density is not an input data of the model, but an unobserved latent variable that we model through the Lognormal.

Then we declare the parameters to monitor (including density_hat) and run the model.

# set parameters to monitor
pars <- c('mu','sigma', 'population_hat', 'density_hat')

# mcmc
fit_model2 <- rstan::stan(file = file.path('tutorial1_model2.stan'), 
                   data = stan_data_model2,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

No warnings are shown.

Question: Can you plot the traceplot and interpret it?

Click for the solution

We plot the traceplot:

# plot trace
traceplot(fit_model2, c('mu', 'sigma'))

We see that the chains have mixed well and that the posterior distributions of the parameters are not constrained by their prior specification.

We plot the predicted density and the predicted count against the observations:

# extract posterior predictions
predicted_pop_model2 <- as_tibble(extract(fit_model2, 'population_hat')$population_hat)

predicted_dens_model2 <- as_tibble(extract(fit_model2, 'density_hat')$density_hat)
colnames(predicted_pop_model2) <- data$id
colnames(predicted_dens_model2) <- data$id

# summarise posterior predictions
comparison_df <- rbind(predicted_dens_model2 %>% 
   pivot_longer(everything(),names_to = 'id', values_to = 'predicted') %>% 
   group_by(id) %>% 
   summarise(across(everything(), list(mean=~mean(.), 
                                       upper=~quantile(., probs=0.975), 
                                       lower=~quantile(., probs=0.025)))) %>% 
   mutate(source= 'Poisson-Lognormal model',
          type= 'Population density') %>% 
   left_join(data %>% 
               select(id, pop_density)%>% 
              rename(reference=pop_density), by = 'id'),
  predicted_pop_model2 %>% 
  pivot_longer(everything(),names_to = 'id', values_to = 'predicted') %>% 
  group_by(id) %>% 
  summarise(across(everything(), list(mean=~mean(.), 
                                      upper=~quantile(., probs=0.975), 
                                      lower=~quantile(., probs=0.025)))) %>% 
  mutate(source= 'Poisson-Lognormal model',
         type='Population count') %>% 
  left_join(data %>% 
              select(id, N) %>% 
              rename(reference=N), by = 'id'))
# plot posterior predictions
ggplot(comparison_df %>% 
         mutate(type= factor(type, levels=c('Population density', 'Population count')))) +
  geom_pointrange(aes(x=reference, y=predicted_mean, ymin=predicted_lower, ymax=predicted_upper
                      ),
                   fill='grey50', color='grey70', shape=21
                  )+
  geom_abline(slope=1, intercept = 0, color='orange', size=1)+
  theme_minimal()+
  labs(title = '', x='Observations', y='Predictions')+ 
  facet_wrap(.~type, scales = 'free')

We see for the population_density the same estimation pattern as in the Poisson model, that is a similar mean posterior prediction distribution for every survey sites. In contrast, the length of the confidence intervals varies between cluster predictions due to the variance term in the Lognormal. Furthermore the predicted population_count is influenced by the varying settled_area.

We use the same metrics based on residuals to assess the goodness-of-fit of the model.

# compute goodness-of-fit metrics
comparison_df %>%
  filter(type=='Population count') %>% 
  mutate(residual = predicted_mean-reference,
          in_CI = ifelse(reference>predicted_lower &reference<predicted_upper, T, F)) %>% 
  summarise(
    `Bias`= mean(residual),
    `Imprecision` = sd(residual),
    `Inaccuracy` = mean(abs(residual)),
    `Correct credible interval (in %)` = round(sum(in_CI)/n()*100,1)
  ) %>% 
    kbl(caption = "Poisson-Lognormal model goodness-of-fit metrics") %>% kable_minimal()
Table 2: Poisson-Lognormal model goodness-of-fit metrics
Bias Imprecision Inaccuracy Correct credible interval (in %)
61.60042 338.2423 265.8362 95.4

The proportion of observations that fall in their credible intervals shows a well-behaved model: 95.5% of the observations falls in the 95% credible interval. On average the model overestimates the population count by around 60 people, but this statistic varies on average by 337 people.

No variations are included in this model: we built a national model of population count. It will require further refinements that will be introduced in the next tutorials.

And for that purpose we will store this last model as a RDS file.

saveRDS(fit_model2, 'tutorial1_model2_fit.rds')

Acknowledgements

This tutorial was written by Edith Darin from WorldPop, University of Southampton and Douglas Leasure from Leverhulme Centre for Demographic Science, University of Oxford, with supervision from Andrew Tatem, WorldPop, University of Southampton.

Funding for the work was provided by the United Nations Population Fund (UNFPA).

License

You are free to redistribute this document under the terms of a Creative Commons Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) license.

References

Hoffman, Matthew D, and Andrew Gelman. n.d. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,” 31.
Leasure, Douglas R., Warren C. Jochem, Eric M. Weber, Vincent Seaman, and Andrew J. Tatem. 2020. “National Population Mapping from Sparse Survey Data: A Hierarchical Bayesian Modeling Framework to Account for Uncertainty.” Proceedings of the National Academy of Sciences 117 (39): 24173–79. https://doi.org/10.1073/pnas.1913050117.
Weber, Eric M., Vincent Y. Seaman, Robert N. Stewart, Tomas J. Bird, Andrew J. Tatem, Jacob J. McKee, Budhendra L. Bhaduri, Jessica J. Moehl, and Andrew E. Reith. 2018. “Census-Independent Population Mapping in Northern Nigeria.” Remote Sensing of Environment 204 (January): 786–98. https://doi.org/10.1016/j.rse.2017.09.024.

  1. More details here.↩︎