# 1 Set-up ----

# load libraries
library(tidyverse) # managing data
library(ggdag) # drawing DAG
library(kableExtra) # visualising table
library(here) # handling path
library(rstan) # running Bayesian models
library(plotly) # interactive plot

# stan setup
options(mc.cores = parallel::detectCores()-1)
rstan::rstan_options(auto_write = TRUE) # speed up running time of compiled model

Introduction

Tutorial 2 explored how to model large-scale spatial patterns: how population density differs per large grouping such as by region or settlement type. We integrated those large-scale variations in a Bayesian framework by using a hierarchical random intercept model for the population density.

Tutorial 3 aims at integrating small-scale variations of population density that are linked to local context of human settlement. Figure 1 shows how high-resolution geospatial covariates provide precise information on local context.

Figure 1: Example of a covariate (residential area) around Onitscha, Nigeria

Adding high-resolution covariates helps to improve the model fit as well as guides population prediction in unsampled areas.

For data to be used as covariates in the model, they should be:

correlated with differences in population density
measured consistently and completely across the study space
accurately mapped as high-resolution geospatial layers

High-resolution covariates that are suitable for the modelling are typically spatial covariates with national coverage and consistent data collection. While individual or household-level information, such as collected during surveys, is useful for understanding differences in demographic characteristics, that type of information is difficult to use in the bottom-up approaches because it only comes from a sample of households. The primary objective of the population model is to make a spatially-complete prediction.

Goals

Understand requirements for covariates to be included in the model
Get an overview of covariates used in WorldPop bottom-up population models
Familiarize with covariates processing for population model
Integrate a linear regression to refine the definition of the median of the Lognormal
Add a random slope model
Introduce the concept of initialisation in a MCMC model estimation

Supporting readings

To include a spatial covariate in the modelling and to use it as a support for prediction, we use gridded datasets, known as rasters. You would need thus to have basic GIS knowledge about raster and vector file management. It does not need to be in R, it can be in QGIS, ArcGIS, Python or any GIS software of your choice.

The purpose of this tutorial is not spatial data processing. We will just mention processing techniques that are required to prepare covariates data for population modelling.

However, here are some R resources on spatial manipulation:

The bible by Pebesma and Bivand on the sf (vector data) and stars (raster data) R packages with excellent overview of the different spatial manipulations
A hands-on introduction to sf and raster package by the University of Wageningen
A focus on raster manipulation with raster and terra by Hijmans

Formal modelling

We will work from model 3 in tutorial 2 which is based on a Poisson distribution to model population count and on a Lognormal distribution with a hierarchical random intercept by settlement type and region to model population density .

Because the median of the population density is only defined with a random intercept, it results in 5 (number of settlement type) x 11 (number of region) options for the population density estimates. To add small-scale variations we refine the median of the Lognormal with a regression model that integrates the covariates.

More formally, let’s define \(X\) a matrix of size number of observations x number of covariates that contains the covariates values for each study site and \(\beta\) a vector of size the number of covariates. Based on Equation 3 in tutorial 2 (and removing the prior distribution for \(\alpha_{t,r}\) for sake of readability), we define \(\mu\) the median of the Lognormal distribution as follows:

\[\begin{equation} population 〜 Poisson( pop\_density * settled\_area) \\ pop\_density 〜 Lognormal(\mu, \: \sigma) \\ \mu = \alpha_{t,r} + X \beta \\[10pt] \beta 〜 Normal(0,10)\tag{1} \end{equation}\]

Note that the prior for \(\beta\) are identical normal distribution for each covariate with mean 0 and standard deviation 10 to avoid introducing any bias.

Figure 2 shows the updated dependent relationships of our model when integrating covariates.

Figure 2: Dependancy graph for model 1 of tutorial 3

Review of the covariates used in WorldPop

To date, five bottom-up population models have been produced at WorldPop:

WorldPop. 2019. Bottom-up gridded population estimates for Nigeria, version 1.2.
WorldPop, University of Southampton. https://dx.doi.org/10.5258/SOTON/WP00655.
WorldPop (School of Geography and Environmental Science, University of
Southampton). 2020. Bottom-up gridded population estimates for Zambia, version 1.0.
https://dx.doi.org/10.5258/SOTON/WP00662
WorldPop and Institut National de la Statistique et de la Démographie du Burkina Faso. 2021. Census-based gridded population estimates for Burkina Faso (2019), version 1.0. WorldPop, University of Southampton. https://dx.doi.org/10.5258/SOTON/WP00687.
Boo G, Darin E, Leasure DR, Dooley CA, Chamberlain HR, Lazar AN, Tatem AJ. 2020.
Modelled gridded population estimates for the Kinshasa, Kongo-Central, Kwango, Kwilu,
and Mai-Ndombe provinces in the Democratic Republic of the Congo, version 2.0.
WorldPop, University of Southampton. https://dx.doi.org/10.5258/SOTON/WP00669
WorldPop. 2020. Bottom-up gridded population estimates for the Kinshasa, Kongo-Central, Kwango, Kwilu, and Mai-Ndombe provinces in the Democratic Republic of the Congo, version 1.0. WorldPop, University of Southampton. https://dx.doi.org/10.5258/SOTON/WP00658

In addition to that, two models are currently being updated: Nigeria version 2.0 and Democratic Republic of the Congo v3.0.

Those six models encompass a large range of covariates, that are specific to each country. Table 1 offers an overview of the final covariates set selected for each model.

review_cov <- read_csv(here('./tutorials/tutorial3/covs_review.csv'))

review_cov %>% arrange(Type) %>% kbl(caption='Review of covariates used in WorldPop bottom-up population models') %>% kable_minimal()

Table 1: Review of covariates used in WorldPop bottom-up population models
Type	Covariate	Model	Source
Gridded population	UN-adjusted projected gridded estimates	Burkina Faso v1.0	WorldPop
Gridded population	Projected gridded estimates	Nigeria v1.2	WorldPop
Gridded population	Mean UN-adjusted projected gridded estimates within a 2km radius	Democratic Republic of Congo v1.0	WorldPop
Infrastucture	Fricition surface	Burkina Faso v1.0	Access to the Cities project
Infrastucture	Distance to secondary roads	Burkina Faso v1.0	National Geographical Office
Infrastucture	Household size	Nigeria v1.2, Nigeria v2.0	Demographic and Health Survey
Infrastucture	Residential roads density	Democratic Republic of Congo v1.0	OpenStreetMap
Infrastucture	Travel time to cities	Democratic Republic of Congo v1.0	Malaria Atlas Project
Infrastucture	Tertiary-sector activities density	Democratic Republic of Congo v1.0	OpenStreetMap
Natural feature	Distance to temporary rivers	Burkina Faso v1.0	National Geographical Office
Natural feature	Monthly variability of dry matter productivity	Democratic Republic of Congo v3.0	Copernicus
Natural feature	Monthly variability of surface air temperature	Democratic Republic of Congo v3.0	Copernicus
Natural feature	Land surface ‘roughness’ from Synthetic Aperture Radard VH	Nigeria v2.0	Sentinel-1
Natural feature	Land surface ‘roughness’ from Synthetic Aperture Radard VV	Nigeria v2.0	Sentinel-1
Settlement	Mean building count within a 5km radius	Burkina Faso v1.0	Ecopia & Maxar
Settlement	Mean building area within a 1km radius	Democratic Republic of Congo v2.0	Ecopia & Maxar
Settlement	Mean distance to nearest building within a 1km radius	Democratic Republic of Congo v2.0	Ecopia & Maxar
Settlement	Mean building count within a 1km radius	Democratic Republic of Congo v2.0	Ecopia & Maxar
Settlement	Mean building area	Zambia v1.0	Ecopia & Maxar
Settlement	Building density	Zambia v1.0	Ecopia & Maxar
Settlement	Coefficient of variation of building area	Zambia v1.0	Ecopia & Maxar
Settlement	Sum residential area within a 1km radius	Nigeria v1.2	Oak Ridge National Laboratory
Settlement	Sum nonresidential area within a 1km radius	Nigeria v1.2	Oak Ridge National Laboratory
Settlement	School density within a 1km radius	Nigeria v1.2	Oak Ridge National Laboratory
Settlement	Mean building perimeter	Democratic Republic of Congo v3.0	Ecopia & Maxar
Settlement	Compactness of building	Democratic Republic of Congo v3.0	Ecopia & Maxar

The covariates selected can be broadly classified as describing four main drivers of local population density variation:

previous population spatial distribution through WorldPop top-down population disaggregation (WorldPop Research Group et al. 2018)
infrastructure through OpenStreetMap (OpenStreetMap contributors 2018), country-specific resources (for example Institut Géographique du Burkina Faso (2015)) or modelled resources (for example travel time to cities (Weiss et al. 2018))
natural features through remote sensing product such as radar data (Sentinel-1) or dry matter productivity (Copernicus)
settlement shape through the morphology of building footprints (Ecopia.AI and Maxar Technologies 2019) or settled area (Oak Ridge National Laboratory 2018)

Other covariates sources that were considered (but not selected in the final models) were:

Conflict locations from the Armed Conflict Location and Event Data Project
Climatic variables from the Climatic Research Unit at the university of Anglia
Active mining concessions from the IPIS group
Land cover classification from the European Spatial Agency
Global forest change from the University of Maryland
Elevation and slope from WorldDEM
Fossil fuel emissions from the Open-source Data Inventory for Anthropogenic CO2 Project

To build a model, we first gather all covariates that can be related to our specific population data. It can reach up to 900+. We then use geospatial analysis techniques to obtain gridded version of the covariates with identical spatial resolution, alignment and extent. It involves resampling and clipping for covariates provided as raster files or for covariates provided as vector files computing count, density, distance to nearest features or even interpolation techniques.

Covariates engineering

Further covariates engineering steps can help extracting even more information from the gathered covariates.

Log-transformation

Considering the logarithm of covariates helps handling extreme values.

Focal statistics

Focal statistics consist in summarising covariates in a moving window around each grid cell. As seen in Table 1, we used different window sizes (1km, 2km or 5km) and summary statistics (mean or coefficient of variation). It provides contextual information around the grid cells.

Standardisation

Scaling the covariate (that is subtracting the mean and dividing by the standard deviation) helps enhancing meaningful local variations from the mean. The scaling can even be refined by computing the mean and the standard deviation by region, such that local variations are representative of the region.

A note on covariate selection

After engineering the gathered covariates, we might end up with 1000+ potential covariates.

To select the best one for prediction purposes, we generally used one of the two following methods:

pairwise correlation and scatter plot with the population density at study site level
univariate model, testing each covariate successively

Including covariates in the model

We focus in the remaining parts of the tutorial on the data we downloaded from Leasure et al. (2020) which corresponds to the Nigeria model v1.2.

Overview of the covariates in Nigeria, v1.2

Six covariates were used in Nigeria v1.2 model:

x_1: gridded population estimates from WorldPop Global
x_2: school densities within a 1km radius
x_3: household sizes by interpolating Demographic Health Survey results from 2013
x_4: settled area within a 1km radius
x_5: residential area in a 1km radius
x_6: nonresidential settled area within a 1km radius

Covariate x_4 was scaled based on its mean and standard deviation nationally, whereas covariates x_5 and x_6 were scaled based on their mean and standard deviation within a 50-km radius. Leasure et al scaled x_5 and x_6 in this way because they suspected that neighborhood types may not be directly comparable across regions (especially northern versus southern Nigeria). This scaling also reduced correlation with x_4 .

They scaled the WorldPop Global estimates (x_1) based on their mean and standard deviation nationally. They treated this covariate as an indicator of relative population densities based on the geospatial covariates that were used in the random forest model.

Covariate x_2 was scaled using its mean and standard deviation within a 50km radius. They scaled this covariate within a 50km moving window because what constitutes a “high density” of schools varies by region and this distinction was lost when the covariate was scaled nationally. This also helped to control for possible differences in school mapping effort in different regions.

They scaled x_3 based on its mean and standard deviation nationally. One key reason for including this covariate was to account for a strong north–south gradient in household sizes, with significantly more people per household in northern Nigeria than in southern Nigeria.

Preparing the data

To integrate the covariates in the model, we build first a dataset with the average of the covariate values for each study site using zonal statistics.

Note that this constitutes a change in support: we might want to check if the range of covariates values at study site level is representative of the covariate values at grid cell level.

Figure 3 shows the relation between the covariates and the population density at study site level. We see that household size (x_3) is positively associated with population density. The negative value are due to the scaling method adopted. On the opposite, nonresidential settled area (x_6) is negatively associated with population density which is expected: the more the surroundings are nonresidential the lower the population density.

# 2. Covariates preparation ----
# load data
data <- readxl::read_excel(here('tutorials/data/nga_demo_data.xls'))
data <- data %>% 
  mutate(
    pop_density=N/A,
    id = as.character(1:nrow(data))
  )

# contrast covariates with pop density
data_long <- data %>% 
  pivot_longer(starts_with('x'), names_to = 'cov')

ggplot(data_long, aes(x=pop_density,y=value))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE,color='orange')+
  theme_minimal()+
  facet_wrap(.~cov, ncol=3, scales = 'free')+
  labs(x='Population density', y='')

Figure 3: Scatterplot of covariates values vs population density for each study site

Before implementing the model in stan, we uniformly scale the covariates at study site level, such that the \(\beta_k\) have the same scale. We first compute the scaling coefficients (mean and standard deviation) for each covariate:

# compute scaling factors (mean and sd)
covariatesScaling <- function(var){
  mean_var <- mean(var)
  sd_var <- sd(var)
  return(
    data.frame(
      'cov_mean'= mean_var,
      'cov_sd' = sd_var
    )
  )
} 

covs <- data %>% 
  select(starts_with('x'))

scale_factor <- bind_rows(apply(covs, 2, covariatesScaling))
scale_factor$cov <- colnames(covs)

scale_factor %>% select(cov, cov_mean, cov_sd) %>% kbl %>%  kable_minimal()

cov	cov_mean	cov_sd
x1	3.913963	8.071879
x2	2.950790	3.970139
x3	0.000000	1.000000
x4	5.081461	4.748466
x5	3.941279	4.377713
x6	2.729769	5.028868

We then apply the scaling coefficient to the covariates:

# apply scaling factors to covariates
covs_scaled <-  covs %>% 
  mutate(cluster_id = 1:nrow(covs)) %>% 
  pivot_longer(-cluster_id,names_to = 'cov') %>% 
  left_join(scale_factor, by="cov") %>% 
  mutate(value= (value-cov_mean)/cov_sd ) %>% 
  select(-cov_mean, -cov_sd) %>% 
  pivot_wider(names_from = cov, values_from = value, id_cols = cluster_id) %>% 
  select(-cluster_id)

We store the scaled covariates and the scaling coefficients for the prediction stage (in Tutorial 4).

# save scaling factor
write_csv(covs_scaled, here('tutorials/data/covs_scaled.csv'))
write_csv(scale_factor, here('tutorials/data/scale_factor.csv'))

Implementing the model

Equation (1) is implemented in stan as follows:

// Model 1: Hierarchical alpha by settlement type , region + covariates
data{
  ...
  // slope
  int<lower=0> ncov; // number of covariates
  matrix[n, ncov] cov; // covariates
}
parameters{
  ...
  // slope
  row_vector[ncov] beta; 
}
transformed parameters{
  ...
  for(idx in 1:n){
    pop_density_median[idx] = alpha_t_r[type[idx], region[idx]] + sum( cov[idx,] .* beta );
  }
}
model{
  ...
  //slope
  beta ~ normal(0,10);
}
generated quantities{
  ...
   for(idx in 1:n){
    density_hat[idx] = lognormal_rng( alpha_t_r[type[idx], region[idx]] + sum(cov[idx,] .* beta), sigma );
   }
}

Note the two new data types - matrix for the covariate values and row_vector for \(\beta\) - as well as the new operator, .* . A row vector is a matrix with one row. .* performs elementwise multiplication. We need these elements because of the way we coded the linear regression for pop_density_median:in a for loop running through each study site defined by their idx. We extract for each study site the corresponding covariates values, cov[idx,] which is a row vector. To get a vector of each covariate value associated with each parameter \(\beta_k\), beta needs to be a row_vector and the multiplication needs to be performed element by element.

We keep the same set up for the MCMC:

# 3. Modelling with covariates ----

# mcmc settings
chains <- 4
warmup <- 500
iter <- 500
seed <- 1789

And we add the covariates to stan input data:

# prepare data for stan
stan_data_model1 <- list(
  population = data$N,
  n = nrow(data),
  area = data$A,
  type = data$type,
  ntype= n_distinct(data$type),
  region = data$region,
  nregion = n_distinct(data$region),
  seed=seed,
  cov = covs_scaled,
  ncov = ncol(covs_scaled)
  )

We add beta as parameter to monitor and run the model.

pars <- c('alpha','sigma','beta','alpha_t', 'nu_alpha', 'nu_alpha_t', 'population_hat',  'density_hat')

# mcmc
fit1 <- rstan::stan(file = file.path(here('./tutorials/tutorial3/tutorial3_model1.stan')), 
                   data = stan_data_model1,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed)

## Warning in .local(object, ...): some chains had errors; consider specifying
## chains = 1 to debug

## here are whatever error messages were returned

## [[1]]
## Stan model 'anon_model' does not contain samples.

## Warning in validityMethod(object): The following variables have undefined
## values: population_hat[1],The following variables have undefined values:
## population_hat[2],The following variables have undefined values:
## population_hat[3],The following variables have undefined values:
## population_hat[4],The following variables have undefined values:
## population_hat[5],The following variables have undefined values:
## population_hat[6],The following variables have undefined values:
## population_hat[7],The following variables have undefined values:
## population_hat[8],The following variables have undefined values:
## population_hat[9],The following variables have undefined values:
## population_hat[10],The following variables have undefined values:
## population_hat[11],The following variables have undefined values:
## population_hat[12],The following variables have undefined values:
## population_hat[13],The following variables have undefined values:
## population_hat[14],The following variables have undefined values:
## population_hat[15],The following variables have undefined values:
## population_hat[16],The following variables have undefined values:
## population_hat[17],The following variables have undefined values:
## population_hat[18],The following variables have undefined values:
## population_hat[19],The following variables have undefined values:
## population_hat[20],The following variables have undefined values:
## population_hat[21],The following variables have undefined values:
## population_hat[22],The following variables have undefined values:
## population_hat[23],The following variables have undefined values:
## population_hat[24],The following variables have undefined values:
## population_hat[25],The following variables have undefined values:
## population_hat[26],The following variables have undefined values:
## population_hat[27],The following variables have undefined values:
## population_hat[28],The following variables have undefined values:
## population_hat[29],The following variables have undefined values:
## population_hat[30],The following variables have undefined values:
## population_hat[31],The following variables have undefined values:
## population_hat[32],The following variables have undefined values:
## population_hat[33],The following variables have undefined values:
## population_hat[34],The following variables have undefined values:
## population_hat[35],The following variables have undefined values:
## population_hat[36],The following variables have undefined values:
## population_hat[37],The following variables have undefined values:
## population_hat[38],The following variables have undefined values:
## population_hat[39],The following variables have undefined values:
## population_hat[40],The following variables have undefined values:
## population_hat[41],The following variables have undefined values:
## population_hat[42],The following variables have undefined values:
## population_hat[43],The following variables have undefined values:
## population_hat[44],The following variables have undefined values:
## population_hat[45],The following variables have undefined values:
## population_hat[46],The following variables have undefined values:
## population_hat[47],The following variables have undefined values:
## population_hat[48],The following variables have undefined values:
## population_hat[49],The following variables have undefined values:
## population_hat[50],The following variables have undefined values:
## population_hat[51],The following variables have undefined values:
## population_hat[52],The following variables have undefined values:
## population_hat[53],The following variables have undefined values:
## population_hat[54],The following variables have undefined values:
## population_hat[55],The following variables have undefined values:
## population_hat[56],The following variables have undefined values:
## population_hat[57],The following variables have undefined values:
## population_hat[58],The following variables have undefined values:
## population_hat[59],The following variables have undefined values:
## population_hat[60],The following variables have undefined values:
## population_hat[61],The following variables have undefined values:
## population_hat[62],The following variables have undefined values:
## population_hat[63],The following variables have undefined values:
## population_hat[64],The following variables have undefined values:
## population_hat[65],The following variables have undefined values:
## population_hat[66],The following variables have undefined values:
## population_hat[67],The following variables have undefined values:
## population_hat[68],The following variables have undefined values:
## population_hat[69],The following variables have undefined values:
## population_hat[70],The following variables have undefined values:
## population_hat[71],The following variables have undefined values:
## population_hat[72],The following variables have undefined values:
## population_hat[73],The following variables have undefined values:
## population_hat[74],The following variables have undefined values:
## population_hat[75],The following variables have undefined values:
## population_hat[76],The following variables have undefined values:
## population_hat[77],The following variables have undefined values:
## population_hat[78],The following variables have undefined values:
## population_hat[79],The following variables have undefined values:
## population_hat[80],The following variables have undefined values:
## population_hat[81],The following variables have undefined values:
## population_hat[82],The following variables have undefined values:
## population_hat[83],The following variables have undefined values:
## population_hat[84],The following variables have undefined values:
## population_hat[85],The following variables have undefined values:
## population_hat[86],The following variables have undefined values:
## population_hat[87],The following variables have undefined values:
## population_hat[88],The following variables have undefined values:
## population_hat[89],The following variables have undefined values:
## population_hat[90],The following variables have undefined values:
## population_hat[91],The following variables have undefined values:
## population_hat[92],The following variables have undefined values:
## population_hat[93],The following variables have undefined values:
## population_hat[94],The following variables have undefined values:
## population_hat[95],The following variables have undefined values:
## population_hat[96],The following variables have undefined values:
## population_hat[97],The following variables have undefined values:
## population_hat[98],The following variables have undefined values:
## population_hat[99],The following variables have undefined values:
## population_hat[100],The following variables have undefined values:
## population_hat[101],The following variables have undefined values:
## population_hat[102],The following variables have undefined values:
## population_hat[103],The following variables have undefined values:
## population_hat[104],The following variables have undefined values:
## population_hat[105],The following variables have undefined values:
## population_hat[106],The following variables have undefined values:
## population_hat[107],The following variables have undefined values:
## population_hat[108],The following variables have undefined values:
## population_hat[109],The following variables have undefined values:
## population_hat[110],The following variables have undefined values:
## population_hat[111],The following variables have undefined values:
## population_hat[112],The following variables have undefined values:
## population_hat[113],The following variables have undefined values:
## population_hat[114],The following variables have undefined values:
## population_hat[115],The following variables have undefined values:
## population_hat[116],The following variables have undefined values:
## population_hat[117],The following variables have undefined values:
## population_hat[118],The following variables have undefined values:
## population_hat[119],The following variables have undefined values:
## population_hat[120],The following variables have undefined values:
## population_hat[121],The following variables have undefined values:
## population_hat[122],Th

The model runs into convergence issues. The error message that is returned is not very informative. However the Viewer pane in Rstudio contains more information, namely:

Chain 2: Exception: lognormal_lpdf: Scale parameter is 0, but must be > 0!.

It means that the combination of \(\alpha\), \(\beta\) and covariates value that is currently being tested lead to some zeros in the scale term of the lognormal which is forbidden.

A note on initialisation

This is an opportunity to discuss about initialisation. MCMC simulations start exploring the parameter space from one initial value.

This initial value is controlled in stan by the option init. Its default is random, that is “Let Stan generate random initial values for all parameters. The seed of the random number generator used by Stan can be specified via the seed argument. If the seed for Stan is fixed, the same initial values are used. The default is to randomly generate initial values between -2 and 2 on the unconstrained support”.

Defining the starting values helps the algorithm to start close to the region of interest such that no time is lost in exploring an area of the parameter space that we know doesn’t fit the likely values and might run into parameters combination that is unfit to our model structure. Note that the initialisation is not constraining, it just gives a hint to the algorithm.

We initialise only root parameters. Root parameters are parameters that don’t depend on other parameters. Dependent parameters will then inherit the initialisation.

# add initialisation
inits.out <- list()
set.seed(stan_data_model1$seed)

for (c in 1:chains){
  inits.i <- list()

  inits.i$sigma <- runif(1, 0.4, 0.8)
  inits.i$alpha <- runif(1, 3, 6)
  inits.i$beta <- runif(stan_data_model1$ncov, -1, 1)
  
  inits.out[[c]] <- inits.i
}

Note that we define initial values for each chain. We base them around estimated values in previous models,
and add some random jittering.

We run the estimation with these initialisation values.

fit1bis <- rstan::stan(file = file.path(here('./tutorials/tutorial3/tutorial3_model1.stan')), 
                   data = stan_data_model1,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed,
                   init= inits.out)

## Warning in validityMethod(object): The following variables have undefined
## values: population_hat[1],The following variables have undefined values:
## population_hat[2],The following variables have undefined values:
## population_hat[3],The following variables have undefined values:
## population_hat[4],The following variables have undefined values:
## population_hat[5],The following variables have undefined values:
## population_hat[6],The following variables have undefined values:
## population_hat[7],The following variables have undefined values:
## population_hat[8],The following variables have undefined values:
## population_hat[9],The following variables have undefined values:
## population_hat[10],The following variables have undefined values:
## population_hat[11],The following variables have undefined values:
## population_hat[12],The following variables have undefined values:
## population_hat[13],The following variables have undefined values:
## population_hat[14],The following variables have undefined values:
## population_hat[15],The following variables have undefined values:
## population_hat[16],The following variables have undefined values:
## population_hat[17],The following variables have undefined values:
## population_hat[18],The following variables have undefined values:
## population_hat[19],The following variables have undefined values:
## population_hat[20],The following variables have undefined values:
## population_hat[21],The following variables have undefined values:
## population_hat[22],The following variables have undefined values:
## population_hat[23],The following variables have undefined values:
## population_hat[24],The following variables have undefined values:
## population_hat[25],The following variables have undefined values:
## population_hat[26],The following variables have undefined values:
## population_hat[27],The following variables have undefined values:
## population_hat[28],The following variables have undefined values:
## population_hat[29],The following variables have undefined values:
## population_hat[30],The following variables have undefined values:
## population_hat[31],The following variables have undefined values:
## population_hat[32],The following variables have undefined values:
## population_hat[33],The following variables have undefined values:
## population_hat[34],The following variables have undefined values:
## population_hat[35],The following variables have undefined values:
## population_hat[36],The following variables have undefined values:
## population_hat[37],The following variables have undefined values:
## population_hat[38],The following variables have undefined values:
## population_hat[39],The following variables have undefined values:
## population_hat[40],The following variables have undefined values:
## population_hat[41],The following variables have undefined values:
## population_hat[42],The following variables have undefined values:
## population_hat[43],The following variables have undefined values:
## population_hat[44],The following variables have undefined values:
## population_hat[45],The following variables have undefined values:
## population_hat[46],The following variables have undefined values:
## population_hat[47],The following variables have undefined values:
## population_hat[48],The following variables have undefined values:
## population_hat[49],The following variables have undefined values:
## population_hat[50],The following variables have undefined values:
## population_hat[51],The following variables have undefined values:
## population_hat[52],The following variables have undefined values:
## population_hat[53],The following variables have undefined values:
## population_hat[54],The following variables have undefined values:
## population_hat[55],The following variables have undefined values:
## population_hat[56],The following variables have undefined values:
## population_hat[57],The following variables have undefined values:
## population_hat[58],The following variables have undefined values:
## population_hat[59],The following variables have undefined values:
## population_hat[60],The following variables have undefined values:
## population_hat[61],The following variables have undefined values:
## population_hat[62],The following variables have undefined values:
## population_hat[63],The following variables have undefined values:
## population_hat[64],The following variables have undefined values:
## population_hat[65],The following variables have undefined values:
## population_hat[66],The following variables have undefined values:
## population_hat[67],The following variables have undefined values:
## population_hat[68],The following variables have undefined values:
## population_hat[69],The following variables have undefined values:
## population_hat[70],The following variables have undefined values:
## population_hat[71],The following variables have undefined values:
## population_hat[72],The following variables have undefined values:
## population_hat[73],The following variables have undefined values:
## population_hat[74],The following variables have undefined values:
## population_hat[75],The following variables have undefined values:
## population_hat[76],The following variables have undefined values:
## population_hat[77],The following variables have undefined values:
## population_hat[78],The following variables have undefined values:
## population_hat[79],The following variables have undefined values:
## population_hat[80],The following variables have undefined values:
## population_hat[81],The following variables have undefined values:
## population_hat[82],The following variables have undefined values:
## population_hat[83],The following variables have undefined values:
## population_hat[84],The following variables have undefined values:
## population_hat[85],The following variables have undefined values:
## population_hat[86],The following variables have undefined values:
## population_hat[87],The following variables have undefined values:
## population_hat[88],The following variables have undefined values:
## population_hat[89],The following variables have undefined values:
## population_hat[90],The following variables have undefined values:
## population_hat[91],The following variables have undefined values:
## population_hat[92],The following variables have undefined values:
## population_hat[93],The following variables have undefined values:
## population_hat[94],The following variables have undefined values:
## population_hat[95],The following variables have undefined values:
## population_hat[96],The following variables have undefined values:
## population_hat[97],The following variables have undefined values:
## population_hat[98],The following variables have undefined values:
## population_hat[99],The following variables have undefined values:
## population_hat[100],The following variables have undefined values:
## population_hat[101],The following variables have undefined values:
## population_hat[102],The following variables have undefined values:
## population_hat[103],The following variables have undefined values:
## population_hat[104],The following variables have undefined values:
## population_hat[105],The following variables have undefined values:
## population_hat[106],The following variables have undefined values:
## population_hat[107],The following variables have undefined values:
## population_hat[108],The following variables have undefined values:
## population_hat[109],The following variables have undefined values:
## population_hat[110],The following variables have undefined values:
## population_hat[111],The following variables have undefined values:
## population_hat[112],The following variables have undefined values:
## population_hat[113],The following variables have undefined values:
## population_hat[114],The following variables have undefined values:
## population_hat[115],The following variables have undefined values:
## population_hat[116],The following variables have undefined values:
## population_hat[117],The following variables have undefined values:
## population_hat[118],The following variables have undefined values:
## population_hat[119],The following variables have undefined values:
## population_hat[120],The following variables have undefined values:
## population_hat[121],The following variables have undefined values:
## population_hat[122],Th

The former error message is not displayed anymore and the model contains samples. We still observe the same issue of tutorial 2, the integer overflow for the prediction of population count in the warmup period. We can safely say that the model has converged.

We can plot the \(\hat\beta_k\). The sign and magnitude of the covariate effects are inline with the association shown in Figure 3.

# plot beta estimation
stan_plot(fit1, pars='beta', fill_color='orange')

Comparing prediction with previous model

The next question is: How much improvement is brought by integrating covariates in the model?

We load Tutorial 2 model 3 to compare:

# load tutorial 2 final model
fit0 <- readRDS(here('./tutorials/tutorial2/tutorial2_model3_fit.rds'))

We can then compute the predictions for every study site and compare the goodness-of-fit when adding the covariates.

# extract predictions
getPopPredictions <- function(model_fit, 
                              estimate='population_hat',
                              obs='N', reference_data=data){
  # extract predictions
  predicted_pop <- as_tibble(extract(model_fit, estimate)[[estimate]])
  colnames(predicted_pop) <- reference_data$id
  
  # summarise predictions
  predicted_pop <- predicted_pop %>% 
    pivot_longer(everything(),names_to = 'id', values_to = 'predicted') %>% 
    group_by(id) %>% 
    summarise(
      across(everything(), list(mean=~mean(.), 
                                upper=~quantile(., probs=0.975), 
                                lower=~quantile(., probs=0.025)))
      ) %>% 
    left_join(reference_data %>% 
                rename('reference'=all_of(obs)) %>% 
                select(id, reference), by = 'id')%>% 
    mutate(
      residual= predicted_mean - reference,
      ci_size = predicted_upper- predicted_lower,
      estimate = estimate
      )

return(predicted_pop)
}

comparison_df <- rbind(
 getPopPredictions(fit0) %>% 
   mutate(Model='Without covariates'),
 getPopPredictions(fit1) %>% 
   mutate(Model='With covariates'))

# compute goodness-of-fit metrics
comparison_df %>% group_by(Model) %>% 
  summarise( `Bias`= mean(residual),
    `Inaccuracy` = mean(abs(residual)),
        `Imprecision` = sd(residual)
) %>%  kbl(caption = 'Goodness-of-metrics comparison with and without covariates ') %>% kable_minimal()

Table 2: Goodness-of-metrics comparison with and without covariates
Model	Bias	Inaccuracy	Imprecision
With covariates	33.13522	193.2779	275.0942
Without covariates	35.31776	200.4041	285.1552

We see an improvement on every goodness-of-fit metrics.

Grouping and covariates effect: a random slope model

In tutorial 2 we have seen that a model fitting a unique \(\alpha\) for all the observations could be improved by splitting the observations into groupings that would share a similar pattern of population density.

The idea is similar with \(\beta\): some covariates effects might vary by grouing. Example: x_4, the sum of settled area within a 1km radius might, might have a greater predictive power in rural areas than in urban areas. The difference relationships between covariate and population density by settlement type is highlighted in Figure 4.

# 4. Modelling covariates with random slope ----

ggplot(data_long %>% 
         group_by(type) %>% 
         mutate(
           type = paste0(type,' n=' ,n()),
           type=as.factor(type)), aes(x=pop_density,y=value, color=type))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)+
  theme_minimal()+
  facet_wrap(.~cov, ncol=3, scales = 'free')+
  labs(y='', x='Population density', color='Settlement type')

Figure 4: Scatterplot of covariates vs population density by settlement type

Modelling \(\beta_k\) by settlement type is called a random slope model.

Question: Do we want to model the \(\beta_k\) hierarchically?

Click for the solution

Modelliong \(\beta\) hierarchically means assuming that there is a national pattern with subnational refinement. The \(\beta_{k,t}\) can however have opposite directions (see Figure 4) which speaks against a common overarching \(\beta\).

Formally, a random slope model is written as follows:

\[\begin{equation} population 〜 Poisson( pop\_density * settled\_area) \\ pop\_density 〜 Lognormal(\mu, \: \sigma) \\[10pt] \mu = \alpha_{t,r} + \beta^{random}_t X^{random} + \beta^{fixed} X^{fixed} \\[15pt] \beta^{random}_t 〜 Normal(0,10) \\ \beta^{fixed} 〜 Normal(0,10) \end{equation}\]

The difference can be seen in the indexing: \(\beta^{random}\) is indexed by \(t\). Similarly as in the no-pooling framework in Tutorial 2, we set the priors to be independent prior, namely Normal(0,10).

The stan implementation is as follows:

// Model 1: Hierarchical alpha by settlement type , region + covariates
data{
  ...
    // fixed slope
  int<lower=0> ncov_fixed; // number of covariates -1
  matrix[n, ncov_fixed] cov_fixed; // covariates
  // random slope
  vector[n] cov_random;
}
parameters{
  ...
  // slope
  row_vector[ncov_fixed] beta_fixed; 
  vector[ntype] beta_random;
}
transformed parameters{
  ...
  vector[n] beta;

  for(idx in 1:n){
    beta[idx] = sum( cov_fixed[idx,] .* beta_fixed) + cov_random[idx] * beta_random[type[idx]];
    pop_density_median[idx] = alpha_t_r[type[idx], region[idx]] + beta[idx];
  }
}
model{
  ...
  //slope
  beta_fixed ~ normal(0,10);
  beta_random ~ normal(0,10);
}
generated quantities{
  ...
 vector[n] beta_hat;

  for(idx in 1:n){
    beta_hat[idx] = sum( cov_fixed[idx,] .* beta_fixed) + cov_random[idx] * beta_random[type[idx]];
    density_hat[idx] = lognormal_rng( alpha_t_r[type[idx], region[idx]] + beta_hat[idx], sigma );
  ...
}

Note that we wrote the code to model only one random covariate, such that beta_random is only a vector that contains the \(\beta^{random}_t\). For implementing several random effect we would need a matrix (settlement type x number of covariates).

To run the model, we distinguish in the input data between the covariates that are fixed and the one that is random. We choose x_4 , the sum of settled area within a 1km radius, to be modelled with a random effect.

Note that this setting allows to test the model with different covariate candidates for the random effect.

# prepare stan data
stan_data_model2 <- list(
  population = data$N,
  n = nrow(data),
  area = data$A,
  type = data$type,
  ntype= n_distinct(data$type),
  region = data$region,
  nregion = n_distinct(data$region),
  seed=seed,
  cov_fixed = covs_scaled %>% select(-x4),
  ncov_fixed = ncol(covs_scaled) -1,
  cov_random = covs_scaled$x4
  )

pars <- c('alpha','sigma','beta_fixed','beta_random','alpha_t','alpha_t_r', 'nu_alpha', 'nu_alpha_t', 'population_hat',  'density_hat')

# initialise
inits.out <- list()
set.seed(stan_data_model2$seed)

for (c in 1:chains){
  inits.i <- list()
  # intercept
  inits.i$sigma <- runif(1, 0.4, 0.8)
  inits.i$alpha <- runif(1, 3, 6)
  inits.i$beta_fixed <- runif(stan_data_model2$ncov_fixed, -1, 1)
  inits.i$beta_random <- runif(stan_data_model2$ntype, -1, 1)

  inits.out[[c]] <- inits.i
}

# mcmc
fit2 <- rstan::stan(file = file.path(here('./tutorials/tutorial3/tutorial3_model2.stan')), 
                   data = stan_data_model2,
                   iter = warmup + iter, 
                   chains = chains,
                   warmup = warmup, 
                   pars = pars,
                   seed = seed,
                   init = inits.out)

## Warning in validityMethod(object): The following variables have undefined
## values: population_hat[1],The following variables have undefined values:
## population_hat[2],The following variables have undefined values:
## population_hat[3],The following variables have undefined values:
## population_hat[4],The following variables have undefined values:
## population_hat[5],The following variables have undefined values:
## population_hat[6],The following variables have undefined values:
## population_hat[7],The following variables have undefined values:
## population_hat[8],The following variables have undefined values:
## population_hat[9],The following variables have undefined values:
## population_hat[10],The following variables have undefined values:
## population_hat[11],The following variables have undefined values:
## population_hat[12],The following variables have undefined values:
## population_hat[13],The following variables have undefined values:
## population_hat[14],The following variables have undefined values:
## population_hat[15],The following variables have undefined values:
## population_hat[16],The following variables have undefined values:
## population_hat[17],The following variables have undefined values:
## population_hat[18],The following variables have undefined values:
## population_hat[19],The following variables have undefined values:
## population_hat[20],The following variables have undefined values:
## population_hat[21],The following variables have undefined values:
## population_hat[22],The following variables have undefined values:
## population_hat[23],The following variables have undefined values:
## population_hat[24],The following variables have undefined values:
## population_hat[25],The following variables have undefined values:
## population_hat[26],The following variables have undefined values:
## population_hat[27],The following variables have undefined values:
## population_hat[28],The following variables have undefined values:
## population_hat[29],The following variables have undefined values:
## population_hat[30],The following variables have undefined values:
## population_hat[31],The following variables have undefined values:
## population_hat[32],The following variables have undefined values:
## population_hat[33],The following variables have undefined values:
## population_hat[34],The following variables have undefined values:
## population_hat[35],The following variables have undefined values:
## population_hat[36],The following variables have undefined values:
## population_hat[37],The following variables have undefined values:
## population_hat[38],The following variables have undefined values:
## population_hat[39],The following variables have undefined values:
## population_hat[40],The following variables have undefined values:
## population_hat[41],The following variables have undefined values:
## population_hat[42],The following variables have undefined values:
## population_hat[43],The following variables have undefined values:
## population_hat[44],The following variables have undefined values:
## population_hat[45],The following variables have undefined values:
## population_hat[46],The following variables have undefined values:
## population_hat[47],The following variables have undefined values:
## population_hat[48],The following variables have undefined values:
## population_hat[49],The following variables have undefined values:
## population_hat[50],The following variables have undefined values:
## population_hat[51],The following variables have undefined values:
## population_hat[52],The following variables have undefined values:
## population_hat[53],The following variables have undefined values:
## population_hat[54],The following variables have undefined values:
## population_hat[55],The following variables have undefined values:
## population_hat[56],The following variables have undefined values:
## population_hat[57],The following variables have undefined values:
## population_hat[58],The following variables have undefined values:
## population_hat[59],The following variables have undefined values:
## population_hat[60],The following variables have undefined values:
## population_hat[61],The following variables have undefined values:
## population_hat[62],The following variables have undefined values:
## population_hat[63],The following variables have undefined values:
## population_hat[64],The following variables have undefined values:
## population_hat[65],The following variables have undefined values:
## population_hat[66],The following variables have undefined values:
## population_hat[67],The following variables have undefined values:
## population_hat[68],The following variables have undefined values:
## population_hat[69],The following variables have undefined values:
## population_hat[70],The following variables have undefined values:
## population_hat[71],The following variables have undefined values:
## population_hat[72],The following variables have undefined values:
## population_hat[73],The following variables have undefined values:
## population_hat[74],The following variables have undefined values:
## population_hat[75],The following variables have undefined values:
## population_hat[76],The following variables have undefined values:
## population_hat[77],The following variables have undefined values:
## population_hat[78],The following variables have undefined values:
## population_hat[79],The following variables have undefined values:
## population_hat[80],The following variables have undefined values:
## population_hat[81],The following variables have undefined values:
## population_hat[82],The following variables have undefined values:
## population_hat[83],The following variables have undefined values:
## population_hat[84],The following variables have undefined values:
## population_hat[85],The following variables have undefined values:
## population_hat[86],The following variables have undefined values:
## population_hat[87],The following variables have undefined values:
## population_hat[88],The following variables have undefined values:
## population_hat[89],The following variables have undefined values:
## population_hat[90],The following variables have undefined values:
## population_hat[91],The following variables have undefined values:
## population_hat[92],The following variables have undefined values:
## population_hat[93],The following variables have undefined values:
## population_hat[94],The following variables have undefined values:
## population_hat[95],The following variables have undefined values:
## population_hat[96],The following variables have undefined values:
## population_hat[97],The following variables have undefined values:
## population_hat[98],The following variables have undefined values:
## population_hat[99],The following variables have undefined values:
## population_hat[100],The following variables have undefined values:
## population_hat[101],The following variables have undefined values:
## population_hat[102],The following variables have undefined values:
## population_hat[103],The following variables have undefined values:
## population_hat[104],The following variables have undefined values:
## population_hat[105],The following variables have undefined values:
## population_hat[106],The following variables have undefined values:
## population_hat[107],The following variables have undefined values:
## population_hat[108],The following variables have undefined values:
## population_hat[109],The following variables have undefined values:
## population_hat[110],The following variables have undefined values:
## population_hat[111],The following variables have undefined values:
## population_hat[112],The following variables have undefined values:
## population_hat[113],The following variables have undefined values:
## population_hat[114],The following variables have undefined values:
## population_hat[115],The following variables have undefined values:
## population_hat[116],The following variables have undefined values:
## population_hat[117],The following variables have undefined values:
## population_hat[118],The following variables have undefined values:
## population_hat[119],The following variables have undefined values:
## population_hat[120],The following variables have undefined values:
## population_hat[121],The following variables have undefined values:
## population_hat[122],Th

## Warning: There were 6 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

No convergence issue. We plot beta_random that is a vector with a \(\hat\beta_t\) for each settlement type.

# plot beta estimation
stan_plot(fit2, pars='beta_random', fill_color='orange')+
    # add alpha from tutorial 1
  geom_vline(xintercept=-0.006515444, size=1.5, linetype=2)+
  annotate('text', x=0.1, y=5.7, label="beta for cov 4 \nfrom first model")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

We see that modelling \(\beta^{x4}\) by settlement type unravels different patterns: we observe a non-significant effect for settlement 1 and 3 the most urbanised types. This was expected as the sum of settled area is likely to be homogeneous across urbanised area. We see also that the previous estimtaed \(\beta_4\) was masking effect in opposite direction between settlement 2 and settlement 4,5.

We now want to evaluate the effect on the predicted population count for each study site.

# extract predictions
comparison_df <- rbind(
 getPopPredictions(fit1) %>% 
   mutate(model='Fixed effect'),
 getPopPredictions(fit2) %>% 
   mutate(model='Random effect in x4'))
# compute goodness-of-fit metrics
comparison_df %>% group_by(model) %>% 
  summarise( `Bias`= mean(residual),
    `Inaccuracy` = mean(abs(residual)),
        `Imprecision` = sd(residual)
) %>%  kbl(caption = 'Goodness-of-metrics comparison with and without random effect in x4 ') %>% kable_minimal()

Table 3: Goodness-of-metrics comparison with and without random effect in x4
model	Bias	Inaccuracy	Imprecision
Fixed effect	33.13522	193.2779	275.0942
Random effect in x4	33.41718	192.4844	274.8726

We see a slight decrease of bias and an increase of the precision of the estimates.

We will save the results of this final model as a RDS file to explore it in Tutorial 4.

# save model
saveRDS(fit2, here('./tutorials/tutorial3/tutorial3_model2_fit.rds'))

Acknowledgements

This tutorial was written by Edith Darin from WorldPop, University of Southampton and Douglas Leasure from Leverhulme Centre for Demographic Science, University of Oxford, with supervision from Andrew Tatem, WorldPop, University of Southampton. Funding for the work was provided by the United Nations Population Fund (UNFPA).

License

You are free to redistribute this document under the terms of a Creative Commons Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) license.

References

Ecopia.AI, and Maxar Technologies. 2019. Digitize Africa Data.

Institut Géographique du Burkina Faso. 2015. “Base Nationale de Données Topographiques.”

Leasure, Douglas R, Warren C Jochem, Eric M Weber, Vincent Seaman, and Andrew J Tatem. 2020. “National Population Mapping from Sparse Survey Data: A Hierarchical Bayesian Modeling Framework to Account for Uncertainty.” Proceedings of the National Academy of Sciences.

Oak Ridge National Laboratory. 2018. “LandScan HD: Nigeria.”

OpenStreetMap contributors. 2018. Planet Dump Retrieved from Https://Planet.osm.org.

Weiss, D J, A Nelson, HS Gibson, W Temperley, S Peedell, A Lieber, M Hancher, et al. 2018. “A Global Map of Travel Time to Cities to Assess Inequalities in Accessibility in 2015.” Nature 553 (7688): 333336.

WorldPop Research Group, Department of Geography and Geosciences, University of Louisville, Departement de Geographie, Universite de Namur, and Center for International Earth Science Information Network (CIESIN), Columbia University. 2018. “Global High Resolution Population Denominators Project - Funded by the Bill and Melinda Gates Foundation (OPP1134076).” https://doi.org/10.5258/SOTON/WP00645.

Population modelling for census support

Tutorial 3: How to model small-scale spatial variations?

Compiled on 10/05/2023