16 Top-Down Population Modelling

16.1 Introduction

Top-down population mapping is a census-dependent process that relies on census data to estimate and redistribute population numbers from larger administrative areas to smaller, more localised grid cells (Stevens et al. 2015b, Leyk et al. 2019, McKeen et al. 2023, Yankey et al. 2024). This method is crucial for producing population counts for small area units, typically at 100m resolution. Top-down population mapping ensures the disaggregation of population data from broad administrative regions, such as provinces or districts, to granular levels that facilitate various analytical and planning purposes.

Population and housing censuses are the most important resources to produce accurate population data at the national and sub-national level. These are typically undertaken every decade, and population projections are used to create subnational estimates in the intercensal period. These data are typically only available as counts per administrative unit, masking small area variations and making them difficult to integrate with other datasets. Top-down modelling approaches take population counts at the level of administrative units, like regions or municipalities, and disaggregate them down to counts for each 100x100m grid square across the country or region of interest. This is achieved by utilising machine learning techniques that take advantage of relationships with a stack of 100x100m resolution geospatial covariate datasets. This method helps to translate these broad estimates into a more granular form while maintaining the official total population at the administrative unit level of the input data.

Top-down population mapping has employed a variety of methodological techniques for population disaggregation (Qiu et al. 2022). The most basic of these methods is the area weighting approach (Eicher & Brewer 2001), which spreads populations evenly across grid cells without considering the need for geospatial covariates or other data that might change how populations are redistributed. Dasymetric population mapping (Mennis 2003, Mennis & Hultgren 2006, Stevens et al. 2015b) has been developed to overcome the limitation of uniformity in population redistribution, combining ancillary geospatial covariates with observed population data to produce a weighting layer that disaggregates the population totals into grid cells using advanced statistical and machine learning approaches. These ancillary geospatial covariates include land uses and land cover data, climate variables like temperature and rainfall, physical features and infrastructure like roads and schools, and settlement data like building footprints. These datasets have a direct or indirect effect on population distribution across a given country, highlighting the need to incorporate such data into the redistribution process. The top-down population modelling method has been used to produce many global population datasets (Nordstrand & Frye 2014, CIESIN 2018, Florczyk et al. 2019, McKeen et al. 2023, Sims et al. 2023)

16.2 Approach to Top-Down Disaggregation

Top-down population mapping primarily relies on the RandomForest algorithm for population disaggregation. The popRF (Bondarenko et al. 2021) R package has been developed for Top-down population disaggregation. The package can be downloaded from this github repository. The popRF package utilizes the RandomForest algorithm for efficient population disaggregation. This package offers a user-friendly approach, allowing users to perform population disaggregation without requiring extensive technical expertise in the RandomForest algorithm.

16.2.1 Random Forest

Random Forest (RF) is a non-parametric supervised machine learning algorithm that consists of aggregating several classification and regression trees (CARTs) (Breiman 2001a).RF model was developed to overcome the limitation of overfitting associated with using a single CART model. By combining the predictions of several independent CARTs and adding bagging in the process, RF models produce better predictions. Also the RF model is able to handle noisy and highly correlated input data (Grippa et al. 2019)

Basically, the RF method involves creating subsets of data, then fitting a series of regression models (decision trees) and combining them for prediction. Each decision tree is randomly created by sampling roughly two-thirds of the training data with replacement (i.e., bootstrapped samples), while the other third is kept out of training (training data bagging). Randomly selected observations at each node are associated with a random subset of features, and the node uses the features that optimally split the observations. For the variable of interest, each tree provides a predicted value, and the average value of all the trees in the forest is calculated and used as the final output. Simultaneously, the third of the data that is retained on each tree is used to calculate a performance evaluation measure called the Out-of-Bag (OOB) error. With the OOB error, the importance of the independent variables can also be assessed. The most common way to do this is to use the increase in the mean square error (iMSE). More specifically, the value of each feature is randomly permuted, and the OOB error iMSE is computed. This value is compared to the performance of the original model before the permutation, and, as a result, if a variable is very important, we expect a large increase in the OOB error and vice versa (Georganos et al. 2021).

RF model is also advantageous because it has only a few (hyper)-parameters that can be optimized to tune the model and improve its performance. The most important parameters are the number of trees in the model and the number of features randomly selected at each node of the tree (Flasse et al. 2021). The popRF package is designed to implement top-down population disagregation using the RF model. The popRF package performs the entire top-down disaggregation process automatically.In popRF, we first calculate the average values of the geospatial covariates at the training level (administrative level), and then use these geospatial covariates to train the RF model. The response variable associated with the geospatial covariates at the training level is the population density. The response variable is log transformed due to its highly skewed distribution.The output of the popRF regression model is the predicted population density on a log scale, which is back transformed in order to retrieve the population density and create a weighting layer. Tutorial on how to create a weighting layer and to redistribute the population counts at a finer level, is provided in the next chapter (chapter 18)

16.3 Global 1

Global I In 2020, WorldPop launched the Global I project with the objective of producing high-resolution global population data using a top-down modeling approach. The project aimed to produce granular population data worldwide at a resolution of 100m, making it accessible for various applications and analyses. Sub-national census-based population estimates combined with a variety of ancillary geospatial covariates that correlate with the census-based subnational estimates were used to disaggregate population data to a 100-m resolution for all countries globally. Also, the gridded population dataset from Global was disaggregated into different age and sex classifications. This allows for a detailed analysis of the population distribution across various demographic groups. Such granularity is invaluable for targeted interventions in public health, resource allocation, urban planning, and other fields requiring demographic insights. Global I represent a significant advancement in global population mapping, providing detailed, reliable, and accessible data that supports more informed decision-making and policy development around the world. You can download the Global I dataset for various countries from [WOPR]. WorldPop produces two gridded population data output namely the Top-down unconstraint and Top-down constraint.

16.4 Top-Down Unconstrained

Top-Down Unconstrained The unconstrained estimation modelling approach is used to disaggregate census data to all gridcells within a study location, irrespective of whether that gridcell is a settled grid or not. This approach is based on the assumption that no settlement dataset can accurately identify all residential buildings globally, so it cannot serve as a mask to map uninhabited areas. As a result, this ‘unconstrained’ model predicts population numbers for all 100x100m land grid cells in the study location by disaggregating a census database. These datasets are particularly useful when the accuracy of satellite-based settlement mapping is questionable, especially for detecting small rural settlements. This method has the advantage of being less sensitive to the accuracy of building or settlement mapping compared to constrained modelling. However, this method also has drawbacks, such as a non-zero population allocation to all land grid cells, which can lead to population misallocation in uninhabited areas and underestimation of urban populations in some regions.

16.5 Top-Down Constrained

The top-down constraint model disaggregates census data into settled grid cells by using settlement data as additional information for the allocation. The constraint model allocates population to only settled grid cells, unlike the unconstraint model that allocates population to a grid cell regardless of its settled status. The mapping of human settlements and buildings using satellite imagery has greatly advanced in spatial detail, accuracy, and availability. Previously, we often overlooked small settlements and isolated buildings, making such datasets ineffective for defining non-residential areas. Recent improvements, however, have enabled the use of these datasets as masks to identify uninhabited areas over recent periods. When accurately mapping settlements and buildings, this approach produces a more precise population distribution, preventing small population predictions in likely uninhabited areas. A disadvantage is the reliance on the accuracy of satellite-based settlement and building mapping; missed settlements and buildings can lead to over-allocation of population to neighbouring areas, while incorrectly identified settlements and buildings can result in under-allocation.

16.6 Contributions

This chapter was written by Ortis Yankey and Assane Gadiaga

References

Bondarenko M, Nieves JJ, Stevens FR, Gaughan AE, Jochem C, Kerr D, Sorichetta A. 2021. popRF: Random forest-informed population disaggregation r package. https://cran.r-project.org/package=popRF.

Breiman L. 2001a. Random forests. Machine learning 45:5–32.

CIESIN. 2018. Gridded population of the world, version 4 (GPWv4): Population count adjusted to match 2015 revision of UN WPP country totals, revision 11. https://doi.org/10.7927/H4PN93PB.

Eicher CL, Brewer CA. 2001. Dasymetric mapping and areal interpolation: Implementation and evaluation. Cartography and Geographic Information Science 28:125–138.

Flasse L, Schewin C, Grapin-Botton A. 2021. Pancreas morphogenesis: Branching in and then out. Current topics in developmental biology 143:75–110.

Florczyk AJ, Corbane C, Ehrlich D, Freire S, Kemper T, Maffenini L, Melchiorri M, Pesaresi M, Politis P, Schiavina M. 2019. GHSL data package 2019. Luxembourg, eur 29788:290498.

Georganos S, Grippa T, Niang Gadiaga A, Linard C, Lennert M, Vanhuysse S, Mboga N, Wolff E, Kalogirou S. 2021. Geographical random forests: A spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto International 36:121–136.

Grippa T, Linard C, Lennert M, Georganos S, Mboga N, Vanhuysse S, Gadiaga A, Wolff E. 2019. Improving urban population distribution models with very-high resolution satellite information. Data 4:13.

Leyk S, Gaughan AE, Adamo SB, Sherbinin A de, Balk D, Freire S, Rose A, Stevens FR, Blankespoor B, Frye C. 2019. The spatial allocation of population: A review of large-scale gridded population data products and their fitness for use. Earth System Science Data 11:1385–1409.

McKeen T, Bondarenko M, Kerr D, Esch T, Marconcini M, Palacios-Lopez D, Zeidler J, Valle RC, Juran S, Tatem AJ. 2023. High-resolution gridded population datasets for latin america and the caribbean using official statistics. Scientific Data 10:436.

Mennis J. 2003. Generating surface models of population using dasymetric mapping. The Professional Geographer 55:31–42.

Mennis J, Hultgren T. 2006. Intelligent dasymetric mapping and its application to areal interpolation. Cartography and Geographic Information Science 33:179–194.

Nordstrand E, Frye C. 2014. World population estimate. doi:10.13140/RG.2.2.18213.14565.

Qiu Y, Zhao X, Fan D, Li S, Zhao Y. 2022. Disaggregating population data for assessing progress of SDGs: Methods and applications. International Journal of Digital Earth 15:2–29.

Sims K, Reith A, Bright E, Kaufman J, Pyle J, Epting J, Gonzales J, Adams D, Powell E, Urban M, Rose A. 2023. LandScan global 2022. doi:10.48690/1529167. landscan.ornl.gov.

Stevens FR, Gaughan AE, Linard C, Tatem AJ. 2015b. Disaggregating census data for population mapping using random forests with remotely-sensed and ancillary data. PloS one 10:e0107042.

Yankey O, Utazi CE, Nnanatu CC, Gadiaga AN, Abbot T, Lazar AN, Tatem AJ. 2024. Disaggregating census data for population mapping using a bayesian additive regression tree model. Applied Geography 172:103416.