2 Working with Data Frames

This module focuses on working with data frames, covering the installation and loading of data into the R environment, and basic data wrangling methods. Also included in Module 2 is a thorough introduction into data visualisation methods, key foundational knowledge that will be utilised in future modules.

2.1 Importing and exporting data in and from R

2.1.1 Installing and loading packages

R uses packages (libraries) to store different functions, and has thousands (16,000) of these packages. The “official” packages for R can be found on CRAN, but there are many other useful packages available on GitHub.

Packages are a collection of R functions, code, and occasionally data that enables efficient work in R without the need to write complicated code.

Examples of common R packages include:

tidyverse: a collection of R packages designed for data manipulation and visualisation, including the following packages
ggplot2: designed for elegant data visualisation and graphics
dplyr: designed for data manipulation and cleaning
tidyr: designed for reshaping and cleaning data
readr: designed for reading and writing delimited files
purrr: designed for functional programming in R
tibble: designed to be a modern reimplementation of data frames
stringr: designed for working with strings
lubridate: designed for working with dates and times
forcats: designed for working with categorical variables
sf: designed for working with spatial vector data
rstan or INLA: designed for use with Bayesian statistics

To utilise the packages in R, it is required to install and load the packages into the session. Some packages are already installed and loaded as part of the base R package, however, for those which are not the functions install.packages() can be used to install packages, ensuring that the name of any packages included are enclosed by either ' or " and library() can be used to load the package from the library.

#install ggplot2 package 
install.packages("ggplot2")
#load the ggplot2 package
library(ggplot2)

Alternatively, available packages can be accessed from the “Packages” window in Pane 3 (bottom-right pane) of the RStudio GUI. In this window, you can either search for the package directly, or scroll through the list and check the tick box of the desired package to load it into R.

2.1.2 Reading data into the R environment

Different data formats require different functions in R for importation into R, with some formats requiring additional packages to allow installation. From base R, the function read.table() reads the chosen .txt file included as an argument. However, this function does not ‘read’ the file into the R environment, only prints the file in the console. To import the file into the R environment, you must assign the results to a variable. The other key functions for importing data into R are given in the table below with example code on how to use the functions provided also.


Data type	Function	Package required
`.csv`	`read.csv()`	none (in `base`)
`.txt`, text files	`read.table()`	none (in `base`)
`.xslx`, excel workbook	`read.excel()`	`readxl`
`.dta`, STATA files	`read.dta()`	`foreign`
`.sav`, SPSS files	`read.spss()`	`foreign`

#import a text or .txt file 
text_data<-read.table(file = paste0(data_path,"sample_data.txt")) 

#import a .csv file 
csv_data1 <-read.table(file = paste0(data_path,"sample_data.csv"), 
                       sep = ",",
                       header = TRUE) 
csv_data2 <-read.csv(file = paste0(data_path,"sample_data.csv"))

#import an Excel Workbook file 
install.packages("readxl") 
library(readxl) 
excel_data<-read_excel(file = paste0(data_path,"sample_data.xlsx")) 

#import a STATA file 
install.packages("foreign") 
library(foreign) 
stata_data<-read.dta(file = paste0(data_path,"sample_data.dta"))

Alternatively to pasting the data path, the function file.choose() can be used, which opens a window on the computer for you to select the file you wish to import.

#import a text or .txt file 
text_data2<-read.table(file = file.choose())

Another option for loading data in R is to use the menus in the main toolbar following File > Import Dataset > From [data format of choice] > [browse for file on the pop-up window] or in the Environment pane following Import Dataset > From [data format of choice] > [browse for file on the pop-up window]. Both menus are shown in Figures 1 and 2 respectively. This practice is not encouraged however, given that it is not hard-coded in the script, it is better practice to use the functions above in the script to import your chosen dataset.

Figure 2.1: Import table via Toolbar

Figure 2.2: Import data via Environment

Exercise: What are the defaults for the header argument in the functions read.table() and read.csv()?

header = T in read.table() and header = T in read.csv()
header = T in read.table() and header = F in read.csv()
header = F in read.table() and header = T in read.csv()
header = F in read.table() and header = F in read.csv()

2.1.3 Creating work paths

Setting the working directory to a single folder means that R can only access files within that folder and cannot open any folders within a folder.

If you have several folders within a folder that need accessing, or need your input and output files and immediate and final results stored in different folders. In these instances, rather than setting a working directory it can be better to store the path to folder as a string, setting a ‘parent directory’. Then use the function read.table() to import the data, which can read both standard .txt and .csv data

setting a work path allows for you to set a ‘parent directory’ which solves this problem.

This can be created through setting a path to the parent directory.

#parent directory
path <- "C:Users/YOUR_NAME" 

#save in a folder called "training"
out_path <- paste0(path, "/training") 
if(!file.exists(out_path))
{
  dir.create(file.path(out_path))
}

You can also create a work path using the current working directory as the parent directory.

#working directory as parent directory
out_path <- paste0(getwd(), "/training")
if(!file.exists(out_path))
{
  dir.create(file.path(out_path))
}

You can set your parent directory to be your working directory as follows.

#set parent directory to working directory
if(file.exists(out_path)){ setwd(file.path(out_path))
}else{
  dir.create(file.path(out_path))
  setwd(file.path(out_path))
}

2.1.4 Exploring data attributes


Function	Description
`class()`	class of the variable
`colnames()`	gives the column names
`describe()`	summary of data - requires `psych` package
`dim()`	gives dimensions of the data
`factor()`	change to a factor
`glimpse()`	view data structure
`head()`	gives first six (6) rows of data
`head(mydata, n)`	gives first `n` (6) rows of data called `mydata`
`headtail()`	first and last rows of data - requires `psych` package
`length()`	length of a vector
`ls()`	lists all items available in the R environment
`names()`	gives column or variable names
`ncol()`	number of columns
`nrow()`	number of rows
`row()`	gives the row names
`rm()`	removes selected items from the R environment
`str()`	gives full structure of data
`summary()`	summary of a vector
`table()`	frequency table of a vector
`tail()`	gives last six (6) rows of data
`unique()`	unique values of a vector
`View()`	view data

#'mtcars' is a dataset available in the 'datasets' package with data on 
#11 different aspects of auto mobiles for 32 auto mobiles from the 1974 Motor 
#Trend US magazine
library(datasets)
#information on the dataset in the 'Help' pane
?mtcars
#load data and assign to 'cars_data'
cars_data <- mtcars

#dimension of data
dim(cars_data)

## [1] 32 11

#variable names of data
names(cars_data)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

#first 6 rows of data
head(cars_data)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

#last 6 rows of data
tail(cars_data)

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

#install psych package
install.packages("psych")

#load psych package
library(psych)

#psych package provides more flexible alternatives to functions in base R
describe(cars_data)

##      vars  n   mean     sd median trimmed    mad   min    max  range  skew kurtosis    se
## mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61    -0.37  1.07
## cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17    -1.76  0.32
## disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38    -1.21 21.91
## hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73    -0.14 12.12
## drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27    -0.71  0.09
## wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42    -0.02  0.17
## qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37     0.34  0.32
## vs      8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24    -2.00  0.09
## am      9 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36    -1.92  0.09
## gear   10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53    -1.07  0.13
## carb   11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05     1.26  0.29

2.1.5 Viewing data

Data imported into R can be viewed in the Environment pane, with the column/variable names given, the data type of each variable and the first 10 observations (provided the observations do not go ‘off’ the display due to length) for each variable given as a summary of the data structure. If the summary is not given, ensure that the display of the Environment is set to ‘list’ and click the arrow to the left of the dataset name.

Figure 2.3: View data in Environment

Alternatively, the function View() can be used to display the full data in a new tab in the first (1st) pane, adjacent to the script.

Figure 2.4: View data in new tab

2.1.6 Extracting data

As with the different data structures in R such as vectors and data frames, data can be extracted from a dataset using square brackets, [], known as sub-setting or slicing. Additionally, the dollar sign, $ can be used to extract variables/access columns from a dataset.

#subset the data to only show the rows which have mpg greater than 25
cars_data[cars_data$mpg>25,]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Data can also be sliced by a given index using a colon, :, to represent a given range.

#return the fifth (5th) to the tenth (10th) rows of the data
cars_data[5:10, ]

##                    mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.15 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4

Through extracting variables from the datasets, the summary statistic functions for vectors can be used.

#quantiles for the hp variable
quantile(cars_data$hp)

##    0%   25%   50%   75%  100% 
##  52.0  96.5 123.0 180.0 335.0

#median value of cyl
median(cars_data$cyl)

## [1] 6

The dollar sign doesn’t just extract variables from the data, but can be used to create new variables as with data frames.

#create a variable for miles per litre by multiplying mpg by 0.22 (1 litre is 
#approximately 0.22 gallons)
cars_data$mpl <- cars_data$mpg*0.22

2.1.7 Attaching data

When working with variables from one dataset in R, it can become tedious to always require the dollar sign, $, to access the variables. The function attach() in R stops the need for this through essentially copying each of the variables from the dataset into the Environment to be used freely, without needing to quote the dataset each time.

#cannot compute the mean of mpg from cars_data without extracting it first
mean(mpg)#returns error

## Error: object 'mpg' not found

mean(cars_data$mpg)#works

## [1] 20.09062

#attach the dataset to the Environment
attach(cars_data) 

mean(mpg) #now works

## [1] 20.09062

If you want to detach the data for any reason, for example if you want to attach another dataset, you can use the detach() function in R.

#detach data
detach(cars_data)

mean(mpg) #now does not work again

## Error: object 'mpg' not found

2.1.8 Writing data to external repositories

Different data formats require different functions in R for exporting from R, with some formats requiring additional packages to allow for exportation. Since the data visible in the Environment (pane 4) is stored in the RAM, and not in the physical memory, it is important that any data you wish to save is written to an external repository, such as writing the data to a .csv file, otherwise it is erased when the R session ends. The key functions used to write the data as the different (common) data types are given in the table below with examples in the code that follows.


Data type	Function	Package required
`.csv`	`write.csv()`	none (in `base`)
`.txt`, text files	`write.table()`	none (in `base`)
`.xslx`, excel workbook	`write.xlsx()`	`writexl`
`.dta`, STATA files	`write.dta()`	`foreign`

#export a .csv file 
write.csv(sample_data, "exported_data.csv")

#export a text or .txt file 
write.table(sample_data, "exported_data.txt")

#export an Excel file 
install.packages("writexl") 
library(writexl)
write_xlsx(sample_data, "exported_data.xlsx")

#export a STATA file 
install.packages("foreign") 
library(foreign) 
write.dta(sample_data, "exported_data.dta")

Note: The entire R Environment can be saved using the save.image() function or by clicking on the “save icon” (floppy disk icon) on the toolbar of the Environment window (pane 4) and selecting where you wish to save the .Rdata in the pop-up window.

2.2 Basic data wrangling methods for data preparation and handling in R

2.2.1 Dealing with NA (‘Not Available’) values

NA is typically used to represent for missing (‘Not Available’) values in R, and is an important concept, particularly as R replaces missing values with NA (and replaces impossible values, such as when dividing by zero, with NaN).

#import population data
Pop_data <- read.csv(file = paste0(data_path, "sim_population.csv"), 
                     header = TRUE)
#variable for gdp contains missing values, first 50 values
Pop_data$gdp[1:50]

##  [1]  8  6  2  2 10  2 10  7  9  9  3  9  9  8  3 12 NA  2  6  3 NA  4 NA  8  8  1  9 14  6 15  7 10  2  8
## [35]  2 14  5  6  7 12 12  1  3  4 15  4  3 13  7  8

#trying to sum the elements of the variable does not work as R cannot deal with
#the missing value
sum(Pop_data$gdp)

## [1] NA

If there is missing data in the variable/vector that you are trying to work with, the function na.rm = TRUE can be applied, which removes the missing data from the variable/vector you are working with.

#remove NA values for sum
sum(Pop_data$gdp, na.rm = TRUE)

## [1] 10831

To identify which elements are missing from an R object, the function is.na() can be used, where the argument is chosen to be the object of interest. If you would like to know how many missing values there are, you can use the sum() function in addition to the is.na() function as shown below.

#identify the missing values in the gdp variable (first 50 values)
is.na(Pop_data$gdp[1:50])

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [18] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

#how many missing values in first 50 values?
sum(is.na(Pop_data$gdp[1:50]))

## [1] 3

Alternatively, if you would like to know which of the values are complete (not missing), the function complete.cases() can be used in the same way, but returns a logical vector indicating which of the elements which are complete.

#identify the missing values in the gdp variable (first 50 values)
complete.cases(Pop_data$gdp[1:50])

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [18]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [35]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

#how many missing values in first 50 values?
sum(complete.cases(Pop_data$gdp[1:50]))

## [1] 47

Another useful function is the na.omit() function, which instead of returning a logical vector identifying which values are complete or missing, it returns the object with the missing values removed.

#remove the missing values
gdp_na_omit <- na.omit(Pop_data$gdp[1:50])
gdp_na_omit

##  [1]  8  6  2  2 10  2 10  7  9  9  3  9  9  8  3 12  2  6  3  4  8  8  1  9 14  6 15  7 10  2  8  2 14  5
## [35]  6  7 12 12  1  3  4 15  4  3 13  7  8
## attr(,"na.action")
## [1] 17 21 23
## attr(,"class")
## [1] "omit"

To quickly identify if there are missing values in your data frame/object, the na.fail() function can be used. With this function, if the object included as the argument contains any missing values, it will signal an error, however, if there are no missing values, the object is returned.

#check for missing values
#na.fail(Pop_data$gdp[1:50]) #returns an error
na.fail(gdp_na_omit) #returns an error

##  [1]  8  6  2  2 10  2 10  7  9  9  3  9  9  8  3 12  2  6  3  4  8  8  1  9 14  6 15  7 10  2  8  2 14  5
## [35]  6  7 12 12  1  3  4 15  4  3 13  7  8
## attr(,"na.action")
## [1] 17 21 23
## attr(,"class")
## [1] "omit"

2.2.2 Key data wrangling methods

Some key data manipulation can be done using the following functions in base, stats, dplyr and tidyr packages.

#install dplyr and tidyr packages 
install.packages("dplyr")
install.packages("tidyr")

#load the dplyr and tidyr packages
library(dplyr)
library(tidyr)


Function	Description	Package
`aggregate()`	obtain summary statistics by group	`stats`
`arrange()`	duplicate existing variable in the data with new name	`dplyr`
`drop_na()`	remove all columns with NA	`tidyr`
`filter()`	Select rows based on certain conditions	`dplyr`
`gather()`	convert to long format	`tidyr`
`group_by()`	group the rows of a data frame by one or more variables	`dplyr`
`merge()`	join two datasets	`base`
`mutate()`	add new columns to a data frame based on existing columns	`dplyr`
`select()`	choose specific columns to work with	`dplyr`
`spread()`	convert to wide format	`tidyr`
`subset()`	subset data by given specifications	`base`
`summarise()`/ `summarize()`	compute summary statistics for each group created by `group_by()`	`dplyr`
`rename()`	change the names of columns in a data frame	`dplyr`

The pipe operator, %>%, from the magritr package allows for a sequence of operations to be carried out simultaneously. R offers a built in native pipe, |>, which works in a similar way and can be faster for longer computations. The main difference between the functions is that %>% does not require parentheses (brackets) when calling functions, whereas for |>, the parentheses are always necessary. The with() function in base R performs similar actions.

#rename dataset and remove NA values
Pop_data_pipe <- Pop_data %>% 
  drop_na() 
head(Pop_data_pipe)

##   ID pop precipitation_mean temperature_mean elevation     slope region buildings gdp
## 1  1 164           1581.950         24.00823 650.27273 0.9090909      1       250   8
## 2  2 192           1628.924         25.07551 364.75000 1.5000000      5       374   6
## 3  3 203           1072.788         25.11997 107.63636 3.5454545      2       418   2
## 4  4 206           1219.166         25.29793 340.75000 1.5000000      2       327   2
## 5  5 226           1117.308         25.17577  90.68421 6.5263158      2       443  10
## 6  6 229           1401.864         25.61566 305.71429 7.2857143      1       503   2

Subset rows based on certain conditions.

#subset observations with a population of more than 10 million people
Pop_data %>% filter(pop>=2000)

##    ID  pop precipitation_mean temperature_mean elevation     slope region buildings gdp
## 1 914 2008           1432.101         25.85989  300.5000 1.3750000      1      3061  NA
## 2 915 2047           1469.931         25.95289  292.1111 0.4444444      1      3897  73
## 3 916 2174           1437.897         25.88929  307.3333 1.1111111      1      4440  48
## 4 917 2233           1425.850         25.84749  317.0000 2.0000000      1      4766  22
## 5 918 2329           1420.629         25.82820  308.8750 2.3750000      1      5294  NA
## 6 919 2369           1425.794         25.84525  308.5833 1.3333333      1      4923  NA

Select certain variables.

#select the variables state and total
head(Pop_data %>% select(region, pop))

##   region pop
## 1      1 164
## 2      5 192
## 3      2 203
## 4      2 206
## 5      2 226
## 6      1 229

Group observations by variable and summarise by property.

#group the observations by region and summarise their population per region
Pop_data%>%
  group_by(region) %>%  
  summarise(regional_population = sum(pop))

## # A tibble: 5 × 2
##   region regional_population
##    <int>               <int>
## 1      1              270962
## 2      2              113832
## 3      3               90361
## 4      4               45677
## 5      5               34583

Adding another variable.

#add another variable called density = total/buildings
Pop_data <- Pop_data %>% 
  mutate(density = pop/buildings)

Renaming variables.

#rename density to population_density
Pop_data<- Pop_data %>%  
  rename(Pop_density = density)

Sort data into ascending or descending order.

#sort data by ascending order of ID
Pop_data_arranged_asc <- arrange(Pop_data, ID)

#sort data by descending order of ID
Pop_data_arranged_desc <- arrange(Pop_data, desc(ID))

2.3 Data visualisation

Data visualisation is important as it forms part of the exploratory data analyses, enabling a quick and easy understanding of the dataset. It also enables for patterns to be detected and can be a useful tool for communicating results to non-experts. Data visualisation is also crucial to data exploration.

For using base R data visualisation functions, with functions such as plot(), boxplot(), and hist(), the following arguments are useful for customisation of the plots, often making the plots easier to understand.


Argument	Description
`main`	an overall title
`xlab`	a title for the $x$-axis
`ylab`	a title for the $y$-axis
`sub`	a sub-title
`type`	determines the type of plot e.g. “`p`” for points, “`l`” for lines
`pch`	determines type of symbol to be used for points
`col`	colour of plot
`lty`	determines the type of line e.g. “`1`” for solid, “`2`” for dashed
`lwd`	determines width of the line

Further arguments for graphical parameters can be found by going to the help file for par (?par).

The ggplot2 and ggpubr packages can be used instead of base R to create more elegant plots.

#install ggplot2 and ggpubr packages
install.packages("ggplot2")
install.packages("ggpubr")

#load the ggplot2 and ggpubr packages
library(ggplot2)
library(ggpubr)

2.3.1 Bar plots

A bar plot can be useful for discrete data.

#create a vector which has the values of total population by region
regional_population <- c()
for(r in 1:5){
  regional_population[r] <- sum(Pop_data$pop[Pop_data$region == r])
}

#plot the bar plot using base R
barplot(regional_population,
        ylab = "Population Total", 
        xlab = "Region",
        col = "#004C92",
        main = "Bar Plot of Population by Region",
        names.arg = c(1:5))
mtext(side = 3, "Using base R")

Using the function ggplot() with the argument geom_bar() creates a bar plot. The following example is the same plot as above, but with the depicted information easier to understand.

#bar plot of population by region
barplot1 <- ggplot(Pop_data, aes(x = region, y = pop, fill = pop)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "yellow", high = "red") +
  labs(x = "States", y = "Population Total", fill = "Total", subtitle = "Using ggplot2") +
  ggtitle("Bar Plot of Population by Region")

#use the ggpar function to customise plot 
ggpar(barplot1, xlab = "Region", ylab = "Population Total",
      legend = "right", legend.title = "Total",
      font.x = c(16),
      font.y = c(16),
      subtitle = "Using ggplot2 and ggpubr")

#import the age dataset
Age_data <- read.csv(file = paste0(data_path, "Age_data.csv"), header = TRUE)

#convert the age groups variable into a factor
Age_data$age_groups <- factor(Age_data$age_groups,
                              levels = c("0-4", "5 - 12", "13 - 18", 
                                         "19 - 24", "25 - 34", "35 - 44", 
                                         "45 - 54", "55 - 64", "65 - 74",
                                         "75 - 84", "85 - 94", "95+"))

#convert data into long format
age_long <- gather(Age_data, Sex, Pop,females : males)
age_long$ID <- paste0("ID", 1:nrow(age_long))

A type of bar plots are descending and ascending bar plots, displaying the data in order of descending or ascending order of the $y$ variable rather than the given order of the $x$ variable.

#making descending and ascending bar plots

#descending order
barplot_descending <- ggbarplot(age_long, x = "ID", y = "Pop",
                                fill = "Sex", 
                                color = "white",
                                palette = "jco",
                                sort.val = "desc",        
                                sort.by.groups = FALSE, 
                                x.text.angle = 90,
                                main = "Descending Bar Plot of Population by Sex",
                                subtitle = "Using ggpubr"
)

#use the ggpar function to customise plot 
barplot_descending2 <- ggpar(barplot_descending, 
                             xlab = "Subject ID", 
                             ylab = "Population Count",
                             legend = "right", legend.title = "Gender",
                             font.label = list(size = 15, face = "bold", 
                                               color = "red"),
                             font.x = c(16),
                             font.y = c(16))
barplot_descending2

#ascending order
barplot_ascending <- ggbarplot(age_long, x = "ID", y = "Pop",
                               fill = "Sex",              
                               color = "white",        
                               palette = "jco",         
                               sort.val = "asc",        
                               sort.by.groups = FALSE,  
                               x.text.angle = 90,      
                               main = "Ascending Bar Plot of Population by Sex",
                               subtitle = "Using ggpubr"
)

#use the ggpar function to customise plot 
barplot_ascending2 <- ggpar(barplot_ascending, 
                            xlab = "Subject ID", 
                            ylab = "Population Count",
                            legend = "right", legend.title = "Gender",
                            font.label = list(size = 15, face = "bold", 
                                              color = "red"),
                            font.x = c(16),
                            font.y = c(16))

barplot_ascending2

#rotate plots
ggbarplot(age_long, x = "ID", y = "Pop",
          fill = "Sex",           # change fill colour by mpg_level
          color = "white",            # Set bar border colours to white
          palette = "jco",            # jco journal colour palette. see ?ggpar
          sort.val = "desc",          # Sort the value in descending order
          sort.by.groups = FALSE,     # Don't sort inside each group
          x.text.angle = 90,          # Rotate vertically x axis texts
          ylab = "Population Count",
          legend.title = "Sex",
          rotate = TRUE,
          ggtheme = theme_minimal(),
          main = "Ascending Bar Plot of Population by Sex",
          subtitle = "Using ggpubr"
)

2.3.2 Box plots

A box plot is the simplest plot for displaying continuous (numerical) data and can show the numerical data by category (for example, separate box plots for male vs female).

#plot the box plot using base R
boxplot(Pop_data$pop~Pop_data$region,
        ylab = "Population",
        xlab = "Region",
        main = "Box Plot of Population")
mtext(side = 3, "Using base R")

#plot the box plot using ggplot
ggplot(Pop_data, aes(x = region, y = pop, group = region, fill = region)) +
  geom_boxplot(fill = c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2"), 
               alpha = 0.75) +
  labs(x = "Region", y = "Population", subtitle = "Using ggplot2") +
  ggtitle("Box Plot of Population")

2.3.3 Histograms

Alternatively, a histogram can be utilised, displaying the frequency distribution of a variable.

#plot the histogram using base R, adding titles in the same way as for bar plots, 
#and changing the width of the bars.
hist(Pop_data$pop,
     main = "Histogram of Population", 
     xlab = "Population")
mtext(side = 3, "Using base R")

#plot the histogram using ggplot
ggplot(Pop_data, aes(x = pop)) +
  geom_histogram(bins = 10, color = "white", fill = "#004C92") +
  scale_fill_manual(values = "#004C92") +
  labs(x = "Population", y = "Frequency", subtitle = "Using ggplot2") +
  ggtitle("Histogram of Population")

The $x$-axis is split into ‘bins’ (or ‘cells’) and R counts the amount of times the variable is in each bin, with the width of each bar on the histogram changing depending on how many ‘bins’ there are. Some ‘bins’ are not visible due to being empty, for example, on the ggplot2 histogram above, there are 10 ‘bins’ but 2 are empty so only 8 are visible. Sometimes, the width of each of the bars is too wide or too narrow to be able to interpret the results effectively. In these instances, you can use the arguments breaks = for plotting with base R, and bins = for plotting with ggplot2.

#plot the histogram using base R, changing the width of the bars.
hist(Pop_data$pop, 
     breaks = 20, 
     main = "Histogram of Population", 
     xlab = "Population")
mtext(side = 3, "Using base R")

#plot the histogram using ggplot
ggplot(Pop_data, aes(x = pop)) +
  geom_histogram(bins = 20, color = "white", fill = "#004C92") +
  scale_fill_manual(values = "#004C92") +
  labs(x = "Population", y = "Frequency", subtitle = "Using ggplot2") +
  ggtitle("Histogram of Population")

The histograms do not just have to display the frequency of the variable of interest. By adding the argument freq = FALSE to the hist function, the density of the variable of interest is plotted instead.

#plot the histogram using base R for density
hist(Pop_data$pop, 
     freq = FALSE,
     breaks = 20, 
     main = "Histogram of Population", 
     xlab = "Population")
mtext(side = 3, "Using base R")

#import height dataset
Height_data <- read.csv(file = paste0(data_path, "Height_data.csv"),
                        header = TRUE)
# Make histogram plot 
hist1 <- gghistogram(Height_data, x = "Height",                  
                     add = "mean", rug = TRUE,                  
                     color = "Sex", fill = "Sex",                  
                     palette = c("#FF0000", "#FFC600"),
                     bins = 15,
                     xlab = "Height (cm)",
                     ylab = "Frequency",
                     main = "Histogram of Height by Sex",
                     subtitle = "Using ggpubr") 
hist1

2.3.4 Density plots

Density plots are smoothed histograms. In base R, the function density() can be used to find the density of the variable of interest and be inserted straight into the plot() function.

#plot the density plot using base R
plot(density(Pop_data$pop), 
     main = "Density plot of Population",
     xlab = "Population",
     ylab = "Density")
mtext(side = 3, "Using base R")

#plot the density plot using ggplot
ggplot(Pop_data, aes(x = pop)) +
  geom_density(fill = "#004C92", alpha = 0.2, color = "#004C92") +
  scale_color_manual(values = "#004C92") +
  labs(x = "Population", color = "Density", subtitle = "Using ggplot2") +
  ggtitle("Density Plot of Population")

The function ggdensity within ggpubr can also be used for plotting the density of a variable, including plotting the density categorised by another variable, for example plotting the density of a population by ‘Sex’.

#make density plot 
density1 <- ggdensity(Height_data, x = "Height", 
                      add = "mean", rug = TRUE, 
                      color = "Sex", 
                      fill = "Sex",          
                      palette = c("#FF0000", "#FFC600"),
                      xlab = "Height (cm)",
                      ylab = "Density",
                      main = "Density Plot of Height by Sex",
                      subtitle = "Using ggpubr")
density1

2.3.5 Scatter plots

Scatter plots are useful for showing multivariate data, for investigating the relationship between two continuous variables.

A line of best fit can be added to the scatter plot, aiding in the visualisation of the relationship through using the functions lm() to model the relationship and abline() to add the line to the plot in base R. Alternatively, the function geom_smooth() can be used with argument method = "lm" for plotting with ggplot2.

Modelling the relationship between variables is discussed further and in more detail in Module 4.

#scatter plot of temperature and rainfall
plot(Pop_data$temperature_mean, Pop_data$precipitation_mean,
     xlab = "Temperature",
     ylab = "Rainfall",
     main = "Scatter Plot of Temperature and Rainfall",
     pch = 20)
mtext(side = 3, "Using base R")
abline(lm(precipitation_mean ~ temperature_mean, data = Pop_data))

#plot the scatter plot and fit a line using ggplot
ggplot(Pop_data, aes(x = temperature_mean, y = precipitation_mean)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Temperature", y = "Rainfall", subtitle = "Using ggplot2") +
  ggtitle("Scatter Plot of Temperature and Rainfall")

## `geom_smooth()` using formula = 'y ~ x'

#points with regression line
ggplot(data = Pop_data, aes(x = elevation,  y = precipitation_mean))+
  geom_point(col = "red")+
  geom_smooth(method = "lm")+
  theme_bw()+
  labs(x = "Elevation", y = "Mean Precipitation", 
       subtitle = "Using ggplot2")+
  ggtitle("Scatter Plot of Elevation and Mean Precipitation")

## `geom_smooth()` using formula = 'y ~ x'

2.3.6 Violin plots

Violin plots are similar to box plots and are useful for the comparison of probability distributions through showing the different probability densities at different values

#violin plot
ggplot(data = Height_data, aes(x = Sex,  y = Height))+  
  geom_violin(aes(fill =  Sex), alpha = 0.75)+  
  fill_palette(c("#FF0000", "#FFC600")) +  
  labs(y = "Height (cm)", subtitle = "Using ggplot2")+
  ggtitle("Violin Plot of Height by Sex")+
  theme_bw()

2.3.7 Line graphs

#points connected with line
ggplot(data = Pop_data, aes(x = elevation, y = precipitation_mean))+
  geom_point(col = "red")+
  geom_line()+
  theme(axis.text.x = element_text(size = 15),
        axis.text.y = element_text(size = 15),
        axis.line = element_line(color = "black"))+
  labs(x = "Elevation", y = "Mean Precipitation", subtitle = "Using ggplot2")+
  ggtitle("Line Graph of Elevation and Mean Precipitation")

2.3.8 Population pyramids

Population pyramids are an effective plot for comparing the structures of different populations, for example, age structure by gender.

#create population pyramid for age by sex
pyramid1 <- ggplot(Age_data) +
  geom_col(aes(x = age_groups, y = females, fill = "red")) +
  geom_col(aes(x = age_groups, y = -males, fill = "blue")) + 
  scale_y_continuous(breaks = seq(-200000, 200000, 50000), 
                     labels = paste0(as.character(c(seq(200, 0, -50), 
                                                    seq(50, 200, 50))))) + 
  coord_flip() + 
  scale_fill_discrete(name = "Gender", labels = c("Female", "Male"))+
  labs(x = "Age groups", y = "Population (in 1000s)", subtitle = "Using ggplot2")+
  ggtitle("Population Pyramid for Age by Sex")+
  theme_minimal()

#use the ggpar function to customise plot 
pyramid2 <- ggpar(pyramid1, legend = "right", legend.title = "Gender",
                  font.label = list(size = 16, face = "bold", color ="red"),
                  font.x = c(16),
                  font.y = c(16),
                  subtitle = "Using ggplot2 and ggpubr",
                  xtickslab.rt = 45, ytickslab.rt = 45)

pyramid2

2.3.9 Lollipop plots

A lollipop plot is a subset of bar plots where instead of bars, the data is represented by a line and a dot.

#create lollipop plot for population
lollipop1 <- ggdotchart(age_long, x = "ID", y = "Pop",
                        color = "age_groups",             
                        #palette = c("#00AFBB", "#E7B800"),
                        sorting = "ascending",            
                        add = "segments",                
                        dot.size = 10,                      #large dot size
                        label = round(age_long$Pop/10000),   
                        font.label = list(color = "white", size = 12,
                                          vjust = 0.5),     
                        ggtheme = theme_pubr(),             
                        subtitle = "Using ggpubr",
                        main = "Lollipop Plot for Population"
)

#adjust lollipop plot with ggpar
lollipop2 <- ggpar(lollipop1, xlab = "Subject ID", ylab = "Population Count",
                   legend = "right", legend.title = "Age groups",
                   font.label = list(size = 15, face = "bold", color ="red"),
                   font.x = c(16),
                   font.y = c(16),
                   xtickslab.rt = 45, ytickslab.rt = 45)

#change size of legend points and axis text to fit in plot
lollipop2 + guides(color = guide_legend(override.aes = list(size = 1.5))) +
  theme(axis.text = element_text(size = 6))

2.3.10 Plots with an image background

#create a plot with an image background
img1 <- png::readPNG(paste0(pic_path,"worldpoplogo.png"))
ggplot(Pop_data, aes(x = pop)) +
  background_image(img1)+
  geom_histogram(bins = 20, color = "white", fill = "black", alpha = 0.75) +
  scale_fill_manual(values = "black") +
  labs(x = "Population", y = "Frequency", subtitle = "Using ggplot2") +
  ggtitle("Histogram of Population")

2.3.11 Viewing and saving plots

You can view current and previous plots by looking to pane 3 on RStudio, navigating to the Plots tab and using the arrows to flip between plots.

Figure 2.5: View plots

You can also view multiple plots on one window/page by using the function par() with argument mfrow which divides the plot window into rows and columns.

#have a window with 4 plots, 2 in each row and 2 in each column
par(mfrow = c(2,2))

boxplot(Pop_data$pop~Pop_data$region,
        ylab = "Population",
        xlab = "Region",
        main = "Box Plot of Population by Region")

hist(Pop_data$pop, 
     breaks = 20, 
     main = "Histogram of Population", 
     xlab = "Population")

barplot(regional_population, 
        ylab = "Population Total", 
        xlab = "Region",
        col = "#004C92",
        main = "Bar Plot of Population by Region",
        names.arg = c(1:5))

plot(Pop_data$temperature_mean, Pop_data$precipitation_mean,
     xlab = "Temperature",
     ylab = "Rainfall",
     main = "Scatter Plot of Temperature and Rainfall",
     pch = 1)

Also using the function par(), the size of the margins of the plots can be changed, using the argument mar = c(bottom, left, top, right), giving the argument a numerical vector with one number for each side of the plot.

#adjust margins of plots
par(mfrow = c(2,2), mar = c(2,3,3,2))

boxplot(Pop_data$pop~Pop_data$region,
        ylab = "Population",
        xlab = "Region",
        main = "Box Plot of Population by Region")

hist(Pop_data$pop, 
     breaks = 20, 
     main = "Histogram of Population", 
     xlab = "Population")

barplot(regional_population, 
        ylab = "Population Total", 
        xlab = "Region",
        col = "#004C92",
        main = "Bar Plot of Population by Region",
        names.arg = c(1:5))

plot(Pop_data$temperature_mean, Pop_data$precipitation_mean,
     xlab = "Temperature",
     ylab = "Rainfall",
     main = "Scatter Plot of Temperature and Rainfall",
     pch = 1)

Similarly, the function ggarrange from ggpubr can also be used to display multiple plots on one window/page.

#arrange multiple plots on one page
ggarrange(hist1, density1 + rremove("x.text"), 
          labels = c("A", "B"),
          ncol = 1, nrow = 2)

To return the plot window back to its default settings of margins and number of plots per window, use the function dev.off() with no arguments.

To save plots as either an image or as a PDF, on pane 3 follow Plots > Export > Save as [Image or PDF].

Figure 2.6: Save plots

2.4 End of module exercises

1. How many cars have a miles per gallon (mpg) of less than 15 in the mtcars data?

2. How many cars have exactly 4 cylinders (cyl) in the mtcars data?

3. What is the mean value of horsepower (hp) to 2 decimal places in the mtcars data?

4. What car has the lowest miles per gallon (mpg) in the mtcars data?

5. What is the median miles per gallon (mpg) value for cars with 8 cylinders (cyl) in the mtcars data?

6. What car has the highest weight (wt) for each amount of cylinders (cyl) in the mtcars data?

7. Create a bar plot that shows the number of cars with each gear type (gear) in the mtcars data.

8. Create a bar plot that shows the number of cars with each gear type (gear) with the distribution of cylinders (cyl) for each type in the mtcars data. Add a legend for the cylinders.

9. Create a scatter plot to show the relationship between weight (wt) and miles per gallon (mpg) in the mtcars data.

10. Create a scatter plot to show the relationship between weight (wt) and miles per gallon (mpg) by cylinder (cyl) in the mtcars data.

2.5 Useful resources

R packages: datacamp
Importing data: intro2r
Basic data wrangling: Data Science in R: A Gentle Introduction
Data visualisation: R for Data Science
Data visualisation: The R Graph Gallery

#drive path for windows
data_path <- "data/"