An INTRODUCTION to
ƒor SPATIAL ANALYSIS
& MAPPING
In the digital age, social and environmental scientists have more spatial data at
their fingertips than ever before. But how do we capture this data, analyse and
display it, and, most importantly, how can it be used to study the world?
Spatial Analytics and GIS is a series of books that deal with potentially tricky tech-
nical content in a way that is accessible, usable and useful. Early titles include Urban
Analytics by Alex Singleton, Seth Spielman and David Folch, and An Introduction
to R for Spatial Analysis and Mapping by Chris Brunsdon and Lex Comber.
Series Editor: Richard Harris
About the Series Editor
Richard Harris is Professor of Quantitative Social Geography at the School of
Geographical Sciences, University of Bristol. He is the lead author on three text-
books about quantitative methods in geography and related disciplines, including
Quantitative Geography: The Basics (Sage, 2016).
Richard’s interests are in the geographies of education and the education of geog-
raphers. He is currently Director of the University of Bristol Q-Step Centre, part of
a multimillion-pound UK initiative to raise quantitative skills training among
social science students, and is working with the Royal Geographical Society (with
IBG) to support data skills in schools.
Books in this Series:
Geocomputation, Chris Brunsdon and Alex Singleton
GIS and Agent-Based Modelling and Geographical Information Systems,
Andrew Crooks, Nicolas Malleson, Ed Manley and Alison Heppenstall
Modelling Environmental Change, Colin Robertson
An Introduction to Big Data and Spatial Data Analytics in R,
Lex Comber and Chris Brunsdon
Published in Association with this Series:
Quantitative Geography, Richard Harris
An INTRODUCTION to
ƒor SPATIAL ANALYSIS
& MAPPING
CHRIS
BRUNSDON
and
LEX
COMBER
SECOND
EDITION
SAGE Publications Ltd
1 Oliver’s Yard
55 City Road
London EC1Y 1SP
SAGE Publications Inc.
2455 Teller Road
Thousand Oaks, California 91320
SAGE Publications India Pvt Ltd
B 1/I 1 Mohan Cooperative Industrial Area
Mathura Road
New Delhi 110 044
SAGE Publications Asia-Pacific Pte Ltd
3 Church Street
#10-04 Samsung Hub
Singapore 049483
Editor: Robert Rojek
Assistant editor: John Nightingale
Production editor: Katherine Haw
Copyeditor: Richard Leigh
Proofreader: Neville Hankins
Indexer: Martin Hargreaves
Marketing manager: Susheel Gokarakonda
Cover design: Francis Kenney
Typeset by: C&M Digitals (P) Ltd, Chennai, India
Printed in the UK
Chris Brunsdon and Lex Comber 2019
First edition published 2015. Reprinted 2015 (twice), 2016
(twice) and 2017 (twice)
This edition first published 2019
Apart from any fair dealing for the purposes of research
or private study, or criticism or review, as permitted under
the Copyright, Designs and Patents Act, 1988, this
publication may be reproduced, stored or transmitted in
any form, or by any means, only with the prior permission
in writing of the publishers, or in the case of reprographic
reproduction, in accordance with the terms of licences
issued by the Copyright Licensing Agency. Enquiries
concerning reproduction outside those terms should be
sent to the publishers.
Library of Congress Control Number: 2018943836
British Library Cataloguing in Publication data
A catalogue record for this book is available from
the British Library
ISBN 978-1-5264-2849-3
ISBN 978-1-5264-2850-9 (pbk)
At SAGE we take sustainability seriously. Most of our products are printed in the UK using responsibly sourced
papers and boards. When we print overseas we ensure sustainable papers are used as measured by the PREPS
grading system. We undertake an annual audit to monitor our sustainability.
PRAISE FOR AN INTRODUCTION
TO R FOR SPATIAL ANALYSIS
AND MAPPING 2E
‘There’s no better text for showing students and data analysts how to use R for
spatial analysis, mapping and reproducible research. If you want to learn how to
make sense of geographic data and would like the tools to do it, this is your guide.’
Richard Harris, University of Bristol
‘The future of GIS is open-source! An Introduction to R for Spatial Analysis and
Mapping is an ideal introduction to spatial data analysis and mapping using the
powerful open-source language R. Assuming no prior knowledge, Brunsdon and
Comber get the reader up to speed quickly with clear writing, excellent pedagogic
material and a keen sense of geographic applications. The second edition is timely
and fresh. This book should be required reading for every Geography and GIS
student, as well as faculty and professionals.’
Harvey Miller, The Ohio State University
‘While there are many books that provide an introduction to R, this is one of the
few that provides both a general and an application-specific (spatial analysis)
introduction and is therefore far more useful and accessible. Written by two
experts in the field, it covers both the theory and practice of spatial statistical
analysis and will be an important addition to the bookshelves of researchers whose
spatial analysis needs have outgrown currently available GIS software.’
Jennifer Miller, University of Texas at Austin
‘Students and other life-long learners need flexible skills to add value to spatial
data. This comprehensive, accessible and thoughtful book unlocks the spatial data
value chain. It provides an essential guide to the R spatial analysis ecosystem. This
excellent state-of-the-art treatment will be widely used in student classes, continu-
ing professional development and self-tuition.’
Paul Longley, University College London
‘In this second edition, the authors have once again captured the state of the art in
one of the most widely used approaches to spatial analysis. Spanning from the
absolute beginner to more advanced concepts and underpinned by a strong “learn
by doing” ethos, this book is ideally suited for both students and teachers of spatial
analysis using R.’
Jonny Huck, The University of Manchester
‘A timely update to the de facto referenceandtextbookforanyone‒geographer,
planner,or(geo)datascientist‒needingtoundertakemappingandspatialanaly-
sis in R. Complete with self-tests and valuable insights into the transition from sp
to sf, this book will help you to develop your ability to write flexible, powerful, and
fast geospatial code in R.’
Jonathan Reades, King’s College London
‘Brunsdon and Comber’s 2nd edition of their acclaimed text book is updated with
the key developments in spatial analysis and mapping in R and maintains the
pedagogic style that made the original volume such an indispensable resource for
teaching and research.’
Scott Orford, Cardiff University
CONTENTS
About the authors x
1 INTRODUCTION 1
1.1 IntroductiontotheSecondEdition 1
1.2 ObjectivesofThisBook 2
1.3 SpatialDataAnalysisinR 3
1.4 ChaptersandLearningArcs 4
1.5 SpecificChangestotheSecondEdition 5
1.6 TheRProjectforStatisticalComputing 7
1.7 ObtainingandRunningtheRSoftware 7
1.8 TheRInterface 10
1.9 OtherResourcesandAccompanyingWebsite 11
References 12
2 DATAANDPLOTS 13
2.1 Introduction 13
2.2 TheBasicIngredientsofR:VariablesandAssignment 14
2.3 DataTypesandDataClasses 16
2.4 Plots 34
2.5 AnotherPlotOption:ggplot 43
2.6 Reading,Writing,LoadingandSavingData 50
2.7 AnswerstoSelf-TestQuestions 52
Reference 54
3 BASICSOFHANDLINGSPATIALDATAINR 55
3.1 Overview 55
3.2 Introductiontosp and sf: The sfRevolution 57
3.3 ReadingandWritingSpatialData 63
3.4 Mapping:AnIntroductiontotmap 66
3.5 MappingSpatialDataAttributes 81
3.6 SimpleDescriptiveStatisticalAnalyses 98
3.7 Self-TestQuestions 107
3.8 AnswerstoSelf-TestQuestions 110
References 117
CONTENTS
viii
4 SCRIPTINGANDWRITINGFUNCTIONSINR 118
4.1 Overview 118
4.2 Introduction 119
4.3 BuildingBlocksforPrograms
,seek to be lazy by not changing vari-
able names or types or do partial matching. And they are surly because they com-
plain more. This forces cleaner coding by identifying problems earlier in the data
analysis cycle.
Finally, the print method for tibble returns the first 10 records by default,
whereas for data.frame the head() function is frequently used to display just
the first 6 records. The tibble class also includes a description of the class of each
field (column) when it is printed.
It is possible to convert between tibbles and data frames using the following
functions:
data.frame(tb)
as_tibble(df)
The following functions work with both tibbles and data frames:
names()
colnames()
rownames()
length() # length of the underlying list
ncol()
nrow()
They can be subsetted in the same way as a matrix, using the [row, col-
umn] notation as above, and they can both be combined using cbind() and
rbind().
cbind(df, Pop = c(700,250,230,150,1200))
dist city Pop
1 0 Leeds 700
2 100 Nottingham 250
3 200 Leicester 230
4 300 Durham 150
5 400 Newcastle 1200
DATA AND PLOTS
29
cbind(tb, Pop = c(700,250,230,150,1200))
dist city Pop
1 0 Leeds 700
2 100 Nottingham 250
3 200 Leicester 230
4 300 Durham 150
5 400 Newcastle 1200
You could explore the tibble vignette by entering:
vignette("tibble")
2.3.3 Self-Test Questions
In the next pages there are a number of self-test questions. In contrast to the previ-
ous sections where the code is provided in the text for you to work through (i.e.
you enter and run it yourself), the self-test questions are tasks for you to complete,
mostly requiring you to write R code. Answers to them are provided in Section 2.7.
The self-test questions relate to the main data types that have been introduced:
factors, matrices, lists (named and unnamed) and classes.
2.3.3.1 Factors
Recall from the descriptions above that factors are used to represent categorical
data – where a small number of categories are used to represent some characteris-
tic in a variable. For example, the colour of a particular model of car sold by a
showroom in a week can be represented using factors:
colours <- factor(c("red","blue","red","white",
"silver","red","white","silver",
"red","red","white","silver","silver"),
levels=c("red","blue","white","silver","black"))
Since the only colours this car comes in are red, blue, white, silver and black, these
are the only levels in the factor.
Self-Test Question 1. Suppose you were to enter:
colours[4] <- "orange"
colours
What would you expect to happen? Why?
Next, use the table function to see how many of each colour were sold. First
reassign the colours (as you may have altered this variable in the previous self-test
question):
R FOR SPATIAL ANALYSIS & MAPPING
30
colours <- factor(c("red","blue","red","white",
"silver","red","white","silver",
"red","red","white","silver","silver"),
levels=c("red","blue","white","silver","black"))
table(colours)
colours
red blue white silver black
5 1 3 4 0
Note that the result of the table function is just a standard vector, but that each
of its elements is named – the names in this case are the levels in the factor. Now
suppose you had simply recorded the colours as a character variable, in colours2
as below, and then computed the table:
colours2 <-c("red","blue","red","white",
"silver","red","white","silver",
"red","red","white","silver")
# Now, make the table
table(colours2)
colours2
blue red silver white
1 5 3 3
Self-Test Question 2. What two differences do you notice between the results of the
two table expressions?
Now suppose we also record the type of car – it comes in saloon, convertible and
hatchback. This can be specified by another factor variable called car.type:
car.type <- factor(c("saloon","saloon","hatchback",
"saloon","convertible","hatchback","convertible",
"saloon","hatchback","saloon","saloon",
"saloon","hatchback"),
levels=c("saloon","hatchback","convertible"))
The table function can also work with two arguments:
table(car.type, colours)
colours
car.type red blue white silver black
saloon 2 1 2 2 0
hatchback 3 0 0 1 0
convertible 0 0 1 1 0
This gives a two-way table of counts – that is, counts of red hatchbacks, silver
saloons and so on. Note that the output this time is a matrix. For now enter the
code below to save the table into a variable called crosstab to be used later on:
crosstab <- table(car.type,colours)
Self-Test Question 3. What is the difference between table(car.type,
colours) and table(colours,car.type)?
DATA AND PLOTS
31
Finally in this section, ordered factors will be considered. Suppose a third
variable about the cars is the engine size, and that the three sizes are 1.1 litre,
1.3 litre and 1.6 litre. Again, this is stored in a variable, but this time the sizes are
ordered. Enter:
engine <- ordered(c("1.1litre","1.3litre","1.1litre",
"1.3litre","1.6litre","1.3litre","1.6litre",
"1.1litre","1.3litre","1.1litre", "1.1litre",
"1.3litre","1.3litre"),
levels=c("1.1litre","1.3litre","1.6litre"))
Recall that with ordered variables, it is possible to use comparison operators >
(greater than), < (less than), >= (greater than or equal to) and <= (less than or equal
to). For example:
engine > "1.1litre"
[1] FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
[10] FALSE FALSE TRUE TRUE
Self-Test Question 4. Using the engine, car.type and colours variables,
write expressions to give the following:
● The colours of all cars with engines with capacity greater than 1.1 litres.
● The counts of types (hatchback etc.) of all cars with capacity below 1.6 litres.
● The counts of colours of all hatchbacks with capacity greater than or
equal to 1.3 litre.
2.3.3.2 Matrices
In the previous section you created a matrix called crosstab. A number of func-
tions can be applied to matrices:
dim(crosstab) # Matrix dimensions
[1] 3 5
rowSums(crosstab) # Row sums
saloon hatchback convertible
7 4 2
colnames(crosstab) # Column names
[1] "red" "blue" "white" "silver" "black"
Another important tool for matrices is the apply function. To recap, this applies a
function to either the rows or columns of a matrix, giving a single-dimensional list
as a result. A simple example finds the largest value in each row:
apply(crosstab,1,max)
saloon hatchback convertible
2 3 1
R FOR SPATIAL ANALYSIS & MAPPING
32
In this case, the function max is applied to each row of crosstab. The 1 as the
second argument specifies that the function will be applied row by row. If it were 2
then the function would be column by column:
apply(crosstab,2,max)
red blue white silver black
3 1 2 2 0
A useful function is which.max. Given a list of numbers, it returns the index of the
largest one. For example:
example <- c(1.4,2.6,1.1,1.5,1.2)
which.max(example)
[1] 2
In this case, the second element is the largest.
Self-Test Question 5. What happens if there is more than one number taking the
largest value in a list? Use either the help facility or experimentation to find out.
Self-Test Question 6. The function which.max can be used in conjunction with apply.
Write an expression to find the index of the largest value in each row of crosstab.
The function levels returns the levels of a variable of type factor in
character form. For example:
levels(engine)
[1] "1.1litre" "1.3litre" "1.6litre"
The order they are returned in is the one specified in the original factor assign-
ment and the same order as row or column names produced by the table func-
tion. This means that levels can be used in conjunction with which.max when
applied to matrices to obtain the row or column names instead of an index number:
levels(colours)[which.max(crosstab[,1])]
[1] "blue"
Alternatively, the same effect can be achieved by the following:
colnames(crosstab)[which.max(crosstab[,1])]
[1]
,"blue"
You should unpick these last two lines of code to make sure you understand what
each element is doing.
colnames(crosstab)
[1] "red" "blue" "white" "silver" "black"
crosstab[,1]
saloon hatchback convertible
2 3 0
which.max(crosstab[,1])
hatchback
2
DATA AND PLOTS
33
More generally, a function could be written to apply this operation to any variable
with names:
# Defines the function
which.max.name <- function(x) {
return(names(x)[which.max(x)])}
# Next, give the variable 'example' names for the values
names(example) <- c("Bradford","Leeds","York",
"Harrogate","Thirsk")
example
Bradford Leeds York Harrogate Thirsk
1.4 2.6 1.1 1.5 1.2
which.max.name(example)
[1] "Leeds"
Self-Test Question 7. The function which.max.name could be applied (using
apply) to a table or matrix to find the name of the row or column with the largest
value. If the crosstab table is considered a table of car sales, write an apply
expression to determine the best-selling colour for each car type and the best-
selling car type in each colour.
Note that in the last code snippet, a function was defined called which.
max.name. You have been using functions, but these have all been existing ones
as defined in R until now. Functions will be thoroughly dealt with in Chapter 4,
but you should note two things about them at this point. First is the form:
function name <- function(function inputs) {
variable <- function
actions return(variable)
}
Second are the syntactic elements of the curly brackets { } that bound the code,
and the return() function that defines the value to be returned.
2.3.3.3 Lists
From the text in this chapter, recall that lists can be named and unnamed. Here we
will only consider the named kind. Lists may be created by the list function in
the form:
var <- list(name1=value1, name2=value2, …)
Self-Test Question 8. Suppose you wanted to store both the row- and column-wise
apply results (from Question 7) in a list called most.popular with two named
elements called colour (containing the most popular colour for each car type) and
type (containing the most popular car type for each colour). Write an R expression
that assigns the best-selling colour and car types to a list.
R FOR SPATIAL ANALYSIS & MAPPING
34
2.3.3.4 Classes
The objective of this task is to create a class based on the list created in the previous
section. The class will consist of a list of most popular colours and car types,
together with a third element containing the total number of cars sold (called
total). Call this class sales.data. A function to create a variable of this class,
given colours and car.type, is as follows:
new.sales.data <- function(colours, car.type) {
xtab <- table(car.type,colours)
result <- list(colour=apply(xtab,1,which.max.name),
type=apply(xtab,2,which.max.name),
total=sum(xtab))
class(result) <- "sales.data"
return(result)}
This can be used to create a sales.data object which has the colours and
car.type variables assigned to it via the function:
this.week <- new.sales.data(colours,car.type)
this.week
$colour
saloon hatchback convertible
"red" "red" "white"
$type
red blue white silver black
"hatchback" "saloon" "saloon" "saloon" "saloon"
$total
[1] 13
attr(,"class")
[1] "sales.data"
In the above code, a new variable called this.week, of class sales.data, is
created. Following the ideas set out in the previous section, it is now possible to
create a print function for variables of class sales.data. This can be done by
writing a function called print.sales.data that takes an input or argument of
the sales.data class.
Self-Test Question 9. Write a print function for variables of class sales.data.
This is a difficult problem and should be tackled by those with previous program-
ming experience. Others can try this now but should return to it after the functions
have been formally introduced in Chapter 4.
2.4 PLOTS
There are a number of plot routines and packages in R. In this section some basic
plot types will be introduced, followed by some more advanced plotting com-
mands and functions. The aim of this section to give you an understanding of how
DATA AND PLOTS
35
the basic plot types can be used as building blocks in more advanced plotting
routines that are called in later chapters to display the results of spatial analysis.
2.4.1 Basic Plot Tools
The most basic plot is the scatter plot. Figure 2.1 was created from the function
rnorm which generates a set of random numbers. Note that each running of the
code will generate a slightly different plot as different random numbers are
generated.
x1 <- rnorm(100)
y1 <- rnorm(100)
plot(x1,y1)
The generic plot function creates a graph of the two variables, plotting them on
the x-axis and the y-axis. The default settings for the plot function produce a scat-
ter plot and you should note that by default the axes are labelled with expressions
passed to the plot function. Many parameters can be set for plot either by defin-
ing the plot environment (described later) or when the plot is called. For example,
the option col specifies the plot colour and pch the plot character:
plot(x1,y1,pch=16, col='red')
Other options include different types of plot: type = 'l' produces a line plot of
the two variables, and again the col option can be used to specify the line colour
and the option lwd specifies the plot line width. You should run the code below to
produce different line plots:
−2 −1 0 1 2
−2
−1
1
2
x1
y1
Figure 2.1 A basic scatter plot
R FOR SPATIAL ANALYSIS & MAPPING
36
x2 <- seq(0,2∗pi,len=100)
y2 <- sin(x2)
plot(x2,y2,type='l')
plot(x2,y2,type='l', lwd=3, col='darkgreen')
You should examine the help for the plot command (reminder: type ?plot at the
R prompt) and explore different plot types that are available. Having called a new
plot as in the above examples, other data can be plotted using other commands:
points, lines, polygons, etc. You will see that plot by default assumes the
plot type is point unless otherwise specified. For example, in Figure 2.2 the line
data described by x2 and y2 are plotted, after which the points described by x2
and y2r are added to the plot.
plot(x2,y2,type='l', col='darkgreen', lwd=3, ylim=c(−1.2,1.2))
y2r <- y2 + rnorm(100,0,0.1)
points(x2,y2r, pch=16, col='darkred')
In the above code, the rnorm function creates a vector of small values which are
added to y2 to create y2r. The function points adds points to an existing plot.
Many other options for plots can be applied here. For example, note the ylim
option. This sets the limits of the y-axis, while xlim does the same for the x-axis.
You should apply the commands below to the plot data.
y4 <- cos(x2)
plot(x2, y2, type='l', lwd=3, col='darkgreen')
lines(x2, y4, lwd=3, lty=2, col='darkblue')
Notice that, similar to points, the function lines adds lines to an existing plot,
and note the lty option as well. This specifies the type of line (dotted, simple, etc.).
0 1 2 3 4 5 6
−1.0
−0.5
0.0
0.5
1.0
x2
y2
Figure 2.2 A line plot with points added
DATA AND PLOTS
37
The function polygon adds a polygon to an existing plot. The option col sets
the polygon fill colour. By default a black border is drawn; however, including the
parameter border = NA would result in no border being drawn. In Figure 2.3
two different plots of the same data illustrate the application of these parameters.
I
You should examine the different plot types and parameters in par. Enter
?par for the help page to see the full list of different plot parameters. One
of these, mfrow, is used below to set a combined plot of one row and two
columns. This needs to be reset or the rest of your plots will continue to be
printed in this way. To do this enter:
par(mfrow = c(1,2))
plot(x2, y2, type='l', lwd=3, col='darkgreen')
plot(x2, y2, type='l', col='darkgreen', lwd=3, ylim=c(−1.2,1.2))
,points(x2, y2r, pch=16, col='darkred')
par(mfrow = c(1,1))
The last line of code resets par.
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
y2
y4
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
y2
y4
Figure 2.3 Points with polygons added
x2 <- seq(0,2∗pi,len=100)
y2 <- sin(x2)
y4 <- cos(x2)
# specify the plot layout and order
par(mfrow = c(1,2))
R FOR SPATIAL ANALYSIS & MAPPING
38
# plot #1
plot(y2,y4)
polygon(y2,y4,col='lightgreen')
# plot #2: this time with 'asp' to set the aspect ratio of the axes
plot(y2,y4, asp=1, type='n')
polygon(y2,y4,col='lightgreen')
In the second plot, the parameter asp fixes the aspect ratio, in this case to 1 so that
the x and y scales are the same, and type = 'n' draws the plot axes to correct
scale (i.e. of the y2 and y4 data) but adds no lines or points.
So far the plot commands have been used to plot pairs of x and y coordinates
in different ways: points, lines and polygons (this may suggest different vector
types in a GIS for some readers). We can extend these to start to consider geo-
graphical coordinates more explicitly with some geographical data. You will need
to install the GISTools package, which may involve setting a mirror site as
described in Chapter 1. The first time you use any package in R it needs to be
downloaded before it is installed.
install.packages("GISTools", depend = T)
Then you can call the package in the R console:
library(GISTools)
You will then see some messages when you load the package, letting you know that
the packages that GISTools makes use of have also been loaded automatically.
You only need to install a package onto your computer the first time you use it.
Once it is installed it can simply be called. That is, there is no need to download it
again, you can simply enter library(package).
1260000 1280000 1300000 1320000
1030000
1050000
1070000
Easting
N
o
rt
h
in
g
Figure 2.4 Appling County plotted from coordinate pairs
DATA AND PLOTS
39
The code below loads a number of datasets with the data(georgia) com-
mand. It then selects the first element from the georgia.polys dataset and
assigns it to a variable called appling. This contains the coordinates of the outline
of Appling County in Georgia. It then plots this to generate Figure 2.4.
# library(GISTools)
data(georgia)
# select the first element
appling <- georgia.polys[[1]]
# set the plot extent
plot(appling, asp=1, type='n', xlab="Easting", ylab="Northing")
# plot the selected features with hatching
polygon(appling, density=14, angle=135)
There are a number of things to note in this bit of code.
1. The call data(georgia) loads three datasets: georgia , georgia2
and georgia.polys .
2. The first element of georgia.polys contains the coordinates for the
outline of Appling County.
3. Polygons do not have to be regular; they can, as in this example, be
geographical zones. The code assigns the coordinates to a variable
called appling and this is a two-column matrix.
4. Thus, with an x and y pairing, the following plot commands all work
with data in this format: plot , lines , polygon , points .
5. As before, the plot command in the code below has the type = 'n'
parameter, and asp = 1 fixes the aspect ratio. The result is that that
the x and y scales are the same but the command adds no lines or points.
The wider point being demonstrated here is how routines for plotting spatial data
that we will use subsequently are underpinned by these kinds of data structures
and core plotting routines. The code above illustrates the engines of, for example,
the mapping and visualisation packages tmap and ggplot.
2.4.2 Plot Colours
Plot colours can be specified names or as red, green and blue (RGB) values. The
former can be listed by entering the following:
colours()
RGB colours are composed of three values in the ranges 0 to 1. Having run the code
above, you should have a variable called appling in your workspace. Now try
entering the code below:
R FOR SPATIAL ANALYSIS & MAPPING
40
plot(appling, asp=1, type='n', xlab="Easting", ylab="Northing")
polygon(appling, col=rgb(0,0.5,0.7))
A fourth parameter can be added to rgb to indicate transparency as in the code
below, where the range is from 0 (invisible) to 1 (opaque).
polygon(appling, col=rgb(0,0.5,0.7,0.4))
Text can also be added to the plot and its placement in the plot window specified.
The cex parameter (for character expansion) determines the size of text. Note that
parameters like col also work with text and that HTML colours also work
(such as "B3B333"). The code below generates two plots. The first plots a set of
random points and then plots appling with a transparency shading over the top
(Figure 2.5).
# set the plot extent
plot(appling, asp=1, type='n', xlab="Easting", ylab="Northing")
# plot the points
points(x = runif(500,126,132)∗10000,
y = runif(500,103,108)∗10000, pch=16, col='red')
# plot the polygon with a transparency factor
polygon(appling, col=rgb(0,0.5,0.7,0.4))
The second plots appling, but with some descriptive text (Figure 2.6).
plot(appling, asp=1, type='n', xlab="Easting", ylab="Northing")
polygon(appling, col="#B3B333")
# add text, specifying its placement, colour and size
text(1287000,1053000,"Appling County",cex=1.5)
text(1287000,1049000,"Georgia",col='darkred')
1260000 1280000 1300000 1320000
1030000
1050000
1070000
Easting
N
o
rt
h
in
g
Figure 2.5 Appling County with transparency
DATA AND PLOTS
41
1260000 1280000 1300000 1320000
1030000
1050000
1070000
Easting
N
o
rt
h
in
g
Appling County
Georgia
Figure 2.6 Appling County with text
I
In the above code, the coordinates for the text placement need to be speci-
fied. The function locator is very useful in this context: it can be used to
determine locations in the plot window. Enter locator() at the R prompt,
and then left-click in the plot window at various locations. When you right-
click, the coordinates of these locations are returned to the R console window.
−2 −1 0 1 2
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
c(−1.5, 1.5)
c(
−1
.5
, 1
.5
)
Figure 2.7 Plotting rectangles
R FOR SPATIAL ANALYSIS & MAPPING
42
Other plot tools include rect, which draws rectangles. This is useful for plac-
ing map legends as your analyses develop. The following code produces the plot
in Figure 2.7.
plot(c(−1.5,1.5),c(−1.5,1.5),asp=1, type='n')
# plot the green/blue rectangle
rect(−0.5,−0.5,0.5,0.5, border=NA, col=rgb(0,0.5,0.5,0.7))
# then the second one
rect(0,0,1,1, col=rgb(1,0.5,0.5,0.7))
The command image plots tabular and raster data as shown in Figure 2.8. It has
default colour schemes, but other colour palettes exist. This book strongly recom-
mends the use of the RColorBrewer package, which is described in more detail
in Chapter 3, but an example of its application is given below:
Figure 2.8 Plotting raster data
# load some grid data
data(meuse.grid)
# define a SpatialPixelsDataFrame from the data
mat = SpatialPixelsDataFrame(points = meuse.grid[c("x", "y")],
data = meuse.grid)
# set some plot parameters (1 row, 2 columns)
par(mfrow = c(1,2))
# set the plot margins
par(mar = c(0,0,0,0))
# plot the points using the default shading
image(mat, "dist")
DATA AND PLOTS
43
# load the package
library(RColorBrewer)
# select and examine a colour palette with 7 classes
greenpal <- brewer.pal(7,'Greens')
# and now use this to plot the data
image(mat, "dist", col=greenpal)
# reset par
par(mfrow = c(1,1))
You should note that par(mfrow = c(1,2)) results in one row and two col-
umns and that it is reset in the last line of code.
I
The command contour(mat, "dist") will generate a contour plot of
the matrix above. You should examine the help for contour; a nice example
of its use can be found in code in the help page for the volcano dataset
that comes with R. Enter the following in the R console:
?volcano
2.5 ANOTHER PLOT OPTION: ggplot
2.5.1 Introduction to ggplot
A suite of tools and functions for plotting are
,available via the ggplot2 package
which is included as part of the tidyverse (https://www.tidyverse.org).
The ggplot2 package applies principles described in The Grammar of Graphics
(Wilkinson, 2005) (hence the gg in the name of the package) which conceptualises
graphics and plots in terms of their theoretical components. The approach is to
handle each element of the graphic separately in a series of layers, and in so doing
to control each part of the plot. This is different from the basic plot functions used
above which apply specific plotting functions based on the type or class of data
that were passed to them.
The ggplot2 package can be installed by installing the whole tidyverse:
install.packages("tidyverse", dep = T)
Or it can be installed on its own:
install.packages("ggplot2", dep = T)
And then loaded into the workspace:
library(ggplot2)
R FOR SPATIAL ANALYSIS & MAPPING
44
The plots above can be re-created using either the qplot or ggplot functions
in the ggplot2 package. The function qplot() is used to produce quick, simple
plots in a similar way to the plot function. It takes x and y and a data argument
for a data frame containing x and y. Figure 2.9 re-creates Figure 2.2. Notice how
the elements in theme are used to control the display.
qplot(x2,y2r,col=I('darkred'), ylim=c(−1.2, 1.2)) +
geom_line(aes(x2,y2), col=I("darkgreen"), size = I(1.5)) +
theme(axis.text=element_text(size=20),
axis.title=element_text(size=20,face="bold"))
Notice how the plot type is first specified (in this case qplot()) and then subse-
quent lines include instructions for what to plot and how to plot it. Here geom_
line() was specified followed by some style instructions.
Try adding:
theme_bw()
or:
theme_dark()
to the above. Remember that you need to include a + for each additional element
in ggplot.
−1.0
−0.5
0.0
0.5
1.0
0 2 4 6
x2
y2
r
Figure 2.9 A simple qplot plot
DATA AND PLOTS
45
To reproduce the Appling plots, the variable appling has to be converted from
a matrix to a data frame whose elements need to be labelled:
appling <- data.frame(appling)
colnames(appling) <- c("X", "Y")
Then qplot can be called as in Figure 2.10 to re-create Figure 2.5 defined above in
stages.
# create the first plot with qplot
p1 <- qplot(X, Y, data = appling, geom = "polygon", asp = 1,
colour = I("black"),
fill=I(rgb(0,0.5,0.7,0.4))) +
theme(axis.text=element_text(size=12),
axis.title=element_text(size=20))
# create a data.frame to hold the points
df <- data.frame(x = runif(500,126,132)∗10000,
y = runif(500,103,108)∗10000)
# now use ggplot to construct the layers of the plot
p2 <- ggplot(appling, aes(x = X, y= Y)) +
geom_polygon(fill = I(rgb(0,0.5,0.7,0.4))) +
geom_point(data = df, aes(x, y),col=I('red')) +
coord_fixed() +
theme(axis.text=element_text(size=12),
axis.title=element_text(size=20))
# finally combine these in a single plot
# using the grid.arrange function
# NB you may have to install the gridExtra package
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
The result is shown in Figure 2.10, the right-hand part of which re-creates Figure 2.5.
Notice a number of things. First, the structural differences in the way the
graphic is called, including the specification of the type with the geom parameter
(compared to the geom_line parameter earlier). Second, the assignment of the
1030000
1040000
1050000
1060000
1070000
1080000
12700001280000129000013000001310000
X
Y
1030000
1040000
1050000
1060000
1070000
1080000
1260000 1280000 1300000 1320000
X
Y
Figure 2.10 A simple qplot plot of a polygon
R FOR SPATIAL ANALYSIS & MAPPING
46
plot objects to variables p1 and p2. Third, the use of the grid.arrange() func-
tion in the gridExtra package that allows two graphics to be included in the plot
window. Finally, you will have to install the gridExtra package before the first
time you use it:
install.packages("gridExtra", dep = T)
2.5.2 Different ggplot Types
This section briefly introduces different kinds of plots using ggplot for different
kinds of variables, including scatter plots, histograms and boxplots. In subsequent
chapters, different flavours and types of ggplot will be illustrated. But this is a
vast package and involves a bit of a learning curve at first. To fully understand all
that it can do is beyond the scope of this subsection in this chapter, but there is
plenty of help and advice on the internet. You could explore some of this yourself
by following some of the links at http://ggplot2.tidyverse.org.
The basic call to ggplot is complemented by an aesthetic prefixed by geom_
and has the following syntax:
ggplot(data = , aes(x,y,colour)) +
geom_XYZ()
To illustrate the different plotting options, we need to create some data and some
categorical variables. The code below extracts the data frame from georgia and
converts it to a tibble. This is like the attribute table of a shapefile. Note that ggplot
will work with any type of data frame.
# data.frame
df <- data.frame(georgia)
# tibble
tb <- as.tibble(df)
Enter the code below to see the first 10 records:
tb
You can see that this has attributes for the counties of Georgia, and a number
of variables are included. Next, the code below creates an indicator for rural/
not-rural, which we set to values using the levels function. Note the use of the
+ 0 to convert the TRUE and FALSE values to 1s and 0s:
tb$rural <- as.factor((tb$PctRural > 50) + 0)
levels(tb$rural) <- list("Non-Rural" = 0, "Rural"=1)
Then we create an income category variable around the interquartile range of the
MedInc variable (median county income). There are fancier ways to do it, but the
code below is tractable:
DATA AND PLOTS
47
tb$IncClass <- rep("Average", nrow(tb))
tb$IncClass[tb$MedInc >= 41204] = "Rich"
tb$IncClass[tb$MedInc <= 29773] = "Poor"
The distributions can be checked if you wanted using the table() function:
table(tb$IncClass)
Scatter plots can be used to show two variables together. The data pairs in tb
should be examined. For example, consider PctBach and PctEld, representing
the percentages of the county populations with bachelor’s degrees and who are
elderly (whatever that means).
ggplot(data = tb, mapping=aes(x=PctBach, y=PctEld)) +
geom_point()
The plot can be enhanced by passing a grouping variable to the colour parameter
in aes:
ggplot(data = tb, mapping=aes(x=PctBach, y=PctEld, colour=rural)) +
geom_point()
Now modify the code above to group by the IncClass variable created earlier.
What happens? What do you see? Does this make sense? Are there any trends? It
could tentatively be said that the poor areas are more elderly and have fewer peo-
ple with bachelor’s degrees. This might be confirmed by adding a trend line:
ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) +
geom_point() +
geom_smooth(method = "lm")
Also note that style templates can be added and colours changed. Putting this all
together generates Figure 2.11:
ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) +
geom_point() +
geom_smooth(method = "lm", col = "red", fill = "lightsalmon") +
theme_bw() +
xlab("% of population with bachelor degree") +
ylab("% of population that are elderly")
You can explore other styles by trying the ones listed under the help for theme_bw.
Next, histograms can be used to examine the distributions of income across the
159 counties of Georgia:
ggplot(tb, aes(x=MedInc)) +
geom_histogram(, binwidth = 5000, colour = "red", fill = "grey")
The axes can be labelled, the theme set and title included as with the above exam-
ples, by including additional elements in the plot. Probability densities can also be
plotted as follows, generating Figure 2.12:
R FOR SPATIAL ANALYSIS & MAPPING
48
5
10
15
20
10 20 30
% of population with bachelor degree
%
o
f
p
o
p
u
la
ti
o
n
t
h
at
a
re
e
ld
er
ly
Figure 2.11 A ggplot scatter plot
0e+00
,2e−05
4e−05
6e−05
20000 40000 60000 80000
MedInc
d
en
si
ty
Figure 2.12 A ggplot density histogram
DATA AND PLOTS
49
ggplot(tb, aes(x=MedInc)) +
geom_histogram(aes(y=..density..),
binwidth=5000,colour="white") +
geom_density(alpha=.4, fill="darksalmon") +
# Ignore NA values for mean
geom_vline(aes(xintercept=median(MedInc, na.rm=T)),
color="orangered1", linetype="dashed", size=1)
Multiple plots can be generated using the facet() options in ggplot. These
create separate plots for each group. Here the PctBach variable is plotted and
median incomes compared:
ggplot(tb, aes(x=PctBach, fill=IncClass)) +
geom_histogram(color="grey30",
binwidth = 1) +
scale_fill_manual("Income Class",
values = c("orange", "palegoldenrod","firebrick3")) +
facet_grid(IncClass~.) +
xlab("% Bachelor degrees") +
ggtitle("Bachelors degree % in different income classes")
Another way of examining distributions is through boxplots. Boxplots display the
distribution of a continuous variable and can be broken down by a categorical
variable. A basic boxplot can be generated with the geom_boxplot aesthetic:
gplot(tb, aes(x = "",PctBach)) +
geom_boxplot()
10
20
30
Average Poor Rich
Income Class
%
B
ac
h
el
o
rs
Rural
Not Rural
Rural
Figure 2.13 A ggplot boxplot with groups
R FOR SPATIAL ANALYSIS & MAPPING
50
This can be extended with some grouping, as before, and to compare more than
one treatment as in Figure 2.13:
ggplot(tb, aes(IncClass, PctBach, fill = factor(rural))) +
geom_boxplot() +
scale_fill_manual(name = "Rural",
values = c("orange", "firebrick3"),
labels = c("Non-Rural"="Not Rural","Rural"="Rural")) +
xlab("Income Class") +
ylab("% Bachelors")
This is only scratching the surface of the capability of ggplot. Additional refine-
ments will be demonstrated throughout this book.
2.6 READING, WRITING, LOADING AND SAVING DATA
There are a number of ways of getting data in and out of R, and three methods for
reading and writing different formats are briefly considered here: text files, R data
files and spatial data.
2.6.1 Text Files
Consider the appling data variable above. This is a matrix variable, containing
two columns and 125 rows. You can examine the data using dim and head:
# display the first six rows
head(appling)
# display the variable dimensions
dim(appling)
You will note that the data fields (columns) are not named; however, these can be
assigned.
colnames(appling) <- c("X", "Y")
The data can be written into a comma-separated variable file using the command
write.csv and then read back into a different variable, as follows:
write.csv(appling, file = "test.csv")
This writes a .csv file into the current working directory. You check where this is
by using the getwd() function. You can set the working directory either though
the setwd() function or through the menu (Session > Set Working Directory).
If you open it using a text editor or spreadsheet software, you will see that it
has three columns: X and Y as expected plus the index for each record. This is
because the default for write.csv includes the default row.names = TRUE.
Again examine the help file for this function.
DATA AND PLOTS
51
write.csv(appling, file = "test.csv", row.names = F)
R also allows you to read .csv files using the read.csv function. Read the file
you have created into a variable:
tmp.appling <- read.csv(file = "test.csv")
Notice that in this case what is read from the .csv file is assigned to the variable
tmp.appling. Try reading this file without assignment. The default for read.
csv is that the file has a header (i.e. the first row contains the names of the col-
umns) and that the separator between values in any record is a comma. However,
these can be changed depending on the nature of the file you are seeking to load
into R. A number of different types of files can be read into R. You should examine
the help files for reading data in different formats. Enter ??read to see some of
these listed. You will note that read.table and write.table require more
parameters to be specified than read.csv and write.csv.
2.6.2 R Data Files
It is possible to save variables that are in your workspace to a designated file. This
can be loaded at the start of your next session. For example, if you have been run-
ning the code as introduced in this chapter you should have a number of variables,
from x at the start to engine and colours and the appling data above.
You can save this workspace using the drop-down menus in the RStudio inter-
face or using the save function. The RStudio menu route saves everything that is
present in your workspace, as listed by ls(), while the save command allows
you to specify what variables you wish to save.
# this will save everything in the workspace
save(list = ls(), file = "MyData.RData")
# this will save just appling
save(list = "appling", file = "MyData.RData")
# this will save appling and georgia.polys
save(list = c("appling", "georgia.polys"), file = "MyData.RData")
You should note that the .RData file binary format is very efficient at storing data:
the Appling .csv file used 4kb of memory, while the .RData file used only 2kb.
Similarly, .RData files can be loaded into R using the menu in the R interface or
within the R console by writing:
load("MyData.RData")
This will load the variables in the .RData file into the R console.
2.6.3 Spatial Data Files
It is appropriate to briefly consider how to get spatial data in and out of R, but note
that this is covered in more detail in Chapter 3.
R FOR SPATIAL ANALYSIS & MAPPING
52
The rgdal package includes two generic functions for reading and writing all
kinds of spatial data: readOGR() and writeOGR(). Load the rgdal package:
library(rgdal)
The georgia object in sp format can be written to a shapefile using the
writeOGR() function as follows:
writeOGR(obj=georgia, dsn=".", layer="georgia",
driver="ESRI Shapefile", overwrite_layer=T)
It can be read back into R using the readOGR() function:
new.georgia <- readOGR("georgia.shp")
Spatial data can be also be read in and written out using the sf functions st_
read() and st_write(). For example, to read in and write out the georgia.
shp shapefile that was created above (and to overwrite g2) the following code can
be used. You will need to install and load the sf package:
install.packages("sf", dep = T)
library(sf)
setwd("/MyPath/MyFolder")
g2 <- st_read("georgia.shp")
st_write(g2, "georgia.shp", delete_layer = T)
2.7 ANSWERS TO SELF-TEST QUESTIONS
Q1: orange is not one of the factor’s levels, so the result is an NA.
colours[4] <- "orange"
colours
[1] red blue red
[9] red red white silver silver
Levels: red blue white silver black
Q2: There is no count for black in the character version – table does not know
that this value exists, since there is no levels information. Also the order of
colours is alphabetical in the character version. In the factor version, the
order is based on that specified in the factor function.
Q3: The first variable is tabulated along the rows, the second along the columns.
Q4: Find the colours of all cars with engines with capacity greater than 1.1 litres:
DATA AND PLOTS
53
# Undo the colour[4] <- 'orange' line used above
colours <- factor(c("red","blue","red","white",
" silver","red","white","silver",
"red","red","white","silver"),
levels=c("red","blue","white","silver","black"))
colours[engine > "1.1litre"]
[1] blue white
Levels: red blue white silver black
Counts of types of all cars with capacity below 1.6 litres:
table(car.type[engine < "1.6litre"])
saloon hatchback convertible
7 4 0
Counts of colours of all hatchbacks with capacity greater than or equal to 1.3 litres:
table(colours[(engine >= "1.3litre") & (car.type == "hatchback")])
red blue white silver black
2 0 0 0
,Q5: The index returned corresponds to the first number taking the largest value.
Q6: An expression to find the index of the largest value in each row of crosstab
using which.max and apply:
apply(crosstab,1,which.max)
saloon hatchback convertible
1 1 3
Q7: Use apply functions to return the best-selling colour and car type:
apply(crosstab,1,which.max.name)
saloon hatchback convertible
"red" "red" "white"
apply(crosstab,2,which.max.name)
red blue white silver black
"hatchback" "saloon" "saloon" "saloon" "saloon"
Q8: An R expression that assigns the best-selling colour and car types to a list:
most.popular <- list(colour=apply(crosstab,1,which.max.name),
type=apply(crosstab,2,which.max.name))
most.popular
$colour
saloon hatchback convertible
"red" "red" "white"
R FOR SPATIAL ANALYSIS & MAPPING
54
$type
red blue white silver black
15 "hatchback" "saloon" "saloon" "saloon" "saloon"
Q9: A print function for variables of class data.frame:
print.sales.data <- function(x) {
cat("Weekly Sales Data:\n")
cat("Most popular colour:\n")
for (i in 1:length(x$colour)) {
cat(sprintf("%12s:%12s\n",names(x$colour)[i],x$colour[i]))}
cat("Most popular type:\n")
for (i in 1:length(x$type)) {
cat(sprintf("%12s:%12s\n",names(x$type)[i],x$type[i]))}
cat("Total Sold = ",x$total)
}
this.week
Weekly Sales Data:
Most popular colour:
saloon: red
hatchback: red
convertible: white
Most popular type:
red: hatchback
blue: saloon
white: saloon
silver: saloon
black: saloon
Total Sold = 13
Although the above is one possible solution to the question, it is not unique. You
may decide to create a very different looking print.sales.data function. Note
also that although until now we have concentrated only on print functions for
different classes, it is possible to create class-specific versions of any function.
REFERENCE
Wilkinson, L. (2005) The Grammar of Graphics. New York: Springer.
3
BASICS OF HANDLING SPATIAL
DATA IN R
3.1 OVERVIEW
The aim of this chapter is to provide an introduction to the mapping and geograph-
ical data handling capabilities of R. It explicitly focuses on developing the building
blocks for the spatial data analyses in later chapters. These extend the mapping
functionality that was briefly introduced in the previous chapter and will be
extended further in Chapter 5. It includes an introduction to the sp and sf pack-
ages and the R spatial data formats they support, and the tmap package. This
chapter describes methods for moving between the sp and sf formats and for
producing choropleth maps – from basic to quite advanced outputs – and intro-
duces some methods for generating descriptive statistics. These skills are funda-
mental to the analyses that will be developed later in the book. This chapter will:
● Introduce the sp and sf R spatial data formats and describe how to use
them
● Describe how to compile maps based on multiple layers using both
basic plot functions and the tmap package
● Describe how to set different plot parameters and shading schemes
● Describe how to develop basic descriptive statistical analyses of spatial
data
3.1.1 Spatial Data
Data are often held in data tables or databases – a bit like a spreadsheet. The rows
represent some real-world feature (a person, a transaction, a date, etc.) and the
columns represent some attribute associated with that feature. Rows in databases
R FOR SPATIAL ANALYSIS & MAPPING
56
may be referred to as records and columns as fields. There are some cases where the
features can be either a record or a field – for example, a date could belong to a list
of daily supermarket transactions (as a record) or be an attribute associated with
an event at a location (as a field). For the purposes of much of the practical work
in this chapter data will be conceptualised in this way.
In R there are many data formats and packages for handling and manipulating
them. For example, the tibble format defined within the dplyr package as part
of the tidyverse is starting to supersede data frames (in fact it includes the
data.frame class). This is part of a concerted activity by many package develop-
ment teams to provide tidy and lazy data formats and processes for data science,
mapping and spatial data analysis. Some of the background to this activity can be
found on the webpage for tidyverse (https://www.tidyverse.org),
which is a collection of R packages designed for data science.
The preceding description of data, with records (rows) and fields (columns), can
be extended to spatial data in which each record typically represents some real-
world geographical feature – a place, a route, a region, etc. – and individuals fields
provide a measurement or attribute associated with that feature. In geographical
data, features are typically represented as points, lines or areas.
Why spatial data? Nearly all data are spatial – they are collected somewhere.
If and when a third edition of this book is written in the future, we expect to
extend this argument to the spatio-temporal domain in which all data are
spatio-temporal – they are collected somewhere and at some time.
3.1.2 Installing and Loading Packages
The previous chapter included a number of basic analytical and graphical tech-
niques using R. However, few of these were particularly geographical. A number
of packages are available in R that allow sophisticated visualisation, manipulation
and analysis of spatial data. Some of this functionality will be demonstrated in this
chapter in conjunction with some mapping tools and specific data types to create
different examples of mapping in R. Remember that a package in R is a set of pre-
written functions (and possibly data items as well) that are not available when you
initially start R running, but can be loaded from the R library at the command line.
To illustrate these techniques, the chapter starts by developing some elementary
maps, building to more sophisticated mapping.
This chapter uses a number of packages: raster, OpenStreetMap,
RgoogleMaps, grid, rgdal, tidyverse, reshape2, ggmosaic, GISTools, sf
and tmap. You will have to install them before you use them for the first time. You will
have installed the GISTools and sf packages using the install.packages()
function if you worked through Chapter 2. Once you have downloaded and installed a
package, you can simply load the package when you use R subsequently.
The is.element query combined with the installed.packages() func-
tion can be used to check whether a package is installed.
HANDLING SPATIAL DATA IN R
57
is.element("sf", installed.packages())
If FALSE is returned then you need to install the package as above:
install.packages("sf", dep = TRUE)
Note the dep = TRUE parameter. This tells R to load the package with its depend-
encies (i.e. other packages that it depends on). Then the package can be loaded:
library(sf)
It is possible to inspect the functionality and tools available in sf or any other
package by examining the documentation.
help(sf)
# or
?sf
This provides the general description of the package. At the bottom of the help
window, there is a hyperlink to the index which, if you click on it, will
open a page with a list of all the tools available in the package. The CRAN
website also has full documentation for each package – for sf see
http://cran.r-project.org/web/packages/sf/index.html.
3.2 INTRODUCTION TO sp AND sf : THE sf REVOLUTION
As described in Chapter 1, the first edition of this book focused on the sp format
for spatial data in R. This format is defined in the sp package. It provides an organ-
ised set of spatial data classes, providing a unified way of moving from one pack-
age to another, taking advantage of the different tools and the functions they
include. However, R is dynamic and sometimes a new
,paradigm is introduced; this
has been the case recently for spatial data in R, with the release of the sf package
by Pebesma et al. (2016).
In this chapter, both the sp and sf formats are introduced. The manipulation
and analysis of spatial data use, where possible, the sf format and associated
tools. However, some packages and operations for spatial analyses have not yet
been updated to work with sf. For example, at the time of writing, many of the
functions in spdep, such as those for cluster analysis using Moran’s I (see Anselin,
1995) and the G-statistic (described in Ord and Getis, 1995), only work with sp
format spatial data. For these reasons, this chapter (and others throughout the
book) will, where possible, describe the manipulation and analysis of spatial data
using sf format and functions but will switch between (and convert data between)
sp and sf formats as needed.
R FOR SPATIAL ANALYSIS & MAPPING
58
3.2.1 sp data format
The sp package defines a number of classes (or sp objects) for handling points,
lines and areas, as summarised in Table 3.1. The sp data formats underpin many
of the packages that you will use directly or indirectly (i.e. they are loaded by other
packages): they have dependencies on sp. An example is the GISTools package by
Brunsdon and Chen (2014) which has dependencies on maptools, sp, rgeos
and other packages. If you install and load GISTools you will see these packages
being loaded.
Table 3.1 Spatial data formats in R
Without attributes With attributes ArcGIS equivalent
SpatialPoints SpatialPointsDataFrame Point shapefiles
SpatialLines SpatialLinesDataFrame Line shapefiles
SpatialPolygons SpatialPolygonsDataFrame Polygon shapefiles
Pebesma et al. (2016).
3.2.1.1 Spatial data in GISTools
GISTools, similar to many other R packages, comes with a number of embedded
datasets that can be loaded from the command line after the package is installed.
Two datasets will be used in this chapter, to illustrate spatial data manipulation,
mapping and analysis in both sf and sp. These are polygon and line data for New
Haven, Connecticut and the counties in the state of Georgia, both in the USA. The
New Haven data include crime statistics, roads, census blocks (including demo-
graphic information), railway lines and place names. The Georgia data include
outlines of the counties in Georgia with a number of attributes relating to the 1990
census including population (TotPop90), the percentage of the population that
are rural (PctRural), that have a college degree (PctBach), that are elderly
(PctEld), that are foreign born (PctFB), that are classed as being in poverty
(PctPov), that are black (PctBlack) and the median income of the county
(MedInc). The two datasets are shown in Figure 3.1.
Having installed GISTools, you can load the newhaven data or georgia
data using the data() function. Load the newhaven data and then examine what
is loaded and the types (or classes) of data that are loaded:
data(newhaven)
ls()
[1] "blocks"
[5] "famdisp"
"breach"
"places"
"burgres.f"
"roads"
"burgres.n"
"tracts"
HANDLING SPATIAL DATA IN R
59
class(breach)
[1] "SpatialPoints"
attr(,"package")
[1] "sp"
class(blocks)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
The breach data are of the SpatialPoints class and simply describe
locations, with no attributes. The blocks data, on the other hand, are of the
SpatialPolygonsDataFrame class as they include some census variables
associated with each census block. Thus spatial data with attributes defined in this
way in sp hold their attributes in the data frame, and you can see this by looking
at the first few lines of the blocks data frame using the head function:
head(data.frame(blocks))
Note that the data frame of sp objects can also be accessed using the @data
parameter of the blocks data frame using the head function:
head(blocks@data)
Both of these code snippets print the first six lines of attributes associated with the
census blocks data. A formal consideration of spatial attributes and how to analyse
and map them is given later in this chapter.
The census blocks in New Haven can be plotted using the R plot function:
plot(blocks)
Westville
Fair HavenNew Haven
Fair Haven East
City Point
Brightview
Morris Cove
South End
Figure 3.1 The New Haven census blocks with roads in blue, and the counties in the state
of Georgia shaded by median income
R FOR SPATIAL ANALYSIS & MAPPING
60
The default plot function for the sp class of objects can be used to gener-
ate maps, and this was the focus of the first edition of this book using the
GISTools packages. It described how different plot commands could be com-
bined to created plot layers. For example, to draw a map of the roads in red, with
the blocks in black (the plot default colour) as in Figure 3.2, the code below
could be entered:
par(mar = c(0,0,0,0))
plot(roads, col="red")
plot(blocks, add = T)
3.2.2 sf Data Format
Recently a new class of R spatial objects has been defined and released as a pack-
age called sf, which stands for ‘simple features’ (Pebesma et al., 2016). It seeks to
Figure 3.2 The New Haven census blocks and road data
HANDLING SPATIAL DATA IN R
61
encode spatial data in a way that conforms to a formal standard (ISO 19125-1:2004).
This emphasises the spatial geometry of objects, the way that objects are stored in
databases. In brief, the aim of the team developing sf (actually many of them are
the same people who developed sp, so they do know what they are doing!) is to
provide a format for spatial data. An overview of the evolution of spatial data in R
can be found at https://edzer.github.io/UseR2017/.
The idea is that a feature is a thing, or an object in the real world, such as a build-
ing or a tree. As is the case with objects, they often consist of other objects such that
a set of features can form a single feature. Features have a geometry describing
where on Earth they are located, and they have attributes, which describe other
properties. There are many sf object types, but the key ones (which are similar to
lines, points and areas) are listed in Table 3.2 (taken from the sf vignette). This has
a much stronger theoretical structure, with for example multipoint features
being composed of point features etc. Only the more common types of geome-
tries defined within sf are described in Table 3.2; other geometries exist but are
much rarer.
Table 3.2 Spatial data formats in R from https://r-spatial.github.io/sf/articles/sf1.html
Feature type Description ArcGIS equivalent
POINT Zero-dimensional geometry containing a single
point
Point shapefiles
LINESTRING Sequence of points connected by straight,
non-self-intersecting line pieces; one-dimensional
geometry
Line shapefiles
POLYGON Geometry with a positive area (two-dimensional);
sequence of points form a closed, non-self-
intersecting ring; the first ring denotes the
exterior ring, zero or more subsequent rings
denote holes in this exterior ring
Polygon shapefiles
MULTIPOINT Set of points; a MULTIPOINT is simple if no two
points in the MULTIPOINT are equal
Point shapefiles
MULTILINESTRING Set of linestrings Line shapefiles
MULTIPOLYGON Geometry with a positive area (two-dimensional);
sequence of points form a closed, non-self-
intersecting ring; the first ring denotes the
exterior ring, zero or more subsequent rings
denote holes in this exterior ring
Polygon shapefiles
Ultimately, sf formats will completely replace sp, and packages that use sp
(such as GWmodel for geographically weighted regression) will all have to be
updated to use sf at some point, but that is a few years away.
The sf package has a number of vignettes or tutorials that you could explore.
These include an overview of the format, reading and writing from and to sf
R FOR SPATIAL ANALYSIS & MAPPING
62
formats including conversions to and from sp and sf, and some illustrations of
how sf objects can be manipulated.
,The code below will create a new window
with a list of sf vignettes:
library(sf)
vignette(package = "sf")
And then to display a specific vignette topic, this can be called using the vignette
function:
vignette("sf1", package = "sf")
I
Vignettes are an important part of R packages. They provide explanations of
the package functionality additional to those found in the example code at
the end of a help page. They can be accessed using the vignette function
or through the R help. The sf1 vignette could also be accessed via the
package help index: enter help(sf), navigate to the index through the link
at the bottom of the overview page and then click on the User guides,
package vignettes and other documentation link.
3.2.2.1 sf spatial data
The sp objects loaded by the GISTools data packages georgia and
newhaven can be converted to sf. The fundamental function for converting to
sf is st_as_sf(). In the code below it is used to convert the georgia sp
object to sf:
# load the georgia data
data(georgia)
# conversion to sf
georgia_sf = st_as_sf(georgia)
class(georgia_sf)
[1] "sf" "data.frame"
You can examine the contents of georgia_sf by entering the following at the
console:
georgia_sf
HANDLING SPATIAL DATA IN R
63
Notice how when georgia_sf is called the spatial information and the first 10
records of the attribute table are printed to the screen, rather than the entire object
as with sp. For comparison you could enter:
georgia
The plot function is also different: it will create maps of sf objects, and if the sf
object has attributes it will shade the first few of these:
# all attributes
plot(georgia_sf)
# selected attribute
plot(georgia_sf[, 6])
# selected attributes
plot(georgia_sf[,c(4,5)])
Finally, note that sf objects have a data frame. You could compare the data frames
of sp and sf objects:
## sp SpatialPolygonDataFrame object
head(data.frame(georgia))
## sf polygon object
head(data.frame(georgia_sf))
Note that the data frames of the sf objects have geometry attributes.
We can also convert to sp by using the as function:
g2 <- as(georgia_sf, "Spatial")
class(g2)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
This automatically recognises the georgia_sf is a multipolygon object in sf
and converts it to a SpatialPolygonsDataFrame object in sp. You could try a
similar set of operations with the roads layer loaded earlier to demonstrate this:
roads_sf <- st_as_sf(roads)
class(roads_sf)
r2 <- as(roads_sf, "Spatial")
class(r2)
3.3 READING AND WRITING SPATIAL DATA
Very often we have data that are in a particular format such as shapefile format. R has
the ability to read and write data from and to many different spatial data formats
using functions in the rgdal and sf packages – we will consider them both here.
R FOR SPATIAL ANALYSIS & MAPPING
64
3.3.1 Reading to and Writing from sp Format
As was briefly described in Chapter 2, the rgdal package includes two generic
functions for reading and writing all kinds of spatial data: readOGR() and
writeOGR(). Load the rgdal package:
library(rgdal)
As a reminder, the georgia object in sp format can be written to a shapefile
using the writeOGR() function as follows:
writeOGR(obj=georgia, dsn=".", layer="georgia",
driver="ESRI Shapefile", overwrite_layer=T)
You will see that a shapefile has been written into your current working directory,
overwriting any previous instance of georgia.shp, with its associated support-
ing files (.dbf etc.) that can be recognised by other applications (QGIS etc.).
Similarly, this can be read into R and assigned to a variable using the readOGR
function:
new.georgia <- readOGR("georgia.shp")
If you enter:
class(new.georgia)
you will see that the class of the new.georgia object is sp. You should examine
the writeOGR and readOGR functions in the rgdal package.
R is also able to read and write other proprietary spatial data formats using a
number of packages, which you should be able to find through a search of the R
help system or via an internet search engine. The rgdal package is the R version
of the Geospatial Data Abstraction Library. It includes a number of methods for read-
ing and writing spatial objects, including to and from SpatialXDataFrame
objects. The full syntax can be important – the code below overwrites any existing
similarly named file:
writeOGR( new.georgia, dsn = ".", layer = "georgia",
driver="ESRI Shapefile", overwrite_layer = T)
The dsn parameter is important here: for shapefiles it determines the folder the
files are written to. In the above example it was set to "." which places the files in
the current working directory.
You could specify a file path here. For a PC it might be something like D:\
MyDocuments\MyProject\DataFiles; for a Mac, /Users/lex/my_docs/
project.
HANDLING SPATIAL DATA IN R
65
The setwd() and getwd() functions can be used in determining and setting
the file path. You may want to set the file path and then use the dsn setting as
above:
setwd("/Users/lex/my_docs/project")
writeOGR( new.georgia, dsn = ".", layer = "georgia",
driver="ESRI Shapefile", overwrite_layer = T)
Or you could use the getwd() function, save the results to a variable and pass this
to writeOGR:
td <- getwd()
writeOGR( new.georgia, dsn = td, layer = "georgia",
driver="ESRI Shapefile", overwrite_layer = T)
You should also examine the functions for reading and writing raster layers in
rgdal, which are readGDAL and writeGDAL. These read and write functions in
rgdal are incredibly powerful and can read/write almost any spatial data format.
3.3.2 Reading to and Writing from sf Format
Spatial data can be also be read in and written out using the sf functions
st_read() and st_write(). For example, to read in the georgia.shp shape-
file that was created above (and to overwrite g2) the following code can be used:
setwd("/MyPath/MyFolder")
g2 <- st_read("georgia.shp")
The working directory needs to be set to ensure that st_read looks in the right
place to read the file from. Here a single argument is used to find both the data
source and the layer. This works when the data source contains a single layer.
To write a simple features object to a file needs at least two arguments, the object
and a filename. As before, this will not work if the georgia.shp file exists in the
working directory, so the delete_layer = T parameter needs to be specified.
st_write(g2, "georgia.shp", delete_layer = T)
The filename is taken as the data source name. The default for the layer name is the
basename (filename without path) of the data source name. For this, st_write
needs to guess the driver. The above command, for instance, is equivalent to:
st_write( g2, dsn = "georgia.shp", layer = "georgia.shp",
driver = "ESRI Shapefile", delete_layer = T)
Typical users will use a filename with path for filename, or first set R’s working
directory with setwd() and use filename without path.
R FOR SPATIAL ANALYSIS & MAPPING
66
Note that the output driver is guessed from the data source name, from either
its extension (.shp: ESRI Shapefile), or its prefix (PG:: PostgreSQL).
The list of extensions with corresponding driver (short driver name) can be
found in the sf2 vignette. You will also note that there are a number of functions
that can be used to read, write and convert. You can examine this:
vignette("sf2", package = "sf")
3.4 MAPPING: AN INTRODUCTION TO tmap
3.4.1 Introduction
The first parts of this chapter have outlined basic commands for plotting data and
for producing maps and graphics using R. These were based on the plot func-
tions associated with sp objects. This section will now concentrate on developing
and expanding these basic techniques using the functions in the tmap package. It
will introduce some new plot parameters and will show how to extract and down-
load Google Maps and to use OpenStreetMap data as background context and to
create interactive (at least zoomable) maps in tmap. As
,you develop more sophis-
ticated analyses in later sections you may wish to return to some of the examples
used in this section. It will develop mapping of vector spatial data (points, lines
and areas) and will also introduce some new R commands and techniques to help
put all of this together.
The tmap mapping package (Tennekes, 2015) focuses on mapping the spatial
distribution of thematic data attributes. It can take sp and sf objects. It has a simi-
lar grammar to plotting with ggplot in that it seeks to handle each element of the
map separately in a series of layers, and in so doing seeks to exercise control over
each element. This is different from the basic plot functions used above to map
sp and sf data.
In this section the workings of tmap will be introduced, and then in later sec-
tions on mapping attributes this will be expanded and refined to impose different
mapping styles and embellishments. To begin with, you will need some predeter-
mined data, and the code in this section will use the georgia and
georgia_sf objects that were created earlier. As ever, you may wish to think
about creating a script and a workspace folder in which you can store any results
you generate. As a reminder, you can clear your workspace to remove all the vari-
ables and datasets you have created and opened using the previous code and com-
mands. This can be done via the menu in RStudio via Session > Clear Workspace,
or via the console by entering:
rm(list=ls())
HANDLING SPATIAL DATA IN R
67
3.4.2 A quick tmap
The qtm() function can be used to compose a quick map. The code below loads
the georgia data, recreates georgia_sf and generates a quick tmap using
qtm. First load the data:
data(georgia)
Check that the data have loaded correctly using ls(). There should be three
Georgia datasets: georgia, georgia2 and georgia.polys. Then create the
sf object georgia_sf as before:
georgia_sf <- st_as_sf(georgia)
Finally load tmap and create a quick map as in Figure 3.3:
library(tmap)
qtm(georgia, fill = "red", style = "natural")
Figure 3.3 The map of Georgia generated by qtm()
R FOR SPATIAL ANALYSIS & MAPPING
68
Note the use of the style parameter. This is a shortcut to a predefined style
within the tmap package, in this case named tm_style. These styles can be
called in abbreviated form using qtm. You should explore the qtm function
through the help.
The fill parameter can be used to specify a colour as above, or a variable to
be mapped. The code below generates Figure 3.4, which shows the distribution of
the MedInc variable:
qtm(georgia_sf, fill="MedInc", text="Name", text.size=0.5,
format="World_wide", style="classic",
text.root=5, fill.title="Median Income")
Appling
Atkinson
Bacon
Baker
Baldwin
Banks
Barrow
Bartow
Ben Hill
Berrien
Bibb
Bleckley
Brantley
Brooks
Bryan
Bulloch
Burke
Butts
Calhoun
Camden
Candler
Carroll
Catoosa
Charlton
Chatham
Chattahoochee
Chattooga
Cherokee
Clarke
Clay
Clayton
Clinch
Cobb
Coffee
Colquitt
Columbia
Cook
Coweta
Crawford
Crisp
Dade
Dawson
Decatur
DeKalb
DodgeDooly
Dougherty
Douglas
Early
Echols
Effingham
Elbert
Emanuel
Evans
Fannin
Fayette
Floyd Forsyth
Franklin
Fulton
Gilmer
Glasco*ck
Glynn
Gordon
Grady
Greene
Gwinnett
Habersham
Hall
Hanco*ck
Haralson
Harris
Hart
Heard
Henry
Houston
Irwin
Jackson
Jasper
Jeff Davis
Jefferson
Jenkins
Johnson
Jones
Lamar
Lanier
Laurens
Lee
Liberty
Lincoln
Long
Lowndes
Lumpkin
McDuffie
McIntosh
Macon
Madison
Marion
Meriwether
Miller
Mitchell
Monroe
Montgomery
Morgan
Murray
Muscogee
Newton
Oconee
OglethorpePaulding
Peach
Pickens
Pierce
Pike
Polk
Pulaski
Putnam
Quitman
Rabun
Randolph
Richmond
Rockdale
Schley
Screven
Seminole
Spalding
Stephens
Stewart Sumter
Talbot
Taliaferro
Tattnall
Taylor
Telfair
Terrell
Thomas
Tift
Toombs
Towns
Treutlen
Troup
Turner
Twiggs
Union
Upson
Walker
Walton
Ware
Warren
Washington
Wayne
Webster
Wheeler
White
Whitfield
Wilcox
Wilkes
Wilkinson
WorthMedian Income
20,000 to 30,000
30,000 to 40,000
40,000 to 50,000
50,000 to 60,000
60,000 to 70,000
70,000 to 80,000
80,000 to 90,000
Figure 3.4 Counties in the state of Georgia shaded by median income
HANDLING SPATIAL DATA IN R
69
3.4.3 Full tmap
The process of making maps using tmap is one in which a series of layers are
added to the map. First the tm_shape() is specified, followed by a tmap aes-
thetic function that specifies what is to be plotted. This can be illustrated by
running the code snippets below and inspecting the results. You should see
how the tmap functions are added as a series of layers to the map in a similar
way to ggplot. Before this an outline of Georgia is created using the st_
union() function in sf. An alternative for sp is the gUnaryUnion() func-
tion in the rgeos package loaded with GISTools. The manipulation of spatial
data using overlay, union and intersection functions is covered in more depth
in Chapter 5.
# do a merge
g <- st_union(georgia_sf)
# for sp
# g <- gUnaryUnion(georgia, id = NULL)
# plot the spatial layers
tm_shape(georgia_sf) +
tm_fill("tomato")
Add the county borders:
tm_shape(georgia_sf) +
tm_fill("tomato") +
tm_borders(lty = "dashed", col = "gold")
Add some styling:
tm_shape(georgia_sf) +
tm_fill("tomato") +
tm_borders(lty = "dashed", col = "gold") +
tm_style("natural", bg.color = "grey90")
Include the outline, noting the second call to tm_shape to plot the second spatial
object g:
tm_shape(georgia_sf) +
tm_fill("tomato") +
tm_borders(lty = "dashed", col = "gold") +
tm_style("natural", bg.color = "grey90") +
# now add the outline
tm_shape(g) +
tm_borders(lwd = 2)
And finally putting it all together to create Figure 3.5:
R FOR SPATIAL ANALYSIS & MAPPING
70
tm_shape(georgia_sf) +
tm_fill("tomato") +
tm_borders(lty = "dashed", col = "gold") +
tm_style("natural", bg.color = "grey90") +
# now add the outline
tm_shape(g) +
tm_borders(lwd = 2) +
tm_layout(title = "The State of Georgia",
title.size = 1,
title.position = c(0.55, "top"))
So what you can see in the above code are two sets of tmap plot commands: the
first set plots the georgia_sf dataset, specifying a dashed gold line to show the
county boundaries, a tomato (red) fill colour for the state and a map background
colour of light grey. The second set adds the outline created by the union operation
with a thicker line width before the title is added.
The State of Georgia
Figure 3.5 Counties in the state of Georgia
It is also possible to plot multiple different maps from different datasets
together, but this requires a bit more control over the tmap parameters. The code
HANDLING SPATIAL DATA IN R
71
below assigns each map to variables t1 and t2, and then a second set of functions
is used to manipulate these in a plot window. Note that georgia2 is in sp format
and has a different map projection than georgia. For this reason, the aspect of the
second plot is specified for the second plot in the code below. The value was deter-
mined through trial and error. You will need to install and load the grid package.
# 1st plot of georgia
t1 <- tm_shape(georgia_sf) +
tm_fill("coral") +
tm_borders() +
tm_layout(bg.color = "grey85")
# 2nd plot of georgia2
t2 <- tm_shape(georgia2) +
tm_fill("orange") +
tm_borders() +
# the asp parameter controls aspect
# this makes the 2nd plot align
tm_layout(asp = 0.86,bg.color = "grey95")
Now you can specify the layout of the combined map plot as in Figure 3.6:
library(grid)
# open a new plot page
grid.newpage()
# set up the layout
pushViewport(viewport(layout=grid.layout(1,2)))
# plot using the print command
print(t1, vp=viewport(layout.pos.col = 1, height = 5))
print(t2, vp=viewport(layout.pos.col
,= 2, height = 5))
Figure 3.6 Examples of the use of tmap to generate multiple maps in the same plot window
R FOR SPATIAL ANALYSIS & MAPPING
72
Thus different plot parameters can be used for different subsets of the data such
that they are plotted in ways that are different from the default. Sometimes we
would like to label the features in our maps. Have a look at the names of the coun-
ties in the georgia_sf dataset. These are held in the 13th attribute column, and
names(georgia_sf) will return a list of the names of all attributes:
data.frame(georgia_sf)[,13]
It would be useful to display these on the map, and this can be done using the
tm_text function in the maptools package that is loaded with tmap. The result
is shown in Figure 3.7.
tm_shape(georgia_sf) +
tm_fill("white") +
tm_borders() +
tm_text("Name", size = 0.3) +
tm_layout(frame = FALSE)
And we can subset the data as with the sp format. The code below subsets the
counties of Jefferson, Jenkins, Johnson, Washington, Glasco*ck, Emanuel, Candler,
Bulloch, Screven, Richmond and Burke:
# the county indices below were extracted from the data.frame
index <- c(81, 82, 83, 150, 62, 53, 21, 16, 124, 121, 17)
georgia_sf.sub <- georgia_sf[index,]
The notation for subsetting is the same as for sp objects, and enables individual
areas or polygons to be selected from spatial datasets using the bracket notation as
used in matrices, data frames and vectors. The subset can be plotted to generate
Figure 3.8 using the code below.
tm_shape(georgia_sf.sub) +
tm_fill("gold1") +
tm_borders("grey") +
tm_text("Name", size = 1) +
# add the outline
tm_shape(g) +
tm_borders(lwd = 2) +
# specify some layout parameters
tm_layout(frame = FALSE, title = "A subset of Georgia",
title.size = 1.5, title.position = c(0., "bottom"))
Finally, we can bring together the different spatial data that have been created in a
single map as in Figure 3.9 using the code below. You should note how the different
tm_shape, tm_fill etc. functions are used to set up each layer of the map and
that tmap determines the map extent from the layers:
# the 1st layer
tm_shape(georgia_sf) +
tm_fill("white") +
tm_borders("grey", lwd = 0.5) +
HANDLING SPATIAL DATA IN R
73
# the 2nd layer
tm_shape(g) +
tm_borders(lwd = 2) +
# the 3rd layer
tm_shape(georgia_sf.sub) +
tm_fill("lightblue") +
tm_borders() +
# specify some layout parameters
tm_layout(frame = T, title = "Georgia with a subset of counties",
title.size = 1, title.position = c(0.02, "bottom"))
Appling
Atkinson
Bacon
Baker
Baldwin
Banks
Barrow
Bartow
Ben Hill
Berrien
Bibb
Bleckley
Brantley
Brooks
Bryan
Bulloch
Burke
Butts
Calhoun
Camden
Candler
Carroll
Catoosa
Charlton
Chatham
Chattahoochee
Chattooga
Cherokee
Clarke
Clay
Clayton
Clinch
Cobb
Coffee
Colquitt
Columbia
Cook
Coweta
Crawford
Crisp
Dade
Dawson
Decatur
DeKalb
DodgeDooly
Dougherty
Douglas
Early
Echols
Effingham
Elbert
Emanuel
Evans
Fannin
Fayette
Floyd
Forsyth
Franklin
Fulton
Gilmer
Glasco*ck
Glynn
Gordon
Grady
Greene
Gwinnett
Habersham
Hall
Hanco*ck
Haralson
Harris
Hart
Heard
Henry
Houston
Irwin
Jackson
Jasper
Jeff Davis
Jefferson
Jenkins
Johnson
Jones
Lamar
Lanier
Laurens
Lee
Liberty
Lincoln
Long
Lowndes
Lumpkin
McDuffie
McIntosh
Macon
Madison
Marion
Meriwether
Miller
Mitchell
Monroe
Montgomery
Morgan
Murray
Muscogee
Newton
Oconee
Oglethorpe
Paulding
Peach
Pickens
Pierce
Pike
Polk
Pulaski
Putnam
Quitman
Rabun
Randolph
Richmond
Rockdale
Schley
Screven
Seminole
Spalding
Stephens
Stewart
Sumter
Talbot
Taliaferro
Tattnall
Taylor
Telfair
Terrell
Thomas
Tift
Toombs
Towns
Treutlen
Troup
Turner
Twiggs
Union
Upson
Walker
Walton
Ware
Warren
Washington
Wayne
Webster
Wheeler
White
Whitfield
Wilcox
Wilkes
Wilkinson
Worth
Figure 3.7 Adding text to map objects with tmap
R FOR SPATIAL ANALYSIS & MAPPING
74
3.4.4 Adding Context
In some situations a map with background context may be more informative.
There are a number of options for doing this, including OpenStreetMap,1 Google
Maps and Leaflet. This requires some additional packages to be downloaded and
installed in R. If you have not done so already, install the OpenStreetMap
package and load it into R:
install.packages(c("OpenStreetMap"),depend=T)
library(OpenStreetMap)
If using OpenStreetMap, the approach is to define the area of interest, to download
and plot the map tile from OpenStreetMap and then to plot your data over the tiles.
In this case the background map area is defined by the spatial extent of the Georgia
subset created above which is used determine the tiles to download from
OpenStreetMap. The results of the code below are shown in Figure 3.10. Note the
use of the spTransform function in the rgdal package in the last line of the code.
Jefferson
Jenkins
Johnson
Washington
Glasco*ck
Emanuel
Candler Bulloch
Screven
Richmond
Burke
A subset of Georgia
Figure 3.8 A subset of the counties in the state of Georgia
1 At the time of writing, there can be some compatibility issues with the rJava package required by
OpenStreetMap. These relate to the use of 32-bit and 64-bit programs, especially on Windows PCs.
If you experience problems installing OpenStreetMap, then it is suggested that you use the 32-bit
version of R, which is also installed as part of R for Windows.
HANDLING SPATIAL DATA IN R
75
This transforms the geographical projection of the georgia.sub data to the same
projection as the OpenStreetMap data layer. Here it is easier to work with sp objects.
# define upper left, lower right corners
georgia.sub <- georgia[index,]
ul <- as.vector(cbind(bbox(georgia.sub)[2,2],
bbox(georgia.sub)[1,1]))
lr <- as.vector(cbind(bbox(georgia.sub)[2,1],
bbox(georgia.sub)[1,2]))
# download the map tile
MyMap <- openmap(ul,lr)
# now plot the layer and the backdrop
par(mar = c(0,0,0,0))
plot(MyMap, removeMargin=FALSE)
plot(spTransform(georgia.sub, osm()), add = TRUE, lwd = 2)
Google Maps can also be downloaded and used as context. Again, this package
should be installed if you have not done so already.
Georgia with a subset of counties
Figure 3.9 The result of the code for plotting a spatial object and a spatial subset
R FOR SPATIAL ANALYSIS & MAPPING
76
install.packages(c("RgoogleMaps"),depend=T)
Then the area for the background map data is defined to identify the tiles to be
downloaded from Google Maps. Some of the plotting commands are specific to
the packages installed – note the first step to convert the subset to PolySet
format using the SpatialPolygons2PolySet function in maptools
(loaded with GISTools) and the last line that defines a polygon plot over
Google Maps:
# load the package
library(RgoogleMaps)
# convert the subset
shp <- SpatialPolygons2PolySet(georgia.sub)
# determine the extent of the subset
bb <- qbbox(lat = shp[,"Y"], lon = shp[,"X"])
# download map data and store it
MyMap <- GetMap.bbox(bb$lonR, bb$latR, destfile = "DC.jpg")
# now plot the layer and the backdrop
par(mar = c(0,0,0,0))
PlotPolysOnStaticMap(MyMap, shp, lwd=2,
col = rgb(0.25,0.25,0.25,0.025), add = F)
It is also possible to use the tmap package for context using Leaflet. Leaflet is an
open source JavaScript library used to build interactive web mapping applica-
tions (see https://rstudio.github.io/leaflet/) and is embedded
Figure 3.10 A subset of Georgia with an OpenStreetMap backdrop
HANDLING SPATIAL DATA IN R
77
within the tmap package. It is useful if you want to embed interactive maps in
an HTML file (e.g. by using RMarkdown). The code below maps georgia.sub
with an interactive Leaflet backdrop as in Figure 3.11. Note that the interactive
mode is set through the tmap_mode function, which in this case has been set to
'view', which requires an internet connection,
,120
4.4 WritingFunctions 127
4.5 SpatialDataStructures 135
4.6 applyFunctions 137
4.7 ManipulatingDatawithdplyr 140
4.8 AnswerstoSelf-TestQuestions 143
5 USINGRASAGIS 148
5.1 Introduction 148
5.2 SpatialIntersectionandClipOperations 150
5.3 Buffers 153
5.4 MergingSpatialFeatures 155
5.5 Point-in-PolygonandAreaCalculations 157
5.6 CreatingDistanceAttributes 163
5.7 CombiningSpatialDatasetsandTheirAttributes 169
5.8 ConvertingbetweenRasterandVector 175
5.9 IntroductiontoRasterAnalysis 182
5.10 AnswerstoSelf-TestQuestions 190
References 192
6 POINTPATTERNANALYSISUSINGR 194
6.1 Introduction 194
6.2 WhatisSpecialaboutSpatial? 194
6.3 TechniquesforPointPatternsUsingR 196
6.4 FurtherUsesofKernelDensityEstimation 202
6.5 Second-OrderAnalysisofPointPatterns 207
6.6 LookingatMarkedPointPatterns 218
6.7 InterpolationofPointPatternswithContinuousAttributes 222
6.8 TheKrigingApproach 235
6.9 ConcludingRemarks 242
6.10 AnswerstoSelf-TestQuestions 243
References 244
7 SPATIALATTRIBUTEANALYSISWITHR 245
7.1 Introduction 245
7.2 ThePennsylvaniaLungCancerData 246
7.3 AVisualExplorationofAutocorrelation 248
7.4 Moran’sI:AnIndexofAutocorrelation 257
CONTENTS
ix
7.5 SpatialAutoregression 262
7.6 CalibratingSpatialRegressionModelsinR 263
7.7 AnswertoSelf-TestQuestion 277
References 279
8 LOCALISEDSPATIALANALYSIS 281
8.1 Introduction 281
8.2 SettingUptheDataUsedinThisChapter 282
8.3 LocalIndicatorsofSpatialAssociation 283
8.4 FurtherIssueswiththeAboveAnalysis 286
References 289
9 RANDINTERNETDATA 290
9.1 Introduction 290
9.2 DirectAccesstoData 291
9.3 UsingRCurl 295
9.4 WorkingwithAPIs 297
9.5 CreatingaStatistical‘Mashup’ 303
9.6 UsingSpecificPackages 306
9.7 WebScraping 310
References 315
10 EPILOGUE 316
10.1 TheFutureofRasaToolforGeocomputation 316
10.2 ExtensionsofRasaLanguage 316
10.3 Improvements‘UndertheBonnet’ 318
10.4 CoexistencewithOtherSoftware 319
10.5 Finally… 320
References 321
Index 322
ABOUT THE AUTHORS
ChrisBrunsdon is Professor of Geocomputation and Director of the National Centre
for Geocomputation at the National University of Ireland, Maynooth, having worked
previously in the Universities of Newcastle, Glamorgan, Leicester and Liverpool,
variously in departments focusing on both geography and computing. He has inter-
ests that span both of these disciplines, including spatial statistics, geographical infor-
mation science, and exploratory spatial data analysis, and in particular the application
of these ideas to crime pattern analysis, the modelling of house prices, medical and
health geography and the analysis of land use data. He was one of the originators of
the technique of geographically weighted regression (GWR). He has extensive experi-
enceofprogramminginR,goingbacktothelate1990s,andhasdevelopedanumber
of R packages which are currently available on CRAN, the Comprehensive R Archive
Network. He is an advocate of free and open source software, and in particular the
use of reproducible research methods, and has contributed to a large number of
workshops on the use of R and of GWR in a number of countries, including the UK,
Ireland, Japan, Canada, the USA, the Czech Republic and Australia. When not
involved in academicworkhe enjoys running, collecting clocks andwatches, and
cooking – the last of these probably cancelling out the benefits of the first.
AlexisComber (Lex) is Professor of Spatial Data Analytics at Leeds Institute for
Data Analytics (LIDA), University of Leeds. He worked previously at the
University of Leicester where he held a chair in geographical information science.
His first degree was in plant and crop science at the University of Nottingham, and
he completed a PhD in computer science at the Macaulay Institute, Aberdeen (now
the James Hutton Institute) and the University of Aberdeen, developing expert
systems for land cover monitoring. This brought him into the world of spatial data,
spatial analysis, and mapping. Lex’s interests span many different application
areas, including land cover/land use, demographics, public health, agriculture,
bio-energy and accessibility, all of which require multi-disciplinary approaches.
His research draws from geocomputation, mathematics, statistics and computer
science, and he has extended techniques in operations research/location–allocation
(what to put where), graph theory (cluster detection in networks), heuristic
searches (how to move intelligently through highly dimensional big data), remote
sensing (novel approaches for classification), handling divergent data semantics
(uncertainty handling, ontologies, text mining) and spatial statistics (quantifying
spatial and temporal process heterogeneity). Outside of academic work and in no
particular order, Lex enjoyshis vegetable garden,walking thedog andplaying
pinball(heistheproudownerofa1981BallyEightBallDeluxe).
1
INTRODUCTION
1.1 INTRODUCTION TO THE SECOND EDITION
Since the first edition of this book was drafted and subsequently published, there
have been a number of developments in the handling of data and spatial data in R.
The use of R has exploded, and it is now a common tool taught at undergraduate
and postgraduate level in many courses. This is due to a number of interrelated
factors. Perhaps the most critical of these from a scientific point of view is that R is
free and open source, which means that the code and functions used to manipulate
data are transparent and can be integrated by the user, rather than being simply
presented as black boxes as is common in many commercial software packages.
Additionally, R is underpinned by a core statistical functionality that provides the
basis for rigorous analysis and confident package development. Finally, R pro-
vides a dynamic analysis environment in which new packages are constantly
developed, refined and updated.
One such set of developments is at the heart of the second edition of this book:
the emergence of tidy and lazy data formats and structures for spatial and non-
spatial data, to improve data manipulations, data wrangling and data handling
supporting cleaner data science. The most notable example of this is the
tidyverse, which is a collection of R packages designed for data science
(https://www.tidyverse.org). These provide a suite of tools for data analy-
sis, linkage and data visualisation, but also augmented data formats such as the
tibble and language extending operations using a piping syntax. Similar devel-
opments have also occurred in mapping, spatial data and spatial data analysis in
R, such as the tmap package for thematic mapping (Tennekes, 2015) and the sf
package that includes both new data structures and tools for handling spatial data
(Pebesma et al., 2016).
In the same way that the first edition of this book, written in 2013, reflected our
practice and how we worked with spatial data in R at that time, so the second edi-
tion reflects our current practice and the techniques we now use. In 2013, spatial
data analysis was undertaken using data in the sp format, as defined in the sp
R FOR SPATIAL ANALYSIS & MAPPING
2
package, and using tools drawn from a range of packages underpinned by the sp
data format such as rgdal and maptools. The first edition had a strong focus on
the GISTools package (Brunsdon and Chen, 2014) which wrapped many func-
tions from other packages with an sp underpinning. Now we work mainly with
spatial data in sf format (described more fully in Chapter 3). At the time of writ-
ing, the R spatial community is in a period of transition from sp to sf formats and
so both are introduced and discussed in this second edition. Many packages
,with the alternative being
'plot'.
tmap_mode('view')
tmap mode set to interactive viewing
tm_shape(georgia_sf.sub) +
tm_polygons(col = "#C6DBEF80" )
Finally, remember to reset the tmap_mode to plot:
tmap_mode("plot")
+
-
Leaflet | © OpenStreetMap © CartoDB
Figure 3.11 An interactive map of the Georgia subset with Leaflet/OpenStreetMap
backdrop
R FOR SPATIAL ANALYSIS & MAPPING
78
3.4.5 Saving Your Map
Having created a map in a window on the screen, you may now want to save the
map for either printing, or incorporating in a document. There are a number of
ways that this can be done. The simplest in RStudio is to click on the Export icon
in the plot pane for saving options (in R, right-click with the mouse on the map
window), select Copy to Clipboard, and then paste it into a word-processing docu-
ment (e.g. one being created in either OpenOffice or MS Windows). Another is to
use Save as Image to save the map as an image file, with a name that you give it.
However, it is also possible to save images by using the R commands that were used
to create the map. This takes more initial effort, but has the advantage that it is pos-
sible to make minor edits and changes (such as altering the position of the scale, or
drawing the census block boundaries in a different colour) and to easily rerun the
code to re-create the image file. There are a number of formats for saving maps, such
as PDF, PNG and TIFF.
One way to create a file of commands is to edit a text file with a name ending in
.R – note the capital letter. In RStudio, open a new document by selecting File >
New File > R script. Then type in the following:
# load package and data
library(GISTools)
data(newhaven)
proj4string(roads) <- proj4string(blocks)
# plot spatial data
tm_shape(blocks) +
tm_borders() +
tm_shape(roads) + tm_lines(col = "red") +
# embellish the map
tm_scale_bar(width = 0.22) +
tm_compass(position = c(0.8, 0.07)) +
tm_layout( frame = F, title = "New Haven, CT", title.size = 1.5,
title.position = c(0.55, "top"), legend.outside = T)
Save the file as newhavenmap.R in your working directory.
I
When you start an R session you should set the working directory to be the folder
that you wish to use to write and read data to and from, to store your command
files, such as the newhavenmap.R file, and any workspace files or .RData files
that you save. In RStudio this is Session > Set Working Directory > .... In R in
Windows it is File > Change dir... and on a Mac it is Misc > Set Working Directory.
Now go back to the R command line and enter:
HANDLING SPATIAL DATA IN R
79
source("newhavenmap.R")
and your map will be redrawn. The file contains all of the commands to draw the
map, and ‘sourcing’ it makes R run through these in sequence. Suppose you now
wish to redraw the map, but with the roads drawn in blue, rather than red. In the
file editor, go to the tm_lines command, and edit the line to become:
tm_lines(col = "blue") +
and save the file again. Re-entering source("newhavenmap.R") now draws
the map, but with the roads drawn in blue. Another parameter sometimes used in
map drawing is the line width parameter, lwd. This time, edit the tm_borders
command in the file to become:
tm_borders(lwd = 3) +
and re-enter the source command. The map is redrawn with thicker boundaries
around the census blocks. The col and lwd parameters can of course be used in
combination. Edit the file again, so that the second line becomes:
tm_lines(col = "blue", lwd = 2) +
and source the file again. This time the roads are thicker and drawn in blue.
Another advantage of saving command files, as noted earlier, is that it is pos-
sible to place the graphics created into various graphics file formats. To create a
PDF, for example, the command:
pdf(file='map.pdf')
can be placed before the first line containing a tm_shape command in the
newhavenmap.R file. This tells R that after this command, any graphics will not
be drawn on the screen, but instead are written to the file map.pdf (or whatever
name you choose for the file). When you have written all of the commands you need
to create your map, then insert the following at the end of the tmap commands:
dev.off()
This is short for device off, and tells R to close the PDF file, and go back to
drawing graphics in windows on the screen in the future. To test this out, insert
a new first line at the beginning of newhavenmap.R and a new last line at the
end. Then re-source the file. This time no new graphics are drawn, but you have
now created a set of commands to write the graphic into a PDF file called map.
pdf. This file will be created in the folder in which you are working. To check that
this has worked, open your working directory folder in Windows Explorer, Mac
Finder, etc., and there should be a file called map.pdf. Click on it and whatever
R FOR SPATIAL ANALYSIS & MAPPING
80
PDF reader you use should open, and your map should be displayed as a PDF file.
This file can be incorporated into presentations, word-processing documents and
so on. A similar command, for producing PNG files, is:
png(file='map.png')
which writes all subsequent R graphics into a PNG file, until a dev.off() is issued.
To test this, replace the first line of newhavenmap.R with the above command,
and re-source it from the R command line. A new file will appear in the folder called
map.png which may be incorporated into documents as with the PDF file.
Of course you do not need to load a .R file to do this! You can place the opening
and closing commands around the mapping code.
There are a number of commonly used functions for writing maps out to PDF,
PNG, TIFF, etc., files:
pdf()
png()
tiff()
Examine the help for these.
The key thing you need to know is that these functions all open a file. The open
file needs to be closed using dev.off() after the map has been written to it. So
the syntax is:
pdf(file = "MyPlot.pdf", other setting)
dev.off()
You can write a .png file for the map using the code below. Note that you may
want to set the working directory that you write to using the setwd() function.
To illustrate this the code below creates some points for the georgia_sf polygon
centroids, sets the working directory and then creates a map:
pts_sf <- st_centroid(georgia_sf)
setwd('~/Desktop/')
# open the file
png(filename = "Figure1.png", w = 5, h = 7, units = "in", res = 150)
# make the map
tm_shape(georgia_sf) +
tm_fill("olivedrab4") +
tm_borders("grey", lwd = 1) +
# the points layer
tm_shape(pts_sf) +
tm_bubbles("PctBlack", title.size = "% Black", col = "gold")+
tm_format_NLD()
# close the png file
dev.off()
HANDLING SPATIAL DATA IN R
81
3.5 MAPPING SPATIAL DATA ATTRIBUTES
3.5.1 Introduction
This section describes some approaches for displaying and mapping spatial data
attributes. Some of these ideas and commands have already been used in the pre-
ceding illustrations, but this section provides a more formal and comprehensive
description.
All of the maps that you have generated thus far have simply displayed data (e.g.
the roads in New Haven and the counties in Georgia). This is fine if the aim is sim-
ply to map the locations of different features. However, we are often interested in
identifying and analysing the properties or attributes associated with different spa-
tial features. The New Haven and Georgia datasets introduced above both contain
areas or regions within them. In the case of the New Haven one these are the census
reporting areas (census blocks or tracts), and in Georgia the counties within the
state. These areas have attributes from the population census for each spatial unit.
These attributes are held in the data frame of the spatial object. For example, in the
code above you examined the data frame of the Georgia dataset and listed the attrib-
utes of individual objects within the dataset. Figure 3.1 actually maps the
,median
income of each county in Georgia, although this code was not shown.
3.5.2 Attributes and Data Frames
The attributes associated with individual features (lines, points, areas in vector data
and cell values in raster data) provide the basis for spatial analyses and geographical
investigation. Before examining attributes directly, it is important to reconsider the
data structures that are commonly used to hold and manipulate spatial data in R.
Clear your workspace and load the New Haven data, convert to sf format and
then examine the blocks, breach and tracts data:
# clear workspace
rm(list = ls())
# load & list the data
data(newhaven)
ls()
# convert to sf
blocks_sf <- st_as_sf(blocks)
breach_sf <- st_as_sf(breach)
tracts_sf <- st_as_sf(tracts)
# have a look at the attributes and object class
summary(blocks_sf)
class(blocks_sf)
summary(breach_sf)
class(breach_sf)
summary(tracts_sf)
class(tracts_sf)
R FOR SPATIAL ANALYSIS & MAPPING
82
You should notice a number of things from these summaries:
● Each of the datasets is spatial: blocks_sf and tracts_sf are
POLYGON sf objects and breach is a POINT object.
● They all have data frames attached to them that contain attributes whose
values are summarised by the summary function.
● breach_sf only has geometry attributes – it has no thematic
attributes, it just records locations.
The data frame of these spatial objects can be accessed in order to examine, manip-
ulate or classify the attribute data. Each row in the data frame contains attribute
values associated with one of the spatial objects, the individual polygons for exam-
ple in blocks_sf, and each column describes the values associated with a par-
ticular attribute for all of the objects. Accessing the data frame allows you to read,
alter or compute new attributes. Entering:
data.frame(blocks_sf)
would print all of the attribute information for each census block in New Haven to
the R console window, until the print limit was reached, while:
head(data.frame(blocks_sf))
prints out the first six rows. The attributes can be individually identified using their
names. To see the list of column names enter:
colnames(data.frame(blocks_sf))
# or
names(blocks_sf)
Note that for sp objects, an alternative is to use @data to access the data frame of
the SpatialPolygonsDataFrame objects, as well as the above code:
colnames(blocks@data)
head(blocks@data)
One of the data attributes or variables is called P_VACANT and describes the per-
centage of households that are unoccupied (i.e. vacant) in each of the blocks. To
access the variable itself, enter:
data.frame(blocks_sf$P_VACANT)
The $ operator works as it would on a standard data frame to access individual
variables (columns) in the data frame. For the data frames of spatial objects a short-
hand exists to access this variable. Enter:
HANDLING SPATIAL DATA IN R
83
blocks$P_VACANT
A third option is to attach the data frame. Enter:
attach(data.frame(blocks_sf))
All of the attribute variables now appear as ordinary R variables. For example, to
draw a histogram of the percentage vacant housing for each block, enter:
hist(P_VACANT)
Finally, it is good practice to detach any objects that have been attached after you
have finished using them. It is possible to attach many data frames simultaneously,
but this can lead to problems if you are not careful. To detach the data frame you
attached earlier, enter:
detach(data.frame(blocks_sf))
You can try a similar set of commands with the tracts data, but the breaches
data has no attributes: it simply records the locations of breaches of the peace. As
with any point data, the breaches of the peace data can be used to create a heat map
raster dataset.
# use kde.points to create a kernel density surface
breach.dens = st_as_sf(kde.points(breach,lims=tracts))
summary(breach.dens)
breach.dens is a raster/pixels dataset, and its attributes are held in a data frame
which can be examined:
breach.dens
Notice that this has the kernel density estimation and geometry attributes that de-
scribe the X and Y locations, and you can plot the breach.dens object:
plot(breach.dens)
Also note that you can remove the st_as_sf function from the kde.points
command to generate a SpatialPixelsDataFrame object, part of the sp fami-
ly of spatial objects. This can be plotted with the image function.
A final key point about attributes is that you can create and assign new attrib-
utes to the spatial object, for both sf and sp. For example, the code below creates
a normally distributed random value for each of the 129 areas in the blocks_sf
object. Note the use of the $ to do this:
blocks_sf$RandVar <- rnorm(nrow(blocks_sf))
R FOR SPATIAL ANALYSIS & MAPPING
84
Of course it is more than likely that you will want to assign a new value to a spatial
object that arises from the result of some kind of analysis, data join, etc. It is very
easy to link new data attributes to spatial objects in this way.
3.5.3 Mapping Polygons and Attributes
A choropleth is a thematic map in which areas are shaded in proportion to their
attributes. The tmap package includes a number of ways of generating choropleth
maps. Enter:
tmap_mode('plot')
tm_shape(blocks_sf) +
tm_polygons("P_OWNEROCC")
This produces a map of the census block in New Haven, shaded by the proportions
of vacant properties. The tm_polygons element automatically includes a legend
to allow the map to be interpreted, in this case the levels of vacancy associated with
each of the different shade colours.
There are a couple of things to note about the use of tmap. First, tmap_mode
was set to plot to generate a standard choropleth suitable for including in a
report rather than an interactive map for use in a webpage, for example. Recall
that the Leaflet mapping above used the interactive view (i.e. tmap_mode was
set to 'view'). Second, in a similar way to the ggplot operations in Chapter 2,
the tmap package constructs maps by combining different map elements. In this
case blocks_sf was passed to the tm_shape function and then the tm_pol-
ygons function was used to specify the variable to be mapped, in this case P_
OWNEROCC.
You should note that it is also possible to pass sp format spatial objects to
tmap. Try replacing tm_shape(blocks_sf) with tm_shape(blocks) in
the code above and below. Also note that in this case the variable P_OWNEROCC
was mapped using five classes of equal interval. Try repeating the tmap code
above using a different variable such as P_VACANT. What happens? You will see
that tmap automatically determines the number of classes to be included and the
class intervals or breaks. Finally, a colour shading scheme is automatically allo-
cated to the map and the legend is included in the map. All of these, and many
of the other default mapping settings that tmap uses, can be controlled and
modified.
For example, to control the class intervals, the breaks parameter can be specified:
tm_shape(blocks_sf) +
tm_polygons("P_OWNEROCC", breaks=seq(0, 100, by=25))
This can be done in many different ways:
tm_shape(blocks_sf) +
tm_polygons("P_OWNEROCC", breaks=c(10, 40, 60, 90))
HANDLING SPATIAL DATA IN R
85
The legend placement and title can be modified. The tm_layout function is very
useful here:
tm_shape(blocks_sf) +
tm_polygons("P_OWNEROCC", title = "Owner Occ") +
tm_layout(legend.title.size = 1,
legend.text.size = 1,
legend.position = c(0.1, 0.1))
You could also try legend.position = c("centre","bottom")). Further
documentation on tm_layout can be found at https://www.rdocumentation.
org/packages/tmap/versions/1.11/topics/tm_layout.
It is also possible to alter the colours used in a shading scheme. The default
colour scheme uses increasing intensities of yellow to red. Graduated lists of col-
ours like this are generated using the RColorBrewer package, which is auto-
matically loaded with both tmap and GISTools. This package makes
,use of a
set of colour palettes designed by Cynthia Brewer, intended to optimise the per-
ceptual difference between each shade in the palette, so that visually each shading
colour is distinct. The palettes available in this package are displayed with the
command:
display.brewer.all()
This displays the various colour palettes and their names in a plot window.
To generate a list of colours from one of these palettes, for example, enter the
following:
brewer.pal(5,'Blues')
[1] "#EFF3FF" "#BDD7E7" "#6BAED6" "#3182BD" "#08519C"
This is a list of colour codes used by R to specify the palette. The brewer.pal
arguments specify that a five-stage palette based on shades of blue is required.
The output of brewer.pal can be fed into tmap to give alternative colours in
shading schemes. For example, enter the code below and a choropleth map shad-
ed in red is displayed with its legend. The palette argument in tm_polygons
specifies the new colours in the shading scheme.
tm_shape(blocks_sf) +
tm_polygons("P_OWNEROCC", title = "Owner Occ", palette = "Reds") +
tm_layout(legend.title.size = 1)
Note that the same map would be produced if the tm_fill function were used
instead of tm_polygons; however, without a tm_borders function, the census
block outlines are not plotted. Try entering:
R FOR SPATIAL ANALYSIS & MAPPING
86
tm_shape(blocks_sf) +
tm_fill("P_OWNEROCC", title = "Owner Occ", palette = "Blues") +
tm_layout(legend.title.size = 1)
Owner Occ
0 to 20
20 to 40
40 to 60
60 to 80
80 to 100
Owner Occ
0.00 to 16.46
16.46 to 30.49
30.49 to 46.17
46.17 to 67.95
67.95 to 91.67
Owner Occ
0.0 to 11.2
11.2 to 20.3
20.3 to 26.2
26.2 to 30.9
30.9 to 46.7
Figure 3.12 Different choropleth maps of owner-occupied properties in New Haven using
different shades and class intervals
A final adjustment is to change the way the class interval boundaries are com-
puted. As a default, they are based on equal-sized intervals of the attribute being
mapped, but different palette styles are available. Have a look at the help for tm_
polygons and you will see that a number of different plotting styles are available.
You should explore these. The class intervals can be changed to quantiles or any
other range of intervals using the breaks parameter. For example, the code below
produces three maps in Figure 3.12 with equal intervals (left), with intervals based
on k-means (middle) and with quantiles (right), using the quantileCuts func-
tion in GISTools, and using the pushViewport function in the grid package
as before to plot multiple maps together.
# with equal intervals: the tmap default
p1 <- tm_shape(blocks_sf) +
tm_polygons("P_OWNEROCC", title = "Owner Occ", palette = "Blues") +
tm_layout(legend.title.size = 0.7)
# with style = kmeans
p2 <- tm_shape(blocks_sf) +
tm_polygons("P_OWNEROCC", title = "Owner Occ", palette = "Oranges",
style = "kmeans") +
tm_layout(legend.title.size = 0.7)
# with quantiles
p3 <- tm_shape(blocks_sf) +
tm_polygons("P_OWNEROCC", title = "Owner Occ", palette = "Greens",
breaks = c(0, round(quantileCuts(blocks$P_OWNEROCC, 6), 1))) +
tm_layout(legend.title.size = 0.7)
# Multiple plots using the grid package
library(grid)
grid.newpage()
HANDLING SPATIAL DATA IN R
87
# set up the layout
pushViewport(viewport(layout=grid.layout(1,3)))
# plot using the print command
print(p1, vp=viewport(layout.pos.col = 1, height = 5))
print(p2, vp=viewport(layout.pos.col = 2, height = 5))
print(p3, vp=viewport(layout.pos.col = 3, height = 5))
It is also possible to display a histogram of the distribution of the variable or
attribute being mapped using the legend.hist parameter. This is very useful
for choropleth mapping as it gives a distribution of the attributes being examined.
Bringing this all together allows you to create a map with a number of refine-
ments as in Figure 3.13. Note, for example, the minus sign before the palette
parameter to reverse the palette order and the various parameters passed to the
tm_layout function.
tm_shape(blocks_sf) +
tm_p olygons("P_OWNEROCC", title = "Owner Occ", palette = "-GnBu",
breaks = c(0, round(quantileCuts(blocks$P_OWNEROCC, 6), 1)),
legend.hist = T) +
tm_scale_bar(width = 0.22) +
tm_compass(position = c(0.8, 0.07)) +
tm_layout(frame = F, title = "New Haven",
title.size = 2, title.position = c(0.55, "top"),
legend.hist.size = 0.5)
New Haven
N
0.0 0.5 1.0 1.5 2.0 2.5 3.0 km
Owner Occ
0.0 to 11.2
11.2 to 20.3
20.3 to 26.2
26.2 to 30.9
30.9 to 46.7
5
10
15
20
10 20 30 40
Figure 3.13 An illustration of the various options for mapping with tmap
R FOR SPATIAL ANALYSIS & MAPPING
88
It is possible to compute certain derived attribute values on the fly in tmap. The
code below first assigns a projection to the tracts_sf layer from the blocks_
sf layer, then plots population density using the convert2density parameter
applied to the POP1990 attribute.
# add a projection to tracts data and convert tracts data to sf
proj4string(tracts) <- proj4string(blocks)
tracts_sf <- st_as_sf(tracts)
tracts_sf <- st_transform(tracts_sf, "+proj=longlat +ellps=WGS84")
# plot
tm_shape(blocks_sf) +
tm_fill(col="POP1990", convert2density=TRUE,
style="kmeans", title=expression("Population (per " ∗ km^2 ∗ ")"),
legend.hist=F, id="name") +
tm_borders("grey25", alpha=.5) +
# add tracts context
tm_shape(tracts_sf) +
tm_borders("grey40", lwd=2) +
tm_format_NLD(bg.color="white", frame = FALSE,
legend.hist.bg.color="grey90")
The convert2density function automatically converts the projection units (in
this case degrees of latitude and longitude) to a projection in metres and then deter-
mines areal density in square kilometres. You can check this by creating your own
population density values, and examining the explanations of how the functions
operate in the help pages for the functions used, such as st_area.
Compare the population density summary with the legend of the figure created
using the code above:
# add an area in km^2 to blocks
blocks_sf$area = st_area(blocks_sf) / (1000∗1000)
# calculate population density manually
summary(blocks_sf$POP1990/blocks_sf$area)
A final consideration is the ability of tmap to map multiple attributes in the same
operation. The code below plots two attributes in the same call (Figure 3.14):
tm_shape(blocks_sf) +
tm_fill(c("P_RENTROCC", "P_BLACK")) +
tm_borders() +
tm_layout(legend.format = list(digits = 0),
legend.position = c("left", "bottom"),
legend.text.size = 0.5,
legend.title.size = 0.8)
In summary, the tm_fill and tm_polygons functions in the tmap package
generate choropleth maps of attributes held in spatial polygons data frame (sp) or
simple feature (sf) data objects. They automatically shade the variables using
equal intervals. The intervals and the palettes can both be adjusted. It is instructive
to examine the plotting functions and the way they operate. Enter:
HANDLING SPATIAL DATA IN R
89
P_RENTROCC
0 to 20
20 to 40
40 to 60
60 to 80
80 to 100
P_BLACK
0 to 20
20 to 40
40 to 60
60 to 80
80 to 100
Figure 3.14 tmap choropleth maps of census blocks in New Haven showing the
percentage of houses rented and occupied (P_RENTROCC) and the percentage of the
population recorded as black (P_BLACK)
tm_polygons
The function code detail is displayed in the R console window. You will see that it
takes a number of arguments and a number of default parameters. In addition to
using the R help system to understand functions, examining functions in this way
can also provide you with insight into their operation.
3.5.4 Mapping Points and Attributes
Point data can be mapped in R, as well as polygons and lines. The newhaven data
include locations of reports of ‘breaches of the peace’. These events are essentially
public disorder incidents, on many
,occasions requiring police intervention. The
data are stored in a variable called breach, which was converted to sf format
above. Plotting this variable works in the same way as plotting polygons or lines,
using the tm_shape function:
tm_shape(blocks_sf) +
tm_polygons("white") +
tm_shape(breach_sf) +
tm_dots(size = 0.5, shape = 19, col = "red", alpha = 1)
This plots the locations of each of the breach of peace incidents with a symbol above
the blocks_sf layer using the tm_dots function. This can take a number of
parameters, including those to control the point size, colour and shape. The shape
is drawn from the core R pch (plot character) argument. You should examine the
R FOR SPATIAL ANALYSIS & MAPPING
90
help for pch and for points to see the different symbols (or shapes in the language
of tmap) that can be used.
If you have very dense point data then one point may obscure another. Adding
some transparency to the points can help visualise dense point data. The alpha
parameter can be used to add a transparency term to the colour. Try adjusting the
code above to change the transparency and the plot character. For example:
tm_shape(breach_sf) +
tm_dots(size = 0.5, shape = 19, col = "red", alpha = 0.5)
Commonly, point data come in a tabular format rather than as an R spatial
object (i.e. of class sp or sf format), with attributes that include the latitude and
longitude or easting and northing of the individual data points. One such dataset
is the quakes dataset included as part of R. It provides the locations of 1000 seis-
mic events (earthquakes) near Fiji. To load and examine the data enter:
# load the data
data(quakes)
# look at the first 6 records
head(quakes)
You will see that the data come with a number of attributes: lat, long, depth,
mag and stations. Here you will use the lat and long attributes to create a
spatial points dataset in sf format with the attributes included. Creating spatial
data from scratch in sf is a bit convoluted, so perhaps the easiest way is to create
an sp object and convert it. This is done in the code below:
# define the coordinates
coords.tmp <- cbind(quakes$long, quakes$lat)
# create the SpatialPointsDataFrame
quakes.sp <- SpatialPointsDataFrame(coords.tmp,
data = data.frame(quakes),
proj4string = CRS("+proj=longlat "))
I
Transparency can also be added to shading colours manually. Remember
that the full set of predefined and named colours available in R can be listed
by entering colours(). Also you can list the colour in the RColor-
Brewer palettes. To see the palettes enter display.brewer.all()
and to list colours in an individual palette enter brewer.pal(5, "Reds").
Any of these can be used in the call above. Additionally, a transparency term
can be added to colour and palettes using the add.alpha function in the
GISTools package. For 50% transparency enter add.alpha(brewer.
pal(5, "Reds"), 0.5).
HANDLING SPATIAL DATA IN R
91
# convert to sf
quakes_sf <- st_as_sf(quakes.sp)
The result can be mapped as shown in Figure 3.15, which shows the spatial context
of the data in the Pacific Ocean, to the north of New Zealand.
Figure 3.15 A plot of the Fiji earthquake data
# map the quakes
tm_shape(quakes_sf) +
tm_dots(size = 0.5, alpha = 0.3)
The last bit of code nicely illustrates how to create a spatial dataset in sp or sf
format in R. Essentially the sequence is:
● define the coordinates for the spatial object
● assign these to an sp class of object as in Table 3.1
● then, if required, convert the sp object to sf
You should examine the help for these classes of objects. In brief, points just need
coordinate pairs, but polygons and lines need lists of coordinates for each object.
help("SpatialPoints-class")
help("sf")
R FOR SPATIAL ANALYSIS & MAPPING
92
You will have noticed that the quakes dataset has an attribute describing the
depth of each earthquake. We can visualise the depths in a number of ways – for
example, by plotting all the data points, but specifying the size of each data point
to be proportional to the depth attribute, or by using choropleth mapping as above
with tmap. These are shown in the code blocks below and in the results are in
Figure 3.16. As a reminder, when you run this code and the other code in this
book, you should try manipulating and changing the parameters that are used to
explore different mapping approaches. The code below uses different plot charac-
ter sizes and colours to indicate the magnitude of the variable being considered:
library(grid)
# by size
p1 <- tm_shape(quakes_sf)+
tm_bubbles("depth", scale = 1, shape = 19, alpha = 0.3,
title.size="Quake Depths")
# by colour
p2 <- tm_shape(quakes_sf)+
tm_dots("depth", shape = 19, alpha = 0.5, size = 0.6,
palette = "PuBuGn",
title="Quake Depths")
# multiple plots using the grid package
grid.newpage()
# set up the layout
pushViewport(viewport(layout=grid.layout(1,2)))
# plot using the print command
print(p1, vp=viewport(layout.pos.col = 1, height = 5))
print(p2, vp=viewport(layout.pos.col = 2, height = 5))
It also possible to select specific data subsets to plot. The code below just maps
earthquakes that have a magnitude greater than 5.5:
# create the index
index <- quakes_sf$mag > 5.5
summary(index)
# select the subset assign to tmp
tmp <- quakes_sf[index,]
# plot the subset
tm_shape(tmp) +
tm_dots( col=brewer.pal(5, "Reds")[4], shape=19,
alpha=0.5, size = 1) +
tm_layout(title="Quakes > 5.5",
title.position = c("centre", "top"))
I
The code used above includes logical operators and illustrates how they can
be used to select elements that satisfy some condition. These can be used
singularly or in combination to select in the following way:
HANDLING SPATIAL DATA IN R
93
Finally it is possible to use the PlotOnStaticMap function from the Rgoogle
Maps package to plot the earthquake locations with some context from Google Maps.
This is similar to Figure 3.10, which mapped a subset of Georgia counties against an
OpenStreetMap backdrop. This time, points rather than polygons are being mapped
and different Google Maps backdrops are being used as context: standard in Figure 3.17
and satellite imagery in Figure 3.18. The code for Figure 3.17 is as follows:
library(RgoogleMaps)
# define Lat and Lon
Lat <- as.vector(quakes$lat)
Long <- as.vector(quakes$long)
# get the map tiles
# you will need to be online
MyMap <- MapBackground(lat=Lat, lon=Long)
Quake Depths
100 200 300 500 700
Quake Depths
0 to 100
100 to 200
200 to 300
300 to 400
400 to 500
500 to 600
600 to 700
Figure 3.16 Plotting points with plot size (left) and plot colour (right) related to the
attribute value.
data <- c(3, 6, 9, 99, 54, 32, −102)
index <- (data == 32 | data <= 6) data[index]
These operations are described in greater detail in Chapter 4.
R FOR SPATIAL ANALYSIS & MAPPING
94
# define a size vector
tmp <- 1+(quakes$mag − min(quakes$mag))/max(quakes$mag)
PlotOnStaticMap(MyMap,Lat,Long,cex=tmp,pch=1,col='#FB6A4A30')
And here is the code for Figure 3.18:
MyMa p <- MapBackground(lat=Lat, lon=Long, zoom = 10,
maptype = "satellite")
Plot OnStaticMap(MyMap,Lat,Long,cex=tmp,pch=1,
col='#FB6A4A50')
Figure 3.17 Plotting points with a standard Google Maps context
3.5.5 Mapping Lines and Attributes
This section considers line data spatial objects. These can be defined in a number
of ways and typically describe different network features such as roads. The first
step in the code below assigns a coordinate system to roads and then selects a
subset. This involves defining a polygon to clip the road data to, and converting
and the datasets to sf objects.
data(newhaven)
proj4string(roads) <- proj4string(blocks)
HANDLING SPATIAL DATA IN R
95
Figure 3.18 Plotting points with Google Maps satellite image context
# 1. create a clip area
xmin <- bbox(roads)[1,1]
ymin <- bbox(roads)[2,1]
xmax <- xmin + diff(bbox(roads)[1,])
,/ 2
ymax <- ymin + diff(bbox(roads)[2,]) / 2
xx = as.vector(c(xmin, xmin, xmax, xmax, xmin))
yy = as.vector(c(ymin, ymax, ymax, ymin, ymin))
# 2. create a spatial polygon from this
crds <- cbind(xx,yy)
Pl <- Polygon(crds)
ID <- "clip"
Pls <- Polygons(list(Pl), ID=ID)
SPls <- SpatialPolygons(list(Pls))
df <- data.frame(value=1, row.names=ID)
clip.bb <- SpatialPolygonsDataFrame(SPls, df)
proj4string(clip.bb) <- proj4string(blocks)
# 3. convert to sf
# convert the data to sf
clip_sf <- st_as_sf(clip.bb)
roads_sf <- st_as_sf(roads)
# 4. clip out the roads and the data frame
roads_tmp <- st_intersection(st_cast(clip_sf), roads_sf)
R FOR SPATIAL ANALYSIS & MAPPING
96
Note that the last line generates a warning. This is because the st_intersection
function operates on geometries as well as geometry attributes under the assump-
tion that they are the same. You can avoid this either by replacing the last line with:
st_intersection(st_geometry(st_cast(clip_sf)), st_geometry(roads_sf))
or by making the assumption that the attribute is constant throughout the geome-
try explicitly before the intersection as follows:
st_agr(x) = "constant"
st_agr(y) = "constant"
where x is assigned st_cast(clip_sf) and y assigned roads_sf.
Having prepared the roads data subset in this way, a number of methods for
mapping spatial lines can be illustrated. These include maps based on classes
and continuous variables or attributes contained in the data frame. As before we
can start with a straightforward map which is then embellished in different
ways: shading by road type (the AV_LEGEND attribute) and line thickness
defined by road segment length (the attribute LENGTH_MI). The maps are
shown in Figure 3.19; note the different ways that the legend titles are specified.
New Haven
Roads
Road Type
FOOTBRIDGE
HWAY PRIM
HWAY SECON
LOCAL ROAD
MINOR ROAD
NON−STAND
PARKING
TRAIL
Segment length
0.1 0.2 0.4 0.6
Figure 3.19 A subset of the New Haven roads data, plotted in different ways: simple,
shaded using an attribute, and line width based on an attribute
3.5.6 Mapping Raster Attributes
Earlier in this chapter a SpatialPixelsDataFrame object was created using a
kernel density function. In this section the Meuse dataset, included as part of the
sp package, will be used to illustrate how raster attributes can be mapped in sf.
Load the meuse.grid dataset and examine its properties using the class
and summary functions.
HANDLING SPATIAL DATA IN R
97
# you may have to install the raster package
# install.packages("raster", dep = T)
library(raster)
data(meuse.grid)
class(meuse.grid)
summary(meuse.grid)
You should notice that meuse.grid is a data.frame object and that it has seven
attributes including an easting (x) and a northing (y). These are described in
the meuse.grid help pages (enter ?meuse.grid). The spatial properties of
the dataset can be examined by plotting the easting and northing attributes:
plot(meuse.grid$x, meuse.grid$y, asp = 1)
And it can be converted to a SpatialPixelsDataFrame object as described
in the help page for SpatialPixelsDataFrame and then to raster format.
Note that, at the time of writing, the sf package does not have raster functionality.
However, the raster package by Hijmans and van Etten (2014) handles gridded
raster data excellently.
meus e.sp = SpatialPixelsDataFrame(points =
meuse.grid[c("x", "y")], data = meuse.grid,
proj4string = CRS("+init=epsg:28992"))
meuse.r <- as(meuse.sp, "RasterStack")
To explore the data, you could try the simple plot and spplot functions as in
the code below. For the sf object it plots all of the attributes, and for the sp object
it plots the specified layer of the meuse grid:
plot(meuse.r)
plot(meuse.sp[,5])
spplot(meuse.sp[, 3:4])
image(meuse.sp[, "dist"], col = rainbow(7))
sppl ot(meuse.sp, c("part.a", "part.b", "soil", "ffreq"),
col.regions=topo.colors(20))
However, it is possible to exercise more control over the mapping of the attributes
held in the data frame of the sf object using the functionality of tmap. Some exam-
ples of tmap mapping routines with tm_raster and different shading schemes
are shown in Figures 3.20 and 3.21 with an interactive map context.
# set the tmap mode to plot
tmap_mode('plot')
# map dist and ffreq attributes
tm_shape(meuse.r) +
tm_raster( col = c("dist", "ffreq"), title = c("Distance","Flood Freq"),
palette = "Reds", style = c("kmeans", "cat"))
R FOR SPATIAL ANALYSIS & MAPPING
98
# set the tmap mode to view
tmap_mode('view')
# map the dist attribute
tm_shape(meuse.r) +
tm_raster(col = "dist", title = "Distance", breaks = c(seq(0,1,0.2))) +
tm_layout(legend.format = list(digits = 1))
Distance
0.000 to 0.122
0.122 to 0.261
0.261 to 0.421
0.421 to 0.633
0.633 to 0.993
Flood Freq
1
2
3
Figure 3.20 Maps of the Meuse raster data
You could also experiment with some of the refinements as with the tm_polygons
examples above. For example:
tm_shape(meuse.r) +
tm_raster(col="soil", title="Soil",
palette="Spectral", style="cat") +
tm_scale_bar(width = 0.3) +
tm_compass(position = c(0.74, 0.05)) +
tm_layout( frame = F, title = "Meuse flood plain",
title.size = 2, title.position = c("0.2", "top"),
legend.hist.size = 0.5)
3.6 SIMPLE DESCRIPTIVE STATISTICAL ANALYSES
The final section of this chapter before the self-test questions describes how to
develop some basic descriptive statistical analyses of attributes held in R data.
HANDLING SPATIAL DATA IN R
99
frame objects. These are intended to provide an introduction to methods for ana-
lysing the properties of spatial data attributes which will be extended in more
formal treatments of statistical and spatial analyses in later chapters. This section
first describes approaches for examining the properties of data variables using
histograms and boxplots, and then extends this to consider some simple ways of
analysing data variables in relation to each other using scatter plots and simple
regressions, before showing how mosaic plots can be used to visualise relation-
ships between variables. Importantly, a number of standard plotting routines with
their ggplot versions are introduced. You should load the tidyverse package
which includes ggplot2, and the reshape2 package which includes some data
manipulation functions:
Figure 3.21 Dynamic maps of the Meuse raster data with a Leaflet backdrop
R FOR SPATIAL ANALYSIS & MAPPING
100
install.packages("tidyverse", dep = T)
install.packages("reshape2", dep = T)
3.6.1 Histograms and Boxplots
There are number of ways of generating simple summaries of any variable. The
function table can be used to summarise the counts of categorical or discrete
data, summary and fivenum provide summaries of continuous variables, and
histograms and boxplots can provide visual summaries. You should make sure the
New Haven data are loaded from the GISTools package and then use these func-
tions to explore the P_VACANT variables in blocks.
For example, typing summary(blocks$P_VACANT) or fivenum(blocks
$P_VACANT) will produce other summaries of the distribution of the varia-
ble. R has some in-built functions for generating histograms and boxplots
with the hist and boxplot functions. However, as described in Chapter 2,
the ggplot2 package also includes functions for these visual data summa-
ries. Code for both standard R and ggplot operations is provided in the
snippets below; note the adjustment to the histogram bin sizes and the plot
labels.
data(newhaven)
# the tidyverse package loads the ggplot2 package
library(tidyverse)
# standard approach with hist
his t(blocks$P_VACANT, breaks = 40, col = "cyan",
border = "salmon",
main = "The distribution of vacant property percentages",
xlab = "percentage vacant", xlim = c(0,40))
# ggplot approach
gg plot(blocks@data, aes(P_VACANT)) +
geom_histogram(col = "salmon", fill = "cyan", bins = 40) +
xlab("percentage vacant") +
labs(title = "The distribution of vacant
,property percentages")
A further way of providing visual descriptive summaries of variables is to use
box-and-whisker plots via the boxplot function in R and the geom_box-
plot function in ggplot2. These can summarise a single variable or multiple
variables together. Here we will focus on the geom_boxplot function in the
ggplot2 package. In order to illustrate this the blocks dataset can be split
into high- and low-vacancy areas based on whether the proportion of proper-
ties vacant is greater than 10%. Setting the vac attribute as a factor is import-
ant for both approaches. and the melt function in the reshape2 package is
critical for many ggplot operations. You should examine the result of running
melt(blocks@data). The geom_boxplot functions can be used to visualise
the differences between these two subsets in terms of the distribution of owner
HANDLING SPATIAL DATA IN R
101
occupancy and the proportion of different ethnic groups, as in Figure 3.22. First
pre-process the data:
library(reshape2)
# a logical test
index <- blocks$P_VACANT > 10
# assigned to 2 high, 1 low
blocks$vac <- index + 1
blocks$vac <- factor(blocks$vac, labels = c("Low", "High"))
Then apply the geom_boxplot function:
library(ggplot2)
ggplot(melt(blocks@data[, c("P_OWNEROCC", "P_WHITE", "P_BLACK", "vac")]),
aes(variable, value)) +
geom_boxplot() +
facet_wrap(~vac)
Low High
P_OWNEROCC P_WHITE P_BLACK P_OWNEROCC P_WHITE P_BLACK
25
50
75
100
variable
va
lu
e
Figure 3.22 Box-and-whisker plot examples
The boxplot can be enhanced in many ways in ggplot. Some parameters are
used below. You may wish to search for examples of different themes and ways of
manipulating boxplots.
R FOR SPATIAL ANALYSIS & MAPPING
102
ggplot( melt(blocks@data[, c("P_OWNEROCC", "P_WHITE", "P_BLACK", "vac")]),
aes(variable, value)) +
geom_boxplot(colour = "yellow", fill = "wheat", alpha = 0.7) +
facet_wrap(~vac) +
xlab("") +
ylab("Percentage") +
theme_dark() +
ggtitle("Boxplot of High and Low property vacancies")
3.6.2 Scatter Plots and Regressions
The differences in the two subgroups suggest that there may be some statistical asso-
ciation between the amount of vacant properties and the proportions of different
ethnic groups, typically due to well-known socio-economic inequalities and power
imbalances. First, we can plot the data to see if we can visually identify any trends:
plot(blocks$P_VACANT/100, blocks$P_WHITE/100)
plot(blocks$P_VACANT/100, blocks$P_BLACK/100)
The scatter plots suggest that there may be a negative relationship between the pro-
portion of white people in a census block and the proportion of vacant properties
and that there may be a positive association with the proportion of black people.
It is difficult to be confident in these statements, but they can be examined more
formally by using simple regression models as estimated by the lm function and
then plotting the coefficient estimates or slopes.
# assign some variables
p.vac <- blocks$P_VACANT/100
p.w <- blocks$P_WHITE/100
<- blocks$P_BLACK/100
# bind these together
df <- data.frame(p.vac, p.w, p.b)
# fit regressions
mod.1 <- lm(p.vac ~ p.w, data = df)
mod.2 <- lm(p.vac ~ p.b, data = df)
I
The function lm is used in R to fit regression models (lm stands for ‘linear
model’). The models to be fitted are specified in a special notation in R.
Effectively a model description is an R variable of its own. Although we do
not go into detail about the modelling language in this book, more can be
found in, for example, de Vries and Meys (2012: Chapter 15); for now, it is
sufficient to know that the R notation y ~ x suggests the basic regression
model y = ax + b. The notation is sufficiently rich to allow the specification of
a very broad set of linear models.
HANDLING SPATIAL DATA IN R
103
The two models above can be interpreted as follows: mod.1 describes the
extent to which changes in p.vac are associated with changes in p.w; mod.2
describes the extent to which changes in p.vac are associated with changes in
p.b. The coefficients can be inspected, and it is evident that the proportion of
white people is a weak negative predictor of the proportion of vacant proper-
ties in a census block and that the proportion of black people is a weak positive
predictor. Specifically, the model suggests relationships that indicate that the
amount of vacant properties in a census block decreases by 1% for each 3.5%
increase in the proportion of white people and that it increases by 1% for
each 3.7% increase in the proportion of black people in the census block.
However, the model fits are poor (examine the R-squared values), and when a
multivariate analysis model is computed neither are found to be significant
predictors of vacant properties. The models can be examined using the
summary command:
summary(mod.1)
Call:
lm(formula = p.vac ~ p.w, data = df)
15 Residuals:
Min 1Q Median 3Q Max
−0.11747 −0.03729 −0.01199 0.01714 0.28271
Coefficients :
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11747 0.01092 10.755 <2e−16 ∗∗∗
p.w −0.03548 0.01723 −2.059 0.0415 ∗
---
Signif. codes:
0 '∗∗∗' 0.001 '∗∗' 0.01 '∗' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06195 on 127 degrees of freedom
Multiple R-squared: 0.03231, Adjusted R-squared: 0.02469
F-statistic: 4.24 on 1 and 127 DF, p-value: 0.04152
# not run below
# summary(mod.2)
# summary(lm(p.vac ~ p.w + p.b, data = df))
The trends can be plotted with the data as in Figure 3.23.
p1 <- ggplot(df,aes(p.vac, p.w))+
#stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm') +
geom_point() +
xlab("Proportion of Vacant Properties") +
ylab("Proporion White") +
labs(title="Regression of Vacant Properties against Proportion White")
p2 <- ggplot(df,aes(p.vac, p.b))+
R FOR SPATIAL ANALYSIS & MAPPING
104
0.00
0.25
0.50
0.75
1.00
0.0 0.1 0.2 0.3
Proportion of Vacant Properties
P
ro
p
o
ri
o
n
W
h
it
e
Regression of Vacant Properties aginst Proportion White
0.00
0.25
0.50
0.75
1.00
0.0 0.1 0.2 0.3
Proportion of Vacant Properties
P
ro
p
o
ri
o
n
B
la
ck
Regression of Vacant Properties aginst Proportion Black
Figure 3.23 Plotting regression coefficient slopes
#stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm') + geom_point() +
xlab("Proportion of Vacant Properties") +
ylab("Proporion Black") +
labs(title="Regression of Vacant Properties against Proportion Black")
grid.newpage()
# set up the layout
pushViewport(viewport(layout=grid.layout(2,1)))
# plot using the print command
print(p1, vp=viewport(layout.pos.row = 1, height = 5))
print(p2, vp=viewport(layout.pos.row = 2, height = 5))
3.6.3 Mosaic Plots
For data where there is some kind of true or false statement, mosaic plots can be used
to generate a powerful visualisation of the statistical properties and relationships
between variables. What they seek to do is to compare crosstabulations of counts
HANDLING SPATIAL DATA IN R
105
(hence the need for true or false statements) against a model where proportionally
equal counts are expected, in this case of vacant housing across ethnic groups.
First install the ggmosaic package:
# install the package
install.packages("ggmosaic", dep = T)
Then prepare the data using the melt function from the reshape2 package:
# create the dataset
pops <- data.frame(blocks[,14:18]) ∗ data.frame(blocks)[,11]
pops <- as.matrix(pops/100)
colnames(pops) <- c("White", "Black", "Ameri", "Asian", "Other")
# a true / false for vacant properties
vac.10 <- (blocks$P_VACANT > 10)
# create a crosstabulation
mat.tab <- xtabs(pops ~vac.10)
# melt the data
df <- melt(mat.tab)
Finally, create the mosaic plot, as in Figure 3.24, using the stat_mosaic function
in the ggmosaic extension to the ggplot2 package.
# load the packages
library(ggmosaic)
# call ggplot and stat_mosaic
ggplot(data = df) +
stat_mosaic(aes(weight = value,
,x = product(Var2),
fill=factor(vac.10)), na.rm=TRUE) +
theme(axis.text.x=element_text(angle=−90, hjust= .1)) +
labs(y='Proportion of Vacant Properties', x = 'Ethnic group',
title="Mosaic Plot of Vacant Properties with ethnicity") +
guides(fill=guide_legend(title = "> 10 percent", reverse = TRUE))
It has the usual ggplot feel. It shows that the census blocks with vacancy levels
higher than 10% are not evenly distributed among different ethnic groups: the tiles
in the mosaic plot have areas proportional to the counts (in this case the number of
people affected).
However, the stat_mosaic plot does not quite have information about
residuals and whether differences are significant, as does the mosaicplot func-
tion in the graphics package. This can be used using the code below to create
Figure 3.25:
# standard mosaic plot
ttext = sprintf("Mosaic Plot of Vacant Properties
with ethnicity")
mosaicplot (t(mat.tab),xlab='',
ylab='Vacant Properties > 10 percent',
main=ttext,shade=TRUE,las=3,cex=0.8)
R FOR SPATIAL ANALYSIS & MAPPING
106
Figure 3.24 An example of a ggmosaic mosaic plot
Figure 3.25 An example of a standard ‘graphics‘ mosaic plot with residuals
Figure 3.25 contains much more information. Its shading shows which groups
are under- or overrepresented, when compared against a model of expected
HANDLING SPATIAL DATA IN R
107
equality. The blue tiles show combinations of property vacancy and ethnicity that
are higher than would be expected, with the tiles shaded deep blue corresponding
to combinations whose residuals are greater than +4, when compared to the
model, indicating a much greater frequency in those cells than would be found if
the model of equality were true. The tiles shaded deep red correspond to the
residuals less than –4, indicating much lower frequencies than would be expected.
Thus the white ethnic group is significantly more strongly associated with areas
where vacant properties make up less than 10%, and the other ethnic groups are
significantly more strongly associated with areas where vacant properties make
up more than 10%, than would be expected in a model of equal distribution.
3.7 SELF-TEST QUESTIONS
This chapter has introduced a number of commands and functions for mapping
spatial data and visualising spatial data attributes. The questions in this section
present a series of tasks for you to complete that build on the methods illustrated
in the preceding sections. The answers at the end of the chapter present snippets
of code that will complete the tasks, but, as ever, you may find that your code
differs from the answers provided. This is to be expected and is not something
that should concern you as there are usually many ways to achieve the same
objectives.
The tasks seek to extend the mapping skills that you have acquired through this
chapter (as a reminder, the expectation is that you run the code embedded in the
text throughout the book) and in places greater detail and explanation of the spe-
cific techniques are given. Four general areas are covered:
● Plots and maps: working with map data
● Misrepresentation of continuous variables: using different cut functions
for choropleth mapping
● Selecting data: creating variables and subsetting data using logical
statements
● Re-projections: transforming data using spTransform
Self-Test Question 1. Plots and maps: working with map data
Your task is to write code that will produce a map of the counties in Georgia,
shaded in a colour scheme of your choice but using 10 classes describing the distri-
bution of median income in thousands of dollars (this is described by the MedInc
attribute in the data frame). The maps should include a scale bar and a legend, and
the code should write the map to a TIFF file, with a resolution of 300 dots per inch
and a map size of 7 × 7 inches.
R FOR SPATIAL ANALYSIS & MAPPING
108
# Hints
display.brewer.all() # to show the Brewer palettes
breaks # to specify class breaks OR
style # in the tm_fill / tm_polygons help
# Tools
library(ggplot2) # for the mapping tools
data(georgia) # the Georgia data in the GISTools package
st_as_sf(georgia) # to convert the data to sf format
tm_layout # takes many parameters, e.g. legend.position
Self-Test Question 2. Misrepresentation of continuous variables: using different
breaks for choropleth mapping
It is well known that it is very easy to lie with maps (see Monmonier, 1996). One of
the very commonly used tricks for misrepresenting the spatial distribution of phe-
nomena relates to the inappropriate categorisation of continuous variables. Your
aim in this exercise is to produce three maps that represent the same feature, and
in so doing you will investigate the impact of different functions for grouping the
continuous variable in the choropleth maps.
Write code that will create three maps, in the same window, of the numbers of
houses in the New Haven census blocks. This is described by the HSE_UNITS
variable. Apply different functions to divide the HSE_UNITS variable in the
blocks dataset into five classes in different ways based on quantiles, absolute
ranges, and standard deviations. You need not add legends, scale bars, etc., but
should include map titles.
# Hints
p1 <- tm_shape(...) # assign the plots to a variable
pushViewport # from the grid package, used earlier...
viewport # ...to plot multiple tmaps
?quantileCuts # quantiles, ranges std.dev...
?rangeCuts # ... from GISTools package
?sdCuts
breaks # to specify breaks in tm_polygon
tmap_mode('plot') # to specify a map view
# Tools
library(tmap) # for the mapping tools
library(grid) # for plotting the maps together
data(newhaven) # to load the New Haven data
Self-Test Question 3. Selecting data: creating variables and subsetting data using
logical statements
In the previous sections on mapping polygon attributes and mapping lines,
different methods for selecting or subsetting the spatial data were introduced.
These applied an overlay of spatial data using st_intersection in the st
package to select roads within the extent of an st polygon object, and logical
operators were used to select earthquake locations that satisfied specific criteria.
HANDLING SPATIAL DATA IN R
109
Additionally, logical operators were introduced in the previous chapter. When
applied to a variable they return true or false statements or more correctly logical
data types. In this exercise, the objective is to create a secondary attribute and
then to use a logical statement to select data objects when applied to the attribute
you create.
A company wishes to market a product to the population in rural areas. The
company has a model that says that they will sell one unit of their product for
every 20 people in rural areas who are visited by one of their sales team, and
they would like to know which counties have a rural population density of
more than 20 people per square kilometre. Using the Georgia data, you should
develop some code that selects counties based on a rural population density
measure. You will need to calculate for each county some kind of rural popula-
tion density score and map the counties in Georgia that have a score of greater
than 20 rural people per square kilometre.
# Hints
library(GISTools) # for the mapping tools
data(georgia) # use georgia2 as it has a geographical projection
help("!") # to examine logic operators
as.numeric # use to coerce new attributes you create to numeric format
# e.g. georgia.sf$NewVariable <- as.numeric(1:159)
# Tools
st_area # a function in the st package
Self-Test Question 4. Re-projections: transforming data using spTransform and
st_transform
Spatial data come with projections, which define an underlying geodetic model over
which the spatial data are projected. Different spatial datasets need to be aligned
over the same projection for
,the spatial features they describe to be compared and
analysed together. National grid projections typically represent the world as a flat
surface and allow distance and area calculations to be made, which cannot be so
easily done using models that use degrees and minutes. World geodetic systems
such as WGS84 provide a standard reference system. For example, in the previous
question you worked with the georgia2 dataset which is projected in metres,
whereas georgia is projected in degrees in WGS84. And, when you plotted the
Georgia subset with an OpenStreetMap backdrop, a transform operation was used
to convert the data to the projection used in OpenStreetMap plotting. A range of
different projections are described in formats for different packages and software
on the Spatial Reference website (http://www.spatialreference.org). A
typical re-projection would be something like:
# Using spTransform in sp
new.spatial.data <- spTransform(old.spatial.data, new.Projection)
# Using st_transform in sf
new.spatial.data.sf <- st_transform(old.spatial.data.sf, new.Projection)
R FOR SPATIAL ANALYSIS & MAPPING
110
You should note that the data need to have a projection in order to be transformed.
Projections can be assigned if you know what the projection is. Recall the code from
earlier in this chapter using the Fiji earthquake data which assigned a projection to
the coordinates:
library(GISTools)
library(rgdal)
library(sf)
data(quakes)
coords.tmp <- cbind(quakes$long, quakes$lat)
# create the SpatialPointsDataFrame
quak es.sp <- SpatialPointsDataFrame(coords.tmp,
data = data.frame(quakes),
proj4string = CRS("+proj=longlat "))
You can examine the projection properties of the SpatialPointsDataFrame
and sf objects after the latter is created, by entering:
summary(quakes.spdf)
quakes_sf <- st_as_sf(quakes.sp)
head(quakes.sf)
If the proj4string properties of sp and sf objects are empty, these can be popu-
lated if you know the spatial reference system and then the data can be transformed.
The objective of this exercise is to re-project the New Haven blocks and
breach datasets from their original reference system to WGS84, using both the
st_transform function in sf and the spTransform function in rgdal, and
then to plot these transformed data on an OpenStreetMap backdrop. You may find
it useful to use a transparency term in your colours.
These datasets have a local projections system, using the State Plane Coordinate
System for Connecticut, in US survey feet. You should transform the breaches of
the peace and the census blocks data to latitude and longitude by assigning a pro-
jection using the CRS function in the sp package and st_crs function in the sf
package. Then the spTransform and st_transform functions can be applied.
Having transformed the datasets, you should map the locations of the breaches of
peace and the census blocks with an OpenStreetMap backdrop. You could use the
OpenStreetMap tools directly and/or the Leaflet embedded in the tmap tools
when tmap_mode is set to 'view'.
3.8 ANSWERS TO SELF-TEST QUESTIONS
Q1: Plots and maps: working with map data. Your map should look something like
Figure 3.26.
# load the data and the packages
library(GISTools)
library(sf)
library(tmap)
data(georgia)
HANDLING SPATIAL DATA IN R
111
# set the tmap plot type
tmap_mode('plot')
# convert to sf format
georgia_sf = st_as_sf(georgia)
# create the variable
georgia_sf$MedInc = georgia_sf$MedInc / 1000
# open the tiff file and give it a name
tiff("my_map.tiff")
# start the tmap commands
tm_shape(georgia_sf) +
tm_polygons("MedInc", title = "Median Income", palette = "GnBu",
style = "equal", n = 10) +
tm_layout(legend.title.size = 1,
legend.format = list(digits = 0),
legend.position = c(0.2, "top")) +
tm_legend(legend.outside=TRUE)
# close the tiff file
dev.off()
Median Income
23 to 29
29 to 35
35 to 41
41 to 47
47 to 53
53 to 58
58 to 64
64 to 70
70 to 76
76 to 82
Figure 3.26 The map produced by the code for Q1
R FOR SPATIAL ANALYSIS & MAPPING
112
Q2: Misrepresentation of continuous variables – using different breaks for chorop-
leth mapping. Your map should look something like Figure 3.27.
# load packages and data
library(tmap)
library(GISTools)
library(sf)
library(grid)
data(newhaven)
# convert data to sf format
blocks_sf = st_as_sf(blocks)
# 1. Initial Investigation
# You could start by having a look at the data
attach(data.frame(blocks_sf))
hist(HSE_UNITS, breaks = 20)
# You should notice that it has a normal distribution
# but with some large outliers
# Then examine different cut schemes
quantileCuts(HSE_UNITS, 6)
rangeCuts(HSE_UNITS, 6)
sdCuts(HSE_UNITS, 6)
# detach the data frame
detach(data.frame(blocks_sf))
# 2. Do the task
# a) mapping classes defined by quantiles
# define some breaks
br <- c(0, round(quantileCuts(blocks_sf$HSE_UNITS, 6),0))
# you could examine br
p1 <- tm_shape(blocks_sf) +
tm_polygons("HSE_UNITS", title="Quantiles",
palette="Reds",
breaks=br)
# b) mapping classes defined by absolute ranges
# define some breaks
br <- c(0, round(rangeCuts(blocks$HSE_UNITS, 6),0))
# you could examine br
p2 <- tm_shape(blocks_sf) +
tm_polygons("HSE_UNITS", title="Ranges",
palette="Reds",
breaks=br)
# c) mapping classes defined by standard deviations
br <- c(0, round(sdCuts(blocks$HSE_UNITS, 6),0))
# you could examine br
p3 <- tm_shape(blocks_sf) +
tm_polygons("HSE_UNITS", title="Std Dev",
palette="Reds",
breaks=br)
# open a new plot page
grid.newpage()
# set up the layout
pushViewport(viewport(layout=grid.layout(1,3)))
HANDLING SPATIAL DATA IN R
113
# plot using the print command
print(p1, vp=viewport(layout.pos.col = 1, height = 5))
print(p2, vp=viewport(layout.pos.col = 2, height = 5))
print(p3, vp=viewport(layout.pos.col = 3, height = 5))
Quantiles
0 to 273
273 to 351
351 to 410
410 to 468
468 to 548
Ranges
0 to 260
260 to 519
519 to 778
778 to 1,038
1,038 to 1,298
Std Dev
0 to 56
56 to 238
238 to 419
419 to 601
601 to 782
Figure 3.27 The map produced by the code for Q2
Q3: Selecting data: creating variables and subsetting data using logical statements.
The code is below and your map should look something like Figure 3.28.
library(GISTools)
library(sf)
data(georgia)
# convert data to sf format
georgia_sf = st_as_sf(georgia2)
# calculate rural population
georgia_sf$rur.pop <- as.numeric(georgia_sf$PctRural
∗ georgia_sf$TotPop90 / 100)
# calculate county areas in km^2
georgia_sf$areas <- as.numeric(st_area(georgia_sf)
/ (1000∗ 1000))
# calculate rural density
georgia_sf$rur.pop.den <- as.numeric(georgia_sf$rur.pop
/ georgia_sf$areas)
# select counties with density > 20
georgia_sf$rur.pop.den <- (georgia_sf$rur.pop.den > 20)
# map them
tm_shape(georgia_sf) +
tm_polygons("rur.pop.den",
palette=c("chartreuse4","darkgoldenrod3"),
title=expression("Pop >20 (per " ∗ km^2 ∗ ")"),
auto.palette.mapping = F)
Q4: Transforming data. Your map should look something like Figure 3.29 or Figure
3.30, depending on which way you did it! First you will need to transform the data:
R FOR SPATIAL ANALYSIS & MAPPING
114
library(GISTools) # for the mapping tools
library(sf) # for the mapping tools
library(rgdal) # this has the spatial reference tools
library(tmap)
library(OpenStreetMap)
data(newhaven)
# Define a new projection
newProj <- CRS("+proj=longlat +ellps=WGS84")
# Transform blocks and breach
# 1. using spTransform
breach2 <- spTransform(breach, newProj)
blocks2 <- spTransform(blocks, newProj)
# 2. using st_transform
breach_sf <- st_as_sf(breach)
blocks_sf <- st_as_sf(blocks)
breach_sf <- st_transform(breach_sf, "+proj=longlat +ellps=WGS84")
blocks_sf <- st_transform(blocks_sf, "+proj=longlat +ellps=WGS84")
Pop >20 (per km2)
FALSE
TRUE
Figure 3.28 The map produced by the code for Q3
HANDLING SPATIAL DATA IN R
115
Then the transformed data can be mapped using Leaflet
,in the tmap package:
# set the mode
tmap_mode('view')
# plot the blocks
tm_shape(blocks_sf) +
tm_borders() +
# and then plot the breaches
tm_shape(breach_sf) +
tm_dots(shape=1, size=0.1, border.col = NULL, col = "red", alpha = 0.5)
It can also be mapped using the OpenStreetMap package. For this you need to
extract the map tiles using the bounding box of the transformed data:
ul <- as.vector(cbind(bbox(blocks2)[2,2],
bbox(blocks2)[1,1]))
lr <- as.vector(cbind(bbox(blocks2)[2,1],
bbox(blocks2)[1,2]))
# download the map tile
MyMap <- openmap(ul,lr)
+
-
Leaflet | © OpenStreetMap © CartoDB
Figure 3.29 The tmap map produced by the code for Q4
R FOR SPATIAL ANALYSIS & MAPPING
116
# now plot the layer and the backdrop
par(mar = c(0,0,0,0))
plot(MyMap, removeMargin=FALSE)
# notice how the data need to be transformed
# to the internal OpenStreetMap projection
plot(spTransform(blocks2, osm()), add = TRUE, lwd = 1)
plot(spTransform(breach2, osm()), add = T, pch = 19, col = "#DE2D2650")
Figure 3.30 The OpenStreetMap map produced by the code for Q4
HANDLING SPATIAL DATA IN R
117
REFERENCES
Anselin, L. (1995) Local indicators of spatial association – Lisa. Geographical
Analysis, 27(2): 93–115.
Brunsdon, C. and Chen, H. (2014) GISTools: Some further GIS capabilities for R. R
Package Version 0.7-4. http://cran.r-project.org/package=GISTools.
de Vries, A. and Meys, J. (2012) R for Dummies. Chichester: John Wiley & Sons.
Hijmans, R.J. and van Etten, J. (2014) Raster: Geographic data analysis and
modeling. R Package Version 2.6-7. http://cran.r-project.org/package=raster.
Monmonier, M. (1996) How to Lie with Maps, 2nd edition. Chicago: University of
Chicago Press.
Ord, J.K. and Getis, A. (1995) Local spatial autocorrelation statistics: Distributional
issues and an application. Geographical Analysis, 27(4): 286–306.
Pebesma, E., Bivand, R., Cook, I., Keitt, T., Sumner, M., Lovelace, R., Wickham, H.,
Ooms, J. and Racine, E. (2016) sf: Simple features for R. R Package Version 0.6-3.
http://cran.r-project.org/package=sf.
Tennekes, M. (2015) tmap: Thematic maps. R Package Version 1. http://cran.r-project.
org/package=tmap.
4
SCRIPTING AND WRITING
FUNCTIONS IN R
4.1 OVERVIEW
As you have been working through the code and exercises in this book you have
applied a number of different tools and techniques for extracting, displaying and
analysing data. In places you have used some quite advanced snippets of code.
However, this has all been done in a step-by-step manner, with each line of code
being run individually, and the occasional function has been applied individu-
ally to a specific dataset or attribute. Quite often in spatial analysis, we would like
to do the same thing repeatedly, but adjusting some of the parameters on each
iteration – for example, applying the same algorithm to different data, different
attributes, or using different thresholds. The aim of this chapter is to introduce
some basic programming principles and routines that will allow you to do many
things repeatedly in a single block of code. This is the basics of writing computer
programs. This chapter will:
● Describe how to combine commands into loops
● Describe how to control loops using if , else , repeat , etc.
● Describe logical operators to index and control
● Describe how to create functions, test them and to make them universal
● Explain how to automate short tasks in R
● Introduce the apply family of operations and how they can be used to
apply functions to different data structures
● Introduce dplyr functions for data table manipulations and operations
SCRIPTING AND WRITING FUNCTIONS IN R
119
4.2 INTRODUCTION
In spatial data analysis and mapping, we frequently want to apply the same set
of commands over and over again, to cycle through data or lists of data and do
things to data depending on whether some condition is met or not, and so on.
These types of repeated actions are supported by functions, loops and conditional
statements. A few simple examples serve to illustrate how R programming com-
bines these ideas through functions with conditional commands, loops and
variables.
For example, consider the following variable tree.heights:
tree.heights <- c(4.3,7.1,6.3,5.2,3.2)
We may wish to print out the first element of this variable if it has a value less than
6: this is a conditional command as the operation (in this case to print something) is
carried out conditionally (i.e. if the condition is met).
tree.heights
[1] 4.3 7.1 6.3 5.2 3.2
if (tree.heights[1] < 6) { cat('Tree is small\n') } else
{ cat('Tree is large\n')}
Tree is small
Alternatively, we may wish to examine all of the elements in the variable
tree.heights and, depending on whether each individual value meets the
condition, perform the same operation. We can carry out operations repeatedly
using a loop structure as follows. Notice the construction of the for loop in
the form:
for(variable in sequence) R expression
This is illustrated in the code below:
for (i in 1:3) {
if (tree.heights[i] < 6) { cat('Tree',i,' is small\n') }
else { cat('Tree',i,' is large\n')} }
Tree 1 is small
Tree 2 is large
Tree 3 is large
A third situation is where we wish to perform the same set of operations, group of
conditional or looped commands over and over again, perhaps to different data.
We can do this by grouping code and defining our own functions.
R FOR SPATIAL ANALYSIS & MAPPING
120
assess.tree.height <- function(tree.list, thresh)
{ for (i in 1:length(tree.list))
{ if(tree.list[i] < thresh) {cat('Tree',i, ' is small\n')}
else { cat('Tree',i,' is large\n')}
}
}
assess.tree.height(tree.heights, 6)
Tree 1 is small
Tree 2 is large
Tree 3 is large
Tree 4 is small
Tree 5 is small
tree.heights2 <- c(8,4.5,6.7,2,4)
assess.tree.height(tree.heights2, 4.5)
Tree 1 is large
Tree 2 is large
Tree 3 is large
Tree 4 is small
Tree 5 is small
Notice how the code in the function assess.tree.height above modifies the
original loop: rather than for(i in 1:3) it now uses the length of the variable
1:length(tree.list) to determine how many times to loop through the data.
Also a variable thresh was used for whatever threshold the user wishes to specify.
The sections in this chapter develop more detailed ideas around functions,
loops and conditional statements and the testing and debugging of functions in
order to support automated analyses in R.
4.3 BUILDING BLOCKS FOR PROGRAMS
In the examples above, a number of programming concepts were introduced.
Before we start to develop these more formally into functions it is important to
explain these ingredients in a bit more detail.
4.3.1 Conditional Statements
Conditional statements test to see whether some condition is TRUE or FALSE, and
if the answer is TRUE some specific actions are undertaken. Conditional statements
are composed of if and else.
The if statement is followed by a condition, an expression that is evaluated,
and then a consequent to be executed if the condition is TRUE. The format of an if
statement is:
if – condition – consequent
Actually this could be read as ‘if the condition is true then the consequent is…’. The
components of a conditional statement are:
SCRIPTING AND WRITING FUNCTIONS IN R
121
● the condition, an R expression that is either TRUE or FALSE
● the consequent, any valid R statement which is only executed if the
condition is TRUE
For example, consider the simple case below where the value of x is changed and
the same condition is applied. The results are different because of the different
values assigned to x: in the first case a statement is printed to the console, in the
second it is not.
x <- −7
if (x < 0) cat("x is negative")
x is negative
x <- 8
if (x < 0) cat("x is negative")
Frequently if statements also have an alternative consequent that is executed when
the condition is FALSE. Thus the format of the conditional statement is expanded to:
if – condition-–
,consequent– else – alternative
Again, this could be read as ‘if the condition is true then do the consequent; or, if
the condition is not true then do the alternative’. The components of a conditional
statement that includes an alternative are:
● the condition, an R expression that is either TRUE or FALSE
● the consequent and alternative, which can be any valid R statements
● the consequent is executed if the condition is TRUE
● the alternative is executed if the condition is FALSE
The example is expanded below to accommodate the alternative:
x <- −7
if (x < 0) cat("x is negative") else cat("x is positive")
x is negative
x <- 8
if (x < 0) cat("x is negative") else cat("x is positive")
x is positive
The condition statement is composed of one or more logical operators and in R
these are defined in Table 4.1. In addition, R contains a number of logical func-
tions which can also be used to evaluate conditions. A sample of these is listed in
Table 4.2 but many others exist.
R FOR SPATIAL ANALYSIS & MAPPING
122
Table 4.1 Logical operators
Logical operator Description
== Equal
!= Not equal
> Greater than
< Less than
>= Greater than or equal
<= Less than or equal
! Not (goes in front of other expressions)
& And (combines expressions)
| Or (combines expressions)
Table 4.2 Logical functions
Logical function Description
any(x) TRUE if any in a vector of conditions x is true
all(x) TRUE if all of a vector of conditions x is true
is.numeric(x) TRUE if x contains a numeric value
is.logical(x) TRUE if x contains a true or false value
is.character(x) TRUE if x contains a character value
There are quite a few more is-type functions (i.e. logical evaluation functions)
that return TRUE or FALSE statements that can be used to develop conditional
tests. To explore these enter:
??is.
The examples below illustrate how the logical tests all and any may be incorpo-
rated into conditional statements:
x <- c(1,3,6,8,9,5)
if (all(x > 0)) cat("All numbers are positive")
All numbers are positive
x <- c(1,3,6,−8,9,5)
if (any(x > 0)) cat("Some numbers are positive")
Some numbers are positive
any(x==0)
[1] FALSE
4.3.2 Code Blocks
Frequently we wish to execute a group of consequent statements together if, for
example, some condition is TRUE. Groups of statements are called code blocks and
SCRIPTING AND WRITING FUNCTIONS IN R
123
in R are contained by { and }. The examples below show how code blocks can be
used if a condition is TRUE to execute consequent statements and can be expanded
to execute alternative statements if the condition is FALSE.
x <- c(1,3,6,8,9,5)
if (all(x > 0)) {
cat("All numbers are positive\n")
total <- sum(x)
cat("Their sum is ",total) }
All numbers are positive
Their sum is 32
The curly brackets are used to group the consequent statements: that is, they con-
tain all of the actions to be performed if the condition is met (i.e. is TRUE) and all of
the alternative actions if the condition is not met (i.e. is FALSE):
if condition { consequents } else { alternatives }
These are illustrated in the code below:
x <- c(1,3,6,8,9,−5)
if (all(x > 0)) {
cat("All numbers are positive\n")
total <- sum(x)
cat("Their sum is ",total) } else {
cat("Not all numbers are positive\n")
cat("This is probably an error as numbers are rainfall levels") }
Not all numbers are positive
This is probably an error as numbers are rainfall levels
4.3.3 Functions
The introductory section above included a function called assess.tree.
height. The format of a function is:
function name <- function(argument list) { R expression }
The R expression is usually a code block and in R the code is contained by curly
brackets or braces: { and }. Wrapping the code into a function allows it to be used
without having to retype the code each time you wish to use it. Instead, once the
function has been defined and compiled, it can be called repeatedly and with dif-
ferent arguments or parameters. Notice in the function below that there are a num-
ber of sets of containing brackets { } that are variously related to the condition, the
consequent and the alternative.
R FOR SPATIAL ANALYSIS & MAPPING
124
mean.rainfall <- function(rf)
{ if (all(rf> 0)) #open Function
{ mean.value <- mean(rf) #open Consequent
cat("The mean is ",mean.value)
} else #close Consequent
{ cat("Warning: Not all values are positive\n") #open Alternative
} #close Alternative
} #close Function
mean.rainfall( c(8.5,9.3,6.5,9.3,9.4))
The mean is 8.6
More commonly, functions are defined that do something to the input specified in
the argument list and return the result, either to a variable or to the console window,
rather than just printing something out. This is done using return() within the
function. Its format is return(R expression). Essentially what this does if it
is used in a function is to make R expression the value of the function. In the
following code the mean.rainfall2 function now returns the mean of the data
passed to it, and this is assigned to another variable:
mean.rainfall2 <- function(rf) {
if ( all(rf > 0)) {
return( mean(rf))} else {
return(NA)}
}
mr <- mean.rainfall2(c(8.5,9.3,6.5,9.3,9.4))
mr
[1] 8.6
I
Notice that the code blocks used in the functions contained within the curly
brackets or braces { and } are indented. There are a number of commonly
accepted protocols for doing this but no unique one. The aim is to make the
code and the nesting of sub-clauses indicated by { and } clear. In the code
for mean.rainfall above, { is used before the first line of the code block,
whereas for mean.rainfall.2 the { is positioned immediately after the
function declaration.
It is possible to declare variables inside functions, and you should note that
these are distinct from external variables with the same name. Consider the
internal variable rf in the mean.rainfall2 function above. Because this is a
variable that is internal to the function, it only exists within the function and will
not alter any external variable of the same name. This is illustrated in the code
below.
SCRIPTING AND WRITING FUNCTIONS IN R
125
rf <- "Tuesday"
mean.rainfall2(c(8.5,9.3,6.5,9.3,9.4))
[1] 8.6
rf
[1] "Tuesday"
4.3.4 Loops and Repetition
Very often, we would like to run a code block a certain number of times, for exam-
ple for each record in a data frame or a spatial data frame. This is done using for
loops. The format of a loop is:
for( 'loop variable' in 'list of values' ) do R expression
Again, typically code blocks are used, as in the following example of a for loop:
for (i in 1 :5) {
i.cubed <- i ∗ i ∗ i
cat("The cube of",i,"is ",i.cubed,"\n")}
The cube of 1 is 1
The cube of 2 is 8
The cube of 3 is 27
The cube of 4 is 64
The cube of 5 is 125
When working with a data frame and other tabular-like data structures, it is com-
mon to want to perform a series of R expressions on each row, on each column or on
each data element. In a for loop the list of values can be a simple sequence
of 1 to n (1:n), where n is related to the number of rows or columns in a dataset or
the length of the input variable as in the assess.tree.height function above.
However, there are many other situations when a different list of values
is required. The function seq is a very useful helper function that generates num-
ber sequences. It has the following formats:
seq(from, to, by = step value)
or
seq(from, to, length = sequence length)
In the example below, it is used to generate a sequence of 0 to 1 in steps of 0.25:
for (val in seq(0,1,by=0.25)) {
val.squared <- val ∗ val
cat("The square of",val,"is ",val.squared,"\n")}
The square of 0 is 0
The square of 0.25 is 0.0625
The square of 0.5 is 0.25
The square of 0.75 is 0.5625
The square of 1 is 1
R FOR SPATIAL ANALYSIS & MAPPING
126
Conditional loops are very useful when you wish to run a code block until a certain
condition
,with
spatial operations and functions for spatial analyses have not yet been updated to
work with sf. For these reasons, this edition will, where possible, describe the
manipulation and analysis of spatial data using sf format and functions but will
switch between (and convert data between) sp and sf formats as needed. The
focus is no longer primarily on GISTools, but this package still provides some
analytical short-cuts and functionality and will be used if appropriate.
R is dynamic – things do not stay the same, and this is part of its attraction and
to be celebrated. New tools, packages and functions are constantly being pro-
duced, and they are updated to improve and develop them. In most cases this is
not problematic as the update almost always extends the functionality of the pack-
age without affecting the original code. However, in a few instances, specific pack-
ages are completely rewritten without backward compatibility. If this happens
then the R code that previously worked may not work with the new package as the
functions may take different parameters, arguments and critical data formats.
However, there is usually a period of transition over some package versions before
the code stops working altogether. So occasionally a completely new paradigm is
introduced, and this has been the case recently for spatial data in R with the release
of the sf package (Pebesma et al., 2016) and the tidyverse. The second edition
reflects these developments and updates.
1.2 OBJECTIVES OF THIS BOOK
This book assumes no prior knowledge of either R or spatial analysis and map-
ping. It provides an introduction to the use of R and the increasing number of tools
that can be used for explicitly spatial analyses, geocomputation and the statistical
analysis of geographical information. The text draws from a number of open
source, user-contributed libraries or ‘packages’ that support mapping and carto-
graphic outputs arising from both raster and vector analyses. The book implicitly
focuses on vector GIS as other texts cover raster with classic geostatistics (e.g.
Bivand et al., 2013), although rasters are implicitly included in some of the exer-
cises, for example the outputs of density surfaces and some of the geographically
weighted analyses as described in later chapters.
The original rationale for producing the first edition of this book in 2013
related to a number of factors. First, the increasing use of R as an analytical tool
across a range of different scientific disciplines is evident. Second, there are an
INTRODUCTION
3
increasing number of data capture devices that are GPS-enabled: smartphones,
tablets, cameras, etc. This has resulted in more and more data (both formal and
informal) having location attached to them. Third, there is therefore an associ-
ated increase in demand for explicitly spatial analyses of such data, in order to
exploit the richness of analysis that location affords. Finally, at the time of writ-
ing, there are no books on the market that have a specific focus on spatial analy-
sis and mapping of such data in R that do not require any prior knowledge of
GIS, spatial analysis or geocomputation. One of the few textbooks on using R for
the analysis of spatial data is Bivand et al. (2013), although this is aimed at
advanced users. These have not changed. If anything, the number of R users has
increased, and of those more and more are increasingly working with spatial
data. This is reflected in the number of online tools, functions and tutorials
(greatly supported by the functionality of RMarkdown) and the continued
development of packages (existing and new) and data formats supporting spa-
tial data analysis. As introduced earlier, an excellent example of the latter is the
Simple Features format in the sf package. For these reasons, what we have
sought to do is to write a book with a geographical focus and (hopefully) user
friendliness and that reflects the latest developments in spatial analyses and
mapping in R.
As you work through this book you will learn a number of techniques for using
R directly to carry out spatial data analysis, visualisation and manipulation.
Although here we focus mostly on vector data (some raster analysis is demon-
strated) and on social and economic applications, and the packages that this book
uses have been chosen as being the most appropriate for analysing these kinds of
data, R also presents opportunities for the analysis of many other kinds of spatial
data – for example, relating to climate and landscape processes. While some of
libraries and packages covered in this book may also be useful in the analysis of
the physical geographical and environmental data, there will no doubt be other
packages that may also play an important role – for example, the PBSMapping
package, developed by the Pacific Biological Station in Nanaimo, British
Columbia, Canada, offers a number of functions that may be useful for the analy-
sis of biogeographical data.
1.3 SPATIAL DATA ANALYSIS IN R
In recent years large amounts of spatial data have become widely available. For
example, there are many governmental open data initiatives that make census data,
crime data and various other data relating to social and economic processes freely
available. However, there is still a need to flexibly analyse, visualise and model
data of these kinds in order to understand the underlying patterns and processes
that the data describe. While there are many packages and software available that
are capable of analysing spatial data, in many situations standard statistical modelling
R FOR SPATIAL ANALYSIS & MAPPING
4
approaches are not appropriate: data observations may not be independent or the
relationship between variables may vary across geographical space. For this reason
many standard statistical packages provide only inadequate tools for analysis as
they cannot account for the complexities of spatial processes and spatial data.
Similarly, although standard GIS packages and software provide tools for the
visualisation of spatial data, their analytical capabilities are relatively limited,
inflexible and cannot represent the state of the art. On the other hand, many R
packages are created by experts and innovators in the field of spatial data analysis
and visualisation, and as R is, in fact, a programming language it is a natural test-
ing ground for newly developed approaches. Thus R provides arguably the best
environment for spatial data analysis and manipulation. One of the key differ-
ences between a standard GIS and R is that many people view GIS as a tool to
handle very large geographical databases rather than for more sophisticated
modelling and analysis, and this is reflected in the evolution of GIS software,
although R is catching up in its ability to easily handle very large datasets. We do
not regard R as competing with GIS, rather we see the two kinds of software as
having complementary functionality.
1.4 CHAPTERS AND LEARNING ARCS
The broad-level content and topics covered by the chapters have not changed. Nor
have the associated learning arcs. The revisions for the second edition have focused
on updates to visualisation and mapping tools through the ggplot2 and tmap
packages and to spatial data structures through sf.
The chapters build in the complexity of the analyses they develop, and by work-
ing through the illustrative code examples you will develop skills to create your
own routines, functions and programs. The book includes a mix of embedded exer-
cises, where the code is provided for you to work through with extensive explana-
tions, and self-test questions, which require you to develop an answer yourself. All
chapters have self-test questions. In some cases these are included in an explicitly
named section, and in others they are embedded in the rest of the text. The final
section in each chapter provides model answers to the self-test questions. Thus in
contrast to the
,is met. In R these can be specified using the repeat and break func-
tions. Here is an example:
i <- 1; n <- 654
repeat{
i.squared <- i ∗ i
if (i.squared > n) break
i <- i + 1}
cat("The first square number exceeding",n, "is ",i.squared,"\n")
The first square number exceeding 654 is 676
Finally, it is possible to include loops in functions as in the following example with
a conditional loop:
first.bigger.square <- function(n) {
i <- 1
repeat{
i.squared <- i ∗ i
if (i.squared > n) break
i <- i + 1 }
return(i.squared)}
first.bigger.square(76987)
[1] 77284
4.3.5 Debugging
As you develop your code and compile it into functions, especially initially, you
will probably encounter a few teething problems: hardly any function of reason-
able size works first time! There are two general kinds of problem:
● The function crashes (i.e. it throws up an error)
● The function does not crash, but returns the wrong answer
Usually the second kind of error is the worst. Debugging is the process of finding
the problems in the function. A typical approach to debugging is to ‘step’ through
the function line by line and in so doing find out where a crash occurs, if one does.
You should then check the values of variables to see if they have the values they
are supposed to. R has tools to help with this.
To debug a function:
● Enter debug(function name)
● Then call the function
SCRIPTING AND WRITING FUNCTIONS IN R
127
For example, enter:
debug(mean.rainfall2)
Then just use the function you are trying to debug and R goes into ‘debug mode’:
mean.rainfall2(c(8.5,9.3,6.5,9.3,9.4))
[1] 8.6
You will notice that the prompt becomes Browse[2]> and the line of the function
about to be executed is listed. You should note a number of features associated with
debug:
● Entering a return executes it, and debug goes to next line
● Typing in a variable lists the value of that variable
● R can ‘see’ variables that are specific to the function
● Typing in any other command executes that command
When you enter c the return runs to the end of a loop/function/block. Typing in Q
exits the function. To return to normal enter undebug(function name) and
note that if there are no bugs, entering c has the same effect as undebug.
A final comment is that learning to write functions and programming is a bit
like learning to drive: you may pass the test, but you will become a good driver by
spending time behind the wheel. Similarly, the best way to learn to write functions
is to practise, and the more you practise the better you will get at programming.
You should try to set yourself various function writing tasks and examine the func-
tions that are introduced throughout this book. Most of the commands that you
use in R are functions that can themselves be examined: entering them without any
brackets afterwards will reveal the blocks of code they use. Have a look at the
ifelse function by entering at the R prompt:
ifelse
This allows you to examine the code blocks, the control, etc., in existing functions.
4.4 WRITING FUNCTIONS
4.4.1 Introduction
In this section you will gain some initial experience in writing functions that can
be used in R, using a number of coding illustrations. You should enter the code
R FOR SPATIAL ANALYSIS & MAPPING
128
blocks for these, compile them and then run them with some data to build up your
experience. Unless you already have experience in writing code, this will be your
first experience of programming. This section contains a series of specific tasks for
you to complete in the form of self-test questions. The answers to the questions are
provided in the final section of the chapter.
In the preceding section, the basic idea of writing functions was described. You
can write functions directly by entering them at the R command line:
cube.root <- function(x) {
result <- x ^ (1/3)
return(result)}
cube.root(27)
[1] 3
Note that ̂ means ‘raise to the power’, and recall that a number to the power of
one-third is its cube root. The cube root of 27 is 3, since 27 = 3 × 3 × 3, hence the
answer printed out for cube.root(27). However, entering functions from the
command line is not always very convenient:
● If you make a typing error in an early line of the definition, it is not
possible to go back and correct it
● You would have to type in the definition every time you used R
A more sensible approach is to type the function definition into a text file. If you
write this definition into a file – calling it, say, functions.R – then you can load
this file when you run R, without having to type in the whole definition. Assuming
you have set R to work in the directory where you have saved this file, just enter:
source("functions.R")
This has the same effect of entering the entire function at the command line. In
fact any R commands in a file (not just function definitions) will be executed when
the source function is used. Also, because the function definition is edited in a
file, it is always possible to return to any typing errors and correct them – and if a
function contains an error, it is easy to correct this and just redefine the function by
re-entering the command above. Using an editor for writing and saving R code was
introduced in previous chapters.
Open a new R script or editing window. In it, enter in the code for the program:
cube.root <- function(x) {
result <- x ^ (1/3)
return(result)}
Then use Save As to save the file as functions.R in the directory you are work-
ing in. In R you can now use source as described:
SCRIPTING AND WRITING FUNCTIONS IN R
129
source('functions.R')
cube.root(343)
cube.root(99)
Note that you can type in several function definitions in the same file. For example,
underneath the code for the cube.root function, you should define a function to
compute the area of a circle. Enter:
circle.area <- function(r) {
result <- pi ∗ r ^ 2
return(result)}
If you save the file and enter source('functions.R') again then the function
circle.area will be defined as well as cube.root. Enter:
source('functions.R')
cube.root(343)
circle.area(10)
4.4.2 Data Checking
One issue when writing functions is making sure that the data that have been
given to the function are the right kind. For example, what happens when you try
to compute the cube root of a negative number?
cube.root(−343)
[1] NaN
That probably was not the answer you wanted. NaN stands for ‘not a number’,
and is the value returned when a mathematical expression is numerically indeter-
minate. In this case, this is actually due to a shortcoming with the ̂ operator in R,
which only works for positive base values. In fact −7 is a perfectly valid cube root of
−343, since (−7) × (−7) × (−7) = −343. In fact we can state a conditional rule:
● If x ≥ 0: calculate the cube root of x normally
● Otherwise: use cube.root(-x)
That is, for cube roots of negative numbers, work out the cube root of the positive
number, then change it to negative. This can be dealt with in an R function by
using an if statement:
cube.root <- function(x) {
if (x >= 0) {
result <- x ^ (1/3) } else {
result <- -(-x) ^ (1/3) }
return(result)}
R FOR SPATIAL ANALYSIS & MAPPING
130
Now you should go back to the text editor and modify the code in functions.R
to reflect this. You can do this by modifying the original cube.root function. You
can now save this edited file, and use source to reload the updated function defi-
nition. The function should work with both positive and negative values.
cube.root(3)
[1] 1.44225
cube.root(−3)
[1] −1.44225
Next, try debugging the function – since it is working properly, you will not (hope-
fully!) find any errors, but this will demonstrate the debug facility. Enter:
debug(cube.root)
at the R command line (not in the file editor!). This tells R that you want to run
cube.root in debug mode. Next, enter:
cube.root(−50)
at the R command line and see how repeatedly pressing the return key steps you
,through the function. Note particularly what happens at the if statement.
At any stage in the process you can type an R expression to check its value.
When you get to the if statement enter:
x > 0
at the command line and press Return to see whether it is true or false. Checking
the value of expressions at various points when stepping through the code is a good
way of identifying potential bugs or glitches in your code. Try running through the
code for a few other cube root calculations, by replacing −50 above with different
numbers, to get used to using the debugging facility. When you are finished, enter:
undebug(cube.root)
at the R command line. This tells R that you are ready to return cube.root to
running in normal mode. For further details about the debugger, at the command
line enter:
help(debug)
4.4.3 More Data Checking
In the last section, you saw how it was possible to check for negative values in the
cube.root function. However, other things can go wrong. For example, try entering:
SCRIPTING AND WRITING FUNCTIONS IN R
131
cube.root('Leeds')
This will cause an error to occur and to be printed out by R. This is not surprising
because cube roots only make sense for numbers, not character variables. However,
it might be helpful if the cube root function could spot this and print a warning
explaining the problem, rather than just crashing with a fairly obscure error
message such as the one above, as it does at the moment. Again, this can be dealt
with using an if statement. The strategy to handle this is:
● If x is numerical: compute its cube root
● If x is not numerical: print a warning message explaining the problem
Checking whether a variable is numerical can be done using the is.numeric
function:
is.numeric(77)
is.numeric("Lex")
is.numeric("77")
v <- "Two Sevens Clash"
is.numeric(v)
The function could be rewritten to make use of is.numeric in the following
way:
cube.root <- function(x) {
if (is.numeric(x)) {
if (x >= 0) { result <- x^(1/3) }
else { result <- -(-x)^(1/3) }
return(result) }
else {
cat("WARNING: Input must be numerical, not character\n")
return(NA)}
}
Note that here there is an if statement inside another if statement – this is an
example of a nested code block. Note also that when no proper result is defined, it is
possible to return the value NA instead of a number (NA stands for ‘not available’).
Finally, recall that the \n in the cat statement tells R to add a carriage return (new
line) when printing out the warning. Try updating your cube root function in the
editor with this latest definition, and then try using it (in particular with character
variables) and stepping through it using debug.
An alternative way of dealing with cube roots of negative numbers is to use
the R functions sign and abs. The function sign(x) returns a value of 1 if
x is positive, −1 if it is negative, and 0 if it is zero. The function abs(x)
returns the absolute value of x without the sign, so for example abs(−7)
R FOR SPATIAL ANALYSIS & MAPPING
132
is 7, and abs(5) is 5. This means that you can specify the core statement in
the cube root function without using an if statement to test for negative
values, as:
result <- sign(x)∗abs(x)^(1/3)
This will work for both positive and negative values of x.
Self-Test Question 1. Define a new function cube.root.2 that uses this way of
computing cube roots – and also include a test to make sure x is a numerical vari-
able, and print out a warning message if it is not.
4.4.4 Loops Revisited
In this section you will revisit the idea of looping in function definitions. There are
two main kinds of loops in R: deterministic and conditional loops. The former are
executed a fixed number of times, specified at the beginning of the loop. The latter
are executed until a specific condition is met.
4.4.4.1 Conditional Loops
A very old example of a conditional loop is Euclid’s algorithm. This is a method for
finding the greatest common divisor (GCD) of a pair of numbers. The GCD of a pair
of numbers is the largest number that divides exactly (i.e. with remainder zero)
into each number in the pair. The algorithm is set out below:
1. Take a pair of numbers a and b – let the dividend be max(a, b), and the
divisor be min(a, b).
2. Let the remainder be the arithmetic remainder when the dividend is
divided by the divisor.
3. Replace the dividend with the divisor.
4. Replace the divisor with the remainder.
5. If the remainder is not equal to zero, repeat from step 2 to here.
6. Once the remainder is zero, the GCD is the dividend.
Without considering in depth the reasons why this algorithm works, it should be
clear that it makes use of a conditional loop. The test to see whether further looping
is required in step 5 above. It should also be clear that the divisor, dividend and
remainder are all variables. Given these observations, we can turn Euclid’s algo-
rithm into an R function:
SCRIPTING AND WRITING FUNCTIONS IN R
133
gcd <- function(a,b)
{
divisor <- min(a,b)
dividend <- max(a,b)
repeat
{ remainder <- dividend %% divisor
dividend <- divisor
divisor <- remainder
if (remainder == 0) break
}
return(dividend)
}
The one unfamiliar thing here is the %% symbol. This is just the remainder operator –
the value of x %% y is the remainder when x is divided by y.
Using the editor, create a definition of this function, and read it into R. You can put
the definition into functions.R. Once the function is defined, it may be tested:
gcd(6,15)
gcd(25,75)
gcd(31,33)
Self-Test Question 2. Try to match up the lines in the function definition with the
lines in the description of Euclid’s algorithm. You may also find it useful to step
through an example of gcd in debug mode.
4.4.4.2 Deterministic Loops
As described in earlier sections, the form of a deterministic loop is:
for ( in
{
... code in loop ...
}
where refers to the looping variable. It is common practice to refer to
in the code in the loop.
which loops. For example, a function to print the cube roots of numbers
from 1 to n takes the form:
cube.root.table <- function(n)
{
for (x in 1 :n)
{
cat("The cube root of ",x," is", cube.root(x),"\n")
}
}
Self-Test Question 3. Write a function to compute and print out GCD(x,60) for
x in the range 1 to n. When this is done, write another function to compute and
R FOR SPATIAL ANALYSIS & MAPPING
134
print out GCD(x,y) for x in the range 1 to n1 and y in the range 1 to n2. In this
exercise you will need to nest one deterministic loop inside another one.
Self-Test Question 4. Modify the cube.root.table function so that the loop
variable runs from 0.5 in steps of 0.5 to n. The key to this is provided in the descrip-
tions of loops in the sections above.
4.4.5 Further Activity
You will notice that in the previous example the output is rather messy, with the
cube roots printing to several decimal places – it might look neater if you could
print to fixed number of decimal places. In the function cube.root.table
replace the cat line with:
cat(sprintf("The cube root of %4.0f is %8.4f \n",x, cube.root(x)))
Then enter help(sprintf) and try to work out what is happening in the code
above.
Self-Test Question 5. Create a for loop that cycles through each county / row in
the data frame of the georgia2 dataset in the GISTools package and creates
a list of the adjacent counties. The code to do this for a single county, Appling,
is as follows:
library(GISTools)
library(sf)
data(georgia)
# create an empty list for the results
adj.list <- list()
# convert georgia to sf
georgia_sf <- st_as_sf(georgia2)
# extract a single county
county.i <- georgia_sf[1,]
# determine the adjacent counties
# the [−1] removes Appling form its own list
adj.i <- unlist(st_intersects(county.i, georgia_sf))[−1]
# extract their names
,adj.names.i <- georgia2$Name[adj.i]
# add to the list
adj.list[[1]] <- adj.i
# name the list elements
names(adj.list[[1]]) <- adj.names.i
This creates a list with a single element, with the names of the counties adjacent
to Appling and an index or reference to their location within the georgia2
dataset.
SCRIPTING AND WRITING FUNCTIONS IN R
135
adj.list
[[1]]
Bacon Jeff Davis Pierce Tattnall Toombs
3 80 113 132 138
Wayne
151
Note that once lists are defined as in adj.list in the code above, elements can
be added:
# in sequence
adj.list[[2]] <- sample(1:100, 3)
# or not!
i = 4
adj.list[[i]] <- c("Chris", "and", "Lex")
# have a look!
adj.list
Self-Test Question 6. Take the loop you created in Question 5 and create a function
that returns a list of the indices of adjacent polygons for each polygon in any poly-
gon dataset in sf or sp format. Hint: you will need to do any conversions to sf
and define the list to be returned inside the function.
4.5 SPATIAL DATA STRUCTURES
This section unpicks some of the detail of spatial data structures in R as a precursor
to manipulating and interrogating spatial data with functions. It examines their
coordinate encoding and briefly revisits their attribute/variable structures.
To begin with, you will load the GISTools package and the georgia data.
However, before doing this and running the code below, you need to check that
you are in the correct working directory. You should already be in the habit of
doing this at the start of every R session. Also, if this is not a fresh R session then
you should clear the workspace of any variables and functions you have created.
This can be done by entering:
rm(list = ls())
Then load the GISTools package and the Georgia datasets:
library(GISTools)
data(georgia)
One of the variables is called georgia.polys. There are two ways to confirm
this. One way is to type ls() into R. This function tells R to list out all currently
defined variables:
ls()
R FOR SPATIAL ANALYSIS & MAPPING
136
The other way of checking that georgia.polys now exists is just to type it into
R and see it printed out.
georgia.polys
What is actually printed out has been excluded here, as it would go on for pages
and pages. However, the content of the variable will now be explained. geor-
gia.polys is a variable of type list, with 159 items in the list. Each item is a
matrix of k rows and 2 columns. The two columns correspond to x and y coordi-
nates describing a polygon made from k points. Each polygon corresponds to one
of the 159 counties that make up the state of Georgia in the USA. To check this
quickly, enter:
class(georgia.polys)
[1] "list"
head(georgia.polys[[1]])
[,1] [,2]
[1,] 1292287 1075896
[2,] 1292654 1075919
[3,] 1292949 1075590
[4,] 1294045 1075841
[5,] 1294603 1075472
[6,] 1295467 1075621
Each of the list elements, containing the bounding coordinates of each of the coun-
ties in Georgia, can be plotted. Enter the code below to produce Figure 4.1.
1240000 1260000 1280000 1300000 1320000 1340000
1030000
1050000
1070000
Easting
N
o
rt
h
in
g
Figure 4.1 A simple plot of Appling County and two adjacent counties
SCRIPTING AND WRITING FUNCTIONS IN R
137
# plot Appling
plot(georgia.polys[[1]],asp=1,type='l',
xlab = "Easting", ylab = "Northing")
# plot adjacent county outlines
points(georgia.polys[[3]],asp=1,type='l', col = "red")
points(georgia.polys[[151]],asp=1,type='l', col = "blue", lty = 2)
Notice the use of the plot and points functions as were introduced in
Chapter 2.
Figure 4.1 will not win any prizes for cartography – but it should be
recognis able as Appling County, as featured in earlier chapters. However, it
highlights that spatial data objects in R have coordinates whether defined in
the sp and sf packages. The code below extracts the coordinates for the first
polygon in the georgia2 dataset, a SpatialPolygonsDataFrame object
that has the same coordinates as georgia.polys. These are the same as the
above.
head(georgia2@polygons[[1]]@Polygons[[1]]@coords)
head(georgia2@data[, 13:14])
If georgia2 is converted to sf format the coordinates are also evident:
g <- st_as_sf(georgia2)
head(g[,13:14])
So we can see that both sp and sf objects explicitly hold the spatial attributes and
the thematic and variable attributes of spatial objects.
4.6 apply FUNCTIONS
The final sections of this chapter describe a number of different functions that can
make programming easier by offering a number of different ways of interrogating,
manipulating and summarising spatial data, either by their variable attributes or
by their spatial properties. This section examines the apply family of functions
that come with the base installation of R.
Like other programming languages, R includes a group of functions which
are generally termed apply functions. These can be used to apply the same set
of operations over each element in a data object (row, column, list element).
They take some input data and a function as inputs. Here we will briefly
explore three of the most commonly used apply functions: apply, lapply
and mapply.
Load the newhaven data and examine the blocks object. It contains a number
of variables describing the percentage of different ethnicities living in each census
block:
R FOR SPATIAL ANALYSIS & MAPPING
138
library(GISTools)
data(newhaven)
## the @data route
head(blocks@data[, 14:17])
## the data frame route
head(data.frame(blocks[, 14:17]))
A basic illustration of apply that returns the percentage value of the largest group
in each block is as follows:
apply(blocks@data[,14:17], 1, max)
Have a look at the help for apply. The code above passes the 14th to 17th columns
of the blocks data frame to apply, the 1 is passed to the MARGIN parameter to
indicate that apply will operate over each row, and the function that is applied is
max. Compare the result when the MARGIN parameter is set to be columns:
apply(blocks@data[,14:17], 2, max)
The code above returns the largest percentage of each ethnic group in any census
block.
Now suppose we wanted to determine which ethnicity formed the largest
group in each block. One way would be to create a for loop. Another would be to
define a function and use apply.
# set up vector to hold result
result.vector <- vector()
for (i in 1:nrow(blocks@data)){
# for each row determine which column has the max value
result.i <- which.max(data.frame(blocks[i,14:17]))
# put into the result vector
result.vector <- append(result.vector, result.i)
}
This can also be determined using apply as in the code below and the two results
compared:
res.vec <-apply(data.frame(blocks[,14:17]), 1, which.max)
# compare the two results
identical(as.vector(res.vec), as.vector(result.vector))
Why use apply? Loops are tractable but slow! Typically apply functions are
much quicker than loops, as is clear if the timings are compared. In many cases
this will not matter, but it will when you have large data or heavy computations
and processing. You may have to define your own functions and in some cases
manipulate the data that are passed to apply, but they are a very useful family of
functions.
SCRIPTING AND WRITING FUNCTIONS IN R
139
# Loop
t1 <- Sys.time()
result.vector <- vector()
for (i in 1:nrow(blocks@data)){
result.i <- which.max(data.frame(blocks[i,14:17]))
result.vector <- append(result.vector, result.i)
}
Sys.time() - t1
# Apply
t1 <- Sys.time()
res.vec <-apply(data.frame(blocks[,14:17]), 1, which.max)
Sys.time() - t1
The second example uses mapply to plot the coordinates of each element of the
georgia.polys list. Here a plot extent has to be defined, and then each polygon
is plotted in turn (actually this is what plotting routines for sf and sp objects do).
One way to do this is as follows:
plot(bbox(georgia2)[1,], bbox(georgia2)[2,], asp = 1,
type='n',xlab='',ylab='',xaxt='n',yaxt='n',bty='n')
for (i in 1:length(georgia.polys)){
points(georgia.polys[[i]],
,type='l')
# small delay so that you can see the plotting
Sys.sleep(0.05)
}
Another would be use to mapply:
plot(bbox(georgia2) [1,], bbox(georgia2) [2,], asp = 1,
type='n',xlab='',ylab='',xaxt='n',yaxt='n',bty='n')
invisible(mapply(polygon,georgia.polys))
The for loop below returns two objects: count.vec, a vector of the number of
counties within 50 km of each of the 159 counties in the georgia2 dataset; and a
list object with 159 elements of the names of these.
# convert Georgia2 to sf
georgia_sf <- st_as_sf(georgia2)
# create a distance matrix
dMat <- as.matrix(dist(coordinates(georgia2)))
dim(dMat)
# create an empty vector
count.vec <- vector()
# create an empty list
names.list <- list()
# for each county...
for( i in 1:nrow(georgia_sf)) {
# which counties are within 50km
vec.i <- which(dMat[i,] <= 50000)
# add to the vector
count.vec <- append(count.vec, length(vec.i))
# find their names
names.i <- georgia_sf$Name[vec.i]
R FOR SPATIAL ANALYSIS & MAPPING
140
# add to the list
names.list[[i]] <- names.i
}
# have a look!
count.vec
names.list
You could of course use lapply to investigate the list you have just created. Notice
how this does not require a MARGIN to be specified as does apply. Rather it just
requires a function to be applied to each element in a list:
lapply(names.list, length)
Self-Test Question 7. Recode the for loop above into two functions to be applied
to the distance matrix, dMat, and called in a similar way to the following:
count.vec <- apply(dMat,1.my.func1)
names.list <- apply(dMat,1.my.func2)
4.7 MANIPULATING DATA WITH dplyr
A second set of very useful tools in the context of programming is provided by the
data table operations within the dplyr package, included within the tidyverse.
These can be used with tabular data, including the data frames containing the
attributes of spatial data. To start you should clear your R workspace and install
and load the tidyverse package and explore the introduction vignette.
Recall that vignettes were introduced in Chapter 3.
vignette("dplyr", package = "dplyr")
For the dplyr vignettes you will also have to install the nycflights13
package that contains some example data describing flights and airlines, and note
that the default data table format for the tidyverse is tibble.
install.packages("nycflights13")
library("nycflights13")
class(flights)
flights
You can examine the other datasets included in this package as well:
data(package = "nycflights13")
You should explore the different functions for summarising and filtering individu-
al data tables. The important ones are summarised in Table 4.3.
SCRIPTING AND WRITING FUNCTIONS IN R
141
Table 4.3 Functions in the dplyr package for manipulating data tables
Function Description
filter() Selects a subset of rows in a data frame, according to user-defined
conditional statements
slice() Selects a subset of rows in a data frame by their position (row number)
arrange() Changes the row order according to the columns specified (by 1st, 2nd and
then 3rd column, etc.)
desc() Orders a column in descending order
select() Selects the subset of specified columns and reorders them vertically
distinct() Finds unique values in a table
mutate() Creates and adds new columns based on operations applied to existing
columns, e.g. NewCol = Col1 + Col2
transmute As select but only retains the new variables
summarise Summarises values with functions that are passed to it
sample_n Takes a random sample of table rows
sample_frac Selects a fixed fraction of rows
Then you should explore the two-table vignette.
vignette("two-table", package = "dplyr")
Again, you should work through the various join and summary operations in the
two-table vignette. The first command is to select variables from flights to
create flight2.
flights2 <- flights %>% select(year:day,hour,origin,dest,tailnum,carrier)
You will note that the vignette uses the piping syntax. The %>% command pipes
the flights dataset to the select function, specifying the columns of data to be
selected. The result is assigned to flights2. A non-piped version would be:
flights2 <- select(flights, year:day,hour,origin,dest,tailnum,carrier)
The dplyr package contains a number of methods for summarising and joining
tables, including different _join functions: inner_join, left_join, right_
join, full_join, semi_join and anti_join. You should familiarise your-
self with how these different join functions operate and how they relate to the two
data table inputs they take.
Self-Test Question 8. The code below creates flights2, a tibble data table
in dplyr with variables of the destination (dest), the number of flights in 2013
R FOR SPATIAL ANALYSIS & MAPPING
142
(count) and the latitude and longitude of the origin (OrLat and OrLon) in the
New York area.
library(nycflights13)
library(tidyverse)
# select the variables
flights2 <- flights %>% select(origin, dest)
# remove Alaska and Hawaii
flights2 <- flights2[-grep("ANC", flights2$dest),]
flights2 <- flights2[-grep("HNL", flights2$dest),]
# group by destination
flights2 <- group_by(flights2, dest)
flights2 <- summarize(flights2, count = n())
# assign Lat and Lon for Origin
flights2$OrLat <- 40.6925
flights2$OrLon <- −74.16867
# have a look!
flights2
# A tibble: 103 x 4
dest count OrLat OrLon
1 ABQ 254 40.7 −74.2
2 ACK 265 40.7 −74.2
3 ALB 439 40.7 −74.2
4 ATL 17215 40.7 −74.2
5 AUS 2439 40.7 −74.2
6 AVL 275 40.7 −74.2
7 BDL 443 40.7 −74.2
8 BGR 375 40.7 −74.2
9 BHM 297 40.7 −74.2
10 BNA 6333 40.7 −74.2
# ... with 93 more rows
Your task is to join the flights2 data table to the airports dataset and
determine the latitude and longitude of the destinations. A secondary task, if
you wish, is to then map the flights using the gcIntermediate function in the
geosphere package and the datasets in the maps package (both of which you
may need to install).
Some hints about the mapping are provided in the code below. This example
plots two locations and then uses the gcIntermediate function in geosphere
to plot a path between them.
library(maps)
library(geosphere)
SCRIPTING AND WRITING FUNCTIONS IN R
143
# origin and destination examples
dest.eg <- matrix(c(77.1025, 28.7041), ncol = 2)
origin.eg <- matrix(c(−74.16867, 40.6925), ncol = 2)
# map the world from the maps package data
map("world", fill=TRUE, col="white", bg="lightblue")
# plot the points
points(dest.eg, col="red", pch=16, cex = 2)
points(origin.eg, col = "cyan", pch = 16, cex = 2)
# add the route
for (i in 1:nrow(dest.eg)) {
lines(gcIntermediate(dest.eg[i,], origin.eg[i,], n=50,
breakAtDateLine=FALSE, addStartEnd=FALSE,
sp=FALSE, sepNA), lwd = 2, lty = 2)
}
You may wish to explore the use of other basemaps from the maps package:
map("usa", fill=TRUE, col="white", bg="lightblue")
4.8 ANSWERS TO SELF-TEST QUESTIONS
Q1: A new cube root function:
cube.root.2 <- function(x)
{ if (is.numeric(x))
{ result <- sign(x)∗abs(x)^(1/3)
return(result)
} else
{ cat("WARNING: Input must be numerical, not character\n")
return(NA) }
}
Q2: Match up the lines in the gcd function to the lines in the description of Euclid’s
algorithm:
gcd <- function(a,b)
{
divisor <- min(a,b) # line 1
dividend <- max(a,b) # line 1
repeat #line 5
{ remainder <- dividend %% divisor #line 2
dividend <- divisor # line 3
divisor <- remainder # line 4
if (remainder == 0) break #line 6
}
return(dividend)
}
Q3: (i) Write a function to compute and print out gcd(x,60):
R FOR SPATIAL ANALYSIS & MAPPING
144
gcd.60 <- function(a)
{
for(i in 1:a)
{ divisor <- min(i,60)
dividend <- max(i,60)
repeat
{ remainder <- dividend %% divisor
dividend <- divisor
divisor <- remainder
if (remainder == 0) break
}
cat(dividend, "\n")
}
}
Alternatively you could nest the predefined gcd function inside the modified
one:
gcd.60 <-
,function(a)
{ for(i in 1:a)
{ dividend <- gcd(i,60)
cat(i,":", dividend, "\n")
}
}
(ii) Write a function to compute and print out gcd(x,y):
gcd.all <- function(x,y)
{ for(n1 in 1:x)
{ for (n2 in 1:y)
{ dividend <- gcd(n1, n2)
cat("when x is",n1,"& y is",n2,"dividend =",dividend,"\n")
}
}
}
Q4: Modify cube.root.table to run from 0.5 to n in steps of 0.5. The obvious
solution to this is:
cube.root.table <- function(n)
{ for (x in seq(0.5, n, by = 0.5))
{ cat("The cube root of ",x," is",
sign(x)∗abs(x)^(1/3),"\n")}
}
However, this will not work when negative values are passed to it: seq cannot
create the array. The function can be modified to accommodate sequences running
from 0.5 to both negative and positive values of n:
SCRIPTING AND WRITING FUNCTIONS IN R
145
cube.root.table <- function(n)
{ if (n < 0 ) by.val = 0.5
if (n < 0 ) by.val =−0.5
for (x in seq(0.5, n, by = by.val))
{ cat("The cube root of ",x," is",
sign(x)∗abs(x)^(1/3),"\n") }
}
Q5: Create a for loop that cycles through each county/row in the data frame of the
georgia2 dataset and creates a list of the adjacent counties. You were given the code
for a single county – this needs to be put into a loop, replacing the 1 with i or similar.
# create an empty list for the results
adj.list <- list()
# convert georgia to sf
georgia_sf <- st_as_sf(georgia2)
for (i in 1:nrow(georgia_sf)) {
# extract a single county
county.i <- georgia_sf[i,]
# determine the adjacent counties
# the [−1] removes Appling form its own list
adj.i <- unlist(st_intersects(county.i, georgia_sf))[−1]
# extract their names
adj.names.i <- georgia2$Name[adj.i]
# add to the list
adj.list[[i]] <- adj.i
# name the list elements
names(adj.list[[i]]) <- adj.names.i
}
Q6: Create a function that returns a list of the indices of adjacent polygons for each
polygon in any polygon dataset in sf or sp format.
return.adj <- function(sf.data){
# convert to sf regardless!
sf.data <- st_as_sf(sf.data)
adj.list <- list()
for (i in 1:nrow(sf.data)) {
# extract a single county
poly.i <- sf.data[i,]
# determine the adjacent counties
adj.i <- unlist(st_intersects(poly.i, sf.data))[−1]
# add to the list
adj.list[[i]] <- adj.i
}
return(adj.list)
}
# test it!
return.adj(georgia_sf)
return.adj(blocks)
R FOR SPATIAL ANALYSIS & MAPPING
146
Q7: Recode the for loop into two functions replicating the functionality of the
loop:
# number of counties within 50km
my.func1 <- function(x){
vec.i <- which(x <= 50000)[−i]
return(length(vec.i))
}
# their names
my.func2 <- function(x){
vec.i <- which(x <= 50000)
names.i <- georgia_sf$Name[vec.i]
return(names.i)
}
count.vec <- apply(dMat,1, my.func1)
names.list <- apply(dMat,1, my.func2)
Q8: Join the flights2 data table to the airports dataset and determine the lati-
tude and longitude of the destinations. Then map the flights using the gcInter-
mediate function in the geosphere package and the datasets in the maps
package.
# Part 1: the join
flights2 <- flights2 %>% left_join(airports, c("dest" = "faa"))
flights2 <- flights2 %>% select(count,dest,OrLat,OrLon,
DestLat=lat,DestLon=lon)
# get rid of any NAs
flights2 <- flights2[!is.na(flights2$DestLat),] flights2
# Part 2: the plot
# Using standard plots
dest.eg <- matrix(c(flights2$DestLon, flights2$DestLat), ncol = 2)
origin.eg <- matrix(c(flights2$OrLon, flights2$OrLat), ncol = 2)
map("usa", fill=TRUE, col="white", bg="lightblue")
points(dest.eg, col="red", pch=16, cex = 1)
points(origin.eg, col = "cyan", pch = 16, cex = 1)
for (i in 1:nrow(dest.eg)) {
lines(gcIntermediate(dest.eg[i,], origin.eg[i,], n=50,
breakAtDateLine=FALSE,
addStartEnd=FALSE, sp=FALSE, sepNA))
}
# using ggplot
all_states <- map_data("state")
dest.eg <- data.frame(DestLon = flights2$DestLon,
DestLat = flights2$DestLat)
origin.eg <- data.frame(OrLon = flights2$OrLon,
OrLat = flights2$OrLat)
library(GISTools)
# Figure 2 using ggplot
# create the main plot
SCRIPTING AND WRITING FUNCTIONS IN R
147
mp <- ggplot() +
geom_polygon( data=all_states,
aes(x=long, y=lat, group = group),
colour="white", fill="grey20") +
coord_fixed() +
geom_point(aes(x = dest.eg$DestLon, y = dest.eg$DestLat),
color="#FB6A4A", size=2) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
# create some transparent shading
cols=add.alpha(colorRampPalette(brewer.pal(9,"Reds"))(nrow(flights2)), 0.7)
# loop through the destinations
for (i in 1:nrow(flights2)) {
# line thickness related flights
lwd.i = 1+ (flights2$count[i]/max(flights2$count))
# a sequence of colours
cols.i = cols[i]
# create a dataset
link <- as.data.frame(gcIntermediate(dest.eg[i,], origin.eg[i,],n=50,
breakAtDateLine=FALSE, addStartEnd=FALSE, sp=FALSE, sepNA))
names(link) <- c("lon", "lat")
mp <- mp + geom_line(data=link, aes(x=lon, y=lat),
color= cols.i, size = lwd.i)
}
# plot!
mp
5
USING R AS A GIS
5.1 INTRODUCTION
In GIS and spatial analysis, we are often interested in finding out how the
information contained in one spatial dataset relates to that contained in
another. The kinds of questions we may be interested in include:
● How does X interact with Y?
● How many X are there in different locations of Y?
● How does the incidence of X relate to the rate of Y?
● How many of X are found within a certain distance of Y?
● How does process X vary with Y spatially?
X and Y may be diseases, crimes, pollution events, attributed census areas, envi-
ronmental factors, deprivation indices or any other geographical process or phe-
nomenon that you are interested in understanding. Answering such questions
using a spatial analysis frequently requires some initial data pre-processing and
manipulation. This might be to ensure that different data have the same spatial
extent, describe processes in a consistent way (e.g. to compare land cover types
from different classifications), are summarised over the same spatial framework
(e.g. census reporting areas), are of the same format (raster, vector, etc.) and are
projected in the same way (the latter was introduced in Chapter 3).
This chapter uses worked examples to illustrate a number of fundamental and
commonly applied spatial operations on spatial datasets. Many of these form the
basis of most GIS software. The datasets may be ones you have read into R from
shapefiles or ones that you have created in the course of your analysis. Essentially,
the operations illustrate different methods for extracting information from one spa-
tial dataset based on the spatial extent of another. Many of these are what are fre-
quently referred to as overlay operations in GIS software such as ArcGIS or QGIS,
but here are extended to include a number of other types of data manipulation. The
sections below describe the following operations:
USING R AS A GIS
149
● Intersections and clipping one dataset to the extent of another
● Creating buffers around features
● Merging the features in a spatial dataset
● Point-in-polygon and area calculations
● Creating distance attributes
● Combining spatial data and attributes
● Converting between raster and vector
As you work through the example code in this chapter a number of self-test ques-
tions are introduced. Some of these go into much greater detail and complexity
than in earlier chapters and come with extensive direction for you to work through
and follow.
The chapter draws on functionality from a number of packages that have
been introduced in earlier chapters (sf, sp, maptools, GISTools,
tidyverse, rgeos, etc.) for performing overlay and other spatial operations
on spatial datasets which create new data, information or attributes. In many
,cases, it is up to the analyst (you!) to decide which operations to undertake and
in what order for a particular analysis and, depending on your objectives, a
given operation may be considered as a pre-processing step or as an analytical
one. For example, calculating distances, areas, or point-in-polygon counts prior
to a statistical test may be pre-processing steps prior to the actual data analysis
or used as the actual analysis itself. The key feature of these operations is that
they create new data or information. Similarly, this chapter will use both sf and
sp data formats as needed, both of which have their own set of functions linking
to rgeos. As a reminder, sf data formats are relatively new and have strong
links to dplyr (part of the tidyverse package). This chapter will highlight
operations in both, and where we think there is a distinct advantage to one
approach this will be presented.
It is important to recall that there are conversion functions for moving between
sf and sp formats:
library(sf)
library(GISTools) # a wrapper for sp, rgeos, etc.
# load some data
data(georgia)
class(georgia)
# convert to sf
georgia_sf <- st_as_sf(georgia)
class(georgia_sf)
# convert back to sp
georgia_v2 <- as(georgia_sf, "Spatial")
class(georgia_v2)
R FOR SPATIAL ANALYSIS & MAPPING
150
5.2 SPATIAL INTERSECTION AND CLIP OPERATIONS
The GISTools package comes with datasets describing tornadoes in the USA.
Load the package and these data into a new R session.
library(GISTools)
data(tornados)
You will see that four sp datasets are now loaded: torn, torn2, us_states
and us_states2. The torn and torn2 data describe the locations of tornadoes
recorded between 1950 and 2004, and the us_states and us_states2 datasets
are spatial data describing the states of the USA. Two of these are in WGS84 pro-
jections (torn and us_states) and two are projected in a GRS80 datum (torn2
and us_states2). We can plot these and examine the data as in Figure 5.1.
library(tmap)
library(sf)
# convert to sf objects
torn_sf <- st_as_sf(torn)
us_states_sf <- st_as_sf(us_states)
# plot extent and grey background
tm_shape(us_states_sf) +
tm_polygons("grey90") +
# add the torn points
tm_shape(torn_sf) +
tm_dots(col = "#FB6A4A", size = 0.04, shape = 1, alpha = 0.5) +
# map the state borders
tm_shape(us_states_sf) +
tm_borders(col = "black") +
tm_layout(frame = F)
Figure 5.1 The tornado data
USING R AS A GIS
151
Note that the sp plotting code takes a very similar form:
plot(us_states, col = "grey90")
plot(torn, add = T, pch = 1, col = "#FB6A4A4C", cex = 0.4)
plot(us_states, add = T)
Remember that you can examine the attributes of a variable using the summary()
function. For sp objects this also includes a summary of the object projection. This
can be seen using the st_geometry function in sf:
summary(torn)
summary(torn_sf)
st_geometry(torn_sf)
Now, consider the situation where the aim was to analyse the incidence of torna-
does in a particular area: we do not want to analyse all of the tornado data but only
those records that describe events in our study area – the area we are interested
in. The code below selects a group of US states, in this case Texas, New Mexico,
Oklahoma and Arkansas – note the use of the OR logical operator | to make the
selection.
index <- us_states$STATE_NAME == "Texas" |
us_states$STATE_NAME == "New Mexico" |
us_states$STATE_NAME == "Oklahoma" |
us_states$STATE_NAME == "Arkansas"
AoI <- us_states[index,]
# OR....
AoI_sf <- us_states_sf[index,]
This can be plotted using the usual commands as in the code below. You can see that
the plot extent is defined by the spatial extent of area of interest (called AoI_sf)
and that all of the tornadoes within that extent are displayed.
tm_shape(AoI_sf) +
tm_borders(col = "black") +
tm_layout(frame = F) +
# add the torn points
tm_shape(torn_sf) +
tm_dots(col = "#FB6A4A", size = 0.2, shape = 1, alpha = 0.5)
# OR in sp
plot(AoI)
plot(torn, add = T, pch = 1, col = "#FB6A4A4C")
There are a number of ways of clipping spatial data in R. The simplest of these is
to use the spatial extent of one as an index to subset another. (Note that this can be
done using sp objects as well.)
torn_clip_sf <- torn_sf[AoI_sf,]
R FOR SPATIAL ANALYSIS & MAPPING
152
This simply clips out the data from torn_sf that is within the spatial extent of
AoI_sf. You can check this:
tm_shape(torn_clip_sf) +
tm_dots(col = "#FB6A4A", size = 0.2, shape = 1, alpha = 0.5) +
tm_shape(AoI_sf) +
tm_borders()
However, such clip (or crop) operations simply subset data based on their spatial
extents. There may be occasions when you wish to combine the attributes of differ-
ence datasets based on the spatial intersection. The gIntersection function in
rgeos or the st_intersection in sf allows us to do this as shown in the code
below. The results are mapped in Figure 5.2.
AoI_torn_sf <- st_intersection(AoI_sf, torn_sf)
tm_shape(AoI_sf) + tm_borders(col = "black") + tm_layout(frame = F) +
# add the torn points
tm_shape(AoI_torn_sf) +
tm_dots(col = "#FB6A4A", size = 0.2, shape = 1, alpha = 0.5)
Figure 5.2 The tornado data in the defined area of interest
The st_intersection operation creates an sf dataset of the locations of the
tornadoes within the area of interest. The gIntersection function does the
same thing:
USING R AS A GIS
153
AoI.torn <- gIntersection(AoI, torn, byid = TRUE)
plot(AoI)
plot(AoI.torn, add = T, pch = 1, col = "#FB6A4A4C")
If you examine the data created by the intersection, you will notice that each of the
intersecting points has the full attribution from input datasets. You can examine the
attributes of the AoI_torn_sf data and the AoI.torn data by entering:
head(data.frame(AoI_torn_sf))
head(data.frame(AoI.torn))
Once extracted, the subset can be written out for use elsewhere as described in
Chapters 2 and 3. You should examine the help for both st_intersection
and gIntersection to see how they work. You should particularly note
that both functions operate on any pair of spatial objects provided they are
projected using the same datum (in this case WGS84). In order to perform spa-
tial operations you may need to re-project your data to the same datum using
spTransform or st_transform as described in Chapter 3.
5.3 BUFFERS
In many situations, we are interested in events or features that occur near to our
area of interest as well as those within it. Environmental events such as torna-
does, for example, do not stop at state lines or other administrative boundaries.
Similarly, if we were studying crime locations or spatial access to facilities such
as shops or health services, we would want to know about locations near to the
study area border. Buffer operations provide a convenient way of doing this, and
buffers can be created in R using the gBuffer function in rgeos or the st_
buffer function in sf.
Continuing with the example above, we might be interested in extracting the
tornadoes occurring in Texas and those within 25 km of the state border. Thus
the objective is to create a 25 km buffer around the state of Texas and to use that
to select from the tornado dataset. Both buffer functions allow us to do that, and
require a distance for the buffer to be specified in terms of the units used in the
projection. However, in order to do this, a different projection is required as dis-
tances are difficult to determine directly from projections in degrees (essentially,
the relationship between planar distance measures such as metres and kilome-
tres to degrees varies with latitude). And the buffer will return an error message
if you try to buffer a non-projected spatial dataset. Therefore, the code below
uses the projected US data, us_states2, and the resultant buffer is shown in
Figure 5.3.
# select an area of interest and apply a buffer
# in rgeos
R FOR SPATIAL ANALYSIS & MAPPING
154
AoI <- us_states2[us_states2$STATE_NAME == "Texas",]
AoI.buf <- gBuffer(AoI, width
,= 25000)
# in sf
us_states2_sf <- st_as_sf(us_states2)
AoI_sf <- st_as_sf(us_states2_sf[us_states2_sf$STATE_NAME == "Texas",])
AoI_buf_sf <- st_buffer(AoI_sf, dist = 25000)
# map the buffer and the original area
# sp format
par(mar=c(0,0,0,0))
plot(AoI.buf)
plot(AoI, add = T, border = "blue")
# tmap: commented out!
# tm_shape(AoI_buf_sf) + tm_borders("black") +
# tm_shape(AoI_sf) + tm_borders("blue") +
# tm_layout(frame = F)
Figure 5.3 Texas with a 25 km buffer
USING R AS A GIS
155
The buffered object, shown in Figure 5.3, or objects can be used as input to clip
or intersection operations as above, for example to extract data within a certain
distance of an object. You should also examine the impact on the output of other
parameters in both buffer functions that control how line segments are created,
the geometry of the buffer, join styles, etc. Note that any sp or sf objects can be
used as an input to gBuffer and st_intersection functions, respectively: try
applying them to the breach dataset that is put into working memory when the
newhaven data are loaded.
There are number of options for defining how the buffer is created. If you enter
the code below, using IDs, then buffers are created around each of the counties
within the georgia2 dataset:
data(georgia)
georgia2_sf <- st_as_sf(georgia2)
# apply a buffer to each object
# sf
buf_t_sf <- st_buffer(georgia2_sf, 5000)
# rgeos
buf.t <- gBuffer(georgia2, width = 5000, byid = T, id = georgia2$Name)
# now plot the data
# sf
tm_shape(buf_t_sf) +
tm_borders() +
tm_shape(georgia2) +
tm_borders(col = "blue") +
tm_layout(frame = F)
# rgeos
plot(buf.t)
plot(georgia2, add = T, border = "blue")
The IDs of the resulting buffer datasets relate to each of the input features, which
in the above code has been specified to be the county names. This can be checked
by examining how the buffer object has been named using names(buf.t). If you
are not convinced that the indexing has been preserved then you can compare the
output with a familiar subset, Appling County:
plot(buf.t[1,])
plot(georgia2[1,], add = T, col = "blue")
5.4 MERGING SPATIAL FEATURES
In the intersection example above, four US states were selected and used to
define the area of interest over which the tornado data were extracted. An attrib-
ute describing in which state each tornado occurred was added to the data
frame of the intersected object. In other instances we may wish to consider the
area as a single object and to merge the features within it. This can be done using
R FOR SPATIAL ANALYSIS & MAPPING
156
the gUnaryUnion function in the rgeos package which was used in Chapter 3,
and also the st_union and st_combine functions in the sf package, to cre-
ate an outline of the state of Georgia from its constituent counties. In the code
below the US states are merged into a single object and then plotted over the
original data as shown in Figure 5.4. Note the use of the st_sf function to con-
vert the sfc output of the st_union function to sf class before passing to the
tmap functions.
Figure 5.4 The outline of the merged US states created by gUnaryUnion, with the
original state outlines in green
library(tmap)
### with rgeos and sp commented out
# AoI.merge <- gUnaryUnion(us_states)
# plot(us_states, border = "darkgreen", lty = 3)
# plot(AoI.merge, add = T, lwd = 1.5)
### with sf and tmap
us_states_sf <- st_as_sf(us_states)
AoI.merge_sf <- st_sf(st_union(us_states_sf))
tm_shape(us_states_sf) + tm_borders(col = "darkgreen", lty = 3) +
tm_shape(AoI.merge_sf) + tm_borders(lwd = 1.5, col = "black") +
tm_layout(frame = F)
The union operations merge spatial object sub-geometries. Once the merged
objects have been created they can be used as inputs into the intersection and buff-
ering procedures above in order to select data for analysis, as well as the analysis
operations described below. The merged objects can also be used in a cartographic
context to provide a border to the study area being considered.
USING R AS A GIS
157
5.5 POINT-IN-POLYGON AND AREA CALCULATIONS
5.5.1 Point-in-Polygon
It is often useful to count the number of points falling within different zones in a
polygon dataset. This can be done using the poly.counts function in the
GISTools package, which extends the gContains function in rgeos, or using
a similar method with the st_contains function in sf.
I
Remember that you can examine how a function works by entering it into the
console without the brackets – try entering poly.counts at the console.
The code below assigns a list of counts of the number of tornadoes that occur
inside each US state to the variable torn.count and prints the first six of these
to the console using the head function:
torn.count <- poly.counts(torn, us_states)
head(torn.count)
1 2 3 4 5 6
79 341 87 1121 1445 549
The numbers along the top are the ‘names’ of the elements in the variable tmp,
which in this case are the polygon ID numbers of the us_states variable. The
values are the counts of the points in the corresponding polygons. You can check
this by entering:
names(torn.count)
5.5.2 Area Calculations
Another useful operation is to be able calculate polygon areas. The gArea and
st_area functions in rgeos and sf do this. To check the projection, and there-
fore the map units, of an sp class object (including SpatialPolygons,
SpatialPoints, etc.), use the proj4string function, and for sf objects use
the st_crs function:
proj4string(us_states2)
st_crs(us_states2_sf)
This declares the projection to be in metres. To see the areas in square metres of
each US state, enter:
R FOR SPATIAL ANALYSIS & MAPPING
158
poly.areas(us_states2)
st_area(us_states2_sf)
These are not particularly useful, and more realistic measures are to report areas in
hectares or square kilometres:
# hectares
poly.areas(us_states2) / (100 ∗ 100)
st_area(us_states2_sf) / (100 ∗ 100)
# square kilometres
poly.areas(us_states2) / (1000 ∗ 1000)
st_area(us_states2_sf) / (1000 ∗ 1000)
Self-Test Question 1. Create the code to produce maps of the densities of
breaches of the peace in each census block in New Haven in breaches per square
kilometre. For the analysis you will need to use the breach point data and the
census blocks in the newhaven dataset and undertake a point-in-polygon
operation, apply an area function and undertake a conversion to square kilo-
metres. The maps should be produced using the tm_shape and tm_fill func-
tions in the tmap package. The New Haven data are included in the GISTools
package:
data(newhaven)
Reminder: As with all self-test questions, worked answers are provided in the final
section of the chapter.
You should note that the New Haven dataset is projected in feet. One way is to
leave the data in feet, calculate densities in squares miles and convert to square
kilometres, apply the ft2miles function to the results of the area calculation, and
as areas are in squared units, you will need to apply it twice, noting that there are
approximately 2.58999 square kilometres in each square mile. The code below cal-
culates the area in square kilometres of each block:
ft2miles(ft2miles(gArea(blocks, byid = T))) ∗ 2.58999
5.5.3 Point and Areas Analysis Exercise
An important advantage of using R to handle spatial data is that it is very easy
to incorporate your data into statistical analysis and graphics routines. For
example, in the New Haven blocks data frame, there is a variable called P_
OWNEROCC which states the percentage of owner-occupied housing in each
census block. It may be of interest to see how this relates to the breach of peace
densities calculated in Self-Test Question 1. A useful statistic is the correlation
coefficient generated by the cor function which causes the correlation to be
printed out:
USING R AS A GIS
159
data(newhaven)
blocks$densities=poly.counts(breach,blocks)/
ft2miles(ft2miles(poly.areas(blocks)))
cor(blocks$P_OWNEROCC,blocks$densities)
[1] −0.2038463
,In this case the two variables have a correlation of around −0.2, a weak nega-
tive relationship, suggesting that, in general, places with a higher proportion of
owner-occupied homes tend to see fewer breaches of peace. It is also possible to
plot the relationship between the quantities:
ggplot(blocks@data, aes(P_OWNEROCC,densities))+
geom_point() +
geom_smooth(method = "lm")
A more detailed approach might be to model the number of breaches of peace. Typ-
ically, these are relatively rare, and a Poisson distribution might be an appropriate
model. A possible model might then be:
breaches ~ Poisson(AREA ∗ exp(a + b ∗ P_OWNEROCC))
where AREA is the area of a block, P_OWNEROCC is the percentage of owner occu-
piers in the block, and a and b are coefficients to be estimated, a being the intercept
term. The AREA variable plays the role of an offset – a variable that always has a
coefficient of 1. The idea here is that even if breaches of peace were uniformly dis-
tributed, the number of incidents in a given census block would be proportional to
the AREA of that block. In fact, we can rewrite the model such that the offset term
is the log of the area:
breaches ~ Poisson(exp(a + b ∗ P_OWNEROCC+log(AREA)))
Seeing the model written this way makes it clear that the offset term has a coefficient
that must always be equal to 1. The model can be fitted in R using the following code:
# load and attach the data
data(newhaven)
attach(data.frame(blocks))
# calculate the breaches of the peace in each block
n.breaches = poly.counts(breach,blocks)
area = ft2miles(ft2miles(poly.areas(blocks)))
# fit the model
model1=glm(n.breaches~P_OWNEROCC,offset=log(area),family=poisson)
# detach the data
detach(data.frame(blocks))
The first two lines compute the counts, storing them in n.breaches, and the
areas, storing them in area. The next line fits the Poisson model. glm stands for
‘generalised linear model’, and extends the standard lm routine to fit models such
R FOR SPATIAL ANALYSIS & MAPPING
160
as Poisson regression. As a reminder, further information about linear models and
the R modelling language was provided in one of the information boxes in Chapter 3
and an example of its use was given. The family=poisson option specifies
that a Poisson model is to be fitted here. The offset option specifies the offset
term, and the first argument specifies the actual model to be fitted. The model-
fitting results are stored in the variable model1. Having created the model in this
way, entering:
model1
returns a brief summary of the fitted model. In particular, it can be seen that the
estimated coefficients are a = 3.02 and b = −0.0310.
A more detailed view can be obtained using:
summary(model1)
Now, among other things, the standard errors and Wald statistics for a and b
are now shown. The Wald Z-statistics are similar to t-statistics in ordinary least
squares regression, and may be tested against the normal distribution. The results
in Table 5.1 summarise the information, showing that both a and b are significant,
and that therefore there is a statistically significant relationship between owner
occupation and breach of peace incidents.
Table 5.1 Summary of the Poisson model of the breaches of the peace over census blocks
Estimate Std. error Wald’s Z p-value
Intercept 3.02 0.11 27.4 <0.01
Owner Occ. % −0.031 0.00364 −8.5 <0.01
It is also possible to extract diagnostic information from fitted models. For
example, the rstandard function extracts the standardised residuals from a
model. Whereas residuals are the difference between the observed value (i.e. in the
data) and the value when estimated using the model, standardised residuals are
rescaled to have a variance of 1. If the model being fitted is correct, then these
residuals should be independent, have a mean of 0, a variance of 1 and an approx-
imately normal distribution. One useful diagnostic is to map these values. The
code below computes them and stores them in a variable called s.resids:
s.resids = rstandard(model1)
Now to plot the map it will be more useful to specify a shading scheme directly
using the shading command:
USING R AS A GIS
161
resid.shades = shading(c(−2,2),c("red","grey","blue"))
This specifies that the map will have three class intervals: below −2, between −2
and 2, and above 2. These are useful intervals, given that the residuals should be
normally distributed, and these values are the approximate two-tailed 5% points of
this distribution. Residuals within these points will be shaded grey, large negative
residuals will be red, and large positive ones will be blue:
par(mar=c(0,0,0,0))
choropleth(blocks,s.resids,resid.shades)
From Figure 5.5 it can be seen that in fact there is notably more variation than one
might expect (there are 21 blocks shaded blue or red, about 16% of the total, when
Figure 5.5 The distribution of the model1 residuals, describing the relationship between
breaches of the peace and owner occupancy
R FOR SPATIAL ANALYSIS & MAPPING
162
around 5% would appear based on the model’s assumptions), and also that the
shaded blocks seem to cluster together. This last observation casts doubt on the
assumption of independence, suggesting instead that some degree of spatial cor-
relation is present. One possible reason for this is that further variables may need
to be added to the model, to explain this extra variability and spatial clustering
among the residuals.
It is possible to extend this analysis by considering P_VACANT, the percentage
of vacant properties in each census block, as well as P_OWNEROCC. This is done by
extending model1 and entering:
attach(data.frame(blocks))
n.breaches = poly.counts(breach,blocks)
Figure 5.6 The distribution of the model2 residuals, describing the relationship between
breaches of the peace with owner occupancy and vacant properties
USING R AS A GIS
163
area = ft2miles(ft2miles(poly.areas(blocks)))
model2=glm(n.breaches~P_OWNEROCC+P_VACANT,
offset=log(area),family=poisson)
s.resids.2 = rstandard(model2)
detach(data.frame(blocks))
This sets up a new model, with a further term for the percentage of vacant housing
in each block, and stores it in model2. Entering summary(model2) shows that
the new predictor variable is significantly related to breaches of the peace, with a
positive relationship. Finally, it is possible to map the standardised residuals for
the new model reusing the shading scheme defined above:
s.resids.2 = rstandard(model2)
par(mar=c(0,0,0,0))
choropleth(blocks,s.resids.2,resid.shades)
This time, Figure 5.6 shows that there are fewer red- and blue-shaded census blocks,
although perhaps still more than we might expect, and there is still some evidence
of spatial clustering. Adding the extra variable has improved things to some extent,
but perhaps there is more investigative research to be done. A more comprehensive
treatment of spatial analysis of spatial data attributes is given in Chapter 7.
Self-Test Question 2. The above code uses the choropleth function in GISTools
to produce a map of outlying residuals. Create a similar-looking map but using the
tm_shape function of the tmap package. You may find it useful to unpick the cho-
ropleth function, to think about passing a user-defined palette to tm_polygons,
to assign s.resids.2 as a blocks variable, and/or to pass a set of break values.
5.6 CREATING DISTANCE ATTRIBUTES
Distance is fundamental to spatial analysis. For example, we may wish to analyse
the number of locations (health facilities, schools, etc.) within a certain distance of
the features we are considering. In the exercise below, distance measures are used
to evaluate differences in accessibility for different social groups, as recorded in
census areas. Such approaches form the basis of supply and demand modelling
and provide inputs into location–allocation models.
Distance could be approximated using a series of buffers created at specific
distance intervals around
,our features (whether point or polygons). These could be
used to determine the number of features or locations that are within different
distance ranges, as specified by the buffers using the poly.counts function
above. However, distances can also be measured directly and there a number of
functions available in R to do this.
First, the most commonly used function is dist. This calculates the Euclidean
distance between points in n-dimensional feature space. The example below,
R FOR SPATIAL ANALYSIS & MAPPING
164
developed from the help for dist, shows how it is used to calculate the distances
between five records (rows) in a feature space of 20 hypothetical variables.
x <- matrix(rnorm(100), nrow = 5)
colnames(x) <- paste0("Var", 1:20)
dist(x)
as.matrix(dist(x))
If your data are projected (in metres, feet, etc.) then dist can also be used to calcu-
late the Euclidean distance between pairs of coordinates.
as.matrix(dist(coordinates(blocks))) # in feet
as.matrix(dist(coordinates(georgia2))) # in metres
When determining geographical distances, it is important that you consider the
projection properties of your data: if the data are projected using degrees (i.e. in lat-
itude and longitude) then this needs to be considered in any calculation of distance.
The gDistance function in rgeos calculates the Cartesian minimum (straight-
line) distance between two spatial datasets of class sp projected in planar coordi-
nates. Try entering:
# this will not work
gDistance(georgia[1,], georgia[2,])
# this will!
gDistance(georgia2[1,], georgia2[2,])
The st_distance function in sf is similar but is also able to calculate great circle
distances for projected points.
# convert to sf
georgia2_sf <- st_as_sf(georgia2)
georgia_sf <- st_as_sf(georgia)
st_distance(georgia2_sf[1,], georgia2_sf[2,])
st_distance(georgia_sf[1,], georgia_sf[2,])
# with points
sp <- st_as_sf(SpatialPoints(coordinates(georgia)))
st_distance(sp[1,], sp[1:3,])
The distance functions return a to–from matrix of the distances between each pair of
locations. These could describe distances between any objects, and such approaches
underpin supply and demand modelling and accessibility analyses.
For example, the code below uses gDistance to calculate the distances
between the centroids of the newhaven blocks data and the places locations.
The latter are simply random locations, but could represent any kind of facility or
supply feature, and the centroids of the census blocks in New Haven represent
demand locations. In the first few lines of code, the projections of the two variables
USING R AS A GIS
165
are set to be the same, before SpatialPoints is used to extract the geometric
centroids of the census block areas and the distance between places and cents
are calculated:
data(newhaven)
proj4string(places) <- CRS(proj4string(blocks))
cents <- SpatialPoints(coordinates(blocks),
proj4string = CRS(proj4string(blocks)))
# note the use of the ft2miles function to convert to miles
distances <- ft2miles(gDistance(places, cents, byid = T))
You can examine the result in relation to the inputs to gDistance and you will
see that the distances variable is a matrix of distances (in miles) from each of the
129 census block centroids to each of the nine locations described in the places
variable.
head(round(distances, 3))
It is possible to use the census block polygons in the above gDistance calcu-
lation, and the distances returned will be to the nearest point of the census area.
Using the census area centroid provides a more representative measure of the av-
erage distance experienced by people living in that area.
A related function is the gWithinDistance function, which tests whether
each to–from distance pair is less than a specified threshold. It returns a matrix of
TRUE and FALSE describing whether the distances between the elements of the
two sp dataset elements are less than or equal to the specified distance or not. In
the example below the distance specified is 1.2 miles.
distances <- gWithinDistance(places, cents,
byid = T, dist = miles2ft(1.2))
You should note that the distance functions work with whatever distance units are
specified in the projections of the spatial features. This means the inputs need to
have the same units. Also remember that the newhaven data are projected in feet,
hence the use of the miles2ft and ft2miles functions.
5.6.1 Distance Analysis/Accessibility Exercise
The use of distance measures in conjunction with census data is particularly useful
for analysing access to the supply of some facility or service for different social
groups. The code below replicates the analysis developed by Comber et al. (2008),
examining access to green spaces for different social groups. In this exercise a
hypothetical example is used: we wish to examine the equity of access to the loca-
tions recorded in the places variable (supply) for different ethnic groups as
recorded in the blocks dataset (demand), on the basis that we expect everyone to
R FOR SPATIAL ANALYSIS & MAPPING
166
be within 1 mile of a facility. We will use the census data to approximate the
number of people with and without access of less than 1 mile to the set of hypo-
thetical facilities.
First, the distances variable is recalculated in case it was overwritten in the
gWithinDistance example above. Then the minimum distance to a supply
facility is determined for each census area using the apply function. Finally, a
logical statement is used to generate a TRUE or FALSE statement for each block:
distances <- ft2miles(gDistance(places, cents, byid = T))
min.dist <- as.vector(apply(distances,1, min))
blocks$access <- min.dist < 1
# and this can be mapped
#qtm(blocks, "access")
The populations of each ethnic group in each census block can be extracted from
the blocks dataset:
# extract the ethnicity data from the blocks variable
ethnicity <- as.matrix(data.frame(blocks[,14:18])/100)
ethnicity <- apply(ethnicity, 2, function(x) (x ∗ blocks$POP1990))
ethnicity <- matrix(as.integer(ethnicity), ncol = 5)
colnames(ethnicity) <- c("White", "Black",
"Native American", "Asian", "Other")
And then a crosstabulation is used to bring together the access data and the
populations:
# use xtabs to generate a crosstabulation
mat.access.tab = xtabs(ethnicity~blocks$access)
# then transposes the data
data.set = as.data.frame(mat.access.tab)
#sets the column names
colnames(data.set) = c("Access","Ethnicity", "Freq")
You should examine the data.set variable. This summarises all of the factors
being considered: access, ethnicity and the counts associated with all factor com-
binations. If we make an assumption that there is an interaction between ethnicity
and access, then this can be tested for using a generalised regression model with a
Poisson distribution using the glm function:
modelethnic = glm(Freq~Access∗Ethnicity,
data=data.set,family=poisson)
# the full model can be printed to the console
# summary(modelethnic)
The model coefficient estimates show that there is significantly less access for some
groups than would be expected under a model of equal access when compared to
USING R AS A GIS
167
the largest ethnic group, White, which was listed first in the data.set variable,
and significantly greater access for the Other ethnic group. Examine the model
coefficient estimates, paying particular attention to the AccessTRUE: coefficients:
summary(modelethnic)$coef
Then assign these to the a variable:
mod.coefs = summary(modelethnic)$coef
By subtracting 1 from the coefficients and converting them to percentages, it is pos-
sible to attach some likelihoods to the access for different groups when compared
to the White ethnic group. Again, you should examine the terms in the model
outputs prefixed by AccessTRUE:, as below:
tab <- 100∗(exp(mod.coefs[,1]) − 1)
tab <- tab[7:10]
names(tab) <- colnames(ethnicity)[2:5]
round(tab, 1)
Black Native American Asian
−35.1 −11.7
,−29.8
Other
256.3
The results in tab tell us that some ethnic groups have significantly less access
to the hypothetical supply facilities than the White ethnic group (as recorded in
the census): Black 35% less, Native American 12% less (although this is not
significant), and Asian 30% less. The Other ethnic group has 256% more access
than the White ethnic group.
It is possible to visualise the variations in access for different groups using a
mosaic plot. Mosaic plots show the counts (i.e. population) as well as the residuals
associated with the interaction between groups and their access, the full details of
which were given in Chapter 3.
mosaicplot(t(mat.access.tab),xlab='',ylab='Access to Supply',
main="Mosaic Plot of Access",shade=TRUE,las=3,cex=0.8)
Self-Test Question 3. In working through the exercise above you have developed
a number of statistical techniques. In answering this self-test question you will
explore the impact of using census data summarised over different areal units in
your analysis. Specifically, you will develop and compare the results of two sta-
tistical models using different census areas in the newhaven datasets: blocks
and tracts. You will analyse the relationship between residential property
occupation and burglaries. You will need to work through the code below before
the tasks associated with this questions are posited. To see the relationship between
the census tracts and the census blocks, enter:
R FOR SPATIAL ANALYSIS & MAPPING
168
plot(blocks,border='red')
plot(tracts,lwd=2,add=TRUE)
You can see that the census blocks are nested within the tracts.
The analysis described below develops a statistical model to describe the rela-
tionship between residential property occupation and burglary using two of the
New Haven crime variables related to residential burglaries. These are both point
objects, called burgres.f and burgres.n: the former is a list of burglaries
where entry was forced into the property, and the latter is a list of burglaries where
entry was not forced, suggesting that the property was left insecure, perhaps by
leaving a door or window open. The burglaries data cover the six-month period
between 1 August 2007 and 31 January 2008.
The questions you will consider are:
● Do both kinds of residential burglary occur in the same places – that is,
if a place is a high-risk area for non-forced entry, does it imply that it is
also a high-risk for forced entry?
● How does this relationship vary over different census units?
To investigate these, you should use a bivariate regression model that attempts to
predict the density of forced burglaries from the density of non-forced ones. The
indicators needed for this are the rates of burglary given the number of properties
at risk. You should use the variable OCCUPIED, present in both the census blocks
data frame and the census tracts data frame, to estimate the number of properties
at risk. If we were to compute rates per 1000 households, this would be:
1000∗(number of burglaries in block)/OCCUPIED, and since this is
over a six-month period, doubling this quantity gives the number of burglaries per
1000 households per year. However, entering:
blocks$OCCUPIED
shows that some blocks have no occupied housing, so the above rate cannot be
defined. To overcome this problem you should select the subset of the blocks with
more than zero occupied dwellings. For polygon spatial objects, each individual
polygon can be treated like a row in a data frame for the purposes of subset selec-
tion. Thus, to select only the blocks where the variable OCCUPIED is greater than
zero, enter:
blocks2 = blocks[blocks$OCCUPIED > 0,]
We can now compute the burglary rates for forced and non-forced entries by first
counting the burglaries in each block in blocks2 using the poly.counts func-
tion, dividing these numbers by the OCCUPIED counts and then multiplying by
USING R AS A GIS
169
2000 to get yearly rates per 1000 households. However, before we do this, you
should remember that you need the OCCUPIED attribute from blocks2 and not
blocks. Attach the blocks2 data and then calculate the two rate variables:
attach(data.frame(blocks2))
forced.rate = 2000∗poly.counts(burgres.f,blocks2)/OCCUPIED
notforced.rate = 2000∗poly.counts(burgres.n,blocks2)/OCCUPIED
detach(data.frame(blocks2))
You should have two rates stored in forced.rate and notforced.rate. A
first attempt at modelling the relationship between the two rates could be via sim-
ple bivariate regression, ignoring any spatial dependencies in the error term. This
is done using the lm function, which creates a simple regression model, model1:
model1 = lm(forced.rate~notforced.rate)
To examine the regression coefficients, enter:
summary(model1)
coef(model1)
The key things to note here are that forced.rate is related to notforced.
rate by the formula:
expected(forced.rate) = a + b × (notforced.rate)
where a is the intercept term and b is the slope or coefficient for the predictor vari-
able. If the coefficient for notforced.rate is statistically different from zero,
indicated in the summary of the model, then there is evidence that the two rates are
related. One possible explanation is that if burglars are active in an area, they will
only use force to enter dwellings when it is necessary, making use of an insecure
window or door if they spot the opportunity. Thus in areas where burglars are
active, both kinds of burglary could potentially occur. However, in areas where
burglars are less active it is less likely for either kind of burglary to occur.
Having outlined the approach, your specific tasks in this question are:
● To determine the coefficients a and b in the formula above for two
different analyses using the blocks and tracts datasets
● To comment on the difference between the analyses using different areal units
5.7 COMBINING SPATIAL DATASETS AND THEIR ATTRIBUTES
The point-in-polygon calculation using poly.counts generates counts of the
points falling in each polygon. A common situation in spatial analysis is the need
R FOR SPATIAL ANALYSIS & MAPPING
170
to combine (overlay) different polygon features that describe the spatial distribu-
tion of different variables, attributes or processes that are of interest. The problem
is that the data may have different underlying area geographies. In fact, it is com-
monly the case that different agencies, institutions and government departments
use different geographical areas, and even where they do not, geographical areas
frequently change over time. In these situations, we can use the intersection func-
tions (gIntersection in rgeos or st_intersection in sf) to identify the
area of intersection between different spatial datasets. With some manipulation it
is possible to determine the proportions of the objects in dataset X that fall into
each of the polygons of dataset Y. This section uses a worked example to illustrate
how this can be done in R. In the subsequent self-test question you will develop a
function to do this.
The key thing to note with all spatial operations, whether using sp and sf
datasets, is that the input data need to have the same projections. You can
examine their projection attributes with proj4string in sp and st_crs
in sf to check whether they need to be transformed, using spTransform
(sp) or st_transform (sf) functions to put the data into the same
projection.
The stages in this analysis are as follows:
1. Create a zone dataset for which the number of houses in each zone will
be calculated. The New Haven tracts data include the variable
HSE_UNITS , describing the number of residential properties in each
census tract. In this case the zones are hypothetical, but could perhaps
be zones used by the emergency services for planning purposes and
resource allocation.
2. Do an overlay of the new zones and the original areas. The key here is
to make sure that both the layers have an identifier that
,exercises, where the code is provided in the text for you to work
through (i.e. for you to enter and run yourself), the self-test questions are tasks for
you to complete, mostly requiring you to write R code yourself, with answers pro-
vided in the last section of each chapter. The idea of these questions is to give you
some experience with working with different kinds of data structures, functions
and operations in R. There is a strong emphasis on solving problems, rather than
simply working through the code. In this way, snippets of code are included in
each chapter describing commands for data manipulation and analysis and to
exemplify specific functionality. It is expected that you will run the R code yourself
in each chapter. This can be typed directly into the R console or may be written
INTRODUCTION
5
directly into a script or document as described below. It is also possible to access
the code in each chapter from the book’s website (again see below). The reasons for
running the code yourself are so that you get used to using the R console and to
help your understanding of the code’s functionality.
In various places information boxes are included to develop a deeper understand-
ing of functions and alternative approaches for achieving the same ends.
The book is aimed at both second- and third-year undergraduate and post-
graduate students. Chapters 6–8 go into much more detail about specific types of
spatial analysis and are extensively supported by references from the scientific
literature in a way that the earlier chapters are not. For these reasons Chapters 2–5
might be considered as introductory and Chapters 6–8 might be considered as
advanced. Thus the earlier chapters are suitable for an Introduction to R module
(Chapters 2–4) or for an Introduction to Mapping in R module, and the later ones for
a module covering more Advanced Techniques (Chapters 6–9). The book could also
be used as the basis for a Geographical Programming module, drawing from differ-
ent chapters, especially Chapters 4 and 9, depending on the experience and techni-
cal capabilities of the student group.
The formal learning objectives of this book are:
● to apply appropriate data types, arrays, control structures, functions
and packages within R code
● to introduce geographical analysis and spatial data handling in R
● to develop programming skills in the R language with particular
reference to current geocomputational research and applications
● to exemplify the principles of algorithm and function construction in R
● to design and construct basic graphical algorithms for the analysis and
visualisation of spatial information
In terms of learning arcs, each chapter introduces a topic, has example code to run
and self-test questions to work through. In a similar way, earlier chapters provide
the foundations for later ones. The dependencies and prerequisites for each chap-
ter are listed in Table 1.1, and you should note that these are inherited (i.e. if
Chapter 4 is a prerequisite then the prerequisites for Chapter 4 also are relevant).
1.5 SPECIFIC CHANGES TO THE SECOND EDITION
In Chapter 2 the main changes were to introduce the ggplot2 package alongside
the basic plot operation. The code for some figures, maps and plots is shown for
both approaches. The other change was to remove the use of deprecated map-
tools functions for reading and writing spatial data and to replace these with
R FOR SPATIAL ANALYSIS & MAPPING
6
Table 1.1 Chapter prerequisites
Chapter Prerequisite chapters Comments
Chapter 2 None Data types and plots – the jumping-off point for
all other chapters
Chapter 3 2 The first maps and spatial data types
Chapter 4 2, 3 Coding blocks and functions
Chapter 5 2, 3 GIS-like operations in R
Chapter 6 4, 5 Cluster analysis and mapping of point data
Chapter 7 4, 5 Attribute analysis and mapping of polygon data
Chapter 8 6, 7 Analysis of geographical variation in spatial
processes
Chapter 9 3, 4, 5 Spatial analysis of data from the web
readOGR and writeOGR functions from the rgdal package and the st_read
function in sf. The self-test questions in each chapter reflect these changes.
Chapter 3 covers the basics of handling spatial data. This chapter now has a
focus on operations on sf objects and tools and a much reduced focus on sp for-
mats and the GISTools package, although it still draws from some of the func-
tionality of packages based on sp. The data manipulations now incorporate
operations on both sp and sf objects, bridging between the two data formats. In
a similar way, the GISTools mapping functions have been replaced by code
using the tmap package, and again many simple plot routines have been
replaced with ggplot2 operations.
Chapter 4 has a few small changes relating to some data table manipulations
using the functions in the dplyr package and demonstrates the use of apply
functions as an alternative to potentially slower (but perhaps more transparent)
for loop operations.
Chapter 5 goes into sf operations in much more detail and ubiquitously uses
tmap. The detailed walk-through coding exercises mix sp and sf formats, using
sf where possible, but where we think there is a distinct advantage to using sp
then this has been presented.
Chapters 6–9 have been revised much less than the earlier chapters, although a
new example has been added to Chapter 9 to reflect changes in web API support
in R. This is because they are focused on more advanced topics, the nuts and bolts
of which have not changed much. However, where appropriate the plotting and
mapping routines have been updated to use tmap and ggplot2 packages.
Chapter 10, the epilogue, evaluates our 2013 thoughts about the direction of
travel in this area and considers the main developments from where we are now
in 2018, including the extensions to R, improvements under the bonnet and the
coexistence of R with other software arising from the tidyverse, piping syntax,
sf formats, Rcpp, the ubiquity of RStudio as the choice of R interface and tools
INTRODUCTION
7
such as RMarkdown. An example of the latter is that the first edition of this book
was written in Sweave and the second edition entirely in RMarkdown.
1.6 THE R PROJECT FOR STATISTICAL COMPUTING
R was developed from the S language which was originally conceived at the
Lucent Technologies (formerly AT&T) Bell Laboratories in the 1970s and 1980s.
Douglas Martin at the company StatSci developed S into the enhanced commercial
product known as S+ in the late 1980s and early 1990s (Krause and Olson, 1997).
R was initially developed by Robert Gentleman and Ross Ihaka of the Department
of Statistics at the University of Auckland. It is becoming widely used in many
areas of scientific activity and quantitative research, partly because it is available
free in source code form and also because of its extensive functionality, through the
continually growing number of contributions of code and functions, in the form of
R packages, which when installed can be called as libraries. The background to R,
along with documentation and information about packages as well as the con-
tributors, can be found at the R Project website http://www.r-project.org.
1.7 OBTAINING AND RUNNING THE R SOFTWARE
We assume that most readers will be using the RStudio interface to R. You should
download the latest version of R and then RStudio in order to run the code pro-
vided in this book. At the time of writing, the latest version of R is version 3.4.3 and
you should ensure you have at least this version. There are 32-bit and 64-bit ver-
sions available, and we assume you have the 64-bit version. The simplest way to
get R installed on your computer is to go the download pages on the R website – a
quick search for ‘download R’ should take you there, but if not you could try:
● http://cran.r-project.org/bin/windows/base/
● http://cran.r-project.org/bin/macosx/
● http://cran.r-project.org/bin/linux/
for Windows, Mac
,allows the
proportions of each original area in each zone to be calculated. This
will then be used to allocate houses based on the proportion of each
intersecting area in each zone.
First, you should make sure you have the tmap and sf packages loaded. Then
create the zones, number them with an ID and plot these on a map with the tracts
data. This is easily done by defining a grid and then converting this to a
SpatialPolygonsDataFrame object. Enter:
library(GISTools)
library(sf)
## linking to GEOS 3.6.1, GDAL 2.1.3, proj.4.4.9.3
library(tmap)
data(newhaven)
## define sample grid in polygons
USING R AS A GIS
171
bb <- bbox(tracts)
grd <- GridTopology(cellcentre.offset=
c(bb[1,1]−200,bb[2,1]−200),
cellsize=c(10000,10000), cells.dim = c(5,5))
int.layer <- SpatialPolygonsDataFrame(
as.SpatialPolygons.GridTopology(grd),
data = data.frame(c(1:25)), match.ID = FALSE)
ct <- proj4string(blocks)
proj4string(int.layer) <- ct
proj4string(tracts) <- ct
names(int.layer) <- "ID"
You can examine the intersection layer:
plot(int.layer)
Next, you should undertake an intersection of the zone and area layers. Projec-
tions can be checked using proj4string(int.layer) and proj4string
(tracts). These have the same projections, so they can be intersected. The code
below converts them to sf format and then uses st_intersection:
int.layer_sf <- st_as_sf(int.layer)
tracts_sf <- st_as_sf(tracts)
int.res_sf <- st_intersection(int.layer_sf, tracts_sf)
You can examine the intersected data, the original data and the zones in the same
plot window, as in Figure 5.7. Remember that the grid.arrange function in the
gridExtra package allows multiple graphics to be included in the plot.
# plot and label the zones
p1 <- tm_shape(int.layer_sf) + tm_borders(lty = 2) +
tm_layout(frame = F) +
tm_text("ID", size = 0.7) +
# plot the tracts
tm_shape(tracts_sf) + tm_borders(col = "red", lwd = 2)
# plot the intersection, scaled by int.later_sf
p2 <- tm_shape(int.layer_sf) + tm_borders(col="white") +
tm_shape(int.res_sf) + tm_polygons("HSE_UNITS", palette = blues9) +
tm_layout(frame = F, legend.show = F)
library(grid)
grid.newpage()
pushViewport(viewport(layout=grid.layout(1,2)))
print(p1, vp=viewport(layout.pos.col = 1))
print(p2, vp=viewport(layout.pos.col = 2))
As in the gIntersection operation described in earlier sections, you can exam-
ine the result of the intersection:
head(int.res_sf)
You will see that the data frame of the intersected object contains composites
of the inputs. These links can be used to create attributes for the intersection
output data.
R FOR SPATIAL ANALYSIS & MAPPING
172
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Figure 5.7 The zones and census tracts data before and after intersection
Recall the need to have an identifier for both the zone and area layers. The ID
variable of the intersection output, int.res_sf, lists the ID variable of the two
input layers, the ID variable of int.layer_sf and the T009075H_I variable of
tracts.sf. In this case, we wish to summarise the HSE_UNITS of tracts_sf
over the zones of int.layer_sf. Here the functionality of dplyr single-table
operations that were introduced in Chapter 4 can be useful. However, first we need
to work out what proportion of the original tracts areas intersect with each
zone, and we can weight the HSE_UNITS variable appropriately to proportionally
allocate the counts of houses to the zones. Knowing the unique identifiers of each
polygon in both of the intersected layers is critical for working out proportions.
# generate area and proportions
int.areas <- st_area(int.res_sf)
tract.areas <- st_area(tracts_sf)
# match tract area to the new layer
index <- match(int.res_sf$T009075H_I, tracts$T009075H_I)
tract.areas <- tract.areas[index]
tract.prop <- as.vector(int.areas)/as.vector(tract.areas)
The tract.prop object can be used to create a variable in the data frame of the new
layer, using the index variable which indicates in which of the original tract areas
each intersected area belongs. (Note that you could examine index to see this.)
int.res_sf$houses <- tracts$HSE_UNITS[index] ∗ tract.prop
And this can be summarised using the functionality in dplyr and linked back to
the original int.layer_sf:
USING R AS A GIS
173
library(tidyverse)
houses <- summarise(group_by(int.res_sf, ID), count = sum(houses))
# create an empty vector
int.layer_sf$houses <- 0
# and populate this using houses$ID as the index
int.layer_sf$houses[houses$ID] <- houses$count
The results can be plotted as in Figure 5.8 and checked against the original inputs
in Figure 5.7.
tm_shape(int.layer_sf) +
tm_polygons( "houses", palette = "Greens",
style = "kmeans", title = "No. of houses") +
tm_layout(frame = F, legend.position = c(1,0.5)) +
tm_shape(tracts_sf) + tm_borders(col = "black")
No. of houses
0 to 456
456 to 1,478
1,478 to 3,796
3,796 to 12,395
12,395 to 16,876
Figure 5.8 The zones shaded by the number of households after intersection with the
census tracts
R FOR SPATIAL ANALYSIS & MAPPING
174
Self-Test Question 4. Write a function that will return an intersected dataset, with
an attribute of counts of some variable (houses, population, etc.) as held in another
sf format dataset. Base your function on the code used in the illustrated exam-
ple above. Compile it such that the function returns the portion of the variable
(typically this should be a count) covered by each zone. For example, it should
be able to intersect the int.layer_sf layer with the blocks_sf layer and
return an sf dataset with an attribute of the number of people, as described in
the POP1990 variable of blocks, covered by each zone. You should remember
that many spatial functions require their inputs to have the same projections. The
int.layer_sf defined above and the tracts originally had no projections.
You may find it useful to check and/or align the input layers – for example, the
int.layer defined above and the blocks data in the following way using the
rgdal or sf packages:
## in rgdal
library(rgdal)
ct <- proj4string(blocks)
proj4string(int.layer) <- CRS(ct)
blocks <- spTransform(blocks, CRS(proj4string(int.layer)))
## in sf
library(sf)
ct <- st_crs(blocks_sf)
st_crs(int.layer_sf) <- (ct)
blocks_sf <- st_transform(blocks_sf, st_crs(int.layer_sf))
Your function will have to take identifier variables for the layer and the intersect
layer as inputs, and you will find it useful in your code to assign these to new ID
variables in each layer. For example, your function could require the following
parameters when compiled, setting some default values:
# define the function
area_intersect_func <- function(int.sf = int.layer.sf, layer.sf = blocks.sf, int.
ID <- "ID",
layer.ID <- "T009075H_I", target <- "POP1990"){
...
...
}
Also, extracting values from data in sf format can be tricky. A couple of possible
ways are:
# directly from the data frame
as.vector(data.frame(int.res_sf[,"T009075H_I"])[,1])
as.vector(unlist(select(as.data.frame(int.res_sf), T009075H_I)))
# set the geometry to null and then extract
st_geometry(int.res_sf) <- NULL
int.res_sf[,"T009075H_I"]
USING R AS A GIS
175
# using select from dplyr
as.vector(unlist(select(as.data.frame(int.res_sf), T009075H_I)))
5.8 CONVERTING BETWEEN RASTER AND VECTOR
Very often we would like to move or convert our data between vector and raster
environments. In fact the very persistence of these dichotomous data structures,
with separate raster and vector functions and analyses in many commercial GIS
software programs, is one of the long-standing legacies in GIS.
This section briefly describes methods for converting data between raster and vector
structures. There are three reasons for this brief treatment. First, many packages define
their own data structures. For example, the functions in the PBSmapping package
require a PolySet object to be passed to them. This means that conversion between
one class
,of raster objects and, for example, the sp class of SpatialPolygons will
require different code. Second, the separation between raster and vector analysis envi-
ronments is no longer strictly needed, especially if you are developing your spatial
analyses using R, with the easy ability for users to compile their own functions and to
create their own analysis tools. Third, advanced raster mapping and analysis is exten-
sively covered in other books (see, for example, Bivand et al., 2013).
The sections below describe methods for converting the sp class of objects
(SpatialPoints, SpatialLines and SpatialPolygons, etc.) and the sf
class of objects (see the first sf vignette) as well as to and from the RasterLayer
class of objects as defined in the raster package, created by Hijmans and van
Etten (2014). They also describe how to convert between sp classes, for example to
and from SpatialPixels and SpatialGrid sp objects.
5.8.1 Vector to Raster
In this section simple approaches for converting are illustrated using datasets in
the tornados package that you have already encountered. We shall examine
techniques for converting the sp class of objects to the raster class, considering
in turn points, lines and areas.
Unfortunately, at the time of writing there is no parallel operation for convert-
ing from sf formats to raster formats. If you have data in sf format, you could
convert to an sp format before converting to raster format as described earlier:
# convert to sf sp
sp <- as(sf, "Spatial")
# do the conversions...as below
You will need to load the data and the packages – you may need to install the
raster package using the install.packages function if this is the first time
that you have used it.
R FOR SPATIAL ANALYSIS & MAPPING
176
5.8.1.1 Converting Points to Raster
First, convert from sp to raster formats. The torn2 is a Spatial
PointsDataFrame object:
library(GISTools)
library(raster)
data(tornados)
class(torn2)
Then create a raster and use the rasterize function to convert the data. Note the
need for a function to be specified to determine how the point dataset are summarised
over the raster grid and, if the data have attributes, which attribute is to be summarised:
# rasterize a point attribute
r <- raster(nrow = 180 , ncols = 360, ext = extent(us_states2))
r <- rasterize(torn2, r, field = "INJ", fun=sum)
# rasterize count of point dataset
r <- raster(nrow = 180 , ncols = 360, ext = extent(us_states2))
r <- rasterize(as(torn2, "SpatialPoints"), r, fun=sum)
The resultant raster has cells describing different tornado densities that can be
mapped as in Figure 5.9:
# set the plot extent by specify the plot colour 'white'
tm_shape(us_states2)+
tm_borders("white")+
Injured
0 to 100,000
100,000 to 200,000
200,000 to 300,000
300,000 to 400,000
400,000 to 500,000
500,000 to 600,000
600,000 to 700,000
700,000 to 800,000
800,000 to 900,000
Figure 5.9 Converting points to raster format
USING R AS A GIS
177
tm_shape(r) +
tm_raster(title = "Injured", n= 7) +
tm_shape(us_states2) +
tm_borders() +
tm_layout(legend.position = c("left", "bottom"))
5.8.1.2 Converting Lines to Raster
For illustrative purposes the code below creates a SpatialLinesDataFrame
object of the outline of the polygons with an attribute based on the area of
the state.
# Lines
us_outline <- as(us_states2 , "SpatialLinesDataFrame")
r <- raster(nrow = 180 , ncols = 360, ext = extent(us_states2))
r <- rasterize(us_outline , r, "AREA")
This takes a bit longer to run but again the results can be mapped and this time
with the shading indicating area (Figure 5.10):
tm_shape(r) +
tm_raster(title = "State Area", palette = "YlGn") +
tm_style("albatross") +
tm_layout(legend.position = c("left", "bottom"))
State Area
0 to 50,000
50,000 to 100,000
100,000 to 150,000
150,000 to 200,000
200,000 to 250,000
250,000 to 300,000
Figure 5.10 Converting lines to raster format
R FOR SPATIAL ANALYSIS & MAPPING
178
5.8.1.3 Converting Polygons or Areas to Raster
Finally, polygons can easily be converted to a RasterLayer object using tools in
the raster package and plotted as in Figure 5.11. In this case the 1997 population
for each state is used to generate raster cell or pixel values.
# Polygons
r <- raster(nrow = 180 , ncols = 360, ext = extent(us_states2))
r <- rasterize(us_states2, r, "POP1997")
tm_shape(r) +
tm_raster(title = "Population", n=7, style="kmeans", palette="OrRd") +
tm_layout( legend.outside = T,
legend.outside.position = c("left"),
frame = F)
It is instructive to examine the outputs of these processes. Enter:
r
This summarises the characteristics of the raster object, including the resolution,
dimensions and extent. The data values of r can be accessed using the getValues
function:
unique(getValues(r))
It is possible to specify particular dimensions for the raster grid cells, rather than
just dividing the dataset’s extent by ncol and nrow in the raster function. The
code below is a bit convoluted, but cleanly allocates the values to raster grid cells
of a specified size, allocating cell values from a polygon variable to the raster cells.
# specify a cell size in the projection units
d <- 50000
dim.x <- d
dim.y <- d
bb <- bbox(us_states2)
Population
484,529 to 1,448,906
1,448,906 to 2,282,016
2,282,016 to 4,102,640
4,102,640 to 7,070,067
7,070,067 to 11,546,805
11,546,805 to 18,780,874
18,780,874 to 32,197,302
Figure 5.11 Converting polygons to raster format
USING R AS A GIS
179
# work out the number of cells needed
cells.x <- (bb[1,2]−bb[1,1]) / dim.x
cells.y <- (bb[2,2]−bb[2,1]) / dim.y
round.vals <- function(x){
if(as.integer(x) < x) {
x <- as.integer(x) + 1
} else {x <- as.integer(x)
}}
# the cells cover the data completely
cells.x <- round.vals(cells.x)
cells.y <- round.vals(cells.y)
# specify the raster extent
ext <- extent(c(bb[1,1], bb[1,1]+(cells.x∗d),
bb[2,1],bb[2,1]+(cells.y∗d)))
# now run the raster conversion
r <- raster(ncol = cells.x,nrow =cells.y)
extent(r) <- ext
r <- rasterize(us_states2, r, "POP1997")
# and map
tm_shape(r) +
tm_raster(col = "layer", title = "Populations",
palette = "Spectral", style = "kmeans") +
tm_layout(frame = F, legend.show = T,
legend.position = c("left","bottom"))
5.8.2 Converting to sp raster classes
You may have noticed that the sp package also has two data classes that are able
to represent raster data, or data located on a regular grid. These are
SpatialPixelsDataFrame and SpatialGridDataFrame. It is possible to
convert the rasters to these. First, create a spatially coarser raster layer of US states
similar to the above.
r <- raster(nrow = 60 , ncols = 120, ext = extent(us_states2))
r <- rasterize(us_states2 , r, "BLACK")
Then the as function can be used to coerce this to SpatialPixelsDataFrame
and SpatialGridDataFrame objects, which can also be mapped using the
image, plot and tm_raster commands in the usual way:
g <- as(r, 'SpatialGridDataFrame')
p <- as(r, 'SpatialPixelsDataFrame')
# image(g, col = topo.colors(51))
You can examine the data values held in the data frame by entering:
head(data.frame(g))
head(data.frame(p))
R FOR SPATIAL ANALYSIS & MAPPING
180
The data can also be manipulated to select certain features, in this case selecting the
states with populations greater than 10 million people. The code below assigns NA
values to the data points that fail this test and plots the data as in Figure 5.12.
# set up and create the raster
r <- raster(nrow = 60 , ncols = 120, ext = extent(us_states2))
r <- rasterize(us_states2 , r, "POP1997")
r2 <- r
# subset the data
r2[r < 10000000] <- NA
g <- as(r2, 'SpatialGridDataFrame')
p <- as(r2, 'SpatialPixelsDataFrame')
# not run
# image(g, bg = "grey90")
tm_shape(r2) +
tm_raster(col = "layer", title = "Pop",
palette = "Reds", style = "cat") +
tm_layout( frame = F, legend.show = T,
legend.position = c("left","bottom")) +
tm_shape(us_states2) + tm_borders()
Pop
11202691
11890919
12051902
14564508
,18177296
19384453
32197302
Figure 5.12 Selecting data in a raster object
5.8.2.1 Raster to Vector
The raster package contains a number of functions for converting from vector to
raster formats. These include rasterToPolygons which converts to a
SpatialPolygonsDataFrame object, and rasterToPoints which converts
USING R AS A GIS
181
to a matrix object. Both are illustrated in the code below and the results shown
in Figure 5.13. Notice how the original raster imposes a grid structure on the poly-
gons that are created. In this case the default mapping options with plot are
easier than using the options in the tmap or ggplot2 packages.
# load the data and convert to raster
data(newhaven)
# set up the raster, r
r <- raster(nrow = 60 , ncols = 60, ext = extent(tracts))
# convert polygons to raster
r <- rasterize(tracts , r, "VACANT")
poly1 <- rasterToPolygons(r, dissolve = T)
# convert to points
Figure 5.13 Converting from rasters to polygons and points, with the original polygon data
in red
R FOR SPATIAL ANALYSIS & MAPPING
182
points1 <- rasterToPoints(r)
# plot the points, rasterised polygons & original polygons
par(mar=c(0,0,0,0))
plot(points1, col = "grey", axes = FALSE, xaxt='n', ann=FALSE, asp= 1)
plot(poly1, lwd = 1.5, add = T)
plot(tracts, border = "red", add = T)
However, regarding tmap … it can be done!
# first convert the point matrix to sp format
points1.sp <- SpatialPointsDataFrame(points1[,1:2],
data = data.frame(points1[,3]))
# then pplot
tm_shape(poly1) + tm_borders(col = "black") +
tm_shape(tracts) + tm_borders(col = "red") +
tm_shape(points1.sp) + tm_dots(col = "grey", shape = 1) +
tm_layout(frame = F)
5.9 INTRODUCTION TO RASTER ANALYSIS
This section provides the briefest of overviews of how raster data may be manipu-
lated and overlaid in R in a similar way to a standard GUI GIS such as QGIS. This
section will cover the reclassification of raster data as a precursor to some basic
methods for performing what is sometimes referred to as map algebra, using a raster
calculator or raster overlay. As a reminder, many packages include user guides in the
form of a PDF document describing the package. This is listed at the top of the pack-
age index page. The raster package includes example code for the creation of
raster data and different types of multi-layered raster composites. These will not be
covered in this section. Rather, the coded examples illustrate some basic methods
for manipulating and analysing raster layers in a similar way to what is often
referred to as sieve mapping, multi-criteria evaluation or multi-criteria analysis. In these,
different layers are combined to identify locations that have specific combinations
of properties, such as height above sea level > 200 m AND soil_type is ‘good’.
Raster analysis requires that the different input data have a number of charac-
teristics in common: typically they should cover the same spatial extent, have the
same spatial resolution (grid or cell size), and, as with data for any spatial analysis,
they should have the same projection or coordinate system. The data layers used
in the example code in this section all have these properties. When you come to
develop your own analyses, you may have to perform some manipulation of the
data prior to analysis to ensure that your data also have these properties.
5.9.1 Raster Data Preparation
The Meuse data in the sp package will be used to illustrate the functions below. You
could read in your raster data using the readGDAL function in the rgdal package,
USING R AS A GIS
183
which provides an excellent R interface into the Geospatial Data Abstraction Library
(GDAL). This has been described as the ‘swiss army knife for spatial data’
(https://cran.r-project.org/web/packages/sf/vignettes/sf2.html) as it is able to read or
write vector and raster data of all file formats. You can inspect the properties and
attributes of the Meuse data by examining the associated help files ?meuse.grid.
library(GISTools)
library(raster)
library(sp)
# load the meuse.grid data
data(meuse.grid)
# create a SpatialPixels DF object
coordinates(meuse.grid) <- ~x+y
proj4string(meuse.grid) <- CRS("+init=epsg:28992")
meuse.grid <- as(meuse.grid, "SpatialPixelsDataFrame")
# create 3 raster layers
r1 <- raster(meuse.grid, layer = 3) #dist
r2 <- raster(meuse.grid, layer = 4) #soil
r3 <- raster(meuse.grid, layer = 5) #ffreq
The code above loads the meuse.grid data, converts it to a SpatialPixels-
DataFrame format and then creates three separate raster layers in the raster
format. These three layers will form the basis of the analyses in this section. You
could visually inspect their attributes by using some simple image commands:
# set the plot parameters for 3 rows
par(mfrow = c(1,3))
image(r1, asp = 1)
image(r2, asp = 1)
image(r3, asp = 1)
# reset par
par(mfrow = c(1,1))
5.9.2 Raster Reclassification
Raster analyses frequently employ simple numerical and mathematical operations.
In essence, they allow you to add, multiply, subtract, etc., raster data layers, and
these operations are performed on a cell-by-cell basis. So for an addition this might
be in the form:
Raster_Result <- Raster.Layer.1 + Raster.Layer.2
Remembering that raster data are numerical, if the Raster.Layer.1 and
Raster.Layer.2 data both contained the values 1, 2 and 3, it would be difficult
to know the origin, for example, of a value of 3 in the Raster_Result output.
Specifically, if the r2 and r3 layers created above are considered, these both con-
tain values in the range 1–3 describing soil types and flooding frequency, respec-
tively (as described in the help for the meuse.grid data). Therefore we may wish
R FOR SPATIAL ANALYSIS & MAPPING
184
to reclassify them in some way to understand the results of any combination or
overlay operation.
It is possible to reclassify raster data in a number of ways. First, the raster
data values can be manipulated using simple mathematical operations. These
produce raster outputs describing the mathematical combination of the input
raster layers. The code below multiplies one of the layers by 10. This means that
the result combining both raster data layers using the add (+) function contains
a fixed set of values (in this case 9) which are tractable to the combinations of
inputs used. A value of 32 would indicate values of 3 in r3 (a flooding fre-
quency of ‘one in 50 years’) and 2 in r2 (a soil type of ‘Rd90C/VII’, whatever
Values
11
12
13
21
22
23
31
32
33
Figure 5.14 The result of a simple raster overlay
USING R AS A GIS
185
that is). The results of this simple overlay are shown in Figure 5.14 and in the
table of values printed.
Raster_Result <- r2 + (r3 ∗ 10)
table(getValues(Raster_Result))
11 12 13 21 22 23 31 32 33
535 242 2 736 450 149 394 392 203
tm_shape( Raster_Result) + tm_raster(col = "layer", title = "Values",
palette = "Spectral", style = "cat") +
tm_layout(frame = F)
A second approach to reclassifying raster data is to employ logical operations on the
data layers prior to combining them. These return TRUE or FALSE for each raster
Values
1
Figure 5.15 A raster overlay using a combinatorial AND
R FOR SPATIAL ANALYSIS & MAPPING
186
grid cell, depending on whether it satisfies the logical condition. The resultant lay-
ers can then be combined in mathematical operations as above. For example, con-
sider the analysis that wanted to identify the locations in the Meuse data that
satisfied the following conditions:
● Are greater than half of the rescaled distance away from the Meuse River
● Have a soil class of 1, that is calcareous weakly developed meadow
soils, light sandy clay
● Have a flooding frequency class of 3, namely once in a 50-year period
The following logical operations can be used to do this:
r1a <- r1 > 0.5
r2a <- r2 >= 2
r3a <- r3 < 3
These can then be combined using specific mathematical operations, depending
on the analysis. For example, a simple suitability
,multi-criteria evaluation, where
all the conditions have to be true and where a crisp, Boolean output is required,
would be coded using the multiplication function as follows, with the result shown
in Figure 5.15:
Raster_Result <- r1a ∗ r2a ∗ r3a
table(getValues(Raster_Result))
0 1
2924 179
tm_shape(Raster_Result) +
tm_raster(title = "Values", style = "cat") +
tm_style("cobalt")
This is equivalent to a combinatorial AND operation, also known as an intersection.
Alternatively, the analysis may be interested in identifying where any of the condi-
tions are true, a combinatorial OR also known as a union, with a different result as
shown in Figure 5.16:
Raster_Result <- r1a + r2a + r3a
table(getValues(Raster_Result))
0 1 2 3
386 1526 1012 179
# plot the result and add a legend
tm_shape(Raster_Result) +
tm_raster(title ="Conditions", style = "cat"), palette = "Spectral")
#tm_layout(frame = F, bg.color = "grey85")
tm_style_col_blind()
USING R AS A GIS
187
5.9.3 Other Raster Calculations
The above examples illustrated code to reclassify raster layers and then combined
them using simple mathematical operations. You should note that it is possible to
apply any kind of mathematical function to a raster layer. For example:
Raster_Result <- sin(r3) + sqrt(r1)
Raster_Result <- ((r1 ∗ 1000 ) / log(r3) ) ∗ r2
tmap_mode('view')
tm_shape(Raster_Result) + tm_raster(col = "layer", title = "Value")
tmap_mode("plot")
which produces Figure 5.17.
Conditions
1
2
3
Figure 5.16 A raster overlay using a combinatorial OR
R FOR SPATIAL ANALYSIS & MAPPING
188
+
-
Value
-1,000 to 0
0 to 1,000
1,000 to 2,000
2,000 to 3,000
3,000 to 4,000
Leaflet | © OpenStreetMap © CartoDB
Figure 5.17 A raster generated from a number of mathematical operations
A number of other operations are possible using different functions included in
the raster package. They are not given a full treatment here, but are introduced
such that the interested reader can explore them in more detail.
The calc function performs a computation over a single raster layer, in a simi-
lar manner to the mathematical operations in the preceding text. The advantage of
the calc function is that it should be faster when computing more complex
operations over large raster datasets.
my.func <- function(x) {log(x)}
Raster_Result <- calc(r3, my.func)
# this is equivalent to
Raster_Result <- calc(r3, log)
The overlay function provides an alternative to the mathematical operations il-
lustrated in the reclassification examples above for combining multiple raster lay-
ers. The advantage of the overlay function, again, is that it is more efficient for
performing computations over large raster objects.
Raster_Result <- overlay(r2,r3,
fun = function(x, y) {return(x + (y ∗ 10))} )
# alternatively using a stack
USING R AS A GIS
189
my.stack <- stack(r2, r3)
Raster_Result <- overlay(my.stack, fun = function(x, y) (x + (y ∗ 10)) )
There are a number of distance functions for computing distances to specific fea-
tures. The distanceFromPoints function calculates the distance between a set
of points to all cells in a raster surface and produces a distance or cost surface as in
Figure 5.18.
# load meuse and convert to points
data(meuse)
coordinates(meuse) <- ~x+y
# select a point layer
soil.1 <- meuse[meuse$soil == 1,]
# create an empty raster layer
# this is based on the extent of meuse
r <- raster(meuse.grid)
dist <- distanceFromPoints(r, soil.1)
plot( dist,asp = 1,
xlab='',ylab='',xaxt='n',yaxt='n',bty='n', axes =F)
plot(soil.1, add = T)
# the tmap version but this is not as nice as plot
# tm_shape(dist) + tm_raster(palette = rev(terrain.colors(10)),
# title = "Distance", style = "kmeans") +
# tm_layout(frame = F, legend.outside = T)
500
1000
1500
Figure 5.18 A raster analysis of distance to points
You are encouraged to explore the raster package (and indeed the sp pack-
age) in more detail if you are specifically interested in raster-based analyses. There
are a number of other distance functions, functions for computing over neighbour-
hoods (focal functions), accessing raster cell values and assessing spatial configura-
tions of raster layers.
R FOR SPATIAL ANALYSIS & MAPPING
190
5.10 ANSWERS TO SELF-TEST QUESTIONS
Q1: Produce maps of the densities of breaches of the peace in each census block in
New Haven in breaches per square kilometre. First, using sf formats:
# convert to sf
breach_sf <- st_as_sf(breach)
blocks_sf <- st_as_sf(blocks)
# point in polygon
b.count <- rowSums(st_contains(blocks_sf,breach_sf,sparse = F))
# area calculation
b.area <- ft2miles(ft2miles(st_area(blocks_sf))) ∗ 2.58999
# combine and assign to the blocks data
blocks_sf$b.p.sqkm <- as.vector(b.count/b.area)
# map
tm_shape(blocks_sf) +
tm_polygons("b.p.sqkm", style = "kmeans", title ="")
Second, using sp formats:
# point in polygon
b.count <- poly.counts(breach, blocks)
# area calculation
b.area <- ft2miles(ft2miles(gArea(blocks, byid = T))) ∗ 2.58999
# combine and assign to the blocks data
blocks$b.p.sqkm <- b.count/b.area
tm_shape(blocks) + tm_polygons("b.p.sqkm", style = "kmeans", title ="")
Q2: Produce a map of the outlying residuals using tm_shape functions etc. from
the tmap package.
blocks$s.resids.2 <- s.resids.2
tm_shape(blocks) +
tm _polygons("s.resids.2", breaks = c(−8,−2,2,8),
auto.palette.mapping = F,
palette = resid.shades$cols)
Q3: Determine the coefficients a and b for two different analyses using blocks and
tracts data and comment on the difference between the analyses using different
areal units. First, calculate the coefficients for the analysis using census blocks:
# Analysis with blocks
blocks2 = blocks[blocks$OCCUPIED > 0,]
attach(data.frame(blocks2))
forced.rate = 2000∗poly.counts(burgres.f,blocks2)/OCCUPIED
notforced.rate = 2000∗poly.counts(burgres.n,blocks2)/OCCUPIED
model1 = lm(forced.rate~notforced.rate)
coef(model1)
(Intercept) notforced.rate
5.4667222 0.3789628
detach(data.frame(blocks2))
USING R AS A GIS
191
The results can be printed out:
# from the model
coef(model1)
# or in a formatted statement
cat("expected(forced rate)=",coef(model1)[1], "+",
coef(model1)[2], "∗ (not forced rate)")
Now calculate the coefficients using census tracts:
# analysis with tracts
tracts2 = tracts[tracts$OCCUPIED > 0,]
# align the projections
ct <- proj4string(burgres.f)
proj4string(tracts2) <- CRS(ct)
# now do the analysis
attach(data.frame(tracts2))
forced.rate = 2000∗poly.counts(burgres.f,tracts2)/OCCUPIED
notforced.rate = 2000∗poly.counts(burgres.n,tracts2)/OCCUPIED
model2=lm(forced.rate~notforced.rate)
detach(data.frame(tracts2))
Again the results can be printed out:
# from the model
coef(model2)
# or in a formatted statement
cat("expected(forced rate) = ",coef(model2)[1], "+",
coef(model2)[2], "∗ (not forced rate)")
These two analyses show that, in this case, there are only small differences between
the coefficients arising from analyses using different areal units. Print out both results:
cat("expected(forced rate) = ",
coef(model1)[1], "+", coef(model1)[2], "∗ (not forced rate)")
cat("expected(forced rate) = ",
coef(model2)[1], "+", coef(model2)[2], "∗ (not forced rate)")
expected(forced rate) = 5.466722 + 0.3789628 ∗ (not forced rate)
expected(forced rate) = 5.243477 + 0.4132951 ∗ (not forced rate)
This analysis tests what is referred to as the modifiable areal unit problem, first identi-
fied in the 1930s, and extensively research by Stan Openshaw in the 1970s and beyond –
see Openshaw (1984) for a comprehensive review. Variability in analyses can arise when
data are summarised over different spatial units and the importance of the modifiable
areal unit problem cannot be overstated as a critical consideration in spatial analysis.
Q4: Write a function that will return an intersected dataset, with an attribute of counts
of some variable (houses, population, etc.) as held in another sf format dataset.
int.count.function
,<- function(
int_sf, layer_sf, int.ID, layer.ID, target.var) {
R FOR SPATIAL ANALYSIS & MAPPING
192
# Use the IDs to assign ID variables to both inputs
# this makes the processing easier later on
int_sf$IntID <- as.vector(data.frame(int_sf[, int.ID])[,1])
layer_sf$LayerID <- as.vector(data.frame(layer_sf[, layer.ID])[,1])
# do the same for the target.var
layer_sf$target.var<-as.vector(data.frame(layer_sf[,target.var])[,1])
# check projections
if(st_crs(int_sf) != st_crs(layer_sf))
print("Check Projections!!!")
# do intersection
int.res_sf <- st_intersection(int_sf, layer_sf)
# generate area and proportions
int.areas <- st_area(int.res_sf)
layer.areas <- st_area(layer_sf)
# match tract area to the new layer
v1 <- as.vector(data.frame(int.res_sf$LayerID)[,1])
v2 <- as.vector(data.frame(layer_sf$LayerID)[,1])
index <- match(v1, v2)
layer.areas <- layer.areas[index]
layer.prop <- as.vector(int.areas/as.vector(layer.areas))
# create a variable of intersected values
int.res_sf$NewVar <-
as.vector(data.frame(layer_sf$target.var)[,1][index]) ∗ layer.prop
summarise this and link back to the int.layer_sf
# NewVar <- summarise(group_by(int.res_sf, IntID), count = sum(NewVar))
# create an empty vector
int.layer_sf$NewVar <- 0
# and populate this using ID as the index
int.layer_sf$NewVar[NewVar$IntID] <- NewVar$count
return(int.layer_sf)
}
You can test this:
# convert blocks to sf
blocks_sf <- st_as_sf(blocks)
# run the function
test.res <- int.count.function(
int_sf <- int.layer_sf,
layer_sf <- blocks_sf,
int.ID <- "ID",
layer.ID <- "NEWH075H_I",
target.var <- "POP1990")
plot(test.res[,"NewVar"])
REFERENCES
Bivand, R.S., Pebesma, E.J. and Gómez-Rubio, V. (2013) Applied Spatial Data:
Analysis with R, 2nd edition. New York: Springer.
Comber, A.J., Brunsdon, C. and Green, E. (2008) Using a GIS-based network
analysis to determine urban greenspace accessibility for different ethnic and
religious groups. Landscape and Urban Planning, 86: 103–114.
USING R AS A GIS
193
Hijmans, R.J. and van Etten, J. (2014) Raster: Geographic data analysis and mode-
ling. R Package Version 2.6–7. http://cran.r-project.org/package=raster.
Openshaw, S. (1984) The Modifiable Areal Unit Problem, CATMOG 38, Geo Abstracts,
Norwich. https://www.uio.no/studier/emner/sv/iss/SGO9010/openshaw
1983.pdf.
6
POINT PATTERN ANALYSIS
USING R
6.1 INTRODUCTION
In this and the next chapter, some key ideas of spatial statistics will be outlined,
together with examples of statistical analysis based on these ideas, via R. The two
main areas of spatial statistics that are covered are those relating to point patterns
(this chapter) and spatially referenced attributes (next chapter). One of the character-
istics of R, as open source software, is that R packages are contributed by a variety
of authors, each using their own individual styles of programming. In particular,
for point pattern analysis the spatstat package is often used, while for spatially
referenced attributed, spdep is favoured. One the one hand spdep handles spa-
tial data in the same way as sp, maptools and GISTools, while on the other
hand spatstat does not. Also, for certain specific tasks, other packages may be
called upon whose mode of working differs from either of these packages. While
this may seem a daunting prospect, the aim of these two chapters is to introduce
the key ideas of spatial statistics, as well as providing guidance in the choice of
packages, and help in converting data formats. Fortunately, although some pack-
ages use different data formats, conversion is generally straightforward, and exam-
ples will appear throughout the chapters, whenever necessary.
6.2 WHAT IS SPECIAL ABOUT SPATIAL?
In one sense, the motivations for statistical analysis of spatial data are the same as
those for non-spatial data:
● To explore and visualise the data
● To create and calibrate models of the process generating the data
● To test hypotheses related to the processes generating the data
POINT PATTERN ANALYSIS USING R
195
However, a number of these requirements are strongly influenced by the nature of
spatial data. The study of mapping and cartography may be regarded as an entire
subject area within the discipline of information visualisation, which focuses
exclusively on geographical information. In addition, the kinds of hypotheses one
might associate with spatial data are quite distinctive – for example, focusing on
the detection and location of spatial clusters of events, or on whether two kinds of
event (say, two different types of crime) have the same spatial distribution.
Similarly, models that are appropriate for spatial data are distinctive, in that they
often have to allow for spatial autocorrelation in their random component – for
example, a regression model generally includes a random error term, but if the
data are spatially referenced, one might expect nearby errors to be correlated. This
differs from a ‘standard’ regression model where each error term is considered to
apply independently, regardless of location. In the remainder of this section, point
patterns (one of two key types of spatial data considered in this book) will be con-
sidered. First, these will be described.
6.2.1 Point Patterns
Point patterns are collections of geographical points assumed to have been
generated by a random process. In this case, the focus of inference and model-
ling is on model(s) of the random processes and their comparison. Typically, a
point dataset consists of a set of observed (x, y) coordinates, say {(x1, y1), (x2, y2),
…, (xn, yn)}, where n is the number of observations. As an alternative notation,
each point could be denoted by a vector xi, where xi = (xi, yi). Using the data
formats used in sp, maptools and so on, these data could be represented as
SpatialPoints or SpatialPointsDataFrame objects. Since these data
are seen as random, many models are concerned with the probability densities
of the random points, ν(xi).
Another area of interest is the interrelation between the points. One way of think-
ing about this is to consider the probability density of one point xi conditional on
the remaining points x x x x1 1 1, , , , , i i n− +{ }. In some situations xi is independent of
the other points. However, for other processes this is not the case. For example, if
xi is the location of the reported address for a contagious disease, then it is more
likely to occur near one of the points in the dataset (due to the nature of contagion),
and therefore not independent of the values of x x x x1 1 1, , , , , i i n− +{ }.
Also important is the idea of a marked process. Here, random sets of points drawn
from a number of different populations are superimposed (e.g. household burgla-
ries using force and household burglaries not using force) and the relationship
between the different sets is considered. The term ‘marked’ is used here as the
dataset can be viewed as a set of points where each point is tagged (or marked)
with its parent population. Using the data formats used by sp, a marked process
could be represented as a spatial points data frame – although the spatstat
package uses a different format.
R FOR SPATIAL ANALYSIS & MAPPING
196
6.3 TECHNIQUES FOR POINT PATTERNS USING R
Having outlined the two main data types that will be considered, and the kinds of
model that may be applied, in this section more specific techniques will be dis-
cussed, with examples of how they may be carried out using R. In this section, we
will focus on random point patterns.
6.3.1 Kernel Density Estimates
The simplest way to consider random two-dimensional point patterns is to assume
that each random location xi is drawn independently from an unknown distribu-
tion with probability density function f(xi). This function maps a location (repre-
sented as a two-dimensional vector) onto a probability density. If we think of
locations in space as a very fine pixel
,grid, and assume a value of probability
density is assigned to each pixel, then summing the pixels making up an arbitrary
region on the map gives the probability that an event occurs in that area. It is gen-
erally more practical to assume an unknown f, rather than, say, a Gaussian distribu-
tion, since geographical patterns often take on fairly arbitrary shapes – for example,
when applying the technique to patterns of public disorder, areas of raised risk
will occur in a number of locations around a city, rather than a simplistic radial
‘bell curve’ centred on the city’s mid-point.
A common technique used to estimate f(xi) is the kernel density estimate (KDE:
Silverman, 1986). KDEs operate by averaging a series of small ‘bumps’ (probability
distributions in two dimensions, in fact) centred on each observed point. This is
illustrated in Figure 6.1. In algebraic terms, the approximation to f(x), for an arbi-
trary location x = (x, y), is given by
ˆ ˆf f x y
nh h
k
x x
h
y y
hx y
i
x
i
yi
( ) = ( , ) =
1
,x
− −
∑ (6.1)
Each of the ‘bumps’ (central panel in Figure 6.1) map onto the kernel function
k
x xi
hx
y yi
hy
− −
, in equation (6.1) and the entire equation describes the ‘bump
averaging’ process, leading to the estimate of probability density in the right-
hand panel. Note that there are also parameters hx and hy (frequently referred to
as the bandwidths) in the x and y directions; their dimension is length, and they
represent the radii of the bumps in each direction. Varying hx and hy alters the
shape of the estimated probability density surface – in brief, low values of hx
and hy lead to very ‘spiky’ distribution estimates, and very high values, possibly
larger than the span of the xi locations, tend to ‘flatten’ the estimate so it appears
to resemble the k-function itself; effectively this gives a superposition of nearly
identical k-functions with relatively small perturbations in their centre points.
POINT PATTERN ANALYSIS USING R
197
This effect of varying hx and hy is shown in Figure 6.2. Typically hx and hy take
similar values. If one of these values is very different in magnitude than the other,
kernels elongated in either the x or y direction result. Although this may be useful
when there are strong directional effects, we will focus on the situation where val-
ues are similar for the examples discussed here. To illustrate the results of varying
the bandwidths, the same set of points used in Figure 6.1 is used to provide KDEs
with three different values of hx and hy: on the left, they both take a very low value,
giving a large number of peaks; in the centre, there are two peaks; and on the right,
only one.
Figure 6.1 Kernel density estimation: initial points (left); bump centred on each point
(centre); average of bumps giving estimate of probability density (right)
Figure 6.2 Kernel density estimation bandwidths: hx and hy too low (left); hx and hy
appropriate (centre); hx and hy too high (right)
An obvious problem is that of choosing appropriate hx and hy given a dataset
{xi}. There are a number of formulae to provide ‘automatic’ choices, as well as some
more sophisticated algorithms. Here, a simple rule is used, as proposed by
Bowman and Azzalini (1997) and Scott (1992):
h
nx x=
2
3
1 6
σ
(6.2)
where σx is the standard deviation of the xi. A similar formula exists for hy, replac-
ing σx with σy, the standard deviation of the yi. The central KDE in Figure 6.2 is
based on choosing hx and hy using this method.
R FOR SPATIAL ANALYSIS & MAPPING
198
6.3.2 Kernel Density Estimation Using R
Here, the breaches of the peace (public disturbances) in New Haven, Connecticut
are used as an example; recall that this is provided in the GISTools package,
here loaded using data(newhaven). As an initial inspection of the data, look
at the locations of breaches of the peace. These can be viewed on an interactive
map using the tmap package in view mode. The following code loads the New
Haven data and tmap, sets R in view mode and produces a map showing the
US Census block boundaries and the locations of breach of the peace, on a back-
drop of a CartoDB map, provided your computer is linked to the internet. The
two layers can be interactively switched on or off, and the backdrop can be
changed. Here, we will generally use the default backdrop as it is monochrome,
and the information to be mapped will be in colour. The initial map window is
seen in Figure 6.3.
# Load GISTools (for the data) and tmap (for the mapping)
require(GISTools)
require(tmap)
# Get the data
data(newhaven)
# look at it
# select 'view' mode
tmap_mode('view')
# Create the map of blocks and incidents
tm_shape(blocks) + tm_borders() + tm_shape(breach) +
tm_dots(col='navyblue')
Figure 6.3 Web view mode of tmap
POINT PATTERN ANALYSIS USING R
199
There are a number of packages in R that provide code for computing KDEs.
Here, the tmap and tmaptools libraries provide some very useful tools. The
function to compute kernel density estimation is map_smooth from tmap-
tools. This estimates the value of the density over a grid of points, and returns the
result as a list – a raster object – referred to as X$raster (where X is the value
returned from map_smooth), a contour object (X$iso) and a polygon object
(X$polygon). The first of these is a raster grid of values for the KDEs, and the
second and third relate to contour lines associated with the KDE; iso provides a
set of lines (the contour lines) which may be plotted. Similarly, the polygons
item provides a solid list of polygons that may be plotted (as filled polygons).
map_smooth takes several arguments (most notably the set of points to use for
the KDE) but also a number of optional arguments. Two key ones here are the
bandwidth and the cover. The bandwidth is a vector of length 2 containing hx
and hy, and the cover is a geographical object whose outline forms the boundary
of the locations where the KDE is estimated. Both of these have defaults: the
default bandwidth is 1
50 times the shortest side of the bounding box of the points,
and the default cover is the bounding box of the points. However, as discussed
earlier, more appropriate hx and hy values may be found using (6.2). This is not
provided as part of smooth_map, but a function is easily written. The division of
the result by 1000 is because the projected data are measured in metres, but
smooth_map expects bandwidths in kilometres.
# Function to choose bandwidth according to Bowman and Azzalini / Scott's rule
# for use with smooth_map in tmaptools
Figure 6.4 KDE map for breaches of the peace
R FOR SPATIAL ANALYSIS & MAPPING
200
choose_bw <- function(spdf) {
X <- coordinates(spdf)
sigma <- c(sd(X[,1]),sd(X[,2])) ∗ (2 / (3 ∗ nrow(X))) ^ (1/6)
return(sigma/1000)
}
Now the code to carry out the KDE and plot the results may be used. Here the raster
version of the result is used, and plotted on a web mapping backdrop (Figure 6.4).
library(tmaptools)
tmap_mode('view')
breach_dens <- smooth_map(breach,cover=blocks, bandwidth = choose_bw(breach))
tm_shape(breach_dens$raster) + tm_raster()
The ‘count’ caption here indicates that the probability densities have been rescaled
to represent intensities – by multiplying the KDE by the number of cases. With this
scale, the quantity being mapped is the expected number of cases per unit area in
the amount of time of the study period.
It is also possible to use the other forms of result (polygons or isolines) to plot
the KDE outcomes. In the following code, isolines are produced, again with a back-
drop of a web map (see Figure 6.5).
tmap_mode('view')
tm_shape(blocks)+ tm_borders(alpha=0.5) +
tm_shape(breach_dens$iso) + tm_lines(col='darkred',lwd=2)
Figure 6.5 KDE map for breaches of the peace – isoline version
POINT PATTERN ANALYSIS USING R
201
Here, a backdrop of block boundaries has also been added to emphasise the
,limits of the data collection region. In this and the previous map, it is important to
be aware of the boundaries of the data sampling region. Low probability densities
outside this region are quite likely due to no data being collected there – not neces-
sarily low incident risk!
Self-Test Question 1. As a further exercise, create the polygons version of the KDE
map in the plot mode of tmap – the tm_fill() function will shade the poly-
gons. As there will be no backdrop map, roads and blocks should be added to the
map to provide context. Also, add a map scale.
I
As well as estimating the probability density function f(x, y), kernel density
estimation also provides a helpful visual tool for displaying point data.
Although plotting point data directly can show all of the information in a
small dataset, if the dataset is larger it is hard to discriminate between
relative densities of points: essentially, when points are very closely packed,
the map symbols begin to overprint and exact numbers are hard to deter-
mine; this is illustrated in Figure 6.6. On the left is a plot of locations. The
points plotted are drawn from a two-dimensional Gaussian distribution, and
their relative density increases towards the centre. However, except for a
penumbral region, the intensity of the dot pattern appears to have roughly
fixed density. As the KDE estimates relative density, this problem is
addressed – as may be seen in the KDE plot in Figure 6.6 (right).
Figure 6.6 The overplotting problem: point plot (left) and KDE plot (right)
R FOR SPATIAL ANALYSIS & MAPPING
202
6.4 FURTHER USES OF KERNEL DENSITY ESTIMATION
KDEs are also useful for comparative purposes. In the newhaven dataset there are
also data relating to burglaries from residential properties. These are divided into
two classes: burglaries involving forced entry, and burglaries that do not. It may
be of interest to compare the spatial distributions of the two groups. In the
newhaven dataset, burgres.f is a SpatialPoints object with points for the
occurrence of forced entry residential burglaries, and burgres.n is a
SpatialPoints object with points for non-forced entries. Based on the recom-
mendation to compare patterns in data using small multiples of graphical panels
(Tufte, 1990), KDE maps for forced and non-forced burglaries may be shown side
by side. This is achieved using the R code below, which carries out the following
operations:
● Specify a set of levels for the intensity contours. To allow comparison
the same levels will be used on both maps
● Compute the KDEs. Here the contours are specified for the iso and
polygons results
● Draw each of the two maps and store in variables dn and df . Here the
polygon format is used
● Use tmap_arrange to draw the two maps in ‘small multiples’
format
The result is seen in Figure 6.7. Although there are some similarities in the two
patterns – likely due to the underlying pattern of housing – it may be seen that
for the non-forced entries there are two peaks of roughly equal intensity (Beaver
Hills/Edgewood in the west and Fair Haven in the east), while for forced entries
the peaks are in similar positions but the stronger peak is to the west, near
Edgewood. More generally, there tend to be more forced incidents than
non-forced.
# R Kernel Density comparison - first make sure the New Haven data are available
require(GISTools)
data(newhaven)
tmap_mode('plot')
# Create the KDEs for the two datasets:
contours <- seq(0,1.4,by=0.2)
brn_dens <- smooth_map( burgres.n,cover=blocks, breaks=contours,
style='fixed',
bandwidth = choose_bw(burgres.n))
brf_dens <- smooth_map( burgres.f,cover=blocks, breaks=contours,
style='fixed',
bandwidth = choose_bw(burgres.f))
POINT PATTERN ANALYSIS USING R
203
# Create the maps and store them in variables
dn <- tm_shape(blocks) + tm_borders() +
tm_shape(brn_dens$polygons) + tm_fill(alpha=0.8) +
tm_layout(title="Non-Forced Burglaries")
df <- tm_shape(blocks) + tm_borders() +
tm_shape(brf_dens$polygons) + tm_fill(alpha=0.8) +
tm_layout(title="Forced Burglaries")
tmap_arrange(dn,df)
Figure 6.7 KDE maps to compare forced and non-forced burglary patterns
6.4.1 Hexagonal Binning Using R
An alternative visualisation tool for geographical point datasets with larger num-
bers of points is hexagonal binning. In this approach, a regular lattice of small hex-
agonal cells is overlaid on the point pattern, and the number of points in each cell
is counted. The cells are then shaded according to the counts. This method also
overcomes the overplotting problem. However, hexagonal binning is not directly
available in GISTools, and it is necessary to use another package. One possibility
is the fMultivar package. This provides a routine for hexagonal binning called
hexBinning, which takes a two-column matrix of coordinates and provides an
R FOR SPATIAL ANALYSIS & MAPPING
204
object representing the hexagonal grid and the counts of points in each hexagonal
cell. Note that this function does not work directly with sp-type spatial data
objects. This is mainly because it is designed to apply hexagonal binning to any
kind of data (e.g. scatter plot points where the x and y variables are not geograph-
ical coordinates). However, it is perfectly acceptable to subject geographical points
to this kind of analysis.
First, make sure that the fMultivar package is installed in R. If not, enter:
install.packages("fMultivar",depend=TRUE)
A complication here is that the result of the hexBinning function is not a
Spatial-PolygonsDataFrame object and not immediately compatible with
tmap and other spatial tools in R. To allow for this, a new function hexbin_map
is written. This takes a SpatialPointsDataFrame object as input, and returns
a SpatialPolygonsDataFrame object consisting of the hexagons in which one
or more points occur, together with a data frame with a column z containing the
count of points. The code works as follows:
● Extract coordinates from the SpatialPointsDataFrame object
● Run hexBinning on these
● Construct hexagonal polygon coordinates
● Loop through each polygon; construct these according to sp data structures
● Copy the map projection information from the
SpatialPointsDataFrame object
● Add the count information giving a SpatialPolygonsDataFrame
object
The code is below:
hexbin_map <- function(spdf, ...) {
hbins <- fMultivar::hexBinning(coordinates(spdf),...)
# Hex binning code block
# Set up the hexagons to plot, as polygons
u <- c(1, 0, −1, −1, 0, 1)
u <- u ∗ min(diff(unique(sort(hbins$x))))
v <- c(1,2,1,−1,−2,−1)
v <- v ∗ min(diff(unique(sort(hbins$y))))/3
# Construct each polygon in the sp model
hexes_list <- vector(length(hbins$x),mode='list')
for (i in 1:length(hbins$x)) {
pol <- Polygon(cbind(u + hbins$x[i], v + hbins$y[i]),hole=FALSE)
hexes_list[[i]] <- Polygons(list(pol),i) }
POINT PATTERN ANALYSIS USING R
205
# Build the spatial polygons data frame
hex_cover_sp <\SpatialPolygons(hexes_list,proj4string=CRS(proj4string(spdf)))
hex_cover <- SpatialPolygonsDataFram e(hex_cover_sp,
data.frame(z=hbins$z),match.ID=FALSE)
# Return the result
return(hex_cover)
}
I
Note the reference to fMultivar::hexBinning in the code. This tells
R to use the function hexBinning from the package fMultivar without
actually loading the package using library. It is useful if it is the only thing
used from that package, as it avoids having to load everything else in the
package.
It is now possible to create hex binned maps via this function. Here a view
mode map is the map of hex binned breach data (Figure 6.8).
tmap_mode('view')
breach_hex <- hexbin_map(breach,bins=20)
tm_shape(breach_hex) +
tm_fill(col='z',title='Count',alpha=0.7)
Figure 6.8 Hexagonal binning of residential burglaries
R FOR SPATIAL ANALYSIS & MAPPING
206
As an alternative graphical representation, it is also possible to draw hexagons
whose area is proportional to the point count.
,This is done by creating a variable
with which to multiply the relative polygon coordinates (this relates to the square
root of the count in each polygon, since it is areas of the hexagons that should reflect
the counts). This is all achieved via a modification of the previous hexbin_map
function, called hexprop_map, listed below.
hexprop_map <- function(spdf, ...) {
hbins <- fMultivar::hexBinning(coordinates(spdf),...)
# Hex binning code block
# Set up the hexagons to plot, as polygons
u <- c(1, 0, −1, −1, 0, 1)
u <- u ∗ min(diff(unique(sort(hbins$x))))
v <- c(1,2,1,−1,−2,−1)
v <- v ∗ min(diff(unique(sort(hbins$y))))/3
scaler <- sqrt(hbins$z/max(hbins$z))
Breach of Peace Incidents
Figure 6.9 Hexagonal binning of residential burglaries
POINT PATTERN ANALYSIS USING R
207
# Construct each polygon in the sp model
hexes_list <- vector(length(hbins$x),mode='list')
for (i in 1:length(hbins$x)) {
pol <- Polygon(cbind(u∗scaler[i] + hbins$x[i], v∗scaler[i] + hbins$y[i]),hole=FALSE)
hexes_list[[i]] <- Polygons(list(pol),i) }
# Build the spatial polygons data frame
hex_cover_sp <- SpatialPolygons(hexes_list,proj4string=CRS(proj4string(spdf)))
hex_cover <- SpatialPolygonsDataFram e(hex_cover_sp,
data.frame(z=hbins$z),match.ID=FALSE)
# Return the result
return(hex_cover)
}
It is now possible to create a proportional hex binning map – here in plot mode
in Figure 6.9.
tmap_mode('plot')
breach_prop <- hexprop_map(breach,bins=20)
tm_shape(blocks) + tm_borders(col='grey') +
tm_shape(breach_prop) +
tm_fill(col='indianred',alpha=0.7) +
tm_layout("Breach of Peace Incidents",title.position=c('left','bottom'))
6.5 SECOND-ORDER ANALYSIS OF POINT PATTERNS
In this section an alternative approach to point patterns will be considered.
Whereas KDEs assume that the spatial distributions for a set of points are
independent but have a varying intensity, the second-order methods consid-
ered in this section assume that marginal distributions of points have a fixed
intensity, but that the joint distribution of all points is such that individual
distributions of points are not independent.1 This process describes situations
in which the occurrences of events are related in some way – for example, if a
disease is contagious, the reporting of an incidence in one place might well be
accompanied by other reports nearby. The K-function (Ripley, 1981) is a very
useful tool for describing processes of this kind. The K-function is a function
of distance, defined by
K(d) = λ−1E(Nd) (6.3)
1 A further stage in complication would be the situation where individual distributions are not inde-
pendent, but also the marginal distributions vary in intensity – however, this will not be considered
here.
R FOR SPATIAL ANALYSIS & MAPPING
208
where Nd is the number of events xi within a distance d of a randomly chosen event
from all recorded events x x1 , , n{ }, and λ is the intensity of the process, measured
in events per unit area. Consider the situation where the distributions of xi are
independent, and the marginal densities are uniform – often termed a Poisson pro-
cess, or complete spatial randomness (CSR). In this situation one would expect the
number of events within a distance d of a randomly chosen event to be the intensity
λ multiplied by the area of a circle of radius d, so that
K dCSR( ) = πd2 (6.4)
The situation in equation (6.4) can be thought of as a benchmark to assess the clus-
tering of other processes. For a given distance d, the function value KCSR(d) gives
an indication of the expected number of events found around a randomly chosen
event, under the assumption of a uniform density with each observation being dis-
tributed independently of the others. Thus for a process having a k-function k(d),
if k(d) > KCSR(d), this suggests that there is an excess of nearby points – or, to put it
another way, there is clustering at the spatial scale associated with the distance d.
Similarly, if K(d) < KCSR(d), this suggests spatial dispersion at this scale – the pres-
ence of one point suggests other points are less likely to appear nearby than for a
Poisson process.
d1
d2
Figure 6.10 A spatial process with both clustering and dispersion
POINT PATTERN ANALYSIS USING R
209
The consideration of spatial scale is important (many processes exhibit spatial
clustering at some scales, and dispersion at others) so that the quantity K(d) −
KCSR(d) may change sign with different values of d. For example, the process illus-
trated in Figure 6.10 shows clustering at low values of d – for small distances (such
as d2 in the figure) there is an excess of points near to other points compared to
CSR, but for intermediate distances (such as d1 in the figure) there is an undercount
of points.
When working with a sample of data points {xi}, the K-function for the underly-
ing distribution will not usually be known. In this case, an estimate must be made
using the sample. If dij is the distance between xi and xj then an estimate of K(d) is
given by
ˆ ˆK d
I d d
n ni j i
ij( ) =
( < )
( 1)
1λ−
≠
∑∑
−
(6.5)
where λ̂ is an estimate of the intensity given by
λ̂ =
| |
n
A
(6.6)
|A| being the area of a study region defined by a polygon A. Also I(·) is an indicator
function taking the value 1 if the logical expression in the brackets is true, and 0
otherwise. To consider whether this sample comes from a clustered or dispersed
process, it is helpful to compare K̂ d( ) to KCSR(d).
Figure 6.11 Sample K-functions under CSR
R FOR SPATIAL ANALYSIS & MAPPING
210
Statistical inference is important here. Even if the dataset had been gener-
ated by a CSR process, an estimate of the K-function would be subject to sam-
pling variation, and could not be expected to match KCSR(d) perfectly. Thus, it
is necessary to test whether the sampled K̂ d( ) is sufficiently unusual with
respect to the distribution of K̂ estimates one might expect to see under CSR
to provide evidence that the generating process for the sample is not CSR. The
idea is illustrated in Figure 6.11. Here, 100 K-function estimates (based on equa-
tion (6.5)) from random CSR samples of 100 points (the same number of points as in
Figure 6.10) are superimposed, together with the estimate from the point set
shown in Figure 6.10. From this it can be seen that the estimate from the clus-
tered sample is quite different from the range of estimates expected from CSR.
Another aspect of sampling inference for K-functions is the dependency of K̂ d( )
on the shape of the study area. The theoretical form KCSR(d) = λπd2 is based on an
assumption of points occurring in an infinite two-dimensional plane. The fact that
a ‘real-world’ sample will be taken from a finite study area (denoted here by A)
will lead to further deviation of sample-based estimates of K̂ d( ) from the theoreti-
cal form. This can also be seen in Figure 6.11 – although for the lower values of d
the CSR estimated K-function curves resemble the quadratic shape expected: the
curves ‘flatten out’ for higher values of d. This is due to the fact that for larger val-
ues of d, points will only be observed in the intersection of a circle of radius d
around a random xi and the study area A. This will result in fewer points being
observed than the theoretical K-function would predict. This effect continues, and
when d is sufficiently large any circle centred on one of the points will encompass
the entirety of A. At this point, any further increase in d will result in no change in
the number of points contained in the circle – this provides an explanation of the
flattening-out effect seen in the figure.
Above, the idea is to consider a CSR process constrained to the study area.
However, another viewpoint is that the study area defines a subset of all
points generated on the full two-dimensional plane. To estimate the
K-function for the full-plane process some allowance for edge effects on the
study area needs to be made. Ripley (1976)
,proposed the following modification
to equation (6.5):
ˆ ˆK d
I d d
n n wi j i
ij
ij
( ) =
2 ( < )
( 1)
1λ−
≠
∑∑
−
(6.7)
where wij is the area of intersection between a circle centred at xi passing
through xj and the study area A. Inference about the estimated K-function can
then be carried out using the approach used above, but with K̂ d( ) based on
equation (6.7).
POINT PATTERN ANALYSIS USING R
211
6.5.1 Using the K-Function in R
In R, a useful package for computing estimated K-functions (as well as other spa-
tial statistical procedures) is spatstat. This is capable of carrying out the kind of
simulation illustrated earlier in this section.
The K-function estimation as defined above may be estimated in the spat-
stat package using the Kest function. Here the locations of bramble canes
(Hutchings, 1979; Diggle, 1983) are analysed, having been obtained as a dataset
supplied with spatstat via the data(bramblecanes) command. They are
plotted in Figure 6.12. Different symbols represent different ages of canes – although
initially we will just consider the point pattern for all canes.
I
For the data in the example, points were generated with A as the rectangle
having lower left corner (−1, −1) and upper right corner (1, 1). In practice A may
have a more complex shape (a polygon outline of a county, for example); for
this reason, assessing the sampling variability of the K-function under
sampling must often be achieved via simulation, as seen in Figure 6.11.
bramblecanes
2
1
Figure 6.12 Bramble cane locations
R FOR SPATIAL ANALYSIS & MAPPING
212
# K-function code block
# Load the spatstat package
require(spatstat)
# Obtain the bramble cane data
data(bramblecanes)
plot(bramblecanes)
Next, the Kest function is used to obtain an estimate for the K-function of the
spatial process underlying the distribution of the bramble canes. The
correction='border' argument requests that an edge-corrected estimate (as
in equation (6.7)) be used.
kf <- Kest(bramblecanes,correction='border')
# Plot it
plot(kf)
The result of plotting the K-function as shown in Figure 6.13 compares the esti-
mated function (labelled K̂bord) to the theoretical function under CSR (labelled K̂pois).
It may be seen that the data appear to be clustered (generally the empirical
K-function is greater than that for CSR, suggesting that more points occur close
together than would be expected under CSR). However, this perhaps needs a
0.00 0.05 0.10 0.15 0.20 0.25
0.
00
0.
05
0.
10
0.
15
0.
20
kf
r (one unit = 9 metres)
K
(r
)
K̂bord(r)
Kpois(r)
Figure 6.13 Ripley’s K-function plot
POINT PATTERN ANALYSIS USING R
213
more rigorous investigation, allowing for sampling variation via simulation as set
out above.
This simulation approach is sometimes referred to as envelope analysis, the enve-
lope being the highest and lowest values of K̂ d( ) for a value of d. Thus the function
for this is called envelope. This takes a ppp object and a further function as an
argument. The function here is Kest – there are other functions also used to
describe spatial distributions which will be discussed later, which envelope can
use, but for now we focus on Kest. The envelope object may also be plotted, as
shown in the following code which results in Figure 6.14:
# Code block to produce k-function with envelope
# Envelope function
kf.env <- envelope(bramblecanes,Kest,correction="border")
# Plot it
plot(kf.env)
From this it can be seen that the estimated K-function for the sample takes on a
higher value than the envelope of simulated K-functions for CSR until d becomes
quite large, suggesting strong evidence that the locations of bramble canes do
indeed exhibit clustering. However, it can reasonably be argued that comparing an
Figure 6.14 K-function with envelope
R FOR SPATIAL ANALYSIS & MAPPING
214
estimated K̂ d( ) and an envelope of randomly sampled estimates under CSR is not
a formal significance test. In particular, since the sample curve is compared to the
envelope for several d values, multiple significance testing problems may occur.
These are well explained by Bland and Altman (1995) – in short, when carrying out
several tests, the chance of obtaining a false positive result in any test is raised. If
the intention is to evaluate a null hypothesis of CSR, then a single number measur-
ing departure of K̂ d( ) from KCSR(d), rather than the K-function, may be more
appropriate – so that a single test can be applied. One such number is the maximum
absolute deviation (MAD: Ripley, 1977, 1981). This is the absolute value of the larg-
est discrepancy between the two functions:
MAD = ( ) ( )max
d
K d K d
∧
− CSR (6.8)
In R, we enter:
mad.test(bramblecanes,Kest,verbose=FALSE)
Maximum absolute deviation test of CSR
Monte Carlo test based on 99 simulations
Summary function: K(r)
Reference function: theoretical
Alternative: two.sided
Interval of distance values: [0, 0.25] units (one unit = 9 metres)
Test statistic: Maximum absolute deviation
Deviation = observed minus theoretical
data: bramblecanes
mad = 0.016159, rank = 1, p-value = 0.01
In this case it can be seen that the null hypotheses of CSR can be rejected at the 1% level.
An alternative test is advocated by Loosmore and Ford (2006) where the test statistic is
u K d K di i k i k k
d d
d
k
= ( ) ( )
2ˆ
min
max
−
∑
=
δ (6.9)
in which K ti k( ) is the average value of K̂ d( ) over the simulations, the dk are a
sequence of sample distances ranging from dmin to dmax, and δk = dk+1 − dk. Essentially
this attempts to measure the sum of the squared distance between the functions,
rather than the maximum distance. This is implemented by spatstat via the
dclf.test function, which works similarly to mad.test:
dclf.test(bramblecanes,Kest,verbose=FALSE)
Diggle-Cressie-Loosmore-Ford test of CSR
Monte Carlo test based on 99 simulations
Summary function: K(r)
Reference function: theoretical
Alternative: two.sided
Interval of distance values: [0, 0.25] units (one unit = 9 metres)
Test statistic: Integral of squared absolute deviation
Deviation = observed minus theoretical
data: bramblecanes
u = 3.3372e−05, rank = 1, p-value = 0.01
POINT PATTERN ANALYSIS USING R
215
Again, results suggest rejecting the null hypothesis of CSR – see the reported
p-value.
6.5.2 The L-function
An alternative to the K-function for identifying clustering in spatial processes is the
L-function. This is defined in terms of the K-function
L d
K d
( ) =
( ) (6.10)
Although just a simple transformation of the K-function, its utility lies in the
fact that under CSR, L(d) = d; that is, the L-function is linear, having a slope of 1
and passing through the origin. Visually identifying this in a plot of estimated
L-functions is generally easier than identifying a quadratic function, and there-
fore L-function estimates are arguably a better visual tool. The Lest function
provides a sample estimate of the L-function (by applying the transform in (6.10)
to K̂ d( )) which can be used in place of Kest. As an example, recall that the enve-
lope function could take alternatives to K-functions to create the envelope plot:
in the following code, an envelope plot using L-functions for the bramble cane
data is created (see Figure 6.15):
# Code block to produce k-function with envelope
# Envelope function
lf.env <- envelope(bramblecanes,Lest,correction="border")
# Plot it
plot(lf.env)
Similarly, it is possible to apply MAD tests or Loosmore and Ford tests using L
instead of K. Again mad.test and dclf.test allow an alternative to K-functions
to be specified. Indeed, Besag (1977) recommends using L-functions in place of
K-functions in this kind of test. As an example, the following code applies the
MAD test to the bramble cane data using the L-function.
mad.test(bramblecanes,Lest,verbose=FALSE)
Maximum absolute deviation test of CSR
,Monte Carlo test based on 99 simulations
Summary function: L(r)
Reference function: theoretical
Alternative: two.sided
Interval of distance values: [0, 0.25] units (one unit = 9 metres)
Test statistic: Maximum absolute deviation
Deviation = observed minus theoretical
data: bramblecanes
mad = 0.017759, rank = 1, p-value = 0.01
π
R FOR SPATIAL ANALYSIS & MAPPING
216
6.5.3 The G-Function
Yet another function used to describe the clustering in point patterns is the
G-function. This is the cumulative distribution of the nearest neighbour distance
for a randomly selected xi. Thus, given a distance d, G(d) is the probability that the
nearest neighbour distance for a randomly chosen sample point is less than or
equal to d. Again, this can be estimated using spatstat, using the function Gest.
As in the case of Lest and Kest, the functions envelope, mad.test and
dclf.test may be used with Gest. Here, again with the bramble cane data, a
G-function envelope is plotted:
# Code block to produce G-function with envelope
# Envelope function
gf.env <- envelope(bramblecanes,Gest,correction="border")
# Plot it
plot(gf.env)
0.00 0.05 0.10 0.15 0.20 0.25
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
lf.env
r (one unit = 9 metres)
L
(r
)
L̂obs(r)
Lth eo(r)
L̂h i(r)
L̂lo(r)
Figure 6.15 L-function with envelope
POINT PATTERN ANALYSIS USING R
217
The estimate of the G-function for the sample is based on the empirical propor-
tion of nearest neighbour distances less than d, for several values of d. In this case
the envelope is the range of estimates for given d values, for samples generated
under CSR. Theoretically, the expected G-function for CSR is
G(d) = 1 − exp(−λπd) (6.11)
This is also plotted in Figure 6.16, as Gtheo.
0.000 0.005 0.010 0.015
0.
0.
2
0.
4
0.
6
0.
8
gf.env
r (one unit = 9 metres)
G
(r
)
Ĝobs (r)
Gtheo (r)
Ĝhi (r)
Ĝlo (r)
Figure 6.16 G-function with envelope
I
One complication is that spatstat stores spatial information in a differ-
ent way than sp, GISTools and related packages, as noted earlier. This is
not a major hurdle, but it does mean that objects of types such as
(Continued)
R FOR SPATIAL ANALYSIS & MAPPING
218
Spatial-PointsDataFrame must be converted to spatstat’s ppp
format. This is a compendium format containing both a set of points and a
polygon describing the study area A, and can be created from a Spatial-
Points or SpatialPointsDataFrame object combined with a
Spatial-Polygons or SpatialPolygonsDataFrame object. This
is achieved via the as and as.ppt functions from the maptools
package.
require(maptools)
require(spatstat)
# Bramblecanes is a dataset in ppp format from spatstat
data(bramblecanes)
# Convert the data to SpatialPoints, and plot them
bc.spformat <- as(bramblecanes,"SpatialPoints")
plot(bc.spformat)
# It is also possible to extract the study polygon
# referred to as a window in spatstat terminology
# Here it is just a rectangle...
bc.win <- as(bramblecanes$win,"SpatialPolygons")
plot(bc.win,add=TRUE)
It is also possible to convert objects in the other direction, via the as.ppp
function. This takes two arguments, the coordinates of the Spatial-
Points or SpatialPointsDataFrame object (extracted using the
coordinates function), and an owin object created from a Spatial-
Polygons or SpatialPolygonsDataFrame via as.win. owin
objects are single polygons used by spatstat to denote study areas, and
are a component of ppp objects. In the following example, the burgres.n
point dataset from GISTools is converted to ppp format and a G-function
is computed and plotted.
require(maptools)
require(spatstat)
# Bramblecanes is a dataset in ppp format from spatstat
# convert burgres.n to a ppp object
br.n.ppp <- as.ppp( coordinates(burgres.n),
W=as.owin(gUnaryUnion(blocks)))
br.n.gf <- Gest(br.n.ppp)
plot(br.n.gf)
6.6 LOOKING AT MARKED POINT PATTERNS
A further advancement of the analysis of patterns of points of a single type is the
consideration of marked point patterns. Here, several kinds of points are considered
POINT PATTERN ANALYSIS USING R
219
in a dataset, instead of only a single kind. For example, in the newhaven dataset
there are point data for several kinds of crime. The term ‘marked’ is used as each
point is thought of as being tagged (or marked) with a specific type. As with the
analysis of single kinds of points (or ‘unmarked’ points), the points are still treated
as random two-dimensional quantities. It is also possible to apply tests and analyses
to each individual kind of point – for example, testing each mark type against a null
hypothesis of CSR, or computing the K-function for that mark type. However, it is
also possible to examine the relationships between the point patterns of different
mark types. For example, it may be of interest to determine whether forced entry
residential burglaries occur closer to non-forced-entry burglaries than one might
expect if the two sets of patterns occurred independently.
One method of investigating this kind of relationship is the cross-K-function
between marks of type i and j. This is defined as
Kij (d) = λj−1E(Ndij) (6.12)
where Ndij is the number of events xk of type j within a distance d of a randomly cho-
sen event from all recorded events x x1 , , n{ } of type i, and λj is the intensity of the
process marked j – measured in events per unit area (Lotwick and Silverman, 1982). If
the process for points with mark j is CSR, then Kij(d) = λjπd2. A similar simulation-based
approach to that set out for K, L and G in earlier sections may be used to investigate
Kij(d) and compare it to a hypothesised sample estimate of Kij(d) under CSR.
The empirical estimate of Kij(d) is obtained in a similar way to that in equation (6.5):
ˆ ˆK d
I d d
n nij j
k l
kl
i j
( ) =
( < )1λ − ∑∑ (6.13)
where k indexes all of the i-marked points and l indexes all of the j-marked processes,
and ni and nj are the respective numbers of points marked i and j. A correction (of
the form in equation (6.7)) may also be applied. There is also a cross-L-function,
Lij (d), which relates to the cross-K-function in the same way that the standard K-
function relates to the standard L-function.
6.6.1 Cross-L-Function Analysis in R
There is a function in spatstat called Kcross to compute cross-K-functions,
and a corresponding function called Lcross for cross-L-functions. These take a
ppp object and values for i and j as the key arguments. Since i and j refer to mark
types, it is also necessary to identify the marks for each point in a ppp object. This
can be done via the marks function. For example, for the bramblecanes object,
the points are marked in relation to the age of the cane (see Hutchings, 1979) with
three levels of age (labelled as 0, 1 and 2 in increasing order). Note that the marks
are factors. These may be listed by entering:
R FOR SPATIAL ANALYSIS & MAPPING
220
marks(bramblecanes)
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[28] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[55] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[82] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[109] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[136] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[163] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[190] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[217] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[244] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[271] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[298] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[325] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[352] 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[379] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[406] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[433] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
,and Linux, respectively. The Windows and Mac versions come
with installer packages and are easy to install, while the Linux binaries require use
of a command terminal.
RStudio can be downloaded from https://www.rstudio.com/products/
rstudio/download/ and the free version of RStudio Desktop is more than
sufficient for this book. RStudio allows you to organise your work into projects,
to use RMarkdown to create documents and webpages, to link to your GitHub
site and much more. It can be customised for your preferred arrangement of the
different panes.
R FOR SPATIAL ANALYSIS & MAPPING
8
You may have to set a mirror site from which the installation files will be down-
loaded to your computer. Generally you should pick one that is near to you. Once
you have installed the software you can run it. On a Windows computer, an R icon
is typically installed on the desktop; on a Mac, R can be found in the Applications
folder. Macs and Windows have slightly different interfaces, but the protocols and
processes for an R session on either platform are similar.
The base installation includes many functions and commands. However,
more often we are interested in using some particular functionality, encoded into
packages contributed by the R developer community. Installing packages for the
first time can be done at the command line in the R console using the install.
packages command, as in the example below to install the GISTools library,
or via the R menu items.
install.packages("tmap", dependencies = T)
In Windows, the menu for this can be accessed by Packages > Load Packages and
on a Mac via Packages and Data > Package Installer. In either case, the first time
you install packages you may have to set a mirror site, from which to download
the packages. Once the package has been installed then the library can be called as
below.
library(tmap)
Further descriptions of packages, their installation and their data structures are
given in later chapters. There are literally thousands of packages that have been
contributed to the R project by various researchers and organisations. These can
be located by name at http://cran.r-project.org/web/packages/
available_packages_by_name.html if you know the package you wish
to use. It is also possible to search the CRAN website to find packages to per-
form particular tasks at http://www.r-project.org/search.html.
Additionally, many packages include user guides in the form of a PDF docu-
ment describing the package and listed at the top of the index page of the help
files for the package. The most commonly used packages in this book are listed
in Table 1.2.
When you install these packages it is strongly suggested you also install the
dependencies – other packages required by the one that is being installed – by
either checking the box in the menu or including depend=TRUE in the command
line as below:
install.packages("GISTools", dep = TRUE)
Packages are occasionally completely rewritten, and this can impact on code func-
tionality. Since we started writing the revision for this edition of the book, the read
INTRODUCTION
9
Table 1.2 R packages used in this book
Name Description
datasets A package containing a number of datasets supplied with the standard
installation of R
deldir Functions for Delaunay triangulations, Dirichlet or Voronoi tessellations of
point datasets
dplyr A grammar of data manipulation
e1071 Functions for data mining, latent class analysis, clustering and modelling
fMultivar Tools for financial engineering but useful for spatial data
ggplot2 Declarative graphics creation, based on The Grammar of Graphics (Wilkinson,
2005)
GISTools Mapping and spatial data manipulation tools
gstat Functions for spatial and geostatistical modelling, prediction and simulation
GWmodel Geographically weighted models
maptools Functions for manipulating and reading geographical data
misc3d Miscellaneous functions for three-dimensional (3D) plots
OpenStreetMap High resolution raster maps and satellite imagery from OpenStreetMap
raster Manipulating, analysing and modelling of raster or gridded spatial data
RColorBrewer A package providing colour palettes for shading maps and other plots
RCurl General HTTP requests, functions to fetch uniform resource identifiers
(URIs), to get and post web data
reshape2 Flexibly reshape data
rgdal Geospatial Data Abstraction Library, projection/transformation operations
rgeos Geometry Engine – Open Source (GEOS), topology operations on
geometries
rgl 3D visualisation device (OpenGL)
RgoogleMaps Interface to query the Google server for static maps as map backgrounds
Rgraphviz Provides plotting capabilities for R graph objects
rjson Converts R objects into JavaScript Object Notation (JSON) objects and vice
versa
sf Simple Features for R – a standardised way to encode spatial vector data
sp Classes and methods for spatial data
SpatialEpi Performs various spatial epidemiological analyses
spatstat A package for analysing spatial data, mainly spatial point patterns
spdep Functions and tests for evaluating spatial patterns and autocorrelation
tibble A modern reimagining of the data frame
tidyverse A collection of R packages designed for data science
tmap A mapping package that allows maps to be constructed in highly
controllable layers
R FOR SPATIAL ANALYSIS & MAPPING
10
and write functions for spatial data in the maptools package (readShape
Poly, writePolyShape, etc.) have deprecated. For instance:
library(maptools)
?readShapePoly
If you examine the help files for these functions you will see that they contain a
warning and suggest other functions that should be used instead. The book web-
site will always contain working code snippets for each chapter to overcome any
problems caused by function deprecation.
Such changes are only a minor inconvenience and are part of the nature of a
dynamic development environment provided by R in which to do research: such
changes are inevitable as packages finesse, improve and standardise.
1.8 THE R INTERFACE
We expect that most readers of this book and most users of R will be using the
RStudio interface to R, although users can of course still use just R. RStudio pro-
vides a good interface to the different things that R users will want to know about
the R sessions via the four panes: the console where code is entered; the file that is
being edited; variables in the working environments; files in the project file space;
plot windows, help pages, as well as font type and size, pane colour, etc. Users can
set up their personal preferences for how they like their RStudio interface. Similar
to straight R, there are few pull-down menus in R, and therefore you will type
command lines in what is termed a command line interface. Like all command
line interfaces, the learning curve is steep but the interaction with the software is
more detailed, which allows greater flexibility and precision in the specification of
commands.
As you work though the book, the expectation is that you will run all the code
that you come across. We cannot emphasise enough the importance of learning by
doing – the best way to learn how to write R code is to write and enter it. Some of
the code might look a bit intimidating when first viewed, especially in later chap-
ters. However, the only really effective way to understand it is to give it a try.
Beyond this there are further choices to be made. Command lines can be entered
in two forms: directly into the R console window or as a series of commands into a
script window. We strongly advise that all code should be written in scripts (script
files have a .R extension) and then run from the script. RStudio includes its own
editor (similar to Notepad in Windows or TextEdit on a Mac). Scripts are useful if
you wish to automate data analysis, and have the advantage of keeping a saved
record of the relevant R programming language commands that you use in a given
piece of analysis. These can be re-executed,
,1 1 1 1
[460] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[487] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[514] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[541] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[568] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[595] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[622] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[649] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[676] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[703] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[730] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
[757] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[784] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[811] 2 2 2 2 2 2 2 2 2 2 2 2 2
Levels: 0 1 2
I
It is also possible to assign values to marks of a ppp object using the
expression:
marks(x) <- ...
where ... is any valid R expression creating a factor variable with the same
length of number elements as there are points in the ppp object x. This is
useful if converting a SpatialPointsDataFrame object into a ppp
object representing a marked process.
POINT PATTERN ANALYSIS USING R
221
As an example here, we compute and plot the cross-L-function for levels 0 and 1 of
the bramblecanes object (the resultant plot is shown in Figure 6.17):
cl.bramble <- Lcross(bramblecanes,i=0,j=1,correction='border')
plot(cl.bramble)
0.00 0.05 0.10 0.15 0.20 0.25
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
cl.bramble
r (one unit = 9 metres)
L
0,
1
( r
)
L̂0, 1
bord
(r)
L 0, 1
po is(r)
Figure 6.17 Cross-L-function for levels 0 and 1 of the bramble cane data
The envelope function may also be used (Figure 6.18):
clenv.bramble <- envelope(bramblecanes,Lcross,i=0,j=1,correction='border')
plot(clenv.bramble)
Thus, it would seem that there is a tendency for more young (level 1) bramble
canes to occur close to very young (level 0) canes. This can be formally tested, as
both mad.test and dclf.test can be used with Kcross and Lcross. Here
the use of Lcross with dclf.test is demonstrated:
R FOR SPATIAL ANALYSIS & MAPPING
222
dclf.test(bramblecanes,Lcross,i=0,j=1,correction='border',verbose=FALSE)
Diggle-Cressie-Loosmore-Ford test of CSR
Monte Carlo test based on 99 simulations
Summary function: L["0", "1"](r)
Reference function: theoretical
Alternative: two.sided
Interval of distance values: [0, 0.25] units (one unit = 9 metres)
Test statistic: Integral of squared absolute deviation
Deviation = observed minus theoretical
data: bramblecanes
u = 4.3982e−05, rank = 1, p-value = 0.01
6.7 INTERPOLATION OF POINT PATTERNS WITH CONTINUOUS
ATTRIBUTES
The previous section can be thought of as outlining methods for analysing point
patterns with categorical-level attributes. An alternative issue is the analysis of
0.00 0.05 0.10 0.15 0.20 0.25
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
clenv.bramble
r (one unit = 9 metres)
L
,1
(r
)
L̂0, 1
obs
(r)
L 0, 1
t h eo(r)
L̂0, 1
h i
(r)
L̂0, 1
l o
(r)
Figure 6.18 Cross-L-function envelope for levels 0 and 1 of the bramble cane data
POINT PATTERN ANALYSIS USING R
223
point patterns in which the points have continuous (or measurement scale) attrib-
utes, such as height above sea level, soil conductivity or house price. A typical
problem here is interpolation: given a sample of measurements – say, z zn1 , ,{ } at
locations x x1 , , n{ } – the goal is to estimate the value of z at some new point x.
Possible methods for doing this can be based on fairly simple algorithms, or on
more sophisticated spatial statistical models. Here, three key measures will be
covered:
● Nearest neighbour interpolation
● Inverse distance weighting
● Kriging
6.7.1 Nearest Neighbour Interpolation
The first of these, nearest neighbour interpolation, is the simplest conceptually, and
can be stated as below:
● Find i such that |xi − x| is minimised
● The estimate of z is zi
In other words, to estimate z at x, use the value of zi at the observation point closest
to x. Since the set of closest points to xi for each i form the set of Thiessen (Voronoi)
polygons for the set of points, an obvious way to represent the estimates is as a set
of Thiessen (Voronoi) polygons corresponding to the xi points, with respective
attributes of zi. In rgeos there is no direct function to create Voronoi polygons, but
Carson Farmer2 has made some code available to do this, providing a function
called voronoipolygons. This has been slightly modified by the authors, and is
listed below. Note that the modified version of the code takes the points from a
spatial points data frame as the basis for the Voronoi polygons on a spatial points
data frame, and carries across the attributes of the points to become attributes of
the corresponding Voronoi polygons. Thus, in effect, if the z value of interest is an
attribute in the input spatial points data frame then the nearest neighbour interpo-
lation is implicitly carried out when using this function.
The function makes use of Voronoi computation tools carried out in another
package called deldir – however, this package does not make use of Spatial∗
object types, and therefore this function provides a ‘front end’ to allow its inte-
gration with the geographical information handling tools in rgeos, sp and
2 http://www.carsonfarmer.com/2009/09/voronoi-polygons-with-r/
R FOR SPATIAL ANALYSIS & MAPPING
224
maptools. Do not be too concerned if you find the code difficult to interpret – at
this stage it is sufficient to understand that it serves to provide a spatial data
manipulation function that is otherwise not available.
#
# Original code from Carson Farmer
# http://www.carsonfarmer.com/2009/09/voronoi-polygons-with-r/
# Subject to minor stylistic modifications
#
require(deldir)
require(sp)
# Modified Carson Farmer code
voronoipolygons = function(layer) {
crds <- layer@coords
z <- deldir(crds[,1], crds[,2])
w <- tile.list(z)
polys <- vector(mode='list', length=length(w))
for (i in seq(along=polys)) {
pcrds <- cbind(w[[i]]$x, w[[i]]$y)
pcrds <- rbind(pcrds, pcrds[1,])
polys[[i]] <- Polygons( list(Polygon(pcrds)),
ID=as.character(i))
}
SP <- SpatialPolygons(polys)
voronoi <- Spa tialPolygonsDataFrame(SP,
data=data.frame( x=crds[,1],
y=crds[,2],
layer@data,
row.names=sapply(slot(SP, 'polygons'),
function(x) slot(x, 'ID'))))
proj4string(voronoi) <- CRS(proj4string(layer))
return(voronoi)
}
6.7.2 A Look at the Data
Having defined this function, the next stage is to use it on a test dataset. One such
dataset is provided in the gstat package. This package provides tools for a number
of approaches to spatial interpolation – including the other two listed in this chapter.
Of interest here is a data frame called fulmar. Details of the dataset may be
obtained by entering ?fulmar once the package gstat has been loaded. The data
are based on airborne counts of the sea bird Fulmaris glacialis during August and
September of 1998 and 1999, over the Dutch part of the North Sea. The counts are
taken along transects corresponding to flight paths of the observation aircraft, and
are transformed to densities by dividing counts by the area of observation, 0.5 km2.
In this and the following sections you will analyse the data described above.
First, however, these data should be read into R, and converted into a Spatial∗
object. The first thing you will need to do is enter the code to define the function
voronoipolygons as listed above. The next few lines of code will read in the
POINT PATTERN ANALYSIS USING R
225
data (stored in the data frame fulmar) and then convert them into a spatial points
data frame. Note that the fulmar sighting density is stored in column fulmar in
the data frame fulmar – the location is specified in columns x and y. The point
object is next converted into
,referred to or modified at a later date.
For this reason, you should get into the habit of constructing scripts for all your
analyses. Since being able to edit functions is extremely useful, both the MS
INTRODUCTION
11
Windows and Mac OSX versions of R have built-in text editors. In RStudio you
should go to File > New File. In R, to start the Windows editor with a blank docu-
ment, go to File > New Script, and to open an existing script, File > Open Script.
To start the Mac editor, use the menu option File > New Document to open a new
document and File > Open Document to open an existing file.
Once code is written into these files, they can be saved for future use; rather
than copy and pasting each line of code, both R and RStudio have their own short-
cuts. Lines of code can be run directly by placing the cursor on the relevant line
(or highlighting a block) and then using Ctrl-R (Windows) or Cmd-Return (Mac).
RStudio also has a number of other keyboard short-cuts for running code, auto-
filling when you are typing, assignment, etc. Further tips are described at
http://r4ds.had.co.nz/workflow-basics.html.
It is also good practice to set the working directory at the beginning of your
R session. This can be done via the menu in RStudio: Session > Set Working
Directory > …. In Windows R select File > Change dir…, and in Mac R select
Misc > Set Working Directory. This points the R session to the folder you
choose and will ensure that any files you wish to read, write or save are placed
in this directory.
Scripts can be saved by selecting File > Save As which will prompt you to enter
a name for the R script you have just created. Chose a name (e.g. test.R) and
select save. It is good practice to use the file extension .R.
1.9 OTHER RESOURCES AND ACCOMPANYING WEBSITE
There are many freely available resources for R users. In order to get some practice
with R we strongly suggest that you download the ‘Owen Guide’ (entitled The R
Guide) and work through this up to and including Section 5. It can be accessed via
http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf.
It does not require any additional libraries or data and provides a gentle introduc-
tion to R and its syntax.
There are many guides to the R software available on the internet. In particular,
you may find some of the following links useful:
● http://www.r-bloggers.com
● http://stackoverflow.com/ and specifically
http://stackoverflow.com/questions/tagged/r
The contemporary nature of R means that much of the R development for pro-
cessing geographical information is chronicled on social media sites (you can
search for information on services such as Twitter, for example #rstats) and
blogs (such as the R-bloggers site listed above), rather than standard textbooks.
R FOR SPATIAL ANALYSIS & MAPPING
12
In addition to the above resources, there is a website that accompanies this book:
https://study.sagepub.com/Brunsdon2e. This site contains all of the code, scripts,
exercises and self-test questions included in each chapter, and these are available
to download. The scripts for each chapter allow the reader to copy and paste the
code into the R console or into their own script. At the time of writing, all of the
code in the book is correct. However, R and its packages are occasionally updated.
In most cases this is not problematic as the update almost always extends the
functionality of the package without affecting the original code. However, in a
few instances, specific packages are completely rewritten without backward com-
patibility. If this happens the code on the accompanying website will be updated
accordingly. You are therefore advised to check the website regularly for archival
components and links to new resources.
REFERENCES
Bivand, R.S., Pebesma, E.J. and Gómez-Rubio, V. (2013) Applied Spatial Data:
Analysis with R, 2nd edition. New York: Springer.
Brunsdon, C. and Chen, H. (2014) GISTools: Some further GIS capabilities for R. R
Package Version 0.7-4. http://cran.r-project.org/package=GISTools.
Krause, A. and Olson, M. (1997) The Basics of S and S-PLUS. New York: Springer.
Pebesma, E., Bivand, R., Cook, I., Keitt, T., Sumner, M., Lovelace, R., Wickham, H.,
Ooms, J. and Racine, E. (2016) sf: Simple features for R. R Package Version 0.6-3.
http://cran.r-project.org/package=sf.
Tennekes, M. (2015) tmap: Thematic maps. R Package Version 1. http://cran.r-project.
org/package=tmap.
Wilkinson, L. (2005) The Grammar of Graphics. New York: Springer.
2
DATA AND PLOTS
2.1 INTRODUCTION
This chapter introduces some of the different data types and data structures that
are commonly used in R and how to visualise them. As you work through this
book, you will gain experience in using and manipulating these individually and
within blocks of code. It sequentially builds on the ideas that are introduced, for
example developing your own functions, and tests this knowledge through self-
test exercises. As you progress, the exercises will place more emphasis on solving
problems, using the different data structures needed, rather than simply working
through the example code. As you work though the code, you should use the help
available to explore the different functions that are called in the code snippets, such
as max, sqrt and length.
This chapter covers a lot of ground – it will:
● Review basic commands in R
● Introduce variables and assignment
● Introduce data types and classes
● Describe how to test for and manipulate data types
● Introduce and compare data frames and tibbles
● Introduce basic plot commands
● Describe how to read, write, load and save different data types
Chapter 1 introduced R, the reasons for using it in spatial analysis and mapping,
and described how to install it. It also directed you to some of the many
resources and introductory exercises for undertaking basic operations in R.
Specifically it advised that you should work through the ‘Owen Guide’ (entitled
The R Guide) up to the end of Section 5. This can be accessed via
https://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf.
R FOR SPATIAL ANALYSIS & MAPPING
14
This chapter assumes that you have worked your way through this – it does not
take long and provides critical introductory knowledge for the more specialised
materials that will be covered in the rest of this book.
2.2 THE BASIC INGREDIENTS OF R: VARIABLES AND ASSIGNMENT
The R interface can be used as a sort of calculator, returning the results of simple
mathematical operations such as (−5 + −4). However, it is normally convenient
to assign values to variables. The form for doing this is:
R_object <- value
The arrow performs the assignments and is referred to as gets. So in this case you
would say R_object gets value. It is possible to use an equals sign instead of gets,
but this only performs a soft assignment (the difference between the arrow and the
equals sign relates to how R how stores the R_object). The objects and variables
that are created can then be manipulated or subject to further operations.
# examples of simple assignment
x <- 5
y <- 4
# the variables can be used in other operations
x+y
[1] 9
# including defining new variables
z <- x + y
z
[1] 9
# which can then be passed to other functions
sqrt(z)
[1] 3
The snippet of code above is the first that you have come across in this book.
There will be further snippets throughout each chapter. Two key points. First,
you are strongly advised to enter and run the code at the R prompt yourself.
Our very strong advice is that you write the code into a script or document
using the in-built text editor in RStudio. For example, for each chapter you might
start a new RStudio session or project and open a new .R file. This script
can be used to save the code snippets you enter and to include your comments
and annotations. The reasons for doing this are so that you get used to using the
I
DATA AND PLOTS
15
The basic assignment type
,in R is to a vector of values. Vectors can have sin-
gle values as in x, y and z above, or multiple values. Note the use of
c(4.3,7.1, …) in the code below, where the c instructs R to combine or
concatenate multiple values:
# example of vector assignment
tree.heights <- c(4.3,7.1,6.3,5.2,3.2,2.1)
tree.heights
[1] 4.3 7.1 6.3 5.2 3.2 2.1
Remember that UPPER and lower case matters to R. So tree.heights, Tree.
Heights and TREE.HEIGHTS will be treated as referring to different variables
by R. Make sure you type in upper and lower case exactly as it is written, otherwise
you are likely to get an error.
In the example above, a vector of values has been assigned to the variable
tree.heights. It is possible to apply a single assignment to the entire vector, as
in the code below that returns tree.heights squared. Note how the operation
returns the square of each element in the vector.
tree.heights∗∗2
[1] 18.49 50.41 39.69 27.04 10.24 4.41
Other operations or functions can then be applied to these vectors variables:
sum(tree.heights)
[1] 28.2
mean(tree.heights)
[1] 4.7
R console, and running the code will help your understanding of the code’s
functionality. Lines of code can be run directly by placing the cursor on the line
of code (or highlighting a block of code) and then using Ctrl-R (Windows) or
Cmd-Return (Mac). Keeping copies of your code in this way will help you keep
a record of it and will allow you to go back and edit it at a later date. Second, we
would like to emphasise the importance of learning by doing and getting your
hands dirty. Some of the code might look a bit fearsome when first viewed,
especially in later chapters, but the only really effective way to understand it is
to give it a try. Remember that the code and chapter summaries are available on
the book’s website https://study.sagepub.com/Brunsdon2e so that
you can copy and paste these into the R console or your own script. A final point
is that in the code, any comments are prefixed by # and are ignored by R when
entered into the console.
R FOR SPATIAL ANALYSIS & MAPPING
16
And, if needed, the results can be assigned to yet further variables:
max.height <- max(tree.heights) max.height
[1] 7.1
One of the advantages of vectors and other structures with multiple data elements
is that they can be subsetted. Individual elements or subsets of elements can be
extracted and manipulated:
tree.heights
[1] 4.3 7.1 6.3 5.2 3.2 2.1
tree.heights[1] # first element
[1] 4.3
tree.heights[1:3] # a subset of elements 1 to 3
[1] 4.3 7.1 6.3
sqrt(tree.heights[1:3]) #square roots of the subset
[1] 2.073644 2.664583 2.509980
tree.heights[c(5,3,2)] # a subset of elements 5,3,2: note the ordering
[1] 3.2 6.3 7.1
In the above examples the numeric values were assigned. However, character
or logical values can be also assigned as in the code below. This starts to hint at
the idea of different classes and types of variables which are described in more
detail in the next sections.
# examples of character variable assignment
name <- "Lex Comber"
name
[1] "Lex Comber"
# these can be assigned to a vector of character variables
cities <- c("Leicester","Newcastle","London","Leeds","Exeter")
cities
[1] "Leicester" "Newcastle" "London" "Leeds"
[5] "Exeter"
length(cities)
[1] 5
# an example of a logical variable
northern <- c(FALSE, TRUE, FALSE, TRUE, FALSE)
northern
[1] FALSE TRUE FALSE TRUE FALSE
# this can be used to subset other variables
cities[northern]
[1] "Newcastle" "Leeds"
2.3 DATA TYPES AND DATA CLASSES
This section introduces data classes and data types to a sufficient depth for read-
ers of this book. However, more formal descriptions of basic classes for R data
DATA AND PLOTS
17
objects can be found in the R Manual on the CRAN website at
http://stat.ethz.ch/R-manual/R-devel/library/methods/
html/BasicClasses.html.
2.3.1 Data Types in R
Data in R can be considered as being organised into a hierarchy of data types
which can then be used to hold data values in different structures. Each of the
types is associated with a test and a conversion function. The basic or core data
types and associated tests and conversions are shown in Table 2.1.
You should note from the table that each type has an associated test in the form
is.xyz, which will return TRUE or FALSE, and a conversion in the form as.
xyz. Most of the exercises, methods, tools, functions and analyses in this book
work with only a small subset of these data types: character, numeric and
logical. These data types can be used to populate different data structures or
classes, including vectors, matrices, data frames, lists and factors. The data types
are described in more detail below. In each case the objects created by the different
classes, conversion functions or tests are illustrated.
Table 2.1 Data type, tests and conversion functions
Type Test Conversion
character is.character as.character
complex is.complex as.complex
double is.double as.double
expression is.expression as.expression
integer is.integer as.integer
list is.list as.list
logical is.logical as.logical
numeric is.numeric as.numeric
single is.single as.single
raw is.raw as.raw
2.3.1.1 Characters
Character variables contain text. By default the function character creates a vec-
tor of whatever length is specified. Each element in the vector is equal to "", an
empty character element in the variable. The function as.character tries to
convert its argument to character type, removing any attributes including, for
example, vector element names. The function is.character tests whether the
arguments passed to it are of character type and returns TRUE or FALSE depending
on whether its argument is of character type or not. Consider the following exam-
ples of these functions and the results when they are applied to different inputs:
R FOR SPATIAL ANALYSIS & MAPPING
18
character(8)
[1] "" "" "" "" "" "" "" ""
# conversion
as.character("8")
[1] "8"
# tests
is.character(8)
[1] FALSE
is.character("8")
[1] TRUE
2.3.1.2 Numeric
Numeric data variables are used to hold numbers. The function numeric is used
to create a vector of the specified length with each element equal to 0. The func-
tion as.numeric tries to convert (coerce) its argument to numeric type. It is
identical to as.double and to as.real. The function is.numeric tests
whether the arguments passed to it are of numeric type and returns TRUE or
FALSE depending on whether its argument is of numeric type or not. Notice how
the last test in the code below returns FALSE because not all of the elements are
numeric.
numeric(8)
[1] 0 0 0 0 0 0 0 0
# conversions
as.numeric(c("1980","−8","Geography"))
[1] 1980 −8 NA
as.numeric(c(FALSE,TRUE))
[1] 0 1
# tests
is.numeric(c(8, 8))
[1] TRUE
is.numeric(c(8, 8, 8, "8"))
[1] FALSE
2.3.1.3 Logical
The function logical creates a logical vector of the specified length and by default
each element of the vector is set to equal FALSE. The function as.logical
attempts to convert its argument to be of logical type. It removes any attributes
including, for example, vector element names. A range of character strings c("T",
"TRUE", "True", "true"), as well any number not equal to zero, are regarded
as TRUE. Similarly, c("F", "FALSE", "False", "false") and zero are
regarded as FALSE. All others are regarded as NA. The function is.logical
returns TRUE or FALSE depending on whether the argument passed to it is of
logical type or not.
DATA AND PLOTS
19
logical(7)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# conversion
as.logical(c(7,5,0,−4,5))
[1] TRUE TRUE FALSE TRUE TRUE
# TRUE and FALSE can be converted to 1 and 0
as.logical(c(7,5,0,−4,5)) ∗ 1
[1] 1 1 0 1 1
as.logical(c(7,5,0,−4,5)) + 0
[1] 1 1 0 1 1
# different ways to declare TRUE and FALSE
as.logical(c("True","T","FALSE","Raspberry","9","0", 0))
[1] TRUE TRUE FALSE NA NA NA NA
Logical vectors are very useful for indexing and subsetting data, including spatial
,data, to select the data that satisfy some criteria. For example, consider the following:
data <- c(3, 6, 9, 99, 54, 32, −102)
# a logical test
index <- (data > 10)
index
[1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE
# used to subset data
data[index]
[1] 99 54 32
sum(data)
[1] 101
sum(data[index])
[1] 185
2.3.2 Data Classes in R
The different data types can be used to populate different data structures or classes.
This section will describe and illustrate vectors, matrices, data frames, lists and
factors, data classes that are commonly used in spatial data analysis.
2.3.2.1 Vectors
All of the commands in R in Section 2.3.1 produced vectors. Vectors are the most
commonly used data structure and the standard one-dimensional R variable.
You will have noticed that when you specified character or logical, etc., a
vector of a given length was produced. An alternative approach is to use the
function vector, which produces a vector of the length and type or mode
specified. The default is logical, and when you assign values to vectors R will
seek to convert them to whichever vector mode is most convenient. Recall that
the test is.vector returns TRUE if its argument is a vector of the specified
class or mode with no attributes other than names, returning FALSE otherwise,
and that the function as.vector seeks to convert its argument into a vector of
whatever mode is specified.
R FOR SPATIAL ANALYSIS & MAPPING
20
# defining vectors
vector(mode = "numeric", length = 8)
[1] 0 0 0 0 0 0 0 0
vector(length = 8)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# testing and conversion
tmp <- data.frame(a=10:15, b=15:20)
is.vector(tmp)
[1] FALSE
as.vector(tmp)
a b
1 10 15
2 11 16
3 12 17
4 13 18
5 14 19
6 15 20
2.3.2.2 Matrices
The function matrix creates a matrix from the data and parameters that are
passed to it. This must include parameters for the number of columns and rows in
the matrix. The function as.matrix attempts to turn its argument into a matrix,
and again the test is.matrix tests to see whether its argument is a matrix.
# defining matrices
matrix(ncol = 2, nrow = 0)
[,1] [,2]
matrix(1:6)
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
matrix(1:6, ncol = 2)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
# conversion and test
as.matrix(6:3)
[,1]
[1,] 6
[2,] 5
[3,] 4
[4,] 3
is.matrix(as.matrix(6:3))
[1] TRUE
DATA AND PLOTS
21
Matrix rows and columns can be named – note the use of byrow=TRUE in the
following.
flow <- matrix(c(2000, 1243, 543, 1243, 212, 545,
654, 168, 109), c(3,3), byrow=TRUE)
# Rows and columns can have names, not just 1,2,3,…
colnames(flow) <- c("Leeds", "Maynooth", "Elsewhere")
rownames(flow) <- c("Leeds", "Maynooth", "Elsewhere")
# examine the matrix
flow
Leeds Maynooth Elsewhere
Leeds 2000 1243 543
Maynooth 1243 212 545
Elsewhere 654 168 109
# and functions exist to summarise
outflows <- rowSums(flow)
outflows
Leeds Maynooth Elsewhere
3786 2000 931
However, if the data class is not a matrix then just use names, rather than
rownames or colnames.
z <- c(6,7,8)
names(z) <- c("Newcastle","London","Manchester")
z
Newcastle London Manchester
6 7 8
R has many additional tools for manipulating matrices and performing matrix
algebra functions that are not described here. However, as spatial scientists we are
often interested in analysing data that have a matrix-like form, as in a data table.
For example, in an analysis of spatial data in vector format, the rows in the attrib-
ute table represent specific features (such as polygons) and the columns hold
information about the attributes of those features. Alternatively, in a raster analysis
environment, the rows and columns may represent specific latitudes and longi-
tudes, or northings and eastings, or raster cells. Methods for analysing data in
matrix-like structures will be covered in more detail in later chapters as spatial
data objects (Chapter 3) and spatial analyses (Chapter 5) are introduced.
You will have noticed in the code snippets that a number of new functions
are introduced, For example, early in this chapter, the function sum was
I
(Continued)
R FOR SPATIAL ANALYSIS & MAPPING
22
2.3.2.3 Factors
The function factor creates a vector with specific categories, defined in the lev-
els parameter. The ordering of factor variables can be specified and an ordered
function also exists. The functions as.factor and as.ordered are the coercion
functions. The test is.factor returns TRUE or FALSE depending on whether its
argument is of type factor or not, and is.ordered returns TRUE when its argu-
ment is an ordered factor and FALSE otherwise.
# a vector assignment
house.type <- c("Bungalow", "Flat", "Flat",
"Detached", "Flat", "Terrace", "Terrace")
# a factor assignment
used. R includes a number of functions that can be used to generate descrip-
tive statistics such as sum and max. You should explore these as they occur
in the text to develop your knowledge of and familiarity with R. Further
useful examples are in the code below and throughout this book. You could
even store them in your own R script. R includes extensive help files which
can be used to explore how different functions can be used, frequently with
example snippets of code. An illustration of how to find out more about the
sum function and some further summary functions is provided in the code
below.
?sum
help(sum)
# Create a variable to pass to other summary functions
x <− matrix(c(3,6,8,8,6,1,−1,6,7),c(3,3),byrow=TRUE)
# Sum over rows
rowSums(x)
# Sum over columns
colSums(x)
# Calculate column means
colMeans(x)
# Apply function over rows (1) or columns (2) of x
apply(x,1,max)
# Logical operations to select matrix elements
x[,c(TRUE,FALSE,TRUE)]
# Add up all of the elements in x
sum(x)
# Pick out the leading diagonal
diag(x)
# Matrix inverse
solve(x)
# Tool to handle rounding
zapsmall(x %∗% solve(x))
DATA AND PLOTS
23
house.type <- factor(c("Bungalow", "Flat",
"Flat", "Detached", "Flat", "Terrace", "Terrace"),
levels=c("Bungalow","Flat","Detached","Semi","Terrace"))
house.type
[1] Bungalow Flat Flat Detached Flat Terrace
[7] Terrace
Levels: Bungalow Flat Detached Semi Terrace
# table can be used to summarise
table(house.type)
house.type
Bungalow Flat Detached Semi Terrace
1 3 1 0 2
# levels controls what can be assigned
house.type <- factor(c("People Carrier", "Flat",
"Flat", "Hatchback", "Flat", "Terrace", "Terrace"),
levels=c("Bungalow","Flat","Detached","Semi","Terrace"))
house.type
[1]
Levels: Bungalow Flat Detached Semi Terrace
Factors are useful for categorical or classified data – that is, data values that must
fall into one of a number of predefined classes. It is obvious to see how this might
be relevant to geographical analysis, where many features represented in spatial
data are labelled using one of a set of discrete classes.
2.3.2.4 Ordering
There is no concept of ordering in factors. However, this can be imposed by using
the ordered function. Ordering allows inferences about preference or hierarchy
to be made (lower–higher, better–worse, etc.) and this can be used in data selection
or indexing (as above) or in the interpretation of derived analyses.
income <-factor(c("High", "High", "Low", "Low",
"Low", "Medium", "Low", "Medium"),
levels=c("Low", "Medium", "High"))
income > "Low"
[1] NA NA NA NA NA NA NA NA
# levels in ordered defines a relative order
income <-ordered(c("High", "High", "Low", "Low",
"Low", "Medium", "Low", "Medium"),
levels=c("Low", "Medium", "High"))
income > "Low"
[1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
Thus we can see that ordering is implicit in the way that the levels are specified and
allows other, ordering-related functions to be applied to the data.
The functions sort and table are new functions. In the above code relating
to factors, the function table was
,used to generate a tabulation of the data in
house.type. It provides a count of the occurrence of each level in house.
type. The command sort orders a vector or factor. You should use the help in
R FOR SPATIAL ANALYSIS & MAPPING
24
R to explore how these functions work and try them with your own variables.
For example:
sort(income)
2.3.2.5 Lists
The character, numeric and logical data types and the associated data
classes described above all contain elements that must all be of the same basic type.
Lists do not have this requirement. Lists have slots for collections of different ele-
ments. A list allows you to gather a variety of different data types together in a single
data structure and the nth element of a list is denoted by double square brackets.
tmp.list <- list("Lex Comber",c(2015, 2018),
"Lecturer", matrix(c(6,3,1,2), c(2,2)))
tmp.list
[[1]]
[1] "Lex Comber"
[[2]]
[1] 2015 2018
[[3]]
[1] "Lecturer"
[[4]]
[,1] [,2]
[1,] 6 1
[2,] 3 2
# elements of the list can be selected
tmp.list[[4]]
[,1] [,2]
[1,] 6 1
[2,] 3 2
From the above it is evident that the function list returns a list structure composed
of its arguments. Each value can be tagged depending on how the argument was
specified. The conversion function as.list attempts to coerce its argument to a
list. It turns a factor into a list of one-element factors and drops attributes that are not
specified. The test is.list returns TRUE if and only if its argument is a list. These
are best explored through some examples; note that list items can be given names.
employee <- list(name="Lex Comber", start.year = 2015,
position="Professor")
employee
$name
[1] "Lex Comber"
$start.year
[1] 2015
$position
[1] "Professor"
DATA AND PLOTS
25
Lists can be joined together with append:
append(tmp.list, list(c(7,6,9,1)))
and lapply applies a function to each element of a list:
# lapply with different functions
lapply(tmp.list[[2]], is.numeric)
lapply(tmp.list, length)
Note that the length of a matrix, even when held in a list, is the total number of
elements.
2.3.2.6 Defining Your Own Classes
In R it is possible to define your own data type and to associate it with specific
behaviours, such as its own way of printing, drawing. For example, you will notice
in later chapters that the plot function is used to draw maps for spatial data objects
as well as conventional graphs. Suppose we create a list containing some employee
information.
employee <- list(name="Lex Comber", start.year = 2015,
position="Professor")
This can be assigned to a new class, called staff in this case (it could be any
name, but meaningful ones help).
class(employee) <- "staff"
Then we can define how R treats that class in the form tion>. for printing is modified by the new class definition: print.staff <- function(x) { cat("Name: ",x$name,"\n") cat("Start Year: ",x$start.year,"\n") cat("Job Title: ",x$position,"\n")} # an example of the print class print(employee) Name: Lex Comber Start Year: 2015 Job Title: Professor You can see that R knows to use a different print function if the argument is not a variable of class staff. You could modify how your R environment treats exist- ing classes in the same way, but do this with caution. You can also undo the class assigned by using unclass, and the print.staff function can be removed per- manently by using rm(print.staff): R FOR SPATIAL ANALYSIS & MAPPING 26 print(unclass(employee)) $name [1] "Lex Comber" $start.year [1] 2015 $position [1] "Professor" 2.3.2.7 Classes in Lists Variables can be assigned to new or user-defined class objects. The example below defines a function to create a new staff object. new.staff <- function(name,year,post) { result <- list(name=name, start.year=year, position=post) class(result) <- "staff" return(result)} A list can then be defined, which is populated using that function as in the code below (note that functions will be dealt with more formally in later chapters). leeds.uni <- vector(mode='list',3) # assign values to elements in the list leeds.uni[[1]] <- new.staff("Heppenstall, Alison", 2017,"Professor") leeds.uni[[2]] <- new.staff("Comber, Lex", 2015,"Professor") leeds.uni[[3]] <- new.staff("Langlands, Alan", 2014,"VC") And the list can be examined by entering: leeds.uni 2.3.2.8 data.frame versus tibble Data of different types and classes are often held in tabular format. The data. frame and tibble classes of the data table are described in this section. Generally, in data tables, each of the records (rows) relates to some kind of real- world feature (a person, a transaction, a date, etc.) and the columns represent some attribute associated with that feature. In R data can be in a matrix, but matrices can only hold one type of data (e.g. integer, logical and character). However, data.frame and tibble class objects can hold different data types in different columns (or fields). This section introduces these (in fact, the tibble class includes data.frame) because they are used to hold attributes of spatial objects (points, lines, areas, pixels) in the R spatial data formats sf and sp, as introduced in detail in Chapter 3. Thus in spatial data tables, each record typically represents some real- world geographical feature (a place, a route, a region, etc.) and the fields describe variables or attributes associated with that feature (population, length, area, etc.). The data.frame class in R is composed of a series of vectors of equal length, which together form a two-dimensional data structure. Each vector records values DATA AND PLOTS 27 for a particular theme or attribute. Typically these form the columns in a data frame, and the name of each vector provides the column name or header. They are ordered such that the nth element in each vector describes a property for the nth record (row) representing the nth feature. The data.frame class is the most commonly used method for storing data in R. A data frame can be created using the data.frame() function: df <- data.frame(dist = seq(0,400, 100), city = c("Leeds", "Nottingham", "Leicester", "Durham", "Newcastle")) str(df) 'data.frame': 5 obs. of 2 variables: $ dist: num 0 100 200 300 400 $ city: Factor w/ 5 levels "Durham","Leeds",..: 2 5 3 1 4 The data.frame() function by default encodes character strings into factors. To see this enter: df$city To overcome this the df object can be refined using stringsAsFactors = FALSE: df <- data.frame(dist = seq(0,400, 100), city = c("Leeds", "Nottingham", "Leicester", "Durham", "Newcastle"), stringsAsFactors = FALSE) str(df) 'data.frame': 5 obs. of 2 variables: $ dist: num 0 100 200 300 400 $ city: chr "Leeds" "Nottingham" "Leicester" "Durham" … The tibble class is a reworking of the data.frame class that seeks to retain the operational advantages of data frames and eliminate aspects that have proven to be less effective. Enter the code below to create tb: tb <- tibble(dist = seq(0,400, 100), city = c("Leeds", "Nottingham", "Leicester", "Durham", "Newcastle")) Probably the biggest criticism of data.frame is the partial matching behaviour. Enter the following code: df$ci [1] "Leeds" "Nottingham" "Leicester" "Durham" [5] "Newcastle" tb$ci NULL Although there is no variable called ci, the partial matching in the data.frame means that the city variable is returned. This is a bit worrying! A further problem is what gets returned when a data table is subsetted. A tibble always returns a tibble, whereas a data frame may return a vector or a data frame, R FOR SPATIAL ANALYSIS & MAPPING 28 depending on the dimensions of the result. For example, compare the outputs of the following code: # 1 column df[,2] tb[,2] class(df[,2]) class(tb[,2]) # 2 columns df[,1:2] tb[,1:2] class(df[,1:2]) class(tb[,1:2]) Note that a tibble is a data frame, but tibbles