(Spatial analytics and GIS) Chris Brunsdon Lex Comber - An introduction to R for spatial analysis mapping (2019) - Outros (2024)

An INTRODUCTION to

ƒor SPATIAL ANALYSIS

& MAPPING

In the digital age, social and environmental scientists have more spatial data at

their fingertips than ever before. But how do we capture this data, analyse and

display it, and, most importantly, how can it be used to study the world?

Spatial Analytics and GIS is a series of books that deal with potentially tricky tech-

nical content in a way that is accessible, usable and useful. Early titles include Urban

Analytics by Alex Singleton, Seth Spielman and David Folch, and An Introduction

to R for Spatial Analysis and Mapping by Chris Brunsdon and Lex Comber.

Series Editor: Richard Harris

About the Series Editor

Richard Harris is Professor of Quantitative Social Geography at the School of

Geographical Sciences, University of Bristol. He is the lead author on three text-

books about quantitative methods in geography and related disciplines, including

Quantitative Geography: The Basics (Sage, 2016).

Richard’s interests are in the geographies of education and the education of geog-

raphers. He is currently Director of the University of Bristol Q-Step Centre, part of

a multimillion-pound UK initiative to raise quantitative skills training among

social science students, and is working with the Royal Geographical Society (with

IBG) to support data skills in schools.

Books in this Series:

Geocomputation, Chris Brunsdon and Alex Singleton

GIS and Agent-Based Modelling and Geographical Information Systems,

Andrew Crooks, Nicolas Malleson, Ed Manley and Alison Heppenstall

Modelling Environmental Change, Colin Robertson

An Introduction to Big Data and Spatial Data Analytics in R,

Lex Comber and Chris Brunsdon

Published in Association with this Series:

Quantitative Geography, Richard Harris

An INTRODUCTION to

ƒor SPATIAL ANALYSIS

& MAPPING

CHRIS

BRUNSDON

and

LEX

COMBER

SECOND

EDITION

SAGE Publications Ltd

1 Oliver’s Yard

55 City Road

London EC1Y 1SP

SAGE Publications Inc.

2455 Teller Road

Thousand Oaks, California 91320

SAGE Publications India Pvt Ltd

B 1/I 1 Mohan Cooperative Industrial Area

Mathura Road

New Delhi 110 044

SAGE Publications Asia-Pacific Pte Ltd

3 Church Street

#10-04 Samsung Hub

Singapore 049483

Editor: Robert Rojek

Assistant editor: John Nightingale

Production editor: Katherine Haw

Copyeditor: Richard Leigh

Proofreader: Neville Hankins

Indexer: Martin Hargreaves

Marketing manager: Susheel Gokarakonda

Cover design: Francis Kenney

Typeset by: C&M Digitals (P) Ltd, Chennai, India

Printed in the UK

 Chris Brunsdon and Lex Comber 2019

First edition published 2015. Reprinted 2015 (twice), 2016

(twice) and 2017 (twice)

This edition first published 2019

Apart from any fair dealing for the purposes of research

or private study, or criticism or review, as permitted under

the Copyright, Designs and Patents Act, 1988, this

publication may be reproduced, stored or transmitted in

any form, or by any means, only with the prior permission

in writing of the publishers, or in the case of reprographic

reproduction, in accordance with the terms of licences

issued by the Copyright Licensing Agency. Enquiries

concerning reproduction outside those terms should be

sent to the publishers.

Library of Congress Control Number: 2018943836

British Library Cataloguing in Publication data

A catalogue record for this book is available from

the British Library

ISBN 978-1-5264-2849-3

ISBN 978-1-5264-2850-9 (pbk)

At SAGE we take sustainability seriously. Most of our products are printed in the UK using responsibly sourced

papers and boards. When we print overseas we ensure sustainable papers are used as measured by the PREPS

grading system. We undertake an annual audit to monitor our sustainability.

PRAISE FOR AN INTRODUCTION

TO R FOR SPATIAL ANALYSIS

AND MAPPING 2E

‘There’s no better text for showing students and data analysts how to use R for

spatial analysis, mapping and reproducible research. If you want to learn how to

make sense of geographic data and would like the tools to do it, this is your guide.’

Richard Harris, University of Bristol

‘The future of GIS is open-source! An Introduction to R for Spatial Analysis and

Mapping is an ideal introduction to spatial data analysis and mapping using the

powerful open-source language R. Assuming no prior knowledge, Brunsdon and

Comber get the reader up to speed quickly with clear writing, excellent pedagogic

material and a keen sense of geographic applications. The second edition is timely

and fresh. This book should be required reading for every Geography and GIS

student, as well as faculty and professionals.’

Harvey Miller, The Ohio State University

‘While there are many books that provide an introduction to R, this is one of the

few that provides both a general and an application-specific (spatial analysis)

introduction and is therefore far more useful and accessible. Written by two

experts in the field, it covers both the theory and practice of spatial statistical

analysis and will be an important addition to the bookshelves of researchers whose

spatial analysis needs have outgrown currently available GIS software.’

Jennifer Miller, University of Texas at Austin

‘Students and other life-long learners need flexible skills to add value to spatial

data. This comprehensive, accessible and thoughtful book unlocks the spatial data

value chain. It provides an essential guide to the R spatial analysis ecosystem. This

excellent state-of-the-art treatment will be widely used in student classes, continu-

ing professional development and self-tuition.’

Paul Longley, University College London

‘In this second edition, the authors have once again captured the state of the art in

one of the most widely used approaches to spatial analysis. Spanning from the

absolute beginner to more advanced concepts and underpinned by a strong “learn

by doing” ethos, this book is ideally suited for both students and teachers of spatial

analysis using R.’

Jonny Huck, The University of Manchester

‘A timely update to the de facto referenceandtextbookforanyone‒geographer,

planner,or(geo)datascientist‒needingtoundertakemappingandspatialanaly-

sis in R. Complete with self-tests and valuable insights into the transition from sp

to sf, this book will help you to develop your ability to write flexible, powerful, and

fast geospatial code in R.’

Jonathan Reades, King’s College London

‘Brunsdon and Comber’s 2nd edition of their acclaimed text book is updated with

the key developments in spatial analysis and mapping in R and maintains the

pedagogic style that made the original volume such an indispensable resource for

teaching and research.’

Scott Orford, Cardiff University

CONTENTS

About the authors x

1 INTRODUCTION 1

1.1 IntroductiontotheSecondEdition 1

1.2 ObjectivesofThisBook 2

1.3 SpatialDataAnalysisinR 3

1.4 ChaptersandLearningArcs 4

1.5 SpecificChangestotheSecondEdition 5

1.6 TheRProjectforStatisticalComputing 7

1.7 ObtainingandRunningtheRSoftware 7

1.8 TheRInterface 10

1.9 OtherResourcesandAccompanyingWebsite 11

References 12

2 DATAANDPLOTS 13

2.1 Introduction 13

2.2 TheBasicIngredientsofR:VariablesandAssignment 14

2.3 DataTypesandDataClasses 16

2.4 Plots 34

2.5 AnotherPlotOption:ggplot 43

2.6 Reading,Writing,LoadingandSavingData 50

2.7 AnswerstoSelf-TestQuestions 52

Reference 54

3 BASICSOFHANDLINGSPATIALDATAINR 55

3.1 Overview 55

3.2 Introductiontosp and sf: The sfRevolution 57

3.3 ReadingandWritingSpatialData 63

3.4 Mapping:AnIntroductiontotmap 66

3.5 MappingSpatialDataAttributes 81

3.6 SimpleDescriptiveStatisticalAnalyses 98

3.7 Self-TestQuestions 107

3.8 AnswerstoSelf-TestQuestions 110

References 117

CONTENTS

viii

4 SCRIPTINGANDWRITINGFUNCTIONSINR 118

4.1 Overview 118

4.2 Introduction 119

4.3 BuildingBlocksforPrograms

,

seek to be lazy by not changing vari-

able names or types or do partial matching. And they are surly because they com-

plain more. This forces cleaner coding by identifying problems earlier in the data

analysis cycle.

Finally, the print method for tibble returns the first 10 records by default,

whereas for data.frame the head() function is frequently used to display just

the first 6 records. The tibble class also includes a description of the class of each

field (column) when it is printed.

It is possible to convert between tibbles and data frames using the following

functions:

data.frame(tb)

as_tibble(df)

The following functions work with both tibbles and data frames:

names()

colnames()

rownames()

length() # length of the underlying list

ncol()

nrow()

They can be subsetted in the same way as a matrix, using the [row, col-

umn] notation as above, and they can both be combined using cbind() and

rbind().

cbind(df, Pop = c(700,250,230,150,1200))

dist city Pop

1 0 Leeds 700

2 100 Nottingham 250

3 200 Leicester 230

4 300 Durham 150

5 400 Newcastle 1200

DATA AND PLOTS

29

cbind(tb, Pop = c(700,250,230,150,1200))

dist city Pop

1 0 Leeds 700

2 100 Nottingham 250

3 200 Leicester 230

4 300 Durham 150

5 400 Newcastle 1200

You could explore the tibble vignette by entering:

vignette("tibble")

2.3.3 Self-Test Questions

In the next pages there are a number of self-test questions. In contrast to the previ-

ous sections where the code is provided in the text for you to work through (i.e.

you enter and run it yourself), the self-test questions are tasks for you to complete,

mostly requiring you to write R code. Answers to them are provided in Section 2.7.

The self-test questions relate to the main data types that have been introduced:

factors, matrices, lists (named and unnamed) and classes.

2.3.3.1 Factors

Recall from the descriptions above that factors are used to represent categorical

data – where a small number of categories are used to represent some characteris-

tic in a variable. For example, the colour of a particular model of car sold by a

showroom in a week can be represented using factors:

colours <- factor(c("red","blue","red","white",

"silver","red","white","silver",

"red","red","white","silver","silver"),

levels=c("red","blue","white","silver","black"))

Since the only colours this car comes in are red, blue, white, silver and black, these

are the only levels in the factor.

Self-Test Question 1. Suppose you were to enter:

colours[4] <- "orange"

colours

What would you expect to happen? Why?

Next, use the table function to see how many of each colour were sold. First

reassign the colours (as you may have altered this variable in the previous self-test

question):

R FOR SPATIAL ANALYSIS & MAPPING

30

colours <- factor(c("red","blue","red","white",

"silver","red","white","silver",

"red","red","white","silver","silver"),

levels=c("red","blue","white","silver","black"))

table(colours)

colours

red blue white silver black

5 1 3 4 0

Note that the result of the table function is just a standard vector, but that each

of its elements is named – the names in this case are the levels in the factor. Now

suppose you had simply recorded the colours as a character variable, in colours2

as below, and then computed the table:

colours2 <-c("red","blue","red","white",

"silver","red","white","silver",

"red","red","white","silver")

# Now, make the table

table(colours2)

colours2

blue red silver white

1 5 3 3

Self-Test Question 2. What two differences do you notice between the results of the

two table expressions?

Now suppose we also record the type of car – it comes in saloon, convertible and

hatchback. This can be specified by another factor variable called car.type:

car.type <- factor(c("saloon","saloon","hatchback",

"saloon","convertible","hatchback","convertible",

"saloon","hatchback","saloon","saloon",

"saloon","hatchback"),

levels=c("saloon","hatchback","convertible"))

The table function can also work with two arguments:

table(car.type, colours)

colours

car.type red blue white silver black

saloon 2 1 2 2 0

hatchback 3 0 0 1 0

convertible 0 0 1 1 0

This gives a two-way table of counts – that is, counts of red hatchbacks, silver

saloons and so on. Note that the output this time is a matrix. For now enter the

code below to save the table into a variable called crosstab to be used later on:

crosstab <- table(car.type,colours)

Self-Test Question 3. What is the difference between table(car.type,

colours) and table(colours,car.type)?

DATA AND PLOTS

31

Finally in this section, ordered factors will be considered. Suppose a third

variable about the cars is the engine size, and that the three sizes are 1.1 litre,

1.3 litre and 1.6 litre. Again, this is stored in a variable, but this time the sizes are

ordered. Enter:

engine <- ordered(c("1.1litre","1.3litre","1.1litre",

"1.3litre","1.6litre","1.3litre","1.6litre",

"1.1litre","1.3litre","1.1litre", "1.1litre",

"1.3litre","1.3litre"),

levels=c("1.1litre","1.3litre","1.6litre"))

Recall that with ordered variables, it is possible to use comparison operators >

(greater than), < (less than), >= (greater than or equal to) and <= (less than or equal

to). For example:

engine > "1.1litre"

[1] FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE

[10] FALSE FALSE TRUE TRUE

Self-Test Question 4. Using the engine, car.type and colours variables,

write expressions to give the following:

● The colours of all cars with engines with capacity greater than 1.1 litres.

● The counts of types (hatchback etc.) of all cars with capacity below 1.6 litres.

● The counts of colours of all hatchbacks with capacity greater than or

equal to 1.3 litre.

2.3.3.2 Matrices

In the previous section you created a matrix called crosstab. A number of func-

tions can be applied to matrices:

dim(crosstab) # Matrix dimensions

[1] 3 5

rowSums(crosstab) # Row sums

saloon hatchback convertible

7 4 2

colnames(crosstab) # Column names

[1] "red" "blue" "white" "silver" "black"

Another important tool for matrices is the apply function. To recap, this applies a

function to either the rows or columns of a matrix, giving a single-dimensional list

as a result. A simple example finds the largest value in each row:

apply(crosstab,1,max)

saloon hatchback convertible

2 3 1

R FOR SPATIAL ANALYSIS & MAPPING

32

In this case, the function max is applied to each row of crosstab. The 1 as the

second argument specifies that the function will be applied row by row. If it were 2

then the function would be column by column:

apply(crosstab,2,max)

red blue white silver black

3 1 2 2 0

A useful function is which.max. Given a list of numbers, it returns the index of the

largest one. For example:

example <- c(1.4,2.6,1.1,1.5,1.2)

which.max(example)

[1] 2

In this case, the second element is the largest.

Self-Test Question 5. What happens if there is more than one number taking the

largest value in a list? Use either the help facility or experimentation to find out.

Self-Test Question 6. The function which.max can be used in conjunction with apply.

Write an expression to find the index of the largest value in each row of crosstab.

The function levels returns the levels of a variable of type factor in

character form. For example:

levels(engine)

[1] "1.1litre" "1.3litre" "1.6litre"

The order they are returned in is the one specified in the original factor assign-

ment and the same order as row or column names produced by the table func-

tion. This means that levels can be used in conjunction with which.max when

applied to matrices to obtain the row or column names instead of an index number:

levels(colours)[which.max(crosstab[,1])]

[1] "blue"

Alternatively, the same effect can be achieved by the following:

colnames(crosstab)[which.max(crosstab[,1])]

[1]

,

"blue"

You should unpick these last two lines of code to make sure you understand what

each element is doing.

colnames(crosstab)

[1] "red" "blue" "white" "silver" "black"

crosstab[,1]

saloon hatchback convertible

2 3 0

which.max(crosstab[,1])

hatchback

2

DATA AND PLOTS

33

More generally, a function could be written to apply this operation to any variable

with names:

# Defines the function

which.max.name <- function(x) {

return(names(x)[which.max(x)])}

# Next, give the variable 'example' names for the values

names(example) <- c("Bradford","Leeds","York",

"Harrogate","Thirsk")

example

Bradford Leeds York Harrogate Thirsk

1.4 2.6 1.1 1.5 1.2

which.max.name(example)

[1] "Leeds"

Self-Test Question 7. The function which.max.name could be applied (using

apply) to a table or matrix to find the name of the row or column with the largest

value. If the crosstab table is considered a table of car sales, write an apply

expression to determine the best-selling colour for each car type and the best-

selling car type in each colour.

Note that in the last code snippet, a function was defined called which.

max.name. You have been using functions, but these have all been existing ones

as defined in R until now. Functions will be thoroughly dealt with in Chapter 4,

but you should note two things about them at this point. First is the form:

function name <- function(function inputs) {

variable <- function

actions return(variable)

}

Second are the syntactic elements of the curly brackets { } that bound the code,

and the return() function that defines the value to be returned.

2.3.3.3 Lists

From the text in this chapter, recall that lists can be named and unnamed. Here we

will only consider the named kind. Lists may be created by the list function in

the form:

var <- list(name1=value1, name2=value2, …)

Self-Test Question 8. Suppose you wanted to store both the row- and column-wise

apply results (from Question 7) in a list called most.popular with two named

elements called colour (containing the most popular colour for each car type) and

type (containing the most popular car type for each colour). Write an R expression

that assigns the best-selling colour and car types to a list.

R FOR SPATIAL ANALYSIS & MAPPING

34

2.3.3.4 Classes

The objective of this task is to create a class based on the list created in the previous

section. The class will consist of a list of most popular colours and car types,

together with a third element containing the total number of cars sold (called

total). Call this class sales.data. A function to create a variable of this class,

given colours and car.type, is as follows:

new.sales.data <- function(colours, car.type) {

xtab <- table(car.type,colours)

result <- list(colour=apply(xtab,1,which.max.name),

type=apply(xtab,2,which.max.name),

total=sum(xtab))

class(result) <- "sales.data"

return(result)}

This can be used to create a sales.data object which has the colours and

car.type variables assigned to it via the function:

this.week <- new.sales.data(colours,car.type)

this.week

$colour

saloon hatchback convertible

"red" "red" "white"

$type

red blue white silver black

"hatchback" "saloon" "saloon" "saloon" "saloon"

$total

[1] 13

attr(,"class")

[1] "sales.data"

In the above code, a new variable called this.week, of class sales.data, is

created. Following the ideas set out in the previous section, it is now possible to

create a print function for variables of class sales.data. This can be done by

writing a function called print.sales.data that takes an input or argument of

the sales.data class.

Self-Test Question 9. Write a print function for variables of class sales.data.

This is a difficult problem and should be tackled by those with previous program-

ming experience. Others can try this now but should return to it after the functions

have been formally introduced in Chapter 4.

2.4 PLOTS

There are a number of plot routines and packages in R. In this section some basic

plot types will be introduced, followed by some more advanced plotting com-

mands and functions. The aim of this section to give you an understanding of how

DATA AND PLOTS

35

the basic plot types can be used as building blocks in more advanced plotting

routines that are called in later chapters to display the results of spatial analysis.

2.4.1 Basic Plot Tools

The most basic plot is the scatter plot. Figure 2.1 was created from the function

rnorm which generates a set of random numbers. Note that each running of the

code will generate a slightly different plot as different random numbers are

generated.

x1 <- rnorm(100)

y1 <- rnorm(100)

plot(x1,y1)

The generic plot function creates a graph of the two variables, plotting them on

the x-axis and the y-axis. The default settings for the plot function produce a scat-

ter plot and you should note that by default the axes are labelled with expressions

passed to the plot function. Many parameters can be set for plot either by defin-

ing the plot environment (described later) or when the plot is called. For example,

the option col specifies the plot colour and pch the plot character:

plot(x1,y1,pch=16, col='red')

Other options include different types of plot: type = 'l' produces a line plot of

the two variables, and again the col option can be used to specify the line colour

and the option lwd specifies the plot line width. You should run the code below to

produce different line plots:

−2 −1 0 1 2

−2

−1

1

2

x1

y1

Figure 2.1 A basic scatter plot

R FOR SPATIAL ANALYSIS & MAPPING

36

x2 <- seq(0,2∗pi,len=100)

y2 <- sin(x2)

plot(x2,y2,type='l')

plot(x2,y2,type='l', lwd=3, col='darkgreen')

You should examine the help for the plot command (reminder: type ?plot at the

R prompt) and explore different plot types that are available. Having called a new

plot as in the above examples, other data can be plotted using other commands:

points, lines, polygons, etc. You will see that plot by default assumes the

plot type is point unless otherwise specified. For example, in Figure 2.2 the line

data described by x2 and y2 are plotted, after which the points described by x2

and y2r are added to the plot.

plot(x2,y2,type='l', col='darkgreen', lwd=3, ylim=c(−1.2,1.2))

y2r <- y2 + rnorm(100,0,0.1)

points(x2,y2r, pch=16, col='darkred')

In the above code, the rnorm function creates a vector of small values which are

added to y2 to create y2r. The function points adds points to an existing plot.

Many other options for plots can be applied here. For example, note the ylim

option. This sets the limits of the y-axis, while xlim does the same for the x-axis.

You should apply the commands below to the plot data.

y4 <- cos(x2)

plot(x2, y2, type='l', lwd=3, col='darkgreen')

lines(x2, y4, lwd=3, lty=2, col='darkblue')

Notice that, similar to points, the function lines adds lines to an existing plot,

and note the lty option as well. This specifies the type of line (dotted, simple, etc.).

0 1 2 3 4 5 6

−1.0

−0.5

0.0

0.5

1.0

x2

y2

Figure 2.2 A line plot with points added

DATA AND PLOTS

37

The function polygon adds a polygon to an existing plot. The option col sets

the polygon fill colour. By default a black border is drawn; however, including the

parameter border = NA would result in no border being drawn. In Figure 2.3

two different plots of the same data illustrate the application of these parameters.

I

You should examine the different plot types and parameters in par. Enter

?par for the help page to see the full list of different plot parameters. One

of these, mfrow, is used below to set a combined plot of one row and two

columns. This needs to be reset or the rest of your plots will continue to be

printed in this way. To do this enter:

par(mfrow = c(1,2))

plot(x2, y2, type='l', lwd=3, col='darkgreen')

plot(x2, y2, type='l', col='darkgreen', lwd=3, ylim=c(−1.2,1.2))

,

points(x2, y2r, pch=16, col='darkred')

par(mfrow = c(1,1))

The last line of code resets par.

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

y2

y4

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

y2

y4

Figure 2.3 Points with polygons added

x2 <- seq(0,2∗pi,len=100)

y2 <- sin(x2)

y4 <- cos(x2)

# specify the plot layout and order

par(mfrow = c(1,2))

R FOR SPATIAL ANALYSIS & MAPPING

38

# plot #1

plot(y2,y4)

polygon(y2,y4,col='lightgreen')

# plot #2: this time with 'asp' to set the aspect ratio of the axes

plot(y2,y4, asp=1, type='n')

polygon(y2,y4,col='lightgreen')

In the second plot, the parameter asp fixes the aspect ratio, in this case to 1 so that

the x and y scales are the same, and type = 'n' draws the plot axes to correct

scale (i.e. of the y2 and y4 data) but adds no lines or points.

So far the plot commands have been used to plot pairs of x and y coordinates

in different ways: points, lines and polygons (this may suggest different vector

types in a GIS for some readers). We can extend these to start to consider geo-

graphical coordinates more explicitly with some geographical data. You will need

to install the GISTools package, which may involve setting a mirror site as

described in Chapter 1. The first time you use any package in R it needs to be

downloaded before it is installed.

install.packages("GISTools", depend = T)

Then you can call the package in the R console:

library(GISTools)

You will then see some messages when you load the package, letting you know that

the packages that GISTools makes use of have also been loaded automatically.

You only need to install a package onto your computer the first time you use it.

Once it is installed it can simply be called. That is, there is no need to download it

again, you can simply enter library(package).

1260000 1280000 1300000 1320000

1030000

1050000

1070000

Easting

N

o

rt

h

in

g

Figure 2.4 Appling County plotted from coordinate pairs

DATA AND PLOTS

39

The code below loads a number of datasets with the data(georgia) com-

mand. It then selects the first element from the georgia.polys dataset and

assigns it to a variable called appling. This contains the coordinates of the outline

of Appling County in Georgia. It then plots this to generate Figure 2.4.

# library(GISTools)

data(georgia)

# select the first element

appling <- georgia.polys[[1]]

# set the plot extent

plot(appling, asp=1, type='n', xlab="Easting", ylab="Northing")

# plot the selected features with hatching

polygon(appling, density=14, angle=135)

There are a number of things to note in this bit of code.

1. The call data(georgia) loads three datasets: georgia , georgia2

and georgia.polys .

2. The first element of georgia.polys contains the coordinates for the

outline of Appling County.

3. Polygons do not have to be regular; they can, as in this example, be

geographical zones. The code assigns the coordinates to a variable

called appling and this is a two-column matrix.

4. Thus, with an x and y pairing, the following plot commands all work

with data in this format: plot , lines , polygon , points .

5. As before, the plot command in the code below has the type = 'n'

parameter, and asp = 1 fixes the aspect ratio. The result is that that

the x and y scales are the same but the command adds no lines or points.

The wider point being demonstrated here is how routines for plotting spatial data

that we will use subsequently are underpinned by these kinds of data structures

and core plotting routines. The code above illustrates the engines of, for example,

the mapping and visualisation packages tmap and ggplot.

2.4.2 Plot Colours

Plot colours can be specified names or as red, green and blue (RGB) values. The

former can be listed by entering the following:

colours()

RGB colours are composed of three values in the ranges 0 to 1. Having run the code

above, you should have a variable called appling in your workspace. Now try

entering the code below:

R FOR SPATIAL ANALYSIS & MAPPING

40

plot(appling, asp=1, type='n', xlab="Easting", ylab="Northing")

polygon(appling, col=rgb(0,0.5,0.7))

A fourth parameter can be added to rgb to indicate transparency as in the code

below, where the range is from 0 (invisible) to 1 (opaque).

polygon(appling, col=rgb(0,0.5,0.7,0.4))

Text can also be added to the plot and its placement in the plot window specified.

The cex parameter (for character expansion) determines the size of text. Note that

parameters like col also work with text and that HTML colours also work

(such as "B3B333"). The code below generates two plots. The first plots a set of

random points and then plots appling with a transparency shading over the top

(Figure 2.5).

# set the plot extent

plot(appling, asp=1, type='n', xlab="Easting", ylab="Northing")

# plot the points

points(x = runif(500,126,132)∗10000,

y = runif(500,103,108)∗10000, pch=16, col='red')

# plot the polygon with a transparency factor

polygon(appling, col=rgb(0,0.5,0.7,0.4))

The second plots appling, but with some descriptive text (Figure 2.6).

plot(appling, asp=1, type='n', xlab="Easting", ylab="Northing")

polygon(appling, col="#B3B333")

# add text, specifying its placement, colour and size

text(1287000,1053000,"Appling County",cex=1.5)

text(1287000,1049000,"Georgia",col='darkred')

1260000 1280000 1300000 1320000

1030000

1050000

1070000

Easting

N

o

rt

h

in

g

Figure 2.5 Appling County with transparency

DATA AND PLOTS

41

1260000 1280000 1300000 1320000

1030000

1050000

1070000

Easting

N

o

rt

h

in

g

Appling County

Georgia

Figure 2.6 Appling County with text

I

In the above code, the coordinates for the text placement need to be speci-

fied. The function locator is very useful in this context: it can be used to

determine locations in the plot window. Enter locator() at the R prompt,

and then left-click in the plot window at various locations. When you right-

click, the coordinates of these locations are returned to the R console window.

−2 −1 0 1 2

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

c(−1.5, 1.5)

c(

−1

.5

, 1

.5

)

Figure 2.7 Plotting rectangles

R FOR SPATIAL ANALYSIS & MAPPING

42

Other plot tools include rect, which draws rectangles. This is useful for plac-

ing map legends as your analyses develop. The following code produces the plot

in Figure 2.7.

plot(c(−1.5,1.5),c(−1.5,1.5),asp=1, type='n')

# plot the green/blue rectangle

rect(−0.5,−0.5,0.5,0.5, border=NA, col=rgb(0,0.5,0.5,0.7))

# then the second one

rect(0,0,1,1, col=rgb(1,0.5,0.5,0.7))

The command image plots tabular and raster data as shown in Figure 2.8. It has

default colour schemes, but other colour palettes exist. This book strongly recom-

mends the use of the RColorBrewer package, which is described in more detail

in Chapter 3, but an example of its application is given below:

Figure 2.8 Plotting raster data

# load some grid data

data(meuse.grid)

# define a SpatialPixelsDataFrame from the data

mat = SpatialPixelsDataFrame(points = meuse.grid[c("x", "y")],

data = meuse.grid)

# set some plot parameters (1 row, 2 columns)

par(mfrow = c(1,2))

# set the plot margins

par(mar = c(0,0,0,0))

# plot the points using the default shading

image(mat, "dist")

DATA AND PLOTS

43

# load the package

library(RColorBrewer)

# select and examine a colour palette with 7 classes

greenpal <- brewer.pal(7,'Greens')

# and now use this to plot the data

image(mat, "dist", col=greenpal)

# reset par

par(mfrow = c(1,1))

You should note that par(mfrow = c(1,2)) results in one row and two col-

umns and that it is reset in the last line of code.

I

The command contour(mat, "dist") will generate a contour plot of

the matrix above. You should examine the help for contour; a nice example

of its use can be found in code in the help page for the volcano dataset

that comes with R. Enter the following in the R console:

?volcano

2.5 ANOTHER PLOT OPTION: ggplot

2.5.1 Introduction to ggplot

A suite of tools and functions for plotting are

,

available via the ggplot2 package

which is included as part of the tidyverse (https://www.tidyverse.org).

The ggplot2 package applies principles described in The Grammar of Graphics

(Wilkinson, 2005) (hence the gg in the name of the package) which conceptualises

graphics and plots in terms of their theoretical components. The approach is to

handle each element of the graphic separately in a series of layers, and in so doing

to control each part of the plot. This is different from the basic plot functions used

above which apply specific plotting functions based on the type or class of data

that were passed to them.

The ggplot2 package can be installed by installing the whole tidyverse:

install.packages("tidyverse", dep = T)

Or it can be installed on its own:

install.packages("ggplot2", dep = T)

And then loaded into the workspace:

library(ggplot2)

R FOR SPATIAL ANALYSIS & MAPPING

44

The plots above can be re-created using either the qplot or ggplot functions

in the ggplot2 package. The function qplot() is used to produce quick, simple

plots in a similar way to the plot function. It takes x and y and a data argument

for a data frame containing x and y. Figure 2.9 re-creates Figure 2.2. Notice how

the elements in theme are used to control the display.

qplot(x2,y2r,col=I('darkred'), ylim=c(−1.2, 1.2)) +

geom_line(aes(x2,y2), col=I("darkgreen"), size = I(1.5)) +

theme(axis.text=element_text(size=20),

axis.title=element_text(size=20,face="bold"))

Notice how the plot type is first specified (in this case qplot()) and then subse-

quent lines include instructions for what to plot and how to plot it. Here geom_

line() was specified followed by some style instructions.

Try adding:

theme_bw()

or:

theme_dark()

to the above. Remember that you need to include a + for each additional element

in ggplot.

−1.0

−0.5

0.0

0.5

1.0

0 2 4 6

x2

y2

r

Figure 2.9 A simple qplot plot

DATA AND PLOTS

45

To reproduce the Appling plots, the variable appling has to be converted from

a matrix to a data frame whose elements need to be labelled:

appling <- data.frame(appling)

colnames(appling) <- c("X", "Y")

Then qplot can be called as in Figure 2.10 to re-create Figure 2.5 defined above in

stages.

# create the first plot with qplot

p1 <- qplot(X, Y, data = appling, geom = "polygon", asp = 1,

colour = I("black"),

fill=I(rgb(0,0.5,0.7,0.4))) +

theme(axis.text=element_text(size=12),

axis.title=element_text(size=20))

# create a data.frame to hold the points

df <- data.frame(x = runif(500,126,132)∗10000,

y = runif(500,103,108)∗10000)

# now use ggplot to construct the layers of the plot

p2 <- ggplot(appling, aes(x = X, y= Y)) +

geom_polygon(fill = I(rgb(0,0.5,0.7,0.4))) +

geom_point(data = df, aes(x, y),col=I('red')) +

coord_fixed() +

theme(axis.text=element_text(size=12),

axis.title=element_text(size=20))

# finally combine these in a single plot

# using the grid.arrange function

# NB you may have to install the gridExtra package

library(gridExtra)

grid.arrange(p1, p2, ncol = 2)

The result is shown in Figure 2.10, the right-hand part of which re-creates Figure 2.5.

Notice a number of things. First, the structural differences in the way the

graphic is called, including the specification of the type with the geom parameter

(compared to the geom_line parameter earlier). Second, the assignment of the

1030000

1040000

1050000

1060000

1070000

1080000

12700001280000129000013000001310000

X

Y

1030000

1040000

1050000

1060000

1070000

1080000

1260000 1280000 1300000 1320000

X

Y

Figure 2.10 A simple qplot plot of a polygon

R FOR SPATIAL ANALYSIS & MAPPING

46

plot objects to variables p1 and p2. Third, the use of the grid.arrange() func-

tion in the gridExtra package that allows two graphics to be included in the plot

window. Finally, you will have to install the gridExtra package before the first

time you use it:

install.packages("gridExtra", dep = T)

2.5.2 Different ggplot Types

This section briefly introduces different kinds of plots using ggplot for different

kinds of variables, including scatter plots, histograms and boxplots. In subsequent

chapters, different flavours and types of ggplot will be illustrated. But this is a

vast package and involves a bit of a learning curve at first. To fully understand all

that it can do is beyond the scope of this subsection in this chapter, but there is

plenty of help and advice on the internet. You could explore some of this yourself

by following some of the links at http://ggplot2.tidyverse.org.

The basic call to ggplot is complemented by an aesthetic prefixed by geom_

and has the following syntax:

ggplot(data = , aes(x,y,colour)) +

geom_XYZ()

To illustrate the different plotting options, we need to create some data and some

categorical variables. The code below extracts the data frame from georgia and

converts it to a tibble. This is like the attribute table of a shapefile. Note that ggplot

will work with any type of data frame.

# data.frame

df <- data.frame(georgia)

# tibble

tb <- as.tibble(df)

Enter the code below to see the first 10 records:

tb

You can see that this has attributes for the counties of Georgia, and a number

of variables are included. Next, the code below creates an indicator for rural/

not-rural, which we set to values using the levels function. Note the use of the

+ 0 to convert the TRUE and FALSE values to 1s and 0s:

tb$rural <- as.factor((tb$PctRural > 50) + 0)

levels(tb$rural) <- list("Non-Rural" = 0, "Rural"=1)

Then we create an income category variable around the interquartile range of the

MedInc variable (median county income). There are fancier ways to do it, but the

code below is tractable:

DATA AND PLOTS

47

tb$IncClass <- rep("Average", nrow(tb))

tb$IncClass[tb$MedInc >= 41204] = "Rich"

tb$IncClass[tb$MedInc <= 29773] = "Poor"

The distributions can be checked if you wanted using the table() function:

table(tb$IncClass)

Scatter plots can be used to show two variables together. The data pairs in tb

should be examined. For example, consider PctBach and PctEld, representing

the percentages of the county populations with bachelor’s degrees and who are

elderly (whatever that means).

ggplot(data = tb, mapping=aes(x=PctBach, y=PctEld)) +

geom_point()

The plot can be enhanced by passing a grouping variable to the colour parameter

in aes:

ggplot(data = tb, mapping=aes(x=PctBach, y=PctEld, colour=rural)) +

geom_point()

Now modify the code above to group by the IncClass variable created earlier.

What happens? What do you see? Does this make sense? Are there any trends? It

could tentatively be said that the poor areas are more elderly and have fewer peo-

ple with bachelor’s degrees. This might be confirmed by adding a trend line:

ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) +

geom_point() +

geom_smooth(method = "lm")

Also note that style templates can be added and colours changed. Putting this all

together generates Figure 2.11:

ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) +

geom_point() +

geom_smooth(method = "lm", col = "red", fill = "lightsalmon") +

theme_bw() +

xlab("% of population with bachelor degree") +

ylab("% of population that are elderly")

You can explore other styles by trying the ones listed under the help for theme_bw.

Next, histograms can be used to examine the distributions of income across the

159 counties of Georgia:

ggplot(tb, aes(x=MedInc)) +

geom_histogram(, binwidth = 5000, colour = "red", fill = "grey")

The axes can be labelled, the theme set and title included as with the above exam-

ples, by including additional elements in the plot. Probability densities can also be

plotted as follows, generating Figure 2.12:

R FOR SPATIAL ANALYSIS & MAPPING

48

5

10

15

20

10 20 30

% of population with bachelor degree

%

o

f

p

o

p

u

la

ti

o

n

t

h

at

a

re

e

ld

er

ly

Figure 2.11 A ggplot scatter plot

0e+00

,

2e−05

4e−05

6e−05

20000 40000 60000 80000

MedInc

d

en

si

ty

Figure 2.12 A ggplot density histogram

DATA AND PLOTS

49

ggplot(tb, aes(x=MedInc)) +

geom_histogram(aes(y=..density..),

binwidth=5000,colour="white") +

geom_density(alpha=.4, fill="darksalmon") +

# Ignore NA values for mean

geom_vline(aes(xintercept=median(MedInc, na.rm=T)),

color="orangered1", linetype="dashed", size=1)

Multiple plots can be generated using the facet() options in ggplot. These

create separate plots for each group. Here the PctBach variable is plotted and

median incomes compared:

ggplot(tb, aes(x=PctBach, fill=IncClass)) +

geom_histogram(color="grey30",

binwidth = 1) +

scale_fill_manual("Income Class",

values = c("orange", "palegoldenrod","firebrick3")) +

facet_grid(IncClass~.) +

xlab("% Bachelor degrees") +

ggtitle("Bachelors degree % in different income classes")

Another way of examining distributions is through boxplots. Boxplots display the

distribution of a continuous variable and can be broken down by a categorical

variable. A basic boxplot can be generated with the geom_boxplot aesthetic:

gplot(tb, aes(x = "",PctBach)) +

geom_boxplot()

10

20

30

Average Poor Rich

Income Class

%

B

ac

h

el

o

rs

Rural

Not Rural

Rural

Figure 2.13 A ggplot boxplot with groups

R FOR SPATIAL ANALYSIS & MAPPING

50

This can be extended with some grouping, as before, and to compare more than

one treatment as in Figure 2.13:

ggplot(tb, aes(IncClass, PctBach, fill = factor(rural))) +

geom_boxplot() +

scale_fill_manual(name = "Rural",

values = c("orange", "firebrick3"),

labels = c("Non-Rural"="Not Rural","Rural"="Rural")) +

xlab("Income Class") +

ylab("% Bachelors")

This is only scratching the surface of the capability of ggplot. Additional refine-

ments will be demonstrated throughout this book.

2.6 READING, WRITING, LOADING AND SAVING DATA

There are a number of ways of getting data in and out of R, and three methods for

reading and writing different formats are briefly considered here: text files, R data

files and spatial data.

2.6.1 Text Files

Consider the appling data variable above. This is a matrix variable, containing

two columns and 125 rows. You can examine the data using dim and head:

# display the first six rows

head(appling)

# display the variable dimensions

dim(appling)

You will note that the data fields (columns) are not named; however, these can be

assigned.

colnames(appling) <- c("X", "Y")

The data can be written into a comma-separated variable file using the command

write.csv and then read back into a different variable, as follows:

write.csv(appling, file = "test.csv")

This writes a .csv file into the current working directory. You check where this is

by using the getwd() function. You can set the working directory either though

the setwd() function or through the menu (Session > Set Working Directory).

If you open it using a text editor or spreadsheet software, you will see that it

has three columns: X and Y as expected plus the index for each record. This is

because the default for write.csv includes the default row.names = TRUE.

Again examine the help file for this function.

DATA AND PLOTS

51

write.csv(appling, file = "test.csv", row.names = F)

R also allows you to read .csv files using the read.csv function. Read the file

you have created into a variable:

tmp.appling <- read.csv(file = "test.csv")

Notice that in this case what is read from the .csv file is assigned to the variable

tmp.appling. Try reading this file without assignment. The default for read.

csv is that the file has a header (i.e. the first row contains the names of the col-

umns) and that the separator between values in any record is a comma. However,

these can be changed depending on the nature of the file you are seeking to load

into R. A number of different types of files can be read into R. You should examine

the help files for reading data in different formats. Enter ??read to see some of

these listed. You will note that read.table and write.table require more

parameters to be specified than read.csv and write.csv.

2.6.2 R Data Files

It is possible to save variables that are in your workspace to a designated file. This

can be loaded at the start of your next session. For example, if you have been run-

ning the code as introduced in this chapter you should have a number of variables,

from x at the start to engine and colours and the appling data above.

You can save this workspace using the drop-down menus in the RStudio inter-

face or using the save function. The RStudio menu route saves everything that is

present in your workspace, as listed by ls(), while the save command allows

you to specify what variables you wish to save.

# this will save everything in the workspace

save(list = ls(), file = "MyData.RData")

# this will save just appling

save(list = "appling", file = "MyData.RData")

# this will save appling and georgia.polys

save(list = c("appling", "georgia.polys"), file = "MyData.RData")

You should note that the .RData file binary format is very efficient at storing data:

the Appling .csv file used 4kb of memory, while the .RData file used only 2kb.

Similarly, .RData files can be loaded into R using the menu in the R interface or

within the R console by writing:

load("MyData.RData")

This will load the variables in the .RData file into the R console.

2.6.3 Spatial Data Files

It is appropriate to briefly consider how to get spatial data in and out of R, but note

that this is covered in more detail in Chapter 3.

R FOR SPATIAL ANALYSIS & MAPPING

52

The rgdal package includes two generic functions for reading and writing all

kinds of spatial data: readOGR() and writeOGR(). Load the rgdal package:

library(rgdal)

The georgia object in sp format can be written to a shapefile using the

writeOGR() function as follows:

writeOGR(obj=georgia, dsn=".", layer="georgia",

driver="ESRI Shapefile", overwrite_layer=T)

It can be read back into R using the readOGR() function:

new.georgia <- readOGR("georgia.shp")

Spatial data can be also be read in and written out using the sf functions st_

read() and st_write(). For example, to read in and write out the georgia.

shp shapefile that was created above (and to overwrite g2) the following code can

be used. You will need to install and load the sf package:

install.packages("sf", dep = T)

library(sf)

setwd("/MyPath/MyFolder")

g2 <- st_read("georgia.shp")

st_write(g2, "georgia.shp", delete_layer = T)

2.7 ANSWERS TO SELF-TEST QUESTIONS

Q1: orange is not one of the factor’s levels, so the result is an NA.

colours[4] <- "orange"

colours

[1] red blue red silver red white silver

[9] red red white silver silver

Levels: red blue white silver black

Q2: There is no count for black in the character version – table does not know

that this value exists, since there is no levels information. Also the order of

colours is alphabetical in the character version. In the factor version, the

order is based on that specified in the factor function.

Q3: The first variable is tabulated along the rows, the second along the columns.

Q4: Find the colours of all cars with engines with capacity greater than 1.1 litres:

DATA AND PLOTS

53

# Undo the colour[4] <- 'orange' line used above

colours <- factor(c("red","blue","red","white",

" silver","red","white","silver",

"red","red","white","silver"),

levels=c("red","blue","white","silver","black"))

colours[engine > "1.1litre"]

[1] blue white red white red silver

Levels: red blue white silver black

Counts of types of all cars with capacity below 1.6 litres:

table(car.type[engine < "1.6litre"])

saloon hatchback convertible

7 4 0

Counts of colours of all hatchbacks with capacity greater than or equal to 1.3 litres:

table(colours[(engine >= "1.3litre") & (car.type == "hatchback")])

red blue white silver black

2 0 0 0

,

Q5: The index returned corresponds to the first number taking the largest value.

Q6: An expression to find the index of the largest value in each row of crosstab

using which.max and apply:

apply(crosstab,1,which.max)

saloon hatchback convertible

1 1 3

Q7: Use apply functions to return the best-selling colour and car type:

apply(crosstab,1,which.max.name)

saloon hatchback convertible

"red" "red" "white"

apply(crosstab,2,which.max.name)

red blue white silver black

"hatchback" "saloon" "saloon" "saloon" "saloon"

Q8: An R expression that assigns the best-selling colour and car types to a list:

most.popular <- list(colour=apply(crosstab,1,which.max.name),

type=apply(crosstab,2,which.max.name))

most.popular

$colour

saloon hatchback convertible

"red" "red" "white"

R FOR SPATIAL ANALYSIS & MAPPING

54

$type

red blue white silver black

15 "hatchback" "saloon" "saloon" "saloon" "saloon"

Q9: A print function for variables of class data.frame:

print.sales.data <- function(x) {

cat("Weekly Sales Data:\n")

cat("Most popular colour:\n")

for (i in 1:length(x$colour)) {

cat(sprintf("%12s:%12s\n",names(x$colour)[i],x$colour[i]))}

cat("Most popular type:\n")

for (i in 1:length(x$type)) {

cat(sprintf("%12s:%12s\n",names(x$type)[i],x$type[i]))}

cat("Total Sold = ",x$total)

}

this.week

Weekly Sales Data:

Most popular colour:

saloon: red

hatchback: red

convertible: white

Most popular type:

red: hatchback

blue: saloon

white: saloon

silver: saloon

black: saloon

Total Sold = 13

Although the above is one possible solution to the question, it is not unique. You

may decide to create a very different looking print.sales.data function. Note

also that although until now we have concentrated only on print functions for

different classes, it is possible to create class-specific versions of any function.

REFERENCE

Wilkinson, L. (2005) The Grammar of Graphics. New York: Springer.

3

BASICS OF HANDLING SPATIAL

DATA IN R

3.1 OVERVIEW

The aim of this chapter is to provide an introduction to the mapping and geograph-

ical data handling capabilities of R. It explicitly focuses on developing the building

blocks for the spatial data analyses in later chapters. These extend the mapping

functionality that was briefly introduced in the previous chapter and will be

extended further in Chapter 5. It includes an introduction to the sp and sf pack-

ages and the R spatial data formats they support, and the tmap package. This

chapter describes methods for moving between the sp and sf formats and for

producing choropleth maps – from basic to quite advanced outputs – and intro-

duces some methods for generating descriptive statistics. These skills are funda-

mental to the analyses that will be developed later in the book. This chapter will:

● Introduce the sp and sf R spatial data formats and describe how to use

them

● Describe how to compile maps based on multiple layers using both

basic plot functions and the tmap package

● Describe how to set different plot parameters and shading schemes

● Describe how to develop basic descriptive statistical analyses of spatial

data

3.1.1 Spatial Data

Data are often held in data tables or databases – a bit like a spreadsheet. The rows

represent some real-world feature (a person, a transaction, a date, etc.) and the

columns represent some attribute associated with that feature. Rows in databases

R FOR SPATIAL ANALYSIS & MAPPING

56

may be referred to as records and columns as fields. There are some cases where the

features can be either a record or a field – for example, a date could belong to a list

of daily supermarket transactions (as a record) or be an attribute associated with

an event at a location (as a field). For the purposes of much of the practical work

in this chapter data will be conceptualised in this way.

In R there are many data formats and packages for handling and manipulating

them. For example, the tibble format defined within the dplyr package as part

of the tidyverse is starting to supersede data frames (in fact it includes the

data.frame class). This is part of a concerted activity by many package develop-

ment teams to provide tidy and lazy data formats and processes for data science,

mapping and spatial data analysis. Some of the background to this activity can be

found on the webpage for tidyverse (https://www.tidyverse.org),

which is a collection of R packages designed for data science.

The preceding description of data, with records (rows) and fields (columns), can

be extended to spatial data in which each record typically represents some real-

world geographical feature – a place, a route, a region, etc. – and individuals fields

provide a measurement or attribute associated with that feature. In geographical

data, features are typically represented as points, lines or areas.

Why spatial data? Nearly all data are spatial – they are collected somewhere.

If and when a third edition of this book is written in the future, we expect to

extend this argument to the spatio-temporal domain in which all data are

spatio-temporal – they are collected somewhere and at some time.

3.1.2 Installing and Loading Packages

The previous chapter included a number of basic analytical and graphical tech-

niques using R. However, few of these were particularly geographical. A number

of packages are available in R that allow sophisticated visualisation, manipulation

and analysis of spatial data. Some of this functionality will be demonstrated in this

chapter in conjunction with some mapping tools and specific data types to create

different examples of mapping in R. Remember that a package in R is a set of pre-

written functions (and possibly data items as well) that are not available when you

initially start R running, but can be loaded from the R library at the command line.

To illustrate these techniques, the chapter starts by developing some elementary

maps, building to more sophisticated mapping.

This chapter uses a number of packages: raster, OpenStreetMap,

RgoogleMaps, grid, rgdal, tidyverse, reshape2, ggmosaic, GISTools, sf

and tmap. You will have to install them before you use them for the first time. You will

have installed the GISTools and sf packages using the install.packages()

function if you worked through Chapter 2. Once you have downloaded and installed a

package, you can simply load the package when you use R subsequently.

The is.element query combined with the installed.packages() func-

tion can be used to check whether a package is installed.

HANDLING SPATIAL DATA IN R

57

is.element("sf", installed.packages())

If FALSE is returned then you need to install the package as above:

install.packages("sf", dep = TRUE)

Note the dep = TRUE parameter. This tells R to load the package with its depend-

encies (i.e. other packages that it depends on). Then the package can be loaded:

library(sf)

It is possible to inspect the functionality and tools available in sf or any other

package by examining the documentation.

help(sf)

# or

?sf

This provides the general description of the package. At the bottom of the help

window, there is a hyperlink to the index which, if you click on it, will

open a page with a list of all the tools available in the package. The CRAN

website also has full documentation for each package – for sf see

http://cran.r-project.org/web/packages/sf/index.html.

3.2 INTRODUCTION TO sp AND sf : THE sf REVOLUTION

As described in Chapter 1, the first edition of this book focused on the sp format

for spatial data in R. This format is defined in the sp package. It provides an organ-

ised set of spatial data classes, providing a unified way of moving from one pack-

age to another, taking advantage of the different tools and the functions they

include. However, R is dynamic and sometimes a new

,

paradigm is introduced; this

has been the case recently for spatial data in R, with the release of the sf package

by Pebesma et al. (2016).

In this chapter, both the sp and sf formats are introduced. The manipulation

and analysis of spatial data use, where possible, the sf format and associated

tools. However, some packages and operations for spatial analyses have not yet

been updated to work with sf. For example, at the time of writing, many of the

functions in spdep, such as those for cluster analysis using Moran’s I (see Anselin,

1995) and the G-statistic (described in Ord and Getis, 1995), only work with sp

format spatial data. For these reasons, this chapter (and others throughout the

book) will, where possible, describe the manipulation and analysis of spatial data

using sf format and functions but will switch between (and convert data between)

sp and sf formats as needed.

R FOR SPATIAL ANALYSIS & MAPPING

58

3.2.1 sp data format

The sp package defines a number of classes (or sp objects) for handling points,

lines and areas, as summarised in Table 3.1. The sp data formats underpin many

of the packages that you will use directly or indirectly (i.e. they are loaded by other

packages): they have dependencies on sp. An example is the GISTools package by

Brunsdon and Chen (2014) which has dependencies on maptools, sp, rgeos

and other packages. If you install and load GISTools you will see these packages

being loaded.

Table 3.1 Spatial data formats in R

Without attributes With attributes ArcGIS equivalent

SpatialPoints SpatialPointsDataFrame Point shapefiles

SpatialLines SpatialLinesDataFrame Line shapefiles

SpatialPolygons SpatialPolygonsDataFrame Polygon shapefiles

Pebesma et al. (2016).

3.2.1.1 Spatial data in GISTools

GISTools, similar to many other R packages, comes with a number of embedded

datasets that can be loaded from the command line after the package is installed.

Two datasets will be used in this chapter, to illustrate spatial data manipulation,

mapping and analysis in both sf and sp. These are polygon and line data for New

Haven, Connecticut and the counties in the state of Georgia, both in the USA. The

New Haven data include crime statistics, roads, census blocks (including demo-

graphic information), railway lines and place names. The Georgia data include

outlines of the counties in Georgia with a number of attributes relating to the 1990

census including population (TotPop90), the percentage of the population that

are rural (PctRural), that have a college degree (PctBach), that are elderly

(PctEld), that are foreign born (PctFB), that are classed as being in poverty

(PctPov), that are black (PctBlack) and the median income of the county

(MedInc). The two datasets are shown in Figure 3.1.

Having installed GISTools, you can load the newhaven data or georgia

data using the data() function. Load the newhaven data and then examine what

is loaded and the types (or classes) of data that are loaded:

data(newhaven)

ls()

[1] "blocks"

[5] "famdisp"

"breach"

"places"

"burgres.f"

"roads"

"burgres.n"

"tracts"

HANDLING SPATIAL DATA IN R

59

class(breach)

[1] "SpatialPoints"

attr(,"package")

[1] "sp"

class(blocks)

[1] "SpatialPolygonsDataFrame"

attr(,"package")

[1] "sp"

The breach data are of the SpatialPoints class and simply describe

locations, with no attributes. The blocks data, on the other hand, are of the

SpatialPolygonsDataFrame class as they include some census variables

associated with each census block. Thus spatial data with attributes defined in this

way in sp hold their attributes in the data frame, and you can see this by looking

at the first few lines of the blocks data frame using the head function:

head(data.frame(blocks))

Note that the data frame of sp objects can also be accessed using the @data

parameter of the blocks data frame using the head function:

head(blocks@data)

Both of these code snippets print the first six lines of attributes associated with the

census blocks data. A formal consideration of spatial attributes and how to analyse

and map them is given later in this chapter.

The census blocks in New Haven can be plotted using the R plot function:

plot(blocks)

Westville

Fair HavenNew Haven

Fair Haven East

City Point

Brightview

Morris Cove

South End

Figure 3.1 The New Haven census blocks with roads in blue, and the counties in the state

of Georgia shaded by median income

R FOR SPATIAL ANALYSIS & MAPPING

60

The default plot function for the sp class of objects can be used to gener-

ate maps, and this was the focus of the first edition of this book using the

GISTools packages. It described how different plot commands could be com-

bined to created plot layers. For example, to draw a map of the roads in red, with

the blocks in black (the plot default colour) as in Figure 3.2, the code below

could be entered:

par(mar = c(0,0,0,0))

plot(roads, col="red")

plot(blocks, add = T)

3.2.2 sf Data Format

Recently a new class of R spatial objects has been defined and released as a pack-

age called sf, which stands for ‘simple features’ (Pebesma et al., 2016). It seeks to

Figure 3.2 The New Haven census blocks and road data

HANDLING SPATIAL DATA IN R

61

encode spatial data in a way that conforms to a formal standard (ISO 19125-1:2004).

This emphasises the spatial geometry of objects, the way that objects are stored in

databases. In brief, the aim of the team developing sf (actually many of them are

the same people who developed sp, so they do know what they are doing!) is to

provide a format for spatial data. An overview of the evolution of spatial data in R

can be found at https://edzer.github.io/UseR2017/.

The idea is that a feature is a thing, or an object in the real world, such as a build-

ing or a tree. As is the case with objects, they often consist of other objects such that

a set of features can form a single feature. Features have a geometry describing

where on Earth they are located, and they have attributes, which describe other

properties. There are many sf object types, but the key ones (which are similar to

lines, points and areas) are listed in Table 3.2 (taken from the sf vignette). This has

a much stronger theoretical structure, with for example multipoint features

being composed of point features etc. Only the more common types of geome-

tries defined within sf are described in Table 3.2; other geometries exist but are

much rarer.

Table 3.2 Spatial data formats in R from https://r-spatial.github.io/sf/articles/sf1.html

Feature type Description ArcGIS equivalent

POINT Zero-dimensional geometry containing a single

point

Point shapefiles

LINESTRING Sequence of points connected by straight,

non-self-intersecting line pieces; one-dimensional

geometry

Line shapefiles

POLYGON Geometry with a positive area (two-dimensional);

sequence of points form a closed, non-self-

intersecting ring; the first ring denotes the

exterior ring, zero or more subsequent rings

denote holes in this exterior ring

Polygon shapefiles

MULTIPOINT Set of points; a MULTIPOINT is simple if no two

points in the MULTIPOINT are equal

Point shapefiles

MULTILINESTRING Set of linestrings Line shapefiles

MULTIPOLYGON Geometry with a positive area (two-dimensional);

sequence of points form a closed, non-self-

intersecting ring; the first ring denotes the

exterior ring, zero or more subsequent rings

denote holes in this exterior ring

Polygon shapefiles

Ultimately, sf formats will completely replace sp, and packages that use sp

(such as GWmodel for geographically weighted regression) will all have to be

updated to use sf at some point, but that is a few years away.

The sf package has a number of vignettes or tutorials that you could explore.

These include an overview of the format, reading and writing from and to sf

R FOR SPATIAL ANALYSIS & MAPPING

62

formats including conversions to and from sp and sf, and some illustrations of

how sf objects can be manipulated.

,

The code below will create a new window

with a list of sf vignettes:

library(sf)

vignette(package = "sf")

And then to display a specific vignette topic, this can be called using the vignette

function:

vignette("sf1", package = "sf")

I

Vignettes are an important part of R packages. They provide explanations of

the package functionality additional to those found in the example code at

the end of a help page. They can be accessed using the vignette function

or through the R help. The sf1 vignette could also be accessed via the

package help index: enter help(sf), navigate to the index through the link

at the bottom of the overview page and then click on the User guides,

package vignettes and other documentation link.

3.2.2.1 sf spatial data

The sp objects loaded by the GISTools data packages georgia and

newhaven can be converted to sf. The fundamental function for converting to

sf is st_as_sf(). In the code below it is used to convert the georgia sp

object to sf:

# load the georgia data

data(georgia)

# conversion to sf

georgia_sf = st_as_sf(georgia)

class(georgia_sf)

[1] "sf" "data.frame"

You can examine the contents of georgia_sf by entering the following at the

console:

georgia_sf

HANDLING SPATIAL DATA IN R

63

Notice how when georgia_sf is called the spatial information and the first 10

records of the attribute table are printed to the screen, rather than the entire object

as with sp. For comparison you could enter:

georgia

The plot function is also different: it will create maps of sf objects, and if the sf

object has attributes it will shade the first few of these:

# all attributes

plot(georgia_sf)

# selected attribute

plot(georgia_sf[, 6])

# selected attributes

plot(georgia_sf[,c(4,5)])

Finally, note that sf objects have a data frame. You could compare the data frames

of sp and sf objects:

## sp SpatialPolygonDataFrame object

head(data.frame(georgia))

## sf polygon object

head(data.frame(georgia_sf))

Note that the data frames of the sf objects have geometry attributes.

We can also convert to sp by using the as function:

g2 <- as(georgia_sf, "Spatial")

class(g2)

[1] "SpatialPolygonsDataFrame"

attr(,"package")

[1] "sp"

This automatically recognises the georgia_sf is a multipolygon object in sf

and converts it to a SpatialPolygonsDataFrame object in sp. You could try a

similar set of operations with the roads layer loaded earlier to demonstrate this:

roads_sf <- st_as_sf(roads)

class(roads_sf)

r2 <- as(roads_sf, "Spatial")

class(r2)

3.3 READING AND WRITING SPATIAL DATA

Very often we have data that are in a particular format such as shapefile format. R has

the ability to read and write data from and to many different spatial data formats

using functions in the rgdal and sf packages – we will consider them both here.

R FOR SPATIAL ANALYSIS & MAPPING

64

3.3.1 Reading to and Writing from sp Format

As was briefly described in Chapter 2, the rgdal package includes two generic

functions for reading and writing all kinds of spatial data: readOGR() and

writeOGR(). Load the rgdal package:

library(rgdal)

As a reminder, the georgia object in sp format can be written to a shapefile

using the writeOGR() function as follows:

writeOGR(obj=georgia, dsn=".", layer="georgia",

driver="ESRI Shapefile", overwrite_layer=T)

You will see that a shapefile has been written into your current working directory,

overwriting any previous instance of georgia.shp, with its associated support-

ing files (.dbf etc.) that can be recognised by other applications (QGIS etc.).

Similarly, this can be read into R and assigned to a variable using the readOGR

function:

new.georgia <- readOGR("georgia.shp")

If you enter:

class(new.georgia)

you will see that the class of the new.georgia object is sp. You should examine

the writeOGR and readOGR functions in the rgdal package.

R is also able to read and write other proprietary spatial data formats using a

number of packages, which you should be able to find through a search of the R

help system or via an internet search engine. The rgdal package is the R version

of the Geospatial Data Abstraction Library. It includes a number of methods for read-

ing and writing spatial objects, including to and from SpatialXDataFrame

objects. The full syntax can be important – the code below overwrites any existing

similarly named file:

writeOGR( new.georgia, dsn = ".", layer = "georgia",

driver="ESRI Shapefile", overwrite_layer = T)

The dsn parameter is important here: for shapefiles it determines the folder the

files are written to. In the above example it was set to "." which places the files in

the current working directory.

You could specify a file path here. For a PC it might be something like D:\

MyDocuments\MyProject\DataFiles; for a Mac, /Users/lex/my_docs/

project.

HANDLING SPATIAL DATA IN R

65

The setwd() and getwd() functions can be used in determining and setting

the file path. You may want to set the file path and then use the dsn setting as

above:

setwd("/Users/lex/my_docs/project")

writeOGR( new.georgia, dsn = ".", layer = "georgia",

driver="ESRI Shapefile", overwrite_layer = T)

Or you could use the getwd() function, save the results to a variable and pass this

to writeOGR:

td <- getwd()

writeOGR( new.georgia, dsn = td, layer = "georgia",

driver="ESRI Shapefile", overwrite_layer = T)

You should also examine the functions for reading and writing raster layers in

rgdal, which are readGDAL and writeGDAL. These read and write functions in

rgdal are incredibly powerful and can read/write almost any spatial data format.

3.3.2 Reading to and Writing from sf Format

Spatial data can be also be read in and written out using the sf functions

st_read() and st_write(). For example, to read in the georgia.shp shape-

file that was created above (and to overwrite g2) the following code can be used:

setwd("/MyPath/MyFolder")

g2 <- st_read("georgia.shp")

The working directory needs to be set to ensure that st_read looks in the right

place to read the file from. Here a single argument is used to find both the data

source and the layer. This works when the data source contains a single layer.

To write a simple features object to a file needs at least two arguments, the object

and a filename. As before, this will not work if the georgia.shp file exists in the

working directory, so the delete_layer = T parameter needs to be specified.

st_write(g2, "georgia.shp", delete_layer = T)

The filename is taken as the data source name. The default for the layer name is the

basename (filename without path) of the data source name. For this, st_write

needs to guess the driver. The above command, for instance, is equivalent to:

st_write( g2, dsn = "georgia.shp", layer = "georgia.shp",

driver = "ESRI Shapefile", delete_layer = T)

Typical users will use a filename with path for filename, or first set R’s working

directory with setwd() and use filename without path.

R FOR SPATIAL ANALYSIS & MAPPING

66

Note that the output driver is guessed from the data source name, from either

its extension (.shp: ESRI Shapefile), or its prefix (PG:: PostgreSQL).

The list of extensions with corresponding driver (short driver name) can be

found in the sf2 vignette. You will also note that there are a number of functions

that can be used to read, write and convert. You can examine this:

vignette("sf2", package = "sf")

3.4 MAPPING: AN INTRODUCTION TO tmap

3.4.1 Introduction

The first parts of this chapter have outlined basic commands for plotting data and

for producing maps and graphics using R. These were based on the plot func-

tions associated with sp objects. This section will now concentrate on developing

and expanding these basic techniques using the functions in the tmap package. It

will introduce some new plot parameters and will show how to extract and down-

load Google Maps and to use OpenStreetMap data as background context and to

create interactive (at least zoomable) maps in tmap. As

,

you develop more sophis-

ticated analyses in later sections you may wish to return to some of the examples

used in this section. It will develop mapping of vector spatial data (points, lines

and areas) and will also introduce some new R commands and techniques to help

put all of this together.

The tmap mapping package (Tennekes, 2015) focuses on mapping the spatial

distribution of thematic data attributes. It can take sp and sf objects. It has a simi-

lar grammar to plotting with ggplot in that it seeks to handle each element of the

map separately in a series of layers, and in so doing seeks to exercise control over

each element. This is different from the basic plot functions used above to map

sp and sf data.

In this section the workings of tmap will be introduced, and then in later sec-

tions on mapping attributes this will be expanded and refined to impose different

mapping styles and embellishments. To begin with, you will need some predeter-

mined data, and the code in this section will use the georgia and

georgia_sf objects that were created earlier. As ever, you may wish to think

about creating a script and a workspace folder in which you can store any results

you generate. As a reminder, you can clear your workspace to remove all the vari-

ables and datasets you have created and opened using the previous code and com-

mands. This can be done via the menu in RStudio via Session > Clear Workspace,

or via the console by entering:

rm(list=ls())

HANDLING SPATIAL DATA IN R

67

3.4.2 A quick tmap

The qtm() function can be used to compose a quick map. The code below loads

the georgia data, recreates georgia_sf and generates a quick tmap using

qtm. First load the data:

data(georgia)

Check that the data have loaded correctly using ls(). There should be three

Georgia datasets: georgia, georgia2 and georgia.polys. Then create the

sf object georgia_sf as before:

georgia_sf <- st_as_sf(georgia)

Finally load tmap and create a quick map as in Figure 3.3:

library(tmap)

qtm(georgia, fill = "red", style = "natural")

Figure 3.3 The map of Georgia generated by qtm()

R FOR SPATIAL ANALYSIS & MAPPING

68

Note the use of the style parameter. This is a shortcut to a predefined style

within the tmap package, in this case named tm_style. These styles can be

called in abbreviated form using qtm. You should explore the qtm function

through the help.

The fill parameter can be used to specify a colour as above, or a variable to

be mapped. The code below generates Figure 3.4, which shows the distribution of

the MedInc variable:

qtm(georgia_sf, fill="MedInc", text="Name", text.size=0.5,

format="World_wide", style="classic",

text.root=5, fill.title="Median Income")

Appling

Atkinson

Bacon

Baker

Baldwin

Banks

Barrow

Bartow

Ben Hill

Berrien

Bibb

Bleckley

Brantley

Brooks

Bryan

Bulloch

Burke

Butts

Calhoun

Camden

Candler

Carroll

Catoosa

Charlton

Chatham

Chattahoochee

Chattooga

Cherokee

Clarke

Clay

Clayton

Clinch

Cobb

Coffee

Colquitt

Columbia

Cook

Coweta

Crawford

Crisp

Dade

Dawson

Decatur

DeKalb

DodgeDooly

Dougherty

Douglas

Early

Echols

Effingham

Elbert

Emanuel

Evans

Fannin

Fayette

Floyd Forsyth

Franklin

Fulton

Gilmer

Glasco*ck

Glynn

Gordon

Grady

Greene

Gwinnett

Habersham

Hall

Hanco*ck

Haralson

Harris

Hart

Heard

Henry

Houston

Irwin

Jackson

Jasper

Jeff Davis

Jefferson

Jenkins

Johnson

Jones

Lamar

Lanier

Laurens

Lee

Liberty

Lincoln

Long

Lowndes

Lumpkin

McDuffie

McIntosh

Macon

Madison

Marion

Meriwether

Miller

Mitchell

Monroe

Montgomery

Morgan

Murray

Muscogee

Newton

Oconee

OglethorpePaulding

Peach

Pickens

Pierce

Pike

Polk

Pulaski

Putnam

Quitman

Rabun

Randolph

Richmond

Rockdale

Schley

Screven

Seminole

Spalding

Stephens

Stewart Sumter

Talbot

Taliaferro

Tattnall

Taylor

Telfair

Terrell

Thomas

Tift

Toombs

Towns

Treutlen

Troup

Turner

Twiggs

Union

Upson

Walker

Walton

Ware

Warren

Washington

Wayne

Webster

Wheeler

White

Whitfield

Wilcox

Wilkes

Wilkinson

WorthMedian Income

20,000 to 30,000

30,000 to 40,000

40,000 to 50,000

50,000 to 60,000

60,000 to 70,000

70,000 to 80,000

80,000 to 90,000

Figure 3.4 Counties in the state of Georgia shaded by median income

HANDLING SPATIAL DATA IN R

69

3.4.3 Full tmap

The process of making maps using tmap is one in which a series of layers are

added to the map. First the tm_shape() is specified, followed by a tmap aes-

thetic function that specifies what is to be plotted. This can be illustrated by

running the code snippets below and inspecting the results. You should see

how the tmap functions are added as a series of layers to the map in a similar

way to ggplot. Before this an outline of Georgia is created using the st_

union() function in sf. An alternative for sp is the gUnaryUnion() func-

tion in the rgeos package loaded with GISTools. The manipulation of spatial

data using overlay, union and intersection functions is covered in more depth

in Chapter 5.

# do a merge

g <- st_union(georgia_sf)

# for sp

# g <- gUnaryUnion(georgia, id = NULL)

# plot the spatial layers

tm_shape(georgia_sf) +

tm_fill("tomato")

Add the county borders:

tm_shape(georgia_sf) +

tm_fill("tomato") +

tm_borders(lty = "dashed", col = "gold")

Add some styling:

tm_shape(georgia_sf) +

tm_fill("tomato") +

tm_borders(lty = "dashed", col = "gold") +

tm_style("natural", bg.color = "grey90")

Include the outline, noting the second call to tm_shape to plot the second spatial

object g:

tm_shape(georgia_sf) +

tm_fill("tomato") +

tm_borders(lty = "dashed", col = "gold") +

tm_style("natural", bg.color = "grey90") +

# now add the outline

tm_shape(g) +

tm_borders(lwd = 2)

And finally putting it all together to create Figure 3.5:

R FOR SPATIAL ANALYSIS & MAPPING

70

tm_shape(georgia_sf) +

tm_fill("tomato") +

tm_borders(lty = "dashed", col = "gold") +

tm_style("natural", bg.color = "grey90") +

# now add the outline

tm_shape(g) +

tm_borders(lwd = 2) +

tm_layout(title = "The State of Georgia",

title.size = 1,

title.position = c(0.55, "top"))

So what you can see in the above code are two sets of tmap plot commands: the

first set plots the georgia_sf dataset, specifying a dashed gold line to show the

county boundaries, a tomato (red) fill colour for the state and a map background

colour of light grey. The second set adds the outline created by the union operation

with a thicker line width before the title is added.

The State of Georgia

Figure 3.5 Counties in the state of Georgia

It is also possible to plot multiple different maps from different datasets

together, but this requires a bit more control over the tmap parameters. The code

HANDLING SPATIAL DATA IN R

71

below assigns each map to variables t1 and t2, and then a second set of functions

is used to manipulate these in a plot window. Note that georgia2 is in sp format

and has a different map projection than georgia. For this reason, the aspect of the

second plot is specified for the second plot in the code below. The value was deter-

mined through trial and error. You will need to install and load the grid package.

# 1st plot of georgia

t1 <- tm_shape(georgia_sf) +

tm_fill("coral") +

tm_borders() +

tm_layout(bg.color = "grey85")

# 2nd plot of georgia2

t2 <- tm_shape(georgia2) +

tm_fill("orange") +

tm_borders() +

# the asp parameter controls aspect

# this makes the 2nd plot align

tm_layout(asp = 0.86,bg.color = "grey95")

Now you can specify the layout of the combined map plot as in Figure 3.6:

library(grid)

# open a new plot page

grid.newpage()

# set up the layout

pushViewport(viewport(layout=grid.layout(1,2)))

# plot using the print command

print(t1, vp=viewport(layout.pos.col = 1, height = 5))

print(t2, vp=viewport(layout.pos.col

,

= 2, height = 5))

Figure 3.6 Examples of the use of tmap to generate multiple maps in the same plot window

R FOR SPATIAL ANALYSIS & MAPPING

72

Thus different plot parameters can be used for different subsets of the data such

that they are plotted in ways that are different from the default. Sometimes we

would like to label the features in our maps. Have a look at the names of the coun-

ties in the georgia_sf dataset. These are held in the 13th attribute column, and

names(georgia_sf) will return a list of the names of all attributes:

data.frame(georgia_sf)[,13]

It would be useful to display these on the map, and this can be done using the

tm_text function in the maptools package that is loaded with tmap. The result

is shown in Figure 3.7.

tm_shape(georgia_sf) +

tm_fill("white") +

tm_borders() +

tm_text("Name", size = 0.3) +

tm_layout(frame = FALSE)

And we can subset the data as with the sp format. The code below subsets the

counties of Jefferson, Jenkins, Johnson, Washington, Glasco*ck, Emanuel, Candler,

Bulloch, Screven, Richmond and Burke:

# the county indices below were extracted from the data.frame

index <- c(81, 82, 83, 150, 62, 53, 21, 16, 124, 121, 17)

georgia_sf.sub <- georgia_sf[index,]

The notation for subsetting is the same as for sp objects, and enables individual

areas or polygons to be selected from spatial datasets using the bracket notation as

used in matrices, data frames and vectors. The subset can be plotted to generate

Figure 3.8 using the code below.

tm_shape(georgia_sf.sub) +

tm_fill("gold1") +

tm_borders("grey") +

tm_text("Name", size = 1) +

# add the outline

tm_shape(g) +

tm_borders(lwd = 2) +

# specify some layout parameters

tm_layout(frame = FALSE, title = "A subset of Georgia",

title.size = 1.5, title.position = c(0., "bottom"))

Finally, we can bring together the different spatial data that have been created in a

single map as in Figure 3.9 using the code below. You should note how the different

tm_shape, tm_fill etc. functions are used to set up each layer of the map and

that tmap determines the map extent from the layers:

# the 1st layer

tm_shape(georgia_sf) +

tm_fill("white") +

tm_borders("grey", lwd = 0.5) +

HANDLING SPATIAL DATA IN R

73

# the 2nd layer

tm_shape(g) +

tm_borders(lwd = 2) +

# the 3rd layer

tm_shape(georgia_sf.sub) +

tm_fill("lightblue") +

tm_borders() +

# specify some layout parameters

tm_layout(frame = T, title = "Georgia with a subset of counties",

title.size = 1, title.position = c(0.02, "bottom"))

Appling

Atkinson

Bacon

Baker

Baldwin

Banks

Barrow

Bartow

Ben Hill

Berrien

Bibb

Bleckley

Brantley

Brooks

Bryan

Bulloch

Burke

Butts

Calhoun

Camden

Candler

Carroll

Catoosa

Charlton

Chatham

Chattahoochee

Chattooga

Cherokee

Clarke

Clay

Clayton

Clinch

Cobb

Coffee

Colquitt

Columbia

Cook

Coweta

Crawford

Crisp

Dade

Dawson

Decatur

DeKalb

DodgeDooly

Dougherty

Douglas

Early

Echols

Effingham

Elbert

Emanuel

Evans

Fannin

Fayette

Floyd

Forsyth

Franklin

Fulton

Gilmer

Glasco*ck

Glynn

Gordon

Grady

Greene

Gwinnett

Habersham

Hall

Hanco*ck

Haralson

Harris

Hart

Heard

Henry

Houston

Irwin

Jackson

Jasper

Jeff Davis

Jefferson

Jenkins

Johnson

Jones

Lamar

Lanier

Laurens

Lee

Liberty

Lincoln

Long

Lowndes

Lumpkin

McDuffie

McIntosh

Macon

Madison

Marion

Meriwether

Miller

Mitchell

Monroe

Montgomery

Morgan

Murray

Muscogee

Newton

Oconee

Oglethorpe

Paulding

Peach

Pickens

Pierce

Pike

Polk

Pulaski

Putnam

Quitman

Rabun

Randolph

Richmond

Rockdale

Schley

Screven

Seminole

Spalding

Stephens

Stewart

Sumter

Talbot

Taliaferro

Tattnall

Taylor

Telfair

Terrell

Thomas

Tift

Toombs

Towns

Treutlen

Troup

Turner

Twiggs

Union

Upson

Walker

Walton

Ware

Warren

Washington

Wayne

Webster

Wheeler

White

Whitfield

Wilcox

Wilkes

Wilkinson

Worth

Figure 3.7 Adding text to map objects with tmap

R FOR SPATIAL ANALYSIS & MAPPING

74

3.4.4 Adding Context

In some situations a map with background context may be more informative.

There are a number of options for doing this, including OpenStreetMap,1 Google

Maps and Leaflet. This requires some additional packages to be downloaded and

installed in R. If you have not done so already, install the OpenStreetMap

package and load it into R:

install.packages(c("OpenStreetMap"),depend=T)

library(OpenStreetMap)

If using OpenStreetMap, the approach is to define the area of interest, to download

and plot the map tile from OpenStreetMap and then to plot your data over the tiles.

In this case the background map area is defined by the spatial extent of the Georgia

subset created above which is used determine the tiles to download from

OpenStreetMap. The results of the code below are shown in Figure 3.10. Note the

use of the spTransform function in the rgdal package in the last line of the code.

Jefferson

Jenkins

Johnson

Washington

Glasco*ck

Emanuel

Candler Bulloch

Screven

Richmond

Burke

A subset of Georgia

Figure 3.8 A subset of the counties in the state of Georgia

1 At the time of writing, there can be some compatibility issues with the rJava package required by

OpenStreetMap. These relate to the use of 32-bit and 64-bit programs, especially on Windows PCs.

If you experience problems installing OpenStreetMap, then it is suggested that you use the 32-bit

version of R, which is also installed as part of R for Windows.

HANDLING SPATIAL DATA IN R

75

This transforms the geographical projection of the georgia.sub data to the same

projection as the OpenStreetMap data layer. Here it is easier to work with sp objects.

# define upper left, lower right corners

georgia.sub <- georgia[index,]

ul <- as.vector(cbind(bbox(georgia.sub)[2,2],

bbox(georgia.sub)[1,1]))

lr <- as.vector(cbind(bbox(georgia.sub)[2,1],

bbox(georgia.sub)[1,2]))

# download the map tile

MyMap <- openmap(ul,lr)

# now plot the layer and the backdrop

par(mar = c(0,0,0,0))

plot(MyMap, removeMargin=FALSE)

plot(spTransform(georgia.sub, osm()), add = TRUE, lwd = 2)

Google Maps can also be downloaded and used as context. Again, this package

should be installed if you have not done so already.

Georgia with a subset of counties

Figure 3.9 The result of the code for plotting a spatial object and a spatial subset

R FOR SPATIAL ANALYSIS & MAPPING

76

install.packages(c("RgoogleMaps"),depend=T)

Then the area for the background map data is defined to identify the tiles to be

downloaded from Google Maps. Some of the plotting commands are specific to

the packages installed – note the first step to convert the subset to PolySet

format using the SpatialPolygons2PolySet function in maptools

(loaded with GISTools) and the last line that defines a polygon plot over

Google Maps:

# load the package

library(RgoogleMaps)

# convert the subset

shp <- SpatialPolygons2PolySet(georgia.sub)

# determine the extent of the subset

bb <- qbbox(lat = shp[,"Y"], lon = shp[,"X"])

# download map data and store it

MyMap <- GetMap.bbox(bb$lonR, bb$latR, destfile = "DC.jpg")

# now plot the layer and the backdrop

par(mar = c(0,0,0,0))

PlotPolysOnStaticMap(MyMap, shp, lwd=2,

col = rgb(0.25,0.25,0.25,0.025), add = F)

It is also possible to use the tmap package for context using Leaflet. Leaflet is an

open source JavaScript library used to build interactive web mapping applica-

tions (see https://rstudio.github.io/leaflet/) and is embedded

Figure 3.10 A subset of Georgia with an OpenStreetMap backdrop

HANDLING SPATIAL DATA IN R

77

within the tmap package. It is useful if you want to embed interactive maps in

an HTML file (e.g. by using RMarkdown). The code below maps georgia.sub

with an interactive Leaflet backdrop as in Figure 3.11. Note that the interactive

mode is set through the tmap_mode function, which in this case has been set to

'view', which requires an internet connection,

,

120

4.4 WritingFunctions 127

4.5 SpatialDataStructures 135

4.6 applyFunctions 137

4.7 ManipulatingDatawithdplyr 140

4.8 AnswerstoSelf-TestQuestions 143

5 USINGRASAGIS 148

5.1 Introduction 148

5.2 SpatialIntersectionandClipOperations 150

5.3 Buffers 153

5.4 MergingSpatialFeatures 155

5.5 Point-in-PolygonandAreaCalculations 157

5.6 CreatingDistanceAttributes 163

5.7 CombiningSpatialDatasetsandTheirAttributes 169

5.8 ConvertingbetweenRasterandVector 175

5.9 IntroductiontoRasterAnalysis 182

5.10 AnswerstoSelf-TestQuestions 190

References 192

6 POINTPATTERNANALYSISUSINGR 194

6.1 Introduction 194

6.2 WhatisSpecialaboutSpatial? 194

6.3 TechniquesforPointPatternsUsingR 196

6.4 FurtherUsesofKernelDensityEstimation 202

6.5 Second-OrderAnalysisofPointPatterns 207

6.6 LookingatMarkedPointPatterns 218

6.7 InterpolationofPointPatternswithContinuousAttributes 222

6.8 TheKrigingApproach 235

6.9 ConcludingRemarks 242

6.10 AnswerstoSelf-TestQuestions 243

References 244

7 SPATIALATTRIBUTEANALYSISWITHR 245

7.1 Introduction 245

7.2 ThePennsylvaniaLungCancerData 246

7.3 AVisualExplorationofAutocorrelation 248

7.4 Moran’sI:AnIndexofAutocorrelation 257

CONTENTS

ix

7.5 SpatialAutoregression 262

7.6 CalibratingSpatialRegressionModelsinR 263

7.7 AnswertoSelf-TestQuestion 277

References 279

8 LOCALISEDSPATIALANALYSIS 281

8.1 Introduction 281

8.2 SettingUptheDataUsedinThisChapter 282

8.3 LocalIndicatorsofSpatialAssociation 283

8.4 FurtherIssueswiththeAboveAnalysis 286

References 289

9 RANDINTERNETDATA 290

9.1 Introduction 290

9.2 DirectAccesstoData 291

9.3 UsingRCurl 295

9.4 WorkingwithAPIs 297

9.5 CreatingaStatistical‘Mashup’ 303

9.6 UsingSpecificPackages 306

9.7 WebScraping 310

References 315

10 EPILOGUE 316

10.1 TheFutureofRasaToolforGeocomputation 316

10.2 ExtensionsofRasaLanguage 316

10.3 Improvements‘UndertheBonnet’ 318

10.4 CoexistencewithOtherSoftware 319

10.5 Finally… 320

References 321

Index 322

ABOUT THE AUTHORS

ChrisBrunsdon is Professor of Geocomputation and Director of the National Centre

for Geocomputation at the National University of Ireland, Maynooth, having worked

previously in the Universities of Newcastle, Glamorgan, Leicester and Liverpool,

variously in departments focusing on both geography and computing. He has inter-

ests that span both of these disciplines, including spatial statistics, geographical infor-

mation science, and exploratory spatial data analysis, and in particular the application

of these ideas to crime pattern analysis, the modelling of house prices, medical and

health geography and the analysis of land use data. He was one of the originators of

the technique of geographically weighted regression (GWR). He has extensive experi-

enceofprogramminginR,goingbacktothelate1990s,andhasdevelopedanumber

of R packages which are currently available on CRAN, the Comprehensive R Archive

Network. He is an advocate of free and open source software, and in particular the

use of reproducible research methods, and has contributed to a large number of

workshops on the use of R and of GWR in a number of countries, including the UK,

Ireland, Japan, Canada, the USA, the Czech Republic and Australia. When not

involved in academicworkhe enjoys running, collecting clocks andwatches, and

cooking – the last of these probably cancelling out the benefits of the first.

AlexisComber (Lex) is Professor of Spatial Data Analytics at Leeds Institute for

Data Analytics (LIDA), University of Leeds. He worked previously at the

University of Leicester where he held a chair in geographical information science.

His first degree was in plant and crop science at the University of Nottingham, and

he completed a PhD in computer science at the Macaulay Institute, Aberdeen (now

the James Hutton Institute) and the University of Aberdeen, developing expert

systems for land cover monitoring. This brought him into the world of spatial data,

spatial analysis, and mapping. Lex’s interests span many different application

areas, including land cover/land use, demographics, public health, agriculture,

bio-energy and accessibility, all of which require multi-disciplinary approaches.

His research draws from geocomputation, mathematics, statistics and computer

science, and he has extended techniques in operations research/location–allocation

(what to put where), graph theory (cluster detection in networks), heuristic

searches (how to move intelligently through highly dimensional big data), remote

sensing (novel approaches for classification), handling divergent data semantics

(uncertainty handling, ontologies, text mining) and spatial statistics (quantifying

spatial and temporal process heterogeneity). Outside of academic work and in no

particular order, Lex enjoyshis vegetable garden,walking thedog andplaying

pinball(heistheproudownerofa1981BallyEightBallDeluxe).

1

INTRODUCTION

1.1 INTRODUCTION TO THE SECOND EDITION

Since the first edition of this book was drafted and subsequently published, there

have been a number of developments in the handling of data and spatial data in R.

The use of R has exploded, and it is now a common tool taught at undergraduate

and postgraduate level in many courses. This is due to a number of interrelated

factors. Perhaps the most critical of these from a scientific point of view is that R is

free and open source, which means that the code and functions used to manipulate

data are transparent and can be integrated by the user, rather than being simply

presented as black boxes as is common in many commercial software packages.

Additionally, R is underpinned by a core statistical functionality that provides the

basis for rigorous analysis and confident package development. Finally, R pro-

vides a dynamic analysis environment in which new packages are constantly

developed, refined and updated.

One such set of developments is at the heart of the second edition of this book:

the emergence of tidy and lazy data formats and structures for spatial and non-

spatial data, to improve data manipulations, data wrangling and data handling

supporting cleaner data science. The most notable example of this is the

tidyverse, which is a collection of R packages designed for data science

(https://www.tidyverse.org). These provide a suite of tools for data analy-

sis, linkage and data visualisation, but also augmented data formats such as the

tibble and language extending operations using a piping syntax. Similar devel-

opments have also occurred in mapping, spatial data and spatial data analysis in

R, such as the tmap package for thematic mapping (Tennekes, 2015) and the sf

package that includes both new data structures and tools for handling spatial data

(Pebesma et al., 2016).

In the same way that the first edition of this book, written in 2013, reflected our

practice and how we worked with spatial data in R at that time, so the second edi-

tion reflects our current practice and the techniques we now use. In 2013, spatial

data analysis was undertaken using data in the sp format, as defined in the sp

R FOR SPATIAL ANALYSIS & MAPPING

2

package, and using tools drawn from a range of packages underpinned by the sp

data format such as rgdal and maptools. The first edition had a strong focus on

the GISTools package (Brunsdon and Chen, 2014) which wrapped many func-

tions from other packages with an sp underpinning. Now we work mainly with

spatial data in sf format (described more fully in Chapter 3). At the time of writ-

ing, the R spatial community is in a period of transition from sp to sf formats and

so both are introduced and discussed in this second edition. Many packages

,

with the alternative being

'plot'.

tmap_mode('view')

tmap mode set to interactive viewing

tm_shape(georgia_sf.sub) +

tm_polygons(col = "#C6DBEF80" )

Finally, remember to reset the tmap_mode to plot:

tmap_mode("plot")

+

-

Leaflet | © OpenStreetMap © CartoDB

Figure 3.11 An interactive map of the Georgia subset with Leaflet/OpenStreetMap

backdrop

R FOR SPATIAL ANALYSIS & MAPPING

78

3.4.5 Saving Your Map

Having created a map in a window on the screen, you may now want to save the

map for either printing, or incorporating in a document. There are a number of

ways that this can be done. The simplest in RStudio is to click on the Export icon

in the plot pane for saving options (in R, right-click with the mouse on the map

window), select Copy to Clipboard, and then paste it into a word-processing docu-

ment (e.g. one being created in either OpenOffice or MS Windows). Another is to

use Save as Image to save the map as an image file, with a name that you give it.

However, it is also possible to save images by using the R commands that were used

to create the map. This takes more initial effort, but has the advantage that it is pos-

sible to make minor edits and changes (such as altering the position of the scale, or

drawing the census block boundaries in a different colour) and to easily rerun the

code to re-create the image file. There are a number of formats for saving maps, such

as PDF, PNG and TIFF.

One way to create a file of commands is to edit a text file with a name ending in

.R – note the capital letter. In RStudio, open a new document by selecting File >

New File > R script. Then type in the following:

# load package and data

library(GISTools)

data(newhaven)

proj4string(roads) <- proj4string(blocks)

# plot spatial data

tm_shape(blocks) +

tm_borders() +

tm_shape(roads) + tm_lines(col = "red") +

# embellish the map

tm_scale_bar(width = 0.22) +

tm_compass(position = c(0.8, 0.07)) +

tm_layout( frame = F, title = "New Haven, CT", title.size = 1.5,

title.position = c(0.55, "top"), legend.outside = T)

Save the file as newhavenmap.R in your working directory.

I

When you start an R session you should set the working directory to be the folder

that you wish to use to write and read data to and from, to store your command

files, such as the newhavenmap.R file, and any workspace files or .RData files

that you save. In RStudio this is Session > Set Working Directory > .... In R in

Windows it is File > Change dir... and on a Mac it is Misc > Set Working Directory.

Now go back to the R command line and enter:

HANDLING SPATIAL DATA IN R

79

source("newhavenmap.R")

and your map will be redrawn. The file contains all of the commands to draw the

map, and ‘sourcing’ it makes R run through these in sequence. Suppose you now

wish to redraw the map, but with the roads drawn in blue, rather than red. In the

file editor, go to the tm_lines command, and edit the line to become:

tm_lines(col = "blue") +

and save the file again. Re-entering source("newhavenmap.R") now draws

the map, but with the roads drawn in blue. Another parameter sometimes used in

map drawing is the line width parameter, lwd. This time, edit the tm_borders

command in the file to become:

tm_borders(lwd = 3) +

and re-enter the source command. The map is redrawn with thicker boundaries

around the census blocks. The col and lwd parameters can of course be used in

combination. Edit the file again, so that the second line becomes:

tm_lines(col = "blue", lwd = 2) +

and source the file again. This time the roads are thicker and drawn in blue.

Another advantage of saving command files, as noted earlier, is that it is pos-

sible to place the graphics created into various graphics file formats. To create a

PDF, for example, the command:

pdf(file='map.pdf')

can be placed before the first line containing a tm_shape command in the

newhavenmap.R file. This tells R that after this command, any graphics will not

be drawn on the screen, but instead are written to the file map.pdf (or whatever

name you choose for the file). When you have written all of the commands you need

to create your map, then insert the following at the end of the tmap commands:

dev.off()

This is short for device off, and tells R to close the PDF file, and go back to

drawing graphics in windows on the screen in the future. To test this out, insert

a new first line at the beginning of newhavenmap.R and a new last line at the

end. Then re-source the file. This time no new graphics are drawn, but you have

now created a set of commands to write the graphic into a PDF file called map.

pdf. This file will be created in the folder in which you are working. To check that

this has worked, open your working directory folder in Windows Explorer, Mac

Finder, etc., and there should be a file called map.pdf. Click on it and whatever

R FOR SPATIAL ANALYSIS & MAPPING

80

PDF reader you use should open, and your map should be displayed as a PDF file.

This file can be incorporated into presentations, word-processing documents and

so on. A similar command, for producing PNG files, is:

png(file='map.png')

which writes all subsequent R graphics into a PNG file, until a dev.off() is issued.

To test this, replace the first line of newhavenmap.R with the above command,

and re-source it from the R command line. A new file will appear in the folder called

map.png which may be incorporated into documents as with the PDF file.

Of course you do not need to load a .R file to do this! You can place the opening

and closing commands around the mapping code.

There are a number of commonly used functions for writing maps out to PDF,

PNG, TIFF, etc., files:

pdf()

png()

tiff()

Examine the help for these.

The key thing you need to know is that these functions all open a file. The open

file needs to be closed using dev.off() after the map has been written to it. So

the syntax is:

pdf(file = "MyPlot.pdf", other setting)

dev.off()

You can write a .png file for the map using the code below. Note that you may

want to set the working directory that you write to using the setwd() function.

To illustrate this the code below creates some points for the georgia_sf polygon

centroids, sets the working directory and then creates a map:

pts_sf <- st_centroid(georgia_sf)

setwd('~/Desktop/')

# open the file

png(filename = "Figure1.png", w = 5, h = 7, units = "in", res = 150)

# make the map

tm_shape(georgia_sf) +

tm_fill("olivedrab4") +

tm_borders("grey", lwd = 1) +

# the points layer

tm_shape(pts_sf) +

tm_bubbles("PctBlack", title.size = "% Black", col = "gold")+

tm_format_NLD()

# close the png file

dev.off()

HANDLING SPATIAL DATA IN R

81

3.5 MAPPING SPATIAL DATA ATTRIBUTES

3.5.1 Introduction

This section describes some approaches for displaying and mapping spatial data

attributes. Some of these ideas and commands have already been used in the pre-

ceding illustrations, but this section provides a more formal and comprehensive

description.

All of the maps that you have generated thus far have simply displayed data (e.g.

the roads in New Haven and the counties in Georgia). This is fine if the aim is sim-

ply to map the locations of different features. However, we are often interested in

identifying and analysing the properties or attributes associated with different spa-

tial features. The New Haven and Georgia datasets introduced above both contain

areas or regions within them. In the case of the New Haven one these are the census

reporting areas (census blocks or tracts), and in Georgia the counties within the

state. These areas have attributes from the population census for each spatial unit.

These attributes are held in the data frame of the spatial object. For example, in the

code above you examined the data frame of the Georgia dataset and listed the attrib-

utes of individual objects within the dataset. Figure 3.1 actually maps the

,

median

income of each county in Georgia, although this code was not shown.

3.5.2 Attributes and Data Frames

The attributes associated with individual features (lines, points, areas in vector data

and cell values in raster data) provide the basis for spatial analyses and geographical

investigation. Before examining attributes directly, it is important to reconsider the

data structures that are commonly used to hold and manipulate spatial data in R.

Clear your workspace and load the New Haven data, convert to sf format and

then examine the blocks, breach and tracts data:

# clear workspace

rm(list = ls())

# load & list the data

data(newhaven)

ls()

# convert to sf

blocks_sf <- st_as_sf(blocks)

breach_sf <- st_as_sf(breach)

tracts_sf <- st_as_sf(tracts)

# have a look at the attributes and object class

summary(blocks_sf)

class(blocks_sf)

summary(breach_sf)

class(breach_sf)

summary(tracts_sf)

class(tracts_sf)

R FOR SPATIAL ANALYSIS & MAPPING

82

You should notice a number of things from these summaries:

● Each of the datasets is spatial: blocks_sf and tracts_sf are

POLYGON sf objects and breach is a POINT object.

● They all have data frames attached to them that contain attributes whose

values are summarised by the summary function.

● breach_sf only has geometry attributes – it has no thematic

attributes, it just records locations.

The data frame of these spatial objects can be accessed in order to examine, manip-

ulate or classify the attribute data. Each row in the data frame contains attribute

values associated with one of the spatial objects, the individual polygons for exam-

ple in blocks_sf, and each column describes the values associated with a par-

ticular attribute for all of the objects. Accessing the data frame allows you to read,

alter or compute new attributes. Entering:

data.frame(blocks_sf)

would print all of the attribute information for each census block in New Haven to

the R console window, until the print limit was reached, while:

head(data.frame(blocks_sf))

prints out the first six rows. The attributes can be individually identified using their

names. To see the list of column names enter:

colnames(data.frame(blocks_sf))

# or

names(blocks_sf)

Note that for sp objects, an alternative is to use @data to access the data frame of

the SpatialPolygonsDataFrame objects, as well as the above code:

colnames(blocks@data)

head(blocks@data)

One of the data attributes or variables is called P_VACANT and describes the per-

centage of households that are unoccupied (i.e. vacant) in each of the blocks. To

access the variable itself, enter:

data.frame(blocks_sf$P_VACANT)

The $ operator works as it would on a standard data frame to access individual

variables (columns) in the data frame. For the data frames of spatial objects a short-

hand exists to access this variable. Enter:

HANDLING SPATIAL DATA IN R

83

blocks$P_VACANT

A third option is to attach the data frame. Enter:

attach(data.frame(blocks_sf))

All of the attribute variables now appear as ordinary R variables. For example, to

draw a histogram of the percentage vacant housing for each block, enter:

hist(P_VACANT)

Finally, it is good practice to detach any objects that have been attached after you

have finished using them. It is possible to attach many data frames simultaneously,

but this can lead to problems if you are not careful. To detach the data frame you

attached earlier, enter:

detach(data.frame(blocks_sf))

You can try a similar set of commands with the tracts data, but the breaches

data has no attributes: it simply records the locations of breaches of the peace. As

with any point data, the breaches of the peace data can be used to create a heat map

raster dataset.

# use kde.points to create a kernel density surface

breach.dens = st_as_sf(kde.points(breach,lims=tracts))

summary(breach.dens)

breach.dens is a raster/pixels dataset, and its attributes are held in a data frame

which can be examined:

breach.dens

Notice that this has the kernel density estimation and geometry attributes that de-

scribe the X and Y locations, and you can plot the breach.dens object:

plot(breach.dens)

Also note that you can remove the st_as_sf function from the kde.points

command to generate a SpatialPixelsDataFrame object, part of the sp fami-

ly of spatial objects. This can be plotted with the image function.

A final key point about attributes is that you can create and assign new attrib-

utes to the spatial object, for both sf and sp. For example, the code below creates

a normally distributed random value for each of the 129 areas in the blocks_sf

object. Note the use of the $ to do this:

blocks_sf$RandVar <- rnorm(nrow(blocks_sf))

R FOR SPATIAL ANALYSIS & MAPPING

84

Of course it is more than likely that you will want to assign a new value to a spatial

object that arises from the result of some kind of analysis, data join, etc. It is very

easy to link new data attributes to spatial objects in this way.

3.5.3 Mapping Polygons and Attributes

A choropleth is a thematic map in which areas are shaded in proportion to their

attributes. The tmap package includes a number of ways of generating choropleth

maps. Enter:

tmap_mode('plot')

tm_shape(blocks_sf) +

tm_polygons("P_OWNEROCC")

This produces a map of the census block in New Haven, shaded by the proportions

of vacant properties. The tm_polygons element automatically includes a legend

to allow the map to be interpreted, in this case the levels of vacancy associated with

each of the different shade colours.

There are a couple of things to note about the use of tmap. First, tmap_mode

was set to plot to generate a standard choropleth suitable for including in a

report rather than an interactive map for use in a webpage, for example. Recall

that the Leaflet mapping above used the interactive view (i.e. tmap_mode was

set to 'view'). Second, in a similar way to the ggplot operations in Chapter 2,

the tmap package constructs maps by combining different map elements. In this

case blocks_sf was passed to the tm_shape function and then the tm_pol-

ygons function was used to specify the variable to be mapped, in this case P_

OWNEROCC.

You should note that it is also possible to pass sp format spatial objects to

tmap. Try replacing tm_shape(blocks_sf) with tm_shape(blocks) in

the code above and below. Also note that in this case the variable P_OWNEROCC

was mapped using five classes of equal interval. Try repeating the tmap code

above using a different variable such as P_VACANT. What happens? You will see

that tmap automatically determines the number of classes to be included and the

class intervals or breaks. Finally, a colour shading scheme is automatically allo-

cated to the map and the legend is included in the map. All of these, and many

of the other default mapping settings that tmap uses, can be controlled and

modified.

For example, to control the class intervals, the breaks parameter can be specified:

tm_shape(blocks_sf) +

tm_polygons("P_OWNEROCC", breaks=seq(0, 100, by=25))

This can be done in many different ways:

tm_shape(blocks_sf) +

tm_polygons("P_OWNEROCC", breaks=c(10, 40, 60, 90))

HANDLING SPATIAL DATA IN R

85

The legend placement and title can be modified. The tm_layout function is very

useful here:

tm_shape(blocks_sf) +

tm_polygons("P_OWNEROCC", title = "Owner Occ") +

tm_layout(legend.title.size = 1,

legend.text.size = 1,

legend.position = c(0.1, 0.1))

You could also try legend.position = c("centre","bottom")). Further

documentation on tm_layout can be found at https://www.rdocumentation.

org/packages/tmap/versions/1.11/topics/tm_layout.

It is also possible to alter the colours used in a shading scheme. The default

colour scheme uses increasing intensities of yellow to red. Graduated lists of col-

ours like this are generated using the RColorBrewer package, which is auto-

matically loaded with both tmap and GISTools. This package makes

,

use of a

set of colour palettes designed by Cynthia Brewer, intended to optimise the per-

ceptual difference between each shade in the palette, so that visually each shading

colour is distinct. The palettes available in this package are displayed with the

command:

display.brewer.all()

This displays the various colour palettes and their names in a plot window.

To generate a list of colours from one of these palettes, for example, enter the

following:

brewer.pal(5,'Blues')

[1] "#EFF3FF" "#BDD7E7" "#6BAED6" "#3182BD" "#08519C"

This is a list of colour codes used by R to specify the palette. The brewer.pal

arguments specify that a five-stage palette based on shades of blue is required.

The output of brewer.pal can be fed into tmap to give alternative colours in

shading schemes. For example, enter the code below and a choropleth map shad-

ed in red is displayed with its legend. The palette argument in tm_polygons

specifies the new colours in the shading scheme.

tm_shape(blocks_sf) +

tm_polygons("P_OWNEROCC", title = "Owner Occ", palette = "Reds") +

tm_layout(legend.title.size = 1)

Note that the same map would be produced if the tm_fill function were used

instead of tm_polygons; however, without a tm_borders function, the census

block outlines are not plotted. Try entering:

R FOR SPATIAL ANALYSIS & MAPPING

86

tm_shape(blocks_sf) +

tm_fill("P_OWNEROCC", title = "Owner Occ", palette = "Blues") +

tm_layout(legend.title.size = 1)

Owner Occ

0 to 20

20 to 40

40 to 60

60 to 80

80 to 100

Owner Occ

0.00 to 16.46

16.46 to 30.49

30.49 to 46.17

46.17 to 67.95

67.95 to 91.67

Owner Occ

0.0 to 11.2

11.2 to 20.3

20.3 to 26.2

26.2 to 30.9

30.9 to 46.7

Figure 3.12 Different choropleth maps of owner-occupied properties in New Haven using

different shades and class intervals

A final adjustment is to change the way the class interval boundaries are com-

puted. As a default, they are based on equal-sized intervals of the attribute being

mapped, but different palette styles are available. Have a look at the help for tm_

polygons and you will see that a number of different plotting styles are available.

You should explore these. The class intervals can be changed to quantiles or any

other range of intervals using the breaks parameter. For example, the code below

produces three maps in Figure 3.12 with equal intervals (left), with intervals based

on k-means (middle) and with quantiles (right), using the quantileCuts func-

tion in GISTools, and using the pushViewport function in the grid package

as before to plot multiple maps together.

# with equal intervals: the tmap default

p1 <- tm_shape(blocks_sf) +

tm_polygons("P_OWNEROCC", title = "Owner Occ", palette = "Blues") +

tm_layout(legend.title.size = 0.7)

# with style = kmeans

p2 <- tm_shape(blocks_sf) +

tm_polygons("P_OWNEROCC", title = "Owner Occ", palette = "Oranges",

style = "kmeans") +

tm_layout(legend.title.size = 0.7)

# with quantiles

p3 <- tm_shape(blocks_sf) +

tm_polygons("P_OWNEROCC", title = "Owner Occ", palette = "Greens",

breaks = c(0, round(quantileCuts(blocks$P_OWNEROCC, 6), 1))) +

tm_layout(legend.title.size = 0.7)

# Multiple plots using the grid package

library(grid)

grid.newpage()

HANDLING SPATIAL DATA IN R

87

# set up the layout

pushViewport(viewport(layout=grid.layout(1,3)))

# plot using the print command

print(p1, vp=viewport(layout.pos.col = 1, height = 5))

print(p2, vp=viewport(layout.pos.col = 2, height = 5))

print(p3, vp=viewport(layout.pos.col = 3, height = 5))

It is also possible to display a histogram of the distribution of the variable or

attribute being mapped using the legend.hist parameter. This is very useful

for choropleth mapping as it gives a distribution of the attributes being examined.

Bringing this all together allows you to create a map with a number of refine-

ments as in Figure 3.13. Note, for example, the minus sign before the palette

parameter to reverse the palette order and the various parameters passed to the

tm_layout function.

tm_shape(blocks_sf) +

tm_p olygons("P_OWNEROCC", title = "Owner Occ", palette = "-GnBu",

breaks = c(0, round(quantileCuts(blocks$P_OWNEROCC, 6), 1)),

legend.hist = T) +

tm_scale_bar(width = 0.22) +

tm_compass(position = c(0.8, 0.07)) +

tm_layout(frame = F, title = "New Haven",

title.size = 2, title.position = c(0.55, "top"),

legend.hist.size = 0.5)

New Haven

N

0.0 0.5 1.0 1.5 2.0 2.5 3.0 km

Owner Occ

0.0 to 11.2

11.2 to 20.3

20.3 to 26.2

26.2 to 30.9

30.9 to 46.7

5

10

15

20

10 20 30 40

Figure 3.13 An illustration of the various options for mapping with tmap

R FOR SPATIAL ANALYSIS & MAPPING

88

It is possible to compute certain derived attribute values on the fly in tmap. The

code below first assigns a projection to the tracts_sf layer from the blocks_

sf layer, then plots population density using the convert2density parameter

applied to the POP1990 attribute.

# add a projection to tracts data and convert tracts data to sf

proj4string(tracts) <- proj4string(blocks)

tracts_sf <- st_as_sf(tracts)

tracts_sf <- st_transform(tracts_sf, "+proj=longlat +ellps=WGS84")

# plot

tm_shape(blocks_sf) +

tm_fill(col="POP1990", convert2density=TRUE,

style="kmeans", title=expression("Population (per " ∗ km^2 ∗ ")"),

legend.hist=F, id="name") +

tm_borders("grey25", alpha=.5) +

# add tracts context

tm_shape(tracts_sf) +

tm_borders("grey40", lwd=2) +

tm_format_NLD(bg.color="white", frame = FALSE,

legend.hist.bg.color="grey90")

The convert2density function automatically converts the projection units (in

this case degrees of latitude and longitude) to a projection in metres and then deter-

mines areal density in square kilometres. You can check this by creating your own

population density values, and examining the explanations of how the functions

operate in the help pages for the functions used, such as st_area.

Compare the population density summary with the legend of the figure created

using the code above:

# add an area in km^2 to blocks

blocks_sf$area = st_area(blocks_sf) / (1000∗1000)

# calculate population density manually

summary(blocks_sf$POP1990/blocks_sf$area)

A final consideration is the ability of tmap to map multiple attributes in the same

operation. The code below plots two attributes in the same call (Figure 3.14):

tm_shape(blocks_sf) +

tm_fill(c("P_RENTROCC", "P_BLACK")) +

tm_borders() +

tm_layout(legend.format = list(digits = 0),

legend.position = c("left", "bottom"),

legend.text.size = 0.5,

legend.title.size = 0.8)

In summary, the tm_fill and tm_polygons functions in the tmap package

generate choropleth maps of attributes held in spatial polygons data frame (sp) or

simple feature (sf) data objects. They automatically shade the variables using

equal intervals. The intervals and the palettes can both be adjusted. It is instructive

to examine the plotting functions and the way they operate. Enter:

HANDLING SPATIAL DATA IN R

89

P_RENTROCC

0 to 20

20 to 40

40 to 60

60 to 80

80 to 100

P_BLACK

0 to 20

20 to 40

40 to 60

60 to 80

80 to 100

Figure 3.14 tmap choropleth maps of census blocks in New Haven showing the

percentage of houses rented and occupied (P_RENTROCC) and the percentage of the

population recorded as black (P_BLACK)

tm_polygons

The function code detail is displayed in the R console window. You will see that it

takes a number of arguments and a number of default parameters. In addition to

using the R help system to understand functions, examining functions in this way

can also provide you with insight into their operation.

3.5.4 Mapping Points and Attributes

Point data can be mapped in R, as well as polygons and lines. The newhaven data

include locations of reports of ‘breaches of the peace’. These events are essentially

public disorder incidents, on many

,

occasions requiring police intervention. The

data are stored in a variable called breach, which was converted to sf format

above. Plotting this variable works in the same way as plotting polygons or lines,

using the tm_shape function:

tm_shape(blocks_sf) +

tm_polygons("white") +

tm_shape(breach_sf) +

tm_dots(size = 0.5, shape = 19, col = "red", alpha = 1)

This plots the locations of each of the breach of peace incidents with a symbol above

the blocks_sf layer using the tm_dots function. This can take a number of

parameters, including those to control the point size, colour and shape. The shape

is drawn from the core R pch (plot character) argument. You should examine the

R FOR SPATIAL ANALYSIS & MAPPING

90

help for pch and for points to see the different symbols (or shapes in the language

of tmap) that can be used.

If you have very dense point data then one point may obscure another. Adding

some transparency to the points can help visualise dense point data. The alpha

parameter can be used to add a transparency term to the colour. Try adjusting the

code above to change the transparency and the plot character. For example:

tm_shape(breach_sf) +

tm_dots(size = 0.5, shape = 19, col = "red", alpha = 0.5)

Commonly, point data come in a tabular format rather than as an R spatial

object (i.e. of class sp or sf format), with attributes that include the latitude and

longitude or easting and northing of the individual data points. One such dataset

is the quakes dataset included as part of R. It provides the locations of 1000 seis-

mic events (earthquakes) near Fiji. To load and examine the data enter:

# load the data

data(quakes)

# look at the first 6 records

head(quakes)

You will see that the data come with a number of attributes: lat, long, depth,

mag and stations. Here you will use the lat and long attributes to create a

spatial points dataset in sf format with the attributes included. Creating spatial

data from scratch in sf is a bit convoluted, so perhaps the easiest way is to create

an sp object and convert it. This is done in the code below:

# define the coordinates

coords.tmp <- cbind(quakes$long, quakes$lat)

# create the SpatialPointsDataFrame

quakes.sp <- SpatialPointsDataFrame(coords.tmp,

data = data.frame(quakes),

proj4string = CRS("+proj=longlat "))

I

Transparency can also be added to shading colours manually. Remember

that the full set of predefined and named colours available in R can be listed

by entering colours(). Also you can list the colour in the RColor-

Brewer palettes. To see the palettes enter display.brewer.all()

and to list colours in an individual palette enter brewer.pal(5, "Reds").

Any of these can be used in the call above. Additionally, a transparency term

can be added to colour and palettes using the add.alpha function in the

GISTools package. For 50% transparency enter add.alpha(brewer.

pal(5, "Reds"), 0.5).

HANDLING SPATIAL DATA IN R

91

# convert to sf

quakes_sf <- st_as_sf(quakes.sp)

The result can be mapped as shown in Figure 3.15, which shows the spatial context

of the data in the Pacific Ocean, to the north of New Zealand.

Figure 3.15 A plot of the Fiji earthquake data

# map the quakes

tm_shape(quakes_sf) +

tm_dots(size = 0.5, alpha = 0.3)

The last bit of code nicely illustrates how to create a spatial dataset in sp or sf

format in R. Essentially the sequence is:

● define the coordinates for the spatial object

● assign these to an sp class of object as in Table 3.1

● then, if required, convert the sp object to sf

You should examine the help for these classes of objects. In brief, points just need

coordinate pairs, but polygons and lines need lists of coordinates for each object.

help("SpatialPoints-class")

help("sf")

R FOR SPATIAL ANALYSIS & MAPPING

92

You will have noticed that the quakes dataset has an attribute describing the

depth of each earthquake. We can visualise the depths in a number of ways – for

example, by plotting all the data points, but specifying the size of each data point

to be proportional to the depth attribute, or by using choropleth mapping as above

with tmap. These are shown in the code blocks below and in the results are in

Figure 3.16. As a reminder, when you run this code and the other code in this

book, you should try manipulating and changing the parameters that are used to

explore different mapping approaches. The code below uses different plot charac-

ter sizes and colours to indicate the magnitude of the variable being considered:

library(grid)

# by size

p1 <- tm_shape(quakes_sf)+

tm_bubbles("depth", scale = 1, shape = 19, alpha = 0.3,

title.size="Quake Depths")

# by colour

p2 <- tm_shape(quakes_sf)+

tm_dots("depth", shape = 19, alpha = 0.5, size = 0.6,

palette = "PuBuGn",

title="Quake Depths")

# multiple plots using the grid package

grid.newpage()

# set up the layout

pushViewport(viewport(layout=grid.layout(1,2)))

# plot using the print command

print(p1, vp=viewport(layout.pos.col = 1, height = 5))

print(p2, vp=viewport(layout.pos.col = 2, height = 5))

It also possible to select specific data subsets to plot. The code below just maps

earthquakes that have a magnitude greater than 5.5:

# create the index

index <- quakes_sf$mag > 5.5

summary(index)

# select the subset assign to tmp

tmp <- quakes_sf[index,]

# plot the subset

tm_shape(tmp) +

tm_dots( col=brewer.pal(5, "Reds")[4], shape=19,

alpha=0.5, size = 1) +

tm_layout(title="Quakes > 5.5",

title.position = c("centre", "top"))

I

The code used above includes logical operators and illustrates how they can

be used to select elements that satisfy some condition. These can be used

singularly or in combination to select in the following way:

HANDLING SPATIAL DATA IN R

93

Finally it is possible to use the PlotOnStaticMap function from the Rgoogle

Maps package to plot the earthquake locations with some context from Google Maps.

This is similar to Figure 3.10, which mapped a subset of Georgia counties against an

OpenStreetMap backdrop. This time, points rather than polygons are being mapped

and different Google Maps backdrops are being used as context: standard in Figure 3.17

and satellite imagery in Figure 3.18. The code for Figure 3.17 is as follows:

library(RgoogleMaps)

# define Lat and Lon

Lat <- as.vector(quakes$lat)

Long <- as.vector(quakes$long)

# get the map tiles

# you will need to be online

MyMap <- MapBackground(lat=Lat, lon=Long)

Quake Depths

100 200 300 500 700

Quake Depths

0 to 100

100 to 200

200 to 300

300 to 400

400 to 500

500 to 600

600 to 700

Figure 3.16 Plotting points with plot size (left) and plot colour (right) related to the

attribute value.

data <- c(3, 6, 9, 99, 54, 32, −102)

index <- (data == 32 | data <= 6) data[index]

These operations are described in greater detail in Chapter 4.

R FOR SPATIAL ANALYSIS & MAPPING

94

# define a size vector

tmp <- 1+(quakes$mag − min(quakes$mag))/max(quakes$mag)

PlotOnStaticMap(MyMap,Lat,Long,cex=tmp,pch=1,col='#FB6A4A30')

And here is the code for Figure 3.18:

MyMa p <- MapBackground(lat=Lat, lon=Long, zoom = 10,

maptype = "satellite")

Plot OnStaticMap(MyMap,Lat,Long,cex=tmp,pch=1,

col='#FB6A4A50')

Figure 3.17 Plotting points with a standard Google Maps context

3.5.5 Mapping Lines and Attributes

This section considers line data spatial objects. These can be defined in a number

of ways and typically describe different network features such as roads. The first

step in the code below assigns a coordinate system to roads and then selects a

subset. This involves defining a polygon to clip the road data to, and converting

and the datasets to sf objects.

data(newhaven)

proj4string(roads) <- proj4string(blocks)

HANDLING SPATIAL DATA IN R

95

Figure 3.18 Plotting points with Google Maps satellite image context

# 1. create a clip area

xmin <- bbox(roads)[1,1]

ymin <- bbox(roads)[2,1]

xmax <- xmin + diff(bbox(roads)[1,])

,

/ 2

ymax <- ymin + diff(bbox(roads)[2,]) / 2

xx = as.vector(c(xmin, xmin, xmax, xmax, xmin))

yy = as.vector(c(ymin, ymax, ymax, ymin, ymin))

# 2. create a spatial polygon from this

crds <- cbind(xx,yy)

Pl <- Polygon(crds)

ID <- "clip"

Pls <- Polygons(list(Pl), ID=ID)

SPls <- SpatialPolygons(list(Pls))

df <- data.frame(value=1, row.names=ID)

clip.bb <- SpatialPolygonsDataFrame(SPls, df)

proj4string(clip.bb) <- proj4string(blocks)

# 3. convert to sf

# convert the data to sf

clip_sf <- st_as_sf(clip.bb)

roads_sf <- st_as_sf(roads)

# 4. clip out the roads and the data frame

roads_tmp <- st_intersection(st_cast(clip_sf), roads_sf)

R FOR SPATIAL ANALYSIS & MAPPING

96

Note that the last line generates a warning. This is because the st_intersection

function operates on geometries as well as geometry attributes under the assump-

tion that they are the same. You can avoid this either by replacing the last line with:

st_intersection(st_geometry(st_cast(clip_sf)), st_geometry(roads_sf))

or by making the assumption that the attribute is constant throughout the geome-

try explicitly before the intersection as follows:

st_agr(x) = "constant"

st_agr(y) = "constant"

where x is assigned st_cast(clip_sf) and y assigned roads_sf.

Having prepared the roads data subset in this way, a number of methods for

mapping spatial lines can be illustrated. These include maps based on classes

and continuous variables or attributes contained in the data frame. As before we

can start with a straightforward map which is then embellished in different

ways: shading by road type (the AV_LEGEND attribute) and line thickness

defined by road segment length (the attribute LENGTH_MI). The maps are

shown in Figure 3.19; note the different ways that the legend titles are specified.

New Haven

Roads

Road Type

FOOTBRIDGE

HWAY PRIM

HWAY SECON

LOCAL ROAD

MINOR ROAD

NON−STAND

PARKING

TRAIL

Segment length

0.1 0.2 0.4 0.6

Figure 3.19 A subset of the New Haven roads data, plotted in different ways: simple,

shaded using an attribute, and line width based on an attribute

3.5.6 Mapping Raster Attributes

Earlier in this chapter a SpatialPixelsDataFrame object was created using a

kernel density function. In this section the Meuse dataset, included as part of the

sp package, will be used to illustrate how raster attributes can be mapped in sf.

Load the meuse.grid dataset and examine its properties using the class

and summary functions.

HANDLING SPATIAL DATA IN R

97

# you may have to install the raster package

# install.packages("raster", dep = T)

library(raster)

data(meuse.grid)

class(meuse.grid)

summary(meuse.grid)

You should notice that meuse.grid is a data.frame object and that it has seven

attributes including an easting (x) and a northing (y). These are described in

the meuse.grid help pages (enter ?meuse.grid). The spatial properties of

the dataset can be examined by plotting the easting and northing attributes:

plot(meuse.grid$x, meuse.grid$y, asp = 1)

And it can be converted to a SpatialPixelsDataFrame object as described

in the help page for SpatialPixelsDataFrame and then to raster format.

Note that, at the time of writing, the sf package does not have raster functionality.

However, the raster package by Hijmans and van Etten (2014) handles gridded

raster data excellently.

meus e.sp = SpatialPixelsDataFrame(points =

meuse.grid[c("x", "y")], data = meuse.grid,

proj4string = CRS("+init=epsg:28992"))

meuse.r <- as(meuse.sp, "RasterStack")

To explore the data, you could try the simple plot and spplot functions as in

the code below. For the sf object it plots all of the attributes, and for the sp object

it plots the specified layer of the meuse grid:

plot(meuse.r)

plot(meuse.sp[,5])

spplot(meuse.sp[, 3:4])

image(meuse.sp[, "dist"], col = rainbow(7))

sppl ot(meuse.sp, c("part.a", "part.b", "soil", "ffreq"),

col.regions=topo.colors(20))

However, it is possible to exercise more control over the mapping of the attributes

held in the data frame of the sf object using the functionality of tmap. Some exam-

ples of tmap mapping routines with tm_raster and different shading schemes

are shown in Figures 3.20 and 3.21 with an interactive map context.

# set the tmap mode to plot

tmap_mode('plot')

# map dist and ffreq attributes

tm_shape(meuse.r) +

tm_raster( col = c("dist", "ffreq"), title = c("Distance","Flood Freq"),

palette = "Reds", style = c("kmeans", "cat"))

R FOR SPATIAL ANALYSIS & MAPPING

98

# set the tmap mode to view

tmap_mode('view')

# map the dist attribute

tm_shape(meuse.r) +

tm_raster(col = "dist", title = "Distance", breaks = c(seq(0,1,0.2))) +

tm_layout(legend.format = list(digits = 1))

Distance

0.000 to 0.122

0.122 to 0.261

0.261 to 0.421

0.421 to 0.633

0.633 to 0.993

Flood Freq

1

2

3

Figure 3.20 Maps of the Meuse raster data

You could also experiment with some of the refinements as with the tm_polygons

examples above. For example:

tm_shape(meuse.r) +

tm_raster(col="soil", title="Soil",

palette="Spectral", style="cat") +

tm_scale_bar(width = 0.3) +

tm_compass(position = c(0.74, 0.05)) +

tm_layout( frame = F, title = "Meuse flood plain",

title.size = 2, title.position = c("0.2", "top"),

legend.hist.size = 0.5)

3.6 SIMPLE DESCRIPTIVE STATISTICAL ANALYSES

The final section of this chapter before the self-test questions describes how to

develop some basic descriptive statistical analyses of attributes held in R data.

HANDLING SPATIAL DATA IN R

99

frame objects. These are intended to provide an introduction to methods for ana-

lysing the properties of spatial data attributes which will be extended in more

formal treatments of statistical and spatial analyses in later chapters. This section

first describes approaches for examining the properties of data variables using

histograms and boxplots, and then extends this to consider some simple ways of

analysing data variables in relation to each other using scatter plots and simple

regressions, before showing how mosaic plots can be used to visualise relation-

ships between variables. Importantly, a number of standard plotting routines with

their ggplot versions are introduced. You should load the tidyverse package

which includes ggplot2, and the reshape2 package which includes some data

manipulation functions:

Figure 3.21 Dynamic maps of the Meuse raster data with a Leaflet backdrop

R FOR SPATIAL ANALYSIS & MAPPING

100

install.packages("tidyverse", dep = T)

install.packages("reshape2", dep = T)

3.6.1 Histograms and Boxplots

There are number of ways of generating simple summaries of any variable. The

function table can be used to summarise the counts of categorical or discrete

data, summary and fivenum provide summaries of continuous variables, and

histograms and boxplots can provide visual summaries. You should make sure the

New Haven data are loaded from the GISTools package and then use these func-

tions to explore the P_VACANT variables in blocks.

For example, typing summary(blocks$P_VACANT) or fivenum(blocks

$P_VACANT) will produce other summaries of the distribution of the varia-

ble. R has some in-built functions for generating histograms and boxplots

with the hist and boxplot functions. However, as described in Chapter 2,

the ggplot2 package also includes functions for these visual data summa-

ries. Code for both standard R and ggplot operations is provided in the

snippets below; note the adjustment to the histogram bin sizes and the plot

labels.

data(newhaven)

# the tidyverse package loads the ggplot2 package

library(tidyverse)

# standard approach with hist

his t(blocks$P_VACANT, breaks = 40, col = "cyan",

border = "salmon",

main = "The distribution of vacant property percentages",

xlab = "percentage vacant", xlim = c(0,40))

# ggplot approach

gg plot(blocks@data, aes(P_VACANT)) +

geom_histogram(col = "salmon", fill = "cyan", bins = 40) +

xlab("percentage vacant") +

labs(title = "The distribution of vacant

,

property percentages")

A further way of providing visual descriptive summaries of variables is to use

box-and-whisker plots via the boxplot function in R and the geom_box-

plot function in ggplot2. These can summarise a single variable or multiple

variables together. Here we will focus on the geom_boxplot function in the

ggplot2 package. In order to illustrate this the blocks dataset can be split

into high- and low-vacancy areas based on whether the proportion of proper-

ties vacant is greater than 10%. Setting the vac attribute as a factor is import-

ant for both approaches. and the melt function in the reshape2 package is

critical for many ggplot operations. You should examine the result of running

melt(blocks@data). The geom_boxplot functions can be used to visualise

the differences between these two subsets in terms of the distribution of owner

HANDLING SPATIAL DATA IN R

101

occupancy and the proportion of different ethnic groups, as in Figure 3.22. First

pre-process the data:

library(reshape2)

# a logical test

index <- blocks$P_VACANT > 10

# assigned to 2 high, 1 low

blocks$vac <- index + 1

blocks$vac <- factor(blocks$vac, labels = c("Low", "High"))

Then apply the geom_boxplot function:

library(ggplot2)

ggplot(melt(blocks@data[, c("P_OWNEROCC", "P_WHITE", "P_BLACK", "vac")]),

aes(variable, value)) +

geom_boxplot() +

facet_wrap(~vac)

Low High

P_OWNEROCC P_WHITE P_BLACK P_OWNEROCC P_WHITE P_BLACK

25

50

75

100

variable

va

lu

e

Figure 3.22 Box-and-whisker plot examples

The boxplot can be enhanced in many ways in ggplot. Some parameters are

used below. You may wish to search for examples of different themes and ways of

manipulating boxplots.

R FOR SPATIAL ANALYSIS & MAPPING

102

ggplot( melt(blocks@data[, c("P_OWNEROCC", "P_WHITE", "P_BLACK", "vac")]),

aes(variable, value)) +

geom_boxplot(colour = "yellow", fill = "wheat", alpha = 0.7) +

facet_wrap(~vac) +

xlab("") +

ylab("Percentage") +

theme_dark() +

ggtitle("Boxplot of High and Low property vacancies")

3.6.2 Scatter Plots and Regressions

The differences in the two subgroups suggest that there may be some statistical asso-

ciation between the amount of vacant properties and the proportions of different

ethnic groups, typically due to well-known socio-economic inequalities and power

imbalances. First, we can plot the data to see if we can visually identify any trends:

plot(blocks$P_VACANT/100, blocks$P_WHITE/100)

plot(blocks$P_VACANT/100, blocks$P_BLACK/100)

The scatter plots suggest that there may be a negative relationship between the pro-

portion of white people in a census block and the proportion of vacant properties

and that there may be a positive association with the proportion of black people.

It is difficult to be confident in these statements, but they can be examined more

formally by using simple regression models as estimated by the lm function and

then plotting the coefficient estimates or slopes.

# assign some variables

p.vac <- blocks$P_VACANT/100

p.w <- blocks$P_WHITE/100

<- blocks$P_BLACK/100

# bind these together

df <- data.frame(p.vac, p.w, p.b)

# fit regressions

mod.1 <- lm(p.vac ~ p.w, data = df)

mod.2 <- lm(p.vac ~ p.b, data = df)

I

The function lm is used in R to fit regression models (lm stands for ‘linear

model’). The models to be fitted are specified in a special notation in R.

Effectively a model description is an R variable of its own. Although we do

not go into detail about the modelling language in this book, more can be

found in, for example, de Vries and Meys (2012: Chapter 15); for now, it is

sufficient to know that the R notation y ~ x suggests the basic regression

model y = ax + b. The notation is sufficiently rich to allow the specification of

a very broad set of linear models.

HANDLING SPATIAL DATA IN R

103

The two models above can be interpreted as follows: mod.1 describes the

extent to which changes in p.vac are associated with changes in p.w; mod.2

describes the extent to which changes in p.vac are associated with changes in

p.b. The coefficients can be inspected, and it is evident that the proportion of

white people is a weak negative predictor of the proportion of vacant proper-

ties in a census block and that the proportion of black people is a weak positive

predictor. Specifically, the model suggests relationships that indicate that the

amount of vacant properties in a census block decreases by 1% for each 3.5%

increase in the proportion of white people and that it increases by 1% for

each 3.7% increase in the proportion of black people in the census block.

However, the model fits are poor (examine the R-squared values), and when a

multivariate analysis model is computed neither are found to be significant

predictors of vacant properties. The models can be examined using the

summary command:

summary(mod.1)

Call:

lm(formula = p.vac ~ p.w, data = df)

15 Residuals:

Min 1Q Median 3Q Max

−0.11747 −0.03729 −0.01199 0.01714 0.28271

Coefficients :

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.11747 0.01092 10.755 <2e−16 ∗∗∗

p.w −0.03548 0.01723 −2.059 0.0415 ∗

---

Signif. codes:

0 '∗∗∗' 0.001 '∗∗' 0.01 '∗' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06195 on 127 degrees of freedom

Multiple R-squared: 0.03231, Adjusted R-squared: 0.02469

F-statistic: 4.24 on 1 and 127 DF, p-value: 0.04152

# not run below

# summary(mod.2)

# summary(lm(p.vac ~ p.w + p.b, data = df))

The trends can be plotted with the data as in Figure 3.23.

p1 <- ggplot(df,aes(p.vac, p.w))+

#stat_summary(fun.data=mean_cl_normal) +

geom_smooth(method='lm') +

geom_point() +

xlab("Proportion of Vacant Properties") +

ylab("Proporion White") +

labs(title="Regression of Vacant Properties against Proportion White")

p2 <- ggplot(df,aes(p.vac, p.b))+

R FOR SPATIAL ANALYSIS & MAPPING

104

0.00

0.25

0.50

0.75

1.00

0.0 0.1 0.2 0.3

Proportion of Vacant Properties

P

ro

p

o

ri

o

n

W

h

it

e

Regression of Vacant Properties aginst Proportion White

0.00

0.25

0.50

0.75

1.00

0.0 0.1 0.2 0.3

Proportion of Vacant Properties

P

ro

p

o

ri

o

n

B

la

ck

Regression of Vacant Properties aginst Proportion Black

Figure 3.23 Plotting regression coefficient slopes

#stat_summary(fun.data=mean_cl_normal) +

geom_smooth(method='lm') + geom_point() +

xlab("Proportion of Vacant Properties") +

ylab("Proporion Black") +

labs(title="Regression of Vacant Properties against Proportion Black")

grid.newpage()

# set up the layout

pushViewport(viewport(layout=grid.layout(2,1)))

# plot using the print command

print(p1, vp=viewport(layout.pos.row = 1, height = 5))

print(p2, vp=viewport(layout.pos.row = 2, height = 5))

3.6.3 Mosaic Plots

For data where there is some kind of true or false statement, mosaic plots can be used

to generate a powerful visualisation of the statistical properties and relationships

between variables. What they seek to do is to compare crosstabulations of counts

HANDLING SPATIAL DATA IN R

105

(hence the need for true or false statements) against a model where proportionally

equal counts are expected, in this case of vacant housing across ethnic groups.

First install the ggmosaic package:

# install the package

install.packages("ggmosaic", dep = T)

Then prepare the data using the melt function from the reshape2 package:

# create the dataset

pops <- data.frame(blocks[,14:18]) ∗ data.frame(blocks)[,11]

pops <- as.matrix(pops/100)

colnames(pops) <- c("White", "Black", "Ameri", "Asian", "Other")

# a true / false for vacant properties

vac.10 <- (blocks$P_VACANT > 10)

# create a crosstabulation

mat.tab <- xtabs(pops ~vac.10)

# melt the data

df <- melt(mat.tab)

Finally, create the mosaic plot, as in Figure 3.24, using the stat_mosaic function

in the ggmosaic extension to the ggplot2 package.

# load the packages

library(ggmosaic)

# call ggplot and stat_mosaic

ggplot(data = df) +

stat_mosaic(aes(weight = value,

,

x = product(Var2),

fill=factor(vac.10)), na.rm=TRUE) +

theme(axis.text.x=element_text(angle=−90, hjust= .1)) +

labs(y='Proportion of Vacant Properties', x = 'Ethnic group',

title="Mosaic Plot of Vacant Properties with ethnicity") +

guides(fill=guide_legend(title = "> 10 percent", reverse = TRUE))

It has the usual ggplot feel. It shows that the census blocks with vacancy levels

higher than 10% are not evenly distributed among different ethnic groups: the tiles

in the mosaic plot have areas proportional to the counts (in this case the number of

people affected).

However, the stat_mosaic plot does not quite have information about

residuals and whether differences are significant, as does the mosaicplot func-

tion in the graphics package. This can be used using the code below to create

Figure 3.25:

# standard mosaic plot

ttext = sprintf("Mosaic Plot of Vacant Properties

with ethnicity")

mosaicplot (t(mat.tab),xlab='',

ylab='Vacant Properties > 10 percent',

main=ttext,shade=TRUE,las=3,cex=0.8)

R FOR SPATIAL ANALYSIS & MAPPING

106

Figure 3.24 An example of a ggmosaic mosaic plot

Figure 3.25 An example of a standard ‘graphics‘ mosaic plot with residuals

Figure 3.25 contains much more information. Its shading shows which groups

are under- or overrepresented, when compared against a model of expected

HANDLING SPATIAL DATA IN R

107

equality. The blue tiles show combinations of property vacancy and ethnicity that

are higher than would be expected, with the tiles shaded deep blue corresponding

to combinations whose residuals are greater than +4, when compared to the

model, indicating a much greater frequency in those cells than would be found if

the model of equality were true. The tiles shaded deep red correspond to the

residuals less than –4, indicating much lower frequencies than would be expected.

Thus the white ethnic group is significantly more strongly associated with areas

where vacant properties make up less than 10%, and the other ethnic groups are

significantly more strongly associated with areas where vacant properties make

up more than 10%, than would be expected in a model of equal distribution.

3.7 SELF-TEST QUESTIONS

This chapter has introduced a number of commands and functions for mapping

spatial data and visualising spatial data attributes. The questions in this section

present a series of tasks for you to complete that build on the methods illustrated

in the preceding sections. The answers at the end of the chapter present snippets

of code that will complete the tasks, but, as ever, you may find that your code

differs from the answers provided. This is to be expected and is not something

that should concern you as there are usually many ways to achieve the same

objectives.

The tasks seek to extend the mapping skills that you have acquired through this

chapter (as a reminder, the expectation is that you run the code embedded in the

text throughout the book) and in places greater detail and explanation of the spe-

cific techniques are given. Four general areas are covered:

● Plots and maps: working with map data

● Misrepresentation of continuous variables: using different cut functions

for choropleth mapping

● Selecting data: creating variables and subsetting data using logical

statements

● Re-projections: transforming data using spTransform

Self-Test Question 1. Plots and maps: working with map data

Your task is to write code that will produce a map of the counties in Georgia,

shaded in a colour scheme of your choice but using 10 classes describing the distri-

bution of median income in thousands of dollars (this is described by the MedInc

attribute in the data frame). The maps should include a scale bar and a legend, and

the code should write the map to a TIFF file, with a resolution of 300 dots per inch

and a map size of 7 × 7 inches.

R FOR SPATIAL ANALYSIS & MAPPING

108

# Hints

display.brewer.all() # to show the Brewer palettes

breaks # to specify class breaks OR

style # in the tm_fill / tm_polygons help

# Tools

library(ggplot2) # for the mapping tools

data(georgia) # the Georgia data in the GISTools package

st_as_sf(georgia) # to convert the data to sf format

tm_layout # takes many parameters, e.g. legend.position

Self-Test Question 2. Misrepresentation of continuous variables: using different

breaks for choropleth mapping

It is well known that it is very easy to lie with maps (see Monmonier, 1996). One of

the very commonly used tricks for misrepresenting the spatial distribution of phe-

nomena relates to the inappropriate categorisation of continuous variables. Your

aim in this exercise is to produce three maps that represent the same feature, and

in so doing you will investigate the impact of different functions for grouping the

continuous variable in the choropleth maps.

Write code that will create three maps, in the same window, of the numbers of

houses in the New Haven census blocks. This is described by the HSE_UNITS

variable. Apply different functions to divide the HSE_UNITS variable in the

blocks dataset into five classes in different ways based on quantiles, absolute

ranges, and standard deviations. You need not add legends, scale bars, etc., but

should include map titles.

# Hints

p1 <- tm_shape(...) # assign the plots to a variable

pushViewport # from the grid package, used earlier...

viewport # ...to plot multiple tmaps

?quantileCuts # quantiles, ranges std.dev...

?rangeCuts # ... from GISTools package

?sdCuts

breaks # to specify breaks in tm_polygon

tmap_mode('plot') # to specify a map view

# Tools

library(tmap) # for the mapping tools

library(grid) # for plotting the maps together

data(newhaven) # to load the New Haven data

Self-Test Question 3. Selecting data: creating variables and subsetting data using

logical statements

In the previous sections on mapping polygon attributes and mapping lines,

different methods for selecting or subsetting the spatial data were introduced.

These applied an overlay of spatial data using st_intersection in the st

package to select roads within the extent of an st polygon object, and logical

operators were used to select earthquake locations that satisfied specific criteria.

HANDLING SPATIAL DATA IN R

109

Additionally, logical operators were introduced in the previous chapter. When

applied to a variable they return true or false statements or more correctly logical

data types. In this exercise, the objective is to create a secondary attribute and

then to use a logical statement to select data objects when applied to the attribute

you create.

A company wishes to market a product to the population in rural areas. The

company has a model that says that they will sell one unit of their product for

every 20 people in rural areas who are visited by one of their sales team, and

they would like to know which counties have a rural population density of

more than 20 people per square kilometre. Using the Georgia data, you should

develop some code that selects counties based on a rural population density

measure. You will need to calculate for each county some kind of rural popula-

tion density score and map the counties in Georgia that have a score of greater

than 20 rural people per square kilometre.

# Hints

library(GISTools) # for the mapping tools

data(georgia) # use georgia2 as it has a geographical projection

help("!") # to examine logic operators

as.numeric # use to coerce new attributes you create to numeric format

# e.g. georgia.sf$NewVariable <- as.numeric(1:159)

# Tools

st_area # a function in the st package

Self-Test Question 4. Re-projections: transforming data using spTransform and

st_transform

Spatial data come with projections, which define an underlying geodetic model over

which the spatial data are projected. Different spatial datasets need to be aligned

over the same projection for

,

the spatial features they describe to be compared and

analysed together. National grid projections typically represent the world as a flat

surface and allow distance and area calculations to be made, which cannot be so

easily done using models that use degrees and minutes. World geodetic systems

such as WGS84 provide a standard reference system. For example, in the previous

question you worked with the georgia2 dataset which is projected in metres,

whereas georgia is projected in degrees in WGS84. And, when you plotted the

Georgia subset with an OpenStreetMap backdrop, a transform operation was used

to convert the data to the projection used in OpenStreetMap plotting. A range of

different projections are described in formats for different packages and software

on the Spatial Reference website (http://www.spatialreference.org). A

typical re-projection would be something like:

# Using spTransform in sp

new.spatial.data <- spTransform(old.spatial.data, new.Projection)

# Using st_transform in sf

new.spatial.data.sf <- st_transform(old.spatial.data.sf, new.Projection)

R FOR SPATIAL ANALYSIS & MAPPING

110

You should note that the data need to have a projection in order to be transformed.

Projections can be assigned if you know what the projection is. Recall the code from

earlier in this chapter using the Fiji earthquake data which assigned a projection to

the coordinates:

library(GISTools)

library(rgdal)

library(sf)

data(quakes)

coords.tmp <- cbind(quakes$long, quakes$lat)

# create the SpatialPointsDataFrame

quak es.sp <- SpatialPointsDataFrame(coords.tmp,

data = data.frame(quakes),

proj4string = CRS("+proj=longlat "))

You can examine the projection properties of the SpatialPointsDataFrame

and sf objects after the latter is created, by entering:

summary(quakes.spdf)

quakes_sf <- st_as_sf(quakes.sp)

head(quakes.sf)

If the proj4string properties of sp and sf objects are empty, these can be popu-

lated if you know the spatial reference system and then the data can be transformed.

The objective of this exercise is to re-project the New Haven blocks and

breach datasets from their original reference system to WGS84, using both the

st_transform function in sf and the spTransform function in rgdal, and

then to plot these transformed data on an OpenStreetMap backdrop. You may find

it useful to use a transparency term in your colours.

These datasets have a local projections system, using the State Plane Coordinate

System for Connecticut, in US survey feet. You should transform the breaches of

the peace and the census blocks data to latitude and longitude by assigning a pro-

jection using the CRS function in the sp package and st_crs function in the sf

package. Then the spTransform and st_transform functions can be applied.

Having transformed the datasets, you should map the locations of the breaches of

peace and the census blocks with an OpenStreetMap backdrop. You could use the

OpenStreetMap tools directly and/or the Leaflet embedded in the tmap tools

when tmap_mode is set to 'view'.

3.8 ANSWERS TO SELF-TEST QUESTIONS

Q1: Plots and maps: working with map data. Your map should look something like

Figure 3.26.

# load the data and the packages

library(GISTools)

library(sf)

library(tmap)

data(georgia)

HANDLING SPATIAL DATA IN R

111

# set the tmap plot type

tmap_mode('plot')

# convert to sf format

georgia_sf = st_as_sf(georgia)

# create the variable

georgia_sf$MedInc = georgia_sf$MedInc / 1000

# open the tiff file and give it a name

tiff("my_map.tiff")

# start the tmap commands

tm_shape(georgia_sf) +

tm_polygons("MedInc", title = "Median Income", palette = "GnBu",

style = "equal", n = 10) +

tm_layout(legend.title.size = 1,

legend.format = list(digits = 0),

legend.position = c(0.2, "top")) +

tm_legend(legend.outside=TRUE)

# close the tiff file

dev.off()

Median Income

23 to 29

29 to 35

35 to 41

41 to 47

47 to 53

53 to 58

58 to 64

64 to 70

70 to 76

76 to 82

Figure 3.26 The map produced by the code for Q1

R FOR SPATIAL ANALYSIS & MAPPING

112

Q2: Misrepresentation of continuous variables – using different breaks for chorop-

leth mapping. Your map should look something like Figure 3.27.

# load packages and data

library(tmap)

library(GISTools)

library(sf)

library(grid)

data(newhaven)

# convert data to sf format

blocks_sf = st_as_sf(blocks)

# 1. Initial Investigation

# You could start by having a look at the data

attach(data.frame(blocks_sf))

hist(HSE_UNITS, breaks = 20)

# You should notice that it has a normal distribution

# but with some large outliers

# Then examine different cut schemes

quantileCuts(HSE_UNITS, 6)

rangeCuts(HSE_UNITS, 6)

sdCuts(HSE_UNITS, 6)

# detach the data frame

detach(data.frame(blocks_sf))

# 2. Do the task

# a) mapping classes defined by quantiles

# define some breaks

br <- c(0, round(quantileCuts(blocks_sf$HSE_UNITS, 6),0))

# you could examine br

p1 <- tm_shape(blocks_sf) +

tm_polygons("HSE_UNITS", title="Quantiles",

palette="Reds",

breaks=br)

# b) mapping classes defined by absolute ranges

# define some breaks

br <- c(0, round(rangeCuts(blocks$HSE_UNITS, 6),0))

# you could examine br

p2 <- tm_shape(blocks_sf) +

tm_polygons("HSE_UNITS", title="Ranges",

palette="Reds",

breaks=br)

# c) mapping classes defined by standard deviations

br <- c(0, round(sdCuts(blocks$HSE_UNITS, 6),0))

# you could examine br

p3 <- tm_shape(blocks_sf) +

tm_polygons("HSE_UNITS", title="Std Dev",

palette="Reds",

breaks=br)

# open a new plot page

grid.newpage()

# set up the layout

pushViewport(viewport(layout=grid.layout(1,3)))

HANDLING SPATIAL DATA IN R

113

# plot using the print command

print(p1, vp=viewport(layout.pos.col = 1, height = 5))

print(p2, vp=viewport(layout.pos.col = 2, height = 5))

print(p3, vp=viewport(layout.pos.col = 3, height = 5))

Quantiles

0 to 273

273 to 351

351 to 410

410 to 468

468 to 548

Ranges

0 to 260

260 to 519

519 to 778

778 to 1,038

1,038 to 1,298

Std Dev

0 to 56

56 to 238

238 to 419

419 to 601

601 to 782

Figure 3.27 The map produced by the code for Q2

Q3: Selecting data: creating variables and subsetting data using logical statements.

The code is below and your map should look something like Figure 3.28.

library(GISTools)

library(sf)

data(georgia)

# convert data to sf format

georgia_sf = st_as_sf(georgia2)

# calculate rural population

georgia_sf$rur.pop <- as.numeric(georgia_sf$PctRural

∗ georgia_sf$TotPop90 / 100)

# calculate county areas in km^2

georgia_sf$areas <- as.numeric(st_area(georgia_sf)

/ (1000∗ 1000))

# calculate rural density

georgia_sf$rur.pop.den <- as.numeric(georgia_sf$rur.pop

/ georgia_sf$areas)

# select counties with density > 20

georgia_sf$rur.pop.den <- (georgia_sf$rur.pop.den > 20)

# map them

tm_shape(georgia_sf) +

tm_polygons("rur.pop.den",

palette=c("chartreuse4","darkgoldenrod3"),

title=expression("Pop >20 (per " ∗ km^2 ∗ ")"),

auto.palette.mapping = F)

Q4: Transforming data. Your map should look something like Figure 3.29 or Figure

3.30, depending on which way you did it! First you will need to transform the data:

R FOR SPATIAL ANALYSIS & MAPPING

114

library(GISTools) # for the mapping tools

library(sf) # for the mapping tools

library(rgdal) # this has the spatial reference tools

library(tmap)

library(OpenStreetMap)

data(newhaven)

# Define a new projection

newProj <- CRS("+proj=longlat +ellps=WGS84")

# Transform blocks and breach

# 1. using spTransform

breach2 <- spTransform(breach, newProj)

blocks2 <- spTransform(blocks, newProj)

# 2. using st_transform

breach_sf <- st_as_sf(breach)

blocks_sf <- st_as_sf(blocks)

breach_sf <- st_transform(breach_sf, "+proj=longlat +ellps=WGS84")

blocks_sf <- st_transform(blocks_sf, "+proj=longlat +ellps=WGS84")

Pop >20 (per km2)

FALSE

TRUE

Figure 3.28 The map produced by the code for Q3

HANDLING SPATIAL DATA IN R

115

Then the transformed data can be mapped using Leaflet

,

in the tmap package:

# set the mode

tmap_mode('view')

# plot the blocks

tm_shape(blocks_sf) +

tm_borders() +

# and then plot the breaches

tm_shape(breach_sf) +

tm_dots(shape=1, size=0.1, border.col = NULL, col = "red", alpha = 0.5)

It can also be mapped using the OpenStreetMap package. For this you need to

extract the map tiles using the bounding box of the transformed data:

ul <- as.vector(cbind(bbox(blocks2)[2,2],

bbox(blocks2)[1,1]))

lr <- as.vector(cbind(bbox(blocks2)[2,1],

bbox(blocks2)[1,2]))

# download the map tile

MyMap <- openmap(ul,lr)

+

-

Leaflet | © OpenStreetMap © CartoDB

Figure 3.29 The tmap map produced by the code for Q4

R FOR SPATIAL ANALYSIS & MAPPING

116

# now plot the layer and the backdrop

par(mar = c(0,0,0,0))

plot(MyMap, removeMargin=FALSE)

# notice how the data need to be transformed

# to the internal OpenStreetMap projection

plot(spTransform(blocks2, osm()), add = TRUE, lwd = 1)

plot(spTransform(breach2, osm()), add = T, pch = 19, col = "#DE2D2650")

Figure 3.30 The OpenStreetMap map produced by the code for Q4

HANDLING SPATIAL DATA IN R

117

REFERENCES

Anselin, L. (1995) Local indicators of spatial association – Lisa. Geographical

Analysis, 27(2): 93–115.

Brunsdon, C. and Chen, H. (2014) GISTools: Some further GIS capabilities for R. R

Package Version 0.7-4. http://cran.r-project.org/package=GISTools.

de Vries, A. and Meys, J. (2012) R for Dummies. Chichester: John Wiley & Sons.

Hijmans, R.J. and van Etten, J. (2014) Raster: Geographic data analysis and

modeling. R Package Version 2.6-7. http://cran.r-project.org/package=raster.

Monmonier, M. (1996) How to Lie with Maps, 2nd edition. Chicago: University of

Chicago Press.

Ord, J.K. and Getis, A. (1995) Local spatial autocorrelation statistics: Distributional

issues and an application. Geographical Analysis, 27(4): 286–306.

Pebesma, E., Bivand, R., Cook, I., Keitt, T., Sumner, M., Lovelace, R., Wickham, H.,

Ooms, J. and Racine, E. (2016) sf: Simple features for R. R Package Version 0.6-3.

http://cran.r-project.org/package=sf.

Tennekes, M. (2015) tmap: Thematic maps. R Package Version 1. http://cran.r-project.

org/package=tmap.

4

SCRIPTING AND WRITING

FUNCTIONS IN R

4.1 OVERVIEW

As you have been working through the code and exercises in this book you have

applied a number of different tools and techniques for extracting, displaying and

analysing data. In places you have used some quite advanced snippets of code.

However, this has all been done in a step-by-step manner, with each line of code

being run individually, and the occasional function has been applied individu-

ally to a specific dataset or attribute. Quite often in spatial analysis, we would like

to do the same thing repeatedly, but adjusting some of the parameters on each

iteration – for example, applying the same algorithm to different data, different

attributes, or using different thresholds. The aim of this chapter is to introduce

some basic programming principles and routines that will allow you to do many

things repeatedly in a single block of code. This is the basics of writing computer

programs. This chapter will:

● Describe how to combine commands into loops

● Describe how to control loops using if , else , repeat , etc.

● Describe logical operators to index and control

● Describe how to create functions, test them and to make them universal

● Explain how to automate short tasks in R

● Introduce the apply family of operations and how they can be used to

apply functions to different data structures

● Introduce dplyr functions for data table manipulations and operations

SCRIPTING AND WRITING FUNCTIONS IN R

119

4.2 INTRODUCTION

In spatial data analysis and mapping, we frequently want to apply the same set

of commands over and over again, to cycle through data or lists of data and do

things to data depending on whether some condition is met or not, and so on.

These types of repeated actions are supported by functions, loops and conditional

statements. A few simple examples serve to illustrate how R programming com-

bines these ideas through functions with conditional commands, loops and

variables.

For example, consider the following variable tree.heights:

tree.heights <- c(4.3,7.1,6.3,5.2,3.2)

We may wish to print out the first element of this variable if it has a value less than

6: this is a conditional command as the operation (in this case to print something) is

carried out conditionally (i.e. if the condition is met).

tree.heights

[1] 4.3 7.1 6.3 5.2 3.2

if (tree.heights[1] < 6) { cat('Tree is small\n') } else

{ cat('Tree is large\n')}

Tree is small

Alternatively, we may wish to examine all of the elements in the variable

tree.heights and, depending on whether each individual value meets the

condition, perform the same operation. We can carry out operations repeatedly

using a loop structure as follows. Notice the construction of the for loop in

the form:

for(variable in sequence) R expression

This is illustrated in the code below:

for (i in 1:3) {

if (tree.heights[i] < 6) { cat('Tree',i,' is small\n') }

else { cat('Tree',i,' is large\n')} }

Tree 1 is small

Tree 2 is large

Tree 3 is large

A third situation is where we wish to perform the same set of operations, group of

conditional or looped commands over and over again, perhaps to different data.

We can do this by grouping code and defining our own functions.

R FOR SPATIAL ANALYSIS & MAPPING

120

assess.tree.height <- function(tree.list, thresh)

{ for (i in 1:length(tree.list))

{ if(tree.list[i] < thresh) {cat('Tree',i, ' is small\n')}

else { cat('Tree',i,' is large\n')}

}

}

assess.tree.height(tree.heights, 6)

Tree 1 is small

Tree 2 is large

Tree 3 is large

Tree 4 is small

Tree 5 is small

tree.heights2 <- c(8,4.5,6.7,2,4)

assess.tree.height(tree.heights2, 4.5)

Tree 1 is large

Tree 2 is large

Tree 3 is large

Tree 4 is small

Tree 5 is small

Notice how the code in the function assess.tree.height above modifies the

original loop: rather than for(i in 1:3) it now uses the length of the variable

1:length(tree.list) to determine how many times to loop through the data.

Also a variable thresh was used for whatever threshold the user wishes to specify.

The sections in this chapter develop more detailed ideas around functions,

loops and conditional statements and the testing and debugging of functions in

order to support automated analyses in R.

4.3 BUILDING BLOCKS FOR PROGRAMS

In the examples above, a number of programming concepts were introduced.

Before we start to develop these more formally into functions it is important to

explain these ingredients in a bit more detail.

4.3.1 Conditional Statements

Conditional statements test to see whether some condition is TRUE or FALSE, and

if the answer is TRUE some specific actions are undertaken. Conditional statements

are composed of if and else.

The if statement is followed by a condition, an expression that is evaluated,

and then a consequent to be executed if the condition is TRUE. The format of an if

statement is:

if – condition – consequent

Actually this could be read as ‘if the condition is true then the consequent is…’. The

components of a conditional statement are:

SCRIPTING AND WRITING FUNCTIONS IN R

121

● the condition, an R expression that is either TRUE or FALSE

● the consequent, any valid R statement which is only executed if the

condition is TRUE

For example, consider the simple case below where the value of x is changed and

the same condition is applied. The results are different because of the different

values assigned to x: in the first case a statement is printed to the console, in the

second it is not.

x <- −7

if (x < 0) cat("x is negative")

x is negative

x <- 8

if (x < 0) cat("x is negative")

Frequently if statements also have an alternative consequent that is executed when

the condition is FALSE. Thus the format of the conditional statement is expanded to:

if – condition-–

,

consequent– else – alternative

Again, this could be read as ‘if the condition is true then do the consequent; or, if

the condition is not true then do the alternative’. The components of a conditional

statement that includes an alternative are:

● the condition, an R expression that is either TRUE or FALSE

● the consequent and alternative, which can be any valid R statements

● the consequent is executed if the condition is TRUE

● the alternative is executed if the condition is FALSE

The example is expanded below to accommodate the alternative:

x <- −7

if (x < 0) cat("x is negative") else cat("x is positive")

x is negative

x <- 8

if (x < 0) cat("x is negative") else cat("x is positive")

x is positive

The condition statement is composed of one or more logical operators and in R

these are defined in Table 4.1. In addition, R contains a number of logical func-

tions which can also be used to evaluate conditions. A sample of these is listed in

Table 4.2 but many others exist.

R FOR SPATIAL ANALYSIS & MAPPING

122

Table 4.1 Logical operators

Logical operator Description

== Equal

!= Not equal

> Greater than

< Less than

>= Greater than or equal

<= Less than or equal

! Not (goes in front of other expressions)

& And (combines expressions)

| Or (combines expressions)

Table 4.2 Logical functions

Logical function Description

any(x) TRUE if any in a vector of conditions x is true

all(x) TRUE if all of a vector of conditions x is true

is.numeric(x) TRUE if x contains a numeric value

is.logical(x) TRUE if x contains a true or false value

is.character(x) TRUE if x contains a character value

There are quite a few more is-type functions (i.e. logical evaluation functions)

that return TRUE or FALSE statements that can be used to develop conditional

tests. To explore these enter:

??is.

The examples below illustrate how the logical tests all and any may be incorpo-

rated into conditional statements:

x <- c(1,3,6,8,9,5)

if (all(x > 0)) cat("All numbers are positive")

All numbers are positive

x <- c(1,3,6,−8,9,5)

if (any(x > 0)) cat("Some numbers are positive")

Some numbers are positive

any(x==0)

[1] FALSE

4.3.2 Code Blocks

Frequently we wish to execute a group of consequent statements together if, for

example, some condition is TRUE. Groups of statements are called code blocks and

SCRIPTING AND WRITING FUNCTIONS IN R

123

in R are contained by { and }. The examples below show how code blocks can be

used if a condition is TRUE to execute consequent statements and can be expanded

to execute alternative statements if the condition is FALSE.

x <- c(1,3,6,8,9,5)

if (all(x > 0)) {

cat("All numbers are positive\n")

total <- sum(x)

cat("Their sum is ",total) }

All numbers are positive

Their sum is 32

The curly brackets are used to group the consequent statements: that is, they con-

tain all of the actions to be performed if the condition is met (i.e. is TRUE) and all of

the alternative actions if the condition is not met (i.e. is FALSE):

if condition { consequents } else { alternatives }

These are illustrated in the code below:

x <- c(1,3,6,8,9,−5)

if (all(x > 0)) {

cat("All numbers are positive\n")

total <- sum(x)

cat("Their sum is ",total) } else {

cat("Not all numbers are positive\n")

cat("This is probably an error as numbers are rainfall levels") }

Not all numbers are positive

This is probably an error as numbers are rainfall levels

4.3.3 Functions

The introductory section above included a function called assess.tree.

height. The format of a function is:

function name <- function(argument list) { R expression }

The R expression is usually a code block and in R the code is contained by curly

brackets or braces: { and }. Wrapping the code into a function allows it to be used

without having to retype the code each time you wish to use it. Instead, once the

function has been defined and compiled, it can be called repeatedly and with dif-

ferent arguments or parameters. Notice in the function below that there are a num-

ber of sets of containing brackets { } that are variously related to the condition, the

consequent and the alternative.

R FOR SPATIAL ANALYSIS & MAPPING

124

mean.rainfall <- function(rf)

{ if (all(rf> 0)) #open Function

{ mean.value <- mean(rf) #open Consequent

cat("The mean is ",mean.value)

} else #close Consequent

{ cat("Warning: Not all values are positive\n") #open Alternative

} #close Alternative

} #close Function

mean.rainfall( c(8.5,9.3,6.5,9.3,9.4))

The mean is 8.6

More commonly, functions are defined that do something to the input specified in

the argument list and return the result, either to a variable or to the console window,

rather than just printing something out. This is done using return() within the

function. Its format is return(R expression). Essentially what this does if it

is used in a function is to make R expression the value of the function. In the

following code the mean.rainfall2 function now returns the mean of the data

passed to it, and this is assigned to another variable:

mean.rainfall2 <- function(rf) {

if ( all(rf > 0)) {

return( mean(rf))} else {

return(NA)}

}

mr <- mean.rainfall2(c(8.5,9.3,6.5,9.3,9.4))

mr

[1] 8.6

I

Notice that the code blocks used in the functions contained within the curly

brackets or braces { and } are indented. There are a number of commonly

accepted protocols for doing this but no unique one. The aim is to make the

code and the nesting of sub-clauses indicated by { and } clear. In the code

for mean.rainfall above, { is used before the first line of the code block,

whereas for mean.rainfall.2 the { is positioned immediately after the

function declaration.

It is possible to declare variables inside functions, and you should note that

these are distinct from external variables with the same name. Consider the

internal variable rf in the mean.rainfall2 function above. Because this is a

variable that is internal to the function, it only exists within the function and will

not alter any external variable of the same name. This is illustrated in the code

below.

SCRIPTING AND WRITING FUNCTIONS IN R

125

rf <- "Tuesday"

mean.rainfall2(c(8.5,9.3,6.5,9.3,9.4))

[1] 8.6

rf

[1] "Tuesday"

4.3.4 Loops and Repetition

Very often, we would like to run a code block a certain number of times, for exam-

ple for each record in a data frame or a spatial data frame. This is done using for

loops. The format of a loop is:

for( 'loop variable' in 'list of values' ) do R expression

Again, typically code blocks are used, as in the following example of a for loop:

for (i in 1 :5) {

i.cubed <- i ∗ i ∗ i

cat("The cube of",i,"is ",i.cubed,"\n")}

The cube of 1 is 1

The cube of 2 is 8

The cube of 3 is 27

The cube of 4 is 64

The cube of 5 is 125

When working with a data frame and other tabular-like data structures, it is com-

mon to want to perform a series of R expressions on each row, on each column or on

each data element. In a for loop the list of values can be a simple sequence

of 1 to n (1:n), where n is related to the number of rows or columns in a dataset or

the length of the input variable as in the assess.tree.height function above.

However, there are many other situations when a different list of values

is required. The function seq is a very useful helper function that generates num-

ber sequences. It has the following formats:

seq(from, to, by = step value)

or

seq(from, to, length = sequence length)

In the example below, it is used to generate a sequence of 0 to 1 in steps of 0.25:

for (val in seq(0,1,by=0.25)) {

val.squared <- val ∗ val

cat("The square of",val,"is ",val.squared,"\n")}

The square of 0 is 0

The square of 0.25 is 0.0625

The square of 0.5 is 0.25

The square of 0.75 is 0.5625

The square of 1 is 1

R FOR SPATIAL ANALYSIS & MAPPING

126

Conditional loops are very useful when you wish to run a code block until a certain

condition

,

with

spatial operations and functions for spatial analyses have not yet been updated to

work with sf. For these reasons, this edition will, where possible, describe the

manipulation and analysis of spatial data using sf format and functions but will

switch between (and convert data between) sp and sf formats as needed. The

focus is no longer primarily on GISTools, but this package still provides some

analytical short-cuts and functionality and will be used if appropriate.

R is dynamic – things do not stay the same, and this is part of its attraction and

to be celebrated. New tools, packages and functions are constantly being pro-

duced, and they are updated to improve and develop them. In most cases this is

not problematic as the update almost always extends the functionality of the pack-

age without affecting the original code. However, in a few instances, specific pack-

ages are completely rewritten without backward compatibility. If this happens

then the R code that previously worked may not work with the new package as the

functions may take different parameters, arguments and critical data formats.

However, there is usually a period of transition over some package versions before

the code stops working altogether. So occasionally a completely new paradigm is

introduced, and this has been the case recently for spatial data in R with the release

of the sf package (Pebesma et al., 2016) and the tidyverse. The second edition

reflects these developments and updates.

1.2 OBJECTIVES OF THIS BOOK

This book assumes no prior knowledge of either R or spatial analysis and map-

ping. It provides an introduction to the use of R and the increasing number of tools

that can be used for explicitly spatial analyses, geocomputation and the statistical

analysis of geographical information. The text draws from a number of open

source, user-contributed libraries or ‘packages’ that support mapping and carto-

graphic outputs arising from both raster and vector analyses. The book implicitly

focuses on vector GIS as other texts cover raster with classic geostatistics (e.g.

Bivand et al., 2013), although rasters are implicitly included in some of the exer-

cises, for example the outputs of density surfaces and some of the geographically

weighted analyses as described in later chapters.

The original rationale for producing the first edition of this book in 2013

related to a number of factors. First, the increasing use of R as an analytical tool

across a range of different scientific disciplines is evident. Second, there are an

INTRODUCTION

3

increasing number of data capture devices that are GPS-enabled: smartphones,

tablets, cameras, etc. This has resulted in more and more data (both formal and

informal) having location attached to them. Third, there is therefore an associ-

ated increase in demand for explicitly spatial analyses of such data, in order to

exploit the richness of analysis that location affords. Finally, at the time of writ-

ing, there are no books on the market that have a specific focus on spatial analy-

sis and mapping of such data in R that do not require any prior knowledge of

GIS, spatial analysis or geocomputation. One of the few textbooks on using R for

the analysis of spatial data is Bivand et al. (2013), although this is aimed at

advanced users. These have not changed. If anything, the number of R users has

increased, and of those more and more are increasingly working with spatial

data. This is reflected in the number of online tools, functions and tutorials

(greatly supported by the functionality of RMarkdown) and the continued

development of packages (existing and new) and data formats supporting spa-

tial data analysis. As introduced earlier, an excellent example of the latter is the

Simple Features format in the sf package. For these reasons, what we have

sought to do is to write a book with a geographical focus and (hopefully) user

friendliness and that reflects the latest developments in spatial analyses and

mapping in R.

As you work through this book you will learn a number of techniques for using

R directly to carry out spatial data analysis, visualisation and manipulation.

Although here we focus mostly on vector data (some raster analysis is demon-

strated) and on social and economic applications, and the packages that this book

uses have been chosen as being the most appropriate for analysing these kinds of

data, R also presents opportunities for the analysis of many other kinds of spatial

data – for example, relating to climate and landscape processes. While some of

libraries and packages covered in this book may also be useful in the analysis of

the physical geographical and environmental data, there will no doubt be other

packages that may also play an important role – for example, the PBSMapping

package, developed by the Pacific Biological Station in Nanaimo, British

Columbia, Canada, offers a number of functions that may be useful for the analy-

sis of biogeographical data.

1.3 SPATIAL DATA ANALYSIS IN R

In recent years large amounts of spatial data have become widely available. For

example, there are many governmental open data initiatives that make census data,

crime data and various other data relating to social and economic processes freely

available. However, there is still a need to flexibly analyse, visualise and model

data of these kinds in order to understand the underlying patterns and processes

that the data describe. While there are many packages and software available that

are capable of analysing spatial data, in many situations standard statistical modelling

R FOR SPATIAL ANALYSIS & MAPPING

4

approaches are not appropriate: data observations may not be independent or the

relationship between variables may vary across geographical space. For this reason

many standard statistical packages provide only inadequate tools for analysis as

they cannot account for the complexities of spatial processes and spatial data.

Similarly, although standard GIS packages and software provide tools for the

visualisation of spatial data, their analytical capabilities are relatively limited,

inflexible and cannot represent the state of the art. On the other hand, many R

packages are created by experts and innovators in the field of spatial data analysis

and visualisation, and as R is, in fact, a programming language it is a natural test-

ing ground for newly developed approaches. Thus R provides arguably the best

environment for spatial data analysis and manipulation. One of the key differ-

ences between a standard GIS and R is that many people view GIS as a tool to

handle very large geographical databases rather than for more sophisticated

modelling and analysis, and this is reflected in the evolution of GIS software,

although R is catching up in its ability to easily handle very large datasets. We do

not regard R as competing with GIS, rather we see the two kinds of software as

having complementary functionality.

1.4 CHAPTERS AND LEARNING ARCS

The broad-level content and topics covered by the chapters have not changed. Nor

have the associated learning arcs. The revisions for the second edition have focused

on updates to visualisation and mapping tools through the ggplot2 and tmap

packages and to spatial data structures through sf.

The chapters build in the complexity of the analyses they develop, and by work-

ing through the illustrative code examples you will develop skills to create your

own routines, functions and programs. The book includes a mix of embedded exer-

cises, where the code is provided for you to work through with extensive explana-

tions, and self-test questions, which require you to develop an answer yourself. All

chapters have self-test questions. In some cases these are included in an explicitly

named section, and in others they are embedded in the rest of the text. The final

section in each chapter provides model answers to the self-test questions. Thus in

contrast to the

,

is met. In R these can be specified using the repeat and break func-

tions. Here is an example:

i <- 1; n <- 654

repeat{

i.squared <- i ∗ i

if (i.squared > n) break

i <- i + 1}

cat("The first square number exceeding",n, "is ",i.squared,"\n")

The first square number exceeding 654 is 676

Finally, it is possible to include loops in functions as in the following example with

a conditional loop:

first.bigger.square <- function(n) {

i <- 1

repeat{

i.squared <- i ∗ i

if (i.squared > n) break

i <- i + 1 }

return(i.squared)}

first.bigger.square(76987)

[1] 77284

4.3.5 Debugging

As you develop your code and compile it into functions, especially initially, you

will probably encounter a few teething problems: hardly any function of reason-

able size works first time! There are two general kinds of problem:

● The function crashes (i.e. it throws up an error)

● The function does not crash, but returns the wrong answer

Usually the second kind of error is the worst. Debugging is the process of finding

the problems in the function. A typical approach to debugging is to ‘step’ through

the function line by line and in so doing find out where a crash occurs, if one does.

You should then check the values of variables to see if they have the values they

are supposed to. R has tools to help with this.

To debug a function:

● Enter debug(function name)

● Then call the function

SCRIPTING AND WRITING FUNCTIONS IN R

127

For example, enter:

debug(mean.rainfall2)

Then just use the function you are trying to debug and R goes into ‘debug mode’:

mean.rainfall2(c(8.5,9.3,6.5,9.3,9.4))

[1] 8.6

You will notice that the prompt becomes Browse[2]> and the line of the function

about to be executed is listed. You should note a number of features associated with

debug:

● Entering a return executes it, and debug goes to next line

● Typing in a variable lists the value of that variable

● R can ‘see’ variables that are specific to the function

● Typing in any other command executes that command

When you enter c the return runs to the end of a loop/function/block. Typing in Q

exits the function. To return to normal enter undebug(function name) and

note that if there are no bugs, entering c has the same effect as undebug.

A final comment is that learning to write functions and programming is a bit

like learning to drive: you may pass the test, but you will become a good driver by

spending time behind the wheel. Similarly, the best way to learn to write functions

is to practise, and the more you practise the better you will get at programming.

You should try to set yourself various function writing tasks and examine the func-

tions that are introduced throughout this book. Most of the commands that you

use in R are functions that can themselves be examined: entering them without any

brackets afterwards will reveal the blocks of code they use. Have a look at the

ifelse function by entering at the R prompt:

ifelse

This allows you to examine the code blocks, the control, etc., in existing functions.

4.4 WRITING FUNCTIONS

4.4.1 Introduction

In this section you will gain some initial experience in writing functions that can

be used in R, using a number of coding illustrations. You should enter the code

R FOR SPATIAL ANALYSIS & MAPPING

128

blocks for these, compile them and then run them with some data to build up your

experience. Unless you already have experience in writing code, this will be your

first experience of programming. This section contains a series of specific tasks for

you to complete in the form of self-test questions. The answers to the questions are

provided in the final section of the chapter.

In the preceding section, the basic idea of writing functions was described. You

can write functions directly by entering them at the R command line:

cube.root <- function(x) {

result <- x ^ (1/3)

return(result)}

cube.root(27)

[1] 3

Note that ̂ means ‘raise to the power’, and recall that a number to the power of

one-third is its cube root. The cube root of 27 is 3, since 27 = 3 × 3 × 3, hence the

answer printed out for cube.root(27). However, entering functions from the

command line is not always very convenient:

● If you make a typing error in an early line of the definition, it is not

possible to go back and correct it

● You would have to type in the definition every time you used R

A more sensible approach is to type the function definition into a text file. If you

write this definition into a file – calling it, say, functions.R – then you can load

this file when you run R, without having to type in the whole definition. Assuming

you have set R to work in the directory where you have saved this file, just enter:

source("functions.R")

This has the same effect of entering the entire function at the command line. In

fact any R commands in a file (not just function definitions) will be executed when

the source function is used. Also, because the function definition is edited in a

file, it is always possible to return to any typing errors and correct them – and if a

function contains an error, it is easy to correct this and just redefine the function by

re-entering the command above. Using an editor for writing and saving R code was

introduced in previous chapters.

Open a new R script or editing window. In it, enter in the code for the program:

cube.root <- function(x) {

result <- x ^ (1/3)

return(result)}

Then use Save As to save the file as functions.R in the directory you are work-

ing in. In R you can now use source as described:

SCRIPTING AND WRITING FUNCTIONS IN R

129

source('functions.R')

cube.root(343)

cube.root(99)

Note that you can type in several function definitions in the same file. For example,

underneath the code for the cube.root function, you should define a function to

compute the area of a circle. Enter:

circle.area <- function(r) {

result <- pi ∗ r ^ 2

return(result)}

If you save the file and enter source('functions.R') again then the function

circle.area will be defined as well as cube.root. Enter:

source('functions.R')

cube.root(343)

circle.area(10)

4.4.2 Data Checking

One issue when writing functions is making sure that the data that have been

given to the function are the right kind. For example, what happens when you try

to compute the cube root of a negative number?

cube.root(−343)

[1] NaN

That probably was not the answer you wanted. NaN stands for ‘not a number’,

and is the value returned when a mathematical expression is numerically indeter-

minate. In this case, this is actually due to a shortcoming with the ̂ operator in R,

which only works for positive base values. In fact −7 is a perfectly valid cube root of

−343, since (−7) × (−7) × (−7) = −343. In fact we can state a conditional rule:

● If x ≥ 0: calculate the cube root of x normally

● Otherwise: use cube.root(-x)

That is, for cube roots of negative numbers, work out the cube root of the positive

number, then change it to negative. This can be dealt with in an R function by

using an if statement:

cube.root <- function(x) {

if (x >= 0) {

result <- x ^ (1/3) } else {

result <- -(-x) ^ (1/3) }

return(result)}

R FOR SPATIAL ANALYSIS & MAPPING

130

Now you should go back to the text editor and modify the code in functions.R

to reflect this. You can do this by modifying the original cube.root function. You

can now save this edited file, and use source to reload the updated function defi-

nition. The function should work with both positive and negative values.

cube.root(3)

[1] 1.44225

cube.root(−3)

[1] −1.44225

Next, try debugging the function – since it is working properly, you will not (hope-

fully!) find any errors, but this will demonstrate the debug facility. Enter:

debug(cube.root)

at the R command line (not in the file editor!). This tells R that you want to run

cube.root in debug mode. Next, enter:

cube.root(−50)

at the R command line and see how repeatedly pressing the return key steps you

,

through the function. Note particularly what happens at the if statement.

At any stage in the process you can type an R expression to check its value.

When you get to the if statement enter:

x > 0

at the command line and press Return to see whether it is true or false. Checking

the value of expressions at various points when stepping through the code is a good

way of identifying potential bugs or glitches in your code. Try running through the

code for a few other cube root calculations, by replacing −50 above with different

numbers, to get used to using the debugging facility. When you are finished, enter:

undebug(cube.root)

at the R command line. This tells R that you are ready to return cube.root to

running in normal mode. For further details about the debugger, at the command

line enter:

help(debug)

4.4.3 More Data Checking

In the last section, you saw how it was possible to check for negative values in the

cube.root function. However, other things can go wrong. For example, try entering:

SCRIPTING AND WRITING FUNCTIONS IN R

131

cube.root('Leeds')

This will cause an error to occur and to be printed out by R. This is not surprising

because cube roots only make sense for numbers, not character variables. However,

it might be helpful if the cube root function could spot this and print a warning

explaining the problem, rather than just crashing with a fairly obscure error

message such as the one above, as it does at the moment. Again, this can be dealt

with using an if statement. The strategy to handle this is:

● If x is numerical: compute its cube root

● If x is not numerical: print a warning message explaining the problem

Checking whether a variable is numerical can be done using the is.numeric

function:

is.numeric(77)

is.numeric("Lex")

is.numeric("77")

v <- "Two Sevens Clash"

is.numeric(v)

The function could be rewritten to make use of is.numeric in the following

way:

cube.root <- function(x) {

if (is.numeric(x)) {

if (x >= 0) { result <- x^(1/3) }

else { result <- -(-x)^(1/3) }

return(result) }

else {

cat("WARNING: Input must be numerical, not character\n")

return(NA)}

}

Note that here there is an if statement inside another if statement – this is an

example of a nested code block. Note also that when no proper result is defined, it is

possible to return the value NA instead of a number (NA stands for ‘not available’).

Finally, recall that the \n in the cat statement tells R to add a carriage return (new

line) when printing out the warning. Try updating your cube root function in the

editor with this latest definition, and then try using it (in particular with character

variables) and stepping through it using debug.

An alternative way of dealing with cube roots of negative numbers is to use

the R functions sign and abs. The function sign(x) returns a value of 1 if

x is positive, −1 if it is negative, and 0 if it is zero. The function abs(x)

returns the absolute value of x without the sign, so for example abs(−7)

R FOR SPATIAL ANALYSIS & MAPPING

132

is 7, and abs(5) is 5. This means that you can specify the core statement in

the cube root function without using an if statement to test for negative

values, as:

result <- sign(x)∗abs(x)^(1/3)

This will work for both positive and negative values of x.

Self-Test Question 1. Define a new function cube.root.2 that uses this way of

computing cube roots – and also include a test to make sure x is a numerical vari-

able, and print out a warning message if it is not.

4.4.4 Loops Revisited

In this section you will revisit the idea of looping in function definitions. There are

two main kinds of loops in R: deterministic and conditional loops. The former are

executed a fixed number of times, specified at the beginning of the loop. The latter

are executed until a specific condition is met.

4.4.4.1 Conditional Loops

A very old example of a conditional loop is Euclid’s algorithm. This is a method for

finding the greatest common divisor (GCD) of a pair of numbers. The GCD of a pair

of numbers is the largest number that divides exactly (i.e. with remainder zero)

into each number in the pair. The algorithm is set out below:

1. Take a pair of numbers a and b – let the dividend be max(a, b), and the

divisor be min(a, b).

2. Let the remainder be the arithmetic remainder when the dividend is

divided by the divisor.

3. Replace the dividend with the divisor.

4. Replace the divisor with the remainder.

5. If the remainder is not equal to zero, repeat from step 2 to here.

6. Once the remainder is zero, the GCD is the dividend.

Without considering in depth the reasons why this algorithm works, it should be

clear that it makes use of a conditional loop. The test to see whether further looping

is required in step 5 above. It should also be clear that the divisor, dividend and

remainder are all variables. Given these observations, we can turn Euclid’s algo-

rithm into an R function:

SCRIPTING AND WRITING FUNCTIONS IN R

133

gcd <- function(a,b)

{

divisor <- min(a,b)

dividend <- max(a,b)

repeat

{ remainder <- dividend %% divisor

dividend <- divisor

divisor <- remainder

if (remainder == 0) break

}

return(dividend)

}

The one unfamiliar thing here is the %% symbol. This is just the remainder operator –

the value of x %% y is the remainder when x is divided by y.

Using the editor, create a definition of this function, and read it into R. You can put

the definition into functions.R. Once the function is defined, it may be tested:

gcd(6,15)

gcd(25,75)

gcd(31,33)

Self-Test Question 2. Try to match up the lines in the function definition with the

lines in the description of Euclid’s algorithm. You may also find it useful to step

through an example of gcd in debug mode.

4.4.4.2 Deterministic Loops

As described in earlier sections, the form of a deterministic loop is:

for ( in :)

{

... code in loop ...

}

where refers to the looping variable. It is common practice to refer to

in the code in the loop. and refer to the range of values over

which loops. For example, a function to print the cube roots of numbers

from 1 to n takes the form:

cube.root.table <- function(n)

{

for (x in 1 :n)

{

cat("The cube root of ",x," is", cube.root(x),"\n")

}

}

Self-Test Question 3. Write a function to compute and print out GCD(x,60) for

x in the range 1 to n. When this is done, write another function to compute and

R FOR SPATIAL ANALYSIS & MAPPING

134

print out GCD(x,y) for x in the range 1 to n1 and y in the range 1 to n2. In this

exercise you will need to nest one deterministic loop inside another one.

Self-Test Question 4. Modify the cube.root.table function so that the loop

variable runs from 0.5 in steps of 0.5 to n. The key to this is provided in the descrip-

tions of loops in the sections above.

4.4.5 Further Activity

You will notice that in the previous example the output is rather messy, with the

cube roots printing to several decimal places – it might look neater if you could

print to fixed number of decimal places. In the function cube.root.table

replace the cat line with:

cat(sprintf("The cube root of %4.0f is %8.4f \n",x, cube.root(x)))

Then enter help(sprintf) and try to work out what is happening in the code

above.

Self-Test Question 5. Create a for loop that cycles through each county / row in

the data frame of the georgia2 dataset in the GISTools package and creates

a list of the adjacent counties. The code to do this for a single county, Appling,

is as follows:

library(GISTools)

library(sf)

data(georgia)

# create an empty list for the results

adj.list <- list()

# convert georgia to sf

georgia_sf <- st_as_sf(georgia2)

# extract a single county

county.i <- georgia_sf[1,]

# determine the adjacent counties

# the [−1] removes Appling form its own list

adj.i <- unlist(st_intersects(county.i, georgia_sf))[−1]

# extract their names

,

adj.names.i <- georgia2$Name[adj.i]

# add to the list

adj.list[[1]] <- adj.i

# name the list elements

names(adj.list[[1]]) <- adj.names.i

This creates a list with a single element, with the names of the counties adjacent

to Appling and an index or reference to their location within the georgia2

dataset.

SCRIPTING AND WRITING FUNCTIONS IN R

135

adj.list

[[1]]

Bacon Jeff Davis Pierce Tattnall Toombs

3 80 113 132 138

Wayne

151

Note that once lists are defined as in adj.list in the code above, elements can

be added:

# in sequence

adj.list[[2]] <- sample(1:100, 3)

# or not!

i = 4

adj.list[[i]] <- c("Chris", "and", "Lex")

# have a look!

adj.list

Self-Test Question 6. Take the loop you created in Question 5 and create a function

that returns a list of the indices of adjacent polygons for each polygon in any poly-

gon dataset in sf or sp format. Hint: you will need to do any conversions to sf

and define the list to be returned inside the function.

4.5 SPATIAL DATA STRUCTURES

This section unpicks some of the detail of spatial data structures in R as a precursor

to manipulating and interrogating spatial data with functions. It examines their

coordinate encoding and briefly revisits their attribute/variable structures.

To begin with, you will load the GISTools package and the georgia data.

However, before doing this and running the code below, you need to check that

you are in the correct working directory. You should already be in the habit of

doing this at the start of every R session. Also, if this is not a fresh R session then

you should clear the workspace of any variables and functions you have created.

This can be done by entering:

rm(list = ls())

Then load the GISTools package and the Georgia datasets:

library(GISTools)

data(georgia)

One of the variables is called georgia.polys. There are two ways to confirm

this. One way is to type ls() into R. This function tells R to list out all currently

defined variables:

ls()

R FOR SPATIAL ANALYSIS & MAPPING

136

The other way of checking that georgia.polys now exists is just to type it into

R and see it printed out.

georgia.polys

What is actually printed out has been excluded here, as it would go on for pages

and pages. However, the content of the variable will now be explained. geor-

gia.polys is a variable of type list, with 159 items in the list. Each item is a

matrix of k rows and 2 columns. The two columns correspond to x and y coordi-

nates describing a polygon made from k points. Each polygon corresponds to one

of the 159 counties that make up the state of Georgia in the USA. To check this

quickly, enter:

class(georgia.polys)

[1] "list"

head(georgia.polys[[1]])

[,1] [,2]

[1,] 1292287 1075896

[2,] 1292654 1075919

[3,] 1292949 1075590

[4,] 1294045 1075841

[5,] 1294603 1075472

[6,] 1295467 1075621

Each of the list elements, containing the bounding coordinates of each of the coun-

ties in Georgia, can be plotted. Enter the code below to produce Figure 4.1.

1240000 1260000 1280000 1300000 1320000 1340000

1030000

1050000

1070000

Easting

N

o

rt

h

in

g

Figure 4.1 A simple plot of Appling County and two adjacent counties

SCRIPTING AND WRITING FUNCTIONS IN R

137

# plot Appling

plot(georgia.polys[[1]],asp=1,type='l',

xlab = "Easting", ylab = "Northing")

# plot adjacent county outlines

points(georgia.polys[[3]],asp=1,type='l', col = "red")

points(georgia.polys[[151]],asp=1,type='l', col = "blue", lty = 2)

Notice the use of the plot and points functions as were introduced in

Chapter 2.

Figure 4.1 will not win any prizes for cartography – but it should be

recognis able as Appling County, as featured in earlier chapters. However, it

highlights that spatial data objects in R have coordinates whether defined in

the sp and sf packages. The code below extracts the coordinates for the first

polygon in the georgia2 dataset, a SpatialPolygonsDataFrame object

that has the same coordinates as georgia.polys. These are the same as the

above.

head(georgia2@polygons[[1]]@Polygons[[1]]@coords)

head(georgia2@data[, 13:14])

If georgia2 is converted to sf format the coordinates are also evident:

g <- st_as_sf(georgia2)

head(g[,13:14])

So we can see that both sp and sf objects explicitly hold the spatial attributes and

the thematic and variable attributes of spatial objects.

4.6 apply FUNCTIONS

The final sections of this chapter describe a number of different functions that can

make programming easier by offering a number of different ways of interrogating,

manipulating and summarising spatial data, either by their variable attributes or

by their spatial properties. This section examines the apply family of functions

that come with the base installation of R.

Like other programming languages, R includes a group of functions which

are generally termed apply functions. These can be used to apply the same set

of operations over each element in a data object (row, column, list element).

They take some input data and a function as inputs. Here we will briefly

explore three of the most commonly used apply functions: apply, lapply

and mapply.

Load the newhaven data and examine the blocks object. It contains a number

of variables describing the percentage of different ethnicities living in each census

block:

R FOR SPATIAL ANALYSIS & MAPPING

138

library(GISTools)

data(newhaven)

## the @data route

head(blocks@data[, 14:17])

## the data frame route

head(data.frame(blocks[, 14:17]))

A basic illustration of apply that returns the percentage value of the largest group

in each block is as follows:

apply(blocks@data[,14:17], 1, max)

Have a look at the help for apply. The code above passes the 14th to 17th columns

of the blocks data frame to apply, the 1 is passed to the MARGIN parameter to

indicate that apply will operate over each row, and the function that is applied is

max. Compare the result when the MARGIN parameter is set to be columns:

apply(blocks@data[,14:17], 2, max)

The code above returns the largest percentage of each ethnic group in any census

block.

Now suppose we wanted to determine which ethnicity formed the largest

group in each block. One way would be to create a for loop. Another would be to

define a function and use apply.

# set up vector to hold result

result.vector <- vector()

for (i in 1:nrow(blocks@data)){

# for each row determine which column has the max value

result.i <- which.max(data.frame(blocks[i,14:17]))

# put into the result vector

result.vector <- append(result.vector, result.i)

}

This can also be determined using apply as in the code below and the two results

compared:

res.vec <-apply(data.frame(blocks[,14:17]), 1, which.max)

# compare the two results

identical(as.vector(res.vec), as.vector(result.vector))

Why use apply? Loops are tractable but slow! Typically apply functions are

much quicker than loops, as is clear if the timings are compared. In many cases

this will not matter, but it will when you have large data or heavy computations

and processing. You may have to define your own functions and in some cases

manipulate the data that are passed to apply, but they are a very useful family of

functions.

SCRIPTING AND WRITING FUNCTIONS IN R

139

# Loop

t1 <- Sys.time()

result.vector <- vector()

for (i in 1:nrow(blocks@data)){

result.i <- which.max(data.frame(blocks[i,14:17]))

result.vector <- append(result.vector, result.i)

}

Sys.time() - t1

# Apply

t1 <- Sys.time()

res.vec <-apply(data.frame(blocks[,14:17]), 1, which.max)

Sys.time() - t1

The second example uses mapply to plot the coordinates of each element of the

georgia.polys list. Here a plot extent has to be defined, and then each polygon

is plotted in turn (actually this is what plotting routines for sf and sp objects do).

One way to do this is as follows:

plot(bbox(georgia2)[1,], bbox(georgia2)[2,], asp = 1,

type='n',xlab='',ylab='',xaxt='n',yaxt='n',bty='n')

for (i in 1:length(georgia.polys)){

points(georgia.polys[[i]],

,

type='l')

# small delay so that you can see the plotting

Sys.sleep(0.05)

}

Another would be use to mapply:

plot(bbox(georgia2) [1,], bbox(georgia2) [2,], asp = 1,

type='n',xlab='',ylab='',xaxt='n',yaxt='n',bty='n')

invisible(mapply(polygon,georgia.polys))

The for loop below returns two objects: count.vec, a vector of the number of

counties within 50 km of each of the 159 counties in the georgia2 dataset; and a

list object with 159 elements of the names of these.

# convert Georgia2 to sf

georgia_sf <- st_as_sf(georgia2)

# create a distance matrix

dMat <- as.matrix(dist(coordinates(georgia2)))

dim(dMat)

# create an empty vector

count.vec <- vector()

# create an empty list

names.list <- list()

# for each county...

for( i in 1:nrow(georgia_sf)) {

# which counties are within 50km

vec.i <- which(dMat[i,] <= 50000)

# add to the vector

count.vec <- append(count.vec, length(vec.i))

# find their names

names.i <- georgia_sf$Name[vec.i]

R FOR SPATIAL ANALYSIS & MAPPING

140

# add to the list

names.list[[i]] <- names.i

}

# have a look!

count.vec

names.list

You could of course use lapply to investigate the list you have just created. Notice

how this does not require a MARGIN to be specified as does apply. Rather it just

requires a function to be applied to each element in a list:

lapply(names.list, length)

Self-Test Question 7. Recode the for loop above into two functions to be applied

to the distance matrix, dMat, and called in a similar way to the following:

count.vec <- apply(dMat,1.my.func1)

names.list <- apply(dMat,1.my.func2)

4.7 MANIPULATING DATA WITH dplyr

A second set of very useful tools in the context of programming is provided by the

data table operations within the dplyr package, included within the tidyverse.

These can be used with tabular data, including the data frames containing the

attributes of spatial data. To start you should clear your R workspace and install

and load the tidyverse package and explore the introduction vignette.

Recall that vignettes were introduced in Chapter 3.

vignette("dplyr", package = "dplyr")

For the dplyr vignettes you will also have to install the nycflights13

package that contains some example data describing flights and airlines, and note

that the default data table format for the tidyverse is tibble.

install.packages("nycflights13")

library("nycflights13")

class(flights)

flights

You can examine the other datasets included in this package as well:

data(package = "nycflights13")

You should explore the different functions for summarising and filtering individu-

al data tables. The important ones are summarised in Table 4.3.

SCRIPTING AND WRITING FUNCTIONS IN R

141

Table 4.3 Functions in the dplyr package for manipulating data tables

Function Description

filter() Selects a subset of rows in a data frame, according to user-defined

conditional statements

slice() Selects a subset of rows in a data frame by their position (row number)

arrange() Changes the row order according to the columns specified (by 1st, 2nd and

then 3rd column, etc.)

desc() Orders a column in descending order

select() Selects the subset of specified columns and reorders them vertically

distinct() Finds unique values in a table

mutate() Creates and adds new columns based on operations applied to existing

columns, e.g. NewCol = Col1 + Col2

transmute As select but only retains the new variables

summarise Summarises values with functions that are passed to it

sample_n Takes a random sample of table rows

sample_frac Selects a fixed fraction of rows

Then you should explore the two-table vignette.

vignette("two-table", package = "dplyr")

Again, you should work through the various join and summary operations in the

two-table vignette. The first command is to select variables from flights to

create flight2.

flights2 <- flights %>% select(year:day,hour,origin,dest,tailnum,carrier)

You will note that the vignette uses the piping syntax. The %>% command pipes

the flights dataset to the select function, specifying the columns of data to be

selected. The result is assigned to flights2. A non-piped version would be:

flights2 <- select(flights, year:day,hour,origin,dest,tailnum,carrier)

The dplyr package contains a number of methods for summarising and joining

tables, including different _join functions: inner_join, left_join, right_

join, full_join, semi_join and anti_join. You should familiarise your-

self with how these different join functions operate and how they relate to the two

data table inputs they take.

Self-Test Question 8. The code below creates flights2, a tibble data table

in dplyr with variables of the destination (dest), the number of flights in 2013

R FOR SPATIAL ANALYSIS & MAPPING

142

(count) and the latitude and longitude of the origin (OrLat and OrLon) in the

New York area.

library(nycflights13)

library(tidyverse)

# select the variables

flights2 <- flights %>% select(origin, dest)

# remove Alaska and Hawaii

flights2 <- flights2[-grep("ANC", flights2$dest),]

flights2 <- flights2[-grep("HNL", flights2$dest),]

# group by destination

flights2 <- group_by(flights2, dest)

flights2 <- summarize(flights2, count = n())

# assign Lat and Lon for Origin

flights2$OrLat <- 40.6925

flights2$OrLon <- −74.16867

# have a look!

flights2

# A tibble: 103 x 4

dest count OrLat OrLon

1 ABQ 254 40.7 −74.2

2 ACK 265 40.7 −74.2

3 ALB 439 40.7 −74.2

4 ATL 17215 40.7 −74.2

5 AUS 2439 40.7 −74.2

6 AVL 275 40.7 −74.2

7 BDL 443 40.7 −74.2

8 BGR 375 40.7 −74.2

9 BHM 297 40.7 −74.2

10 BNA 6333 40.7 −74.2

# ... with 93 more rows

Your task is to join the flights2 data table to the airports dataset and

determine the latitude and longitude of the destinations. A secondary task, if

you wish, is to then map the flights using the gcIntermediate function in the

geosphere package and the datasets in the maps package (both of which you

may need to install).

Some hints about the mapping are provided in the code below. This example

plots two locations and then uses the gcIntermediate function in geosphere

to plot a path between them.

library(maps)

library(geosphere)

SCRIPTING AND WRITING FUNCTIONS IN R

143

# origin and destination examples

dest.eg <- matrix(c(77.1025, 28.7041), ncol = 2)

origin.eg <- matrix(c(−74.16867, 40.6925), ncol = 2)

# map the world from the maps package data

map("world", fill=TRUE, col="white", bg="lightblue")

# plot the points

points(dest.eg, col="red", pch=16, cex = 2)

points(origin.eg, col = "cyan", pch = 16, cex = 2)

# add the route

for (i in 1:nrow(dest.eg)) {

lines(gcIntermediate(dest.eg[i,], origin.eg[i,], n=50,

breakAtDateLine=FALSE, addStartEnd=FALSE,

sp=FALSE, sepNA), lwd = 2, lty = 2)

}

You may wish to explore the use of other basemaps from the maps package:

map("usa", fill=TRUE, col="white", bg="lightblue")

4.8 ANSWERS TO SELF-TEST QUESTIONS

Q1: A new cube root function:

cube.root.2 <- function(x)

{ if (is.numeric(x))

{ result <- sign(x)∗abs(x)^(1/3)

return(result)

} else

{ cat("WARNING: Input must be numerical, not character\n")

return(NA) }

}

Q2: Match up the lines in the gcd function to the lines in the description of Euclid’s

algorithm:

gcd <- function(a,b)

{

divisor <- min(a,b) # line 1

dividend <- max(a,b) # line 1

repeat #line 5

{ remainder <- dividend %% divisor #line 2

dividend <- divisor # line 3

divisor <- remainder # line 4

if (remainder == 0) break #line 6

}

return(dividend)

}

Q3: (i) Write a function to compute and print out gcd(x,60):

R FOR SPATIAL ANALYSIS & MAPPING

144

gcd.60 <- function(a)

{

for(i in 1:a)

{ divisor <- min(i,60)

dividend <- max(i,60)

repeat

{ remainder <- dividend %% divisor

dividend <- divisor

divisor <- remainder

if (remainder == 0) break

}

cat(dividend, "\n")

}

}

Alternatively you could nest the predefined gcd function inside the modified

one:

gcd.60 <-

,

function(a)

{ for(i in 1:a)

{ dividend <- gcd(i,60)

cat(i,":", dividend, "\n")

}

}

(ii) Write a function to compute and print out gcd(x,y):

gcd.all <- function(x,y)

{ for(n1 in 1:x)

{ for (n2 in 1:y)

{ dividend <- gcd(n1, n2)

cat("when x is",n1,"& y is",n2,"dividend =",dividend,"\n")

}

}

}

Q4: Modify cube.root.table to run from 0.5 to n in steps of 0.5. The obvious

solution to this is:

cube.root.table <- function(n)

{ for (x in seq(0.5, n, by = 0.5))

{ cat("The cube root of ",x," is",

sign(x)∗abs(x)^(1/3),"\n")}

}

However, this will not work when negative values are passed to it: seq cannot

create the array. The function can be modified to accommodate sequences running

from 0.5 to both negative and positive values of n:

SCRIPTING AND WRITING FUNCTIONS IN R

145

cube.root.table <- function(n)

{ if (n < 0 ) by.val = 0.5

if (n < 0 ) by.val =−0.5

for (x in seq(0.5, n, by = by.val))

{ cat("The cube root of ",x," is",

sign(x)∗abs(x)^(1/3),"\n") }

}

Q5: Create a for loop that cycles through each county/row in the data frame of the

georgia2 dataset and creates a list of the adjacent counties. You were given the code

for a single county – this needs to be put into a loop, replacing the 1 with i or similar.

# create an empty list for the results

adj.list <- list()

# convert georgia to sf

georgia_sf <- st_as_sf(georgia2)

for (i in 1:nrow(georgia_sf)) {

# extract a single county

county.i <- georgia_sf[i,]

# determine the adjacent counties

# the [−1] removes Appling form its own list

adj.i <- unlist(st_intersects(county.i, georgia_sf))[−1]

# extract their names

adj.names.i <- georgia2$Name[adj.i]

# add to the list

adj.list[[i]] <- adj.i

# name the list elements

names(adj.list[[i]]) <- adj.names.i

}

Q6: Create a function that returns a list of the indices of adjacent polygons for each

polygon in any polygon dataset in sf or sp format.

return.adj <- function(sf.data){

# convert to sf regardless!

sf.data <- st_as_sf(sf.data)

adj.list <- list()

for (i in 1:nrow(sf.data)) {

# extract a single county

poly.i <- sf.data[i,]

# determine the adjacent counties

adj.i <- unlist(st_intersects(poly.i, sf.data))[−1]

# add to the list

adj.list[[i]] <- adj.i

}

return(adj.list)

}

# test it!

return.adj(georgia_sf)

return.adj(blocks)

R FOR SPATIAL ANALYSIS & MAPPING

146

Q7: Recode the for loop into two functions replicating the functionality of the

loop:

# number of counties within 50km

my.func1 <- function(x){

vec.i <- which(x <= 50000)[−i]

return(length(vec.i))

}

# their names

my.func2 <- function(x){

vec.i <- which(x <= 50000)

names.i <- georgia_sf$Name[vec.i]

return(names.i)

}

count.vec <- apply(dMat,1, my.func1)

names.list <- apply(dMat,1, my.func2)

Q8: Join the flights2 data table to the airports dataset and determine the lati-

tude and longitude of the destinations. Then map the flights using the gcInter-

mediate function in the geosphere package and the datasets in the maps

package.

# Part 1: the join

flights2 <- flights2 %>% left_join(airports, c("dest" = "faa"))

flights2 <- flights2 %>% select(count,dest,OrLat,OrLon,

DestLat=lat,DestLon=lon)

# get rid of any NAs

flights2 <- flights2[!is.na(flights2$DestLat),] flights2

# Part 2: the plot

# Using standard plots

dest.eg <- matrix(c(flights2$DestLon, flights2$DestLat), ncol = 2)

origin.eg <- matrix(c(flights2$OrLon, flights2$OrLat), ncol = 2)

map("usa", fill=TRUE, col="white", bg="lightblue")

points(dest.eg, col="red", pch=16, cex = 1)

points(origin.eg, col = "cyan", pch = 16, cex = 1)

for (i in 1:nrow(dest.eg)) {

lines(gcIntermediate(dest.eg[i,], origin.eg[i,], n=50,

breakAtDateLine=FALSE,

addStartEnd=FALSE, sp=FALSE, sepNA))

}

# using ggplot

all_states <- map_data("state")

dest.eg <- data.frame(DestLon = flights2$DestLon,

DestLat = flights2$DestLat)

origin.eg <- data.frame(OrLon = flights2$OrLon,

OrLat = flights2$OrLat)

library(GISTools)

# Figure 2 using ggplot

# create the main plot

SCRIPTING AND WRITING FUNCTIONS IN R

147

mp <- ggplot() +

geom_polygon( data=all_states,

aes(x=long, y=lat, group = group),

colour="white", fill="grey20") +

coord_fixed() +

geom_point(aes(x = dest.eg$DestLon, y = dest.eg$DestLat),

color="#FB6A4A", size=2) +

theme(axis.title.x=element_blank(),

axis.text.x=element_blank(),

axis.ticks.x=element_blank(),

axis.title.y=element_blank(),

axis.text.y=element_blank(),

axis.ticks.y=element_blank())

# create some transparent shading

cols=add.alpha(colorRampPalette(brewer.pal(9,"Reds"))(nrow(flights2)), 0.7)

# loop through the destinations

for (i in 1:nrow(flights2)) {

# line thickness related flights

lwd.i = 1+ (flights2$count[i]/max(flights2$count))

# a sequence of colours

cols.i = cols[i]

# create a dataset

link <- as.data.frame(gcIntermediate(dest.eg[i,], origin.eg[i,],n=50,

breakAtDateLine=FALSE, addStartEnd=FALSE, sp=FALSE, sepNA))

names(link) <- c("lon", "lat")

mp <- mp + geom_line(data=link, aes(x=lon, y=lat),

color= cols.i, size = lwd.i)

}

# plot!

mp

5

USING R AS A GIS

5.1 INTRODUCTION

In GIS and spatial analysis, we are often interested in finding out how the

information contained in one spatial dataset relates to that contained in

another. The kinds of questions we may be interested in include:

● How does X interact with Y?

● How many X are there in different locations of Y?

● How does the incidence of X relate to the rate of Y?

● How many of X are found within a certain distance of Y?

● How does process X vary with Y spatially?

X and Y may be diseases, crimes, pollution events, attributed census areas, envi-

ronmental factors, deprivation indices or any other geographical process or phe-

nomenon that you are interested in understanding. Answering such questions

using a spatial analysis frequently requires some initial data pre-processing and

manipulation. This might be to ensure that different data have the same spatial

extent, describe processes in a consistent way (e.g. to compare land cover types

from different classifications), are summarised over the same spatial framework

(e.g. census reporting areas), are of the same format (raster, vector, etc.) and are

projected in the same way (the latter was introduced in Chapter 3).

This chapter uses worked examples to illustrate a number of fundamental and

commonly applied spatial operations on spatial datasets. Many of these form the

basis of most GIS software. The datasets may be ones you have read into R from

shapefiles or ones that you have created in the course of your analysis. Essentially,

the operations illustrate different methods for extracting information from one spa-

tial dataset based on the spatial extent of another. Many of these are what are fre-

quently referred to as overlay operations in GIS software such as ArcGIS or QGIS,

but here are extended to include a number of other types of data manipulation. The

sections below describe the following operations:

USING R AS A GIS

149

● Intersections and clipping one dataset to the extent of another

● Creating buffers around features

● Merging the features in a spatial dataset

● Point-in-polygon and area calculations

● Creating distance attributes

● Combining spatial data and attributes

● Converting between raster and vector

As you work through the example code in this chapter a number of self-test ques-

tions are introduced. Some of these go into much greater detail and complexity

than in earlier chapters and come with extensive direction for you to work through

and follow.

The chapter draws on functionality from a number of packages that have

been introduced in earlier chapters (sf, sp, maptools, GISTools,

tidyverse, rgeos, etc.) for performing overlay and other spatial operations

on spatial datasets which create new data, information or attributes. In many

,

cases, it is up to the analyst (you!) to decide which operations to undertake and

in what order for a particular analysis and, depending on your objectives, a

given operation may be considered as a pre-processing step or as an analytical

one. For example, calculating distances, areas, or point-in-polygon counts prior

to a statistical test may be pre-processing steps prior to the actual data analysis

or used as the actual analysis itself. The key feature of these operations is that

they create new data or information. Similarly, this chapter will use both sf and

sp data formats as needed, both of which have their own set of functions linking

to rgeos. As a reminder, sf data formats are relatively new and have strong

links to dplyr (part of the tidyverse package). This chapter will highlight

operations in both, and where we think there is a distinct advantage to one

approach this will be presented.

It is important to recall that there are conversion functions for moving between

sf and sp formats:

library(sf)

library(GISTools) # a wrapper for sp, rgeos, etc.

# load some data

data(georgia)

class(georgia)

# convert to sf

georgia_sf <- st_as_sf(georgia)

class(georgia_sf)

# convert back to sp

georgia_v2 <- as(georgia_sf, "Spatial")

class(georgia_v2)

R FOR SPATIAL ANALYSIS & MAPPING

150

5.2 SPATIAL INTERSECTION AND CLIP OPERATIONS

The GISTools package comes with datasets describing tornadoes in the USA.

Load the package and these data into a new R session.

library(GISTools)

data(tornados)

You will see that four sp datasets are now loaded: torn, torn2, us_states

and us_states2. The torn and torn2 data describe the locations of tornadoes

recorded between 1950 and 2004, and the us_states and us_states2 datasets

are spatial data describing the states of the USA. Two of these are in WGS84 pro-

jections (torn and us_states) and two are projected in a GRS80 datum (torn2

and us_states2). We can plot these and examine the data as in Figure 5.1.

library(tmap)

library(sf)

# convert to sf objects

torn_sf <- st_as_sf(torn)

us_states_sf <- st_as_sf(us_states)

# plot extent and grey background

tm_shape(us_states_sf) +

tm_polygons("grey90") +

# add the torn points

tm_shape(torn_sf) +

tm_dots(col = "#FB6A4A", size = 0.04, shape = 1, alpha = 0.5) +

# map the state borders

tm_shape(us_states_sf) +

tm_borders(col = "black") +

tm_layout(frame = F)

Figure 5.1 The tornado data

USING R AS A GIS

151

Note that the sp plotting code takes a very similar form:

plot(us_states, col = "grey90")

plot(torn, add = T, pch = 1, col = "#FB6A4A4C", cex = 0.4)

plot(us_states, add = T)

Remember that you can examine the attributes of a variable using the summary()

function. For sp objects this also includes a summary of the object projection. This

can be seen using the st_geometry function in sf:

summary(torn)

summary(torn_sf)

st_geometry(torn_sf)

Now, consider the situation where the aim was to analyse the incidence of torna-

does in a particular area: we do not want to analyse all of the tornado data but only

those records that describe events in our study area – the area we are interested

in. The code below selects a group of US states, in this case Texas, New Mexico,

Oklahoma and Arkansas – note the use of the OR logical operator | to make the

selection.

index <- us_states$STATE_NAME == "Texas" |

us_states$STATE_NAME == "New Mexico" |

us_states$STATE_NAME == "Oklahoma" |

us_states$STATE_NAME == "Arkansas"

AoI <- us_states[index,]

# OR....

AoI_sf <- us_states_sf[index,]

This can be plotted using the usual commands as in the code below. You can see that

the plot extent is defined by the spatial extent of area of interest (called AoI_sf)

and that all of the tornadoes within that extent are displayed.

tm_shape(AoI_sf) +

tm_borders(col = "black") +

tm_layout(frame = F) +

# add the torn points

tm_shape(torn_sf) +

tm_dots(col = "#FB6A4A", size = 0.2, shape = 1, alpha = 0.5)

# OR in sp

plot(AoI)

plot(torn, add = T, pch = 1, col = "#FB6A4A4C")

There are a number of ways of clipping spatial data in R. The simplest of these is

to use the spatial extent of one as an index to subset another. (Note that this can be

done using sp objects as well.)

torn_clip_sf <- torn_sf[AoI_sf,]

R FOR SPATIAL ANALYSIS & MAPPING

152

This simply clips out the data from torn_sf that is within the spatial extent of

AoI_sf. You can check this:

tm_shape(torn_clip_sf) +

tm_dots(col = "#FB6A4A", size = 0.2, shape = 1, alpha = 0.5) +

tm_shape(AoI_sf) +

tm_borders()

However, such clip (or crop) operations simply subset data based on their spatial

extents. There may be occasions when you wish to combine the attributes of differ-

ence datasets based on the spatial intersection. The gIntersection function in

rgeos or the st_intersection in sf allows us to do this as shown in the code

below. The results are mapped in Figure 5.2.

AoI_torn_sf <- st_intersection(AoI_sf, torn_sf)

tm_shape(AoI_sf) + tm_borders(col = "black") + tm_layout(frame = F) +

# add the torn points

tm_shape(AoI_torn_sf) +

tm_dots(col = "#FB6A4A", size = 0.2, shape = 1, alpha = 0.5)

Figure 5.2 The tornado data in the defined area of interest

The st_intersection operation creates an sf dataset of the locations of the

tornadoes within the area of interest. The gIntersection function does the

same thing:

USING R AS A GIS

153

AoI.torn <- gIntersection(AoI, torn, byid = TRUE)

plot(AoI)

plot(AoI.torn, add = T, pch = 1, col = "#FB6A4A4C")

If you examine the data created by the intersection, you will notice that each of the

intersecting points has the full attribution from input datasets. You can examine the

attributes of the AoI_torn_sf data and the AoI.torn data by entering:

head(data.frame(AoI_torn_sf))

head(data.frame(AoI.torn))

Once extracted, the subset can be written out for use elsewhere as described in

Chapters 2 and 3. You should examine the help for both st_intersection

and gIntersection to see how they work. You should particularly note

that both functions operate on any pair of spatial objects provided they are

projected using the same datum (in this case WGS84). In order to perform spa-

tial operations you may need to re-project your data to the same datum using

spTransform or st_transform as described in Chapter 3.

5.3 BUFFERS

In many situations, we are interested in events or features that occur near to our

area of interest as well as those within it. Environmental events such as torna-

does, for example, do not stop at state lines or other administrative boundaries.

Similarly, if we were studying crime locations or spatial access to facilities such

as shops or health services, we would want to know about locations near to the

study area border. Buffer operations provide a convenient way of doing this, and

buffers can be created in R using the gBuffer function in rgeos or the st_

buffer function in sf.

Continuing with the example above, we might be interested in extracting the

tornadoes occurring in Texas and those within 25 km of the state border. Thus

the objective is to create a 25 km buffer around the state of Texas and to use that

to select from the tornado dataset. Both buffer functions allow us to do that, and

require a distance for the buffer to be specified in terms of the units used in the

projection. However, in order to do this, a different projection is required as dis-

tances are difficult to determine directly from projections in degrees (essentially,

the relationship between planar distance measures such as metres and kilome-

tres to degrees varies with latitude). And the buffer will return an error message

if you try to buffer a non-projected spatial dataset. Therefore, the code below

uses the projected US data, us_states2, and the resultant buffer is shown in

Figure 5.3.

# select an area of interest and apply a buffer

# in rgeos

R FOR SPATIAL ANALYSIS & MAPPING

154

AoI <- us_states2[us_states2$STATE_NAME == "Texas",]

AoI.buf <- gBuffer(AoI, width

,

= 25000)

# in sf

us_states2_sf <- st_as_sf(us_states2)

AoI_sf <- st_as_sf(us_states2_sf[us_states2_sf$STATE_NAME == "Texas",])

AoI_buf_sf <- st_buffer(AoI_sf, dist = 25000)

# map the buffer and the original area

# sp format

par(mar=c(0,0,0,0))

plot(AoI.buf)

plot(AoI, add = T, border = "blue")

# tmap: commented out!

# tm_shape(AoI_buf_sf) + tm_borders("black") +

# tm_shape(AoI_sf) + tm_borders("blue") +

# tm_layout(frame = F)

Figure 5.3 Texas with a 25 km buffer

USING R AS A GIS

155

The buffered object, shown in Figure 5.3, or objects can be used as input to clip

or intersection operations as above, for example to extract data within a certain

distance of an object. You should also examine the impact on the output of other

parameters in both buffer functions that control how line segments are created,

the geometry of the buffer, join styles, etc. Note that any sp or sf objects can be

used as an input to gBuffer and st_intersection functions, respectively: try

applying them to the breach dataset that is put into working memory when the

newhaven data are loaded.

There are number of options for defining how the buffer is created. If you enter

the code below, using IDs, then buffers are created around each of the counties

within the georgia2 dataset:

data(georgia)

georgia2_sf <- st_as_sf(georgia2)

# apply a buffer to each object

# sf

buf_t_sf <- st_buffer(georgia2_sf, 5000)

# rgeos

buf.t <- gBuffer(georgia2, width = 5000, byid = T, id = georgia2$Name)

# now plot the data

# sf

tm_shape(buf_t_sf) +

tm_borders() +

tm_shape(georgia2) +

tm_borders(col = "blue") +

tm_layout(frame = F)

# rgeos

plot(buf.t)

plot(georgia2, add = T, border = "blue")

The IDs of the resulting buffer datasets relate to each of the input features, which

in the above code has been specified to be the county names. This can be checked

by examining how the buffer object has been named using names(buf.t). If you

are not convinced that the indexing has been preserved then you can compare the

output with a familiar subset, Appling County:

plot(buf.t[1,])

plot(georgia2[1,], add = T, col = "blue")

5.4 MERGING SPATIAL FEATURES

In the intersection example above, four US states were selected and used to

define the area of interest over which the tornado data were extracted. An attrib-

ute describing in which state each tornado occurred was added to the data

frame of the intersected object. In other instances we may wish to consider the

area as a single object and to merge the features within it. This can be done using

R FOR SPATIAL ANALYSIS & MAPPING

156

the gUnaryUnion function in the rgeos package which was used in Chapter 3,

and also the st_union and st_combine functions in the sf package, to cre-

ate an outline of the state of Georgia from its constituent counties. In the code

below the US states are merged into a single object and then plotted over the

original data as shown in Figure 5.4. Note the use of the st_sf function to con-

vert the sfc output of the st_union function to sf class before passing to the

tmap functions.

Figure 5.4 The outline of the merged US states created by gUnaryUnion, with the

original state outlines in green

library(tmap)

### with rgeos and sp commented out

# AoI.merge <- gUnaryUnion(us_states)

# plot(us_states, border = "darkgreen", lty = 3)

# plot(AoI.merge, add = T, lwd = 1.5)

### with sf and tmap

us_states_sf <- st_as_sf(us_states)

AoI.merge_sf <- st_sf(st_union(us_states_sf))

tm_shape(us_states_sf) + tm_borders(col = "darkgreen", lty = 3) +

tm_shape(AoI.merge_sf) + tm_borders(lwd = 1.5, col = "black") +

tm_layout(frame = F)

The union operations merge spatial object sub-geometries. Once the merged

objects have been created they can be used as inputs into the intersection and buff-

ering procedures above in order to select data for analysis, as well as the analysis

operations described below. The merged objects can also be used in a cartographic

context to provide a border to the study area being considered.

USING R AS A GIS

157

5.5 POINT-IN-POLYGON AND AREA CALCULATIONS

5.5.1 Point-in-Polygon

It is often useful to count the number of points falling within different zones in a

polygon dataset. This can be done using the poly.counts function in the

GISTools package, which extends the gContains function in rgeos, or using

a similar method with the st_contains function in sf.

I

Remember that you can examine how a function works by entering it into the

console without the brackets – try entering poly.counts at the console.

The code below assigns a list of counts of the number of tornadoes that occur

inside each US state to the variable torn.count and prints the first six of these

to the console using the head function:

torn.count <- poly.counts(torn, us_states)

head(torn.count)

1 2 3 4 5 6

79 341 87 1121 1445 549

The numbers along the top are the ‘names’ of the elements in the variable tmp,

which in this case are the polygon ID numbers of the us_states variable. The

values are the counts of the points in the corresponding polygons. You can check

this by entering:

names(torn.count)

5.5.2 Area Calculations

Another useful operation is to be able calculate polygon areas. The gArea and

st_area functions in rgeos and sf do this. To check the projection, and there-

fore the map units, of an sp class object (including SpatialPolygons,

SpatialPoints, etc.), use the proj4string function, and for sf objects use

the st_crs function:

proj4string(us_states2)

st_crs(us_states2_sf)

This declares the projection to be in metres. To see the areas in square metres of

each US state, enter:

R FOR SPATIAL ANALYSIS & MAPPING

158

poly.areas(us_states2)

st_area(us_states2_sf)

These are not particularly useful, and more realistic measures are to report areas in

hectares or square kilometres:

# hectares

poly.areas(us_states2) / (100 ∗ 100)

st_area(us_states2_sf) / (100 ∗ 100)

# square kilometres

poly.areas(us_states2) / (1000 ∗ 1000)

st_area(us_states2_sf) / (1000 ∗ 1000)

Self-Test Question 1. Create the code to produce maps of the densities of

breaches of the peace in each census block in New Haven in breaches per square

kilometre. For the analysis you will need to use the breach point data and the

census blocks in the newhaven dataset and undertake a point-in-polygon

operation, apply an area function and undertake a conversion to square kilo-

metres. The maps should be produced using the tm_shape and tm_fill func-

tions in the tmap package. The New Haven data are included in the GISTools

package:

data(newhaven)

Reminder: As with all self-test questions, worked answers are provided in the final

section of the chapter.

You should note that the New Haven dataset is projected in feet. One way is to

leave the data in feet, calculate densities in squares miles and convert to square

kilometres, apply the ft2miles function to the results of the area calculation, and

as areas are in squared units, you will need to apply it twice, noting that there are

approximately 2.58999 square kilometres in each square mile. The code below cal-

culates the area in square kilometres of each block:

ft2miles(ft2miles(gArea(blocks, byid = T))) ∗ 2.58999

5.5.3 Point and Areas Analysis Exercise

An important advantage of using R to handle spatial data is that it is very easy

to incorporate your data into statistical analysis and graphics routines. For

example, in the New Haven blocks data frame, there is a variable called P_

OWNEROCC which states the percentage of owner-occupied housing in each

census block. It may be of interest to see how this relates to the breach of peace

densities calculated in Self-Test Question 1. A useful statistic is the correlation

coefficient generated by the cor function which causes the correlation to be

printed out:

USING R AS A GIS

159

data(newhaven)

blocks$densities=poly.counts(breach,blocks)/

ft2miles(ft2miles(poly.areas(blocks)))

cor(blocks$P_OWNEROCC,blocks$densities)

[1] −0.2038463

,

In this case the two variables have a correlation of around −0.2, a weak nega-

tive relationship, suggesting that, in general, places with a higher proportion of

owner-occupied homes tend to see fewer breaches of peace. It is also possible to

plot the relationship between the quantities:

ggplot(blocks@data, aes(P_OWNEROCC,densities))+

geom_point() +

geom_smooth(method = "lm")

A more detailed approach might be to model the number of breaches of peace. Typ-

ically, these are relatively rare, and a Poisson distribution might be an appropriate

model. A possible model might then be:

breaches ~ Poisson(AREA ∗ exp(a + b ∗ P_OWNEROCC))

where AREA is the area of a block, P_OWNEROCC is the percentage of owner occu-

piers in the block, and a and b are coefficients to be estimated, a being the intercept

term. The AREA variable plays the role of an offset – a variable that always has a

coefficient of 1. The idea here is that even if breaches of peace were uniformly dis-

tributed, the number of incidents in a given census block would be proportional to

the AREA of that block. In fact, we can rewrite the model such that the offset term

is the log of the area:

breaches ~ Poisson(exp(a + b ∗ P_OWNEROCC+log(AREA)))

Seeing the model written this way makes it clear that the offset term has a coefficient

that must always be equal to 1. The model can be fitted in R using the following code:

# load and attach the data

data(newhaven)

attach(data.frame(blocks))

# calculate the breaches of the peace in each block

n.breaches = poly.counts(breach,blocks)

area = ft2miles(ft2miles(poly.areas(blocks)))

# fit the model

model1=glm(n.breaches~P_OWNEROCC,offset=log(area),family=poisson)

# detach the data

detach(data.frame(blocks))

The first two lines compute the counts, storing them in n.breaches, and the

areas, storing them in area. The next line fits the Poisson model. glm stands for

‘generalised linear model’, and extends the standard lm routine to fit models such

R FOR SPATIAL ANALYSIS & MAPPING

160

as Poisson regression. As a reminder, further information about linear models and

the R modelling language was provided in one of the information boxes in Chapter 3

and an example of its use was given. The family=poisson option specifies

that a Poisson model is to be fitted here. The offset option specifies the offset

term, and the first argument specifies the actual model to be fitted. The model-

fitting results are stored in the variable model1. Having created the model in this

way, entering:

model1

returns a brief summary of the fitted model. In particular, it can be seen that the

estimated coefficients are a = 3.02 and b = −0.0310.

A more detailed view can be obtained using:

summary(model1)

Now, among other things, the standard errors and Wald statistics for a and b

are now shown. The Wald Z-statistics are similar to t-statistics in ordinary least

squares regression, and may be tested against the normal distribution. The results

in Table 5.1 summarise the information, showing that both a and b are significant,

and that therefore there is a statistically significant relationship between owner

occupation and breach of peace incidents.

Table 5.1 Summary of the Poisson model of the breaches of the peace over census blocks

Estimate Std. error Wald’s Z p-value

Intercept 3.02 0.11 27.4 <0.01

Owner Occ. % −0.031 0.00364 −8.5 <0.01

It is also possible to extract diagnostic information from fitted models. For

example, the rstandard function extracts the standardised residuals from a

model. Whereas residuals are the difference between the observed value (i.e. in the

data) and the value when estimated using the model, standardised residuals are

rescaled to have a variance of 1. If the model being fitted is correct, then these

residuals should be independent, have a mean of 0, a variance of 1 and an approx-

imately normal distribution. One useful diagnostic is to map these values. The

code below computes them and stores them in a variable called s.resids:

s.resids = rstandard(model1)

Now to plot the map it will be more useful to specify a shading scheme directly

using the shading command:

USING R AS A GIS

161

resid.shades = shading(c(−2,2),c("red","grey","blue"))

This specifies that the map will have three class intervals: below −2, between −2

and 2, and above 2. These are useful intervals, given that the residuals should be

normally distributed, and these values are the approximate two-tailed 5% points of

this distribution. Residuals within these points will be shaded grey, large negative

residuals will be red, and large positive ones will be blue:

par(mar=c(0,0,0,0))

choropleth(blocks,s.resids,resid.shades)

From Figure 5.5 it can be seen that in fact there is notably more variation than one

might expect (there are 21 blocks shaded blue or red, about 16% of the total, when

Figure 5.5 The distribution of the model1 residuals, describing the relationship between

breaches of the peace and owner occupancy

R FOR SPATIAL ANALYSIS & MAPPING

162

around 5% would appear based on the model’s assumptions), and also that the

shaded blocks seem to cluster together. This last observation casts doubt on the

assumption of independence, suggesting instead that some degree of spatial cor-

relation is present. One possible reason for this is that further variables may need

to be added to the model, to explain this extra variability and spatial clustering

among the residuals.

It is possible to extend this analysis by considering P_VACANT, the percentage

of vacant properties in each census block, as well as P_OWNEROCC. This is done by

extending model1 and entering:

attach(data.frame(blocks))

n.breaches = poly.counts(breach,blocks)

Figure 5.6 The distribution of the model2 residuals, describing the relationship between

breaches of the peace with owner occupancy and vacant properties

USING R AS A GIS

163

area = ft2miles(ft2miles(poly.areas(blocks)))

model2=glm(n.breaches~P_OWNEROCC+P_VACANT,

offset=log(area),family=poisson)

s.resids.2 = rstandard(model2)

detach(data.frame(blocks))

This sets up a new model, with a further term for the percentage of vacant housing

in each block, and stores it in model2. Entering summary(model2) shows that

the new predictor variable is significantly related to breaches of the peace, with a

positive relationship. Finally, it is possible to map the standardised residuals for

the new model reusing the shading scheme defined above:

s.resids.2 = rstandard(model2)

par(mar=c(0,0,0,0))

choropleth(blocks,s.resids.2,resid.shades)

This time, Figure 5.6 shows that there are fewer red- and blue-shaded census blocks,

although perhaps still more than we might expect, and there is still some evidence

of spatial clustering. Adding the extra variable has improved things to some extent,

but perhaps there is more investigative research to be done. A more comprehensive

treatment of spatial analysis of spatial data attributes is given in Chapter 7.

Self-Test Question 2. The above code uses the choropleth function in GISTools

to produce a map of outlying residuals. Create a similar-looking map but using the

tm_shape function of the tmap package. You may find it useful to unpick the cho-

ropleth function, to think about passing a user-defined palette to tm_polygons,

to assign s.resids.2 as a blocks variable, and/or to pass a set of break values.

5.6 CREATING DISTANCE ATTRIBUTES

Distance is fundamental to spatial analysis. For example, we may wish to analyse

the number of locations (health facilities, schools, etc.) within a certain distance of

the features we are considering. In the exercise below, distance measures are used

to evaluate differences in accessibility for different social groups, as recorded in

census areas. Such approaches form the basis of supply and demand modelling

and provide inputs into location–allocation models.

Distance could be approximated using a series of buffers created at specific

distance intervals around

,

our features (whether point or polygons). These could be

used to determine the number of features or locations that are within different

distance ranges, as specified by the buffers using the poly.counts function

above. However, distances can also be measured directly and there a number of

functions available in R to do this.

First, the most commonly used function is dist. This calculates the Euclidean

distance between points in n-dimensional feature space. The example below,

R FOR SPATIAL ANALYSIS & MAPPING

164

developed from the help for dist, shows how it is used to calculate the distances

between five records (rows) in a feature space of 20 hypothetical variables.

x <- matrix(rnorm(100), nrow = 5)

colnames(x) <- paste0("Var", 1:20)

dist(x)

as.matrix(dist(x))

If your data are projected (in metres, feet, etc.) then dist can also be used to calcu-

late the Euclidean distance between pairs of coordinates.

as.matrix(dist(coordinates(blocks))) # in feet

as.matrix(dist(coordinates(georgia2))) # in metres

When determining geographical distances, it is important that you consider the

projection properties of your data: if the data are projected using degrees (i.e. in lat-

itude and longitude) then this needs to be considered in any calculation of distance.

The gDistance function in rgeos calculates the Cartesian minimum (straight-

line) distance between two spatial datasets of class sp projected in planar coordi-

nates. Try entering:

# this will not work

gDistance(georgia[1,], georgia[2,])

# this will!

gDistance(georgia2[1,], georgia2[2,])

The st_distance function in sf is similar but is also able to calculate great circle

distances for projected points.

# convert to sf

georgia2_sf <- st_as_sf(georgia2)

georgia_sf <- st_as_sf(georgia)

st_distance(georgia2_sf[1,], georgia2_sf[2,])

st_distance(georgia_sf[1,], georgia_sf[2,])

# with points

sp <- st_as_sf(SpatialPoints(coordinates(georgia)))

st_distance(sp[1,], sp[1:3,])

The distance functions return a to–from matrix of the distances between each pair of

locations. These could describe distances between any objects, and such approaches

underpin supply and demand modelling and accessibility analyses.

For example, the code below uses gDistance to calculate the distances

between the centroids of the newhaven blocks data and the places locations.

The latter are simply random locations, but could represent any kind of facility or

supply feature, and the centroids of the census blocks in New Haven represent

demand locations. In the first few lines of code, the projections of the two variables

USING R AS A GIS

165

are set to be the same, before SpatialPoints is used to extract the geometric

centroids of the census block areas and the distance between places and cents

are calculated:

data(newhaven)

proj4string(places) <- CRS(proj4string(blocks))

cents <- SpatialPoints(coordinates(blocks),

proj4string = CRS(proj4string(blocks)))

# note the use of the ft2miles function to convert to miles

distances <- ft2miles(gDistance(places, cents, byid = T))

You can examine the result in relation to the inputs to gDistance and you will

see that the distances variable is a matrix of distances (in miles) from each of the

129 census block centroids to each of the nine locations described in the places

variable.

head(round(distances, 3))

It is possible to use the census block polygons in the above gDistance calcu-

lation, and the distances returned will be to the nearest point of the census area.

Using the census area centroid provides a more representative measure of the av-

erage distance experienced by people living in that area.

A related function is the gWithinDistance function, which tests whether

each to–from distance pair is less than a specified threshold. It returns a matrix of

TRUE and FALSE describing whether the distances between the elements of the

two sp dataset elements are less than or equal to the specified distance or not. In

the example below the distance specified is 1.2 miles.

distances <- gWithinDistance(places, cents,

byid = T, dist = miles2ft(1.2))

You should note that the distance functions work with whatever distance units are

specified in the projections of the spatial features. This means the inputs need to

have the same units. Also remember that the newhaven data are projected in feet,

hence the use of the miles2ft and ft2miles functions.

5.6.1 Distance Analysis/Accessibility Exercise

The use of distance measures in conjunction with census data is particularly useful

for analysing access to the supply of some facility or service for different social

groups. The code below replicates the analysis developed by Comber et al. (2008),

examining access to green spaces for different social groups. In this exercise a

hypothetical example is used: we wish to examine the equity of access to the loca-

tions recorded in the places variable (supply) for different ethnic groups as

recorded in the blocks dataset (demand), on the basis that we expect everyone to

R FOR SPATIAL ANALYSIS & MAPPING

166

be within 1 mile of a facility. We will use the census data to approximate the

number of people with and without access of less than 1 mile to the set of hypo-

thetical facilities.

First, the distances variable is recalculated in case it was overwritten in the

gWithinDistance example above. Then the minimum distance to a supply

facility is determined for each census area using the apply function. Finally, a

logical statement is used to generate a TRUE or FALSE statement for each block:

distances <- ft2miles(gDistance(places, cents, byid = T))

min.dist <- as.vector(apply(distances,1, min))

blocks$access <- min.dist < 1

# and this can be mapped

#qtm(blocks, "access")

The populations of each ethnic group in each census block can be extracted from

the blocks dataset:

# extract the ethnicity data from the blocks variable

ethnicity <- as.matrix(data.frame(blocks[,14:18])/100)

ethnicity <- apply(ethnicity, 2, function(x) (x ∗ blocks$POP1990))

ethnicity <- matrix(as.integer(ethnicity), ncol = 5)

colnames(ethnicity) <- c("White", "Black",

"Native American", "Asian", "Other")

And then a crosstabulation is used to bring together the access data and the

populations:

# use xtabs to generate a crosstabulation

mat.access.tab = xtabs(ethnicity~blocks$access)

# then transposes the data

data.set = as.data.frame(mat.access.tab)

#sets the column names

colnames(data.set) = c("Access","Ethnicity", "Freq")

You should examine the data.set variable. This summarises all of the factors

being considered: access, ethnicity and the counts associated with all factor com-

binations. If we make an assumption that there is an interaction between ethnicity

and access, then this can be tested for using a generalised regression model with a

Poisson distribution using the glm function:

modelethnic = glm(Freq~Access∗Ethnicity,

data=data.set,family=poisson)

# the full model can be printed to the console

# summary(modelethnic)

The model coefficient estimates show that there is significantly less access for some

groups than would be expected under a model of equal access when compared to

USING R AS A GIS

167

the largest ethnic group, White, which was listed first in the data.set variable,

and significantly greater access for the Other ethnic group. Examine the model

coefficient estimates, paying particular attention to the AccessTRUE: coefficients:

summary(modelethnic)$coef

Then assign these to the a variable:

mod.coefs = summary(modelethnic)$coef

By subtracting 1 from the coefficients and converting them to percentages, it is pos-

sible to attach some likelihoods to the access for different groups when compared

to the White ethnic group. Again, you should examine the terms in the model

outputs prefixed by AccessTRUE:, as below:

tab <- 100∗(exp(mod.coefs[,1]) − 1)

tab <- tab[7:10]

names(tab) <- colnames(ethnicity)[2:5]

round(tab, 1)

Black Native American Asian

−35.1 −11.7

,

−29.8

Other

256.3

The results in tab tell us that some ethnic groups have significantly less access

to the hypothetical supply facilities than the White ethnic group (as recorded in

the census): Black 35% less, Native American 12% less (although this is not

significant), and Asian 30% less. The Other ethnic group has 256% more access

than the White ethnic group.

It is possible to visualise the variations in access for different groups using a

mosaic plot. Mosaic plots show the counts (i.e. population) as well as the residuals

associated with the interaction between groups and their access, the full details of

which were given in Chapter 3.

mosaicplot(t(mat.access.tab),xlab='',ylab='Access to Supply',

main="Mosaic Plot of Access",shade=TRUE,las=3,cex=0.8)

Self-Test Question 3. In working through the exercise above you have developed

a number of statistical techniques. In answering this self-test question you will

explore the impact of using census data summarised over different areal units in

your analysis. Specifically, you will develop and compare the results of two sta-

tistical models using different census areas in the newhaven datasets: blocks

and tracts. You will analyse the relationship between residential property

occupation and burglaries. You will need to work through the code below before

the tasks associated with this questions are posited. To see the relationship between

the census tracts and the census blocks, enter:

R FOR SPATIAL ANALYSIS & MAPPING

168

plot(blocks,border='red')

plot(tracts,lwd=2,add=TRUE)

You can see that the census blocks are nested within the tracts.

The analysis described below develops a statistical model to describe the rela-

tionship between residential property occupation and burglary using two of the

New Haven crime variables related to residential burglaries. These are both point

objects, called burgres.f and burgres.n: the former is a list of burglaries

where entry was forced into the property, and the latter is a list of burglaries where

entry was not forced, suggesting that the property was left insecure, perhaps by

leaving a door or window open. The burglaries data cover the six-month period

between 1 August 2007 and 31 January 2008.

The questions you will consider are:

● Do both kinds of residential burglary occur in the same places – that is,

if a place is a high-risk area for non-forced entry, does it imply that it is

also a high-risk for forced entry?

● How does this relationship vary over different census units?

To investigate these, you should use a bivariate regression model that attempts to

predict the density of forced burglaries from the density of non-forced ones. The

indicators needed for this are the rates of burglary given the number of properties

at risk. You should use the variable OCCUPIED, present in both the census blocks

data frame and the census tracts data frame, to estimate the number of properties

at risk. If we were to compute rates per 1000 households, this would be:

1000∗(number of burglaries in block)/OCCUPIED, and since this is

over a six-month period, doubling this quantity gives the number of burglaries per

1000 households per year. However, entering:

blocks$OCCUPIED

shows that some blocks have no occupied housing, so the above rate cannot be

defined. To overcome this problem you should select the subset of the blocks with

more than zero occupied dwellings. For polygon spatial objects, each individual

polygon can be treated like a row in a data frame for the purposes of subset selec-

tion. Thus, to select only the blocks where the variable OCCUPIED is greater than

zero, enter:

blocks2 = blocks[blocks$OCCUPIED > 0,]

We can now compute the burglary rates for forced and non-forced entries by first

counting the burglaries in each block in blocks2 using the poly.counts func-

tion, dividing these numbers by the OCCUPIED counts and then multiplying by

USING R AS A GIS

169

2000 to get yearly rates per 1000 households. However, before we do this, you

should remember that you need the OCCUPIED attribute from blocks2 and not

blocks. Attach the blocks2 data and then calculate the two rate variables:

attach(data.frame(blocks2))

forced.rate = 2000∗poly.counts(burgres.f,blocks2)/OCCUPIED

notforced.rate = 2000∗poly.counts(burgres.n,blocks2)/OCCUPIED

detach(data.frame(blocks2))

You should have two rates stored in forced.rate and notforced.rate. A

first attempt at modelling the relationship between the two rates could be via sim-

ple bivariate regression, ignoring any spatial dependencies in the error term. This

is done using the lm function, which creates a simple regression model, model1:

model1 = lm(forced.rate~notforced.rate)

To examine the regression coefficients, enter:

summary(model1)

coef(model1)

The key things to note here are that forced.rate is related to notforced.

rate by the formula:

expected(forced.rate) = a + b × (notforced.rate)

where a is the intercept term and b is the slope or coefficient for the predictor vari-

able. If the coefficient for notforced.rate is statistically different from zero,

indicated in the summary of the model, then there is evidence that the two rates are

related. One possible explanation is that if burglars are active in an area, they will

only use force to enter dwellings when it is necessary, making use of an insecure

window or door if they spot the opportunity. Thus in areas where burglars are

active, both kinds of burglary could potentially occur. However, in areas where

burglars are less active it is less likely for either kind of burglary to occur.

Having outlined the approach, your specific tasks in this question are:

● To determine the coefficients a and b in the formula above for two

different analyses using the blocks and tracts datasets

● To comment on the difference between the analyses using different areal units

5.7 COMBINING SPATIAL DATASETS AND THEIR ATTRIBUTES

The point-in-polygon calculation using poly.counts generates counts of the

points falling in each polygon. A common situation in spatial analysis is the need

R FOR SPATIAL ANALYSIS & MAPPING

170

to combine (overlay) different polygon features that describe the spatial distribu-

tion of different variables, attributes or processes that are of interest. The problem

is that the data may have different underlying area geographies. In fact, it is com-

monly the case that different agencies, institutions and government departments

use different geographical areas, and even where they do not, geographical areas

frequently change over time. In these situations, we can use the intersection func-

tions (gIntersection in rgeos or st_intersection in sf) to identify the

area of intersection between different spatial datasets. With some manipulation it

is possible to determine the proportions of the objects in dataset X that fall into

each of the polygons of dataset Y. This section uses a worked example to illustrate

how this can be done in R. In the subsequent self-test question you will develop a

function to do this.

The key thing to note with all spatial operations, whether using sp and sf

datasets, is that the input data need to have the same projections. You can

examine their projection attributes with proj4string in sp and st_crs

in sf to check whether they need to be transformed, using spTransform

(sp) or st_transform (sf) functions to put the data into the same

projection.

The stages in this analysis are as follows:

1. Create a zone dataset for which the number of houses in each zone will

be calculated. The New Haven tracts data include the variable

HSE_UNITS , describing the number of residential properties in each

census tract. In this case the zones are hypothetical, but could perhaps

be zones used by the emergency services for planning purposes and

resource allocation.

2. Do an overlay of the new zones and the original areas. The key here is

to make sure that both the layers have an identifier that

,

exercises, where the code is provided in the text for you to work

through (i.e. for you to enter and run yourself), the self-test questions are tasks for

you to complete, mostly requiring you to write R code yourself, with answers pro-

vided in the last section of each chapter. The idea of these questions is to give you

some experience with working with different kinds of data structures, functions

and operations in R. There is a strong emphasis on solving problems, rather than

simply working through the code. In this way, snippets of code are included in

each chapter describing commands for data manipulation and analysis and to

exemplify specific functionality. It is expected that you will run the R code yourself

in each chapter. This can be typed directly into the R console or may be written

INTRODUCTION

5

directly into a script or document as described below. It is also possible to access

the code in each chapter from the book’s website (again see below). The reasons for

running the code yourself are so that you get used to using the R console and to

help your understanding of the code’s functionality.

In various places information boxes are included to develop a deeper understand-

ing of functions and alternative approaches for achieving the same ends.

The book is aimed at both second- and third-year undergraduate and post-

graduate students. Chapters 6–8 go into much more detail about specific types of

spatial analysis and are extensively supported by references from the scientific

literature in a way that the earlier chapters are not. For these reasons Chapters 2–5

might be considered as introductory and Chapters 6–8 might be considered as

advanced. Thus the earlier chapters are suitable for an Introduction to R module

(Chapters 2–4) or for an Introduction to Mapping in R module, and the later ones for

a module covering more Advanced Techniques (Chapters 6–9). The book could also

be used as the basis for a Geographical Programming module, drawing from differ-

ent chapters, especially Chapters 4 and 9, depending on the experience and techni-

cal capabilities of the student group.

The formal learning objectives of this book are:

● to apply appropriate data types, arrays, control structures, functions

and packages within R code

● to introduce geographical analysis and spatial data handling in R

● to develop programming skills in the R language with particular

reference to current geocomputational research and applications

● to exemplify the principles of algorithm and function construction in R

● to design and construct basic graphical algorithms for the analysis and

visualisation of spatial information

In terms of learning arcs, each chapter introduces a topic, has example code to run

and self-test questions to work through. In a similar way, earlier chapters provide

the foundations for later ones. The dependencies and prerequisites for each chap-

ter are listed in Table 1.1, and you should note that these are inherited (i.e. if

Chapter 4 is a prerequisite then the prerequisites for Chapter 4 also are relevant).

1.5 SPECIFIC CHANGES TO THE SECOND EDITION

In Chapter 2 the main changes were to introduce the ggplot2 package alongside

the basic plot operation. The code for some figures, maps and plots is shown for

both approaches. The other change was to remove the use of deprecated map-

tools functions for reading and writing spatial data and to replace these with

R FOR SPATIAL ANALYSIS & MAPPING

6

Table 1.1 Chapter prerequisites

Chapter Prerequisite chapters Comments

Chapter 2 None Data types and plots – the jumping-off point for

all other chapters

Chapter 3 2 The first maps and spatial data types

Chapter 4 2, 3 Coding blocks and functions

Chapter 5 2, 3 GIS-like operations in R

Chapter 6 4, 5 Cluster analysis and mapping of point data

Chapter 7 4, 5 Attribute analysis and mapping of polygon data

Chapter 8 6, 7 Analysis of geographical variation in spatial

processes

Chapter 9 3, 4, 5 Spatial analysis of data from the web

readOGR and writeOGR functions from the rgdal package and the st_read

function in sf. The self-test questions in each chapter reflect these changes.

Chapter 3 covers the basics of handling spatial data. This chapter now has a

focus on operations on sf objects and tools and a much reduced focus on sp for-

mats and the GISTools package, although it still draws from some of the func-

tionality of packages based on sp. The data manipulations now incorporate

operations on both sp and sf objects, bridging between the two data formats. In

a similar way, the GISTools mapping functions have been replaced by code

using the tmap package, and again many simple plot routines have been

replaced with ggplot2 operations.

Chapter 4 has a few small changes relating to some data table manipulations

using the functions in the dplyr package and demonstrates the use of apply

functions as an alternative to potentially slower (but perhaps more transparent)

for loop operations.

Chapter 5 goes into sf operations in much more detail and ubiquitously uses

tmap. The detailed walk-through coding exercises mix sp and sf formats, using

sf where possible, but where we think there is a distinct advantage to using sp

then this has been presented.

Chapters 6–9 have been revised much less than the earlier chapters, although a

new example has been added to Chapter 9 to reflect changes in web API support

in R. This is because they are focused on more advanced topics, the nuts and bolts

of which have not changed much. However, where appropriate the plotting and

mapping routines have been updated to use tmap and ggplot2 packages.

Chapter 10, the epilogue, evaluates our 2013 thoughts about the direction of

travel in this area and considers the main developments from where we are now

in 2018, including the extensions to R, improvements under the bonnet and the

coexistence of R with other software arising from the tidyverse, piping syntax,

sf formats, Rcpp, the ubiquity of RStudio as the choice of R interface and tools

INTRODUCTION

7

such as RMarkdown. An example of the latter is that the first edition of this book

was written in Sweave and the second edition entirely in RMarkdown.

1.6 THE R PROJECT FOR STATISTICAL COMPUTING

R was developed from the S language which was originally conceived at the

Lucent Technologies (formerly AT&T) Bell Laboratories in the 1970s and 1980s.

Douglas Martin at the company StatSci developed S into the enhanced commercial

product known as S+ in the late 1980s and early 1990s (Krause and Olson, 1997).

R was initially developed by Robert Gentleman and Ross Ihaka of the Department

of Statistics at the University of Auckland. It is becoming widely used in many

areas of scientific activity and quantitative research, partly because it is available

free in source code form and also because of its extensive functionality, through the

continually growing number of contributions of code and functions, in the form of

R packages, which when installed can be called as libraries. The background to R,

along with documentation and information about packages as well as the con-

tributors, can be found at the R Project website http://www.r-project.org.

1.7 OBTAINING AND RUNNING THE R SOFTWARE

We assume that most readers will be using the RStudio interface to R. You should

download the latest version of R and then RStudio in order to run the code pro-

vided in this book. At the time of writing, the latest version of R is version 3.4.3 and

you should ensure you have at least this version. There are 32-bit and 64-bit ver-

sions available, and we assume you have the 64-bit version. The simplest way to

get R installed on your computer is to go the download pages on the R website – a

quick search for ‘download R’ should take you there, but if not you could try:

● http://cran.r-project.org/bin/windows/base/

● http://cran.r-project.org/bin/macosx/

● http://cran.r-project.org/bin/linux/

for Windows, Mac

,

allows the

proportions of each original area in each zone to be calculated. This

will then be used to allocate houses based on the proportion of each

intersecting area in each zone.

First, you should make sure you have the tmap and sf packages loaded. Then

create the zones, number them with an ID and plot these on a map with the tracts

data. This is easily done by defining a grid and then converting this to a

SpatialPolygonsDataFrame object. Enter:

library(GISTools)

library(sf)

## linking to GEOS 3.6.1, GDAL 2.1.3, proj.4.4.9.3

library(tmap)

data(newhaven)

## define sample grid in polygons

USING R AS A GIS

171

bb <- bbox(tracts)

grd <- GridTopology(cellcentre.offset=

c(bb[1,1]−200,bb[2,1]−200),

cellsize=c(10000,10000), cells.dim = c(5,5))

int.layer <- SpatialPolygonsDataFrame(

as.SpatialPolygons.GridTopology(grd),

data = data.frame(c(1:25)), match.ID = FALSE)

ct <- proj4string(blocks)

proj4string(int.layer) <- ct

proj4string(tracts) <- ct

names(int.layer) <- "ID"

You can examine the intersection layer:

plot(int.layer)

Next, you should undertake an intersection of the zone and area layers. Projec-

tions can be checked using proj4string(int.layer) and proj4string

(tracts). These have the same projections, so they can be intersected. The code

below converts them to sf format and then uses st_intersection:

int.layer_sf <- st_as_sf(int.layer)

tracts_sf <- st_as_sf(tracts)

int.res_sf <- st_intersection(int.layer_sf, tracts_sf)

You can examine the intersected data, the original data and the zones in the same

plot window, as in Figure 5.7. Remember that the grid.arrange function in the

gridExtra package allows multiple graphics to be included in the plot.

# plot and label the zones

p1 <- tm_shape(int.layer_sf) + tm_borders(lty = 2) +

tm_layout(frame = F) +

tm_text("ID", size = 0.7) +

# plot the tracts

tm_shape(tracts_sf) + tm_borders(col = "red", lwd = 2)

# plot the intersection, scaled by int.later_sf

p2 <- tm_shape(int.layer_sf) + tm_borders(col="white") +

tm_shape(int.res_sf) + tm_polygons("HSE_UNITS", palette = blues9) +

tm_layout(frame = F, legend.show = F)

library(grid)

grid.newpage()

pushViewport(viewport(layout=grid.layout(1,2)))

print(p1, vp=viewport(layout.pos.col = 1))

print(p2, vp=viewport(layout.pos.col = 2))

As in the gIntersection operation described in earlier sections, you can exam-

ine the result of the intersection:

head(int.res_sf)

You will see that the data frame of the intersected object contains composites

of the inputs. These links can be used to create attributes for the intersection

output data.

R FOR SPATIAL ANALYSIS & MAPPING

172

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

Figure 5.7 The zones and census tracts data before and after intersection

Recall the need to have an identifier for both the zone and area layers. The ID

variable of the intersection output, int.res_sf, lists the ID variable of the two

input layers, the ID variable of int.layer_sf and the T009075H_I variable of

tracts.sf. In this case, we wish to summarise the HSE_UNITS of tracts_sf

over the zones of int.layer_sf. Here the functionality of dplyr single-table

operations that were introduced in Chapter 4 can be useful. However, first we need

to work out what proportion of the original tracts areas intersect with each

zone, and we can weight the HSE_UNITS variable appropriately to proportionally

allocate the counts of houses to the zones. Knowing the unique identifiers of each

polygon in both of the intersected layers is critical for working out proportions.

# generate area and proportions

int.areas <- st_area(int.res_sf)

tract.areas <- st_area(tracts_sf)

# match tract area to the new layer

index <- match(int.res_sf$T009075H_I, tracts$T009075H_I)

tract.areas <- tract.areas[index]

tract.prop <- as.vector(int.areas)/as.vector(tract.areas)

The tract.prop object can be used to create a variable in the data frame of the new

layer, using the index variable which indicates in which of the original tract areas

each intersected area belongs. (Note that you could examine index to see this.)

int.res_sf$houses <- tracts$HSE_UNITS[index] ∗ tract.prop

And this can be summarised using the functionality in dplyr and linked back to

the original int.layer_sf:

USING R AS A GIS

173

library(tidyverse)

houses <- summarise(group_by(int.res_sf, ID), count = sum(houses))

# create an empty vector

int.layer_sf$houses <- 0

# and populate this using houses$ID as the index

int.layer_sf$houses[houses$ID] <- houses$count

The results can be plotted as in Figure 5.8 and checked against the original inputs

in Figure 5.7.

tm_shape(int.layer_sf) +

tm_polygons( "houses", palette = "Greens",

style = "kmeans", title = "No. of houses") +

tm_layout(frame = F, legend.position = c(1,0.5)) +

tm_shape(tracts_sf) + tm_borders(col = "black")

No. of houses

0 to 456

456 to 1,478

1,478 to 3,796

3,796 to 12,395

12,395 to 16,876

Figure 5.8 The zones shaded by the number of households after intersection with the

census tracts

R FOR SPATIAL ANALYSIS & MAPPING

174

Self-Test Question 4. Write a function that will return an intersected dataset, with

an attribute of counts of some variable (houses, population, etc.) as held in another

sf format dataset. Base your function on the code used in the illustrated exam-

ple above. Compile it such that the function returns the portion of the variable

(typically this should be a count) covered by each zone. For example, it should

be able to intersect the int.layer_sf layer with the blocks_sf layer and

return an sf dataset with an attribute of the number of people, as described in

the POP1990 variable of blocks, covered by each zone. You should remember

that many spatial functions require their inputs to have the same projections. The

int.layer_sf defined above and the tracts originally had no projections.

You may find it useful to check and/or align the input layers – for example, the

int.layer defined above and the blocks data in the following way using the

rgdal or sf packages:

## in rgdal

library(rgdal)

ct <- proj4string(blocks)

proj4string(int.layer) <- CRS(ct)

blocks <- spTransform(blocks, CRS(proj4string(int.layer)))

## in sf

library(sf)

ct <- st_crs(blocks_sf)

st_crs(int.layer_sf) <- (ct)

blocks_sf <- st_transform(blocks_sf, st_crs(int.layer_sf))

Your function will have to take identifier variables for the layer and the intersect

layer as inputs, and you will find it useful in your code to assign these to new ID

variables in each layer. For example, your function could require the following

parameters when compiled, setting some default values:

# define the function

area_intersect_func <- function(int.sf = int.layer.sf, layer.sf = blocks.sf, int.

ID <- "ID",

layer.ID <- "T009075H_I", target <- "POP1990"){

...

...

}

Also, extracting values from data in sf format can be tricky. A couple of possible

ways are:

# directly from the data frame

as.vector(data.frame(int.res_sf[,"T009075H_I"])[,1])

as.vector(unlist(select(as.data.frame(int.res_sf), T009075H_I)))

# set the geometry to null and then extract

st_geometry(int.res_sf) <- NULL

int.res_sf[,"T009075H_I"]

USING R AS A GIS

175

# using select from dplyr

as.vector(unlist(select(as.data.frame(int.res_sf), T009075H_I)))

5.8 CONVERTING BETWEEN RASTER AND VECTOR

Very often we would like to move or convert our data between vector and raster

environments. In fact the very persistence of these dichotomous data structures,

with separate raster and vector functions and analyses in many commercial GIS

software programs, is one of the long-standing legacies in GIS.

This section briefly describes methods for converting data between raster and vector

structures. There are three reasons for this brief treatment. First, many packages define

their own data structures. For example, the functions in the PBSmapping package

require a PolySet object to be passed to them. This means that conversion between

one class

,

of raster objects and, for example, the sp class of SpatialPolygons will

require different code. Second, the separation between raster and vector analysis envi-

ronments is no longer strictly needed, especially if you are developing your spatial

analyses using R, with the easy ability for users to compile their own functions and to

create their own analysis tools. Third, advanced raster mapping and analysis is exten-

sively covered in other books (see, for example, Bivand et al., 2013).

The sections below describe methods for converting the sp class of objects

(SpatialPoints, SpatialLines and SpatialPolygons, etc.) and the sf

class of objects (see the first sf vignette) as well as to and from the RasterLayer

class of objects as defined in the raster package, created by Hijmans and van

Etten (2014). They also describe how to convert between sp classes, for example to

and from SpatialPixels and SpatialGrid sp objects.

5.8.1 Vector to Raster

In this section simple approaches for converting are illustrated using datasets in

the tornados package that you have already encountered. We shall examine

techniques for converting the sp class of objects to the raster class, considering

in turn points, lines and areas.

Unfortunately, at the time of writing there is no parallel operation for convert-

ing from sf formats to raster formats. If you have data in sf format, you could

convert to an sp format before converting to raster format as described earlier:

# convert to sf sp

sp <- as(sf, "Spatial")

# do the conversions...as below

You will need to load the data and the packages – you may need to install the

raster package using the install.packages function if this is the first time

that you have used it.

R FOR SPATIAL ANALYSIS & MAPPING

176

5.8.1.1 Converting Points to Raster

First, convert from sp to raster formats. The torn2 is a Spatial

PointsDataFrame object:

library(GISTools)

library(raster)

data(tornados)

class(torn2)

Then create a raster and use the rasterize function to convert the data. Note the

need for a function to be specified to determine how the point dataset are summarised

over the raster grid and, if the data have attributes, which attribute is to be summarised:

# rasterize a point attribute

r <- raster(nrow = 180 , ncols = 360, ext = extent(us_states2))

r <- rasterize(torn2, r, field = "INJ", fun=sum)

# rasterize count of point dataset

r <- raster(nrow = 180 , ncols = 360, ext = extent(us_states2))

r <- rasterize(as(torn2, "SpatialPoints"), r, fun=sum)

The resultant raster has cells describing different tornado densities that can be

mapped as in Figure 5.9:

# set the plot extent by specify the plot colour 'white'

tm_shape(us_states2)+

tm_borders("white")+

Injured

0 to 100,000

100,000 to 200,000

200,000 to 300,000

300,000 to 400,000

400,000 to 500,000

500,000 to 600,000

600,000 to 700,000

700,000 to 800,000

800,000 to 900,000

Figure 5.9 Converting points to raster format

USING R AS A GIS

177

tm_shape(r) +

tm_raster(title = "Injured", n= 7) +

tm_shape(us_states2) +

tm_borders() +

tm_layout(legend.position = c("left", "bottom"))

5.8.1.2 Converting Lines to Raster

For illustrative purposes the code below creates a SpatialLinesDataFrame

object of the outline of the polygons with an attribute based on the area of

the state.

# Lines

us_outline <- as(us_states2 , "SpatialLinesDataFrame")

r <- raster(nrow = 180 , ncols = 360, ext = extent(us_states2))

r <- rasterize(us_outline , r, "AREA")

This takes a bit longer to run but again the results can be mapped and this time

with the shading indicating area (Figure 5.10):

tm_shape(r) +

tm_raster(title = "State Area", palette = "YlGn") +

tm_style("albatross") +

tm_layout(legend.position = c("left", "bottom"))

State Area

0 to 50,000

50,000 to 100,000

100,000 to 150,000

150,000 to 200,000

200,000 to 250,000

250,000 to 300,000

Figure 5.10 Converting lines to raster format

R FOR SPATIAL ANALYSIS & MAPPING

178

5.8.1.3 Converting Polygons or Areas to Raster

Finally, polygons can easily be converted to a RasterLayer object using tools in

the raster package and plotted as in Figure 5.11. In this case the 1997 population

for each state is used to generate raster cell or pixel values.

# Polygons

r <- raster(nrow = 180 , ncols = 360, ext = extent(us_states2))

r <- rasterize(us_states2, r, "POP1997")

tm_shape(r) +

tm_raster(title = "Population", n=7, style="kmeans", palette="OrRd") +

tm_layout( legend.outside = T,

legend.outside.position = c("left"),

frame = F)

It is instructive to examine the outputs of these processes. Enter:

r

This summarises the characteristics of the raster object, including the resolution,

dimensions and extent. The data values of r can be accessed using the getValues

function:

unique(getValues(r))

It is possible to specify particular dimensions for the raster grid cells, rather than

just dividing the dataset’s extent by ncol and nrow in the raster function. The

code below is a bit convoluted, but cleanly allocates the values to raster grid cells

of a specified size, allocating cell values from a polygon variable to the raster cells.

# specify a cell size in the projection units

d <- 50000

dim.x <- d

dim.y <- d

bb <- bbox(us_states2)

Population

484,529 to 1,448,906

1,448,906 to 2,282,016

2,282,016 to 4,102,640

4,102,640 to 7,070,067

7,070,067 to 11,546,805

11,546,805 to 18,780,874

18,780,874 to 32,197,302

Figure 5.11 Converting polygons to raster format

USING R AS A GIS

179

# work out the number of cells needed

cells.x <- (bb[1,2]−bb[1,1]) / dim.x

cells.y <- (bb[2,2]−bb[2,1]) / dim.y

round.vals <- function(x){

if(as.integer(x) < x) {

x <- as.integer(x) + 1

} else {x <- as.integer(x)

}}

# the cells cover the data completely

cells.x <- round.vals(cells.x)

cells.y <- round.vals(cells.y)

# specify the raster extent

ext <- extent(c(bb[1,1], bb[1,1]+(cells.x∗d),

bb[2,1],bb[2,1]+(cells.y∗d)))

# now run the raster conversion

r <- raster(ncol = cells.x,nrow =cells.y)

extent(r) <- ext

r <- rasterize(us_states2, r, "POP1997")

# and map

tm_shape(r) +

tm_raster(col = "layer", title = "Populations",

palette = "Spectral", style = "kmeans") +

tm_layout(frame = F, legend.show = T,

legend.position = c("left","bottom"))

5.8.2 Converting to sp raster classes

You may have noticed that the sp package also has two data classes that are able

to represent raster data, or data located on a regular grid. These are

SpatialPixelsDataFrame and SpatialGridDataFrame. It is possible to

convert the rasters to these. First, create a spatially coarser raster layer of US states

similar to the above.

r <- raster(nrow = 60 , ncols = 120, ext = extent(us_states2))

r <- rasterize(us_states2 , r, "BLACK")

Then the as function can be used to coerce this to SpatialPixelsDataFrame

and SpatialGridDataFrame objects, which can also be mapped using the

image, plot and tm_raster commands in the usual way:

g <- as(r, 'SpatialGridDataFrame')

p <- as(r, 'SpatialPixelsDataFrame')

# image(g, col = topo.colors(51))

You can examine the data values held in the data frame by entering:

head(data.frame(g))

head(data.frame(p))

R FOR SPATIAL ANALYSIS & MAPPING

180

The data can also be manipulated to select certain features, in this case selecting the

states with populations greater than 10 million people. The code below assigns NA

values to the data points that fail this test and plots the data as in Figure 5.12.

# set up and create the raster

r <- raster(nrow = 60 , ncols = 120, ext = extent(us_states2))

r <- rasterize(us_states2 , r, "POP1997")

r2 <- r

# subset the data

r2[r < 10000000] <- NA

g <- as(r2, 'SpatialGridDataFrame')

p <- as(r2, 'SpatialPixelsDataFrame')

# not run

# image(g, bg = "grey90")

tm_shape(r2) +

tm_raster(col = "layer", title = "Pop",

palette = "Reds", style = "cat") +

tm_layout( frame = F, legend.show = T,

legend.position = c("left","bottom")) +

tm_shape(us_states2) + tm_borders()

Pop

11202691

11890919

12051902

14564508

,

18177296

19384453

32197302

Figure 5.12 Selecting data in a raster object

5.8.2.1 Raster to Vector

The raster package contains a number of functions for converting from vector to

raster formats. These include rasterToPolygons which converts to a

SpatialPolygonsDataFrame object, and rasterToPoints which converts

USING R AS A GIS

181

to a matrix object. Both are illustrated in the code below and the results shown

in Figure 5.13. Notice how the original raster imposes a grid structure on the poly-

gons that are created. In this case the default mapping options with plot are

easier than using the options in the tmap or ggplot2 packages.

# load the data and convert to raster

data(newhaven)

# set up the raster, r

r <- raster(nrow = 60 , ncols = 60, ext = extent(tracts))

# convert polygons to raster

r <- rasterize(tracts , r, "VACANT")

poly1 <- rasterToPolygons(r, dissolve = T)

# convert to points

Figure 5.13 Converting from rasters to polygons and points, with the original polygon data

in red

R FOR SPATIAL ANALYSIS & MAPPING

182

points1 <- rasterToPoints(r)

# plot the points, rasterised polygons & original polygons

par(mar=c(0,0,0,0))

plot(points1, col = "grey", axes = FALSE, xaxt='n', ann=FALSE, asp= 1)

plot(poly1, lwd = 1.5, add = T)

plot(tracts, border = "red", add = T)

However, regarding tmap … it can be done!

# first convert the point matrix to sp format

points1.sp <- SpatialPointsDataFrame(points1[,1:2],

data = data.frame(points1[,3]))

# then pplot

tm_shape(poly1) + tm_borders(col = "black") +

tm_shape(tracts) + tm_borders(col = "red") +

tm_shape(points1.sp) + tm_dots(col = "grey", shape = 1) +

tm_layout(frame = F)

5.9 INTRODUCTION TO RASTER ANALYSIS

This section provides the briefest of overviews of how raster data may be manipu-

lated and overlaid in R in a similar way to a standard GUI GIS such as QGIS. This

section will cover the reclassification of raster data as a precursor to some basic

methods for performing what is sometimes referred to as map algebra, using a raster

calculator or raster overlay. As a reminder, many packages include user guides in the

form of a PDF document describing the package. This is listed at the top of the pack-

age index page. The raster package includes example code for the creation of

raster data and different types of multi-layered raster composites. These will not be

covered in this section. Rather, the coded examples illustrate some basic methods

for manipulating and analysing raster layers in a similar way to what is often

referred to as sieve mapping, multi-criteria evaluation or multi-criteria analysis. In these,

different layers are combined to identify locations that have specific combinations

of properties, such as height above sea level > 200 m AND soil_type is ‘good’.

Raster analysis requires that the different input data have a number of charac-

teristics in common: typically they should cover the same spatial extent, have the

same spatial resolution (grid or cell size), and, as with data for any spatial analysis,

they should have the same projection or coordinate system. The data layers used

in the example code in this section all have these properties. When you come to

develop your own analyses, you may have to perform some manipulation of the

data prior to analysis to ensure that your data also have these properties.

5.9.1 Raster Data Preparation

The Meuse data in the sp package will be used to illustrate the functions below. You

could read in your raster data using the readGDAL function in the rgdal package,

USING R AS A GIS

183

which provides an excellent R interface into the Geospatial Data Abstraction Library

(GDAL). This has been described as the ‘swiss army knife for spatial data’

(https://cran.r-project.org/web/packages/sf/vignettes/sf2.html) as it is able to read or

write vector and raster data of all file formats. You can inspect the properties and

attributes of the Meuse data by examining the associated help files ?meuse.grid.

library(GISTools)

library(raster)

library(sp)

# load the meuse.grid data

data(meuse.grid)

# create a SpatialPixels DF object

coordinates(meuse.grid) <- ~x+y

proj4string(meuse.grid) <- CRS("+init=epsg:28992")

meuse.grid <- as(meuse.grid, "SpatialPixelsDataFrame")

# create 3 raster layers

r1 <- raster(meuse.grid, layer = 3) #dist

r2 <- raster(meuse.grid, layer = 4) #soil

r3 <- raster(meuse.grid, layer = 5) #ffreq

The code above loads the meuse.grid data, converts it to a SpatialPixels-

DataFrame format and then creates three separate raster layers in the raster

format. These three layers will form the basis of the analyses in this section. You

could visually inspect their attributes by using some simple image commands:

# set the plot parameters for 3 rows

par(mfrow = c(1,3))

image(r1, asp = 1)

image(r2, asp = 1)

image(r3, asp = 1)

# reset par

par(mfrow = c(1,1))

5.9.2 Raster Reclassification

Raster analyses frequently employ simple numerical and mathematical operations.

In essence, they allow you to add, multiply, subtract, etc., raster data layers, and

these operations are performed on a cell-by-cell basis. So for an addition this might

be in the form:

Raster_Result <- Raster.Layer.1 + Raster.Layer.2

Remembering that raster data are numerical, if the Raster.Layer.1 and

Raster.Layer.2 data both contained the values 1, 2 and 3, it would be difficult

to know the origin, for example, of a value of 3 in the Raster_Result output.

Specifically, if the r2 and r3 layers created above are considered, these both con-

tain values in the range 1–3 describing soil types and flooding frequency, respec-

tively (as described in the help for the meuse.grid data). Therefore we may wish

R FOR SPATIAL ANALYSIS & MAPPING

184

to reclassify them in some way to understand the results of any combination or

overlay operation.

It is possible to reclassify raster data in a number of ways. First, the raster

data values can be manipulated using simple mathematical operations. These

produce raster outputs describing the mathematical combination of the input

raster layers. The code below multiplies one of the layers by 10. This means that

the result combining both raster data layers using the add (+) function contains

a fixed set of values (in this case 9) which are tractable to the combinations of

inputs used. A value of 32 would indicate values of 3 in r3 (a flooding fre-

quency of ‘one in 50 years’) and 2 in r2 (a soil type of ‘Rd90C/VII’, whatever

Values

11

12

13

21

22

23

31

32

33

Figure 5.14 The result of a simple raster overlay

USING R AS A GIS

185

that is). The results of this simple overlay are shown in Figure 5.14 and in the

table of values printed.

Raster_Result <- r2 + (r3 ∗ 10)

table(getValues(Raster_Result))

11 12 13 21 22 23 31 32 33

535 242 2 736 450 149 394 392 203

tm_shape( Raster_Result) + tm_raster(col = "layer", title = "Values",

palette = "Spectral", style = "cat") +

tm_layout(frame = F)

A second approach to reclassifying raster data is to employ logical operations on the

data layers prior to combining them. These return TRUE or FALSE for each raster

Values

1

Figure 5.15 A raster overlay using a combinatorial AND

R FOR SPATIAL ANALYSIS & MAPPING

186

grid cell, depending on whether it satisfies the logical condition. The resultant lay-

ers can then be combined in mathematical operations as above. For example, con-

sider the analysis that wanted to identify the locations in the Meuse data that

satisfied the following conditions:

● Are greater than half of the rescaled distance away from the Meuse River

● Have a soil class of 1, that is calcareous weakly developed meadow

soils, light sandy clay

● Have a flooding frequency class of 3, namely once in a 50-year period

The following logical operations can be used to do this:

r1a <- r1 > 0.5

r2a <- r2 >= 2

r3a <- r3 < 3

These can then be combined using specific mathematical operations, depending

on the analysis. For example, a simple suitability

,

multi-criteria evaluation, where

all the conditions have to be true and where a crisp, Boolean output is required,

would be coded using the multiplication function as follows, with the result shown

in Figure 5.15:

Raster_Result <- r1a ∗ r2a ∗ r3a

table(getValues(Raster_Result))

0 1

2924 179

tm_shape(Raster_Result) +

tm_raster(title = "Values", style = "cat") +

tm_style("cobalt")

This is equivalent to a combinatorial AND operation, also known as an intersection.

Alternatively, the analysis may be interested in identifying where any of the condi-

tions are true, a combinatorial OR also known as a union, with a different result as

shown in Figure 5.16:

Raster_Result <- r1a + r2a + r3a

table(getValues(Raster_Result))

0 1 2 3

386 1526 1012 179

# plot the result and add a legend

tm_shape(Raster_Result) +

tm_raster(title ="Conditions", style = "cat"), palette = "Spectral")

#tm_layout(frame = F, bg.color = "grey85")

tm_style_col_blind()

USING R AS A GIS

187

5.9.3 Other Raster Calculations

The above examples illustrated code to reclassify raster layers and then combined

them using simple mathematical operations. You should note that it is possible to

apply any kind of mathematical function to a raster layer. For example:

Raster_Result <- sin(r3) + sqrt(r1)

Raster_Result <- ((r1 ∗ 1000 ) / log(r3) ) ∗ r2

tmap_mode('view')

tm_shape(Raster_Result) + tm_raster(col = "layer", title = "Value")

tmap_mode("plot")

which produces Figure 5.17.

Conditions

1

2

3

Figure 5.16 A raster overlay using a combinatorial OR

R FOR SPATIAL ANALYSIS & MAPPING

188

+

-

Value

-1,000 to 0

0 to 1,000

1,000 to 2,000

2,000 to 3,000

3,000 to 4,000

Leaflet | © OpenStreetMap © CartoDB

Figure 5.17 A raster generated from a number of mathematical operations

A number of other operations are possible using different functions included in

the raster package. They are not given a full treatment here, but are introduced

such that the interested reader can explore them in more detail.

The calc function performs a computation over a single raster layer, in a simi-

lar manner to the mathematical operations in the preceding text. The advantage of

the calc function is that it should be faster when computing more complex

operations over large raster datasets.

my.func <- function(x) {log(x)}

Raster_Result <- calc(r3, my.func)

# this is equivalent to

Raster_Result <- calc(r3, log)

The overlay function provides an alternative to the mathematical operations il-

lustrated in the reclassification examples above for combining multiple raster lay-

ers. The advantage of the overlay function, again, is that it is more efficient for

performing computations over large raster objects.

Raster_Result <- overlay(r2,r3,

fun = function(x, y) {return(x + (y ∗ 10))} )

# alternatively using a stack

USING R AS A GIS

189

my.stack <- stack(r2, r3)

Raster_Result <- overlay(my.stack, fun = function(x, y) (x + (y ∗ 10)) )

There are a number of distance functions for computing distances to specific fea-

tures. The distanceFromPoints function calculates the distance between a set

of points to all cells in a raster surface and produces a distance or cost surface as in

Figure 5.18.

# load meuse and convert to points

data(meuse)

coordinates(meuse) <- ~x+y

# select a point layer

soil.1 <- meuse[meuse$soil == 1,]

# create an empty raster layer

# this is based on the extent of meuse

r <- raster(meuse.grid)

dist <- distanceFromPoints(r, soil.1)

plot( dist,asp = 1,

xlab='',ylab='',xaxt='n',yaxt='n',bty='n', axes =F)

plot(soil.1, add = T)

# the tmap version but this is not as nice as plot

# tm_shape(dist) + tm_raster(palette = rev(terrain.colors(10)),

# title = "Distance", style = "kmeans") +

# tm_layout(frame = F, legend.outside = T)

500

1000

1500

Figure 5.18 A raster analysis of distance to points

You are encouraged to explore the raster package (and indeed the sp pack-

age) in more detail if you are specifically interested in raster-based analyses. There

are a number of other distance functions, functions for computing over neighbour-

hoods (focal functions), accessing raster cell values and assessing spatial configura-

tions of raster layers.

R FOR SPATIAL ANALYSIS & MAPPING

190

5.10 ANSWERS TO SELF-TEST QUESTIONS

Q1: Produce maps of the densities of breaches of the peace in each census block in

New Haven in breaches per square kilometre. First, using sf formats:

# convert to sf

breach_sf <- st_as_sf(breach)

blocks_sf <- st_as_sf(blocks)

# point in polygon

b.count <- rowSums(st_contains(blocks_sf,breach_sf,sparse = F))

# area calculation

b.area <- ft2miles(ft2miles(st_area(blocks_sf))) ∗ 2.58999

# combine and assign to the blocks data

blocks_sf$b.p.sqkm <- as.vector(b.count/b.area)

# map

tm_shape(blocks_sf) +

tm_polygons("b.p.sqkm", style = "kmeans", title ="")

Second, using sp formats:

# point in polygon

b.count <- poly.counts(breach, blocks)

# area calculation

b.area <- ft2miles(ft2miles(gArea(blocks, byid = T))) ∗ 2.58999

# combine and assign to the blocks data

blocks$b.p.sqkm <- b.count/b.area

tm_shape(blocks) + tm_polygons("b.p.sqkm", style = "kmeans", title ="")

Q2: Produce a map of the outlying residuals using tm_shape functions etc. from

the tmap package.

blocks$s.resids.2 <- s.resids.2

tm_shape(blocks) +

tm _polygons("s.resids.2", breaks = c(−8,−2,2,8),

auto.palette.mapping = F,

palette = resid.shades$cols)

Q3: Determine the coefficients a and b for two different analyses using blocks and

tracts data and comment on the difference between the analyses using different

areal units. First, calculate the coefficients for the analysis using census blocks:

# Analysis with blocks

blocks2 = blocks[blocks$OCCUPIED > 0,]

attach(data.frame(blocks2))

forced.rate = 2000∗poly.counts(burgres.f,blocks2)/OCCUPIED

notforced.rate = 2000∗poly.counts(burgres.n,blocks2)/OCCUPIED

model1 = lm(forced.rate~notforced.rate)

coef(model1)

(Intercept) notforced.rate

5.4667222 0.3789628

detach(data.frame(blocks2))

USING R AS A GIS

191

The results can be printed out:

# from the model

coef(model1)

# or in a formatted statement

cat("expected(forced rate)=",coef(model1)[1], "+",

coef(model1)[2], "∗ (not forced rate)")

Now calculate the coefficients using census tracts:

# analysis with tracts

tracts2 = tracts[tracts$OCCUPIED > 0,]

# align the projections

ct <- proj4string(burgres.f)

proj4string(tracts2) <- CRS(ct)

# now do the analysis

attach(data.frame(tracts2))

forced.rate = 2000∗poly.counts(burgres.f,tracts2)/OCCUPIED

notforced.rate = 2000∗poly.counts(burgres.n,tracts2)/OCCUPIED

model2=lm(forced.rate~notforced.rate)

detach(data.frame(tracts2))

Again the results can be printed out:

# from the model

coef(model2)

# or in a formatted statement

cat("expected(forced rate) = ",coef(model2)[1], "+",

coef(model2)[2], "∗ (not forced rate)")

These two analyses show that, in this case, there are only small differences between

the coefficients arising from analyses using different areal units. Print out both results:

cat("expected(forced rate) = ",

coef(model1)[1], "+", coef(model1)[2], "∗ (not forced rate)")

cat("expected(forced rate) = ",

coef(model2)[1], "+", coef(model2)[2], "∗ (not forced rate)")

expected(forced rate) = 5.466722 + 0.3789628 ∗ (not forced rate)

expected(forced rate) = 5.243477 + 0.4132951 ∗ (not forced rate)

This analysis tests what is referred to as the modifiable areal unit problem, first identi-

fied in the 1930s, and extensively research by Stan Openshaw in the 1970s and beyond –

see Openshaw (1984) for a comprehensive review. Variability in analyses can arise when

data are summarised over different spatial units and the importance of the modifiable

areal unit problem cannot be overstated as a critical consideration in spatial analysis.

Q4: Write a function that will return an intersected dataset, with an attribute of counts

of some variable (houses, population, etc.) as held in another sf format dataset.

int.count.function

,

<- function(

int_sf, layer_sf, int.ID, layer.ID, target.var) {

R FOR SPATIAL ANALYSIS & MAPPING

192

# Use the IDs to assign ID variables to both inputs

# this makes the processing easier later on

int_sf$IntID <- as.vector(data.frame(int_sf[, int.ID])[,1])

layer_sf$LayerID <- as.vector(data.frame(layer_sf[, layer.ID])[,1])

# do the same for the target.var

layer_sf$target.var<-as.vector(data.frame(layer_sf[,target.var])[,1])

# check projections

if(st_crs(int_sf) != st_crs(layer_sf))

print("Check Projections!!!")

# do intersection

int.res_sf <- st_intersection(int_sf, layer_sf)

# generate area and proportions

int.areas <- st_area(int.res_sf)

layer.areas <- st_area(layer_sf)

# match tract area to the new layer

v1 <- as.vector(data.frame(int.res_sf$LayerID)[,1])

v2 <- as.vector(data.frame(layer_sf$LayerID)[,1])

index <- match(v1, v2)

layer.areas <- layer.areas[index]

layer.prop <- as.vector(int.areas/as.vector(layer.areas))

# create a variable of intersected values

int.res_sf$NewVar <-

as.vector(data.frame(layer_sf$target.var)[,1][index]) ∗ layer.prop

summarise this and link back to the int.layer_sf

# NewVar <- summarise(group_by(int.res_sf, IntID), count = sum(NewVar))

# create an empty vector

int.layer_sf$NewVar <- 0

# and populate this using ID as the index

int.layer_sf$NewVar[NewVar$IntID] <- NewVar$count

return(int.layer_sf)

}

You can test this:

# convert blocks to sf

blocks_sf <- st_as_sf(blocks)

# run the function

test.res <- int.count.function(

int_sf <- int.layer_sf,

layer_sf <- blocks_sf,

int.ID <- "ID",

layer.ID <- "NEWH075H_I",

target.var <- "POP1990")

plot(test.res[,"NewVar"])

REFERENCES

Bivand, R.S., Pebesma, E.J. and Gómez-Rubio, V. (2013) Applied Spatial Data:

Analysis with R, 2nd edition. New York: Springer.

Comber, A.J., Brunsdon, C. and Green, E. (2008) Using a GIS-based network

analysis to determine urban greenspace accessibility for different ethnic and

religious groups. Landscape and Urban Planning, 86: 103–114.

USING R AS A GIS

193

Hijmans, R.J. and van Etten, J. (2014) Raster: Geographic data analysis and mode-

ling. R Package Version 2.6–7. http://cran.r-project.org/package=raster.

Openshaw, S. (1984) The Modifiable Areal Unit Problem, CATMOG 38, Geo Abstracts,

Norwich. https://www.uio.no/studier/emner/sv/iss/SGO9010/openshaw

1983.pdf.

6

POINT PATTERN ANALYSIS

USING R

6.1 INTRODUCTION

In this and the next chapter, some key ideas of spatial statistics will be outlined,

together with examples of statistical analysis based on these ideas, via R. The two

main areas of spatial statistics that are covered are those relating to point patterns

(this chapter) and spatially referenced attributes (next chapter). One of the character-

istics of R, as open source software, is that R packages are contributed by a variety

of authors, each using their own individual styles of programming. In particular,

for point pattern analysis the spatstat package is often used, while for spatially

referenced attributed, spdep is favoured. One the one hand spdep handles spa-

tial data in the same way as sp, maptools and GISTools, while on the other

hand spatstat does not. Also, for certain specific tasks, other packages may be

called upon whose mode of working differs from either of these packages. While

this may seem a daunting prospect, the aim of these two chapters is to introduce

the key ideas of spatial statistics, as well as providing guidance in the choice of

packages, and help in converting data formats. Fortunately, although some pack-

ages use different data formats, conversion is generally straightforward, and exam-

ples will appear throughout the chapters, whenever necessary.

6.2 WHAT IS SPECIAL ABOUT SPATIAL?

In one sense, the motivations for statistical analysis of spatial data are the same as

those for non-spatial data:

● To explore and visualise the data

● To create and calibrate models of the process generating the data

● To test hypotheses related to the processes generating the data

POINT PATTERN ANALYSIS USING R

195

However, a number of these requirements are strongly influenced by the nature of

spatial data. The study of mapping and cartography may be regarded as an entire

subject area within the discipline of information visualisation, which focuses

exclusively on geographical information. In addition, the kinds of hypotheses one

might associate with spatial data are quite distinctive – for example, focusing on

the detection and location of spatial clusters of events, or on whether two kinds of

event (say, two different types of crime) have the same spatial distribution.

Similarly, models that are appropriate for spatial data are distinctive, in that they

often have to allow for spatial autocorrelation in their random component – for

example, a regression model generally includes a random error term, but if the

data are spatially referenced, one might expect nearby errors to be correlated. This

differs from a ‘standard’ regression model where each error term is considered to

apply independently, regardless of location. In the remainder of this section, point

patterns (one of two key types of spatial data considered in this book) will be con-

sidered. First, these will be described.

6.2.1 Point Patterns

Point patterns are collections of geographical points assumed to have been

generated by a random process. In this case, the focus of inference and model-

ling is on model(s) of the random processes and their comparison. Typically, a

point dataset consists of a set of observed (x, y) coordinates, say {(x1, y1), (x2, y2),

…, (xn, yn)}, where n is the number of observations. As an alternative notation,

each point could be denoted by a vector xi, where xi = (xi, yi). Using the data

formats used in sp, maptools and so on, these data could be represented as

SpatialPoints or SpatialPointsDataFrame objects. Since these data

are seen as random, many models are concerned with the probability densities

of the random points, ν(xi).

Another area of interest is the interrelation between the points. One way of think-

ing about this is to consider the probability density of one point xi conditional on

the remaining points x x x x1 1 1, , , , , i i n− +{ }. In some situations xi is independent of

the other points. However, for other processes this is not the case. For example, if

xi is the location of the reported address for a contagious disease, then it is more

likely to occur near one of the points in the dataset (due to the nature of contagion),

and therefore not independent of the values of x x x x1 1 1, , , , , i i n− +{ }.

Also important is the idea of a marked process. Here, random sets of points drawn

from a number of different populations are superimposed (e.g. household burgla-

ries using force and household burglaries not using force) and the relationship

between the different sets is considered. The term ‘marked’ is used here as the

dataset can be viewed as a set of points where each point is tagged (or marked)

with its parent population. Using the data formats used by sp, a marked process

could be represented as a spatial points data frame – although the spatstat

package uses a different format.

R FOR SPATIAL ANALYSIS & MAPPING

196

6.3 TECHNIQUES FOR POINT PATTERNS USING R

Having outlined the two main data types that will be considered, and the kinds of

model that may be applied, in this section more specific techniques will be dis-

cussed, with examples of how they may be carried out using R. In this section, we

will focus on random point patterns.

6.3.1 Kernel Density Estimates

The simplest way to consider random two-dimensional point patterns is to assume

that each random location xi is drawn independently from an unknown distribu-

tion with probability density function f(xi). This function maps a location (repre-

sented as a two-dimensional vector) onto a probability density. If we think of

locations in space as a very fine pixel

,

grid, and assume a value of probability

density is assigned to each pixel, then summing the pixels making up an arbitrary

region on the map gives the probability that an event occurs in that area. It is gen-

erally more practical to assume an unknown f, rather than, say, a Gaussian distribu-

tion, since geographical patterns often take on fairly arbitrary shapes – for example,

when applying the technique to patterns of public disorder, areas of raised risk

will occur in a number of locations around a city, rather than a simplistic radial

‘bell curve’ centred on the city’s mid-point.

A common technique used to estimate f(xi) is the kernel density estimate (KDE:

Silverman, 1986). KDEs operate by averaging a series of small ‘bumps’ (probability

distributions in two dimensions, in fact) centred on each observed point. This is

illustrated in Figure 6.1. In algebraic terms, the approximation to f(x), for an arbi-

trary location x = (x, y), is given by

ˆ ˆf f x y

nh h

k

x x

h

y y

hx y

i

x

i

yi

( ) = ( , ) =

1

,x

− −



∑ (6.1)

Each of the ‘bumps’ (central panel in Figure 6.1) map onto the kernel function

k

x xi

hx

y yi

hy

− −

, in equation (6.1) and the entire equation describes the ‘bump

averaging’ process, leading to the estimate of probability density in the right-

hand panel. Note that there are also parameters hx and hy (frequently referred to

as the bandwidths) in the x and y directions; their dimension is length, and they

represent the radii of the bumps in each direction. Varying hx and hy alters the

shape of the estimated probability density surface – in brief, low values of hx

and hy lead to very ‘spiky’ distribution estimates, and very high values, possibly

larger than the span of the xi locations, tend to ‘flatten’ the estimate so it appears

to resemble the k-function itself; effectively this gives a superposition of nearly

identical k-functions with relatively small perturbations in their centre points.

POINT PATTERN ANALYSIS USING R

197

This effect of varying hx and hy is shown in Figure 6.2. Typically hx and hy take

similar values. If one of these values is very different in magnitude than the other,

kernels elongated in either the x or y direction result. Although this may be useful

when there are strong directional effects, we will focus on the situation where val-

ues are similar for the examples discussed here. To illustrate the results of varying

the bandwidths, the same set of points used in Figure 6.1 is used to provide KDEs

with three different values of hx and hy: on the left, they both take a very low value,

giving a large number of peaks; in the centre, there are two peaks; and on the right,

only one.

Figure 6.1 Kernel density estimation: initial points (left); bump centred on each point

(centre); average of bumps giving estimate of probability density (right)

Figure 6.2 Kernel density estimation bandwidths: hx and hy too low (left); hx and hy

appropriate (centre); hx and hy too high (right)

An obvious problem is that of choosing appropriate hx and hy given a dataset

{xi}. There are a number of formulae to provide ‘automatic’ choices, as well as some

more sophisticated algorithms. Here, a simple rule is used, as proposed by

Bowman and Azzalini (1997) and Scott (1992):

h

nx x=

2

3

1 6

σ 

 (6.2)

where σx is the standard deviation of the xi. A similar formula exists for hy, replac-

ing σx with σy, the standard deviation of the yi. The central KDE in Figure 6.2 is

based on choosing hx and hy using this method.

R FOR SPATIAL ANALYSIS & MAPPING

198

6.3.2 Kernel Density Estimation Using R

Here, the breaches of the peace (public disturbances) in New Haven, Connecticut

are used as an example; recall that this is provided in the GISTools package,

here loaded using data(newhaven). As an initial inspection of the data, look

at the locations of breaches of the peace. These can be viewed on an interactive

map using the tmap package in view mode. The following code loads the New

Haven data and tmap, sets R in view mode and produces a map showing the

US Census block boundaries and the locations of breach of the peace, on a back-

drop of a CartoDB map, provided your computer is linked to the internet. The

two layers can be interactively switched on or off, and the backdrop can be

changed. Here, we will generally use the default backdrop as it is monochrome,

and the information to be mapped will be in colour. The initial map window is

seen in Figure 6.3.

# Load GISTools (for the data) and tmap (for the mapping)

require(GISTools)

require(tmap)

# Get the data

data(newhaven)

# look at it

# select 'view' mode

tmap_mode('view')

# Create the map of blocks and incidents

tm_shape(blocks) + tm_borders() + tm_shape(breach) +

tm_dots(col='navyblue')

Figure 6.3 Web view mode of tmap

POINT PATTERN ANALYSIS USING R

199

There are a number of packages in R that provide code for computing KDEs.

Here, the tmap and tmaptools libraries provide some very useful tools. The

function to compute kernel density estimation is map_smooth from tmap-

tools. This estimates the value of the density over a grid of points, and returns the

result as a list – a raster object – referred to as X$raster (where X is the value

returned from map_smooth), a contour object (X$iso) and a polygon object

(X$polygon). The first of these is a raster grid of values for the KDEs, and the

second and third relate to contour lines associated with the KDE; iso provides a

set of lines (the contour lines) which may be plotted. Similarly, the polygons

item provides a solid list of polygons that may be plotted (as filled polygons).

map_smooth takes several arguments (most notably the set of points to use for

the KDE) but also a number of optional arguments. Two key ones here are the

bandwidth and the cover. The bandwidth is a vector of length 2 containing hx

and hy, and the cover is a geographical object whose outline forms the boundary

of the locations where the KDE is estimated. Both of these have defaults: the

default bandwidth is 1

50 times the shortest side of the bounding box of the points,

and the default cover is the bounding box of the points. However, as discussed

earlier, more appropriate hx and hy values may be found using (6.2). This is not

provided as part of smooth_map, but a function is easily written. The division of

the result by 1000 is because the projected data are measured in metres, but

smooth_map expects bandwidths in kilometres.

# Function to choose bandwidth according to Bowman and Azzalini / Scott's rule

# for use with smooth_map in tmaptools

Figure 6.4 KDE map for breaches of the peace

R FOR SPATIAL ANALYSIS & MAPPING

200

choose_bw <- function(spdf) {

X <- coordinates(spdf)

sigma <- c(sd(X[,1]),sd(X[,2])) ∗ (2 / (3 ∗ nrow(X))) ^ (1/6)

return(sigma/1000)

}

Now the code to carry out the KDE and plot the results may be used. Here the raster

version of the result is used, and plotted on a web mapping backdrop (Figure 6.4).

library(tmaptools)

tmap_mode('view')

breach_dens <- smooth_map(breach,cover=blocks, bandwidth = choose_bw(breach))

tm_shape(breach_dens$raster) + tm_raster()

The ‘count’ caption here indicates that the probability densities have been rescaled

to represent intensities – by multiplying the KDE by the number of cases. With this

scale, the quantity being mapped is the expected number of cases per unit area in

the amount of time of the study period.

It is also possible to use the other forms of result (polygons or isolines) to plot

the KDE outcomes. In the following code, isolines are produced, again with a back-

drop of a web map (see Figure 6.5).

tmap_mode('view')

tm_shape(blocks)+ tm_borders(alpha=0.5) +

tm_shape(breach_dens$iso) + tm_lines(col='darkred',lwd=2)

Figure 6.5 KDE map for breaches of the peace – isoline version

POINT PATTERN ANALYSIS USING R

201

Here, a backdrop of block boundaries has also been added to emphasise the

,

limits of the data collection region. In this and the previous map, it is important to

be aware of the boundaries of the data sampling region. Low probability densities

outside this region are quite likely due to no data being collected there – not neces-

sarily low incident risk!

Self-Test Question 1. As a further exercise, create the polygons version of the KDE

map in the plot mode of tmap – the tm_fill() function will shade the poly-

gons. As there will be no backdrop map, roads and blocks should be added to the

map to provide context. Also, add a map scale.

I

As well as estimating the probability density function f(x, y), kernel density

estimation also provides a helpful visual tool for displaying point data.

Although plotting point data directly can show all of the information in a

small dataset, if the dataset is larger it is hard to discriminate between

relative densities of points: essentially, when points are very closely packed,

the map symbols begin to overprint and exact numbers are hard to deter-

mine; this is illustrated in Figure 6.6. On the left is a plot of locations. The

points plotted are drawn from a two-dimensional Gaussian distribution, and

their relative density increases towards the centre. However, except for a

penumbral region, the intensity of the dot pattern appears to have roughly

fixed density. As the KDE estimates relative density, this problem is

addressed – as may be seen in the KDE plot in Figure 6.6 (right).

Figure 6.6 The overplotting problem: point plot (left) and KDE plot (right)

R FOR SPATIAL ANALYSIS & MAPPING

202

6.4 FURTHER USES OF KERNEL DENSITY ESTIMATION

KDEs are also useful for comparative purposes. In the newhaven dataset there are

also data relating to burglaries from residential properties. These are divided into

two classes: burglaries involving forced entry, and burglaries that do not. It may

be of interest to compare the spatial distributions of the two groups. In the

newhaven dataset, burgres.f is a SpatialPoints object with points for the

occurrence of forced entry residential burglaries, and burgres.n is a

SpatialPoints object with points for non-forced entries. Based on the recom-

mendation to compare patterns in data using small multiples of graphical panels

(Tufte, 1990), KDE maps for forced and non-forced burglaries may be shown side

by side. This is achieved using the R code below, which carries out the following

operations:

● Specify a set of levels for the intensity contours. To allow comparison

the same levels will be used on both maps

● Compute the KDEs. Here the contours are specified for the iso and

polygons results

● Draw each of the two maps and store in variables dn and df . Here the

polygon format is used

● Use tmap_arrange to draw the two maps in ‘small multiples’

format

The result is seen in Figure 6.7. Although there are some similarities in the two

patterns – likely due to the underlying pattern of housing – it may be seen that

for the non-forced entries there are two peaks of roughly equal intensity (Beaver

Hills/Edgewood in the west and Fair Haven in the east), while for forced entries

the peaks are in similar positions but the stronger peak is to the west, near

Edgewood. More generally, there tend to be more forced incidents than

non-forced.

# R Kernel Density comparison - first make sure the New Haven data are available

require(GISTools)

data(newhaven)

tmap_mode('plot')

# Create the KDEs for the two datasets:

contours <- seq(0,1.4,by=0.2)

brn_dens <- smooth_map( burgres.n,cover=blocks, breaks=contours,

style='fixed',

bandwidth = choose_bw(burgres.n))

brf_dens <- smooth_map( burgres.f,cover=blocks, breaks=contours,

style='fixed',

bandwidth = choose_bw(burgres.f))

POINT PATTERN ANALYSIS USING R

203

# Create the maps and store them in variables

dn <- tm_shape(blocks) + tm_borders() +

tm_shape(brn_dens$polygons) + tm_fill(alpha=0.8) +

tm_layout(title="Non-Forced Burglaries")

df <- tm_shape(blocks) + tm_borders() +

tm_shape(brf_dens$polygons) + tm_fill(alpha=0.8) +

tm_layout(title="Forced Burglaries")

tmap_arrange(dn,df)

Figure 6.7 KDE maps to compare forced and non-forced burglary patterns

6.4.1 Hexagonal Binning Using R

An alternative visualisation tool for geographical point datasets with larger num-

bers of points is hexagonal binning. In this approach, a regular lattice of small hex-

agonal cells is overlaid on the point pattern, and the number of points in each cell

is counted. The cells are then shaded according to the counts. This method also

overcomes the overplotting problem. However, hexagonal binning is not directly

available in GISTools, and it is necessary to use another package. One possibility

is the fMultivar package. This provides a routine for hexagonal binning called

hexBinning, which takes a two-column matrix of coordinates and provides an

R FOR SPATIAL ANALYSIS & MAPPING

204

object representing the hexagonal grid and the counts of points in each hexagonal

cell. Note that this function does not work directly with sp-type spatial data

objects. This is mainly because it is designed to apply hexagonal binning to any

kind of data (e.g. scatter plot points where the x and y variables are not geograph-

ical coordinates). However, it is perfectly acceptable to subject geographical points

to this kind of analysis.

First, make sure that the fMultivar package is installed in R. If not, enter:

install.packages("fMultivar",depend=TRUE)

A complication here is that the result of the hexBinning function is not a

Spatial-PolygonsDataFrame object and not immediately compatible with

tmap and other spatial tools in R. To allow for this, a new function hexbin_map

is written. This takes a SpatialPointsDataFrame object as input, and returns

a SpatialPolygonsDataFrame object consisting of the hexagons in which one

or more points occur, together with a data frame with a column z containing the

count of points. The code works as follows:

● Extract coordinates from the SpatialPointsDataFrame object

● Run hexBinning on these

● Construct hexagonal polygon coordinates

● Loop through each polygon; construct these according to sp data structures

● Copy the map projection information from the

SpatialPointsDataFrame object

● Add the count information giving a SpatialPolygonsDataFrame

object

The code is below:

hexbin_map <- function(spdf, ...) {

hbins <- fMultivar::hexBinning(coordinates(spdf),...)

# Hex binning code block

# Set up the hexagons to plot, as polygons

u <- c(1, 0, −1, −1, 0, 1)

u <- u ∗ min(diff(unique(sort(hbins$x))))

v <- c(1,2,1,−1,−2,−1)

v <- v ∗ min(diff(unique(sort(hbins$y))))/3

# Construct each polygon in the sp model

hexes_list <- vector(length(hbins$x),mode='list')

for (i in 1:length(hbins$x)) {

pol <- Polygon(cbind(u + hbins$x[i], v + hbins$y[i]),hole=FALSE)

hexes_list[[i]] <- Polygons(list(pol),i) }

POINT PATTERN ANALYSIS USING R

205

# Build the spatial polygons data frame

hex_cover_sp <\SpatialPolygons(hexes_list,proj4string=CRS(proj4string(spdf)))

hex_cover <- SpatialPolygonsDataFram e(hex_cover_sp,

data.frame(z=hbins$z),match.ID=FALSE)

# Return the result

return(hex_cover)

}

I

Note the reference to fMultivar::hexBinning in the code. This tells

R to use the function hexBinning from the package fMultivar without

actually loading the package using library. It is useful if it is the only thing

used from that package, as it avoids having to load everything else in the

package.

It is now possible to create hex binned maps via this function. Here a view

mode map is the map of hex binned breach data (Figure 6.8).

tmap_mode('view')

breach_hex <- hexbin_map(breach,bins=20)

tm_shape(breach_hex) +

tm_fill(col='z',title='Count',alpha=0.7)

Figure 6.8 Hexagonal binning of residential burglaries

R FOR SPATIAL ANALYSIS & MAPPING

206

As an alternative graphical representation, it is also possible to draw hexagons

whose area is proportional to the point count.

,

This is done by creating a variable

with which to multiply the relative polygon coordinates (this relates to the square

root of the count in each polygon, since it is areas of the hexagons that should reflect

the counts). This is all achieved via a modification of the previous hexbin_map

function, called hexprop_map, listed below.

hexprop_map <- function(spdf, ...) {

hbins <- fMultivar::hexBinning(coordinates(spdf),...)

# Hex binning code block

# Set up the hexagons to plot, as polygons

u <- c(1, 0, −1, −1, 0, 1)

u <- u ∗ min(diff(unique(sort(hbins$x))))

v <- c(1,2,1,−1,−2,−1)

v <- v ∗ min(diff(unique(sort(hbins$y))))/3

scaler <- sqrt(hbins$z/max(hbins$z))

Breach of Peace Incidents

Figure 6.9 Hexagonal binning of residential burglaries

POINT PATTERN ANALYSIS USING R

207

# Construct each polygon in the sp model

hexes_list <- vector(length(hbins$x),mode='list')

for (i in 1:length(hbins$x)) {

pol <- Polygon(cbind(u∗scaler[i] + hbins$x[i], v∗scaler[i] + hbins$y[i]),hole=FALSE)

hexes_list[[i]] <- Polygons(list(pol),i) }

# Build the spatial polygons data frame

hex_cover_sp <- SpatialPolygons(hexes_list,proj4string=CRS(proj4string(spdf)))

hex_cover <- SpatialPolygonsDataFram e(hex_cover_sp,

data.frame(z=hbins$z),match.ID=FALSE)

# Return the result

return(hex_cover)

}

It is now possible to create a proportional hex binning map – here in plot mode

in Figure 6.9.

tmap_mode('plot')

breach_prop <- hexprop_map(breach,bins=20)

tm_shape(blocks) + tm_borders(col='grey') +

tm_shape(breach_prop) +

tm_fill(col='indianred',alpha=0.7) +

tm_layout("Breach of Peace Incidents",title.position=c('left','bottom'))

6.5 SECOND-ORDER ANALYSIS OF POINT PATTERNS

In this section an alternative approach to point patterns will be considered.

Whereas KDEs assume that the spatial distributions for a set of points are

independent but have a varying intensity, the second-order methods consid-

ered in this section assume that marginal distributions of points have a fixed

intensity, but that the joint distribution of all points is such that individual

distributions of points are not independent.1 This process describes situations

in which the occurrences of events are related in some way – for example, if a

disease is contagious, the reporting of an incidence in one place might well be

accompanied by other reports nearby. The K-function (Ripley, 1981) is a very

useful tool for describing processes of this kind. The K-function is a function

of distance, defined by

K(d) = λ−1E(Nd) (6.3)

1 A further stage in complication would be the situation where individual distributions are not inde-

pendent, but also the marginal distributions vary in intensity – however, this will not be considered

here.

R FOR SPATIAL ANALYSIS & MAPPING

208

where Nd is the number of events xi within a distance d of a randomly chosen event

from all recorded events x x1 , , n{ }, and λ is the intensity of the process, measured

in events per unit area. Consider the situation where the distributions of xi are

independent, and the marginal densities are uniform – often termed a Poisson pro-

cess, or complete spatial randomness (CSR). In this situation one would expect the

number of events within a distance d of a randomly chosen event to be the intensity

λ multiplied by the area of a circle of radius d, so that

K dCSR( ) = πd2 (6.4)

The situation in equation (6.4) can be thought of as a benchmark to assess the clus-

tering of other processes. For a given distance d, the function value KCSR(d) gives

an indication of the expected number of events found around a randomly chosen

event, under the assumption of a uniform density with each observation being dis-

tributed independently of the others. Thus for a process having a k-function k(d),

if k(d) > KCSR(d), this suggests that there is an excess of nearby points – or, to put it

another way, there is clustering at the spatial scale associated with the distance d.

Similarly, if K(d) < KCSR(d), this suggests spatial dispersion at this scale – the pres-

ence of one point suggests other points are less likely to appear nearby than for a

Poisson process.

d1

d2

Figure 6.10 A spatial process with both clustering and dispersion

POINT PATTERN ANALYSIS USING R

209

The consideration of spatial scale is important (many processes exhibit spatial

clustering at some scales, and dispersion at others) so that the quantity K(d) −

KCSR(d) may change sign with different values of d. For example, the process illus-

trated in Figure 6.10 shows clustering at low values of d – for small distances (such

as d2 in the figure) there is an excess of points near to other points compared to

CSR, but for intermediate distances (such as d1 in the figure) there is an undercount

of points.

When working with a sample of data points {xi}, the K-function for the underly-

ing distribution will not usually be known. In this case, an estimate must be made

using the sample. If dij is the distance between xi and xj then an estimate of K(d) is

given by

ˆ ˆK d

I d d

n ni j i

ij( ) =

( < )

( 1)

1λ−

∑∑

(6.5)

where λ̂ is an estimate of the intensity given by

λ̂ =

| |

n

A

(6.6)

|A| being the area of a study region defined by a polygon A. Also I(·) is an indicator

function taking the value 1 if the logical expression in the brackets is true, and 0

otherwise. To consider whether this sample comes from a clustered or dispersed

process, it is helpful to compare K̂ d( ) to KCSR(d).

Figure 6.11 Sample K-functions under CSR

R FOR SPATIAL ANALYSIS & MAPPING

210

Statistical inference is important here. Even if the dataset had been gener-

ated by a CSR process, an estimate of the K-function would be subject to sam-

pling variation, and could not be expected to match KCSR(d) perfectly. Thus, it

is necessary to test whether the sampled K̂ d( ) is sufficiently unusual with

respect to the distribution of K̂ estimates one might expect to see under CSR

to provide evidence that the generating process for the sample is not CSR. The

idea is illustrated in Figure 6.11. Here, 100 K-function estimates (based on equa-

tion (6.5)) from random CSR samples of 100 points (the same number of points as in

Figure 6.10) are superimposed, together with the estimate from the point set

shown in Figure 6.10. From this it can be seen that the estimate from the clus-

tered sample is quite different from the range of estimates expected from CSR.

Another aspect of sampling inference for K-functions is the dependency of K̂ d( )

on the shape of the study area. The theoretical form KCSR(d) = λπd2 is based on an

assumption of points occurring in an infinite two-dimensional plane. The fact that

a ‘real-world’ sample will be taken from a finite study area (denoted here by A)

will lead to further deviation of sample-based estimates of K̂ d( ) from the theoreti-

cal form. This can also be seen in Figure 6.11 – although for the lower values of d

the CSR estimated K-function curves resemble the quadratic shape expected: the

curves ‘flatten out’ for higher values of d. This is due to the fact that for larger val-

ues of d, points will only be observed in the intersection of a circle of radius d

around a random xi and the study area A. This will result in fewer points being

observed than the theoretical K-function would predict. This effect continues, and

when d is sufficiently large any circle centred on one of the points will encompass

the entirety of A. At this point, any further increase in d will result in no change in

the number of points contained in the circle – this provides an explanation of the

flattening-out effect seen in the figure.

Above, the idea is to consider a CSR process constrained to the study area.

However, another viewpoint is that the study area defines a subset of all

points generated on the full two-dimensional plane. To estimate the

K-function for the full-plane process some allowance for edge effects on the

study area needs to be made. Ripley (1976)

,

proposed the following modification

to equation (6.5):

ˆ ˆK d

I d d

n n wi j i

ij

ij

( ) =

2 ( < )

( 1)

1λ−

∑∑

(6.7)

where wij is the area of intersection between a circle centred at xi passing

through xj and the study area A. Inference about the estimated K-function can

then be carried out using the approach used above, but with K̂ d( ) based on

equation (6.7).

POINT PATTERN ANALYSIS USING R

211

6.5.1 Using the K-Function in R

In R, a useful package for computing estimated K-functions (as well as other spa-

tial statistical procedures) is spatstat. This is capable of carrying out the kind of

simulation illustrated earlier in this section.

The K-function estimation as defined above may be estimated in the spat-

stat package using the Kest function. Here the locations of bramble canes

(Hutchings, 1979; Diggle, 1983) are analysed, having been obtained as a dataset

supplied with spatstat via the data(bramblecanes) command. They are

plotted in Figure 6.12. Different symbols represent different ages of canes – although

initially we will just consider the point pattern for all canes.

I

For the data in the example, points were generated with A as the rectangle

having lower left corner (−1, −1) and upper right corner (1, 1). In practice A may

have a more complex shape (a polygon outline of a county, for example); for

this reason, assessing the sampling variability of the K-function under

sampling must often be achieved via simulation, as seen in Figure 6.11.

bramblecanes

2

1

Figure 6.12 Bramble cane locations

R FOR SPATIAL ANALYSIS & MAPPING

212

# K-function code block

# Load the spatstat package

require(spatstat)

# Obtain the bramble cane data

data(bramblecanes)

plot(bramblecanes)

Next, the Kest function is used to obtain an estimate for the K-function of the

spatial process underlying the distribution of the bramble canes. The

correction='border' argument requests that an edge-corrected estimate (as

in equation (6.7)) be used.

kf <- Kest(bramblecanes,correction='border')

# Plot it

plot(kf)

The result of plotting the K-function as shown in Figure 6.13 compares the esti-

mated function (labelled K̂bord) to the theoretical function under CSR (labelled K̂pois).

It may be seen that the data appear to be clustered (generally the empirical

K-function is greater than that for CSR, suggesting that more points occur close

together than would be expected under CSR). However, this perhaps needs a

0.00 0.05 0.10 0.15 0.20 0.25

0.

00

0.

05

0.

10

0.

15

0.

20

kf

r (one unit = 9 metres)

K

(r

)

K̂bord(r)

Kpois(r)

Figure 6.13 Ripley’s K-function plot

POINT PATTERN ANALYSIS USING R

213

more rigorous investigation, allowing for sampling variation via simulation as set

out above.

This simulation approach is sometimes referred to as envelope analysis, the enve-

lope being the highest and lowest values of K̂ d( ) for a value of d. Thus the function

for this is called envelope. This takes a ppp object and a further function as an

argument. The function here is Kest – there are other functions also used to

describe spatial distributions which will be discussed later, which envelope can

use, but for now we focus on Kest. The envelope object may also be plotted, as

shown in the following code which results in Figure 6.14:

# Code block to produce k-function with envelope

# Envelope function

kf.env <- envelope(bramblecanes,Kest,correction="border")

# Plot it

plot(kf.env)

From this it can be seen that the estimated K-function for the sample takes on a

higher value than the envelope of simulated K-functions for CSR until d becomes

quite large, suggesting strong evidence that the locations of bramble canes do

indeed exhibit clustering. However, it can reasonably be argued that comparing an

Figure 6.14 K-function with envelope

R FOR SPATIAL ANALYSIS & MAPPING

214

estimated K̂ d( ) and an envelope of randomly sampled estimates under CSR is not

a formal significance test. In particular, since the sample curve is compared to the

envelope for several d values, multiple significance testing problems may occur.

These are well explained by Bland and Altman (1995) – in short, when carrying out

several tests, the chance of obtaining a false positive result in any test is raised. If

the intention is to evaluate a null hypothesis of CSR, then a single number measur-

ing departure of K̂ d( ) from KCSR(d), rather than the K-function, may be more

appropriate – so that a single test can be applied. One such number is the maximum

absolute deviation (MAD: Ripley, 1977, 1981). This is the absolute value of the larg-

est discrepancy between the two functions:

MAD = ( ) ( )max

d

K d K d

− CSR (6.8)

In R, we enter:

mad.test(bramblecanes,Kest,verbose=FALSE)

Maximum absolute deviation test of CSR

Monte Carlo test based on 99 simulations

Summary function: K(r)

Reference function: theoretical

Alternative: two.sided

Interval of distance values: [0, 0.25] units (one unit = 9 metres)

Test statistic: Maximum absolute deviation

Deviation = observed minus theoretical

data: bramblecanes

mad = 0.016159, rank = 1, p-value = 0.01

In this case it can be seen that the null hypotheses of CSR can be rejected at the 1% level.

An alternative test is advocated by Loosmore and Ford (2006) where the test statistic is

u K d K di i k i k k

d d

d

k

= ( ) ( )

min

max

−

∑

=

δ (6.9)

in which K ti k( ) is the average value of K̂ d( ) over the simulations, the dk are a

sequence of sample distances ranging from dmin to dmax, and δk = dk+1 − dk. Essentially

this attempts to measure the sum of the squared distance between the functions,

rather than the maximum distance. This is implemented by spatstat via the

dclf.test function, which works similarly to mad.test:

dclf.test(bramblecanes,Kest,verbose=FALSE)

Diggle-Cressie-Loosmore-Ford test of CSR

Monte Carlo test based on 99 simulations

Summary function: K(r)

Reference function: theoretical

Alternative: two.sided

Interval of distance values: [0, 0.25] units (one unit = 9 metres)

Test statistic: Integral of squared absolute deviation

Deviation = observed minus theoretical

data: bramblecanes

u = 3.3372e−05, rank = 1, p-value = 0.01

POINT PATTERN ANALYSIS USING R

215

Again, results suggest rejecting the null hypothesis of CSR – see the reported

p-value.

6.5.2 The L-function

An alternative to the K-function for identifying clustering in spatial processes is the

L-function. This is defined in terms of the K-function

L d

K d

( ) =

( ) (6.10)

Although just a simple transformation of the K-function, its utility lies in the

fact that under CSR, L(d) = d; that is, the L-function is linear, having a slope of 1

and passing through the origin. Visually identifying this in a plot of estimated

L-functions is generally easier than identifying a quadratic function, and there-

fore L-function estimates are arguably a better visual tool. The Lest function

provides a sample estimate of the L-function (by applying the transform in (6.10)

to K̂ d( )) which can be used in place of Kest. As an example, recall that the enve-

lope function could take alternatives to K-functions to create the envelope plot:

in the following code, an envelope plot using L-functions for the bramble cane

data is created (see Figure 6.15):

# Code block to produce k-function with envelope

# Envelope function

lf.env <- envelope(bramblecanes,Lest,correction="border")

# Plot it

plot(lf.env)

Similarly, it is possible to apply MAD tests or Loosmore and Ford tests using L

instead of K. Again mad.test and dclf.test allow an alternative to K-functions

to be specified. Indeed, Besag (1977) recommends using L-functions in place of

K-functions in this kind of test. As an example, the following code applies the

MAD test to the bramble cane data using the L-function.

mad.test(bramblecanes,Lest,verbose=FALSE)

Maximum absolute deviation test of CSR

,

Monte Carlo test based on 99 simulations

Summary function: L(r)

Reference function: theoretical

Alternative: two.sided

Interval of distance values: [0, 0.25] units (one unit = 9 metres)

Test statistic: Maximum absolute deviation

Deviation = observed minus theoretical

data: bramblecanes

mad = 0.017759, rank = 1, p-value = 0.01

π

R FOR SPATIAL ANALYSIS & MAPPING

216

6.5.3 The G-Function

Yet another function used to describe the clustering in point patterns is the

G-function. This is the cumulative distribution of the nearest neighbour distance

for a randomly selected xi. Thus, given a distance d, G(d) is the probability that the

nearest neighbour distance for a randomly chosen sample point is less than or

equal to d. Again, this can be estimated using spatstat, using the function Gest.

As in the case of Lest and Kest, the functions envelope, mad.test and

dclf.test may be used with Gest. Here, again with the bramble cane data, a

G-function envelope is plotted:

# Code block to produce G-function with envelope

# Envelope function

gf.env <- envelope(bramblecanes,Gest,correction="border")

# Plot it

plot(gf.env)

0.00 0.05 0.10 0.15 0.20 0.25

0.

00

0.

05

0.

10

0.

15

0.

20

0.

25

lf.env

r (one unit = 9 metres)

L

(r

)

L̂obs(r)

Lth eo(r)

L̂h i(r)

L̂lo(r)

Figure 6.15 L-function with envelope

POINT PATTERN ANALYSIS USING R

217

The estimate of the G-function for the sample is based on the empirical propor-

tion of nearest neighbour distances less than d, for several values of d. In this case

the envelope is the range of estimates for given d values, for samples generated

under CSR. Theoretically, the expected G-function for CSR is

G(d) = 1 − exp(−λπd) (6.11)

This is also plotted in Figure 6.16, as Gtheo.

0.000 0.005 0.010 0.015

0.

0.

2

0.

4

0.

6

0.

8

gf.env

r (one unit = 9 metres)

G

(r

)

Ĝobs (r)

Gtheo (r)

Ĝhi (r)

Ĝlo (r)

Figure 6.16 G-function with envelope

I

One complication is that spatstat stores spatial information in a differ-

ent way than sp, GISTools and related packages, as noted earlier. This is

not a major hurdle, but it does mean that objects of types such as

(Continued)

R FOR SPATIAL ANALYSIS & MAPPING

218

Spatial-PointsDataFrame must be converted to spatstat’s ppp

format. This is a compendium format containing both a set of points and a

polygon describing the study area A, and can be created from a Spatial-

Points or SpatialPointsDataFrame object combined with a

Spatial-Polygons or SpatialPolygonsDataFrame object. This

is achieved via the as and as.ppt functions from the maptools

package.

require(maptools)

require(spatstat)

# Bramblecanes is a dataset in ppp format from spatstat

data(bramblecanes)

# Convert the data to SpatialPoints, and plot them

bc.spformat <- as(bramblecanes,"SpatialPoints")

plot(bc.spformat)

# It is also possible to extract the study polygon

# referred to as a window in spatstat terminology

# Here it is just a rectangle...

bc.win <- as(bramblecanes$win,"SpatialPolygons")

plot(bc.win,add=TRUE)

It is also possible to convert objects in the other direction, via the as.ppp

function. This takes two arguments, the coordinates of the Spatial-

Points or SpatialPointsDataFrame object (extracted using the

coordinates function), and an owin object created from a Spatial-

Polygons or SpatialPolygonsDataFrame via as.win. owin

objects are single polygons used by spatstat to denote study areas, and

are a component of ppp objects. In the following example, the burgres.n

point dataset from GISTools is converted to ppp format and a G-function

is computed and plotted.

require(maptools)

require(spatstat)

# Bramblecanes is a dataset in ppp format from spatstat

# convert burgres.n to a ppp object

br.n.ppp <- as.ppp( coordinates(burgres.n),

W=as.owin(gUnaryUnion(blocks)))

br.n.gf <- Gest(br.n.ppp)

plot(br.n.gf)

6.6 LOOKING AT MARKED POINT PATTERNS

A further advancement of the analysis of patterns of points of a single type is the

consideration of marked point patterns. Here, several kinds of points are considered

POINT PATTERN ANALYSIS USING R

219

in a dataset, instead of only a single kind. For example, in the newhaven dataset

there are point data for several kinds of crime. The term ‘marked’ is used as each

point is thought of as being tagged (or marked) with a specific type. As with the

analysis of single kinds of points (or ‘unmarked’ points), the points are still treated

as random two-dimensional quantities. It is also possible to apply tests and analyses

to each individual kind of point – for example, testing each mark type against a null

hypothesis of CSR, or computing the K-function for that mark type. However, it is

also possible to examine the relationships between the point patterns of different

mark types. For example, it may be of interest to determine whether forced entry

residential burglaries occur closer to non-forced-entry burglaries than one might

expect if the two sets of patterns occurred independently.

One method of investigating this kind of relationship is the cross-K-function

between marks of type i and j. This is defined as

Kij (d) = λj−1E(Ndij) (6.12)

where Ndij is the number of events xk of type j within a distance d of a randomly cho-

sen event from all recorded events x x1 , , n{ } of type i, and λj is the intensity of the

process marked j – measured in events per unit area (Lotwick and Silverman, 1982). If

the process for points with mark j is CSR, then Kij(d) = λjπd2. A similar simulation-based

approach to that set out for K, L and G in earlier sections may be used to investigate

Kij(d) and compare it to a hypothesised sample estimate of Kij(d) under CSR.

The empirical estimate of Kij(d) is obtained in a similar way to that in equation (6.5):

ˆ ˆK d

I d d

n nij j

k l

kl

i j

( ) =

( < )1λ − ∑∑ (6.13)

where k indexes all of the i-marked points and l indexes all of the j-marked processes,

and ni and nj are the respective numbers of points marked i and j. A correction (of

the form in equation (6.7)) may also be applied. There is also a cross-L-function,

Lij (d), which relates to the cross-K-function in the same way that the standard K-

function relates to the standard L-function.

6.6.1 Cross-L-Function Analysis in R

There is a function in spatstat called Kcross to compute cross-K-functions,

and a corresponding function called Lcross for cross-L-functions. These take a

ppp object and values for i and j as the key arguments. Since i and j refer to mark

types, it is also necessary to identify the marks for each point in a ppp object. This

can be done via the marks function. For example, for the bramblecanes object,

the points are marked in relation to the age of the cane (see Hutchings, 1979) with

three levels of age (labelled as 0, 1 and 2 in increasing order). Note that the marks

are factors. These may be listed by entering:

R FOR SPATIAL ANALYSIS & MAPPING

220

marks(bramblecanes)

[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[28] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[55] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[82] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[109] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[136] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[163] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[190] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[217] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[244] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[271] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[298] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[325] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[352] 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[379] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[406] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[433] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

,

and Linux, respectively. The Windows and Mac versions come

with installer packages and are easy to install, while the Linux binaries require use

of a command terminal.

RStudio can be downloaded from https://www.rstudio.com/products/

rstudio/download/ and the free version of RStudio Desktop is more than

sufficient for this book. RStudio allows you to organise your work into projects,

to use RMarkdown to create documents and webpages, to link to your GitHub

site and much more. It can be customised for your preferred arrangement of the

different panes.

R FOR SPATIAL ANALYSIS & MAPPING

8

You may have to set a mirror site from which the installation files will be down-

loaded to your computer. Generally you should pick one that is near to you. Once

you have installed the software you can run it. On a Windows computer, an R icon

is typically installed on the desktop; on a Mac, R can be found in the Applications

folder. Macs and Windows have slightly different interfaces, but the protocols and

processes for an R session on either platform are similar.

The base installation includes many functions and commands. However,

more often we are interested in using some particular functionality, encoded into

packages contributed by the R developer community. Installing packages for the

first time can be done at the command line in the R console using the install.

packages command, as in the example below to install the GISTools library,

or via the R menu items.

install.packages("tmap", dependencies = T)

In Windows, the menu for this can be accessed by Packages > Load Packages and

on a Mac via Packages and Data > Package Installer. In either case, the first time

you install packages you may have to set a mirror site, from which to download

the packages. Once the package has been installed then the library can be called as

below.

library(tmap)

Further descriptions of packages, their installation and their data structures are

given in later chapters. There are literally thousands of packages that have been

contributed to the R project by various researchers and organisations. These can

be located by name at http://cran.r-project.org/web/packages/

available_packages_by_name.html if you know the package you wish

to use. It is also possible to search the CRAN website to find packages to per-

form particular tasks at http://www.r-project.org/search.html.

Additionally, many packages include user guides in the form of a PDF docu-

ment describing the package and listed at the top of the index page of the help

files for the package. The most commonly used packages in this book are listed

in Table 1.2.

When you install these packages it is strongly suggested you also install the

dependencies – other packages required by the one that is being installed – by

either checking the box in the menu or including depend=TRUE in the command

line as below:

install.packages("GISTools", dep = TRUE)

Packages are occasionally completely rewritten, and this can impact on code func-

tionality. Since we started writing the revision for this edition of the book, the read

INTRODUCTION

9

Table 1.2 R packages used in this book

Name Description

datasets A package containing a number of datasets supplied with the standard

installation of R

deldir Functions for Delaunay triangulations, Dirichlet or Voronoi tessellations of

point datasets

dplyr A grammar of data manipulation

e1071 Functions for data mining, latent class analysis, clustering and modelling

fMultivar Tools for financial engineering but useful for spatial data

ggplot2 Declarative graphics creation, based on The Grammar of Graphics (Wilkinson,

2005)

GISTools Mapping and spatial data manipulation tools

gstat Functions for spatial and geostatistical modelling, prediction and simulation

GWmodel Geographically weighted models

maptools Functions for manipulating and reading geographical data

misc3d Miscellaneous functions for three-dimensional (3D) plots

OpenStreetMap High resolution raster maps and satellite imagery from OpenStreetMap

raster Manipulating, analysing and modelling of raster or gridded spatial data

RColorBrewer A package providing colour palettes for shading maps and other plots

RCurl General HTTP requests, functions to fetch uniform resource identifiers

(URIs), to get and post web data

reshape2 Flexibly reshape data

rgdal Geospatial Data Abstraction Library, projection/transformation operations

rgeos Geometry Engine – Open Source (GEOS), topology operations on

geometries

rgl 3D visualisation device (OpenGL)

RgoogleMaps Interface to query the Google server for static maps as map backgrounds

Rgraphviz Provides plotting capabilities for R graph objects

rjson Converts R objects into JavaScript Object Notation (JSON) objects and vice

versa

sf Simple Features for R – a standardised way to encode spatial vector data

sp Classes and methods for spatial data

SpatialEpi Performs various spatial epidemiological analyses

spatstat A package for analysing spatial data, mainly spatial point patterns

spdep Functions and tests for evaluating spatial patterns and autocorrelation

tibble A modern reimagining of the data frame

tidyverse A collection of R packages designed for data science

tmap A mapping package that allows maps to be constructed in highly

controllable layers

R FOR SPATIAL ANALYSIS & MAPPING

10

and write functions for spatial data in the maptools package (readShape

Poly, writePolyShape, etc.) have deprecated. For instance:

library(maptools)

?readShapePoly

If you examine the help files for these functions you will see that they contain a

warning and suggest other functions that should be used instead. The book web-

site will always contain working code snippets for each chapter to overcome any

problems caused by function deprecation.

Such changes are only a minor inconvenience and are part of the nature of a

dynamic development environment provided by R in which to do research: such

changes are inevitable as packages finesse, improve and standardise.

1.8 THE R INTERFACE

We expect that most readers of this book and most users of R will be using the

RStudio interface to R, although users can of course still use just R. RStudio pro-

vides a good interface to the different things that R users will want to know about

the R sessions via the four panes: the console where code is entered; the file that is

being edited; variables in the working environments; files in the project file space;

plot windows, help pages, as well as font type and size, pane colour, etc. Users can

set up their personal preferences for how they like their RStudio interface. Similar

to straight R, there are few pull-down menus in R, and therefore you will type

command lines in what is termed a command line interface. Like all command

line interfaces, the learning curve is steep but the interaction with the software is

more detailed, which allows greater flexibility and precision in the specification of

commands.

As you work though the book, the expectation is that you will run all the code

that you come across. We cannot emphasise enough the importance of learning by

doing – the best way to learn how to write R code is to write and enter it. Some of

the code might look a bit intimidating when first viewed, especially in later chap-

ters. However, the only really effective way to understand it is to give it a try.

Beyond this there are further choices to be made. Command lines can be entered

in two forms: directly into the R console window or as a series of commands into a

script window. We strongly advise that all code should be written in scripts (script

files have a .R extension) and then run from the script. RStudio includes its own

editor (similar to Notepad in Windows or TextEdit on a Mac). Scripts are useful if

you wish to automate data analysis, and have the advantage of keeping a saved

record of the relevant R programming language commands that you use in a given

piece of analysis. These can be re-executed,

,

1 1 1 1

[460] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[487] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[514] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[541] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[568] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[595] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[622] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[649] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[676] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[703] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[730] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2

[757] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[784] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[811] 2 2 2 2 2 2 2 2 2 2 2 2 2

Levels: 0 1 2

I

It is also possible to assign values to marks of a ppp object using the

expression:

marks(x) <- ...

where ... is any valid R expression creating a factor variable with the same

length of number elements as there are points in the ppp object x. This is

useful if converting a SpatialPointsDataFrame object into a ppp

object representing a marked process.

POINT PATTERN ANALYSIS USING R

221

As an example here, we compute and plot the cross-L-function for levels 0 and 1 of

the bramblecanes object (the resultant plot is shown in Figure 6.17):

cl.bramble <- Lcross(bramblecanes,i=0,j=1,correction='border')

plot(cl.bramble)

0.00 0.05 0.10 0.15 0.20 0.25

0.

00

0.

05

0.

10

0.

15

0.

20

0.

25

cl.bramble

r (one unit = 9 metres)

L

0,

1

( r

)

L̂0, 1

bord

(r)

L 0, 1

po is(r)

Figure 6.17 Cross-L-function for levels 0 and 1 of the bramble cane data

The envelope function may also be used (Figure 6.18):

clenv.bramble <- envelope(bramblecanes,Lcross,i=0,j=1,correction='border')

plot(clenv.bramble)

Thus, it would seem that there is a tendency for more young (level 1) bramble

canes to occur close to very young (level 0) canes. This can be formally tested, as

both mad.test and dclf.test can be used with Kcross and Lcross. Here

the use of Lcross with dclf.test is demonstrated:

R FOR SPATIAL ANALYSIS & MAPPING

222

dclf.test(bramblecanes,Lcross,i=0,j=1,correction='border',verbose=FALSE)

Diggle-Cressie-Loosmore-Ford test of CSR

Monte Carlo test based on 99 simulations

Summary function: L["0", "1"](r)

Reference function: theoretical

Alternative: two.sided

Interval of distance values: [0, 0.25] units (one unit = 9 metres)

Test statistic: Integral of squared absolute deviation

Deviation = observed minus theoretical

data: bramblecanes

u = 4.3982e−05, rank = 1, p-value = 0.01

6.7 INTERPOLATION OF POINT PATTERNS WITH CONTINUOUS

ATTRIBUTES

The previous section can be thought of as outlining methods for analysing point

patterns with categorical-level attributes. An alternative issue is the analysis of

0.00 0.05 0.10 0.15 0.20 0.25

0.

00

0.

05

0.

10

0.

15

0.

20

0.

25

clenv.bramble

r (one unit = 9 metres)

L

,1

(r

)

L̂0, 1

obs

(r)

L 0, 1

t h eo(r)

L̂0, 1

h i

(r)

L̂0, 1

l o

(r)

Figure 6.18 Cross-L-function envelope for levels 0 and 1 of the bramble cane data

POINT PATTERN ANALYSIS USING R

223

point patterns in which the points have continuous (or measurement scale) attrib-

utes, such as height above sea level, soil conductivity or house price. A typical

problem here is interpolation: given a sample of measurements – say, z zn1 , ,{ } at

locations x x1 , , n{ } – the goal is to estimate the value of z at some new point x.

Possible methods for doing this can be based on fairly simple algorithms, or on

more sophisticated spatial statistical models. Here, three key measures will be

covered:

● Nearest neighbour interpolation

● Inverse distance weighting

● Kriging

6.7.1 Nearest Neighbour Interpolation

The first of these, nearest neighbour interpolation, is the simplest conceptually, and

can be stated as below:

● Find i such that |xi − x| is minimised

● The estimate of z is zi

In other words, to estimate z at x, use the value of zi at the observation point closest

to x. Since the set of closest points to xi for each i form the set of Thiessen (Voronoi)

polygons for the set of points, an obvious way to represent the estimates is as a set

of Thiessen (Voronoi) polygons corresponding to the xi points, with respective

attributes of zi. In rgeos there is no direct function to create Voronoi polygons, but

Carson Farmer2 has made some code available to do this, providing a function

called voronoipolygons. This has been slightly modified by the authors, and is

listed below. Note that the modified version of the code takes the points from a

spatial points data frame as the basis for the Voronoi polygons on a spatial points

data frame, and carries across the attributes of the points to become attributes of

the corresponding Voronoi polygons. Thus, in effect, if the z value of interest is an

attribute in the input spatial points data frame then the nearest neighbour interpo-

lation is implicitly carried out when using this function.

The function makes use of Voronoi computation tools carried out in another

package called deldir – however, this package does not make use of Spatial∗

object types, and therefore this function provides a ‘front end’ to allow its inte-

gration with the geographical information handling tools in rgeos, sp and

2 http://www.carsonfarmer.com/2009/09/voronoi-polygons-with-r/

R FOR SPATIAL ANALYSIS & MAPPING

224

maptools. Do not be too concerned if you find the code difficult to interpret – at

this stage it is sufficient to understand that it serves to provide a spatial data

manipulation function that is otherwise not available.

#

# Original code from Carson Farmer

# http://www.carsonfarmer.com/2009/09/voronoi-polygons-with-r/

# Subject to minor stylistic modifications

#

require(deldir)

require(sp)

# Modified Carson Farmer code

voronoipolygons = function(layer) {

crds <- layer@coords

z <- deldir(crds[,1], crds[,2])

w <- tile.list(z)

polys <- vector(mode='list', length=length(w))

for (i in seq(along=polys)) {

pcrds <- cbind(w[[i]]$x, w[[i]]$y)

pcrds <- rbind(pcrds, pcrds[1,])

polys[[i]] <- Polygons( list(Polygon(pcrds)),

ID=as.character(i))

}

SP <- SpatialPolygons(polys)

voronoi <- Spa tialPolygonsDataFrame(SP,

data=data.frame( x=crds[,1],

y=crds[,2],

layer@data,

row.names=sapply(slot(SP, 'polygons'),

function(x) slot(x, 'ID'))))

proj4string(voronoi) <- CRS(proj4string(layer))

return(voronoi)

}

6.7.2 A Look at the Data

Having defined this function, the next stage is to use it on a test dataset. One such

dataset is provided in the gstat package. This package provides tools for a number

of approaches to spatial interpolation – including the other two listed in this chapter.

Of interest here is a data frame called fulmar. Details of the dataset may be

obtained by entering ?fulmar once the package gstat has been loaded. The data

are based on airborne counts of the sea bird Fulmaris glacialis during August and

September of 1998 and 1999, over the Dutch part of the North Sea. The counts are

taken along transects corresponding to flight paths of the observation aircraft, and

are transformed to densities by dividing counts by the area of observation, 0.5 km2.

In this and the following sections you will analyse the data described above.

First, however, these data should be read into R, and converted into a Spatial∗

object. The first thing you will need to do is enter the code to define the function

voronoipolygons as listed above. The next few lines of code will read in the

POINT PATTERN ANALYSIS USING R

225

data (stored in the data frame fulmar) and then convert them into a spatial points

data frame. Note that the fulmar sighting density is stored in column fulmar in

the data frame fulmar – the location is specified in columns x and y. The point

object is next converted into

,

referred to or modified at a later date.

For this reason, you should get into the habit of constructing scripts for all your

analyses. Since being able to edit functions is extremely useful, both the MS

INTRODUCTION

11

Windows and Mac OSX versions of R have built-in text editors. In RStudio you

should go to File > New File. In R, to start the Windows editor with a blank docu-

ment, go to File > New Script, and to open an existing script, File > Open Script.

To start the Mac editor, use the menu option File > New Document to open a new

document and File > Open Document to open an existing file.

Once code is written into these files, they can be saved for future use; rather

than copy and pasting each line of code, both R and RStudio have their own short-

cuts. Lines of code can be run directly by placing the cursor on the relevant line

(or highlighting a block) and then using Ctrl-R (Windows) or Cmd-Return (Mac).

RStudio also has a number of other keyboard short-cuts for running code, auto-

filling when you are typing, assignment, etc. Further tips are described at

http://r4ds.had.co.nz/workflow-basics.html.

It is also good practice to set the working directory at the beginning of your

R session. This can be done via the menu in RStudio: Session > Set Working

Directory > …. In Windows R select File > Change dir…, and in Mac R select

Misc > Set Working Directory. This points the R session to the folder you

choose and will ensure that any files you wish to read, write or save are placed

in this directory.

Scripts can be saved by selecting File > Save As which will prompt you to enter

a name for the R script you have just created. Chose a name (e.g. test.R) and

select save. It is good practice to use the file extension .R.

1.9 OTHER RESOURCES AND ACCOMPANYING WEBSITE

There are many freely available resources for R users. In order to get some practice

with R we strongly suggest that you download the ‘Owen Guide’ (entitled The R

Guide) and work through this up to and including Section 5. It can be accessed via

http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf.

It does not require any additional libraries or data and provides a gentle introduc-

tion to R and its syntax.

There are many guides to the R software available on the internet. In particular,

you may find some of the following links useful:

● http://www.r-bloggers.com

● http://stackoverflow.com/ and specifically

http://stackoverflow.com/questions/tagged/r

The contemporary nature of R means that much of the R development for pro-

cessing geographical information is chronicled on social media sites (you can

search for information on services such as Twitter, for example #rstats) and

blogs (such as the R-bloggers site listed above), rather than standard textbooks.

R FOR SPATIAL ANALYSIS & MAPPING

12

In addition to the above resources, there is a website that accompanies this book:

https://study.sagepub.com/Brunsdon2e. This site contains all of the code, scripts,

exercises and self-test questions included in each chapter, and these are available

to download. The scripts for each chapter allow the reader to copy and paste the

code into the R console or into their own script. At the time of writing, all of the

code in the book is correct. However, R and its packages are occasionally updated.

In most cases this is not problematic as the update almost always extends the

functionality of the package without affecting the original code. However, in a

few instances, specific packages are completely rewritten without backward com-

patibility. If this happens the code on the accompanying website will be updated

accordingly. You are therefore advised to check the website regularly for archival

components and links to new resources.

REFERENCES

Bivand, R.S., Pebesma, E.J. and Gómez-Rubio, V. (2013) Applied Spatial Data:

Analysis with R, 2nd edition. New York: Springer.

Brunsdon, C. and Chen, H. (2014) GISTools: Some further GIS capabilities for R. R

Package Version 0.7-4. http://cran.r-project.org/package=GISTools.

Krause, A. and Olson, M. (1997) The Basics of S and S-PLUS. New York: Springer.

Pebesma, E., Bivand, R., Cook, I., Keitt, T., Sumner, M., Lovelace, R., Wickham, H.,

Ooms, J. and Racine, E. (2016) sf: Simple features for R. R Package Version 0.6-3.

http://cran.r-project.org/package=sf.

Tennekes, M. (2015) tmap: Thematic maps. R Package Version 1. http://cran.r-project.

org/package=tmap.

Wilkinson, L. (2005) The Grammar of Graphics. New York: Springer.

2

DATA AND PLOTS

2.1 INTRODUCTION

This chapter introduces some of the different data types and data structures that

are commonly used in R and how to visualise them. As you work through this

book, you will gain experience in using and manipulating these individually and

within blocks of code. It sequentially builds on the ideas that are introduced, for

example developing your own functions, and tests this knowledge through self-

test exercises. As you progress, the exercises will place more emphasis on solving

problems, using the different data structures needed, rather than simply working

through the example code. As you work though the code, you should use the help

available to explore the different functions that are called in the code snippets, such

as max, sqrt and length.

This chapter covers a lot of ground – it will:

● Review basic commands in R

● Introduce variables and assignment

● Introduce data types and classes

● Describe how to test for and manipulate data types

● Introduce and compare data frames and tibbles

● Introduce basic plot commands

● Describe how to read, write, load and save different data types

Chapter 1 introduced R, the reasons for using it in spatial analysis and mapping,

and described how to install it. It also directed you to some of the many

resources and introductory exercises for undertaking basic operations in R.

Specifically it advised that you should work through the ‘Owen Guide’ (entitled

The R Guide) up to the end of Section 5. This can be accessed via

https://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf.

R FOR SPATIAL ANALYSIS & MAPPING

14

This chapter assumes that you have worked your way through this – it does not

take long and provides critical introductory knowledge for the more specialised

materials that will be covered in the rest of this book.

2.2 THE BASIC INGREDIENTS OF R: VARIABLES AND ASSIGNMENT

The R interface can be used as a sort of calculator, returning the results of simple

mathematical operations such as (−5 + −4). However, it is normally convenient

to assign values to variables. The form for doing this is:

R_object <- value

The arrow performs the assignments and is referred to as gets. So in this case you

would say R_object gets value. It is possible to use an equals sign instead of gets,

but this only performs a soft assignment (the difference between the arrow and the

equals sign relates to how R how stores the R_object). The objects and variables

that are created can then be manipulated or subject to further operations.

# examples of simple assignment

x <- 5

y <- 4

# the variables can be used in other operations

x+y

[1] 9

# including defining new variables

z <- x + y

z

[1] 9

# which can then be passed to other functions

sqrt(z)

[1] 3

The snippet of code above is the first that you have come across in this book.

There will be further snippets throughout each chapter. Two key points. First,

you are strongly advised to enter and run the code at the R prompt yourself.

Our very strong advice is that you write the code into a script or document

using the in-built text editor in RStudio. For example, for each chapter you might

start a new RStudio session or project and open a new .R file. This script

can be used to save the code snippets you enter and to include your comments

and annotations. The reasons for doing this are so that you get used to using the

I

DATA AND PLOTS

15

The basic assignment type

,

in R is to a vector of values. Vectors can have sin-

gle values as in x, y and z above, or multiple values. Note the use of

c(4.3,7.1, …) in the code below, where the c instructs R to combine or

concatenate multiple values:

# example of vector assignment

tree.heights <- c(4.3,7.1,6.3,5.2,3.2,2.1)

tree.heights

[1] 4.3 7.1 6.3 5.2 3.2 2.1

Remember that UPPER and lower case matters to R. So tree.heights, Tree.

Heights and TREE.HEIGHTS will be treated as referring to different variables

by R. Make sure you type in upper and lower case exactly as it is written, otherwise

you are likely to get an error.

In the example above, a vector of values has been assigned to the variable

tree.heights. It is possible to apply a single assignment to the entire vector, as

in the code below that returns tree.heights squared. Note how the operation

returns the square of each element in the vector.

tree.heights∗∗2

[1] 18.49 50.41 39.69 27.04 10.24 4.41

Other operations or functions can then be applied to these vectors variables:

sum(tree.heights)

[1] 28.2

mean(tree.heights)

[1] 4.7

R console, and running the code will help your understanding of the code’s

functionality. Lines of code can be run directly by placing the cursor on the line

of code (or highlighting a block of code) and then using Ctrl-R (Windows) or

Cmd-Return (Mac). Keeping copies of your code in this way will help you keep

a record of it and will allow you to go back and edit it at a later date. Second, we

would like to emphasise the importance of learning by doing and getting your

hands dirty. Some of the code might look a bit fearsome when first viewed,

especially in later chapters, but the only really effective way to understand it is

to give it a try. Remember that the code and chapter summaries are available on

the book’s website https://study.sagepub.com/Brunsdon2e so that

you can copy and paste these into the R console or your own script. A final point

is that in the code, any comments are prefixed by # and are ignored by R when

entered into the console.

R FOR SPATIAL ANALYSIS & MAPPING

16

And, if needed, the results can be assigned to yet further variables:

max.height <- max(tree.heights) max.height

[1] 7.1

One of the advantages of vectors and other structures with multiple data elements

is that they can be subsetted. Individual elements or subsets of elements can be

extracted and manipulated:

tree.heights

[1] 4.3 7.1 6.3 5.2 3.2 2.1

tree.heights[1] # first element

[1] 4.3

tree.heights[1:3] # a subset of elements 1 to 3

[1] 4.3 7.1 6.3

sqrt(tree.heights[1:3]) #square roots of the subset

[1] 2.073644 2.664583 2.509980

tree.heights[c(5,3,2)] # a subset of elements 5,3,2: note the ordering

[1] 3.2 6.3 7.1

In the above examples the numeric values were assigned. However, character

or logical values can be also assigned as in the code below. This starts to hint at

the idea of different classes and types of variables which are described in more

detail in the next sections.

# examples of character variable assignment

name <- "Lex Comber"

name

[1] "Lex Comber"

# these can be assigned to a vector of character variables

cities <- c("Leicester","Newcastle","London","Leeds","Exeter")

cities

[1] "Leicester" "Newcastle" "London" "Leeds"

[5] "Exeter"

length(cities)

[1] 5

# an example of a logical variable

northern <- c(FALSE, TRUE, FALSE, TRUE, FALSE)

northern

[1] FALSE TRUE FALSE TRUE FALSE

# this can be used to subset other variables

cities[northern]

[1] "Newcastle" "Leeds"

2.3 DATA TYPES AND DATA CLASSES

This section introduces data classes and data types to a sufficient depth for read-

ers of this book. However, more formal descriptions of basic classes for R data

DATA AND PLOTS

17

objects can be found in the R Manual on the CRAN website at

http://stat.ethz.ch/R-manual/R-devel/library/methods/

html/BasicClasses.html.

2.3.1 Data Types in R

Data in R can be considered as being organised into a hierarchy of data types

which can then be used to hold data values in different structures. Each of the

types is associated with a test and a conversion function. The basic or core data

types and associated tests and conversions are shown in Table 2.1.

You should note from the table that each type has an associated test in the form

is.xyz, which will return TRUE or FALSE, and a conversion in the form as.

xyz. Most of the exercises, methods, tools, functions and analyses in this book

work with only a small subset of these data types: character, numeric and

logical. These data types can be used to populate different data structures or

classes, including vectors, matrices, data frames, lists and factors. The data types

are described in more detail below. In each case the objects created by the different

classes, conversion functions or tests are illustrated.

Table 2.1 Data type, tests and conversion functions

Type Test Conversion

character is.character as.character

complex is.complex as.complex

double is.double as.double

expression is.expression as.expression

integer is.integer as.integer

list is.list as.list

logical is.logical as.logical

numeric is.numeric as.numeric

single is.single as.single

raw is.raw as.raw

2.3.1.1 Characters

Character variables contain text. By default the function character creates a vec-

tor of whatever length is specified. Each element in the vector is equal to "", an

empty character element in the variable. The function as.character tries to

convert its argument to character type, removing any attributes including, for

example, vector element names. The function is.character tests whether the

arguments passed to it are of character type and returns TRUE or FALSE depending

on whether its argument is of character type or not. Consider the following exam-

ples of these functions and the results when they are applied to different inputs:

R FOR SPATIAL ANALYSIS & MAPPING

18

character(8)

[1] "" "" "" "" "" "" "" ""

# conversion

as.character("8")

[1] "8"

# tests

is.character(8)

[1] FALSE

is.character("8")

[1] TRUE

2.3.1.2 Numeric

Numeric data variables are used to hold numbers. The function numeric is used

to create a vector of the specified length with each element equal to 0. The func-

tion as.numeric tries to convert (coerce) its argument to numeric type. It is

identical to as.double and to as.real. The function is.numeric tests

whether the arguments passed to it are of numeric type and returns TRUE or

FALSE depending on whether its argument is of numeric type or not. Notice how

the last test in the code below returns FALSE because not all of the elements are

numeric.

numeric(8)

[1] 0 0 0 0 0 0 0 0

# conversions

as.numeric(c("1980","−8","Geography"))

[1] 1980 −8 NA

as.numeric(c(FALSE,TRUE))

[1] 0 1

# tests

is.numeric(c(8, 8))

[1] TRUE

is.numeric(c(8, 8, 8, "8"))

[1] FALSE

2.3.1.3 Logical

The function logical creates a logical vector of the specified length and by default

each element of the vector is set to equal FALSE. The function as.logical

attempts to convert its argument to be of logical type. It removes any attributes

including, for example, vector element names. A range of character strings c("T",

"TRUE", "True", "true"), as well any number not equal to zero, are regarded

as TRUE. Similarly, c("F", "FALSE", "False", "false") and zero are

regarded as FALSE. All others are regarded as NA. The function is.logical

returns TRUE or FALSE depending on whether the argument passed to it is of

logical type or not.

DATA AND PLOTS

19

logical(7)

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# conversion

as.logical(c(7,5,0,−4,5))

[1] TRUE TRUE FALSE TRUE TRUE

# TRUE and FALSE can be converted to 1 and 0

as.logical(c(7,5,0,−4,5)) ∗ 1

[1] 1 1 0 1 1

as.logical(c(7,5,0,−4,5)) + 0

[1] 1 1 0 1 1

# different ways to declare TRUE and FALSE

as.logical(c("True","T","FALSE","Raspberry","9","0", 0))

[1] TRUE TRUE FALSE NA NA NA NA

Logical vectors are very useful for indexing and subsetting data, including spatial

,

data, to select the data that satisfy some criteria. For example, consider the following:

data <- c(3, 6, 9, 99, 54, 32, −102)

# a logical test

index <- (data > 10)

index

[1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE

# used to subset data

data[index]

[1] 99 54 32

sum(data)

[1] 101

sum(data[index])

[1] 185

2.3.2 Data Classes in R

The different data types can be used to populate different data structures or classes.

This section will describe and illustrate vectors, matrices, data frames, lists and

factors, data classes that are commonly used in spatial data analysis.

2.3.2.1 Vectors

All of the commands in R in Section 2.3.1 produced vectors. Vectors are the most

commonly used data structure and the standard one-dimensional R variable.

You will have noticed that when you specified character or logical, etc., a

vector of a given length was produced. An alternative approach is to use the

function vector, which produces a vector of the length and type or mode

specified. The default is logical, and when you assign values to vectors R will

seek to convert them to whichever vector mode is most convenient. Recall that

the test is.vector returns TRUE if its argument is a vector of the specified

class or mode with no attributes other than names, returning FALSE otherwise,

and that the function as.vector seeks to convert its argument into a vector of

whatever mode is specified.

R FOR SPATIAL ANALYSIS & MAPPING

20

# defining vectors

vector(mode = "numeric", length = 8)

[1] 0 0 0 0 0 0 0 0

vector(length = 8)

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# testing and conversion

tmp <- data.frame(a=10:15, b=15:20)

is.vector(tmp)

[1] FALSE

as.vector(tmp)

a b

1 10 15

2 11 16

3 12 17

4 13 18

5 14 19

6 15 20

2.3.2.2 Matrices

The function matrix creates a matrix from the data and parameters that are

passed to it. This must include parameters for the number of columns and rows in

the matrix. The function as.matrix attempts to turn its argument into a matrix,

and again the test is.matrix tests to see whether its argument is a matrix.

# defining matrices

matrix(ncol = 2, nrow = 0)

[,1] [,2]

matrix(1:6)

[,1]

[1,] 1

[2,] 2

[3,] 3

[4,] 4

[5,] 5

[6,] 6

matrix(1:6, ncol = 2)

[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

# conversion and test

as.matrix(6:3)

[,1]

[1,] 6

[2,] 5

[3,] 4

[4,] 3

is.matrix(as.matrix(6:3))

[1] TRUE

DATA AND PLOTS

21

Matrix rows and columns can be named – note the use of byrow=TRUE in the

following.

flow <- matrix(c(2000, 1243, 543, 1243, 212, 545,

654, 168, 109), c(3,3), byrow=TRUE)

# Rows and columns can have names, not just 1,2,3,…

colnames(flow) <- c("Leeds", "Maynooth", "Elsewhere")

rownames(flow) <- c("Leeds", "Maynooth", "Elsewhere")

# examine the matrix

flow

Leeds Maynooth Elsewhere

Leeds 2000 1243 543

Maynooth 1243 212 545

Elsewhere 654 168 109

# and functions exist to summarise

outflows <- rowSums(flow)

outflows

Leeds Maynooth Elsewhere

3786 2000 931

However, if the data class is not a matrix then just use names, rather than

rownames or colnames.

z <- c(6,7,8)

names(z) <- c("Newcastle","London","Manchester")

z

Newcastle London Manchester

6 7 8

R has many additional tools for manipulating matrices and performing matrix

algebra functions that are not described here. However, as spatial scientists we are

often interested in analysing data that have a matrix-like form, as in a data table.

For example, in an analysis of spatial data in vector format, the rows in the attrib-

ute table represent specific features (such as polygons) and the columns hold

information about the attributes of those features. Alternatively, in a raster analysis

environment, the rows and columns may represent specific latitudes and longi-

tudes, or northings and eastings, or raster cells. Methods for analysing data in

matrix-like structures will be covered in more detail in later chapters as spatial

data objects (Chapter 3) and spatial analyses (Chapter 5) are introduced.

You will have noticed in the code snippets that a number of new functions

are introduced, For example, early in this chapter, the function sum was

I

(Continued)

R FOR SPATIAL ANALYSIS & MAPPING

22

2.3.2.3 Factors

The function factor creates a vector with specific categories, defined in the lev-

els parameter. The ordering of factor variables can be specified and an ordered

function also exists. The functions as.factor and as.ordered are the coercion

functions. The test is.factor returns TRUE or FALSE depending on whether its

argument is of type factor or not, and is.ordered returns TRUE when its argu-

ment is an ordered factor and FALSE otherwise.

# a vector assignment

house.type <- c("Bungalow", "Flat", "Flat",

"Detached", "Flat", "Terrace", "Terrace")

# a factor assignment

used. R includes a number of functions that can be used to generate descrip-

tive statistics such as sum and max. You should explore these as they occur

in the text to develop your knowledge of and familiarity with R. Further

useful examples are in the code below and throughout this book. You could

even store them in your own R script. R includes extensive help files which

can be used to explore how different functions can be used, frequently with

example snippets of code. An illustration of how to find out more about the

sum function and some further summary functions is provided in the code

below.

?sum

help(sum)

# Create a variable to pass to other summary functions

x <− matrix(c(3,6,8,8,6,1,−1,6,7),c(3,3),byrow=TRUE)

# Sum over rows

rowSums(x)

# Sum over columns

colSums(x)

# Calculate column means

colMeans(x)

# Apply function over rows (1) or columns (2) of x

apply(x,1,max)

# Logical operations to select matrix elements

x[,c(TRUE,FALSE,TRUE)]

# Add up all of the elements in x

sum(x)

# Pick out the leading diagonal

diag(x)

# Matrix inverse

solve(x)

# Tool to handle rounding

zapsmall(x %∗% solve(x))

DATA AND PLOTS

23

house.type <- factor(c("Bungalow", "Flat",

"Flat", "Detached", "Flat", "Terrace", "Terrace"),

levels=c("Bungalow","Flat","Detached","Semi","Terrace"))

house.type

[1] Bungalow Flat Flat Detached Flat Terrace

[7] Terrace

Levels: Bungalow Flat Detached Semi Terrace

# table can be used to summarise

table(house.type)

house.type

Bungalow Flat Detached Semi Terrace

1 3 1 0 2

# levels controls what can be assigned

house.type <- factor(c("People Carrier", "Flat",

"Flat", "Hatchback", "Flat", "Terrace", "Terrace"),

levels=c("Bungalow","Flat","Detached","Semi","Terrace"))

house.type

[1] Flat Flat Flat Terrace Terrace

Levels: Bungalow Flat Detached Semi Terrace

Factors are useful for categorical or classified data – that is, data values that must

fall into one of a number of predefined classes. It is obvious to see how this might

be relevant to geographical analysis, where many features represented in spatial

data are labelled using one of a set of discrete classes.

2.3.2.4 Ordering

There is no concept of ordering in factors. However, this can be imposed by using

the ordered function. Ordering allows inferences about preference or hierarchy

to be made (lower–higher, better–worse, etc.) and this can be used in data selection

or indexing (as above) or in the interpretation of derived analyses.

income <-factor(c("High", "High", "Low", "Low",

"Low", "Medium", "Low", "Medium"),

levels=c("Low", "Medium", "High"))

income > "Low"

[1] NA NA NA NA NA NA NA NA

# levels in ordered defines a relative order

income <-ordered(c("High", "High", "Low", "Low",

"Low", "Medium", "Low", "Medium"),

levels=c("Low", "Medium", "High"))

income > "Low"

[1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE

Thus we can see that ordering is implicit in the way that the levels are specified and

allows other, ordering-related functions to be applied to the data.

The functions sort and table are new functions. In the above code relating

to factors, the function table was

,

used to generate a tabulation of the data in

house.type. It provides a count of the occurrence of each level in house.

type. The command sort orders a vector or factor. You should use the help in

R FOR SPATIAL ANALYSIS & MAPPING

24

R to explore how these functions work and try them with your own variables.

For example:

sort(income)

2.3.2.5 Lists

The character, numeric and logical data types and the associated data

classes described above all contain elements that must all be of the same basic type.

Lists do not have this requirement. Lists have slots for collections of different ele-

ments. A list allows you to gather a variety of different data types together in a single

data structure and the nth element of a list is denoted by double square brackets.

tmp.list <- list("Lex Comber",c(2015, 2018),

"Lecturer", matrix(c(6,3,1,2), c(2,2)))

tmp.list

[[1]]

[1] "Lex Comber"

[[2]]

[1] 2015 2018

[[3]]

[1] "Lecturer"

[[4]]

[,1] [,2]

[1,] 6 1

[2,] 3 2

# elements of the list can be selected

tmp.list[[4]]

[,1] [,2]

[1,] 6 1

[2,] 3 2

From the above it is evident that the function list returns a list structure composed

of its arguments. Each value can be tagged depending on how the argument was

specified. The conversion function as.list attempts to coerce its argument to a

list. It turns a factor into a list of one-element factors and drops attributes that are not

specified. The test is.list returns TRUE if and only if its argument is a list. These

are best explored through some examples; note that list items can be given names.

employee <- list(name="Lex Comber", start.year = 2015,

position="Professor")

employee

$name

[1] "Lex Comber"

$start.year

[1] 2015

$position

[1] "Professor"

DATA AND PLOTS

25

Lists can be joined together with append:

append(tmp.list, list(c(7,6,9,1)))

and lapply applies a function to each element of a list:

# lapply with different functions

lapply(tmp.list[[2]], is.numeric)

lapply(tmp.list, length)

Note that the length of a matrix, even when held in a list, is the total number of

elements.

2.3.2.6 Defining Your Own Classes

In R it is possible to define your own data type and to associate it with specific

behaviours, such as its own way of printing, drawing. For example, you will notice

in later chapters that the plot function is used to draw maps for spatial data objects

as well as conventional graphs. Suppose we create a list containing some employee

information.

employee <- list(name="Lex Comber", start.year = 2015,

position="Professor")

This can be assigned to a new class, called staff in this case (it could be any

name, but meaningful ones help).

class(employee) <- "staff"

Then we can define how R treats that class in the form

tion>. – for example, how it is printed. Note how the existing function

for printing is modified by the new class definition:

print.staff <- function(x) {

cat("Name: ",x$name,"\n")

cat("Start Year: ",x$start.year,"\n")

cat("Job Title: ",x$position,"\n")}

# an example of the print class

print(employee)

Name: Lex Comber

Start Year: 2015

Job Title: Professor

You can see that R knows to use a different print function if the argument is not

a variable of class staff. You could modify how your R environment treats exist-

ing classes in the same way, but do this with caution. You can also undo the class

assigned by using unclass, and the print.staff function can be removed per-

manently by using rm(print.staff):

R FOR SPATIAL ANALYSIS & MAPPING

26

print(unclass(employee))

$name

[1] "Lex Comber"

$start.year

[1] 2015

$position

[1] "Professor"

2.3.2.7 Classes in Lists

Variables can be assigned to new or user-defined class objects. The example below

defines a function to create a new staff object.

new.staff <- function(name,year,post) {

result <- list(name=name, start.year=year, position=post)

class(result) <- "staff"

return(result)}

A list can then be defined, which is populated using that function as in the code

below (note that functions will be dealt with more formally in later chapters).

leeds.uni <- vector(mode='list',3)

# assign values to elements in the list

leeds.uni[[1]] <- new.staff("Heppenstall, Alison", 2017,"Professor")

leeds.uni[[2]] <- new.staff("Comber, Lex", 2015,"Professor")

leeds.uni[[3]] <- new.staff("Langlands, Alan", 2014,"VC")

And the list can be examined by entering:

leeds.uni

2.3.2.8 data.frame versus tibble

Data of different types and classes are often held in tabular format. The data.

frame and tibble classes of the data table are described in this section.

Generally, in data tables, each of the records (rows) relates to some kind of real-

world feature (a person, a transaction, a date, etc.) and the columns represent some

attribute associated with that feature. In R data can be in a matrix, but matrices can

only hold one type of data (e.g. integer, logical and character). However,

data.frame and tibble class objects can hold different data types in different

columns (or fields). This section introduces these (in fact, the tibble class includes

data.frame) because they are used to hold attributes of spatial objects (points,

lines, areas, pixels) in the R spatial data formats sf and sp, as introduced in detail

in Chapter 3. Thus in spatial data tables, each record typically represents some real-

world geographical feature (a place, a route, a region, etc.) and the fields describe

variables or attributes associated with that feature (population, length, area, etc.).

The data.frame class in R is composed of a series of vectors of equal length,

which together form a two-dimensional data structure. Each vector records values

DATA AND PLOTS

27

for a particular theme or attribute. Typically these form the columns in a data

frame, and the name of each vector provides the column name or header. They are

ordered such that the nth element in each vector describes a property for the nth

record (row) representing the nth feature. The data.frame class is the most

commonly used method for storing data in R.

A data frame can be created using the data.frame() function:

df <- data.frame(dist = seq(0,400, 100),

city = c("Leeds", "Nottingham", "Leicester", "Durham", "Newcastle"))

str(df)

'data.frame': 5 obs. of 2 variables:

$ dist: num 0 100 200 300 400

$ city: Factor w/ 5 levels "Durham","Leeds",..: 2 5 3 1 4

The data.frame() function by default encodes character strings into factors. To

see this enter:

df$city

To overcome this the df object can be refined using stringsAsFactors = FALSE:

df <- data.frame(dist = seq(0,400, 100),

city = c("Leeds", "Nottingham", "Leicester", "Durham", "Newcastle"),

stringsAsFactors = FALSE)

str(df)

'data.frame': 5 obs. of 2 variables:

$ dist: num 0 100 200 300 400

$ city: chr "Leeds" "Nottingham" "Leicester" "Durham" …

The tibble class is a reworking of the data.frame class that seeks to retain the

operational advantages of data frames and eliminate aspects that have proven to

be less effective. Enter the code below to create tb:

tb <- tibble(dist = seq(0,400, 100),

city = c("Leeds", "Nottingham", "Leicester", "Durham", "Newcastle"))

Probably the biggest criticism of data.frame is the partial matching behaviour.

Enter the following code:

df$ci

[1] "Leeds" "Nottingham" "Leicester" "Durham"

[5] "Newcastle"

tb$ci

NULL

Although there is no variable called ci, the partial matching in the data.frame

means that the city variable is returned. This is a bit worrying!

A further problem is what gets returned when a data table is subsetted. A tibble

always returns a tibble, whereas a data frame may return a vector or a data frame,

R FOR SPATIAL ANALYSIS & MAPPING

28

depending on the dimensions of the result. For example, compare the outputs of

the following code:

# 1 column

df[,2]

tb[,2]

class(df[,2])

class(tb[,2])

# 2 columns

df[,1:2]

tb[,1:2]

class(df[,1:2])

class(tb[,1:2])

Note that a tibble is a data frame, but tibbles

(Spatial analytics and GIS) Chris Brunsdon  Lex Comber - An introduction to R for spatial analysis mapping (2019) - Outros (2024)
Top Articles
Latest Posts
Article information

Author: Chrissy Homenick

Last Updated:

Views: 5867

Rating: 4.3 / 5 (74 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Chrissy Homenick

Birthday: 2001-10-22

Address: 611 Kuhn Oval, Feltonbury, NY 02783-3818

Phone: +96619177651654

Job: Mining Representative

Hobby: amateur radio, Sculling, Knife making, Gardening, Watching movies, Gunsmithing, Video gaming

Introduction: My name is Chrissy Homenick, I am a tender, funny, determined, tender, glorious, fancy, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.