getting some work done browsing twitter at work today, I came across this tweet from the always excellent John Burn-Murdoch on the scourge of heatmaps. What’s most frustrating about these maps is that ggplot2 (which is underrated as mapping software, especially when combined with packages like sf in R) makes it super easy to create this bland, uninformative maps.
For instance, lets load some mapping libraries
library(tidyverse) library(sf) library(rgdal)
For this blog I’m going to use data of bus stops in London, because there’s an absolute ton of them and because I love the London Datastore and it was the first public, heavy, point data file I came across.
Let’s grab some data to use as exemplars
#get the shapefile data of london from GADM #downloads into a temp file gadm_url <- "https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_GBR_2_sf.rds" temp_dir <- tempdir() download.file(gadm_url, destfile = file.path(temp_dir, "london_shapefile.rds"), mode = "wb", quiet = TRUE) london <- sf::st_as_sf(readRDS(file.path(temp_dir, "london_shapefile.rds"))) %>% filter(grepl("London", NAME_2)) #get the bus stop data #https://data.london.gov.uk/dataset/tfl-bus-stop-locations-and-routes bus <- read.csv("https://files.datapress.com/london/dataset/tfl-bus-stop-locations-and-routes/bus-stops-10-06-15.csv", stringsAsFactors = FALSE) %>% filter(!is.na(Location_Easting)) %>% #convert to simple features st_as_sf(coords = c("Location_Easting", "Location_Northing"), crs = st_crs("+init=epsg:27700")) %>% #transform projection to match the boundary data st_transform(crs = st_crs(london)) %>% #remove bus stops outside of the limits of the london shapefile .[unlist(st_intersects(london, .)),]
Now that we have the data, let’s have a first pass at plotting the points on a map of London using ggplot2
#plot the bus stop locations p1 <- ggplot(bus) + geom_sf(data = london, fill = "grey") + geom_sf(colour = "blue", alpha = 0.2) + ggtitle("Bus Stops in London") + theme_void() plot(p1)
The points here do kiiiind of work alone- it’s possible to see the route along which buses travel. However, it’s still pretty heavy and takes some cognitive effort. It’s also worth remembering this is just the first example of data I came across- if working with stuff like common incident data, even a heavy alpha on the points is going to cover the map and leave it unreadable, as in the tweet up top.
But let’s say we plot it as a heatmap to try and sumamrise where the points are. In my opinion, this is even worse. For one- it stretches utside the boundaries of the shapes, even though we have filtered data to include no bus stops outside London. In this example that’s not a huge deal- but if London was (e.g.) an Island, the graph is now suggesting commutes into the sea. It’s also incredibly taxing to keep track of the various peaks and troughs- even in a simple model where bus stops generally increase as you go near to the centre
#bind the coordinates as numeric bus <- bus %>% cbind(., st_coordinates(bus)) #plot the bus stops as a density map p2 <- ggplot() + geom_sf(data = london) + stat_density_2d(data = bus, aes(X, Y)) + ggtitle("Bus Stops in London") + theme_void() plot(p2)
Instead, we can bin the point data into equi-sized containers. I’m extremely partial to this, even though it’s not super popular. To do this with hexagonal bins (the closest to circular that still tessellates perfectly), we just have to create a grid of points and connect them up
#merge the london wards into one boundary file london_union <- london %>% group_by("group") %>% summarise() #generate a grid of points separated hexagonally #no way to do this purely in sf yet afaik hex_points <- spsample(as_Spatial(london_union), type = "hexagonal", cellsize = 0.01) #generate hexgaon polygons from these points hex_polygons <- HexPoints2SpatialPolygons(hex_points) %>% st_as_sf(crs = st_crs(london_union)) %>% #clip to the london shapefile st_intersection(., london_union)
We can then find out which bin every bus stop is located in using st_intersects
#find how many bus stops are within each hexagon hex_polygons$planning_no <- lengths(st_intersects(hex_polygons, bus)) #plot the number of bus stops per bin p2 <- ggplot(hex_polygons) + geom_sf(aes(fill = planning_no)) + scale_fill_viridis_c(option = "magma", "# Bus Stops") + theme_void() + ggtitle("Binned London Bus Stops") plot(p2)
However, this is still pretty confusing. It seems like bus stops are fairly randomly distributed as by chance one hexagon may contain multiple stops at its edges, whereas a neighbour may be juuuust missing out on these.
To mitigate this effect, we can study the spatial autocorrelation of each hexagon to it’s neighbours. There are multiple ways to do this, but the one I was first introduced to and have used most is the Getis-Ord local statistic. In this example I will include.self() which means we are using the Gi* variant of the statistic.
Basically- we tell R to find all the nearest neighbours of any bin (hexagon- though not necessarily so, we could e.g. use wards, but I think it looks messier). It then calculates the ratio of values within the bin and it’s neighbours, to the total number of points (bus stops). The reported Gi* value is a z statistic- it can be positive (more clustering) or negative (less) and used to find significant clusters. I’m not going to do any of that here- just accept for now that a high Getis_Ord Gi* value means a greater cluster of bus stops in that region.
library(spdep) #find the centroid of each hexagon and convert to a matrix of points hex_points <- do.call(rbind, st_geometry(st_centroid(hex_polygons))) %>% unlist() %>% as.matrix.data.frame() #use a k-nearest-neighbour algorithm to find which shape neighbour which #super easy for the hexagons analytically obvs but important for e.g. using the ward boundaries instead neighbouring_hexes <- knn2nb(knearneigh(hex_points, k = 6), row.names = rownames(hex_points)) %>% include.self() #calculate the local G for a given variable (#bus stops) using the neihbours found previously localGvalues <- localG(x = as.numeric(hex_polygons$planning_no), listw = nb2listw(neighbouring_hexes, style = "B"), zero.policy = TRUE) #bind this back to the sf as a numeric variable column hex_polygons$smooth_planning_no <- as.numeric(localGvalues) #plot the statistic #+ve indicates more than would be expected p3 <- ggplot(hex_polygons) + geom_sf(aes(fill = smooth_planning_no)) + scale_fill_viridis_c(option = "magma", name = "Gi* Statistic") + theme_void() + ggtitle("Getis-Ord Binned London Bus Stops Statistic") plot(p3)
which generates a nice plot showing smoothed autocorrelation of dense public-transportation access. I like how you can clearly see the darker regions for the Lea Valley and Richmond Park, and in contrast the hubs of Kingston and Croydon, but in a way to is much more manageable than the contour map, or the point data itself.
It’s also worth bearing in mind, that this data is fairly organised (bus routes are to some extent, logically planned). When I first looked into spatial autocorrelation I was dealing with a huge number of dense points randomly dispersed over a significant proportion of England. At that level, techniques such as the finding the Getis-Ord statistic allow you to make sense of the data, AS WELL AS statistically test it. Though I haven’t ever worked with epidemiology data, apparently it’s a powerful technique to find clusters of disease outbreaks, and indeed, the data packaged with spdep is for SIDS data in North Carolina.
For more on this, I first learned about Getis-Ord from an excellent The Pudding post (from which some of my code is stolen), ARCGIS has a pretty good write up, and of course there is the CRAN page for the spdep library