Racing Bar Charts and Some data.table Munging

A while ago (and also still a bit) racing bar charts were all the rage on data visualisation forums/twitter. Perhaps one of the real breakout examples is this tweet from the, always excellent, John Burn-Murdochat the Financial Times, looking at the most populous cities in the world since the middle ages:

As with most things I blog about here I wondered if I could make them in R, and how easy that would be. The first half of this post is basically recreating the aforementioned gif pretty simply using gganimate. However, I also wanted to see if I could use the format to look at the strongest English football teams across history in link to second half of blogpost which gave me an unexpected lesson in using data.table and especially some lesser used facets of it- ones that might require more complex multiline solutions in the tidyverse.

Reproducing Racing Bar Charts in R

Getting on with actually producing racing bar charts, first we want to load up the libraries we’ll need.

library(tidyverse)
library(gganimate)
library(tweenr)
library(stringi)
#for getting the continents of various cities
library(countrycode)

Then also grab the data. This isn’t exactly all the data used by John in his tweet but it’s close enough. I can’t remember where I got the data from but similar datasets can be found by googling “world cities populations csv” e.g. here.

#read in data from blog repo
city_data <- read.csv("https://raw.githubusercontent.com/RobWHickman/netlify_blog/race_bar_charts/static/files/racing_bar_charts/city_populations.csv", 
                 encoding = "UTF-8", stringsAsFactors = FALSE)

head(city_data)
##   Country.Code Country.or.area City.Code               Urban.Agglomeration
## 1          392           Japan     21671                             Tokyo
## 2          356           India     21228                             Delhi
## 3          156           China     20656                          Shanghai
## 4           76          Brazil     20287                      S<e3>o Paulo
## 5          356           India     21206                   Mumbai (Bombay)
## 6          484          Mexico     21853 Ciudad de M<e9>xico (Mexico City)
##   Note  Latitude Longitude  X1950  X1955  X1960  X1965  X1970  X1975
## 1  325  35.68950 139.69171 11,275 13,713 16,679 20,284 23,298 26,615
## 2  318  28.66667  77.21667  1,369  1,782  2,283  2,845  3,531  4,426
## 3  202  31.22000 121.46000  4,301  5,846  6,820  6,428  6,036  5,627
## 4   NA -23.55000 -46.64000  2,334  3,044  3,970  5,494  7,620  9,614
## 5   NA  19.07398  72.88084  2,857  3,432  4,060  4,854  5,811  7,082
## 6  330  19.42732 -99.14187  3,365  4,294  5,479  6,969  8,831 10,734
##    X1980  X1985  X1990  X1995  X2000  X2005  X2010  X2015  X2020  X2025
## 1 28,549 30,304 32,530 33,587 34,450 35,622 36,834 38,001 38,323 37,876
## 2  5,558  7,325  9,726 12,407 15,732 18,670 21,935 25,703 29,348 32,727
## 3  5,966  6,847  7,823 10,450 13,959 16,763 19,980 23,741 27,137 29,442
## 4 12,089 13,395 14,776 15,913 17,014 18,288 19,660 21,066 22,119 22,899
## 5  8,658 10,391 12,436 14,310 16,367 17,891 19,422 21,043 22,838 25,207
## 6 13,028 14,278 15,642 17,017 18,457 19,276 20,132 20,999 21,868 22,916
##    X2030
## 1 37,190
## 2 36,060
## 3 30,751
## 4 23,444
## 5 27,797
## 6 23,865

Then we want to pull out the relevant data and do some melting to get a nice long format data frame of our data.

city_data <- city_data %>%
  #select out relevant columns
  select(country_id = Country.Code, country = Country.or.area,
         city_id = City.Code, city = Urban.Agglomeration,
         X1950, X1955, X1960, X1965, X1970, X1975, X1980, X1985, X1990,
         X1995, X2000, X2005, X2010, X2015, X2020, X2025, X2030) %>%
  #melt the data to long format
  reshape2::melt(id.vars = c("country_id", "country", "city_id", "city"),
                 variable.name = "year", value.name = "population") %>%
  #conver the data into usuable numbers
  mutate(year = as.numeric(gsub("^X", "", year)),
         population = as.numeric(gsub(",", "", population)),
         #convert the text into utf-8 readable
         city = stri_trans_general(city, "latin-ascii")) %>%
  #extract the english names for cities
  mutate(city_name = case_when(
    grepl("\\(", city) ~ str_extract(city,  "(?<=\\().+?(?=\\))"),
    grepl("-", city) ~ gsub("-.*", "", city),
    TRUE ~ as.character(city)
  ))

head(city_data)
##   country_id country city_id                                  city year
## 1        392   Japan   21671                                 Tokyo 1950
## 2        356   India   21228                                 Delhi 1950
## 3        156   China   20656                              Shanghai 1950
## 4         76  Brazil   20287                      S<U+FFFD>o Paulo 1950
## 5        356   India   21206                       Mumbai (Bombay) 1950
## 6        484  Mexico   21853 Ciudad de M<U+FFFD>xico (Mexico City) 1950
##   population        city_name
## 1      11275            Tokyo
## 2       1369            Delhi
## 3       4301         Shanghai
## 4       2334 S<U+FFFD>o Paulo
## 5       2857           Bombay
## 6       3365      Mexico City

As the y axis of our racing bar chart is going to be the cities rank in terms of population for any given observation year, we need to list cities in order. Using the Tidyverse a simple group_by() and mutating an order column will do the trick

city_data <- city_data %>%
  #group by and find order at any point
  group_by(year) %>%
  arrange(-population) %>%
  mutate(order = row_number()) %>%
  ungroup()

head(city_data)
## # A tibble: 6 x 8
##   country_id country city_id city   year population city_name order
##        <int> <chr>     <int> <chr> <dbl>      <dbl> <chr>     <int>
## 1        392 Japan     21671 Tokyo  2020      38323 Tokyo         1
## 2        392 Japan     21671 Tokyo  2015      38001 Tokyo         1
## 3        392 Japan     21671 Tokyo  2025      37876 Tokyo         1
## 4        392 Japan     21671 Tokyo  2030      37190 Tokyo         1
## 5        392 Japan     21671 Tokyo  2010      36834 Tokyo         1
## 6        356 India     21228 Delhi  2030      36060 Delhi         2

I also wanted to colour the bars by the continent of the city for a little extra aesthetic. Easy to do by joining with the data from countrycode.

#get the id data for each unique city
id_data <- city_data %>%
  select(city_id, city_name, country_id, country) %>%
  unique() %>%
  #find the continent of each city
  mutate(continent = countrycode(.$country, origin = "country.name", destination = "continent"))

Then we just need to do a final bit of munging on our data. I think in theory it’s possible to do this purely within the gganimate plotting machinery, but I prefer to munge the data outside.

First we want to select the columns that will be ‘animated’- i.e. the population and the order, and then also the time information (year), and group (city_id). We pre-specify a linear ease (i.e that between time points the numbers increase/decrease linearly per frame).

Finally I merge in our id data we prepared above and also round off the population numbers to leave a data frame ready for plotting!

#the number of frames the output will contain
frames <- 500

#use tweenr to manually make the naimation frame data
frame_data <- city_data %>%
  group_by(year) %>%
  arrange(-population) %>%
  mutate(order = row_number()) %>%
  #tweenr stuff here
  select(city_id, year, population, order) %>%
  mutate(ease = "linear") %>%
  tween_elements(., "year", "city_id", "ease", nframes= frames) %>%
  #select out columns
  select(population, order, year, .frame, city_id = .group) %>%
  #merge in id data
  merge(., id_data, by = 'city_id') %>%
  #munge population numbers
  mutate(pop = round(population/1000, 2))

head(frame_data)
##   city_id population    order     year .frame city_name country_id
## 1   20001    82.0000  849.000 1950.000      0     Herat          4
## 2   20001   565.0000 1213.000 2030.000    500     Herat          4
## 3   20001   101.1290 1141.129 1969.516    122     Herat          4
## 4   20001   245.2500 1510.250 2001.875    324     Herat          4
## 5   20001   178.2581 1314.000 1988.871    243     Herat          4
## 6   20001   335.7419 1516.419 2014.839    405     Herat          4
##       country continent  pop
## 1 Afghanistan      Asia 0.08
## 2 Afghanistan      Asia 0.56
## 3 Afghanistan      Asia 0.10
## 4 Afghanistan      Asia 0.25
## 5 Afghanistan      Asia 0.18
## 6 Afghanistan      Asia 0.34

Then we just want to filter out only the information we want to plot (the top 10 cities per year, I use 10.8 so you see cities as they enter the top 10) and use the lesser-spotted geom_tile() geom from ggplot2 which I found is the easiest to manually move about. After some extra aesthetics we then specify the transition_states()- in our case transition based on the year- and some cool little extras like dynamically adjusting the y axis, and also fading bars as they enter and exit the plot.

At the bottom we then turn this plot into a gif using animate and add a little pause at the start and end.

p_cities <- frame_data %>%
  #only want to plot the top 10
  filter(order < 10.8) %>%
  ggplot(aes(y = order, x = pop)) +
  #hack to plot the moving bars
  #from v helpful answer at
  #https://stackoverflow.com/questions/53162821/
  #animated-sorted-bar-chart-with-bars-overtaking-each-other/53163549
  geom_tile(aes(x = pop/2, width = pop, fill = continent),
            alpha = 0.8, colour = "black", height = 0.9) +
  geom_text(aes(label = sprintf("%1.2f",pop)), hjust = 1) +
  geom_text(aes(x = 0, label = paste(city_name, " ")),
            vjust = 0.2, hjust = 1) +
  #add labels to plot
  labs(title='{round(as.numeric(closest_state))}',
       x = "Population (millions)", y = "") +
  #y limits at 0-10.5
  #don't clip as will screw the labels outside the plot
  coord_cartesian(ylim = c(0,10.5), clip = "off") +
  #flip the y axis
  scale_y_reverse(position = "left") +
  #theme stuff
  #taken from same stackoverflow answer
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0, size = 22),
        axis.ticks.y = element_blank(),
        axis.text.y  = element_blank(), 
        #make sure labels will be visible
        plot.margin = margin(0,0,0,2.5, "cm")) +
  #transition by our calculated year
  transition_states(year, transition_length = 1, state_length = 0) +
  #scale x axis as pop increases
  view_follow(fixed_y = TRUE) +
  #fade as bares enter and exit the plot
  exit_fade() +
  enter_fade()

#animate the gif
city_gif <- animate(p_cities, frames, start_pause = 10, end_pause = 20)
#e.g. if you want to save the output
#anim_save("city_gif.gif", city_gif)

And voilà! here is our gif

city_gif

Soccer Through the Ages

So as I mentioned, I wanted the example to be a plot of how English football teams have risen and fallen in strength over the ~150 years the league has been playing. This turned out to be more difficult than I anticipated so also gives a nice little extension to using data.table beyond the basic i,j,k syntax.

For this we need James Curley’s engsoccerdata package and then also data.table and zoo for munging our data

#use James Curley's engsoccer data for the match data
#club data and updated english football dat ain latest release
#November 2019
library(engsoccerdata)

#data.table and zoo for munging
library(data.table)
library(zoo)

Then we want to select only the relevant bits of the data for calculating the ELO ratings of teams on a match-by-match basis. For this I use the formula used by the folks who calculate the national football ELO ratings at eloratings.net so if you’re confused by what K and G mean look there.

#some dates in the engsoccerdata::england data are off
#original should work by time of reading
#match_data <- engsoccerdata:;england %>%
match_data <- readRDS("../../static/files/racing_bar_charts/england_data.rds") %>%
  mutate(date = as.Date(Date)) %>%
  select(date, season = Season, home, visitor, tier, hgoal, vgoal, result) %>%
  mutate(K = (5-tier) * 10) %>%
  mutate(G = case_when(
    abs(hgoal-vgoal) < 2 ~ 1,
    abs(hgoal-vgoal) < 3 ~ 1.5,
    abs(hgoal-vgoal) >= 3 ~ (11 + abs(hgoal-vgoal)) / 8
  )) %>%
  mutate(result = case_when(
    result == "H" ~ 1,
    result == "A" ~ 0,
    result == "D" ~ 0.5
  )) %>%
  arrange(date)

head(match_data)
##         date season                    home              visitor tier
## 1 1888-09-08   1888        Bolton Wanderers         Derby County    1
## 2 1888-09-08   1888                 Everton      Accrington F.C.    1
## 3 1888-09-08   1888       Preston North End              Burnley    1
## 4 1888-09-08   1888              Stoke City West Bromwich Albion    1
## 5 1888-09-08   1888 Wolverhampton Wanderers          Aston Villa    1
## 6 1888-09-15   1888             Aston Villa           Stoke City    1
##   hgoal vgoal result  K     G
## 1     3     6    0.0 40 1.750
## 2     2     1    1.0 40 1.000
## 3     5     2    1.0 40 1.750
## 4     0     2    0.0 40 1.500
## 5     1     1    0.5 40 1.000
## 6     5     1    1.0 40 1.875

We can then define a quick function to update ELO ratings based on the match results and push out the updated ratings to a data frame holding the current rating over time. This will take a few minutes given how many matches we have so it’s worth going to make a cup of tea while it runs.

#start all teams with an ELO of 1000
team_ratings <- data.frame(team = unique(match_data$home),
                           rating = 1000)

#function to calculate team ELO and update back to the ratings df
calc_ELO <- function(row) {
  #get the difference in ratings
  hr <- team_ratings$rating[which(team_ratings$team == row$home)]
  vr <- team_ratings$rating[which(team_ratings$team == row$visitor)]
  dr <- (hr + 100) - vr
  
  e_result <- 1/ (10^(-dr/400)+1)
  new_hr <- hr + ((row$K*row$G) * (row$result - e_result))
  new_vr <- vr + ((row$K*row$G) * ((1-row$result) - (1-e_result)))
  
  team_ratings$rating[which(team_ratings$team == row$home)] <<- new_hr
  team_ratings$rating[which(team_ratings$team == row$visitor)] <<- new_vr
  
  output_row <- row %>%
    mutate(h_rating = new_hr, v_rating = new_vr)
  
  return(output_row)
}

#split and lapply function
elo_data <- match_data %>%
  split(f = seq(nrow(.))) %>%
  lapply(., calc_ELO) %>%
  do.call(rbind, .)

#melt the elo data
melted_elo_data <- elo_data %>%
  select(date, season, home, visitor, h_rating, v_rating) %>%
  reshape2::melt(id.vars = c("date", "season", "h_rating", "v_rating"),
                 variable.name = "location", value.name = "team") %>%
  mutate(rating = case_when(
    location == "home" ~ h_rating,
    location == "visitor" ~ v_rating
  )) %>%
  select(date, season, team, rating)

#df of each teams rating after each game day
head(melted_elo_data)

Now we have our ELO ratings for each team updated after every match, we can start munging the data to plot the racing bar chart. Most of this is fairly standard data.table munging but there’s an operation up top that I was pretty pleased with.

The English football season typically runs from mid August until mid May. Given that teams do not play competitive matches in this time, it gives 3 months of dead space where team ratings shouldn’t change which is going to be pretty boring in our final gif.

Therefore, I wanted to find out the total time within each season- i.e. the first match in any division in English football to the last match in any division (as different leagues may start and end on different dates). Therefore we need to find the minimum and maximum date value for each season and seq between them.

You could in theory do this using ddply- see a question asked by Hadley Wickham on Stack Overflow but it’s a) quicker in data.table and also I think the one line aspect to do everything is super pleasing.

We know once we have the row numbers for the min date we can easily subset based using them in data.table such as

dt_data[subset_condition, values_needed]

but we need to actually find the row numbers for the subset condition to use in i. We can do this in a 3.5 step process as follows:

  1. re supply the data (melted_elo_data)
  2. add a column of the condition of the minimum data using .I[which.min(date)] (2.5) do this by season as we need it for every season
  3. keep only this new minimum date information by selecting the newly initialised V1 column

We then also do this for the max date in each sequence and use map2 to sequence between each min and max date (could also use mapply but I just prefer map2s syntax).

We then have to make sure that every team is listed for every date so after joining the data.table of team rating information with the data.table of dates, we want to cross-join (CJ) by date and team. I;d never actually used cross-join before, but it’s essentially the equivalent of tidyr::complete().

setDT(melted_elo_data)

#get the min and max date for each season and seq between them
days <- map2(melted_elo_data[melted_elo_data[, .I[which.min(date)], by = "season"]$V1, date], 
             melted_elo_data[melted_elo_data[, .I[which.max(date)], by = "season"]$V1, date],
             seq.Date, by = "day") %>%
  #unlist screws up dates
  do.call("c", .) %>%
  #convert to dt for joining
  data.table(date = .)

#expand the dt by dates
#join each date first
munged_football_data <- melted_elo_data[days, on = "date"] %>%
  #cross-join so that every team is represented on every date
  #equivelant of tidyr::complete
  .[CJ(date = date, team = team, unique=TRUE), 
    on=.(date, team)] %>%
  #filter out the completion of NA team for dates
  .[!is.na(team)] %>%
  #also a a column for the calendar year by rejoining based on date
  .[days[, year := as.numeric(gsub("-.*", "", date))], 
    on = "date", allow.cartesian = TRUE]

Finally, we need to use gganimate::tween() again to fill in the missing ratings in the days where teams aren’t playing. Luckily, gganimate contains tween_fill which is perfect for this and plays nicely with data.table. I tween the ratings linearly between matches (and also take a rolling mean), and tween the positions using a cubic function so teams spend a little more time roughly in a nice plotting order.

After that, all that’s left is to order the frames properly and we’re almost read for plotting!

munged_football_data <- munged_football_data %>%
  .[order(date)] %>%
  #use tween fill for each team to get intermediate ratings
  .[, rating := tween_fill(rating, ease = "linear"), by = team] %>%
  .[, rating := na.locf(rating, na.rm = FALSE), by = "team"] %>%
  .[, frame := .GRP, by = date] %>%
  #take every 5th frame to cut down on final gif size
  #not strictly necessary
  .[frame %% 5 == 1] %>%
  #take the rolling mean over 30 days to smooth out jumps
  .[order(date), 
    roll_rating := frollmean(rating, n = 6, algo = "exact", align = "left"),
    by = "team"] %>%
  #remove unrated rows
  .[!is.na(roll_rating)] %>%
    #order by rolling rating
  .[order(-roll_rating)] %>%
  #rating stransition over a 10 day window
  #order by rating (as with the cities when rating by size)
  .[frame %% 30 == 1, team_order := 1:.N, by = date] %>%
  #use tweenr to make transitions smooth
  .[order(date), 
    order_fill := tween_fill(team_order, ease = "cubic-in-out"),
    by = team] %>%
  .[!is.na(order_fill)] %>%
  .[order(date)] %>%
  .[frame != lag(frame, default = 0), year_frame := 1:.N, by = year] %>%
  #fill down
  .[, year_frame := na.locf(year_frame)] %>%
  #frame is the season plus the amount through that season
  #could probably do something fancy and label as unique days but this suffices
  .[, year_frame := year + year_frame/(max(year_frame) + 1)] %>%
  .[, c("date", "team", "roll_rating", "order_fill", "year_frame")]

To plot, first I join in data from engosccerdata::club_data which is a new dataset of some basic information for English football clubs that’s new to the package. It contains a primary and secondary colour for each club so adds a little bit of extra aesthetic to the final gif. Then I basically plot as before using geom_tile() from ggplot()

#use the newly added club data from engsoccerdata
plotting_data <- engsoccerdata::england_club_data %>%
  #filter only teams we have in our dataset
  filter(team %in% munged_football_data$team) %>%
  #join
  munged_football_data[., on = "team"] %>%
  #only plot a cutdown here as gifs are large
  filter(date > "2015-07-01") %>%
  #only want to plot the top 10
  filter(order_fill < 10.8)

#get the colour and fill palettes from the plotting data
palette <- plotting_data %>%
  filter(!duplicated(team)) 
fills <- palette %>%
  .$col1 %>%
  as.character() %>%
  `names<-`(palette$team)
cols <- palette %>%
  .$col2 %>%
  as.character() %>%
  `names<-`(palette$team)

#plot roughly as before
p_football <- plotting_data %>%
  ggplot(aes(y = order_fill, x = roll_rating, fill = team, colour = team)) +
  geom_tile(aes(x = roll_rating/2, width = roll_rating),
            alpha = 0.8, height = 0.9, size = 1) +
  geom_text(aes(x = 0, label = paste(short_name, " ")),
            vjust = 0.2, hjust = 1, colour = "black") +
  geom_text(aes(label = sprintf("%1.2f",roll_rating)), 
            hjust = 1.5, colour = "black") +
  scale_fill_manual(values = fills, guide = FALSE) +
  scale_colour_manual(values = cols, guide = FALSE) +
  labs(title='{as.character(current_frame)}',
       x = "ELO rating", y = "") +
  coord_cartesian(ylim = c(0,10.5), clip = "off") +
  scale_y_reverse(position = "left") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0, size = 22),
        axis.ticks.y = element_blank(),
        axis.text.y  = element_blank(), 
        plot.margin = margin(0,0,0,2.5, "cm")) +
  transition_manual(date) +
  view_follow(fixed_y = TRUE)

#animate the gif
football_gif <- animate(p_football, 
                        nframes = length(unique(plotting_data$year_frame)),
                        fps = 10,
                        end_pause = 10)
football_gif

Anyway, cheers. It was a fun one to play around with. I’ll put the full gifs of 1880-2019 online at some point!

Best,