One of the most consistent fonts of posts on this blog is The Guardian’s football trivia page The Knowledge. A particular reason for this is that the small contained questions lend themselves to small blogposts that I can turn around in an hour or two, as opposed to being endlessly redrafted until I lose interest.

However, I still sometimes don’t quite get round to finishing some of these posts, or have trouble justifying a blog post on a very small and ‘trivial’ answer to a question. Therefore, as a sort of end-of-year round up, and a Christmas present to myself, I wanted to push out answers to questions I found particularly interesting over the last year and hadn’t quite got round to ^{1}. I’ll probably add them all to this post as I finish them up.

# 2nd December - Everything in its right place

I wonder if any of any sporting leagues have ever ended in alphabetical order? pic.twitter.com/you6u8Uzwz

— P A Hunt (@TeachFMaths) June 15, 2019

## Answer - yes, kind of. But also no.

This question has actually been answered (as many of these will have been). For a league of 20 teams (like the English Premier League), we might imagine if would have happened over the last ~150 years, but it’s almost certain from some basic maths that it won’t have, and moreover, will never happen.

Let’s load some data and see why.

```
#as per usual, going to heavily rely on tidyverse
#and engsoccerdata throughout these posts
library(tidyverse)
library(engsoccerdata)
```

```
#load English league data
league_data <- engsoccerdata::england %>%
#select and gather match results
select(season = Season, division, home, visitor, hgoal, vgoal) %>%
gather("location", "team", -season, -division, -hgoal, -vgoal) %>%
mutate(
g_for = case_when(
location == "home" ~ hgoal,
location == "visitor" ~ vgoal
),
g_ag = case_when(
location == "home" ~ vgoal,
location == "visitor" ~ hgoal
)) %>%
#get correct point for a win/loss
mutate(
points = case_when(
g_for > g_ag & season < 1981 ~ 2,
g_for > g_ag & season > 1980 ~ 3,
g_for == g_ag ~ 1,
g_for < g_ag ~ 0
),
gd = g_for - g_ag
) %>%
#group by season and league and get final tables
group_by(season, division, team) %>%
summarise(points = sum(points),
gd = sum(gd),
g_for = sum(g_for)) %>%
arrange(-points, -gd, -g_for) %>%
#rank league order and alphabetical order
mutate(league_pos = rank(-points, ties.method = "first"),
alph_order = rank(team, ties.method = "first")) %>%
select(season, division, team, league_pos, alph_order) %>%
#split by league and season
split(., f = list(.$season, .$division)) %>%
keep(function(x) nrow(x) > 0)
#print the top of the first league table
head(league_data[[1]])
```

```
## # A tibble: 6 x 5
## # Groups: season, division [1]
## season division team league_pos alph_order
## <int> <int> <chr> <int> <int>
## 1 1888 1 Preston North End 1 9
## 2 1888 1 Aston Villa 2 2
## 3 1888 1 Wolverhampton Wanderers 3 12
## 4 1888 1 Blackburn Rovers 4 3
## 5 1888 1 Bolton Wanderers 5 4
## 6 1888 1 West Bromwich Albion 6 11
```

We can then run a load of Spearman’s rank correlation tests on the data to see which ones are perfectly correlated or anti-correlated in both league and alphabetical ranking. We’ll use the very handy broom package to tidy the results of our many tests into one data.frame (remove the filter at the end of the pipe chain to see what gets output).

```
#use broom to tidily do stats
library(broom)
#correlate league and alphabetical order by year
exact_correlations <- league_data %>%
map_df(., function(data) {
cor.test(
data$league_pos,
data$alph_order,
method = "spearman"
) %>%
tidy() %>%
mutate(season = unique(data$season),
division = unique(data$division))
}) %>%
#take only significantly
filter(abs(statistic) == 1)
print(exact_correlations)
```

```
## # A tibble: 0 x 7
## # ... with 7 variables: estimate <dbl>, statistic <dbl>, p.value <dbl>,
## # method <chr>, alternative <chr>, season <int>, division <int>
```

And so we find no exact correlations. There are no instances in 363 separate seasons of English league football where teams line up in either alphabetical, or anti-alphabetical order.

Let’s see why this is. To make things simpler, I’m going to imagine a cutdown league of only 6 teams using teams starting with each of the first 6 letter of the alphabet:

```
first_letter_names <- league_data %>%
bind_rows() %>%
ungroup() %>%
#get first letter of a team name
mutate(first_letter = gsub("(^.)(.*)", "\\1", team)) %>%
filter(season > 1992 &
division == 1 &
first_letter %in% toupper(letters[1:6])
) %>%
#get one team beginning with A, B, C...
filter(!duplicated(first_letter)) %>%
select(team) %>%
arrange(team) %>%
print()
```

```
## # A tibble: 6 x 1
## team
## <chr>
## 1 Arsenal
## 2 Blackburn Rovers
## 3 Coventry City
## 4 Derby County
## 5 Everton
## 6 Fulham
```

For the league to finish in alphabetical order, we first need the team that is first alphabetically (Arsenal) to finish in first position. Assuming all teams have an equal chance of winning the league, the chance of this is obviously

\[ p(Arsenal = 1) = \frac{1}{n}\]

Then we need the second team (Blackburn Rovers), to finish in second. This is predicated on Arsenal already finishing in first position, so the chance becomes

\[ p(Blackburn = 2 | Arsenal = 1) = \frac{1}{n-1} \]

and so on until the last team (Fulham) just have to slot into the only position left (n, 6th in our example)

Thus the total chance becomes

\[ \frac{1}{n} \cdot \frac{1}{n-1} ... \cdot \frac{1}{1} \]

which can also be written

\[ p(ordered) = \prod_{n = 1}^{N} \frac{1}{n}\]

which multiplies out to

\[ p(ordered) = \frac{1}{n!} \]

so for our very small league the chance of n (assumed equally strong teams)

`factorial(nrow(first_letter_names))`

`## [1] 720`

so we have a 1/720 chance that this league ends perfectly in alphabetical order. For bigger leagues (for reference most large European leagues contain 18-24 teams) this denominator grows *super-exponentially* and becomes tiny.

For the English Premier League (20 teams) for instance the chance becomes

```
league_data %>%
bind_rows() %>%
ungroup() %>%
filter(season == max(season) & division == 1) %>%
nrow() %>%
factorial()
```

`## [1] 2.432902e+18`

or 1 in 2.4 quintillion. In short, if it’s assumed that there’s no relation between order of names and team strength, we might expect the universe to end before all 20 teams finish in perfect order.

We can test if our predictions bear out by looking at tiny leagues with small numbers of teams, e.g. the group stages of the Champions/Europa Leagues.

First we need to scrape the final tables for the last 8 years of data from both competitions:

```
library(rvest)
#website to scrape group stage data from
fb_data <- "https://footballdatabase.com"
ucl_links <- sprintf(
"/league-scores-tables/uefa-champions-league-20%s-%s",
10:18, 11:19
)
europa_links <- sprintf(
"/league-scores-tables/uefa-europa-league-20%s-%s",
10:18, 11:19
)
#function to scrape the data from these links
get_competition_data <- function(competition, links) {
data <- links %>%
paste0(fb_data, .) %>%
map_df(., function(year) {
page_read <- read_html(year)
groups <- letters[1:8] %>%
map_df(., function(group) {
page_read %>%
html_nodes(sprintf("#total-group-%s > div > table", group)) %>%
html_table(fill = TRUE) %>%
as.data.frame() %>%
mutate(group)
}) %>%
mutate(year = gsub("(.*-)([0-9]{4}-[0-9]{2})", "\\2", year))
}) %>%
mutate(competition)
}
#scrape and bind the data
uefa_data <- bind_rows(
get_competition_data("champions", ucl_links),
get_competition_data("europa", europa_links)
)
#print a cutdown version of the scraped data
head(uefa_data %>% select(club = Club, points = P, year, competition))
```

```
## club points year competition
## 1 Tottenham Hotspur 11 2010-11 champions
## 2 Inter Milan 10 2010-11 champions
## 3 FC Twente 6 2010-11 champions
## 4 Werder Bremen 5 2010-11 champions
## 5 Schalke 04 13 2010-11 champions
## 6 Lyon 10 2010-11 champions
```

So now we have 128 (8 groups x 8 years x 2 competitions) ‘mini-leagues’ each of 4 teams.

We can then munge this data to find all the groups where the teams finish in alphabetical order. We’d expect 128/4! leagues to finish in alphabetical order (or 5.33 to be exact).

```
ordered_groups <- uefa_data %>%
#select relevant informatiob
select(team = Club, league_pos = X., group, year, competition) %>%
#by group find where teams finish in alphabetical order
group_by(year, group, competition) %>%
mutate(alph_order = rank(team, ties.method = "first")) %>%
filter(league_pos == alph_order) %>%
#keep only group where all (4) teams finish in order
summarise(n = n()) %>%
filter(n == 4) %>%
#join and filter back data
left_join(uefa_data, ., by = c("group", "year", "competition")) %>%
filter(!is.na(n)) %>%
#select useful information
select(team = Club, points = P, gd = X..., league_pos = X.,
group, year, competition) %>%
#split groups up
split(., list(.$year, .$group, .$competition)) %>%
keep(function(x) nrow(x) > 0)
```

which leaves us with 5 leagues that have finished in order! almost exactly what we’d predict by chance if the first letter of a teams name had no effect on the outcome.

`ordered_groups`

```
## $`2011-12.c.champions`
## team points gd league_pos group year competition
## 5 Benfica 12 4 1 c 2011-12 champions
## 6 FC Basel 11 1 2 c 2011-12 champions
## 7 Manchester United 9 3 3 c 2011-12 champions
## 8 Otelul Galati 0 -8 4 c 2011-12 champions
##
## $`2015-16.c.champions`
## team points gd league_pos group year competition
## 9 Atlético Madrid 13 8 1 c 2015-16 champions
## 10 Benfica 10 2 2 c 2015-16 champions
## 11 Galatasaray 5 -4 3 c 2015-16 champions
## 12 Lokomotiv Astana 4 -6 4 c 2015-16 champions
##
## $`2010-11.f.champions`
## team points gd league_pos group year competition
## 1 Chelsea FC 15 10 1 f 2010-11 champions
## 2 Marseille 12 9 2 f 2010-11 champions
## 3 Spartak Moskva 9 -3 3 f 2010-11 champions
## 4 Žilina 0 -16 4 f 2010-11 champions
##
## $`2015-16.g.champions`
## team points gd league_pos group year competition
## 13 Chelsea FC 13 10 1 g 2015-16 champions
## 14 Dynamo Kyiv 11 4 2 g 2015-16 champions
## 15 FC Porto 10 1 3 g 2015-16 champions
## 16 Maccabi Tel Aviv FC 0 -15 4 g 2015-16 champions
##
## $`2018-19.h.champions`
## team points gd league_pos group year competition
## 17 Juventus 12 5 1 h 2018-19 champions
## 18 Manchester United 10 3 2 h 2018-19 champions
## 19 Valencia 8 0 3 h 2018-19 champions
## 20 Young Boys 4 -8 4 h 2018-19 champions
##
## $`2012-13.h.europa`
## team points gd league_pos group year competition
## 21 FC Rubin Kazan 14 7 1 h 2012-13 europa
## 22 Inter Milan 11 2 2 h 2012-13 europa
## 23 Neftçi 3 -4 3 h 2012-13 europa
## 24 Partizan Beograd (SRB) 3 -5 4 h 2012-13 europa
```

We can also do a larger test by randomly selecting teams out of the English league data we looked at earlier. To do this I need two quick functions: one to sample randomly from the data, and another to carry out the correlation test.

The first takes a number of samples (how many tests to run) and then selects a number of teams from each league sample. For instance, if I chose 3 teams, it might select Liverpool, Manchester United, and Watford, from the last season of the Premier League. These teams finished 2nd, 6th, and 11th respectively, so this ‘sampled league’ would fulfill the criteria of finishing in alphabetical order.

```
set.seed(3459)
#take a random sample of leagues and teams withing those leagues
sample_cutdown_leagues <- function(nteams, nsamples, data) {
samples <- sample(length(data), nsamples, replace = TRUE)
sampled_league_data <- data[samples]
league_team_serials <- sampled_league_data %>%
lapply(., nrow) %>%
lapply(., sample, size = nteams)
#carry out the correlation test
league_cor_test <- map2_df(
.x = sampled_league_data,
.y = league_team_serials,
.f = cor_test_data
)
}
#function for correlation test
cor_test_data <- function(full_league_data, sampled_teams) {
sampled_league <- full_league_data[sampled_teams,] %>%
arrange(league_pos)
cor_test <- cor.test(
sampled_league$league_pos,
sampled_league$alph_order,
method = "spearman"
) %>%
tidy() %>%
#mutate on information about that season and teams chosen
mutate(teams = paste(sampled_league$team, collapse = ", "),
season = unique(sampled_league$season),
division = unique(sampled_league$division))
}
```

So for instance if I just run it once, randomly selecting 4 teams:

```
test <- sample_cutdown_leagues(4, 1, league_data)
#print the teams selected
test$teams
```

`## [1] "Brentford, Bristol Rovers, Brighton & Hove Albion, Chester"`

`test`

```
## # A tibble: 1 x 8
## estimate statistic p.value method alternative teams season division
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int>
## 1 0.8 2.00 0.333 Spearma~ two.sided Brentfor~ 1994 3
```

It gives me 4 teams from the 1994 division 3 who didn’t finish in alphabetical order (though, amusingly, all have a very similar starting letter).

We can then carry this out with 10000 samples for n_team numbers of 2:6 to see if we get roughly the expected numbers of exactly correlated league finish positions (this will take 1-2mins) by finding out how many tests give an estimate of 1 (finished exactly correlated with alphabetical order) or -1 (finished exactly anti-correlated with alphabetical order).

Both these numbers should be roughly equal to the number of samples (10000) divided by the factorial of the number of teams selected.

```
test_n_numbers <- function(nteams) {
#run sampling function n times
#10k should do
sampling <- sample_cutdown_leagues(nteams, 10000, league_data)
#find exactly correlated and anti-correlated examples
#where teams are in exact alphabetical order ascending or descending
correlated <- length(which(sampling$estimate == max(sampling$estimate)))
anti_correlated <- length(which(sampling$estimate == min(sampling$estimate)))
expected <- nrow(sampling) / factorial(nteams)
df <- data.frame(n = nteams,
sample_cor = correlated,
sample_anticor = anti_correlated,
sample_expected = expected)
}
#run the function
testing <- map_df(2:6, test_n_numbers)
#print results
print(testing)
```

```
## n sample_cor sample_anticor sample_expected
## 1 2 5010 4990 5000.00000
## 2 3 1676 1665 1666.66667
## 3 4 367 398 416.66667
## 4 5 101 81 83.33333
## 5 6 14 15 13.88889
```

And the numbers line up, as we would expect if there is no effect of the first letter of a team’s name upon final league position.

Finally, we can do a Kendall’s correlation test to really see if there is any relationship between alphabetical team name order and final league finish for all out our English league data. We use Kendall instead of a Spearman test here because we grouping all the data together we’re going to have a lot of ties (one team has to finish 1st in every league each year).

```
all_data <- league_data %>%
bind_rows()
#do a big correlation test
kendall_test <- cor.test(all_data$alph_order,
all_data$league_pos,
alternative = "two.sided",
method = "kendall") %>%
tidy() %>%
print()
```

```
## # A tibble: 1 x 5
## estimate statistic p.value method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 0.0135 1.74 0.0826 Kendall's rank correlation tau two.sided
```

And we can see that, even though our p-value is ‘approaching significance’, it’s not significant at our fairly liberal threshold of 0.05. Even then, the effect size (0.013) is tiny, so there’s no need for Watford to start worrying just yet.

- SMALL DIGRESSION: I love blogging on this site and it also has been a great help to me in numerous ways (practice coding/writing, feeling like a “programmer”, for job interviews), but quite a lot of the time feel posts are not quite where I want them (I’m sure this feeling isn’t restricted to me) and so won’t put them up and so that time (sometimes quite a few hours!) I put into them in my spare time feels wasted and makes me feel worse about myself. I’m hoping that pushing out fairly rushed/half formed ideas like this will help with this.

# 3rd December - Groan Rangers

## Answer - It beats any team in the English league. Reproducible code below if you want to check for other leagues.

(I did check for most of them in the dataset, although this doesn’t include foreign cup competitions. Nothing seems to get close)

This is quite a nice question from a data munging point of view. It’s extremely quantifiable and only involves a little grouping by.

First we’ll load the libraries we’re relying on in this little project:

```
library(engsoccerdata)
library(tidyverse)
```

I’m going to focus on the English league as it has the most data and also has data on the concurrent cup competitions. It’s super easy to sub in whichever competitions in the engsoccerdata package you want.

We want to first bind the data from the league, league cup, fa cup, and league playoffs together with a little munging. Then we want to gather the data to get the goals scored and goals conceded in each game for each team.

```
#bind all the match data together with relevant variables
scoring_data <- bind_rows(
engsoccerdata::england %>%
select(date = Date, tier, home, visitor, hgoal, vgoal) %>%
mutate(date = as.Date(date),
competition = "league"),
engsoccerdata::facup %>%
select(date = Date, home, visitor, hgoal, vgoal) %>%
mutate(date = as.Date(date),
tier = NA,
competition = "fa_cup"),
engsoccerdata::leaguecup %>%
select(date = Date, home, visitor, hgoal, vgoal) %>%
mutate(date = as.Date(date),
tier = NA,
competition = "league_cup"),
engsoccerdata::englandplayoffs %>%
select(date = Date, home, visitor, hgoal, vgoal, htier, vtier) %>%
mutate(date = as.Date(date),
tier = (htier+vtier)/2,
competition = "league_playoffs") %>%
select(-htier, -vtier),
) %>%
#gather and find matches for each team
gather("location", "team", -date, -hgoal, -vgoal, -competition, -tier) %>%
split(f = .$location) %>%
map2_df(., rev(.), function(df, vs_data)
mutate(df, opponent = vs_data$team)
) %>%
#add in goals for and against
mutate(goals_for = case_when(
location == "home" ~ hgoal,
TRUE ~ vgoal
)) %>%
mutate(goals_against = case_when(
location == "visitor" ~ hgoal,
TRUE ~ vgoal
)) %>%
arrange(team, date) %>%
group_by(team)
head(scoring_data)
```

```
## # A tibble: 6 x 10
## # Groups: team [1]
## date tier hgoal vgoal competition location team opponent
## <date> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 1875-11-06 NA 0 0 fa_cup home 105t~ Crystal~
## 2 1875-11-20 NA 3 0 fa_cup visitor 105t~ Crystal~
## 3 1876-11-11 NA 3 0 fa_cup home 105t~ 1st Sur~
## 4 1876-12-14 NA 6 1 fa_cup visitor 105t~ Oxford ~
## 5 1877-11-07 NA 0 2 fa_cup home 105t~ Old Har~
## 6 NA NA NA NA fa_cup visitor 105t~ Minerva
## # ... with 2 more variables: goals_for <dbl>, goals_against <dbl>
```

Next we need to find the start of each run of games where a team has failed to score. We can do this by finding the first instance of 0 goals scored using lag(). We’ll then give an id to each ‘run’ of finishing games without scoring.

(I’m aware that teams can also concede goals in a run having scored first in a match but there’s no way to factor that in with the data)

We then use the na.locf() function from the very useful zoo package to fill in the runs where no goals have been scored.

We can then finish answering the question (already!) by grouping by run and summing the total number of goals conceded in that time.

```
#load the zoo library for helping filling NA values
library(zoo)
dry_runs <- scoring_data %>%
#find the start of runs
mutate(run_start = case_when(
goals_for == 0 & lag(goals_for, default = 1) != 0 ~ 1:n()
)) %>%
#only care about games where didn't score
filter(goals_for == 0) %>%
#fill in NAs to get full runs
mutate(run_id = na.locf(run_start, na.rm = FALSE))
longest_dry_runs <- dry_runs %>%
#group runs by id
group_by(run_id, team) %>%
#find total conceeded over n games
mutate(total_conceeded = sum(goals_against),
run_start_date = min(date),
matches = n()) %>%
#take only the last instance
filter(!duplicated(run_id, fromLast = TRUE)) %>%
select(run_start_date, run_end_date = date, team, run_id, total_conceeded, matches) %>%
#find the most 'impressive' runs
filter(total_conceeded > 15) %>%
arrange(-total_conceeded)
head(longest_dry_runs)
```

```
## # A tibble: 6 x 6
## # Groups: run_id, team [6]
## run_start_date run_end_date team run_id total_conceeded matches
## <date> <date> <chr> <int> <dbl> <int>
## 1 1899-01-14 1899-03-11 Darwen 273 38 7
## 2 1898-11-12 1898-12-26 Darwen 263 35 7
## 3 1891-12-12 1892-01-09 Darwen 60 31 5
## 4 2019-04-09 2019-08-31 Bolton Wander~ 5447 29 11
## 5 1877-12-22 1886-10-23 1st Surrey Ri~ 8 27 3
## 6 1880-12-18 1894-01-27 Reading 6 27 3
```

And can see that two 7 game runs from the (now-defunct) Darwen FC are top of the list. Around 1898/1899 the team conceded 35 and 38 goals without scoring themselves.

Manually looking at the data, we can see that these two streaks are broken only by a few losses over Christmas 1898, a losing run of 18 games! Indeed, Darwen only won 2 games that season and set the record for most goals conceded (141).

7 Years earlier, the same team managed an impressive run of letting in 31 goals in just 5 matches, without scoring. If we want to check out the game in this, we can do by left_join() ing our data together

```
#joni data to inspect individual games
dry_run_matches <- dry_runs %>%
left_join(longest_dry_runs, by = c("team", "run_id")) %>%
filter(!is.na(total_conceeded)) %>%
select(date, team, opponent, goals_for, goals_against,
competition, tier, total_conceeded, run_id) %>%
arrange(-total_conceeded)
#print this
print(filter(dry_run_matches, run_id == 60))
```

```
## # A tibble: 5 x 9
## # Groups: team [1]
## date team opponent goals_for goals_against competition tier
## <date> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 1891-12-12 Darw~ Sunderl~ 0 7 league 1
## 2 1891-12-25 Darw~ Blackbu~ 0 4 league 1
## 3 1891-12-26 Darw~ Aston V~ 0 7 league 1
## 4 1892-01-01 Darw~ Preston~ 0 4 league 1
## 5 1892-01-09 Darw~ Burnley 0 9 league 1
## # ... with 2 more variables: total_conceeded <dbl>, run_id <int>
```

Also, congratulations to the oft-trouble Bolton Wanderers who have got closest to this in modern times, failing to score in 11 straight matches, while conceding 29 goals in the process.

I also wanted to find out the opposite: the team that has scored the most goals without conceding any. It’s super easy with our pipeline- just switch goals_against and goals_for in the chain.

```
#do the inverse
scoring_runs <- scoring_data %>%
mutate(run_start = case_when(
goals_against == 0 & lag(goals_against, default = 1) != 0 ~ 1:n()
)) %>%
filter(goals_against == 0) %>%
mutate(run_id = na.locf(run_start, na.rm = FALSE))
longest_scoring_runs <- scoring_runs %>%
group_by(run_id, team) %>%
mutate(total_scored = sum(goals_for),
run_start_date = min(date),
matches = n()) %>%
filter(!duplicated(run_id, fromLast = TRUE)) %>%
select(run_start_date, run_end_date = date, team, run_id, total_scored, matches) %>%
filter(total_scored > 15) %>%
arrange(-total_scored)
head(longest_scoring_runs)
```

```
## # A tibble: 6 x 6
## # Groups: run_id, team [6]
## run_start_date run_end_date team run_id total_scored matches
## <date> <date> <chr> <int> <dbl> <int>
## 1 2010-04-25 2010-08-28 Chelsea 4372 32 7
## 2 1929-03-06 1929-03-30 Bradford City 919 29 5
## 3 2019-01-06 2019-01-26 Manchester City 5194 28 6
## 4 1903-04-10 1903-10-03 Arsenal 328 26 8
## 5 1880-01-17 1880-11-13 Clapham Rovers 25 26 5
## 6 1885-10-24 1885-12-12 Notts County 32 26 3
```

Where we can see that Chelsea’s impressive end to the 2009-2010 season puts them top, having scored 32 goals without reply. Almost all the other top examples are from pre-war football, except Manchester City coming close last year with 28 goals scored without conceding.

When we look at this run we can see it was greatly helped along by some demolitions in the cups, winning 5-0, 9-0 and 7-0 against Burnley, Burton Albion, and Rotherham United.

```
scoring_run_matches <- scoring_runs %>%
left_join(longest_scoring_runs, by = c("team", "run_id")) %>%
filter(!is.na(total_scored)) %>%
select(date, team, opponent, goals_for, goals_against,
competition, tier, total_scored, run_id) %>%
arrange(-total_scored)
#print this
print(filter(scoring_run_matches, run_id == 5194))
```

```
## # A tibble: 6 x 9
## # Groups: team [1]
## date team opponent goals_for goals_against competition tier
## <date> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 2019-01-06 Manc~ Rotherh~ 7 0 fa_cup NA
## 2 2019-01-09 Manc~ Burton ~ 9 0 league_cup NA
## 3 2019-01-14 Manc~ Wolverh~ 3 0 league 1
## 4 2019-01-20 Manc~ Hudders~ 3 0 league 1
## 5 2019-01-23 Manc~ Burton ~ 1 0 league_cup NA
## 6 2019-01-26 Manc~ Burnley 5 0 fa_cup NA
## # ... with 2 more variables: total_scored <dbl>, run_id <int>
```

# 5th December - We’re going to Wembley

Which team has had to travel the shortest combined distance in a cup run? (excluding regional competitions, just to make it interesting)

— Chris van Thomas (@chrisvanthomas) July 10, 2019

## Answer - Multiple teams have played 5 FA cup matches all at home. To answer the inverse question, Queen’s Park in 1883/1884 and 1884/1885 have had the farthest to travel

For this question, I’m actually going to answer the opposite topic- which team have traveled the farthest in a cup run? The reason being is that multiple teams have had cup runs (of 5 matches in the FA cup) without travelling away from home at all. The code below could easily be changed to analyse other cup competitions, for simplicity, I’m sticking with the FA cup which has the most complete data in the engsoccerdata set.

Once again, we’ll start by loading libraries. We also want the sf package that makes working with spatial data a bit cleaner.

```
library(engsoccerdata)
library(tidyverse)
#also want sf to manipulate spatial features
library(sf)
```

Then we want to grab the data. In a recent update of the engsoccerdata package I added the location of grounds for teams in England which will let us find the distances teams have traveled to matches.

I also download a shapefile of the UK from GADM for plotting and to filter out any bad data in ground location (which still is very much in beta).

```
#download a map of the uk to plot with
shape_url <- "https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_GBR_0_sf.rds"
temp_dir <- tempdir()
download.file(shape_url, destfile = file.path(temp_dir, "shapefiles.rds"), mode = "wb")
uk <- st_as_sf(readRDS(file.path(temp_dir, "shapefiles.rds")))
#the location of football grounds in the dataset
grounds <- engsoccerdata::england_locations %>%
st_as_sf(coords = c("lon", "lat"), crs = st_crs("+init=epsg:4326")) %>%
st_transform(crs = st_crs(uk)) %>%
#remove those that are bad data-outside the uk
.[seq(nrow(.)) %in% unlist(st_contains(uk, .)),]
#get the fa cup match data
matches <- engsoccerdata::facup
```

There’s some ambiguity in the question as to how the distance of a cup run should be measured. There’s really two ways we can do this, which I will henceforth refer to as a ‘routing’ as ‘spoking’. To illustrate the two different approaches, I’ll use Southampton’s 2017/2018 FA cup run

```
#e.g. with Southampton's semi final run in 2017/2018
Southampton <- matches %>%
filter(Season == 2017 &
(home == "Southampton" |
visitor == "Southampton")
) %>%
select(Venue, Date, home, visitor, neutral) %>%
gather("location", "team", -Venue, -Date, -neutral) %>%
filter(location == "home" | neutral == "yes") %>%
filter(!duplicated(Date)) %>%
mutate(location = case_when(
neutral == "yes" ~ Venue,
TRUE ~ team
))
#print matches
Southampton
```

```
## # A tibble: 5 x 5
## Venue Date neutral location team
## <chr> <date> <chr> <chr> <chr>
## 1 Craven Cottage 2018-01-06 <NA> Fulham Fulham
## 2 St Mary's Stadi~ 2018-01-27 <NA> Southampton Southampton
## 3 The Hawthorns 2018-02-17 <NA> West Bromwich Alb~ West Bromwich Alb~
## 4 DW Stadium 2018-03-18 <NA> Wigan Athletic Wigan Athletic
## 5 Wembley Stadium 2018-04-22 yes Wembley Stadium Chelsea
```

We’ll join the ground geography data to this to figure out distances traveled

```
#will need the location of southamptons ground
Southampton_home <- grounds %>%
filter(location == "Southampton")
#and the locations of all their matches
match_locations <- Southampton %>%
left_join(., select(grounds, location, geometry), by = "location")
```

The first method of calculating distance traveled per match (‘spoking’) takes the location of each match, and finds the distance from that team’s home ground. E.g. for Southampton:

```
#find the line from southampton's ground to the match location
spoke_lines <- st_coordinates(match_locations$geometry) %>%
split(f = rownames(.)) %>%
#create line from geometry1 to geometry2
lapply(function(x) {
mat <- rbind(x, st_coordinates(Southampton_home$geometry))
line <- st_linestring(mat)
return(line)
}) %>%
#cast to multiline and add projection
st_multilinestring() %>%
st_sfc(crs = st_crs("+init=epsg:4326"))
#plot over uk
p <- ggplot() +
geom_sf(data = uk) +
geom_sf(data = spoke_lines, colour = "blue", size = 1.5) +
theme_minimal()
plot(p)
```

```
#get the total length in metres
st_length(spoke_lines)
```

`## 698614.6 [m]`

For any home games, the distance traveled is taken to be 0m.

The second method of calculating distance traveled takes the distance from each match *to* the next match in run. I.e. it assumes a team stays in their past location until the next round (obviously not true in real life) and finds the distance to the ground for the next round game:

```
#need coords separate
coords <- st_coordinates(match_locations$geometry) %>%
split(f = rownames(.))
#find distance from one coords to next
travel_lines <- mapply(coords, lead(coords), FUN = function(x, y) {
#for last match no further path
if(is.na(y)) {
return(NULL)
} else {
mat <- rbind(x, y)
line <- st_linestring(mat)
return(line)
}
}) %>%
#filter null last element
.[-which(lengths(.) == 0)] %>%
st_multilinestring() %>%
st_sfc(crs = st_crs("+init=epsg:4326"))
#plot over uk
p <- ggplot() +
geom_sf(data = uk) +
geom_sf(data = travel_lines, colour = "blue", size = 1.5) +
theme_minimal()
plot(p)
```

```
#get the length
st_length(travel_lines)
```

`## 684586.7 [m]`

So Southampton here begin in London away to Fulham, before travelling to their home in Southampton, then on to Birmingham (West Brom), Wigan, and back to London to play at Wembley.

Now we have the two methods, we need to prep the data we’re going to analyse. As before, this is done by gathering a df of match data so we have one row per match per team in the FA cup. We then group by team-season’s and find how many matches they played in the cup that year. For a ‘cup run’ we assume more than 3 matches have to played.

```
#gather each match per team into a separate row
long_campaigns <- matches %>%
filter(!is.na(Date)) %>%
select(Season, home, visitor) %>%
gather("location", "team", -Season) %>%
#remove rows with missing teams
filter(!is.na(team)) %>%
#get the number of fa cup matches per season per team
group_by(team, Season) %>%
summarise(matches = n()) %>%
#assume you need at least 3 matches for a 'run'
filter(matches > 3) %>%
ungroup() %>%
arrange(Season) %>%
mutate(id = 1:n())
long_campaign_matches <- matches %>%
select(., Season, Date, round,
neutral, Venue,
team = home, opponent = visitor) %>%
mutate(location = "home") %>%
#bind in the opposite data for away teams
bind_rows(., mutate(select(., Season, Date, round,
neutral, Venue,
team = opponent, opponent = team),
location = "away")) %>%
filter(!is.na(team)) %>%
#left join the data for long campaigns
left_join(., long_campaigns, by = c("Season", "team")) %>%
#remove non-long campaigns
filter(!is.na(matches)) %>%
#get the location of the match
mutate(location = case_when(
neutral == "yes" ~ Venue,
location == "home" ~ team,
location == "away" ~ opponent,
)) %>%
#left join in the location for the corresponding ground
left_join(grounds, c("location")) %>%
#select columns
select(season = Season, date = Date, round,
neutral, team, opponent,
location, matches, id, geometry)
#print
head(long_campaign_matches)
```

```
## # A tibble: 6 x 10
## season date round neutral team opponent location matches id
## <dbl> <date> <chr> <chr> <chr> <chr> <chr> <int> <int>
## 1 1871 1871-12-16 2 <NA> Crys~ Maidenh~ Crystal~ 5 1
## 2 1871 1871-12-16 2 <NA> Wand~ Clapham~ Wandere~ 4 3
## 3 1871 1872-01-20 3 <NA> Wand~ Crystal~ Wandere~ 4 3
## 4 1871 1872-01-27 3 <NA> Roya~ Hampste~ Royal E~ 5 2
## 5 1871 1872-02-17 s yes Crys~ Royal E~ Kenning~ 5 1
## 6 1871 1872-03-09 s yes Roya~ Crystal~ Kenning~ 5 2
## # ... with 1 more variable: geometry <POINT [°]>
```

We can then find the routing distance using a nice trick I found on Stack Overflow to find the distance between each location and the next in the data.frame.

Finally, this is grouped by id and summed to get the total distance traveled in that cup campaign (when judging by the ‘routing’ metric).

```
#taken from
#https://github.com/r-spatial/sf/issues/799
#init an empty sfc
empty <- st_as_sfc("POINT(EMPTY)")
routing_distances <- long_campaign_matches %>%
arrange(id, date) %>%
filter(!is.na(date)) %>%
group_by(id) %>%
#find the distance from one game to the next
mutate(
distance_to_next = sf::st_distance(
geometry,
lag(geometry, default = empty),
by_element = TRUE)
)
#sum the distances
grouped_routing_distances <- routing_distances %>%
summarise(travel_distance = sum(distance_to_next, na.rm = TRUE)) %>%
merge(long_campaigns, by = "id") %>%
#conver to km
mutate(total_distance = travel_distance / 1000) %>%
select(id, team, matches, season = Season, total_distance) %>%
mutate(av_distance = total_distance / matches) %>%
arrange(-total_distance)
head(grouped_routing_distances)
```

```
## id team matches season total_distance av_distance
## 1 111 Queens Park 8 1884 2684.363 335.5454
## 2 668 Exeter City 8 1930 2362.073 295.2592
## 3 293 Newcastle United 8 1904 2316.524 289.5655
## 4 512 Newcastle United 9 1923 2112.232 234.6924
## 5 1182 Newcastle United 10 1954 2017.824 201.7824
## 6 1090 Gateshead 8 1951 2015.698 251.9623
```

By functionalising our code from earlier, we can easily plot these well-traveled runs. Using the recently-added-to-CRAN patchwork package, we can make multiple plots and stitch them together, e.g. for the 6 FA cup runs with the longest distances traveled we get:

```
#functionalise our code from earlier to plot travel routes easier
plot_travel_lines <- function(run_team_year, plot_type) {
run_matches <- long_campaign_matches %>%
mutate(id = paste(team, season)) %>%
filter(id == run_team_year) %>%
arrange(date)
coords <- st_coordinates(run_matches$geometry) %>%
split(f = rownames(.))
if(plot_type == "travel") {
lines <- mapply(coords, lead(coords), FUN = function(x, y) {
if(is.na(y)) {
return(NULL)
} else {
mat <- rbind(x, y)
line <- st_linestring(mat)
return(line)
}
}) %>%
.[-which(lengths(.) == 0)] %>%
st_multilinestring() %>%
st_sfc(crs = st_crs("+init=epsg:4326"))
} else if(plot_type == "spokes") {
home <- grounds %>%
filter(location == unique(run_matches$team)) %>%
st_coordinates()
lines <- lapply(coords, function(x) {
mat <- rbind(x, home)
line <- st_linestring(mat)
return(line)
}) %>%
#cast to multiline and add projection
st_multilinestring() %>%
st_sfc(crs = st_crs("+init=epsg:4326"))
}
plot <- ggplot() +
geom_sf(data = uk) +
geom_sf(data = lines, colour = "blue", size = 1.5) +
labs(title = paste(run_team_year)) +
theme_minimal()
return(plot)
}
#plot the top six
library(patchwork)
paste(grouped_routing_distances$team[1:6],
grouped_routing_distances$season[1:6]) %>%
lapply(., plot_travel_lines, plot_type = "travel") %>%
wrap_plots(.)
```

We then need to check this against our other method of evaluating distances in a cup run- the ‘spoking’ method. This is much easier to calculate- all we have to do is left_join() in the location for each team’s home ground, and find the distance between this and the match location.

Then we simply sum the total distances per campaign and plot the longest of these:

```
spoke_distances <- long_campaign_matches %>%
#left join in location data for each team
left_join(grounds, by = c("team" = "location")) %>%
#calculate distance between each teams home ground and the match location
mutate(distance = st_distance(geometry.x, geometry.y, by_element = TRUE))
#group by and sum the cup run distances
grouped_spoke_distances <- spoke_distances %>%
group_by(team, season) %>%
summarise(total_distance = sum(distance/1000, na.rm = TRUE),
av_distance = mean(distance/1000, na.rm = TRUE)) %>%
arrange(-total_distance)
#print
head(grouped_spoke_distances)
```

```
## # A tibble: 6 x 4
## # Groups: team [3]
## team season total_distance av_distance
## <chr> <dbl> [m] [m]
## 1 Queens Park 1883 2150.504 307.2149
## 2 Newcastle United 1923 1974.430 219.3811
## 3 Newcastle United 1951 1957.196 279.5994
## 4 Fulham 1974 1810.676 150.8897
## 5 Newcastle United 1973 1781.279 178.1279
## 6 Queens Park 1884 1702.166 170.2166
```

```
#plot
paste(grouped_spoke_distances$team[1:6],
grouped_spoke_distances$season[1:6]) %>%
lapply(., plot_travel_lines, plot_type = "spokes") %>%
wrap_plots(.)
```

# 10th December - This Town Ain’t Big Enough For a League Football Team

## Answer -

To answer this question, first we need data on towns and cities in England. We’re going to rank by population so can download the population table found here to start with

```
#scrape data on town/city ppulations in UK
pops <- "http://lovemytown.co.uk/populations/TownsTable1.asp" %>%
read_html() %>%
html_nodes("#mainContent > table:nth-child(3)") %>%
html_table(fill = TRUE, header = TRUE) %>%
as.data.frame() %>%
#some munging to match datasets later
mutate(tcity15nm = case_when(
grepl("^St\\. ", Town) ~ gsub("^St\\. ", "St ", Town),
grepl("^Hull$", Town) ~ "Kingston upon Hull",
grepl("^Burton$", Town) ~ "Burton upon Trent",
grepl("^Newcastle$", Town) ~ "Newcastle upon Tyne",
grepl("^Southend$", Town) ~ "Southend-on-Sea",
grepl("^Stoke$", Town) ~ "Stoke-on-Trent",
TRUE ~ Town
)) %>%
#convert population to numeric
mutate(population = as.numeric(gsub(",", "", Population))) %>%
select(tcity15nm, population, status = Status)
```

We’re then going to want the geographic data on these towns. The UK government provides shapefiles for the outlines of ‘Major Towns and Cities’ from a few years ago which should be sufficient for the question. They’re provided as geoJSON files so I’m going to use the geojsonsf package to load them straight as sf objects.

To reproduce this script, you’ll need to download the data from the UK government achives and point the file object towards it

```
#to read in geojson data as an sf file
library(geojsonsf)
#download the shapefile from
#https://data.gov.uk/dataset/7879ab82-2863-401e-8a29-a56e264d2182/major-towns-and-cities-december-2015-boundaries
file <- "path/to/downloaded/file.geojson"
#load data
towns <- geojson_sf(file) %>%
left_join(., pops, by = "tcity15nm") %>%
st_transform(st_crs(27700)) %>%
#buff the town shapefiles by 2.5km to catch all clubs within
#reasonable distance of the town
st_buffer(., 2500)
```

Then we want the club data. In the latest release of engsoccer data I added some (very beta) non-league data so we have a greater number of teams to pick from. We’ll take the league the team played in in 2018 as this on league data hasn’t been updated to 2019 (the current season) yet.

```
#get all league and non-league clubs
#non league clubs in new release of engsoccerdata
clubs <- rbind(
select(england, home, Season, tier),
select(england_nonleague, home, Season, tier)) %>%
#no 2019 data for non league yet
filter(Season == 2018) %>%
select(home, tier) %>%
unique()
#get the locations of each of these clubs in
club_locations <- england_club_data %>%
st_as_sf(coords = c("lon", "lat"), crs = st_crs(4326)) %>%
#on uk grid projection
st_transform(st_crs(27700)) %>%
left_join(., clubs, by = c("team" = "home")) %>%
select(team, tier) %>%
filter(!is.na(tier))
#plot these clubs over major towns in uk
p_town_clubs <- ggplot() +
geom_sf(data = uk) +
geom_sf(data = towns, fill = "red") +
geom_sf(data = club_locations, alpha = 0.15, colour = "blue") +
labs(title = "location of English football teams relative to major towns") +
theme_minimal()
p_town_clubs
```

We can then run a very quick function to find the clubs that are location within each town using sf::st_contains and arrange by our parameters to answer the question!

```
#function to find which towns contain clubs
town_data <- st_contains(towns, club_locations) %>%
map_df(., function(x) {
n_clubs <- length(x)
if(n_clubs == 0) {
max_tier <- NA
tiers <- NA
} else {
#get the tiers of the english footballing pyramid that clubs play in
tiers <- I(list(club_locations$tier[x]))
max_tier <- min(club_locations$tier[x])
}
return(data.frame(n_clubs,
max_tier,
tiers))
}) %>%
#bind to the town data
bind_cols(towns, .) %>%
select(town = tcity15nm, pop = population, n_clubs, max_tier) %>%
#arrange to answer question
arrange(n_clubs, -max_tier, -pop)
#get rid of unnessecary geometry
st_geometry(town_data) <- NULL
#print answer
head(town_data)
```

```
## town pop n_clubs max_tier
## 1 Gloucester 136362 0 NA
## 2 Basingstoke 107355 0 NA
## 3 Worcester 100153 0 NA
## 4 Stockton-on-Tees 82729 0 NA
## 5 Guildford 77057 0 NA
## 6 Sutton Coldfield 109015 1 8
```

So it appears Gloucester, Worcester and Basingstoke are the largest towns without a football club in their city limits. We can double check this using a quick grep function for the clubs (it’s possible this could miss some clubs but is probably accurate enough)

```
empty_towns <- c("Gloucester", "Basingstoke", "Worcester", "Stockton", "Guildford")
lapply(empty_towns, grep, x = clubs$home) %>%
unlist() %>%
clubs$home[.]
```

`## [1] "Gloucester City" "Basingstoke Town"`

So we can see that Gloucester and Basingstoke *do* in fact have football teams, however a quick Wikipedia search shows that they both play outside their town (so I’m not sure if these count). The largest town* without a football team, down to the 8th tier of English football, is therefore Worcester with 100,000 people, but a team only in the 9th tier of the football pyramid.

The question actually does specify *‘League’* teams, which generally only refers to the top 4 flights on English football. We can then run the function for each tier, finding the largest town without a team in that tier or above.

```
#find the largest town without a club above tiers 5:8
by_tier <- lapply(4:7, function(x) {
data <- town_data %>%
filter(!is.na(max_tier) & max_tier > x) %>%
arrange(-pop)
})
#print by max tier
lapply(by_tier, head, n = 5)
```

```
## [[1]]
## town pop n_clubs max_tier
## 1 Coventry 325949 1 7
## 2 Warrington 165456 1 7
## 3 Slough 155298 2 5
## 4 Poole 154718 2 7
## 5 York 152841 1 6
##
## [[2]]
## town pop n_clubs max_tier
## 1 Coventry 325949 1 7
## 2 Warrington 165456 1 7
## 3 Poole 154718 2 7
## 4 York 152841 1 6
## 5 Telford 142723 1 6
##
## [[3]]
## town pop n_clubs max_tier
## 1 Coventry 325949 1 7
## 2 Warrington 165456 1 7
## 3 Poole 154718 2 7
## 4 Worthing 109120 1 7
## 5 Sutton Coldfield 109015 1 8
##
## [[4]]
## town pop n_clubs max_tier
## 1 Sutton Coldfield 109015 1 8
## 2 Basildon 107123 2 8
## 3 St Helens 102885 1 8
## 4 Wakefield 99251 1 8
## 5 Hastings 91053 1 8
```

And the City of Coventry takes it, having a population of 325,000 and a team only in the 7th tier (Bedworth United, who play just outside the city). This is only because the city’s main team Coventry City are playing in Birmingham due to ongoing difficulties finding a stadium within their own city.

I think in the spirit of the question, the true answer is one of Warrington (165k), Poole (154k), York (152k), of Telford (142k). Of which York is probably the most major and independent as a town/city.

Finally the article in which the question was shared posted the follow-up: what is the smallest town to hold a league club?

We can answer this using the directory of population places produced by the Ordnance Survey. Again, download the .csv and point the script at it to reproduce.

```
#load the .csv of small places in the uk
small_places <- read.csv("path/to.csv", stringsAsFactors = FALSE) %>%
#only interested in England
filter(COUNTRY == "England") %>%
select(name = NAME1, x = X, y = Y) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(27700)) %>%
#find small places outside of the large towns
.[-unique(unlist(st_contains(towns, .))),] %>%
#buffer by 500m
st_buffer(., 100)
```

```
#plot these for a cool map
plot(select(small_places))
```

It’s then simple to find the small places to hold football clubs using sf::st_contains and indexing

```
#find clubs that are in these remaining places
small_town_clubs <- st_contains(small_places, filter(club_locations, tier < 5))
small_places[which(lengths(small_town_clubs) > 0),]
```

```
## Simple feature collection with 6 features and 1 field
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 351959.9 ymin: 116300 xmax: 391793 ymax: 430665
## epsg (SRID): 27700
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
## name geometry
## 3454 Laneside POLYGON ((376361 429724, 37...
## 3474 Enfield POLYGON ((375759 430165, 37...
## 8843 Moss Lane POLYGON ((391793 371719, 39...
## 17646 Houndstone POLYGON ((353128 117093.1, ...
## 17691 Lufton POLYGON ((352959.9 116800, ...
## 19419 Newmarket POLYGON ((384471.3 199698.6...
```

Enfield and Moss Lane are in London and Manchester, so don’t count. Houndstone and Lufton are actually in the same town (Yeovil) but still are valid answers.

Laneside is located just outside Accrington, where Accrington Stanley play in the third tier (population 35,000).

Yeovil (population 45,000) is home to Yeovil Town who play in the 4th tier.

But the far and away winner is 4th tier Forest Green Rovers who play in Nailsworth, with a population of just 5,700 people. Even more, the only remotely near town, Stroud, has a population of just 32,000 people. Most incredible of all, the stadium for Forest Green Rovers can hold ~5,000 people, or almost the entire surrounding population.