Advent Calendar of Football Trivia Analyses

One of the most consistent fonts of posts on this blog is The Guardian’s football trivia page The Knowledge. A particular reason for this is that the small contained questions lend themselves to small blogposts that I can turn around in an hour or two, as opposed to being endlessly redrafted until I lose interest.

However, I still sometimes don’t quite get round to finishing some of these posts, or have trouble justifying a blog post on a very small and ‘trivial’ answer to a question. Therefore, as a sort of end-of-year round up, and a Christmas present to myself, I wanted to push out answers to questions I found particularly interesting over the last year and hadn’t quite got round to 1. I’ll probably add them all to this post as I finish them up.

2nd December - Everything in its right place

Answer - yes, kind of. But also no.

This question has actually been answered (as many of these will have been). For a league of 20 teams (like the English Premier League), we might imagine if would have happened over the last ~150 years, but it’s almost certain from some basic maths that it won’t have, and moreover, will never happen.

Let’s load some data and see why.

#as per usual, going to heavily rely on tidyverse 
#and engsoccerdata throughout these posts
library(tidyverse)
library(engsoccerdata)
#load English league data
league_data <- engsoccerdata::england %>%
  #select and gather match results
  select(season = Season, division, home, visitor, hgoal, vgoal) %>%
  gather("location", "team", -season, -division, -hgoal, -vgoal) %>%
  mutate(
    g_for = case_when(
      location == "home" ~ hgoal,
      location == "visitor" ~ vgoal
    ),
    g_ag = case_when(
      location == "home" ~ vgoal,
      location == "visitor" ~ hgoal
    )) %>%
  #get correct point for a win/loss
  mutate(
    points = case_when(
      g_for > g_ag & season < 1981 ~ 2,
      g_for > g_ag & season > 1980 ~ 3,
      g_for == g_ag ~ 1,
      g_for < g_ag ~ 0
    ),
    gd = g_for - g_ag
  ) %>%
  #group by season and league and get final tables
  group_by(season, division, team) %>%
  summarise(points = sum(points),
            gd = sum(gd),
            g_for = sum(g_for)) %>%
  arrange(-points, -gd, -g_for) %>%
  #rank league order and alphabetical order
  mutate(league_pos = rank(-points, ties.method = "first"),
         alph_order = rank(team, ties.method = "first")) %>%
  select(season, division, team, league_pos, alph_order) %>%
  #split by league and season
  split(., f = list(.$season, .$division)) %>%
  keep(function(x) nrow(x) > 0)

#print the top of the first league table
head(league_data[[1]])
## # A tibble: 6 x 5
## # Groups:   season, division [1]
##   season division team                    league_pos alph_order
##    <int>    <int> <chr>                        <int>      <int>
## 1   1888        1 Preston North End                1          9
## 2   1888        1 Aston Villa                      2          2
## 3   1888        1 Wolverhampton Wanderers          3         12
## 4   1888        1 Blackburn Rovers                 4          3
## 5   1888        1 Bolton Wanderers                 5          4
## 6   1888        1 West Bromwich Albion             6         11

We can then run a load of Spearman’s rank correlation tests on the data to see which ones are perfectly correlated or anti-correlated in both league and alphabetical ranking. We’ll use the very handy broom package to tidy the results of our many tests into one data.frame (remove the filter at the end of the pipe chain to see what gets output).

#use broom to tidily do stats
library(broom)

#correlate league and alphabetical order by year
exact_correlations <- league_data %>%
  map_df(., function(data) {
    cor.test(
      data$league_pos,
      data$alph_order,
      method = "spearman"
    ) %>%
      tidy() %>%
      mutate(season = unique(data$season),
             division = unique(data$division))
  }) %>%
  #take only significantly 
  filter(abs(statistic) == 1)

print(exact_correlations)
## # A tibble: 0 x 7
## # ... with 7 variables: estimate <dbl>, statistic <dbl>, p.value <dbl>,
## #   method <chr>, alternative <chr>, season <int>, division <int>

And so we find no exact correlations. There are no instances in 363 separate seasons of English league football where teams line up in either alphabetical, or anti-alphabetical order.

Let’s see why this is. To make things simpler, I’m going to imagine a cutdown league of only 6 teams using teams starting with each of the first 6 letter of the alphabet:

first_letter_names <- league_data %>%
  bind_rows() %>%
  ungroup() %>%
  #get first letter of a team name
  mutate(first_letter = gsub("(^.)(.*)", "\\1", team)) %>%
  filter(season > 1992 &
           division == 1 &
           first_letter %in% toupper(letters[1:6])
         ) %>%
  #get one team beginning with A, B, C...
  filter(!duplicated(first_letter)) %>%
  select(team) %>%
  arrange(team) %>%
  print()
## # A tibble: 6 x 1
##   team            
##   <chr>           
## 1 Arsenal         
## 2 Blackburn Rovers
## 3 Coventry City   
## 4 Derby County    
## 5 Everton         
## 6 Fulham

For the league to finish in alphabetical order, we first need the team that is first alphabetically (Arsenal) to finish in first position. Assuming all teams have an equal chance of winning the league, the chance of this is obviously

\[ p(Arsenal = 1) = \frac{1}{n}\]

Then we need the second team (Blackburn Rovers), to finish in second. This is predicated on Arsenal already finishing in first position, so the chance becomes

\[ p(Blackburn = 2 | Arsenal = 1) = \frac{1}{n-1} \]

and so on until the last team (Fulham) just have to slot into the only position left (n, 6th in our example)

Thus the total chance becomes

\[ \frac{1}{n} \cdot \frac{1}{n-1} ... \cdot \frac{1}{1} \]

which can also be written

\[ p(ordered) = \prod_{n = 1}^{N} \frac{1}{n}\]

which multiplies out to

\[ p(ordered) = \frac{1}{n!} \]

so for our very small league the chance of n (assumed equally strong teams)

factorial(nrow(first_letter_names))
## [1] 720

so we have a 1/720 chance that this league ends perfectly in alphabetical order. For bigger leagues (for reference most large European leagues contain 18-24 teams) this denominator grows super-exponentially and becomes tiny.

For the English Premier League (20 teams) for instance the chance becomes

league_data %>%
  bind_rows() %>%
  ungroup() %>%
  filter(season == max(season) & division == 1) %>% 
  nrow() %>%
  factorial()
## [1] 2.432902e+18

or 1 in 2.4 quintillion. In short, if it’s assumed that there’s no relation between order of names and team strength, we might expect the universe to end before all 20 teams finish in perfect order.

We can test if our predictions bear out by looking at tiny leagues with small numbers of teams, e.g. the group stages of the Champions/Europa Leagues.

First we need to scrape the final tables for the last 8 years of data from both competitions:

library(rvest)

#website to scrape group stage data from
fb_data <- "https://footballdatabase.com"
ucl_links <- sprintf(
  "/league-scores-tables/uefa-champions-league-20%s-%s",
  10:18, 11:19
)
europa_links <- sprintf(
  "/league-scores-tables/uefa-europa-league-20%s-%s",
  10:18, 11:19
)
#function to scrape the data from these links
get_competition_data <- function(competition, links) {
  data <- links %>%
    paste0(fb_data, .) %>%
    map_df(., function(year) {
      page_read <- read_html(year)
      
      groups <- letters[1:8] %>%
        map_df(., function(group) {
          page_read %>% 
            html_nodes(sprintf("#total-group-%s > div > table", group)) %>% 
            html_table(fill = TRUE) %>% 
            as.data.frame() %>%
            mutate(group)
        }) %>%
        mutate(year = gsub("(.*-)([0-9]{4}-[0-9]{2})", "\\2", year))
    }) %>%
    mutate(competition)
}
#scrape and bind the data
uefa_data <- bind_rows(
  get_competition_data("champions", ucl_links),
  get_competition_data("europa", europa_links)
)
#print a cutdown version of the scraped data
head(uefa_data %>% select(club = Club, points = P, year, competition))
##                club points    year competition
## 1 Tottenham Hotspur     11 2010-11   champions
## 2       Inter Milan     10 2010-11   champions
## 3         FC Twente      6 2010-11   champions
## 4     Werder Bremen      5 2010-11   champions
## 5        Schalke 04     13 2010-11   champions
## 6              Lyon     10 2010-11   champions

So now we have 128 (8 groups x 8 years x 2 competitions) ‘mini-leagues’ each of 4 teams.

We can then munge this data to find all the groups where the teams finish in alphabetical order. We’d expect 128/4! leagues to finish in alphabetical order (or 5.33 to be exact).

ordered_groups <- uefa_data %>%
  #select relevant informatiob
  select(team = Club, league_pos = X., group, year, competition) %>%
  #by group find where teams finish in alphabetical order
  group_by(year, group, competition) %>%
  mutate(alph_order = rank(team, ties.method = "first")) %>%
  filter(league_pos == alph_order) %>%
  #keep only group where all (4) teams finish in order
  summarise(n = n()) %>%
  filter(n == 4) %>%
  #join and filter back data
  left_join(uefa_data, ., by = c("group", "year", "competition")) %>%
  filter(!is.na(n)) %>%
  #select useful information
  select(team = Club, points = P, gd = X..., league_pos = X.,
         group, year, competition) %>%
  #split groups up
  split(., list(.$year, .$group, .$competition)) %>%
  keep(function(x) nrow(x) > 0)

which leaves us with 5 leagues that have finished in order! almost exactly what we’d predict by chance if the first letter of a teams name had no effect on the outcome.

ordered_groups
## $`2011-12.c.champions`
##                team points gd league_pos group    year competition
## 5           Benfica     12  4          1     c 2011-12   champions
## 6          FC Basel     11  1          2     c 2011-12   champions
## 7 Manchester United      9  3          3     c 2011-12   champions
## 8     Otelul Galati      0 -8          4     c 2011-12   champions
## 
## $`2015-16.c.champions`
##                team points gd league_pos group    year competition
## 9   Atlético Madrid     13  8          1     c 2015-16   champions
## 10          Benfica     10  2          2     c 2015-16   champions
## 11      Galatasaray      5 -4          3     c 2015-16   champions
## 12 Lokomotiv Astana      4 -6          4     c 2015-16   champions
## 
## $`2010-11.f.champions`
##             team points  gd league_pos group    year competition
## 1     Chelsea FC     15  10          1     f 2010-11   champions
## 2      Marseille     12   9          2     f 2010-11   champions
## 3 Spartak Moskva      9  -3          3     f 2010-11   champions
## 4         Žilina      0 -16          4     f 2010-11   champions
## 
## $`2015-16.g.champions`
##                   team points  gd league_pos group    year competition
## 13          Chelsea FC     13  10          1     g 2015-16   champions
## 14         Dynamo Kyiv     11   4          2     g 2015-16   champions
## 15            FC Porto     10   1          3     g 2015-16   champions
## 16 Maccabi Tel Aviv FC      0 -15          4     g 2015-16   champions
## 
## $`2018-19.h.champions`
##                 team points gd league_pos group    year competition
## 17          Juventus     12  5          1     h 2018-19   champions
## 18 Manchester United     10  3          2     h 2018-19   champions
## 19          Valencia      8  0          3     h 2018-19   champions
## 20        Young Boys      4 -8          4     h 2018-19   champions
## 
## $`2012-13.h.europa`
##                      team points gd league_pos group    year competition
## 21         FC Rubin Kazan     14  7          1     h 2012-13      europa
## 22            Inter Milan     11  2          2     h 2012-13      europa
## 23                 Neftçi      3 -4          3     h 2012-13      europa
## 24 Partizan Beograd (SRB)      3 -5          4     h 2012-13      europa

We can also do a larger test by randomly selecting teams out of the English league data we looked at earlier. To do this I need two quick functions: one to sample randomly from the data, and another to carry out the correlation test.

The first takes a number of samples (how many tests to run) and then selects a number of teams from each league sample. For instance, if I chose 3 teams, it might select Liverpool, Manchester United, and Watford, from the last season of the Premier League. These teams finished 2nd, 6th, and 11th respectively, so this ‘sampled league’ would fulfill the criteria of finishing in alphabetical order.

set.seed(3459)

#take a random sample of leagues and teams withing those leagues
sample_cutdown_leagues <- function(nteams, nsamples, data) {
  samples <- sample(length(data), nsamples, replace = TRUE)
  
  sampled_league_data <- data[samples]
  
  league_team_serials <- sampled_league_data %>%
    lapply(., nrow) %>%
    lapply(., sample, size = nteams)
  
  #carry out the correlation test
  league_cor_test <- map2_df(
    .x = sampled_league_data,
    .y = league_team_serials,
    .f = cor_test_data
  )
}
  
#function for correlation test
cor_test_data <- function(full_league_data, sampled_teams) {
  sampled_league <- full_league_data[sampled_teams,] %>%
    arrange(league_pos)
  cor_test <- cor.test(
    sampled_league$league_pos,
    sampled_league$alph_order,
    method = "spearman"
  ) %>%
    tidy() %>%
    #mutate on information about that season and teams chosen
    mutate(teams = paste(sampled_league$team, collapse = ", "),
           season = unique(sampled_league$season),
           division = unique(sampled_league$division))
}

So for instance if I just run it once, randomly selecting 4 teams:

test <- sample_cutdown_leagues(4, 1, league_data)
#print the teams selected
test$teams
## [1] "Brentford, Bristol Rovers, Brighton & Hove Albion, Chester"
test
## # A tibble: 1 x 8
##   estimate statistic p.value method   alternative teams     season division
##      <dbl>     <dbl>   <dbl> <chr>    <chr>       <chr>      <int>    <int>
## 1      0.8      2.00   0.333 Spearma~ two.sided   Brentfor~   1994        3

It gives me 4 teams from the 1994 division 3 who didn’t finish in alphabetical order (though, amusingly, all have a very similar starting letter).

We can then carry this out with 10000 samples for n_team numbers of 2:6 to see if we get roughly the expected numbers of exactly correlated league finish positions (this will take 1-2mins) by finding out how many tests give an estimate of 1 (finished exactly correlated with alphabetical order) or -1 (finished exactly anti-correlated with alphabetical order).

Both these numbers should be roughly equal to the number of samples (10000) divided by the factorial of the number of teams selected.

test_n_numbers <- function(nteams) {
  #run sampling function n times
  #10k should do
  sampling <- sample_cutdown_leagues(nteams, 10000, league_data)
  
  #find exactly correlated and anti-correlated examples
  #where teams are in exact alphabetical order ascending or descending
  correlated <- length(which(sampling$estimate == max(sampling$estimate)))
  anti_correlated <- length(which(sampling$estimate == min(sampling$estimate)))
  expected <- nrow(sampling) / factorial(nteams)
  
  df <- data.frame(n = nteams,
                   sample_cor = correlated,
                   sample_anticor = anti_correlated,
                   sample_expected = expected)
}
#run the function
testing <- map_df(2:6, test_n_numbers)
#print results
print(testing)
##   n sample_cor sample_anticor sample_expected
## 1 2       5010           4990      5000.00000
## 2 3       1676           1665      1666.66667
## 3 4        367            398       416.66667
## 4 5        101             81        83.33333
## 5 6         14             15        13.88889

And the numbers line up, as we would expect if there is no effect of the first letter of a team’s name upon final league position.

Finally, we can do a Kendall’s correlation test to really see if there is any relationship between alphabetical team name order and final league finish for all out our English league data. We use Kendall instead of a Spearman test here because we grouping all the data together we’re going to have a lot of ties (one team has to finish 1st in every league each year).

all_data <- league_data %>%
  bind_rows()

#do a big correlation test
kendall_test <- cor.test(all_data$alph_order,
                         all_data$league_pos,
                         alternative = "two.sided",
                         method = "kendall") %>%
  tidy() %>%
  print()
## # A tibble: 1 x 5
##   estimate statistic p.value method                         alternative
##      <dbl>     <dbl>   <dbl> <chr>                          <chr>      
## 1   0.0135      1.74  0.0826 Kendall's rank correlation tau two.sided

And we can see that, even though our p-value is ‘approaching significance’, it’s not significant at our fairly liberal threshold of 0.05. Even then, the effect size (0.013) is tiny, so there’s no need for Watford to start worrying just yet.

  1. SMALL DIGRESSION: I love blogging on this site and it also has been a great help to me in numerous ways (practice coding/writing, feeling like a “programmer”, for job interviews), but quite a lot of the time feel posts are not quite where I want them (I’m sure this feeling isn’t restricted to me) and so won’t put them up and so that time (sometimes quite a few hours!) I put into them in my spare time feels wasted and makes me feel worse about myself. I’m hoping that pushing out fairly rushed/half formed ideas like this will help with this.

3rd December - Groan Rangers

“Berwick Rangers have conceded 42 goals in competitive matches – Scottish League 2, relegation play-off, Scottish League – since last scoring themselves, against Peterhead, on March 19th. Is this a record for a league club (I know they’ve now lost that status, but all of these matches are league level competition)?” asks Huw Richards.

Answer - It beats any team in the English league. Reproducible code below if you want to check for other leagues.

(I did check for most of them in the dataset, although this doesn’t include foreign cup competitions. Nothing seems to get close)

This is quite a nice question from a data munging point of view. It’s extremely quantifiable and only involves a little grouping by.

First we’ll load the libraries we’re relying on in this little project:

library(engsoccerdata)
library(tidyverse)

I’m going to focus on the English league as it has the most data and also has data on the concurrent cup competitions. It’s super easy to sub in whichever competitions in the engsoccerdata package you want.

We want to first bind the data from the league, league cup, fa cup, and league playoffs together with a little munging. Then we want to gather the data to get the goals scored and goals conceded in each game for each team.

#bind all the match data together with relevant variables
scoring_data <- bind_rows(
  engsoccerdata::england %>%
    select(date = Date, tier, home, visitor, hgoal, vgoal) %>%
    mutate(date = as.Date(date),
           competition = "league"),
  engsoccerdata::facup %>%
    select(date = Date, home, visitor, hgoal, vgoal) %>%
    mutate(date = as.Date(date),
           tier = NA, 
           competition = "fa_cup"),
  engsoccerdata::leaguecup %>%
    select(date = Date, home, visitor, hgoal, vgoal) %>%
    mutate(date = as.Date(date),
           tier = NA,
           competition = "league_cup"),
  engsoccerdata::englandplayoffs %>%
    select(date = Date, home, visitor, hgoal, vgoal, htier, vtier) %>%
    mutate(date = as.Date(date), 
           tier = (htier+vtier)/2, 
           competition = "league_playoffs") %>%
    select(-htier, -vtier),
) %>%
  #gather and find matches for each team
  gather("location", "team", -date, -hgoal, -vgoal, -competition, -tier) %>%
  split(f = .$location) %>%
  map2_df(., rev(.), function(df, vs_data) 
    mutate(df, opponent = vs_data$team)
  ) %>%
  #add in goals for and against
  mutate(goals_for = case_when(
    location == "home" ~ hgoal,
    TRUE ~ vgoal
  )) %>%
  mutate(goals_against = case_when(
    location == "visitor" ~ hgoal,
    TRUE ~ vgoal
  )) %>%
  arrange(team, date) %>%
  group_by(team)

head(scoring_data)
## # A tibble: 6 x 10
## # Groups:   team [1]
##   date        tier hgoal vgoal competition location team  opponent
##   <date>     <dbl> <dbl> <dbl> <chr>       <chr>    <chr> <chr>   
## 1 1875-11-06    NA     0     0 fa_cup      home     105t~ Crystal~
## 2 1875-11-20    NA     3     0 fa_cup      visitor  105t~ Crystal~
## 3 1876-11-11    NA     3     0 fa_cup      home     105t~ 1st Sur~
## 4 1876-12-14    NA     6     1 fa_cup      visitor  105t~ Oxford ~
## 5 1877-11-07    NA     0     2 fa_cup      home     105t~ Old Har~
## 6 NA            NA    NA    NA fa_cup      visitor  105t~ Minerva 
## # ... with 2 more variables: goals_for <dbl>, goals_against <dbl>

Next we need to find the start of each run of games where a team has failed to score. We can do this by finding the first instance of 0 goals scored using lag(). We’ll then give an id to each ‘run’ of finishing games without scoring.

(I’m aware that teams can also concede goals in a run having scored first in a match but there’s no way to factor that in with the data)

We then use the na.locf() function from the very useful zoo package to fill in the runs where no goals have been scored.

We can then finish answering the question (already!) by grouping by run and summing the total number of goals conceded in that time.

#load the zoo library for helping filling NA values
library(zoo)

dry_runs <- scoring_data %>%
  #find the start of runs
  mutate(run_start = case_when(
    goals_for == 0 & lag(goals_for, default = 1) != 0 ~ 1:n()
  )) %>%
  #only care about games where didn't score
  filter(goals_for == 0) %>%
  #fill in NAs to get full runs
  mutate(run_id = na.locf(run_start, na.rm = FALSE)) 

longest_dry_runs <- dry_runs %>%
  #group runs by id
  group_by(run_id, team) %>%
  #find total conceeded over n games
  mutate(total_conceeded = sum(goals_against),
         run_start_date = min(date),
         matches = n()) %>%
  #take only the last instance
  filter(!duplicated(run_id, fromLast = TRUE)) %>%
  select(run_start_date, run_end_date = date, team, run_id, total_conceeded, matches) %>%
  #find the most 'impressive' runs
  filter(total_conceeded > 15) %>%
  arrange(-total_conceeded)

head(longest_dry_runs)
## # A tibble: 6 x 6
## # Groups:   run_id, team [6]
##   run_start_date run_end_date team           run_id total_conceeded matches
##   <date>         <date>       <chr>           <int>           <dbl>   <int>
## 1 1899-01-14     1899-03-11   Darwen            273              38       7
## 2 1898-11-12     1898-12-26   Darwen            263              35       7
## 3 1891-12-12     1892-01-09   Darwen             60              31       5
## 4 2019-04-09     2019-08-31   Bolton Wander~   5447              29      11
## 5 1877-12-22     1886-10-23   1st Surrey Ri~      8              27       3
## 6 1880-12-18     1894-01-27   Reading             6              27       3

And can see that two 7 game runs from the (now-defunct) Darwen FC are top of the list. Around 1898/1899 the team conceded 35 and 38 goals without scoring themselves.

Manually looking at the data, we can see that these two streaks are broken only by a few losses over Christmas 1898, a losing run of 18 games! Indeed, Darwen only won 2 games that season and set the record for most goals conceded (141).

7 Years earlier, the same team managed an impressive run of letting in 31 goals in just 5 matches, without scoring. If we want to check out the game in this, we can do by left_join() ing our data together

#joni data to inspect individual games
dry_run_matches <- dry_runs %>%
  left_join(longest_dry_runs, by = c("team", "run_id")) %>%
  filter(!is.na(total_conceeded)) %>%
  select(date, team, opponent, goals_for, goals_against,
         competition, tier, total_conceeded, run_id) %>%
  arrange(-total_conceeded)

#print this
print(filter(dry_run_matches, run_id == 60))
## # A tibble: 5 x 9
## # Groups:   team [1]
##   date       team  opponent goals_for goals_against competition  tier
##   <date>     <chr> <chr>        <dbl>         <dbl> <chr>       <dbl>
## 1 1891-12-12 Darw~ Sunderl~         0             7 league          1
## 2 1891-12-25 Darw~ Blackbu~         0             4 league          1
## 3 1891-12-26 Darw~ Aston V~         0             7 league          1
## 4 1892-01-01 Darw~ Preston~         0             4 league          1
## 5 1892-01-09 Darw~ Burnley          0             9 league          1
## # ... with 2 more variables: total_conceeded <dbl>, run_id <int>

Also, congratulations to the oft-trouble Bolton Wanderers who have got closest to this in modern times, failing to score in 11 straight matches, while conceding 29 goals in the process.

I also wanted to find out the opposite: the team that has scored the most goals without conceding any. It’s super easy with our pipeline- just switch goals_against and goals_for in the chain.

#do the inverse
scoring_runs <- scoring_data %>%
  mutate(run_start = case_when(
    goals_against == 0 & lag(goals_against, default = 1) != 0 ~ 1:n()
  )) %>%
  filter(goals_against == 0) %>%
  mutate(run_id = na.locf(run_start, na.rm = FALSE)) 

longest_scoring_runs <- scoring_runs %>%
  group_by(run_id, team) %>%
  mutate(total_scored = sum(goals_for),
         run_start_date = min(date),
         matches = n()) %>%
  filter(!duplicated(run_id, fromLast = TRUE)) %>%
  select(run_start_date, run_end_date = date, team, run_id, total_scored, matches) %>%
  filter(total_scored > 15) %>%
  arrange(-total_scored)

head(longest_scoring_runs)
## # A tibble: 6 x 6
## # Groups:   run_id, team [6]
##   run_start_date run_end_date team            run_id total_scored matches
##   <date>         <date>       <chr>            <int>        <dbl>   <int>
## 1 2010-04-25     2010-08-28   Chelsea           4372           32       7
## 2 1929-03-06     1929-03-30   Bradford City      919           29       5
## 3 2019-01-06     2019-01-26   Manchester City   5194           28       6
## 4 1903-04-10     1903-10-03   Arsenal            328           26       8
## 5 1880-01-17     1880-11-13   Clapham Rovers      25           26       5
## 6 1885-10-24     1885-12-12   Notts County        32           26       3

Where we can see that Chelsea’s impressive end to the 2009-2010 season puts them top, having scored 32 goals without reply. Almost all the other top examples are from pre-war football, except Manchester City coming close last year with 28 goals scored without conceding.

When we look at this run we can see it was greatly helped along by some demolitions in the cups, winning 5-0, 9-0 and 7-0 against Burnley, Burton Albion, and Rotherham United.

scoring_run_matches <- scoring_runs %>%
  left_join(longest_scoring_runs, by = c("team", "run_id")) %>%
  filter(!is.na(total_scored)) %>%
  select(date, team, opponent, goals_for, goals_against,
         competition, tier, total_scored, run_id) %>%
  arrange(-total_scored)

#print this
print(filter(scoring_run_matches, run_id == 5194))
## # A tibble: 6 x 9
## # Groups:   team [1]
##   date       team  opponent goals_for goals_against competition  tier
##   <date>     <chr> <chr>        <dbl>         <dbl> <chr>       <dbl>
## 1 2019-01-06 Manc~ Rotherh~         7             0 fa_cup         NA
## 2 2019-01-09 Manc~ Burton ~         9             0 league_cup     NA
## 3 2019-01-14 Manc~ Wolverh~         3             0 league          1
## 4 2019-01-20 Manc~ Hudders~         3             0 league          1
## 5 2019-01-23 Manc~ Burton ~         1             0 league_cup     NA
## 6 2019-01-26 Manc~ Burnley          5             0 fa_cup         NA
## # ... with 2 more variables: total_scored <dbl>, run_id <int>

5th December - We’re going to Wembley

Answer - Multiple teams have played 5 FA cup matches all at home. To answer the inverse question, Queen’s Park in 1883/1884 and 1884/1885 have had the farthest to travel

For this question, I’m actually going to answer the opposite topic- which team have traveled the farthest in a cup run? The reason being is that multiple teams have had cup runs (of 5 matches in the FA cup) without travelling away from home at all. The code below could easily be changed to analyse other cup competitions, for simplicity, I’m sticking with the FA cup which has the most complete data in the engsoccerdata set.

Once again, we’ll start by loading libraries. We also want the sf package that makes working with spatial data a bit cleaner.

library(engsoccerdata)
library(tidyverse)
#also want sf to manipulate spatial features
library(sf)

Then we want to grab the data. In a recent update of the engsoccerdata package I added the location of grounds for teams in England which will let us find the distances teams have traveled to matches.

I also download a shapefile of the UK from GADM for plotting and to filter out any bad data in ground location (which still is very much in beta).

#download a map of the uk to plot with
shape_url <- "https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_GBR_0_sf.rds"
temp_dir <- tempdir()
download.file(shape_url, destfile = file.path(temp_dir, "shapefiles.rds"), mode = "wb")
uk <- st_as_sf(readRDS(file.path(temp_dir, "shapefiles.rds")))

#the location of football grounds in the dataset
grounds <- engsoccerdata::england_locations %>%
  st_as_sf(coords = c("lon", "lat"), crs = st_crs("+init=epsg:4326")) %>%
  st_transform(crs = st_crs(uk)) %>%
  #remove those that are bad data-outside the uk
  .[seq(nrow(.)) %in% unlist(st_contains(uk, .)),]
#get the fa cup match data
matches <- engsoccerdata::facup

There’s some ambiguity in the question as to how the distance of a cup run should be measured. There’s really two ways we can do this, which I will henceforth refer to as a ‘routing’ as ‘spoking’. To illustrate the two different approaches, I’ll use Southampton’s 2017/2018 FA cup run

#e.g. with Southampton's semi final run in 2017/2018
Southampton <- matches %>%
    filter(Season == 2017 & 
             (home == "Southampton" | 
                visitor == "Southampton")
           ) %>%
  select(Venue, Date, home, visitor, neutral) %>%
  gather("location", "team", -Venue, -Date, -neutral) %>%
  filter(location == "home" | neutral == "yes") %>%
  filter(!duplicated(Date)) %>%
  mutate(location = case_when(
    neutral == "yes" ~ Venue,
    TRUE ~ team
  ))

#print matches
Southampton
## # A tibble: 5 x 5
##   Venue            Date       neutral location           team              
##   <chr>            <date>     <chr>   <chr>              <chr>             
## 1 Craven Cottage   2018-01-06 <NA>    Fulham             Fulham            
## 2 St Mary's Stadi~ 2018-01-27 <NA>    Southampton        Southampton       
## 3 The Hawthorns    2018-02-17 <NA>    West Bromwich Alb~ West Bromwich Alb~
## 4 DW Stadium       2018-03-18 <NA>    Wigan Athletic     Wigan Athletic    
## 5 Wembley Stadium  2018-04-22 yes     Wembley Stadium    Chelsea

We’ll join the ground geography data to this to figure out distances traveled

#will need the location of southamptons ground
Southampton_home <- grounds %>%
  filter(location == "Southampton")
#and the locations of all their matches
match_locations <- Southampton %>%
  left_join(., select(grounds, location, geometry), by = "location")

The first method of calculating distance traveled per match (‘spoking’) takes the location of each match, and finds the distance from that team’s home ground. E.g. for Southampton:

#find the line from southampton's ground to the match location
spoke_lines <- st_coordinates(match_locations$geometry) %>%
  split(f = rownames(.)) %>%
  #create line from geometry1 to geometry2
  lapply(function(x) {
    mat <- rbind(x, st_coordinates(Southampton_home$geometry))
    line <- st_linestring(mat)
    return(line)
  }) %>%
  #cast to multiline and add projection
  st_multilinestring() %>%
  st_sfc(crs = st_crs("+init=epsg:4326"))

#plot over uk
p <- ggplot() +
  geom_sf(data = uk) +
  geom_sf(data = spoke_lines, colour = "blue", size = 1.5) +
  theme_minimal()

plot(p)

#get the total length in metres
st_length(spoke_lines)
## 698614.6 [m]

For any home games, the distance traveled is taken to be 0m.

The second method of calculating distance traveled takes the distance from each match to the next match in run. I.e. it assumes a team stays in their past location until the next round (obviously not true in real life) and finds the distance to the ground for the next round game:

#need coords separate
coords <- st_coordinates(match_locations$geometry) %>%
  split(f = rownames(.))

#find distance from one coords to next
travel_lines <- mapply(coords, lead(coords), FUN = function(x, y) {
  #for last match no further path
  if(is.na(y)) {
    return(NULL)
  } else {
    mat <- rbind(x, y)
    line <- st_linestring(mat)
    return(line)
  }
}) %>%
  #filter null last element
  .[-which(lengths(.) == 0)] %>%
  st_multilinestring() %>%
  st_sfc(crs = st_crs("+init=epsg:4326"))

#plot over uk
p <- ggplot() +
  geom_sf(data = uk) +
  geom_sf(data = travel_lines, colour = "blue", size = 1.5) +
  theme_minimal()

plot(p)

#get the length
st_length(travel_lines)
## 684586.7 [m]

So Southampton here begin in London away to Fulham, before travelling to their home in Southampton, then on to Birmingham (West Brom), Wigan, and back to London to play at Wembley.

Now we have the two methods, we need to prep the data we’re going to analyse. As before, this is done by gathering a df of match data so we have one row per match per team in the FA cup. We then group by team-season’s and find how many matches they played in the cup that year. For a ‘cup run’ we assume more than 3 matches have to played.

#gather each match per team into a separate row
long_campaigns <- matches %>%
  filter(!is.na(Date)) %>%
  select(Season, home, visitor) %>%
  gather("location", "team", -Season) %>%
  #remove rows with missing teams
  filter(!is.na(team)) %>%
  #get the number of fa cup matches per season per team
  group_by(team, Season) %>%
  summarise(matches = n()) %>%
  #assume you need at least 3 matches for a 'run'
  filter(matches > 3) %>%
  ungroup() %>%
  arrange(Season) %>%
  mutate(id = 1:n())

long_campaign_matches <- matches %>%
  select(., Season, Date, round, 
         neutral, Venue,
         team = home, opponent = visitor) %>%
  mutate(location = "home") %>%
  #bind in the opposite data for away teams
  bind_rows(., mutate(select(., Season, Date, round,
                             neutral, Venue,
                             team = opponent, opponent = team),
                      location = "away")) %>%
  filter(!is.na(team)) %>%
  #left join the data for long campaigns
  left_join(., long_campaigns, by = c("Season", "team")) %>%
  #remove non-long campaigns
  filter(!is.na(matches)) %>%
  #get the location of the match
  mutate(location = case_when(
    neutral == "yes" ~ Venue,
    location == "home" ~ team,
    location == "away" ~ opponent,
  )) %>%
  #left join in the location for the corresponding ground
  left_join(grounds, c("location")) %>%
  #select columns
  select(season = Season, date = Date, round, 
         neutral, team, opponent, 
         location, matches, id, geometry)

#print 
head(long_campaign_matches)
## # A tibble: 6 x 10
##   season date       round neutral team  opponent location matches    id
##    <dbl> <date>     <chr> <chr>   <chr> <chr>    <chr>      <int> <int>
## 1   1871 1871-12-16 2     <NA>    Crys~ Maidenh~ Crystal~       5     1
## 2   1871 1871-12-16 2     <NA>    Wand~ Clapham~ Wandere~       4     3
## 3   1871 1872-01-20 3     <NA>    Wand~ Crystal~ Wandere~       4     3
## 4   1871 1872-01-27 3     <NA>    Roya~ Hampste~ Royal E~       5     2
## 5   1871 1872-02-17 s     yes     Crys~ Royal E~ Kenning~       5     1
## 6   1871 1872-03-09 s     yes     Roya~ Crystal~ Kenning~       5     2
## # ... with 1 more variable: geometry <POINT [°]>

We can then find the routing distance using a nice trick I found on Stack Overflow to find the distance between each location and the next in the data.frame.

Finally, this is grouped by id and summed to get the total distance traveled in that cup campaign (when judging by the ‘routing’ metric).

#taken from
#https://github.com/r-spatial/sf/issues/799
#init an empty sfc
empty <- st_as_sfc("POINT(EMPTY)")
routing_distances <- long_campaign_matches %>%
  arrange(id, date) %>%
  filter(!is.na(date)) %>%
  group_by(id) %>%
  #find the distance from one game to the next
  mutate(
    distance_to_next = sf::st_distance(
      geometry, 
      lag(geometry, default = empty), 
      by_element = TRUE)
    ) 

#sum the distances
grouped_routing_distances <- routing_distances %>%
  summarise(travel_distance = sum(distance_to_next, na.rm = TRUE)) %>%
  merge(long_campaigns, by = "id") %>%
  #conver to km
  mutate(total_distance = travel_distance / 1000) %>%
  select(id, team, matches, season = Season, total_distance) %>%
  mutate(av_distance = total_distance / matches) %>%
  arrange(-total_distance)

head(grouped_routing_distances)
##     id             team matches season total_distance av_distance
## 1  111      Queens Park       8   1884       2684.363    335.5454
## 2  668      Exeter City       8   1930       2362.073    295.2592
## 3  293 Newcastle United       8   1904       2316.524    289.5655
## 4  512 Newcastle United       9   1923       2112.232    234.6924
## 5 1182 Newcastle United      10   1954       2017.824    201.7824
## 6 1090        Gateshead       8   1951       2015.698    251.9623

By functionalising our code from earlier, we can easily plot these well-traveled runs. Using the recently-added-to-CRAN patchwork package, we can make multiple plots and stitch them together, e.g. for the 6 FA cup runs with the longest distances traveled we get:

#functionalise our code from earlier to plot travel routes easier
plot_travel_lines <- function(run_team_year, plot_type) {
  run_matches <- long_campaign_matches %>%
    mutate(id = paste(team, season)) %>%
    filter(id == run_team_year) %>%
    arrange(date)
  coords <- st_coordinates(run_matches$geometry) %>%
    split(f = rownames(.))
  if(plot_type == "travel") {
    lines <- mapply(coords, lead(coords), FUN = function(x, y) {
    if(is.na(y)) {
      return(NULL)
    } else {
      mat <- rbind(x, y)
      line <- st_linestring(mat)
      return(line)
    }
    }) %>%
    .[-which(lengths(.) == 0)] %>%
    st_multilinestring() %>%
    st_sfc(crs = st_crs("+init=epsg:4326"))
  } else if(plot_type == "spokes") {
    home <- grounds %>%
      filter(location == unique(run_matches$team)) %>%
      st_coordinates()
    lines <- lapply(coords, function(x) {
      mat <- rbind(x, home)
      line <- st_linestring(mat)
      return(line)
    }) %>%
      #cast to multiline and add projection
      st_multilinestring() %>%
      st_sfc(crs = st_crs("+init=epsg:4326"))
  }
  plot <- ggplot() +
    geom_sf(data = uk) +
    geom_sf(data = lines, colour = "blue", size = 1.5) +
    labs(title = paste(run_team_year)) +
    theme_minimal()
  return(plot)
}

#plot the top six
library(patchwork)
paste(grouped_routing_distances$team[1:6],
      grouped_routing_distances$season[1:6]) %>%
  lapply(., plot_travel_lines, plot_type = "travel") %>%
  wrap_plots(.)

We then need to check this against our other method of evaluating distances in a cup run- the ‘spoking’ method. This is much easier to calculate- all we have to do is left_join() in the location for each team’s home ground, and find the distance between this and the match location.

Then we simply sum the total distances per campaign and plot the longest of these:

spoke_distances <- long_campaign_matches %>%
  #left join in location data for each team
  left_join(grounds, by = c("team" = "location")) %>%
  #calculate distance between each teams home ground and the match location
  mutate(distance = st_distance(geometry.x, geometry.y, by_element = TRUE))

#group by and sum the cup run distances
grouped_spoke_distances <- spoke_distances %>%
  group_by(team, season) %>%
  summarise(total_distance = sum(distance/1000, na.rm = TRUE),
            av_distance = mean(distance/1000, na.rm = TRUE)) %>%
  arrange(-total_distance)

#print
head(grouped_spoke_distances)
## # A tibble: 6 x 4
## # Groups:   team [3]
##   team             season total_distance av_distance
##   <chr>             <dbl>            [m]         [m]
## 1 Queens Park        1883       2150.504    307.2149
## 2 Newcastle United   1923       1974.430    219.3811
## 3 Newcastle United   1951       1957.196    279.5994
## 4 Fulham             1974       1810.676    150.8897
## 5 Newcastle United   1973       1781.279    178.1279
## 6 Queens Park        1884       1702.166    170.2166
#plot
paste(grouped_spoke_distances$team[1:6],
      grouped_spoke_distances$season[1:6]) %>%
  lapply(., plot_travel_lines, plot_type = "spokes") %>%
  wrap_plots(.)

10th December - This Town Ain’t Big Enough For a League Football Team

“What’s the largest town/city without a League club? I reckon Maidstone takes some beating (population 139,000 - about the same as Blackburn). Unlike Blackburn Rovers, Maidstone United play in the Kent League (of course Blackburn would not actually be eligible) which is some way below the Conference and Dr Martins Leagues. But being Maidstone United of course they play all their fixtures 12 miles away in Sittingbourne,” writes Peter Driver. IN 2003

Answer -

To answer this question, first we need data on towns and cities in England. We’re going to rank by population so can download the population table found here to start with

#scrape data on town/city ppulations in UK
pops <- "http://lovemytown.co.uk/populations/TownsTable1.asp" %>%
  read_html() %>%
  html_nodes("#mainContent > table:nth-child(3)") %>%
  html_table(fill = TRUE, header = TRUE) %>%
  as.data.frame() %>%
  #some munging to match datasets later
  mutate(tcity15nm = case_when(
    grepl("^St\\. ", Town) ~ gsub("^St\\. ", "St ", Town),
    grepl("^Hull$", Town) ~ "Kingston upon Hull",
    grepl("^Burton$", Town) ~ "Burton upon Trent",
    grepl("^Newcastle$", Town) ~ "Newcastle upon Tyne",
    grepl("^Southend$", Town) ~ "Southend-on-Sea",
    grepl("^Stoke$", Town) ~ "Stoke-on-Trent",
    TRUE ~ Town
  )) %>%
  #convert population to numeric
  mutate(population = as.numeric(gsub(",", "", Population))) %>%
  select(tcity15nm, population, status = Status)

We’re then going to want the geographic data on these towns. The UK government provides shapefiles for the outlines of ‘Major Towns and Cities’ from a few years ago which should be sufficient for the question. They’re provided as geoJSON files so I’m going to use the geojsonsf package to load them straight as sf objects.

To reproduce this script, you’ll need to download the data from the UK government achives and point the file object towards it

#to read in geojson data as an sf file
library(geojsonsf)

#download the shapefile from
#https://data.gov.uk/dataset/7879ab82-2863-401e-8a29-a56e264d2182/major-towns-and-cities-december-2015-boundaries
file <- "path/to/downloaded/file.geojson"
#load data
towns <- geojson_sf(file) %>%
  left_join(., pops, by = "tcity15nm") %>%
  st_transform(st_crs(27700)) %>%
  #buff the town shapefiles by 2.5km to catch all clubs within
  #reasonable distance of the town
  st_buffer(., 2500)

Then we want the club data. In the latest release of engsoccer data I added some (very beta) non-league data so we have a greater number of teams to pick from. We’ll take the league the team played in in 2018 as this on league data hasn’t been updated to 2019 (the current season) yet.

#get all league and non-league clubs
#non league clubs in new release of engsoccerdata
clubs <- rbind(
  select(england, home, Season, tier), 
  select(england_nonleague, home, Season, tier)) %>%
  #no 2019 data for non league yet
  filter(Season == 2018) %>%
  select(home, tier) %>%
  unique()

#get the locations of each of these clubs in
club_locations <- england_club_data %>%
  st_as_sf(coords = c("lon", "lat"), crs = st_crs(4326)) %>%
  #on uk grid projection
  st_transform(st_crs(27700)) %>%
  left_join(., clubs, by = c("team" = "home")) %>%
  select(team, tier) %>%
  filter(!is.na(tier))

#plot these clubs over major towns in uk
p_town_clubs <- ggplot() +
  geom_sf(data = uk) +
  geom_sf(data = towns, fill = "red") +
  geom_sf(data = club_locations, alpha = 0.15, colour = "blue") +
  labs(title = "location of English football teams relative to major towns") +
  theme_minimal()

p_town_clubs

We can then run a very quick function to find the clubs that are location within each town using sf::st_contains and arrange by our parameters to answer the question!

#function to find which towns contain clubs
town_data <- st_contains(towns, club_locations) %>%
  map_df(., function(x) {
    n_clubs <- length(x)
    if(n_clubs == 0) {
      max_tier <- NA
      tiers <- NA
    } else {
      #get the tiers of the english footballing pyramid that clubs play in
      tiers <- I(list(club_locations$tier[x]))
      max_tier <- min(club_locations$tier[x])
    }
    return(data.frame(n_clubs,
                      max_tier,
                      tiers))
  }) %>%
  #bind to the town data
  bind_cols(towns, .) %>%
  select(town = tcity15nm, pop = population, n_clubs, max_tier) %>%
  #arrange to answer question
  arrange(n_clubs, -max_tier, -pop)

#get rid of unnessecary geometry
st_geometry(town_data) <- NULL

#print answer
head(town_data)
##               town    pop n_clubs max_tier
## 1       Gloucester 136362       0       NA
## 2      Basingstoke 107355       0       NA
## 3        Worcester 100153       0       NA
## 4 Stockton-on-Tees  82729       0       NA
## 5        Guildford  77057       0       NA
## 6 Sutton Coldfield 109015       1        8

So it appears Gloucester, Worcester and Basingstoke are the largest towns without a football club in their city limits. We can double check this using a quick grep function for the clubs (it’s possible this could miss some clubs but is probably accurate enough)

empty_towns <- c("Gloucester", "Basingstoke", "Worcester", "Stockton", "Guildford")

lapply(empty_towns, grep, x = clubs$home) %>%
  unlist() %>%
  clubs$home[.]
## [1] "Gloucester City"  "Basingstoke Town"

So we can see that Gloucester and Basingstoke do in fact have football teams, however a quick Wikipedia search shows that they both play outside their town (so I’m not sure if these count). The largest town* without a football team, down to the 8th tier of English football, is therefore Worcester with 100,000 people, but a team only in the 9th tier of the football pyramid.

The question actually does specify ‘League’ teams, which generally only refers to the top 4 flights on English football. We can then run the function for each tier, finding the largest town without a team in that tier or above.

#find the largest town without a club above tiers 5:8
by_tier <- lapply(4:7, function(x) {
    data <- town_data %>%
      filter(!is.na(max_tier) & max_tier > x) %>%
      arrange(-pop)
  })

#print by max tier
lapply(by_tier, head, n = 5)
## [[1]]
##         town    pop n_clubs max_tier
## 1   Coventry 325949       1        7
## 2 Warrington 165456       1        7
## 3     Slough 155298       2        5
## 4      Poole 154718       2        7
## 5       York 152841       1        6
## 
## [[2]]
##         town    pop n_clubs max_tier
## 1   Coventry 325949       1        7
## 2 Warrington 165456       1        7
## 3      Poole 154718       2        7
## 4       York 152841       1        6
## 5    Telford 142723       1        6
## 
## [[3]]
##               town    pop n_clubs max_tier
## 1         Coventry 325949       1        7
## 2       Warrington 165456       1        7
## 3            Poole 154718       2        7
## 4         Worthing 109120       1        7
## 5 Sutton Coldfield 109015       1        8
## 
## [[4]]
##               town    pop n_clubs max_tier
## 1 Sutton Coldfield 109015       1        8
## 2         Basildon 107123       2        8
## 3        St Helens 102885       1        8
## 4        Wakefield  99251       1        8
## 5         Hastings  91053       1        8

And the City of Coventry takes it, having a population of 325,000 and a team only in the 7th tier (Bedworth United, who play just outside the city). This is only because the city’s main team Coventry City are playing in Birmingham due to ongoing difficulties finding a stadium within their own city.

I think in the spirit of the question, the true answer is one of Warrington (165k), Poole (154k), York (152k), of Telford (142k). Of which York is probably the most major and independent as a town/city.

Finally the article in which the question was shared posted the follow-up: what is the smallest town to hold a league club?

We can answer this using the directory of population places produced by the Ordnance Survey. Again, download the .csv and point the script at it to reproduce.

#load the .csv of small places in the uk
small_places <- read.csv("path/to.csv", stringsAsFactors = FALSE) %>%
  #only interested in England
  filter(COUNTRY == "England") %>%
  select(name = NAME1, x = X, y = Y) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(27700)) %>%
  #find small places outside of the large towns
  .[-unique(unlist(st_contains(towns, .))),] %>%
  #buffer by 500m
  st_buffer(., 100)
#plot these for a cool map
plot(select(small_places))

It’s then simple to find the small places to hold football clubs using sf::st_contains and indexing

#find clubs that are in these remaining places
small_town_clubs <- st_contains(small_places, filter(club_locations, tier < 5))
small_places[which(lengths(small_town_clubs) > 0),]
## Simple feature collection with 6 features and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 351959.9 ymin: 116300 xmax: 391793 ymax: 430665
## epsg (SRID):    27700
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
##             name                       geometry
## 3454    Laneside POLYGON ((376361 429724, 37...
## 3474     Enfield POLYGON ((375759 430165, 37...
## 8843   Moss Lane POLYGON ((391793 371719, 39...
## 17646 Houndstone POLYGON ((353128 117093.1, ...
## 17691     Lufton POLYGON ((352959.9 116800, ...
## 19419  Newmarket POLYGON ((384471.3 199698.6...

Enfield and Moss Lane are in London and Manchester, so don’t count. Houndstone and Lufton are actually in the same town (Yeovil) but still are valid answers.

Laneside is located just outside Accrington, where Accrington Stanley play in the third tier (population 35,000).

Yeovil (population 45,000) is home to Yeovil Town who play in the 4th tier.

But the far and away winner is 4th tier Forest Green Rovers who play in Nailsworth, with a population of just 5,700 people. Even more, the only remotely near town, Stroud, has a population of just 32,000 people. Most incredible of all, the stadium for Forest Green Rovers can hold ~5,000 people, or almost the entire surrounding population.

Related