Five Minute Football Trivia - Invincibles

generally as I have less and less time to waste on meaningless football stats I get halfway through a post and abandon it. To remedy this, I want to start pushing out posts that give a reasonable half-guess at an answer within an hour or so without needing to really check my working or write good prose. This is the first of these

Liverpool Football Club have had a pretty impressive season until recently, winning 26 of the first 27 games and remaining unbeaten. Last weekend however, they lost 3-0 to Watford which means that Arsenal remain the only team to have gone a full (modern) season of top flight English football unbeaten (in 2003/2004).

Modern football twitter being what it is, a lot of debate has sprung up about which would be more impressive- going to whole season unbeaten like Arsenal, or winning 100 (out of a max 114) points in a single season, as Manchester City did in 2017-2018 and both Manchester City and Liverpool almost did last season. (A third option also is the treble won by Manchester United in 1998/1999 but since Liverpool have also lost to Chelsea in the FA cup this week, that too remains unbeaten).

As usual, first we need some libraries

#munging
library(tidyverse)
#plotting
library(ggrepel)
#football data
library(engsoccerdata)
library(rvest)
#Ben Torvaney's excellend package to model football
library(regista)

#set seed for reproducibility
set.seed(22081992)

Then we can get going loading up the data on English football results up until the end of the 2018/2019 season. We’ll also take some time to find the winners each season which will be useful later. There’s a lot of repetitive munging in this post so bear in mind the 3 main things we’ll be doing are: + pivoting data to longer to get the results for each team (not each match) + working out the goals for and against each team using case_when() + working out the points for each team using case_when()

data <- engsoccerdata::england %>%
  #only care about the top flight in the premier league era
  dplyr::filter(Season > 1991 & Season < 2019 & division == 1) %>%
  select(season = Season, home, away = visitor, hgoal, agoal = vgoal)

league_winners <- data %>%
  #pivot data to longer to get team (rather than match) data
  pivot_longer(c("home", "away"), names_to = "location", values_to = "team") %>%
  #find goals for and goals against per team
  mutate(g_for = case_when(
    location == "home" ~ hgoal,
    location == "away" ~ agoal
  )) %>%
  mutate(g_ag = case_when(
    location == "home" ~ agoal,
    location == "away" ~ hgoal
  )) %>%
  #get the team's points per match
  mutate(points = case_when(
    g_for > g_ag ~ 3,
    g_for == g_ag ~ 1,
    g_ag > g_for ~ 0
  )) %>%
  mutate(gd = g_for - g_ag) %>%
  group_by(team, season) %>%
  #calculate total points and goal difference
  summarise(total_points = sum(points),
            total_gd = sum(gd)) %>%
  #get the winners of each league season
  arrange(season, -total_points, -total_gd) %>%
  group_by(season) %>%
  mutate(league_position = 1:n()) %>%
  ungroup() %>%
  mutate(winner = case_when(
    league_position == 1 ~ "y",
    TRUE ~ "n"
  ))

head(league_winners)
## # A tibble: 6 x 6
##   team                season total_points total_gd league_position winner
##   <chr>                <int>        <dbl>    <int>           <int> <chr> 
## 1 Manchester United     1992           84       36               1 y     
## 2 Aston Villa           1992           74       17               2 n     
## 3 Norwich City          1992           72       -4               3 n     
## 4 Blackburn Rovers      1992           71       22               4 n     
## 5 Queens Park Rangers   1992           63        8               5 n     
## 6 Liverpool             1992           59        7               6 n

We can then use the match data to calculate the offensive and defensive strength of each teams over the whole season using the Dixon-Coles method. I’ve previously written an introduction to this method here (which I need to finish part two of) but suffice to say it takes the goals scored and goals conceded per game and gives a good estimation of how good a team is. It’s similar in concept to fivethirtyeight’s Soccer SPI.

#split data by seasons
fit_data <- data %>%
  split(f = .$season) %>%
  lapply(., function(x) x %>% mutate(home = factor(home), away = factor(away)))

#model using dixoncoles() from the regista package
fits <- lapply(fit_data, function(x) dixoncoles(hgoal, agoal, home, away, x))

We can then extract the parameters from this model to see how teams have performed in each season of the Premier League. I also flip the defence axis (higher being a better defence) as I think it makes a little more sense

parameters <- fits %>%
  #extract the team parameters per fit
  lapply(., function(f) {
    par_data <- f$par[grepl("def_|off_", names(f$par))]
    teams <- unique(gsub("def_*|off_*", "", names(par_data)))
    par_df <- matrix(par_data, ncol = 2) %>%
      as.data.frame() %>%
      rename(attack = V1, defence = V2)
    rownames(par_df) <- teams
    return(par_df)
  }) %>%
  do.call(rbind, .) %>%
  rownames_to_column() %>%
  separate(rowname, c("season", "team"), sep = "\\.") %>%
  mutate(season = as.numeric(season)) %>%
  #flip the defence parameter (higher = better)
  mutate(defence = defence * -1) %>%
  left_join(., league_winners, by = c("season", "team"))

#plot the parameters with season performance (points) as the colour
p1 <- parameters %>%
  ggplot(aes(x = attack, y = defence, fill = total_points, colour = winner)) +
  geom_point(shape = 21, size = 3, alpha = 0.7, stroke = 2) +
  #label exceptional teams
  geom_text_repel(data = filter(parameters, winner == 1 | attack + defence > 1),
            aes(label = paste(team, season))) +
  labs(title = "Dixon Coles parameters per team per Premier League Season",
       subtitle = "league winners and exceptional teams labelled",
       x = "attacking strength",
       y = "defensive strength") +
  scale_colour_manual(values = c("blue", "red")) +
  theme_minimal()

p1

We can then use these parameters as ‘true estimates’ of how good each team was each season, and go back and simulate results from each match to work out how likely a win/lose/draw for any team was in any match. This is questionably a good idea but as I said up top, this is stream of consciousness first-guesses at answering stupid trivia questions so I’m going to go along with it.

The regista package’s augment.dixoncoles easily gives us the chance of a win/lose/draw per match based on the attacking/defensive strength of each team (see above) that season

#split the matches by season
matches <- data %>%
  select(season, home, away) %>%
  split(f = .$season)

#function to predict the results per match
predict_matches <- function(dc_fit, fixtures) {
  augment.dixoncoles(x = dc_fit, newdata = fixtures, type = "outcomes") %>% 
    unnest() %>%
    spread(outcome, prob)
}

#run the prediction function
predictions <- map2_df(fits, matches,
                       predict_matches)

head(predictions)
## # A tibble: 6 x 6
##   season home    away             away_win  draw home_win
##    <int> <chr>   <chr>               <dbl> <dbl>    <dbl>
## 1   1992 Arsenal Aston Villa         0.294 0.379    0.327
## 2   1992 Arsenal Blackburn Rovers    0.319 0.344    0.337
## 3   1992 Arsenal Chelsea             0.220 0.342    0.437
## 4   1992 Arsenal Coventry City       0.214 0.333    0.454
## 5   1992 Arsenal Crystal Palace      0.188 0.322    0.490
## 6   1992 Arsenal Everton             0.223 0.338    0.439

So e.g. based on Dixon-Coles estimates, given how well Arsenal and Aston Villa did over the whole of the 1992/1993 season, Arsenal had a 32.6% chance of beating Aston Villa at home on the opening day of the season.

We can then use these probability estimates to calculate the chance of any one team going unbeaten over the whole league (multiply out the probabilities of not losing each game)

invincible_chance <- predictions %>%
  #get match predictions per team
  pivot_longer(c("home", "away"), names_to = "location", values_to = "team") %>%
  mutate(nonloss_chance = case_when(
    location == "home" ~ 1 - away_win,
    location == "away" ~ 1 - home_win
  )) %>%
  select(season, team, nonloss_chance) %>%
  group_by(team, season) %>%
  #chance of going invincible = product sum of chance of not drawing
  summarise(invincible_chance = prod(nonloss_chance)) %>%
  arrange(-invincible_chance)

head(invincible_chance, n = 10)
## # A tibble: 10 x 3
## # Groups:   team [6]
##    team              season invincible_chance
##    <chr>              <int>             <dbl>
##  1 Chelsea             2004           0.0494 
##  2 Manchester City     2017           0.0362 
##  3 Manchester City     2018           0.0286 
##  4 Liverpool           2018           0.0232 
##  5 Arsenal             1998           0.0164 
##  6 Manchester City     2011           0.0124 
##  7 Manchester United   2007           0.0123 
##  8 Tottenham Hotspur   2016           0.00846
##  9 Arsenal             2003           0.00529
## 10 Chelsea             2009           0.00475

So it turns out that the team most likely to have gone invincible over a whole season was Chelsea in 2004/2005 (not surprising given their excellent defensive record that year), but with only a ~5% chance.

Arsenal’s actual invincible year is estimated that have had a 0.05% chance based on the team’s results (surprisingly low!). Another notable team is Tottenham Hotspur who only finished 2nd in 2016/2017 but perhaps went under the radar as a very good team that year (with a 0.08% chance of finishing unbeaten).

So we can assume* that the very best ‘unbeatable’ teams have ~5% chance of finishing a season invincible. We can use this baseline to see how hard this seems compared to the expectation a team gets 100 points.

*not really, but for this post yes

We’re going to simulate every Premier League season 1000 times and calculate the total points expected of a team based on their Dixon-Coles parameters. To narrow down the search a bit, I’m going to limit it to only exceptional teams with an attack and defence parameter > 0.25 (which gives 33 season-teams).

result_probs <- predictions %>%
  #pivoting and case_when to get result probabilities per team
  pivot_longer(c("home", "away"), names_to = "location", values_to = "team") %>%
  mutate(win = case_when(
    location == "home" ~ home_win,
    location == "away" ~ away_win
  )) %>%
  mutate(lose = 1 - draw - win) %>%
  select(season, team, win, lose, draw) %>%
  group_by(team, season) %>%
  mutate(game = 1:n()) %>%
  nest(probs = c(win, lose, draw))

#filter down to only the very best teams to save processing
selected_teams <- parameters %>%
  filter(attack > 0.25 & defence > 0.25) %>%
  select(season, team) %>%
  left_join(., result_probs, by = c("team", "season"))

sim_result <- function(probabilities) {
  chosen_results <- gather(probabilities) %>%
    sample_n(., 1, weight = value)
  result <- chosen_results$key
}

simulate_all_games <- function(data) {
  data$result <- unlist(lapply(data$probs, sim_result))
  return(data)
}

#will simulate 1000 seasons for each of these teams
n_sims <- 1000

#run simulations - will take ~10mins
simulated_results <- rerun(n_sims, simulate_all_games(selected_teams))

Calculating the total points won per season, we can work out the percentage of simulations in which each team exceed 100 points quite easily

simulated_points <- simulated_results %>%
  #for each sim, get the points won by each team
  lapply(., function(data) {
    data <- data %>%
      mutate(points = case_when(
        result == "win" ~ 3,
        result == "draw" ~ 1,
        result == "lose" ~ 0
      )) %>%
      group_by(season, team) %>%
      mutate(total_points = sum(points)) %>%
      select(season, team, total_points) %>%
      unique()
  }) %>%
  do.call(rbind, .)

#probability of reaching 100 points is no. of sims > 100 points / n_sims
centurion_probs <- simulated_points %>%
  filter(total_points > 99) %>%
  group_by(season, team) %>%
  summarise(centurion_prob = n() / n_sims) %>%
  arrange(-centurion_prob)

print(centurion_probs)
## # A tibble: 27 x 3
## # Groups:   season [16]
##    season team              centurion_prob
##     <dbl> <chr>                      <dbl>
##  1   2017 Manchester City            0.17 
##  2   2018 Manchester City            0.107
##  3   2018 Liverpool                  0.076
##  4   1994 Manchester United          0.069
##  5   2004 Chelsea                    0.046
##  6   2009 Chelsea                    0.039
##  7   2011 Manchester City            0.029
##  8   2007 Manchester United          0.023
##  9   2016 Tottenham Hotspur          0.023
## 10   2006 Manchester United          0.011
## # … with 17 more rows

As expected, the two recent Manchester City teams come top, with the one that actually did reach 100 points (2017) given a 14.5% chance of reaching that milestone, given their strength.

So now we have a baseline that the best team at accumulating points (Manchester City 2017/2018) has ~3x as much chance of winning 100 points in that season than the very best (potentially) invincible team (Chelsea 2004/2005). I.e. we have some (not super strong) evidence that it is ~3x as hard to go a season unbeaten than it is to become a ‘centurion’.

We can calculate how many points our threshold needs to be set at to have an equal chance using top_frac() on our 1000 simulations.

#find the points threshold for Man City 2017 that would reach n points 
#as often as Chelsea 2004 would go unbeaten
invincible_equivalent <- simulated_points %>%
  ungroup() %>%
  filter(season == 2017 & team == "Manchester City") %>%
  top_frac(max(invincible_chance$invincible_chance)) %>%
  arrange(total_points)

#print the lowest threshold
head(invincible_equivalent, n = 1)
## # A tibble: 1 x 3
##   season team            total_points
##    <dbl> <chr>                  <dbl>
## 1   2017 Manchester City          103

So we might presume that the equivalent achievement to going the season unbeaten is to win 103 points in the Premier League. To see how the 2017/2018 Manchester City team compare to this we can plot the expected final points total of that season (given league team strengths) in a histogram:

p2 <-  simulated_points %>%
  ungroup() %>%
  filter(season == 2017 & team == "Manchester City") %>%
  ggplot(., aes(x = total_points)) +
  geom_histogram(fill = "skyblue", alpha = 0.7) +
  #invincle equivalent achievement in red
  geom_vline(xintercept = min(invincible_equivalent$total_points),
             colour = "red", linetype = "dashed", size = 2) +
  #actual achievement in blue
  geom_vline(xintercept = filter(league_winners, season == 2017 & league_position == 1)$total_points,
             colour = "blue", linetype = "dashed", size = 2) +
  labs(title = "Man C. expected 2017/2018 performance c.f. invincible equivalent threshold",
       subtitle = "invincible equivalent achievement = 103 points, actual = 100 points",
       x = "season expected total points",
       y = paste("times achieved over", n_sims, "simulations")) +
  theme_minimal()

p2

The original question was really if this years Liverpool team might achieve this 103 point threshold (given they have now failed to go unbeaten). We can test this by doing exactly the same procedure on their season so far.

First we need to download all the match data from fbref. Handily, fbref doesn’t just gives us the goals scored per match but the expected goals each team managed to put up. We’re going to use that to model team strengths as we might assume* this is a better measure of how good a team really is. In order to fit the model using the regista package I need to supply an integer, so I’ve simply rounded those xG numbers to the nearest whole number**

*lets ignore game state and other such important thing- this is five minute football trivia **you actually can use expected goals in a regista::dixoncoles model, see here, but this is five minute football trivia

#download the match data from 2019/2020
fixtures_2020 <- "https://fbref.com/en/comps/9/schedule/Premier-League-Fixtures" %>%
  read_html() %>%
  html_nodes("#sched_ks_3232_1") %>%
  html_table() %>%
  as.data.frame() %>%
  separate(Score, into = c("hgoal", "agoal"), sep = "–") %>%
  #only care about goals and expected goals
  select(home = Home, away = Away, home_xg = xG, away_xg = xG.1, hgoal, agoal) %>%
  filter(home != "") %>%
  mutate(home = factor(home), away = factor(away)) %>%
  #round expected goals to nearest integer
  mutate_at(c("home_xg", "away_xg", "hgoal", "agoal"), .funs = funs(round(as.numeric(.))))

#matches with a known result
#used for modelling
played_matches <- fixtures_2020 %>%
  filter(!is.na(home_xg))

#matches with an unknown result
#used for simulation
unplayed_matches <- fixtures_2020 %>%
  filter(is.na(home_xg)) %>%
  select_if(negate(is.numeric))

#fit the dixon coles model
fit_2020 <- dixoncoles(home_xg, away_xg, home, away, data = played_matches)

And as before we can plot the team strength in attacking and defending dimensions

#extract parameters from the model
pars_2020 <- fit_2020$par %>%
  .[grepl("def_|off_", names(.))] %>%
  matrix(., ncol = 2) %>%
  as.data.frame() %>%
  rename(attack = V1, defence = V2)
pars_2020$team <- unique(gsub("def_*|off_*", "", names(fit_2020$par)))[1:20]

#plot as before
p3 <- pars_2020 %>%
  mutate(defence = 1 - defence) %>%
  ggplot(aes(x = attack, y = defence, colour = attack + defence, label = team)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_text_repel() +
  labs(title = "Dixon Coles parameters per team 2019/2020",
       x = "attacking strength",
       y = "defensive strength") +
  scale_colour_continuous(guide = FALSE) +
  labs(title = "Dixon Coles parameters per team for the 2019/2020 Premier League Season",
       x = "attacking strength",
       y = "defensive strength") +
  theme_minimal()

p3

Surprisingly, a distant 2nd place Manchester City actually rate higher than Liverpool using this model, and Manchester United (by all accounts having a very middling season) aren’t far off either.

Now we just need to simulate the remaining games of Liverpool’s season to see how likely they are to hit are 103 points target. We can then add the points we expect Liverpool to win to the number of points we know they already have to get an estimate of final total points.

#calculate points we know Liverpool have
liverpool_points <- played_matches %>%
  filter(home == "Liverpool" | away == "Liverpool") %>%
  mutate(points = case_when(
    hgoal == agoal ~ 1,
    home == "Liverpool" & (hgoal > agoal) ~ 3,
    away == "Liverpool" & (agoal > hgoal) ~ 3,
    TRUE ~ 0
  )) %>%
  summarise(total_points = sum(points))

#estimate the chance of results in all remaining games
unplayed_results <-
  augment.dixoncoles(fit_2020, unplayed_matches, type.predict = "outcomes") %>%
  unnest() %>%
  #filter out the liverpool ones
  filter(home == "Liverpool" | away == "Liverpool")

#function to simulate a season by making weighted samples
simulate_season <- function(result_probabilities) {
  result_probabilities %>%
    nest(outcome, prob, .key = "results") %>%
    mutate(sampled = map(results, ~ sample_n(., 1, weight = prob))) %>%
    select(-results) %>%
    unnest()
}

#simulate the rest of liverpool's season
liverpool_2020_simulated <- rerun(n_sims, simulate_season(unplayed_results)) %>%
  bind_rows(.id = "simulation_id") %>%
  #find the sampled points won per game
  mutate(points = case_when(
    home == "Liverpool" & outcome == "home_win" ~ 3,
    away == "Liverpool" & outcome == "away_win" ~ 3,
    outcome == "draw" ~ 1,
    TRUE ~ 0
  )) %>%
  group_by(simulation_id) %>%
  #calculate Liverpool's total season points for this simulation
  summarise(total_points = sum(points) + as.numeric(liverpool_points))

It’s then very easy to find the fraction of sims in which Liverpool break this 103 point challenge

length(which(liverpool_2020_simulated$total_points > 102)) / 1000
## [1] 0.157

Finally, we can plot it as before to see how many points we expect Liverpool to win this season: (this time the 103 point threshold is in blue to stand out against the red that Liverpool play in)

p4 <- liverpool_2020_simulated %>%
  ggplot(., aes(x = total_points)) +
  geom_histogram(fill = "red", alpha = 0.7) +
  #invincle equivalent achievement in red
  geom_vline(xintercept = min(invincible_equivalent$total_points),
             colour = "blue", linetype = "dashed", size = 2) +
  labs(title = "Liverpool's expected 2019/2020 performance c.f. invincible equivalent threshold",
       subtitle = "invincible equivalent achievement = 103 points in blue this time",
       x = "season expected total points",
       y = paste("times achieved over", n_sims, "simulations")) +
  theme_minimal()

p4

Anyway, that’s that for the first of these (hopefully? of many!). Did we learn anything? probably not. But did we at least do something interesting? also probably not. But I do like doing these silly little analyses in my spare time and by not limiting myself to things like rigor, I can pump them out faster. I’ll probably aim for one post (smaller than this) a week to start building a little bit of a public portfolio up again (I’m unemployed in 5 months- hire me!!). Hope you enjoyed reading it :)

Related