Predicting the 2018-19 Women's Super League Using xG and Dixon-Coles

Over the last few years since I started coding I’d always been interested in how data science could help predict football results/ identify footballing talents, and just generally ‘solve’ football.

One of the major problems with analysing football had been the availability of data. Though there’s a lot of great published stuff freely available to read, a lot of the cutting edge work revolves around advanced metrics, such as expected goals, which it’s hard to get the data for.

Over the summer StatsBomb committed to sharing free data on (amongst others) the Women’s Super League (the top women’s competition in England), and I’d been interested in looking into this since then.

This post is basically just a reproduction of two blogs by the excellent Ben Torvaney, using the Dixon-Coles method to predict the the final positions of teams at the end of a football season.

The first of these published over the summer combines this method with xG data and the second from this week is a simple and fully reproducible tutorial on implementing Dixon-Coles.

(as both of these posts are pretty comprehensive I’m going to be sparse with commenting/explaining for this post- any questions will probably be answered by the above two articles)

First, loading the libraries needed:

library(tidyverse)

#Ben Torvaney's soccer analysis packages
#devtools::install_github("torvaney/footballdatr")
library(footballdatr)
#devtools::install_github("torvaney/regista")
library(regista)

#women's football data
#devtools::install_github(statsbomb/StatsBombR)
library(StatsBombR)

Then we can use the StatsBombR package to download the data we need. First we grab a tibble of every match so far in the WSL season, and then use this to get another tibble of every shot from every game which we bind to the original as a new column.

At the end, we save this, so we don’t have to bombard the API every time we want to rerun the script

#only want to run this once to avoid overloading the API
#save data after pulling and set chunk eval = FALSE

#get free match info from the StatsBombR package
wsl_matches <- StatsBombR::FreeCompetitions() %>%
  #only interested in WSL
  filter(competition_name == "FA Women's Super League") %>%
  #find free matches from WSL
  #(all matches played so far)
  select(competition_id) %>%
  StatsBombR::FreeMatches(.) %>%
  #only want info that helps us predict scores
  select(match_id,
         competition.competition_id,
         season.season_id,
         home = home_team.home_team_name,
         away = away_team.away_team_name,
         hgoals = home_score,
         agoals = away_score)

#get the shot information from each match and bind a tibble as a column
wsl_matches$shots <- wsl_matches %>%
  #split match info into separate rows
  split(f = 1:nrow(.)) %>%
  #get the shots per game
  lapply(., function(x){
    StatsBombR::get.matchFree(x) %>%
      select(team = possession_team.name,
             xG = shot.statsbomb_xg) %>%
      filter(!is.na(xG)) %>%
      #join home/away information
      left_join(x %>% 
                  select(home, away) %>%
                  gather(location, team),
                ., by = "team") %>%
      select(-team)
  })

#save data
#ONLY WANT TO RUN THIS CHUNK ONCE
saveRDS(wsl_matches, "saved_wsl_matches.rds")

For every rerun we want to load this data again. I also got rid of the LFC/WFC suffixes from each team as it’s a bit redundant and changed the team name columns to factors, which will be required when modelling later.

#load saved data
wsl_matches <- readRDS("./saved_wsl_matches.rds") %>%
  mutate(home = gsub(" WFC| LFC", "", home),
         away = gsub(" WFC| LFC", "", away)) %>%
  mutate(home = factor(home), away = factor(away))

#peek at what data we have
head(wsl_matches)
## # A tibble: 6 x 8
##   match_id competition.com~ season.season_id home  away  hgoals agoals
##      <int>            <int>            <int> <fct> <fct>  <int>  <int>
## 1    19751               37                4 West~ Chel~      0      2
## 2    19727               37                4 Read~ Birm~      0      1
## 3    19719               37                4 West~ Read~      0      0
## 4    19731               37                4 West~ Yeov~      2      1
## 5    19730               37                4 Chel~ Brig~      2      0
## 6    19733               37                4 Birm~ Manc~      2      3
## # ... with 1 more variable: shots <list>

All of these function comes pretty much verbatim from the first blog post by Ben Torvaney. We run through the shot data (and the expected goals for every shot over every match), and find the probability of every plausible score from these matches happening.

#http://www.statsandsnakeoil.com/2019/01/01/predicting-the-premier-league-with-dixon-coles/
add_if_missing <- function(data, col, fill = 0.0) {
  # Add column if not found in a dataframe
  # We need this in cases where a team has 0 shots (!)
  if (!(col %in% colnames(data))) {
    data[, col] <- fill
  }
  data
}

team_goal_probs <- function(xgs, side) {
  # Find P(Goals=G) from a set of xGs by the
  # poisson-binomial distribution
  # Use tidyeval to prefix column names with
  # the team's side ("h"ome or "a"way)
  tibble(!!str_c(side, "goals") := 0:length(xgs),
         !!str_c(side, "prob")  := poisbinom::dpoisbinom(0:length(xgs), xgs))
}

simulate_game <- function(shot_xgs) {
  shot_xgs %>%
    split(.$location) %>%
    imap(~ team_goal_probs(.x$xG, .y)) %>%
    reduce(crossing) %>%
    # If there are no shots, give that team a 1.0 chance of scoring 0 goals
    add_if_missing("homegoals", 0) %>%
    add_if_missing("homeprob", 1) %>%
    add_if_missing("awaygoals", 0) %>%
    add_if_missing("awayprob", 1) %>%
    mutate(prob = homeprob * awayprob) %>%
    select(homegoals, awaygoals, prob)
}

simulated_games <- wsl_matches %>%
  mutate(simulated_probabilities = map(shots, simulate_game)) %>%
  select(match_id, home, away, simulated_probabilities) %>%
  unnest() %>%
  filter(prob > 0.001) %>%  # Keep the number of rows vaguely reasonable
  rename(hgoals = homegoals, agoals = awaygoals)

simulated_games
## # A tibble: 1,291 x 6
##    match_id home            away    hgoals agoals    prob
##       <int> <fct>           <fct>    <int>  <int>   <dbl>
##  1    19751 West Ham United Chelsea      0      0 0.0869 
##  2    19751 West Ham United Chelsea      1      0 0.0420 
##  3    19751 West Ham United Chelsea      2      0 0.00755
##  4    19751 West Ham United Chelsea      0      1 0.212  
##  5    19751 West Ham United Chelsea      1      1 0.103  
##  6    19751 West Ham United Chelsea      2      1 0.0184 
##  7    19751 West Ham United Chelsea      3      1 0.00161
##  8    19751 West Ham United Chelsea      0      2 0.201  
##  9    19751 West Ham United Chelsea      1      2 0.0969 
## 10    19751 West Ham United Chelsea      2      2 0.0174 
## # ... with 1,281 more rows

We can use this data to get estimates of each team’s offensive and defensive strengths using the Dixon-Coles method. From here on out I’m going to refer to “actual goals” as “accomplished” (the number of goals that in reality occurred either for or against a team so far this season) and use “expected goals” for xG.

In the final tibble, “off” refers to the offensive strength (higher = good) and “def” the “defensive weakness” (i.e. higher is worse).

#ag_model
ag_model <- dixoncoles(
  hgoal = hgoals,
  agoal = agoals,
  hteam = home,
  ateam = away,
  data  = factor_teams(wsl_matches, c("home", "away"))
)

#xg_model
xg_model <- dixoncoles(
  hgoal   = hgoals,
  agoal   = agoals,
  hteam   = home,
  ateam   = away,
  weights = prob,
  data    = factor_teams(simulated_games, c("home", "away"))
)

#join these together
estimates <-
  inner_join(
    broom::tidy(ag_model),
    broom::tidy(xg_model),
    by = c("parameter", "team"),
    suffix = c("_accomplished", "_xg")
  ) %>%
  mutate(value_accomplished = exp(value_accomplished),
         value_xg      = exp(value_xg))

# Preview results, ordered by the biggest difference
estimates %>%
  arrange(desc(abs(value_xg - value_accomplished))) %>%
  head()
## # A tibble: 6 x 4
##   parameter team                   value_accomplished value_xg
##   <chr>     <chr>                               <dbl>    <dbl>
## 1 def       Brighton & Hove Albion              2.65     1.57 
## 2 off       Arsenal                             2.75     2.03 
## 3 def       Yeovil Town                         3.10     2.41 
## 4 off       Chelsea                             0.788    1.42 
## 5 def       Liverpool                           1.73     1.23 
## 6 rho       <NA>                                1.48     0.996

Let’s quickly plot the expected and accomplished goals based strength for each team and facet by offense and defence:

#plot the Dixon-Coles strengths
estimates %>%
  arrange(desc(abs(value_xg - value_accomplished))) %>%
  filter(parameter %in% c("def", "off")) %>%
  ggplot(aes(x = value_xg, y = value_accomplished)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dotted") +
  geom_text(aes(label = team), alpha = 0.7) +
  labs(title = "Dixon-Coles parameter estimates for WSL teams",
       subtitle = "Based on ...",
       x = "... expected goals",
       y = "... accomplished goals") +
  coord_equal(ratio=1) +
  theme_minimal() +
  facet_wrap(~parameter)

So Arsenal who have been outrageously good so far this year have been overperforming in offense (accomplished goals-based Dixon-Coles assume they are better than they probably are), whereas relegation-candidate Yeovil have been hugely underperforming in defense.

The next step is then to get a list of all unplayed fixtures and predict the probabilities of each score in these using the same models as above (re-stated in the chunk).

From here we’re using code from the second blog post by Ben Torvaney.

#get list of unplayed fixtures
unplayed_games <-
  crossing(home = wsl_matches$home,
           away = wsl_matches$home) %>%
  filter(home != away) %>%
  anti_join(wsl_matches, by = c("home", "away"))

#don't need to reinitialise the models really 
#maybe makes it a bit easier to follow
xg_model <- dixoncoles(hgoals, agoals, home, away, data = simulated_games)
ag_model <- dixoncoles(hgoals, agoals, home, away, data = wsl_matches)

#predict future scorelines using these models
unplayed_xg_scorelines <-
  broom::augment(xg_model, unplayed_games, type.predict = "scorelines") %>%
  unnest() %>%
  mutate(model = "expected goals")

unplayed_ag_scorelines <-
  broom::augment(ag_model, unplayed_games, type.predict = "scorelines") %>%
  unnest() %>%
  mutate(model = "accomplished goals")

Bind the data from these predictions to the data from the played games in one big tibble

played_scorelines <-
  wsl_matches %>%
  select(home, away, hgoal = hgoals, agoal = agoals) %>%
  mutate(prob = 1.0)

scorelines <- bind_rows(
  mutate(played_scorelines, model = "expected goals"),
  mutate(played_scorelines, model = "accomplished goals"),
  unplayed_xg_scorelines,
  unplayed_ag_scorelines
)

These functions again are pretty much verbatim copied from Ben’s work. They simulate the season using Monte-Carlo sampling and build a final predicted WSL table

simulate_season <- function(scoreline_probabilities) {
  scoreline_probabilities %>%
    nest(hgoal, agoal, prob, .key = "scorelines") %>%
    mutate(sampled = map(scorelines, ~ sample_n(., 1, weight = prob))) %>%
    select(-scorelines) %>%
    unnest()
}

calculate_table <- function(games) {
  games_augmented <-
    games %>%
    mutate(
      hpoints = case_when(
        hgoal > agoal  ~ 3,
        hgoal == agoal ~ 1,
        agoal > hgoal  ~ 0
      ),
      apoints = case_when(
        hgoal > agoal  ~ 0,
        hgoal == agoal ~ 1,
        agoal > hgoal  ~ 3
      )
    )

  games_home <-
    games_augmented %>%
    select(
      team   = home,
      gf     = hgoal,
      ga     = agoal,
      points = hpoints
    )

  games_away <-
    games_augmented %>%
    select(
      team   = away,
      gf     = agoal,
      ga     = hgoal,
      points = apoints
    )

  bind_rows(games_home, games_away) %>%
    group_by(team) %>%
    summarise(w  = sum(gf > ga),
              d  = sum(gf == ga),
              l  = sum(gf < ga),
              gf = sum(gf),
              ga = sum(ga),
              gd = gf - ga,
              points = sum(points)) %>%
    arrange(desc(points), desc(gd), desc(gf)) %>%
    mutate(position = row_number())
}

Then it’s simply a case of running these function n times. 1000 for each model is a nice balance of predictive power and runtime.

n_simulations <- 1000

simulated_tables <- scorelines %>%
  split(.$model) %>%
  lapply(., function(sim_data) {
    rerun(n_simulations, simulate_season(sim_data)) %>%
      map(calculate_table) %>%
      bind_rows(.id = "simulation_id") %>%
      #add which model we're using to each sim
      mutate(model = unique(sim_data$model))
  }) %>%
  do.call(rbind, .)

simulated_tables
## # A tibble: 22,000 x 11
##    simulation_id team      w     d     l    gf    ga    gd points position
##  * <chr>         <fct> <int> <int> <int> <int> <int> <int>  <dbl>    <int>
##  1 1             Arse~    17     0     3    81    14    67     51        1
##  2 1             Manc~    15     4     1    67     9    58     49        2
##  3 1             Chel~    12     7     1    33     5    28     43        3
##  4 1             Birm~    14     1     5    35    15    20     43        4
##  5 1             Read~    10     4     6    30    24     6     34        5
##  6 1             West~     7     2    11    28    42   -14     23        6
##  7 1             Bris~     7     2    11    17    36   -19     23        7
##  8 1             Brig~     5     1    14    15    46   -31     16        8
##  9 1             Ever~     4     3    13    15    43   -28     15        9
## 10 1             Live~     4     2    14    13    36   -23     14       10
## # ... with 21,990 more rows, and 1 more variable: model <chr>

Finally we can see how many points each team is predicted to achieve at the end of the season for each model:

plot_colour <- "#1a4990"
n_teams <- length(unique(wsl_matches$home))

simulated_tables %>%
  count(team, points, model) %>%
  ggplot(aes(x = points, y = reorder(team, points))) +
  geom_tile(aes(alpha = n / n_simulations),
            fill = plot_colour) +
  scale_alpha_continuous(range = c(0, 1), name = "probability") +
  labs(title = "Points total probabilities for the Women's Super League",
       subtitle = paste("n =", n_simulations, "simulations each"),
       x = "final season points",
       y = NULL) +
  theme_minimal() +
  facet_wrap(~model)

Both models show similar results (most likely Arsenal winning and Yeovil relegated). However, at roughly the season halfway point there are some interesting discrepancies. For instance the xG model is more bullish on Chelsea than the accomplished goals one, and more bearish on Bristol.

There’s a lot of analysis that can be done here (and if I have time will get round to looking at) but for now I’m pretty satisfied with this as just a very slight synthesis of xG Dixon-Coles and Dixon-Coles tutorial.

As one final addendum we can look at the probabilities each model assumes for the team to be relegated this year.

simulated_tables %>%
  group_by(team, model) %>%
  summarise(p_rel = mean(position == 11)) %>%
  filter(p_rel > 0.01) %>%
  arrange(desc(p_rel))
## # A tibble: 6 x 3
## # Groups:   team [3]
##   team                   model              p_rel
##   <fct>                  <chr>              <dbl>
## 1 Yeovil Town            expected goals     0.898
## 2 Yeovil Town            accomplished goals 0.818
## 3 Brighton & Hove Albion accomplished goals 0.17 
## 4 Brighton & Hove Albion expected goals     0.064
## 5 Everton                expected goals     0.034
## 6 Everton                accomplished goals 0.012

Which shows broad agreement between the two models. Yeovil are in real trouble. The expected goals model might give some relief to Brighton fans though.

Doing the same for the eventual champion:

simulated_tables %>%
  group_by(team, model) %>%
  summarise(p_rel = mean(position == 1)) %>%
  filter(p_rel > 0.01) %>%
  arrange(desc(p_rel))
## # A tibble: 4 x 3
## # Groups:   team [2]
##   team            model              p_rel
##   <fct>           <chr>              <dbl>
## 1 Arsenal         expected goals     0.891
## 2 Arsenal         accomplished goals 0.801
## 3 Manchester City accomplished goals 0.199
## 4 Manchester City expected goals     0.099

Chelsea only sneak into the mix for the expected goals model but it doesn’t look likely they’ll win the league. In the accomplished goals model there may be a slight title race (3/4 that Arsenal win, 1/4 that Man City overtake them), but using expected goals, Arsenal should be pretty confident of winning the league this season.

That’s all for now. Will hopefully do a lot more football analysis in the coming year(s) first to expand on this post then look at other stuff.

Related