Over the last few years since I started coding I’d always been interested in how data science could help predict football results/ identify footballing talents, and just generally ‘solve’ football.
One of the major problems with analysing football had been the availability of data. Though there’s a lot of great published stuff freely available to read, a lot of the cutting edge work revolves around advanced metrics, such as expected goals, which it’s hard to get the data for.
Over the summer StatsBomb committed to sharing free data on (amongst others) the Women’s Super League (the top women’s competition in England), and I’d been interested in looking into this since then.
This post is basically just a reproduction of two blogs by the excellent Ben Torvaney, using the Dixon-Coles method to predict the the final positions of teams at the end of a football season.
The first of these published over the summer combines this method with xG data and the second from this week is a simple and fully reproducible tutorial on implementing Dixon-Coles.
(as both of these posts are pretty comprehensive I’m going to be sparse with commenting/explaining for this post- any questions will probably be answered by the above two articles)
First, loading the libraries needed:
library(tidyverse)
#Ben Torvaney's soccer analysis packages
#devtools::install_github("torvaney/footballdatr")
library(footballdatr)
#devtools::install_github("torvaney/regista")
library(regista)
#women's football data
#devtools::install_github(statsbomb/StatsBombR)
library(StatsBombR)
Then we can use the StatsBombR package to download the data we need. First we grab a tibble of every match so far in the WSL season, and then use this to get another tibble of every shot from every game which we bind to the original as a new column.
At the end, we save this, so we don’t have to bombard the API every time we want to rerun the script
#only want to run this once to avoid overloading the API
#save data after pulling and set chunk eval = FALSE
#get free match info from the StatsBombR package
wsl_matches <- StatsBombR::FreeCompetitions() %>%
#only interested in WSL
filter(competition_name == "FA Women's Super League") %>%
#find free matches from WSL
#(all matches played so far)
select(competition_id) %>%
StatsBombR::FreeMatches(.) %>%
#only want info that helps us predict scores
select(match_id,
competition.competition_id,
season.season_id,
home = home_team.home_team_name,
away = away_team.away_team_name,
hgoals = home_score,
agoals = away_score)
#get the shot information from each match and bind a tibble as a column
wsl_matches$shots <- wsl_matches %>%
#split match info into separate rows
split(f = 1:nrow(.)) %>%
#get the shots per game
lapply(., function(x){
StatsBombR::get.matchFree(x) %>%
select(team = possession_team.name,
xG = shot.statsbomb_xg) %>%
filter(!is.na(xG)) %>%
#join home/away information
left_join(x %>%
select(home, away) %>%
gather(location, team),
., by = "team") %>%
select(-team)
})
#save data
#ONLY WANT TO RUN THIS CHUNK ONCE
saveRDS(wsl_matches, "saved_wsl_matches.rds")
For every rerun we want to load this data again. I also got rid of the LFC/WFC suffixes from each team as it’s a bit redundant and changed the team name columns to factors, which will be required when modelling later.
#load saved data
wsl_matches <- readRDS("./saved_wsl_matches.rds") %>%
mutate(home = gsub(" WFC| LFC", "", home),
away = gsub(" WFC| LFC", "", away)) %>%
mutate(home = factor(home), away = factor(away))
#peek at what data we have
head(wsl_matches)
## # A tibble: 6 x 8
## match_id competition.com~ season.season_id home away hgoals agoals
## <int> <int> <int> <fct> <fct> <int> <int>
## 1 19751 37 4 West~ Chel~ 0 2
## 2 19727 37 4 Read~ Birm~ 0 1
## 3 19719 37 4 West~ Read~ 0 0
## 4 19731 37 4 West~ Yeov~ 2 1
## 5 19730 37 4 Chel~ Brig~ 2 0
## 6 19733 37 4 Birm~ Manc~ 2 3
## # ... with 1 more variable: shots <list>
All of these function comes pretty much verbatim from the first blog post by Ben Torvaney. We run through the shot data (and the expected goals for every shot over every match), and find the probability of every plausible score from these matches happening.
#http://www.statsandsnakeoil.com/2019/01/01/predicting-the-premier-league-with-dixon-coles/
add_if_missing <- function(data, col, fill = 0.0) {
# Add column if not found in a dataframe
# We need this in cases where a team has 0 shots (!)
if (!(col %in% colnames(data))) {
data[, col] <- fill
}
data
}
team_goal_probs <- function(xgs, side) {
# Find P(Goals=G) from a set of xGs by the
# poisson-binomial distribution
# Use tidyeval to prefix column names with
# the team's side ("h"ome or "a"way)
tibble(!!str_c(side, "goals") := 0:length(xgs),
!!str_c(side, "prob") := poisbinom::dpoisbinom(0:length(xgs), xgs))
}
simulate_game <- function(shot_xgs) {
shot_xgs %>%
split(.$location) %>%
imap(~ team_goal_probs(.x$xG, .y)) %>%
reduce(crossing) %>%
# If there are no shots, give that team a 1.0 chance of scoring 0 goals
add_if_missing("homegoals", 0) %>%
add_if_missing("homeprob", 1) %>%
add_if_missing("awaygoals", 0) %>%
add_if_missing("awayprob", 1) %>%
mutate(prob = homeprob * awayprob) %>%
select(homegoals, awaygoals, prob)
}
simulated_games <- wsl_matches %>%
mutate(simulated_probabilities = map(shots, simulate_game)) %>%
select(match_id, home, away, simulated_probabilities) %>%
unnest() %>%
filter(prob > 0.001) %>% # Keep the number of rows vaguely reasonable
rename(hgoals = homegoals, agoals = awaygoals)
simulated_games
## # A tibble: 1,291 x 6
## match_id home away hgoals agoals prob
## <int> <fct> <fct> <int> <int> <dbl>
## 1 19751 West Ham United Chelsea 0 0 0.0869
## 2 19751 West Ham United Chelsea 1 0 0.0420
## 3 19751 West Ham United Chelsea 2 0 0.00755
## 4 19751 West Ham United Chelsea 0 1 0.212
## 5 19751 West Ham United Chelsea 1 1 0.103
## 6 19751 West Ham United Chelsea 2 1 0.0184
## 7 19751 West Ham United Chelsea 3 1 0.00161
## 8 19751 West Ham United Chelsea 0 2 0.201
## 9 19751 West Ham United Chelsea 1 2 0.0969
## 10 19751 West Ham United Chelsea 2 2 0.0174
## # ... with 1,281 more rows
We can use this data to get estimates of each team’s offensive and defensive strengths using the Dixon-Coles method. From here on out I’m going to refer to “actual goals” as “accomplished” (the number of goals that in reality occurred either for or against a team so far this season) and use “expected goals” for xG.
In the final tibble, “off” refers to the offensive strength (higher = good) and “def” the “defensive weakness” (i.e. higher is worse).
#ag_model
ag_model <- dixoncoles(
hgoal = hgoals,
agoal = agoals,
hteam = home,
ateam = away,
data = factor_teams(wsl_matches, c("home", "away"))
)
#xg_model
xg_model <- dixoncoles(
hgoal = hgoals,
agoal = agoals,
hteam = home,
ateam = away,
weights = prob,
data = factor_teams(simulated_games, c("home", "away"))
)
#join these together
estimates <-
inner_join(
broom::tidy(ag_model),
broom::tidy(xg_model),
by = c("parameter", "team"),
suffix = c("_accomplished", "_xg")
) %>%
mutate(value_accomplished = exp(value_accomplished),
value_xg = exp(value_xg))
# Preview results, ordered by the biggest difference
estimates %>%
arrange(desc(abs(value_xg - value_accomplished))) %>%
head()
## # A tibble: 6 x 4
## parameter team value_accomplished value_xg
## <chr> <chr> <dbl> <dbl>
## 1 def Brighton & Hove Albion 2.65 1.57
## 2 off Arsenal 2.75 2.03
## 3 def Yeovil Town 3.10 2.41
## 4 off Chelsea 0.788 1.42
## 5 def Liverpool 1.73 1.23
## 6 rho <NA> 1.48 0.996
Let’s quickly plot the expected and accomplished goals based strength for each team and facet by offense and defence:
#plot the Dixon-Coles strengths
estimates %>%
arrange(desc(abs(value_xg - value_accomplished))) %>%
filter(parameter %in% c("def", "off")) %>%
ggplot(aes(x = value_xg, y = value_accomplished)) +
geom_abline(slope = 1, intercept = 0, linetype = "dotted") +
geom_text(aes(label = team), alpha = 0.7) +
labs(title = "Dixon-Coles parameter estimates for WSL teams",
subtitle = "Based on ...",
x = "... expected goals",
y = "... accomplished goals") +
coord_equal(ratio=1) +
theme_minimal() +
facet_wrap(~parameter)
So Arsenal who have been outrageously good so far this year have been overperforming in offense (accomplished goals-based Dixon-Coles assume they are better than they probably are), whereas relegation-candidate Yeovil have been hugely underperforming in defense.
The next step is then to get a list of all unplayed fixtures and predict the probabilities of each score in these using the same models as above (re-stated in the chunk).
From here we’re using code from the second blog post by Ben Torvaney.
#get list of unplayed fixtures
unplayed_games <-
crossing(home = wsl_matches$home,
away = wsl_matches$home) %>%
filter(home != away) %>%
anti_join(wsl_matches, by = c("home", "away"))
#don't need to reinitialise the models really
#maybe makes it a bit easier to follow
xg_model <- dixoncoles(hgoals, agoals, home, away, data = simulated_games)
ag_model <- dixoncoles(hgoals, agoals, home, away, data = wsl_matches)
#predict future scorelines using these models
unplayed_xg_scorelines <-
broom::augment(xg_model, unplayed_games, type.predict = "scorelines") %>%
unnest() %>%
mutate(model = "expected goals")
unplayed_ag_scorelines <-
broom::augment(ag_model, unplayed_games, type.predict = "scorelines") %>%
unnest() %>%
mutate(model = "accomplished goals")
Bind the data from these predictions to the data from the played games in one big tibble
played_scorelines <-
wsl_matches %>%
select(home, away, hgoal = hgoals, agoal = agoals) %>%
mutate(prob = 1.0)
scorelines <- bind_rows(
mutate(played_scorelines, model = "expected goals"),
mutate(played_scorelines, model = "accomplished goals"),
unplayed_xg_scorelines,
unplayed_ag_scorelines
)
These functions again are pretty much verbatim copied from Ben’s work. They simulate the season using Monte-Carlo sampling and build a final predicted WSL table
simulate_season <- function(scoreline_probabilities) {
scoreline_probabilities %>%
nest(hgoal, agoal, prob, .key = "scorelines") %>%
mutate(sampled = map(scorelines, ~ sample_n(., 1, weight = prob))) %>%
select(-scorelines) %>%
unnest()
}
calculate_table <- function(games) {
games_augmented <-
games %>%
mutate(
hpoints = case_when(
hgoal > agoal ~ 3,
hgoal == agoal ~ 1,
agoal > hgoal ~ 0
),
apoints = case_when(
hgoal > agoal ~ 0,
hgoal == agoal ~ 1,
agoal > hgoal ~ 3
)
)
games_home <-
games_augmented %>%
select(
team = home,
gf = hgoal,
ga = agoal,
points = hpoints
)
games_away <-
games_augmented %>%
select(
team = away,
gf = agoal,
ga = hgoal,
points = apoints
)
bind_rows(games_home, games_away) %>%
group_by(team) %>%
summarise(w = sum(gf > ga),
d = sum(gf == ga),
l = sum(gf < ga),
gf = sum(gf),
ga = sum(ga),
gd = gf - ga,
points = sum(points)) %>%
arrange(desc(points), desc(gd), desc(gf)) %>%
mutate(position = row_number())
}
Then it’s simply a case of running these function n times. 1000 for each model is a nice balance of predictive power and runtime.
n_simulations <- 1000
simulated_tables <- scorelines %>%
split(.$model) %>%
lapply(., function(sim_data) {
rerun(n_simulations, simulate_season(sim_data)) %>%
map(calculate_table) %>%
bind_rows(.id = "simulation_id") %>%
#add which model we're using to each sim
mutate(model = unique(sim_data$model))
}) %>%
do.call(rbind, .)
simulated_tables
## # A tibble: 22,000 x 11
## simulation_id team w d l gf ga gd points position
## * <chr> <fct> <int> <int> <int> <int> <int> <int> <dbl> <int>
## 1 1 Arse~ 17 0 3 81 14 67 51 1
## 2 1 Manc~ 15 4 1 67 9 58 49 2
## 3 1 Chel~ 12 7 1 33 5 28 43 3
## 4 1 Birm~ 14 1 5 35 15 20 43 4
## 5 1 Read~ 10 4 6 30 24 6 34 5
## 6 1 West~ 7 2 11 28 42 -14 23 6
## 7 1 Bris~ 7 2 11 17 36 -19 23 7
## 8 1 Brig~ 5 1 14 15 46 -31 16 8
## 9 1 Ever~ 4 3 13 15 43 -28 15 9
## 10 1 Live~ 4 2 14 13 36 -23 14 10
## # ... with 21,990 more rows, and 1 more variable: model <chr>
Finally we can see how many points each team is predicted to achieve at the end of the season for each model:
plot_colour <- "#1a4990"
n_teams <- length(unique(wsl_matches$home))
simulated_tables %>%
count(team, points, model) %>%
ggplot(aes(x = points, y = reorder(team, points))) +
geom_tile(aes(alpha = n / n_simulations),
fill = plot_colour) +
scale_alpha_continuous(range = c(0, 1), name = "probability") +
labs(title = "Points total probabilities for the Women's Super League",
subtitle = paste("n =", n_simulations, "simulations each"),
x = "final season points",
y = NULL) +
theme_minimal() +
facet_wrap(~model)
Both models show similar results (most likely Arsenal winning and Yeovil relegated). However, at roughly the season halfway point there are some interesting discrepancies. For instance the xG model is more bullish on Chelsea than the accomplished goals one, and more bearish on Bristol.
There’s a lot of analysis that can be done here (and if I have time will get round to looking at) but for now I’m pretty satisfied with this as just a very slight synthesis of xG Dixon-Coles and Dixon-Coles tutorial.
As one final addendum we can look at the probabilities each model assumes for the team to be relegated this year.
simulated_tables %>%
group_by(team, model) %>%
summarise(p_rel = mean(position == 11)) %>%
filter(p_rel > 0.01) %>%
arrange(desc(p_rel))
## # A tibble: 6 x 3
## # Groups: team [3]
## team model p_rel
## <fct> <chr> <dbl>
## 1 Yeovil Town expected goals 0.898
## 2 Yeovil Town accomplished goals 0.818
## 3 Brighton & Hove Albion accomplished goals 0.17
## 4 Brighton & Hove Albion expected goals 0.064
## 5 Everton expected goals 0.034
## 6 Everton accomplished goals 0.012
Which shows broad agreement between the two models. Yeovil are in real trouble. The expected goals model might give some relief to Brighton fans though.
Doing the same for the eventual champion:
simulated_tables %>%
group_by(team, model) %>%
summarise(p_rel = mean(position == 1)) %>%
filter(p_rel > 0.01) %>%
arrange(desc(p_rel))
## # A tibble: 4 x 3
## # Groups: team [2]
## team model p_rel
## <fct> <chr> <dbl>
## 1 Arsenal expected goals 0.891
## 2 Arsenal accomplished goals 0.801
## 3 Manchester City accomplished goals 0.199
## 4 Manchester City expected goals 0.099
Chelsea only sneak into the mix for the expected goals model but it doesn’t look likely they’ll win the league. In the accomplished goals model there may be a slight title race (3/4 that Arsenal win, 1/4 that Man City overtake them), but using expected goals, Arsenal should be pretty confident of winning the league this season.
That’s all for now. Will hopefully do a lot more football analysis in the coming year(s) first to expand on this post then look at other stuff.