An Introduction to Modelling Soccer Matches in R (part 1)

For anyone watching football, being able to predict matches is a key aspect of the hobby. Whether explicitly (e.g. when betting on matches, or deciding on recruitment for an upcoming season), or more implicitly when discussing favourites to win the league in the pub, almost all discussion of the sport on some level require predictions about some set of upcoming games.

The first step of prediction is some form of quantification of ability. We’d expect a better team to have a better chance of winning than a worse team. For an example of a more sophisticated set of rankings, see fivethirtyeight’s Soccer Power Index which is explicitly used to predict the results of various football competitions.

The accuracy of our predictions therefore relies on the accuracy of our judgement on team’s ability. When discussing football with friends, we might use half-remembered match highlights to form some impression of how strong a team is. When programming however, we have free access to the results of teams thus far in a campaign and should be able to produce a model more grounded in truth.

Two seminal papers for using recent football results to assess the abilities of football teams (and then use this assessment to predict matches) are Maher’s 1982 paper on modelling football scores, which is complimented by Mark Dixon and Stuart Coles’ 1997 paper. For R various packages to use the methods outlined in these papers exist including Ben Torvaney’s regista, opisthokonta’s goalmodel1, and Eli Holmes’ fbRanks.

However, the overlap between people obsessed enough with football to read mathematical papers on the sport, and those with the formal training in reading math notation to understand these models is fairly low, and I wasn’t able to find2 a good intuitive explanation for these models. Hopefully, building up these models from the most basic entry steps to a fully sophisticated model for predicting football matches might help some who want to start modelling football but don’t have the privilege of formal stats/modelling/coding training. As I want to start from pretty much zero, in this first post I make at least one or two claims that are not strictly true (indeed, this post does not actually implement some of the main points of the 1997 Dixon & Coles paper), but will try to point these out as I go, and correct them in later posts.

First, let’s load libraries and also set a seed for the reproducibility of this document

# munging
library(tidyverse)

# seed for reproducibility
set.seed(3459)

##Set up

In reality, we’d probably want to model a whole league or cup. However, these can generally contain 20+ teams, many of which will have similar abilities. For simplicity here, lets instead imagine a summer league between 6 English football clubs where each team plays each other twice (once at home and once away)

teams <- c("Arsenal", # 5th in the 1st tier
"Blackburn_Rovers", # 15th in 2nd tier
"Coventry_City", # 8th in 3rd tier
"Dover_Athletic", # 14th 5th tier
"Enfield_Town", # 10th in 7th tier
"Frimley_Green") # 2nd in 9th tier

We’ve managed to arrange a league that has a nice stratification between teams, so we’d expect each to be comfortably better than the next best (which will make sanity checking our results easier). Lucky for us, the teams are also in alphabetical order of strength so in case you don’t have any prior on a team, take the first letter of it’s name (A-F).

Each week each team play one game, so we’ll have a fixture list that looks like:

head(fixtures, 8)
##               home             away gameweek
## 1    Frimley_Green          Arsenal        1
## 2     Enfield_Town Blackburn_Rovers        1
## 3   Dover_Athletic    Coventry_City        1
## 4          Arsenal     Enfield_Town        2
## 5    Frimley_Green   Dover_Athletic        2
## 6 Blackburn_Rovers    Coventry_City        2
## 7   Dover_Athletic          Arsenal        3
## 8    Coventry_City     Enfield_Town        3

Obviously for this we’re going to have to make up our data. For the code used to generate it, see the bottom of the post.

Let’s say that we’ve had 8 weeks of games played so far, and the results have been as follows

head(results,8)
##               home             away hgoal agoal gameweek
## 1   Dover_Athletic    Coventry_City     0     3        1
## 2     Enfield_Town Blackburn_Rovers     0     3        1
## 3    Frimley_Green          Arsenal     0     8        1
## 4          Arsenal     Enfield_Town     5     0        2
## 5 Blackburn_Rovers    Coventry_City     1     1        2
## 6    Frimley_Green   Dover_Athletic     1     2        2
## 7 Blackburn_Rovers    Frimley_Green     6     0        3
## 8    Coventry_City     Enfield_Town     2     1        3

A better way to show this is to generate a matrix of home (y axis) vs. away (x axis) and show the goals scored in each match between them:

p1 <- results %>%
# remove unplayed games
filter(!is.na(hgoal)) %>%
ggplot(., aes(x = away, y = home, fill = hgoal-agoal)) +
geom_tile() +
geom_label(aes(label = paste(hgoal, agoal, sep = "-")), fill = "white") +
# colour where green shows home win and red an away win
scale_fill_gradient2(low = "darkred", high = "green", midpoint = 0, guide = FALSE) +
scale_x_discrete(limits = levels(results$home), position = "top") + scale_y_discrete(limits = rev(levels(results$away))) +
theme_minimal()

# plot
p1

As the colour gradient (from bottom right to top left) shows, the teams we’d expect to do better are. Given the stochastic nature of football though, there are some surprises. E.g. Blackburn only managing to draw at home to Coventry.

A good sense of teams relative abilities can be seen in the league table of results so far (assuming 3 points for a win, and 1 for a draw):

# function to melt results
# returns df with team and goals for and against for each match
melt_results <- function(results_df) {
results_df %>%
# select only relevant columns
select(home, away, hgoal, agoal) %>%
gather(location, team,  -hgoal, -agoal) %>%
# calculate goals for/against the team
mutate(g_for = case_when(
location == "home" ~ hgoal,
location == "away" ~ agoal
)) %>%
mutate(g_ag = case_when(
location == "home" ~ agoal,
location == "away" ~ hgoal
))
}

# function to calculate points won and gd for each team
results_to_table <- function(results_df) {
results_df %>%
# use above melting function
melt_results(.) %>%
# 3 points for a win, 1 for a draw
mutate(points = case_when(
g_for > g_ag ~ 3,
g_ag > g_for ~ 0,
g_for == g_ag ~ 1
)) %>%
# calculate goal difference for each match
mutate(gd = g_for - g_ag) %>%
group_by(team) %>%
# get the final statistics per team
summarise(games_played = n(),
gf = sum(g_for),
ga = sum(g_ag),
gd = sum(gd),
points = sum(points)) %>%
arrange(-points, -gd, -gf)
}

# calculate league table for our played fixtures
league_table <- results  %>%
filter(!is.na(hgoal)) %>%
select(-gameweek) %>%
results_to_table(.) %>%
print()
## # A tibble: 6 x 6
##   team             games_played    gf    ga    gd points
##   <chr>                   <int> <dbl> <dbl> <dbl>  <dbl>
## 1 Arsenal                     8    39     4    35     24
## 2 Blackburn_Rovers            8    23     6    17     19
## 3 Coventry_City               8    14     8     6     16
## 4 Dover_Athletic              8     8    15    -7      9
## 5 Enfield_Town                8     6    22   -16      3
## 6 Frimley_Green               8     2    37   -35      0

Where teams positions are nicely rank ordered (the data for this example is fairly curated so it’s not that surprising).

##Predictions

With two rounds to go, there’s still 6 fixtures we might want to predict (to try and judge which team will end up where, or just to bet on the remaining games).

This are:

# get the yet to be played matches
unplayed_games <- fixtures %>%
filter(gameweek > 8) %>%
print()
##               home             away gameweek
## 1    Coventry_City          Arsenal        9
## 2 Blackburn_Rovers   Dover_Athletic        9
## 3    Frimley_Green     Enfield_Town        9
## 4          Arsenal Blackburn_Rovers       10
## 5    Coventry_City    Frimley_Green       10
## 6   Dover_Athletic     Enfield_Town       10

If we want to predict these results, we need to have data on the strength of the teams above, but also, a good prior on what sort of scores we should expect.

Using real data from the engsoccerdata package we can get the results of all 48840 English football league games between August 1992 and May 2016. If we melt this to get the goals scored by each team by their location we get a data.frame of 97680 records of a teams performance in a game:

# load real data from the english league
real_data <- engsoccerdata::england %>%
# filter out 'premier league era' matches
filter(Season > 1991) %>%
# select only relevant columns
select(home, away = visitor, hgoal, agoal = vgoal) %>%
# munge
melt_results() %>%
select(-hgoal, -agoal) %>%
mutate(data = "real")

head(real_data)
##   location           team g_for g_ag data
## 1     home        Arsenal     2    4 real
## 2     home        Chelsea     1    1 real
## 3     home  Coventry City     2    1 real
## 4     home Crystal Palace     3    3 real
## 5     home        Everton     1    1 real
## 6     home   Ipswich Town     1    1 real

Here every row shows a team that played a match (as it’s sorted by league then alphabetically, the first 6 records are all for Arsenal). It also shows if the team played home or away. The data also shows the goals scored by (e.g.) Arsenal in g_for, and the goals they conceded in g_ag.

If we plot the goals scored for each game, we get a nice humped distribution with slightly offset peaks for home and away. That is to say, in most games teams will score 0, 1, or 2 goals, and that scoring more than 6 goals in a match is incredibly rare. The difference between the home and away distributions mean that teams are slightly more likely to score more if playing at home, compared to play away from home.

# plot goals scored home/away for real english football matches
p2 <- real_data %>%
ggplot(., aes(x = g_for, fill = location)) +
# smooth densities
geom_density(adjust = 8, alpha = 0.5) +
scale_fill_manual(values = c("red", "blue")) +
scale_x_continuous(breaks = 0:6) +
labs(title = "Goals scored at home and away in English football",
subtitle = "data from 48.8k matches 1992-2016",
x = "goals scored",
y = "density") +
theme_minimal()

# plot
p2

We can work out what the average difference between playing at home and away is by taking the means of goals scored at home, and when playing away:

# calculate mean home and away goals
real_data_means <- real_data %>%
group_by(location) %>%
summarise(mean_scored = mean(g_for)) %>%
print()
## # A tibble: 2 x 2
##   location mean_scored
##   <chr>          <dbl>
## 1 away            1.12
## 2 home            1.47

Goals in games are both relatively sparse, and relatively stochastic; football is a low scoring game where goals are evenly distributed throughout the game. In theory any attack made by a team i has a probability of being scored dependent upon the strength of team i’s attack (αi) which is independent of all the other attacks that team has made.

(there is some reason to doubt this may be the case3, but for now this is a fine generalisation)

By grouping all teams together into “home” and “away” categories (in a league setting each team will play each other home and away so this should average out) and taking the average number of goals scored per match as the Poisson mean (λ) we can see how well our above graph fits a simulated Poisson process.

# generate Poisson distributed vector with mean = real world mean
simulated_poisson <- real_data_means %>%
split(f = .$location) %>% lapply(., function(x) df = data.frame(dist = rpois(100000, x$mean_scored),
location = x$location)) %>% # map it all together and label map_df(I) %>% mutate(data = "simulated") # add these distributions to the plot p2 + geom_density(data = simulated_poisson, aes(x = dist), fill = NA, adjust = 8, alpha = 0.2) + scale_fill_manual(values = c("red", "blue"), guide = FALSE) + facet_wrap(~location) It’s not perfect, but it’s not a bad fit either. In actuality, a Chi-squared test will show that goals scored does not follow a Poisson distribution given the number of matches we have as data. But for the sake of this post, put that out of mind. If we think that goals scored represents some Poisson process, it can be modeled using the equation which underlies the Poisson distribution. For a given interval (one match), the probability of x events (goals scored) in that interval will be: $P(x) = \frac{\lambda^{x}e^{-\lambda}}{x!}$ The simplest model we can produce is to estimate λ as each team’s attack rating (henceforth αi) which is equal to observed mean rate of goals for that team. That is the say the probability of team i scoring x goals against team j is: $P(x_{i,j} = x) = \frac{\alpha_{i}^{x}e^{-\alpha_{i}}}{x!}$ where αi is the sum of all goals scored divided by the total number of matches: $\alpha_{i} = \frac{1}{N}\sum_{n=1}^{N} x$ grouping by teams makes this easy to calculate: basic_model <- results %>% melt_results() %>% group_by(team) %>% # we'll use the goals scored to model the attack # and goals conceeded to measure defence rating summarise(alpha = mean(g_for), beta = mean(g_ag)) %>% print() ## # A tibble: 6 x 3 ## team alpha beta ## <chr> <dbl> <dbl> ## 1 Arsenal 4.88 0.5 ## 2 Blackburn_Rovers 2.88 0.75 ## 3 Coventry_City 1.75 1 ## 4 Dover_Athletic 1 1.88 ## 5 Enfield_Town 0.75 2.75 ## 6 Frimley_Green 0.25 4.62 (we’ll come on to the beta parameter in a bit- where alpha is the average scoring rate, beta is the average conceding rate). If we take Coventry’s remaining two games as examples we can see that they are yet to play Arsenal and Frimley Green at home coventry_games <- unplayed_games %>% # filter out Coventry City's remaining fixtures filter(grepl("Coventry_City", home)) %>% print() ## home away gameweek ## 1 Coventry_City Arsenal 9 ## 2 Coventry_City Frimley_Green 10 And we can take the attack rating (α) of each team and use it to estimate the results # get the attack ratings of all teams team_alphas <- basic_model$alpha %>% names<-(basic_model$team) # assume goals scored for each team will be it's attack rating e_results <- paste(team_alphas[coventry_games$home],
team_alphas[coventry_games$away], sep = "-") %>% # name each match with the teams competing names<-(c(paste(coventry_games$home, coventry_games$away, sep = "-"))) %>% print() ## Coventry_City-Arsenal Coventry_City-Frimley_Green ## "1.75-4.875" "1.75-0.25" These aren’t ridiculous estimates by any stretch but it’s clear something is up. It’s pretty intuitive that Coventry City would be expected to score more goals at home to Frimley Green than at home to Arsenal. We can account for this by introducing an opposing team defence parameter βj. In our very simple model this will be estimating by taking the average rate a team concedes goals. As with the attack rating, this is the calculated as the sum of all goals conceded divided by number of matches. We’ll then multiply αi and βj together to get the score estimate: # get and name the defence rating for each team team_betas <- basic_model$beta %>% names<-(basic_model$team) # assume the goals scored will be the attack rating of the team times # the defence rating of it's opponent e_results <- paste(round(team_alphas[coventry_games$home]*
team_betas[coventry_games$away], 3), round(team_alphas[coventry_games$away]*
team_betas[coventry_games$home], 3), sep = "-") %>% names<-(c(paste(coventry_games$home, coventry_games$away, sep = "-"))) %>% print() ## Coventry_City-Arsenal Coventry_City-Frimley_Green ## "0.875-4.875" "8.094-0.25" The opposition scores remain the same because Coventry have on average conceded 1 goal per game. Coventry’s predicted goals though has diverged with them now predicted to score less than a goal against Arsenal and to score 8(!) against Frimley Green, both of which sound reasonable (when you consider that Frimley Green are a team of amateurs). However, we’re also missing one final piece of the model we’ll finish with today. Recall modelling the English football data from 1992 onwards, we were left with a difference between the home scoring rate and the away scoring rate. # reprint what we calculated earlier real_data_means ## # A tibble: 2 x 2 ## location mean_scored ## <chr> <dbl> ## 1 away 1.12 ## 2 home 1.47 It’s pretty common knowledge that football teams do better at home, so we’ll want to factor that in. A simple estimate is to divide the mean home goals/game by the mean away goals/game. We’ll call this parameter γ and can be formalised as the sum of home goals (which we’ll refer to as x from now on) divided by the sum of away goals (y) $\gamma = \frac{\sum{x}}{\sum{y}}$ # the home advantage is how much easier it is to score at home home_advantage_gamma <- sum(results$hgoal) / sum(results$agoal) e_results <- paste(round(team_alphas[coventry_games$home]*
team_betas[coventry_games$away] * # add in home advantage for home team home_advantage_gamma, 3), round(team_alphas[coventry_games$away]*
team_betas[coventry_games$home], 3), sep = "-") %>% names<-(c(paste(coventry_games$home, coventry_games$away, sep = "-"))) %>% print() ## Coventry_City-Arsenal Coventry_City-Frimley_Green ## "0.955-4.875" "8.83-0.25" Which tilts the scales a little towards Coventry’s favour but (as we’d expect- home advantage can only go so far) doesn’t affect the results too much. Now we have a method to predict matches, we can use this on the remaining 6 nice and easily: # simplify to just gamma gamma <- home_advantage_gamma # wrap the above into a function for home and away teams predict_results <- function(home, away, parameters) { e_goals_home <- parameters$alpha[home]*parameters$beta[away] * gamma e_goals_away <- parameters$alpha[away]*parameters$beta[home] # output a df of expected goals for home and away teams df <- data.frame(home = home, away = away, e_hgoal = e_goals_home, e_agoal = e_goals_away) return(df) } # convert the basic_model df into a list with$attack and $defence parameters # for each team basic_parameters <- basic_model %>% # rename scored/conceeded to attack/defence select(-team) %>% # convert to a list and name each element as.list() %>% lapply(., function(x){names(x) <- teams;return(x)}) # predict results using the function defined above and the list of parameters # could use e.g. mapply here but I prefer the map2 grammar # run the predict results function over each game consisting of$home and $away predicted_fixtures <- map2_df(unplayed_games$home, unplayed_games$away, predict_results, # parameters forms an extra argument that does not vary basic_parameters) %>% # round the outputs mutate_if(is.numeric, round, digits = 2) %>% print() ## home away e_hgoal e_agoal ## 1 Coventry_City Arsenal 0.95 4.88 ## 2 Blackburn_Rovers Dover_Athletic 5.88 0.75 ## 3 Frimley_Green Enfield_Town 0.75 3.47 ## 4 Arsenal Blackburn_Rovers 3.99 1.44 ## 5 Coventry_City Frimley_Green 8.83 0.25 ## 6 Dover_Athletic Enfield_Town 3.00 1.41 All of which look reasonable, if maybe a little bullish on the ‘better’ teams prospects. However, while this is good for back of the envelope predictions, we know that this is a very basic model. If we want to improve it, first we must quantify how good it is. In order to do this we can use the results we have from the first 8 weeks of matches as training data. We know what the ‘correct’ scores are for these matches, so if our model is good, it will predict similar scores to those observed. Remember that for the Poisson distribution, the probability of x goals in one match is $P(x) = \frac{\lambda^{x}e^{-\lambda}}{x!}$ The expected value of the Poisson distribution is equal to λ, so we can plug λ as our predicted goals, and x as the actual goals, and calculate the probability of that results occurring given the attack/defence/home advantage parameters that we think are correct. We then do this for all the matches played and get the likelihood for the home and away teams scores given the model: # 'predict' the already played matches using our function predicted_results <- map2_df(results$home, results$away, predict_results, basic_parameters) %>% mutate_if(is.numeric, round, digits = 2) %>% print() ## home away e_hgoal e_agoal ## 1 Dover_Athletic Coventry_City 1.09 3.28 ## 2 Enfield_Town Blackburn_Rovers 0.61 7.91 ## 3 Frimley_Green Arsenal 0.14 22.55 ## 4 Arsenal Enfield_Town 14.62 0.38 ## 5 Blackburn_Rovers Coventry_City 3.14 1.31 ## 6 Frimley_Green Dover_Athletic 0.51 4.62 ## 7 Blackburn_Rovers Frimley_Green 14.51 0.19 ## 8 Coventry_City Enfield_Town 5.25 0.75 ## 9 Dover_Athletic Arsenal 0.55 9.14 ## 10 Arsenal Coventry_City 5.32 0.88 ## 11 Dover_Athletic Blackburn_Rovers 0.82 5.39 ## 12 Enfield_Town Frimley_Green 3.78 0.69 ## 13 Blackburn_Rovers Arsenal 1.57 3.66 ## 14 Enfield_Town Dover_Athletic 1.53 2.75 ## 15 Frimley_Green Coventry_City 0.27 8.09 ## 16 Arsenal Frimley_Green 24.60 0.12 ## 17 Blackburn_Rovers Enfield_Town 8.62 0.56 ## 18 Coventry_City Dover_Athletic 3.58 1.00 ## 19 Coventry_City Blackburn_Rovers 1.43 2.88 ## 20 Dover_Athletic Frimley_Green 5.05 0.47 ## 21 Enfield_Town Arsenal 0.41 13.41 ## 22 Arsenal Dover_Athletic 9.97 0.50 ## 23 Enfield_Town Coventry_City 0.82 4.81 ## 24 Frimley_Green Blackburn_Rovers 0.20 13.30 # calculate the likelihood of each home/away team actually scoring that many goals # given the parameters for attack/defence supplied likelihoods <- data.frame(lik_hgoal = dpois(results$hgoal,
predicted_results$e_hgoal), lik_agoal = dpois(results$agoal,
predicted_results$e_agoal)) %>% # round the probabilities mutate_all(round, 4) %>% # bind likelihoods to results cbind(results, . ) %>% # bind in predictions left_join(., predicted_results, by = c("home", "away")) %>% # select useful parameters select(home, away, hgoal, e_hgoal, lik_hgoal, agoal, e_agoal, lik_agoal) %>% print() ## home away hgoal e_hgoal lik_hgoal agoal e_agoal ## 1 Dover_Athletic Coventry_City 0 1.09 0.3362 3 3.28 ## 2 Enfield_Town Blackburn_Rovers 0 0.61 0.5434 3 7.91 ## 3 Frimley_Green Arsenal 0 0.14 0.8694 8 22.55 ## 4 Arsenal Enfield_Town 5 14.62 0.0025 0 0.38 ## 5 Blackburn_Rovers Coventry_City 1 3.14 0.1359 1 1.31 ## 6 Frimley_Green Dover_Athletic 1 0.51 0.3063 2 4.62 ## 7 Blackburn_Rovers Frimley_Green 6 14.51 0.0065 0 0.19 ## 8 Coventry_City Enfield_Town 2 5.25 0.0723 1 0.75 ## 9 Dover_Athletic Arsenal 1 0.55 0.3173 3 9.14 ## 10 Arsenal Coventry_City 3 5.32 0.1228 1 0.88 ## 11 Dover_Athletic Blackburn_Rovers 1 0.82 0.3612 2 5.39 ## 12 Enfield_Town Frimley_Green 1 3.78 0.0863 0 0.69 ## 13 Blackburn_Rovers Arsenal 0 1.57 0.2080 2 3.66 ## 14 Enfield_Town Dover_Athletic 1 1.53 0.3313 2 2.75 ## 15 Frimley_Green Coventry_City 0 0.27 0.7634 3 8.09 ## 16 Arsenal Frimley_Green 10 24.60 0.0005 0 0.12 ## 17 Blackburn_Rovers Enfield_Town 4 8.62 0.0415 0 0.56 ## 18 Coventry_City Dover_Athletic 1 3.58 0.0998 0 1.00 ## 19 Coventry_City Blackburn_Rovers 1 1.43 0.3422 2 2.88 ## 20 Dover_Athletic Frimley_Green 2 5.05 0.0817 0 0.47 ## 21 Enfield_Town Arsenal 2 0.41 0.0558 4 13.41 ## 22 Arsenal Dover_Athletic 4 9.97 0.0193 0 0.50 ## 23 Enfield_Town Coventry_City 1 0.82 0.3612 2 4.81 ## 24 Frimley_Green Blackburn_Rovers 1 0.20 0.1637 5 13.30 ## lik_agoal ## 1 0.2213 ## 2 0.0303 ## 3 0.0003 ## 4 0.6839 ## 5 0.3535 ## 6 0.1052 ## 7 0.8270 ## 8 0.3543 ## 9 0.0137 ## 10 0.3650 ## 11 0.0663 ## 12 0.5016 ## 13 0.1724 ## 14 0.2417 ## 15 0.0271 ## 16 0.8869 ## 17 0.5712 ## 18 0.3679 ## 19 0.2328 ## 20 0.6250 ## 21 0.0020 ## 22 0.6065 ## 23 0.0943 ## 24 0.0058 If we sum the log of those likelihood values we get a measure of how wrong overall our predictions are: log_likehood <- sum(log(likelihoods$lik_hgoal), log(likelihoods$lik_agoal)) * -1 log_likehood ## [1] 105.995 (n.b. there will be some rounding errors- especially on the pre-log probabilities, but this will suffice for now) To get an idea of whether or not this is good, let’s quickly run the model with all the parameters set to zero. Given that we’re pretty sure that at least Arsenal will be a lot better than Frimley Green, this model should do worse than our basic model above. If it indeed does fit the results worse we will get a greater error term- the log likelihood sum # do the same but set each teams attack and defence to 1 # expect model to be worse as assumes all teams are equal equal_parameters <- list( alpha = rep(1, length(teams)) %>% names<-(teams), beta = rep(1, length(teams)) %>% names<-(teams) ) # predict results and munge through to find sum of log likelihoods worse_log_likelihood <- map2_df(results$home, results$away, predict_results, equal_parameters) %>% mutate_if(is.numeric, round, digits = 2) %>% # take the log probability straight away this time mutate(lik_hgoal = dpois(results$hgoal, e_hgoal, log = TRUE),
lik_agoal = dpois(results$agoal, e_agoal, log = TRUE)) %>% select(lik_hgoal, lik_agoal) %>% map_dbl(sum) %>% sum(.) * -1 worse_log_likelihood ## [1] 112.618 The worse log likelihood (112.6) is worse (only a bit though) than the 106.0 we previously. This suggests that either the teams are actually quite equal, or that our basic model wasn’t all that good. Parameter Optimisation There will exist some parameters (α and β for each team, and γ for the home field advantage) that will minimise this negative log likelihood. That is to say, they will predict the results of the already played games most accurately. If we want to find those we can use the optim() function in the stats package. This will take a vector of parameters and iterate while slightly changing their values until it gets the lowest value it can find as the output for a supplied function. It also takes a data.frame of results between teams. The results of these games are predicted and then checked against this actually observed data. At the end, I’ve also set the function to pass some information from each iteration into the global environment, namely, the iteration number (i), the parameter values the optim() function has chosen for this iteration, and the negative log likelihood of those parameters- the likelihood of the observed scores if those parameters are correct. optimise_params <- function(parameters, results) { # form the parameters back into a list # parameters names alpha (attack), beta (defense), and gamma (hfa) param_list <- relist_params(parameters) # predict the expected results for the games that have been played e_results <- map2_df(results$home, results$away, predict_results, param_list) # calculate the negative log likelihood of those predictions # given the parameters how likely are those scores neg_log_likelihood <- calculate_log_likelihood(results, e_results) # capture the parameters and likelihood at each loop # only do it if i is initialised if(exists("i")) { i <<- i + 1 current_parameters[[i]] <<- parameters current_nll[[i]] <<- neg_log_likelihood } # return the value to be minimised # in this case the negative log likelihood return(neg_log_likelihood) } The three separate functions are coded out separately so we can tinker with them shortly: 1. to predict our results we have been supplying a list of two elements: alpha and beta, each of which are numeric vectors. optim() can only take one vector to optimise over but we can trick it by supplying unlist(list_of_parameters). If we do this we then first want to convert this unlisted numeric vector back into our two element list* *it isn’t vital to have the parameters arranged like this, but I think it leads to neater indexing when predicting the results 1. we then need to use these parameters to predict the results of past games. For each home and away team in a data.frame of results we can predict the expected home and expected away goals. These are then bound into a data.frame of home and away teams and these predicted goals for each 2. finally, we need to calculate the negative log likelihood by calculating the log probability of the observed goals given the predicted goals and summing these. We then multiply this by -1 as the sum of the log probabilities will be negative and we want to minimise this number as close to zero as possible. The transformation of prod(neg_log_likelihood, -1) is a quick hack for this4 Hopefully this is at least bearable to follow. Formalised, this can be written for teams i and matches k as: $\mathcal L(\alpha_{i},\beta_{i},\gamma;i = 1 ... n) = \prod_{k = 1}^{K}{\frac{\lambda_{k}^{x_{k}}e^{-\lambda_{k}}}{x_{k}!}\frac{\mu_{k}^{y_{k}}e^{-\mu_{k}}}{y_{k}!}}$ where for match k and teams i and j, home goals, x is defined by $x_{k} \sim Poisson(\lambda_{k} = \alpha_{i(k)}\beta_{j(k)}\gamma)$ and away goals, y $y_{k} \sim Poisson(\mu_{k} = \alpha_{j(k)}\beta_{i(k)})$ which seems daunting when you write it down, but we’ve already covered everything we need to do solve it. It’s just saying we want to minimise the result of the multiplication (the sum of logs in our case above) of the probability of scoring x and y goals in a game. The probability of goals scored assumed to be Poisson distributed, controlled by parameters α, β, and γ for home and away teams. # optim requires parameters to be supplied as a vector # we'll unlist the parameters then relist in the function relist_params <- function(parameters) { parameter_list <- list( # alpha = attack rating alpha = parameters %>% .[grepl("alpha", names(.))] %>% names<-(teams), # beta = defence rating beta = parameters %>% .[grepl("beta", names(.))] %>% names<-(teams), # gamma = home field advantage gamma = parameters["gamma"] ) return(parameter_list) } # use these parameters to predict results for supplied matches predict_results <- function(home, away, param_list) { # expected home goals e_goals_home <- param_list$alpha[home] * param_list$beta[away] * param_list$gamma
# expected away goals
e_goals_away <- (param_list$alpha[away] * param_list$beta[home])

# bind to df
df <- data.frame(home = home, away = away,
e_hgoal = e_goals_home, e_agoal = e_goals_away)

return(df)
}

# calculate the log likelihood of predict results vs supplied results
calculate_log_likelihood <- function(results, e_results) {
home_likelihoods = dpois(results$hgoal, lambda = e_results$e_hgoal, log = TRUE)
away_likelihoods = dpois(results$agoal, lambda = e_results$e_agoal, log = TRUE)

# sum log likelihood and multiply by -1 so we're minimising neg log likelihood
likelihood_sum <- sum(home_likelihoods, away_likelihoods)
neg_log_likelihood <- prod(likelihood_sum, -1)

return(neg_log_likelihood)
}

We’ll supply parameters that are all equal to 1 to optim to stop it falling into local minima that might affect the ‘optimal’ parameters it finds. The unlisted parameters are then supplied to optim along with the optimise_parameters() function.

# start with all parameters equal
equal_parameters <- list(
alpha = rep(1, length(teams)) %>% names<-(teams),
beta = rep(1, length(teams)) %>% names<-(teams),
gamma = 1
)

# run optim over the functions with these initial parameters
optimised_parameters <- optim(
# the equal initial parameters
par = unlist(equal_parameters),
# run over the function to optimise parameters
fn = optimise_params,
# extra arguments to function
results = results,
# Nelder-Mead equation with 10k iterations max
control = list(maxit = 10000)
)

We can take the $par element of the output of this to find the parameters for which the negative log likelihood is minimised # display the parameters found to minimise # the negative log likelihood optimised_parameters$par
##          alpha.Arsenal alpha.Blackburn_Rovers    alpha.Coventry_City
##              2.9858302              1.8014838              1.2995271
##   alpha.Dover_Athletic     alpha.Enfield_Town    alpha.Frimley_Green
##              0.8192267              0.7762002              0.2748448
##           beta.Arsenal  beta.Blackburn_Rovers     beta.Coventry_City
##              0.4738011              0.6346112              0.7503864
##    beta.Dover_Athletic      beta.Enfield_Town     beta.Frimley_Green
##              1.2208768              1.5180931              2.5535961
##                  gamma
##              1.1663125

As expected, alpha decreases as teams get worse, and beta increases. The found gamma (1.166) is only marginally higher than the 1.091 for our simple model.

The $value element gives the negative log likelihood calculated for these parameters optimised_parameters$value
## [1] 57.5175

Which is much smaller than the ~100 we got from our very basic model.

Tinkering

This is all very well but there’s still some small improvements we can make.

For starters, I always think it’s simpler to have both scales of α and β to increase as a teams becomes more skillful in attack or defence. In our original equation the expected home and away goals follow the formula

$x_{ij} \sim Poisson(α_{i}β_{j}γ)$ $y_{ij} \sim Poisson(α_{j}β_{i})$

if instead of multiplying by β, we divide instead, a stronger defence will reduce the value of xij/yij (reducing the number of expected goals for the opposing team).

$x_{ij} \sim Poisson(\frac{α_{i}γ}{β_{j}})$ $y_{ij} \sim Poisson(\frac{α_{j}}{β_{i}})$

To achieve this we just have to flip two lines of the predict_results function. Instead of multiplying α and β, we divide them instead.

# change prediction to inverse defence parameters
predict_results <- function(home, away, param_list) {
e_goals_home <- (param_list$alpha[home] / param_list$beta[away]) * param_list$gamma e_goals_away <- (param_list$alpha[away] / param_list$beta[home]) df <- data.frame(home = home, away = away, e_hgoal = e_goals_home, e_agoal = e_goals_away) return(df) } # re run using new subfunction optimised_parameters2 <- optim( par = unlist(equal_parameters), fn = optimise_params, results = results, method = "Nelder-Mead", control = list(maxit = 10000)) # check this does what we want optimised_parameters2$par

(n.b. I won’t print out the results of all these steps as this post is long enough, but you can run and see the gradual improvements for yourself)

Next we want to subtly change how the expected goals are calculated.

Given that

$A = \frac{B \cdot C}{D}$ is exactly the same as

$A = e ^{log(B) + log(C) - log(D)}$ we can convert the parameters we are looking for into log(parameters) and take the exponent of their sum as the predicted goals. This might seem like a minor change, but prevents an important exception. Using home goals as an example, remember that

$x_{ij} \sim Poisson(\frac{α_{i}γ}{β_{j}})$ if any of the three parameters become negative then we’re left with a Poisson distribution with a negative mean, which is is absurd: events cannot unhappen. For instance, imagine a football game where one team scores negative goals.

If we take the log parameters instead we have

$x_{ij} \sim Poisson(e ^ {α_{i} - β_{j} + γ})$ where no matter what values α, β, or γ take, the exponent of their sum will never be negative. When playing a very strong away teams, the mean goals will tend towards 0 (though will never actually reach it).

# change prediction to use log parameters
# exp(log(x) + log(y)) = x * y
predict_results <- function(home, away, param_list) {
e_goals_home <- exp(param_list$alpha[home] - param_list$beta[away] + param_list$gamma) e_goals_away <- exp(param_list$alpha[away] - param_list$beta[home]) df <- data.frame(home = home, away = away, e_hgoal = e_goals_home, e_agoal = e_goals_away) return(df) } # initialise parameters as all 0 # log(1) = 0 equal_parameters <- list( alpha = rep(0, length(teams)) %>% names<-(teams), beta = rep(0, length(teams)) %>% names<-(teams), gamma = 0 ) # re run using new subfunction optimised_parameters3 <- optim( par = unlist(equal_parameters), fn = optimise_params, results = results, # using log will avoid non-finite differences # so can use BFGS model method = "BFGS", control = list(maxit = 10000)) We’ve also switched optimisation algorithm from Nelder-Mead to BFGS. BFGS is quicker than Nelder-Mead but requires the minimisation function (i.e. the negative log likelihood we calculate) to be finite. Before, we could get infinite negative log likelihoods, as it was possible to calculate a negative mean (expected goals for a team). Running dpois() for a negative lambda will return NaN so it becomes impossible to calculate the final negative log likelihood. Finally, we want to constrain the final optimised parameters by fixing the sum of all attack parameters, and the sum of all defence parameters, to equal 0. In practice, this basically means that above average attacking/defending teams will have parameters above 0, and below average teams will have parameters below 0. This is handy, but also the main advantage is this prevents overfitting of the parameters by the optimisation algorithm. To do this, we can simply drop the first (or last, or any, it doesn’t matter) parameter from attack or defence (the parameters for Arsenal) and then calculate Arsenal’s parameters as the sum of the remaining parameters multiplied by minus 1. $\alpha_{n} = -\sum_{i = 1}^{n-1} \alpha_{i}$ and also $\beta_{n} = -\sum_{i = 1}^{n-1} \beta_{i}$ In terms of code this just requires adding one line to the relist_params() function to append the value back. We also then need to remove this parameter that we will add back in from the initial parameters which is done below. Our output will now be missing the parameters for Arsenal (as they will only exist within the function), but we can easily calculate it from the parameters we do get out. # introduce sum to zero constraint by calculating # first teams parameters as minus sum of the rest relist_params <- function(parameters) { parameter_list <- list( alpha = parameters %>% .[grepl("alpha", names(.))] %>% append(prod(sum(.), -1), .) %>% names<-(teams), beta = parameters %>% .[grepl("beta", names(.))] %>% append(prod(sum(.), -1), .) %>% names<-(teams), gamma = parameters["gamma"] ) return(parameter_list) } # remove the first team from the attack and defence ratings equal_parameters <- list( alpha = rep(0, length(teams)-1) %>% names<-(teams[2:length(teams)]), beta = rep(0, length(teams)-1) %>% names<-(teams[2:length(teams)]), gamma = 0 ) # initialise i to collect data about the optimisation process at each iteration i <- 0 # collect current parameter values and neg log likelihood at each iteration current_parameters <- list() current_nll <- list() We can then final the optim() function one final time to get our final optimised parameters # run our final calculation optimised_parameters4 <- optim( par = unlist(equal_parameters), fn = optimise_params, results = results, method = "BFGS", control = list(maxit = 10000)) We can plot the log likelihood at each iteration. Notice how it starts around <120, which is pretty close what our worse_log_likelihood returned. For these optimisations, the original parameters we are supplying are similar to the zeroed parameters for that example. As the optim() function plays with the parameters you can see the log likelihood jumps around quite violently, but over time tend towards zero. p3 <- data.frame(likelihood = unlist(current_nll), iteration = seq(length(current_nll))) %>% ggplot(aes(x = iteration, y = likelihood)) + geom_line(colour = "red") + # cut out some cases where optim() has been a bit ambitious coord_cartesian(ylim = c(0, 250)) + labs(title = "Negative log likelihood of parameters over iterations", y = "negative log likelihood", x = "iteration") + theme_minimal() p3 The final parameters can also be extracted from the output from optim() and plotted: p4 <- optimised_parameters4$par %>%
# relist to add in first team
relist_params() %>%
unlist() %>%
# select team parameters
.[grepl("beta|alpha", names(.))] %>%
data.frame(value = .,
parameter = names(.)) %>%
separate(parameter, into = c("parameter", "team"), "\\.") %>%
# pipe into a plot
ggplot(aes(x = alpha, y = beta)) +
geom_point() +
ggrepel::geom_text_repel(aes(label = team)) +
stat_smooth(method = "lm", se = FALSE) +
labs(title = "Optimal parameters for teams",
subtitle = "given first 8 weeks of results",
x = "alpha (more likely to score ->)",
y = "beta (less likely to concede ->)") +
theme_minimal()

p4

Notice how the teams monotonically increase in both attack and defensive ability. This is by design on how the results were created (see the bottom of this post). With only 8 games per team however, there is quite a lot of noise in the signal. Hitting the crossbar instead of scoring in one game could make a fairly large difference in how the function rates a team.

Also note how the regression line passes through the origin- this is a result of us constraining the parameters to sum to zero.

If we want to see how optim() selects these, we can plot how they change over iterations. You can see how it jumps around then settles on incremental improvements to the model.

p5 <- current_parameters %>%
# get the parameters for arsenal for each iteration
lapply(., function(x){ unlist(relist_params(x))}) %>%
map_df(bind_rows, .id = "iteration") %>%
# melt data and split parameters into team and parameter
gather("parameter", "value", -iteration) %>%
# get rid of the gamma parameter
filter(parameter != "gamma.gamma") %>%
separate(parameter, into = c("parameter", "team"), sep = "\\.") %>%
# spread data back by parameter
mutate(iteration = as.numeric(iteration)) %>%
# plot alpha against beta for each iteration
ggplot(aes(x = alpha, y = beta)) +
geom_text(aes(label = team)) +
labs(title = 'Parameters for Iteration {floor(frame_time)}',
subtitle = "given first 8 weeks of results",
x = "alpha (more likely to score ->)",
y = "beta (less likely to concede ->)") +
# using gganimate package
gganimate::transition_time(iteration) +
gganimate::ease_aes('linear') +
gganimate::view_follow()

# animate the plot
gganimate::animate(p5, nframes = i)

Predict Remaining Matches

Now we have rated each teams attack/defense, and the advantage to a team to play at home, we can predict the remaining matches between the teams.

For this, we just have to use the predict_results() function we defined earlier, except this time the output will be the expected goals per team. Earlier we were measuring the deviance from expectation, but not we assume the most likely result is exactly equal to the expected results. If we wanted to we could work out how likely this result is, and what the most likely results are.

This post is long enough however, so for now, we’ll just detail the most likely results.

predicted_results <- predict_results(unplayed_games$home, unplayed_games$away,
relist_params(optimised_parameters4$par)) %>% mutate_if(is.numeric, round, 2) %>% print() ## home away e_hgoal e_agoal ## 1 Coventry_City Arsenal 0.86 2.11 ## 2 Blackburn_Rovers Dover_Athletic 2.62 0.49 ## 3 Frimley_Green Enfield_Town 0.44 1.72 ## 4 Arsenal Blackburn_Rovers 2.39 0.99 ## 5 Coventry_City Frimley_Green 4.09 0.17 ## 6 Dover_Athletic Enfield_Town 1.33 0.79 All of these look reasonable, with better teams beating worse ones. The only match that the model thinks might well end in a draw is Dover at home to Enfield, which is not entirely unreasonable. We can add these predictions to our earlier matrix of results to get a sense if these fit in with the trend from the observed matches: p6 <- rbind( predicted_results %>% rename_if(is.numeric, gsub, pattern = "e_", replacement = "") %>% mutate(type = "predicted"), results %>% select(-gameweek) %>% mutate(type = "result") ) %>% ggplot(., aes(x = away, y = home, fill = hgoal-agoal)) + geom_tile() + # add the scorelines geom_label(aes(label = paste(hgoal, agoal, sep = "-"), colour = type), fill = "white") + # colour where black for actual results and red for predictions scale_colour_manual(values = c("red", "black")) + # colour where green shows home win and red an away win scale_fill_gradient2(low = "darkred", high = "green", midpoint = 0, guide = FALSE) + scale_x_discrete(limits = levels(results$home), position = "top") +
scale_y_discrete(limits = rev(levels(results$away))) + theme_minimal() p6 Which they do! The predicted results fit in with the gradient of heavier defeats for home teams towards the bottom left, progressing to easy home victories in the top right. That’s all for this post. Hopefully using the Poisson distribution to model football matches is a little clearer now. Feel free to email me any questions and check out the packages I stole all the codes/idea from. Next time, I’ll go over how to quantify the probability of a range of results for any single match in (hopefully) a shorter post; until then! Notes 1 much of the code I use here is stolen/reworked from the code shared on this repo 2 towards the end of writing this post I came across David Sheehan’s blog which actually does a pretty good job, but I felt still didn’t quite go through how/why the model uses the maths it does 3 see https://arxiv.org/pdf/cond-mat/0110605.pdf and also the conclusion of David Sheehan’s blog on Dixon-Coles processes 4 *we could instead maximise the sum of the log likelihoods and then the error will converge towards 0 from a negative number. Either is fine. Results Generation First we need to create a data.frame of fixtures for each team # https://stackoverflow.com/questions/54099990/is-there-an-efficient-algorithm-to-create-this-type-of-schedule create_fixtures <- function(teams) { # keep team 1 in place team1 <- as.character(teams[1]) #rotate other teams around team 1 other_teams <- as.character(teams[!teams %in% team1]) length <- length(other_teams) # generate fixtures each week for(week in seq((length(teams)-1)*2)) { if(week %% 2 == 0) { fixtures <- data.frame(home = c(team1, other_teams[1:2]), away = other_teams[length:3], gameweek = week) } else { fixtures <- data.frame(home = other_teams[length:3], away = c(team1, other_teams[1:2]), gameweek = week) } if(week == 1) { fixtures_df <- fixtures } else { fixtures_df <- rbind(fixtures_df, fixtures) } # rotate other teams around other_teams <- c(other_teams[length], other_teams[1:length-1]) } return(fixtures_df) } # create the fixtures fixtures <- create_fixtures(teams) %>% mutate_if(is.factor, as.character) # print the fixture list fixtures ## home away gameweek ## 1 Frimley_Green Arsenal 1 ## 2 Enfield_Town Blackburn_Rovers 1 ## 3 Dover_Athletic Coventry_City 1 ## 4 Arsenal Enfield_Town 2 ## 5 Frimley_Green Dover_Athletic 2 ## 6 Blackburn_Rovers Coventry_City 2 ## 7 Dover_Athletic Arsenal 3 ## 8 Coventry_City Enfield_Town 3 ## 9 Blackburn_Rovers Frimley_Green 3 ## 10 Arsenal Coventry_City 4 ## 11 Dover_Athletic Blackburn_Rovers 4 ## 12 Enfield_Town Frimley_Green 4 ## 13 Blackburn_Rovers Arsenal 5 ## 14 Frimley_Green Coventry_City 5 ## 15 Enfield_Town Dover_Athletic 5 ## 16 Arsenal Frimley_Green 6 ## 17 Blackburn_Rovers Enfield_Town 6 ## 18 Coventry_City Dover_Athletic 6 ## 19 Enfield_Town Arsenal 7 ## 20 Dover_Athletic Frimley_Green 7 ## 21 Coventry_City Blackburn_Rovers 7 ## 22 Arsenal Dover_Athletic 8 ## 23 Enfield_Town Coventry_City 8 ## 24 Frimley_Green Blackburn_Rovers 8 ## 25 Coventry_City Arsenal 9 ## 26 Blackburn_Rovers Dover_Athletic 9 ## 27 Frimley_Green Enfield_Town 9 ## 28 Arsenal Blackburn_Rovers 10 ## 29 Coventry_City Frimley_Green 10 ## 30 Dover_Athletic Enfield_Town 10 and then create the results # using goalmodel package # https://github.com/opisthokonta/goalmodel library(goalmodel) # have to manually create a list of parameters model <- list() # stratify teams abilities in attack and defense model$parameters <- list(attack = seq(1, -1 + 2/length(teams), by = -2/(length(teams)-1)) %>%
append(-sum(.)) %>%
names<-(teams),
defense = seq(1, -1 + 2/length(teams), by = -2/(length(teams)-1)) %>%
append(-sum(.)) %>%
names<-(teams),
# no base rate of goals
intercept = 0,
# roughly accurate hfa for English professional football
hfa = 0.3)

model$all_teams <- teams # use a simple Poisson model with 8 goals max model$model <- "poisson"
model$maxgoal <- 8 # use the model to predict results using regista package results <- predict_expg(model, fixtures$home, fixtures\$away, return_df = TRUE) %>%
mutate(noise1 = rnorm(nrow(.), 0, 0.5),
noise2 = rnorm(nrow(.), 0, 0.5)) %>%
mutate(hgoal = round(expg1 + noise1,0 ),
agoal = round(expg2 + noise2,0),
home = as.factor(team1),
away = as.factor(team2)) %>%
# merge to fixtures
merge(., fixtures, by = c("home", "away")) %>%
# cant score less than zero goals
mutate_at(vars(hgoal:agoal), funs(replace(., .<0, 0))) %>%
select(home, away, hgoal, agoal, gameweek) %>%
arrange(gameweek, home) %>%
# treat only first 8 weeks as played
filter(gameweek <= 8)

# print results
results
##                home             away hgoal agoal gameweek
## 1    Dover_Athletic    Coventry_City     0     2        1
## 2      Enfield_Town Blackburn_Rovers     1     3        1
## 3     Frimley_Green          Arsenal     0     6        1
## 4           Arsenal     Enfield_Town     6     0        2
## 5  Blackburn_Rovers    Coventry_City     2     0        2
## 6     Frimley_Green   Dover_Athletic     0     3        2
## 7  Blackburn_Rovers    Frimley_Green     8     0        3
## 8     Coventry_City     Enfield_Town     3     1        3
## 9    Dover_Athletic          Arsenal     1     3        3
## 10          Arsenal    Coventry_City     3     0        4
## 11   Dover_Athletic Blackburn_Rovers     1     2        4
## 12     Enfield_Town    Frimley_Green     2     1        4
## 13 Blackburn_Rovers          Arsenal     2     2        5
## 14     Enfield_Town   Dover_Athletic     0     2        5
## 15    Frimley_Green    Coventry_City     1     3        5
## 16          Arsenal    Frimley_Green     9     0        6
## 17 Blackburn_Rovers     Enfield_Town     5     0        6
## 18    Coventry_City   Dover_Athletic     1     2        6
## 19    Coventry_City Blackburn_Rovers     0     2        7
## 20   Dover_Athletic    Frimley_Green     3     1        7
## 21     Enfield_Town          Arsenal     0     5        7
## 22          Arsenal   Dover_Athletic     4     1        8
## 23     Enfield_Town    Coventry_City     1     2        8
## 24    Frimley_Green Blackburn_Rovers     1     4        8