Robert Hickman on Robert Hickman
/
Recent content in Robert Hickman on Robert Hickman
Hugo  gohugo.io
enus
© 2018
Fri, 05 Jan 2018 00:00:00 +0000

Slicing the onion 3 ways Toy problems in R, python, and Julia
/post/riddler_nba_tour_de_frace/
Sun, 27 Sep 2020 00:00:00 +0000
/post/riddler_nba_tour_de_frace/
<p>Between writing up my thesis, applying to jobs <a href="robwhickman@gmail.com">hire me! I’m quite good at programming</a>, and the ongoing pandemic, I don’t really have time to write full blogposts. I have however decided to brush up my python skills and dive headfirst into Julia. As such, I like to answer the toy problems posted at <a href="https://fivethirtyeight.com/tag/theriddler/">fivethirtyeight’s riddler</a> weekly. These will likely be even a few weeks late but it motivates me to tend to the blog and keep up my programming.</p>
<div id="express" class="section level2">
<h2>Express</h2>
<p><em>From Zack Beamer comes a baffling brain teaser of basketball, just in time for the NBA playoffs:</em></p>
<p><em>Once a week, folks from Blacksburg, Greensboro, and Silver Spring get together for a game of pickup basketball. Every week, anywhere from one to five individuals will show up from each town, with each outcome equally likely.</em></p>
<p><em>Using all the players that show up, they want to create exactly two teams of equal size. Being a prideful bunch, everyone wears a jersey that matches the color mentioned in the name of their city. However, since it might create confusion to have one jersey playing for both sides, they agree that the residents of two towns will combine forces to play against the third town’s residents.</em></p>
<p><em>What is the probability that, on any given week, it’s possible to form two equal teams with everyone playing, where two towns are pitted against the third?</em></p>
<p><em>Extra credit: Suppose that, instead of anywhere from one to five individuals per town, anywhere from one to N individuals show up per town. Now what’s the probability that there will be two equal teams?</em></p>
<p>This is a nice little combinatorics problem, as such we can solve it by finding all combinations and then the combinations where the maximum value is equal to the sum of the remaining values:</p>
<pre class="r"><code>#create lists of possible values for all team a, b, or c
players < list(a = 1:5, b = 1:5, c = 1:5)
#find all combinations
player_combinations < do.call(expand.grid, players)
#get the value of the largest team in each combination
largest_team < apply(player_combinations, 1, max)
#get the sum of the remaining teams in each combination
reamining_players < apply(player_combinations, 1, function(x) sum(x)  max(x))
#check when these match
matched_teams < nrow(player_combinations[which(largest_team == reamining_players),])
#find the fraction which match
fraction_even_teams < matched_teams / nrow(player_combinations)
fraction_even_teams</code></pre>
<pre><code>## [1] 0.24</code></pre>
<p>So the answer to the main express question is 0.24, or about 1 in 4 chance.</p>
<p>It’s easy to expand this to multiple players by allowing the first line to take any value:</p>
<pre class="r"><code>#rewrite previous chunk as function that takes max_players as an argument
find_matches_fraction < function(max_players) {
players < list(a = seq(max_players), b = seq(max_players), c = seq(max_players))
player_combinations < do.call(expand.grid, players)
largest_team < apply(player_combinations, 1, max)
reamining_players < apply(player_combinations, 1, function(x) sum(x)  max(x))
matched_teams < nrow(player_combinations[which(largest_team == reamining_players),])
fraction_even_teams < matched_teams / nrow(player_combinations)
}
#run for n 1:50
fraction_even_teams < lapply(seq(50), find_matches_fraction)
answers_df < data.frame(
townspeople = seq(50),
chance = unlist(fraction_even_teams)
)
#for plotting
library(ggplot2)
#plot the answers for 1 to n players where max n is 50
p1 < ggplot(answers_df, aes(x = townspeople, y = chance)) +
geom_point() +
geom_line() +
labs(
title = "solution to Riddler express",
subtitle = "chance of matched team with max n players",
x = "max N players per town",
y = "chance of even teams"
) +
theme_minimal()
p1</code></pre>
<p><img src="/post/20200927riddler_nba_and_tour_files/figurehtml/solve_express_r_extra1.png" width="672" />
Let’s implement this in python. I won’t comment lines again, the flow of the function is fundamentally the same</p>
<pre class="python"><code>import itertools
def find_matches_fraction(max_players):
team_a = range(1,max_players)
team_b = range(1,max_players)
team_c = range(1,max_players)
matched_team = []
for players in list(itertools.product(team_a,team_b,team_c)):
largest_team = max(players)
l_combinations = list(players)
l_combinations.pop(l_combinations.index(max(l_combinations)))
remaining_players = sum(l_combinations)
if remaining_players == largest_team:
matched_team.append(1)
else:
matched_team.append(0)
fraction_success = sum(matched_team) / len(matched_team)
return(fraction_success)
answer_express = find_matches_fraction(6)
print(answer_express)</code></pre>
<pre><code>## 0.24</code></pre>
<p>And in Julia</p>
<pre class="julia"><code>using IterTools
function find_matches_fraction_jl(max_players)
team_a = 1:max_players
team_b = 1:max_players
team_c = 1:max_players
matched_teams = []
for players in product(team_a, team_b, team_c)
largest_team = maximum(players)
other_teams = collect(players)
deleteat!(other_teams, argmax(players))
remaining_players = sum(other_teams)
if largest_team == remaining_players
push!(matched_teams, 1)
else
push!(matched_teams, 0)
end
end
fraction_success = sum(matched_teams) / length(matched_teams)
return fraction_success
end</code></pre>
<pre><code>## find_matches_fraction_jl (generic function with 1 method)</code></pre>
<pre class="julia"><code>
answer_express = find_matches_fraction_jl(5);
answer_express</code></pre>
<pre><code>## 0.24</code></pre>
<p>We can also run these chunks in R using <a href="https://rstudio.github.io/reticulate/">reticulate</a> and <a href="https://cran.rproject.org/web/packages/JuliaCall/readme/README.html">JuliaCall</a></p>
<pre class="r"><code>#packages to call other languages into R
library(JuliaCall)
library(reticulate)
#run the functions to check answers
py$find_matches_fraction(as.integer(6))</code></pre>
<pre><code>## [1] 0.24</code></pre>
<pre class="r"><code>julia_eval("find_matches_fraction_jl(5)")</code></pre>
<pre><code>## [1] 0.24</code></pre>
<p>We can then use <a href="https://www.rdocumentation.org/packages/microbenchmark/versions/1.47/topics/microbenchmark">microbenchmark</a> to test the speeds of the functions written here. We run each n times and look at the distribution of times spent running each.</p>
<pre class="r"><code>#microbenchmark to time functions
library(microbenchmark)
#run each function 10000 times
n < 10000
bench_express < microbenchmark(
jl = julia_eval("find_matches_fraction_jl(5)"),
py = py$find_matches_fraction(as.integer(6)),
r = find_matches_fraction(5),
times = n
)
#plot the speeds of each functions
p2 < ggplot2::autoplot(bench_express) +
labs(
title = "benchmarking of functions for Riddler Express"
) +
theme_minimal()
p2</code></pre>
<p><img src="/post/20200927riddler_nba_and_tour_files/figurehtml/benchmark_express1.png" width="672" /></p>
<p>I’m pretty happy with that. Even my rusty python ends up being faster than the R code (which I wrote for expressiveness and not speed per se), but my first ever solution in Julia outstrips both!</p>
</div>
<div id="classic" class="section level2">
<h2>Classic</h2>
<p><em>This month, the Tour de France is back, and so is the Tour de FiveThirtyEight!</em></p>
<p><em>For every mountain in the Tour de FiveThirtyEight, the first few riders to reach the summit are awarded points. The rider with the most such points at the end of the Tour is named “King of the Mountains” and gets to wear a special polka dot jersey.</em></p>
<p><em>At the moment, you are racing against three other riders up one of the mountains. The first rider over the top gets 5 points, the second rider gets 3, the third rider gets 2, and the fourth rider gets 1.</em></p>
<p><em>All four of you are of equal ability — that is, under normal circumstances, you all have an equal chance of reaching the summit first. But there’s a catch — two of your competitors are on the same team. Teammates are able to work together, drafting and setting a tempo up the mountain. Whichever teammate happens to be slower on the climb will get a boost from their faster teammate, and the two of them will both reach the summit at the faster teammate’s time.</em></p>
<p><em>As a lone rider, the odds may be stacked against you. In your quest for the polka dot jersey, how many points can you expect to win on this mountain, on average?</em></p>
<p>A quick guess can be gotten by assuming there were <em>no</em> teams and just taking the expected points after random assignment</p>
<pre class="r"><code>riders < 4
points < c(5,3,2,1)
sum(points/riders)</code></pre>
<pre><code>## [1] 2.75</code></pre>
<p>We can then work out the answer to the classic analytically by calculating the chance that the rider is bumped back a spot for any position they find themselves in. For instance, if they finish 2nd, there is a 1 in 2 chance the rider ahead of them is part of the team, which would bump our rider into 3rd to make run for the teammate.</p>
<pre class="r"><code>expected_points <
#first
(points[1] / riders) +
#second
(points[2] / riders)/(riders1) + 2 * (points[(riders1)] / riders)/(riders1) +
#third
(points[(riders1)] / riders) / (riders1) + 2 *(points[riders] / riders)/(riders1) +
#last
(points[riders] / riders)
expected_points</code></pre>
<pre><code>## [1] 2.416667</code></pre>
<p>So we have our answer, but what about for any combination of team and points? We can write an R function to assign riders to teams and simulating many races to get an estimate of the total points. We could again solve these analytically, but that wouldn’t really benefit my programming.</p>
<pre class="r"><code>get_team_points < function(teams, points) {
team_pos < sample(unique(teams), length(unique(teams)), prob = table(teams))
all_positions < unlist(lapply(team_pos, function(p) rep(p, length(which(p == teams)))))
team_points < lapply(unique(teams), function(i) sum(points[which(all_positions == i)]))
names(team_points) < unique(teams)
return(team_points)
}
sim_race < function(n_riders, n_per_team = 2, points = c(5,3,2,1), times = 1000) {
leftover_riders < (n_riders1) %% n_per_team
teams < (n_riders  leftover_riders  1) / n_per_team
teamed_riders < c(
rep(seq(teams), each = n_per_team),
rep(max(teams)+1, leftover_riders),
999
)
all_points < c(
points,
rep(0, n_riders  length(points))
)
simmed_points < unlist(purrr::rerun(times, get_team_points(teamed_riders, all_points)))
expected_points < tapply(simmed_points, names(simmed_points), sum) / times
expected_points[names(expected_points) == 999]
}
expected_points < sim_race(4, 2, points = c(5,3,2,1), times = 10000)
expected_points</code></pre>
<pre><code>## 999
## 2.4093</code></pre>
<p>For a range of n riders and team sizes, we can calculate our riders expected points per race (we’ll use the same point structure of c(1:n1, n+1)) for a little extra flourish</p>
<pre class="r"><code>riders < 1:20
n_per_team < 1:5
library(dplyr)
arguments < expand.grid(riders, n_per_team) %>%
dplyr::rename(n_riders = Var1, n_per_team = Var2) %>%
#must be more riders than riders per team
dplyr::filter(n_riders > n_per_team)
arguments$points < lapply(arguments$n_riders, function(r) c(r+1, (r1):1))
#use map2
library(purrr)
sims < 1000
arguments$expected_points < pmap_dbl(arguments, sim_race, times = sims)
#plot the expected points
p3 < ggplot(arguments, aes(x = n_riders, y = n_per_team)) +
geom_tile(aes(fill = expected_points / (n_riders+1))) +
scale_fill_viridis_c(option = "plasma", name = "expected points\n / max possible points") +
labs(
title = "solution to Riddler classic",
subtitle = "expected points for our rider",
x = "total n riders",
y = "number of riders per team"
) +
theme_minimal()
p3</code></pre>
<p><img src="/post/20200927riddler_nba_and_tour_files/figurehtml/extra_credit_classic1.png" width="672" /></p>
<p>Lets now port our function for this over the python…</p>
<pre class="python"><code>from numpy.random import choice
import numpy as np
import pandas as pd
import math
import itertools
def sim_race_py(n_riders, n_per_team, points):
n_teams = math.ceil((n_riders  1) / n_per_team) + 1
filled_teams = math.floor((n_riders  1) / n_per_team)
leftover_riders = (n_riders  1) % n_per_team
if leftover_riders > 0:
extra_riders = [leftover_riders, 1]
else:
extra_riders = 1
if filled_teams == 1:
win_prob = [n_per_team, extra_riders]
else:
win_prob = [n_per_team] * filled_teams
win_prob.extend([extra_riders])
flattened_probs = list(pd.core.common.flatten(win_prob))
sum_probs = np.sum(flattened_probs)
adjusted_probs = [p/sum_probs for p in flattened_probs]
no_teams = list(range(len(flattened_probs)))
finish_order = choice(no_teams, len(no_teams), p = adjusted_probs, replace = False)
expanded_finish_order = []
for team in finish_order:
if team < filled_teams:
expanded_finish_order += [team] * n_per_team
else:
if team != max(no_teams):
expanded_finish_order += [team] * leftover_riders
else:
expanded_finish_order += [team]
won_points = points[np.argmax(expanded_finish_order)]
return won_points
def sim_races_py(n_riders, n_per_team, points, n_times):
won_points = []
for _ in range(n_times):
sim_points = sim_race_py(n_riders, n_per_team, points)
won_points.append(sim_points)
expected_points = np.sum(won_points) / len(won_points)
return(expected_points)
answer_classic = sim_races_py(4,2,[5,3,2,1], 10000)
print(answer_classic)</code></pre>
<pre><code>## 2.4286</code></pre>
<p>…and in Julia</p>
<pre class="julia"><code>using StatsBase
function sim_race_jl(n_riders, n_per_team, points);
n_teams = Int(ceil((n_riders  1) / n_per_team));
filled_teams = Int(floor((n_riders  1) / n_per_team));
leftover_riders = mod(n_riders  1, n_per_team);
if leftover_riders > 0
extra_riders = [leftover_riders, 1];
else
extra_riders = 1;
end
if filled_teams == 1
win_prob = vcat(n_per_team, extra_riders);
else
win_prob = vcat(repeat([n_per_team], filled_teams), extra_riders);
end
finish_order = sample(1:length(win_prob),
ProbabilityWeights(win_prob),
length(win_prob),
replace = false
);
expanded_finish_order = Vector{Int}();
for team in finish_order
if team <= filled_teams
append!(expanded_finish_order, repeat([team], n_per_team));
else
if team != length(finish_order)
append!(expanded_finish_order, repeat([team], leftover_riders));
else
append!(expanded_finish_order, team);
end
end
end
rider_position = findall(expanded_finish_order .== maximum(expanded_finish_order));
points_won = points[rider_position];
return points_won
end</code></pre>
<pre><code>## sim_race_jl (generic function with 1 method)</code></pre>
<pre class="julia"><code>
function sim_races_jl(n_riders, n_per_team, points, n_times);
won_points = Vector{Int}();
for _ in 1:n_times
sim_points = sim_race_jl(n_riders, n_per_team, points);
append!(won_points, sim_points);
end
expected_points = sum(won_points) / length(won_points);
return expected_points;
end</code></pre>
<pre><code>## sim_races_jl (generic function with 1 method)</code></pre>
<pre class="julia"><code>
answer_classic = sim_races_jl(4,2,[5,3,2,1], 10000);
answer_classic</code></pre>
<pre><code>## 2.4098</code></pre>
<p>And then lets benchmark each of these functions again</p>
<pre class="r"><code>#run each function 10000 times
n < 10000
bench_classic < microbenchmark(
jl = julia_eval("sim_race_jl(4,2,[5,3,2,1])"),
py = py$sim_race_py(as.integer(4),as.integer(2),c(5,3,2,1)),
r = sim_race(4,2,c(5,3,2,1), times = 1),
times = n
)
#plot the speeds of each functions
p4 < ggplot2::autoplot(bench_classic) +
labs(
title = "benchmarking of functions for Riddler Classic"
) +
theme_minimal()
p4</code></pre>
<p><img src="/post/20200927riddler_nba_and_tour_files/figurehtml/benchmark_classic1.png" width="672" />
A bit closer this time. I think I haven’t quite got efficiency for more involved functions down for python and Julia. Julia still wins this round but I feel could be speed up by at least a factor 2 or 3x.</p>
</div>

The Riddler  June 26th
/post/riddler_june_26th/
Mon, 29 Jun 2020 00:00:00 +0000
/post/riddler_june_26th/
<pre class="r"><code>#for working with polygons
library(sf)
library(sfheaders)
library(tidyverse)
library(gtools)
set.seed(22081992)</code></pre>
<div id="riddlerexpress" class="section level1">
<h1>Riddler Express</h1>
<p>This weeks express deals with an erratic driver:</p>
<p><em>In Riddler City, the city streets follow a grid layout, running northsouth and eastwest. You’re driving north when you decide to play a little game. Every time you reach an intersection, you randomly turn left or right, each with a 50 percent chance.</em></p>
<p><em>After driving through 10 intersections, what is the probability that you are still driving north?</em></p>
<p>So all we have to do is create a binomial tree of depth 10 and then sum by final heading direction. As the driver <em>must</em> turn left or right at each junction, we actually only have to consider the final turn as this will change it from whichever North/South or East/West it is heading to the other with p = 0.5. But if we want to prove this, let’s consider it as a more canonical balldrawing probability task where one can draw balls from a bag:</p>
<ul>
<li>Red (right) ball with probability p or</li>
<li>Lime (left) ball with probability q</li>
</ul>
<p>drawing balls 10 times without replacement</p>
<p>We know that as there are only two balls, the total probability is</p>
<p><span class="math display">\[ (p + q) = 1 \]</span>
on the first pick we are just choosing p or q so can raise everything to the power 1 (pick) to get the same formula:</p>
<p><span class="math display">\[ (p + q)^1 = 1^1 \]</span>
and can generalise to n picks</p>
<p><span class="math display">\[ (p + q)^n = 1^n \]</span>
to expand this we’re going to get combinations of p and q to the powers from 0:n, multiplied by the combinatorics from <a href="https://en.wikipedia.org/wiki/Pascal%27s_triangle">Pascal’s triangle</a>.</p>
<p>If we set this multiplication as m, we can express this as:</p>
<p><span class="math display">\[ m = \frac{n!}{(nk!)k!} \]</span>
(where k is 0:n)</p>
<p>so for n = 10 (turns of the car, or picks of a ball), we get</p>
<pre class="r"><code>#calculate pascals triangle via factorials
calc_pascal < function(n,k) {
factorial(n) / (factorial(nk) * factorial(k))
}
#run for n turns
n_turns < 10
m = map2_dbl(n_turns, 0:n_turns, calc_pascal)
m</code></pre>
<pre><code>## [1] 1 10 45 120 210 252 210 120 45 10 1</code></pre>
<p>so for</p>
<p><span class="math display">\[ (p + q)^{10}\]</span>
we will expand this into</p>
<p><span class="math display">\[ 1p^{10} + 10p^9q + 45p^8q^2 + 120 p^7q^3 + 210p^6q^4 + 252p^5q^5 + 210p^4q^6 + 120p^3q^7 + 45p^2q^8 + 10pq^9 + 1q^{10}\]</span>
But where we now diverge from the balls in a bag, each time we draw (/turn), the position of our car updates. We don’t care about the probability of each of these per se, but the probabilities grouped by the final direction of the car.</p>
<p>It should be clear that every p draw (a right turn), moves the car 1 cardinal direction to the right, whereas a left turn moves it 1 cardinal direction. In our expansion we have 210 examples of drawing 6 right turns and 4 left turns, which would end up having the car face due south (2 cardinal turns). For each term, we just have to minus the exponent of the left turns from the exponent of the right turns, then find the direction by taking the 4th modulus of this.</p>
<p>For a binomial expansion like this, it’s very easy:</p>
<pre class="r"><code>#calculate the end heading for each term of the expansion
term_direction = (n_turns:0  0:n_turns) %% 4
term_direction</code></pre>
<pre><code>## [1] 2 0 2 0 2 0 2 0 2 0 2</code></pre>
<p>so we’re either going to end up facing north (0 overall turn) or south (2 overall turns). We can then multiply these by the m for each term</p>
<pre class="r"><code>#list of cardinal direction
final_directions < c("north", "east", "south", "west")
#loop through each expansion term to get the final direction
direction_p < c()
for(d in 0:3) {
direction_p[d+1] < sum(m[term_direction == d])
}
#find the probability of facing any direction
names(direction_p) < final_directions
direction_p / sum(direction_p)</code></pre>
<pre><code>## north east south west
## 0.5 0.0 0.5 0.0</code></pre>
<p>so we have a 50% chance of ending up facing either north or south. So the answer to this weeks riddler express is</p>
<p><span class="math display">\[p(North) = 0.5 \]</span></p>
<div id="extracredit" class="section level2">
<h2>Extra Credit</h2>
<p>For extra credit, the driver decides instead to turn left, right, or continue straight with equal probability (1/3). In addition to p and q, we now also have an r probability where</p>
<p><span class="math display">\[ r = p(No Turn) \]</span>
We can then use expand.grid() to produce combinations of these three probabilities, and count the combinations by number of each of these:</p>
<pre class="r"><code>#find combinations of p, q, and r
extra_credit < expand.grid(rep(list(c("p", "q", "r")), n_turns)) %>%
#label each combination
mutate(id = 1:n()) %>%
#count numbers of p, q, and r
pivot_longer(cols = starts_with("Var")) %>%
group_by(id, value) %>%
summarise(n = n()) %>%
#pivot back to wide
pivot_wider(id_cols = id, names_from = value, values_from = n) %>%
mutate_at(c("p", "q", "r"), ~replace(., is.na(.), 0)) %>%
#count numbers of each combination
group_by(p, q, r) %>%
summarise(n = n()) %>%
arrange(n)
extra_credit</code></pre>
<pre><code>## # A tibble: 66 x 4
## # Groups: p, q [66]
## p q r n
## <dbl> <dbl> <dbl> <int>
## 1 0 0 10 1
## 2 0 10 0 1
## 3 10 0 0 1
## 4 0 1 9 10
## 5 0 9 1 10
## 6 1 0 9 10
## 7 1 9 0 10
## 8 9 0 1 10
## 9 9 1 0 10
## 10 0 2 8 45
## # ... with 56 more rows</code></pre>
<p>As we might expect, we get the same number of each combinations, but with 3x combinations for each x^n y^n (for each combination of p, q, and r). As we know that the final heading will be the difference in number of right and left turns, we can subtract these and count the number of combinations leading to each direction</p>
<pre class="r"><code>extra_credit_answer < extra_credit %>%
mutate(net_turns = p  q) %>%
mutate(final_direction = net_turns %% 4) %>%
.$final_direction %>%
table()
names(extra_credit_answer) < final_directions
extra_credit_answer / sum(extra_credit_answer)</code></pre>
<pre><code>## north east south west
## 0.2727273 0.2272727 0.2727273 0.2272727</code></pre>
<p>giving us an answer of</p>
<p><span class="math display">\[ p(North) = 0.\dot{2}\dot{7} \]</span>
# Riddler Classic</p>
<p><em>Polly Gawn loves to play “connect the dots.” Today, she’s playing a particularly challenging version of the game, which has six unlabeled dots on the page. She would like to connect them so that they form the vertices of a hexagon. To her surprise, she finds that there are many different hexagons she can draw, each with the same six vertices.</em></p>
<p><em>What is the greatest possible number of unique hexagons Polly can draw using six points?</em></p>
<p>This is a pretty tricky question! I can’t see any way to analytically solve it and given that it involves polygons (and not just pure numbers) it seems like a tricky question to brute force. That doesn’t mean we can’t try though.</p>
<p>Let’s start by using the data in the hint that for n = 4 points, the maximum number is 3 polygons, given that the fourth point lies within an enclosing triangle of the other three. We can generate some points randomly for this pretty easily, and use the <a href="https://rspatial.github.io/sf/articles/sf1.html">simple features</a> package to test the properties of the resulting polygons:</p>
<pre class="r"><code>#generate 3 random points
points < data.frame(
x = runif(3),
y = runif(3)
)
#create a triangle from these points
triangle < sf_polygon(points)
#randomly generate a fourth point within the bounding box of these points
new_point < data.frame(
x = runif(1, min = min(points$x), max = max(points$x)),
y = runif(1, min = min(points$y), max = max(points$y))
)
#keep generate this point until it lies within the triangle of the previous 3
while(length(unlist(st_contains(triangle, sf_point(new_point)))) ==0) {
new_point < data.frame(
x = runif(1, min = min(points$x), max = max(points$x)),
y = runif(1, min = min(points$y), max = max(points$y))
)
}
#bind the fourth point onto the previous 3
points < rbind(points, new_point)
#plot the points
p2 < ggplot() +
#triangle
geom_sf(data = triangle, alpha = 0.1) +
geom_point(data = points, aes(x, y),
shape = 21, fill = "skyblue", colour = "black", size = 3) +
theme_minimal()
p2</code></pre>
<p><img src="/post/20200628riddlerjune26_files/figurehtml/generate_four_points1.png" width="672" /></p>
<p>Now we need to brute force through every possible polygon. To do this we can use combinatorics again, this time with the permutations() function from the gtools package. We create every possible route of points, then take only the routes that start on the first point (to cut down our search space, as many routes will be the same just shifted to a different start node)</p>
<pre class="r"><code>#create all possible routes of 4 points
routes < permutations(4, 4, 1:4) %>%
as.data.frame() %>%
#filter those beginning with node 1
filter(V1 == 1)</code></pre>
<p>For each route we then create the resulting polygon by ordering the points and creating a simple features polygon. These are then bound together and each given an id.</p>
<pre class="r"><code>#cycle through routes to create polygons
for(r in seq(nrow(routes))) {
nodes < as.numeric(routes[r,])
sf_points < points[nodes,]
sf < sf_polygon(sf_points)
if(r == 1) {
polygons < sf
} else {
polygons < rbind(polygons, sf)
}
}
polygons$id < 1:nrow(polygons)
#plot the resulting polygons
p3 < ggplot() +
geom_sf(data = polygons, fill = "dodgerblue") +
theme_minimal() +
theme(axis.text = element_blank()) +
facet_wrap(~id)
p3</code></pre>
<p><img src="/post/20200628riddlerjune26_files/figurehtml/test_resulting_polygons1.png" width="672" /></p>
<p>However, we know that there are only 3 unique polygons for n = 4 points. Why have we found 6? From inspection it’s pretty clear that even though they all have unique paths, 3 of these are duplicates of 3 others. This occurs as for each starting node, there are two paths to create each polygon, a ‘clockwise’ path and an ‘anticlockwise’ one.</p>
<p>We can easily test for this and remove half the polygons as such:</p>
<pre class="r"><code>#test for duplicate polygons
duplicates < as.data.frame(st_equals(polygons, polygons)) %>%
#ignore self matches
filter(row.id != col.id) %>%
mutate(id = 1:n()) %>%
#remove the last 3 polygons
top_frac(0.5, id)
polygons < polygons[duplicates$row.id,]
#replot
p4 < ggplot() +
geom_sf(data = polygons, fill = "dodgerblue") +
theme_minimal() +
theme(axis.text = element_blank()) +
facet_wrap(~id)
p4</code></pre>
<p><img src="/post/20200628riddlerjune26_files/figurehtml/remove_duplicate_polygons1.png" width="672" /></p>
<p>And we have our 3 unique polygons. For a higher number n, we want to spin out and generalise two functions:</p>
<ul>
<li>one to create points on a ‘page’</li>
<li>one to build as many unique polygons as possible</li>
</ul>
<p>To create points, we can pretty much verbatim take the previous code. I’ve added a second argument of how many points should lie within a perimeter triangle of points, though this will always be n3 (where n > 3), as far as I can see.</p>
<pre class="r"><code>#take our previous code for any n
create_points < function(sides, within) {
points < data.frame(
x = runif(sides  within),
y = runif(sides  within)
)
perimeter < sf_polygon(points)
new_points < data.frame(
x = runif(within, min = min(points$x), max = max(points$x)),
y = runif(within, min = min(points$y), max = max(points$y))
)
while(length(unlist(st_contains(perimeter, sf_point(new_points)))) != within) {
new_points < data.frame(
x = runif(within, min = min(points$x), max = max(points$x)),
y = runif(within, min = min(points$y), max = max(points$y))
)
}
points < rbind(points, new_points) %>%
mutate(id = 1:n())
return(points)
}
#run to create a pentagon
five_points < create_points(5, 2)
#plot the five points
p5 < ggplot() +
geom_point(data = five_points, aes(x, y, fill = as.factor(id)),
shape = 21, colour = "black", size = 3) +
scale_fill_discrete(guide = FALSE) +
theme_minimal()
p5</code></pre>
<p><img src="/post/20200628riddlerjune26_files/figurehtml/generalise_point_creation1.png" width="672" /></p>
<p>The second function (to test how many polygons can be drawn) needs two minor tweaks. Polygons are created as before, but to test for duplicates, we now take the smaller id each time, and also use st_is_valid() to check that the polygon does not contain any selfintersections (where lines cross each other).</p>
<pre class="r"><code>#create polygons for n points
get_unique_polygons < function(points) {
#create polygons as before
sides < nrow(points)
routes < permutations(sides, sides, 1:sides) %>%
as.data.frame() %>%
filter(V1 == 1)
for(r in seq(nrow(routes))) {
nodes < as.numeric(routes[r,])
sf_points < points[nodes,]
sf < sf_polygon(sf_points)
if(r == 1) {
polygons < sf
} else {
polygons < rbind(polygons, sf)
}
}
polygons$id < 1:nrow(polygons)
#find duplicate polygons
duplicates < as.data.frame(st_equals(polygons, polygons)) %>%
filter(row.id != col.id) %>%
mutate(smaller = case_when(
row.id < col.id ~ row.id,
col.id < row.id ~ col.id
))
#always take the smaller id
polygons < polygons[polygons$id %in% duplicates$smaller,]
#test for valid polygons
#i.e. no selfintersections
polygons < polygons[st_is_valid(polygons),]
return(polygons)
}
#create pentagons
pentagons < get_unique_polygons(five_points)
#calculate and arrange by the area of each for aesthetics
pentagons$area < st_area(pentagons)
pentagons < pentagons %>%
arrange(area) %>%
mutate(id = 1:n())
#plot the unique pentagons
p6 < ggplot() +
geom_sf(data = pentagons, aes(fill = area)) +
scale_fill_continuous(guide = FALSE) +
theme_minimal() +
theme(axis.text = element_blank()) +
facet_wrap(~id, nrow = 2)
p6</code></pre>
<p><img src="/post/20200628riddlerjune26_files/figurehtml/create_polygons_function1.png" width="672" /></p>
<p>So for 5 points, the answer seems to be 8 unique polygons that can be drawn.</p>
<p>For higher n, I then ran these function repeatedly and found the largest number of polygons for any random allocation of points, I’ve used 6 here, but the number can be any. At n = 6 points it’s already struggling (my code here wasn’t written for efficiency) and at 7 is reaaalllly slow, so the loops can be arbitrarily large and run while you make dinner/watch TV etc.</p>
<pre class="r"><code>#very dirty inefficient brute force code
all_n < c()
n_points < 6
for(i in 1:1){
#randomly create points
points < create_points(n_points, n_points3)
#build polygons from these
polygon < get_unique_polygons(points)
n < nrow(polygon)
all_n[i] < n
#report back from the loop
print(paste(i, "loops run"))
print(paste("biggest n so far is", max(all_n)))
print(all_n)
}</code></pre>
<pre><code>## [1] "1 loops run"
## [1] "biggest n so far is 24"
## [1] 24</code></pre>
<p>While I was running this to check if I’d missed anything, I tried to solve the problem logically (but not analytically). It seemed clear that you want as many points within larger perimeter triangle of 3 points. It also seemed like you wanted to make sure that none of these points were on a straight line of 3 points (which would limit the number of possible connections of those 3 points). For n = 6 I settled on a slightly offset (by rounding errors) trianglewithinatriangle</p>
<pre class="r"><code>#logiccreated six points
six_points < data.frame(
x = c(0, 1, 0.55, 0.25, 0.75, 0.45),
y = c(0, 0, sqrt(0.75), 0.22, 0.22, 0.65)
)
#plot the six points
p7 < ggplot() +
geom_point(data = six_points, aes(x, y),
shape = 21, fill = "skyblue", colour = "black", size = 3) +
theme_minimal()
p7</code></pre>
<p><img src="/post/20200628riddlerjune26_files/figurehtml/six_points1.png" width="672" /></p>
<p>if we pass these points through our function we find that it can create 29 unique polygons (the same number I found from ~100 loops of my brute force attack). Plotting them as before, these are:</p>
<pre class="r"><code>#test the six points and munge
heaxgons < get_unique_polygons(six_points)
heaxgons$area < st_area(heaxgons)
heaxgons < heaxgons %>%
arrange(area) %>%
mutate(id = 1:n())
#plot
p8 < ggplot() +
geom_sf(data = heaxgons, aes(fill = area)) +
scale_fill_continuous(guide = FALSE) +
theme_minimal() +
theme(axis.text = element_blank()) +
facet_wrap(~id)
p8</code></pre>
<p><img src="/post/20200628riddlerjune26_files/figurehtml/create_hexagons1.png" width="672" /></p>
<p>This isn’t a proof, but I feel reasonably confident in this as the answer for the classic</p>
</div>
<div id="extracredit1" class="section level2">
<h2>Extra Credit</h2>
<p>As mentioned, now we want to find this for 7 points creating heptagons. Given we can now fit 4 spare points inside our original triangle, I decided to see what would happen if you stretched the trianglewithinatriangle and point the final point inside this.</p>
<pre class="r"><code>#logiccreated seven points
#stretched y axis on point six
#point seven lies within new triangle
seven_points < data.frame(
x = c(0, 1, 0.55, 0.25, 0.75, 0.5, 0.45),
y = c(0, 0, sqrt(0.75), 0.22, 0.22, 0.75, 0.65)
)
#munge our heptagons
heptagons < get_unique_polygons(seven_points)
heptagons$area < st_area(heptagons)
heptagons < heptagons %>%
arrange(area) %>%
mutate(id = 1:n())
#aaaaand plot
p9 < ggplot() +
geom_sf(data = heptagons, aes(fill = area)) +
scale_fill_continuous(guide = FALSE) +
theme_minimal() +
theme(axis.text = element_blank()) +
facet_wrap(~id)
p9</code></pre>
<p><img src="/post/20200628riddlerjune26_files/figurehtml/test_heptagon1.png" width="672" /></p>
<p>Again, running a brute force (though for few iterations as seven really stretches the inefficient code here), 91 polygons seems a common end point, which is pretty close. I think there’s probably a very limited error on the difference between 91 and 92 polygons, so I maybe got lucky. But who knows?</p>
</div>
</div>
<div id="finalanswers" class="section level1">
<h1>Final Answers</h1>
<div id="express" class="section level2">
<h2>Express</h2>
<div id="section" class="section level3">
<h3>0.5</h3>
</div>
<div id="extracredit0.27recurring" class="section level3">
<h3>extra credit: 0.27 recurring</h3>
</div>
</div>
<div id="classic" class="section level2">
<h2>Classic</h2>
<div id="section1" class="section level3">
<h3>29</h3>
</div>
<div id="section2" class="section level3">
<h3>92</h3>
</div>
</div>
</div>

#TidyTuesday  Building Stalk Portfolios with R
/post/how_i_made_my_millions/
Sun, 17 May 2020 00:00:00 +0000
/post/how_i_made_my_millions/
<p>Every Tuesday, the R4DataScience community posts a dataset online as part of <a href="https://github.com/rfordatascience/tidytuesday">#TidyTuesday</a> as practice wrangling and modelling data. For the week of 5th May 2020, the dataset concerned the video game <a href="https://en.wikipedia.org/wiki/Animal_Crossing">Animal Crossing</a>.</p>
<div id="intro" class="section level2">
<h2>Intro</h2>
<p><a href="https://www.youtube.com/watch?v=n10JC0BOWF8">Radiohead  How I Made My Millions</a></p>
<p>I don’t play Animal Crossing (unfortunately Nintendo Switches sold out as the UK went into lockdown), but it seems that everyone around me does so I’ve become fascinated by how it has created almost a surrogate life for people, performing manual tasks to pay off loans to Tom Nook, the nefarious bankster of the player’s island.</p>
<p>One aspect in particular that captured my attention was the weekly market for turnips on each player’s island. Every Sunday, the player has the opportunity to buy turnips from a salesperson, which they then have a week to sell (before the turnips rot). The prices of turnips fluctuate over the week and (as far as I know) the vegetables have little function outside of buying/selling, so in essence they work as stocks which can be bought once per week, and the player must clear before the next offering.</p>
<p>To formalise this (taken from <a href="https://animalcrossing.fandom.com/wiki/White_turnip">here</a>):</p>
<ul>
<li>players can buy as many turnips as they want from character a for price x per turnip* on Sunday morning</li>
<li>players can then sell as many or as few turnips as they want to character b for price y(t) from the morning until 10pm</li>
<li>prices vary depending on price y(t1) and a given pattern, changing at the start of each day and then at midday each day (so there are 12 independent selling prices y through a week)</li>
<li>on the next Sunday everything resets</li>
<li>for the next week the pattern the prices follow may or may not be different</li>
</ul>
<p>*technically players buy turnips in bundles of 10, but it’s easier just to refer to the turnips rather than bundles</p>
<p>Modelling the movement of the turnip price leads to some interesting analysis using models from financial data science, and while this data wasn’t a part of #TidyTuesday, I think it’s related enough to potentially be of interest to people.</p>
<p>First, as always, let’s load the libraries we’ll need:</p>
<pre class="r"><code>#load libraries
library(tidyverse)
library(MASS)
library(conflicted)
#prefer tidyverse functions
preferred < map(c("filter", "select", "lag"), conflict_prefer, "dplyr")
#source the functions governing turnips price dynamics
source("../../static/files/turnips/turnip_funs.R")</code></pre>
</div>
<div id="pricepatterns" class="section level2">
<h2>Price Patterns</h2>
<p>Looking into the code governing the price of turnips, it appears that on each Sunday, turnips are sold randomly for between 90 and 110 ‘bells’ (equivalent to and henceforth referred to as $), and then there are 4 independent ‘patterns’ of price evolution which last the whole week. After purchasing turnips the prices can either:</p>
<ul>
<li>fluctuate (go up and down around mean $100 purchase price)</li>
<li>‘spike’ upwards around midweek (where prices will decrease before shooting up past the ~$100 purchase price around Wednesday). This is actually two separate but similar patterns, where ‘large spike’ leads to greater selling prices than ‘small spike’</li>
<li>decreasing in which the player has no hope of making a profit that week as prices will decrease every day before resetting on the Sunday</li>
</ul>
<p>for a few more details see <a href="https://animalcrossing.fandom.com/wiki/White_turnip">the game’s wiki</a> or the <a href="https://gist.github.com/Treeki/85be14d297c80c8b3c0a76375743325b">C++ code I ripped my functions from</a>. The R translations I use here can be found on the Github repo for my site <a href="https://github.com/RobWHickman/netlify_blog/blob/master/static/files/turnips/turnip_funs.R">here</a>.</p>
<p>The patterns are not completely memoryless, and progress as a Markov Chain, where the probability of prices following a pattern next week are dependent on the current price pattern. Let’s say however that we are only playing video games to relax and not paying too much attention to virtual stock markets, the chance of seeing a pattern can be estimated as the stationary probabilities of each node of the chain.</p>
<p>There are two ways to solve this, first (and easiest) we have full knowledge of the transitions from week to week, so can solve analytically. The chance of seeing a pattern next week (column names) is related to the observed pattern this week (rownames) in the following matrix:</p>
<pre class="r"><code>#the four patterns
states < c("fluctuating", "large_spike", "decreasing", "small_spike")
#build the transition matrix between the states
transition_matrix < matrix(
c(
0.2, 0.3, 0.15, 0.35,
0.5, 0.05, 0.2, 0.25,
0.25, 0.45, 0.05, 0.25,
0.45, 0.25, 0.15, 0.15
),
nrow = 4, byrow = TRUE)
#name the current (rows) and next (cols) states
rownames(transition_matrix) < states
colnames(transition_matrix) < states
transition_matrix</code></pre>
<pre><code>## fluctuating large_spike decreasing small_spike
## fluctuating 0.20 0.30 0.15 0.35
## large_spike 0.50 0.05 0.20 0.25
## decreasing 0.25 0.45 0.05 0.25
## small_spike 0.45 0.25 0.15 0.15</code></pre>
<p>If we are a naive observer, the chance of observing any pattern is therefore solved by taking the left eigenvectors of this matrix:</p>
<pre class="r"><code>#take the elft eignevector
#ginv from the MASS package
left_eigen < ginv(eigen(transition_matrix)$vectors)[1,]
pattern_likelihood_analytic < left_eigen / sum(left_eigen)
#name the probabilities
names(pattern_likelihood_analytic) < states
pattern_likelihood_analytic</code></pre>
<pre><code>## fluctuating large_spike decreasing small_spike
## 0.3462773 0.2473628 0.1476074 0.2587525</code></pre>
<p>Where we see that around half the time we have a chance of either a large or a small spike in prices around midweek (24.7% + 25.9%), with the majority of the remaining weeks showing a fluctuating pattern (where the player can still make a small profit). The worst case scenario of continually decreasing prices happens only 14.7% of the time, so overall, the stalk market looks like a pretty good bet for investors.</p>
<p>Of course, we can also do this using Hamiltonian Monte Carlo methods by simulating a few sets of independent weeks</p>
<pre class="r"><code>#transition probabilities
transition_df < as.data.frame(transition_matrix) %>%
rownames_to_column(var = "current_state") %>%
pivot_longer(cols = states, names_to = "next_state", values_to = "prob") %>%
group_by(current_state) %>%
mutate(cum_prob = cumsum(prob)) %>%
ungroup()
#get the next pattern from the current pattern
find_next_pattern < function(pattern, rng, transitions = transition_df) {
next_transition < transitions %>%
#find possible patterns
filter(current_state == pattern & cum_prob > rng) %>%
#take top row
.[1,]
#next state is that pattern
next_state < next_transition$next_state
}
#run forward for prop_forward weeks for each run to check convergence
transition_patterns < function(initial_pattern, prop_forward) {
patterns < c()
pattern < initial_pattern
#run n times
for(runs in seq(prop_forward)) {
pattern < find_next_pattern(pattern, runif(1))
patterns < append(patterns, pattern)
}
#return as df
df < data.frame(
initial_pattern,
pattern = as.character(patterns),
t = 1:prop_forward
)
return(df)
}
#repeat sims n times
simulation_reps < 100
#how many weeks to run each sim for
prop_forward = 10
#run the sims
pattern_likelihood < states %>%
rep(simulation_reps) %>%
map_df(., transition_patterns, prop_forward) %>%
group_by(pattern) %>%
summarise(prob = n() / (simulation_reps * prop_forward * length(states)))
pattern_likelihood</code></pre>
<pre><code>## # A tibble: 4 x 2
## pattern prob
## <chr> <dbl>
## 1 decreasing 0.147
## 2 fluctuating 0.352
## 3 large_spike 0.244
## 4 small_spike 0.256</code></pre>
<p>And we get pretty much the same numbers (as we would expect). To show the relative frequencies and how well our two methods of finding the stationary probabilities work, we can easily graph this using ggplot:</p>
<pre class="r"><code>p1 < pattern_likelihood_analytic %>%
as.data.frame() %>%
rownames_to_column("pattern") %>%
left_join(pattern_likelihood, by = "pattern") %>%
rename(hmc = "prob", analytic = ".") %>%
pivot_longer(c("hmc", "analytic"), names_to = "calc", values_to = "prob") %>%
ggplot(aes(x = pattern, y = prob, group = calc)) +
geom_bar(stat = "identity", position = "dodge", aes(fill = calc), colour = "black") +
scale_fill_manual(values = c("dodgerblue", "orange")) +
labs(
title = "probability of observing any one price pattern when randomly sampling",
subtitle = "showing difference estimate from analytic and Monte Carlo methods",
x = "week's prices pattern",
y = "probability"
) +
theme_minimal()
p1</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/plot_pattern_likelihood1.png" width="672" /></p>
<p>So, given the likelihood of spikes in prices, we know we’ve got a good chance of making some money by buying and selling turnips.</p>
</div>
<div id="modellingturnipsprices" class="section level2">
<h2>Modelling Turnips Prices</h2>
<p>To calculate exactly how much we might expect, it’s easiest, just to simulate the prices a load of times. We can do this by using a simple function that samples from <a href="https://gist.github.com/Treeki/85be14d297c80c8b3c0a76375743325b">the C++ code provided by Treeki</a> (for a translation into R which I’m using here see my Github <a href="https://github.com/RobWHickman/netlify_blog/blob/master/static/files/turnips/turnip_funs.R">here</a>).</p>
<p>The function randomly selects an initial (Sunday) price for turnips to be bought at, and then, runs the simulation code for a given pattern of prices. The second argument simply gives a list of names for the epochs (each day for both AM or PM, which will have different selling prices). I wrap the simulation up into a df because I find it easier to work with though the real meat of the simulation is a vector of length 14 which contains the ‘two’ Sunday buying prices (which will be identical it’s just to make it easier for me to count), and the 12 selling prices from Monday AM  Saturday PM.</p>
<pre class="r"><code>#the epochs for buying and selling turnips
#14 epochs, 2 identical buying epochs, and 12 unique selling epochs
week < c("sun", "mon", "tues", "wed", "thurs", "fri", "sat")
epochs < paste(rep(week, each = 2), rep(c("AM", "PM"), 7))
#simulate a week of prices given a pattern
simulate_week < function(pattern, epochs) {
#set up prices vector
sunday_price < sample(90:110, 1)
initial_prices < c(rep(sunday_price, 2), rep(0, 12))
#simulate pattern
if(pattern == "decreasing") {
week_prices < sim_decreasing(
prices = initial_prices
)
} else if(pattern == "fluctuating") {
week_prices < sim_fluctuating(
prices = initial_prices,
first_epochs = c(sample(0:6, 1), sample(2:3, 1))
)
} else if(pattern == "large_spike") {
week_prices < sim_largespike(
prices = initial_prices,
rate = runif(1, 0.85, 0.95),
first_peak = sample(2:8, 1)
)
} else if(pattern == "small_spike") {
week_prices < sim_smallspike(
prices = initial_prices,
first_peak = sample(1:8, 1)
)
}
#arrange df
weekly_prices < data.frame(
day = epochs,
buy_price = sunday_price,
price = week_prices
)
return(weekly_prices)
}</code></pre>
<p>We can calculate how many times each pattern should be run by defining the number of simulations we want to run, and sampling price patterns, weighted by likelihood, from the df we calculated above.</p>
<p>Then we just have to sample the vector of 1000 choices of the 4 patterns to the function and do a little munging at the end. After we can get a sense of which days are most profitable for selling turnips by plotting the histogram of the return (by how many times we have multiplied our original stock of $) on turnip investment.</p>
<pre class="r"><code>#how many simulations of weeks to run
simulation_reps < 1000
prices < pattern_likelihood %>%
#sample patterns by likelihood
sample_n(simulation_reps, weight = prob, replace = TRUE) %>%
.$pattern %>%
map_df(., simulate_week, epochs) %>%
mutate(return = price / buy_price,
day = factor(day, levels = epochs)) %>%
filter(!grepl("sun [AZ]{2}", day)) %>%
group_by(day) %>%
mutate(mean_return = mean(return)) %>%
ungroup()
p2 < ggplot(prices, aes(x = return, fill = mean_return)) +
geom_histogram(alpha = 0.8, colour = "black") +
geom_vline(xintercept = 1, linetype = "dashed", colour = "dodgerblue", size = 1) +
scale_fill_gradient2(low = "red", high = "green", mid = "black", midpoint = 1) +
scale_x_continuous(limits = c(0, 2)) +
theme_minimal() +
labs(
title = "which days yield greatest profits in the stalk market?",
subtitle = paste("based on", simulation_reps, "simulations"),
x = "return on turnip investment",
y = "observed count"
) +
facet_wrap(~day)
p2</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/run_simulations1.png" width="672" /></p>
<p>So the period of WednesdayThursday seems to have the greatest mean profit, which we could have predicted, given that that the two known profitable patterns both have their ‘spike’ around this time. For all days though, we can see a large hump under the break even point (a return of 1) which is due to the fact that <em>all</em> patterns (even the profitable spike ones) have a random amount of decrease in prices over the week.</p>
<p>What might be most concerning to a turnip investor is that the mean return the next weekend (which might be the next time they get to play the game) is fairly lower (~0.8) than break even, so they are going to have to pay attention to the movement of prices during the week.</p>
<p>The histograms struggle to portray the movement of time across epochs, so I also wanted to plot the prices using the <a href="https://cran.rproject.org/web/packages/ggridges/vignettes/introduction.html">ggridges</a> package to produce density <a href="https://upload.wikimedia.org/wikipedia/en/7/70/Unknown_Pleasures_Joy_Division_LP_sleeve.jpg">joy plots</a> over time. Here we can see a bit clearer that it’s only the long positive tails on the distributions which give us an expected return slightly above break even from Wednesday AM Thursday PM:</p>
<pre class="r"><code>library(ggridges)
p3 < ggplot(prices, aes(x = return, y = day, fill = mean_return)) +
geom_density_ridges2() +
geom_vline(xintercept = 1, linetype = "dashed", colour = "dodgerblue", size = 1) +
scale_fill_gradient2(low = "red", high = "green", mid = "black", midpoint = 1) +
scale_x_continuous(limits = c(0, 2.5)) +
labs(
title = "which days yield greatest profits in the stalk market?",
subtitle = paste("based on", simulation_reps, "simulations"),
x = "return on turnip investment",
y = "observed density by day"
) +
theme_minimal()
p3</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/plot_ridgeline1.png" width="672" /></p>
</div>
<div id="astuteturnipinvestment" class="section level2">
<h2>Astute Turnip Investment</h2>
<p>Given the potential constraints of any person’s (even one quarantined at home’s) time, an astute investor of turnips may want to calculate which days they should check prices to ensure the greatest return on investment. This is simply done, but first we must introduce one last factor in the stalk market the risk free interest banked money accrues.</p>
<p>In addition to performing manual tasks or investing in turnip stocks, the player can also bank their hardearned money and collect the interest. As far as I can tell banked money earns interest at a rate of 0.05%* which is payed out monthly. To work with this a bit easier, I’m going to make a slight tweak and calculate as if the interested was earned daily. Therefore, if we take time 0 to be a Sunday morning at the beginning of the month, by the following Saturday, the player who put $100 dollars in the bank will know have (1 + (0.05/ 100)) ^ (6/30) * 100 in their bank account (or an extra 1cent if you calculate).</p>
<p>The riskfree return over that one week will therefore have been 1cent it is the return the player receives without having to risk their money buying/selling turnips. For each day over the week, this riskfree return is easy to calculate. Because we are only interested in this compared to the returns on investing savings in turnips, the amount in the bank doesn’t actually matter we only care on the interest gained as a proportion of savings.</p>
<p>*the FT has an <a href="https://www.ft.com/content/68f96d2402f042fdb132aba0acba777f">article on</a> the recent Animal Crossing interest rate cut and why it forces players into riskier assets like turnips</p>
<pre class="r"><code>monthly_interest < 1.005
interest_df < data.frame(day = factor(epochs, levels = epochs)) %>%
mutate(interest_days = rep(0:6, each = 2)) %>%
mutate(interest_gained = (1 * (monthly_interest ^ (1/30)) ^ interest_days) 1)
interest_df</code></pre>
<pre><code>## day interest_days interest_gained
## 1 sun AM 0 0.0000000000
## 2 sun PM 0 0.0000000000
## 3 mon AM 1 0.0001662652
## 4 mon PM 1 0.0001662652
## 5 tues AM 2 0.0003325581
## 6 tues PM 2 0.0003325581
## 7 wed AM 3 0.0004988785
## 8 wed PM 3 0.0004988785
## 9 thurs AM 4 0.0006652267
## 10 thurs PM 4 0.0006652267
## 11 fri AM 5 0.0008316025
## 12 fri PM 5 0.0008316025
## 13 sat AM 6 0.0009980060
## 14 sat PM 6 0.0009980060</code></pre>
<p>We can then work out the return on investment compared to risk. The simplest way to do this is to use the <a href="https://en.wikipedia.org/wiki/Sharpe_ratio">Sharpe ratio</a> which can be formalized as:</p>
<p><span class="math display">\[S_{a} = \frac{E[R_{a}  R_{b}]}{\sqrt{var[R_{a}  R_{b}]}} \]</span></p>
<p>Where we calculate the Sharpe ratio S of an asset a which is a function of the expected excess return (aka profit) R of that asset above the expected excess return of a ‘safe’ asset b (in this case the interest on money in the bank). This is then divided by the variance of the expected gain above the risk free asset. We call the difference in return of the risky and safe asset Ra  Rb the ‘excess return’.</p>
<p>It should also be clear that we want a Sharpe ratio of <em>at least</em> greater than 0 to make our investment worthwhile (as a risky asset is time discounted and the possibility of prospectlike losses); generally we want a Sharpe ratio of 1 to indicate a good investment.</p>
<p>As we have the interest gained per day, we can calculate the excess return by joining our interest df and taking the return by day for each simulation as Ra, which we average to find the expected and variance.</p>
<pre class="r"><code>#join in interest data
Sharpe_mean_returns < prices %>%
left_join(interest_df, by = "day") %>%
#calculate excess return over safe asset
mutate(excess_return = (return  1)  interest_gained) %>%
group_by(day) %>%
#calc nominator and denominator
summarise(mean_excess = mean(excess_return),
sd_excess = sd(excess_return)) %>%
mutate(sharpe_ratio = mean_excess / sd_excess)
select(Sharpe_mean_returns, day, sharpe_ratio)</code></pre>
<pre><code>## # A tibble: 12 x 2
## day sharpe_ratio
## <fct> <dbl>
## 1 mon AM 0.274
## 2 mon PM 0.369
## 3 tues AM 0.158
## 4 tues PM 0.0975
## 5 wed AM 0.157
## 6 wed PM 0.135
## 7 thurs AM 0.0574
## 8 thurs PM 0.0533
## 9 fri AM 0.0347
## 10 fri PM 0.103
## 11 sat AM 0.595
## 12 sat PM 0.763</code></pre>
<p>So, as expected, the only epochs which show a positive Sharpe ratio are in the middle of the week, where prices spike. If we plot this we get a clearer indication of this:</p>
<pre class="r"><code>p4 < ggplot(Sharpe_mean_returns, aes(x = day, y = sharpe_ratio, group = 1)) +
geom_line(colour = "dodgerblue") +
geom_point(size = 2, colour = "dodgerblue") +
geom_hline(yintercept = 0, linetype = "dashed", colour = "orange") +
labs(
title = "Sharpe ratio for selling turnip investment on a given epoch",
subtitle = paste("based on", simulation_reps, "simulations"),
x = "day",
y = "Sharpe ratio"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p4</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/plot_sharpe_ratio1.png" width="672" /></p>
<p>Remember, we want a Sharpe ratio of around 1 to be confident our investment is a good idea, so for the casual investor, the stalk market doesn’t seem to give good value. If a player did want to dabble and could only dedicate so many hours per week to the game, the calculation suggests they should try to make time somewhere around Wednesday or Thursday to check in on the market to maximize their returns.</p>
</div>
<div id="globalizingthestalkmarket" class="section level2">
<h2>Globalizing the Stalk Market</h2>
<p>However, all is not lost for turnip investors! While each player lives and works on their <em>own</em> island, they are able to visit the islands of their friends and can buy and sell Turnips there following the same rules. However, prices (both buying on Sunday and selling for the rest of the week) are independent between player’s islands, so a smart investor can sample many markets and sell in the one which offers the greatest price.</p>
<p>To simulate these open markets, we image a player has n friends who will let them visit, and calculate the potential profit made by an optimal investor (buying on the cheapest island, and selling on the most profitable). We simulate each epoch on each island the same 1000 times and take the max potential returns per epoch.</p>
<pre class="r"><code>#do the same for n islands
simulate_open_economies < function(islands, pattern_likelihood, epochs) {
patterns < pattern_likelihood %>%
sample_n(islands, weight = prob, replace = TRUE) %>%
.$pattern
#munge
all_prices < map_df(patterns, simulate_week, epochs) %>%
#will always buy at lowest and sell at highest
mutate(buy_price = min(buy_price)) %>%
group_by(day) %>%
mutate(sell_price = max(price)) %>%
select(day, buy_price, sell_price) %>%
unique() %>%
mutate(islands)
}
#run on 1:n islands
n_islands < 10
open_prices < rep(1:n_islands, simulation_reps) %>%
map_df(simulate_open_economies, pattern_likelihood, epochs) %>%
group_by(islands) %>%
mutate(return = sell_price / buy_price,
day = factor(day, levels = epochs)) %>%
filter(!grepl("sun [AZ]{2}", day)) %>%
group_by(islands, day) %>%
mutate(mean_return = mean(return)) %>%
ungroup()</code></pre>
<p>If we then paste the density of returns by simulation, colored by the number of islands, we can see a clear rightward shift towards greater returns when a player has more friends. To think about this simply, if we imagine a player has infinite friends, they will <em>always</em> buy turnips for the minimum possible price ($90) and <em>always</em> sell them for the maximum possible price on that day.</p>
<pre class="r"><code>p5 < ggplot(open_prices, aes(x = return, y = day, group = paste(day, islands),
fill = islands, colour = islands)) +
geom_density_ridges2(alpha = 0.2) +
geom_vline(xintercept = 1, linetype = "dashed", colour = "dodgerblue", size = 1) +
scale_fill_gradient(low = "yellow", high = "green") +
scale_colour_gradient(low = "yellow", high = "green") +
scale_x_continuous(limits = c(0, 2.5)) +
labs(
title = "which days yield greatest profits in the stalk market?",
subtitle = "by day and number of islands sampled",
x = "return on investment",
y = "day"
) +
theme_minimal()
p5</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/plot_multiisland_prices1.png" width="672" /></p>
<p>What’s striking is that even with just 10 friends, player can be pretty much guaranteed to always make profit no matter which day they collude to all check their islands prices the mean return on investment is clearly above 1.0 even by the following Saturday. The best potential returns are clearly still to be had midweek however, where now a player can clearly expect a doubling of their investment:</p>
<pre class="r"><code>open_prices %>%
group_by(islands, day) %>%
summarise(mean_return = mean(return)) %>%
arrange(mean_return) %>%
head(n = 10)</code></pre>
<pre><code>## # A tibble: 10 x 3
## # Groups: islands [2]
## islands day mean_return
## <int> <fct> <dbl>
## 1 10 thurs AM 2.65
## 2 10 fri AM 2.60
## 3 10 wed AM 2.59
## 4 10 wed PM 2.59
## 5 10 thurs PM 2.58
## 6 9 wed AM 2.58
## 7 10 tues PM 2.57
## 8 10 fri PM 2.56
## 9 9 wed PM 2.54
## 10 9 thurs AM 2.54</code></pre>
<p>If we use our Sharpe ratio calculation to then calculate when a player should collude with friends to all check their local turnip prices* we might expect therefore that it will also suggest checking somewhere in this midweek spike. However, if we plot it, we find an unexpected result:</p>
<p>*if we assume that quarantined players probably <em>can</em> manage to check prices more than once/twice a week, managing to coordinate between multiple players is going to get very hard very quickly so this constraint really will become a factor</p>
<pre class="r"><code>#calculate Sharpe ratio per island as before
Sharpe_mean_open_returns < open_prices %>%
left_join(interest_df, by = "day") %>%
mutate(excess_return = (return  1)  interest_gained) %>%
group_by(islands, day) %>%
summarise(mean_excess = mean(excess_return),
sd_excess = sd(excess_return)) %>%
mutate(sharpe_ratio = mean_excess / sd_excess)
#plot the sharpe ratio coloured by islands
p6 < ggplot(Sharpe_mean_open_returns,
aes(x = day, y = sharpe_ratio, group = islands, colour = islands)) +
geom_line() +
geom_point(size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "orange") +
labs(
title = "Sharpe ratio for selling turnip investment on a given epoch",
subtitle = "by day and number of islands sampled",
x = "day",
y = "Sharpe ratio"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p6</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/sharpe_ratios1.png" width="672" /></p>
<p>As we <em>increase</em> the number of islands, the Sharpe ratio tilts towards selling <em>earlier</em> (when the mean returns are lower). Why do we get this result? To investigate we need to look at some real stock charts, which we can easily do via the <a href="https://cran.rproject.org/web/packages/tidyquant/index.html">tidyquant</a> package.</p>
<pre class="r"><code>library(tidyquant)
#get nintendo stock data
nintendo_symbol < "7974.T"
nintendo < getSymbols(nintendo_symbol,
from = '20190601',
to = "20200315",
auto.assign = FALSE) %>%
as.data.frame() %>%
rename_all(~gsub(paste0(nintendo_symbol, "."), "", .)) %>%
rownames_to_column("date") %>%
mutate(date = as.Date(date)) %>%
select(date, close = Close)
#plot the last 9 months of nintendo stock
p7 < ggplot(nintendo, aes(x = date, y = close)) +
geom_line(size = 2, colour = "orange") +
labs(
title = "Nintendo stock prices",
subtitle = "data from June 2019March 2020",
x = "date",
y = "closing stock price"
) +
theme_minimal()
p7</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/get_stock_prices1.png" width="672" /></p>
<p>Even though the Nintendo stock price has been fairly volatile over the past 9 months, it’s held its value fairly steadily it would be literally impossible for a trader to double their investment over this time (c.f. turnips in a week). We can drill down into the data by looking at the daily change in closing price (as a fraction of the price), using dplyr::lag()</p>
<pre class="r"><code>p8 < nintendo %>%
mutate(daily_change = (close  lag(close)) / lag(close)) %>%
ggplot(aes(x = daily_change)) +
geom_histogram(fill = "orange", colour = "black") +
geom_vline(xintercept = 0, linetype = "dashed", colour = "dodgerblue") +
labs(
title = "daily changes in Nintendo stock prices",
subtitle = "data from June 2019March 2020",
x = "fractional daily change in price",
y = "count"
) +
theme_minimal()
p8</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/show_stock_returns1.png" width="672" /></p>
<p>There’s three things to note here:
 the daily change in prices (roughly) follows a normal distribution* (with a mean slightly above 0 over a given time frame)
 the daily change in price is fairly small, i.e. the price is fairly stable
 there is a greater downside risk in large price moves you’re more likely to see a big reduction in price than a big increase (see the recent downturn in Nintendo stock due to COVID19 for example**)</p>
<p>*stock returns don’t follow a normal distribution if you rigorously test it, but it’s close enough to be useful</p>
<p>**yes, I know it has bounced back up, the data was selectively chosen to make a point</p>
<p>These are important basic heuristics for portfolio building and we can see that our stalk market fails on all three. Luckily, the fact that our turnip returns are skewed (even if they are skewed upwards instead of downwards as in the real life data).</p>
<p>Instead of using the Sharpe ratio, which considers the total standard deviation of the returns, we can use the <a href="https://en.wikipedia.org/wiki/Sortino_ratio">Sortino ratio</a> which is a riskadjusted version to control for the downside risk of investment (i.e. you’re more likely to make big losses than big gains). We know that with multiple friends, we can be pretty confident of making big returns,</p>
<p><span class="math display">\[S_{a} = \frac{E(R_{a}  MAR)}{\sqrt{\frac{1}{n}\cdot\int_{\infty}^{MAR}{(MAR  R_{a})^2}}dR} \]</span></p>
<p>which ok, looks pretty rough, but is simple enough to calculate.</p>
<p>The numerator is just the same as the Sharpe ratio numerator, except instead of the returns on asset a vs. a riskfree asset, we’re now calculating the returns vs. a Minimal Acceptable Return (MAR). Reimagine our scenario where someone only has x hours spare to play Animal Crossing, they aren’t going to go through the stress and commitment to play the stalk market without making <em>at least</em> MAR returns (where MAR is some number).</p>
<p>The denominator also looks more complicated than the Sharpe ratio, but remember, for that we want to find</p>
<p><span class="math display">\[denom_{Sharpe} = \sqrt{var[R_{a}  R_{b}]} = sd[R_{a}  R_{b}]\]</span></p>
<p>which is what we’re calculating here, just we are limiting the standard deviation to the <em>downside risk</em>, which means we only take the standard deviation of returns which fall beneath the MAR (hence the max argument in the integration).</p>
<p>For instance, let’s say we want an excess return of 1, i.e. we want to judge the profitability of checking certain epochs to at least double our initial investment on turnips:</p>
<pre class="r"><code>#want to double investment so MAR = 1
MAR < 1
#calc Sortino ratio
Sortino_ratio < open_prices %>%
group_by(day, islands) %>%
mutate(excess_return = return  1) %>%
summarise(
#numerator
mean_excess = mean(excess_return  MAR),
#denominator squared for readability
downside_sq = sum((MAR  excess_return[excess_return < MAR])^2/n())
) %>%
#calc
mutate(sortino_ratio = mean_excess / sqrt(downside_sq))
#plot the Sortino ratio by epoch
p9 < ggplot(Sortino_ratio, aes(x = day, y = sortino_ratio, group = islands, colour = islands)) +
geom_line() +
geom_point(size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "orange") +
labs(
title = "Sortino ratio for selling turnip investment on a given epoch",
subtitle = "by day and number of islands sampled",
x = "day",
y = "Sortino ratio"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p9</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/sortino_ratio1.png" width="672" /></p>
<p>And we can see that, compared to the Sharpe ratio, we are now being advised to preferentially check on prices midweek. In addition, it gives us a nice idea of how many friends we need to round up to (in this example) double our investment. With 4/5 friends who coordinate with us, we have a reasonable chance (a Sortino ratio ~0) of doubling our money, but to be encouraged (generally a ratio of 1 is at least desired) we want 7 or so.</p>
<p>Of course, the MAR is not an absolute, it depends on the how much risk a trader is willing to take, or in this case, how much motivation a video game player needs to try to play the stalk market. We can reproduce this plot easily for a range of MARs as follows:</p>
<pre class="r"><code>#range of MARs to test
MARs < c(0.5, 1, 2, 6)
#same as above
multiple_sortinos < map_dfr(seq(length(MARs)), ~open_prices) %>%
mutate(MAR = rep(MARs, each = nrow(open_prices))) %>%
select(day, islands, return, MAR) %>%
group_by(day, islands, MAR) %>%
mutate(excess_return = return  1) %>%
summarise(
mean_excess = mean(excess_return  MAR),
downside_sq = sum((MAR  excess_return[excess_return < MAR])^2/n())
) %>%
mutate(sortino_ratio = mean_excess / sqrt(downside_sq))
#plot faceted as above
p10 < ggplot(multiple_sortinos, aes(x = day, y = sortino_ratio, group = islands, colour = islands)) +
geom_line() +
geom_point(size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "orange") +
labs(
title = "Sortino ratio for selling turnip investment on a given epoch",
subtitle = "by day and number of islands sampled",
x = "day",
y = "Sortino ratio"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~MAR, scales = "free")
#plot  note the free y scales
p10</code></pre>
<p><img src="/post/20190517playing_the_stalk_market_files/figurehtml/calc_sortinos1.png" width="672" /></p>
<p>There’s two things to really take away here as we increase the MAR, the ratio decreases. This is because the df of returns stays constant, so as a player demands more and more return, the ratio is more and more pessimistic on their chances. Given that the maximum possible profit is roughly buying turnips for $90 and selling for $600 ($510 profit, an excess return of 5.6x our initial investment), demanding a MAR of 6 is literally impossible even with infinite friend’s islands, and for our relatively small number, the ratio approaches it’s limit at 1.</p>
<p>Conversely, if a player only wanted to buy turnips for fun and didn’t mind losing 50% of their initial investment (a MAR of 0.5), they can be reasonably confident of being fine even just checking their own island inconsistently. By the time they have a few friends islands thrown in, they are guaranteed to make that much, and the ratio goes to infinity (there is no downside risk).</p>
<p>That’s all for this post. As I’ve mentioned, thanks to <a href="https://gist.github.com/Treeki/85be14d297c80c8b3c0a76375743325b">Treeki</a> for putting the turnip pricing mechanism code online, and if you want to play with it, my R translation is hosted on the <a href="https://github.com/RobWHickman/netlify_blog/blob/master/static/files/turnips/turnip_funs.R">Github repo</a> for this site. Thanks for reading, and also to the organizers of #TidyTuesday for giving me the idea :)</p>
</div>

An Introduction to Modelling Soccer Matches in R (part 2)
/post/dixon_coles_2/
Sun, 26 Apr 2020 00:00:00 +0000
/post/dixon_coles_2/
<p><em>I wrote this one pretty quickly compared to part 1 (which goes much deeper into mathematical concepts), and only realized after how much of a similarity it has to many of <a href="https://twitter.com/Torvaney?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor">Ben Torvaney’s</a> posts on the subject. This probably isn’t a coincidence given how much I’ve used his work previously in posts on this blog. Any imitation here is meant as flattery. The purpose of this post is really as a bridge between what I really want to write about the maths behind the models in part 1, and extensions of these models into other distribution in parts 3n so it might be a little derivative of stuff written elsewhere.</em></p>
<p>On this blog I enjoy explaining various concepts from the more academic side of football analytics. One of my favourite results from this field are the papers in predicting future soccer matches based on limited information about past matches. Roughly 1 year ago I published <a href="https://www.roberthickman.eu/post/dixon_coles_1/">part 1</a> on a series of this and never got round to writing part 2 (of hopefully 2 or 3 more).</p>
<p>In the first post, we saw how we can use the Poisson distribution to estimate the relative strengths of teams in a hypothetical summer league between Arsenal, Blackburn Rovers, Coventry City, Dover Athletic, Enfield Town, and Frimley Green. Now we want to move onto actually <em>using</em> these estimates to predict matches, and eventually, whole leagues.</p>
<p>A good way to sum up this post in one line is a quote (mis) attributed to Niels Bohr:</p>
<div id="itsdifficulttomakepredictionsespeciallyaboutthefuture" class="section level3">
<h3><em>“It’s Difficult to Make Predictions, Especially About the Future”</em></h3>
<p>We’ve made our predictions about the past (estimating the relative strengths of teams based on past results), now we need to predict the future. I think it also nicely captures that even our predictions about the past are noisy we can not ever truly know the exact strengths of football teams; the job of analytics is to estimate these are accurately as possible. But any noise in those past predictions will be carried forward and amplified when predicting the future.</p>
<p>Onward to the code, first as always, loading libraries and setting a seed for reproducibility:</p>
<pre class="r"><code>library(tidyverse)
library(ggrepel)
set.seed(3459)</code></pre>
<p>We’re then going to load all the stuff we prepped and predicted in the last post. Remember the α parameter below refers to a teams attacking strength (the relative number of goals they are expected to score), and the β parameter refers to the attacking strength (the inverse of the relative number of goals they are expected to concede). Finally, γ refers to the extra advantage of playing at home.</p>
<p>(all these files are on the github repo for this site)</p>
<pre class="r"><code>fixtures < readRDS("../../static/files/dc_2/dc_fixtures.rds")
results < readRDS("../../static/files/dc_2/dc_results.rds")
model < readRDS("../../static/files/dc_2/dc_model.rds")
model</code></pre>
<pre><code>## $alpha
## Arsenal Blackburn_Rovers Coventry_City Dover_Athletic
## 1.1106558 0.6370160 0.3023048 0.2875353
## Enfield_Town Frimley_Green
## 0.3767038 1.3857376
##
## $beta
## Arsenal Blackburn_Rovers Coventry_City Dover_Athletic
## 0.6457175 0.4289270 0.3647815 0.1362931
## Enfield_Town Frimley_Green
## 0.3852812 0.9178517
##
## $gamma
## gamma
## 0.189462</code></pre>
<p>We’ll define a quick function to do our prediction. For a quick explanation of exactly why it’s coded as presented, see <a href="https://www.roberthickman.eu/post/dixon_coles_1/#Tinkering">the previous post</a>, under the title ‘Tinkering’.</p>
<p>For a given string of a home team and an away team, the function finds the relevant parameters from a third argument (param_list) and calculates the expected goal for each team.</p>
<pre class="r"><code>predict_results < function(home, away, param_list) {
e_goals_home < exp(param_list$alpha[home]  param_list$beta[away] + param_list$gamma)
e_goals_away < exp(param_list$alpha[away]  param_list$beta[home])
df < data.frame(home = home, away = away,
e_hgoal = as.numeric(e_goals_home),
e_agoal = as.numeric(e_goals_away))
return(df)
}</code></pre>
<p>If we run this for two example teams for example:</p>
<pre class="r"><code>#two example teams
home < "Blackburn_Rovers"
away < "Arsenal"
prediction < predict_results(home, away, model)
prediction</code></pre>
<pre><code>## home away e_hgoal e_agoal
## 1 Blackburn_Rovers Arsenal 1.198128 1.977293</code></pre>
<p>We can see that it gives Arsenal (the away team) a slightly more optimistic chance than Blackburn. The expected goals for each team of course can be rewritten as the mean, and in our Poisson model refers to λ (lambda) the mean times an event (goal) happens per a time interval (match). We also set a maximum number of possible goals (7 in this case*) to bound the area under the distribution so we aren’t sampling forever.</p>
<p>*sharp readers might notice that this is actually <em>lower</em> than the lambda for our more extreme cases (e.g. Arsenal at home to Frimley Green), but for realistic matches (even between wildly different professional sides) this is a fair enough assumption.</p>
<p>We then use dpois() to calculate the probability of this Poisson function returning a value (0:7 goals) given it’s lambda value. So if we run this over the prediction we made for Blackburn Rovers vs. Arsenal we get:</p>
<pre class="r"><code>#set a limit of where we'll calculate across
max_goals < 7
#calculate the probability of scoring x goals for either team
blackburn_goal_probs < lapply(0:max_goals, dpois, lambda = prediction$e_hgoal)
arsenal_goal_probs < lapply(0:max_goals, dpois, lambda = prediction$e_agoal)
#bind together in a df
df < data.frame(goals = rep(0:max_goals, 2),
team = rep(c(home, away), each = max_goals+1),
p = c(unlist(blackburn_goal_probs), unlist(arsenal_goal_probs)))
#plot the p of scoring x goals for either team
p1 < ggplot(df, aes(x = goals, y = p, fill = team)) +
geom_density(stat = "identity", alpha = 0.5) +
scale_fill_manual(values = c("red", "blue")) +
labs(title = "Predicted goals for Blackburn Rovers and Arsenal",
y = "probability") +
theme_minimal()
p1</code></pre>
<p><img src="/post/2019305dixoncoles2_files/figurehtml/predict_mach_goals1.png" width="672" /></p>
<p>Because of how maths works, these curves are the same result we would get if we ran rpois() (sampling from the Poisson function) lots of times. We’ll do that quickly because it sets the stage nicely for what will come later.</p>
<pre class="r"><code>#sample from the function lots of times for each team
n < 100000
blackburn_goals_samples < rpois(n, lambda = prediction$e_hgoal)
arsenal_goals_samples < rpois(n, lambda = prediction$e_agoal)
df < data.frame(team = rep(c(home, away), each = n),
sampled_goals = c(blackburn_goals_samples, arsenal_goals_samples))
#look its the same plot!
p2 < ggplot(df, aes(x = sampled_goals, fill = team)) +
geom_bar(stat = "count", position = "dodge", colour = "black", alpha = 0.5) +
geom_line(aes(colour = team), stat = "count", size = 3) +
scale_fill_manual(values = c("red", "blue"), guide = FALSE) +
scale_colour_manual(values = c("red", "blue"), guide = FALSE) +
labs(title = "Predicted goals for Blackburn Rovers and Arsenal",
y = "probability",
x = "sampled goals") +
theme_minimal() +
theme(axis.text.y = element_blank())
p2</code></pre>
<p><img src="/post/2019305dixoncoles2_files/figurehtml/monte_carlo_poisson1.png" width="672" /></p>
<p>Ok great!, in terms of predicting the result, the rightwards shift of the red (Arsenal) curve here is the difference in the teams ability to generate a positive goal differential it makes it more likely that if we sample event, Arsenal will have scored more goals than Blackburn Rovers at the end of the match. Of course, it’s also obvious that while Arsenal’s curve is right shifted, the bars for Arsenal scoring 0 goals and Blackburn scoring 6 are still sizable enough that it isn’t outside the realm of possibility.</p>
<p>This is a nice way of presenting the chance of each team scoring n goals, but doesn’t really help us in predicting the result of a match given that this relies on the interaction of both these distributions (we need to know how many goals BOTH Arsenal AND Blackburn will score).</p>
<p>To calculate this, we can do an outer product of the probabilities for both teams scoring n goals. We can then plot the probability of each <em>scoreline</em> as a tile plot:</p>
<pre class="r"><code>#calculate matrix of possible results and probabilities of those
matrix < outer(unlist(arsenal_goal_probs), unlist(blackburn_goal_probs)) %>%
as.data.frame() %>%
gather() %>%
#add in scorelines
mutate(hgoals = rep(0:max_goals, max_goals+1),
agoals = rep(0:max_goals, each = max_goals+1))
#make the tile plot
p3 < ggplot(matrix, aes(x = hgoals, y = agoals, fill = value)) +
geom_tile() +
geom_text(aes(label = paste(hgoals, agoals, sep = ""))) +
scale_fill_gradient2(low = "white", high = "red", guide = FALSE) +
theme_minimal()
p3</code></pre>
<p><img src="/post/2019305dixoncoles2_files/figurehtml/example_tileplot1.png" width="672" /></p>
<p>Where we can see that the most common scorelines are low scoring (football is a low scoring game), and slightly biased towards away goals (i.e. Arsenal are more likely to win than lose). The darkest (most likely) tiles being 11 or a 21 Arsenal win seem very plausible given our calculated λs earlier.</p>
<p>We can then do this for every fixture and build a large graph of the expected results for each using a simple map2_ apply. Because of the huge plot, I’ve restricted it here to a 3x3 matrix of the results for Arsenal, Coventry City, and Enfield Town, but if you click you should be linked to the full image.</p>
<pre class="r"><code>#want to predict over the whole fixture space
all_fixtures < bind_rows(fixtures, results) %>%
filter(!duplicated(paste(home, away), fromLast = TRUE))
#get the lambda for each team per game
predictions < map2_df(all_fixtures$home, all_fixtures$away,
predict_results,
model)
#calc out probabilities and bind up
all_predictions < map2_df(
predictions$e_hgoal, predictions$e_agoal,
function(lambda_home, lambda_away, max_goals) {
hgoal_prob < dpois(0:max_goals, lambda_home) %>% `names<`(0:max_goals)
agoal_prob < dpois(0:max_goals, lambda_away) %>% `names<`(0:max_goals)
outer(hgoal_prob, agoal_prob) %>%
as.data.frame() %>%
gather() %>%
rownames_to_column("row") %>%
mutate(hgoal = as.numeric(row) %% (max_goals+1)1) %>%
mutate(hgoal = case_when(hgoal < 0 ~ max_goals, TRUE ~ hgoal),
agoal = as.numeric(key)) %>%
select(sample_hgoal = hgoal, sample_agoal = agoal, prob = value)
}, max_goals) %>%
cbind(all_fixtures[rep(seq_len(nrow(all_fixtures)), each=(max_goals+1)^2),], .) %>%
group_by(home, away) %>%
mutate(prob = prob / sum(prob)) %>%
ungroup()
#plot again
p3 < all_predictions %>%
#filter only a few out to scale plot
filter(home %in% c("Arsenal", "Coventry_City", "Enfield_Town"),
away %in% c("Arsenal", "Coventry_City", "Enfield_Town")) %>%
ggplot(aes(x = sample_hgoal, y = sample_agoal, fill = prob)) +
geom_tile() +
geom_point(aes(x = hgoal, y = agoal),
colour = "blue", size = 5, alpha = 0.5 / max_goals^2) +
geom_text(aes(label = paste(sample_hgoal, sample_agoal, sep = "")), size = 2.3) +
scale_fill_gradient2(low = "white", high = "red", guide = FALSE) +
labs(
title = "predictions for final score across all fixtures",
y = "away goals",
x = "home goals") +
theme_minimal() +
facet_grid(away~home, scales = "free")
p3</code></pre>
<p><img src="/post/2019305dixoncoles2_files/figurehtml/plot_all_matrix1.png" width="672" /></p>
<p>For the whole matrix, click <a href="https://www.roberthickman.eu/img/results_matrix.png">here</a></p>
</div>
<div id="sowhat" class="section level2">
<h2>So what?</h2>
<p>These graphs are nice, but whats important is what they show: <em>we have a way to quantify how likely any result is in a match between two given teams</em>. Why is this useful</p>
<ul>
<li>Firstly, we can use the output of this to build betting models. Given the odds on final scores for any match, we can hedge effectively by betting on (e.g.) the five overwhelmingly most likely results.</li>
<li>Secondly, we can simulate leagues. This is <a href="https://www.bloomberg.com/graphics/2020coronaviruseuropeanfootball/">perhaps especially of interest given the context of writing this post</a>. I’m going to focus on this application because I don’t bet on football, and also because it’s hard to get a nice database of odds at the moment given the aforementioned situation.</li>
</ul>
<p><a href="https://www.youtube.com/watch?v=DYb8gSwDrM">The Verve  Monte Carlo</a></p>
<p>We can do this using a technique called <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo simulation</a>. There are lots of good explanation of the technique on the internet, but it basically boils down to this:</p>
<div id="ifeventsfollowaknowndistributionyoucansampletheseeventslotsoftimestogetstochasticguesstimatesbutovermanysamplesyouwillreproduceexactlythatdistribution" class="section level3">
<h3>"if events follow a known distribution*, you can sample these events lots of times to get stochastic guesstimates, but over many samples you will reproduce exactly that distribution"</h3>
<p>*a Poisson distribution for the expected number of goals scored in our case</p>
<p>For football, this means that while on an individual match level results are noisy (sometimes better teams lose!), if we simulate matches lots and lots of times, eventually they should converge to the ‘truth’*</p>
<p>*as defined by our Poisson distribution (which may or may not be a good/accurate ‘truth’ but go with it for now).</p>
<p>To work with this highly repetitive data, first we want to ‘nest’ the probabilities for each match. This basically means storing a df of all the possible results and their probabilities as a column inside a larger df so we can move between the data in those two structures easier.</p>
<p>For instance, the nest match results probability information for the next match to be played (Coventry City and home to Arsenal):</p>
<pre class="r"><code>nested_probabilities < all_predictions %>%
filter(is.na(hgoal)) %>%
select(hgoal, agoal) %>%
nest(probabilities = c(sample_hgoal, sample_agoal, prob))
nested_probabilities$probabilities[[1]] %>%
rename("Coventry City" = sample_hgoal, "Arsenal" = sample_agoal) %>%
arrange(prob) %>%
#show first 15 rows
.[1:15,]</code></pre>
<pre><code>## # A tibble: 15 x 3
## `Coventry City` Arsenal prob
## <dbl> <dbl> <dbl>
## 1 0 2 0.115
## 2 0 1 0.109
## 3 1 2 0.0983
## 4 1 1 0.0933
## 5 0 3 0.0806
## 6 1 3 0.0691
## 7 0 0 0.0516
## 8 1 0 0.0442
## 9 0 4 0.0425
## 10 2 2 0.0422
## 11 2 1 0.0400
## 12 1 4 0.0364
## 13 2 3 0.0296
## 14 2 0 0.0190
## 15 0 5 0.0179</code></pre>
<p>The probability for any single result is small (otherwise match betting would be easy), but the probabilities for a 20 and 10 Arsenal wins are highest (as we found earlier). Indeed all of the most likely results are within a goal or two for either side of these.</p>
<p>To make sure these probabilities makes sense, we can sum them and see that the results space of 0:max_goals for either side sums to 1</p>
<pre class="r"><code>sum(nested_probabilities$probabilities[[1]]$prob)</code></pre>
<pre><code>## [1] 1</code></pre>
<p>Then we can easily use this data to simulate results. We sample a single row (a ‘result’ of the match) weighted by the probability of it occurring. For instance, when we sample from the Coventry City vs Arsenal match it picks a 31 Arsenal away win (not the likeliest result, but not the most unlikely either).</p>
<pre class="r"><code>nested_probabilities$probabilities[[1]] %>%
rename("Coventry_City" = sample_hgoal, "Arsenal" = sample_agoal) %>%
sample_n(1, weight = prob)</code></pre>
<pre><code>## # A tibble: 1 x 3
## Coventry_City Arsenal prob
## <dbl> <dbl> <dbl>
## 1 1 3 0.0691</code></pre>
<p>We can of course repeat this across every match and see that the probabilities of the chosen results vary (because we’re randomly sampling we won’t always choose the most likely, or even a likely result), but all are within a reasonable range given the team playing:</p>
<pre class="r"><code>nested_probabilities %>%
mutate(sampled_result = map(probabilities, sample_n, 1, weight = prob)) %>%
select(probabilities) %>%
unnest(cols = c(sampled_result))</code></pre>
<pre><code>## # A tibble: 6 x 6
## home away gameweek sample_hgoal sample_agoal prob
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Coventry_City Arsenal 9 0 5 0.0179
## 2 Blackburn_Rovers Dover_Athletic 9 1 1 0.0575
## 3 Frimley_Green Enfield_Town 9 0 4 0.0418
## 4 Arsenal Blackburn_Rovers 10 2 1 0.0966
## 5 Coventry_City Frimley_Green 10 3 0 0.170
## 6 Dover_Athletic Enfield_Town 10 2 1 0.0839</code></pre>
<p>But when we are predicting what will happen, we want to find the <em>most likely</em> result. As mentioned earlier, if we sample enough, our average will converge towards this, so we can repeat this sampling technique n times (here I’ve done it 10 times), depending on how much time we want to wait for it to process.</p>
<p>You can see that as we do this many times, the results with the highest probability turn up more than others as we would expect if we were to (e.g.) actually play Blackburn Rovers vs Arsenal many times.</p>
<pre class="r"><code>rerun(10, nested_probabilities %>%
filter(home == "Coventry_City" & away == "Arsenal") %>%
mutate(sampled_result = map(probabilities, sample_n, 1, weight = prob)) %>%
select(probabilities) %>%
unnest(cols = c(sampled_result))
) %>%
bind_rows() %>%
arrange(prob)</code></pre>
<pre><code>## # A tibble: 10 x 6
## home away gameweek sample_hgoal sample_agoal prob
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Coventry_City Arsenal 9 0 2 0.115
## 2 Coventry_City Arsenal 9 1 2 0.0983
## 3 Coventry_City Arsenal 9 1 1 0.0933
## 4 Coventry_City Arsenal 9 0 3 0.0806
## 5 Coventry_City Arsenal 9 1 3 0.0691
## 6 Coventry_City Arsenal 9 1 0 0.0442
## 7 Coventry_City Arsenal 9 0 4 0.0425
## 8 Coventry_City Arsenal 9 0 4 0.0425
## 9 Coventry_City Arsenal 9 0 4 0.0425
## 10 Coventry_City Arsenal 9 1 5 0.0154</code></pre>
<p>If we do this a few more times per fixture (here 100, for a better estimate I’d advise at least 10000 it should only take a few minutes), we can then start assigning points and goal difference to each team based upon the result we’ve sampled. E.g. if one sample predicts Arsenal to beat Blackburn Rovers 40, we assign 3 points to Arsenal and 0 points to Blackburn Rovers for that simulation and +4 and 4 goal difference respectively.</p>
<pre class="r"><code>n < 100
fixture_sims < rerun(n, nested_probabilities %>%
mutate(sampled_result = map(probabilities, sample_n, 1, weight = prob)) %>%
select(probabilities) %>%
unnest(cols = c(sampled_result)) %>%
select(gameweek, prob) %>%
pivot_longer(c(home, away), names_to = "location", values_to = "team") %>%
mutate(points = case_when(
location == "home" & sample_hgoal > sample_agoal ~ 3,
location == "away" & sample_agoal > sample_hgoal ~ 3,
sample_hgoal == sample_agoal ~ 1,
TRUE ~ 0
)) %>%
mutate(gd = case_when(
location == "home" ~ sample_hgoal  sample_agoal,
location == "away" ~ sample_agoal  sample_hgoal
)))
fixture_sims[1]</code></pre>
<pre><code>## [[1]]
## # A tibble: 12 x 6
## sample_hgoal sample_agoal location team points gd
## <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 0 0 home Coventry_City 1 0
## 2 0 0 away Arsenal 1 0
## 3 4 0 home Blackburn_Rovers 3 4
## 4 4 0 away Dover_Athletic 0 4
## 5 0 0 home Frimley_Green 1 0
## 6 0 0 away Enfield_Town 1 0
## 7 3 0 home Arsenal 3 3
## 8 3 0 away Blackburn_Rovers 0 3
## 9 6 1 home Coventry_City 3 5
## 10 6 1 away Frimley_Green 0 5
## 11 1 1 home Dover_Athletic 1 0
## 12 1 1 away Enfield_Town 1 0</code></pre>
<p>We can then average the points and goal difference won in these sims across each team and see what teams are predicted to win across their fixtures.</p>
<pre class="r"><code>fixture_sims %>%
bind_rows() %>%
group_by(team) %>%
summarise(av_points = sum(points)/n,
av_gd = sum(gd) / n)</code></pre>
<pre><code>## # A tibble: 6 x 3
## team av_points av_gd
## <chr> <dbl> <dbl>
## 1 Arsenal 4.19 2.44
## 2 Blackburn_Rovers 3.16 0.7
## 3 Coventry_City 3.61 2.42
## 4 Dover_Athletic 2.26 1.26
## 5 Enfield_Town 2.95 0.23
## 6 Frimley_Green 0.6 4.53</code></pre>
<p>Where we can see that we expect Arsenal to win 4.19 out of a possible 6 points (with games remaining against Coventry and Blackburn Rovers they are expected to drop points but win at least one and probably draw the other). Coventry City are expected to also do well probably because their final game is at home to Frimley Green, whereas Blackburn have tougher fixtures away at Arsenal and home to Dover Athletic.</p>
<p>We can then add this to the calculated points teams have <em>already</em> accrued to get a prediction of where teams will end the season position wise:</p>
<pre class="r"><code>table < results %>%
pivot_longer(c(home, away), names_to = "location", values_to = "team") %>%
mutate(points = case_when(
location == "home" & hgoal > agoal ~ 3,
location == "away" & agoal > hgoal ~ 3,
hgoal == agoal ~ 1,
TRUE ~ 0
)) %>%
mutate(gd = case_when(
location == "home" ~ hgoal  agoal,
location == "away" ~ agoal  hgoal
)) %>%
group_by(team) %>%
summarise(points = sum(points),
gd = sum(gd))
predicted_finishes < map_df(fixture_sims, function(simulated_fixtures, table) {
simulated_fixtures %>%
select(team, points, gd) %>%
bind_rows(., table) %>%
group_by(team) %>%
summarise(points = sum(points),
gd = sum(gd)) %>%
arrange(points, gd) %>%
mutate(predicted_finish = 1:n())
}, table) %>%
group_by(team, predicted_finish) %>%
summarise(perc = n() / n)
predicted_finishes</code></pre>
<pre><code>## # A tibble: 10 x 3
## # Groups: team [6]
## team predicted_finish perc
## <chr> <int> <dbl>
## 1 Arsenal 1 0.82
## 2 Arsenal 2 0.18
## 3 Blackburn_Rovers 1 0.18
## 4 Blackburn_Rovers 2 0.82
## 5 Coventry_City 3 0.97
## 6 Coventry_City 4 0.03
## 7 Dover_Athletic 3 0.03
## 8 Dover_Athletic 4 0.97
## 9 Enfield_Town 5 1
## 10 Frimley_Green 6 1</code></pre>
<p>Which gives Arsenal an 82% chance of finishing champions, with only a 18% chance Blackburn manage to leapfrog them into 1st place. Given there are only 2 matches left with teams designed to have fairly large gulfs in ability, it’s not surprising most of the final positions are nailed on e.g. Enfield Town finish 5th in every single simulation:</p>
<pre class="r"><code>p4 < ggplot(predicted_finishes, aes(x = predicted_finish, y = perc, fill = team)) +
geom_bar(stat = "identity", colour = "black") +
scale_fill_manual(values = c("red", "blue", "skyblue", "white", "dodgerblue4", "blue")) +
labs(
title = "Predicted finish position of teams",
subtitle = "with two gameweeks left to play",
y = "fraction of finishes",
x = "final position"
) +
theme_minimal() +
facet_wrap(~team)
p4</code></pre>
<p><img src="/post/2019305dixoncoles2_files/figurehtml/plot_predicted_finishes1.png" width="672" /></p>
</div>
</div>
<div id="therealthing" class="section level2">
<h2>The Real Thing</h2>
<p>We’re now at the stage where we can start to look at real data. One of the motivating forces which drew me back to this putative blog series was the <a href="https://www.theguardian.com/football/blog/2020/apr/25/finishingpremierleagueseasonpointlessfootball">current football situation</a> with season ending with games left to play.</p>
<p>We can use the knowledge we’ve built up over these last posts to see what we expect to happen in these unplayed games, if they cannot be completed.</p>
<p>To make code more concise, I’ve used Ben Torvaney’s code in his <a href="https://github.com/Torvaney/regista">regista</a> package (he also has some nice usage blogs similar to this post at his <a href="http://www.statsandsnakeoil.com/">blog</a>). The underlying maths is exactly the same as in my previous post though with a few different design choices. If we run the simulations using the code from the previous post we should get exactly the same answer.</p>
<p>The code following is also extremely similar to the final chunks of one of my previous <a href="https://www.roberthickman.eu/post/five_min_trivia_invincibles/">posts</a> in analysing the current Liverpool team’s achievements.</p>
<pre class="r"><code>library(rvest)
library(regista)</code></pre>
<p>First we need to download the data on the current English Premier League season. Once we have this we can split it into played matches (where we 100% know the result) and unplayed matches which we need to predict the result of. For the basis of the team strength estimates I’ve used the xg created and allowed per game, as I believe these give a better estimate of team strength (indeed Ben Torvaney has a <a href="http://www.statsandsnakeoil.com/2019/01/06/predictingthepremierleaguewithdixoncolesandxg/">nice post on using even the shotbyshot xg to produce DixonColes models</a>).</p>
<pre class="r"><code>#download the match data from 2019/2020
fixtures_2020 < "https://fbref.com/en/comps/9/schedule/PremierLeagueFixtures" %>%
read_html() %>%
html_nodes("#sched_ks_3232_1") %>%
html_table() %>%
as.data.frame() %>%
separate(Score, into = c("hgoal", "agoal"), sep = "–") %>%
#only care about goals and expected goals
select(home = Home, away = Away, home_xg = xG, away_xg = xG.1, hgoal, agoal) %>%
filter(home != "") %>%
mutate(home = factor(home), away = factor(away)) %>%
#round expected goals to nearest integer
mutate_at(c("home_xg", "away_xg", "hgoal", "agoal"), .funs = funs(round(as.numeric(.))))
#matches with a known result
#used for modelling
played_matches < fixtures_2020 %>%
filter(!is.na(home_xg))
#matches with an unknown result
#used for simulation
unplayed_matches < fixtures_2020 %>%
filter(is.na(home_xg)) %>%
select_if(negate(is.numeric))
#fit the dixon coles model
#use xg per game, not 'actual' goals
fit_2020 < dixoncoles(home_xg, away_xg, home, away, data = played_matches)</code></pre>
<p>To get a look at what the team parameters in a reallife league look like we can extract them from the model and plot them:</p>
<pre class="r"><code>#extract DixonColes team strenth parameters
pars_2020 < fit_2020$par %>%
.[grepl("def_off_", names(.))] %>%
matrix(., ncol = 2) %>%
as.data.frame() %>%
rename(attack = V1, defence = V2)
pars_2020$team < unique(gsub("def_*off_*", "", names(fit_2020$par)))[1:20]
#plot as before
p5 < pars_2020 %>%
mutate(defence = 1  defence) %>%
ggplot(aes(x = attack, y = defence, colour = attack + defence, label = team)) +
geom_point(size = 3, alpha = 0.7) +
geom_text_repel() +
scale_colour_continuous(guide = FALSE) +
labs(title = "DixonColes parameters for the 2019/2020 EPL",
x = "attacking strength",
y = "defensive strength") +
theme_minimal()
p5</code></pre>
<p><img src="/post/2019305dixoncoles2_files/figurehtml/plot_real_params1.png" width="672" />
It might surprise some that Manchester City are predicted to be better than Liverpool by this model, but it shouldn’t given the underlying numbers for both teams. Liverpool have run very hot and Manchester City have run very cold this season.</p>
<p>Finally, we can then calculate the current Premier League table, and simulate remaining games to predict where teams will finish the season if the remainder of games were to be played. I’ve chosen 1000 sims just for sake of processing time, but you can scale up and down as desired.</p>
<pre class="r"><code>#calculate the current EPL table
current_epl_table < played_matches %>%
select(home, away, hgoal, agoal) %>%
pivot_longer(c(home, away), names_to = "location", values_to = "team") %>%
mutate(points = case_when(
location == "home" & hgoal > agoal ~ 3,
location == "away" & agoal > hgoal ~ 3,
hgoal == agoal ~ 1,
TRUE ~ 0
)) %>%
mutate(gd = case_when(
location == "home" ~ hgoal  agoal,
location == "away" ~ agoal  hgoal
)) %>%
group_by(team) %>%
summarise(points = sum(points),
gd = sum(gd))
#the number of sims to run
n < 10000
#simulate remaining matches
fixture_sims_2020 < rerun(
n,
augment.dixoncoles(fit_2020, unplayed_matches, type.predict = "scorelines") %>%
mutate(sampled_result = map(.scorelines, sample_n, 1, weight = prob)) %>%
select(.scorelines) %>%
unnest(cols = c(sampled_result)) %>%
pivot_longer(c(home, away), names_to = "location", values_to = "team") %>%
mutate(points = case_when(
location == "home" & hgoal > agoal ~ 3,
location == "away" & agoal > hgoal ~ 3,
hgoal == agoal ~ 1,
TRUE ~ 0
)) %>%
mutate(gd = case_when(
location == "home" ~ hgoal  agoal,
location == "away" ~ agoal  hgoal
)) %>%
select(team, points, gd))
#calculate final EPL tables
predicted_finishes_2020 < map_df(fixture_sims_2020, function(sim_fixtures, table) {
sim_fixtures %>%
select(team, points, gd) %>%
bind_rows(., table) %>%
group_by(team) %>%
summarise(points = sum(points),
gd = sum(gd)) %>%
arrange(points, gd) %>%
mutate(predicted_finish = 1:n())
}, current_epl_table) %>%
group_by(team, predicted_finish) %>%
summarise(perc = n() / n) %>%
group_by(team) %>%
mutate(mean_finish = mean(predicted_finish)) %>%
arrange(mean_finish) %>%
ungroup() %>%
mutate(team = factor(team, levels = unique(team)))</code></pre>
<p>If we then plot these predicted finishes (ordered by the chance of their highest finish position), we can get an idea of where we expect teams to end the season:</p>
<pre class="r"><code>#list of team colours
team_cols < c("red", "skyblue", "darkblue", "darkblue", "darkred",
"orange", "red", "white", "red", "blue", "maroon",
"blue", "white", "red", "dodgerblue", "yellow",
"maroon", "red", "maroon", "yellow")
#plot the finishing position by chance based on these simualtions
p6 < ggplot(predicted_finishes_2020,
aes(x = predicted_finish, y = perc, fill = team)) +
geom_bar(stat = "identity", colour = "black") +
scale_fill_manual(values = team_cols, guide = FALSE) +
labs(
title = "Predicted finish position of teams",
subtitle = "for incomplete 2019/2020 EPL season",
y = "fraction of finishes",
x = "final position"
) +
theme_minimal() +
facet_wrap(~team)
p6</code></pre>
<p><img src="/post/2019305dixoncoles2_files/figurehtml/plot_2020_prediction1.png" width="672" /></p>
<p>So great news for Liverpool fans who the model believes have a 100% chance of finishing in first place. Leicester also might be happy with a nailed on 3rd place, with Chelsea or Manchester United probably rounding out the top four, and Wolves joining the loser of the two in the Europa League.</p>
<pre class="r"><code>#get the predictions for the 2019/2020 champion
winner < predicted_finishes_2020 %>%
filter(predicted_finish < 2)%>%
mutate(prediction = "Champion chance")
winner</code></pre>
<pre><code>## # A tibble: 2 x 5
## team predicted_finish perc mean_finish prediction
## <fct> <int> <dbl> <dbl> <chr>
## 1 Liverpool 1 1.00 1.5 Champion chance
## 2 Manchester City 1 0.0001 2.5 Champion chance</code></pre>
<pre class="r"><code>#get prediction for those who qualify for champions league
#and for europa league
champs_league < predicted_finishes_2020 %>%
filter(predicted_finish < 5) %>%
group_by(team) %>%
summarise(perc = sum(perc)) %>%
arrange(perc) %>%
mutate(prediction = "Champions League chance")
champs_league</code></pre>
<pre><code>## # A tibble: 10 x 3
## team perc prediction
## <fct> <dbl> <chr>
## 1 Liverpool 1 Champions League chance
## 2 Manchester City 1 Champions League chance
## 3 Leicester City 0.933 Champions League chance
## 4 Chelsea 0.479 Champions League chance
## 5 Manchester Utd 0.46 Champions League chance
## 6 Wolves 0.106 Champions League chance
## 7 Sheffield Utd 0.0155 Champions League chance
## 8 Tottenham 0.004 Champions League chance
## 9 Arsenal 0.0018 Champions League chance
## 10 Everton 0.0005 Champions League chance</code></pre>
<pre class="r"><code>europa_league < predicted_finishes_2020 %>%
filter(predicted_finish < 7) %>%
group_by(team) %>%
summarise(perc = sum(perc)) %>%
arrange(perc) %>%
mutate(prediction = "(at least) Europa League chance")
europa_league</code></pre>
<pre><code>## # A tibble: 13 x 3
## team perc prediction
## <fct> <dbl> <chr>
## 1 Liverpool 1 (at least) Europa League chance
## 2 Manchester City 1 (at least) Europa League chance
## 3 Leicester City 0.999 (at least) Europa League chance
## 4 Manchester Utd 0.954 (at least) Europa League chance
## 5 Chelsea 0.954 (at least) Europa League chance
## 6 Wolves 0.729 (at least) Europa League chance
## 7 Sheffield Utd 0.196 (at least) Europa League chance
## 8 Tottenham 0.096 (at least) Europa League chance
## 9 Arsenal 0.0479 (at least) Europa League chance
## 10 Everton 0.0139 (at least) Europa League chance
## 11 Burnley 0.0089 (at least) Europa League chance
## 12 Crystal Palace 0.0009 (at least) Europa League chance
## 13 Southampton 0.0008 (at least) Europa League chance</code></pre>
<p>(obviously this model does not account for any ramifications of <a href="https://www.itv.com/news/20200214/manchestercitybannedfromchampionsleaguefortwoyearsbyuefa/">Manchester City’s European ban</a>)</p>
<p>At the foot of the table, the model is fairly bullish on Norwich being relegated, with Aston Villa probably joining them, and then probably West Ham rounding out the relegation spots.</p>
<pre class="r"><code>#get predictions for those who would be relegated
relegated < predicted_finishes_2020 %>%
filter(predicted_finish > 17) %>%
group_by(team) %>%
summarise(perc = sum(perc)) %>%
arrange(perc) %>%
mutate(prediction = "Relegation chance")
relegated</code></pre>
<pre><code>## # A tibble: 8 x 3
## team perc prediction
## <fct> <dbl> <chr>
## 1 Norwich City 0.934 Relegation chance
## 2 Aston Villa 0.700 Relegation chance
## 3 Bournemouth 0.507 Relegation chance
## 4 West Ham 0.402 Relegation chance
## 5 Watford 0.270 Relegation chance
## 6 Brighton 0.171 Relegation chance
## 7 Newcastle Utd 0.0126 Relegation chance
## 8 Southampton 0.00270 Relegation chance</code></pre>
</div>
<div id="finalremarks" class="section level2">
<h2>Final Remarks</h2>
<p>I want to make it clear at the end of this post that this probably isn’t the most sophisticated model for predicting football matches (more to come in a part 3, maybe this time within less than a year), but does a pretty good job!</p>
<p>In any case though, I don’t think that running these sims is a good way to end the season in truth, there’s probably no good way. This post is more about <em>how</em> to use this technique than <em>whether</em> to use it.</p>
<p>Best, stay safe as always!</p>
</div>

Papers Please! 'Wide Open Spaces A statistical technique for measuring space creation in professional soccer' pt 1
/post/fall_back_in_to_space/
Mon, 20 Apr 2020 00:00:00 +0000
/post/fall_back_in_to_space/
<p><em>written during lockdown so while I think it adds some value (and is useful to organise my thoughts on the paper for my own work on football) there are probably mistakes. E.g. the C++ code is still pretty inefficient and could well be improved and I’ve surely confused some maths concepts. To be honest, the post is just an excuse to practice writing LaTeX maths and some C++. Let me know my errors and I’ll correct</em></p>
<p><a href="https://www.youtube.com/watch?v=f9X1C7pTuM">Beach House  Space Song</a></p>
<div id="intropreamble" class="section level1">
<h1>Intro/Preamble</h1>
<p>When analysing football (whether as a scout watching games, or an analyst using data), we want the greatest sample size possible. During a single match a player might well make fewer than 100 ‘events’ (passes, tackles, interceptions, shots, …) and still play well enough that he might be a worthwhile purchase. As we increase the number of matches we watch that player play, a more accurate ‘smoothed’ representation of their game should emerge. However, time is very obviously a limited resource. If we assume a very hardworking scout can watch 6 football matches a day, it will probably take them a week to cover all of the games of one team in a season, and over 3 months to cover an entire season of a league.</p>
<p>An obvious way to get around some of these limitations is augment scouting using data. If a player is obviously an <a href="https://en.wikipedia.org/wiki/Filippo_Inzaghi">Filipo Inzaghi</a> style poacher, its feasible we might watch 56 games of his to get a feel of his ability, then check some basic stats such a shots, xG, … etc. per game over his last few seasons to see how representative our sample was and flesh out our scouting.</p>
<p>When we build these models (even just counting shot numbers) we are in essence ‘teaching’ machines to do the scouting for us. We provide them with a model of how the game works and ask them to ‘watch’ a huge number of matches very quickly. The obvious pitfall of this is that ‘computers don’t play football’, and they don’t, so the output of our model is going to be proportional to the understanding of the game the computer has. For example, a computer who only counts shot numbers has a poorer understanding of football than a machine who weights these by xG per shot. Just as humans understand creating better shooting chances is important, the second computer has come to grasp that.</p>
<p>Some of these computational models seem to work, even with simple inputs. The xG a striker produces per season does for instance align quite well with how good the human eye test thinks a striker is. However, many are <em>quite bad</em>, especially as you move back through play away from shots on goal. To fix this, we need machines who understand the game better, and in the same ways humans do.</p>
<p>This is really the idea behind a lot of modern football analytics research, but I think especially behind <a href="http://www.sloansportsconference.com/wpcontent/uploads/2018/03/1003.pdf">Wide Open Spaces</a>, a 2019 Sloan conference paper by Javier Fernandez and Luke Bornn. I’m not going to review the whole paper, but the key takeaway is that for every ‘event’ that a player takes, there are actually many more uncaptured events where players are continually creating and destroying space. Combining these gives us a better approximation of what the human brain does when evaluating players. If this seems confusing, a simpler way to think about this is consider <a href="https://www.youtube.com/watch?v=sZsJo7ZKdjQ">this Tifo football video on Thomas Mueller</a>.</p>
<p><em>It is probably more valuable to be able to create and exploit space, than it is to be able to technically execute a pass. The reverse is also clearly true for defenders; consider Maldini’s famous quote: “If I have to make a tackle then I have already made a mistake.”</em></p>
<p>The paper, while very clearly written, does not explain it’s maths as accessibly as I might like, so I thought a post going through exactly what the paper is doing might be of value. All the hard work for this post is reall done by <a href="https://twitter.com/AnEnglishGoat">Will Thomson’s</a> whose implementation of the algorithm in python <a href="https://colab.research.google.com/drive/1V75UgfJEfCWgbfxnG4OuB1WpvqahUJPU">here</a> forms the basis (and only has minor tweaks in my final code).</p>
<p>As always, let’s first load some libraries we’ll need:</p>
<pre class="r"><code>set.seed(3459)
#libraries
library(tidyverse)
library(ggsoccer)
library(mvtnorm) #might be possible with MASS
library(zoo)</code></pre>
</div>
<div id="thetheory" class="section level1">
<h1>The Theory</h1>
<p>Imagine two teams, I and J. Each of these has 11 players (hopefully) on the pitch at any time chasing after one ball. We want to know which team controls which parts of the pitch for each point in the match. As ‘control’ in a football match really only refers to “will player on my team get to a potential pass there first”, we are just looking at where players i,j,k… are going to be at time t + n seconds.</p>
<p>The easiest way to start to approximate this is to imagine a set of players who never change direction, they only speed up or slow down (and possibly reverse). E.g. a full back running up and down the wings like a rook in a chess game. Their location at t + 1 will be their current location plus the expected value of their velocity.</p>
<pre class="r"><code>#make up some movement data
full_back_pos < data.frame(x = 40, y = 70)
full_back_movement < data.frame(
pos = 40,
x = c(10, 100),
y_pos = 70)
next_x < rnorm(10000, 60, 5)
next_x < next_x  (next_x %% 5)
full_back_next_pos < data.frame(table(next_x)) %>%
mutate(y = 70,
next_x = as.numeric(as.character(next_x)),
Freq = Freq / sum(Freq))
#plot fake movement data
p < ggplot() +
annotate_pitch(dimensions = pitch_statsbomb) +
geom_tile(data = full_back_next_pos,
aes(x = next_x, y = y, fill = Freq),
alpha = 0.7, height = 10) +
scale_fill_viridis_c(name = "confidence") +
geom_segment(data = full_back_movement,
aes(x = pos, xend = x, y = y_pos, yend = y_pos), size = 2) +
geom_point(data = full_back_pos,
aes(x = x , y = y),
shape = 21, colour = "black", fill = "red", size = 5) +
theme_pitch()
p</code></pre>
<p><img src="/post/20200420wide_open_spaces_1_files/figurehtml/player_example1.png" width="672" /></p>
<p>As we’re not fully confident in our assessment of how fast this full back is, we aren’t 100% sure where his next position will be (at time t + n seconds), but given how quick we <em>expect</em> him to be, we can produce produce an expected distribution of his next x coordinate (here binned into boxes of 5m worth). This estimate will vary according to two parameters, the mean speed (μ) and the standard deviation of that speed (σ). If we make 10000 such estimates (assuming no bias and forgetting our previous estimate etc.) these will form the normal distribution probability density function</p>
<pre class="r"><code>#plot histogram of fake movement data
p2 < ggplot(full_back_next_pos, aes(x = next_x, y = Freq)) +
geom_bar(stat = "identity") +
ylab("confidence") +
xlab("next x coordinate") +
theme_minimal()
p2</code></pre>
<p><img src="/post/20200420wide_open_spaces_1_files/figurehtml/player_example_hist1.png" width="672" /></p>
<p>(here I’ve plotted the x axis as the next x coordinate which is just our estimate of the x speed + the original x coordinate [40]).</p>
<p>But this is obviously an oversimplification because players can travel in a myriad different directions across the pitch we need our normal distribution confidence interval to generalise across more than 1 dimension.</p>
<pre class="r"><code>#fake data in 2 dimensions now
next_x < rnorm(10000, 60, 5)
next_x < next_x  (next_x %% 5)
next_y < rnorm(10000, 65, 3)
next_y < next_y  (next_y %% 5)
full_back_next_pos < data.frame(next_x, next_y) %>%
group_by(next_x, next_y) %>%
summarise(Freq = n())
full_back_movement < data.frame(x = 40, y = 70, next_x = 60, next_y = 60)
#plot
p3 < ggplot() +
annotate_pitch(dimensions = pitch_statsbomb) +
geom_tile(data = full_back_next_pos,
aes(x = next_x, y = next_y, fill = Freq),
alpha = 0.7, height = 10) +
scale_fill_viridis_c(name = "confidence") +
geom_segment(data = full_back_movement,
aes(x = x, xend = next_x, y = y, yend = next_y),
size = 2, arrow = arrow(length = unit(0.03, "npc"))) +
geom_point(data = full_back_pos,
aes(x = x , y = y),
shape = 21, colour = "black", fill = "red", size = 5) +
theme_pitch()
p3</code></pre>
<p><img src="/post/20200420wide_open_spaces_1_files/figurehtml/player_example_2d1.png" width="672" />
So now we have a realistic of guess, based upon the players velocity vector, of where they will be in n seconds time. If we do the same for every player of the pitch, we get a (roughly) 22 layer raster detailing how likely any single player is to be able to be in location x, y at time t + n. If a football magically appeared at point x,y, we now know which player(s) are likely to be able to reach it. Therefore, we know we parts of the pitch team I or J ‘controls’ where their teammates can pass to and expect them to receive the ball.</p>
<p>This really is the fundamental idea of the pitch control metric presented in <a href="http://www.sloansportsconference.com/wpcontent/uploads/2018/03/1003.pdf">Wide Open Spaces</a> we can use the expected 2d position of each player in the next n seconds, to work out which team would win the ball if it were dropped on a specific coordinate. This is what we mean by ‘pitch control’.</p>
</div>
<div id="themath" class="section level1">
<h1>The Math</h1>
<p>Now we have an idea of what we want to do, ‘we’ need to formalise it. Luckily the paper already does it for us and all we need to do is follow the derivation. First, need to define two terms. We’ll call the space of possible locations (120 x 80m for me) P(itch) and the range of times T(ime)</p>
<p>For every single point p at time t pitch control (PC) is defined by equation 2</p>
<p><span class="math display">\[PC_{(p,t)} = \sigma \sum_{i} I_{(p,t)}  \sum_{j} I_{(p,t)}\]</span></p>
<p>where you sum across i (all the players on team I) and j (all the players on team J). This is then multiplied by a logistic function (σ). Due to the logistic function, the output of this (PC) will have a value from 0 to 1 where <0.5 is control by team J and >0.5 is control by team I. E.g. if you drop a ball at place p at time t, if PC(p,t) is greater than 0.5, team I is likelier to get the ball, and viceversa for <0.5.</p>
<p>We’ll rewrite this with sigma replace with numbers as:</p>
<p><span class="math display">\[PC(p,t) = \frac{1}{1 + (\sum_{i} I(p,t)  \sum_{j} I(p,t))}\]</span></p>
<p>From here it should be obvious we need to calculate I(p,t) for each player. We do this in equation 1</p>
<p><span class="math display">\[I_{i}(p,t) = \frac{f_{i}(p,t)}{f_{i}(p_{i}(t), t)} \]</span></p>
<p>The numerator here is the probability density function of the player influence. How much influence does a single player have over any single part of the pitch surface (p) at a time (t). This is normalised by the denominator which does the same thing only for the players current location at time t (p_i(t)).</p>
<p>Ok so so far so good. Equations 4 and 5 in the paper we’ll come back to later but they define the value of having the ball at these locations. Don’t worry about that for now. We won’t really go into that in this post.</p>
<p>If we then skip to the supplemental figures we hit the pretty rough equation 12 which tells us how to solve for f_i(p,t)</p>
<p><span class="math display">\[f_{i}(p,t) = \frac{1}{\sqrt{(2\pi)^2detCOV_{i}(t)}}exp(\frac{1}{2}(p\mu_{i}(\overrightarrow{s}_{i}(t)))^tCOV_{i}(t)^{1}(p\mu_{i}(t))) \]</span></p>
<p>It looks horrendous but it’s just the equation for the multivariate normal distribution. See for example <a href="https://wikimedia.org/api/rest_v1/media/math/render/svg/c66e6f6abd66698181e114a4b00da97446efd3c4">here</a>. It’s not a surprise to see this equation because we know we need to solve a multivariate normal from the example using our full back above!</p>
<p>All we need to do is find x, μ, and Σ, in the linked picture above. Then we’re going to use mvtnorm::dmvnorm to calculate the density function. If you run</p>
<pre class="r"><code>?mvtnorm::dmvnorm</code></pre>
<p>you can see that ‘coincidentally’ this also requires 3 arguments (ignore the 4th log = FALSE), x, μ (mean), and sigma. All we have to do is find out what each of these arguments are equal to.</p>
<p>Firstly we want to find the covariance matrix (COV_i(t)). To calculate this, we can rewrite it as Sigma the product of two matrices R and S such that:</p>
<p><span class="math display">\[ \Sigma = R\cdot S \cdot S \cdot R^{1}\]</span>
where R is the rotation matrix around the euclidean plane:</p>
<p><span class="math display">\[R =
\begin{bmatrix}
cos(\theta) & sin(\theta) \\
sin(\theta) & cos(\theta) \\
\end{bmatrix}
\]</span></p>
<p>and S is a scaling matrix</p>
<p><span class="math display">\[S =
\begin{bmatrix}
s_{x} & 0 \\
0 & s_{y} \\
\end{bmatrix}
\]</span></p>
<p>The details of this transformation aren’t really important, but a good explanation can be found <a href="https://www.visiondummy.com/2014/04/geometricinterpretationcovariancematrix/">here</a>.</p>
<p>After resolving these matrices, we then only need to find the mean value of the distribution (μ_i(t)) which is defined in equation 21 of the paper</p>
<p><span class="math display">\[\mu_{i}(t) = p_{i}(t) + \overrightarrow{\widehat{s}}_{i}(t) \cdot 0.5 \]</span>
(we’ll go over the details of this equation later)</p>
<p>and also the pitch area, p, which is just an area of the pitch we want to find the control a player exerts over. We define this by dividing the total pitch area into many ‘pixels’ sampling each in our multivariate normal function. For example, in you split a 120m x 80m pitch into 1m^2 boxes, there are 120 * 80 = 9600 ‘pixels’ to run across.</p>
</div>
<div id="thedata" class="section level1">
<h1>The Data</h1>
<p>Now we’ve (briefly) gone through the theory, we can start working with the data and build our way back up. First we need to get our hands on the data itself. As part of the Friends of Tracking project during lockdown, Metrica Sports have kindly provided 2 sample matches (I’m using match 1 here) of tracking and event data which can be found <a href="https://github.com/metricasports/sampledata">here</a>.</p>
<p>The function below downloads, melts and organises the data.</p>
<pre class="r"><code>#func to download and melt tracking data
#will use game 1
get_tracking_data < function(file, directory = "metricasports/sampledata", x_adj = 120, y_adj = 80) {
#build url
url < paste0("https://raw.githubusercontent.com/", directory, "/master/data/", file)
#read data
data < read_csv(url, skip = 2)
#fix names
names(data)[grep("^X[09]*$", names(data))1] < paste0(names(data)[grep("^X[09]*$", names(data))1], "_x")
names(data)[grep("^X[09]*$", names(data))] < gsub("_x$", "_y", names(data)[grep("^X[09]*$", names(data))1])
#melt it from long to wide
melted_data < data %>%
pivot_longer(cols = starts_with("Player")) %>%
separate(name, into = c("player", "coord"), sep = "_") %>%
pivot_wider(names_from = "coord", values_from = "value") %>%
rename(time = `Time [s]`) %>%
rename_all(tolower) %>%
#add the team info
#scale coords to statsbomb spec
mutate(team = gsub("(.*)(Home_TeamAway_Team)(\\..*)", "\\2", file)) %>%
mutate_at(vars(ends_with("x")), ~.x * x_adj) %>%
mutate_at(vars(ends_with("y")), ~.x * y_adj) %>%
arrange(player, frame) %>%
#some missing values on the ball location
#will just say ball stays where it is when no location data
#could interpolate but w/e
mutate(ball_x = na.locf(ball_x),
ball_y = na.locf(ball_y))
return(melted_data)
}
#run
tracking_data < map_df(
c("Sample_Game_1/Sample_Game_1_RawTrackingData_Away_Team.csv",
"Sample_Game_1/Sample_Game_1_RawTrackingData_Home_Team.csv"),
get_tracking_data) %>%
filter(!is.na(x) & !is.na(y))</code></pre>
<p>To calculate pitch control, we only need 4 pieces on information on each player to calculate their relative pitch control:</p>
<ul>
<li>their x,y location on the pitch</li>
<li>the x,y location of the ball</li>
<li>the time at which they were at that location</li>
<li>and also, their location x,y at time t + n</li>
</ul>
<pre class="r"><code>head(tracking_data)</code></pre>
<pre><code>## # A tibble: 6 x 9
## period frame time ball_x ball_y player x y team
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 1 1 0.04 54.6 31.0 Player15 70.1 16.6 Away_Team
## 2 1 2 0.08 59.6 32.5 Player15 70.1 16.6 Away_Team
## 3 1 3 0.12 64.5 34.0 Player15 70.1 16.6 Away_Team
## 4 1 4 0.16 66.4 33.8 Player15 70.0 16.7 Away_Team
## 5 1 5 0.2 66.6 32.5 Player15 69.9 16.8 Away_Team
## 6 1 6 0.24 66.8 31.1 Player15 69.9 16.9 Away_Team</code></pre>
<p>and from this we can build up everything we need. We’ll also want the team data at the end to sum across all players but for now that isn’t important.</p>
<p>First lets do the two simplest: the speed and trajectory of a player’s movement. To ease processing, first we’ll put all the information needed per frame on one line. Not strictly necessary, but allows for neater functions when we really get processing</p>
<pre class="r"><code>#first add in the lead x/y to ease processing
tracking_data < tracking_data %>%
group_by(player, team, period) %>%
#player x,y and time at t + n
mutate(next_x = lead(x), next_y = lead(y), next_time = lead(time)) %>%
#to develop velocity arrows per player
mutate(forward_x = lead(x, 10), forward_y = lead(y, 10)) %>%
ungroup() </code></pre>
<p>We calculate the speed in the x and y dimensions simply as the change in position divided by the time taken, and can calculate theta using either this speed vector, or the change in position (defined as the angle from the x axis the vector takes).</p>
<p>For an example, here is the data for 4 seconds on Player15 in the sample dataset. It’s taken from about 4 minutes into the match.</p>
<pre class="r"><code>#filters for some data to plot
player_spec < "Player15"
#each frame is 0.04s apart, take 100 frames worth from t = 250
times_spec < seq(250, by = 0.04, length.out = 100)
#filter
example_data < tracking_data %>%
filter(player == player_spec & time %in% times_spec)
#plot the players trajectory over this time
#and the velocity and theta derived from it
p4 < ggplot(example_data) +
geom_point(aes(x = x, y = y, colour = time), alpha = 0.6, size = 3) +
#plot the xaxis as a green line
geom_hline(yintercept = first(example_data$y), colour = "green", alpha = 0.5, size = 3) +
#plot the xaxis movement
geom_segment(aes(x = first(x), xend = last(x), y = first(y), yend = first(y)),
arrow = arrow(length = unit(0.03, "npc"))) +
#plot the y axis movement
geom_segment(aes(x = last(x), xend = last(x), y = first(y), yend = last(y)),
arrow = arrow(length = unit(0.03, "npc"))) +
#plot the hypotenuse
geom_segment(aes(x = first(x), xend = last(x), y = first(y), yend = last(y)),
size = 2, colour = "red", arrow = arrow(length = unit(0.03, "npc"))) +
#anotate speeds and theta
annotate("text", x = 105.5, y = 17.3, label = "x speed = 0.95 m/s") +
annotate("text", x = 106.75, y = 16.25, label = "y speed =\n0.56m/s") +
annotate("text", x = 104.25, y = 16.9, label = "theta = 30.4°") +
labs(x = "pitch x coord (/m)",
y = "pitch y coord (/m)",
title = "example player movement over 4 seconds") +
#scale manually so it isn't distorted
scale_x_continuous(limits = c(103.5, 107.5)) +
scale_y_continuous(limits = c(14, 18)) +
theme_minimal()
p4</code></pre>
<p><img src="/post/20200420wide_open_spaces_1_files/figurehtml/example_data1.png" width="672" /></p>
<p>So we get a good idea of the players trajectory over those 4 seconds and the average velocity and angle he is travelling at.</p>
<p>We can now start building up all the calculations we need to do to work out the pitch control any one player (and then whole teams) exert from basics. Through this I’m going to define each sum as a function to make it <em>extremely</em> clear what’s going on. Some of those functions will be ridiculously simple, but I don’t want to skip over anything.</p>
<p>Starting with the speed in any dimension and the angle from the x axis (theta) the player is travelling at:</p>
<pre class="r"><code>#no real reason for these to be functions, but just to
#make it more obvious what we're doing
get_speed < function(coord, next_coord, time, next_time) {
#speed in meters per second
speed = (next_coord  coord) / (next_time  time)
return(speed)
}
#again very simple for illustrative purposes
get_theta < function(x_speed, y_speed) {
hypotenuse_speed = sqrt(x_speed^2 + y_speed^2)
theta = acos(x_speed / hypotenuse_speed)
return(theta)
}</code></pre>
<p>if we plug our data from graph p4 into these very verbosely we get</p>
<pre class="r"><code>x_start < first(example_data$x)
x_end < last(example_data$x)
y_start < first(example_data$y)
y_end < last(example_data$y)
t_start < first(example_data$time)
t_end < last(example_data$time)
#in m/s
speed_x < get_speed(x_start, x_end, t_start, t_end)
speed_y < get_speed(y_start, y_end, t_start, t_end)
#convert to degrees
theta < get_theta(speed_x, speed_y)
theta_deg < theta * (180/pi)
results < c(speed_x, speed_y, theta_deg)
names(results) < c("speed_x", "speed_y", "theta")
print(results)</code></pre>
<pre><code>## speed_x speed_y theta
## 0.9496970 0.5575758 30.4175840</code></pre>
<p>(the calculations will use theta in radians, but I think it makes more sense to show it here in degrees).</p>
<p>We can now very trivially solve equation 21 right off the bat</p>
<p><span class="math display">\[\mu_{i}(t) = p_{i}(t) + \overrightarrow{\widehat{s}}_{i}(t) \cdot 0.5 \]</span></p>
<p>Where p is the location of player i at time t, and s_hat is the speed of the player in either dimension. The mean of the distribution (where we expect the player to have the most pitch control) is his current position + (where he will be / 2)</p>
<pre class="r"><code>#another simple function to find mu
get_mu < function(location, speed) {
mu = location + speed / 2
return(mu)
}
mu_x < get_mu(x_start, speed_x)
mu_y < get_mu(y_start, speed_y)</code></pre>
<p>Which means we now have the first of our variables for our big multivariate normal distribution equation (paper equation 12)</p>
<p><span class="math display">\[f_{i}(p,t) = \frac{1}{\sqrt{(2\pi)^2detCOV_{i}(t)}}exp(\frac{1}{2}(p\mu_{i}(\overrightarrow{s}_{i}(t)))^tCOV_{i}(t)^{1}(p\mu_{i}(t))) \]</span></p>
<p>and just need to define p, and calculate the covariance matrix COV.</p>
<p>We can start calculating the components of the covariance matrix with equation 18 (calculating the speed as a ratio of max speed) which is also trivial to solve now. Instead of using the speed in either direction, this relies on the total velocity , which we can find using school trigonometry</p>
<p><span class="math display">\[ Srat_{i}(t) = \frac{s^2}{ 13^2 } \]</span>
The 13m/s constant is the assumed maximum possible speed of a player (averaging this over 100m would break the world record by ~2 seconds)</p>
<pre class="r"><code>get_srat < function(speed_x, speed_y) {
#find total velocity
speed < sqrt(speed_x^2 + abs(speed_y)^2)
srat = (speed / 13)^2
return(srat)
}
srat < get_srat(speed_x, speed_y)</code></pre>
<p>And we can also find the constant Ri the radius of a players influence which isn’t listed in the paper but gives rise to figure 9. Given the formula isn’t listed, the numeric constants in the equation might be slightly off. They’re all taken from Will Thomson’s work <a href="https://colab.research.google.com/drive/1V75UgfJEfCWgbfxnG4OuB1WpvqahUJPU?pli=1">here</a>.</p>
<p><span class="math display">\[R_{i}(t) =
\begin{cases}
4 + \frac{(p_{i}(t)  p_{b}(t))^3}{18^3 / 6} & \text{if < 10} \\
10 & \text{else}
\end{cases}\]</span></p>
<p>It specifies that a player has an influence radius of 10 metres, unless they are within ~15metres of the ball, in which case their influence radius decreases with ball_distance to a minimum of 4 metres. The idea behind this is that a player nearer the ball is much more geographically focused in their movement as they either posses the ball or are trying to win it back.</p>
<pre class="r"><code>#allocate a few more variables from our example data
ball_x < first(example_data$ball_x)
ball_y < first(example_data$ball_y)
#little bit more complicated but still easy
get_ri < function(x, y, ball_x, ball_y) {
ball_diff < sqrt((x  ball_x) ^ 2 + (y  ball_y)^2)
ri = 4 + ((ball_diff^3) / ((18^3) / 6))
return(min(ri, 10))
}
ri < get_ri(x_start, y_start, ball_x, ball_y)</code></pre>
<p>We can test this function in the range of distance to the ball 030m and compare it to figure 9 in the paper</p>
<pre class="r"><code>p5 < data.frame(
x = 0:30,
y = map_dbl(0:30, get_ri,
#set all other args to 0
y = 0, ball_x = 0, ball_y = 0)) %>%
ggplot(aes(x = x, y = y)) +
geom_line(colour = "maroon", size = 2) +
geom_point(size = 3, alpha = 0.5) +
scale_y_continuous(limits = c(0, 12)) +
labs(title = "paper figure 9 (approx)",
x = "distance to the ball (/m)",
y = "influence radius (/m)") +
theme_minimal()
p5</code></pre>
<p><img src="/post/20200420wide_open_spaces_1_files/figurehtml/test_function1.png" width="672" /></p>
<p>We’re really getting there now. We just need to define our covariance matrix and we’re done with equations. Remember earlier with redefined</p>
<p><span class="math display">\[ \Sigma = R\cdot S \cdot S \cdot R^{1}\]</span>
in paper equation 14, where R is the rotation matrix, and S is the scaling matrix.</p>
<p>To rotate in Euclidean space clockwise from the xaxis, the rotation matrix <a href="https://en.wikipedia.org/wiki/Rotation_matrix">is just</a></p>
<p><span class="math display">\[R =
\begin{bmatrix}
cos(\theta) & sin(\theta) \\
sin(\theta) & cos(\theta) \\
\end{bmatrix}
\]</span>
as also defined in the paper in equation 16. Easy enough to define, we just need to put the right transform of theta in the right space</p>
<pre class="r"><code>get_R < function(theta) {
#R fills down first so these aren't the wrong way round
R = matrix(c(cos(theta), sin(theta), sin(theta), cos(theta)), nrow = 2)
return(R)
}
R < get_R(theta)</code></pre>
<p>For simplicity, I earlier said that the scaling matrix (S) was equivalent to the speed of the player in x and y dimensions, which was a bit of a white lie. It <em>is</em> derived from that, but itself scaled by the influence radius of the player (Ri)</p>
<p><span class="math display">\[S =
\begin{bmatrix}
s_{x} & 0 \\
0 & s_{y} \\
\end{bmatrix}
\]</span>
<span class="math display">\[S_{i}(t) =
\begin{bmatrix}
\frac{R_{i}(t) \cdot (1 + Srat_{i}(\overrightarrow{s}_{i}(t)))}{2} & 0 \\
0 & \frac{R_{i}(t) \cdot (1Srat_{i}(\overrightarrow{s}_{i}(t)))}{2} \\
\end{bmatrix}
\]</span></p>
<p>This is the same as equation 19, I’ve just taken the Ri outside the brackets. As with the rotation matrix R, this is just matrix building and putting the right variables in the right place</p>
<pre class="r"><code>get_S < function(ri, srat) {
top_left < ri * (1 + srat) / 2
bottom_right < ri * (1srat) / 2
S = matrix(c(top_left, 0, 0, bottom_right), nrow = 2)
}
S < get_S(ri, srat)</code></pre>
<p>Once we have R and S, Σ is just equal to the dot product of these as in equation 15</p>
<pre class="r"><code>get_Sigma < function(R, S) {
inv_R < solve(R)
Sigma = R %*% S %*% S %*% inv_R
return(Sigma)
}
Sigma < get_Sigma(R, S)</code></pre>
<p>So now we have the mean (μ), sigma (Σ) arguments to our dmvnorm function to calculate a players pitch control. We just to plug in the p term (corresponding to x in the R function arguments).</p>
<p>As in equation 1 (and 13), we actually need <em>two</em> p terms:</p>
<p><span class="math display">\[I_{i}(p,t) = \frac{f_{i}(p,t)}{f_{i}(p_{i}(t), t)} \]</span></p>
<p>the first (p) account for every ‘unit’ of the pitch (we divide the pitch up into each squares and calculate a players influence on each) and a second (p_i) which is the control of a player on their own area of pitch p. The denominator (control of pitch at player i’s x,y) is used to normalise the control they exert across the pitch from 01.</p>
<p>To create the matrix of pitch zones, we can simply use seq and expand.grid on the dimensions of the pitch. Splitting each dimension 200 ways leaves us with a 40000 x 2 data.frame to apply as p. For p_i, we just use the player’s x and y coordinates.</p>
<pre class="r"><code>#use statsbomb coords  120m x 80m pitch
#split into 200x200 rectangles
pitch < expand.grid(seq(0, 120, length.out = 200), seq(0, 80, length.out = 200)) %>%
rename(x = Var1, y = Var2)
#function to calculate I as in equation 1/13
calc_I < function(pitch_area, x, y, mu_x, mu_y, Sigma) {
#create vectors
mu < c(mu_x, mu_y)
player_loc < c(x, y)
numerator < dmvnorm(as.matrix(pitch_area), mu, Sigma)
denominator < dmvnorm(t(matrix(player_loc)), mu, Sigma)
#and normalise
norm_pdf = numerator/denominator
return(norm_pdf)
}
#column I is the control on pitch area x,y of player I
I < calc_I(pitch, x_start, y_start, mu_x, mu_y, Sigma)
head(mutate(pitch, I))</code></pre>
<pre><code>## x y I
## 1 0.0000000 0 4.256184e96
## 2 0.6030151 0 5.076124e95
## 3 1.2060302 0 5.967198e94
## 4 1.8090452 0 6.914091e93
## 5 2.4120603 0 7.896343e92
## 6 3.0150754 0 8.888804e91</code></pre>
<p>We of course need to do this across the whole team, summing the pitch influence per team then finding the difference between them as per equation 2 in the paper</p>
<p><span class="math display">\[PC_{(p,t)} = \sigma \sum_{i} I_{(p,t)}  \sum_{j} I_{(p,t)}\]</span></p>
<p>I’ve neatly nested all the functions we’ve written into one larger function which every row of a team is then applied to using pmap from the purrr package.</p>
<pre class="r"><code>#test our functions on one frame of the tracking data
testing_data < tracking_data %>%
filter(time == 600)
#sum all our little functions into one bigger function
calc_PC < function(time, next_time, ball_x, ball_y, x, y, next_x, next_y, team, player, pitch_area) {
speed_x < get_speed(x, next_x, time, next_time)
speed_y < get_speed(y, next_y, time, next_time)
srat < get_srat(speed_x, speed_y)
theta < get_theta(speed_x, speed_y)
mu_x < get_mu(x, speed_x)
mu_y < get_mu(y, speed_y)
ri < get_ri(x, y, ball_x, ball_y)
R < get_R(theta)
S < get_S(ri, srat)
Sigma < get_Sigma(R, S)
pitch_area$I < calc_I(as.matrix(pitch), x, y, mu_x, mu_y, Sigma)
pitch_area$team < team
pitch_area$time < time
pitch_area$player < player
return(pitch_area)
}
#run the pitch control function
pitch_control < testing_data %>%
select(time, next_time, ball_x, ball_y, x, y, next_x, next_y, player, team) %>%
#run func
pmap_df(., calc_PC, pitch_area = pitch) %>%
#sum by team and area
group_by(team, x, y) %>%
summarise(team_sum = sum(I)) %>%
pivot_wider(names_from = team, values_from = team_sum) %>%
#σ  logistic function
mutate(PC = 1 / (1 + exp(Home_Team  Away_Team)))</code></pre>
<p>After calculating the individual pitch control metrics, we sum by team and pixel and then subtract the away team sum from the home team sum and run it through a simple logistic function (σ)</p>
<pre class="r"><code>#get the position of the ball for this frame
ball_location < testing_data %>%
select(ball_x, ball_y) %>%
unique()
#plot it all
p6 < ggplot() +
#pitch layout background
annotate_pitch(dimensions = pitch_statsbomb) +
#pitch control raster
geom_tile(data = pitch_control, aes(x = x, y = y, fill = PC), alpha = 0.7) +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0.5) +
#players for each team
#also add in little vector arrows
geom_segment(data = testing_data, aes(x = x, y = y, xend = forward_x, yend = forward_y, colour = team),
size = 1, arrow = arrow(length = unit(0.01, "npc"))) +
geom_point(data = testing_data, aes(x = x, y = y, colour = team), size = 3) +
scale_colour_manual(values = c("black", "gold"), guide = FALSE) +
#ball location
geom_point(data = ball_location, aes(x = ball_x, y = ball_y),
colour = "black", fill = "white", shape = 21, size = 2.5, stroke = 2) +
theme_pitch()
p6</code></pre>
<p><img src="/post/20200420wide_open_spaces_1_files/figurehtml/plot_PC1.png" width="672" /></p>
<p>It looks pretty good! We can see which areas on the pitch the yellow and black (blue and red areas respectively) control (the ball here is the white circle outlined in black). In theory we can now run this function over the whole tracking_data data frame and calculate the control of each time over every part of the pitch at any time.</p>
<p>If we know this, we can work out (e.g.) the potential of an attack by multiplying the pitch control by a second layer, the value of every area of the pitch. For a very good intro into why/how you might value pitch areas, see <a href="https://karun.in/blog/expectedthreat.html">Karun Singh’s explanation of Expected Threat</a>. The paper itself uses a neural network based on the ball location. It can be best understood as imaging that you only know the location of the ball and are asked where the best place to pass it to would be? Moving it towards the centre of the opposition goal (reducing distance and angle) is always better, but you also want to maximise the chance of the pass being successful. The paper includes a great mp4 of modeled real life play hosted on <a href="http://www.lukebornn.com/sloan/field_value.mp4">Luke Bornn’s website</a>.</p>
<p>This post is already long enough so I’m not going to go into pitch value more here, but will hopefully write a followup combining the two at some point.</p>
</div>
<div id="prematureoptimisation" class="section level1">
<h1>(premature) optimisation</h1>
<p>(There’s not really much gain from reading beyond here, but I attempted to implement it in Rcpp for some optimisation which worked a little bit I’m sure this function could be vastly improved though so it might be of value leaving it here for others to run with)</p>
<p>So this is all fine and good, but we probably want to run this at least over every frame in the game, and possibly many games! To do this we’re really going to want to optimise the crap out of this function. I’ve had a first go at this using Rcpp and RcppArmadillo to implement the whole pitch control algorithm. It actually didn’t speed things up as much as I wanted*, but does remove 2030% of the time the R function takes. (it’s also just good practice to write more C++ for myself).</p>
<p>*lots more low hanging fruit to take out of it, but it does the job for now</p>
<p>We’ll need a few Rcpp libraries to implement this:</p>
<pre class="r"><code>library(Rcpp)
library(RcppArmadillo)</code></pre>
<p>And then can use a Rcpp chunk to export a compiled function that R can access</p>
<pre class="cpp"><code>//namespaces
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
/* C++ version of the dtrmv BLAS function */
// stolen from https://gallery.rcpp.org/articles/dmvnorm_arma/
void inplace_tri_mat_mult(arma::rowvec &x, arma::mat const &trimat){
arma::uword const n = trimat.n_cols;
for(unsigned j = n; j > 0;){
double tmp(0.);
for(unsigned i = 0; i <= j; ++i)
tmp += trimat.at(i, j) * x[i];
x[j] = tmp;
}
}
//set log(2pi) as a constant
static double const log2pi = std::log(2.0 * M_PI);
//replaces the dmvnorm() multivariate sampling
arma::vec dmvnrm_arma_fast(arma::mat const &x,
arma::rowvec const &mean,
arma::mat const &sigma,
bool const logd = false) {
using arma::uword;
uword const n = x.n_rows,
xdim = x.n_cols;
arma::vec out(n);
arma::mat const rooti = arma::inv(trimatu(arma::chol(sigma)));
double const rootisum = arma::sum(log(rooti.diag())),
constants = (double)xdim/2.0 * log2pi,
other_terms = rootisum + constants;
arma::rowvec z;
for (uword i = 0; i < n; i++) {
z = (x.row(i)  mean);
inplace_tri_mat_mult(z, rooti);
out(i) = other_terms  0.5 * arma::dot(z, z);
}
if (logd)
return out;
return exp(out);
}
//does all the calculations in the paper
//outputs a vector
// [[Rcpp::export]]
arma::vec calc_I_cpp(arma::vec coords, arma::vec next_coords, arma::vec ball_coords, double t, double next_t, arma::mat pitch, arma::mat coord_mat) {
arma::vec rng = runif(1);
arma::vec velocity = ((next_coords  coords) + (rng[0] / 10000)) / (next_t  t);
double speed = norm(velocity);
double srat =pow((speed / 13), 2.0);
double theta = acos(velocity[0] / speed);
//sometimes players reach 'impossible' speeds
if(srat > 1) {
velocity = {(12.5 * cos(theta)), (12.5 * sin(theta))};
speed = norm(velocity);
srat = pow((speed / 13), 2.0);
}
arma::mat R = {{+cos(theta), sin(theta)},
{+sin(theta), +cos(theta)}};
arma::vec m = coords + velocity / 2;
arma::rowvec mu = arma::conv_to<arma::rowvec>::from(m);
double ri_val = 4.0 + (pow(norm(ball_coords  coords), 3.0) / (pow(18.0, 3) / 6));
double ri = std::min(ri_val, 10.0);
arma::mat S = {{ri * (1 + srat) / 2, 0},
{0, ri * (1  srat) / 2}};
arma::mat inv_R = arma::inv(R);
arma::mat Sigma = R * S * S * inv_R;
arma::vec numerator = dmvnrm_arma_fast(pitch, mu, Sigma);
arma::vec denominator = dmvnrm_arma_fast(coord_mat, mu, Sigma);
arma::vec I = numerator / denominator[0];
return I;
}
</code></pre>
<p>And we can now start running this over multiple frames. My laptop is pretty hideously falling apart at the moment, so I’ve limited it here, but really you could for sure run it over many frames. For plotting as a single object, remember, we’re using a 40000 (200 * 200) row df to store stuff which is surely less than optimal, but even cutting that down as much as feasible, with 25 frames a second, memory bloat is going to happen fast.</p>
<p>In a future post at some point I’d like to actually try some analysis using this work, and I think the key is really to analyse within frame and output a condensed pitch area controlled * value for each player.</p>
<p>For now though, I’ve posted a plot of ten seconds (not consecutive frames) of data. If you click on that, it links to an imgur of the gif of the proper combination of those frames.</p>
<pre class="r"><code>#ugly packaged up function
calc_PC_cpp < function(time, next_time, ball_x, ball_y, x, y, next_x, next_y, team, player, pitch_area) {
#blargh terribly written run out of energy to improve
pitch_area$I < calc_I_cpp(c(x, y), c(next_x, next_y), c(ball_x, ball_y), time, next_time, as.matrix(pitch_area), t(c(x, y)))
pitch_area$team < team
pitch_area$time < time
pitch_area$player < player
return(pitch_area)
}
#sample 10 seconds worth of data
animation_data < tracking_data %>%
filter(time %in% 600:610) %>%
dplyr::select(time, next_time, ball_x, ball_y, x, y, next_x, next_y, team, player)
#run the function over the data
anim_pitch_control < animation_data %>%
#run func
pmap_df(., calc_PC_cpp, pitch_area = pitch) %>%
#sum by team and area
group_by(team, x, y, time) %>%
summarise(team_sum = sum(I)) %>%
pivot_wider(names_from = team, values_from = team_sum) %>%
#σ  logistic function
mutate(PC = 1 / (1 + exp(Home_Team  Away_Team)))
#plot
p7 < ggplot(anim_pitch_control, aes(x = x, y = y, colour = PC)) +
annotate_pitch(dimensions = pitch_statsbomb) +
geom_point(alpha = 0.7, shape = 15) +
scale_colour_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0.5) +
theme_pitch() +
labs(title = "pitch control rasters by match time (s)") +
facet_wrap(~time)</code></pre>
<p><a href="https://i.imgur.com/fYIfjaR.mp4" title="plot of surface control"><img src="/img/pitch_control_raster.png" alt="plot of surface control" /></a></p>
<p>(click for link to gif)</p>
<p>I actually really like these plots of just the surface control; they remind me of high dimensional (e.g. biological sample) sorting and I think just look pretty funky</p>
<p>I mentioned I benchmarked the functions themselves earlier, here’s some sample code of benchmarking. It’s not really apples to oranges because of the tweaks to the cpp function, and obviously calling pmap_df on a single row of a data.frame isn’t really what it’s for… it’s more just to document a little bit (also please ignore the spaghetti passing of functions).</p>
<pre class="r"><code>library(microbenchmark)
microbenchmark(
pmap_calc_pc = pmap_df(animation_data[1,], calc_PC, pitch_area = pitch),
pmap_calc_pc_cpp = pmap_df(animation_data[1,], calc_PC_cpp, pitch_area = pitch),
calc_pc = calc_PC(animation_data$time[1], animation_data$next_time[1], animation_data$ball_x[1], animation_data$ball_y[1], animation_data$x[1], animation_data$y[1], animation_data$next_x[1], animation_data$next_y[1], "teamA", "playera", pitch),
calc_pc_cpp = calc_PC_cpp(animation_data$time[1], animation_data$next_time[1], animation_data$ball_x[1], animation_data$ball_y[1], animation_data$x[1], animation_data$y[1], animation_data$next_x[1], animation_data$next_y[1], "teamA", "playera", pitch),
times = 1000
)</code></pre>
<pre><code>## Unit: milliseconds
## expr min lq mean median uq max neval
## pmap_calc_pc 4.5347 5.95505 7.129008 6.30795 6.84670 125.7230 1000
## pmap_calc_pc_cpp 3.8280 4.93920 5.608463 5.20625 5.64985 126.1780 1000
## calc_pc 3.1482 4.16710 5.136463 4.38575 4.82170 127.3604 1000
## calc_pc_cpp 2.4537 3.15150 4.023492 3.31625 3.63570 212.4050 1000</code></pre>
<p>That’s all for this post! As I said at some point (soon? later? who knows) I’d like to include the value term because conceptually it’s not hard to get a stupid version of it going. Hopefully this is of use to some people. As I said up top, written in evenings locked inside during quarantine so probably maths mistakes/ huge coding errors etc. If people point them out and get in touch I’ll fix them.</p>
<p>Best!</p>
</div>

Five Minute Football Trivia  TransEurope Express
/post/five_min_trivia_kraftwerk/
Thu, 02 Apr 2020 00:00:00 +0000
/post/five_min_trivia_kraftwerk/
<p><em>generally as I have less and less time to waste on meaningless football stats I get halfway through a post and abandon it. To remedy this, I want to start pushing out posts that give a reasonable halfguess at an answer within an hour or so without needing to really check my working or write good prose. This is the third of these</em></p>
For this weeks question, I’m stealing straight from the source of most of my posts, <a href="https://www.theguardian.com/football/series/theknowledge">The Knowledge column</a> at The Guardian:
<blockquote class="twittertweet">
<p lang="en" dir="ltr">
What is the shortest total distance a club has had to travel in a Champions League winning campaign? (Perhaps average distance per (away) fixture to balance out changes in format over the years.)
</p>
— JBfaeDundee (<span class="citation">@JBfaeDundee</span>) <a href="https://twitter.com/JBfaeDundee/status/1242529510735720448?ref_src=twsrc%5Etfw">March 24, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf8"></script>
I’m going to turn it on it’s head a bit, and find the longest distance campaigns, mostly because I find it more interesting, but also because it reminded me of this tweet from a few years ago
<blockquote class="twittertweet">
<p lang="en" dir="ltr">
The longest away trip in the world took place today as Baltika Kaliningrad travelled about 10,000 km to meet Luch Vladivostok in the second division in Russia. <br>The gamed ended 00, of course. <a href="https://t.co/EsSpmWzddk">pic.twitter.com/EsSpmWzddk</a>
</p>
— Michael Yokhin (<span class="citation">@Yokhin</span>) <a href="https://twitter.com/Yokhin/status/980050993810493440?ref_src=twsrc%5Etfw">March 31, 2018</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf8"></script>
<p>And generally I love weird quirks of geography that lead to commutes of 13 hours like this.</p>
<p>As always, first load the libraries we need. Having looked around, the most organised dataset seemed to be at <a href="https://www.worldfootball.net">worldfootball.net</a>.</p>
<pre class="r"><code>#scrape
library(rvest)
#using data from worldfootbal.net
base_url < "https://www.worldfootball.net"
#tidy
library(tidyverse)
library(magrittr)
#map
library(sf)
library(rnaturalearth)
library(ggthemes)</code></pre>
<p>To find the location of every team, we need a data.frame of every team to have competed in the Champions League (and Qualifying) since it’s inception. We can get that by sprintf’ing a list of urls and scraping the links to each team page from there. For this, and most of the scraping jobs below, I saved the data from the first time I scrape so that I don’t have to continually restress the worldfootball server. The datasets can be found in the static folder of my website GitHub.</p>
<pre class="r"><code>#the years each competition took place
years < 1955:2018
qual_years < c(1966, 1969:1971, 1978:1982, 1992:2018)
#sprintf the correct urls together
all_urls < c(sprintf("/championsleague%d%d", years, years+1),
sprintf("/championsleaguequalifikation%d%d", qual_years, qual_years+1))
#some exceptions
all_urls[grepl("championsleague20102011", all_urls)] %<>% paste0(., "_3")
all_urls[grepl("championsleague20082009", all_urls)] %<>% paste0(., "_2")
all_urls[grepl("qualifikation20082009", all_urls)] %<>% gsub("qualifikation", "qf", .)
all_urls[grepl("qualifikation", all_urls) & as.numeric(gsub(".*","",all_urls)>2009)] %<>%
gsub("qualifikation", "qual", .)
#scrape the list of each team's links
teams < map_df(all_urls, function(competition_link) {
#read once and scrape from there
read < read_html(paste0(base_url, "/players", competition_link))
#get the useful info
competition < read %>% html_nodes("h1") %>% html_text() %>% gsub(" » .*", "", .)
team_name < read %>% html_nodes("td:nthchild(2) a") %>% html_text()
team_info < read %>% html_nodes("td:nthchild(4) a") %>% html_attr("href")
#compile into a df to return
df < data.frame(competition, team_name, team_info)
return(df)
})</code></pre>
<p>we can then take a look at what we have on our hands</p>
<pre class="r"><code>head(teams)</code></pre>
<pre><code>## competition team_name
## 1 Champions League 1955/1956 1. FC Saarbrücken
## 2 Champions League 1955/1956 Aarhus GF
## 3 Champions League 1955/1956 AC Milan
## 4 Champions League 1955/1956 Budapesti Vörös Lobogó
## 5 Champions League 1955/1956 Djurgårdens IF
## 6 Champions League 1955/1956 Gwardia Warszawa
## team_info
## 1 /teams/1fcsaarbruecken/1/
## 2 /teams/aarhusgf/1/
## 3 /teams/acmilan/1/
## 4 /teams/mtkbudapest/1/
## 5 /teams/djurgardensif/1/
## 6 /teams/gwardiawarszawa/1/</code></pre>
<p>As a short aside, one of the things I really enjoy about posts like this one is it exposes you to lots of history from the ‘early’ days of organised football and the teams (some of which remain, some do not) that were present then.</p>
<p>After this, we then want to scrape the data on every match played in the Champions League in a similar manner:</p>
<pre class="r"><code>match_data < map_df(all_urls, function(competition_link) {
#read once
read < read_html(paste0(base_url, "/all_matches", competition_link))
#get the competition/season id from the url
season < gsub("(^.*)([09]{4}[09]{4})(.*)", "\\2", competition_link)
competition < ifelse(grepl("qualqf", competition_link), "uclquals", "ucl")
#scrape the links to each match we'll need some of these later
match_link < read %>%
html_nodes("td:nthchild(6) a") %>%
html_attr("href")
#save the champions league matches into a df
matches_df < read %>%
html_nodes("#site > div.white > div.content > div > div.box > div > table") %>%
html_table(fill = TRUE, header = FALSE) %>%
as.data.frame() %>%
#rename
select(date = X1, round = X4, home = X3, away = X5, result = X6) %>%
#mutate the correct round to matches
mutate(round = case_when(
round != "" ~ round
)) %>%
mutate(date = case_when(
date != "" ~ date
)) %>%
mutate(round = zoo::na.locf(round)) %>%
#filter out valid matches
filter(grepl("^[09]*:[09]*abor.", result)) %>%
mutate(date = zoo::na.locf(date)) %>%
#few exceptions of matches that wern't played
filter(!(grepl("dec.", result) & date == "01/12/1965")) %>%
filter(!(home == "FK Partizani" & date == "30/09/1987")) %>%
mutate(match_link, season, competition)
return(matches_df)
})</code></pre>
<p>Which we can glimpse to see that there are 7206 matches listed across the competition proper and qualification rounds since the 1950s. This resolves down to 2875 unique teamseasons (from ~561 unique teams) who have been involved in either competition.</p>
<pre class="r"><code>head(match_data)</code></pre>
<pre><code>## date round home away result
## 1 04/09/1955 1. Round Sporting CP Partizan 3:3 (1:1)
## 2 07/09/1955 1. Round Budapesti Vörös Lobogó RSC Anderlecht 6:3 (3:2)
## 3 08/09/1955 1. Round Servette Genève Real Madrid 0:2 (0:0)
## 4 14/09/1955 1. Round RotWeiss Essen Hibernian FC 0:4 (0:2)
## 5 20/09/1955 1. Round Djurgårdens IF Gwardia Warszawa 0:0 (0:0)
## 6 21/09/1955 1. Round Aarhus GF Stade Reims 0:2 (0:1)
## match_link
## 1 /report/championsleague195519561rundesportingcppartizan/
## 2 /report/championsleague195519561rundemtkbudapestrscanderlecht/
## 3 /report/championsleague195519561rundeservettegeneverealmadrid/
## 4 /report/championsleague195519561runderotweissessenhibernianfc/
## 5 /report/championsleague195519561rundedjurgardensifgwardiawarszawa/
## 6 /report/championsleague195519561rundeaarhusgfstadereims/
## season competition
## 1 19551956 ucl
## 2 19551956 ucl
## 3 19551956 ucl
## 4 19551956 ucl
## 5 19551956 ucl
## 6 19551956 ucl</code></pre>
<p>To work out the distances travelled, we then need to find the locations of each of these matches. The easiest way would be to run through each of those match links and scrape the location data, but that would put a lot of load on the worldfootball servers, so we can be smarter than that.</p>
<p>Matches generally take place at the home location (or in some exceptions, very close to) of every team (e.g. Arsenal’s home matches take place in North London). However, in the history of the competition, 2legged matches that ended as a draw used to go to a third leg at a neutral location (for example <a href="https://www.worldfootball.net/report/championsleague199219931rundevfbstuttgartleedsunited_2/">Leeds United vs. VFB Stuttgart in 1992 took place at the Nou Camp</a>). Also, each final is played at a preselected venue that is independent of the eventual finalists.</p>
<p>We can find the data for these matches and scrape the exact location from the match link, while taking the rest from the location of the home team in the tie.</p>
<pre class="r"><code>#split data by neutral venue or not
match_locations < match_data %>%
split(f = (.$round == "Final" 
duplicated(paste(.$home, .$away, .$round, .$season))))
#function for scraping the location of the neutral matches
#uses a link to a specific match
get_neutral_location < function(link) {
full_url < paste0(base_url, link)
#get and munge the location
node < ".standard_tabelle tr:nthchild(1) .dunkel~ .dunkel+ .dunkel"
read < read_html(full_url)
location < read %>% html_nodes(node) %>% html_text() %>%
gsub("\\(\\)\\/", "", .)
return(location)
}
#run through this function to locate all neutral matches
neutral_matches < match_locations[[2]] %>%
mutate(location = unlist(lapply(match_link, get_neutral_location))) %>%
mutate(type = "neutral") %>%
select(match_link)</code></pre>
<p>We can see we’ve gathered a few extra matches that wern’t actually neutral, but given we get their correct location anyway, it’s not big deal.</p>
<p>We then have to use the information on each team to get the location of thier home ground. For larger teams we can get this to within an exct postcode if we so wish, but many (e.g. <a href="https://www.worldfootball.net/teams/szombierkibytom/1/">former Polish champions Szombierki Bytom</a>) all we can get from their page is the country. This is fine because we’ll combine this with the team name to use a google search to get more exact locations later. (in any case it’s probably fine because the proportion of teams with poor geographic data probably gets lost in noise overall).</p>
<pre class="r"><code>#scrape the information on the teams location from their
#worldfootball profile page
get_team_location < function(link) {
read < read_html(paste0(base_url, link))
stadium_link < read %>%
html_nodes(".yellow tr:nthchild(5) a") %>%
html_attr("href")
#if the link contains a link to a stadium scrape from there
if(length(stadium_link) > 0) {
stadium_link < paste0(base_url, stadium_link)
location < read_html(stadium_link) %>%
html_nodes(".yellow tr:nthchild(1) td , .yellow tr:nthchild(2) td") %>%
html_text() %>%
.[c(2,4)] %>%
gsub("\\r\\t\\n", "", .) %>%
paste0(collapse = " ")
return(location)
#otherwise get a best approximation
} else {
country < read %>%
html_nodes(".portfolio tr:nthchild(3) .hell+ .hell") %>%
html_text() %>%
gsub("\\r\\t\\n", "", .)
return(country)
}
}
#run the function over each team
team_info < teams %>%
filter(!duplicated(team_name)) %>%
mutate(location = unlist(lapply(team_info, get_team_location)))</code></pre>
<pre class="r"><code>head(team_info)</code></pre>
<pre><code>## competition team_name
## 1 Champions League 1955/1956 1. FC Saarbrücken
## 2 Champions League 1955/1956 Aarhus GF
## 3 Champions League 1955/1956 AC Milan
## 4 Champions League 1955/1956 Budapesti Vörös Lobogó
## 5 Champions League 1955/1956 Djurgårdens IF
## 6 Champions League 1955/1956 Gwardia Warszawa
## team_info location
## 1 /teams/1fcsaarbruecken/1/ Saarbrücken Germany
## 2 /teams/aarhusgf/1/ Aarhus Denmark
## 3 /teams/acmilan/1/ Milano Italy
## 4 /teams/mtkbudapest/1/ Budapest Hungary
## 5 /teams/djurgardensif/1/ Stockholm Sweden
## 6 /teams/gwardiawarszawa/1/ Poland 0000</code></pre>
<p>Now we have a rough location for each team we can join everything back together to get a complete list of matches and where (to a best approximation sometimes) they took place.</p>
<pre class="r"><code>#join the team location into the nonneutral matches
nonneutral_matches < match_locations[[1]] %>%
left_join(., select(team_info, competition), by = c("home" = "team_name")) %>%
mutate(type = "normal") %>%
select(names(neutral_matches))
#join neutral and non neutral matches back together
all_matches < rbind(neutral_matches, nonneutral_matches) %>%
mutate(match_location = case_when(
type == "normal" ~ paste(home, "football club", location),
type == "neutral" ~ location
))</code></pre>
<p>Now we have the locations for each match, but not in a quantative form. For that, we’ll use the <a href="https://cran.rproject.org/web/packages/googleway/vignettes/googlewayvignette.html">googleway</a> package that provides access to a variety of Google APIs to access the map geolocation feature of Google Mapes. Obviously, I haven’t included my unique key for this below, but you can get one for free using <a href="https://developers.google.com/maps/documentation/javascript/tutorial">this link</a>.</p>
<p>For each location we’ll return a latitude and longitude that will allow us to calculate exactly the distances between a teams home location and each match they played.</p>
<pre class="r"><code>#fake key
google_key < "myGooGLeKEy1234567"
#function to get lat/lon data from Google Maps
googleway_geocode < function(location, key){
data < google_geocode(location, key = key)
latlon < data$results$geometry$location[1,]
if(length(latlon) == 0) {
return(data.frame(lat = NA, lng = NA, location))
} else {
return(latlon %>% mutate(location))
}
}
#run the function over each unique location
locations < unique(all_matches$match_location) %>%
map_df(., googleway_geocode, key = google_key)</code></pre>
<p>This gets us 99% of the way there, though the API does miss a few smaller/less well formatted clubs (e.g. Monaco is not ‘in’ France per se, but an enclave in the French territory, which fucks Google Maps up)</p>
<pre class="r"><code>locations %>%
filter(is.na(lat))</code></pre>
<pre><code>## lat lng location
## 1 NA NA AS Monaco football club Monaco France
## 2 NA NA St Patrick's Athletic football club Dublin Ireland
## 3 NA NA FK Sloga Jugomagnat football club North Macedonia
## 4 NA NA Tsement Ararat football club Ararat Armenia
## 5 NA NA NK Brotnjo football club BosniaHerzegovina
## 6 NA NA Dunaferr SE football club Hungary 0000
## 7 NA NA Araks Ararat football club Ararat Armenia
## 8 NA NA FK Gomel football club Belarus 1959
## 9 NA NA Sioni Bolnisi football club Georgia
## 10 NA NA SS Murata football club San Marino
## 11 NA NA KF Shkëndija 79 football club North Macedonia
## 12 NA NA SP Tre Penne football club San Marino
## 13 NA NA Ulisses FC football club Armenia
## 14 NA NA SP La Fiorita football club San Marino
## 15 NA NA Lincoln Red Imps football club Gibraltar 0000
## 16 NA NA Dila Gori football club Georgia
## 17 NA NA KF Trepça'89 football club Kosovo
## 18 NA NA Europa FC football club Gibraltar
## 19 NA NA FK Spartaks football club Latvia
## 20 NA NA FK Kukësi football club Albania</code></pre>
<p>To solve this, the best way sometimes is just the stupidest, so here are the manually found locations of these clubs</p>
<pre class="r"><code>#manually enter lat lon for the missing locations
missing_locs < data.frame(
lat = c(43.73, 53.34, 42.02, 39.86, 43.2, 46.96, 39.86, 52.44, 41.44, 43.93, 42.01, 43.93, 40.17, 43.93, 36.14, 41.98, 42.88, 36.14, 56.94, 42.07),
lng = c(7.41, 6.27, 21.44, 44.69, 17.7, 18.94, 44.69, 31.01, 44.53, 12.44, 20.97, 12.44, 44.52, 12.44, 5.35, 44.10, 20.86, 5.35, 23.61, 20.42),
location = locations$location[is.na(locations$lat)]
)
#bind everything together
all_locations < locations %>%
filter(!is.na(lat)) %>%
rbind(., missing_locs) %>%
#convert to an sf object with worldwide projection
st_as_sf(coords = c("lng", "lat"), crs = st_crs("+init=epsg:4326"))</code></pre>
<p>At the end, I also cast the object to an <a href="https://rspatial.github.io/sf/articles/sf1.html">simple features</a> (sf) data.frame to allow for easier manipulation of geographic data and add the reference for Earth’s lat/lon coordinate system (epsg:4326).</p>
<p>We can then merge the geographic data into our dataframe of every match and see the location of every club to have played in (some stage) of the Champions League over the last ~60 years</p>
<pre class="r"><code>#join in the geographic information
all_matches %<>% left_join(., all_locations, by = c("match_location" = "location"))
#plot the home locations of all teams
p1 < all_matches %>%
filter(type == "normal") %>%
filter(!duplicated(home)) %>%
ggplot(.) +
geom_sf(data = st_as_sf(ne_countries(scale=110), st_crs("+init:epsg=4326")),
colour = NA) +
geom_sf_text(aes(label = home, geometry = geometry), alpha = 0.5) +
#taken from st_bbox(all_matches$geometry)
coord_sf(xlim = c(24, 78), ylim = c(30, 67)) +
ggtitle("Home location of every Champions League team",
subtitle = "19552019, includes qualifying rounds") +
theme_map()
#plot
p1</code></pre>
<p><img src="/post/20200329ucl_distance_files/figurehtml/plot_locations1.png" width="672" /></p>
<p>It’s quite nice to see the distribution hubs around large cities with competitive leagues (e.g. Denmark, Czech Republic, The Rhine), with extremes in the north in Iceland/Faroe Islands, to the south in Israel, and the far far East with the Central Asian UEFA countries.</p>
<p>The first thing to then work out is the matches per team, which can be done via a simple gather. (in theory you’d want to use pivot_long which has deprecated gather but afaik it doesnt play well with geometry data yet). We also mutate in 2 variables for the home and away teams to keep the matches for data presentation purposes.</p>
<p>Once we have that, we have each match played by each team, each season. A nice little result is we can see which teams have had the longest campaigns (in terms of number of matches), which it turns out are the Valencia and Bayer Leverkusen teams that qualified and got to the finals of the Champions League during the longer twogroupstage format at the turn of the century.</p>
<pre class="r"><code>#melt the mach data by team
team_campaigns < all_matches %>%
select(season, date, competition, round, home, away, result, geometry) %>%
#keep the home and away columns for later
mutate(home_keep = home, away_keep = away) %>%
gather("location", "team_name",
season, competition, round, result, geometry, date,
home_keep, away_keep)
#get the longest campaigns in terms of n matches
longest_campaigns < team_campaigns %>%
group_by(season, team_name) %>%
summarise(matches = n()) %>%
arrange(matches)
head(longest_campaigns)</code></pre>
<pre><code>## # A tibble: 6 x 3
## # Groups: season [4]
## season team_name matches
## <chr> <chr> <int>
## 1 19992000 Valencia CF 19
## 2 20002001 Valencia CF 19
## 3 20012002 Bayer Leverkusen 19
## 4 20022003 AC Milan 19
## 5 20002001 Leeds United 18
## 6 20012002 FC Barcelona 18</code></pre>
<p>But we want to work out the distance to each match, not the number. To do this, first we want to work backwards and get the lat/lon of each clubs home ground. We can then merge this with the match location data and find the difference between these two locations (in metres). I.e. for every home game, a team will travel 0m to the game, whereas the away club will travel probably many kilometres.</p>
<pre class="r"><code>#work backwards and get the home location of each team
team_locations < all_matches %>%
filter(type == "normal") %>%
filter(!duplicated(home)) %>%
select(team_name = home, location = match_location) %>%
left_join(., all_locations, by = "location")
#merge this in
#for each team match have location of match and home location of team
match_travel < team_campaigns %>%
left_join(., select(team_locations, team_name, geometry), by = "team_name") %>%
#calculate the distance between each teams home location the match
mutate(distance = st_distance(geometry.x, geometry.y, by_element = TRUE))
head(match_travel)</code></pre>
<pre><code>## season date competition round result
## 1 19551956 13/06/1956 ucl Final 4:3 (2:2)
## 2 19561957 16/09/1956 ucl 1. Round 7:0 (4:0)
## 3 19561957 28/11/1956 ucl Round of 16 3:1 (1:0)
## 4 19561957 12/12/1956 ucl Round of 16 2:0 (2:0)
## 5 19561957 30/05/1957 ucl Final 2:0 (0:0)
## 6 19571958 15/10/1957 ucl 1. Round 1:1 (0:1, 1:1) aet
## geometry.x home_keep away_keep
## 1 POINT (2.253049 48.84144) Real Madrid Stade Reims
## 2 POINT (7.453112 51.49276) Borussia Dortmund Spora Luxemburg
## 3 POINT (2.253049 48.84144) OGC Nice Rangers FC
## 4 POINT (3.688344 40.45305) Real Madrid Rapid Wien
## 5 POINT (3.688344 40.45305) Real Madrid ACF Fiorentina
## 6 POINT (13.40849 52.54356) SC Wismut KarlMarxStadt Gwardia Warszawa
## location team_name geometry.y
## 1 home Real Madrid POINT (3.688344 40.45305)
## 2 home Borussia Dortmund POINT (7.450945 51.49807)
## 3 home OGC Nice POINT (7.195828 43.68232)
## 4 home Real Madrid POINT (3.688344 40.45305)
## 5 home Real Madrid POINT (3.688344 40.45305)
## 6 home SC Wismut KarlMarxStadt POINT (12.69902 50.58733)
## distance
## 1 1043745.9567 [m]
## 2 609.3633 [m]
## 3 688197.6337 [m]
## 4 0.0000 [m]
## 5 0.0000 [m]
## 6 223136.3518 [m]</code></pre>
<p>Then all we need to do is group by each team and season and calculate the total distance travelled by that team. I then printed the top 10 total distances (in km) that team had to travel to complete all of their matches</p>
<pre class="r"><code>longest_distance_campaigns < match_travel %>%
group_by(season, team_name) %>%
mutate(total_travel = sum(distance),
date = as.Date(gsub("\\/", "", date), "%d%m%Y")) %>%
select(season, date, competition, round, team = team_name,
home = home_keep, away = away_keep, result, distance, total_travel) %>%
arrange(total_travel, date)
longest_distance_campaigns %>%
filter(!duplicated(paste(season, team))) %>%
select(season, team, total_travel) %>%
mutate(total_travel = total_travel / 1000) %>%
head(., n = 10)</code></pre>
<pre><code>## # A tibble: 10 x 3
## # Groups: season, team [10]
## season team total_travel
## <chr> <chr> <dbl>
## 1 20152016 FK Astana 25874.
## 2 20112012 APOEL Nikosia 19112.
## 3 20092010 APOEL Nikosia 18649.
## 4 20112012 SL Benfica 17817.
## 5 20092010 Maccabi Haifa 17808.
## 6 20002001 Galatasaray 17371.
## 7 20102011 Hapoel Tel Aviv 17350.
## 8 20172018 Qarabag FK 17286.
## 9 20152016 Maccabi Tel Aviv 17041.
## 10 20022003 Lokomotiv Moskva 16732.</code></pre>
<p>Perhaps unsurprisingly <a href="https://en.wikipedia.org/wiki/2015_FC_Astana_season">FK Astana</a> from the capital of Kazakhstan come out top (by far), having worked through the qualifying round and making it to the group stages (where they were unbeaten at home). After that, succesful teams from the far corners of Europe (Benfica, APOEL, Hapoel Tel Aviv) come out on top. I was surprised that Lokomotiv are the only Russian team in the list, and as far back as 20022003. Also that all of these seasons are from this century (perhaps due to the ever increasing number of fixtures in the Champions League).</p>
<p>I select the matches FK Astana played in their recordbreaking 20152016 below:</p>
<pre class="r"><code>longest_distance_campaigns %>%
filter(season == "20152016" & team == "FK Astana") %>%
select(team)</code></pre>
<pre><code>## # A tibble: 12 x 10
## # Groups: season, team [1]
## team season date competition round home away result distance
## <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr> [m]
## 1 FK A… 2015… 20150714 uclquals 2. R… NK M… FK A… 1:0 (… 4025459
## 2 FK A… 2015… 20150722 uclquals 2. R… FK A… NK M… 3:1 (… 0
## 3 FK A… 2015… 20150729 uclquals 3. R… HJK … FK A… 0:0 (… 3022119
## 4 FK A… 2015… 20150805 uclquals 3. R… FK A… HJK … 4:3 (… 0
## 5 FK A… 2015… 20150818 uclquals Play… FK A… APOE… 1:0 (… 0
## 6 FK A… 2015… 20150826 uclquals Play… APOE… FK A… 1:1 (… 3510897
## 7 FK A… 2015… 20150915 ucl Grou… SL B… FK A… 2:0 (… 6180579
## 8 FK A… 2015… 20150930 ucl Grou… FK A… Gala… 2:2 (… 0
## 9 FK A… 2015… 20151021 ucl Grou… Atlé… FK A… 4:0 (… 5713668
## 10 FK A… 2015… 20151103 ucl Grou… FK A… Atlé… 0:0 (… 0
## 11 FK A… 2015… 20151125 ucl Grou… FK A… SL B… 2:2 (… 0
## 12 FK A… 2015… 20151208 ucl Grou… Gala… FK A… 1:1 (… 3421522
## # … with 1 more variable: total_travel <dbl></code></pre>
<p>Finally, one of the real niche joys in my love is making maps and what better oppurtunity than to map these long distance Champions League campaigns. It’s a bit of a munge to get the lines from point data but sf does at least make it possible.</p>
<pre class="r"><code>#get the top ten longest campaigns
data < filter(longest_distance_campaigns,
!duplicated(paste(season, team)))[1:10,] %>%
ungroup() %>%
select(season, team_name = team, total_travel) %>%
left_join(., match_travel) %>%
#munge the geometry
filter(st_geometry(.$geometry.x) != st_geometry(.$geometry.y)) %>%
mutate(versus = case_when(
location == "home" ~ away_keep,
location == "away" ~ home_keep
)) %>%
select(season, team_name, versus, round, total_travel,
geometry.x, geometry.y) %>%
split(f = rownames(.)) %>%
#calculate lines from points
lapply(., function(row) {
coords1 < st_coordinates(row$geometry.x) %>%
split(f = rownames(.))
coords2 < st_coordinates(row$geometry.y) %>%
split(f = rownames(.))
lines < map2(coords1, coords2, ~st_linestring(rbind(.x, .y)))
row$lines < st_as_sfc(lines, crs = st_crs("+init=epsg:4326"))
return(row)
}) %>%
do.call(rbind, .) %>%
mutate(title = paste(season, team_name, "=", round(total_travel/1000), "km"))
#plot the travel of each team
p2 < ggplot() +
geom_sf(data = st_as_sf(ne_countries(scale=110), st_crs("+init:epsg=4326")),
colour = NA) +
geom_sf(data = data, aes(geometry = lines),
colour = "red", size = 2) +
geom_sf_text(data = data, aes(geometry = geometry.x, label = versus),
size = 4, nudge_y = 2) +
#again taken from st_bbox
coord_sf(xlim = c(23, 77), ylim = c(66, 30)) +
theme_map() +
theme(
strip.background = element_rect(fill = "white"),
strip.text = element_text(size = 10)
) +
facet_wrap(~title)</code></pre>
<p><a href="https://www.roberthickman.eu/img/longest_distances_plot.svg" title="plot of the longest UCL campaigns"><img src="/img/longest_distance_campaigns.png" alt="plot of the longest UCL campaigns" /></a></p>
<p>Click on the image for a higherres version :)</p>
<p>Two quick finishing pieces:</p>
<p>Firstly, what is the single longest journey in the history of the Champions League? Unsurprisingly it involves the 20152016 FK Astana season travelling to Benfica on the coast of Portugal (and of course the return fixture).</p>
<pre class="r"><code>match_travel[which.max(match_travel$distance),]</code></pre>
<pre><code>## season date competition round result
## 12000 20152016 15/09/2015 ucl Group C 2:0 (0:0)
## geometry.x home_keep away_keep location team_name
## 12000 POINT (9.184503 38.75253) SL Benfica FK Astana away FK Astana
## geometry.y distance
## 12000 POINT (71.40261 51.10822) 6180579 [m]</code></pre>
<p>And secondly, answering the original question what the shortest average commute for a winning side?</p>
<pre class="r"><code>#get UCL champions
winners < match_travel %>%
filter(round == "Final") %>%
mutate(result = gsub(" .*", "", result)) %>%
separate(result, into = c("h_goal", "a_goal"), sep = ":") %>%
filter((location == "home" & h_goal > a_goal)  (location == "away" & a_goal > h_goal)) %>%
select(season, team_name)
#find the matches played by champions
winners_matches < left_join(winners, match_travel, by = c("season", "team_name")) %>%
group_by(season, team_name) %>%
mutate(matches = n(), total_travel = sum(distance/1000)) %>%
ungroup() %>%
#calculate average travel per game
mutate(average_travel = total_travel / matches,
date = as.Date(gsub("\\/", "", date), "%d%m%Y")) %>%
select(season, date, round, home = home_keep, away = away_keep, result, distance, average_travel) %>%
arrange(average_travel, date)
#print the 3 campaigns with the lowest average travel
head(winners_matches, n = 27)</code></pre>
<pre><code>## # A tibble: 27 x 8
## season date round home away result distance average_travel
## <chr> <date> <chr> <chr> <chr> <chr> [m] <dbl>
## 1 19631… 19630918 1. Rou… Everto… Inter 0:0 (… 1247112… 410.
## 2 19631… 19630925 1. Rou… Inter Evert… 1:0 (… 0… 410.
## 3 19631… 19631127 Round … Inter AS Mo… 1:0 (… 0… 410.
## 4 19631… 19631204 Round … AS Mon… Inter 1:3 (… 236127… 410.
## 5 19631… 19640226 Quarte… Partiz… Inter 0:2 (… 891298… 410.
## 6 19631… 19640304 Quarte… Inter Parti… 2:1 (… 0… 410.
## 7 19631… 19640415 Semif… Boruss… Inter 2:2 (… 684493… 410.
## 8 19631… 19640429 Semif… Inter Borus… 2:0 (… 0… 410.
## 9 19631… 19640527 Final Inter Real … 3:1 (… 632859… 410.
## 10 19711… 19710915 1. Rou… AFC Aj… Dynam… 2:0 (… 0… 433.
## # … with 17 more rows</code></pre>
<p>Where the top three are Inter’s 19631964, Ajax’s 19711972, and Bayern Munich’s 19731974 seasons all of which have an average travel of just over 400km per game. It’s fairly striking how many more Central European teams there are further in the competitions in these seasons comapred to today.</p>
<p>And that’s all for now! Thanks for reading and I’ll try and put out another post soon :)</p>

Five Minute Football Trivia  Birthday FiveASide
/post/five_min_trivia_birthdays/
Sat, 14 Mar 2020 00:00:00 +0000
/post/five_min_trivia_birthdays/
<p><em>generally as I have less and less time to waste on meaningless football stats I get halfway through a post and abandon it. To remedy this, I want to start pushing out posts that give a reasonable halfguess at an answer within an hour or so without needing to really check my working or write good prose. This is the second of these</em></p>
<p>A semicommon question I’ve come across when doing stupid football trivia is ‘Which Birthday could field the best 5aside team?’. That is, if you could only select 5 players all born on the same date, which date would you pick in order to allow for the strongest team. For an example, see <a href="https://www.theguardian.com/football/2014/feb/19/bestteamplayersbornsameday">the Guardian’s knowledge blog</a> from 2014. However, this was based on gut feel of the team, and this blog (however flawed) deals in data, so let’s go.</p>
<p>As always, we’ll start by loading some libraries</p>
<pre class="r"><code>#munging
library(tidyverse)
#regression (later)
library(glmnet)
#plots (at the end)
library(ggsoccer)</code></pre>
<p>We’ll then load our data to compare players’ ability. For this I’m using a database I put together of every character in the FIFA database I scraped from <a href="https://www.fifaindex.com/">FIFAindex.com</a>. The database goes back 15 years to the ‘05’ version of the game so we won’t have to limit ourselves to current players. The scraped db can be found at <a href="https://github.com/RobWHickman/fifadb">my github</a>*</p>
<p>*it’s still very beta version at the moment and needs a lot more munging but should work for most applications</p>
<pre class="r"><code>player_data < map_df(
#load female and male player data from the github repo
c("male_players.rds"), function(x) {
data < readRDS(paste0("path/to/file/", x)) %>%
mutate(version = as.character(version), dob = as.Date(dob, "%Y%B%D"))
return(data)
}) %>%
#munge date of birth in day and month of birth (don't care about year)
mutate(day_of_birth = gsub("^[09]{4}", "", dob)) %>%
separate(day_of_birth, into = c("month", "day"), sep = "", remove = FALSE) %>%
mutate(month = as.numeric(month), day = as.numeric(day), ability = as.numeric(ability))</code></pre>
<p>This should give us a 145536 observation df for 65 variables (!). I.e. the information on, and stats of, every character to appear over the last decade and half. We can then look at the spread of birthdays in the dataset:</p>
<pre class="r"><code>#plot the numbers of players per day of birth in the dataset
p1 < ggplot(player_data, aes(x = day, fill = factor(month))) +
geom_histogram(stat = "count") +
scale_fill_discrete(guide = FALSE) +
labs(title = "Number of players sharing birthday by day of year",
x = "day of month") +
facet_wrap(~month) +
theme_minimal()
p1</code></pre>
<p><img src="/post/20200314birthdays_files/figurehtml/plot_birthdays1.png" width="672" /></p>
<p>Two peaks that pop out are the massive spike on February 29th and the smaller one of January 1st. On inspection, it looks like these are used as placeholders when true date of birth isn’t known. FOr the reaminer of the post we’re going to exclude them</p>
<pre class="r"><code>#filter out 1st January and 29th February (placeholders for unknown dob?)
player_data < filter(player_data,
!(day == 1 & month == 1) & !(day == 29 & month == 2))
#replot
p2 < player_data %>%
ggplot(., aes(x = day, fill = factor(month))) +
geom_histogram(stat = "count") +
scale_fill_discrete(guide = FALSE) +
labs(title = "Number of players sharing birthday by day of year",
subtitle = "placeholder dates removed",
x = "day of month") +
facet_wrap(~month) +
theme_minimal()
p2</code></pre>
<p><img src="/post/20200314birthdays_files/figurehtml/plot_birthdays21.png" width="672" /></p>
<p>Then we can put teams together by taking the top 5 players by the ‘overall ability’ stat for each date of birth:</p>
<pre class="r"><code>by_day < player_data %>%
#take only relevant data
select(name = name2, ability, day, month, version) %>%
arrange(ability) %>%
#group by day and take each players best ability score
group_by(day, month) %>%
filter(!duplicated(name)) %>%
#get the top five by day
split(f = paste(.$day, .$month)) %>%
map(., function(dat) dat[1:5,] %>% mutate(team_ability = sum(ability)))
#glimpse the first two teams (1/10 and 1/11)
by_day[1:2]</code></pre>
<pre><code>## $`1 10`
## # A tibble: 5 x 6
## # Groups: day, month [1]
## name ability day month version team_ability
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Anthony Lopes 85 1 10 19 418
## 2 Mirko Vucinic 84 1 10 07 418
## 3 Julio Baptista 83 1 10 05 418
## 4 Vitor Baía 83 1 10 06 418
## 5 Ümit Karan 83 1 10 07 418
##
## $`1 11`
## # A tibble: 5 x 6
## # Groups: day, month [1]
## name ability day month version team_ability
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Miloš Krasic 83 1 11 11 403
## 2 Filip Kostic 82 1 11 20 403
## 3 Mahler Tressor Moreno 80 1 11 06 403
## 4 Vaclav Sverkos 79 1 11 06 403
## 5 Dimo Wache 79 1 11 07 403</code></pre>
<p>Then to find the best 5 of these teams, we can push it through two quick functions as follows:</p>
<pre class="r"><code>#get the top 5 teams by summed ability
top_teams < by_day %>%
#sum the ability per team
map_dbl(., function(dat) return(unique(dat$team_ability))) %>%
sort() %>%
#return the top n teams
tail(n = 5) %>%
names(.) %>%
map(., function(date) return(by_day[date]))
top_teams</code></pre>
<pre><code>## [[1]]
## [[1]]$`7 10`
## # A tibble: 5 x 6
## # Groups: day, month [1]
## name ability day month version team_ability
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Dida 91 7 10 06 436
## 2 Gilberto 89 7 10 05 436
## 3 Sami Hyypiä 86 7 10 05 436
## 4 Diego Costa 86 7 10 17 436
## 5 Santiago Hernán Solari 84 7 10 05 436
##
##
## [[2]]
## [[2]]$`17 8`
## # A tibble: 5 x 6
## # Groups: day, month [1]
## name ability day month version team_ability
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Thierry Henry 97 17 8 05 437
## 2 Ederson 88 17 8 20 437
## 3 William Gallas 87 17 8 05 437
## 4 Güiza 83 17 8 09 437
## 5 Phil Jagielka 82 17 8 10 437
##
##
## [[3]]
## [[3]]$`22 9`
## # A tibble: 5 x 6
## # Groups: day, month [1]
## name ability day month version team_ability
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Ronaldo 94 22 9 06 437
## 2 Thiago Silva 89 22 9 17 437
## 3 Harry Kewell 86 22 9 05 437
## 4 Javier López Vallejo 84 22 9 06 437
## 5 Maarten Stekelenburg 84 22 9 12 437
##
##
## [[4]]
## [[4]]$`24 6`
## # A tibble: 5 x 6
## # Groups: day, month [1]
## name ability day month version team_ability
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Lionel Messi 94 24 6 12 439
## 2 Juan Román Riquelme 88 24 6 07 439
## 3 Luis García 86 24 6 06 439
## 4 David Alaba 86 24 6 17 439
## 5 Shunsuke Nakamura 85 24 6 08 439
##
##
## [[5]]
## [[5]]$`5 2`
## # A tibble: 5 x 6
## # Groups: day, month [1]
## name ability day month version team_ability
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Cristiano Ronaldo 94 5 2 17 439
## 2 Neymar Jr 92 5 2 17 439
## 3 Carlos Tévez 87 5 2 06 439
## 4 Stefan de Vrij 84 5 2 18 439
## 5 Giovanni Van Bronckhorst 82 5 2 05 439</code></pre>
<p>So our top team is comprised of <a href="https://en.wikipedia.org/wiki/Carli_Lloyd">2017 Carli Lloyd</a>, <a href="https://en.wikipedia.org/wiki/Vicente_Rodr%C3%ADguez">2005 Vincente</a>, <a href="https://en.wikipedia.org/wiki/Gareth_Bale">2017 Gareth Bale</a>, <a href="https://en.wikipedia.org/wiki/Sergio_Busquets">2019 Sergio Busquets</a>, and <a href="https://en.wikipedia.org/wiki/Mousa_Demb%C3%A9l%C3%A9_(Belgian_footballer)">2018 Moussa Dembele</a>, all of whom were born on July 16th.</p>
<p>However, it’s pretty clear this isn’t a very satisfactory answer; the best team here has 5 midfielders. To get a little deeper, we need to bust out a bit of machine learning. First we want to see what are all the positions in the dataset:</p>
<pre class="r"><code>#take the primary position for each player
unique(as.character(sapply(player_data$positions, "[[", 1)))</code></pre>
<pre><code>## [1] "ST" "GK" "CAM" "CDM" "CB" "LCAM" "CM" "LM" "CF" "LWM"
## [11] "RM" "RB" "RWB" "RWM" "LB" "LCB" "LS" "LF" "RCB" "LAM"
## [21] "LWB" "LCDM" "RS" "LCM" "RAM" "RCM" "RF" "SW" "RCDM" "RCAM"
## [31] "LDM" "RDM" "RW" "LW"</code></pre>
<p>So a fair few, but a lot of these (e.g. RW and LW) are basically the same position, just played of the opposite side of the pitch. We’d expect a left winger to <em>mostly</em> have the same skills as a right winger.</p>
<p>To fix this we can make these positions ‘symmetric’ by replacin the left/right with a W (for wide):</p>
<pre class="r"><code>#add in the symmetric position column
player_data$position < sapply(player_data$positions, "[[", 1)
player_data < player_data %>%
mutate(symmetric_position = gsub("LR", "W", position))
unique(player_data$symmetric_position)</code></pre>
<pre><code>## [1] "ST" "GK" "CAM" "CDM" "CB" "WCAM" "CM" "WM" "CF" "WWM"
## [11] "WB" "WWB" "WCB" "WS" "WF" "WAM" "WCDM" "WCM" "SW" "WDM"
## [21] "WW"</code></pre>
<p>We then want to use the attributes for each player at various skills (e.g. Shot Power, Ball Control, GK Rushing [out], Free Kicks,…) to work out how they interact with the palyer’s chosen position to create their overall ability score in the game.</p>
<p>For example, in <a href="https://www.fifaindex.com/player/158023/lionelmessi/fifa20/">FIFA 2020, Lionel Messi</a> has rather poor defensive stats (e.g. only 26/100 for sliding tackles). Any reasonable person would reognise that sliding tackles just <em>aren’t important</em> for Lionel Messi’s role in the Barcelona team. However, we can use these stats to work out what his overall ability would be <em>if he were a defender</em>.</p>
<pre class="r"><code>#attribute variables we'll need to use to work out player position ability
attribute_vars < grep("Ball_Control", names(player_data)):
grep("GK_Rushing", names(player_data))
names(player_data)[attribute_vars]</code></pre>
<pre><code>## [1] "Ball_Control" "Dribbling" "Marking"
## [4] "Slide_Tackle" "Stand_Tackle" "Aggression"
## [7] "Reactions" "Attack_Positioning" "Interceptions"
## [10] "Vision" "Crossing" "Short_Pass"
## [13] "Long_Pass" "Acceleration" "Stamina"
## [16] "Strength" "Balance" "Sprint_Speed"
## [19] "Agility" "Jumping" "Heading"
## [22] "Shot_Power" "Long_Shots" "Finishing"
## [25] "FK_Accuracy" "Curve" "Penalties"
## [28] "Volleys" "GK_Reflexes" "GK_Handling"
## [31] "GK_Positioning" "GK_Diving" "GK_Kicking"
## [34] "Tackling" "Anticipation" "Composure"
## [37] "Creativity" "Passing" "Long_Balls"
## [40] "Pace" "Shot_Accuracy" "GK_Rushing"</code></pre>
<p>We’ll then run a <a href="https://www.statisticshowto.datasciencecentral.com/lassoregression/">LASSO regression</a> to calculate exactly how important each of these variables are to players of each position, and then use these weights to calculate the hypothetical ability of players in positions they would never play. For some more explanation, a lot of this is taken from some old blog posts <a href="https://www.roberthickman.eu/post/yorkshire_world_cup_2/">here</a></p>
<pre class="r"><code>#function to do the lasso regression
get_player_position_abilities < function(model_pos, model_vers) {
#id players by their link to fifaindex
ids < player_data$player_link[player_data$version == model_vers]
#train on players who play each position
train_data < player_data %>%
filter(symmetric_position == model_pos & version == model_vers) %>%
select("ability", attribute_vars) %>%
mutate(ability = as.numeric(ability)) %>%
#some variables aren't in all versions of FIFA
#get rid of any that are all NA
purrr::discard(~all(is.na(.)))
#if no examples of this position for a version of FIFA, return NULL
if(length(train_data) == 0) return(NULL)
#convert to a matrix and train the regression
train_matrix < model.matrix(ability~., train_data)
cv_model < cv.glmnet(train_matrix, train_data$ability)
#use these weights on every player from that version of FIFA
test_data < player_data %>%
filter(version == model_vers) %>%
select("ability", attribute_vars) %>%
mutate(ability = as.numeric(ability)) %>%
purrr::discard(~all(is.na(.)))
#calculate the ability score for each player for that position
test_matrix < model.matrix(ability~., test_data)
position_ability < predict(cv_model, newx = test_matrix, s = "lambda.min", type="response")
#return the positional scores
df < data.frame(
player_link = ids,
ability = as.numeric(position_ability),
version = model_vers,
position = model_pos
)
return(df)
}
#get all combinations of position and FIFA version
crossed_vars < crossing(
pos = unique(player_data$symmetric_position),
vers = unique(player_data$version)
)
#get all players ability in every position
position_abilities < map2_df(crossed_vars$pos, crossed_vars$vers, get_player_position_abilities) %>%
left_join(select(player_data, name = name2, day, month, player_link), by = "player_link") %>%
pivot_wider(names_from = position, values_from = ability) %>%
select(player_link)</code></pre>
<p>So once we’ve run that we can see how each player is expected to perform in any position. For instance, if we take the first three players in the dataset we can see how Thierry Henry</p>
<pre class="r"><code>head(position_abilities, n = 3)</code></pre>
<pre><code>## # A tibble: 3 x 25
## version name day month CAM CB CDM CF CM GK ST SW
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 05 Thie… 17 8 88.8 53.1 70.9 96.5 76.4 24.2 96.5 55.2
## 2 05 Gian… 28 1 34.3 39.5 34.7 39.0 34.3 96.3 39.2 34.5
## 3 05 Zine… 23 6 96.1 72.1 86.8 93.9 91.1 26.7 93.9 75.2
## # … with 13 more variables: WAM <dbl>, WB <dbl>, WCAM <dbl>, WCB <dbl>,
## # WCDM <dbl>, WCM <dbl>, WDM <dbl>, WF <dbl>, WM <dbl>, WS <dbl>,
## # WW <dbl>, WWB <dbl>, WWM <dbl></code></pre>
<p>We can also plot the relative abilities of each player to see if they make sense. In the below I’ve excluded the names of each playey on the y axis to save space, but every slither is a separate player:</p>
<pre class="r"><code>p3 < position_abilities %>%
#arrange ordering
arrange(ST) %>%
filter(!duplicated(name)) %>%
mutate(name = factor(name, levels = unique(name),)) %>%
select(name, 5:ncol(.)) %>%
#melt data
pivot_longer(cols = c(2:ncol(.)), names_to = "position", values_to = "ability") %>%
arrange(name, ability) %>%
mutate(position = factor(position, levels = unique(position))) %>%
ggplot(aes(y = name, x = position)) +
geom_tile(aes(fill = ability)) +
labs(title = "Relative abilities of all players in all positions",
x = "position",
y = "player") +
theme_minimal() +
#get rid of y axis text
theme(axis.text.y = element_blank()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p3</code></pre>
<p><img src="/post/20200314birthdays_files/figurehtml/plot_position_abilities1.png" width="672" /></p>
<p>And it’s clear that players that excel in the attacking positions towards the left are weaker at the defensive positions towards the right (also not the big section of dark blue for weakness in the goalkeepers column).</p>
<p>In building a 5aside team, I’m going to assume you want at least one defencer, one midfielder, one attacker, one goalkeeper, and then one extra outfield player in whatever position. It’s important to note that this not have to be ‘player traditionally thought of as a defender’, a particularly skilled striker could very well take up the defensive space if they had the requisite abilities.</p>
<p>To do this, we have to bin the positions into attack/defense and then find the highest value for each for every player</p>
<pre class="r"><code>gk_cols < c("GK")
def_cols < c("CB", "SW", "WCB", "CDM", "WCDM", "WDM")
mid_cols < c("CM", "WCM", "CAM", "WCAM", "WM", "WWM", "WB", "WWB")
attack_cols < c("CF", "WF", "ST", "WS", "WAM", "WW")
#find the best defensive/attack position for each player
fiveaside_abilities < map(list(gk_cols, def_cols, mid_cols, attack_cols),
function(cols) position_abilities %>% select(cols) %>% apply(., 1, max, na.rm = TRUE)) %>%
do.call(cbind, .) %>%
as.data.frame() %>%
#bind this back to the original data
cbind(position_abilities, .) %>%
#select the necessary columns
select(name, version, gk = V1, def = V2, mid = V3, att = V4, day, month)
head(fiveaside_abilities)</code></pre>
<pre><code>## name version gk def mid att day
## 1 Thierry Henry 05 24.24627 70.91822 88.87479 96.60456 17
## 2 Gianluigi Buffon 05 96.29510 40.01513 39.58011 39.76021 28
## 3 Zinedine Zidane 05 26.66050 86.77503 96.06867 96.11257 23
## 4 Ruud van Nistelrooy 05 22.75526 75.99536 86.59128 94.70145 1
## 5 Roy Makaay 05 23.52925 65.63241 78.20789 93.82865 9
## 6 Iker Casillas 05 94.11531 54.33178 45.29751 51.61508 20
## month
## 1 8
## 2 1
## 3 6
## 4 7
## 5 3
## 6 5</code></pre>
<p>Now we have the player abilities, combining them into a team is not quite trivial, but not far off</p>
<pre class="r"><code>#melt data back down for team selection
team_selection_dat < fiveaside_abilities %>%
pivot_longer(cols = names(fiveaside_abilities)[3:6],
names_to = "pos", values_to = "ability")
#fun on a function to select optimal five a side teams
best_teams <
#run for each date we want to select for
map_df(unique(paste(team_selection_dat$day, team_selection_dat$month, sep = "")),
function(select_dob, data) {
#filter only players with that birth date
bday_dat < data %>%
mutate(dob = paste(day, month, sep = "")) %>%
filter(dob == select_dob) %>%
arrange(ability)
#take positions in order of highest ability score
#in order to pick optimally
position_order < unique(bday_dat$pos)
#init a data frame
team < data.frame(
name = NULL,
version = NULL,
day = NULL,
month = NULL,
pos = NULL,
ability = NULL
)
#for loop through the positions to be picked
#probably a better way to write this but
#by now my brain was melting
for(position in position_order) {
#select the best player for that position
selected_player < bday_dat %>%
filter(pos == position) %>%
top_n(1, ability) %>%
select(dob)
team < rbind(team, selected_player)
#remove selected player from later choices
bday_dat < bday_dat %>%
filter(!name %in% team$name)
}
#pick the last last player
team < bday_dat %>%
filter(pos != "gk") %>%
top_n(1, ability) %>%
select(dob) %>%
rbind(team, .)
return(team)
}, data = team_selection_dat)</code></pre>
<p>We can then plot the best teams by taking the mean of each teams positional ability and finding the top 10 teams. I then plotted these on half pitches using Ben Torvaney’s (mentoned in 2/2 posts so far…) <a href="https://github.com/Torvaney/ggsoccer">ggsoccer</a> package:</p>
<pre class="r"><code>p4 < best_teams %>%
#work out total ability by team
group_by(day, month) %>%
mutate(team_ability = mean(ability)) %>%
group_by(day, month, pos) %>%
mutate(total_pos = n(), pos_n = 1:n()) %>%
#calculate the x and y coordinates for each player on a pitch
mutate(x = case_when(
pos == "gk" ~ 5,
pos == "def" ~ 22,
pos == "mid" ~ 35,
pos == "att" ~ 52
)) %>%
mutate(y = case_when(
pos_n == 1 & total_pos == 2 ~ 20,
pos_n == 2 ~ 60,
TRUE ~ 40
)) %>%
ungroup() %>%
select(pos_n, total_pos) %>%
#take the 10 best teams
top_n(50, team_ability) %>%
#add in a column for the faceting
mutate(dob = paste0(day, "/", month, ": ", round(team_ability, 2))) %>%
ggplot(aes(x = x, y = y)) +
annotate_pitch(dimensions = pitch_statsbomb,
colour = "black",
fill = "white",
limits = FALSE) +
coord_flip(xlim = c(0, 60)) +
geom_text(aes(
label = gsub("( )([AZ])", "\n\\2", paste(name, version, sep = "")),
colour = ability), size = 5.5) +
scale_colour_gradient(low = "darkblue", high = "red", name = "player\n ability") +
labs(title = "Ten Best Birthday Teams") +
theme_pitch() +
theme(strip.text.x = element_text(size = 14)) +
facet_wrap(~dob) </code></pre>
<p><img src="/img/birthday_teams.svg" /></p>
<p>All the top 10 teams have fairly similar total abilities around 8687. These best of which is an average of 87.51 for a team of
 2006 Dida (GK)
 2005 Sami Hyypia (Def)
 2005 Gilberto Silva (Mid)
 2005 Santiago Solari (Att)
 2017 Diego Costa (Att)</p>
<p>who all share a birthday on the 7th October. Many of the top teams we found earlier also show up, though surprisingly the 1st November which has players such as Neymar Jr., Christiano Ronaldo, Carlos Tevez, and Stefan De Vrij, doesn’t make the cut.</p>
<p>That’s all for the second of these posts. The first one can be found <a href="https://www.roberthickman.eu/post/five_min_trivia_invincibles/">here</a>. Hopefully it provides some relief from the madness that is a complete lack of football. Stay safe, and wash your hands.</p>

Five Minute Football Trivia  Invincibles
/post/five_min_trivia_invincibles/
Sat, 07 Mar 2020 00:00:00 +0000
/post/five_min_trivia_invincibles/
<p><em>generally as I have less and less time to waste on meaningless football stats I get halfway through a post and abandon it. To remedy this, I want to start pushing out posts that give a reasonable halfguess at an answer within an hour or so without needing to really check my working or write good prose. This is the first of these</em></p>
<p>Liverpool Football Club have had a pretty impressive season until recently, winning <a href="https://www.google.com/search?client=firefoxbd&q=premier+league+table#sie=lg;/g/11fj6snmjm;2;/m/02_tc;st;fp;1;;">26 of the first 27 games</a> and remaining unbeaten. Last weekend however, they lost <a href="https://www.bbc.co.uk/sport/football/51595064">30 to Watford</a> which means that Arsenal remain the only team to have gone a full (modern) season of top flight English football unbeaten (in <a href="https://en.wikipedia.org/wiki/The_Invincibles_(football)">2003/2004</a>).</p>
<p>Modern football twitter being what it is, a lot of debate has sprung up about which would be more impressive going to whole season unbeaten like Arsenal, or winning 100 (out of a max 114) points in a single season, as Manchester City did in 20172018 and both Manchester City and Liverpool <em>almost</em> did last season. (A third option also is the treble won by Manchester United in <a href="https://en.wikipedia.org/wiki/1998%E2%80%9399_Manchester_United_F.C._season">1998/1999</a> but since Liverpool have also lost to Chelsea in the FA cup this week, that too remains unbeaten).</p>
<p>As usual, first we need some libraries</p>
<pre class="r"><code>#munging
library(tidyverse)
#plotting
library(ggrepel)
#football data
library(engsoccerdata)
library(rvest)
#Ben Torvaney's excellend package to model football
library(regista)
#set seed for reproducibility
set.seed(22081992)</code></pre>
<p>Then we can get going loading up the data on English football results up until the end of the 2018/2019 season. We’ll also take some time to find the winners each season which will be useful later. There’s a lot of repetitive munging in this post so bear in mind the 3 main things we’ll be doing are:
+ pivoting data to longer to get the results for each team (not each match)
+ working out the goals for and against each team using case_when()
+ working out the points for each team using case_when()</p>
<pre class="r"><code>data < engsoccerdata::england %>%
#only care about the top flight in the premier league era
dplyr::filter(Season > 1991 & Season < 2019 & division == 1) %>%
select(season = Season, home, away = visitor, hgoal, agoal = vgoal)
league_winners < data %>%
#pivot data to longer to get team (rather than match) data
pivot_longer(c("home", "away"), names_to = "location", values_to = "team") %>%
#find goals for and goals against per team
mutate(g_for = case_when(
location == "home" ~ hgoal,
location == "away" ~ agoal
)) %>%
mutate(g_ag = case_when(
location == "home" ~ agoal,
location == "away" ~ hgoal
)) %>%
#get the team's points per match
mutate(points = case_when(
g_for > g_ag ~ 3,
g_for == g_ag ~ 1,
g_ag > g_for ~ 0
)) %>%
mutate(gd = g_for  g_ag) %>%
group_by(team, season) %>%
#calculate total points and goal difference
summarise(total_points = sum(points),
total_gd = sum(gd)) %>%
#get the winners of each league season
arrange(season, total_points, total_gd) %>%
group_by(season) %>%
mutate(league_position = 1:n()) %>%
ungroup() %>%
mutate(winner = case_when(
league_position == 1 ~ "y",
TRUE ~ "n"
))
head(league_winners)</code></pre>
<pre><code>## # A tibble: 6 x 6
## team season total_points total_gd league_position winner
## <chr> <int> <dbl> <int> <int> <chr>
## 1 Manchester United 1992 84 36 1 y
## 2 Aston Villa 1992 74 17 2 n
## 3 Norwich City 1992 72 4 3 n
## 4 Blackburn Rovers 1992 71 22 4 n
## 5 Queens Park Rangers 1992 63 8 5 n
## 6 Liverpool 1992 59 7 6 n</code></pre>
<p>We can then use the match data to calculate the offensive and defensive strength of each teams over the whole season using the <a href="https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/14679876.00065">DixonColes method</a>. I’ve previously written an introduction to this method <a href="https://www.roberthickman.eu/post/dixon_coles_1/">here</a> (which I need to finish part two of) but suffice to say it takes the goals scored and goals conceded per game and gives a good estimation of how good a team is. It’s similar in concept to <a href="https://projects.fivethirtyeight.com/soccerpredictions/">fivethirtyeight’s Soccer SPI</a>.</p>
<pre class="r"><code>#split data by seasons
fit_data < data %>%
split(f = .$season) %>%
lapply(., function(x) x %>% mutate(home = factor(home), away = factor(away)))
#model using dixoncoles() from the regista package
fits < lapply(fit_data, function(x) dixoncoles(hgoal, agoal, home, away, x))</code></pre>
<p>We can then extract the parameters from this model to see how teams have performed in each season of the Premier League. I also flip the defence axis (higher being a better defence) as I think it makes a little more sense</p>
<pre class="r"><code>parameters < fits %>%
#extract the team parameters per fit
lapply(., function(f) {
par_data < f$par[grepl("def_off_", names(f$par))]
teams < unique(gsub("def_*off_*", "", names(par_data)))
par_df < matrix(par_data, ncol = 2) %>%
as.data.frame() %>%
rename(attack = V1, defence = V2)
rownames(par_df) < teams
return(par_df)
}) %>%
do.call(rbind, .) %>%
rownames_to_column() %>%
separate(rowname, c("season", "team"), sep = "\\.") %>%
mutate(season = as.numeric(season)) %>%
#flip the defence parameter (higher = better)
mutate(defence = defence * 1) %>%
left_join(., league_winners, by = c("season", "team"))
#plot the parameters with season performance (points) as the colour
p1 < parameters %>%
ggplot(aes(x = attack, y = defence, fill = total_points, colour = winner)) +
geom_point(shape = 21, size = 3, alpha = 0.7, stroke = 2) +
#label exceptional teams
geom_text_repel(data = filter(parameters, winner == 1  attack + defence > 1),
aes(label = paste(team, season))) +
labs(title = "Dixon Coles parameters per team per Premier League Season",
subtitle = "league winners and exceptional teams labelled",
x = "attacking strength",
y = "defensive strength") +
scale_colour_manual(values = c("blue", "red")) +
theme_minimal()
p1</code></pre>
<p><img src="/post/20200304invincibles_files/figurehtml/get_dc_parameters1.png" width="1152" />
We can then use these parameters as ‘true estimates’ of how good each team was each season, and go back and simulate results from each match to work out how likely a win/lose/draw for any team was in any match. This is questionably a good idea but as I said up top, this is stream of consciousness firstguesses at answering stupid trivia questions so I’m going to go along with it.</p>
<p>The regista package’s augment.dixoncoles easily gives us the chance of a win/lose/draw per match based on the attacking/defensive strength of each team (see above) that season</p>
<pre class="r"><code>#split the matches by season
matches < data %>%
select(season, home, away) %>%
split(f = .$season)
#function to predict the results per match
predict_matches < function(dc_fit, fixtures) {
augment.dixoncoles(x = dc_fit, newdata = fixtures, type = "outcomes") %>%
unnest() %>%
spread(outcome, prob)
}
#run the prediction function
predictions < map2_df(fits, matches,
predict_matches)
head(predictions)</code></pre>
<pre><code>## # A tibble: 6 x 6
## season home away away_win draw home_win
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1992 Arsenal Aston Villa 0.294 0.379 0.327
## 2 1992 Arsenal Blackburn Rovers 0.319 0.344 0.337
## 3 1992 Arsenal Chelsea 0.220 0.342 0.437
## 4 1992 Arsenal Coventry City 0.214 0.333 0.454
## 5 1992 Arsenal Crystal Palace 0.188 0.322 0.490
## 6 1992 Arsenal Everton 0.223 0.338 0.439</code></pre>
<p>So e.g. based on DixonColes estimates, given how well Arsenal and Aston Villa did over the <em>whole</em> of the 1992/1993 season, Arsenal had a 32.6% chance of beating Aston Villa at home on the opening day of the season.</p>
<p>We can then use these probability estimates to calculate the chance of any one team going unbeaten over the whole league (multiply out the probabilities of not losing each game)</p>
<pre class="r"><code>invincible_chance < predictions %>%
#get match predictions per team
pivot_longer(c("home", "away"), names_to = "location", values_to = "team") %>%
mutate(nonloss_chance = case_when(
location == "home" ~ 1  away_win,
location == "away" ~ 1  home_win
)) %>%
select(season, team, nonloss_chance) %>%
group_by(team, season) %>%
#chance of going invincible = product sum of chance of not drawing
summarise(invincible_chance = prod(nonloss_chance)) %>%
arrange(invincible_chance)
head(invincible_chance, n = 10)</code></pre>
<pre><code>## # A tibble: 10 x 3
## # Groups: team [6]
## team season invincible_chance
## <chr> <int> <dbl>
## 1 Chelsea 2004 0.0494
## 2 Manchester City 2017 0.0362
## 3 Manchester City 2018 0.0286
## 4 Liverpool 2018 0.0232
## 5 Arsenal 1998 0.0164
## 6 Manchester City 2011 0.0124
## 7 Manchester United 2007 0.0123
## 8 Tottenham Hotspur 2016 0.00846
## 9 Arsenal 2003 0.00529
## 10 Chelsea 2009 0.00475</code></pre>
<p>So it turns out that the team most likely to have gone invincible over a whole season was Chelsea in 2004/2005 (not surprising given their <a href="https://en.wikipedia.org/wiki/2004%E2%80%9305_Chelsea_F.C._season#Results_by_round">excellent defensive record that year</a>), but with only a ~5% chance.</p>
<p>Arsenal’s <em>actual</em> invincible year is estimated that have had a 0.05% chance based on the team’s results (surprisingly low!). Another notable team is Tottenham Hotspur who only finished 2nd in 2016/2017 but perhaps went under the radar as a very good team that year (with a 0.08% chance of finishing unbeaten).</p>
<p>So we can assume* that the very best ‘unbeatable’ teams have ~5% chance of finishing a season invincible. We can use this baseline to see how hard this seems compared to the expectation a team gets 100 points.</p>
<p>*not really, but for this post yes</p>
<p>We’re going to simulate every Premier League season 1000 times and calculate the total points expected of a team based on their DixonColes parameters. To narrow down the search a bit, I’m going to limit it to only exceptional teams with an attack and defence parameter > 0.25 (which gives 33 seasonteams).</p>
<pre class="r"><code>result_probs < predictions %>%
#pivoting and case_when to get result probabilities per team
pivot_longer(c("home", "away"), names_to = "location", values_to = "team") %>%
mutate(win = case_when(
location == "home" ~ home_win,
location == "away" ~ away_win
)) %>%
mutate(lose = 1  draw  win) %>%
select(season, team, win, lose, draw) %>%
group_by(team, season) %>%
mutate(game = 1:n()) %>%
nest(probs = c(win, lose, draw))
#filter down to only the very best teams to save processing
selected_teams < parameters %>%
filter(attack > 0.25 & defence > 0.25) %>%
select(season, team) %>%
left_join(., result_probs, by = c("team", "season"))
sim_result < function(probabilities) {
chosen_results < gather(probabilities) %>%
sample_n(., 1, weight = value)
result < chosen_results$key
}
simulate_all_games < function(data) {
data$result < unlist(lapply(data$probs, sim_result))
return(data)
}
#will simulate 1000 seasons for each of these teams
n_sims < 1000
#run simulations  will take ~10mins
simulated_results < rerun(n_sims, simulate_all_games(selected_teams))</code></pre>
<p>Calculating the total points won per season, we can work out the percentage of simulations in which each team exceed 100 points quite easily</p>
<pre class="r"><code>simulated_points < simulated_results %>%
#for each sim, get the points won by each team
lapply(., function(data) {
data < data %>%
mutate(points = case_when(
result == "win" ~ 3,
result == "draw" ~ 1,
result == "lose" ~ 0
)) %>%
group_by(season, team) %>%
mutate(total_points = sum(points)) %>%
select(season, team, total_points) %>%
unique()
}) %>%
do.call(rbind, .)
#probability of reaching 100 points is no. of sims > 100 points / n_sims
centurion_probs < simulated_points %>%
filter(total_points > 99) %>%
group_by(season, team) %>%
summarise(centurion_prob = n() / n_sims) %>%
arrange(centurion_prob)
print(centurion_probs)</code></pre>
<pre><code>## # A tibble: 27 x 3
## # Groups: season [16]
## season team centurion_prob
## <dbl> <chr> <dbl>
## 1 2017 Manchester City 0.17
## 2 2018 Manchester City 0.107
## 3 2018 Liverpool 0.076
## 4 1994 Manchester United 0.069
## 5 2004 Chelsea 0.046
## 6 2009 Chelsea 0.039
## 7 2011 Manchester City 0.029
## 8 2007 Manchester United 0.023
## 9 2016 Tottenham Hotspur 0.023
## 10 2006 Manchester United 0.011
## # … with 17 more rows</code></pre>
<p>As expected, the two recent Manchester City teams come top, with the one that actually did reach 100 points (2017) given a 14.5% chance of reaching that milestone, given their strength.</p>
<p>So now we have a baseline that the best team at accumulating points (Manchester City 2017/2018) has ~3x as much chance of winning 100 points in that season than the very best (potentially) invincible team (Chelsea 2004/2005). I.e. we have some (not super strong) evidence that it is ~3x as hard to go a season unbeaten than it is to become a ‘centurion’.</p>
<p>We can calculate how many points our threshold needs to be set at to have an equal chance using top_frac() on our 1000 simulations.</p>
<pre class="r"><code>#find the points threshold for Man City 2017 that would reach n points
#as often as Chelsea 2004 would go unbeaten
invincible_equivalent < simulated_points %>%
ungroup() %>%
filter(season == 2017 & team == "Manchester City") %>%
top_frac(max(invincible_chance$invincible_chance)) %>%
arrange(total_points)
#print the lowest threshold
head(invincible_equivalent, n = 1)</code></pre>
<pre><code>## # A tibble: 1 x 3
## season team total_points
## <dbl> <chr> <dbl>
## 1 2017 Manchester City 103</code></pre>
<p>So we might presume that the equivalent achievement to going the season unbeaten is to win 103 points in the Premier League. To see how the 2017/2018 Manchester City team compare to this we can plot the expected final points total of that season (given league team strengths) in a histogram:</p>
<pre class="r"><code>p2 < simulated_points %>%
ungroup() %>%
filter(season == 2017 & team == "Manchester City") %>%
ggplot(., aes(x = total_points)) +
geom_histogram(fill = "skyblue", alpha = 0.7) +
#invincle equivalent achievement in red
geom_vline(xintercept = min(invincible_equivalent$total_points),
colour = "red", linetype = "dashed", size = 2) +
#actual achievement in blue
geom_vline(xintercept = filter(league_winners, season == 2017 & league_position == 1)$total_points,
colour = "blue", linetype = "dashed", size = 2) +
labs(title = "Man C. expected 2017/2018 performance c.f. invincible equivalent threshold",
subtitle = "invincible equivalent achievement = 103 points, actual = 100 points",
x = "season expected total points",
y = paste("times achieved over", n_sims, "simulations")) +
theme_minimal()
p2</code></pre>
<p><img src="/post/20200304invincibles_files/figurehtml/plot_threshold1.png" width="672" /></p>
<p>The original question was really if this years Liverpool team might achieve this 103 point threshold (given they have now failed to go unbeaten). We can test this by doing exactly the same procedure on their season so far.</p>
<p>First we need to download all the match data from fbref. Handily, fbref doesn’t just gives us the goals scored per match but the <a href="https://fbref.com/en/expectedgoalsmodelexplained/"><em>expected goals</em></a> each team managed to put up. We’re going to use that to model team strengths as we might assume* this is a better measure of how good a team really is. In order to fit the model using the regista package I need to supply an integer, so I’ve simply rounded those xG numbers to the nearest whole number**</p>
<p>*lets ignore game state and other such important thing this is <em>five minute</em> football trivia
**you actually can use expected goals in a regista::dixoncoles model, see <a href="https://www.roberthickman.eu/post/wslprediction1/">here</a>, but this is <em>five minute</em> football trivia</p>
<pre class="r"><code>#download the match data from 2019/2020
fixtures_2020 < "https://fbref.com/en/comps/9/schedule/PremierLeagueFixtures" %>%
read_html() %>%
html_nodes("#sched_ks_3232_1") %>%
html_table() %>%
as.data.frame() %>%
separate(Score, into = c("hgoal", "agoal"), sep = "–") %>%
#only care about goals and expected goals
select(home = Home, away = Away, home_xg = xG, away_xg = xG.1, hgoal, agoal) %>%
filter(home != "") %>%
mutate(home = factor(home), away = factor(away)) %>%
#round expected goals to nearest integer
mutate_at(c("home_xg", "away_xg", "hgoal", "agoal"), .funs = funs(round(as.numeric(.))))
#matches with a known result
#used for modelling
played_matches < fixtures_2020 %>%
filter(!is.na(home_xg))
#matches with an unknown result
#used for simulation
unplayed_matches < fixtures_2020 %>%
filter(is.na(home_xg)) %>%
select_if(negate(is.numeric))
#fit the dixon coles model
fit_2020 < dixoncoles(home_xg, away_xg, home, away, data = played_matches)</code></pre>
<p>And as before we can plot the team strength in attacking and defending dimensions</p>
<pre class="r"><code>#extract parameters from the model
pars_2020 < fit_2020$par %>%
.[grepl("def_off_", names(.))] %>%
matrix(., ncol = 2) %>%
as.data.frame() %>%
rename(attack = V1, defence = V2)
pars_2020$team < unique(gsub("def_*off_*", "", names(fit_2020$par)))[1:20]
#plot as before
p3 < pars_2020 %>%
mutate(defence = 1  defence) %>%
ggplot(aes(x = attack, y = defence, colour = attack + defence, label = team)) +
geom_point(size = 3, alpha = 0.7) +
geom_text_repel() +
labs(title = "Dixon Coles parameters per team 2019/2020",
x = "attacking strength",
y = "defensive strength") +
scale_colour_continuous(guide = FALSE) +
labs(title = "Dixon Coles parameters per team for the 2019/2020 Premier League Season",
x = "attacking strength",
y = "defensive strength") +
theme_minimal()
p3</code></pre>
<p><img src="/post/20200304invincibles_files/figurehtml/2020_dc_parameters1.png" width="672" /></p>
<p>Surprisingly, a distant 2nd place Manchester City actually rate higher than Liverpool using this model, and Manchester United (by all accounts having a very middling season) aren’t far off either.</p>
<p>Now we just need to simulate the remaining games of Liverpool’s season to see how likely they are to hit are 103 points target. We can then add the points we expect Liverpool to win to the number of points we know they already have to get an estimate of final total points.</p>
<pre class="r"><code>#calculate points we know Liverpool have
liverpool_points < played_matches %>%
filter(home == "Liverpool"  away == "Liverpool") %>%
mutate(points = case_when(
hgoal == agoal ~ 1,
home == "Liverpool" & (hgoal > agoal) ~ 3,
away == "Liverpool" & (agoal > hgoal) ~ 3,
TRUE ~ 0
)) %>%
summarise(total_points = sum(points))
#estimate the chance of results in all remaining games
unplayed_results <
augment.dixoncoles(fit_2020, unplayed_matches, type.predict = "outcomes") %>%
unnest() %>%
#filter out the liverpool ones
filter(home == "Liverpool"  away == "Liverpool")
#function to simulate a season by making weighted samples
simulate_season < function(result_probabilities) {
result_probabilities %>%
nest(outcome, prob, .key = "results") %>%
mutate(sampled = map(results, ~ sample_n(., 1, weight = prob))) %>%
select(results) %>%
unnest()
}
#simulate the rest of liverpool's season
liverpool_2020_simulated < rerun(n_sims, simulate_season(unplayed_results)) %>%
bind_rows(.id = "simulation_id") %>%
#find the sampled points won per game
mutate(points = case_when(
home == "Liverpool" & outcome == "home_win" ~ 3,
away == "Liverpool" & outcome == "away_win" ~ 3,
outcome == "draw" ~ 1,
TRUE ~ 0
)) %>%
group_by(simulation_id) %>%
#calculate Liverpool's total season points for this simulation
summarise(total_points = sum(points) + as.numeric(liverpool_points))</code></pre>
<p>It’s then very easy to find the fraction of sims in which Liverpool break this 103 point challenge</p>
<pre class="r"><code>length(which(liverpool_2020_simulated$total_points > 102)) / 1000</code></pre>
<pre><code>## [1] 0.157</code></pre>
<p>Finally, we can plot it as before to see how many points we expect Liverpool to win this season:
(this time the 103 point threshold is in blue to stand out against the red that Liverpool play in)</p>
<pre class="r"><code>p4 < liverpool_2020_simulated %>%
ggplot(., aes(x = total_points)) +
geom_histogram(fill = "red", alpha = 0.7) +
#invincle equivalent achievement in red
geom_vline(xintercept = min(invincible_equivalent$total_points),
colour = "blue", linetype = "dashed", size = 2) +
labs(title = "Liverpool's expected 2019/2020 performance c.f. invincible equivalent threshold",
subtitle = "invincible equivalent achievement = 103 points in blue this time",
x = "season expected total points",
y = paste("times achieved over", n_sims, "simulations")) +
theme_minimal()
p4</code></pre>
<p><img src="/post/20200304invincibles_files/figurehtml/plot_liverpool_expectation1.png" width="672" /></p>
<p>Anyway, that’s that for the first of these (hopefully? of many!). Did we learn anything? probably not. But did we at least do something interesting? also probably not. But I do like doing these silly little analyses in my spare time and by not limiting myself to things like rigor, I can pump them out faster. I’ll probably aim for one post (smaller than this) a week to start building a little bit of a public portfolio up again (I’m unemployed in 5 months hire me!!). Hope you enjoyed reading it :)</p>

Advent Calendar of Football Trivia Analyses
/post/advent_calendar_trivia/
Sun, 01 Dec 2019 00:00:00 +0000
/post/advent_calendar_trivia/
<p>One of the most consistent <a href="https://www.roberthickman.eu/project/guardian_knowledge/">fonts of posts on this blog</a> is The Guardian’s football trivia page <a href="https://www.theguardian.com/football/series/theknowledge">The Knowledge</a>. A particular reason for this is that the small contained questions lend themselves to small blogposts that I can turn around in an hour or two, as opposed to being endlessly redrafted until I lose interest.</p>
<p>However, I still sometimes don’t quite get round to finishing some of these posts, or have trouble justifying a blog post on a very small and ‘trivial’ answer to a question. Therefore, as a sort of endofyear round up, and a Christmas present to myself, I wanted to push out answers to questions I found particularly interesting over the last year and hadn’t quite got round to <sup>1</sup>. I’ll probably add them all to this post as I finish them up.</p>
<div id="nddecembereverythinginitsrightplace" class="section level1">
<h1>2nd December  Everything in its right place</h1>
<blockquote class="twittertweet">
<p lang="en" dir="ltr">
I wonder if any of any sporting leagues have ever ended in alphabetical order? <a href="https://t.co/you6u8Uzwz">pic.twitter.com/you6u8Uzwz</a>
</p>
— P A Hunt (<span class="citation">@TeachFMaths</span>) <a href="https://twitter.com/TeachFMaths/status/1139832761295024128?ref_src=twsrc%5Etfw">June 15, 2019</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf8"></script>
<div id="answeryeskindof.butalsono." class="section level2">
<h2>Answer  yes, kind of. But also no.</h2>
<p>This question has actually <a href="https://www.theguardian.com/football/2011/mar/09/hasleagueeverfinishedalphabeticalorder">been answered</a> (as many of these will have been). For a league of 20 teams (like the English Premier League), we might imagine if would have happened over the last ~150 years, but it’s almost certain from some basic maths that it won’t have, and moreover, will never happen.</p>
<p>Let’s load some data and see why.</p>
<pre class="r"><code>#as per usual, going to heavily rely on tidyverse
#and engsoccerdata throughout these posts
library(tidyverse)
library(engsoccerdata)</code></pre>
<pre class="r"><code>#load English league data
league_data < engsoccerdata::england %>%
#select and gather match results
select(season = Season, division, home, visitor, hgoal, vgoal) %>%
gather("location", "team", season, division, hgoal, vgoal) %>%
mutate(
g_for = case_when(
location == "home" ~ hgoal,
location == "visitor" ~ vgoal
),
g_ag = case_when(
location == "home" ~ vgoal,
location == "visitor" ~ hgoal
)) %>%
#get correct point for a win/loss
mutate(
points = case_when(
g_for > g_ag & season < 1981 ~ 2,
g_for > g_ag & season > 1980 ~ 3,
g_for == g_ag ~ 1,
g_for < g_ag ~ 0
),
gd = g_for  g_ag
) %>%
#group by season and league and get final tables
group_by(season, division, team) %>%
summarise(points = sum(points),
gd = sum(gd),
g_for = sum(g_for)) %>%
arrange(points, gd, g_for) %>%
#rank league order and alphabetical order
mutate(league_pos = rank(points, ties.method = "first"),
alph_order = rank(team, ties.method = "first")) %>%
select(season, division, team, league_pos, alph_order) %>%
#split by league and season
split(., f = list(.$season, .$division)) %>%
keep(function(x) nrow(x) > 0)
#print the top of the first league table
head(league_data[[1]])</code></pre>
<pre><code>## # A tibble: 6 x 5
## # Groups: season, division [1]
## season division team league_pos alph_order
## <int> <int> <chr> <int> <int>
## 1 1888 1 Preston North End 1 9
## 2 1888 1 Aston Villa 2 2
## 3 1888 1 Wolverhampton Wanderers 3 12
## 4 1888 1 Blackburn Rovers 4 3
## 5 1888 1 Bolton Wanderers 5 4
## 6 1888 1 West Bromwich Albion 6 11</code></pre>
<p>We can then run a load of Spearman’s rank correlation tests on the data to see which ones are perfectly correlated or anticorrelated in both league and alphabetical ranking. We’ll use the very handy <a href="https://cran.rproject.org/web/packages/broom/vignettes/broom.html">broom</a> package to tidy the results of our many tests into one data.frame (remove the filter at the end of the pipe chain to see what gets output).</p>
<pre class="r"><code>#use broom to tidily do stats
library(broom)
#correlate league and alphabetical order by year
exact_correlations < league_data %>%
map_df(., function(data) {
cor.test(
data$league_pos,
data$alph_order,
method = "spearman"
) %>%
tidy() %>%
mutate(season = unique(data$season),
division = unique(data$division))
}) %>%
#take only significantly
filter(abs(statistic) == 1)
print(exact_correlations)</code></pre>
<pre><code>## # A tibble: 0 x 7
## # ... with 7 variables: estimate <dbl>, statistic <dbl>, p.value <dbl>,
## # method <chr>, alternative <chr>, season <int>, division <int></code></pre>
<p>And so we find no exact correlations. There are no instances in 363 separate seasons of English league football where teams line up in either alphabetical, or antialphabetical order.</p>
<p>Let’s see why this is. To make things simpler, I’m going to imagine a cutdown league of only 6 teams using teams starting with each of the first 6 letter of the alphabet:</p>
<pre class="r"><code>first_letter_names < league_data %>%
bind_rows() %>%
ungroup() %>%
#get first letter of a team name
mutate(first_letter = gsub("(^.)(.*)", "\\1", team)) %>%
filter(season > 1992 &
division == 1 &
first_letter %in% toupper(letters[1:6])
) %>%
#get one team beginning with A, B, C...
filter(!duplicated(first_letter)) %>%
select(team) %>%
arrange(team) %>%
print()</code></pre>
<pre><code>## # A tibble: 6 x 1
## team
## <chr>
## 1 Arsenal
## 2 Blackburn Rovers
## 3 Coventry City
## 4 Derby County
## 5 Everton
## 6 Fulham</code></pre>
<p>For the league to finish in alphabetical order, we first need the team that is first alphabetically (Arsenal) to finish in first position. Assuming all teams have an equal chance of winning the league, the chance of this is obviously</p>
<p><span class="math display">\[ p(Arsenal = 1) = \frac{1}{n}\]</span></p>
<p>Then we need the second team (Blackburn Rovers), to finish in second. This is predicated on Arsenal already finishing in first position, so the chance becomes</p>
<p><span class="math display">\[ p(Blackburn = 2  Arsenal = 1) = \frac{1}{n1} \]</span></p>
<p>and so on until the last team (Fulham) just have to slot into the only position left (n, 6th in our example)</p>
<p>Thus the total chance becomes</p>
<p><span class="math display">\[ \frac{1}{n} \cdot \frac{1}{n1} ... \cdot \frac{1}{1} \]</span></p>
<p>which can also be written</p>
<p><span class="math display">\[ p(ordered) = \prod_{n = 1}^{N} \frac{1}{n}\]</span></p>
<p>which multiplies out to</p>
<p><span class="math display">\[ p(ordered) = \frac{1}{n!} \]</span></p>
<p>so for our very small league the chance of n (assumed equally strong teams)</p>
<pre class="r"><code>factorial(nrow(first_letter_names))</code></pre>
<pre><code>## [1] 720</code></pre>
<p>so we have a 1/720 chance that this league ends perfectly in alphabetical order. For bigger leagues (for reference most large European leagues contain 1824 teams) this denominator grows <em>superexponentially</em> and becomes tiny.</p>
<p>For the English Premier League (20 teams) for instance the chance becomes</p>
<pre class="r"><code>league_data %>%
bind_rows() %>%
ungroup() %>%
filter(season == max(season) & division == 1) %>%
nrow() %>%
factorial()</code></pre>
<pre><code>## [1] 2.432902e+18</code></pre>
<p>or 1 in 2.4 <a href="https://en.wikipedia.org/wiki/Order_of_magnitude">quintillion</a>. In short, if it’s assumed that there’s no relation between order of names and team strength, we might expect the universe to end before all 20 teams finish in perfect order.</p>
<p>We can test if our predictions bear out by looking at tiny leagues with small numbers of teams, e.g. <a href="https://en.wikipedia.org/wiki/2018%E2%80%9319_UEFA_Champions_League_group_stage">the group stages of the Champions/Europa Leagues</a>.</p>
<p>First we need to scrape the final tables for the last 8 years of data from both competitions:</p>
<pre class="r"><code>library(rvest)
#website to scrape group stage data from
fb_data < "https://footballdatabase.com"
ucl_links < sprintf(
"/leaguescorestables/uefachampionsleague20%s%s",
10:18, 11:19
)
europa_links < sprintf(
"/leaguescorestables/uefaeuropaleague20%s%s",
10:18, 11:19
)
#function to scrape the data from these links
get_competition_data < function(competition, links) {
data < links %>%
paste0(fb_data, .) %>%
map_df(., function(year) {
page_read < read_html(year)
groups < letters[1:8] %>%
map_df(., function(group) {
page_read %>%
html_nodes(sprintf("#totalgroup%s > div > table", group)) %>%
html_table(fill = TRUE) %>%
as.data.frame() %>%
mutate(group)
}) %>%
mutate(year = gsub("(.*)([09]{4}[09]{2})", "\\2", year))
}) %>%
mutate(competition)
}
#scrape and bind the data
uefa_data < bind_rows(
get_competition_data("champions", ucl_links),
get_competition_data("europa", europa_links)
)
#print a cutdown version of the scraped data
head(uefa_data %>% select(club = Club, points = P, year, competition))</code></pre>
<pre><code>## club points year competition
## 1 Tottenham Hotspur 11 201011 champions
## 2 Inter Milan 10 201011 champions
## 3 FC Twente 6 201011 champions
## 4 Werder Bremen 5 201011 champions
## 5 Schalke 04 13 201011 champions
## 6 Lyon 10 201011 champions</code></pre>
<p>So now we have 128 (8 groups x 8 years x 2 competitions) ‘minileagues’ each of 4 teams.</p>
<p>We can then munge this data to find all the groups where the teams finish in alphabetical order. We’d expect 128/4! leagues to finish in alphabetical order (or 5.33 to be exact).</p>
<pre class="r"><code>ordered_groups < uefa_data %>%
#select relevant informatiob
select(team = Club, league_pos = X., group, year, competition) %>%
#by group find where teams finish in alphabetical order
group_by(year, group, competition) %>%
mutate(alph_order = rank(team, ties.method = "first")) %>%
filter(league_pos == alph_order) %>%
#keep only group where all (4) teams finish in order
summarise(n = n()) %>%
filter(n == 4) %>%
#join and filter back data
left_join(uefa_data, ., by = c("group", "year", "competition")) %>%
filter(!is.na(n)) %>%
#select useful information
select(team = Club, points = P, gd = X..., league_pos = X.,
group, year, competition) %>%
#split groups up
split(., list(.$year, .$group, .$competition)) %>%
keep(function(x) nrow(x) > 0)</code></pre>
<p>which leaves us with 5 leagues that have finished in order! almost exactly what we’d predict by chance if the first letter of a teams name had no effect on the outcome.</p>
<pre class="r"><code>ordered_groups</code></pre>
<pre><code>## $`201112.c.champions`
## team points gd league_pos group year competition
## 5 Benfica 12 4 1 c 201112 champions
## 6 FC Basel 11 1 2 c 201112 champions
## 7 Manchester United 9 3 3 c 201112 champions
## 8 Otelul Galati 0 8 4 c 201112 champions
##
## $`201516.c.champions`
## team points gd league_pos group year competition
## 9 Atlético Madrid 13 8 1 c 201516 champions
## 10 Benfica 10 2 2 c 201516 champions
## 11 Galatasaray 5 4 3 c 201516 champions
## 12 Lokomotiv Astana 4 6 4 c 201516 champions
##
## $`201011.f.champions`
## team points gd league_pos group year competition
## 1 Chelsea FC 15 10 1 f 201011 champions
## 2 Marseille 12 9 2 f 201011 champions
## 3 Spartak Moskva 9 3 3 f 201011 champions
## 4 Žilina 0 16 4 f 201011 champions
##
## $`201516.g.champions`
## team points gd league_pos group year competition
## 13 Chelsea FC 13 10 1 g 201516 champions
## 14 Dynamo Kyiv 11 4 2 g 201516 champions
## 15 FC Porto 10 1 3 g 201516 champions
## 16 Maccabi Tel Aviv FC 0 15 4 g 201516 champions
##
## $`201819.h.champions`
## team points gd league_pos group year competition
## 17 Juventus 12 5 1 h 201819 champions
## 18 Manchester United 10 3 2 h 201819 champions
## 19 Valencia 8 0 3 h 201819 champions
## 20 Young Boys 4 8 4 h 201819 champions
##
## $`201213.h.europa`
## team points gd league_pos group year competition
## 21 FC Rubin Kazan 14 7 1 h 201213 europa
## 22 Inter Milan 11 2 2 h 201213 europa
## 23 Neftçi 3 4 3 h 201213 europa
## 24 Partizan Beograd (SRB) 3 5 4 h 201213 europa</code></pre>
<p>We can also do a larger test by randomly selecting teams out of the English league data we looked at earlier. To do this I need two quick functions: one to sample randomly from the data, and another to carry out the correlation test.</p>
<p>The first takes a number of samples (how many tests to run) and then selects a number of teams from each league sample. For instance, if I chose 3 teams, it might select Liverpool, Manchester United, and Watford, from the <a href="https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League">last season of the Premier League</a>. These teams finished 2nd, 6th, and 11th respectively, so this ‘sampled league’ would fulfill the criteria of finishing in alphabetical order.</p>
<pre class="r"><code>set.seed(3459)
#take a random sample of leagues and teams withing those leagues
sample_cutdown_leagues < function(nteams, nsamples, data) {
samples < sample(length(data), nsamples, replace = TRUE)
sampled_league_data < data[samples]
league_team_serials < sampled_league_data %>%
lapply(., nrow) %>%
lapply(., sample, size = nteams)
#carry out the correlation test
league_cor_test < map2_df(
.x = sampled_league_data,
.y = league_team_serials,
.f = cor_test_data
)
}
#function for correlation test
cor_test_data < function(full_league_data, sampled_teams) {
sampled_league < full_league_data[sampled_teams,] %>%
arrange(league_pos)
cor_test < cor.test(
sampled_league$league_pos,
sampled_league$alph_order,
method = "spearman"
) %>%
tidy() %>%
#mutate on information about that season and teams chosen
mutate(teams = paste(sampled_league$team, collapse = ", "),
season = unique(sampled_league$season),
division = unique(sampled_league$division))
}</code></pre>
<p>So for instance if I just run it once, randomly selecting 4 teams:</p>
<pre class="r"><code>test < sample_cutdown_leagues(4, 1, league_data)
#print the teams selected
test$teams</code></pre>
<pre><code>## [1] "Brentford, Bristol Rovers, Brighton & Hove Albion, Chester"</code></pre>
<pre class="r"><code>test</code></pre>
<pre><code>## # A tibble: 1 x 8
## estimate statistic p.value method alternative teams season division
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int>
## 1 0.8 2.00 0.333 Spearma~ two.sided Brentfor~ 1994 3</code></pre>
<p>It gives me 4 teams from the 1994 division 3 who didn’t finish in alphabetical order (though, amusingly, all have a very similar starting letter).</p>
<p>We can then carry this out with 10000 samples for n_team numbers of 2:6 to see if we get roughly the expected numbers of exactly correlated league finish positions (this will take 12mins) by finding out how many tests give an estimate of 1 (finished exactly correlated with alphabetical order) or 1 (finished exactly anticorrelated with alphabetical order).</p>
<p>Both these numbers should be roughly equal to the number of samples (10000) divided by the factorial of the number of teams selected.</p>
<pre class="r"><code>test_n_numbers < function(nteams) {
#run sampling function n times
#10k should do
sampling < sample_cutdown_leagues(nteams, 10000, league_data)
#find exactly correlated and anticorrelated examples
#where teams are in exact alphabetical order ascending or descending
correlated < length(which(sampling$estimate == max(sampling$estimate)))
anti_correlated < length(which(sampling$estimate == min(sampling$estimate)))
expected < nrow(sampling) / factorial(nteams)
df < data.frame(n = nteams,
sample_cor = correlated,
sample_anticor = anti_correlated,
sample_expected = expected)
}
#run the function
testing < map_df(2:6, test_n_numbers)
#print results
print(testing)</code></pre>
<pre><code>## n sample_cor sample_anticor sample_expected
## 1 2 5010 4990 5000.00000
## 2 3 1676 1665 1666.66667
## 3 4 367 398 416.66667
## 4 5 101 81 83.33333
## 5 6 14 15 13.88889</code></pre>
<p>And the numbers line up, as we would expect if there is no effect of the first letter of a team’s name upon final league position.</p>
<p>Finally, we can do a Kendall’s correlation test to really see if there is any relationship between alphabetical team name order and final league finish for all out our English league data. We use Kendall instead of a Spearman test here because we grouping all the data together we’re going to have a lot of ties (one team has to finish 1st in every league each year).</p>
<pre class="r"><code>all_data < league_data %>%
bind_rows()
#do a big correlation test
kendall_test < cor.test(all_data$alph_order,
all_data$league_pos,
alternative = "two.sided",
method = "kendall") %>%
tidy() %>%
print()</code></pre>
<pre><code>## # A tibble: 1 x 5
## estimate statistic p.value method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 0.0135 1.74 0.0826 Kendall's rank correlation tau two.sided</code></pre>
<p>And we can see that, even though our pvalue is <a href="https://mchankins.wordpress.com/2013/04/21/stillnotsignificant2/">‘approaching significance’</a>, it’s not significant at our fairly liberal threshold of 0.05. Even then, the effect size (0.013) is tiny, so there’s no need for Watford to start worrying <a href="https://www.bbc.co.uk/sport/football/50619972">just yet</a>.</p>
<ol style="liststyletype: decimal">
<li>SMALL DIGRESSION: I love blogging on this site and it also has been a great help to me in numerous ways (practice coding/writing, feeling like a “programmer”, for job interviews), but quite a lot of the time feel posts are not quite where I want them (I’m sure this feeling isn’t restricted to me) and so won’t put them up and so that time (sometimes quite a few hours!) I put into them in my spare time feels wasted and makes me feel worse about myself. I’m hoping that pushing out fairly rushed/half formed ideas like this will help with this.</li>
</ol>
</div>
</div>
<div id="rddecembergroanrangers" class="section level1">
<h1>3rd December  Groan Rangers</h1>
<p><a href="https://www.theguardian.com/football/2019/jul/24/whichteamsweremanagedbytheiralltimeleadinggoalscorer">“Berwick Rangers have conceded 42 goals in competitive matches – Scottish League 2, relegation playoff, Scottish League – since last scoring themselves, against Peterhead, on March 19th. Is this a record for a league club (I know they’ve now lost that status, but all of these matches are league level competition)?” asks Huw Richards.</a></p>
<div id="answeritbeatsanyteamintheenglishleague.reproduciblecodebelowifyouwanttocheckforotherleagues." class="section level2">
<h2>Answer  It beats any team in the English league. Reproducible code below if you want to check for other leagues.</h2>
<p>(I did check for most of them in the dataset, although this doesn’t include foreign cup competitions. Nothing seems to get close)</p>
<p>This is quite a nice question from a data munging point of view. It’s extremely quantifiable and only involves a little grouping by.</p>
<p>First we’ll load the libraries we’re relying on in this little project:</p>
<pre class="r"><code>library(engsoccerdata)
library(tidyverse)</code></pre>
<p>I’m going to focus on the English league as it has the most data and also has data on the concurrent cup competitions. It’s super easy to sub in whichever competitions in the engsoccerdata package you want.</p>
<p>We want to first bind the data from the league, league cup, fa cup, and league playoffs together with a little munging. Then we want to gather the data to get the goals scored and goals conceded in each game for each team.</p>
<pre class="r"><code>#bind all the match data together with relevant variables
scoring_data < bind_rows(
engsoccerdata::england %>%
select(date = Date, tier, home, visitor, hgoal, vgoal) %>%
mutate(date = as.Date(date),
competition = "league"),
engsoccerdata::facup %>%
select(date = Date, home, visitor, hgoal, vgoal) %>%
mutate(date = as.Date(date),
tier = NA,
competition = "fa_cup"),
engsoccerdata::leaguecup %>%
select(date = Date, home, visitor, hgoal, vgoal) %>%
mutate(date = as.Date(date),
tier = NA,
competition = "league_cup"),
engsoccerdata::englandplayoffs %>%
select(date = Date, home, visitor, hgoal, vgoal, htier, vtier) %>%
mutate(date = as.Date(date),
tier = (htier+vtier)/2,
competition = "league_playoffs") %>%
select(htier, vtier),
) %>%
#gather and find matches for each team
gather("location", "team", date, hgoal, vgoal, competition, tier) %>%
split(f = .$location) %>%
map2_df(., rev(.), function(df, vs_data)
mutate(df, opponent = vs_data$team)
) %>%
#add in goals for and against
mutate(goals_for = case_when(
location == "home" ~ hgoal,
TRUE ~ vgoal
)) %>%
mutate(goals_against = case_when(
location == "visitor" ~ hgoal,
TRUE ~ vgoal
)) %>%
arrange(team, date) %>%
group_by(team)
head(scoring_data)</code></pre>
<pre><code>## # A tibble: 6 x 10
## # Groups: team [1]
## date tier hgoal vgoal competition location team opponent
## <date> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 18751106 NA 0 0 fa_cup home 105t~ Crystal~
## 2 18751120 NA 3 0 fa_cup visitor 105t~ Crystal~
## 3 18761111 NA 3 0 fa_cup home 105t~ 1st Sur~
## 4 18761214 NA 6 1 fa_cup visitor 105t~ Oxford ~
## 5 18771107 NA 0 2 fa_cup home 105t~ Old Har~
## 6 NA NA NA NA fa_cup visitor 105t~ Minerva
## # ... with 2 more variables: goals_for <dbl>, goals_against <dbl></code></pre>
<p>Next we need to find the start of each run of games where a team has failed to score. We can do this by finding the first instance of 0 goals scored using lag(). We’ll then give an id to each ‘run’ of finishing games without scoring.</p>
<p>(I’m aware that teams can also concede goals in a run having scored first in a match but there’s no way to factor that in with the data)</p>
<p>We then use the na.locf() function from the very useful <a href="https://cran.rproject.org/web/packages/zoo/zoo.pdf">zoo</a> package to fill in the runs where no goals have been scored.</p>
<p>We can then finish answering the question (already!) by grouping by run and summing the total number of goals conceded in that time.</p>
<pre class="r"><code>#load the zoo library for helping filling NA values
library(zoo)
dry_runs < scoring_data %>%
#find the start of runs
mutate(run_start = case_when(
goals_for == 0 & lag(goals_for, default = 1) != 0 ~ 1:n()
)) %>%
#only care about games where didn't score
filter(goals_for == 0) %>%
#fill in NAs to get full runs
mutate(run_id = na.locf(run_start, na.rm = FALSE))
longest_dry_runs < dry_runs %>%
#group runs by id
group_by(run_id, team) %>%
#find total conceeded over n games
mutate(total_conceeded = sum(goals_against),
run_start_date = min(date),
matches = n()) %>%
#take only the last instance
filter(!duplicated(run_id, fromLast = TRUE)) %>%
select(run_start_date, run_end_date = date, team, run_id, total_conceeded, matches) %>%
#find the most 'impressive' runs
filter(total_conceeded > 15) %>%
arrange(total_conceeded)
head(longest_dry_runs)</code></pre>
<pre><code>## # A tibble: 6 x 6
## # Groups: run_id, team [6]
## run_start_date run_end_date team run_id total_conceeded matches
## <date> <date> <chr> <int> <dbl> <int>
## 1 18990114 18990311 Darwen 273 38 7
## 2 18981112 18981226 Darwen 263 35 7
## 3 18911212 18920109 Darwen 60 31 5
## 4 20190409 20190831 Bolton Wander~ 5447 29 11
## 5 18771222 18861023 1st Surrey Ri~ 8 27 3
## 6 18801218 18940127 Reading 6 27 3</code></pre>
<p>And can see that two 7 game runs from the (nowdefunct) <a href="https://en.wikipedia.org/wiki/Darwen_F.C.">Darwen FC</a> are top of the list. Around 1898/1899 the team conceded 35 and 38 goals without scoring themselves.</p>
<p>Manually <a href="https://www.11v11.com/teams/darwen/tab/matches/season/1899/">looking at the data</a>, we can see that these two streaks are broken only by a few losses over Christmas 1898, a losing run of 18 games! Indeed, Darwen only won 2 games that season and set the record for most goals conceded (141).</p>
<p>7 Years earlier, the same team managed an impressive run of letting in 31 goals in just 5 matches, without scoring. If we want to check out the game in this, we can do by left_join() ing our data together</p>
<pre class="r"><code>#joni data to inspect individual games
dry_run_matches < dry_runs %>%
left_join(longest_dry_runs, by = c("team", "run_id")) %>%
filter(!is.na(total_conceeded)) %>%
select(date, team, opponent, goals_for, goals_against,
competition, tier, total_conceeded, run_id) %>%
arrange(total_conceeded)
#print this
print(filter(dry_run_matches, run_id == 60))</code></pre>
<pre><code>## # A tibble: 5 x 9
## # Groups: team [1]
## date team opponent goals_for goals_against competition tier
## <date> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 18911212 Darw~ Sunderl~ 0 7 league 1
## 2 18911225 Darw~ Blackbu~ 0 4 league 1
## 3 18911226 Darw~ Aston V~ 0 7 league 1
## 4 18920101 Darw~ Preston~ 0 4 league 1
## 5 18920109 Darw~ Burnley 0 9 league 1
## # ... with 2 more variables: total_conceeded <dbl>, run_id <int></code></pre>
<p>Also, congratulations to the <a href="https://www.boltonwanderers.news/news/efldiscinplinarypanelsdecisiononwanderersdelayedagain/">ofttrouble Bolton Wanderers</a> who have got closest to this in modern times, failing to score in 11 straight matches, while conceding 29 goals in the process.</p>
<p>I also wanted to find out the opposite: the team that has scored the most goals without conceding any. It’s super easy with our pipeline just switch goals_against and goals_for in the chain.</p>
<pre class="r"><code>#do the inverse
scoring_runs < scoring_data %>%
mutate(run_start = case_when(
goals_against == 0 & lag(goals_against, default = 1) != 0 ~ 1:n()
)) %>%
filter(goals_against == 0) %>%
mutate(run_id = na.locf(run_start, na.rm = FALSE))
longest_scoring_runs < scoring_runs %>%
group_by(run_id, team) %>%
mutate(total_scored = sum(goals_for),
run_start_date = min(date),
matches = n()) %>%
filter(!duplicated(run_id, fromLast = TRUE)) %>%
select(run_start_date, run_end_date = date, team, run_id, total_scored, matches) %>%
filter(total_scored > 15) %>%
arrange(total_scored)
head(longest_scoring_runs)</code></pre>
<pre><code>## # A tibble: 6 x 6
## # Groups: run_id, team [6]
## run_start_date run_end_date team run_id total_scored matches
## <date> <date> <chr> <int> <dbl> <int>
## 1 20100425 20100828 Chelsea 4372 32 7
## 2 19290306 19290330 Bradford City 919 29 5
## 3 20190106 20190126 Manchester City 5194 28 6
## 4 19030410 19031003 Arsenal 328 26 8
## 5 18800117 18801113 Clapham Rovers 25 26 5
## 6 18851024 18851212 Notts County 32 26 3</code></pre>
<p>Where we can see that Chelsea’s impressive end to the 20092010 season puts them top, having scored 32 goals without reply. Almost all the other top examples are from prewar football, except Manchester City coming close last year with 28 goals scored without conceding.</p>
<p>When we look at this run we can see it was greatly helped along by some demolitions in the cups, winning 50, 90 and 70 against Burnley, Burton Albion, and Rotherham United.</p>
<pre class="r"><code>scoring_run_matches < scoring_runs %>%
left_join(longest_scoring_runs, by = c("team", "run_id")) %>%
filter(!is.na(total_scored)) %>%
select(date, team, opponent, goals_for, goals_against,
competition, tier, total_scored, run_id) %>%
arrange(total_scored)
#print this
print(filter(scoring_run_matches, run_id == 5194))</code></pre>
<pre><code>## # A tibble: 6 x 9
## # Groups: team [1]
## date team opponent goals_for goals_against competition tier
## <date> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 20190106 Manc~ Rotherh~ 7 0 fa_cup NA
## 2 20190109 Manc~ Burton ~ 9 0 league_cup NA
## 3 20190114 Manc~ Wolverh~ 3 0 league 1
## 4 20190120 Manc~ Hudders~ 3 0 league 1
## 5 20190123 Manc~ Burton ~ 1 0 league_cup NA
## 6 20190126 Manc~ Burnley 5 0 fa_cup NA
## # ... with 2 more variables: total_scored <dbl>, run_id <int></code></pre>
</div>
</div>
<div id="thdecemberweregoingtowembley" class="section level1">
<h1>5th December  We’re going to Wembley</h1>
<blockquote class="twittertweet">
<p lang="en" dir="ltr">
Which team has had to travel the shortest combined distance in a cup run? (excluding regional competitions, just to make it interesting)
</p>
— Chris van Thomas (<span class="citation">@chrisvanthomas</span>) <a href="https://twitter.com/chrisvanthomas/status/1148879896430731266?ref_src=twsrc%5Etfw">July 10, 2019</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf8"></script>
<div id="answermultipleteamshaveplayed5facupmatchesallathome.toanswertheinversequestionqueensparkin18831884and18841885havehadthefarthesttotravel" class="section level2">
<h2>Answer  Multiple teams have played 5 FA cup matches all at home. To answer the inverse question, Queen’s Park in 1883/1884 and 1884/1885 have had the farthest to travel</h2>
<p>For this question, I’m actually going to answer the opposite topic which team have traveled the farthest in a cup run? The reason being is that multiple teams have had cup runs (of 5 matches in the FA cup) without travelling away from home at all. The code below could easily be changed to analyse other cup competitions, for simplicity, I’m sticking with the FA cup which has the most complete data in the engsoccerdata set.</p>
<p>Once again, we’ll start by loading libraries. We also want the <a href="https://rspatial.github.io/sf/articles/sf1.html">sf</a> package that makes working with spatial data a bit cleaner.</p>
<pre class="r"><code>library(engsoccerdata)
library(tidyverse)
#also want sf to manipulate spatial features
library(sf)</code></pre>
<p>Then we want to grab the data. In a <a href="https://github.com/jalapic/engsoccerdata/commit/6133cf9f6fd77574a5a03097a6d2db4d213c508c">recent update of the engsoccerdata package</a> I added the location of grounds for teams in England which will let us find the distances teams have traveled to matches.</p>
<p>I also download a shapefile of the UK from <a href="https://gadm.org/">GADM</a> for plotting and to filter out any bad data in ground location (which still is very much in beta).</p>
<pre class="r"><code>#download a map of the uk to plot with
shape_url < "https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_GBR_0_sf.rds"
temp_dir < tempdir()
download.file(shape_url, destfile = file.path(temp_dir, "shapefiles.rds"), mode = "wb")
uk < st_as_sf(readRDS(file.path(temp_dir, "shapefiles.rds")))
#the location of football grounds in the dataset
grounds < engsoccerdata::england_locations %>%
st_as_sf(coords = c("lon", "lat"), crs = st_crs("+init=epsg:4326")) %>%
st_transform(crs = st_crs(uk)) %>%
#remove those that are bad dataoutside the uk
.[seq(nrow(.)) %in% unlist(st_contains(uk, .)),]
#get the fa cup match data
matches < engsoccerdata::facup</code></pre>
<p>There’s some ambiguity in the question as to how the distance of a cup run should be measured. There’s really two ways we can do this, which I will henceforth refer to as a ‘routing’ as ‘spoking’. To illustrate the two different approaches, I’ll use Southampton’s 2017/2018 FA cup run</p>
<pre class="r"><code>#e.g. with Southampton's semi final run in 2017/2018
Southampton < matches %>%
filter(Season == 2017 &
(home == "Southampton" 
visitor == "Southampton")
) %>%
select(Venue, Date, home, visitor, neutral) %>%
gather("location", "team", Venue, Date, neutral) %>%
filter(location == "home"  neutral == "yes") %>%
filter(!duplicated(Date)) %>%
mutate(location = case_when(
neutral == "yes" ~ Venue,
TRUE ~ team
))
#print matches
Southampton</code></pre>
<pre><code>## # A tibble: 5 x 5
## Venue Date neutral location team
## <chr> <date> <chr> <chr> <chr>
## 1 Craven Cottage 20180106 <NA> Fulham Fulham
## 2 St Mary's Stadi~ 20180127 <NA> Southampton Southampton
## 3 The Hawthorns 20180217 <NA> West Bromwich Alb~ West Bromwich Alb~
## 4 DW Stadium 20180318 <NA> Wigan Athletic Wigan Athletic
## 5 Wembley Stadium 20180422 yes Wembley Stadium Chelsea</code></pre>
<p>We’ll join the ground geography data to this to figure out distances traveled</p>
<pre class="r"><code>#will need the location of southamptons ground
Southampton_home < grounds %>%
filter(location == "Southampton")
#and the locations of all their matches
match_locations < Southampton %>%
left_join(., select(grounds, location, geometry), by = "location")</code></pre>
<p>The first method of calculating distance traveled per match (‘spoking’) takes the location of each match, and finds the distance from that team’s home ground. E.g. for Southampton:</p>
<pre class="r"><code>#find the line from southampton's ground to the match location
spoke_lines < st_coordinates(match_locations$geometry) %>%
split(f = rownames(.)) %>%
#create line from geometry1 to geometry2
lapply(function(x) {
mat < rbind(x, st_coordinates(Southampton_home$geometry))
line < st_linestring(mat)
return(line)
}) %>%
#cast to multiline and add projection
st_multilinestring() %>%
st_sfc(crs = st_crs("+init=epsg:4326"))
#plot over uk
p < ggplot() +
geom_sf(data = uk) +
geom_sf(data = spoke_lines, colour = "blue", size = 1.5) +
theme_minimal()
plot(p)</code></pre>
<p><img src="/post/20191201advent_calendar_knowledge_files/figurehtml/southampton_spokes1.png" width="672" /></p>
<pre class="r"><code>#get the total length in metres
st_length(spoke_lines)</code></pre>
<pre><code>## 698614.6 [m]</code></pre>
<p>For any home games, the distance traveled is taken to be 0m.</p>
<p>The second method of calculating distance traveled takes the distance from each match <em>to</em> the next match in run. I.e. it assumes a team stays in their past location until the next round (obviously not true in real life) and finds the distance to the ground for the next round game:</p>
<pre class="r"><code>#need coords separate
coords < st_coordinates(match_locations$geometry) %>%
split(f = rownames(.))
#find distance from one coords to next
travel_lines < mapply(coords, lead(coords), FUN = function(x, y) {
#for last match no further path
if(is.na(y)) {
return(NULL)
} else {
mat < rbind(x, y)
line < st_linestring(mat)
return(line)
}
}) %>%
#filter null last element
.[which(lengths(.) == 0)] %>%
st_multilinestring() %>%
st_sfc(crs = st_crs("+init=epsg:4326"))
#plot over uk
p < ggplot() +
geom_sf(data = uk) +
geom_sf(data = travel_lines, colour = "blue", size = 1.5) +
theme_minimal()
plot(p)</code></pre>
<p><img src="/post/20191201advent_calendar_knowledge_files/figurehtml/southampton_travel1.png" width="672" /></p>
<pre class="r"><code>#get the length
st_length(travel_lines)</code></pre>
<pre><code>## 684586.7 [m]</code></pre>
<p>So Southampton here begin in London away to Fulham, before travelling to their home in Southampton, then on to Birmingham (West Brom), Wigan, and back to London to play at Wembley.</p>
<p>Now we have the two methods, we need to prep the data we’re going to analyse. As before, this is done by gathering a df of match data so we have one row per match per team in the FA cup. We then group by teamseason’s and find how many matches they played in the cup that year. For a ‘cup run’ we assume more than 3 matches have to played.</p>
<pre class="r"><code>#gather each match per team into a separate row
long_campaigns < matches %>%
filter(!is.na(Date)) %>%
select(Season, home, visitor) %>%
gather("location", "team", Season) %>%
#remove rows with missing teams
filter(!is.na(team)) %>%
#get the number of fa cup matches per season per team
group_by(team, Season) %>%
summarise(matches = n()) %>%
#assume you need at least 3 matches for a 'run'
filter(matches > 3) %>%
ungroup() %>%
arrange(Season) %>%
mutate(id = 1:n())
long_campaign_matches < matches %>%
select(., Season, Date, round,
neutral, Venue,
team = home, opponent = visitor) %>%
mutate(location = "home") %>%
#bind in the opposite data for away teams
bind_rows(., mutate(select(., Season, Date, round,
neutral, Venue,
team = opponent, opponent = team),
location = "away")) %>%
filter(!is.na(team)) %>%
#left join the data for long campaigns
left_join(., long_campaigns, by = c("Season", "team")) %>%
#remove nonlong campaigns
filter(!is.na(matches)) %>%
#get the location of the match
mutate(location = case_when(
neutral == "yes" ~ Venue,
location == "home" ~ team,
location == "away" ~ opponent,
)) %>%
#left join in the location for the corresponding ground
left_join(grounds, c("location")) %>%
#select columns
select(season = Season, date = Date, round,
neutral, team, opponent,
location, matches, id, geometry)
#print
head(long_campaign_matches)</code></pre>
<pre><code>## # A tibble: 6 x 10
## season date round neutral team opponent location matches id
## <dbl> <date> <chr> <chr> <chr> <chr> <chr> <int> <int>
## 1 1871 18711216 2 <NA> Crys~ Maidenh~ Crystal~ 5 1
## 2 1871 18711216 2 <NA> Wand~ Clapham~ Wandere~ 4 3
## 3 1871 18720120 3 <NA> Wand~ Crystal~ Wandere~ 4 3
## 4 1871 18720127 3 <NA> Roya~ Hampste~ Royal E~ 5 2
## 5 1871 18720217 s yes Crys~ Royal E~ Kenning~ 5 1
## 6 1871 18720309 s yes Roya~ Crystal~ Kenning~ 5 2
## # ... with 1 more variable: geometry <POINT [°]></code></pre>
<p>We can then find the routing distance using a nice trick I found on <a href="https://github.com/rspatial/sf/issues/799">Stack Overflow</a> to find the distance between each location and the next in the data.frame.</p>
<p>Finally, this is grouped by id and summed to get the total distance traveled in that cup campaign (when judging by the ‘routing’ metric).</p>
<pre class="r"><code>#taken from
#https://github.com/rspatial/sf/issues/799
#init an empty sfc
empty < st_as_sfc("POINT(EMPTY)")
routing_distances < long_campaign_matches %>%
arrange(id, date) %>%
filter(!is.na(date)) %>%
group_by(id) %>%
#find the distance from one game to the next
mutate(
distance_to_next = sf::st_distance(
geometry,
lag(geometry, default = empty),
by_element = TRUE)
)
#sum the distances
grouped_routing_distances < routing_distances %>%
summarise(travel_distance = sum(distance_to_next, na.rm = TRUE)) %>%
merge(long_campaigns, by = "id") %>%
#conver to km
mutate(total_distance = travel_distance / 1000) %>%
select(id, team, matches, season = Season, total_distance) %>%
mutate(av_distance = total_distance / matches) %>%
arrange(total_distance)
head(grouped_routing_distances)</code></pre>
<pre><code>## id team matches season total_distance av_distance
## 1 111 Queens Park 8 1884 2684.363 335.5454
## 2 668 Exeter City 8 1930 2362.073 295.2592
## 3 293 Newcastle United 8 1904 2316.524 289.5655
## 4 512 Newcastle United 9 1923 2112.232 234.6924
## 5 1182 Newcastle United 10 1954 2017.824 201.7824
## 6 1090 Gateshead 8 1951 2015.698 251.9623</code></pre>
<p>By functionalising our code from earlier, we can easily plot these welltraveled runs. Using the recentlyaddedtoCRAN <a href="https://github.com/thomasp85/patchwork">patchwork</a> package, we can make multiple plots and stitch them together, e.g. for the 6 FA cup runs with the longest distances traveled we get:</p>
<pre class="r"><code>#functionalise our code from earlier to plot travel routes easier
plot_travel_lines < function(run_team_year, plot_type) {
run_matches < long_campaign_matches %>%
mutate(id = paste(team, season)) %>%
filter(id == run_team_year) %>%
arrange(date)
coords < st_coordinates(run_matches$geometry) %>%
split(f = rownames(.))
if(plot_type == "travel") {
lines < mapply(coords, lead(coords), FUN = function(x, y) {
if(is.na(y)) {
return(NULL)
} else {
mat < rbind(x, y)
line < st_linestring(mat)
return(line)
}
}) %>%
.[which(lengths(.) == 0)] %>%
st_multilinestring() %>%
st_sfc(crs = st_crs("+init=epsg:4326"))
} else if(plot_type == "spokes") {
home < grounds %>%
filter(location == unique(run_matches$team)) %>%
st_coordinates()
lines < lapply(coords, function(x) {
mat < rbind(x, home)
line < st_linestring(mat)
return(line)
}) %>%
#cast to multiline and add projection
st_multilinestring() %>%
st_sfc(crs = st_crs("+init=epsg:4326"))
}
plot < ggplot() +
geom_sf(data = uk) +
geom_sf(data = lines, colour = "blue", size = 1.5) +
labs(title = paste(run_team_year)) +
theme_minimal()
return(plot)
}
#plot the top six
library(patchwork)
paste(grouped_routing_distances$team[1:6],
grouped_routing_distances$season[1:6]) %>%
lapply(., plot_travel_lines, plot_type = "travel") %>%
wrap_plots(.)</code></pre>
<p><img src="/post/20191201advent_calendar_knowledge_files/figurehtml/plot_run_travel1.png" width="672" /></p>
<p>We then need to check this against our other method of evaluating distances in a cup run the ‘spoking’ method. This is much easier to calculate all we have to do is left_join() in the location for each team’s home ground, and find the distance between this and the match location.</p>
<p>Then we simply sum the total distances per campaign and plot the longest of these:</p>
<pre class="r"><code>spoke_distances < long_campaign_matches %>%
#left join in location data for each team
left_join(grounds, by = c("team" = "location")) %>%
#calculate distance between each teams home ground and the match location
mutate(distance = st_distance(geometry.x, geometry.y, by_element = TRUE))
#group by and sum the cup run distances
grouped_spoke_distances < spoke_distances %>%
group_by(team, season) %>%
summarise(total_distance = sum(distance/1000, na.rm = TRUE),
av_distance = mean(distance/1000, na.rm = TRUE)) %>%
arrange(total_distance)
#print
head(grouped_spoke_distances)</code></pre>
<pre><code>## # A tibble: 6 x 4
## # Groups: team [3]
## team season total_distance av_distance
## <chr> <dbl> [m] [m]
## 1 Queens Park 1883 2150.504 307.2149
## 2 Newcastle United 1923 1974.430 219.3811
## 3 Newcastle United 1951 1957.196 279.5994
## 4 Fulham 1974 1810.676 150.8897
## 5 Newcastle United 1973 1781.279 178.1279
## 6 Queens Park 1884 1702.166 170.2166</code></pre>
<pre class="r"><code>#plot
paste(grouped_spoke_distances$team[1:6],
grouped_spoke_distances$season[1:6]) %>%
lapply(., plot_travel_lines, plot_type = "spokes") %>%
wrap_plots(.)</code></pre>
<p><img src="/post/20191201advent_calendar_knowledge_files/figurehtml/find_spoke_distances1.png" width="672" /></p>
</div>
</div>
<div id="thdecemberthistownaintbigenoughforaleaguefootballteam" class="section level1">
<h1>10th December  This Town Ain’t Big Enough For a League Football Team</h1>
<p><a href="https://www.theguardian.com/football/2003/may/29/theknowledge.sport">“What’s the largest town/city without a League club? I reckon Maidstone takes some beating (population 139,000  about the same as Blackburn). Unlike Blackburn Rovers, Maidstone United play in the Kent League (of course Blackburn would not actually be eligible) which is some way below the Conference and Dr Martins Leagues. But being Maidstone United of course they play all their fixtures 12 miles away in Sittingbourne,” writes Peter Driver. IN 2003</a></p>
<div id="answer" class="section level2">
<h2>Answer </h2>
<p>To answer this question, first we need data on towns and cities in England. We’re going to rank by population so can download the population table found <a href="http://lovemytown.co.uk/populations/TownsTable1.asp">here</a> to start with</p>
<pre class="r"><code>#scrape data on town/city ppulations in UK
pops < "http://lovemytown.co.uk/populations/TownsTable1.asp" %>%
read_html() %>%
html_nodes("#mainContent > table:nthchild(3)") %>%
html_table(fill = TRUE, header = TRUE) %>%
as.data.frame() %>%
#some munging to match datasets later
mutate(tcity15nm = case_when(
grepl("^St\\. ", Town) ~ gsub("^St\\. ", "St ", Town),
grepl("^Hull$", Town) ~ "Kingston upon Hull",
grepl("^Burton$", Town) ~ "Burton upon Trent",
grepl("^Newcastle$", Town) ~ "Newcastle upon Tyne",
grepl("^Southend$", Town) ~ "SouthendonSea",
grepl("^Stoke$", Town) ~ "StokeonTrent",
TRUE ~ Town
)) %>%
#convert population to numeric
mutate(population = as.numeric(gsub(",", "", Population))) %>%
select(tcity15nm, population, status = Status)</code></pre>
<p>We’re then going to want the geographic data on these towns. The UK government provides shapefiles for the outlines of ‘Major Towns and Cities’ from a few years ago which should be sufficient for the question. They’re provided as geoJSON files so I’m going to use the geojsonsf package to load them straight as sf objects.</p>
<p>To reproduce this script, you’ll need to download the data from the <a href="https://data.gov.uk/dataset/7879ab822863401e8a29a56e264d2182/majortownsandcitiesdecember2015boundaries">UK government achives</a> and point the file object towards it</p>
<pre class="r"><code>#to read in geojson data as an sf file
library(geojsonsf)
#download the shapefile from
#https://data.gov.uk/dataset/7879ab822863401e8a29a56e264d2182/majortownsandcitiesdecember2015boundaries
file < "path/to/downloaded/file.geojson"
#load data
towns < geojson_sf(file) %>%
left_join(., pops, by = "tcity15nm") %>%
st_transform(st_crs(27700)) %>%
#buff the town shapefiles by 2.5km to catch all clubs within
#reasonable distance of the town
st_buffer(., 2500)</code></pre>
<p>Then we want the club data. In the latest release of <a href="https://github.com/jalapic/engsoccerdata/pull/61">engsoccer data</a> I added some (very beta) nonleague data so we have a greater number of teams to pick from. We’ll take the league the team played in in 2018 as this on league data hasn’t been updated to 2019 (the current season) yet.</p>
<pre class="r"><code>#get all league and nonleague clubs
#non league clubs in new release of engsoccerdata
clubs < rbind(
select(england, home, Season, tier),
select(england_nonleague, home, Season, tier)) %>%
#no 2019 data for non league yet
filter(Season == 2018) %>%
select(home, tier) %>%
unique()
#get the locations of each of these clubs in
club_locations < england_club_data %>%
st_as_sf(coords = c("lon", "lat"), crs = st_crs(4326)) %>%
#on uk grid projection
st_transform(st_crs(27700)) %>%
left_join(., clubs, by = c("team" = "home")) %>%
select(team, tier) %>%
filter(!is.na(tier))
#plot these clubs over major towns in uk
p_town_clubs < ggplot() +
geom_sf(data = uk) +
geom_sf(data = towns, fill = "red") +
geom_sf(data = club_locations, alpha = 0.15, colour = "blue") +
labs(title = "location of English football teams relative to major towns") +
theme_minimal()
p_town_clubs</code></pre>
<p><img src="/post/20191201advent_calendar_knowledge_files/figurehtml/towns_get_clubs1.png" width="672" /></p>
<p>We can then run a very quick function to find the clubs that are location within each town using sf::st_contains and arrange by our parameters to answer the question!</p>
<pre class="r"><code>#function to find which towns contain clubs
town_data < st_contains(towns, club_locations) %>%
map_df(., function(x) {
n_clubs < length(x)
if(n_clubs == 0) {
max_tier < NA
tiers < NA
} else {
#get the tiers of the english footballing pyramid that clubs play in
tiers < I(list(club_locations$tier[x]))
max_tier < min(club_locations$tier[x])
}
return(data.frame(n_clubs,
max_tier,
tiers))
}) %>%
#bind to the town data
bind_cols(towns, .) %>%
select(town = tcity15nm, pop = population, n_clubs, max_tier) %>%
#arrange to answer question
arrange(n_clubs, max_tier, pop)
#get rid of unnessecary geometry
st_geometry(town_data) < NULL
#print answer
head(town_data)</code></pre>
<pre><code>## town pop n_clubs max_tier
## 1 Gloucester 136362 0 NA
## 2 Basingstoke 107355 0 NA
## 3 Worcester 100153 0 NA
## 4 StocktononTees 82729 0 NA
## 5 Guildford 77057 0 NA
## 6 Sutton Coldfield 109015 1 8</code></pre>
<p>So it appears Gloucester, Worcester and Basingstoke are the largest towns without a football club in their city limits. We can double check this using a quick grep function for the clubs (it’s possible this could miss some clubs but is probably accurate enough)</p>
<pre class="r"><code>empty_towns < c("Gloucester", "Basingstoke", "Worcester", "Stockton", "Guildford")
lapply(empty_towns, grep, x = clubs$home) %>%
unlist() %>%
clubs$home[.]</code></pre>
<pre><code>## [1] "Gloucester City" "Basingstoke Town"</code></pre>
<p>So we can see that Gloucester and Basingstoke <em>do</em> in fact have football teams, however a quick Wikipedia search shows that <a href="https://en.wikipedia.org/wiki/Gloucester_City_A.F.C.">they</a> <a href="https://en.wikipedia.org/wiki/Basingstoke_Town_F.C.">both</a> play outside their town (so I’m not sure if these count). The largest town* without a football team, down to the 8th tier of English football, is therefore Worcester with 100,000 people, but a team only in the <a href="https://en.wikipedia.org/wiki/Worcester_City_F.C.">9th tier</a> of the football pyramid.</p>
<p>The question actually does specify <em>‘League’</em> teams, which generally only refers to the top 4 flights on English football. We can then run the function for each tier, finding the largest town without a team in that tier or above.</p>
<pre class="r"><code>#find the largest town without a club above tiers 5:8
by_tier < lapply(4:7, function(x) {
data < town_data %>%
filter(!is.na(max_tier) & max_tier > x) %>%
arrange(pop)
})
#print by max tier
lapply(by_tier, head, n = 5)</code></pre>
<pre><code>## [[1]]
## town pop n_clubs max_tier
## 1 Coventry 325949 1 7
## 2 Warrington 165456 1 7
## 3 Slough 155298 2 5
## 4 Poole 154718 2 7
## 5 York 152841 1 6
##
## [[2]]
## town pop n_clubs max_tier
## 1 Coventry 325949 1 7
## 2 Warrington 165456 1 7
## 3 Poole 154718 2 7
## 4 York 152841 1 6
## 5 Telford 142723 1 6
##
## [[3]]
## town pop n_clubs max_tier
## 1 Coventry 325949 1 7
## 2 Warrington 165456 1 7
## 3 Poole 154718 2 7
## 4 Worthing 109120 1 7
## 5 Sutton Coldfield 109015 1 8
##
## [[4]]
## town pop n_clubs max_tier
## 1 Sutton Coldfield 109015 1 8
## 2 Basildon 107123 2 8
## 3 St Helens 102885 1 8
## 4 Wakefield 99251 1 8
## 5 Hastings 91053 1 8</code></pre>
<p>And the City of Coventry takes it, having a population of 325,000 and a team only in the 7th tier (<a href="https://en.wikipedia.org/wiki/Bedworth_United_F.C.">Bedworth United, who play just outside the city</a>). This is only because the city’s main team <a href="https://en.wikipedia.org/wiki/Coventry_City_F.C.">Coventry City</a> are playing in Birmingham due to <a href="https://en.wikipedia.org/wiki/Coventry_City_F.C.#St_Andrew's">ongoing difficulties finding a stadium within their own city</a>.</p>
<p>I think in the spirit of the question, the true answer is one of Warrington (165k), Poole (154k), York (152k), of Telford (142k). Of which York is probably the most major and independent as a town/city.</p>
<p>Finally the article in which the question was shared posted the followup: what is the smallest town to hold a league club?</p>
<p>We can answer this using the directory of population places produced by the <a href="https://github.com/alasdairrae/gbplaces">Ordnance Survey</a>. Again, download the .csv and point the script at it to reproduce.</p>
<pre class="r"><code>#load the .csv of small places in the uk
small_places < read.csv("path/to.csv", stringsAsFactors = FALSE) %>%
#only interested in England
filter(COUNTRY == "England") %>%
select(name = NAME1, x = X, y = Y) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(27700)) %>%
#find small places outside of the large towns
.[unique(unlist(st_contains(towns, .))),] %>%
#buffer by 500m
st_buffer(., 100)</code></pre>
<pre class="r"><code>#plot these for a cool map
plot(select(small_places))</code></pre>
<p><img src="/post/20191201advent_calendar_knowledge_files/figurehtml/plot_small_places1.png" width="672" /></p>
<p>It’s then simple to find the small places to hold football clubs using sf::st_contains and indexing</p>
<pre class="r"><code>#find clubs that are in these remaining places
small_town_clubs < st_contains(small_places, filter(club_locations, tier < 5))
small_places[which(lengths(small_town_clubs) > 0),]</code></pre>
<pre><code>## Simple feature collection with 6 features and 1 field
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 351959.9 ymin: 116300 xmax: 391793 ymax: 430665
## epsg (SRID): 27700
## proj4string: +proj=tmerc +lat_0=49 +lon_0=2 +k=0.9996012717 +x_0=400000 +y_0=100000 +ellps=airy +towgs84=446.448,125.157,542.06,0.15,0.247,0.842,20.489 +units=m +no_defs
## name geometry
## 3454 Laneside POLYGON ((376361 429724, 37...
## 3474 Enfield POLYGON ((375759 430165, 37...
## 8843 Moss Lane POLYGON ((391793 371719, 39...
## 17646 Houndstone POLYGON ((353128 117093.1, ...
## 17691 Lufton POLYGON ((352959.9 116800, ...
## 19419 Newmarket POLYGON ((384471.3 199698.6...</code></pre>
<p>Enfield and Moss Lane are in London and Manchester, so don’t count. Houndstone and Lufton are actually in the same town (Yeovil) but still are valid answers.</p>
<p>Laneside is located just outside Accrington, where <a href="https://en.wikipedia.org/wiki/Accrington_Stanley_F.C.">Accrington Stanley</a> play in the third tier (population 35,000).</p>
<p>Yeovil (population 45,000) is home to <a href="https://en.wikipedia.org/wiki/Yeovil_Town_F.C.">Yeovil Town</a> who play in the 4th tier.</p>
<p>But the far and away winner is 4th tier <a href="https://en.wikipedia.org/wiki/Forest_Green_Rovers_F.C.">Forest Green Rovers</a> who play in Nailsworth, with a population of just 5,700 people. Even more, the only remotely near town, Stroud, has a population of just 32,000 people. Most incredible of all, <a href="https://en.wikipedia.org/wiki/The_New_Lawn">the stadium for Forest Green Rovers</a> can hold ~5,000 people, or almost the entire surrounding population.</p>
</div>
</div>

Rinforcement Learning Part One Tic Tac Toe
/post/rinforcement_learning_one/
Thu, 28 Nov 2019 00:00:00 +0000
/post/rinforcement_learning_one/
<p>I’m extremely biased, but to me, one of the real success* stories in neuroscience over the last (just over) two decades has been in studying reward signals. Since the <a href="https://www.ncbi.nlm.nih.gov/pubmed/9054347">seminal 1997 paper</a>, a lot of work has gone into figuring out how the brain assigns value to outcomes.</p>
<p>*ugh, maybe. This isn’t a blog post about that</p>
<p>My PhD project looks at novel ways of eliciting valuation behaviour to study these signals, but as a key part of the modelling involved in this work, it’s important to get a handle on <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement learning</a>. When originally working through the <a href="http://incompleteideas.net/book/thebook.html">Sutton and Barto books</a>, I threw together some code a few years ago for the problem sets mostly in python and MATLAB. As someone who runs a blog nominally about coding in R, however, I thought there might be some value in going through said code and refactoring into R. As R can struggle with the speed necessary for reinforcement learning (which typically relies on large numbers of iterating behaviour), it also provided a good chance to crack out some C++ code using the always excellent <a href="http://advr.had.co.nz/Rcpp.html">Rcpp package</a>, which is always worth practicing.</p>
<p>In this first example of Reinforcement Learning in R (and C++), we’re going to train our computers to play Noughts and Crosses (or tic tac toe for Americans) to at least/super human level.</p>
<p>Let’s get started with the libraries we’ll need. I want to stick to base for speed here, as well as obviously Rcpp. In theory you can easily generalise all the code here to any size board, but I only have tested in with 3x3 boards so YMMV.</p>
<pre class="r"><code>#will use ggplot
#everything else Ive used base of listed packages
library(ggplot2)
#Rcpp for some vroom vroom
library(Rcpp)
#in theory this stuff should work for boards of any size
#but I haven't tested that
board_cols = 3
board_rows = 3
squares < board_cols * board_rows</code></pre>
<p>The very first thing we’ll want to do is find a way to store the information in a game state and convert between this, and a human readable form.</p>
<pre class="r"><code>#function to plot boards in a human readable way
#not generalised to all board sizes but easy enough to
plot_board < function(string) {
pieced < rep("", length(string))
pieced[which(string == 1)] < "x"
pieced[which(string == 1)] < "o"
pieced[which(string == 0)] < "*"
board < gsub(" \\$", "", paste(pieced, "", collapse = " "))
board_lines < gsub("(. \\ . \\ . )\\( . \\ . \\ . )\\( . \\ . \\ .)",
"\n \\1\n\n\\2\n\n\\3",
board
)
return(writeLines(board_lines))
}</code></pre>
<p>Next we’re going to want to find every possible state we might encounter so we can test for any exceptions. I’m storing strings as a list of 9 0s (unused), 1s (Xs) and 1s (Os) representing the squares 1>9 from the top left corner.</p>
<p>It’s simple and fast enough to do this with a quick R function</p>
<pre class="r"><code>#get all possible boards
possible_boards < gtools::permutations(
board_cols, squares,
c(1,0,1),
repeats.allowed = TRUE
)
#can only have a sum of 1 or 0
possible_boards < possible_boards[which(rowSums(possible_boards) %in% c(0,1)),]
#plot a random example
plot_board(c(1,0,0,1,0,0,0,0,1))</code></pre>
<pre><code>##
## x  *  *
## 
## o  *  *
## 
## *  *  x</code></pre>
<p>Now we have the representations of any possible board, we want to find a way to store this is a more compressed format as a hash. I originally wrote a pretty quick function to do this in R and then threw up a quick one underneath compiled in Rcpp for comparison.</p>
<pre class="r"><code>#get a unique hash for each board
calc_hash < function(board) {
hash < 0
for(piece in seq(squares)) {
hash < (hash*board_cols) + board[piece] + 1
}
return(hash)
}
#and the equivalent in Cpp
cppFunction('int calc_hashCpp(NumericVector board, int squaresize) {
//need to init vals in C++
int hash = 0;
int boardsize = squaresize * squaresize;
//C++ for loops have start, end, and by
for (int i=0; i <= boardsize  1; ++i) {
hash = (hash * squaresize) + board[i] + 1;
}
//always have to declare a return
return hash;
}')
#get a list of all the possible hashes
hashes < lapply(purrr::array_tree(possible_boards, margin = 1),
calc_hashCpp, squaresize = 3)
#should all be unique
which(duplicated(hashes))</code></pre>
<pre><code>## integer(0)</code></pre>
<p>In order to play noughts and crosses, we then need some way for a game to end. An easy way to check this is when our board string (0s,1s,and1s) add up to 3/3 along any row, column or diagonal.</p>
<pre class="r"><code>#first we need a function to check when a game has been won
cppFunction('int check_winnerCpp(NumericVector board) {
int winner = 0;
int vec_length = board.size();
int square_size = sqrt(vec_length);
//check rows and columns for a winner
for (int i=0; i <= square_size  1; ++i) {
//check row i
NumericVector row_squares = NumericVector::create(0,1,2);
row_squares = row_squares + (square_size * i);
NumericVector row_elements = board[row_squares];
int row_sum = sum(row_elements);
if(abs(row_sum) == square_size) {
if(row_sum > 0) {
winner = 1;
} else {
winner = 1;
}
}
//check col i
NumericVector col_squares = NumericVector::create(0,3,6);
col_squares = col_squares + i;
NumericVector col_elements = board[col_squares];
int col_sum = sum(col_elements);
if(abs(col_sum) == square_size) {
if(col_sum > 0) {
winner = 1;
} else {
winner = 1;
}
}
}
//check the diagonalsNumericVector
NumericVector rising_diag_squares = NumericVector::create();
NumericVector falling_diag_squares = NumericVector::create();
for (int i=0; i <= square_size  1; ++i) {
int rising_diag_square = (square_size * i) + i;
rising_diag_squares.push_back(rising_diag_square);
int falling_diag_square = (square_size  1) * (i+1);
falling_diag_squares.push_back(falling_diag_square);
}
NumericVector rising_diag_elements = board[rising_diag_squares];
NumericVector falling_diag_elements = board[falling_diag_squares];
int rising_sum = sum(rising_diag_elements);
int falling_sum = sum(falling_diag_elements);
if(abs(falling_sum) == square_size) {
if(falling_sum > 0) {
winner = 1;
} else {
winner = 1;
}
}
if(abs(rising_sum) == square_size) {
if(rising_sum > 0) {
winner = 1;
} else {
winner = 1;
}
}
//return the winner
//0 for no winner, 999 for draw
return winner;
}')</code></pre>
<p>We can then apply this function to every possible board and find the ones that indicate a winning state. We also init a data frame containing all possible boards, their hash, and their ‘value’ (0 for all for now, more on this later). Finally, I plot the first one in this set just because why not?</p>
<pre class="r"><code>#find which boards are winning positions
winning < purrr::map(purrr::array_tree(possible_boards, margin = 1), check_winnerCpp)
#going to create a df to store the values of all moves
moves_df < data.frame(hash = unlist(hashes),
value = 0,
winning = unlist(winning))
#store all boards as a list
#purrr::aray_tree is a really nice way to convert matrix to lists
moves_df$board = purrr::array_tree(possible_boards, margin = 1)
#plot the first board just why not
plot_board(unlist(moves_df$board[1]))</code></pre>
<pre><code>##
## o  o  o
## 
## o  *  x
## 
## x  x  x</code></pre>
<p>As we can see, we still have some impossible boards here. This particular board will never occur in actual play because X wins before O can make a move to complete the top row. It doesn’t matter, but useful to keep in mind for a plot later.</p>
<p>We then need a function telling the computer how to make a move. For this post we’re going to use what’s called ‘E (epsilon)greedy’ selection. A computer has a parameter epsilon such that</p>
<p><span class="math display">\[\begin{cases}
v &\text{if } \epsilon \leq \rho\\
V_{max} &\text{if } \epsilon > \rho\\
\end{cases} \]</span></p>
<p>if epsilon is greater than a random number rho, the computer makes the most valuable choice possible. It chooses whatever it thinks (rightly or wrongly) will lead to the best outcome. This is called <em>exploitation</em>. If epsilon is less than or equal to rho, the computer randomly chooses any possible action v. This is known as <em>exploration</em> to test any possibly rewarding but unvalued paths.</p>
<p>(I may have gotten epsilon the wrong way round here. It really doesn’t matter at all.)</p>
<p>Let’s implement this in C++</p>
<pre class="r"><code>cppFunction('int choose_moveCpp(NumericVector epsilon, NumericVector values) {
//random number to decide if computer should explore or exploit
NumericVector random_number = runif(1);
int move_choice = 0;
NumericVector choices = NumericVector::create();
//exploit the best move
if(epsilon[0] > random_number[0]) {
double max = Rcpp::max(values);
std::vector< int > res;
int i;
for(i = 0; i < values.size(); ++i) {
if(values[i] == max) {
res.push_back(i);
}
}
IntegerVector max_indexes(res.begin(), res.end());
if(max_indexes.size() > 1) {
std::random_shuffle(max_indexes.begin(), max_indexes.end());
move_choice = max_indexes[0] + 1;
} else {
move_choice = max_indexes[0] + 1;
}
//explore all moves randomly
} else {
int potential_choices = values.size();
choices = seq(1, potential_choices);
std::random_shuffle(choices.begin(), choices.end());
move_choice = choices[0];
}
return move_choice;
}')</code></pre>
<p>We also want a little helper func to find all the possible hashes so we can look up which moves a computer can make before choosing between them.</p>
<pre class="r"><code>#find all possible next moves
get_next_hashes < function(board, piece) {
unused < which(board == 0)
next_boards < lapply(unused, function(x, piece) {
board[x] < piece
return(board)
}, piece = piece)
#get the hashes of the next boards
hashes < lapply(next_boards, calc_hashCpp, squaresize = 3)
}</code></pre>
<p>Finally, we need to reward the computer for making good actions, and punish it for making bad ones. We’ll do this using Temporal Difference (TD) error learning.</p>
<p>The computer looks at how good an end point was (for noughts and crosses this can be a win, lose, or draw) and then decides if that outcome is better or worse than it was expecting. It then reevaluates its beliefs about the choices it made to lead to that end state. It can be formulated as</p>
<p><span class="math display">\[V_{state} = V_{state} + TD error \cdot scalar \]</span>
the scalar here is the <em>learning rate</em> of the computer. Do we want it to forget everything it new about the world seconds earlier and take only the most recent information (1), or update it’s beliefs very slowly (~0). We’ll refer to this as lr in subsequent equations.</p>
<p>The TD error itself is calculated as</p>
<p><span class="math display">\[TD error = (\gamma \cdot reward  V_{state}) \]</span>
Where gamma acts to make sure we don’t overfit too far back into the past. It reduces the reward as you go further back and is set between 0 and 1. The reward here will (e.g.) be 1 if the computer has just won with it’s latest move, otherwise it will be the value of the state the computer might move into.</p>
<p>Putting these together we get</p>
<p><span class="math display">\[ V_{state} = V_{state} + lr \cdot (\gamma \cdot V_{state+1}  V_{state}) \]</span></p>
<p>Let’s implement this using Rcpp</p>
<pre class="r"><code>#function to feed reward back to the agent based on results
cppFunction('NumericVector backfeed_rewardCpp(NumericVector values, double reward, double learning_rate, double gamma) {
int states = values.size();
NumericVector new_values = NumericVector::create();
//go from last state backwards
for( int state = states1; state >= 0; state) {
double new_value = values[state] + learning_rate * ((gamma * reward)  values[state]);
new_values.push_back(new_value);
//recurse the reward
reward = new_value;
}
return new_values;
}')</code></pre>
<p>Now we can start actually playing games! I wrote out a long function in R to play through the various bits. It surely could be refactored a little more concisely but it works for now and I was getting tired by this point.</p>
<p>We first add two functions (one to make moves/play the game, and one to update the values using the formula above) then put it all into to one uberfunction</p>
<pre class="r"><code>#function to choose and implement computer moves
computer_move < function(piece, board, epsilon) {
#get potential moves
potential_move_hashes < get_next_hashes(board, piece)
#get the values of the potential moves
potential_move_vals < moves_df$value[
unlist(lapply(potential_move_hashes, function(x) which(moves_df$hash == x)))]
#choose move based on rewards
player_move < choose_moveCpp(epsilon, potential_move_vals)
#update the board with the new move
updated_board < unlist(moves_df$board[
moves_df$hash == unlist(potential_move_hashes)[player_move]])
return(updated_board)
}
#function to get the values for each state based on the reward
update_move_vals < function(player1_reward, player2_reward,
player1_hashes, player2_hashes,
learning_rate,gamma) {
player1_newvals < backfeed_rewardCpp(moves_df$value[
unlist(lapply(player1_hashes, function(x) which(moves_df$hash == x)))],
player1_reward, learning_rate, gamma)
player2_newvals < backfeed_rewardCpp(moves_df$value[
unlist(lapply(player2_hashes, function(x) which(moves_df$hash == x)))],
player2_reward, learning_rate, gamma)
new_vals < list(player1_newvals, player2_newvals)
return(new_vals)
}
#function to get two computers to play each other
play_game_computers < function(player1_epsilon,
player2_epsilon,
learning_rate, gamma) {
#init board
board < rep(0, squares)
winner < 0
moves < 0
#init hash storage
player1_hashes < c()
player2_hashes < c()
#keep moving until game is over
while(winner == 0 & moves < 9) {
#iterate moves
moves < moves + 1
#player 1 moves
board < computer_move(1, board, player1_epsilon)
player1_hashes < append(calc_hashCpp(board, board_cols), player1_hashes)
winner < check_winnerCpp(board)
#same for player 2
if(winner == 0 & moves < 9) {
moves < moves + 1
board < computer_move(1, board, player1_epsilon)
player2_hashes < append(calc_hashCpp(board, board_cols), player2_hashes)
winner < check_winnerCpp(board)
}
}
#update policies
if(winner == 1) {
message < "x wins!"
new_vals < update_move_vals(1, 0, player1_hashes, player2_hashes,
learning_rate, gamma)
} else if(winner == 1) {
message < "o wins!"
new_vals < update_move_vals(0, 1, player1_hashes, player2_hashes,
learning_rate, gamma)
} else {
message < "draw!"
new_vals < update_move_vals(0.1, 0.5, player1_hashes, player2_hashes, learning_rate, gamma)
}
#push the values back into the dictionary data frame
moves_df$value[unlist(lapply(player1_hashes, function(x) which(moves_df$hash == x)))] << new_vals[[1]]
moves_df$value[unlist(lapply(player2_hashes, function(x) which(moves_df$hash == x)))] << new_vals[[2]]
return(message)
}</code></pre>
<p>So that the computer can learn the value of moves, we first want to run this on a training epoch. We’ll get the computer to play 100000 games against itself with an epsilon < 1 so that it explores the game state and learns by reinforcement. We’ll then plot the values it’s learn for all moves based upon if they are winning or not.</p>
<pre class="r"><code>#test on 10000 games with a little randomness thrown in
train < purrr::rerun(100000, play_game_computers(0.8, 0.8, 0.35, 0.9))
#test how fast our function is
microbenchmark::microbenchmark(play_game_computers(0.8, 0.8, 0.35, 0.9), times = 1000)</code></pre>
<pre><code>## Unit: microseconds
## expr min lq mean median
## play_game_computers(0.8, 0.8, 0.35, 0.9) 838.7 1061.05 1352.258 1222.2
## uq max neval
## 1361.45 4548.4 1000</code></pre>
<pre class="r"><code>#plot the updated values of moves
p1 < ggplot(moves_df, aes(x = value, group = as.character(winning))) +
geom_density(alpha = 0.5, aes(fill = as.character(winning))) +
scale_fill_manual(values = c("red", "blue", "green"), name = "winning move") +
theme_minimal()
p1</code></pre>
<p><img src="/post/20191128Rinforcement_learning_part_1_files/figurehtml/train_computer1.png" width="672" /></p>
<p>Thankfully the computer has learned that winning moves are more valuable than nonwinning moves! The reason there are peaks at 0 is because these are ‘winning’ moves that are impossible as referenced nearer the top of the post.</p>
<p>We’ll then run 2500 testing games where the computer is trying to play optimally. Noughts and crosses is a <a href="https://en.wikipedia.org/wiki/Solved_game">solved</a> game. Unless a play chooses a nonoptimal move, the game should end in a draw. Let’s see what proportion actually do end in a draw by grouping every 500 games of the testing set.</p>
<pre class="r"><code>#run on an extra 2500 games with no exploration (just exploit)
test < purrr::rerun(2500, play_game_computers(1, 1, 0.35, 0.9))
#group by each 500 games
test_df < data.frame(result = unlist(test),
group = rep(1:5, each = 500))
#plot percentage of games that are drawn
p2 < ggplot(test_df, aes(x = group, fill = result)) +
geom_bar(stat = "count") +
labs(x = "group (every 500 games)") +
theme_minimal()
p2</code></pre>
<p><img src="/post/20191128Rinforcement_learning_part_1_files/figurehtml/test_computer1.png" width="672" /></p>
<p>And it seems like the computer learns after a final bit of optimisation to always draw! hooray!!</p>
<p>Finally, because obviously this post wouldn’t be complete without human testing, I wrote a quick and dirty function to play a game against the now proficient computer. Enjoy below!!</p>
<pre class="r"><code>player_move < function(board){
#find free spaces a move can be made into
free_spaces < which(board == 0)
cat("Please move to one of the following board spaces: [", free_spaces, "]\n")
#user input
submitted_move < as.integer(readline(prompt = ""))
#need valid input
while(!submitted_move %in% free_spaces) {
if(submitted_move == 0) {
break
} else {
cat("Illegal Move! Please move to one of the following board spaces: [", free_spaces, "] or press 0 to quit\n")
submitted_move < as.integer(readline(prompt = ""))
}
}
#return move
return(submitted_move)
}</code></pre>
<pre class="r"><code>#only need a computer epsilon and which piece (turn order)
play_game_human < function(human_piece, computer_epsilon = 1) {
board < rep(0, 9)
moves < 0
winner < 0
#play the game as before but with a human player
if (human_piece == 1) {
while (winner == 0 & moves < 9) {
moves < moves + 1
plot_board(board)
human_move < player_move(board)
if (human_move == 0) {
break
} else {
board[human_move] < human_piece
}
i << board
j << board
winner < check_winnerCpp(board)
if (winner == 0 & moves < 9) {
moves < moves + 1
piece < human_piece * 1
board < computer_move(1, board, computer_epsilon)
winner < check_winnerCpp(board)
}
}
} else {
while (winner == 0 & moves < 9) {
moves < moves + 1
piece < human_piece * 1
board < computer_move(1, board, player1_epsilon)
winner < check_winnerCpp(board)
if (winner == 0 & moves < 9) {
moves < moves + 1
plot_board(board)
human_move < player_move(board)
if (human_move == 0) {
break
} else {
board[human_move] < human_piece
}
winner < check_winnerCpp(board)
}
}
}
#little ending flavour
if (winner == human_piece) {
print("you win!!")
} else if(winner == human_piece) {
print("oh no! you lost!")
} else {
print("a draw..")
}
plot_board(board)
}
#run like:
play_game_human(1, 1)</code></pre>

Predicting the Unpredictable Analysing Rowing in Cambridge pt. 1
/post/cam_rowing_1/
Sun, 24 Nov 2019 00:00:00 +0000
/post/cam_rowing_1/
<p>In my free time away from PhD and data science work, I (used to) enjoy rowing. Aside from obvious benefits like socialising, providing a (very intense) workout, seeing the outdoors at least a few times a week… there are really two things that I love(d) about rowing:</p>
<ol style="liststyletype: decimal">
<li>It’s the sport that is closest to a simple engineering problem. Going fast basically boils down to how in time and how hard you can get 18 guys to move an oar through the water. Realistically, you could probably model how good a boat of guys will row just by tracking them on a rowing machine (and I have suspicions that this is what British Rowing etc. do for national teams).</li>
<li>I learnt to row as an undergraduate at Oxford, and really got serious about it as a postgraduate student at Cambridge. This might seem like a irrelevant detail but it’s not.</li>
</ol>
<div id="howrowingusuallyworks" class="section level1">
<h1>How rowing usually works</h1>
<p>Generally when racing boats, some n number of rowing crews line up alongside each other, and row straight down a lake (usually ~2km). The first boat to cross the finish line is generally considered the winner. For an example of such a race, see this Olympic final from 2012:</p>
<iframe width="560" height="315" src="https://youtu.be/x6wHZNWF7pA?t=655" frameborder="0" allowfullscreen>
</iframe>
<p>You might notice that there are four men in each boat here, each of whom are rowing. This works well on a reservoir where this race was held, but not so well on (e.g.) the River Cam that flows through Cambridge, which is both a pretty thin river, and has lots of tight corners.</p>
<div class="figure">
<img src="/img/river_cam.png" alt="the river cam" />
<p class="caption">the river cam</p>
</div>
</div>
<div id="howrowingincambridgeoxfordworks" class="section level1">
<h1>How rowing in Cambridge/Oxford works</h1>
<p>Instead of these rivers, boat typically contain 8 rowers, and one cox, who is responsible for steering the boat. In lieu of the space needed to row side by side, various races across the year are run as time trials down a portion of the river. The <em>real</em> highlight of the year however, are two four day competitions in which crews line up onebehindtheother and attempt to chase down and ‘bump’ the crew ahead (before being chased down themselves).</p>
<iframe width="560" height="315" src="https://www.youtube.com/watch?v=x6N6B_ob2k" frameborder="0" allowfullscreen>
</iframe>
<p>Upon hitting the boat that starts ahead, the two crews switch places the next day and then the race is run again, until hopefully, the positions roughly reflect the speeds of the boats.</p>
</div>
<div id="predictingbumpsraces" class="section level1">
<h1>Predicting bumps races</h1>
<p>Generally therefore, if two boats line up for a bumps race, the faster one should catch the slower boat (or if the positions are reversed, the faster boat should fail to be caught by the slower boat behind). It <em>should</em> be fairly easy to predict bumps races, but it isn’t. The nature of the relative inexperience of lots of crews, the panic of the races, and the pretty tight course means mistakes are made early and often.</p>
<iframe width="560" height="315" src="https://youtu.be/SCaeOsQmpTs?t=59" frameborder="0" allowfullscreen>
</iframe>
<p>However, I wanted to see how possible it was. The only real data to train predictions on are the timetrial races that happen before bumps, so I’m going to see how well it’s possible to model a bumps race using the implied speeds of crews from these previous time trials.</p>
</div>
<div id="librariesanddata" class="section level1">
<h1>Libraries and Data</h1>
<p>For this post, I’m only going to do some simple munging and logistic regression, so I only need the (new) version of the Tidyverse (as I’m also going to play with pivot_longer and pivot_shorter for the <a href="">first time</a>).</p>
<pre class="r"><code>library(tidyverse)</code></pre>
<p>The data comes from my own scraping of race results on the river cam over the last ten years. I’ll eventually package this up properly. For now it can be found at <a href="https://github.com/RobWHickman/CamStroker">my Github</a>. Today I’ll just read in the raw .csv files.</p>
<pre class="r"><code>#download the raw data
#wil lbe packages eventually
race_results < read.csv("https://raw.githubusercontent.com/RobWHickman/CamStroker/master/dataraw/cambridge_race_results.csv",
stringsAsFactors = FALSE)
bumps_results < read.csv("https://raw.githubusercontent.com/RobWHickman/CamStroker/master/dataraw/cambridge_bumps_results.csv",
stringsAsFactors = FALSE)</code></pre>
</div>
<div id="datamunging" class="section level1">
<h1>Data Munging</h1>
<p>We then want to lengthen out the bumps data by days to squeeze as much data as possible out of possible combinations of boats we have data for. I need to line boats up by the start position each day, so I also init a column for this at the end</p>
<pre class="r"><code>bumps_long < bumps_results %>%
#pivot bumps results to longer so we can model each day of racing
pivot_longer(., starts_with("Day"),
names_to = "Day", values_to = "Bump") %>%
mutate(Bump = case_when(
is.na(Bump) ~ 0,
TRUE ~ as.numeric(Bump)
)) %>%
group_by(Competition, College, Year, Crew, Gender) %>%
#calculate day start and end positions
mutate(day_end = StartPos  cumsum(Bump)) %>%
mutate(day_start = day_end + Bump)
head(bumps_long)</code></pre>
<pre><code>## # A tibble: 6 x 10
## # Groups: Competition, College, Year, Crew, Gender [2]
## Competition College Year Crew Gender StartPos Day Bump day_end
## <chr> <chr> <int> <int> <chr> <int> <chr> <dbl> <dbl>
## 1 Lent Caius 2016 NA M 1 Day1 0 1
## 2 Lent Caius 2016 NA M 1 Day2 0 1
## 3 Lent Caius 2016 NA M 1 Day3 0 1
## 4 Lent Caius 2016 NA M 1 Day4 0 1
## 5 Lent Downing 2016 NA M 2 Day1 0 2
## 6 Lent Downing 2016 NA M 2 Day2 0 2
## # ... with 1 more variable: day_start <dbl></code></pre>
<p>And we want to do the opposite for the race data so we can efficiently join this onto the bumps data. As the speed of the crew is all we care about I calculate this as the course distance / seconds taken to get an idea of roughly how fast each crew is.</p>
<pre class="r"><code>race_wide < race_results %>%
#calculate implied racing speed
mutate(race_id = paste(race, leg),
av_speed = distance / seconds) %>%
select(Year = year, College = college, Crew = crew, Gender = gender,
race_id, av_speed) %>%
pivot_wider(., id_cols = c("Year", "College", "Crew", "Gender"),
names_from = race_id, values_from = av_speed) %>%
#rename to tidy up
rename(NSC = `Newnham Short Course NA`,
Frbrn = `Fairbairns NA`,
WH2H1 = `Winter Head 2 Head leg1`,
WH2H2 = `Winter Head 2 Head leg2`,
Rbnsn = `Robinson Head NA`) %>%
#not much data for Robinson regatta so leave out
select(Rbnsn)
head(race_wide)</code></pre>
<pre><code>## # A tibble: 6 x 8
## Year College Crew Gender NSC Frbrn WH2H1 WH2H2
## <int> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2010 Catz 1 M 5.42 NA NA NA
## 2 2010 Robinson 1 M 5.41 NA NA NA
## 3 2010 Sidney 1 M 5.39 NA NA NA
## 4 2010 Caius 1 M 5.38 NA NA NA
## 5 2010 Girton 1 M 5.22 NA NA NA
## 6 2010 Churchill 1 M 5.21 NA NA NA</code></pre>
<p>Then we simply join the data and calculate the implied speed differential between two crew who start a bumps race behind each other.</p>
<pre class="r"><code>regression_data < bumps_long %>%
ungroup() %>%
#join datasets
left_join(race_wide, by = c("Year", "College", "Crew", "Gender")) %>%
group_by(Competition, Year, Gender, Day) %>%
arrange(Competition, Year, Gender, Day, day_start) %>%
#calculate speed difference between boats starting bumps data behind each other
mutate(frbrn_diff = Frbrn  lag(Frbrn),
NSC_diff = NSC  lag(NSC),
WH2H1_diff = WH2H1  lag(WH2H1),
WH2H2_diff = WH2H2  lag(WH2H2)) %>%
select(NSC, Frbrn, WH2H1, WH2H2) %>%
#pivot longer for plotting
pivot_longer(., frbrn_diff:WH2H2_diff,
names_to = "race", values_to = "speed_difference") %>%
filter(!is.na(speed_difference)) %>%
#tidy up plotting data
filter(Bump %in% c(1, 0)) %>%
filter(Competition == "Lent")
head(regression_data)</code></pre>
<pre><code>## # A tibble: 6 x 12
## # Groups: Competition, Year, Gender, Day [2]
## Competition College Year Crew Gender StartPos Day Bump day_end
## <chr> <chr> <int> <int> <chr> <int> <chr> <dbl> <dbl>
## 1 Lent Sidney 2010 2 F 40 Day1 1 39
## 2 Lent Christs 2010 2 F 41 Day1 0 41
## 3 Lent Sidney 2010 2 F 40 Day2 0 39
## 4 Lent Christs 2010 2 F 41 Day2 1 40
## 5 Lent Magdal~ 2010 2 F 46 Day2 0 45
## 6 Lent Newnham 2010 3 F 52 Day2 0 51
## # ... with 3 more variables: day_start <dbl>, race <chr>,
## # speed_difference <dbl></code></pre>
</div>
<div id="plottingdata" class="section level1">
<h1>Plotting Data</h1>
<p>We can then plot the data and see if the speed differential of races earlier in theyear is a useful predictor of bumping a boat ahead of you. We can model this as a logistic problem where bumping is either a 1 (to catch the boat ahead) or a 0 (did not catch). This does cut out some data in weird ways that I’ll get onto in later posts, but will do for now.</p>
<p>For the logistic regression I use geom_smooth and a binomial generalised linear model. Again, there’s more we can explore here, but this is just a quick intro post so we won’t worry about standrad error etc. I also split out by Male/Female crews as I imagine gender might play a role.</p>
<p>Given that this week is the first race of the 2019/2020 calendar (Fairbairns) I first limited myself to only data from that race.</p>
<pre class="r"><code>p1 < regression_data %>%
#filter only Fairbairns results
filter(race == "frbrn_diff") %>%
ggplot(., aes(x = speed_difference, y = Bump)) +
geom_point() +
#model as a logistic event
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE) +
facet_wrap(~Gender, scales = "free_x") +
theme_minimal()
#plot
p1</code></pre>
<p><img src="/post/20191124cambridge_rowing_1_files/figurehtml/plot_fairbairns1.png" width="672" /></p>
<p>And it seems being faster than a boat ahead of you does increase your chance of bumping, but not a huge amount.</p>
<p>There’s good reason to beleive Fairbairns regatta might not be the best predictor of performance later in the year. It’s the first race, where many collges are still testing out their crew. It also takes place during a weekday, so many students cannot take part, and is a 4.5km race, instead of the usual 2km of later races and bumps itself.</p>
<p>If we look at how all races predict later bumps success we can see much nicer logistic curves:</p>
<pre class="r"><code>#do the same but for all race data
p2 < regression_data %>%
ggplot(., aes(x = speed_difference, y = Bump)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE) +
geom_vline(xintercept = 0, colour = "red", linetype = "dashed") +
facet_grid(race~Gender, scales = "free_x") +
theme_minimal()
p2</code></pre>
<p><img src="/post/20191124cambridge_rowing_1_files/figurehtml/plot_all_races1.png" width="672" /></p>
<p>Especially Newnham Short Course and the two legs of Winter Head 2 Head show nice curves where boats that are faster on these races have a greater chance of bumping later in the year.</p>
<p>There’s a lot more to do to properly model a bumps regatta, but the first step of validating our ideas and data seems to show promising results!</p>
</div>

Could Yorkshire Win the World Cup
/project/yorkshire_world_cup/
Thu, 21 Nov 2019 00:00:00 +0000
/project/yorkshire_world_cup/
<p>In 2018, after watching the <a href="https://en.wikipedia.org/wiki/2018_CONIFA_World_Football_Cup" target="_blank">CONIFA World Cup</a> final live, I wondered if an Independent Yorkshire could win the FIFA World Cup. This resulted in a few blogposts that were turned into an article in <a href="https://www.citymetric.com/horizons/footballcouldindependentyorkshirewinworldcup3961" target="_blank">Citymetric magazine</a>.</p>
<p>This page lists the blog posts and a one line description of techniques therein. The copies the article text below.</p>
<h1 id="blogposts">Blog Posts</h1>
<ul>
<li><p><a href="http://www.roberthickman.eu/post/yorkshire_world_cup_1/" target="_blank">Data and Scraping</a>  Grabbing the FIFA video game player data and UK maps</p></li>
<li><p><a href="http://www.roberthickman.eu/post/yorkshire_world_cup_2/" target="_blank">Player Position LASSO</a>  Using LASSO regression to calculate player abilities in various positions</p></li>
<li><p><a href="http://www.roberthickman.eu/post/yorkshire_world_cup_3/" target="_blank">Finding Birthplaces</a>  Scraping the birthplaces of British players and binning by county</p></li>
<li><p><a href="http://www.roberthickman.eu/post/yorkshire_world_cup_4/" target="_blank">Picking Teams</a>  Simulating various lineups to find the best team each nation/county can put out on the field</p></li>
<li><p><a href="http://www.roberthickman.eu/post/yorkshire_world_cup_5/" target="_blank">Simulating World Cup</a>  Calculate implied county ELO and simulate World Cups to get a sense of the chance of a county winning it</p></li>
<li><p><a href="http://www.roberthickman.eu/post/yorkshire_world_cup_6/" target="_blank"><em>Bonus</em> UK/Rest of the World teams</a>  Doing the same but for an all UK team/ players from Nations who did not qualify for the World Cup</p></li>
</ul>
<h1 id="articletext">Article Text</h1>
<p>With less than a week until the start of the 2018 World Cup in Russia, it’s worth remembering, that another World Cup – the 2018 ConIFA World Cup for stateless people, minorities, and regions unaffiliated with FIFA  is also taking place in London.</p>
<p>Though happening in the UK, neither of the local ConIFA members will be competing. The Ellan Vannin team from the Isle of Man withdrew midway; and the latest ConIFA member, Yorkshire, only gained membership earlier this year.</p>
<p>One of Yorkshire’s most obvious characteristics, is that it’s absolutely huge compared to most other UK counties. It also – probably – has the highest contemporary population of any of the historic British counties. Indeed, as recently as this February the region resisted attempts to split control of the region up, demanding a “One Yorkshire” devolution deal instead of the proposed control to regions surrounding four of it’s major cities – and in May, a vocal proponent of such a “One Yorkshire” devolution, Dan Jarvis, the Labour MP for Barnsley, was elected as mayor of one of the Sheffield City region.</p>
<p>Given its size, ConIFA membership, and pushes for further devolution, I was wondering how Yorkshire would do as an independent full FIFA member. If it seceded as a whole from the rest of the UK could it field a team that could challenge internationally? Could any of the historic British counties?</p>
<p>Overall, there are 88 historic counties in Great Britain, plus the 6 counties of Northern Ireland (I couldn’t find shapefiles for the older subdivisions) which could be potential independent FIFA members.</p>
<p>Once I had these, I needed some way to rate potential players, and therefore teams. Luckily, the popular video game FIFA18 maintains up to date ratings of thousands of players across 36 different stats (e.g. dribbling, heading, pace etc.). After scraping an online database of players, I’m left with 18,058 players of various nationalities and abilities.</p>
<p>Using a simple regression model, I can use these abilities and the player’s listed preferred positions to predict what each players rating for each position, and use these position ratings to train a computer to pick optimal teams across a variety of formations. If we do this do for every nation that has at least 11 players in the database (10 outfield + 1 goalkeeper), the best 4 national teams that can be fielded are from Brazil, Germany, Spain, and Belgium.</p>
<p>To pick the teams for each county, though, I first had to find the birthplace of player. To simplify things a bit I only check players listed as English, Scottish, Welsh, Northern Irish, or Irish (due to the weirdness of the Irish FA) in my database of FIFA players. For each of these I ran a script to look the player up on wikipedia and scrape their birthplace. Once this was geocoded, I had a map of each British player and their birthplace, and therefore, the county of their birth.</p>
<p>Unsurprisingly, it basically shows a population density map of the UK, with more players born in the urban centres of London, Birmingham, the Lancashire cities and the West Yorkshire urban centres. After binning the players by county of birth, twenty historic counties have enough players to field a team.</p>
<p>On this chart, ‘FIFA_ability’ is the perceived ability of the optimal 11 players in a starting line up for that county, as judged by FIFA stats.</p>
<p>Perhaps a little surprisingly, the Lancashire team is rated slightly higher than the Yorkshire team – though looking at the sheer number of players they can select from it makes sense. Elsewhere, the home counties do well, as do Glasgow and Warwickshire (which contains much of contemporary Birmingham).</p>
<p>Looking at the selections the alogirthm chooses, it’s pretty clear some of these teams tend to be a bit flawed but overall make sense. The Yorkshire/Lancashire teams in particular are full of England international players and are lacking only an experienced top level goalkeeper.</p>
<p>In order to predict how these teams would do at a World Cup, I needed some form of quantifiable rating of a team;s ability. Luckily, ELO ratings in chess can do exactly that: the likelihood of any team A beating a team B is a direct function in the difference in their ELO rating.</p>
<p>Plotting the ELO ratings of each actual national team (an up to date calculation is maintained at ELOrating.net) against the ability of each national team as judged by FIFA18 shows a pretty clear linear trend. Using a regression model of this relationship, we can predict the ability of each hypothetical county national team.</p>
<p>When plotted, these ELO ratings show that some of the counties are definitely in the ball park of established world cup qualifiers – and so we might expected a postsuperdevolution Britain to be sending multiple teams to the World Cup.</p>
<p>In fact, Yorkshire and Lancashire are predicted to be about as good as the national teams of Serbia and Sweden. Lagging a bit behind, Essex and Surrey – both of which take in large chunks of what is now London – could expect to be competititve with teams like Turkey and Morocco.</p>
<p>However, just finding out how good these teams would be wasn’t what I wanted to know. I wanted to see if an independent British county could win the World Cup.</p>
<p>To do this, I swapped each of these counties in for the national English team and ran 10000 simulations of the postdevolution 2018 World Cup, uusing the same draws and fixtures as the real tournament uses.</p>
<p>The bad news is, the reallife favourites tend to dominate the simulations. Brazil or Germany were predicted to win the tournament in almost half of all the simulations. On the graph, it;s just possible to make out the red bars of Yorkshire and Lancashire, both of which won 41 out of 10000 simulations (a 0.41 per cent chance of winning any random World Cup).</p>
<p>This seems pretty low – but is comparable to pretty respectable teams like Denmark (0.775 per cent), Senegal (0.217 per cent), and even higher than the Iceland team which knocked england out of Euro2016 (0.339 per cent). It’s way higher than the chances the simulation gives the Russian hosts (0.07 per cent).</p>
<p>Scaling down to just these pretty hopeless nations/counties really shows how little hope the independent British counties would have at an international tournament. However, the best four counties (Lancashire, Yorkshire, Essex, and Surrey) all have about a 0.2 per cent or higher chance, or 5001 odds, at winning the 2018 World Cup were they to replace England at the last minute. This is an order of magnitude greater than the 50001 odds given to Leicester City at the start of 20152016 Premier League season, so there’s always a chance.</p>

Guardian: The Knowledge
/project/guardian_knowledge/
Thu, 21 Nov 2019 00:00:00 +0000
/project/guardian_knowledge/
<p>When I have a free afternoon, I enjoy answering the questions listed on The Guardian’s <a href="https://www.theguardian.com/football/series/theknowledge" target="_blank">The Knowledge</a> blog. This munging generally ends up as <a href="http://www.roberthickman.eu/post/" target="_blank">blogposts</a>.</p>
<p>Here are the current examples I have published:</p>
<h1 id="august2018httpwwwroberthickmaneuposttheknowledge4thaugust2018"><a href="http://www.roberthickman.eu/post/theknowledge4thaugust2018/" target="_blank">August 2018</a></h1>
<p>Looked at successive runs of fixtures for English clubs against identical suffix/prefixes. Also scraped FIFA World Cup squads and looked at players who played in a lower shirt number than their age.</p>
<h1 id="january2019httpwwwroberthickmaneupostcountiesleaguepoints"><a href="http://www.roberthickman.eu/post/counties_league_points/" target="_blank">January 2019</a></h1>
<p>Grouped football teams by their county and looked at which counties had won the most points over the past 140 years of league football</p>
<h1 id="febuary2019httpwwwroberthickmaneuposttheknowledge7thfebruary2019"><a href="http://www.roberthickman.eu/post/theknowledge7thfebruary2019/" target="_blank">Febuary 2019</a></h1>
<p>Answered 4 questions:
 which football players have scored on the most unique days of the year (Cristiano Ronaldo)
 which football teams had finished 2nd in a league the most times (Manchester United in the 1st Division  14times)
 what is the earliest a team has been relegated from a league the earliest (Rochdale, with 8 games to go in <sup>1973</sup>⁄<sub>1974</sub>)
 what is the longest run of games without a draw in the English leagues (Aston Villa with 50 games in <sup>1891</sup>⁄<sub>1892</sub>)</p>
<h1 id="june2019httpwwwroberthickmaneupostguardianknowledgejune"><a href="http://www.roberthickman.eu/post/guardian_knowledge_june/" target="_blank">June 2019</a></h1>
<p>A pretty fun post trying to work out the players who had played in the Premier League while representing low ranked countries. It turns out it’s pretty difficult to answer conclusively, but it seems that Zesh Rehman (Fulham/Pakistan) in 2005 is the lowest ranked, whereas Christpher Wreh seems to be the player with the lowest ranked nationality to actually win the league</p>
<p>For these questions I tend to rely on the <a href="https://github.com/jalapic/engsoccerdata" target="_blank">engsoccerdata</a> package in R, to which I am a frequent contributor. I also tend to do a fair bit of web scraping using simple (and sometimes <a href="http://www.roberthickman.eu/post/dynamic_web_scraping/" target="_blank">not so simple</a>) methods.</p>

R Packages
/project/r_packages/
Thu, 21 Nov 2019 00:00:00 +0000
/project/r_packages/
<p>Some R packages I have authored. Most/all can be found at my <a href="https://github.com/RobWHickman" target="_blank">Github page</a></p>
<h1 id="karpov">KaRpov</h1>
<p><img src="/img/packages/immortal_game.gif" alt="Immortal Game GIF'd using kaRpov" /></p>
<p>A small package in base R to read pgn files of chess matches and turn them into animations of the game using ggplot2. <a href="https://github.com/RobWHickman/kaRpov" target="_blank">https://github.com/RobWHickman/kaRpov</a></p>
<h1 id="ggparliament">ggparliament</h1>
<p><img src="https://raw.githubusercontent.com/RobWHickman/ggparliament/master/man/figures/HexSticker.png" alt="ggparliament" /></p>
<p>An extension to ggplot2 written with Thomas Leeper and Zoe Meers to plot the layour of various parliamentary chambers and their composition by party. Resulted in a <a href="https://www.theoj.org/josspapers/joss.01313/10.21105.joss.01313.pdf" target="_blank">JOSS paper</a>. Over 4000 downloads from CRAN and 115 Github stars as of November 2019.</p>
<h1 id="jeb">Jeb!</h1>
<p><img src="https://raw.githubusercontent.com/RobWHickman/Jeb/master/man/figure/HexSticker.png" alt="Jeb!" /></p>
<p>A small joke package written to quickly generate maps akin to the <a href="https://knowyourmeme.com/memes/jebwins" target="_blank">Jeb wins meme</a> using sf and ggplot2.</p>
<h1 id="epv">EPV</h1>
<p><img src="https://raw.githubusercontent.com/RobWHickman/EPV/master/hex_sticker/HexSticker2.png" alt="EPV" /></p>
<p>A small package to include code from the my work for the 2019 Statsbomb Conference on Expected Threat Models. To be extended into other Expected Posession Value models when time allows.</p>
<h1 id="camstroke">CamStroke</h1>
<p>A package to download data related to Cambridge collegiate rowing. Contains cleaned results files for various races and data on the various college boat clubs.</p>

Statsbomb Conference
/project/statsbomb_conference/
Thu, 21 Nov 2019 00:00:00 +0000
/project/statsbomb_conference/
<p>In August 2019, I won the oppurtunity to present a research talk at the inagural <a href="https://statsbomb.com/conference/" target="_blank">Statsbomb football analytics conference</a>. My proposal focused on <a href="https://karun.in/blog/expectedthreat.html" target="_blank">Markov models</a> of possession value when playing football, and incorporating a risk factor into these models.</p>
<p>I presented a 25 minute talk on my work at the conference in October 2019. Below is a list of resources related to the project.</p>
<h1 id="originalapplication">Original Application</h1>
<p>< to be added ></p>
<h1 id="thetalk">The Talk</h1>
<p><a href="https://www.youtube.com/watch?v=nzaHaWEa9BA" title="my talk" target="_blank"><img src="/img/statsbomb_talk.png" alt="My Statsbomb Talk Video" /></a></p>
<h1 id="talkslides">Talk Slides</h1>
<p>< to be added ></p>
<h1 id="whitepaper">White Paper</h1>
<p>currently in the process of writing up a white paper based on the research. Will upload here when completed.</p>
<h1 id="rpackage">R Package</h1>
<p>alongside the white paper, I am putting together an R package to easily reproduce the findings, and also extend into other similar models of possession value. The code can be found <a href="https://github.com/RobWHickman/EPV" target="_blank">on Github</a></p>