In what is becoming a repeated series, I enjoy answering trivia questions from The Guardian’s The Knowledge football trivia column.
There’s a few questions that built up that seemed amenable to coding answers so I’ve taken a stab at them here
#munging
library(tidyverse)
library(data.table)
library(zoo)
#english football data
library(engsoccerdata)
#web data scraping
library(rvest)
#plotting
library(openair)
Calendar Boys
The first question this week concerns players scoring on (or nearest to) every day of the year
Has any player played/or even scored on every date in a calendar year. What’s the nearest anyone has come?
— David Thomson (@thomsonionioni) January 29, 2019
Answer: Cristiano Ronaldo (using data from around the turn of the millenium) - 244 days
Getting the data for this is the main problem. The best (free) source I tend to use is transfermarkt.com, but data there becomes less reliable from before 2000 (and only has a few years of data from more obscure years where I could believe some players are banging in goals for fun). Nonetheless, it should at least gives us some ideas
For each player sampled we’re going to want the data for each goal scored both for their club and country. Saving the competition data is also useful as it also allows us to sort out friendlies which may or may not count depending on interpretation of the question.
Two quick functions will do this for any given player id
#create a data frame of club goals
get_club_goals <- function(club_stats) {
#read the page
read <- read_html(club_stats)
#get the players names
name <- read %>%
html_nodes(".dataName b") %>%
html_text()
#read the table of goals scored and munge together
club_df <- read %>%
html_nodes(xpath = '//*[@id="main"]/div[10]/div/div/div[4]/table') %>%
html_table(fill = TRUE) %>%
as.data.frame() %>%
select(Date, Minute, Competition.1) %>%
mutate(minute = as.numeric(gsub("'.*", "", Minute))) %>%
filter(!is.na(minute)) %>%
#convert date to day of the year
mutate(date = case_when(
Date != "" ~ strftime(as.Date(Date, "%m/%d/%y"), "%m/%d")
)) %>%
mutate(competition = ifelse(Competition.1 == "", NA, Competition.1)) %>%
select(competition, date, minute) %>%
#fill down the competition and date if missing
do(na.locf(.)) %>%
mutate(scored_for = "club", name = name)
}
#do the same for national team goals
get_nt_goals <- function(nt_stats) {
read <- read_html(nt_stats)
name <- read %>%
html_nodes(".dataName b") %>%
html_text()
goal_table <- read %>%
html_nodes(xpath = '//*[@id="main"]/div[10]/div[1]/div[3]/div[4]/table')
#some players won't have any national team goals
#return NA
if(!is_empty(goal_table)) {
nt <- goal_table %>%
html_table(fill = TRUE) %>%
as.data.frame() %>%
select(For, Date, Var.11) %>%
mutate(goals = as.numeric(Var.11)) %>%
mutate(date = case_when(
Date != "" ~ strftime(as.Date(Date, "%m/%d/%y"), "%m/%d")
)) %>%
mutate(competition = ifelse(For == "", NA, For)) %>%
mutate(competition = na.locf(competition)) %>%
filter(!is.na(date)) %>%
select(competition, date, goals)
#if more than 1 goal is scored on a game it's counted as two rows
#separate these out
if(any(nt$goals != 1)) {
nt_df <- do.call("c", (mapply(rep, c(nt$competition, nt$date), nt$goals))) %>%
matrix(., 2, byrow = TRUE) %>%
t() %>%
as.data.frame() %>%
select(competition = V1, date = V2)
} else {
nt_df <- nt %>%
select(-goals)
}
#finish off munging
df <- nt_df %>%
mutate(minute = NA, scored_for = "nation", name = name)
} else {
df <- NA
}
return(df)
}
Now we can get into the scraping. For each player some parts of the URL stay the same, so lets save those as objects so we don’t have to deal with massive long urls.
I decided to test out the functions using Cristiano Ronaldo as his 675 goals for club and country is (I believe) more than any active player. Pasting the url together and running on the functions does the trick
#for each player some parts of url stay the same
base_url <- "https://www.transfermarkt.co.uk/"
club_text1 <- "/alletore/spieler/"
club_text2 <- "/saison//verein/0/liga/0/wettbewerb//pos/0/trainer_id/0/minute/0/torart/0/plus/1"
nt_text1 <- "/nationalmannschaft/spieler/"
nt_text2 <- "/verein_id/3300/plus/0?hauptwettbewerb=&wettbewerb_id=&trainer_id=&start=Aug+20%2C+2003&ende=Feb+4%2C+2019&nurEinsatz=2"
#get all the goals scored by Cristiano Ronaldo
ronaldo <- rbind(
paste0(base_url, "player_name", club_text1, 8198, club_text2) %>%
get_club_goals(),
paste0(base_url, "player_name", nt_text1, 8198, nt_text2) %>%
get_nt_goals()
)
#count the number of unique dates scored on
length(unique(ronaldo$date))
## [1] 244
So Ronaldo has scored on 244 of the 366 possible days of the year. It’s not surprising that scoring on every day would be difficult. The club season only runs August-June and there are unlikely to be many possible games to play in July at all. Plus days such a Christmas are usually taken off from football.
In terms of goals per day using 2019s calendar this looks like (plot made using the openair package:
ronaldo %>%
#count goals per date
group_by(date) %>%
summarise(goals = n()) %>%
#convert to 2019 dates
mutate(date = as.Date(paste0("2019/", date))) %>%
#use calendarPlot from the openair package
calendarPlot(., pollutant = "goals", year = 2019)
Which shows more deviation in scoring than I thought it would. Nevertheless, from September-May each year is pretty blocked out, though there is a run of Saturdays this December which could be fertile ground for increasing his total.
Next we need to get a list of likely players who could come close to matching Ronaldo’s record.
For this I took the first page of transfermarkt’s top scorers of the year across all leagues. It’s possible that a player might (e.g.) be on the second page each year and have scored a ton, but I don’t think it’s super likely.
I run this through the top scorers page from 1995 (the earliest year available) to 2018 and grab each player id. Afterwards, save the scraped list as an .rds to prevent needing to continually re scrape the page and put extra load onto the server.
#the url for the pages of top scorers
top_scorer_ids <- paste0(base_url,
"spieler-statistik/jahrestorschuetzen/",
"statistik/stat/plus/0/galerie/0?jahr=",
1995:2018,
"&wettbewerb=alle&monatVon=01&monatBis=12&altersklasse=&",
"land_id=&ausrichtung=alle&spielerposition_id=alle&art=0") %>%
#scrape the ids of players
lapply(., function(year) {
read_html(year) %>%
html_nodes("#yw1 .spielprofil_tooltip") %>%
html_attr("id")
}) %>%
unlist() %>%
unique()
#save this to prevent need for re-scraping
saveRDS(top_scorer_ids, "transfermarkt_top_scorers.rds")
Then all that’s left is to scrape the goals for each player whose id we’ve scraped. Again, save this once run, especially as it takes a fair while to complete. For this article the data was scraped on the 5th February 2019
player_goals <- top_scorer_ids %>%
#for each player scrape every goal
lapply(., function(id) {
goals <- rbind(
paste0(base_url, "player_name", club_text1, id, club_text2) %>%
get_club_goals(),
paste0(base_url, "player_name", nt_text1, id, nt_text2) %>%
get_nt_goals()
) %>%
#remove NAS
#this is where a player hasn't scored for their nation
filter(!is.na(date)) %>%
mutate(id = id)
}) %>%
do.call(rbind, .)
#and save
saveRDS(player_goals, "top_scorer_goals.rds")
Now we have a list of every goal scored by prolific strikers, we just have to group by each player and count how many dates they’ve scored on. To get the players with the highest number of unique dates we group by their id and count the length of the unique dates they’ve scored on.
days_per_player <- player_goals %>%
#group by player
group_by(id) %>%
#count the dates scored on
summarise(days = length(unique(date))) %>%
arrange(-days) %>%
#rejoin the name data back in
left_join(.,
player_goals %>%
select(id, name) %>%
unique(),
by = "id") %>%
print()
## # A tibble: 413 x 3
## id days name
## <chr> <int> <chr>
## 1 8198 244 Cristiano Ronaldo
## 2 28003 210 Messi
## 3 7349 203 Raúl
## 4 3455 200 Ibrahimovic
## 5 3207 189 Henry
## 6 4257 187 Eto'o
## 7 3924 173 Drogba
## 8 48280 173 Cavani
## 9 44352 172 Suárez
## 10 7980 171 Villa
## # ... with 403 more rows
so perhaps unsurprisingly, Ronaldo comes out on top. As expected given the data source, most of the top players are very recent strikers- all of the top 10 were active well into the 2010s. Ulf Kirsten and Toni Polster are the torchbearers for strikers from the 90s.
As always in these posts, I try to learn some new stuff as I do them. I thought this might be a good time to try some circular plotting. I don’t think the resultant plots actually inform that much but they are cool to look at.
circular_data <- player_goals %>%
#filter out top 16 scorers
filter(id %in% days_per_player$id[1:16]) %>%
#group by month and player and sum
mutate(month = gsub("\\/.*", "", date)) %>%
group_by(id, month, competition) %>%
summarise(goals = n()) %>%
left_join(.,
player_goals %>%
select(id, name) %>%
unique(),
by = "id")
#too many competititon for legend
#sort out into broad groups
competition_types <- data.frame(competition = circular_data$competition) %>%
unique() %>%
mutate(competition = as.character(competition)) %>%
mutate(competition_type = case_when(
grepl("MLS", competition) ~ "Domestic",
grepl("World Cup qualification| Qualifiers", competition) ~ "International",
grepl("Friendlies", competition) ~ "International Friendlies",
grepl("World Cup [0-9]{4}|Confederations|EURO [0-9]{4}", competition) ~ "International Tournament",
grepl("UEFA|Champions League|UI Cup|Cup Winners|European Cup|Europa", competition) ~ "European",
grepl("Club World", competition) ~ "International Club",
grepl("Cup|cup|Pokal|copa|Copa|beker|Coupe|coppa|Kupasi|Trophée|Kupa", competition) ~ "Domestic Cup"
)) %>%
mutate(competition_type = ifelse(is.na(competition_type), "Domestic", competition_type)) %>%
#convert to factor for plot fill order
mutate(competition_type = fct_rev(factor(competition_type)))
#plot as circular radar plots
circular_data %>%
left_join(., competition_types, by = "competition") %>%
ggplot(., aes(x = month, y = goals, fill = competition_type)) +
#convert to polar coordinates
coord_polar(theta = "x", start = -.13) +
geom_bar(stat = "identity") +
scale_fill_discrete(name = "Competition Type") +
ggtitle("Goals Per Month for Top 16 Unique Day Scorers") +
facet_wrap(~name) +
theme_minimal() +
theme(axis.text = element_blank())
One nice thing that pops out is how Kirsten rarely scored in December/January- probably due to the Bundesliga mid season break.
I also found it interesting that Dirk Kuyt featured in the top 16, despite not being renowned as a great goalscorer.
#count total goals per player
left_join(
days_per_player,
player_goals %>%
group_by(id) %>%
summarise(goals = n()),
by = "id"
) %>%
#work out days/total goals
mutate(proportion_unique = days / goals) %>%
arrange(-days) %>%
filter(days > 150) %>%
select(name, goals, days, proportion_unique) %>%
arrange(-proportion_unique)
## # A tibble: 34 x 4
## name goals days proportion_unique
## <chr> <int> <int> <dbl>
## 1 Gilardino 231 154 0.667
## 2 Cissé 269 166 0.617
## 3 Signori 260 160 0.615
## 4 Di Vaio 269 165 0.613
## 5 Toni 273 165 0.604
## 6 Kuyt 281 169 0.601
## 7 Lampard 258 154 0.597
## 8 Frei 257 153 0.595
## 9 Drogba 300 173 0.577
## 10 Trézéguet 265 152 0.574
## # ... with 24 more rows
When sorted by how evenly their goals/date coverage is (i.e. the ideal ratio would be 1 goal on every day), Dirk Kuyt pops up again (and did in fact score many more goals than I had assumed). Alberto Gilardino really stands out as a player who has maximum date coverage despite (relative to other members of the list!) a low number of total goals scored.
I’m not sure what, if any, insight that adds but is a cool piece of trivia.
First Losers
Which club holds the record for most 2nd place finishes in the English top flight?
— Tom Goddard (@Tom_Goddard_13) February 5, 2019
Answer: Manchester United in the top flight - 14 times
The fist question this week where I get to dive back into James Curley’s engsoccerdata package asks which teams have finished second in their league the most.
First lets load up the engsoccerdata for English leagues 1882-2016. I’ve munged it in a pretty verbose way; there’s definitely a faster way to do it but that’s not really necessary. All we need are the indicators used to sort the league (points, goal difference, and goal scored) for every match in a long format.
eng_data <- engsoccerdata::england %>%
#select only pertinent variables
select(Date, Season, home, visitor, hgoal, vgoal, division) %>%
rename_all(tolower) %>%
#melt the data to long format
reshape2::melt(id.vars = c("date", "season", "hgoal", "vgoal", "division"),
value.name = "team", variable.name = "location") %>%
#this can be done in one step but for sanity
mutate(result = case_when(
hgoal > vgoal & location == "home" ~ "W",
vgoal > hgoal & location == "visitor" ~ "W",
hgoal < vgoal & location == "home" ~ "L",
vgoal < hgoal & location == "visitor" ~ "L",
vgoal == hgoal ~ "D"
)) %>%
#points for a win changed in 1981
mutate(points = case_when(
result == "L" ~ 0,
result == "D" ~ 1,
result == "W" & season < 1981 ~ 2,
result == "W" & season > 1980 ~ 3
)) %>%
#and get the goal info too
mutate(goal_diff = case_when(
location == "home" ~ hgoal - vgoal,
location == "visitor" ~ vgoal - hgoal
)) %>%
mutate(goals = case_when(
location == "home" ~ hgoal,
location == "visitor" ~ vgoal
)) %>%
#only save the variables we care about then sort
select(date, season, division, team, points, goals, goal_diff) %>%
arrange(date, team)
Then find the final positions of each team in each season of English football sorted by points, goal difference and goals for
final_positions <- eng_data %>%
setDT() %>%
#find the match number
.[, match := 1:.N, by = c("season", "team")] %>%
#find the cumulative points, goal difference and goals for
.[, season_points := cumsum(points), by = c("season", "team")] %>%
.[, season_gd := cumsum(goal_diff), by = c("season", "team")] %>%
.[, season_g := cumsum(goals), by = c("season", "team")] %>%
#filter out the final matches totals and order
.[.[, .I[match == max(match)], by= c("season", "division")]$V1] %>%
.[order(season, division, -season_points)] %>%
#assign the final positions
.[, final_position := 1:.N, by = c("season", "division")] %>%
.[, c("team", "division", "final_position")] %>%
#count by final position
.[, pos_count := .N, by = c("team", "division", "final_position")] %>%
unique()
Then we can filter out those who have finished second most
second_place <- final_positions %>%
#filter out second place finishes
.[final_position == 2] %>%
.[, c("team", "division", "final_position", "pos_count")] %>%
.[order(-pos_count)]
head(data.frame(second_place), 10)
## team division final_position pos_count
## 1 Manchester United 1 2 14
## 2 Arsenal 1 2 12
## 3 Liverpool 1 2 11
## 4 Aston Villa 1 2 9
## 5 Sheffield United 2 2 8
## 6 Birmingham City 2 2 8
## 7 Everton 1 2 7
## 8 Sunderland 1 2 7
## 9 Preston North End 1 2 6
## 10 Manchester City 1 2 6
Perhaps unsurprisingly, most of the teams to finish second have finished second in the top flight. Manchester United lead the way with Arsenal and Liverpool following up.
Bristol City, Charlton Athletic, Oldham Athletic, Blackpool, QPR, Watford and Southampton have finished runners up in the top division without winning it, all having achieved this exactly once.
Plymouth Argyle have perhaps the most heartbreaking run of all though- having finished second in the old 3rd Division South SIX times in a row between 1922-1927 before finally winning it in 1929.
I thought I might as well also plot every teams league finishes as a proportion of their season in the league. Position here refers to total overall position (so 1st in Division two might be 21st overall). The darker the colour, the more likely the team was the end the season in that position. All teams have been sorted by their mean final league position.
eng_data %>%
#filter out modern era
filter(season > 1991) %>%
setDT() %>%
#find finish positions and count as above
.[, match := 1:.N, by = c("season", "team")] %>%
.[, season_points := cumsum(points), by = c("season", "team")] %>%
.[, season_gd := cumsum(goal_diff), by = c("season", "team")] %>%
.[, season_g := cumsum(goals), by = c("season", "team")] %>%
.[.[, .I[match == max(match)], by= c("season", "division")]$V1] %>%
.[order(season, division, -season_points)] %>%
.[, final_position := 1:.N, by = c("season", "division")] %>%
.[, total_position := 1:.N, by = c("season")] %>%
.[, c("team", "division", "final_position", "total_position")] %>%
.[, pos_count := .N, by = c("team", "division", "final_position")] %>%
.[, team_appearances := .N, by = c("team")] %>%
.[, mean_pos := sum(total_position)/team_appearances, by = c("team")] %>%
unique() %>%
#order by mean position
.[order(mean_pos)] %>%
.[, team := fct_rev(fct_relevel(as.factor(team), unique(.$team)))] %>%
#plot
ggplot(., aes(x = total_position, y = team)) +
geom_tile(aes(alpha = pos_count/team_appearances), fill = "blue") +
scale_alpha_continuous(guide = FALSE) +
ggtitle("Teams Ordered by Mean Final Position 1992-2016",
subtitle = "Weight indicates proportion of finishes in that position") +
xlab("Total League Position") +
theme_minimal() +
theme(axis.text.y = element_text(angle = 10))
There’s probably too much data to graph here, but it’s still a fun way to look at 140 years of English football
Slip Slidin’ Away
As we're starting to reach that time of season (we'll, we are at #itfc)… What's the earliest collectively a team from each of the top 4 English leagues has been relegated?
— Philip Genochio (@philipgenochio) February 5, 2019
Answer: Rochdale with 8 games to go in Division 3 in 1973/1974
A similar question involves the earliest teams to get relegated. Obviously for this first we need to know how many teams are relegated from each league per season. Having only really started watching football around the turn of the millenium I was a bit surprised how much this has changed over the years (n.b. I’m only counting automatic relegation- playoffs and re-elections don’t count).
#manually enter the number of relegation spots per league
relegation_spots <- eng_data %>%
.[, c("season", "division")] %>%
unique() %>%
mutate(relegation_spots = case_when(
season >= 1995 & division == 1 ~ 3,
season >= 1994 & division == 1 ~ 4,
season >= 1991 & division == 1 ~ 3,
season >= 1990 & division == 1 ~ 2,
season >= 1973 & division == 1 ~ 3,
season >= 1898 & division == 1 ~ 2,
season >= 1995 & division == 2 ~ 3,
season >= 1994 & division == 2 ~ 4,
season >= 1991 & division == 2 ~ 3,
season >= 1990 & division == 2 ~ 2,
season >= 1988 & division == 2 ~ 3,
season >= 1986 & division == 2 ~ 2,
season >= 1973 & division == 2 ~ 3,
season >= 1921 & division == 2 ~ 2,
season >= 1920 & division == 2 ~ 1,
season >= 1919 & division == 2 ~ 3,
season >= 1995 & division == 3 ~ 4,
season >= 1994 & division == 3 ~ 5,
season >= 1991 & division == 3 ~ 4,
season >= 1990 & division == 3 ~ 3,
season >= 1988 & division == 3 ~ 4,
season >= 1986 & division == 3 ~ 3,
season >= 1958 & division == 3 ~ 4,
season >= 2002 & division == 4 ~ 2,
season >= 1996 & division == 4 ~ 1,
season >= 1993 & division == 4 ~ 0,
season >= 1992 & division == 4 ~ 1,
season >= 1990 & division == 4 ~ 0,
season >= 1986 & division == 4 ~ 1
)) %>%
mutate(relegation_spots = ifelse(relegation_spots == 0, NA, relegation_spots))
We then need to work out how many points each team has, and how many they could possibly achieve, after every match in a season.
possible_positions <- eng_data %>%
setDT() %>%
#get the match number
.[, match := 1:.N, by = c("season", "team")] %>%
#get the current points for the team
.[, season_points := cumsum(points), by = c("season", "team")] %>%
.[order(division, season, match, -season_points)] %>%
#get the current position for the team
.[, position := 1:.N, by = c("division", "season", "match")] %>%
#how many teams are in the league
.[, teams := max(position), by = c("division", "season")] %>%
#find how many matches each team has left to play
.[, matches_remaining := max(match) - match, by = c("division", "season")] %>%
#the max points assumes each team wins all of their remaining matches
.[season < 1981, possible_points := season_points + (matches_remaining * 2)] %>%
.[season > 1980, possible_points := season_points + (matches_remaining * 3)] %>%
#merge in the relegation spots and find what position each team needs to be safe
merge(., relegation_spots, by = c("division", "season")) %>%
.[, lowest_safe_position := teams - relegation_spots] %>%
#the threshold for safety is the number of points the lowest safe team has
.[position == lowest_safe_position, lowest_safe_points := season_points]
Then it’s a simple case of finding teams in the relegation zone and finding the point at which they can no longer catch the lowest safe team
relegation_secured <- possible_positions %>%
#filter out teams in relegation trouble
.[!is.na(lowest_safe_position)] %>%
.[position >= lowest_safe_position] %>%
.[, lowest_safe_points := na.locf(lowest_safe_points)] %>%
.[possible_points < lowest_safe_points] %>%
.[, c("season", "division", "team", "season_points",
"matches_remaining")] %>%
.[order(-matches_remaining)] %>%
#remove duplicates
.[!duplicated(paste0(season, division, team))]
head(data.frame(relegation_secured), 15)
## season division team season_points matches_remaining
## 1 1973 3 Rochdale 16 8
## 2 1984 1 Stoke City 17 7
## 3 2001 2 Stockport County 17 7
## 4 2003 2 Wimbledon 21 7
## 5 2016 2 Rotherham United 17 7
## 6 1961 3 Newport County 18 7
## 7 1984 3 Cambridge United 17 7
## 8 1993 3 Barnet 23 7
## 9 2000 3 Oxford United 22 7
## 10 1930 1 Manchester United 16 6
## 11 1954 1 Sheffield Wednesday 19 6
## 12 1975 1 Sheffield United 13 6
## 13 1994 1 Ipswich Town 23 6
## 14 2007 1 Derby County 11 6
## 15 1952 2 Barnsley 17 6
So Rochdale hold the questionable honour of being the team knowing they are doomed with the most matches to go (with 8 in division 3 in 1973). There’s quite a large chasing pack of teams who have known with 7 or 6 matches left too. Ipswich are currently ‘only’ 8 points off of safety with 16 games left to go so seems unlikely to beat 8 but it could be close…
I also wanted to see what the earliest a team has ever been certain of their final position is.
certain_final_positions <- possible_positions %>%
#find the possible points for the teams above and below each team
.[, poss_points_nextworst := lead(possible_points), by = c("season", "division", "match")] %>%
.[, points_nextbest := lag(season_points), by = c("season", "division", "match")] %>%
#filter out teams that cannot beat/fall below the next best/worst teams
.[(is.na(poss_points_nextworst) | season_points > poss_points_nextworst) &
(is.na(points_nextbest) | possible_points < points_nextbest) &
matches_remaining > 0] %>%
#order and select columns
.[order(-matches_remaining)] %>%
.[, c("division", "season", "team", "position", "matches_remaining", "teams")]
head(data.frame(certain_final_positions), 15)
## division season team position matches_remaining teams
## 1 4 1968 Bradford Park Avenue 24 7 24
## 2 1 1984 Stoke City 22 6 22
## 3 2 1949 Tottenham Hotspur 1 6 22
## 4 2 1971 Watford 22 6 22
## 5 2 1973 Middlesbrough 1 6 22
## 6 2 2001 Stockport County 24 6 24
## 7 3 1966 Queens Park Rangers 1 6 24
## 8 3 1984 Cambridge United 24 6 24
## 9 3b 1952 Walsall 24 6 24
## 10 4 1968 Bradford Park Avenue 24 6 24
## 11 4 1977 Watford 1 6 24
## 12 4 1997 Notts County 1 6 24
## 13 1 1980 Crystal Palace 22 5 22
## 14 1 1982 Liverpool 1 5 22
## 15 1 1984 Everton 1 5 22
Spare a thought for fans of Bradford Park Avenue in 1968-1969 who knew their team would finish bottom of the 3rd Division North with 7 matches (of 46) remaining. Luckily they weren’t relegated as they were already in the bottom division of the football league and we re-elected for the next season. They repeated this feat, now in Division 4, 5 years later, finishing bottom with 6 games to go.
Most of these involve teams either winning or finishing bottom of their league. If we filter these out we’re left with:
certain_final_positions %>%
#filter off bottom or top teams
.[position != 1 & position != teams] %>%
data.frame() %>%
head(., 15)
## division season team position matches_remaining teams
## 1 3a 1931 New Brighton 20 4 22
## 2 3a 1931 Rochdale 21 4 22
## 3 2 2003 Bradford City 23 3 24
## 4 2 2005 Sheffield United 2 3 24
## 5 3a 1931 New Brighton 20 3 22
## 6 3a 1931 Rochdale 21 3 22
## 7 3b 1929 Brentford 2 3 22
## 8 4 1975 Northampton Town 2 3 24
## 9 1 1888 Aston Villa 2 2 12
## 10 1 1930 Aston Villa 2 2 22
## 11 1 1934 Sunderland 2 2 22
## 12 1 1946 Brentford 21 2 22
## 13 1 1957 Preston North End 2 2 22
## 14 1 1970 Burnley 21 2 22
## 15 1 1978 Queens Park Rangers 20 2 22
Which really emphasizes how only exceptionally good/bad teams are ever really certain of their position before the end of the season.
Draw Me Like One of Your Top Flight Teams
Answer: Aston Villa - 50 games in 1891/1892, but otherwise Spurs in modern football
(N.b. the engsoccerdata package only has data up until the 2016/2017 season, so Tottenham’s run won’t appear in the results below)
To do this we just need to select every game with a non-zero goal difference (i.e. a draw) and then find consecutive runs for teams.
draws <- eng_data %>%
setDT() %>%
.[order(team, date)] %>%
#give each match a consecutive 'id'
.[, game_id := 1:.N, by = team] %>%
#find matches with a non zero goal difference (not a draw)
.[goal_diff != 0] %>%
#find consecutive matches with non zero goal difference
.[, consecutive := lead(game_id) - game_id, by = team] %>%
.[consecutive != 1, consecutive := NA] %>%
#count all consecutive runs
.[, count := .N*!is.na(consecutive), rleid(!is.na(consecutive))] %>%
#find the start and end of each run
.[count != lead(count) | count != lag(count)] %>%
.[order(-count, team, date)] %>%
#set the start points and end point and spread to separate columns
.[, run_point := c("start", "end")] %>%
spread(run_point, date) %>%
.[, start := na.locf(start)] %>%
.[!is.na(end) & !is.na(start)] %>%
#select columns to print
.[, c("start", "end", "division", "team", "count")] %>%
.[order(-count)]
head(data.frame(draws), 25)
## start end division team count
## 1 1891-01-01 1892-12-10 1 Aston Villa 50
## 2 1895-03-30 1896-11-09 1 Stoke City 45
## 3 1907-12-26 1909-02-27 1 Sunderland 45
## 4 1913-01-01 1914-02-14 2 Bradford Park Avenue 43
## 5 1909-01-30 1910-03-25 2 Leicester City 43
## 6 1894-01-06 1896-09-05 2 Walsall 43
## 7 1892-12-10 1894-09-15 1 Birmingham City 42
## 8 1896-03-21 1897-09-25 2 Darwen 39
## 9 1928-03-17 1929-02-02 1 Portsmouth 37
## 10 1904-10-22 1905-11-11 1 Sheffield United 37
## 11 1947-11-15 1948-09-25 3b Bristol Rovers 36
## 12 1930-05-03 1931-03-28 2 Wolverhampton Wanderers 36
## 13 1894-03-23 1895-04-20 2 Lincoln City 35
## 14 1946-10-12 1947-08-23 3a Stockport County 35
## 15 1904-01-30 1904-12-27 2 Bristol City 34
## 16 1915-04-03 1920-02-21 1 West Bromwich Albion 34
## 17 1934-12-01 1935-09-16 2 Doncaster Rovers 33
## 18 1895-11-16 1896-11-28 2 Newcastle United 33
## 19 1925-05-02 1926-02-22 2 Middlesbrough 32
## 20 1935-08-31 1936-03-14 3b Reading 32
## 21 1925-12-26 1926-10-02 3a Rochdale 32
## 22 1896-01-20 1897-02-13 2 Burton Wanderers 31
## 23 1927-12-24 1928-09-15 3a Accrington Stanley 30
## 24 1905-11-04 1906-09-03 2 Gainsborough Trinity 30
## 25 1946-11-23 1947-08-25 2 Plymouth Argyle 30
Tottenham’s run without a draw doesn’t even make the top 25 such runs! And they still would have to wait until at least next season until they can match Aston Villa’s run from New Years Day 1891 until Christmas Eve in 1892, a run of 50 matches without a draw- a run that included 23 losses, and 27 wins. It is the longest for quite sometime though- there’s few similar runs in the post-war years.
We can restrict this easily to just runs in the top division my modifying one line
draws <- eng_data %>%
setDT() %>%
.[order(team, date)] %>%
.[, game_id := 1:.N, by = team] %>%
#filter out only the top division matches
.[division == 1] %>%
.[goal_diff != 0] %>%
.[, consecutive := lead(game_id) - game_id, by = team] %>%
.[consecutive != 1, consecutive := NA] %>%
.[, count := .N*!is.na(consecutive), rleid(!is.na(consecutive))] %>%
.[count != lead(count) | count != lag(count)] %>%
.[order(-count, team, date)] %>%
.[, run_point := c("start", "end")] %>%
spread(run_point, date) %>%
.[, start := na.locf(start)] %>%
.[!is.na(end) & !is.na(start)] %>%
.[, c("start", "end", "division", "team", "count")] %>%
.[order(-count)]
head(data.frame(draws), 10)
## start end division team count
## 1 1891-01-01 1892-12-10 1 Aston Villa 50
## 2 1895-03-30 1896-11-09 1 Stoke City 45
## 3 1907-12-26 1909-02-27 1 Sunderland 45
## 4 1928-03-17 1929-02-02 1 Portsmouth 37
## 5 1904-10-22 1905-11-11 1 Sheffield United 37
## 6 1915-04-03 1920-02-21 1 West Bromwich Albion 34
## 7 1895-09-28 1896-09-12 1 Wolverhampton Wanderers 29
## 8 1964-09-12 1965-03-31 1 Aston Villa 28
## 9 1953-04-25 1954-01-02 1 Burnley 28
## 10 1891-03-14 1892-04-30 1 Sunderland 28
Where we can see that Spurs’ run is at least the longest modern top flight drawless run.