The Knowledge 7th February 2019

In what is becoming a repeated series, I enjoy answering trivia questions from The Guardian’s The Knowledge football trivia column.

There’s a few questions that built up that seemed amenable to coding answers so I’ve taken a stab at them here

#munging
library(tidyverse)
library(data.table)
library(zoo)
#english football data
library(engsoccerdata)
#web data scraping
library(rvest)
#plotting
library(openair)

Calendar Boys

The first question this week concerns players scoring on (or nearest to) every day of the year

Answer: Cristiano Ronaldo (using data from around the turn of the millenium) - 244 days

Getting the data for this is the main problem. The best (free) source I tend to use is transfermarkt.com, but data there becomes less reliable from before 2000 (and only has a few years of data from more obscure years where I could believe some players are banging in goals for fun). Nonetheless, it should at least gives us some ideas

For each player sampled we’re going to want the data for each goal scored both for their club and country. Saving the competition data is also useful as it also allows us to sort out friendlies which may or may not count depending on interpretation of the question.

Two quick functions will do this for any given player id

#create a data frame of club goals
get_club_goals <- function(club_stats) {
  #read the page
  read <- read_html(club_stats)
  
  #get the players names
  name <- read %>%
    html_nodes(".dataName b") %>%
    html_text()
    
  #read the table of goals scored and munge together
  club_df <- read %>%
    html_nodes(xpath = '//*[@id="main"]/div[10]/div/div/div[4]/table') %>%
    html_table(fill = TRUE) %>%
    as.data.frame() %>%
    select(Date, Minute, Competition.1) %>%
    mutate(minute = as.numeric(gsub("'.*", "", Minute))) %>%
    filter(!is.na(minute)) %>% 
    #convert date to day of the year
    mutate(date = case_when(
      Date != "" ~ strftime(as.Date(Date, "%m/%d/%y"), "%m/%d")
    )) %>%
    mutate(competition = ifelse(Competition.1 == "", NA, Competition.1)) %>%
    select(competition, date, minute) %>%
    #fill down the competition and date if missing
    do(na.locf(.)) %>%
    mutate(scored_for = "club", name = name)
}

#do the same for national team goals
get_nt_goals <- function(nt_stats) {
  read <- read_html(nt_stats)
  
  name <- read %>%
    html_nodes(".dataName b") %>%
    html_text()

  goal_table <- read %>%
    html_nodes(xpath = '//*[@id="main"]/div[10]/div[1]/div[3]/div[4]/table')
  
  #some players won't have any national team goals
  #return NA
  if(!is_empty(goal_table)) {
    nt <- goal_table %>%
      html_table(fill = TRUE) %>%
      as.data.frame() %>%
      select(For, Date, Var.11) %>%
      mutate(goals = as.numeric(Var.11)) %>%
      mutate(date = case_when(
        Date != "" ~ strftime(as.Date(Date, "%m/%d/%y"), "%m/%d")
      )) %>%
      mutate(competition = ifelse(For == "", NA, For)) %>%
      mutate(competition = na.locf(competition)) %>%
      filter(!is.na(date)) %>%
      select(competition, date, goals) 
    
    #if more than 1 goal is scored on a game it's counted as two rows
    #separate these out
    if(any(nt$goals != 1)) {
      nt_df <- do.call("c", (mapply(rep, c(nt$competition, nt$date), nt$goals))) %>%
        matrix(., 2, byrow = TRUE) %>%
        t() %>%
        as.data.frame() %>%
        select(competition = V1, date = V2)
    } else {
      nt_df <- nt %>%
        select(-goals)
    }
    
    #finish off munging
    df <- nt_df %>%
      mutate(minute = NA, scored_for = "nation", name = name)
  } else {
    df <- NA
  }
  return(df)
}

Now we can get into the scraping. For each player some parts of the URL stay the same, so lets save those as objects so we don’t have to deal with massive long urls.

I decided to test out the functions using Cristiano Ronaldo as his 675 goals for club and country is (I believe) more than any active player. Pasting the url together and running on the functions does the trick

#for each player some parts of url stay the same
base_url <- "https://www.transfermarkt.co.uk/"
club_text1 <- "/alletore/spieler/"
club_text2 <- "/saison//verein/0/liga/0/wettbewerb//pos/0/trainer_id/0/minute/0/torart/0/plus/1"
nt_text1 <- "/nationalmannschaft/spieler/"
nt_text2 <- "/verein_id/3300/plus/0?hauptwettbewerb=&wettbewerb_id=&trainer_id=&start=Aug+20%2C+2003&ende=Feb+4%2C+2019&nurEinsatz=2"

#get all the goals scored by Cristiano Ronaldo
ronaldo <- rbind(
  paste0(base_url, "player_name", club_text1, 8198, club_text2) %>%
    get_club_goals(),
  paste0(base_url, "player_name", nt_text1, 8198, nt_text2) %>%
    get_nt_goals()
)

#count the number of unique dates scored on
length(unique(ronaldo$date))
## [1] 244

So Ronaldo has scored on 244 of the 366 possible days of the year. It’s not surprising that scoring on every day would be difficult. The club season only runs August-June and there are unlikely to be many possible games to play in July at all. Plus days such a Christmas are usually taken off from football.

In terms of goals per day using 2019s calendar this looks like (plot made using the openair package:

ronaldo %>%
  #count goals per date
  group_by(date) %>%
  summarise(goals = n()) %>%
  #convert to 2019 dates
  mutate(date = as.Date(paste0("2019/", date))) %>%
  #use calendarPlot from the openair package
  calendarPlot(., pollutant = "goals", year = 2019)

Which shows more deviation in scoring than I thought it would. Nevertheless, from September-May each year is pretty blocked out, though there is a run of Saturdays this December which could be fertile ground for increasing his total.

Next we need to get a list of likely players who could come close to matching Ronaldo’s record.

For this I took the first page of transfermarkt’s top scorers of the year across all leagues. It’s possible that a player might (e.g.) be on the second page each year and have scored a ton, but I don’t think it’s super likely.

I run this through the top scorers page from 1995 (the earliest year available) to 2018 and grab each player id. Afterwards, save the scraped list as an .rds to prevent needing to continually re scrape the page and put extra load onto the server.

#the url for the pages of top scorers
top_scorer_ids <- paste0(base_url, 
                         "spieler-statistik/jahrestorschuetzen/",
                         "statistik/stat/plus/0/galerie/0?jahr=",
                         1995:2018,
                         "&wettbewerb=alle&monatVon=01&monatBis=12&altersklasse=&",
                         "land_id=&ausrichtung=alle&spielerposition_id=alle&art=0") %>%
  #scrape the ids of players
  lapply(., function(year) {
    read_html(year) %>%
      html_nodes("#yw1 .spielprofil_tooltip") %>%
      html_attr("id")
  }) %>%
  unlist() %>%
  unique()

#save this to prevent need for re-scraping
saveRDS(top_scorer_ids, "transfermarkt_top_scorers.rds")

Then all that’s left is to scrape the goals for each player whose id we’ve scraped. Again, save this once run, especially as it takes a fair while to complete. For this article the data was scraped on the 5th February 2019

player_goals <- top_scorer_ids %>%
  #for each player scrape every goal
  lapply(., function(id) {
    goals <- rbind(
      paste0(base_url, "player_name", club_text1, id, club_text2) %>%
        get_club_goals(),
      paste0(base_url, "player_name", nt_text1, id, nt_text2) %>%
        get_nt_goals()
    ) %>%
      #remove NAS
      #this is where a player hasn't scored for their nation
      filter(!is.na(date)) %>%
      mutate(id = id)
  }) %>%
  do.call(rbind, .)

#and save
saveRDS(player_goals, "top_scorer_goals.rds")

Now we have a list of every goal scored by prolific strikers, we just have to group by each player and count how many dates they’ve scored on. To get the players with the highest number of unique dates we group by their id and count the length of the unique dates they’ve scored on.

days_per_player <- player_goals %>%
  #group by player
  group_by(id) %>%
  #count the dates scored on
  summarise(days = length(unique(date))) %>%
  arrange(-days) %>%
  #rejoin the name data back in
  left_join(.,
            player_goals %>%
              select(id, name) %>%
              unique(),
            by = "id") %>%
  print()
## # A tibble: 413 x 3
##    id     days name             
##    <chr> <int> <chr>            
##  1 8198    244 Cristiano Ronaldo
##  2 28003   210 Messi            
##  3 7349    203 Raúl             
##  4 3455    200 Ibrahimovic      
##  5 3207    189 Henry            
##  6 4257    187 Eto'o            
##  7 3924    173 Drogba           
##  8 48280   173 Cavani           
##  9 44352   172 Suárez           
## 10 7980    171 Villa            
## # ... with 403 more rows

so perhaps unsurprisingly, Ronaldo comes out on top. As expected given the data source, most of the top players are very recent strikers- all of the top 10 were active well into the 2010s. Ulf Kirsten and Toni Polster are the torchbearers for strikers from the 90s.

As always in these posts, I try to learn some new stuff as I do them. I thought this might be a good time to try some circular plotting. I don’t think the resultant plots actually inform that much but they are cool to look at.

circular_data <- player_goals %>%
  #filter out top 16 scorers
  filter(id %in% days_per_player$id[1:16]) %>%
  #group by month and player and sum
  mutate(month = gsub("\\/.*", "", date)) %>%
  group_by(id, month, competition) %>%
  summarise(goals = n()) %>%
  left_join(.,
            player_goals %>%
              select(id, name) %>%
              unique(),
            by = "id")

#too many competititon for legend
#sort out into broad groups
competition_types <- data.frame(competition = circular_data$competition) %>%
  unique() %>%
  mutate(competition = as.character(competition)) %>%
  mutate(competition_type = case_when(
    grepl("MLS", competition) ~ "Domestic",
    grepl("World Cup qualification| Qualifiers", competition) ~ "International",
    grepl("Friendlies", competition) ~ "International Friendlies",
    grepl("World Cup [0-9]{4}|Confederations|EURO [0-9]{4}", competition) ~ "International Tournament",
    grepl("UEFA|Champions League|UI Cup|Cup Winners|European Cup|Europa", competition) ~ "European",
    grepl("Club World", competition) ~ "International Club",
    grepl("Cup|cup|Pokal|copa|Copa|beker|Coupe|coppa|Kupasi|Trophée|Kupa", competition) ~ "Domestic Cup"
  )) %>%
  mutate(competition_type = ifelse(is.na(competition_type), "Domestic", competition_type)) %>%
  #convert to factor for plot fill order
  mutate(competition_type = fct_rev(factor(competition_type)))

#plot as circular radar plots
circular_data %>%
  left_join(., competition_types, by = "competition") %>%
  ggplot(., aes(x = month, y = goals, fill = competition_type)) +
  #convert to polar coordinates
  coord_polar(theta = "x", start = -.13) +
  geom_bar(stat = "identity") +
  scale_fill_discrete(name = "Competition Type") +
  ggtitle("Goals Per Month for Top 16 Unique Day Scorers") +
  facet_wrap(~name) +
  theme_minimal() +
  theme(axis.text = element_blank())

One nice thing that pops out is how Kirsten rarely scored in December/January- probably due to the Bundesliga mid season break.

I also found it interesting that Dirk Kuyt featured in the top 16, despite not being renowned as a great goalscorer.

#count total goals per player
left_join(
  days_per_player,
  player_goals %>%
    group_by(id) %>%
    summarise(goals = n()),
  by = "id"
) %>%
  #work out days/total goals
  mutate(proportion_unique = days / goals) %>%
  arrange(-days) %>%
  filter(days > 150) %>%
  select(name, goals, days, proportion_unique) %>%
  arrange(-proportion_unique)
## # A tibble: 34 x 4
##    name      goals  days proportion_unique
##    <chr>     <int> <int>             <dbl>
##  1 Gilardino   231   154             0.667
##  2 Cissé       269   166             0.617
##  3 Signori     260   160             0.615
##  4 Di Vaio     269   165             0.613
##  5 Toni        273   165             0.604
##  6 Kuyt        281   169             0.601
##  7 Lampard     258   154             0.597
##  8 Frei        257   153             0.595
##  9 Drogba      300   173             0.577
## 10 Trézéguet   265   152             0.574
## # ... with 24 more rows

When sorted by how evenly their goals/date coverage is (i.e. the ideal ratio would be 1 goal on every day), Dirk Kuyt pops up again (and did in fact score many more goals than I had assumed). Alberto Gilardino really stands out as a player who has maximum date coverage despite (relative to other members of the list!) a low number of total goals scored.

I’m not sure what, if any, insight that adds but is a cool piece of trivia.

First Losers

Answer: Manchester United in the top flight - 14 times

The fist question this week where I get to dive back into James Curley’s engsoccerdata package asks which teams have finished second in their league the most.

First lets load up the engsoccerdata for English leagues 1882-2016. I’ve munged it in a pretty verbose way; there’s definitely a faster way to do it but that’s not really necessary. All we need are the indicators used to sort the league (points, goal difference, and goal scored) for every match in a long format.

eng_data <- engsoccerdata::england %>%
  #select only pertinent variables
  select(Date, Season, home, visitor, hgoal, vgoal, division) %>%
  rename_all(tolower) %>%
  #melt the data to long format
  reshape2::melt(id.vars = c("date", "season", "hgoal", "vgoal", "division"),
                 value.name = "team", variable.name = "location") %>%
  #this can be done in one step but for sanity
  mutate(result = case_when(
    hgoal > vgoal & location == "home" ~ "W",
    vgoal > hgoal & location == "visitor" ~ "W",
    hgoal < vgoal & location == "home" ~ "L",
    vgoal < hgoal & location == "visitor" ~ "L",
    vgoal == hgoal ~ "D"
  )) %>%
  #points for a win changed in 1981
  mutate(points = case_when(
    result == "L" ~ 0,
    result == "D" ~ 1,
    result == "W" & season < 1981 ~ 2,
    result == "W" & season > 1980 ~ 3
  )) %>%
  #and get the goal info too
  mutate(goal_diff = case_when(
    location == "home" ~ hgoal - vgoal,
    location == "visitor" ~ vgoal - hgoal
  )) %>%
  mutate(goals = case_when(
    location == "home" ~ hgoal,
    location == "visitor" ~ vgoal
  )) %>%
  #only save the variables we care about then sort
  select(date, season, division, team, points, goals, goal_diff) %>%
  arrange(date, team)

Then find the final positions of each team in each season of English football sorted by points, goal difference and goals for

final_positions <- eng_data %>%
  setDT() %>%
  #find the match number
  .[, match := 1:.N, by = c("season", "team")] %>%
  #find the cumulative points, goal difference and goals for
  .[, season_points := cumsum(points), by = c("season", "team")] %>%
  .[, season_gd := cumsum(goal_diff), by = c("season", "team")] %>%
  .[, season_g := cumsum(goals), by = c("season", "team")] %>%
  #filter out the final matches totals and order
  .[.[, .I[match == max(match)], by= c("season", "division")]$V1] %>%
  .[order(season, division, -season_points)] %>%
  #assign the final positions
  .[, final_position := 1:.N, by = c("season", "division")] %>%
  .[, c("team", "division", "final_position")] %>%
  #count by final position
  .[, pos_count := .N, by = c("team", "division", "final_position")] %>%
  unique()

Then we can filter out those who have finished second most

second_place <- final_positions %>%
  #filter out second place finishes
  .[final_position == 2] %>%
  .[, c("team", "division", "final_position", "pos_count")] %>%
  .[order(-pos_count)]

head(data.frame(second_place), 10)
##                 team division final_position pos_count
## 1  Manchester United        1              2        14
## 2            Arsenal        1              2        12
## 3          Liverpool        1              2        11
## 4        Aston Villa        1              2         9
## 5   Sheffield United        2              2         8
## 6    Birmingham City        2              2         8
## 7            Everton        1              2         7
## 8         Sunderland        1              2         7
## 9  Preston North End        1              2         6
## 10   Manchester City        1              2         6

Perhaps unsurprisingly, most of the teams to finish second have finished second in the top flight. Manchester United lead the way with Arsenal and Liverpool following up.

Bristol City, Charlton Athletic, Oldham Athletic, Blackpool, QPR, Watford and Southampton have finished runners up in the top division without winning it, all having achieved this exactly once.

Plymouth Argyle have perhaps the most heartbreaking run of all though- having finished second in the old 3rd Division South SIX times in a row between 1922-1927 before finally winning it in 1929.

I thought I might as well also plot every teams league finishes as a proportion of their season in the league. Position here refers to total overall position (so 1st in Division two might be 21st overall). The darker the colour, the more likely the team was the end the season in that position. All teams have been sorted by their mean final league position.

eng_data %>%
  #filter out modern era
  filter(season > 1991) %>%
  setDT() %>%
  #find finish positions and count as above
  .[, match := 1:.N, by = c("season", "team")] %>%
  .[, season_points := cumsum(points), by = c("season", "team")] %>%
  .[, season_gd := cumsum(goal_diff), by = c("season", "team")] %>%
  .[, season_g := cumsum(goals), by = c("season", "team")] %>%
  .[.[, .I[match == max(match)], by= c("season", "division")]$V1] %>%
  .[order(season, division, -season_points)] %>%
  .[, final_position := 1:.N, by = c("season", "division")] %>%
  .[, total_position := 1:.N, by = c("season")] %>%
  .[, c("team", "division", "final_position", "total_position")] %>%
  .[, pos_count := .N, by = c("team", "division", "final_position")] %>%
  .[, team_appearances := .N, by = c("team")] %>%
  .[, mean_pos := sum(total_position)/team_appearances, by = c("team")] %>%
  unique() %>%
  #order by mean position
  .[order(mean_pos)] %>%
  .[, team := fct_rev(fct_relevel(as.factor(team), unique(.$team)))] %>%
  #plot
  ggplot(., aes(x = total_position, y = team)) + 
  geom_tile(aes(alpha = pos_count/team_appearances), fill = "blue") +
  scale_alpha_continuous(guide = FALSE) +
  ggtitle("Teams Ordered by Mean Final Position 1992-2016",
          subtitle = "Weight indicates proportion of finishes in that position") +
  xlab("Total League Position") +
  theme_minimal() +
  theme(axis.text.y = element_text(angle = 10))

There’s probably too much data to graph here, but it’s still a fun way to look at 140 years of English football

Slip Slidin’ Away

Answer: Rochdale with 8 games to go in Division 3 in 1973/1974

A similar question involves the earliest teams to get relegated. Obviously for this first we need to know how many teams are relegated from each league per season. Having only really started watching football around the turn of the millenium I was a bit surprised how much this has changed over the years (n.b. I’m only counting automatic relegation- playoffs and re-elections don’t count).

#manually enter the number of relegation spots per league
relegation_spots <- eng_data %>%
  .[, c("season", "division")] %>%
  unique() %>%
  mutate(relegation_spots = case_when(
    season >= 1995 & division == 1 ~ 3,
    season >= 1994 & division == 1 ~ 4,
    season >= 1991 & division == 1 ~ 3,
    season >= 1990 & division == 1 ~ 2,
    season >= 1973 & division == 1 ~ 3,
    season >= 1898 & division == 1 ~ 2,
    season >= 1995 & division == 2 ~ 3,
    season >= 1994 & division == 2 ~ 4,
    season >= 1991 & division == 2 ~ 3,
    season >= 1990 & division == 2 ~ 2,
    season >= 1988 & division == 2 ~ 3,
    season >= 1986 & division == 2 ~ 2,
    season >= 1973 & division == 2 ~ 3,
    season >= 1921 & division == 2 ~ 2,
    season >= 1920 & division == 2 ~ 1,
    season >= 1919 & division == 2 ~ 3,
    season >= 1995 & division == 3 ~ 4,
    season >= 1994 & division == 3 ~ 5,
    season >= 1991 & division == 3 ~ 4,
    season >= 1990 & division == 3 ~ 3,
    season >= 1988 & division == 3 ~ 4,
    season >= 1986 & division == 3 ~ 3,
    season >= 1958 & division == 3 ~ 4,
    season >= 2002 & division == 4 ~ 2,
    season >= 1996 & division == 4 ~ 1,
    season >= 1993 & division == 4 ~ 0,
    season >= 1992 & division == 4 ~ 1,
    season >= 1990 & division == 4 ~ 0,
    season >= 1986 & division == 4 ~ 1
 )) %>%
  mutate(relegation_spots = ifelse(relegation_spots == 0, NA, relegation_spots))

We then need to work out how many points each team has, and how many they could possibly achieve, after every match in a season.

possible_positions <- eng_data %>%
    setDT() %>%
  #get the match number
  .[, match := 1:.N, by = c("season", "team")] %>%
  #get the current points for the team
  .[, season_points := cumsum(points), by = c("season", "team")] %>%
  .[order(division, season, match, -season_points)] %>%
  #get the current position for the team
  .[, position := 1:.N, by = c("division", "season", "match")] %>%
  #how many teams are in the league
  .[, teams := max(position), by = c("division", "season")] %>%
  #find how many matches each team has left to play
  .[, matches_remaining := max(match) - match, by = c("division", "season")] %>%
  #the max points assumes each team wins all of their remaining matches
  .[season < 1981, possible_points := season_points + (matches_remaining * 2)] %>%
  .[season > 1980, possible_points := season_points + (matches_remaining * 3)] %>%
  #merge in the relegation spots and find what position each team needs to be safe
  merge(., relegation_spots, by = c("division", "season")) %>%
  .[, lowest_safe_position := teams - relegation_spots] %>%
  #the threshold for safety is the number of points the lowest safe team has
  .[position == lowest_safe_position, lowest_safe_points := season_points]

Then it’s a simple case of finding teams in the relegation zone and finding the point at which they can no longer catch the lowest safe team

relegation_secured <- possible_positions %>%
  #filter out teams in relegation trouble
  .[!is.na(lowest_safe_position)] %>%
  .[position >= lowest_safe_position] %>%
  .[, lowest_safe_points := na.locf(lowest_safe_points)] %>%
  .[possible_points < lowest_safe_points] %>%
    .[, c("season", "division", "team", "season_points",
          "matches_remaining")] %>%
  .[order(-matches_remaining)] %>%
  #remove duplicates
  .[!duplicated(paste0(season, division, team))]

head(data.frame(relegation_secured), 15)
##    season division                team season_points matches_remaining
## 1    1973        3            Rochdale            16                 8
## 2    1984        1          Stoke City            17                 7
## 3    2001        2    Stockport County            17                 7
## 4    2003        2           Wimbledon            21                 7
## 5    2016        2    Rotherham United            17                 7
## 6    1961        3      Newport County            18                 7
## 7    1984        3    Cambridge United            17                 7
## 8    1993        3              Barnet            23                 7
## 9    2000        3       Oxford United            22                 7
## 10   1930        1   Manchester United            16                 6
## 11   1954        1 Sheffield Wednesday            19                 6
## 12   1975        1    Sheffield United            13                 6
## 13   1994        1        Ipswich Town            23                 6
## 14   2007        1        Derby County            11                 6
## 15   1952        2            Barnsley            17                 6

So Rochdale hold the questionable honour of being the team knowing they are doomed with the most matches to go (with 8 in division 3 in 1973). There’s quite a large chasing pack of teams who have known with 7 or 6 matches left too. Ipswich are currently ‘only’ 8 points off of safety with 16 games left to go so seems unlikely to beat 8 but it could be close…

I also wanted to see what the earliest a team has ever been certain of their final position is.

certain_final_positions <- possible_positions %>%
  #find the possible points for the teams above and below each team
  .[, poss_points_nextworst := lead(possible_points), by = c("season", "division", "match")] %>%
  .[, points_nextbest := lag(season_points), by = c("season", "division", "match")] %>%
  #filter out teams that cannot beat/fall below the next best/worst teams
  .[(is.na(poss_points_nextworst) | season_points > poss_points_nextworst) & 
      (is.na(points_nextbest) | possible_points < points_nextbest) &
      matches_remaining > 0] %>%
  #order and select columns
  .[order(-matches_remaining)] %>%
  .[, c("division", "season", "team", "position", "matches_remaining", "teams")]

head(data.frame(certain_final_positions), 15)
##    division season                 team position matches_remaining teams
## 1         4   1968 Bradford Park Avenue       24                 7    24
## 2         1   1984           Stoke City       22                 6    22
## 3         2   1949    Tottenham Hotspur        1                 6    22
## 4         2   1971              Watford       22                 6    22
## 5         2   1973        Middlesbrough        1                 6    22
## 6         2   2001     Stockport County       24                 6    24
## 7         3   1966  Queens Park Rangers        1                 6    24
## 8         3   1984     Cambridge United       24                 6    24
## 9        3b   1952              Walsall       24                 6    24
## 10        4   1968 Bradford Park Avenue       24                 6    24
## 11        4   1977              Watford        1                 6    24
## 12        4   1997         Notts County        1                 6    24
## 13        1   1980       Crystal Palace       22                 5    22
## 14        1   1982            Liverpool        1                 5    22
## 15        1   1984              Everton        1                 5    22

Spare a thought for fans of Bradford Park Avenue in 1968-1969 who knew their team would finish bottom of the 3rd Division North with 7 matches (of 46) remaining. Luckily they weren’t relegated as they were already in the bottom division of the football league and we re-elected for the next season. They repeated this feat, now in Division 4, 5 years later, finishing bottom with 6 games to go.

Most of these involve teams either winning or finishing bottom of their league. If we filter these out we’re left with:

certain_final_positions %>%
  #filter off bottom or top teams
  .[position != 1 & position != teams] %>%
  data.frame() %>%
  head(., 15)
##    division season                team position matches_remaining teams
## 1        3a   1931        New Brighton       20                 4    22
## 2        3a   1931            Rochdale       21                 4    22
## 3         2   2003       Bradford City       23                 3    24
## 4         2   2005    Sheffield United        2                 3    24
## 5        3a   1931        New Brighton       20                 3    22
## 6        3a   1931            Rochdale       21                 3    22
## 7        3b   1929           Brentford        2                 3    22
## 8         4   1975    Northampton Town        2                 3    24
## 9         1   1888         Aston Villa        2                 2    12
## 10        1   1930         Aston Villa        2                 2    22
## 11        1   1934          Sunderland        2                 2    22
## 12        1   1946           Brentford       21                 2    22
## 13        1   1957   Preston North End        2                 2    22
## 14        1   1970             Burnley       21                 2    22
## 15        1   1978 Queens Park Rangers       20                 2    22

Which really emphasizes how only exceptionally good/bad teams are ever really certain of their position before the end of the season.

Draw Me Like One of Your Top Flight Teams

““Tottenham have currently played 29 consecutive Premier League games without drawing one,” notes Wouter van Dael. “What is the longest ever such league run?””

Answer: Aston Villa - 50 games in 1891/1892, but otherwise Spurs in modern football

(N.b. the engsoccerdata package only has data up until the 2016/2017 season, so Tottenham’s run won’t appear in the results below)

To do this we just need to select every game with a non-zero goal difference (i.e. a draw) and then find consecutive runs for teams.

draws <- eng_data %>%
  setDT() %>%
  .[order(team, date)] %>%
  #give each match a consecutive 'id'
  .[, game_id := 1:.N, by = team] %>%
  #find matches with a non zero goal difference (not a draw)
  .[goal_diff != 0] %>%
  #find consecutive matches with non zero goal difference
  .[, consecutive := lead(game_id) - game_id, by = team] %>%
  .[consecutive != 1, consecutive := NA] %>%
  #count all consecutive runs
  .[, count := .N*!is.na(consecutive), rleid(!is.na(consecutive))] %>%
  #find the start and end of each run
  .[count != lead(count) | count != lag(count)] %>%
  .[order(-count, team, date)] %>%
  #set the start points and end point and spread to separate columns
  .[, run_point := c("start", "end")] %>%
  spread(run_point, date) %>%
  .[, start := na.locf(start)] %>%
  .[!is.na(end) & !is.na(start)] %>%
  #select columns to print
  .[, c("start", "end", "division", "team", "count")] %>%
  .[order(-count)]

head(data.frame(draws), 25)
##         start        end division                    team count
## 1  1891-01-01 1892-12-10        1             Aston Villa    50
## 2  1895-03-30 1896-11-09        1              Stoke City    45
## 3  1907-12-26 1909-02-27        1              Sunderland    45
## 4  1913-01-01 1914-02-14        2    Bradford Park Avenue    43
## 5  1909-01-30 1910-03-25        2          Leicester City    43
## 6  1894-01-06 1896-09-05        2                 Walsall    43
## 7  1892-12-10 1894-09-15        1         Birmingham City    42
## 8  1896-03-21 1897-09-25        2                  Darwen    39
## 9  1928-03-17 1929-02-02        1              Portsmouth    37
## 10 1904-10-22 1905-11-11        1        Sheffield United    37
## 11 1947-11-15 1948-09-25       3b          Bristol Rovers    36
## 12 1930-05-03 1931-03-28        2 Wolverhampton Wanderers    36
## 13 1894-03-23 1895-04-20        2            Lincoln City    35
## 14 1946-10-12 1947-08-23       3a        Stockport County    35
## 15 1904-01-30 1904-12-27        2            Bristol City    34
## 16 1915-04-03 1920-02-21        1    West Bromwich Albion    34
## 17 1934-12-01 1935-09-16        2        Doncaster Rovers    33
## 18 1895-11-16 1896-11-28        2        Newcastle United    33
## 19 1925-05-02 1926-02-22        2           Middlesbrough    32
## 20 1935-08-31 1936-03-14       3b                 Reading    32
## 21 1925-12-26 1926-10-02       3a                Rochdale    32
## 22 1896-01-20 1897-02-13        2        Burton Wanderers    31
## 23 1927-12-24 1928-09-15       3a      Accrington Stanley    30
## 24 1905-11-04 1906-09-03        2    Gainsborough Trinity    30
## 25 1946-11-23 1947-08-25        2         Plymouth Argyle    30

Tottenham’s run without a draw doesn’t even make the top 25 such runs! And they still would have to wait until at least next season until they can match Aston Villa’s run from New Years Day 1891 until Christmas Eve in 1892, a run of 50 matches without a draw- a run that included 23 losses, and 27 wins. It is the longest for quite sometime though- there’s few similar runs in the post-war years.

We can restrict this easily to just runs in the top division my modifying one line

draws <- eng_data %>%
  setDT() %>%
  .[order(team, date)] %>%
  .[, game_id := 1:.N, by = team] %>%
  #filter out only the top division matches
  .[division == 1] %>%
  .[goal_diff != 0] %>%
  .[, consecutive := lead(game_id) - game_id, by = team] %>%
  .[consecutive != 1, consecutive := NA] %>%
  .[, count := .N*!is.na(consecutive), rleid(!is.na(consecutive))] %>%
  .[count != lead(count) | count != lag(count)] %>%
  .[order(-count, team, date)] %>%
  .[, run_point := c("start", "end")] %>%
  spread(run_point, date) %>%
  .[, start := na.locf(start)] %>%
  .[!is.na(end) & !is.na(start)] %>%
  .[, c("start", "end", "division", "team", "count")] %>%
  .[order(-count)]

head(data.frame(draws), 10)
##         start        end division                    team count
## 1  1891-01-01 1892-12-10        1             Aston Villa    50
## 2  1895-03-30 1896-11-09        1              Stoke City    45
## 3  1907-12-26 1909-02-27        1              Sunderland    45
## 4  1928-03-17 1929-02-02        1              Portsmouth    37
## 5  1904-10-22 1905-11-11        1        Sheffield United    37
## 6  1915-04-03 1920-02-21        1    West Bromwich Albion    34
## 7  1895-09-28 1896-09-12        1 Wolverhampton Wanderers    29
## 8  1964-09-12 1965-03-31        1             Aston Villa    28
## 9  1953-04-25 1954-01-02        1                 Burnley    28
## 10 1891-03-14 1892-04-30        1              Sunderland    28

Where we can see that Spurs’ run is at least the longest modern top flight drawless run.

Related