In what is becoming a repeated series, I enjoy answering trivia questions from The Guardian’s The Knowledge football trivia column.
There’s a few questions that built up that seemed amenable to coding answers so I’ve taken a stab at them here
#munging library(tidyverse) library(data.table) library(zoo) #english football data library(engsoccerdata) #web data scraping library(rvest) #plotting library(openair) Calendar Boys The first question this week concerns players scoring on (or nearest to) every day of the year
Riddler Classic In my spare time I enjoy solving 538’s The Riddler column. This week I had a spare few hours waiting for the Superbowl to start and decided to code up a solution to the latest problem to keep me busy.
The question revolves around a card game in which whatever choice a player makes, they are likely to lose to a con artist. Formally this is phrased as:
Every so often a question on The Guardian’s The Knowledge football trivia section piques my interest and is amenable to analysis using R. Previously, I looked at club name suffixes and young World Cup winners last August. This week (give or take), a question posed on twitter caught my attention:
@TheKnowledge_GU was just chatting to some colleagues in the kitchen at work about why Essex doesn't have many big football clubs and it got me thinking.
Over the last few years since I started coding I’d always been interested in how data science could help predict football results/ identify footballing talents, and just generally ‘solve’ football.
One of the major problems with analysing football had been the availability of data. Though there’s a lot of great published stuff freely available to read, a lot of the cutting edge work revolves around advanced metrics, such as expected goals, which it’s hard to get the data for.
Given it’s the new year, I decided to try and get back onto more regular posting on this blog (mostly just to build up a portfolio of work).
A quick way to get something to work with that can be published unpolished is #TidyTuesday on twitter which (as far as I know/can tell) is organised by Thomas Mock from RStudio.
This week, the data comes in the form of a massive corpus of every tweet using the #rstats hashtag, curated by rtweet package creator Mike Kearney.
A few weeks ago I went on the first pub crawl I’d been on in years around my city of Cambridge. Around the same time I had also been visiting 4 very good pubs within ~200m of each other tucked away in a quiet neighbourhood of the town. Together, I wondered if it was possible with freely avaiable data to plan an optimal pub crawl around any town/area of the UK, and also, if it would be feasbile to visit every pub within the city in a single day if travelling optimally.
Whilst getting some work done browsing twitter at work today, I came across this tweet from the always excellent John Burn-Murdoch on the scourge of heatmaps. What’s most frustrating about these maps is that ggplot2 (which is underrated as mapping software, especially when combined with packages like sf in R) makes it super easy to create this bland, uninformative maps.
For instance, lets load some mapping libraries
library(tidyverse) library(sf) library(rgdal) For this blog I’m going to use data of bus stops in London, because there’s an absolute ton of them and because I love the London Datastore and it was the first public, heavy, point data file I came across.
Recently, I’d seen two tweets with stunning examples of maps by Paul Campbell here and (taken inspiration from the first) by Imer Muhović here.
The basic idea of the dot chloropleths is to visualise not only the location clustering of each variable but the number of observations (something traditional ‘filled’ chloropleths don’t do). More importantly than this, the maps also just look really really cool.
I had a spare few minutes during work on Friday which I tidied up into a package to calculate the random position of dots for such maps which can be found on my github.
The Guardian publish a weekly set of questions and answers on a variety of football minutiae at The Knowledge. Forutnately, some of these are extremely tractable using R, so I thought I’d have a go at working through the archives to see if I can shed light on any of the questions.
library(rvest) library(dplyr) library(magrittr) library(data.table) library(zoo) library(ggplot2) library(rvest) library(stringr) #jalapic/engsoccerdata library(engsoccerdata) We Ain’t Going To The Town.. ‘This season, Tranmere Rovers return to contest League Two alongside eight teams with the suffix Town, including six successive fixtures against these clubs over the New Year.
Recently, a Yorkshire national football team appeared in a league of national teams for stateless people. This got me wondering how the historic counties of the UK would do at the world cup. Could any of them compete with full international teams?
I published the complete code for that article on this blog this week. However, one question which I kept being asked was how a ‘All of the UK’ team would do (i.