TidyTuesday Week One

Given it’s the new year, I decided to try and get back onto more regular posting on this blog (mostly just to build up a portfolio of work).

A quick way to get something to work with that can be published unpolished is #TidyTuesday on twitter which (as far as I know/can tell) is organised by Thomas Mock from RStudio.

This week, the data comes in the form of a massive corpus of every tweet using the #rstats hashtag, curated by rtweet package creator Mike Kearney.

I’m only going to leave sparse notes as this is just a post from some lunchtime work cleaned up and published after. I probably won’t fully spellcheck it either.

First, libraries:

#for data #tidytuesday data manipulation
library(tidyverse)
#used for clustering later
library(lsa)
library(e1071)

When loading the data, the first thing I decided to look at was the evolution of the hashtags use over time. As far as I can tell, first used in spring 2009 by Giuseppe Paleologo. Since then, it’s grown pretty exponentially.

#data at https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-01
#rstats_data <- readRDS("../../Downloads/rstats_tweets.rds")

#quickly plot tweets over time
p <- rstats_data %>%
  select(created_at) %>%
  arrange(created_at) %>%
  mutate(total_tweets = row_number()) %>%
  ggplot(., aes(x = created_at, y = total_tweets)) +
  geom_line() +
  xlab("Date") +
  ylab("Total #rstats Tweets") +
  ggtitle("#rstats Tweets Over Time") +
  theme_minimal()

p

I decided only to work with the most prolific #rstats tweeters, mostly to save space in plots as the corpus contains over 26k unique persons and 430k tweets

#filter out people who tweet about rstats >=500 times
rstats_data %<>% 
  group_by(user_id) %>%
  mutate(tweet_count = n()) %>%
  filter(tweet_count > 499) %>%
  ungroup() %>%
  arrange(-tweet_count) %>%
  #also filter out feeds
  filter(!screen_name %in% c("CRANberriesFeed", "Rbloggers", "rweekly_live", "tidyversetweets"))

#plot the number of tweets per person
p2 <- rstats_data %>%
  ggplot(., aes(x = reorder(user_id, tweet_count))) +
  geom_bar(stat = "count") +
  ggtitle("Rstats Tweets By Person") +
  xlab("User") +
  ylab("Tweets") +
  theme_minimal() +
  theme(axis.text.x = element_blank())

p2

Lets see the most prolific tweeters

#show the most prolific retweeters
rstats_users <- rstats_data %>%
  select(screen_name, tweet_count) %>%
  unique()

head(rstats_users)
## # A tibble: 6 x 2
##   screen_name tweet_count
##   <chr>             <int>
## 1 AndySugs           8216
## 2 dataandme          4113
## 3 gp_pulipaka        3237
## 4 DerFredo           3091
## 5 revodavid          2640
## 6 MangoTheCat        2523

I had been interested in recreating some analyses from https://www.jtimm.net/2018/11/03/twitter-political-ideology-and-the-115-us-senate/ recently, and thought this gave a good oppurtunity.

First I needed the top levels domains of links in #rstats tweets

#try to find only top level domains for grouping
domain_patterns <- "\\.com.*|\\.org.*|\\.me.*|\\.gl.*|\\.li.*|\\..appspot|\\.blogspot|\\.io.*"
links <- data.frame(url = unlist(rstats_data$urls_url)) %>%
  mutate(domain = gsub(domain_patterns, "", url)) %>%
  filter(!is.na(domain)) %>%
  group_by(domain) %>%
  mutate(share_count = n()) %>%
  ungroup()

#which are the most tweeted links by the top tweeters
head(links %>% select(-url) %>% unique() %>% arrange(-share_count))
## # A tibble: 6 x 2
##   domain         share_count
##   <chr>                <int>
## 1 goo                   4724
## 2 wp                    4110
## 3 github                3430
## 4 twitter               3201
## 5 cran.r-project        2878
## 6 r-bloggers            2708

some of these (e.g. the google/wp/fb/bit.ly) ones seem a bit more to be quick links to pictures and so were removed. I also cut out links to amazon, google, facebook, and youtube, which I was less certain about doing and would probably analyse in a deeper cut.

#remove non-data sciencey links
links %>%
  filter(!grepl("goo|wp|tweetedtimes|fb|htl|facebook|youtube|amazon|google", domain)) %>%
  filter(!grepl("activevoice.us|ift.tt|rviv.ly|bit.ly", domain)) %>%
  select(-url) %>%
  unique() %>%
  arrange(-share_count) %>%
  head()
## # A tibble: 6 x 2
##   domain                   share_count
##   <chr>                          <int>
## 1 github                          3430
## 2 twitter                         3201
## 3 cran.r-project                  2878
## 4 r-bloggers                      2708
## 5 link.rweekly                    2415
## 6 blog.revolutionanalytics        1225

Then we need to create a matrix of each domain vs. each user with a value of how many tweets from that user link to that domain.

I selected 3 users to illustrate the finished matrix (from here on out I’m freely stealing code from the blogpost linked above)

#find which domains each tweeted link belong to
rstats_domains_shared <- rstats_data %>%
  select(user_id, screen_name, url = urls_url, date = created_at) %>%
  #remove tweets without links
  filter(!is.na(url)) %>%
  #unlist the links
  #can be multiple per tweet
  splitstackshape::listCol_l(., listcol = "url") %>%
  #merge with domain information
  merge(., unique(select(links, domain, url_ul = url, domain_shares = share_count)), by = "url_ul") %>%
  #select only domains shared 100 or more times
  filter(domain_shares > 99) %>%
  #remove uninteresting domains
  filter(!grepl("goo|wp|tweetedtimes|fb|htl|facebook|youtube|amazon|google", domain)) %>%
  filter(!grepl("activevoice.us|ift.tt|rviv.ly|bit.ly", domain)) %>%
  #limit to only frequent tweeters
  filter(screen_name %in% rstats_users$screen_name)

#get a matrix of domains shared vs. users
rstats_shares_by_user <- rstats_domains_shared %>%
  #find the number of times each user tweets a link to a domain
  group_by(screen_name, domain) %>%
  summarize(share_count = n()) %>%
  #filter out those that are untweets
  filter(share_count > 0) %>%
  spread(screen_name, share_count) %>%
  replace(is.na(.), 0)  %>%
  ungroup()

#quickly glance at this
#has many columns so selecting only a few users
users <- c("hadleywickham", "drob", "JennyBryan")
rstats_shares_by_user %>%
  .[c(1, which(names(rstats_shares_by_user) %in% users))] %>%
  .[1:10,]
## # A tibble: 10 x 4
##    domain                    drob hadleywickham JennyBryan
##    <chr>                    <dbl>         <dbl>      <dbl>
##  1 analyticsvidhya              0             0          0
##  2 andrewgelman                 0             0          0
##  3 arilamstein                  0             0          0
##  4 asbcllc                      0             0          0
##  5 bl.ocks                      0             0          0
##  6 blog.revolutionanalytics     0             5          0
##  7 blog.rstudio                 1           115          6
##  8 cran.r-project               6            12         21
##  9 cran.rstudio                 0             1          1
## 10 datasciencecentral           0             0          0

Next we use cosine from the lsa package to get a matrix of user-user similarity. This is then crushed down to two dimensions X1 and X2

#find the cosine similarity between all users
cosine_rstats <- rstats_shares_by_user %>%
  select(2:ncol(.)) %>%
  data.matrix() %>%
  lsa::cosine(.)

#sort this into two dimensions
#X1 and X2
rstats_clustering <- cmdscale(1-cosine_rstats, eig = TRUE, k = 2)$points %>% 
  data.frame() %>%
  mutate(screen_name = rownames(cosine_rstats)) %>%
  merge(rstats_users, by = "screen_name")

head(rstats_clustering)
##       screen_name         X1         X2 tweet_count
## 1       _ColinFay -0.1192867 -0.2821199         989
## 2        abresler -0.1712703 -0.3224543        1443
## 3     AnalyticsFr -0.3210288  0.4201589        1386
## 4 AnalyticsFrance -0.3210288  0.4201589        1989
## 5 AnalyticsVidhya -0.2969805  0.4152374        1814
## 6        AndySugs  0.1371950  0.2465780        8216

If we plot this we get a nice graph of the top #rstats users qhich fall neatly into two dimensions. The first X1 seems to be ‘social’ vs. ‘professional’. People further to the left are users I recognise off the top of my head for sharing amateur data analyses/package building (e.g. JennyBryan) whereas those on the right seem to be more industrial users (e.g. MangoTheCat).

The second dimension is a bit harder to gauge but strikes me as sort of software vs. data science divide with more package creators/rstudio employees towards the bottom and people doing analysis of data towards the top (but this is only a gut feeling).

#plot the users by their cosine similarity and number of tweets
rstats_clustering %>%
  ggplot(aes(X1,X2)) +
  geom_text(aes(label= screen_name, size = tweet_count), alpha = 0.3) +
  scale_size_continuous(range = c(2,5), guide = FALSE) +
  xlab("Dimension X1") +
  ylab("Dimension X2") +
  ggtitle("#rstats Tweeters Arranged by Links Shared",
          subtitle = "position based on cosine similarity between users") +
  theme_minimal()

To investigate a bit further I decided to see what each person was sharing. First I used c-means clustering as it’s something else I was working on in a separate project recently to cluster each use based on their cosine similarity (mostly just to have something to order the final plot by).

I then used geom_tile to show how often each user was sending links from which domains. Roughly, I would say that the ‘industrial’ (green) cluster makes shows a concentration of links to sites such as r-bloggers and revolutionanalytics’ blog, whereas the ‘social data science’ cluster (blue) links much more to twitter itself, github, and CRAN. The red (‘software’) cluster links to these too, but especially much more to the r-project blog in particular.

set.seed(22081992)
#use fuzzy c means to find clusters based on cosine similarity
#chose 3 as seems to be 3 clear nodes
c_grouping <- cmeans(select(rstats_clustering, X1, X2) %>% as.matrix(), 3, iter.max = 1000)

#merge this data in
rstats_clustering %<>%
  mutate(cluster = c_grouping$cluster) %>%
  cbind(as.data.frame(c_grouping$membership)) %>%
  mutate(cluster_membership = apply(.[, (ncol(.)-(max(.$cluster)-1)):ncol(.)], 1, max))

#plot a heatmap of links shared vs. cluster grouping
#remember cluster grouping is related to cosine similarity
rstats_shares_by_user %>%
  reshape2::melt(id.vars = "domain", variable.name = "screen_name", value.name = "shares") %>%
  merge(rstats_clustering, by = "screen_name") %>%
  filter(shares > 0) %>%
  ggplot(., aes(x = domain, y = reorder(screen_name, cluster + cluster_membership))) +
  geom_tile(aes(fill = log(shares), colour = factor(cluster)), size = 0.5) +
  scale_fill_viridis_c(option = "plasma", guide = FALSE) +
  scale_colour_manual(values = c("red", "blue", "green", "purple"), guide = FALSE) +
  xlab("Domain Shared") +
  ylab("Screen Name") +
  ggtitle("Domains Shared by #rstats Tweeters Coloured by User Cluster") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Finally, I wanted to recreate the previous cosine similarity graph but with the clusters highlighted just because I think it makes a pretty graph.

#replot our initial plot of cosine similarity with the cluster information
#alpha of screen_name indicates group membership strength
rstats_clustering %>%
  ggplot(aes(X1, X2)) +
  geom_label(aes(label= screen_name, fill = factor(cluster), colour = cluster_membership, size = tweet_count), alpha = 0.3) +
  scale_colour_gradient(high = "black", low = "white", guide = FALSE) +
  scale_fill_manual(values = c("red", "blue", "green", "purple"), guide = FALSE) +
  scale_size_continuous(range = c(2,5), guide = FALSE) +
  xlab("Dimension X1") +
  ylab("Dimension X2") +
  ggtitle("#rstats Tweeters Grouped by Links Shared",
          subtitle = "grouping based on cosine similarity between users") +
  theme_minimal()

That’s all for this post. I think I’ll keep on throwing up quick #TidyTuesday posts throughout the year which will be as sparse as this, but hopefully be interesting to one or two people.