Given it’s the new year, I decided to try and get back onto more regular posting on this blog (mostly just to build up a portfolio of work).
This week, the data comes in the form of a massive corpus of every tweet using the #rstats hashtag, curated by rtweet package creator Mike Kearney.
I’m only going to leave sparse notes as this is just a post from some lunchtime work cleaned up and published after. I probably won’t fully spellcheck it either.
#for data #tidytuesday data manipulation library(tidyverse) #used for clustering later library(lsa) library(e1071)
When loading the data, the first thing I decided to look at was the evolution of the hashtags use over time. As far as I can tell, first used in spring 2009 by Giuseppe Paleologo. Since then, it’s grown pretty exponentially.
#data at https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-01 #rstats_data <- readRDS("../../Downloads/rstats_tweets.rds") #quickly plot tweets over time p <- rstats_data %>% select(created_at) %>% arrange(created_at) %>% mutate(total_tweets = row_number()) %>% ggplot(., aes(x = created_at, y = total_tweets)) + geom_line() + xlab("Date") + ylab("Total #rstats Tweets") + ggtitle("#rstats Tweets Over Time") + theme_minimal() p
I decided only to work with the most prolific #rstats tweeters, mostly to save space in plots as the corpus contains over 26k unique persons and 430k tweets
#filter out people who tweet about rstats >=500 times rstats_data %<>% group_by(user_id) %>% mutate(tweet_count = n()) %>% filter(tweet_count > 499) %>% ungroup() %>% arrange(-tweet_count) %>% #also filter out feeds filter(!screen_name %in% c("CRANberriesFeed", "Rbloggers", "rweekly_live", "tidyversetweets")) #plot the number of tweets per person p2 <- rstats_data %>% ggplot(., aes(x = reorder(user_id, tweet_count))) + geom_bar(stat = "count") + ggtitle("Rstats Tweets By Person") + xlab("User") + ylab("Tweets") + theme_minimal() + theme(axis.text.x = element_blank()) p2
Lets see the most prolific tweeters
#show the most prolific retweeters rstats_users <- rstats_data %>% select(screen_name, tweet_count) %>% unique() head(rstats_users)
## # A tibble: 6 x 2 ## screen_name tweet_count ## <chr> <int> ## 1 AndySugs 8216 ## 2 dataandme 4113 ## 3 gp_pulipaka 3237 ## 4 DerFredo 3091 ## 5 revodavid 2640 ## 6 MangoTheCat 2523
I had been interested in recreating some analyses from https://www.jtimm.net/2018/11/03/twitter-political-ideology-and-the-115-us-senate/ recently, and thought this gave a good oppurtunity.
First I needed the top levels domains of links in #rstats tweets
#try to find only top level domains for grouping domain_patterns <- "\\.com.*|\\.org.*|\\.me.*|\\.gl.*|\\.li.*|\\..appspot|\\.blogspot|\\.io.*" links <- data.frame(url = unlist(rstats_data$urls_url)) %>% mutate(domain = gsub(domain_patterns, "", url)) %>% filter(!is.na(domain)) %>% group_by(domain) %>% mutate(share_count = n()) %>% ungroup() #which are the most tweeted links by the top tweeters head(links %>% select(-url) %>% unique() %>% arrange(-share_count))
## # A tibble: 6 x 2 ## domain share_count ## <chr> <int> ## 1 goo 4724 ## 2 wp 4110 ## 3 github 3430 ## 4 twitter 3201 ## 5 cran.r-project 2878 ## 6 r-bloggers 2708
some of these (e.g. the google/wp/fb/bit.ly) ones seem a bit more to be quick links to pictures and so were removed. I also cut out links to amazon, google, facebook, and youtube, which I was less certain about doing and would probably analyse in a deeper cut.
#remove non-data sciencey links links %>% filter(!grepl("goo|wp|tweetedtimes|fb|htl|facebook|youtube|amazon|google", domain)) %>% filter(!grepl("activevoice.us|ift.tt|rviv.ly|bit.ly", domain)) %>% select(-url) %>% unique() %>% arrange(-share_count) %>% head()
## # A tibble: 6 x 2 ## domain share_count ## <chr> <int> ## 1 github 3430 ## 2 twitter 3201 ## 3 cran.r-project 2878 ## 4 r-bloggers 2708 ## 5 link.rweekly 2415 ## 6 blog.revolutionanalytics 1225
Then we need to create a matrix of each domain vs. each user with a value of how many tweets from that user link to that domain.
I selected 3 users to illustrate the finished matrix (from here on out I’m freely stealing code from the blogpost linked above)
#find which domains each tweeted link belong to rstats_domains_shared <- rstats_data %>% select(user_id, screen_name, url = urls_url, date = created_at) %>% #remove tweets without links filter(!is.na(url)) %>% #unlist the links #can be multiple per tweet splitstackshape::listCol_l(., listcol = "url") %>% #merge with domain information merge(., unique(select(links, domain, url_ul = url, domain_shares = share_count)), by = "url_ul") %>% #select only domains shared 100 or more times filter(domain_shares > 99) %>% #remove uninteresting domains filter(!grepl("goo|wp|tweetedtimes|fb|htl|facebook|youtube|amazon|google", domain)) %>% filter(!grepl("activevoice.us|ift.tt|rviv.ly|bit.ly", domain)) %>% #limit to only frequent tweeters filter(screen_name %in% rstats_users$screen_name) #get a matrix of domains shared vs. users rstats_shares_by_user <- rstats_domains_shared %>% #find the number of times each user tweets a link to a domain group_by(screen_name, domain) %>% summarize(share_count = n()) %>% #filter out those that are untweets filter(share_count > 0) %>% spread(screen_name, share_count) %>% replace(is.na(.), 0) %>% ungroup() #quickly glance at this #has many columns so selecting only a few users users <- c("hadleywickham", "drob", "JennyBryan") rstats_shares_by_user %>% .[c(1, which(names(rstats_shares_by_user) %in% users))] %>% .[1:10,]
## # A tibble: 10 x 4 ## domain drob hadleywickham JennyBryan ## <chr> <dbl> <dbl> <dbl> ## 1 analyticsvidhya 0 0 0 ## 2 andrewgelman 0 0 0 ## 3 arilamstein 0 0 0 ## 4 asbcllc 0 0 0 ## 5 bl.ocks 0 0 0 ## 6 blog.revolutionanalytics 0 5 0 ## 7 blog.rstudio 1 115 6 ## 8 cran.r-project 6 12 21 ## 9 cran.rstudio 0 1 1 ## 10 datasciencecentral 0 0 0
Next we use cosine from the lsa package to get a matrix of user-user similarity. This is then crushed down to two dimensions X1 and X2
#find the cosine similarity between all users cosine_rstats <- rstats_shares_by_user %>% select(2:ncol(.)) %>% data.matrix() %>% lsa::cosine(.) #sort this into two dimensions #X1 and X2 rstats_clustering <- cmdscale(1-cosine_rstats, eig = TRUE, k = 2)$points %>% data.frame() %>% mutate(screen_name = rownames(cosine_rstats)) %>% merge(rstats_users, by = "screen_name") head(rstats_clustering)
## screen_name X1 X2 tweet_count ## 1 _ColinFay -0.1192867 -0.2821199 989 ## 2 abresler -0.1712703 -0.3224543 1443 ## 3 AnalyticsFr -0.3210288 0.4201589 1386 ## 4 AnalyticsFrance -0.3210288 0.4201589 1989 ## 5 AnalyticsVidhya -0.2969805 0.4152374 1814 ## 6 AndySugs 0.1371950 0.2465780 8216
If we plot this we get a nice graph of the top #rstats users qhich fall neatly into two dimensions. The first X1 seems to be ‘social’ vs. ‘professional’. People further to the left are users I recognise off the top of my head for sharing amateur data analyses/package building (e.g. JennyBryan) whereas those on the right seem to be more industrial users (e.g. MangoTheCat).
The second dimension is a bit harder to gauge but strikes me as sort of software vs. data science divide with more package creators/rstudio employees towards the bottom and people doing analysis of data towards the top (but this is only a gut feeling).
#plot the users by their cosine similarity and number of tweets rstats_clustering %>% ggplot(aes(X1,X2)) + geom_text(aes(label= screen_name, size = tweet_count), alpha = 0.3) + scale_size_continuous(range = c(2,5), guide = FALSE) + xlab("Dimension X1") + ylab("Dimension X2") + ggtitle("#rstats Tweeters Arranged by Links Shared", subtitle = "position based on cosine similarity between users") + theme_minimal()
To investigate a bit further I decided to see what each person was sharing. First I used c-means clustering as it’s something else I was working on in a separate project recently to cluster each use based on their cosine similarity (mostly just to have something to order the final plot by).
I then used geom_tile to show how often each user was sending links from which domains. Roughly, I would say that the ‘industrial’ (green) cluster makes shows a concentration of links to sites such as r-bloggers and revolutionanalytics’ blog, whereas the ‘social data science’ cluster (blue) links much more to twitter itself, github, and CRAN. The red (‘software’) cluster links to these too, but especially much more to the r-project blog in particular.
set.seed(22081992) #use fuzzy c means to find clusters based on cosine similarity #chose 3 as seems to be 3 clear nodes c_grouping <- cmeans(select(rstats_clustering, X1, X2) %>% as.matrix(), 3, iter.max = 1000) #merge this data in rstats_clustering %<>% mutate(cluster = c_grouping$cluster) %>% cbind(as.data.frame(c_grouping$membership)) %>% mutate(cluster_membership = apply(.[, (ncol(.)-(max(.$cluster)-1)):ncol(.)], 1, max)) #plot a heatmap of links shared vs. cluster grouping #remember cluster grouping is related to cosine similarity rstats_shares_by_user %>% reshape2::melt(id.vars = "domain", variable.name = "screen_name", value.name = "shares") %>% merge(rstats_clustering, by = "screen_name") %>% filter(shares > 0) %>% ggplot(., aes(x = domain, y = reorder(screen_name, cluster + cluster_membership))) + geom_tile(aes(fill = log(shares), colour = factor(cluster)), size = 0.5) + scale_fill_viridis_c(option = "plasma", guide = FALSE) + scale_colour_manual(values = c("red", "blue", "green", "purple"), guide = FALSE) + xlab("Domain Shared") + ylab("Screen Name") + ggtitle("Domains Shared by #rstats Tweeters Coloured by User Cluster") + theme_minimal() + theme(axis.text.x = element_text(angle = 30, hjust = 1))
Finally, I wanted to recreate the previous cosine similarity graph but with the clusters highlighted just because I think it makes a pretty graph.
#replot our initial plot of cosine similarity with the cluster information #alpha of screen_name indicates group membership strength rstats_clustering %>% ggplot(aes(X1, X2)) + geom_label(aes(label= screen_name, fill = factor(cluster), colour = cluster_membership, size = tweet_count), alpha = 0.3) + scale_colour_gradient(high = "black", low = "white", guide = FALSE) + scale_fill_manual(values = c("red", "blue", "green", "purple"), guide = FALSE) + scale_size_continuous(range = c(2,5), guide = FALSE) + xlab("Dimension X1") + ylab("Dimension X2") + ggtitle("#rstats Tweeters Grouped by Links Shared", subtitle = "grouping based on cosine similarity between users") + theme_minimal()
That’s all for this post. I think I’ll keep on throwing up quick #TidyTuesday posts throughout the year which will be as sparse as this, but hopefully be interesting to one or two people.