Scraping Albuquerque Crossfit Data

I’ve been playing around with R and web data lately and thought i’d give a go at a minor analysis of publicly-available Facebook data.

This first section is an example taken from Pablo Barbera’s RFacebook github page used to look at Black Box Fitness’s Facebook page.

Libraries used for this analysis:

library(Rfacebook)
library(Rook)
library(ggplot2)
library(scales)

I have registered as a developer on facebook and have my authentication token saved. Data is read from Black Box’s public page and saved for local use here:

# me <- getUsers('me', token=fb_oauth)
if (!file.exists("~/Dropbox/stats/blackbox/bbf_fb.csv")) {
    load("~/Dropbox/stats/facebook/fb_oauth")
    print("getting blackbox page")
    page <- getPage("blackboxfitness", fb_oauth, n = 5000)
    write.csv(page, "~/Dropbox/stats/blackbox/bbf_fb.csv")
}
page <- read.csv("~/Dropbox/stats/blackbox/bbf_fb.csv")

And now we copy Pablo’s code to do some simple plotting of page popularity over time:

## convert Facebook date format to R date format
format.facebook.date <- function(datestring) {
    date <- as.POSIXct(datestring, format = "%Y-%m-%dT%H:%M:%S+0000", tz = "MST")
}
## aggregate metric counts over month
aggregate.metric <- function(metric) {
    m <- aggregate(page[[paste0(metric, "_count")]], list(month = page$month), 
        mean)
    m$month <- as.Date(paste0(m$month, "-15"))
    m$metric <- metric
    return(m)
}
# create data frame with average metric counts per month
page$datetime <- format.facebook.date(page$created_time)
page$month <- format(page$datetime, "%Y-%m")
df.list <- lapply(c("likes", "comments", "shares"), aggregate.metric)
df <- do.call(rbind, df.list)
library(ggplot2)
p <- ggplot(df, aes(x = month, y = x, group = metric))
p <- p + geom_line(aes(color = metric), size = 2)
# p <- p + scale_color_manual(values = wes.palette(3, 'Zissou'))
p <- p + scale_x_date(breaks = "years", labels = date_format("%Y-%m"))
p <- p + theme(axis.title.x = element_blank()) + labs(y = "average # of actions per post", 
    title = "average number of likes per post for Black Box Fitness Facebook \n, binned per month")
print(p)

Is that interesting? Probably not, as any analysis would probably have to take some other lurking things into account - the number of people who “like” black box on facebook, number of posts they make perday, etc. Perhaps i’ll look at some of these things later.

Next - comparing a bunch of crossfit boxes…

What could be more interesting is looking at how each box in Albuquerque leverages their social networks.

scraping all the post data using the getPage function in Rfacebook:

if (!file.exists("~/Dropbox/stats/blackbox/abqcf_data.csv")) {
    print("getting pages")
    cf_bbf <- getPage("blackboxfitness", fb_oauth, n = 5000)
    cf_abq <- getPage("CrossFit-Albuquerque", fb_oauth, n = 5000)
    cf_bigbarn <- getPage("BigBarnCrossfit", fb_oauth, n = 5000)
    cf_petro <- getPage("crossfitpetroglyph", fb_oauth, n = 5000)
    cf_dukecity <- getPage("DukeCityCrossFit", fb_oauth, n = 5000)
    cf_cantina <- getPage("cantinacrossfit", fb_oauth, n = 5000)
    cf_desertforge <- getPage("Desert-Forge-Crossfit", fb_oauth, n = 5000)
    cf_hunger <- getPage("poweredbyprimal", fb_oauth, n = 5000)
    # get rid of the annoying POWERED BY PRIMAL bit
    cf_hunger[, 2] <- c("CrossFit Hunger")
    # added after initial post date
    cf_ttb <- getPage("crossfittothebone", fb_oauth, n = 5000)
    cf_hellbox <- getPage("crossfithellbox", fb_oauth, n = 5000)
    cf_sandstorm <- getPage("cfsandstorm", fb_oauth, n = 5000)
    abqcf <- rbind(cf_bbf, cf_bigbarn, cf_petro, cf_dukecity, cf_cantina, cf_hunger)
    abqcf$from_name <- as.factor(abqcf$from_name)
    write.csv(abqcf, "~/Dropbox/stats/blackbox/abqcf_data.csv")
}

abqcf <- read.csv("~/Dropbox/stats/blackbox/abqcf_data.csv")
# gets rid of an annoying column left in by the csv write, silly bug or
# problem with me
abqcf <- abqcf[, -1]

I have the following box’s data:

Also note that only six of the local boxes had publicly available facebook page data. If anyone from CF Abq or Sandia CF want to change their settings, I’d be happy to include them in all of this. Also, if i have missed any boxes, let me know.

Looking at the most-liked/commented/shared posts, we see that:

abqcf[which.max(abqcf$likes_count), c(2, 5, 6)]
##             from_name  type
## 168 Black Box Fitness photo
##                                                                                                                            link
## 168 https://www.facebook.com/photo.php?fbid=621816951186670&set=a.345567808811587.73662.114224235279280&type=1&relevant_count=1
abqcf[which.max(abqcf$comments_count), c(2, 5, 6)]
##               from_name  type
## 4197 Duke City CrossFit photo
##                                                                                                                                  link
## 4197 https://www.facebook.com/photo.php?fbid=703682639683274&set=a.619772838074255.1073741834.195441880507355&type=1&relevant_count=1
abqcf[which.max(abqcf$shares), c(2, 5, 6)]
##              from_name  type
## 1645 Black Box Fitness photo
##                                                                                                                            link
## 1645 https://www.facebook.com/photo.php?fbid=301687569866278&set=a.123628401005530.9095.114224235279280&type=1&relevant_count=1

We now have all of the data tucked away, so let’s refine it a bit:

# create data frame with average metric counts per month
abqcf$datetime <- format.facebook.date(abqcf$created_time)
abqcf$month <- format(abqcf$datetime, "%Y-%m")
abqcf$post_type <- factor(abqcf$type)
# reorder(abqcf$post_type, c('status', 'link', 'question', 'photo',
# 'video')) change column names colnames(abqcf)
colnames(abqcf) <- c("post_id", "box", "post_text", "created_time", "post_type", 
    "url", "post_id2", "likes", "comments", "shares", "datetime", "month")
# removing questions from the set
abqcf <- abqcf[abqcf$post_type != "question", ]
abqcf <- droplevels(abqcf)

So now we have the following, 11676 observations of the following variables:

  • post_id
  • box - business name (it’s a factor in R)
  • post_text - text from the post, could do some sentiment analysis on this
  • created_time - time of hte post
  • post_type - video, photo, link, etc.
  • likes
  • comments
  • shares
  • datetime - POSIX date/time format
  • month - month/year info

Now, just a little worthless plot that isn’t terribly informative, as it just shows a bunch of points that show the number of likes per post over time. It’s overplotted, doesn’t account for anything really (size of page following ,etc.).

p <- ggplot(abqcf, aes(x = datetime, y = likes, group = box, colour = box)) + 
    # scale_x_date(breaks = 'years', labels = date_format('%Y-%m')) +
geom_point(aes()) + # scale_color_manual(values = wes_palette('FantasticFox')) +
labs(title = "Worthless plot of facebook likes \n over time for albuquerque crossfit boxes", 
    x = "")
print(p)

So, let’s try and do something more informative and look at the averages number of likes per box using plyr.

library(plyr)
library(stargazer)
# ddply(data.frame, variable(s), function, optional arguments)
means <- ddply(abqcf, .(box), summarise, likes = mean(likes), comments = mean(comments), 
    shares = mean(shares))
# get best post
maxes <- ddply(abqcf, .(box), summarise, likes = max(likes), comments = max(comments), 
    shares = max(shares))
average.likes <- ddply(abqcf, .(box), summarise, avglikes = mean(likes), sd = sd(likes))
stargazer(means, caption = "Average number of likes, comments, and shares per post", 
    digits = 1, type = "html", summary = FALSE)
Big Barn CrossFit 11.6 1.3 0.01
Black Box Fitness 6.1 1.5 0.2
Cantina Crossfit 8.1 2.3 0.03
CrossFit HellBox 13.2 2.1 0.1
CrossFit Hunger 5.6 1.3 0.1
CrossFit Petroglyph 7.2 0.8 0.04
CrossFit Sandstorm 4.8 0.9 0.1
CrossFit To The Bone 8.2 0.7 0.1
Duke City CrossFit 6.7 1.9 0.1
Average number of likes, comments, and shares per post
stargazer(maxes, caption = "Highest number of likes, comments, and shares on a post", 
    digits = 1, type = "html", summary = FALSE)
Big Barn CrossFit 127 14 1
Black Box Fitness 183 30 142
Cantina Crossfit 75 20 2
CrossFit HellBox 120 33 24
CrossFit Hunger 72 61 29
CrossFit Petroglyph 37 14 3
CrossFit Sandstorm 65 25 16
CrossFit To The Bone 44 15 4
Duke City CrossFit 174 112 31
Highest number of likes, comments, and shares on a post
p <- ggplot(abqcf, aes(box, likes, colour = box, group = box, fill = box))
p <- p + stat_summary(fun.y = mean, geom = "point", size = 8)
p <- p + stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2, 
    size = 1)
p <- p + labs(title = "Average number of likes per post for \n all type of posts, mean and 95% confidence interval", 
    x = "", y = "Average number of likes")
p <- p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p

And now let’s look at it by the type of post as well:

p <- ggplot(abqcf, aes(post_type, likes, colour = box, group = box, fill = box))
p <- p + stat_summary(fun.y = mean, geom = "point", size = 5, position = position_dodge(width = 1))
p <- p + stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2, 
    size = 1, position = position_dodge(width = 1))
p <- p + labs(title = "Average number of likes per post type by box, \n mean and 95% confidence interval", 
    x = "", y = "Average number of likes")
p
## ymax not defined: adjusting position using y instead
# py$ggplotly(p)