Installing and loading the required libraries

Let us begin by loading the reshape2, plyr and ggplot2 packages. If you don’t have these already, go ahead and install it using install.packages(). NOTE: Installing reshape2 automatically installs plyr as a dependency, so you don’t need to run this if you installed reshape2 in the past!

packages <- c("reshape2","ggplot2")
install.packages(packages)

It’s then ready to be loaded

library("plyr")
library("reshape2")
library("ggplot2")

Load the data

The dataset we’ll be using comprises 20 patients that developed lung cancer that were also former smokers, from which biopsy samples were collected and sequenced at a certain center (i.e. site). Each row is a patient, and each column is a descriptor that includes information on the patient (unique patient identifier, age, progression status of the tissue upon collection, treatment and smoking history (amount in pack-years and mode of quitting).

setwd("~/R Workshop/data/")
cancer.samples <- read.csv("cancer_samples.csv",
                      header=TRUE,
                      colClasses = c("Patient_ID"="character"))
dim(cancer.samples)
## [1] 20 11
head(cancer.samples)
##   Patient_ID Sex Age      Status Treatment Site Pack_years Cold_turkey Nicotine_supplements E_cigarettes Acupuncture
## 1      22471   M  55   Dysplasia        No DFCI      1.000         Yes                   No           No          No
## 2      23657   M  49 Hyperplasia       Yes  BGW      0.500          No                   No           No         Yes
## 3      29876   F  60   Carcinoma       Yes  BGW      0.200          No                  Yes          Yes          No
## 4      24867   M  61   Dysplasia        No DFCI      0.100          No                  Yes           No          No
## 5      20991   F  50 Hyperplasia        No  BGW      1.000         Yes                  Yes           No          No
## 6      21242   M  70      Normal        No BUMC      0.001          No                   No          Yes          No

Reshaping the data

When we collect data, we often collect it in it’s wide format. This means each row is a unique identifier, and each column includes fields describing the feature (row). While this is often a convenient way to collect and store data, it tends to grow quickly (i.e. width wise) and can end up being less ‘tidy’. In order to process such data for easy handling with packages that support data summarization and plotting (like plyr/ggplot2), we need to reshape it to convert it into a more ‘long’ vs ‘wide’ representation. We can do this using the melt function in the reshape2 package

cancer.samples.m <- melt(cancer.samples,
                         measure.vars = c("Cold_turkey","Nicotine_supplements","E_cigarettes","Acupuncture"),
                         variable.name = "quitting.method")
dim(cancer.samples.m)
## [1] 80  9
head(cancer.samples.m)
##   Patient_ID Sex Age      Status Treatment Site Pack_years quitting.method value
## 1      22471   M  55   Dysplasia        No DFCI      1.000     Cold_turkey   Yes
## 2      23657   M  49 Hyperplasia       Yes  BGW      0.500     Cold_turkey    No
## 3      29876   F  60   Carcinoma       Yes  BGW      0.200     Cold_turkey    No
## 4      24867   M  61   Dysplasia        No DFCI      0.100     Cold_turkey    No
## 5      20991   F  50 Hyperplasia        No  BGW      1.000     Cold_turkey   Yes
## 6      21242   M  70      Normal        No BUMC      0.001     Cold_turkey    No

So we went from 11 columns in the original data, to 9 columns in the melted data (and the melted data has 4 times as many rows). No big deal? Notice that we had 4 different fields with “Yes”/“No” entries describing whether or not the patient used a certain method to quit smoking. While this seems redundant, this is often how data is collected. Upon melting the data, specifying these 4 fields to be collapsed into one “measured” variable, we notice all the other columns remain as is (i.e. serving as unique identifiers per patient), with a single column then describing what method was used, and the Yes/No value pertaining to the grouped method column. This now makes it much easier for us to treat “quitting.method” as a single variable instead of 4 different variables, and we can do some fun summarization/plotting with this reformatted data!

Splitting and aggregating data with plyr

Now that we’ve reshaped/grouped our data a certain way, we can use this formatting to compute summary statistics involving our grouped/collapsed variable of interest (in this case, quitting.method). Plyr is a very useful package that let’s you do this. The most common function is ddply (data frame to data frame: input to output). We need to provide ddply with the data we want to process, along with a formula specfying what variables to split the data by, as well as a function to apply over the split rows.

Let us try to compute the Number of male vs female patient specimens collected per tissue grade status. We do this using ddply on the ORIGINAL data in the following manner:

sex.per.status <- ddply(cancer.samples,~Status + Sex,nrow)
sex.per.status
##        Status Sex V1
## 1   Carcinoma   F  4
## 2   Carcinoma   M  1
## 3   Dysplasia   F  1
## 4   Dysplasia   M  4
## 5 Hyperplasia   F  2
## 6 Hyperplasia   M  3
## 7      Normal   F  2
## 8      Normal   M  3

That was easy! That’s just like applying the nrow function over a set of rows defined by a specific grouping variable- in this case the Status of the tissue collected. We observe that there are 5 counts for each category.

Let’s try something a bit more informative. What if we want to compute the mean age for patients per quitting method. We already have our single quitting method variable ready to go (we obtained this in the step above when we melted the 4 columns into one), so we want to iterate over each quitting method and compute the mean and std deviation of the patients’ age per group. We do this using ddply in the following manner:

age.per.group <- ddply(cancer.samples.m,~quitting.method,function(rows){
  # Only subset rows that are true for the recovery method in question
  rows <- rows[rows$value=="Yes",]
  n <- nrow(rows)
  m <- mean(rows$Age)
  std <- sd(rows$Age)
  data.frame("N"=n,"Mean.Age"=m,"Std.dev.Age"=std)
})
age.per.group
##        quitting.method N Mean.Age Std.dev.Age
## 1          Cold_turkey 4 46.75000    6.994045
## 2 Nicotine_supplements 8 57.87500    6.728139
## 3         E_cigarettes 8 59.87500    8.096516
## 4          Acupuncture 6 53.83333   12.921558

Notice in the case above, we defined our own function, and even controlled how the output data frame is displayed (column names, order etc.), as opposed to the first case where we just specified it to use nrow(). This allows you to do more sophisticated tasks, applying your own custom functions over a subset of your data.

Now let us try and visualize some of this data using ggplot!

You may already be familiar with plotting with the basic R graphics handle. The R plot() function would explicitely take the x and y axis variables, and the end product would take a lot of fine tuning if you want to enhance the aesthetics factor. With ggplot, we can arrive at similar, if not prettier results in just a few lines or less!

Suppose we want to plot the distribution of ages across the 20 patients (even though we have very few samples), the original plotting function would use something like:

hist(cancer.samples$Age)

With ggplot, this works differently. Instead of passing the variable being plotted (in this case, the Age column), we provide ggplot with the entire dataset (similar to melt and ddply - now you see the consistency of the tidy universe!), and then specify what we want to do with this data (i.e. what goes on what axes, and what aesthetics to apply)

ggplot(cancer.samples,aes(x=Age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The reason it looks different is that it assumes different bin widths to count over (on the x-axis)

Let’s try plotting the mean age of patients per mode of quitting (for this, we would need the data returned by ddply grouped by quitting.method)

ggplot(age.per.group,aes(x=quitting.method,y=Mean.Age)) + geom_point()

A little bit of enhancement and we’re in business

ggplot(age.per.group,aes(x=quitting.method,y=Mean.Age)) + 
  geom_point(color="firebrick",size=3) + # Change the color and size of the points
  theme_bw()+ # Apply a black and white grid theme, instead of the greyscale background
  labs(x="Quitting method",y="Mean Age") # Change the X and Y axis labels

GGplot works in layers, so you begin with the base plot, and add enhancements/modifications after. The nice thing about doing this is that you can save the object at any point, and continue to modify it after:

p1 <- ggplot(age.per.group,aes(x=factor(quitting.method),y=Mean.Age)) + 
  geom_point(color="firebrick",size=3) + # Change the color and size of the points
  theme_bw()+ # apply a black and white grid theme, instead of the greyscale background
  labs(x="Quitting method",y="Mean Age") # Change the X and Y axis labels

p1 <- p1 + geom_point(size=2,color="gray") # This will overlay a smaller point on the original (gray on red)
p1

See if you can explore the other functions ggplot has to offer (geom_bar, geom_line, etc.) for the same example!