23 November 2017

Setup

What you'll need

  • R and RStudio installed on your laptop
  • The tidyverse R package
  • The Perceptions of Uncertainty dataset from https://tinyurl.com/probly-data (right click page and Save as probly.csv)

Overview

  • Introduction to R and RStudio
  • Importing data
  • Processing data
  • Plotting data
  • Summarising data

Getting started

RStudio

RStudio

Projects

Create a new project to keep files organised

Notebooks

Create a new notebook in RStudio.

Adding text to a notebook

  • Simply type!
  • Create sections with #.
  • Subsections can be created with ## and ###.
  • Highlight text by surrounding it with * for italics.
  • Use ** for bold.

Type some text into the notebook and experiment with formatting.

You can use the notebook to keep notes for this course.

Learning more about Markdown

Preview results

  • Take a look at the rendered results.
  • Try exporting to PDF or Word.

Adding R code

  • Blocks of R code start with ```{r}
  • and end with ```

```{r}

# fancy computations here

```

Add a code block to your notebook.

  • Ctrl+Alt+I may help.
  • ⌘+Option+I may help.

Storing values

  • Want to assign a name to data so we can use it in computations
  • Assign values to variables
  • Values can be simple (e.g. 5 or "dog")
  • … or complex (e.g. all the data from your study)
ans <- 42

Add a variable assignment to your code chunk.

Executing R code

Execute current line of code:

  • Ctrl+Enter
  • ⌘+Enter

Execute entire code block:

  • Ctrl+Shift+Enter
  • ⌘+Shift+Enter

Chunk options

  • Behaviour of R chunks can be modified through options.
  • Access via settings icon or type directly into chunk header.
  • Can name chunks to make locating them later easier.

Examining values

  • View values in the Environment window.
  • Use the data viewer
    • Try View(ans)
  • print values
    • Try print(ans)
  • Highlight variable name in notebook + Ctrl+Enter

Basic building blocks – Vectors

  • Data is stored in vectors
  • Each vector has a type
    • numeric for interval and ratio data
    • logical for binary data/truth values
    • character or factor for categorical data
1:5
logical(3)
letters
factor(letters)

Basic building blocks – Vectors

  • Vectors can have multiple dimensions.
mat <- matrix(1:12, ncol=3)
dim(mat)
mat[2, 3]

Basic building blocks – Vectors

  • Different types of vectors can be combined into lists.
  • Vectors of the same length can be combined into data frames.
lst <- list(position=1:26, letter=letters)
lst[[2]][3]
lst$letter[3]
df <- data.frame(position=1:26, letter=letters)
df[3, 2]
df[-(1:10),]

Basic building blocks – Functions

  • Functions transform data.
  • Input remains unchanged.

Basic building blocks – Functions

  • Each function consists of three parts.
    • A name
    • A list of arguments
    • A body of R code that processes the arguments

Getting help

  • Use RStudio's help browser.
  • Use the ? command (?which).
  • Check RStudio cheat sheets.



  • Look at the help page for the which function.
  • What does the function do?
  • How do you use it?

A lot of useful information is available online. Take a look at rseek.org.

Getting help

  • Use RStudio's help browser.
  • Use the ? command (?which).
  • Check RStudio cheat sheets.



  • Look at the help page for the which function.
  • What does the function do?
  • How do you use it?

A lot of useful information is available online. Take a look at rseek.org.

Binders full of functions

  • R comes with a lot of useful functions build in.
  • Additional functions are available through packages.
  • Large collection available through public repositories.
package_name::function_name()

library(package_name)
function_name()

Working with data

Working with variables

We can manipulate variables in multiple ways

  • Arithmetic
  • Comparison
  • Combine, split, search
x <- c(ans, ans/2, ans/3, ans/7)
y <- x*x
z <- x + y
p1 <- z/x
eq <- x == p1 - 1
small_idx <- which(x < 20)
small <- x[small_idx]



  • +, -, /, * are arithmetic operators.
  • ==, <, >, <=, >= are comparison operators.
  • c concatenates variables.
  • [ extracts subset of a vector.
  • What happens if an operator is applied to a vector?

Importing data

Importing data

The dataset

  • Perception of uncertainty
    • What probability would you assign to the phrase "[phrase]"
  • Collected by /u/zonination on Reddit
  • Inspired by Sherman Kent study
  • Available on GitHub

Tidying data

Tidy data should have

  • One variable per column.
  • One observation per row.
  • What is the structure of the dataset?
  • What are the rows/columns?
  • Is this tidy data?

Tidying data

It depends on the question we want to ask of the data!

  • Each column could be considered a different level of the same variable.
  • Or a different variable.
  • Note that column with subject ID is missing.

Using the tidyr package

Allows easy reformatting of data to make it tidy.

library(tidyverse)



gather(table4a, `1999`, `2000`, 
       key = "year", value = "cases")

spread(table2, type, count)

Reshaping data

Tidy up the probly dataset by gathering all reported probabilities into a single column. Don't forget to load the tidyverse package.

First add a subject column.

probly <- add_column(probly, subject=1:nrow(probly), .before=1)
probly_long <- gather(probly, `Almost Certainly`:`Chances Are Slight`, 
                      key="phrase", value="probability")

Plotting with ggplot2

  • A plot only really needs three things
    • data
    • mapping of data to geometric objects
    • coordinate system
  • ggplot2 provides an easy way to combine these elements into plots.

ggplot( data) + geom_ type(aes( aesthetics))

data: a data.frame with your data.

geom: Geometric object to use to visualise the data.

aesthetics: Mapping of variables to properties of the geom (location, colour, size, …)

Plotting data

Create a simple box plot of the data.

library(ggplot2)

Use geom_boxplot

Plotting data

ggplot(probly_long) + geom_boxplot(aes(x=phrase, y=probability))

Improving the box plot

  • Can't really read the labels. Can we fix that?

Take a look at the help page for geom_boxplot, the examples may provide inspiration.

Improved box plot

ggplot(probly_long) + geom_boxplot(aes(x=phrase, y=probability)) + coord_flip()

Improved box plot (alternative solution)

ggplot(probly_long) + geom_boxplot(aes(x=phrase, y=probability)) +
  theme(axis.text.x=element_text(angle=45,hjust=1))

Further improvements

  • Plot is still hard to read.
  • Would be helpful to order phrases by perceived probability.
  • How do we tell ggplot which order to use?
  • Factor levels are ordered alphabetically by default.
  • The forcats package supports reordering factor levels based on another variable.
library(forcats)

Ordered plot

probly_long <- mutate(probly_long, phrase=fct_reorder(phrase, probability))
ggplot(probly_long) + geom_boxplot(aes(x=phrase, 
                                       y=probability)) + coord_flip()

Final tweaks

  1. Fix the y axis label.
  2. Use colour to highlight differences between categories.

Search the help system for a way to change the axis label.

The xlab and ylab function will set the axis labels. Remember that we flipped the coordinate system.

ggplot(probly_long) + geom_boxplot(aes(x=phrase, 
                                       y=probability)) +
  xlab("") + coord_flip()

Final tweaks

  1. Fix the y axis label.
  2. Use colour to highlight differences between categories.
  • Group observations by phrase using the group_by function.
  • New variables can be added with the mutate function.
probly_long <- group_by(probly_long, phrase)
probly_long <- mutate(probly_long, avg=median(probability))

Coloured plot

ggplot(probly_long) + geom_boxplot(aes(x=phrase, 
                                       y=probability,
                                       fill=avg)) + 
  xlab("") + coord_flip()

Coloured plot

Adjusting scales

  • ggplot chose a single colour gradient to visualise median probabilities.
  • A divergent scale centred on 50% would make a lot more sense.
  • Scales for aesthetics can be adjusted via scale functions.
  • scale_ aesthetic_type
  • scale types include manual, discrete, continuous, …

scale_fill_gradient2(midpoint=50)

Final plot

ggplot(probly_long) + geom_boxplot(aes(x=phrase, 
                                       y=probability, fill=avg)) + 
  xlab("") + coord_flip() +
  scale_fill_gradient2(midpoint=50)

Final plot

Data cleaning

Cleaning messy data

  • Data are often messy
  • Do unexpected outliers contain valuable information or are they just noise?
  • What could we do to investigate?

Plotting individual level data

It may be helpful to inspect individual responses. Create a plot that facilitates comparison between individual and average responses for each phrase.

A possible solution

A possible solution

library(ggrepel)
phrase_count <- length(levels(probly_long$phrase))
probly_long <- mutate(probly_long, unexpected=sign(probability - 50) != sign(avg -50) &
                        probability != 50 & avg != 50)
ggplot(probly_long, aes(x=as.numeric(phrase),
                        y=probability)) + 
  geom_line(aes(colour=factor(subject))) +
  geom_point(aes(colour=factor(subject), shape=unexpected, size=unexpected)) +
  geom_smooth(method="loess", colour='black') +
  geom_text_repel(data=filter(probly_long, unexpected), 
                  aes(x=as.numeric(phrase), 
                      y=probability,
                      label=subject)) +
  scale_x_continuous(breaks=1:phrase_count, 
                     labels=levels(probly_long$phrase),
                     minor_breaks = NULL,
                     expand = c(0, 0.4)) +
  scale_size_manual(values=c(0, 2)) +
  scale_shape_manual(values=c(0, 16)) + 
  guides(colour='none', shape='none', size='none') + xlab("") +
  theme(axis.text.x=element_text(angle=45,hjust=1))

Removing unreliable measurements

  • Be careful about removing outliers
  • Outliers may be due to corrupted measurements
  • but may also represent higher than expected variability in the data
  • Only remove outliers if there is solid evidence that these measurements are unreliable
  • Remove subject 15 from the dataset.
  • Create plots for the cleaned data.

Cleaned data

probly_clean <- filter(probly_long, subject != 15)

Summarising data

Getting the numbers

  • Plots are very useful but sometimes we need actual numbers.
  • Simple summaries can easily be obtained via dplyr::summarise
 prob_summary <- summarise(probly_clean, mean=mean(probability))

More numbers

  • Compute additional summary statistics.
  • You could try median, trimmed mean, standard deviation, inter quartile range, median absolute deviation

summarise can produce multiple summaries at once.

More numbers

prob_summary <- summarise(probly_clean, mean=mean(probability), 
                          sd=sd(probability), 
                          'trimmed mean'=mean(probability, trim = 0.1),
                          median=median(probability), IQR=IQR(probability), 
                          MAD=mad(probability))

Is an event with better than even odds likely?

  • We can use these data to determine whether the perception of two phrases differs significantly.

Investigate the difference between Better Than Even and Likely.

Use data in wide format (one column for each phrase).

A paired t-test should be useful here.

Is an event with better than even odds likely?

t.test(probly$Likely[-15], probly$`Better Than Even`[-15], paired=TRUE)
## 
##  Paired t-test
## 
## data:  probly$Likely[-15] and probly$`Better Than Even`[-15]
## t = 7.1021, df = 44, p-value = 8.102e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   9.802789 17.570544
## sample estimates:
## mean of the differences 
##                13.68667

The Road Ahead

Further adventures in R

Useful packages

This presentation