An Introduction to R

23 November 2017

Setup

What you'll need

R and RStudio installed on your laptop
The tidyverse R package
The Perceptions of Uncertainty dataset from https://tinyurl.com/probly-data (right click page and Save as probly.csv)

Overview

Introduction to R and RStudio
Importing data
Processing data
Plotting data
Summarising data

Getting started

RStudio

Projects

Create a new project to keep files organised

Notebooks

Create a new notebook in RStudio.

Adding text to a notebook

Simply type!
Create sections with #.
Subsections can be created with ## and ###.
Highlight text by surrounding it with * for italics.
Use ** for bold.

Type some text into the notebook and experiment with formatting.

You can use the notebook to keep notes for this course.

Learning more about Markdown

Preview results

Take a look at the rendered results.
Try exporting to PDF or Word.

Adding R code

Blocks of R code start with ```{r}
and end with ```

```{r}

# fancy computations here

```

Add a code block to your notebook.

Ctrl+Alt+I may help.
⌘+Option+I may help.

Storing values

Want to assign a name to data so we can use it in computations
Assign values to variables
Values can be simple (e.g. 5 or "dog")
… or complex (e.g. all the data from your study)

ans <- 42

Add a variable assignment to your code chunk.

Executing R code

Execute current line of code:

Ctrl+Enter
⌘+Enter

Execute entire code block:

Ctrl+Shift+Enter
⌘+Shift+Enter

Chunk options

Behaviour of R chunks can be modified through options.
Access via settings icon or type directly into chunk header.
Can name chunks to make locating them later easier.

Examining values

View values in the Environment window.
Use the data viewer
- Try View(ans)
print values
- Try print(ans)
Highlight variable name in notebook + Ctrl+Enter

Basic building blocks – Vectors

Data is stored in vectors
Each vector has a type
- numeric for interval and ratio data
- logical for binary data/truth values
- character or factor for categorical data
- …

1:5
logical(3)
letters
factor(letters)

Basic building blocks – Vectors

Vectors can have multiple dimensions.

mat <- matrix(1:12, ncol=3)
dim(mat)
mat[2, 3]

Basic building blocks – Vectors

Different types of vectors can be combined into lists.
Vectors of the same length can be combined into data frames.

lst <- list(position=1:26, letter=letters)
lst[[2]][3]
lst$letter[3]
df <- data.frame(position=1:26, letter=letters)
df[3, 2]
df[-(1:10),]

Basic building blocks – Functions

Functions transform data.
Input remains unchanged.

Basic building blocks – Functions

Each function consists of three parts.
- A name
- A list of arguments
- A body of R code that processes the arguments

Getting help

Use RStudio's help browser.
Use the ? command (?which).
Check RStudio cheat sheets.

Look at the help page for the which function.
What does the function do?
How do you use it?

A lot of useful information is available online. Take a look at rseek.org.

Getting help

Use RStudio's help browser.
Use the ? command (?which).
Check RStudio cheat sheets.

Look at the help page for the which function.
What does the function do?
How do you use it?

A lot of useful information is available online. Take a look at rseek.org.

Binders full of functions

R comes with a lot of useful functions build in.
Additional functions are available through packages.
Large collection available through public repositories.

package_name::function_name()

library(package_name)
function_name()

Working with data

Working with variables

We can manipulate variables in multiple ways

Arithmetic
Comparison
Combine, split, search

x <- c(ans, ans/2, ans/3, ans/7)
y <- x*x
z <- x + y
p1 <- z/x
eq <- x == p1 - 1
small_idx <- which(x < 20)
small <- x[small_idx]

+, -, /, * are arithmetic operators.
==, <, >, <=, >= are comparison operators.
c concatenates variables.
[ extracts subset of a vector.

What happens if an operator is applied to a vector?

Importing data

The dataset

Perception of uncertainty
- What probability would you assign to the phrase "[phrase]"
Collected by /u/zonination on Reddit
Inspired by Sherman Kent study
Available on GitHub

Tidying data

Tidy data should have

One variable per column.
One observation per row.

What is the structure of the dataset?
What are the rows/columns?
Is this tidy data?

Tidying data

It depends on the question we want to ask of the data!

Each column could be considered a different level of the same variable.
Or a different variable.
Note that column with subject ID is missing.

Using the `tidyr` package

Allows easy reformatting of data to make it tidy.

library(tidyverse)

gather(table4a, `1999`, `2000`, 
       key = "year", value = "cases")

spread(table2, type, count)

Reshaping data

Tidy up the probly dataset by gathering all reported probabilities into a single column. Don't forget to load the tidyverse package.

First add a subject column.

probly <- add_column(probly, subject=1:nrow(probly), .before=1)

probly_long <- gather(probly, `Almost Certainly`:`Chances Are Slight`, 
                      key="phrase", value="probability")

Plotting with `ggplot2`

A plot only really needs three things
- data
- mapping of data to geometric objects
- coordinate system
ggplot2 provides an easy way to combine these elements into plots.

ggplot( data) + geom_ type(aes( aesthetics))

data: a data.frame with your data.

geom: Geometric object to use to visualise the data.

aesthetics: Mapping of variables to properties of the geom (location, colour, size, …)

Plotting data

Create a simple box plot of the data.

library(ggplot2)

Use geom_boxplot

Plotting data

ggplot(probly_long) + geom_boxplot(aes(x=phrase, y=probability))

Improving the box plot

Can't really read the labels. Can we fix that?

Take a look at the help page for geom_boxplot, the examples may provide inspiration.

Improved box plot

ggplot(probly_long) + geom_boxplot(aes(x=phrase, y=probability)) + coord_flip()

Improved box plot (alternative solution)

ggplot(probly_long) + geom_boxplot(aes(x=phrase, y=probability)) +
  theme(axis.text.x=element_text(angle=45,hjust=1))

Further improvements

Plot is still hard to read.
Would be helpful to order phrases by perceived probability.
How do we tell ggplot which order to use?

Factor levels are ordered alphabetically by default.
The forcats package supports reordering factor levels based on another variable.

library(forcats)

Ordered plot

probly_long <- mutate(probly_long, phrase=fct_reorder(phrase, probability))
ggplot(probly_long) + geom_boxplot(aes(x=phrase, 
                                       y=probability)) + coord_flip()

Final tweaks

Fix the y axis label.
Use colour to highlight differences between categories.

Search the help system for a way to change the axis label.

The xlab and ylab function will set the axis labels. Remember that we flipped the coordinate system.

ggplot(probly_long) + geom_boxplot(aes(x=phrase, 
                                       y=probability)) +
  xlab("") + coord_flip()

Final tweaks

Fix the y axis label.
Use colour to highlight differences between categories.

Group observations by phrase using the group_by function.
New variables can be added with the mutate function.

probly_long <- group_by(probly_long, phrase)
probly_long <- mutate(probly_long, avg=median(probability))

Coloured plot

ggplot(probly_long) + geom_boxplot(aes(x=phrase, 
                                       y=probability,
                                       fill=avg)) + 
  xlab("") + coord_flip()

Coloured plot

Adjusting scales

ggplot chose a single colour gradient to visualise median probabilities.
A divergent scale centred on 50% would make a lot more sense.

Scales for aesthetics can be adjusted via scale functions.
scale_ aesthetic_type
scale types include manual, discrete, continuous, …

scale_fill_gradient2(midpoint=50)

Final plot

ggplot(probly_long) + geom_boxplot(aes(x=phrase, 
                                       y=probability, fill=avg)) + 
  xlab("") + coord_flip() +
  scale_fill_gradient2(midpoint=50)

Final plot

Data cleaning

Cleaning messy data

Data are often messy

Do unexpected outliers contain valuable information or are they just noise?

What could we do to investigate?

Plotting individual level data

It may be helpful to inspect individual responses. Create a plot that facilitates comparison between individual and average responses for each phrase.

A possible solution

library(ggrepel)
phrase_count <- length(levels(probly_long$phrase))
probly_long <- mutate(probly_long, unexpected=sign(probability - 50) != sign(avg -50) &
                        probability != 50 & avg != 50)
ggplot(probly_long, aes(x=as.numeric(phrase),
                        y=probability)) + 
  geom_line(aes(colour=factor(subject))) +
  geom_point(aes(colour=factor(subject), shape=unexpected, size=unexpected)) +
  geom_smooth(method="loess", colour='black') +
  geom_text_repel(data=filter(probly_long, unexpected), 
                  aes(x=as.numeric(phrase), 
                      y=probability,
                      label=subject)) +
  scale_x_continuous(breaks=1:phrase_count, 
                     labels=levels(probly_long$phrase),
                     minor_breaks = NULL,
                     expand = c(0, 0.4)) +
  scale_size_manual(values=c(0, 2)) +
  scale_shape_manual(values=c(0, 16)) + 
  guides(colour='none', shape='none', size='none') + xlab("") +
  theme(axis.text.x=element_text(angle=45,hjust=1))

Removing unreliable measurements

Be careful about removing outliers
Outliers may be due to corrupted measurements
but may also represent higher than expected variability in the data
Only remove outliers if there is solid evidence that these measurements are unreliable

Remove subject 15 from the dataset.
Create plots for the cleaned data.

Cleaned data

probly_clean <- filter(probly_long, subject != 15)

Summarising data

Getting the numbers

Plots are very useful but sometimes we need actual numbers.
Simple summaries can easily be obtained via dplyr::summarise

 prob_summary <- summarise(probly_clean, mean=mean(probability))

More numbers

Compute additional summary statistics.
You could try median, trimmed mean, standard deviation, inter quartile range, median absolute deviation

summarise can produce multiple summaries at once.

More numbers

prob_summary <- summarise(probly_clean, mean=mean(probability), 
                          sd=sd(probability), 
                          'trimmed mean'=mean(probability, trim = 0.1),
                          median=median(probability), IQR=IQR(probability), 
                          MAD=mad(probability))

Is an event with better than even odds likely?

We can use these data to determine whether the perception of two phrases differs significantly.

Investigate the difference between Better Than Even and Likely.

Use data in wide format (one column for each phrase).

A paired t-test should be useful here.

Is an event with better than even odds likely?

t.test(probly$Likely[-15], probly$`Better Than Even`[-15], paired=TRUE)

## 
##  Paired t-test
## 
## data:  probly$Likely[-15] and probly$`Better Than Even`[-15]
## t = 7.1021, df = 44, p-value = 8.102e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   9.802789 17.570544
## sample estimates:
## mean of the differences 
##                13.68667

The Road Ahead

Further adventures in R

Continue to practice using R
Explore your own data
Learn about your favourite analysis techniques in R
CRAN maintains lists of useful and recommended packages by topic
- Social Sciences
- Psychometrics
- Official Statistics (including survey methodology)
- Bayesian Inference

Useful packages

Power analysis:
- pwr
- skpr
- simr
Linear (mixed) models: lme4
Non-linear regression: nlme
Eye tracking: eyetrackingR
Factor analysis/SEM: lavaan
Bayesian hypothesis testing: BayesFactor

This presentation

These slides are available at https://humburg.github.io/Rintro/

Setup

What you'll need

Overview

Getting started

RStudio

RStudio

Projects

Notebooks

Adding text to a notebook

Learning more about Markdown

Preview results

Adding R code

Storing values

Executing R code

Chunk options

Examining values

Basic building blocks – Vectors

Basic building blocks – Vectors

Basic building blocks – Vectors

Basic building blocks – Functions

Basic building blocks – Functions

Getting help

Getting help

Binders full of functions

Working with data

Working with variables

Importing data

Importing data

The dataset

Tidying data

Tidying data

Using the tidyr package

Reshaping data

Plotting with ggplot2

Plotting data

Plotting data

Improving the box plot

Improved box plot

Improved box plot (alternative solution)

Further improvements

Ordered plot

Final tweaks

Final tweaks

Coloured plot

Coloured plot

Adjusting scales

Final plot

Final plot

Data cleaning

Cleaning messy data

Plotting individual level data

A possible solution

A possible solution

Removing unreliable measurements

Cleaned data

Summarising data

Getting the numbers

More numbers

More numbers

Is an event with better than even odds likely?

Is an event with better than even odds likely?

The Road Ahead

Further adventures in R

Useful packages

This presentation

Using the `tidyr` package

Plotting with `ggplot2`