Introduction
When researchers collect quantitative data, they often store it in a CSV, TSV, or TXT file for cleaning and analysis. This data can then be loaded into R for visualization and statistical processing. However, before data can be effectively visualized or analyzed, it often requires cleaning and reshaping to be in the correct format. The tidyverse, a popular collection of R packages, provides tools to make this process more efficient. The tidyverse includes packages like readr (for reading data files), dplyr (for data transformation), tidyr (for reshaping data), and ggplot2 (for data visualization), along with tibble, stringr, purrr, and forcats. In this post, I’ll walk you through some essential functions in these packages for tidying your data, with exercises included to help you practice programming in R.
Prerequisites
Download and install R and RStudio on your computer. Then, open the RStudio and create a new R script so you can save your work for later. To get started, install the tidyverse package and load it into your script:
install.packages("tidyverse") library(tidyverse)
Let’s Create a Hypothetical Dataset!
Let’s create a hypothetical dataset that records students’ IDs, genders, and their mid-term scores in English, Biology, Maths, and Physics.
student_scores <- data.frame( Student_ID = 1:11, English = c(85, 78, 92, 67, 88, 76, 95, 80, 72, 90, 100), Biology = c(95, 87, 90, 79, 94, 96, 93, 82, 89, 97, 105), Maths = c(90, 82, 58, 74, 89, 91, 88, 77, 84, 92, 100), Physics = c(78, 85, 89, 80, 90, 76, 83, 91, 87, 79, 100), Gender = c("f", "f", "m", "m", "f", "m", "m", "f", "m", "m", "non-binary"))
In this dataset, each column is a variable, each row is an observation, and each cell is a single value. You can view the dataset in RStudio using the following lines of code. The first line opens a new window and display the entire dataset, while the second line shows just the first six rows by default:
view(student_scores) head(student_scores)
The dataset looks like this:
Extracting the Rows You Need: filter()
If you want to work specifically with certain rows, you can use the filter()
function. You can combine it with logical or/and boolean operators to extract the rows you want. Commonly used logical operators with filter()
include:
==
(is equal to)!=
(is not equal to)>
(greater than)<
(less than)>=
(greater than or equal to)<=
(less than or equal to)
For example, to create a new dataset (called tmp
) that only includes scores for female students ("f"
represents female while "m"
represents male), you can use ==
:
tmp <- student_scores %>% filter(Gender == "f")
The %>%
operator, known as the pipe, passes the object (in this case, our dataset) on its left as the input argument to the function to its right (in this case, the function below).
Or, if you want scores for students who are NOT female students, use !=
:
tmp <- student_scores %>% filter(Gender != "f")
You can also create a dataset that includes only students with Maths scores of 90 or higher:
tmp <- student_scores %>% filter(Maths >= 90)
To add a bit more complexity, you might want to include only students with Maths scores of 90 or higher and Physics scores above 76. In this case, use the & operator (a boolean operator meaning “and”) to combine these two conditions:
tmp <- student_scores %>% filter(Maths >= 90 & Physics > 76)
Other commonly used (boolean) operators with the filter()
function include:
-
|
(or)!is.na()
(is not an NA value, used to filter out missing values coded as NA in the dataset
Exercise 1: Try creating a dataset that includes only students with ID numbers from 1 to 5. Hint: use the %in%
operator along with filter()
. If you’re unsure what the %in%
operator does, take a few minutes to look it up — it is a useful tool often combined with the filter()
!
Selecting Specific Columns: select()
Sometimes a dataset can have many columns, but we only want to select some of it to play with. We can use select()
to get the specific columns we need:
tmp <- student_scores %>% select(one_of(c("Student_ID", "Maths", "Physics", "Gender")))
Alternatively, you can get the same dataset by deleting the unnecessary column using the -
operator:
tmp <- student_scores %>% select(-c("English", "Biology"))
You can also use the :
operator to select a range of columns. For example, if you want to select Student_ID
, English
, Biology
, and Maths
(since they are next to each other in our dataset), you don’t need to type each column name individually. Instead, you can simply use:
tmp <- student_scores %>% select(Student_ID:Maths)
Creating a New Column Based on an Existing Column: mutate()
Sometimes, you may need to add a new column to your dataset based values in an existing column. For example, if we know whether students failed in the Maths, you can create a new column (let’s name it as fail_in_maths
) that marks a failed score with 1 and others with 0. You can use mutate()
combined with ifelse()
to achieve this:
tmp <- student_scores %>% mutate(fail_in_maths = ifelse(Maths < 60, 1, 0))
If you’re still unsure of what ifelse()
does, try asking RStudio for help:
?ifelse()
Running the code adds fail_in_maths
as a new column, where 1 represents a failing Maths score and 0 represents a passing score.
Exercise 2: Create a dataset that includes only students with a passing Maths score.
- Hint 1: use both
mutate()
andfilter()
. - Hint 2: the
%>%
operator can be used multiple times to chain functions together.
Reshape the Dataset: pivot_longer()
and pivot_wider()
Some functions that I will introduce soon, such as group_by()
, summarise()
, require the dataset to be in a specific shape. Currently, our dataset has each course score in separate columns, which isn’t ideal for calculating descriptive statistics like mean, maximum, minimum, and standard deviation. You can reshape the dataset to list all the scores in one column, with an additional column showing the course corresponding to each score. Since you are going to need this dataset for descriptive statistics, let’s name this new dataset as scores_reshape
.
To do this, we can use pivot_longer()
, which rearranges multiple columns into two new columns: one for the course names and one for the scores:
scores_reshape <- student_scores %>% pivot_longer( cols = English:Physics, # columns to reshape: from English to Physics names_to = "courses", # new column name for the course names values_to = "scores" # new column name for the scores )
After running this, your dataset will have a courses
column listing each course name, and a scores
column showing each student’s score for each course.
Exercise 3: use pivot_wider()
to convert scores_reshape
back to its original form, where each course has a separate column for its scores.
Last but Not Least: group_by()
and summarise()
The group_by()
function allows you to group your dataset by a specific variable so that you can perform further operations, like summarise()
within each group. For example, if you want to calculate the average scores of each course, you can first group the data by course with group_by()
. Then, use summarise()
to create a new dataset that calculates the mean for each group (in this case, each course is a group), with each group represented in a separate row.
To calculate the average scores, use mean()
function as shown below:
mean_scores <- scores_reshape %>% group_by(courses) %>% summarise(ave_scores = mean(scores)) mean_scores
This code will produce a new dataset mean_scores
, with two columns: courses and ave_scores. It will contain four rows, each representing the average score for one of the four courses: English
, Biology
, Maths
, and Physics
.
Note: summarize()
and summarise()
are synonyms. They can be used interchangeably.
Exercise 4: Create a new dataset that calculates the mean and standard deviation of scores for female and male students.
- Hint 1: Use
filter()
to exclude the non-binary student. - Hint 2: Use
sd()
to calculate the standard deviation.
Additional resources
If you find this tutorial helpful and want to explore more functionalities of R for data analysis, here are some online books you can use for self-learning:
- Data Analytics in Digital Research with R (an online book by our former Digital Fellow, Yuxiao Luo)
- R for Data Science (second edition)
- Learning Statistical Models Through Simulation in R
If you’d like to see the answer keys to exercises in this tutorial, or need help working through them, feel free to join our R User Group on Mattermost! I will be posting the answer keys there.