The tidyverse describes itself as “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” But what does this mean?
It might help to understand what life was like in the dark times, before the tidyverse. R inherited much of its syntax and functions from the older S and S-plus programs. Many of these base functions were written decades ago by many different people, and relatively little thought was given to uniformity in the way the syntax operated. For example, while most functions take the variable or object they want to manipulate as their first argument, you will ocassionally find functions where this is the second argument (the pattern matching functions like grep and sub are good examples). Thus, base R can feel a little disorganized and aesthetically unpleasing from a coding sense.
The intent of the tidyverse was to create core data science functions that all used the same “grammar” and thus were easy to learn, use, and read. It is no exaggeration to say that the tidyverse has transformed R from a niche programming language for hard-core statisticians and geeks into one of the premier tools for data analysis and data science. In this book (and course, if you are taking it from me), we will largely focus on tidyverse solutions to data wrangling problems. While there are extensions of the tidyverse to the more data analysis side of things, I think the biggest benefit comes in terms of organizing your data.
The Tidyverse Packages
You can install the tidyverse with a simple install.packages command:
install.packages("tidyverse")
However, when you install the tidyverse this way, you are actually installing (at last count) eight distinct packages that make up the tidyverse ecosystem. You can also install each of these packages individually if you just need the functionality of a specific one, but generally its just easier to install it all together. When you load the tidyverse package with a library command, you will see that it is loading these eight different packages.
The eight packages that make up the tidyverse are:
This is the package that started it all. The “gg” stands for the grammar of graphics and ggplot2 offers an entire grammar for creating beautiful plots. This package is so popular that it has completely eclipsed the “base” R plots people used to make. As someone who used to struggle through making base R plots, I am here to tell you that is a great thing. Because this is such a big part of what we do, I have devoted Chapter 5 to a discussion of making plots with ggplot2.
This is the workhouse package of the tidyverse and includes a variety of functions designed to manipulate data, all with handy verbs for names. You mutate to recode variables. You filter to create subsets of data. You select to pick variables in your dataset. And so on. Most of your basic data manipulations functions will live in dplyr.
Wondering what the whole “tidy” thing is about? It usually refers to the concept of “tidy data” where each row is an observation and each column is a variable. Unfortunately, data doesn’t always come in this tidy form. The tidyr package has functions that allow you to turn messy data into tidy data. Most important for our purposes, the tidyr package has functions that allow us to reshape data from wide to long formats and vice versa, as we will cover in Chapter 10.
The readr package is all about getting data intoR in the first place. Base R has some functions for reading in data, but the readr functions are generally improvements on these functions. They allow you to read in a wide variety of data formats faster and with less errors, as we will learn in Chapter 7.
This package is designed to replace iteration tools like for-loops and the lapply command from base R with faster and more flexible functions. It is probably the most “advanced” package in the tidyverse. We will cover some of its functionality when we discuss programming in Chapter 13.
Implicitly, you will use this package more than any other. The tibble is an extension to the basic data.frame object. It has all the functionality we expect from the data.frame plus a lot more. Generally, we will try to work with tibbles rather than data.frames whenever we can.
This package is designed for dealing with character strings. You might for example want to check for a certain pattern in a string. The stringr package has you covered. We won’t cover it in detail here.
Not actually about cats. This package has a variety of functions to make it easier to work with categorical data in the form of factor variables.
Using Tibbles
Tibbles are the tidyverse version of a data.frame. They have all the same functionality of a data.frame but with many added features that make them easier to work with. You can convert any data.frame into a tibble with the as_tibble command:
name age ate_breakfast high_degree height
1 Bob 15 TRUE Less than HS 67
2 Juan 25 FALSE College NA
3 Maria 19 TRUE HS Diploma 64
4 Jane 12 TRUE HS Diploma 66
5 Howie 21 FALSE College 72
my_tibble <-as_tibble(my_data_frame)my_tibble
# A tibble: 5 × 5
name age ate_breakfast high_degree height
<chr> <dbl> <lgl> <fct> <dbl>
1 Bob 15 TRUE Less than HS 67
2 Juan 25 FALSE College NA
3 Maria 19 TRUE HS Diploma 64
4 Jane 12 TRUE HS Diploma 66
5 Howie 21 FALSE College 72
The printing of the two objects already reveals some differences. The tibble includes information about the type of variable for each of the variable. In this case “dbl” stands for “double” which is a computer coding way of recording numeric values into memory, the other options being integer (“int”) and float (“flt”).
The differences in how these two objects print to the screen are more extensive than what you see above. If you simply print a data.frame to output, it will always print the entire dataset, which can be a bit overwhelming if you have many observations.1 Lets look at how a large tibble is printed instead.
# A tibble: 145,647 × 11
wages age gender race marstat education occup nchild foreign_born
<dbl> <int> <fct> <fct> <fct> <fct> <fct> <int> <fct>
1 20.8 52 Female Black Divorced/Separa… HS Diplo… Admi… 3 No
2 10 19 Female Black Never Married HS Diplo… Admi… 0 No
3 25 56 Female Black Divorced/Separa… Bachelor… Othe… 1 No
4 9.5 22 Female Black Never Married HS Diplo… Serv… 0 No
5 17 48 Male White Never Married HS Diplo… Manu… 0 No
6 20 59 Male Black Never Married HS Diplo… Manu… 0 No
7 11 27 Male Black Never Married HS Diplo… Manu… 0 No
8 17.5 30 Female Black Never Married HS Diplo… Mana… 0 No
9 8.15 49 Female Black Never Married HS Diplo… Serv… 0 No
10 21 26 Female White Married Bachelor… Othe… 0 No
# ℹ 145,637 more rows
# ℹ 2 more variables: earn_type <fct>, earningwt <dbl>
By default, the tibble only prints its first ten rows and includes other contextual information, including the total sample size.
There are other subtler differences between tibbles and data.frames that you can read about here. In practice, however, you will use tibbles in the same way you use data.frames. So, for example, if you need the mean of a variable in a tibble:
mean(my_tibble$age)
[1] 18.4
In general, you won’t have to worry too much about transforming base data.frames into tibbles. Most of the packages and functions we use to read in data will read in data as a tibble by default. So in general you will usually be working with tibbles. Ocassionally, if you use a base R function to do some data transformation of a tibble, you may end up with a base data.frame instead. You will be able to tell the first time you try to print the data.frame to the screen and it goes on for thousands of lines. In this case, you can re-convert back to a tibble with the as_tibble command (or better yet, replace the base R command with a tidyverse equivalent to ensure it remains a tibble).
Piping for Power
One of the major innovations introduced by the tidyverse is the ability to “pipe” the output of one command into another command as its primary input. This piping was originally done by the %>% pipe syntax. However, piping has become so popular that base R implemented its own pipe with the |> syntax. There are slight differences in how these two pipes functions, but they are largely equivalent. We will use the base R|> pipe.
Pipes are useful because they allow us to combine multiple commands together into a compound command, without using the more common approach of nesting commands inside other commands. The result is more human readable code. Additionally, because we are creating a compound command, we don’t litter our environment with a bunch of intermediate objects that were temporarily created to get to the final product.
Lets start with a simple example. Lets say I want to take some vector of numeric values x, log it, sum up the results, and then round it. I could do it like so:
x <-c(3,7,5,6,13)log_x <-log(x)sum_x <-sum(log_x)round(sum_x, 2)
[1] 9.01
I was able to get to my final result, but I ended up creating two intermediate objects of log_x and sum_x to get there. This is not a tidy approach - over time we will end up with an environment littered with these sorts of intermediate temporary objects. An alternative would be to do this all in a single line by nesting:
round(sum(log(x)), 2)
[1] 9.01
This approach works as well and is much more compact, but its also hard to read, because the only way to distinguish which functions we are in is by visually matching the parenthesis. Instead, lets try to pipe it:
x |>log() |>sum() |>round(2)
[1] 9.01
I first “pipe” the x vector itself into the first command. By default, R will expect that whatever is piped into a command be the first argument of that command. That will work for all tidyverse commands by design and for most (but not all) R base functions. I can then continue to pipe results until I get to the final step. The result is code that is compact, does not create intermediate objects, and is easy to read. I can easily see the sequential steps that were performed.
Note that by convention, I am creating a new line after each pipe command, although I could have done it all on one line. When you do this in R, the next line will be indented which tells you that you are in a pipe and the command is not done. If the indentation gets out of order, you can always use Ctrl+I (or Command+I on Mac) to correct your indentation.
Now lets try a more complicated example. In this case, I want to take the earnings dataset and do the following:
Create a new variable called has_children that is TRUE if the respondnet had more than zero children and FALSE otherwise.
Subset the earnings dataset to only respondents under 45 years of age.
Drop all variables except wages, gender, race, and has_children.
Calculate the mean wages conditional on the three variables of gender, race, and has_children.
Sort the resulting mean wages aggregate dataset from lowest mean wage to highest mean wage.
The base R code below accomplishes that task.
# create has_children variableearnings$has_children <- earnings$nchild>0# subset earnings to those under 45 years of age and just the variables # we wantearnings_sub <-subset(earnings, age<45, select=c("wages", "gender", "race", "has_children"))# calculate mean earnings by gender, race, and children statusearnings_agg <-aggregate(wages~gender+race+has_children, data=earnings_sub, mean)# reorder the aggregate earnings from lowest to highest wageearnings_agg <- earnings_agg[order(earnings_agg$wages),]earnings_agg
gender race has_children wages
6 Female Latino FALSE 16.47047
10 Female Indigenous FALSE 17.18805
4 Female Black FALSE 17.28123
18 Female Latino TRUE 17.39206
5 Male Latino FALSE 17.71734
9 Male Indigenous FALSE 17.76656
3 Male Black FALSE 18.05983
16 Female Black TRUE 18.47575
12 Female Other/Multiple FALSE 18.54329
22 Female Indigenous TRUE 19.28425
2 Female White FALSE 20.20004
11 Male Other/Multiple FALSE 20.71322
17 Male Latino TRUE 20.73794
15 Male Black TRUE 21.51384
24 Female Other/Multiple TRUE 21.72578
21 Male Indigenous TRUE 21.84132
1 Male White FALSE 22.24041
14 Female White TRUE 23.65777
8 Female Asian FALSE 24.88583
23 Male Other/Multiple TRUE 27.44830
7 Male Asian FALSE 27.66706
20 Female Asian TRUE 27.89848
13 Male White TRUE 29.44297
19 Male Asian TRUE 36.19389
You don’t need to understand all of these functions to get the gist of what is going on here. This code works, but is unpleasing for several reasons. First, I could have tried piping these commands, but it would have been difficult because there is no way to pipe in the creation of a new variable. Furthermore, the aggregate command takes the dataset as its second, not first, argument so the default pipe will not work out of the box. You will see that I ended up creating an earnings_sub dataset that I could feed into aggregate, leading to clutter in my environment. Finally, each of the commands uses its own bespoke system to do the things that it does. The subset command does two things at once (subsetting to respondents under 45 and restricting variables to the four I want). The aggregate command uses a formula to aggregate, and reordering my results has to be done by putting a command inside of indexing brackets. All in all this is not easy code to follow unless you are an expert at the inner workings of R.
Instead, lets try this same thing with a tidyverse approach:
# A tibble: 24 × 4
# Groups: gender, race [12]
gender race has_children mean_wages
<fct> <fct> <lgl> <dbl>
1 Female Latino FALSE 16.5
2 Female Indigenous FALSE 17.2
3 Female Black FALSE 17.3
4 Female Latino TRUE 17.4
5 Male Latino FALSE 17.7
6 Male Indigenous FALSE 17.8
7 Male Black FALSE 18.1
8 Female Black TRUE 18.5
9 Female Other/Multiple FALSE 18.5
10 Female Indigenous TRUE 19.3
# ℹ 14 more rows
I start by piping the earnings dataset into the mutate command which allows me to create a new variable or recode an existing one. The output of that mutate command (which includes the new variable of has_children) is then piped into the filter command which drops all observations that are not under age 45. The output of that filter command is then fed into the select command which drops all variables except for the ones listed. The output of that select command is then fed into the group_by command which creates a “grouped” tibble that can be aggregated along the given dimensions. The output of the group_by command is then fed into the summarize command which calculates the mean of wages across the three groups. The output of this command, which is still a tibble, is then fed into the arrange command which orders the observations from smallest mean wage to largest mean wage. This final output is then assigned to an object called earnings_agg.
I get the same result and all of the code is more compact, symmetric and easy to read. Tidyverse functions are designed as “verbs” and you can see that in action here. We first mutate, then filter, then select, etc. Even if you don’t understand all of the details of these commands yet, the approach is easier to follow. We are also much tidier because we don’t create any intermediate objects along the way.
Note that when you use pipes, the first argument is always the object your are piping and can be left out of the function call. For example, the first argument of filter is the data.frame/tibble you want to subset and the second argument is the boolean statement which tells filter what to keep (in this case, observations under age 45). Because, the earnings tibble is already being piped in, we can begin with the second argument.
Referencing Variables in the Tidyverse
Another thing to note here is that generally when you reference variables in tidyverse functions, you don’t ever have to surround them with quotes. You will notice that in the base R code above, I had to feed in a vector of character names for the select argument of the subset command to reduce the dataset to just those variables. The tidyverse command select does the same thing but I can just write the raw names of the variables themselves.
The tidyverse also gives us tools to reference variables more abstractly. Instead of just listing all variables by name, you can use the functions starts_with, ends_with, and contains to identify a group of variables by something shared in their name. You can also combine raw variable names and these functions together with a c() command to get diverse groups of variables. Finally, if you put a - before a variable name, you can exclude it rather than include it. The script below shows you some examples in practice.
load(url("https://github.com/AaronGullickson/stat_data/raw/main/output/crimes.RData"))# get all variables that start with "percent_crimes |>select(starts_with("percent_"))
This ability should encourage you to apply some basic logic to how you structure variable names. For example, it might be good to start all variable names with the kind of measure being taken, so you can easily grab different variable types. Sadly, my logic in naming the variables in the crime data wasn’t so sound, but you can do better!
Learning More
I am not going to go into detail about all of the functionality of the tidyverse here, because we will learn that in subsequent chapters. However, you can also learn more by going to the tidyverse website. If you click on the links to individual packages, you will get access to handy cheatsheets for each of the packages which are a useful reference. For example, you can click here to get the cheatsheet for dplyr.
The traditional way around this problem is to use the head or foot command instead, to only print out the first or last six rows, respectively.↩︎