4 Becoming Tidyversant

The tidyverse describes itself as “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” But what does this mean?

It might help to understand what life was like in the dark times, before the tidyverse. R inherited much of its syntax and functions from the older S and S-plus programs. Many of these base functions were written decades ago by many different people, and relatively little thought was given to uniformity in the way the syntax operated. For example, while most functions take the variable or object they want to manipulate as their first argument, you will ocassionally find functions where this is the second argument (the pattern matching functions like grep and sub are good examples). Thus, base R can feel a little disorganized and aesthetically unpleasing from a coding sense.

The intent of the tidyverse was to create core data science functions that all used the same “grammar” and thus were easy to learn, use, and read. It is no exaggeration to say that the tidyverse has transformed R from a niche programming language for hard-core statisticians and geeks into one of the premier tools for data analysis and data science. In this book (and course, if you are taking it from me), we will largely focus on tidyverse solutions to data wrangling problems. While there are extensions of the tidyverse to the more data analysis side of things, I think the biggest benefit comes in terms of organizing your data.

The Tidyverse Packages

You can install the tidyverse with a simple install.packages command:

install.packages("tidyverse")

However, when you install the tidyverse this way, you are actually installing (at last count) eight distinct packages that make up the tidyverse ecosystem. You can also install each of these packages individually if you just need the functionality of a specific one, but generally its just easier to install it all together. When you load the tidyverse package with a library command, you will see that it is loading these eight different packages.

library(tidyverse)

── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

The eight packages that make up the tidyverse are:

	This is the package that started it all. The “gg” stands for the grammar of graphics and `ggplot2` offers an entire grammar for creating beautiful plots. This package is so popular that it has completely eclipsed the “base” R plots people used to make. As someone who used to struggle through making base R plots, I am here to tell you that is a great thing. Because this is such a big part of what we do, I have devoted Chapter 5 to a discussion of making plots with `ggplot2`.
	This is the workhouse package of the tidyverse and includes a variety of functions designed to manipulate data, all with handy verbs for names. You `mutate` to recode variables. You `filter` to create subsets of data. You `select` to pick variables in your dataset. And so on. Most of your basic data manipulations functions will live in `dplyr`.
	Wondering what the whole “tidy” thing is about? It usually refers to the concept of “tidy data” where each row is an observation and each column is a variable. Unfortunately, data doesn’t always come in this tidy form. The `tidyr` package has functions that allow you to turn messy data into tidy data. Most important for our purposes, the `tidyr` package has functions that allow us to reshape data from wide to long formats and vice versa, as we will cover in Chapter 10.
	The `readr` package is all about getting data into R in the first place. Base R has some functions for reading in data, but the `readr` functions are generally improvements on these functions. They allow you to read in a wide variety of data formats faster and with less errors, as we will learn in Chapter 7.
	This package is designed to replace iteration tools like for-loops and the `lapply` command from base R with faster and more flexible functions. It is probably the most “advanced” package in the tidyverse. We will cover some of its functionality when we discuss programming in Chapter 13.
	Implicitly, you will use this package more than any other. The `tibble` is an extension to the basic `data.frame` object. It has all the functionality we expect from the `data.frame` plus a lot more. Generally, we will try to work with tibbles rather than data.frames whenever we can.
	This package is designed for dealing with character strings. You might for example want to check for a certain pattern in a string. The `stringr` package has you covered. We won’t cover it in detail here.
	Not actually about cats. This package has a variety of functions to make it easier to work with categorical data in the form of factor variables.

Using Tibbles

Tibbles are the tidyverse version of a data.frame. They have all the same functionality of a data.frame but with many added features that make them easier to work with. You can convert any data.frame into a tibble with the as_tibble command:

my_data_frame <- data.frame(
  name = c("Bob", "Juan", "Maria", "Jane", "Howie"),
  age = c(15, 25, 19, 12, 21), 
  ate_breakfast = c(TRUE, FALSE, TRUE, TRUE, FALSE), 
  high_degree= factor(c("Less than HS", "College", "HS Diploma", "HS Diploma", 
                        "College"),
                      levels=c("Less than HS", "HS Diploma", "College")),
  height = c(67, NA, 64, 66, 72))

my_data_frame

   name age ate_breakfast  high_degree height
1   Bob  15          TRUE Less than HS     67
2  Juan  25         FALSE      College     NA
3 Maria  19          TRUE   HS Diploma     64
4  Jane  12          TRUE   HS Diploma     66
5 Howie  21         FALSE      College     72

my_tibble <- as_tibble(my_data_frame)

my_tibble

# A tibble: 5 × 5
  name    age ate_breakfast high_degree  height
  <chr> <dbl> <lgl>         <fct>         <dbl>
1 Bob      15 TRUE          Less than HS     67
2 Juan     25 FALSE         College          NA
3 Maria    19 TRUE          HS Diploma       64
4 Jane     12 TRUE          HS Diploma       66
5 Howie    21 FALSE         College          72

The printing of the two objects already reveals some differences. The tibble includes information about the type of variable for each of the variable. In this case “dbl” stands for “double” which is a computer coding way of recording numeric values into memory, the other options being integer (“int”) and float (“flt”).

The differences in how these two objects print to the screen are more extensive than what you see above. If you simply print a data.frame to output, it will always print the entire dataset, which can be a bit overwhelming if you have many observations.¹ Lets look at how a large tibble is printed instead.

load(url("https://github.com/AaronGullickson/stat_data/raw/main/output/earnings.RData"))
earnings

# A tibble: 145,647 × 11
   wages   age gender race  marstat          education occup nchild foreign_born
   <dbl> <int> <fct>  <fct> <fct>            <fct>     <fct>  <int> <fct>       
 1 20.8     52 Female Black Divorced/Separa… HS Diplo… Admi…      3 No          
 2 10       19 Female Black Never Married    HS Diplo… Admi…      0 No          
 3 25       56 Female Black Divorced/Separa… Bachelor… Othe…      1 No          
 4  9.5     22 Female Black Never Married    HS Diplo… Serv…      0 No          
 5 17       48 Male   White Never Married    HS Diplo… Manu…      0 No          
 6 20       59 Male   Black Never Married    HS Diplo… Manu…      0 No          
 7 11       27 Male   Black Never Married    HS Diplo… Manu…      0 No          
 8 17.5     30 Female Black Never Married    HS Diplo… Mana…      0 No          
 9  8.15    49 Female Black Never Married    HS Diplo… Serv…      0 No          
10 21       26 Female White Married          Bachelor… Othe…      0 No          
# ℹ 145,637 more rows
# ℹ 2 more variables: earn_type <fct>, earningwt <dbl>

By default, the tibble only prints its first ten rows and includes other contextual information, including the total sample size.

There are other subtler differences between tibbles and data.frames that you can read about here. In practice, however, you will use tibbles in the same way you use data.frames. So, for example, if you need the mean of a variable in a tibble:

mean(my_tibble$age)

[1] 18.4

In general, you won’t have to worry too much about transforming base data.frames into tibbles. Most of the packages and functions we use to read in data will read in data as a tibble by default. So in general you will usually be working with tibbles. Ocassionally, if you use a base R function to do some data transformation of a tibble, you may end up with a base data.frame instead. You will be able to tell the first time you try to print the data.frame to the screen and it goes on for thousands of lines. In this case, you can re-convert back to a tibble with the as_tibble command (or better yet, replace the base R command with a tidyverse equivalent to ensure it remains a tibble).

Piping for Power

One of the major innovations introduced by the tidyverse is the ability to “pipe” the output of one command into another command as its primary input. This piping was originally done by the %>% pipe syntax. However, piping has become so popular that base R implemented its own pipe with the |> syntax. There are slight differences in how these two pipes functions, but they are largely equivalent. We will use the base R |> pipe.

Pipes are useful because they allow us to combine multiple commands together into a compound command, without using the more common approach of nesting commands inside other commands. The result is more human readable code. Additionally, because we are creating a compound command, we don’t litter our environment with a bunch of intermediate objects that were temporarily created to get to the final product.

Lets start with a simple example. Lets say I want to take some vector of numeric values x, log it, sum up the results, and then round it. I could do it like so:

x <- c(3,7,5,6,13)
log_x <- log(x)
sum_x <- sum(log_x)
round(sum_x, 2)

[1] 9.01

I was able to get to my final result, but I ended up creating two intermediate objects of log_x and sum_x to get there. This is not a tidy approach - over time we will end up with an environment littered with these sorts of intermediate temporary objects. An alternative would be to do this all in a single line by nesting:

round(sum(log(x)), 2)

[1] 9.01

This approach works as well and is much more compact, but its also hard to read, because the only way to distinguish which functions we are in is by visually matching the parenthesis. Instead, lets try to pipe it:

x |>
  log() |>
  sum() |>
  round(2)

[1] 9.01

I first “pipe” the x vector itself into the first command. By default, R will expect that whatever is piped into a command be the first argument of that command. That will work for all tidyverse commands by design and for most (but not all) R base functions. I can then continue to pipe results until I get to the final step. The result is code that is compact, does not create intermediate objects, and is easy to read. I can easily see the sequential steps that were performed.

Note that by convention, I am creating a new line after each pipe command, although I could have done it all on one line. When you do this in R, the next line will be indented which tells you that you are in a pipe and the command is not done. If the indentation gets out of order, you can always use Ctrl+I (or Command+I on Mac) to correct your indentation.

Now lets try a more complicated example. In this case, I want to take the earnings dataset and do the following:

Create a new variable called has_children that is TRUE if the respondnet had more than zero children and FALSE otherwise.
Subset the earnings dataset to only respondents under 45 years of age.
Drop all variables except wages, gender, race, and has_children.
Calculate the mean wages conditional on the three variables of gender, race, and has_children.
Sort the resulting mean wages aggregate dataset from lowest mean wage to highest mean wage.

The base R code below accomplishes that task.

# create has_children variable
earnings$has_children <- earnings$nchild>0

# subset earnings to those under 45 years of age and just the variables 
# we want
earnings_sub <- subset(earnings, age<45, 
                       select=c("wages", "gender", "race", "has_children"))

# calculate mean earnings by gender, race, and children status
earnings_agg <- aggregate(wages~gender+race+has_children, data=earnings_sub, mean)

# reorder the aggregate earnings from lowest to highest wage
earnings_agg <- earnings_agg[order(earnings_agg$wages),]

earnings_agg

   gender           race has_children    wages
6  Female         Latino        FALSE 16.47047
10 Female     Indigenous        FALSE 17.18805
4  Female          Black        FALSE 17.28123
18 Female         Latino         TRUE 17.39206
5    Male         Latino        FALSE 17.71734
9    Male     Indigenous        FALSE 17.76656
3    Male          Black        FALSE 18.05983
16 Female          Black         TRUE 18.47575
12 Female Other/Multiple        FALSE 18.54329
22 Female     Indigenous         TRUE 19.28425
2  Female          White        FALSE 20.20004
11   Male Other/Multiple        FALSE 20.71322
17   Male         Latino         TRUE 20.73794
15   Male          Black         TRUE 21.51384
24 Female Other/Multiple         TRUE 21.72578
21   Male     Indigenous         TRUE 21.84132
1    Male          White        FALSE 22.24041
14 Female          White         TRUE 23.65777
8  Female          Asian        FALSE 24.88583
23   Male Other/Multiple         TRUE 27.44830
7    Male          Asian        FALSE 27.66706
20 Female          Asian         TRUE 27.89848
13   Male          White         TRUE 29.44297
19   Male          Asian         TRUE 36.19389

You don’t need to understand all of these functions to get the gist of what is going on here. This code works, but is unpleasing for several reasons. First, I could have tried piping these commands, but it would have been difficult because there is no way to pipe in the creation of a new variable. Furthermore, the aggregate command takes the dataset as its second, not first, argument so the default pipe will not work out of the box. You will see that I ended up creating an earnings_sub dataset that I could feed into aggregate, leading to clutter in my environment. Finally, each of the commands uses its own bespoke system to do the things that it does. The subset command does two things at once (subsetting to respondents under 45 and restricting variables to the four I want). The aggregate command uses a formula to aggregate, and reordering my results has to be done by putting a command inside of indexing brackets. All in all this is not easy code to follow unless you are an expert at the inner workings of R.

Instead, lets try this same thing with a tidyverse approach:

earnings_agg <- earnings |>
  mutate(has_children = nchild>0) |>
  filter(age<45) |>
  select(wages, gender, race, has_children) |>
  group_by(gender, race, has_children) |>
  summarize(mean_wages=mean(wages)) |>
  arrange(mean_wages)

earnings_agg

# A tibble: 24 × 4
# Groups:   gender, race [12]
   gender race           has_children mean_wages
   <fct>  <fct>          <lgl>             <dbl>
 1 Female Latino         FALSE              16.5
 2 Female Indigenous     FALSE              17.2
 3 Female Black          FALSE              17.3
 4 Female Latino         TRUE               17.4
 5 Male   Latino         FALSE              17.7
 6 Male   Indigenous     FALSE              17.8
 7 Male   Black          FALSE              18.1
 8 Female Black          TRUE               18.5
 9 Female Other/Multiple FALSE              18.5
10 Female Indigenous     TRUE               19.3
# ℹ 14 more rows

I start by piping the earnings dataset into the mutate command which allows me to create a new variable or recode an existing one. The output of that mutate command (which includes the new variable of has_children) is then piped into the filter command which drops all observations that are not under age 45. The output of that filter command is then fed into the select command which drops all variables except for the ones listed. The output of that select command is then fed into the group_by command which creates a “grouped” tibble that can be aggregated along the given dimensions. The output of the group_by command is then fed into the summarize command which calculates the mean of wages across the three groups. The output of this command, which is still a tibble, is then fed into the arrange command which orders the observations from smallest mean wage to largest mean wage. This final output is then assigned to an object called earnings_agg.

I get the same result and all of the code is more compact, symmetric and easy to read. Tidyverse functions are designed as “verbs” and you can see that in action here. We first mutate, then filter, then select, etc. Even if you don’t understand all of the details of these commands yet, the approach is easier to follow. We are also much tidier because we don’t create any intermediate objects along the way.

Note that when you use pipes, the first argument is always the object your are piping and can be left out of the function call. For example, the first argument of filter is the data.frame/tibble you want to subset and the second argument is the boolean statement which tells filter what to keep (in this case, observations under age 45). Because, the earnings tibble is already being piped in, we can begin with the second argument.

Referencing Variables in the Tidyverse

Another thing to note here is that generally when you reference variables in tidyverse functions, you don’t ever have to surround them with quotes. You will notice that in the base R code above, I had to feed in a vector of character names for the select argument of the subset command to reduce the dataset to just those variables. The tidyverse command select does the same thing but I can just write the raw names of the variables themselves.

The tidyverse also gives us tools to reference variables more abstractly. Instead of just listing all variables by name, you can use the functions starts_with, ends_with, and contains to identify a group of variables by something shared in their name. You can also combine raw variable names and these functions together with a c() command to get diverse groups of variables. Finally, if you put a - before a variable name, you can exclude it rather than include it. The script below shows you some examples in practice.

load(url("https://github.com/AaronGullickson/stat_data/raw/main/output/crimes.RData"))

# get all variables that start with "percent_
crimes |>
  select(starts_with("percent_"))

# A tibble: 51 × 2
   percent_male percent_lhs
          <dbl>       <dbl>
 1         48.4       14.2 
 2         52.2        7.28
 3         49.7       13.2 
 4         49.1       13.8 
 5         49.7       17.1 
 6         50.3        8.61
 7         48.8        9.51
 8         48.4       10.2 
 9         47.5        9.41
10         48.9       12.0 
# ℹ 41 more rows

# get all variables that end with "_rate"
crimes |>
  select(ends_with("_rate"))

# A tibble: 51 × 4
   violent_rate property_rate unemploy_rate poverty_rate
          <dbl>         <dbl>         <dbl>        <dbl>
 1         496.         2979.          6.65        13.0 
 2         784.         3158.          7.40         7.49
 3         451.         2962.          6.46        11.6 
 4         538.         3202.          5.54        12.9 
 5         434.         2502.          6.73        10.4 
 6         349.         2670.          4.68         7.17
 7         225.         1802.          6.50         6.91
 8         476.         2645.          5.94         7.98
 9        1144.         4650.          7.37        12.9 
10         429.         2646.          6.31        10.6 
# ℹ 41 more rows

# get all variables that start with percent_ or end with _rate
crimes |>
  select(c(starts_with("percent_"), ends_with("_rate")))

# A tibble: 51 × 6
   percent_male percent_lhs violent_rate property_rate unemploy_rate
          <dbl>       <dbl>        <dbl>         <dbl>         <dbl>
 1         48.4       14.2          496.         2979.          6.65
 2         52.2        7.28         784.         3158.          7.40
 3         49.7       13.2          451.         2962.          6.46
 4         49.1       13.8          538.         3202.          5.54
 5         49.7       17.1          434.         2502.          6.73
 6         50.3        8.61         349.         2670.          4.68
 7         48.8        9.51         225.         1802.          6.50
 8         48.4       10.2          476.         2645.          5.94
 9         47.5        9.41        1144.         4650.          7.37
10         48.9       12.0          429.         2646.          6.31
# ℹ 41 more rows
# ℹ 1 more variable: poverty_rate <dbl>

# get all variables that end with _rate, but not unemploy_rate
crimes |>
  select(c(ends_with("_rate"), -unemploy_rate))

# A tibble: 51 × 3
   violent_rate property_rate poverty_rate
          <dbl>         <dbl>        <dbl>
 1         496.         2979.        13.0 
 2         784.         3158.         7.49
 3         451.         2962.        11.6 
 4         538.         3202.        12.9 
 5         434.         2502.        10.4 
 6         349.         2670.         7.17
 7         225.         1802.         6.91
 8         476.         2645.         7.98
 9        1144.         4650.        12.9 
10         429.         2646.        10.6 
# ℹ 41 more rows

# get all variables that start with percent_ and gini
crimes |>
  select(c(starts_with("percent_"), gini))

# A tibble: 51 × 3
   percent_male percent_lhs  gini
          <dbl>       <dbl> <dbl>
 1         48.4       14.2   48.0
 2         52.2        7.28  42.4
 3         49.7       13.2   46.8
 4         49.1       13.8   47.6
 5         49.7       17.1   48.9
 6         50.3        8.61  45.8
 7         48.8        9.51  49.6
 8         48.4       10.2   45.5
 9         47.5        9.41  52.8
10         48.9       12.0   48.7
# ℹ 41 more rows

This ability should encourage you to apply some basic logic to how you structure variable names. For example, it might be good to start all variable names with the kind of measure being taken, so you can easily grab different variable types. Sadly, my logic in naming the variables in the crime data wasn’t so sound, but you can do better!

Learning More

I am not going to go into detail about all of the functionality of the tidyverse here, because we will learn that in subsequent chapters. However, you can also learn more by going to the tidyverse website. If you click on the links to individual packages, you will get access to handy cheatsheets for each of the packages which are a useful reference. For example, you can click here to get the cheatsheet for dplyr.

The traditional way around this problem is to use the head or foot command instead, to only print out the first or last six rows, respectively.↩︎