Now that you have successfully loaded your data into R where do we start with data wrangling. Well, many times our dataset may contain more stuff than we actually need. Trimming that dataset down to just what we need will help keep our project tidy. There are two ways that you might want to trim, or subset, your data:
You may want to remove certain kinds of observations. For example you may want to remove all respondents below a certain age, or you may want to remove all cases that are missing on a certain key variable. We can remove these observations with the filter command from the dplyr package.
You may want to remove variables that you don’t need. If you specifically defined an extract to have variables you want, you may not need to do this initially. However, you will usually want to do this after you create all the variables you need for your analysis because many of the original variables will now be extraneous. You can keep only the variables you want with the select command from the dplyr package.
As an example of each case, we will use the ACS data that I loaded in Chapter 7:
# A tibble: 100,293 × 11
tech_hhwt tech_cluster tech_perwt sex age marst race hispan hcovany
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 73876 NA 73876 1 51 5 1 0 2
2 376096 NA 376096 2 19 6 1 0 2
3 228344 NA 228344 1 50 6 2 0 2
4 177974 NA 177974 2 56 6 2 0 2
5 228344 NA 228344 2 93 5 1 0 2
6 43654 NA 43654 1 62 6 1 0 2
7 265282 NA 265282 1 18 6 7 4 2
8 282072 NA 282072 1 20 6 2 0 2
9 23506 NA 23506 1 53 3 2 0 2
10 174616 NA 174616 1 56 6 1 0 1
# ℹ 100,283 more rows
# ℹ 2 more variables: educd <int>, sei <int>
Filtering Observations
We can remove observations by creating a boolean vector from our variables where TRUE will indicate observations to keep and FALSE will indicate observations to drop. For example, if we wanted to drop all cases in the ACS data that are under age 18 and have a zero value for sei, we could use the following boolean statement:
acs$age >=18& acs$sei !=0
Note that a TRUE response indicates we want to keep the variable, rather than what we want to drop. Therefore, I am not asking for observation below 18 but rather observations that are 18 and over.
We can then feed this boolean vector into the filter command, preferably through a pipe. Because we are feeding the dataset into the filter command as well, we don’t need to put the acs$ prefixes on the variable names.
# pre-filter checksnrow(acs)
[1] 100293
summary(acs$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 22.00 43.00 42.77 63.00 97.00
summary(acs$sei)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 15.00 27.19 52.00 96.00
acs <- acs |>filter(age >=18& sei !=0)# post-filter checksnrow(acs)
[1] 58416
summary(acs$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 32.00 45.00 45.34 59.00 95.00
summary(acs$sei)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 19.00 47.00 46.23 68.00 96.00
You can see that these two conditions dropped my dataset size from 100,923 respondents to 58,416 respondents. The new minimum values of my summary commands also show that the filtering worked correctly.
Note that I have assigned the result of the filtering of the acs data back to the acs object. This effectively overwrites my original acs dataset. If I don’t assign it back to an object, then it will just print to the screen and be gone. However, you may be concerned that I am overwriting my original acs dataset. What if I lose something important? You might be tempted to instead do something like this:
## DO NOT DO THIS! ##acs2 <- acs |>filter(age >=18& sei !=0)
This approach may seem safer because you keep both the original dataset and your new reduced one, but it is a Very Bad Idea. You will be performing so many operations that if you create a new object each time, your environment will quickly be swimming in datasets and it will be very hard to track which dataset is the correct one. This is messy and is more likely to lead to mistakes.
But what about this issue of losing stuff? Its a nonexistent one. You can always get the acs data back from before the filter command by just re-running your script from the top. Thats the point of a script! Remember that the objects you create are not the product of your labor. the code that you write is. That code makes every object recoverable at any state.
Selecting Variables
Like filtering, we positively select which variables we want to keep rather than which variables we want to drop. This approach helps keep our dataset tidy. When we get to the ultimate analytical dataset, we want it to include only the variables we definitely need, and not extraneous stuff.
To select variables, simply provide each variable name (with no quotes) in the select command separated by commas. This can (and generally should) be done through a pipe:
In the case, that you only want to drop a couple of variables and don’t want to write down all the variables you want to keep, you can use the ! syntax in front of variable names to indicate that you want everything but that variable (or set of variables if you surround them with a c()).
acs |>select(!tech_perwt)
# A tibble: 58,416 × 10
tech_hhwt tech_cluster sex age marst race hispan hcovany educd sei
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 73876 NA 1 51 5 1 0 2 64 44
2 376096 NA 2 19 6 1 0 2 63 16
3 228344 NA 1 50 6 2 0 2 64 18
4 43654 NA 1 62 6 1 0 2 64 9
5 23506 NA 1 53 3 2 0 2 30 18
6 208196 NA 1 18 6 4 0 2 63 16
7 104098 NA 1 36 6 2 0 1 40 18
8 87308 NA 1 31 6 2 0 2 63 44
9 194764 NA 1 21 6 2 0 2 71 44
10 265282 NA 1 21 6 2 0 2 71 65
# ℹ 58,406 more rows
acs |>select(!c(sex, hcovany, sei))
# A tibble: 58,416 × 8
tech_hhwt tech_cluster tech_perwt age marst race hispan educd
<int> <int> <int> <int> <int> <int> <int> <int>
1 73876 NA 73876 51 5 1 0 64
2 376096 NA 376096 19 6 1 0 63
3 228344 NA 228344 50 6 2 0 64
4 43654 NA 43654 62 6 1 0 64
5 23506 NA 23506 53 3 2 0 30
6 208196 NA 208196 18 6 4 0 63
7 104098 NA 104098 36 6 2 0 40
8 87308 NA 87308 31 6 2 0 63
9 194764 NA 194764 21 6 2 0 71
10 265282 NA 265282 21 6 2 0 71
# ℹ 58,406 more rows
The dplyr package also comes with some additional utilities that can help you identify multiple variables by partial string matching. In particular, the following functions can be very helpful:
starts_with() will identify variables that start with a certain prefix.
ends_with() will identify variables that end with a certain prefix.
contains() will look for a certain character string anywhere within the name.
matches() will look for a regular expression.
For example, you will see that I put a “tech_” prefix on all of the technical variables I included when reading in the ACS data. I can remove all of these variables easily with: