13 Iterating for Speed

In this chapter, we want to continue working on the challenge introduced in Chapter 12 of calculating Theil’s H - a measure of racial segregation - for every county in the United States. If you haven’t read that chapter, you should probably do so before continuing on here.

The dataset we used for this analysis is based on tract level data of population counts by race extracted from Social Explorer. You can see a snippet below:

tracts

# A tibble: 84,539 × 11
     tract_id county_id county_name      pop_total pop_race_white pop_race_black
        <dbl>     <dbl> <chr>                <int>          <int>          <int>
 1 1001020100      1001 Autauga County,…      1865           1428            208
 2 1001020200      1001 Autauga County,…      1861            674           1042
 3 1001020300      1001 Autauga County,…      3492           2413            876
 4 1001020400      1001 Autauga County,…      3987           3500            297
 5 1001020501      1001 Autauga County,…      4121           3209            620
 6 1001020502      1001 Autauga County,…      3256           2090            247
 7 1001020503      1001 Autauga County,…      3513           2411            715
 8 1001020600      1001 Autauga County,…      3839           2717            667
 9 1001020700      1001 Autauga County,…      3369           2146            866
10 1001020801      1001 Autauga County,…      3166           2819            312
# ℹ 84,529 more rows
# ℹ 5 more variables: pop_race_asian <int>, pop_race_other <int>,
#   pop_race_multi <int>, pop_race_latino <int>, pop_race_indigenous <int>

In Chapter 12, we developed a handy function for calculating Theil’s H from census tract data for a given county. This saves us a lot of time and code, because that function allows us to reuse the code for calculating Theil’s H for any county we want. For example, if I want the Theil’s H of Alameda County, California:

tracts |> 
  filter(county_name == "Alameda County, California") |>
  calc_theil_h()

[1] 0.1689115

However, we still have a bit of a problem. If I want to calculate Theil’s H for all 3,222 counties in the dataset, entering them in one by one is going to be extremely laborious. How can I get R to do the hard work here?

Computers are great at iteration. We basically want to tell R to repeat the same procedure over and over again across all counties in my dataset. There are two approaches to iteration in R: loops and mapping.

Looping

Looping is a common procedure in most computer programming languages. In this case, we want a for loop which will iterate across a vector of values and repeat the same code.¹ Here is a simple example of a for loop:

for(i in 1:10) {
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

The value i is a placeholder that can be referenced within the for loop itself. The first time the for loop is run, it uses the value of 1 and processes the code in the curly brackets. The next time, the for loop is run, it uses the value of 2, and so on until the final value of 10 is reached.

We want to loop across every county. In order to do that, I need a unique identifier for a county. I could use name or ID in this case. Generally, names are not as reliable, but for more intuitive display, I will use names here as I know that no county names are duplicated within states. To get the unique county names, I can use the unique function:

counties <- unique(tracts$county_name)
length(counties)

[1] 3222

For expository purposes, I am going to take a small sample of these counties to iterate across so that you don’t have to look at output for all 3,222 counties. Once we get the code working for the small sample, we will go back to the full set.

counties_sample <- sample(counties, 10)

First, lets test out a simple for loop where we print out county names:

for(county in counties_sample) {
  print(county)
}

[1] "Alger County, Michigan"
[1] "Kewaunee County, Wisconsin"
[1] "Catahoula Parish, Louisiana"
[1] "Marshall County, Mississippi"
[1] "Jackson County, Missouri"
[1] "Dakota County, Nebraska"
[1] "Jones County, North Carolina"
[1] "Prince Edward County, Virginia"
[1] "Tipton County, Tennessee"
[1] "Orange County, New York"

Note that I used county rather than i as my placeholder value name because its more intuitive. You can use any name you like here. The for loop seems to be working, so lets try to actually calculate Theil’s H within the loop.

for(county in counties_sample) {
  tracts |>
    filter(county_name == county) |>
    calc_theil_h() |>
    print()
}

[1] 0.1360778
[1] 0.06634935
[1] 0.09818927
[1] 0.1432648
[1] 0.2284551
[1] 0.1364297
[1] 0.02125523
[1] 0.04927384
[1] 0.1285344
[1] 0.1452806

That seemed to work. However, at this point, I am just spitting out the results to the screen. I would prefer to save these results back to something. I could just save the results back to a vector, but I would rather save them back to a tibble that includes both the county name and Theil’s H. Either way, I do this by initializing a NULL object and then adding to that object within the loop. For a vector, I can do this by simply concatenating the new value onto the old with c:

theil_h <- NULL
for(county in counties_sample) {
  h <- tracts |>
    filter(county_name == county) |>
    calc_theil_h()
  theil_h <- c(theil_h, h)
}
theil_h

 [1] 0.13607777 0.06634935 0.09818927 0.14326482 0.22845512 0.13642966
 [7] 0.02125523 0.04927384 0.12853442 0.14528057

However, when I only return a vector, I lose information on which value belongs to which county. Instead, I am going to save the results to a tibble and then use the bind_rows command to add this tibble to my existing tibble:

theil_h <- NULL
for(county in counties_sample) {
  
  # calculate Theil's H for this county
  h <- tracts |>
    filter(county_name == county) |>
    calc_theil_h()
  
  # add this county's values to our dataset of values
  theil_h <- theil_h |>
    bind_rows(tibble(county_name = county, theil_h = h))
}
theil_h

# A tibble: 10 × 2
   county_name                    theil_h
   <chr>                            <dbl>
 1 Alger County, Michigan          0.136 
 2 Kewaunee County, Wisconsin      0.0663
 3 Catahoula Parish, Louisiana     0.0982
 4 Marshall County, Mississippi    0.143 
 5 Jackson County, Missouri        0.228 
 6 Dakota County, Nebraska         0.136 
 7 Jones County, North Carolina    0.0213
 8 Prince Edward County, Virginia  0.0493
 9 Tipton County, Tennessee        0.129 
10 Orange County, New York         0.145

One thing to keep in mind is that if you end up rerunning your for loop, you also need to remember to re-initialize your object of returned value as a NULL object. Otherwise, your results will be added on to the pre-existing object.

Now that I have this code working for a small sample, lets run it for all 3,222 counties.

theil_h <- NULL
for(county in counties) {
  
  # calculate Theil's H for this county
  h <- tracts |>
    filter(county_name == county) |>
    calc_theil_h()
  
  # add this county's values to our dataset of values
  theil_h <- theil_h |>
    bind_rows(tibble(county_name = county, theil_h = h))
}
theil_h

# A tibble: 3,222 × 2
   county_name              theil_h
   <chr>                      <dbl>
 1 Autauga County, Alabama   0.120 
 2 Baldwin County, Alabama   0.117 
 3 Barbour County, Alabama   0.0625
 4 Bibb County, Alabama      0.155 
 5 Blount County, Alabama    0.129 
 6 Bullock County, Alabama   0.110 
 7 Butler County, Alabama    0.116 
 8 Calhoun County, Alabama   0.188 
 9 Chambers County, Alabama  0.107 
10 Cherokee County, Alabama  0.116 
# ℹ 3,212 more rows

This is too many values to display, so lets go ahead and look at a histogram of Theil’s H across counties in the US:

ggplot(theil_h, aes(x = theil_h))+
  geom_histogram(color = "grey20", fill = "grey70")+
  theme_bw()+
  labs(x = "Theil's H")

Figure 13.1: Distribution of Theil’s H across US counties.

Mapping

For loops are fairly intuitive but have a bad reputation in R. Historically, for loops in R have been much slower than in many other programming languages (for reasons are that are technically complex). For that reason, many people prefer to use the methods of mapping rather than looping in R for speedier results. In more recent version of R, looping has improved considerably and so the advantages of mapping may be somewhat overstated. Nonetheless, it is useful to know both methods as mapping can sometimes offer other efficiencies over looping, or provide more compact code.

Our primary tool is going to be the map function from the purrr package. However, you can also use the lapply function from base R in the same manner. Both functions allow you to apply some other function to every element of a list.

Whats a list, you say? You may want to Chapter 3, but the short answer is that lists are collections of objects. The objects within lists can be of different types, but for the purposes of map and lapply we want lists that contain objects of the same type, since we are going to apply the same function to each element of the list.

The first step to getting map to work is to create the list. In our case, we want to split up our single dataset of all tracts to a list of datasets where each county is a separate element of the list. Each dataset within the list will have the same structure.

To split a dataset into a list of datasets, we can use the group_by command to define what variable should be used to split the list (county name and id, in this case), followed by a group_split command. For expository purposes, rather than splitting the full tract dataset, I am going to use the sample of ten counties from above.

county_list <- tracts |>
  filter(county_name %in% counties_sample) |>
  group_by(county_name, county_id) |>
  group_split()

length(county_list)

[1] 10

Lets take a look at the first element of this list:

county_list[[1]]

# A tibble: 3 × 11
     tract_id county_id county_name      pop_total pop_race_white pop_race_black
        <dbl>     <dbl> <chr>                <int>          <int>          <int>
1 26003000100     26003 Alger County, M…      3473           3013              2
2 26003000200     26003 Alger County, M…      2203           1925             18
3 26003000300     26003 Alger County, M…      3190           2328            671
# ℹ 5 more variables: pop_race_asian <int>, pop_race_other <int>,
#   pop_race_multi <int>, pop_race_latino <int>, pop_race_indigenous <int>

The first element is the dataset for Alger County, Michigan. Each of the ten elements contains a dataset for a given county.

Now that I have my data structured the way that map wants the input, I can use the map function itself. The other required argument for map is a function that will be applied to each element of the list. I can use a pre-defined function or I can create a custom function. I will first demonstrate how it works by getting a summary for each dataset:

map(county_list, summary)

[[1]]
    tract_id         county_id     county_name          pop_total   
 Min.   :2.6e+10   Min.   :26003   Length:3           Min.   :2203  
 1st Qu.:2.6e+10   1st Qu.:26003   Class :character   1st Qu.:2696  
 Median :2.6e+10   Median :26003   Mode  :character   Median :3190  
 Mean   :2.6e+10   Mean   :26003                      Mean   :2955  
 3rd Qu.:2.6e+10   3rd Qu.:26003                      3rd Qu.:3332  
 Max.   :2.6e+10   Max.   :26003                      Max.   :3473  
 pop_race_white pop_race_black  pop_race_asian  pop_race_other pop_race_multi
 Min.   :1925   Min.   :  2.0   Min.   : 2.00   Min.   :0      Min.   : 79   
 1st Qu.:2126   1st Qu.: 10.0   1st Qu.: 2.00   1st Qu.:0      1st Qu.: 91   
 Median :2328   Median : 18.0   Median : 2.00   Median :0      Median :103   
 Mean   :2422   Mean   :230.3   Mean   :23.33   Mean   :2      Mean   :125   
 3rd Qu.:2670   3rd Qu.:344.5   3rd Qu.:34.00   3rd Qu.:3      3rd Qu.:148   
 Max.   :3013   Max.   :671.0   Max.   :66.00   Max.   :6      Max.   :193   
 pop_race_latino  pop_race_indigenous
 Min.   : 28.00   Min.   : 18.00     
 1st Qu.: 34.50   1st Qu.: 30.50     
 Median : 41.00   Median : 43.00     
 Mean   : 58.33   Mean   : 94.33     
 3rd Qu.: 73.50   3rd Qu.:132.50     
 Max.   :106.00   Max.   :222.00     

[[2]]
    tract_id           county_id     county_name          pop_total   
 Min.   :2.203e+10   Min.   :22025   Length:3           Min.   :1575  
 1st Qu.:2.203e+10   1st Qu.:22025   Class :character   1st Qu.:2122  
 Median :2.203e+10   Median :22025   Mode  :character   Median :2670  
 Mean   :2.203e+10   Mean   :22025                      Mean   :2965  
 3rd Qu.:2.203e+10   3rd Qu.:22025                      3rd Qu.:3660  
 Max.   :2.203e+10   Max.   :22025                      Max.   :4650  
 pop_race_white pop_race_black   pop_race_asian  pop_race_other pop_race_multi 
 Min.   : 509   Min.   : 448.0   Min.   :0.000   Min.   :0      Min.   : 0.00  
 1st Qu.:1362   1st Qu.: 739.0   1st Qu.:0.000   1st Qu.:0      1st Qu.:14.50  
 Median :2215   Median :1030.0   Median :0.000   Median :0      Median :29.00  
 Mean   :1962   Mean   : 935.3   Mean   :2.333   Mean   :0      Mean   :29.67  
 3rd Qu.:2688   3rd Qu.:1179.0   3rd Qu.:3.500   3rd Qu.:0      3rd Qu.:44.50  
 Max.   :3162   Max.   :1328.0   Max.   :7.000   Max.   :0      Max.   :60.00  
 pop_race_latino pop_race_indigenous
 Min.   : 0.00   Min.   : 0.000     
 1st Qu.: 3.50   1st Qu.: 0.000     
 Median : 7.00   Median : 0.000     
 Mean   :32.33   Mean   : 3.333     
 3rd Qu.:48.50   3rd Qu.: 5.000     
 Max.   :90.00   Max.   :10.000     

[[3]]
    tract_id           county_id     county_name          pop_total   
 Min.   :3.104e+10   Min.   :31043   Length:5           Min.   :2899  
 1st Qu.:3.104e+10   1st Qu.:31043   Class :character   1st Qu.:3117  
 Median :3.104e+10   Median :31043   Mode  :character   Median :3928  
 Mean   :3.104e+10   Mean   :31043                      Mean   :4262  
 3rd Qu.:3.104e+10   3rd Qu.:31043                      3rd Qu.:4535  
 Max.   :3.104e+10   Max.   :31043                      Max.   :6829  
 pop_race_white pop_race_black  pop_race_asian pop_race_other pop_race_multi 
 Min.   : 879   Min.   :  9.0   Min.   : 63    Min.   :0      Min.   : 52.0  
 1st Qu.:1037   1st Qu.: 15.0   1st Qu.: 63    1st Qu.:0      1st Qu.: 76.0  
 Median :2430   Median :160.0   Median : 79    Median :0      Median : 85.0  
 Mean   :1919   Mean   :299.2   Mean   :136    Mean   :0      Mean   :119.4  
 3rd Qu.:2514   3rd Qu.:593.0   3rd Qu.:214    3rd Qu.:0      3rd Qu.:148.0  
 Max.   :2737   Max.   :719.0   Max.   :261    Max.   :0      Max.   :236.0  
 pop_race_latino pop_race_indigenous
 Min.   : 151    Min.   :  0.0      
 1st Qu.: 970    1st Qu.: 24.0      
 Median :1710    Median : 40.0      
 Mean   :1696    Mean   : 91.4      
 3rd Qu.:1872    3rd Qu.: 83.0      
 Max.   :3778    Max.   :310.0      

[[4]]
    tract_id          county_id     county_name          pop_total   
 Min.   :2.91e+10   Min.   :29095   Length:224         Min.   :  23  
 1st Qu.:2.91e+10   1st Qu.:29095   Class :character   1st Qu.:2051  
 Median :2.91e+10   Median :29095   Mode  :character   Median :3042  
 Mean   :2.91e+10   Mean   :29095                      Mean   :3194  
 3rd Qu.:2.91e+10   3rd Qu.:29095                      3rd Qu.:3948  
 Max.   :2.91e+10   Max.   :29095                      Max.   :7356  
 pop_race_white   pop_race_black   pop_race_asian   pop_race_other  
 Min.   :   6.0   Min.   :   0.0   Min.   :  0.00   Min.   :  0.00  
 1st Qu.: 866.2   1st Qu.: 169.8   1st Qu.:  0.00   1st Qu.:  0.00  
 Median :1702.5   Median : 507.5   Median : 29.00   Median :  0.00  
 Mean   :1952.6   Mean   : 723.0   Mean   : 55.53   Mean   : 16.65  
 3rd Qu.:2757.2   3rd Qu.:1051.5   3rd Qu.: 82.25   3rd Qu.: 11.00  
 Max.   :5710.0   Max.   :3364.0   Max.   :421.00   Max.   :316.00  
 pop_race_multi   pop_race_latino  pop_race_indigenous
 Min.   :  0.00   Min.   :   0.0   Min.   :  0.00     
 1st Qu.: 58.75   1st Qu.:  90.5   1st Qu.:  0.00     
 Median :103.00   Median : 183.5   Median :  0.00     
 Mean   :126.01   Mean   : 306.1   Mean   : 14.39     
 3rd Qu.:166.00   3rd Qu.: 346.2   3rd Qu.: 13.00     
 Max.   :683.00   Max.   :2737.0   Max.   :403.00     

[[5]]
    tract_id          county_id     county_name          pop_total   
 Min.   :3.71e+10   Min.   :37103   Length:3           Min.   :2776  
 1st Qu.:3.71e+10   1st Qu.:37103   Class :character   1st Qu.:2930  
 Median :3.71e+10   Median :37103   Mode  :character   Median :3083  
 Mean   :3.71e+10   Mean   :37103                      Mean   :3088  
 3rd Qu.:3.71e+10   3rd Qu.:37103                      3rd Qu.:3244  
 Max.   :3.71e+10   Max.   :37103                      Max.   :3404  
 pop_race_white pop_race_black   pop_race_asian  pop_race_other
 Min.   :1727   Min.   : 642.0   Min.   : 4.00   Min.   :0     
 1st Qu.:1854   1st Qu.: 712.0   1st Qu.:20.50   1st Qu.:0     
 Median :1981   Median : 782.0   Median :37.00   Median :0     
 Mean   :1913   Mean   : 854.7   Mean   :31.67   Mean   :0     
 3rd Qu.:2006   3rd Qu.: 961.0   3rd Qu.:45.50   3rd Qu.:0     
 Max.   :2031   Max.   :1140.0   Max.   :54.00   Max.   :0     
 pop_race_multi   pop_race_latino pop_race_indigenous
 Min.   : 19.00   Min.   : 86.0   Min.   : 9.00      
 1st Qu.: 53.50   1st Qu.:144.0   1st Qu.:18.00      
 Median : 88.00   Median :202.0   Median :27.00      
 Mean   : 90.33   Mean   :176.3   Mean   :21.67      
 3rd Qu.:126.00   3rd Qu.:221.5   3rd Qu.:28.00      
 Max.   :164.00   Max.   :241.0   Max.   :29.00      

[[6]]
    tract_id           county_id     county_name          pop_total   
 Min.   :5.506e+10   Min.   :55061   Length:5           Min.   :2370  
 1st Qu.:5.506e+10   1st Qu.:55061   Class :character   1st Qu.:3260  
 Median :5.506e+10   Median :55061   Mode  :character   Median :3724  
 Mean   :5.506e+10   Mean   :55061                      Mean   :4114  
 3rd Qu.:5.506e+10   3rd Qu.:55061                      3rd Qu.:5212  
 Max.   :5.506e+10   Max.   :55061                      Max.   :6004  
 pop_race_white pop_race_black pop_race_asian pop_race_other pop_race_multi 
 Min.   :2208   Min.   : 0.0   Min.   : 0     Min.   : 0     Min.   : 37.0  
 1st Qu.:2959   1st Qu.: 0.0   1st Qu.: 6     1st Qu.: 0     1st Qu.: 48.0  
 Median :3522   Median : 3.0   Median : 6     Median : 2     Median : 68.0  
 Mean   :3848   Mean   :19.4   Mean   :15     Mean   : 9     Mean   : 75.2  
 3rd Qu.:4725   3rd Qu.:16.0   3rd Qu.:20     3rd Qu.: 4     3rd Qu.: 85.0  
 Max.   :5827   Max.   :78.0   Max.   :43     Max.   :39     Max.   :138.0  
 pop_race_latino pop_race_indigenous
 Min.   : 46     Min.   : 0.0       
 1st Qu.: 81     1st Qu.: 0.0       
 Median :100     Median : 0.0       
 Mean   :139     Mean   : 8.2       
 3rd Qu.:143     3rd Qu.: 5.0       
 Max.   :325     Max.   :36.0       

[[7]]
    tract_id           county_id     county_name          pop_total   
 Min.   :2.809e+10   Min.   :28093   Length:10          Min.   :2164  
 1st Qu.:2.809e+10   1st Qu.:28093   Class :character   1st Qu.:2742  
 Median :2.809e+10   Median :28093   Mode  :character   Median :3594  
 Mean   :2.809e+10   Mean   :28093                      Mean   :3398  
 3rd Qu.:2.809e+10   3rd Qu.:28093                      3rd Qu.:3860  
 Max.   :2.809e+10   Max.   :28093                      Max.   :4617  
 pop_race_white   pop_race_black pop_race_asian pop_race_other pop_race_multi  
 Min.   : 375.0   Min.   : 558   Min.   : 0     Min.   : 0.0   Min.   :  0.00  
 1st Qu.: 998.2   1st Qu.: 918   1st Qu.: 0     1st Qu.: 0.0   1st Qu.: 27.75  
 Median :1613.5   Median :1258   Median : 0     Median : 0.0   Median : 91.50  
 Mean   :1600.6   Mean   :1548   Mean   : 4     Mean   :16.7   Mean   : 82.60  
 3rd Qu.:2111.8   3rd Qu.:1989   3rd Qu.: 0     3rd Qu.:21.0   3rd Qu.:113.75  
 Max.   :3098.0   Max.   :3470   Max.   :30     Max.   :99.0   Max.   :165.00  
 pop_race_latino  pop_race_indigenous
 Min.   :  0.00   Min.   : 0.0       
 1st Qu.: 27.75   1st Qu.: 0.0       
 Median :112.50   Median : 0.0       
 Mean   :145.20   Mean   : 1.4       
 3rd Qu.:183.00   3rd Qu.: 0.0       
 Max.   :477.00   Max.   :11.0       

[[8]]
    tract_id           county_id     county_name          pop_total    
 Min.   :3.607e+10   Min.   :36071   Length:92          Min.   : 1568  
 1st Qu.:3.607e+10   1st Qu.:36071   Class :character   1st Qu.: 3327  
 Median :3.607e+10   Median :36071   Mode  :character   Median : 4348  
 Mean   :3.607e+10   Mean   :36071                      Mean   : 4361  
 3rd Qu.:3.607e+10   3rd Qu.:36071                      3rd Qu.: 5130  
 Max.   :3.607e+10   Max.   :36071                      Max.   :10758  
 pop_race_white  pop_race_black   pop_race_asian   pop_race_other  
 Min.   :  246   Min.   :   0.0   Min.   :  0.00   Min.   :  0.00  
 1st Qu.: 1858   1st Qu.: 118.0   1st Qu.: 24.25   1st Qu.:  0.00  
 Median : 2556   Median : 371.5   Median : 60.50   Median :  5.50  
 Mean   : 2637   Mean   : 455.9   Mean   :123.46   Mean   : 21.59  
 3rd Qu.: 3354   3rd Qu.: 623.5   3rd Qu.:180.50   3rd Qu.: 32.50  
 Max.   :10080   Max.   :2405.0   Max.   :591.00   Max.   :256.00  
 pop_race_multi  pop_race_latino  pop_race_indigenous
 Min.   :  0.0   Min.   :   0.0   Min.   : 0.000     
 1st Qu.: 60.0   1st Qu.: 524.0   1st Qu.: 0.000     
 Median :121.0   Median : 892.5   Median : 0.000     
 Mean   :144.5   Mean   : 972.8   Mean   : 5.663     
 3rd Qu.:186.8   3rd Qu.:1320.0   3rd Qu.: 0.000     
 Max.   :618.0   Max.   :3221.0   Max.   :76.000     

[[9]]
    tract_id           county_id     county_name          pop_total   
 Min.   :5.115e+10   Min.   :51147   Length:6           Min.   :3124  
 1st Qu.:5.115e+10   1st Qu.:51147   Class :character   1st Qu.:3178  
 Median :5.115e+10   Median :51147   Mode  :character   Median :3394  
 Mean   :5.115e+10   Mean   :51147                      Mean   :3654  
 3rd Qu.:5.115e+10   3rd Qu.:51147                      3rd Qu.:3488  
 Max.   :5.115e+10   Max.   :51147                      Max.   :5385  
 pop_race_white pop_race_black pop_race_asian   pop_race_other  
 Min.   :1636   Min.   : 702   Min.   :  0.00   Min.   : 0.000  
 1st Qu.:1770   1st Qu.: 827   1st Qu.: 27.25   1st Qu.: 0.000  
 Median :1966   Median :1152   Median : 29.50   Median : 0.000  
 Mean   :2212   Mean   :1087   Mean   : 51.33   Mean   : 1.833  
 3rd Qu.:2394   3rd Qu.:1286   3rd Qu.: 68.50   3rd Qu.: 0.000  
 Max.   :3461   Max.   :1466   Max.   :141.00   Max.   :11.000  
 pop_race_multi   pop_race_latino pop_race_indigenous
 Min.   :  0.00   Min.   : 46.0   Min.   :  0.00     
 1st Qu.: 65.25   1st Qu.: 60.5   1st Qu.:  7.50     
 Median :113.00   Median : 97.0   Median : 23.50     
 Mean   :125.83   Mean   :131.8   Mean   : 44.33     
 3rd Qu.:157.00   3rd Qu.:125.2   3rd Qu.: 29.75     
 Max.   :308.00   Max.   :368.0   Max.   :185.00     

[[10]]
    tract_id           county_id     county_name          pop_total    
 Min.   :4.717e+10   Min.   :47167   Length:13          Min.   : 1566  
 1st Qu.:4.717e+10   1st Qu.:47167   Class :character   1st Qu.: 2783  
 Median :4.717e+10   Median :47167   Mode  :character   Median : 4849  
 Mean   :4.717e+10   Mean   :47167                      Mean   : 4701  
 3rd Qu.:4.717e+10   3rd Qu.:47167                      3rd Qu.: 5602  
 Max.   :4.717e+10   Max.   :47167                      Max.   :10769  
 pop_race_white pop_race_black   pop_race_asian   pop_race_other  
 Min.   :1114   Min.   :  55.0   Min.   :  0.00   Min.   :  0.00  
 1st Qu.:2028   1st Qu.: 286.0   1st Qu.:  0.00   1st Qu.:  0.00  
 Median :2983   Median : 520.0   Median :  4.00   Median :  0.00  
 Mean   :3504   Mean   : 869.5   Mean   : 26.23   Mean   : 31.31  
 3rd Qu.:4603   3rd Qu.:1237.0   3rd Qu.: 26.00   3rd Qu.: 16.00  
 Max.   :8682   Max.   :2938.0   Max.   :120.00   Max.   :169.00  
 pop_race_multi  pop_race_latino pop_race_indigenous
 Min.   : 32.0   Min.   :  0.0   Min.   : 0.000     
 1st Qu.: 70.0   1st Qu.: 35.0   1st Qu.: 0.000     
 Median : 94.0   Median : 96.0   Median : 0.000     
 Mean   :127.7   Mean   :138.2   Mean   : 4.154     
 3rd Qu.:150.0   3rd Qu.:161.0   3rd Qu.: 2.000     
 Max.   :445.0   Max.   :454.0   Max.   :26.000

This is a lot of information! By default, map returns a list of the same dimension as the input (in this case, 10 elements). In some cases, we can simplify this output, but not in this case.

Lets try creating a custom function within map itself. In this case, I want to get the sum of pop_total within each dataset.

map(county_list, function(x) { sum(x$pop_total) })

[[1]]
[1] 8866

[[2]]
[1] 8895

[[3]]
[1] 21308

[[4]]
[1] 715526

[[5]]
[1] 9263

[[6]]
[1] 20570

[[7]]
[1] 33980

[[8]]
[1] 401237

[[9]]
[1] 21922

[[10]]
[1] 61116

In this case, I am actually writing the code for a custom function within another function. In this case, my function is quite simple, but you can use this feature to do some quite complex stuff if you want.

Again, the output is formatted as a list. However, I am returning a simple number for each value, so I would prefer that this output just be displayed as a simple vector of results. You can do this by using a variety of map_* functions where the second part indicates the return type you expect. In this case, I can use map_dbl because I expect the output to be a numeric value.² Lets try it:

map_dbl(county_list, function(x) { sum(x$pop_total) })

 [1]   8866   8895  21308 715526   9263  20570  33980 401237  21922  61116

My result is now displayed as a vector of numbers, which is more compact.

Now that we can see how it works, lets try out map with our custom calc_theil_h function:

map_dbl(county_list, calc_theil_h)

 [1] 0.13607777 0.09818927 0.13642966 0.22845512 0.02125523 0.06634935
 [7] 0.14326482 0.14528057 0.04927384 0.12853442

That seemed to work perfectly and gives me the same output as the for loop above. However, I would also like to get the county information returned in a table format. How, can I do that? I can use a custom function to output a tibble that includes the county name and id in the result:

map(county_list, function(x) {
  tibble(county_name = unique(x$county_name),
         county_id = unique(x$county_id),
         theil_h = calc_theil_h(x))
})

[[1]]
# A tibble: 1 × 3
  county_name            county_id theil_h
  <chr>                      <dbl>   <dbl>
1 Alger County, Michigan     26003   0.136

[[2]]
# A tibble: 1 × 3
  county_name                 county_id theil_h
  <chr>                           <dbl>   <dbl>
1 Catahoula Parish, Louisiana     22025  0.0982

[[3]]
# A tibble: 1 × 3
  county_name             county_id theil_h
  <chr>                       <dbl>   <dbl>
1 Dakota County, Nebraska     31043   0.136

[[4]]
# A tibble: 1 × 3
  county_name              county_id theil_h
  <chr>                        <dbl>   <dbl>
1 Jackson County, Missouri     29095   0.228

[[5]]
# A tibble: 1 × 3
  county_name                  county_id theil_h
  <chr>                            <dbl>   <dbl>
1 Jones County, North Carolina     37103  0.0213

[[6]]
# A tibble: 1 × 3
  county_name                county_id theil_h
  <chr>                          <dbl>   <dbl>
1 Kewaunee County, Wisconsin     55061  0.0663

[[7]]
# A tibble: 1 × 3
  county_name                  county_id theil_h
  <chr>                            <dbl>   <dbl>
1 Marshall County, Mississippi     28093   0.143

[[8]]
# A tibble: 1 × 3
  county_name             county_id theil_h
  <chr>                       <dbl>   <dbl>
1 Orange County, New York     36071   0.145

[[9]]
# A tibble: 1 × 3
  county_name                    county_id theil_h
  <chr>                              <dbl>   <dbl>
1 Prince Edward County, Virginia     51147  0.0493

[[10]]
# A tibble: 1 × 3
  county_name              county_id theil_h
  <chr>                        <dbl>   <dbl>
1 Tipton County, Tennessee     47167   0.129

This approach worked, but because I am returning a tibble each time, I am back to the list format for my output. What I want is each of these results as rows of a shared tibble. Luckily, bind_rows will do this for me. I just have to pipe the results of map into a bind_rows to get everything formatted nicely:

map(county_list, function(x) {
  tibble(county_name = unique(x$county_name),
         county_id = unique(x$county_id),
         theil_h = calc_theil_h(x))
  }) |>
  bind_rows()

# A tibble: 10 × 3
   county_name                    county_id theil_h
   <chr>                              <dbl>   <dbl>
 1 Alger County, Michigan             26003  0.136 
 2 Catahoula Parish, Louisiana        22025  0.0982
 3 Dakota County, Nebraska            31043  0.136 
 4 Jackson County, Missouri           29095  0.228 
 5 Jones County, North Carolina       37103  0.0213
 6 Kewaunee County, Wisconsin         55061  0.0663
 7 Marshall County, Mississippi       28093  0.143 
 8 Orange County, New York            36071  0.145 
 9 Prince Edward County, Virginia     51147  0.0493
10 Tipton County, Tennessee           47167  0.129

Now the format is working perfectly. Lets go ahead and apply this to the full dataset in one pipe:

tracts |>
  group_by(county_name, county_id) |>
  group_split() |>
  map(function(x) {
    tibble(county_name = unique(x$county_name),
           county_id = unique(x$county_id),
           theil_h = calc_theil_h(x))
  }) |>
  bind_rows()

# A tibble: 3,222 × 3
   county_name                      county_id theil_h
   <chr>                                <dbl>   <dbl>
 1 Abbeville County, South Carolina     45001  0.127 
 2 Acadia Parish, Louisiana             22001  0.263 
 3 Accomack County, Virginia            51001  0.129 
 4 Ada County, Idaho                    16001  0.110 
 5 Adair County, Iowa                   19001  0.0902
 6 Adair County, Kentucky               21001  0.0768
 7 Adair County, Missouri               29001  0.0843
 8 Adair County, Oklahoma               40001  0.0424
 9 Adams County, Colorado                8001  0.122 
10 Adams County, Idaho                  16003  0.0240
# ℹ 3,212 more rows

Students often go with the for loop approach because it feels more natural and intuitive. In particular, students are often scared off by the custom function business. However, mapping offers you more flexibility and often speed, and so its worth learning. In this case, it was much easier to get both county name and id in the final output using mapping rather than looping.

We can also wrap each command in system.time to see how they actually perform in terms of time.

theil_h <- NULL
system.time(
  for(county in counties) {
    
    # calculate Theil's H for this county
    h <- tracts |>
      filter(county_name == county) |>
      calc_theil_h()
    
    # add this county's values to our dataset of values
    theil_h <- theil_h |>
      bind_rows(tibble(county_name = county, theil_h = h))
  }
)

   user  system elapsed 
 19.227   0.317  19.624

system.time(
  tracts |>
    group_by(county_name, county_id) |>
    group_split() |>
    map(function(x) {
      tibble(county_name = unique(x$county_name),
             county_id = unique(x$county_id),
             theil_h = calc_theil_h(x))
    }) |>
    bind_rows()
)

   user  system elapsed 
 16.503   0.010  16.536

The difference in speed is pretty minimal here. The for loop takes about 10% more time, which amounts to 2 seconds in this case. For many operations, this gain is probably not enough to warrant a decision on looping vs. mapping. However, for computationally intensive operations, a 10% savings in time may make a big difference in absolute time.

The other option is a while loop which will repeat the same code until a condition is met.↩︎
The equivalent approach to lapply is to use sapply which simplifies the list to a vector if possible.↩︎