量化金融与金融编程
L07 迭代

曾永艺
厦门大学管理学院

2023-11-24
1 / 42

通常你并不需要手写显性的循环
控制流结构
函数式编程
purrr 包
purrr 包与列表列
purrr -> furrr

2 / 42

1. 通常你并不需要手写显性的循环

( no loops, pls. )

3 / 42

>> 通常你并不需要手写显性的循环4 / 42

>> 通常你并不需要手写显性的循环

1.1 readr 包的 vroom 引擎

fpaths <- list.files("csvs", "\\.csv$",
                     full.names = TRUE)
read_csv(fpaths, id = "file")

1.2 向量化运算

tbl_mtcars <- as_tibble(mtcars)
tbl_mtcars %>% 
  mutate(kmpl = 1.609 * mpg / 3.785)

#> # A tibble: 32 × 12
#>     mpg   cyl  disp    hp  drat    wt
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62
#> 2  21       6   160   110  3.9   2.88
#> 3  22.8     4   108    93  3.85  2.32
#> # ℹ 29 more rows
#> # ℹ 6 more variables: qsec <dbl>,
#> #   vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>, kmpl <dbl>

4 / 42

>> 通常你并不需要手写显性的循环

1.1 readr 包的 vroom 引擎

fpaths <- list.files("csvs", "\\.csv$",
                     full.names = TRUE)
read_csv(fpaths, id = "file")

1.2 向量化运算

tbl_mtcars <- as_tibble(mtcars)
tbl_mtcars %>% 
  mutate(kmpl = 1.609 * mpg / 3.785)

#> # A tibble: 32 × 12
#>     mpg   cyl  disp    hp  drat    wt
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62
#> 2  21       6   160   110  3.9   2.88
#> 3  22.8     4   108    93  3.85  2.32
#> # ℹ 29 more rows
#> # ℹ 6 more variables: qsec <dbl>,
#> #   vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>, kmpl <dbl>

1.3 ggplot2::aes() 和 facet_*()

tbl_mtcars %>% 
  ggplot(aes(mpg, wt, 
             color = factor(cyl))) +
  geom_point() + 
  facet_grid(cols = vars(am),
             labeller = label_both)

4 / 42

>> 通常你并不需要手写显性的循环5 / 42

>> 通常你并不需要手写显性的循环

1.4 dplyr::group_by()

tbl_mtcars %>% 
  group_by(am) %>%
  summarise(mmpg = mean(mpg), n = n())

#> # A tibble: 2 × 3
#>      am  mmpg     n
#>   <dbl> <dbl> <int>
#> 1     0  17.1    19
#> 2     1  24.4    13

1.5 dplyr 包的 .by/by 参数

tbl_mtcars %>% 
  summarise(
    mmpg = mean(mpg), n = n(),
    .by = am
  )

#> # A tibble: 2 × 3
#>      am  mmpg     n
#>   <dbl> <dbl> <int>
#> 1     1  24.4    13
#> 2     0  17.1    19

5 / 42

>> 通常你并不需要手写显性的循环

1.4 dplyr::group_by()

tbl_mtcars %>% 
  group_by(am) %>%
  summarise(mmpg = mean(mpg), n = n())

#> # A tibble: 2 × 3
#>      am  mmpg     n
#>   <dbl> <dbl> <int>
#> 1     0  17.1    19
#> 2     1  24.4    13

1.5 dplyr 包的 .by/by 参数

tbl_mtcars %>% 
  summarise(
    mmpg = mean(mpg), n = n(),
    .by = am
  )

#> # A tibble: 2 × 3
#>      am  mmpg     n
#>   <dbl> <dbl> <int>
#> 1     1  24.4    13
#> 2     0  17.1    19

1.6 dplyr::rowwise()

tibble(x1 = runif(2), x2 = runif(2)) %>% 
  rowwise() %>%
  mutate(
    m = mean(c_across(starts_with("x")))
  )

#> # A tibble: 2 × 3
#> # Rowwise: 
#>       x1     x2     m
#>    <dbl>  <dbl> <dbl>
#> 1 0.0788 0.889  0.484
#> 2 0.568  0.0275 0.298

1.7 dplyr::across()

tibble(x1 = runif(2), x2 = runif(2)) %>% 
  mutate(
    across(starts_with("x"), ~.x/max(.x))
  )

#> # A tibble: 2 × 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1 0.488 0.562
#> 2 1     1

5 / 42

2. 控制流结构

( control-flow constructs in R )

6 / 42

日出日落，月圆月缺，年尾年头，这是“循环”；

上学还是就业，单身还是结婚，丁克还是生娃，这是“分支”；

不管是循环还是分支，都嵌入在生老病死的时间轴上，这是“顺序”；

所谓尽人事听天命，想来就是心平气和地接受顺序结构，小心翼翼地制定循环结构，在关键时刻控制好分支结构，就这样度过一生罢。

—— 大鹏志，转引自《学R》

7 / 42

>> 2.1 `for` 循环结构

for 循环属于命令式编程（imperative programming）中的重复执行范式

8 / 42

>> 2.1 `for` 循环结构

for 循环属于命令式编程（imperative programming）中的重复执行范式

set.seed(1234)
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10)
)

# 计算各列的均值
mean(df$a)
mean(df$b)
mean(df$c)

#> [1] -0.383
#> [1] -0.118
#> [1] -0.388

8 / 42

>> 2.1 `for` 循环结构

for 循环属于命令式编程（imperative programming）中的重复执行范式

set.seed(1234)
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10)
)

# 计算各列的均值
mean(df$a)
mean(df$b)
mean(df$c)

#> [1] -0.383
#> [1] -0.118
#> [1] -0.388

# for循环的三个组成部分
output <- vector("double", ncol(df))  # 1. output    输出
for(i in seq_along(df)) {             # 2. sequence  循环序列
  output[[i]] <- mean(df[[i]])        # 3. body      循环体
}
output

#> [1] -0.383 -0.118 -0.388

8 / 42

>> 2.1 `for` 循环结构

for 循环的三种模式

9 / 42

>> 2.1 `for` 循环结构

for 循环的三种模式

for(x in xs)：逐个元素循环，当你只关注副作用（如作图、存盘）时，这是最有用的模式

9 / 42

>> 2.1 `for` 循环结构

for 循环的三种模式

for(x in xs)：逐个元素循环，当你只关注副作用（如作图、存盘）时，这是最有用的模式
for(nm in names(xs))：逐个名字循环，在循环体中用 xs[[nm]] 得到命名向量 xs 元素的值。你还可以让输出的名字和输入的名字对应起来：

output <- vector("list", length(xs))
names(output) <- names(xs)
for(nm in names(xs)) {
  output[[nm]] <- .f(xs[[nm]], ...)
}

9 / 42

>> 2.1 `for` 循环结构

for 循环的三种模式

for(x in xs)：逐个元素循环，当你只关注副作用（如作图、存盘）时，这是最有用的模式
for(nm in names(xs))：逐个名字循环，在循环体中用 xs[[nm]] 得到命名向量 xs 元素的值。你还可以让输出的名字和输入的名字对应起来：

output <- vector("list", length(xs))
names(output) <- names(xs)
for(nm in names(xs)) {
  output[[nm]] <- .f(xs[[nm]], ...)
}

for(i in seq_along(xs))：逐个数值索引循环，这是最通用的模式。下面语句会给出元素的名字和取值：

for(i in seq_along(xs)) {
  name <- names(xs)[[i]]
  value <- xs[[i]]
  # ...
}

9 / 42

>> 2.1 `for` 循环结构

特殊情况：“就地”修改

10 / 42

>> 2.1 `for` 循环结构

特殊情况：“就地”修改

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

# output（输出）已备
# 在body（循环体）中直接调用`[[<-`完成“就地”修改 x           #]#]
for(i in seq_along(df)) {
  df[[i]] <- rescale01(df[[i]])  # 不要使用[]
}
df

#> # A tibble: 10 × 3
#>       a      b     c
#>   <dbl>  <dbl> <dbl>
#> 1 0.332 0.153  0.782
#> 2 0.765 0      0.473
#> 3 1     0.0651 0.498
#> # ℹ 7 more rows

10 / 42

>> 2.1 `for` 循环结构

特殊情况：事前无法确定输出的长度

11 / 42

>> 2.1 `for` 循环结构

特殊情况：事前无法确定输出的长度

# **低效的做法**
set.seed(1234)
means <- c(0, 1, 2)
out <- double()  # 空的实数向量
for(i in seq_along(means)) {
  n <- sample(100, 1)
  print(glue::glue("L#{i}: n={n}"))
  out <- c(out,
           rnorm(n, means[[i]]))
}

#> L#1: n=28
#> L#2: n=79
#> L#3: n=2

str(out, vec.len = 2.5)

#>  num [1:109] 0.312 0.314 0.359 ...

11 / 42

>> 2.1 `for` 循环结构

特殊情况：事前无法确定输出的长度

# **低效的做法**
set.seed(1234)
means <- c(0, 1, 2)
out <- double()  # 空的实数向量
for(i in seq_along(means)) {
  n <- sample(100, 1)
  print(glue::glue("L#{i}: n={n}"))
  out <- c(out,
           rnorm(n, means[[i]]))
}

#> L#1: n=28
#> L#2: n=79
#> L#3: n=2

str(out, vec.len = 2.5)

#>  num [1:109] 0.312 0.314 0.359 ...

# **更好的做法**
# 使用更复杂的数据结构，完成循环后再处理
set.seed(1234)
means <- c(0, 1, 2)
out <- vector("list", length(means))
for(i in seq_along(means)) {
  n <- sample(100, 1)
  print(glue::glue("L#{i}: n={n}"))
  out[[i]] <- rnorm(n, means[[i]])
}
out <- unlist(out)
# purrr::flatten_dbl(out)
str(out, vec.len = 2.5)

#> L#1: n=28
#> L#2: n=79
#> L#3: n=2
#>  num [1:109] 0.312 0.314 0.359 ...

11 / 42

>> 2.1 `for` 循环结构 + 2.2 `if` 分支结构

特殊情况：事前无法确定循环的次数 -> while() / repeat + break

12 / 42

>> 2.1 `for` 循环结构 + 2.2 `if` 分支结构

特殊情况：事前无法确定循环的次数 -> while() / repeat + break

flip <- function() sample(c("T", "H"), 1)

set.seed(111)
nheads <- 0
flips <- character()
while(nheads < 3) {
  H_T <- flip()
  if(H_T == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0  # 重新计数
  }
  flips <- c(flips, H_T)
}
flips

#>  [1] "H" "T" "H" "T" "T" "T" "T"
#>  [8] "T" "T" "H" "H" "H"

12 / 42

>> 2.1 `for` 循环结构 + 2.2 `if` 分支结构

特殊情况：事前无法确定循环的次数 -> while() / repeat + break

flip <- function() sample(c("T", "H"), 1)

set.seed(111)
nheads <- 0
flips <- character()
while(nheads < 3) {
  H_T <- flip()
  if(H_T == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0  # 重新计数
  }
  flips <- c(flips, H_T)
}
flips

#>  [1] "H" "T" "H" "T" "T" "T" "T"
#>  [8] "T" "T" "H" "H" "H"

set.seed(111)
nheads <- 0
flips <- character()
repeat {
  H_T <- flip()
  if(H_T == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0  # 重新计数
  }
  flips <- c(flips, H_T)
  if(nheads >= 3) break
}
flips

#>  [1] "H" "T" "H" "T" "T" "T" "T"
#>  [8] "T" "T" "H" "H" "H"

12 / 42

3. 函数式编程

( functional programming )

13 / 42

>> 3.1 R 语言与函数式编程

R 语言的核心其实是一种函数式编程（functional programming）语言，可通过提供向量化函数和函数式编程工具，避免直接调用 for 循环，减少代码重复并提高代码的可读性（当然还是得有人来写这些函数底层的 for 循环）

14 / 42

>> 3.1 R 语言与函数式编程

# 定义函数，计算数据框每列的均值
col_mean <- function(df) {
  out <- vector("double", length(df))
  for(i in seq_along(df)) {
    out[i] <- mean(df[[i]])
  }
  out
}
# 调用函数
col_mean(df)

#> [1] 0.572 0.258 0.524

🤔 该如何一般化 col_mean()？

🤩 定义 col_median()、col_sd() ...？

🙅 NOPE! 更好的做法是 ->

14 / 42

>> 3.1 R 语言与函数式编程

# 定义函数，计算数据框每列的均值
col_mean <- function(df) {
  out <- vector("double", length(df))
  for(i in seq_along(df)) {
    out[i] <- mean(df[[i]])
  }
  out
}
# 调用函数
col_mean(df)

#> [1] 0.572 0.258 0.524

🤔 该如何一般化 col_mean()？

🤩 定义 col_median()、col_sd() ...？

🙅 NOPE! 更好的做法是 ->

# ... 使用函数式编程
# 函数作为参数 -> 泛函
col_summary <- function(df, .fun) {
  out <- vector("double", length(df))
  for(i in seq_along(df)) {
    out[i] <- .fun(df[[i]])
  }
  out
}

col_summary(df, mean)

#> [1] 0.572 0.258 0.524

col_summary(df, sd)

#> [1] 0.290 0.313 0.329

14 / 42

>> 3.2 `apply` 族函数 & `Reduce()` 等高阶函数

apply(X, MARGIN, FUN, ...)
sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
lapply(X, FUN, ...)
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
rapply(object, f, classes = "ANY", deflt = NULL, how = c("unlist", "replace", "list"), ...)
replicate(n, expr, simplify = "array")
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
eapply(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE)
tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
by(data, INDICES, FUN, ..., simplify = TRUE)
aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE) ...
Reduce(f, x, init, right = FALSE, accumulate = FALSE)
Filter(f, x)
Find(f, x, right = FALSE, nomatch = NULL)
Map(f, ...)
Position(f, x, right = FALSE, nomatch = NA_integer_)

15 / 42

4. `purrr` 包 ^v1.0.2

( Functional Programming Tools )

16 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数："Learn it once, use it everywhere!" - Jenny Bryan

17 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数："Learn it once, use it everywhere!" - Jenny Bryan

参数统一：map()、map_*()、modify() 和 walk() 函数的第1个参数 .x 为输入向量（包括原子向量和列表），第2个参数 .f 为函数，第3个参数 ... 为传递给 .f 的额外参数。

17 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数："Learn it once, use it everywhere!" - Jenny Bryan

参数统一：map()、map_*()、modify() 和 walk() 函数的第1个参数 .x 为输入向量（包括原子向量和列表），第2个参数 .f 为函数，第3个参数 ... 为传递给 .f 的额外参数。

返回结果一目了然：map()、map_*()、modify() 和 walk() 的返回结果均为与输入向量等长的向量，map() 返回列表，map_*() 返回向量的类由函数名中的后缀决定，modify() 返回与输入向量同类的输出，而 walk() 则不可见地返回输入向量。

17 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数："Learn it once, use it everywhere!" - Jenny Bryan

参数统一：map()、map_*()、modify() 和 walk() 函数的第1个参数 .x 为输入向量（包括原子向量和列表），第2个参数 .f 为函数，第3个参数 ... 为传递给 .f 的额外参数。

返回结果一目了然：map()、map_*()、modify() 和 walk() 的返回结果均为与输入向量等长的向量，map() 返回列表，map_*() 返回向量的类由函数名中的后缀决定，modify() 返回与输入向量同类的输出，而 walk() 则不可见地返回输入向量。

一通百通：输入扩展至2个向量（如 map2(.x, .y, .f, ...)）和多个向量构成的列表 .l（如 pmap(.l, .f, ...)），还可同时应用于向量元素和索引（如 imap(.x, .f, ...)）。

17 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map(df, mean)         # 返回列表

#> $a
#> [1] 0.572
#> 
#> $b
#> [1] 0.258
#> 
#> $c
#> [1] 0.524

map_dbl(df, mean)       # 返回实数向量
# df %>% map_dbl(mean)  # 支持管道操作

#>     a     b     c 
#> 0.572 0.258 0.524

18 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map(df, mean)         # 返回列表

#> $a
#> [1] 0.572
#> 
#> $b
#> [1] 0.258
#> 
#> $c
#> [1] 0.524

map_dbl(df, mean)       # 返回实数向量
# df %>% map_dbl(mean)  # 支持管道操作

#>     a     b     c 
#> 0.572 0.258 0.524

相比 for 循环

用函数式编程函数来完成循环，更加简洁
关注完成运算的函数（如 mean），而非准备性步骤（如定义输出 output）
适合用 %>% 将不同函数链接起来解决问题
-> 代码更加易读、易写、易用

相比自行编写的 col_summary()

后台用 C 编写，效率更高
参数 .f 支持函数、公式、字符|整数向量
结果保留元素的名称

👍 ❤️ ✌️

18 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数的参数 .f 支持快捷写法

19 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数的参数 .f 支持快捷写法

# 匿名函数
models <- mtcars %>% 
  split(.$cyl) %>%  # 得到3个命名列表
  map(function(df)
        lm(mpg ~ wt, data = df))
# map(\(df) lm(mpg ~ wt, data = df))
# 将匿名函数改写为单侧公式（但不推荐）
models <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ lm(mpg ~ wt, data = .x))
str(models, max.level = 1)

#> List of 3
#>  $ 4:List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ 6:List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ 8:List of 12
#>   ..- attr(*, "class")= chr "lm"

19 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数的参数 .f 支持快捷写法

# 匿名函数
models <- mtcars %>% 
  split(.$cyl) %>%  # 得到3个命名列表
  map(function(df)
        lm(mpg ~ wt, data = df))
# map(\(df) lm(mpg ~ wt, data = df))
# 将匿名函数改写为单侧公式（但不推荐）
models <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ lm(mpg ~ wt, data = .x))
str(models, max.level = 1)

#> List of 3
#>  $ 4:List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ 6:List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ 8:List of 12
#>   ..- attr(*, "class")= chr "lm"

# 匿名函数
models %>% 
  map(summary) %>% 
  map_dbl(\(x) x$r.squared)

#>     4     6     8 
#> 0.509 0.465 0.423

# 直接使用 字符向量 提取元素
# 结果同上，从略
models %>% 
  map(summary) %>% 
  map_dbl("r.squared")
# 若你知道元素的具体位置，也可以直接用
# 整数提取元素，但不推荐

19 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap() 和 imap()

20 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap() 和 imap()

# 1个输入用map()
mu <- list(5, 10, -3)
mu %>% map(\(m) rnorm(m, n = 5))

# 2个输入呢？
mu <- list(5, 10, -3)
sigma <- list(1, 5, 10)

# 坚持用map()！
set.seed(1234)
seq_along(mu) %>% 
  map(\(i) rnorm(5, mu[[i]], 
                 sigma[[i]])) %>% 
  str()

#> List of 3
#>  $ : num [1:5] 3.79 5.28 6.08 2.65 5.43
#>  $ : num [1:5] 12.53 7.13 7.27 7.18 5.55
#>  $ : num [1:5] -7.77 -12.98 -10.76 -2.36 6.59

20 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap() 和 imap()

# 1个输入用map()
mu <- list(5, 10, -3)
mu %>% map(\(m) rnorm(m, n = 5))

# 2个输入呢？
mu <- list(5, 10, -3)
sigma <- list(1, 5, 10)

# 坚持用map()！
set.seed(1234)
seq_along(mu) %>% 
  map(\(i) rnorm(5, mu[[i]], 
                 sigma[[i]])) %>% 
  str()

#> List of 3
#>  $ : num [1:5] 3.79 5.28 6.08 2.65 5.43
#>  $ : num [1:5] 12.53 7.13 7.27 7.18 5.55
#>  $ : num [1:5] -7.77 -12.98 -10.76 -2.36 6.59

# 还是改用map2()吧，:)
set.seed(1234)
map2(mu, sigma, rnorm, n = 5) %>% 
  str() # 结果相同

#> List of 3
#>  $ : num [1:5] 3.79 5.28 6.08 2.65 5.43
#>  $ : num [1:5] 12.53 7.13 7.27 7.18 5.55
#>  $ : num [1:5] -7.77 -12.98 -10.76 -2.36 6.59

20 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap(.l, .f, ...) 和 imap()

21 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap(.l, .f, ...) 和 imap()

# 我还想让抽样样本数n也有所不同！
n <- list(1, 2, 3)

# 默认为位置匹配
args1 <- list(n, mu, sigma)
args1 %>%
  pmap(rnorm) %>% str()

#> List of 3
#>  $ : num 4.03
#>  $ : num [1:2] 4.46 3.74
#>  $ : num [1:3] -8.24 -7.97 -21.06

21 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap(.l, .f, ...) 和 imap()

# 我还想让抽样样本数n也有所不同！
n <- list(1, 2, 3)

# 默认为位置匹配
args1 <- list(n, mu, sigma)
args1 %>%
  pmap(rnorm) %>% str()

#> List of 3
#>  $ : num 4.03
#>  $ : num [1:2] 4.46 3.74
#>  $ : num [1:3] -8.24 -7.97 -21.06

# 使用命名参数列表，匹配.f函数的参数名
# 也可用数据框作为.l参数的取值
args2 <- list(mean = mu, 
              sd = sigma, 
              n = n)
args2 %>% 
  pmap(rnorm) %>% str()

21 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2()、pmap() 和 imap(.x, .f, ...)

imap()为 indexed map 函数

当输入向量 .x 的元素有名称时，它是 map2(.x, names(.x), ...) 的简便写法
当输入向量 .x 的元素没有名称时，它 map2(.x, seq_along(.x), ...) 的简便写法
在你需要同时基于 .x 的取值和名称/索引进行计算时，imap(.x, .f, ...) 很有用

22 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2()、pmap() 和 imap(.x, .f, ...)

imap()为 indexed map 函数

当输入向量 .x 的元素有名称时，它是 map2(.x, names(.x), ...) 的简便写法
当输入向量 .x 的元素没有名称时，它 map2(.x, seq_along(.x), ...) 的简便写法
在你需要同时基于 .x 的取值和名称/索引进行计算时，imap(.x, .f, ...) 很有用

# 元素没有名称，.f 第2个参数为元素位置
imap_chr(sample(LETTERS[1:4]), 
         \(x, i) paste0(i, " -> ", x))

#> [1] "1 -> B" "2 -> D" "3 -> C" "4 -> A"

# 元素有名称，.f 第2个参数为元素名
lst <- map(1:4, ~ sample(1000, 10))
names(lst) <- paste0("#", 1:4)
imap_chr(
  lst, 
  \(x, i) glue::glue(
    "样本{i} 的最大值为 {max(x)}")
)

#>                      #1 
#> "样本#1 的最大值为 962" 
#>                      #2 
#> "样本#2 的最大值为 976" 
#>                      #3 
#> "样本#3 的最大值为 942" 
#>                      #4 
#> "样本#4 的最大值为 877"

22 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

23 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

modify()、modify2() 和 imodify() 总是返回与输入向量 .x 的类 class 相同的向量

df <- data.frame(
  x = 1:3,
  y = 6:4
)
modify(df, \(x) x * 2)

#>   x  y
#> 1 2 12
#> 2 4 10
#> 3 6  8

尽管 modify()、modify2() 和 imodify() 函数名中含有 modify，但它们并不会“原地修改”输入向量 .x，而只是返回修改后的版本——如果你想永久保留修改，就你必须手动将返回结果赋值给变量

df

#>   x y
#> 1 1 6
#> 2 2 5
#> 3 3 4

df <- modify(df, \(x) x * 2)

23 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

24 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

walk()、walk2()、iwalk() 和 pwalk()：调用函数不是为了函数的返回值，而是函数的“副作用”（如数据存盘）；这些函数都会不可见地返回第 1 个输入项

# ggsave(filename, plot=last_plot(), 
#   device = NULL, path = NULL, ...)
tmp <- tempdir()
gs <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ ggplot(., aes(wt, mpg)) + 
        geom_point())
fs <- str_c("cyl-", names(gs), ".pdf")
walk2(fs, gs, ggsave, path = tmp)
list.files(tmp, pattern = "^cyl-")

#> [1] "cyl-4.pdf" "cyl-6.pdf" "cyl-8.pdf"

24 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

# ggsave(filename, plot=last_plot(), 
#   device = NULL, path = NULL, ...)
tmp <- tempdir()
gs <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ ggplot(., aes(wt, mpg)) + 
        geom_point())
fs <- str_c("cyl-", names(gs), ".pdf")
walk2(fs, gs, ggsave, path = tmp)
list.files(tmp, pattern = "^cyl-")

#> [1] "cyl-4.pdf" "cyl-6.pdf" "cyl-8.pdf"

# 参数位置匹配并使用 ... 来传递参数
# 更推荐的做法是：
walk2(
  fs, gs, 
  \(fs, gs) ggsave(fs, gs, path = tmp)
)

24 / 42

>> 4.1 `purrr` 包的 `map` 族函数

lmap(.x, .f, ...)、lmap_if(.x, .p, .f, ..., .else = NULL) 和 lmap_at(.x, .at, .f, ...)

类似于 map*() 函数，但只适用于输入并返回 list 或 data.frame 的函数 .f，也就是说 .f 应用于 .x 的列表元素而非元素（即 .x[i]，而非 .x[[i]]）。

25 / 42

>> 4.1 `purrr` 包的 `map` 族函数

lmap(.x, .f, ...)、lmap_if(.x, .p, .f, ..., .else = NULL) 和 lmap_at(.x, .at, .f, ...)

类似于 map*() 函数，但只适用于输入并返回 list 或 data.frame 的函数 .f，也就是说 .f 应用于 .x 的列表元素而非元素（即 .x[i]，而非 .x[[i]]）。

map_if(.x, .p, .f, ..., .else = NULL) 和 map_at(.x, .at, .f, ...)

map_if() 将 .f（.else）应用于 .x 中断言函数 .p 取值为 TRUE（FALSE）的元素；
map_at() 将 .f 应用于 .x 中 .at 参数（名称或位置向量）所指定的元素；
modify() 和 lmap() 也有这类条件应用的变体函数。

25 / 42

>> 4.1 `purrr` 包的 `map` 族函数

lmap(.x, .f, ...)、lmap_if(.x, .p, .f, ..., .else = NULL) 和 lmap_at(.x, .at, .f, ...)

类似于 map*() 函数，但只适用于输入并返回 list 或 data.frame 的函数 .f，也就是说 .f 应用于 .x 的列表元素而非元素（即 .x[i]，而非 .x[[i]]）。

map_if(.x, .p, .f, ..., .else = NULL) 和 map_at(.x, .at, .f, ...)

map_if() 将 .f（.else）应用于 .x 中断言函数 .p 取值为 TRUE（FALSE）的元素；
map_at() 将 .f 应用于 .x 中 .at 参数（名称或位置向量）所指定的元素；
modify() 和 lmap() 也有这类条件应用的变体函数。

map_depth(.x, .depth, .f, ..., .ragged = .depth < 0, .is_node = NULL) 和 modify_depth()

map_depth() / modify_depth() 将 .f 应用于嵌套向量 .x 中 .depth 参数所指定深度的元素。

25 / 42

>> 4.2 其它 `purrr` 包函数

列表操作

26 / 42

>> 4.2 其它 `purrr` 包函数

列表操作

list_c(x, ..., ptype = NULL) 将列表x的元素合并为向量；
list_cbind(x, ..., name_repair = c("unique", "universal", "check_unique"), size = NULL) 将列表x的元素（须为数据框或NULL）按行合并为数据框；
list_rbind(x, ..., names_to = rlang::zap(), ptype = NULL) 将列表x的元素（须为数据框或NULL）按列合并为数据框。

26 / 42

>> 4.2 其它 `purrr` 包函数

列表操作

list_c(x, ..., ptype = NULL) 将列表x的元素合并为向量；
list_cbind(x, ..., name_repair = c("unique", "universal", "check_unique"), size = NULL) 将列表x的元素（须为数据框或NULL）按行合并为数据框；
list_rbind(x, ..., names_to = rlang::zap(), ptype = NULL) 将列表x的元素（须为数据框或NULL）按列合并为数据框。

list_flatten(x, ..., name_spec = "{outer}_{inner}", name_repair)：轧平列表x的一层元素索引；
list_simplify(x, ..., strict = TRUE, ptype = NULL)：将列表x简化为原子向量或者 S3 向量；
list_transpose()：转置列表，如 2×3 -> 3×2；
list_assign(x, ..., .is_node = NULL) / list_modify() / list_merge()：根据 ... 按名称或位置赋值 / 修改 / 合并列表x的元素值。

26 / 42

>> 4.2 其它 `purrr` 包函数

支持断言函数（predicate functions）的泛函

27 / 42

>> 4.2 其它 `purrr` 包函数

支持断言函数（predicate functions）的泛函

# keep(.x, .p, ...) | discard()
iris %>% 
  keep(is.factor) %>% 
  str(vec.len = 1)

#> 'data.frame':    150 obs. of  1 variable:
#>  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 ...

# keep_at(x, at, ...) | discard_at()
#   based on their name/position
# compact(.x, .p = identity)
#   remove all NULLs

# every(.x, .p, ...) | some() | none()
list(1:5, letters) %>% 
  some(is_character)

#> [1] TRUE

27 / 42

>> 4.2 其它 `purrr` 包函数

支持断言函数（predicate functions）的泛函

# keep(.x, .p, ...) | discard()
iris %>% 
  keep(is.factor) %>% 
  str(vec.len = 1)

#> 'data.frame':    150 obs. of  1 variable:
#>  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 ...

# keep_at(x, at, ...) | discard_at()
#   based on their name/position
# compact(.x, .p = identity)
#   remove all NULLs

# every(.x, .p, ...) | some() | none()
list(1:5, letters) %>% 
  some(is_character)

#> [1] TRUE

set.seed(1234)
(x <- sample(9))

#> [1] 6 5 4 1 8 2 7 9 3

# detect(.x, .f, ..., 
#   .dir = c("forward", "backward"), 
#   .right = NULL,.default = NULL)
# detect_index()
x %>% detect(~ . > 2)

#> [1] 6

# head_while(.x, .p, ...)|tail_while()
x %>% head_while(~ . > 2)

#> [1] 6 5 4

27 / 42

>> 4.2 其它 `purrr` 包函数

reduce(.x, .f, ..., .init, .dir = c("forward", "backward")) 和 accumulate(.x, .f, ..., .init, .dir = c("forward", "backward"))

28 / 42

>> 4.2 其它 `purrr` 包函数

reduce(.x, .f, ..., .init, .dir = c("forward", "backward")) 和 accumulate(.x, .f, ..., .init, .dir = c("forward", "backward"))

dfs <- list(
  age = tibble(name = "Jo", age = 30),
  sex = tibble(name = c("Jo", "An"), 
               sex = c("M", "F")),
  trt = tibble(name = "An", 
               treatment = "A")
)
dfs %>% reduce(full_join)

#> # A tibble: 2 × 4
#>   name    age sex   treatment
#>   <chr> <dbl> <chr> <chr>    
#> 1 Jo       30 M     <NA>     
#> 2 An       NA F     A

1:10 %>% accumulate(`+`)

#>  [1]  1  3  6 10 15 21 28 36 45 55

28 / 42

>> 4.2 其它 `purrr` 包函数

reduce(.x, .f, ..., .init, .dir = c("forward", "backward")) 和 accumulate(.x, .f, ..., .init, .dir = c("forward", "backward"))

dfs <- list(
  age = tibble(name = "Jo", age = 30),
  sex = tibble(name = c("Jo", "An"), 
               sex = c("M", "F")),
  trt = tibble(name = "An", 
               treatment = "A")
)
dfs %>% reduce(full_join)

#> # A tibble: 2 × 4
#>   name    age sex   treatment
#>   <chr> <dbl> <chr> <chr>    
#> 1 Jo       30 M     <NA>     
#> 2 An       NA F     A

1:10 %>% accumulate(`+`)

#>  [1]  1  3  6 10 15 21 28 36 45 55

reduce() 和 accumulate() 支持的是二元函数，即有两个输入项的函数（及运算符）
还有reduce2() 和 accumulate2()函数

28 / 42

>> 4.2 其它 `purrr` 包函数

safely(.f, otherwise = NULL, quiet = TRUE)、quietly() 和 possibly()

这些函数就是所谓的函数运算符（function operator）——以函数作为参数输入，并返回函数。
safely() 会调整函数的行为（副词），让调整后的函数不再抛出错误，而是返回包含两个元素 result 和 error 的列表；通过 otherwise 参数设定错误时默认值，possibly() 总是成功；而 quietly() 则会捕捉命令的结果、输出、警告和消息。

29 / 42

>> 4.2 其它 `purrr` 包函数

safely(.f, otherwise = NULL, quiet = TRUE)、quietly() 和 possibly()

这些函数就是所谓的函数运算符（function operator）——以函数作为参数输入，并返回函数。
safely() 会调整函数的行为（副词），让调整后的函数不再抛出错误，而是返回包含两个元素 result 和 error 的列表；通过 otherwise 参数设定错误时默认值，possibly() 总是成功；而 quietly() 则会捕捉命令的结果、输出、警告和消息。

x <- list(1, 10, "a")
y <- x %>% map(safely(log)); str(y)

#> List of 3
#>  $ :List of 2
#>   ..$ result: num 0
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: num 2.3
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: NULL
#>   ..$ error :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .Primitive("log")(x, base)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

29 / 42

>> 4.2 其它 `purrr` 包函数

safely(.f, otherwise = NULL, quiet = TRUE)、quietly() 和 possibly()

这些函数就是所谓的函数运算符（function operator）——以函数作为参数输入，并返回函数。
safely() 会调整函数的行为（副词），让调整后的函数不再抛出错误，而是返回包含两个元素 result 和 error 的列表；通过 otherwise 参数设定错误时默认值，possibly() 总是成功；而 quietly() 则会捕捉命令的结果、输出、警告和消息。

x <- list(1, 10, "a")
y <- x %>% map(safely(log)); str(y)

#> List of 3
#>  $ :List of 2
#>   ..$ result: num 0
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: num 2.3
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: NULL
#>   ..$ error :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .Primitive("log")(x, base)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

y <- list_transpose(y); str(y)

#> List of 2
#>  $ result:List of 3
#>   ..$ : num 0
#>   ..$ : num 2.3
#>   ..$ : NULL
#>  $ error :List of 3
#>   ..$ : NULL
#>   ..$ : NULL
#>   ..$ :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .Primitive("log")(x, base)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

is_ok <- y$error %>% map_lgl(is_null)
y$result[is_ok] %>% list_c()

#> [1] 0.0 2.3

29 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

pluck_depth(x, is.node = NULL)：计算向量 x 的深度

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

pluck_depth(x, is.node = NULL)：计算向量 x 的深度

has_element(.x, .y)：列表 .x 是否包含元素 .y

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

pluck_depth(x, is.node = NULL)：计算向量 x 的深度

has_element(.x, .y)：列表 .x 是否包含元素 .y

array_branch(array, margin = NULL) 和 array_tree(array, margin = NULL)：将数组转化为列表

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

pluck_depth(x, is.node = NULL)：计算向量 x 的深度

has_element(.x, .y)：列表 .x 是否包含元素 .y

array_branch(array, margin = NULL) 和 array_tree(array, margin = NULL)：将数组转化为列表

auto_browse()、insistently()、slowly()、compose()、partial()、negate()：函数运算符，修改函数的行为

30 / 42

5. `purrr` 包与列表列

( purrr and list columns )

31 / 42

32 / 42

可将任意结果（如模型对象）存入列表列并对其进行后续分析，避免重新计算
利用数据框保证相同观测不同变量取值之间的关联关系（如元数据与结果）
可将已掌握的知识/流程/工具直接应用于包含列表列的数据框

32 / 42

>> 5.1 步骤1：生成列表列（list column）33 / 42

>> 5.1 步骤1：生成列表列（list column）

gapminder::gapminder

#> # A tibble: 1,704 × 6
#>   country   continent  year lifeExp
#>   <fct>     <fct>     <int>   <dbl>
#> 1 Afghanis… Asia       1952    28.8
#> 2 Afghanis… Asia       1957    30.3
#> 3 Afghanis… Asia       1962    32.0
#> 4 Afghanis… Asia       1967    34.0
#> 5 Afghanis… Asia       1972    36.1
#> 6 Afghanis… Asia       1977    38.4
#> # ℹ 1,698 more rows
#> # ℹ 2 more variables: pop <int>,
#> #   gdpPercap <dbl>

33 / 42

>> 5.1 步骤1：生成列表列（list column）

gapminder::gapminder

#> # A tibble: 1,704 × 6
#>   country   continent  year lifeExp
#>   <fct>     <fct>     <int>   <dbl>
#> 1 Afghanis… Asia       1952    28.8
#> 2 Afghanis… Asia       1957    30.3
#> 3 Afghanis… Asia       1962    32.0
#> 4 Afghanis… Asia       1967    34.0
#> 5 Afghanis… Asia       1972    36.1
#> 6 Afghanis… Asia       1977    38.4
#> # ℹ 1,698 more rows
#> # ℹ 2 more variables: pop <int>,
#> #   gdpPercap <dbl>

by_cnty <- gapminder::gapminder %>% 
  tidyr::nest(
    data = -c(country, continent))
by_cnty

#> # A tibble: 142 × 3
#>   country     continent data             
#>   <fct>       <fct>     <list>           
#> 1 Afghanistan Asia      <tibble [12 × 4]>
#> 2 Albania     Europe    <tibble [12 × 4]>
#> 3 Algeria     Africa    <tibble [12 × 4]>
#> # ℹ 139 more rows

33 / 42

>> 5.2 步骤2：处理列表列34 / 42

>> 5.2 步骤2：处理列表列

# 将线性回归模型lm应用于data列的每个元素，
# 回归结果（列表）存为新的列表列model
by_cnty <- by_cnty %>% 
  mutate(
    model = map(
      data, 
      \(df) lm(lifeExp ~ year, 
               data = df)
    )
  )
by_cnty

#> # A tibble: 142 × 4
#>   country     continent data              model 
#>   <fct>       <fct>     <list>            <list>
#> 1 Afghanistan Asia      <tibble [12 × 4]> <lm>  
#> 2 Albania     Europe    <tibble [12 × 4]> <lm>  
#> 3 Algeria     Africa    <tibble [12 × 4]> <lm>  
#> 4 Angola      Africa    <tibble [12 × 4]> <lm>  
#> 5 Argentina   Americas  <tibble [12 × 4]> <lm>  
#> 6 Australia   Oceania   <tibble [12 × 4]> <lm>  
#> # ℹ 136 more rows

34 / 42

>> 5.2 步骤2：处理列表列# 将线性回归模型lm应用于data列的每个元素，
# 回归结果（列表）存为新的列表列model
by_cnty <- by_cnty %>% 
  mutate(
    model = map(
      data, 
      \(df) lm(lifeExp ~ year, 
               data = df)
    )
  )
by_cnty

#> # A tibble: 142 × 4
#>   country     continent data              model 
#>   <fct>       <fct>     <list>            <list>
#> 1 Afghanistan Asia      <tibble [12 × 4]> <lm>  
#> 2 Albania     Europe    <tibble [12 × 4]> <lm>  
#> 3 Algeria     Africa    <tibble [12 × 4]> <lm>  
#> 4 Angola      Africa    <tibble [12 × 4]> <lm>  
#> 5 Argentina   Americas  <tibble [12 × 4]> <lm>  
#> 6 Australia   Oceania   <tibble [12 × 4]> <lm>  
#> # ℹ 136 more rows
# 查看model列表列所保存对象的内容
# -> 提取model列表列的第1个元素
#    by_cnty$model[[1]]
by_cnty %>% 
  pluck("model", 1)

#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = df)
#> 
#> Coefficients:
#> (Intercept)         year  
#>    -507.534        0.275
34 / 42

>> 5.3 步骤3：简化列表列35 / 42

>> 5.3 步骤3：简化列表列

# 还是用mutate + map_*提取信息
by_cnty %>% 
  mutate(
    coef_year = map_dbl(
      model, ~ coef(.x)[["year"]]
    )
  ) %>% 
  select(-data, -model)

#> # A tibble: 142 × 3
#>   country     continent coef_year
#>   <fct>       <fct>         <dbl>
#> 1 Afghanistan Asia          0.275
#> 2 Albania     Europe        0.335
#> 3 Algeria     Africa        0.569
#> 4 Angola      Africa        0.209
#> 5 Argentina   Americas      0.232
#> 6 Australia   Oceania       0.228
#> # ℹ 136 more rows

35 / 42

>> 5.3 步骤3：简化列表列# 还是用mutate + map_*提取信息
by_cnty %>% 
  mutate(
    coef_year = map_dbl(
      model, ~ coef(.x)[["year"]]
    )
  ) %>% 
  select(-data, -model)

#> # A tibble: 142 × 3
#>   country     continent coef_year
#>   <fct>       <fct>         <dbl>
#> 1 Afghanistan Asia          0.275
#> 2 Albania     Europe        0.335
#> 3 Algeria     Africa        0.569
#> 4 Angola      Africa        0.209
#> 5 Argentina   Americas      0.232
#> 6 Australia   Oceania       0.228
#> # ℹ 136 more rows
# 使用broom包，更强大、也更方便
# glance() | tidy() | augment()
by_cnty %>%
  mutate(
    res = map(model, 
              broom::glance)) %>%
  tidyr::unnest(res) %>%
  select(-c(data, model))

#> # A tibble: 142 × 14
#>   country    continent r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
#>   <fct>      <fct>         <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
#> 1 Afghanist… Asia          0.948         0.942 1.22      181.  9.84e- 8     1 -18.3  42.7 
#> 2 Albania    Europe        0.911         0.902 1.98      102.  1.46e- 6     1 -24.1  54.3 
#> 3 Algeria    Africa        0.985         0.984 1.32      662.  1.81e-10     1 -19.3  44.6 
#> 4 Angola     Africa        0.888         0.877 1.41       79.1 4.59e- 6     1 -20.0  46.1 
#> 5 Argentina  Americas      0.996         0.995 0.292    2246.  4.22e-13     1  -1.17  8.35
#> 6 Australia  Oceania       0.980         0.978 0.621     481.  8.67e-10     1 -10.2  26.4 
#> # ℹ 136 more rows
#> # ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>
35 / 42

>> 5.4 [进一步的（探索性）分析]36 / 42

>> 5.4 [进一步的（探索性）分析]

by_cnty %>% 
  mutate(res = map(model, broom::glance)) %>% 
  unnest(res) %>% 
  ggplot(aes(continent, r.squared, colour = continent)) +
    geom_jitter(width = 0.3) +
    theme(legend.position = "none")

36 / 42

>> 5.5 实验性的 dplyr::group_*() 和 dplyr::nest_by()37 / 42

>> 5.5 实验性的 `dplyr::group_*()` 和 `dplyr::nest_by()`

gapminder::gapminder %>% 
  group_by(country, continent) %>% 
  group_modify(
    \(df, key)  # .f 的两个参数
    lm(lifeExp ~ year, data = df) %>% 
      list(.) %>% tibble(model = .)
  )

#> # A tibble: 142 × 3
#> # Groups:   country, continent [142]
#>   country     continent model 
#>   <fct>       <fct>     <list>
#> 1 Afghanistan Asia      <lm>  
#> # ℹ 141 more rows

gapminder::gapminder %>% 
  group_by(country, continent) %>% 
  group_modify(
    ~ lm(lifeExp ~ year, data = .x) %>% 
      broom::glance())

#> # A tibble: 142 × 14
#> # Groups:   country, continent [142]
#>   country     continent r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#>   <fct>       <fct>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl>
#> 1 Afghanistan Asia          0.948         0.942  1.22      181. 9.84e-8     1  -18.3  42.7
#> # ℹ 141 more rows
#> # ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>

37 / 42

>> 5.5 实验性的 dplyr::group_*() 和 dplyr::nest_by()gapminder::gapminder %>% 
  group_by(country, continent) %>% 
  group_modify(
    \(df, key)  # .f 的两个参数
    lm(lifeExp ~ year, data = df) %>% 
      list(.) %>% tibble(model = .)
  )

#> # A tibble: 142 × 3
#> # Groups:   country, continent [142]
#>   country     continent model 
#>   <fct>       <fct>     <list>
#> 1 Afghanistan Asia      <lm>  
#> # ℹ 141 more rows
gapminder::gapminder %>% 
  group_by(country, continent) %>% 
  group_modify(
    ~ lm(lifeExp ~ year, data = .x) %>% 
      broom::glance())

#> # A tibble: 142 × 14
#> # Groups:   country, continent [142]
#>   country     continent r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#>   <fct>       <fct>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl>
#> 1 Afghanistan Asia          0.948         0.942  1.22      181. 9.84e-8     1  -18.3  42.7
#> # ℹ 141 more rows
#> # ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>
gapminder::gapminder %>% 
    nest_by(country, continent) %>% 
    # nest_by() returns
    # a rowwise data frame
    mutate(
      model = list(lm(lifeExp ~ year, 
                      data = data)),
      res   = list(broom::glance(model))
    )

#> # A tibble: 142 × 5
#> # Rowwise:  country, continent
#>   country   continent     data model res     
#>   <fct>     <fct>     <list<t> <lis> <list>  
#> 1 Afghanis… Asia      [12 × 4] <lm>  <tibble>
#> 2 Albania   Europe    [12 × 4] <lm>  <tibble>
#> 3 Algeria   Africa    [12 × 4] <lm>  <tibble>
#> # ℹ 139 more rows
# you can then unnest() res ...

37 / 42

( `purrr` -> `furrr` ^v0.3.1 )

38 / 42

>> furrr: Apply Mapping Functions in Parallel using Futureslibrary(tictoc) # for timing R scripts
by_cnty <- gapminder::gapminder %>% 
  tidyr::nest(
    data = -c(country, continent))
slow_lm <- function(...) {
  Sys.sleep(0.1)
  lm(...)
}
tic()
by_cnty %>% 
  mutate(
    model = map(
      data, 
      \(df) slow_lm(lifeExp ~ year, 
                    data = df))
  ) -> gc1
toc()

#> 15.51 sec elapsed
39 / 42

>> furrr: Apply Mapping Functions in Parallel using Futureslibrary(tictoc) # for timing R scripts
by_cnty <- gapminder::gapminder %>% 
  tidyr::nest(
    data = -c(country, continent))
slow_lm <- function(...) {
  Sys.sleep(0.1)
  lm(...)
}
tic()
by_cnty %>% 
  mutate(
    model = map(
      data, 
      \(df) slow_lm(lifeExp ~ year, 
                    data = df))
  ) -> gc1
toc()

#> 15.51 sec elapsed
library(furrr)
plan(multisession, workers = 4)
tic()
by_cnty %>% 
  mutate(
    model = future_map( # future_*
      data, 
      \(df) slow_lm(lifeExp ~ year, 
                    data = df)
    )
  ) -> gc2
toc()

#> 6.14 sec elapsed
identical(gc1, gc2)

#> [1] FALSE
# 检查后发现是数据的附加属性不同导致

39 / 42

课后作业40 / 42

课后作业

1. 复习 📖 R for Data Science, 2e 一书第五部分 Program 的以下章节：

26 Iteration
27 A field guide to base R 的 27.4-27.5

2. 下载（打印） 📰 {{purrr 包的cheatsheet}} 并阅读之

3. 根据课程讲义的打印稿，在 📑 Qmd 中键入、运行、理解第 2-6 节中代码

文档统一取名为：L07_coding_practice.qmd
可结队编写代码，共同署名，分别提交，但注意控制雷同比例
于 2023年11月30日24:00前将该 Qmd 文档上传至 {{坚果云收件箱}}

41 / 42

本网页版讲义的制作由 R 包 {{xaringan}} 赋能！

42 / 42

量化金融与金融编程
L07 迭代

曾永艺
厦门大学管理学院

2023-11-24
1 / 42

通常你并不需要手写显性的循环
控制流结构
函数式编程
purrr 包
purrr 包与列表列
purrr -> furrr

2 / 42

1. 通常你并不需要手写显性的循环

( no loops, pls. )

3 / 42

>> 通常你并不需要手写显性的循环4 / 42

>> 通常你并不需要手写显性的循环

1.1 readr 包的 vroom 引擎

fpaths <- list.files("csvs", "\\.csv$",
                     full.names = TRUE)
read_csv(fpaths, id = "file")

1.2 向量化运算

tbl_mtcars <- as_tibble(mtcars)
tbl_mtcars %>% 
  mutate(kmpl = 1.609 * mpg / 3.785)

#> # A tibble: 32 × 12
#>     mpg   cyl  disp    hp  drat    wt
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62
#> 2  21       6   160   110  3.9   2.88
#> 3  22.8     4   108    93  3.85  2.32
#> # ℹ 29 more rows
#> # ℹ 6 more variables: qsec <dbl>,
#> #   vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>, kmpl <dbl>

4 / 42

>> 通常你并不需要手写显性的循环

1.1 readr 包的 vroom 引擎

fpaths <- list.files("csvs", "\\.csv$",
                     full.names = TRUE)
read_csv(fpaths, id = "file")

1.2 向量化运算

tbl_mtcars <- as_tibble(mtcars)
tbl_mtcars %>% 
  mutate(kmpl = 1.609 * mpg / 3.785)

#> # A tibble: 32 × 12
#>     mpg   cyl  disp    hp  drat    wt
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62
#> 2  21       6   160   110  3.9   2.88
#> 3  22.8     4   108    93  3.85  2.32
#> # ℹ 29 more rows
#> # ℹ 6 more variables: qsec <dbl>,
#> #   vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>, kmpl <dbl>

1.3 ggplot2::aes() 和 facet_*()

tbl_mtcars %>% 
  ggplot(aes(mpg, wt, 
             color = factor(cyl))) +
  geom_point() + 
  facet_grid(cols = vars(am),
             labeller = label_both)

4 / 42

>> 通常你并不需要手写显性的循环5 / 42

>> 通常你并不需要手写显性的循环

1.4 dplyr::group_by()

tbl_mtcars %>% 
  group_by(am) %>%
  summarise(mmpg = mean(mpg), n = n())

#> # A tibble: 2 × 3
#>      am  mmpg     n
#>   <dbl> <dbl> <int>
#> 1     0  17.1    19
#> 2     1  24.4    13

1.5 dplyr 包的 .by/by 参数

tbl_mtcars %>% 
  summarise(
    mmpg = mean(mpg), n = n(),
    .by = am
  )

#> # A tibble: 2 × 3
#>      am  mmpg     n
#>   <dbl> <dbl> <int>
#> 1     1  24.4    13
#> 2     0  17.1    19

5 / 42

>> 通常你并不需要手写显性的循环

1.4 dplyr::group_by()

tbl_mtcars %>% 
  group_by(am) %>%
  summarise(mmpg = mean(mpg), n = n())

#> # A tibble: 2 × 3
#>      am  mmpg     n
#>   <dbl> <dbl> <int>
#> 1     0  17.1    19
#> 2     1  24.4    13

1.5 dplyr 包的 .by/by 参数

tbl_mtcars %>% 
  summarise(
    mmpg = mean(mpg), n = n(),
    .by = am
  )

#> # A tibble: 2 × 3
#>      am  mmpg     n
#>   <dbl> <dbl> <int>
#> 1     1  24.4    13
#> 2     0  17.1    19

1.6 dplyr::rowwise()

tibble(x1 = runif(2), x2 = runif(2)) %>% 
  rowwise() %>%
  mutate(
    m = mean(c_across(starts_with("x")))
  )

#> # A tibble: 2 × 3
#> # Rowwise: 
#>       x1     x2     m
#>    <dbl>  <dbl> <dbl>
#> 1 0.0788 0.889  0.484
#> 2 0.568  0.0275 0.298

1.7 dplyr::across()

tibble(x1 = runif(2), x2 = runif(2)) %>% 
  mutate(
    across(starts_with("x"), ~.x/max(.x))
  )

#> # A tibble: 2 × 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1 0.488 0.562
#> 2 1     1

5 / 42

2. 控制流结构

( control-flow constructs in R )

6 / 42

日出日落，月圆月缺，年尾年头，这是“循环”；

上学还是就业，单身还是结婚，丁克还是生娃，这是“分支”；

不管是循环还是分支，都嵌入在生老病死的时间轴上，这是“顺序”；

所谓尽人事听天命，想来就是心平气和地接受顺序结构，小心翼翼地制定循环结构，在关键时刻控制好分支结构，就这样度过一生罢。

—— 大鹏志，转引自《学R》

7 / 42

>> 2.1 `for` 循环结构

for 循环属于命令式编程（imperative programming）中的重复执行范式

8 / 42

>> 2.1 `for` 循环结构

for 循环属于命令式编程（imperative programming）中的重复执行范式

set.seed(1234)
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10)
)

# 计算各列的均值
mean(df$a)
mean(df$b)
mean(df$c)

#> [1] -0.383
#> [1] -0.118
#> [1] -0.388

8 / 42

>> 2.1 `for` 循环结构

for 循环属于命令式编程（imperative programming）中的重复执行范式

set.seed(1234)
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10)
)

# 计算各列的均值
mean(df$a)
mean(df$b)
mean(df$c)

#> [1] -0.383
#> [1] -0.118
#> [1] -0.388

# for循环的三个组成部分
output <- vector("double", ncol(df))  # 1. output    输出
for(i in seq_along(df)) {             # 2. sequence  循环序列
  output[[i]] <- mean(df[[i]])        # 3. body      循环体
}
output

#> [1] -0.383 -0.118 -0.388

8 / 42

>> 2.1 `for` 循环结构

for 循环的三种模式

9 / 42

>> 2.1 `for` 循环结构

for 循环的三种模式

for(x in xs)：逐个元素循环，当你只关注副作用（如作图、存盘）时，这是最有用的模式

9 / 42

>> 2.1 `for` 循环结构

for 循环的三种模式

for(x in xs)：逐个元素循环，当你只关注副作用（如作图、存盘）时，这是最有用的模式
for(nm in names(xs))：逐个名字循环，在循环体中用 xs[[nm]] 得到命名向量 xs 元素的值。你还可以让输出的名字和输入的名字对应起来：

output <- vector("list", length(xs))
names(output) <- names(xs)
for(nm in names(xs)) {
  output[[nm]] <- .f(xs[[nm]], ...)
}

9 / 42

>> 2.1 `for` 循环结构

for 循环的三种模式

for(x in xs)：逐个元素循环，当你只关注副作用（如作图、存盘）时，这是最有用的模式
for(nm in names(xs))：逐个名字循环，在循环体中用 xs[[nm]] 得到命名向量 xs 元素的值。你还可以让输出的名字和输入的名字对应起来：

output <- vector("list", length(xs))
names(output) <- names(xs)
for(nm in names(xs)) {
  output[[nm]] <- .f(xs[[nm]], ...)
}

for(i in seq_along(xs))：逐个数值索引循环，这是最通用的模式。下面语句会给出元素的名字和取值：

for(i in seq_along(xs)) {
  name <- names(xs)[[i]]
  value <- xs[[i]]
  # ...
}

9 / 42

>> 2.1 `for` 循环结构

特殊情况：“就地”修改

10 / 42

>> 2.1 `for` 循环结构

特殊情况：“就地”修改

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

# output（输出）已备
# 在body（循环体）中直接调用`[[<-`完成“就地”修改 x           #]#]
for(i in seq_along(df)) {
  df[[i]] <- rescale01(df[[i]])  # 不要使用[]
}
df

#> # A tibble: 10 × 3
#>       a      b     c
#>   <dbl>  <dbl> <dbl>
#> 1 0.332 0.153  0.782
#> 2 0.765 0      0.473
#> 3 1     0.0651 0.498
#> # ℹ 7 more rows

10 / 42

>> 2.1 `for` 循环结构

特殊情况：事前无法确定输出的长度

11 / 42

>> 2.1 `for` 循环结构

特殊情况：事前无法确定输出的长度

# **低效的做法**
set.seed(1234)
means <- c(0, 1, 2)
out <- double()  # 空的实数向量
for(i in seq_along(means)) {
  n <- sample(100, 1)
  print(glue::glue("L#{i}: n={n}"))
  out <- c(out,
           rnorm(n, means[[i]]))
}

#> L#1: n=28
#> L#2: n=79
#> L#3: n=2

str(out, vec.len = 2.5)

#>  num [1:109] 0.312 0.314 0.359 ...

11 / 42

>> 2.1 `for` 循环结构

特殊情况：事前无法确定输出的长度

# **低效的做法**
set.seed(1234)
means <- c(0, 1, 2)
out <- double()  # 空的实数向量
for(i in seq_along(means)) {
  n <- sample(100, 1)
  print(glue::glue("L#{i}: n={n}"))
  out <- c(out,
           rnorm(n, means[[i]]))
}

#> L#1: n=28
#> L#2: n=79
#> L#3: n=2

str(out, vec.len = 2.5)

#>  num [1:109] 0.312 0.314 0.359 ...

# **更好的做法**
# 使用更复杂的数据结构，完成循环后再处理
set.seed(1234)
means <- c(0, 1, 2)
out <- vector("list", length(means))
for(i in seq_along(means)) {
  n <- sample(100, 1)
  print(glue::glue("L#{i}: n={n}"))
  out[[i]] <- rnorm(n, means[[i]])
}
out <- unlist(out)
# purrr::flatten_dbl(out)
str(out, vec.len = 2.5)

#> L#1: n=28
#> L#2: n=79
#> L#3: n=2
#>  num [1:109] 0.312 0.314 0.359 ...

11 / 42

>> 2.1 `for` 循环结构 + 2.2 `if` 分支结构

特殊情况：事前无法确定循环的次数 -> while() / repeat + break

12 / 42

>> 2.1 `for` 循环结构 + 2.2 `if` 分支结构

特殊情况：事前无法确定循环的次数 -> while() / repeat + break

flip <- function() sample(c("T", "H"), 1)

set.seed(111)
nheads <- 0
flips <- character()
while(nheads < 3) {
  H_T <- flip()
  if(H_T == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0  # 重新计数
  }
  flips <- c(flips, H_T)
}
flips

#>  [1] "H" "T" "H" "T" "T" "T" "T"
#>  [8] "T" "T" "H" "H" "H"

12 / 42

>> 2.1 `for` 循环结构 + 2.2 `if` 分支结构

特殊情况：事前无法确定循环的次数 -> while() / repeat + break

flip <- function() sample(c("T", "H"), 1)

set.seed(111)
nheads <- 0
flips <- character()
while(nheads < 3) {
  H_T <- flip()
  if(H_T == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0  # 重新计数
  }
  flips <- c(flips, H_T)
}
flips

#>  [1] "H" "T" "H" "T" "T" "T" "T"
#>  [8] "T" "T" "H" "H" "H"

set.seed(111)
nheads <- 0
flips <- character()
repeat {
  H_T <- flip()
  if(H_T == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0  # 重新计数
  }
  flips <- c(flips, H_T)
  if(nheads >= 3) break
}
flips

#>  [1] "H" "T" "H" "T" "T" "T" "T"
#>  [8] "T" "T" "H" "H" "H"

12 / 42

3. 函数式编程

( functional programming )

13 / 42

>> 3.1 R 语言与函数式编程

14 / 42

>> 3.1 R 语言与函数式编程

# 定义函数，计算数据框每列的均值
col_mean <- function(df) {
  out <- vector("double", length(df))
  for(i in seq_along(df)) {
    out[i] <- mean(df[[i]])
  }
  out
}
# 调用函数
col_mean(df)

#> [1] 0.572 0.258 0.524

🤔 该如何一般化 col_mean()？

🤩 定义 col_median()、col_sd() ...？

🙅 NOPE! 更好的做法是 ->

14 / 42

>> 3.1 R 语言与函数式编程

# 定义函数，计算数据框每列的均值
col_mean <- function(df) {
  out <- vector("double", length(df))
  for(i in seq_along(df)) {
    out[i] <- mean(df[[i]])
  }
  out
}
# 调用函数
col_mean(df)

#> [1] 0.572 0.258 0.524

🤔 该如何一般化 col_mean()？

🤩 定义 col_median()、col_sd() ...？

🙅 NOPE! 更好的做法是 ->

# ... 使用函数式编程
# 函数作为参数 -> 泛函
col_summary <- function(df, .fun) {
  out <- vector("double", length(df))
  for(i in seq_along(df)) {
    out[i] <- .fun(df[[i]])
  }
  out
}

col_summary(df, mean)

#> [1] 0.572 0.258 0.524

col_summary(df, sd)

#> [1] 0.290 0.313 0.329

14 / 42

>> 3.2 `apply` 族函数 & `Reduce()` 等高阶函数

apply(X, MARGIN, FUN, ...)
sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
lapply(X, FUN, ...)
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
rapply(object, f, classes = "ANY", deflt = NULL, how = c("unlist", "replace", "list"), ...)
replicate(n, expr, simplify = "array")
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
eapply(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE)
tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
by(data, INDICES, FUN, ..., simplify = TRUE)
aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE) ...
Reduce(f, x, init, right = FALSE, accumulate = FALSE)
Filter(f, x)
Find(f, x, right = FALSE, nomatch = NULL)
Map(f, ...)
Position(f, x, right = FALSE, nomatch = NA_integer_)

15 / 42

4. `purrr` 包 ^v1.0.2

( Functional Programming Tools )

16 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数："Learn it once, use it everywhere!" - Jenny Bryan

17 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数："Learn it once, use it everywhere!" - Jenny Bryan

参数统一：map()、map_*()、modify() 和 walk() 函数的第1个参数 .x 为输入向量（包括原子向量和列表），第2个参数 .f 为函数，第3个参数 ... 为传递给 .f 的额外参数。

17 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数："Learn it once, use it everywhere!" - Jenny Bryan

参数统一：map()、map_*()、modify() 和 walk() 函数的第1个参数 .x 为输入向量（包括原子向量和列表），第2个参数 .f 为函数，第3个参数 ... 为传递给 .f 的额外参数。

返回结果一目了然：map()、map_*()、modify() 和 walk() 的返回结果均为与输入向量等长的向量，map() 返回列表，map_*() 返回向量的类由函数名中的后缀决定，modify() 返回与输入向量同类的输出，而 walk() 则不可见地返回输入向量。

17 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数："Learn it once, use it everywhere!" - Jenny Bryan

参数统一：map()、map_*()、modify() 和 walk() 函数的第1个参数 .x 为输入向量（包括原子向量和列表），第2个参数 .f 为函数，第3个参数 ... 为传递给 .f 的额外参数。

返回结果一目了然：map()、map_*()、modify() 和 walk() 的返回结果均为与输入向量等长的向量，map() 返回列表，map_*() 返回向量的类由函数名中的后缀决定，modify() 返回与输入向量同类的输出，而 walk() 则不可见地返回输入向量。

一通百通：输入扩展至2个向量（如 map2(.x, .y, .f, ...)）和多个向量构成的列表 .l（如 pmap(.l, .f, ...)），还可同时应用于向量元素和索引（如 imap(.x, .f, ...)）。

17 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map(df, mean)         # 返回列表

#> $a
#> [1] 0.572
#> 
#> $b
#> [1] 0.258
#> 
#> $c
#> [1] 0.524

map_dbl(df, mean)       # 返回实数向量
# df %>% map_dbl(mean)  # 支持管道操作

#>     a     b     c 
#> 0.572 0.258 0.524

18 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map(df, mean)         # 返回列表

#> $a
#> [1] 0.572
#> 
#> $b
#> [1] 0.258
#> 
#> $c
#> [1] 0.524

map_dbl(df, mean)       # 返回实数向量
# df %>% map_dbl(mean)  # 支持管道操作

#>     a     b     c 
#> 0.572 0.258 0.524

相比 for 循环

用函数式编程函数来完成循环，更加简洁
关注完成运算的函数（如 mean），而非准备性步骤（如定义输出 output）
适合用 %>% 将不同函数链接起来解决问题
-> 代码更加易读、易写、易用

相比自行编写的 col_summary()

后台用 C 编写，效率更高
参数 .f 支持函数、公式、字符|整数向量
结果保留元素的名称

👍 ❤️ ✌️

18 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数的参数 .f 支持快捷写法

19 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数的参数 .f 支持快捷写法

# 匿名函数
models <- mtcars %>% 
  split(.$cyl) %>%  # 得到3个命名列表
  map(function(df)
        lm(mpg ~ wt, data = df))
# map(\(df) lm(mpg ~ wt, data = df))
# 将匿名函数改写为单侧公式（但不推荐）
models <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ lm(mpg ~ wt, data = .x))
str(models, max.level = 1)

#> List of 3
#>  $ 4:List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ 6:List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ 8:List of 12
#>   ..- attr(*, "class")= chr "lm"

19 / 42

>> 4.1 `purrr` 包的 `map` 族函数

map 族函数的参数 .f 支持快捷写法

# 匿名函数
models <- mtcars %>% 
  split(.$cyl) %>%  # 得到3个命名列表
  map(function(df)
        lm(mpg ~ wt, data = df))
# map(\(df) lm(mpg ~ wt, data = df))
# 将匿名函数改写为单侧公式（但不推荐）
models <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ lm(mpg ~ wt, data = .x))
str(models, max.level = 1)

#> List of 3
#>  $ 4:List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ 6:List of 12
#>   ..- attr(*, "class")= chr "lm"
#>  $ 8:List of 12
#>   ..- attr(*, "class")= chr "lm"

# 匿名函数
models %>% 
  map(summary) %>% 
  map_dbl(\(x) x$r.squared)

#>     4     6     8 
#> 0.509 0.465 0.423

# 直接使用 字符向量 提取元素
# 结果同上，从略
models %>% 
  map(summary) %>% 
  map_dbl("r.squared")
# 若你知道元素的具体位置，也可以直接用
# 整数提取元素，但不推荐

19 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap() 和 imap()

20 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap() 和 imap()

# 1个输入用map()
mu <- list(5, 10, -3)
mu %>% map(\(m) rnorm(m, n = 5))

# 2个输入呢？
mu <- list(5, 10, -3)
sigma <- list(1, 5, 10)

# 坚持用map()！
set.seed(1234)
seq_along(mu) %>% 
  map(\(i) rnorm(5, mu[[i]], 
                 sigma[[i]])) %>% 
  str()

#> List of 3
#>  $ : num [1:5] 3.79 5.28 6.08 2.65 5.43
#>  $ : num [1:5] 12.53 7.13 7.27 7.18 5.55
#>  $ : num [1:5] -7.77 -12.98 -10.76 -2.36 6.59

20 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap() 和 imap()

# 1个输入用map()
mu <- list(5, 10, -3)
mu %>% map(\(m) rnorm(m, n = 5))

# 2个输入呢？
mu <- list(5, 10, -3)
sigma <- list(1, 5, 10)

# 坚持用map()！
set.seed(1234)
seq_along(mu) %>% 
  map(\(i) rnorm(5, mu[[i]], 
                 sigma[[i]])) %>% 
  str()

#> List of 3
#>  $ : num [1:5] 3.79 5.28 6.08 2.65 5.43
#>  $ : num [1:5] 12.53 7.13 7.27 7.18 5.55
#>  $ : num [1:5] -7.77 -12.98 -10.76 -2.36 6.59

# 还是改用map2()吧，:)
set.seed(1234)
map2(mu, sigma, rnorm, n = 5) %>% 
  str() # 结果相同

#> List of 3
#>  $ : num [1:5] 3.79 5.28 6.08 2.65 5.43
#>  $ : num [1:5] 12.53 7.13 7.27 7.18 5.55
#>  $ : num [1:5] -7.77 -12.98 -10.76 -2.36 6.59

20 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap(.l, .f, ...) 和 imap()

21 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap(.l, .f, ...) 和 imap()

# 我还想让抽样样本数n也有所不同！
n <- list(1, 2, 3)

# 默认为位置匹配
args1 <- list(n, mu, sigma)
args1 %>%
  pmap(rnorm) %>% str()

#> List of 3
#>  $ : num 4.03
#>  $ : num [1:2] 4.46 3.74
#>  $ : num [1:3] -8.24 -7.97 -21.06

21 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2(.x, .y, .f, ...)、pmap(.l, .f, ...) 和 imap()

# 我还想让抽样样本数n也有所不同！
n <- list(1, 2, 3)

# 默认为位置匹配
args1 <- list(n, mu, sigma)
args1 %>%
  pmap(rnorm) %>% str()

#> List of 3
#>  $ : num 4.03
#>  $ : num [1:2] 4.46 3.74
#>  $ : num [1:3] -8.24 -7.97 -21.06

# 使用命名参数列表，匹配.f函数的参数名
# 也可用数据框作为.l参数的取值
args2 <- list(mean = mu, 
              sd = sigma, 
              n = n)
args2 %>% 
  pmap(rnorm) %>% str()

21 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2()、pmap() 和 imap(.x, .f, ...)

imap()为 indexed map 函数

当输入向量 .x 的元素有名称时，它是 map2(.x, names(.x), ...) 的简便写法
当输入向量 .x 的元素没有名称时，它 map2(.x, seq_along(.x), ...) 的简便写法
在你需要同时基于 .x 的取值和名称/索引进行计算时，imap(.x, .f, ...) 很有用

22 / 42

>> 4.1 `purrr` 包的 `map` 族函数

多个输入：map2()、pmap() 和 imap(.x, .f, ...)

imap()为 indexed map 函数

当输入向量 .x 的元素有名称时，它是 map2(.x, names(.x), ...) 的简便写法
当输入向量 .x 的元素没有名称时，它 map2(.x, seq_along(.x), ...) 的简便写法
在你需要同时基于 .x 的取值和名称/索引进行计算时，imap(.x, .f, ...) 很有用

# 元素没有名称，.f 第2个参数为元素位置
imap_chr(sample(LETTERS[1:4]), 
         \(x, i) paste0(i, " -> ", x))

#> [1] "1 -> B" "2 -> D" "3 -> C" "4 -> A"

# 元素有名称，.f 第2个参数为元素名
lst <- map(1:4, ~ sample(1000, 10))
names(lst) <- paste0("#", 1:4)
imap_chr(
  lst, 
  \(x, i) glue::glue(
    "样本{i} 的最大值为 {max(x)}")
)

#>                      #1 
#> "样本#1 的最大值为 962" 
#>                      #2 
#> "样本#2 的最大值为 976" 
#>                      #3 
#> "样本#3 的最大值为 942" 
#>                      #4 
#> "样本#4 的最大值为 877"

22 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

23 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

modify()、modify2() 和 imodify() 总是返回与输入向量 .x 的类 class 相同的向量

df <- data.frame(
  x = 1:3,
  y = 6:4
)
modify(df, \(x) x * 2)

#>   x  y
#> 1 2 12
#> 2 4 10
#> 3 6  8

df

#>   x y
#> 1 1 6
#> 2 2 5
#> 3 3 4

df <- modify(df, \(x) x * 2)

23 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

24 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

# ggsave(filename, plot=last_plot(), 
#   device = NULL, path = NULL, ...)
tmp <- tempdir()
gs <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ ggplot(., aes(wt, mpg)) + 
        geom_point())
fs <- str_c("cyl-", names(gs), ".pdf")
walk2(fs, gs, ggsave, path = tmp)
list.files(tmp, pattern = "^cyl-")

#> [1] "cyl-4.pdf" "cyl-6.pdf" "cyl-8.pdf"

24 / 42

>> 4.1 `purrr` 包的 `map` 族函数

不同输出：modify(.x, .f, ...) 和 walk(.x, .f, ...)

# ggsave(filename, plot=last_plot(), 
#   device = NULL, path = NULL, ...)
tmp <- tempdir()
gs <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ ggplot(., aes(wt, mpg)) + 
        geom_point())
fs <- str_c("cyl-", names(gs), ".pdf")
walk2(fs, gs, ggsave, path = tmp)
list.files(tmp, pattern = "^cyl-")

#> [1] "cyl-4.pdf" "cyl-6.pdf" "cyl-8.pdf"

# 参数位置匹配并使用 ... 来传递参数
# 更推荐的做法是：
walk2(
  fs, gs, 
  \(fs, gs) ggsave(fs, gs, path = tmp)
)

24 / 42

>> 4.1 `purrr` 包的 `map` 族函数

lmap(.x, .f, ...)、lmap_if(.x, .p, .f, ..., .else = NULL) 和 lmap_at(.x, .at, .f, ...)

类似于 map*() 函数，但只适用于输入并返回 list 或 data.frame 的函数 .f，也就是说 .f 应用于 .x 的列表元素而非元素（即 .x[i]，而非 .x[[i]]）。

25 / 42

>> 4.1 `purrr` 包的 `map` 族函数

lmap(.x, .f, ...)、lmap_if(.x, .p, .f, ..., .else = NULL) 和 lmap_at(.x, .at, .f, ...)

类似于 map*() 函数，但只适用于输入并返回 list 或 data.frame 的函数 .f，也就是说 .f 应用于 .x 的列表元素而非元素（即 .x[i]，而非 .x[[i]]）。

map_if(.x, .p, .f, ..., .else = NULL) 和 map_at(.x, .at, .f, ...)

map_if() 将 .f（.else）应用于 .x 中断言函数 .p 取值为 TRUE（FALSE）的元素；
map_at() 将 .f 应用于 .x 中 .at 参数（名称或位置向量）所指定的元素；
modify() 和 lmap() 也有这类条件应用的变体函数。

25 / 42

>> 4.1 `purrr` 包的 `map` 族函数

lmap(.x, .f, ...)、lmap_if(.x, .p, .f, ..., .else = NULL) 和 lmap_at(.x, .at, .f, ...)

类似于 map*() 函数，但只适用于输入并返回 list 或 data.frame 的函数 .f，也就是说 .f 应用于 .x 的列表元素而非元素（即 .x[i]，而非 .x[[i]]）。

map_if(.x, .p, .f, ..., .else = NULL) 和 map_at(.x, .at, .f, ...)

map_if() 将 .f（.else）应用于 .x 中断言函数 .p 取值为 TRUE（FALSE）的元素；
map_at() 将 .f 应用于 .x 中 .at 参数（名称或位置向量）所指定的元素；
modify() 和 lmap() 也有这类条件应用的变体函数。

map_depth(.x, .depth, .f, ..., .ragged = .depth < 0, .is_node = NULL) 和 modify_depth()

map_depth() / modify_depth() 将 .f 应用于嵌套向量 .x 中 .depth 参数所指定深度的元素。

25 / 42

>> 4.2 其它 `purrr` 包函数

列表操作

26 / 42

>> 4.2 其它 `purrr` 包函数

列表操作

list_c(x, ..., ptype = NULL) 将列表x的元素合并为向量；
list_cbind(x, ..., name_repair = c("unique", "universal", "check_unique"), size = NULL) 将列表x的元素（须为数据框或NULL）按行合并为数据框；
list_rbind(x, ..., names_to = rlang::zap(), ptype = NULL) 将列表x的元素（须为数据框或NULL）按列合并为数据框。

26 / 42

>> 4.2 其它 `purrr` 包函数

列表操作

list_c(x, ..., ptype = NULL) 将列表x的元素合并为向量；
list_cbind(x, ..., name_repair = c("unique", "universal", "check_unique"), size = NULL) 将列表x的元素（须为数据框或NULL）按行合并为数据框；
list_rbind(x, ..., names_to = rlang::zap(), ptype = NULL) 将列表x的元素（须为数据框或NULL）按列合并为数据框。

list_flatten(x, ..., name_spec = "{outer}_{inner}", name_repair)：轧平列表x的一层元素索引；
list_simplify(x, ..., strict = TRUE, ptype = NULL)：将列表x简化为原子向量或者 S3 向量；
list_transpose()：转置列表，如 2×3 -> 3×2；
list_assign(x, ..., .is_node = NULL) / list_modify() / list_merge()：根据 ... 按名称或位置赋值 / 修改 / 合并列表x的元素值。

26 / 42

>> 4.2 其它 `purrr` 包函数

支持断言函数（predicate functions）的泛函

27 / 42

>> 4.2 其它 `purrr` 包函数

支持断言函数（predicate functions）的泛函

# keep(.x, .p, ...) | discard()
iris %>% 
  keep(is.factor) %>% 
  str(vec.len = 1)

#> 'data.frame':    150 obs. of  1 variable:
#>  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 ...

# keep_at(x, at, ...) | discard_at()
#   based on their name/position
# compact(.x, .p = identity)
#   remove all NULLs

# every(.x, .p, ...) | some() | none()
list(1:5, letters) %>% 
  some(is_character)

#> [1] TRUE

27 / 42

>> 4.2 其它 `purrr` 包函数

支持断言函数（predicate functions）的泛函

# keep(.x, .p, ...) | discard()
iris %>% 
  keep(is.factor) %>% 
  str(vec.len = 1)

#> 'data.frame':    150 obs. of  1 variable:
#>  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 ...

# keep_at(x, at, ...) | discard_at()
#   based on their name/position
# compact(.x, .p = identity)
#   remove all NULLs

# every(.x, .p, ...) | some() | none()
list(1:5, letters) %>% 
  some(is_character)

#> [1] TRUE

set.seed(1234)
(x <- sample(9))

#> [1] 6 5 4 1 8 2 7 9 3

# detect(.x, .f, ..., 
#   .dir = c("forward", "backward"), 
#   .right = NULL,.default = NULL)
# detect_index()
x %>% detect(~ . > 2)

#> [1] 6

# head_while(.x, .p, ...)|tail_while()
x %>% head_while(~ . > 2)

#> [1] 6 5 4

27 / 42

>> 4.2 其它 `purrr` 包函数

reduce(.x, .f, ..., .init, .dir = c("forward", "backward")) 和 accumulate(.x, .f, ..., .init, .dir = c("forward", "backward"))

28 / 42

>> 4.2 其它 `purrr` 包函数

reduce(.x, .f, ..., .init, .dir = c("forward", "backward")) 和 accumulate(.x, .f, ..., .init, .dir = c("forward", "backward"))

dfs <- list(
  age = tibble(name = "Jo", age = 30),
  sex = tibble(name = c("Jo", "An"), 
               sex = c("M", "F")),
  trt = tibble(name = "An", 
               treatment = "A")
)
dfs %>% reduce(full_join)

#> # A tibble: 2 × 4
#>   name    age sex   treatment
#>   <chr> <dbl> <chr> <chr>    
#> 1 Jo       30 M     <NA>     
#> 2 An       NA F     A

1:10 %>% accumulate(`+`)

#>  [1]  1  3  6 10 15 21 28 36 45 55

28 / 42

>> 4.2 其它 `purrr` 包函数

reduce(.x, .f, ..., .init, .dir = c("forward", "backward")) 和 accumulate(.x, .f, ..., .init, .dir = c("forward", "backward"))

dfs <- list(
  age = tibble(name = "Jo", age = 30),
  sex = tibble(name = c("Jo", "An"), 
               sex = c("M", "F")),
  trt = tibble(name = "An", 
               treatment = "A")
)
dfs %>% reduce(full_join)

#> # A tibble: 2 × 4
#>   name    age sex   treatment
#>   <chr> <dbl> <chr> <chr>    
#> 1 Jo       30 M     <NA>     
#> 2 An       NA F     A

1:10 %>% accumulate(`+`)

#>  [1]  1  3  6 10 15 21 28 36 45 55

reduce() 和 accumulate() 支持的是二元函数，即有两个输入项的函数（及运算符）
还有reduce2() 和 accumulate2()函数

28 / 42

>> 4.2 其它 `purrr` 包函数

safely(.f, otherwise = NULL, quiet = TRUE)、quietly() 和 possibly()

这些函数就是所谓的函数运算符（function operator）——以函数作为参数输入，并返回函数。
safely() 会调整函数的行为（副词），让调整后的函数不再抛出错误，而是返回包含两个元素 result 和 error 的列表；通过 otherwise 参数设定错误时默认值，possibly() 总是成功；而 quietly() 则会捕捉命令的结果、输出、警告和消息。

29 / 42

>> 4.2 其它 `purrr` 包函数

safely(.f, otherwise = NULL, quiet = TRUE)、quietly() 和 possibly()

这些函数就是所谓的函数运算符（function operator）——以函数作为参数输入，并返回函数。
safely() 会调整函数的行为（副词），让调整后的函数不再抛出错误，而是返回包含两个元素 result 和 error 的列表；通过 otherwise 参数设定错误时默认值，possibly() 总是成功；而 quietly() 则会捕捉命令的结果、输出、警告和消息。

x <- list(1, 10, "a")
y <- x %>% map(safely(log)); str(y)

#> List of 3
#>  $ :List of 2
#>   ..$ result: num 0
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: num 2.3
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: NULL
#>   ..$ error :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .Primitive("log")(x, base)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

29 / 42

>> 4.2 其它 `purrr` 包函数

safely(.f, otherwise = NULL, quiet = TRUE)、quietly() 和 possibly()

这些函数就是所谓的函数运算符（function operator）——以函数作为参数输入，并返回函数。
safely() 会调整函数的行为（副词），让调整后的函数不再抛出错误，而是返回包含两个元素 result 和 error 的列表；通过 otherwise 参数设定错误时默认值，possibly() 总是成功；而 quietly() 则会捕捉命令的结果、输出、警告和消息。

x <- list(1, 10, "a")
y <- x %>% map(safely(log)); str(y)

#> List of 3
#>  $ :List of 2
#>   ..$ result: num 0
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: num 2.3
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: NULL
#>   ..$ error :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .Primitive("log")(x, base)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

y <- list_transpose(y); str(y)

#> List of 2
#>  $ result:List of 3
#>   ..$ : num 0
#>   ..$ : num 2.3
#>   ..$ : NULL
#>  $ error :List of 3
#>   ..$ : NULL
#>   ..$ : NULL
#>   ..$ :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .Primitive("log")(x, base)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

is_ok <- y$error %>% map_lgl(is_null)
y$result[is_ok] %>% list_c()

#> [1] 0.0 2.3

29 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

pluck_depth(x, is.node = NULL)：计算向量 x 的深度

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

pluck_depth(x, is.node = NULL)：计算向量 x 的深度

has_element(.x, .y)：列表 .x 是否包含元素 .y

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

pluck_depth(x, is.node = NULL)：计算向量 x 的深度

has_element(.x, .y)：列表 .x 是否包含元素 .y

array_branch(array, margin = NULL) 和 array_tree(array, margin = NULL)：将数组转化为列表

30 / 42

>> 4.2 其它 `purrr` 包函数

pluck(.x, ..., .default = NULL) 和 chuck(.x, ...)：用名称或位置列表来选择 .x 中的一个元素或其属性（+ attr_getter()）

assign_in(x, where, value) 和 modify_in(.x, .where, .f, ...)：修改指定位置元素的取值

pluck_depth(x, is.node = NULL)：计算向量 x 的深度

has_element(.x, .y)：列表 .x 是否包含元素 .y

array_branch(array, margin = NULL) 和 array_tree(array, margin = NULL)：将数组转化为列表

auto_browse()、insistently()、slowly()、compose()、partial()、negate()：函数运算符，修改函数的行为

30 / 42

5. `purrr` 包与列表列

( purrr and list columns )

31 / 42

32 / 42

可将任意结果（如模型对象）存入列表列并对其进行后续分析，避免重新计算
利用数据框保证相同观测不同变量取值之间的关联关系（如元数据与结果）
可将已掌握的知识/流程/工具直接应用于包含列表列的数据框

32 / 42

>> 5.1 步骤1：生成列表列（list column）33 / 42

>> 5.1 步骤1：生成列表列（list column）gapminder::gapminder

#> # A tibble: 1,704 × 6
#>   country   continent  year lifeExp
#>   <fct>     <fct>     <int>   <dbl>
#> 1 Afghanis… Asia       1952    28.8
#> 2 Afghanis… Asia       1957    30.3
#> 3 Afghanis… Asia       1962    32.0
#> 4 Afghanis… Asia       1967    34.0
#> 5 Afghanis… Asia       1972    36.1
#> 6 Afghanis… Asia       1977    38.4
#> # ℹ 1,698 more rows
#> # ℹ 2 more variables: pop <int>,
#> #   gdpPercap <dbl>
33 / 42

>> 5.1 步骤1：生成列表列（list column）

gapminder::gapminder

#> # A tibble: 1,704 × 6
#>   country   continent  year lifeExp
#>   <fct>     <fct>     <int>   <dbl>
#> 1 Afghanis… Asia       1952    28.8
#> 2 Afghanis… Asia       1957    30.3
#> 3 Afghanis… Asia       1962    32.0
#> 4 Afghanis… Asia       1967    34.0
#> 5 Afghanis… Asia       1972    36.1
#> 6 Afghanis… Asia       1977    38.4
#> # ℹ 1,698 more rows
#> # ℹ 2 more variables: pop <int>,
#> #   gdpPercap <dbl>

by_cnty <- gapminder::gapminder %>% 
  tidyr::nest(
    data = -c(country, continent))
by_cnty

#> # A tibble: 142 × 3
#>   country     continent data             
#>   <fct>       <fct>     <list>           
#> 1 Afghanistan Asia      <tibble [12 × 4]>
#> 2 Albania     Europe    <tibble [12 × 4]>
#> 3 Algeria     Africa    <tibble [12 × 4]>
#> # ℹ 139 more rows

33 / 42

>> 5.2 步骤2：处理列表列34 / 42

>> 5.2 步骤2：处理列表列# 将线性回归模型lm应用于data列的每个元素，
# 回归结果（列表）存为新的列表列model
by_cnty <- by_cnty %>% 
  mutate(
    model = map(
      data, 
      \(df) lm(lifeExp ~ year, 
               data = df)
    )
  )
by_cnty

#> # A tibble: 142 × 4
#>   country     continent data              model 
#>   <fct>       <fct>     <list>            <list>
#> 1 Afghanistan Asia      <tibble [12 × 4]> <lm>  
#> 2 Albania     Europe    <tibble [12 × 4]> <lm>  
#> 3 Algeria     Africa    <tibble [12 × 4]> <lm>  
#> 4 Angola      Africa    <tibble [12 × 4]> <lm>  
#> 5 Argentina   Americas  <tibble [12 × 4]> <lm>  
#> 6 Australia   Oceania   <tibble [12 × 4]> <lm>  
#> # ℹ 136 more rows
34 / 42

>> 5.2 步骤2：处理列表列# 将线性回归模型lm应用于data列的每个元素，
# 回归结果（列表）存为新的列表列model
by_cnty <- by_cnty %>% 
  mutate(
    model = map(
      data, 
      \(df) lm(lifeExp ~ year, 
               data = df)
    )
  )
by_cnty

#> # A tibble: 142 × 4
#>   country     continent data              model 
#>   <fct>       <fct>     <list>            <list>
#> 1 Afghanistan Asia      <tibble [12 × 4]> <lm>  
#> 2 Albania     Europe    <tibble [12 × 4]> <lm>  
#> 3 Algeria     Africa    <tibble [12 × 4]> <lm>  
#> 4 Angola      Africa    <tibble [12 × 4]> <lm>  
#> 5 Argentina   Americas  <tibble [12 × 4]> <lm>  
#> 6 Australia   Oceania   <tibble [12 × 4]> <lm>  
#> # ℹ 136 more rows
# 查看model列表列所保存对象的内容
# -> 提取model列表列的第1个元素
#    by_cnty$model[[1]]
by_cnty %>% 
  pluck("model", 1)

#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = df)
#> 
#> Coefficients:
#> (Intercept)         year  
#>    -507.534        0.275
34 / 42

>> 5.3 步骤3：简化列表列35 / 42

>> 5.3 步骤3：简化列表列# 还是用mutate + map_*提取信息
by_cnty %>% 
  mutate(
    coef_year = map_dbl(
      model, ~ coef(.x)[["year"]]
    )
  ) %>% 
  select(-data, -model)

#> # A tibble: 142 × 3
#>   country     continent coef_year
#>   <fct>       <fct>         <dbl>
#> 1 Afghanistan Asia          0.275
#> 2 Albania     Europe        0.335
#> 3 Algeria     Africa        0.569
#> 4 Angola      Africa        0.209
#> 5 Argentina   Americas      0.232
#> 6 Australia   Oceania       0.228
#> # ℹ 136 more rows
35 / 42

>> 5.3 步骤3：简化列表列# 还是用mutate + map_*提取信息
by_cnty %>% 
  mutate(
    coef_year = map_dbl(
      model, ~ coef(.x)[["year"]]
    )
  ) %>% 
  select(-data, -model)

#> # A tibble: 142 × 3
#>   country     continent coef_year
#>   <fct>       <fct>         <dbl>
#> 1 Afghanistan Asia          0.275
#> 2 Albania     Europe        0.335
#> 3 Algeria     Africa        0.569
#> 4 Angola      Africa        0.209
#> 5 Argentina   Americas      0.232
#> 6 Australia   Oceania       0.228
#> # ℹ 136 more rows
# 使用broom包，更强大、也更方便
# glance() | tidy() | augment()
by_cnty %>%
  mutate(
    res = map(model, 
              broom::glance)) %>%
  tidyr::unnest(res) %>%
  select(-c(data, model))

#> # A tibble: 142 × 14
#>   country    continent r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
#>   <fct>      <fct>         <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
#> 1 Afghanist… Asia          0.948         0.942 1.22      181.  9.84e- 8     1 -18.3  42.7 
#> 2 Albania    Europe        0.911         0.902 1.98      102.  1.46e- 6     1 -24.1  54.3 
#> 3 Algeria    Africa        0.985         0.984 1.32      662.  1.81e-10     1 -19.3  44.6 
#> 4 Angola     Africa        0.888         0.877 1.41       79.1 4.59e- 6     1 -20.0  46.1 
#> 5 Argentina  Americas      0.996         0.995 0.292    2246.  4.22e-13     1  -1.17  8.35
#> 6 Australia  Oceania       0.980         0.978 0.621     481.  8.67e-10     1 -10.2  26.4 
#> # ℹ 136 more rows
#> # ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>
35 / 42

>> 5.4 [进一步的（探索性）分析]36 / 42

>> 5.4 [进一步的（探索性）分析]

by_cnty %>% 
  mutate(res = map(model, broom::glance)) %>% 
  unnest(res) %>% 
  ggplot(aes(continent, r.squared, colour = continent)) +
    geom_jitter(width = 0.3) +
    theme(legend.position = "none")

36 / 42

>> 5.5 实验性的 dplyr::group_*() 和 dplyr::nest_by()37 / 42

>> 5.5 实验性的 dplyr::group_*() 和 dplyr::nest_by()gapminder::gapminder %>% 
  group_by(country, continent) %>% 
  group_modify(
    \(df, key)  # .f 的两个参数
    lm(lifeExp ~ year, data = df) %>% 
      list(.) %>% tibble(model = .)
  )

#> # A tibble: 142 × 3
#> # Groups:   country, continent [142]
#>   country     continent model 
#>   <fct>       <fct>     <list>
#> 1 Afghanistan Asia      <lm>  
#> # ℹ 141 more rows
gapminder::gapminder %>% 
  group_by(country, continent) %>% 
  group_modify(
    ~ lm(lifeExp ~ year, data = .x) %>% 
      broom::glance())

#> # A tibble: 142 × 14
#> # Groups:   country, continent [142]
#>   country     continent r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#>   <fct>       <fct>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl>
#> 1 Afghanistan Asia          0.948         0.942  1.22      181. 9.84e-8     1  -18.3  42.7
#> # ℹ 141 more rows
#> # ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>
37 / 42

>> 5.5 实验性的 dplyr::group_*() 和 dplyr::nest_by()gapminder::gapminder %>% 
  group_by(country, continent) %>% 
  group_modify(
    \(df, key)  # .f 的两个参数
    lm(lifeExp ~ year, data = df) %>% 
      list(.) %>% tibble(model = .)
  )

#> # A tibble: 142 × 3
#> # Groups:   country, continent [142]
#>   country     continent model 
#>   <fct>       <fct>     <list>
#> 1 Afghanistan Asia      <lm>  
#> # ℹ 141 more rows
gapminder::gapminder %>% 
  group_by(country, continent) %>% 
  group_modify(
    ~ lm(lifeExp ~ year, data = .x) %>% 
      broom::glance())

#> # A tibble: 142 × 14
#> # Groups:   country, continent [142]
#>   country     continent r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#>   <fct>       <fct>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl>
#> 1 Afghanistan Asia          0.948         0.942  1.22      181. 9.84e-8     1  -18.3  42.7
#> # ℹ 141 more rows
#> # ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>
gapminder::gapminder %>% 
    nest_by(country, continent) %>% 
    # nest_by() returns
    # a rowwise data frame
    mutate(
      model = list(lm(lifeExp ~ year, 
                      data = data)),
      res   = list(broom::glance(model))
    )

#> # A tibble: 142 × 5
#> # Rowwise:  country, continent
#>   country   continent     data model res     
#>   <fct>     <fct>     <list<t> <lis> <list>  
#> 1 Afghanis… Asia      [12 × 4] <lm>  <tibble>
#> 2 Albania   Europe    [12 × 4] <lm>  <tibble>
#> 3 Algeria   Africa    [12 × 4] <lm>  <tibble>
#> # ℹ 139 more rows
# you can then unnest() res ...

37 / 42

( `purrr` -> `furrr` ^v0.3.1 )

38 / 42

>> furrr: Apply Mapping Functions in Parallel using Futureslibrary(tictoc) # for timing R scripts
by_cnty <- gapminder::gapminder %>% 
  tidyr::nest(
    data = -c(country, continent))
slow_lm <- function(...) {
  Sys.sleep(0.1)
  lm(...)
}
tic()
by_cnty %>% 
  mutate(
    model = map(
      data, 
      \(df) slow_lm(lifeExp ~ year, 
                    data = df))
  ) -> gc1
toc()

#> 15.51 sec elapsed
39 / 42

>> furrr: Apply Mapping Functions in Parallel using Futureslibrary(tictoc) # for timing R scripts
by_cnty <- gapminder::gapminder %>% 
  tidyr::nest(
    data = -c(country, continent))
slow_lm <- function(...) {
  Sys.sleep(0.1)
  lm(...)
}
tic()
by_cnty %>% 
  mutate(
    model = map(
      data, 
      \(df) slow_lm(lifeExp ~ year, 
                    data = df))
  ) -> gc1
toc()

#> 15.51 sec elapsed
library(furrr)
plan(multisession, workers = 4)
tic()
by_cnty %>% 
  mutate(
    model = future_map( # future_*
      data, 
      \(df) slow_lm(lifeExp ~ year, 
                    data = df)
    )
  ) -> gc2
toc()

#> 6.14 sec elapsed
identical(gc1, gc2)

#> [1] FALSE
# 检查后发现是数据的附加属性不同导致

39 / 42

课后作业40 / 42

课后作业

1. 复习 📖 R for Data Science, 2e 一书第五部分 Program 的以下章节：

26 Iteration
27 A field guide to base R 的 27.4-27.5

2. 下载（打印） 📰 {{purrr 包的cheatsheet}} 并阅读之

3. 根据课程讲义的打印稿，在 📑 Qmd 中键入、运行、理解第 2-6 节中代码

文档统一取名为：L07_coding_practice.qmd
可结队编写代码，共同署名，分别提交，但注意控制雷同比例
于 2023年11月30日24:00前将该 Qmd 文档上传至 {{坚果云收件箱}}

41 / 42

本网页版讲义的制作由 R 包 {{xaringan}} 赋能！

42 / 42

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

量化金融与金融编程

L07 迭代

曾永艺

厦门大学管理学院

2023-11-24

1. 通常你并不需要手写显性的循环

>> 通常你并不需要手写显性的循环

>> 通常你并不需要手写显性的循环

>> 通常你并不需要手写显性的循环

>> 通常你并不需要手写显性的循环

>> 通常你并不需要手写显性的循环

>> 通常你并不需要手写显性的循环

2. 控制流结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构

>> 2.1 for 循环结构 + 2.2 if 分支结构

>> 2.1 for 循环结构 + 2.2 if 分支结构

>> 2.1 for 循环结构 + 2.2 if 分支结构

3. 函数式编程

>> 3.1 R 语言与函数式编程

>> 3.1 R 语言与函数式编程

>> 3.1 R 语言与函数式编程

>> 3.2 apply 族函数 & Reduce() 等高阶函数

4. purrr 包 v1.0.2

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.1 purrr 包的 map 族函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

>> 4.2 其它 purrr 包函数

5. purrr 包 与 列表列

>> 5.1 步骤1：生成列表列（list column）

>> 5.1 步骤1：生成列表列（list column）

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构 + 2.2 `if` 分支结构

>> 2.1 `for` 循环结构 + 2.2 `if` 分支结构

>> 2.1 `for` 循环结构 + 2.2 `if` 分支结构

>> 3.2 `apply` 族函数 & `Reduce()` 等高阶函数

4. `purrr` 包 ^v1.0.2

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.1 `purrr` 包的 `map` 族函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

>> 4.2 其它 `purrr` 包函数

5. `purrr` 包与列表列

>> 5.5 实验性的 `dplyr::group_*()` 和 `dplyr::nest_by()`

>> 5.5 实验性的 `dplyr::group_*()` 和 `dplyr::nest_by()`

>> 5.5 实验性的 `dplyr::group_*()` 和 `dplyr::nest_by()`

( `purrr` -> `furrr` ^v0.3.1 )

>> `furrr`: Apply Mapping Functions in Parallel using Futures

>> `furrr`: Apply Mapping Functions in Parallel using Futures

第7讲迭代

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构

>> 2.1 `for` 循环结构