量化金融与金融编程

class: center, middle, inverse, title-slide

.title[
# 量化金融与金融编程
]
.subtitle[
## L09 <code>tidymodels</code> suite
]
.author[
### <br>曾永艺
]
.institute[
### 厦门大学管理学院
]
.date[
### <br>2023-12-08
]

---

class: center, middle

<div>
<style type="text/css">.xaringan-extra-logo {
width: 50px;
height: 50px;
z-index: 0;
background-image: url(imgs/logo-sm.png);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:0.5em;right:0.5em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('div')
          logo.classList = 'xaringan-extra-logo'
          logo.href = null
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

### >> 为什么要在 <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> 中进行机器学习分析 <svg aria-hidden="true" role="img" viewBox="0 0 320 512" style="height:1em;width:0.62em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M80 160c0-35.3 28.7-64 64-64h32c35.3 0 64 28.7 64 64v3.6c0 21.8-11.1 42.1-29.4 53.8l-42.2 27.1c-25.2 16.2-40.4 44.1-40.4 74V320c0 17.7 14.3 32 32 32s32-14.3 32-32v-1.4c0-8.2 4.2-15.8 11-20.2l42.2-27.1c36.6-23.6 58.8-64.1 58.8-107.7V160c0-70.7-57.3-128-128-128H144C73.3 32 16 89.3 16 160c0 17.7 14.3 32 32 32s32-14.3 32-32zm80 320a40 40 0 1 0 0-80 40 40 0 1 0 0 80z"/></svg>

<br>

.pull-left.font150.bold[

&#x1F4AA; R 和 R 包大多是由统计学家和数据分析专家开发的

&#x1F4AA; R 有着异常丰富且紧跟前沿的 [.red[{{统计建模和机器学习包}}]](https://cran.r-project.org/web/views/)

&#x1F4AA; R 容易接口或调用其他应用软件（如 C++、python、Spark、TensorFlow 等）

&#x1F4AA; 有大量学习参考资料

]

.pull-right[
<img src="imgs/MLwR.png" width="80%" style="display: block; margin: auto;" />
]

---

### >> 在 <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> 中进行机器学习分析面临哪些问题 <svg aria-hidden="true" role="img" viewBox="0 0 320 512" style="height:1em;width:0.62em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M80 160c0-35.3 28.7-64 64-64h32c35.3 0 64 28.7 64 64v3.6c0 21.8-11.1 42.1-29.4 53.8l-42.2 27.1c-25.2 16.2-40.4 44.1-40.4 74V320c0 17.7 14.3 32 32 32s32-14.3 32-32v-1.4c0-8.2 4.2-15.8 11-20.2l42.2-27.1c36.6-23.6 58.8-64.1 58.8-107.7V160c0-70.7-57.3-128-128-128H144C73.3 32 16 89.3 16 160c0 17.7 14.3 32 32 32s32-14.3 32-32zm80 320a40 40 0 1 0 0-80 40 40 0 1 0 0 80z"/></svg>

<br>

.pull-left.font150.bold[

&#x1F630; 作为解析型的数据分析语言，R 的运行效率低于 C、C++、JAVA 等编译语言

&#x1F630; R 操作数据的规模往往受限于计算机内存的大小

.red[&#x1F630; 各式 R 包的接口并不统一]
]

.pull-right.code80[

|Function       |Code                                         |
|:--------------|:--------------------------------------------|
|`MASS::lda`    |`predict(obj)`                               |
|`stats::glm`   |`predict(obj, type = "response")`            |
|`gbm::gbm`     |`predict(obj, type = "response", n.trees)`   |
|`rpart::rpart` |`predict(obj, type = "prob")`                |
|`RWeka::Weka`  |`predict(obj, type = "probability")`         |
]

--
<br>
.right.bold.font120[
[`caret`](https://topepo.github.io/caret/) -> .font150[&#x1F449;  .red[[`tidymodels`!](https://www.tidymodels.org/)]]<br>
[`mlr`](https://mlr.mlr-org.com/) -> [`mlr3`!](https://mlr3.mlr-org.com/)
]

---
layout: false

<hr>

---
class: inverse, center, middle, hide_logo
background-image: url(imgs/logo-rsample.svg)
background-size: 4em
background-position: 12% 45%

# 1. 数据重采样 -> `rsample` .font60[<sup>v1.2.0</sup>]

.font150[(_General Resampling Infrastructure_)]

---
layout: true

### >> 数据重采样 -> `rsample` 包

---

.code90[

```r
# 先安装并加载 tidymodels 元包（meta package） 
# install.package("tidymodels")
library(tidymodels)  # v1.0.0 @ 2022-07-13
```

```
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
```

```
#> ✔ broom        1.0.5     ✔ recipes      1.0.8
#> ✔ dials        1.2.0     ✔ rsample      1.2.0
#> ✔ dplyr        1.1.4     ✔ tibble       3.2.1
#> ✔ ggplot2      3.4.4     ✔ tidyr        1.3.0
#> ✔ infer        1.0.5     ✔ tune         1.1.2
#> ✔ modeldata    1.2.0     ✔ workflows    1.1.3
#> ✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
#> ✔ purrr        1.0.2     ✔ yardstick    1.2.0
```

```
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Use suppressPackageStartupMessages() to eliminate package startup messages
```
]

---

.pull-left.code95[

```r
# 设定随机数种子
set.seed(1234)
# 随机抽样分成训练组和试验组
iris_split <- initial_split(iris)
# initial_split(data, prop = 3/4,
#               strata = NULL,
#               breaks = 4, 
#               pool = 0.1, ...)
```

```r
# 查看分组情况
iris_split
```

```
#> <Training/Testing/Total>
#> <112/38/150>
```

```r
# 查看iris_split的类
iris_split %>% class()
```

```
#> [1] "initial_split" "mc_split"     
#> [3] "rsplit"
```

]

.pull-right.code85[

```r
# 查看 rsplit 类的底层结构
iris_split %>% str(vec.len = 1.5)
```

```
*#> List of 4
*#>  $ data  :'data.frame':	150 obs. of  5 variables:
#>   ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 ...
#>   ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 ...
#>   ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 ...
#>   ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 ...
#>   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 ...
*#>  $ in_id : int [1:112] 28 80 101 111 ...
#>  $ out_id: logi NA
#>  $ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
#>   ..$ id: chr "Resample1"
#>  - attr(*, "class")= chr [1:3] "initial_split" "mc_split" ...
```

```r
# 提取训练组样本（Q：测试组样本用哪个函数？）
iris_split %>% training() %>% 
  glimpse(width = 40)
```

```
#> Rows: 112
#> Columns: 5
#> $ Sepal.Length <dbl> 5.2, 5.7, 6.3, 6.…
#> $ Sepal.Width  <dbl> 3.5, 2.6, 3.3, 3.…
#> $ Petal.Length <dbl> 1.5, 3.5, 6.0, 5.…
#> $ Petal.Width  <dbl> 0.2, 1.0, 2.5, 2.…
#> $ Species      <fct> setosa, versicolo…
```

]

---
layout: false
class: inverse, center, middle, hide_logo
background-image: url(imgs/logo-recipes.svg)
background-size: 4em
background-position: 12% 45%

# 2. 数据预处理 -> `recipes` .font60[<sup>v1.0.8</sup>]

.font120[(_Preprocessing and Feature Engineering Steps for Modeling_)]

---

### >> 2.1 定义菜单

.pull-left.code85[

```r
iris_recipe <- 
  recipe(Species ~ ., data = iris) %>% 
  step_corr(all_predictors()) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

iris_recipe
```

.code80[
```
#> ── Recipe ────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:   1
#> predictor: 4
#> 
#> ── Operations 
#> • Correlation filter on: all_predictors()
#> • Centering for: all_predictors()
#> • Scaling for: all_predictors()
```
]
]

.pull-right.font100[

.bold.info[复杂的recipe类，`View()`它！]

.bold.info[`recipes` 包中共计有 98 个数据预处理步骤函数，示例如下：]

.code80[

```
[1] "step_arrange"    "..."            
[3] "step_corr"       "..."            
[5] "step_YeoJohnson" "step_zv"        
```
]

.bold.info[你可以用以下五种方式（及其组合）在 `step_*()` 中指定拟作用的变量：]
- 直接使用变量名枚举相关变量
- 以 `dplyr` 包的方式，如 `starts_with()`
- 以 `all_outcomes()`、`all_predictors()` 或 `has_role()` 指定角色的方式
- 以 `all_numeric()`、`all_nominal()` 或 `has_type()` 指定类型的方式
- 以指定类型+角色复合的方式，如 `all_numeric_predictors()`

`?recipes::selections`
]

---

### >> 2.2 准备菜单

.pull-left.code90[

```r
iris_recipe <- 
  iris_recipe %>% 
  prep(training(iris_split),
       verbose = TRUE)
```

```
#> oper 1 step corr [training] 
#> oper 2 step center [training] 
#> oper 3 step scale [training] 
#> The retained training set is ~ 0 Mb  in memory.
```

```r
# prep(x, 
#      training = NULL, 
#      fresh = FALSE, 
#      verbose = FALSE, 
#      retain = TRUE, 
#      log_change = FALSE, 
#      strings_as_factors = TRUE, 
#      ...)
```

]

.pull-right.code90[

```r
iris_recipe
```

.code75[
```
#> ── Recipe ───────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:   1
#> predictor: 4
#> 
#> ── Training information 
#> Training data contained 112 data points and no incomplete rows.
#> 
#> ── Operations 
#> • Correlation filter on: Petal.Length | Trained
#> • Centering for: Sepal.Length, Sepal.Width, Petal.Width | Trained
#> • Scaling for: Sepal.Length, Sepal.Width, Petal.Width | Trained

```
]

]

---

### >> 2.3 应用菜单

.pull-left.code90[

```r
# 训练组
# 可直接提取准备阶段 retain 的数据
iris_training <- iris_recipe %>% 
* bake(new_data = NULL)

# bake(object, 
#      new_data, 
#      ..., 
#      composition = "tibble")

iris_training
```

```
#> # A tibble: 112 × 4
#>   Sepal.Length Sepal.Width Petal.Width Species   
#>          <dbl>       <dbl>       <dbl> <fct>     
#> 1       -0.805       1.09       -1.36  setosa    
#> 2       -0.212      -1.07       -0.316 versicolor
#> 3        0.500       0.607       1.64  virginica 
#> # ℹ 109 more rows
```

]

.pull-right.code85[

```r
# 测试组
# 用 bake() 进行预处理
iris_testing <- iris_recipe %>% 
* bake(new_data = testing(iris_split))

iris_testing
```

```
#> # A tibble: 38 × 4
#>   Sepal.Length Sepal.Width Petal.Width Species
#>          <dbl>       <dbl>       <dbl> <fct>  
#> 1       -0.924       1.09        -1.36 setosa 
#> 2       -1.52        0.847       -1.23 setosa 
#> 3       -1.28        0.847       -1.36 setosa 
#> # ℹ 35 more rows
```

]

---
layout: false
class: inverse, center, middle, hide_logo
background-image: url(imgs/logo-parsnip.svg)
background-size: 4em
background-position: 12% 45%

# 3. 模型训练 -> `parsnip` .font60[<sup>v1.1.1</sup>]

.font130[(_A Common API to Modeling and Analysis Functions_)]

---
layout: true

### >> 3.1 模型设定

---

.pull-left.code90[

```r
# 随机森林模型的统一 API
iris_rf <- rand_forest(
  mode = "classification", 
  trees = 100
)
# rand_forest(
#   mode = "unknown",
#   engine = "ranger",
#   mtry = NULL, trees = NULL,
#   min_n = NULL)

# 设定随机森林模型的计算引擎
# show_engines("rand_forest")
(iris_rf <- iris_rf %>% 
    set_engine("ranger"))
```

```
#> Random Forest Model Specification (classification)
#> 
#> Main Arguments:
#>   trees = 100
#> 
#> Computational engine: ranger
```
]

.pull-right.code85[

```r
# 查看 iris_rf 的类
iris_rf %>% class()
```

```
#> [1] "rand_forest" "model_spec"
```

```r
# 查看 iris_rf 的底层结构
iris_rf %>% str()
```

```
#> List of 7
#>  $ args                 :List of 3
#>   ..$ mtry : language ~NULL
#>   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
#>   ..$ trees: language ~100
#>   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
#>   ..$ min_n: language ~NULL
#>   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
#>  $ eng_args             : Named list()
#>   ..- attr(*, "class")= chr [1:2] "quosures" "list"
#>  $ mode                 : chr "classification"
#>  $ user_specified_mode  : logi TRUE
#>  $ method               : NULL
#>  $ engine               : chr "ranger"
#>  $ user_specified_engine: logi TRUE
#>  - attr(*, "class")= chr [1:2] "rand_forest" "model_spec"
```
]

---
layout: true

### >> 3.2 模型拟合

---

.pull-left.code85[

```r
# 拟合随机森林模型
(iris_fit <- iris_rf %>% 
  fit( # ?parsnip::fit.model_spec()
*   formula = Species ~ .,
    data = iris_training
  ))
```

```
#> parsnip model object
#> 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  100 
#> Sample size:                      112 
#> Number of independent variables:  3 
#> Mtry:                             1 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.0563
```
]

.pull-right.code90[

```r
# 查看 iris_fit 的类
iris_fit %>% class()
```

```
#> [1] "_ranger"   "model_fit"
```

```r
# 查看 iris_fit 的底层结构
iris_fit %>% str(max.level = 1)
```

```
#> List of 6
#>  $ lvl         : chr [1:3] "setosa" "versicolor" "virginica"
*#>  $ spec        :List of 8
#>   ..- attr(*, "class")= chr [1:2] "rand_forest" "model_spec"
*#>  $ fit         :List of 13
#>   ..- attr(*, "class")= chr "ranger"
#>  $ preproc     :List of 4
#>  $ elapsed     :List of 1
#>  $ censor_probs: list()
#>  - attr(*, "class")= chr [1:2] "_ranger" "model_fit"
```
]

---
layout: true

### >> 3.3 模型预测

---

.pull-left.code90[

```r
# 将拟合模型应用于测试组
iris_fit %>% 
  predict(
    new_data = iris_testing
  )
```

```
#> # A tibble: 38 × 1
#>   .pred_class
#>   <fct>      
#> 1 setosa     
#> 2 setosa     
#> 3 setosa     
#> # ℹ 35 more rows
```

```r
# ? parsnip::predict.model_fit()
# predict(object, new_data, 
#   type = NULL,  
#     # "numeric", "class", "prob", 
#     # "conf_int", "pred_int", 
#     # "quantile", "time", 
#     # "hazard", "survival", "raw"
#   opts = list(), ...)
```
]

.pull-right.code95[

```r
# 合并原数据集
iris_class <- iris_fit %>% 
  predict(
    new_data = iris_testing
  ) %>% 
* bind_cols(iris_testing)

iris_class %>% 
  glimpse(width = 40)
```

```
#> Rows: 38
#> Columns: 5
#> $ .pred_class  <fct> setosa, setosa, s…
#> $ Sepal.Length <dbl> -0.9241, -1.5176,…
#> $ Sepal.Width  <dbl> 1.087, 0.847, 0.8…
#> $ Petal.Width  <dbl> -1.359, -1.229, -…
#> $ Species      <fct> setosa, setosa, s…
```
]

---
layout: false
class: inverse, center, middle, hide_logo
background-image: url(imgs/logo-yardstick.svg)
background-size: 4em
background-position: 12% 45%

# 4. 模型表现 -> `yardstick` .font60[<sup>v1.2.0</sup>]

.font130[(_Tidy Characterizations of Model Performance_)]

---
layout: true

### >> 4.1 模型表现：`metrics()`

---

.pull-left.code90[

```r
iris_class %>% glimpse(width = 40)
```

```r
iris_class %>% 
* metrics( # 计算模型表现指标
*   truth = Species,
*   estimate = .pred_class
* )
```

```
#> # A tibble: 2 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy multiclass     0.947
#> 2 kap      multiclass     0.920
```
]

.pull-right.code85[

```r
# 提取各分类的预测概率
iris_probs <- iris_fit %>% 
  predict(
    new_data = iris_testing,
*   type = "prob"
  ) %>% 
  bind_cols(iris_class)

# 模型表现指标：基于分类预测概率
iris_probs %>% 
  metrics(
    truth = Species, 
    estimate = .pred_class,
*   .pred_setosa:.pred_virginica
  )
```

```
#> # A tibble: 4 × 3
#>   .metric     .estimator .estimate
#>   <chr>       <chr>          <dbl>
#> 1 accuracy    multiclass     0.947
#> 2 kap         multiclass     0.920
*#> 3 mn_log_loss multiclass     0.199
*#> 4 roc_auc     hand_till      0.979
```
]

---
layout: true

### >> 4.2 模型表现：`*_curve()`

---

.pull-left.code90[

```r
# 接收器操作特征曲线
iris_roc <- iris_probs %>% 
  roc_curve(
    truth = Species, 
    .pred_setosa:.pred_virginica
  )    
# iris_roc -> a "roc_df"

iris_roc %>% autoplot()
```

<img src="L09_tidymodels_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right.code90[

```r
# 增益曲线
iris_gain <- iris_probs %>% 
* gain_curve(
    truth = Species, 
    .pred_setosa:.pred_virginica
  )
# iris_gain   # a "gain_df"

iris_gain %>% autoplot()
```

<img src="L09_tidymodels_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" />
]

---
layout: false
class: inverse, center, middle

# 5. 一个较复杂的案例

.font150[(_Case Study_)]

---
### >> 5.0 数据

.code75[

```r
modeldata::ames %>% str()
```

```
#> tibble [2,930 × 74] (S3: tbl_df/tbl/data.frame)
#>  $ MS_SubClass       : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 1 1 6 6 12 12 12 6 ...
#>  $ MS_Zoning         : Factor w/ 7 levels "Floating_Village_Residential",..: 3 2 3 3 3 3 3 3 3 3 ...
#>  $ Lot_Frontage      : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...
#>  $ Lot_Area          : int [1:2930] 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
#>  $ Street            : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ Alley             : Factor w/ 3 levels "Gravel","No_Alley_Access",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ Lot_Shape         : Factor w/ 4 levels "Regular","Slightly_Irregular",..: 2 1 2 1 2 2 1 2 2 1 ...
#>  $ Land_Contour      : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
#>  $ Utilities         : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Lot_Config        : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
#>  $ Land_Slope        : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Neighborhood      : Factor w/ 29 levels "North_Ames","College_Creek",..: 1 1 1 1 7 7 17 17 17 7 ...
#>  $ Condition_1       : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
#>  $ Condition_2       : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Bldg_Type         : Factor w/ 5 levels "OneFam","TwoFmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
#>  $ House_Style       : Factor w/ 8 levels "One_and_Half_Fin",..: 3 3 3 3 8 8 3 3 3 8 ...
#>  $ Overall_Cond      : Factor w/ 10 levels "Very_Poor","Poor",..: 5 6 6 5 5 6 5 5 5 5 ...
#>  $ Year_Built        : int [1:2930] 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
#>  $ Year_Remod_Add    : int [1:2930] 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
#>  $ Roof_Style        : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
#>  $ Roof_Matl         : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ Exterior_1st      : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
#>  $ Exterior_2nd      : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
#>  $ Mas_Vnr_Type      : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 2 4 4 2 4 4 4 4 ...
#>  $ Mas_Vnr_Area      : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
#>  $ Exter_Cond        : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 5 5 ...
#>  $ Foundation        : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
#>  $ Bsmt_Cond         : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ...
#>  $ Bsmt_Exposure     : Factor w/ 5 levels "Av","Gd","Mn",..: 2 4 4 4 4 4 3 4 4 4 ...
#>  $ BsmtFin_Type_1    : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 2 6 1 1 3 3 3 1 3 7 ...
#>  $ BsmtFin_SF_1      : num [1:2930] 2 6 1 1 3 3 3 1 3 7 ...
#>  $ BsmtFin_Type_2    : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 4 7 7 7 7 7 7 7 7 ...
#>  $ BsmtFin_SF_2      : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
#>  $ Bsmt_Unf_SF       : num [1:2930] 441 270 406 1045 137 ...
#>  $ Total_Bsmt_SF     : num [1:2930] 1080 882 1329 2110 928 ...
#>  $ Heating           : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ Heating_QC        : Factor w/ 5 levels "Excellent","Fair",..: 2 5 5 1 3 1 1 1 1 3 ...
#>  $ Central_Air       : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ Electrical        : Factor w/ 6 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
#>  $ First_Flr_SF      : int [1:2930] 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
#>  $ Second_Flr_SF     : int [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
#>  $ Gr_Liv_Area       : int [1:2930] 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
#>  $ Bsmt_Full_Bath    : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
#>  $ Bsmt_Half_Bath    : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ Full_Bath         : int [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
#>  $ Half_Bath         : int [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
#>  $ Bedroom_AbvGr     : int [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
#>  $ Kitchen_AbvGr     : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ TotRms_AbvGrd     : int [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
#>  $ Functional        : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
#>  $ Fireplaces        : int [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
#>  $ Garage_Type       : Factor w/ 7 levels "Attchd","Basment",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Garage_Finish     : Factor w/ 4 levels "Fin","No_Garage",..: 1 4 4 1 1 1 1 3 3 1 ...
#>  $ Garage_Cars       : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
#>  $ Garage_Area       : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
#>  $ Garage_Cond       : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
#>  $ Paved_Drive       : Factor w/ 3 levels "Dirt_Gravel",..: 2 3 3 3 3 3 3 3 3 3 ...
#>  $ Wood_Deck_SF      : int [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
#>  $ Open_Porch_SF     : int [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
#>  $ Enclosed_Porch    : int [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
#>  $ Three_season_porch: int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ Screen_Porch      : int [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
#>  $ Pool_Area         : int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ Pool_QC           : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ...
#>  $ Fence             : Factor w/ 5 levels "Good_Privacy",..: 5 3 5 5 3 5 5 5 5 5 ...
#>  $ Misc_Feature      : Factor w/ 6 levels "Elev","Gar2",..: 3 3 2 3 3 3 3 3 3 3 ...
#>  $ Misc_Val          : int [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
#>  $ Mo_Sold           : int [1:2930] 5 6 6 4 3 6 4 1 3 6 ...
#>  $ Year_Sold         : int [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
#>  $ Sale_Type         : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
#>  $ Sale_Condition    : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
#>  $ Sale_Price        : int [1:2930] 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
#>  $ Longitude         : num [1:2930] -93.6 -93.6 -93.6 -93.6 -93.6 ...
#>  $ Latitude          : num [1:2930] 42.1 42.1 42.1 42.1 42.1 ...
```
]

---
### >> 5.1 数据重采样 -> `rsample` 包

.pull-left.code90[
<img src="imgs/resampling.svg" width="100%" style="display: block; margin: auto;" />

```r
set.seed(1234)
data_split <- 
  initial_split(
    ames, strata = "Sale_Price")
ames_train <- training(data_split)
ames_test  <- testing(data_split)
nrow(ames_train) / nrow(ames)
```

```
#> [1] 0.75
```
]

.pull-right.code85[

```r
# v折交叉验证
(cv_splits <- vfold_cv(ames_train))
```

```
#> #  10-fold cross-validation 
#> # A tibble: 10 × 2
#>   splits             id    
#>   <list>             <chr> 
#> 1 <split [1977/220]> Fold01
#> 2 <split [1977/220]> Fold02
#> 3 <split [1977/220]> Fold03
#> # ℹ 7 more rows
```

```r
cv_splits$splits[[1]]
```

```
#> <Analysis/Assess/Total>
#> <1977/220/2197>
```

```r
cv_splits$splits[[1]] %>% 
  analysis() %>% dim()
```

```
#> [1] 1977   74
```

`vignette("Common_Patterns", package = "rsample")`
]

---
layout: true

### >> 5.2 数据预处理 -> `recipes` 包

---

```r
# 1. 定义数据预处理菜单
ames_rec <- 
  recipe(Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + 
           Full_Bath + Year_Sold + Lot_Area + Central_Air + Longitude + Latitude,
         data = ames_train) %>%
  step_log(Sale_Price, base = 10) %>%
  step_BoxCox(Lot_Area, Gr_Liv_Area) %>%
  step_other(Neighborhood, threshold = 0.05)  %>%
  step_dummy(all_nominal()) %>%
  step_interact(~ starts_with("Central_Air"):Year_Built) %>%
  step_bs(Longitude, Latitude, options = list(df = 5))

ames_rec %>% tidy()
```

```
#> # A tibble: 6 × 6
#>   number operation type   trained skip  id          
#>    <int> <chr>     <chr>  <lgl>   <lgl> <chr>       
#> 1      1 step      log    FALSE   FALSE log_YNFHa   
#> 2      2 step      BoxCox FALSE   FALSE BoxCox_vCuOq
#> 3      3 step      other  FALSE   FALSE other_0nogs 
#> # ℹ 3 more rows
```

---
.pull-left.code85[

```r
# 2. 准备数据预处理
cv_splits <- cv_splits %>% 
  mutate(
    recipes = map(splits, prepper,
                  recipe = ames_rec)
  )
cv_splits
```

```
#> #  10-fold cross-validation 
#> # A tibble: 10 × 3
#>   splits             id     recipes 
#>   <list>             <chr>  <list>  
#> 1 <split [1977/220]> Fold01 <recipe>
#> 2 <split [1977/220]> Fold02 <recipe>
#> 3 <split [1977/220]> Fold03 <recipe>
#> # ℹ 7 more rows
```

```r
# prepper
# function(split_obj, recipe, ...) {
#   prep(recipe, 
#        training = rsample::analysis(split_obj), 
#        fresh = TRUE, ...)
# }
```
]

.pull-right.code90[

```r
# 相同的菜单 + 
# 不同的训练数据集 -> 
# 不同的准备结果
cv_splits$recipes %>% 
  map(\(x) tidy(x, number = 2)) %>% 
  list_rbind() %>% 
  filter(terms == "Lot_Area")
```

```
#> # A tibble: 10 × 3
#>   terms     value id          
#>   <chr>     <dbl> <chr>       
#> 1 Lot_Area 0.0836 BoxCox_vCuOq
#> 2 Lot_Area 0.0777 BoxCox_vCuOq
#> 3 Lot_Area 0.104  BoxCox_vCuOq
#> # ℹ 7 more rows
```

```r
# ?recipes:::tidy.recipe
```
]

---
layout: true

### >> 5.3 拟合模型 -> `parsnip` 包

---

--
.pull-left.code85[

```r
# 1. 设定模型
knn_mod <- 
  nearest_neighbor(
    mode = "regression",
    neighbors = 5) %>%
  set_engine("kknn") # 默认，可不设置

# 2. 拟合模型
# 2.1 定义辅助函数
parsnip_fit <- function(rec, model) {
  fit(model, Sale_Price ~ ., 
      data = bake(rec, new_data= NULL))
}
# 2.2 应用辅助函数
cv_splits <- cv_splits %>% 
  mutate(
    knn = map(
      recipes, 
      parsnip_fit, 
      model = knn_mod
    )
  )
```
]

--
.pull-right.code85[

```r
# 查看结果
cv_splits %>% select(-id)
```

```
#> # A tibble: 10 × 3
#>   splits             recipes  knn     
#>   <list>             <list>   <list>  
#> 1 <split [1977/220]> <recipe> <fit[+]>
#> 2 <split [1977/220]> <recipe> <fit[+]>
#> 3 <split [1977/220]> <recipe> <fit[+]>
#> # ℹ 7 more rows
```

```r
cv_splits$knn[[1]] %>% 
  str(max.level = 1)
```

```
#> List of 6
#>  $ lvl         : NULL
#>  $ spec        :List of 7
#>   ..- attr(*, "class")= chr [1:2] "nearest_neighbor" "model_spec"
#>  $ fit         :List of 10
#>   ..- attr(*, "class")= chr [1:2] "train.kknn" "kknn"
#>  $ preproc     :List of 1
#>  $ elapsed     :List of 1
#>  $ censor_probs: list()
#>  - attr(*, "class")= chr [1:2] "_train.kknn" "model_fit"
```
]

---
layout: true

### >> 5.4 评估模型表现 -> `parsnip` 包 + `yardstick` 包

---

--
.pull-left.code80[

```r
# 定义模型表现函数
pf_metrics <- metric_set(rmse, rsq, ccc)

# 定义辅助函数
parsnip_metrics <- 
  function(split, recipe, model) {
    raw <- assessment(split)
    processed <- bake(recipe, 
                      new_data = raw)
    model %>%
      predict(new_data = processed) %>%
      bind_cols(processed) %>% 
      pf_metrics(Sale_Price, .pred)
  }
# 应用辅助函数
cv_splits <- cv_splits %>%
  mutate(
    metrics = pmap(
      list(
        split  = splits, 
        recipe = recipes, 
        model  = knn
      ),
      parsnip_metrics 
    )
  )
```
]

--
.pull-right.code80[

```r
cv_splits$metrics[[1]]
```

```
#> # A tibble: 3 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      0.0839
#> 2 rsq     standard      0.761 
#> 3 ccc     standard      0.871
```

```r
# 汇总计算模型表现指标
cv_splits %>%
  unnest(cols = metrics) %>%
  group_by(.metric) %>%
  summarise(
    mean = mean(.estimate),
    std_err = sd(.estimate)/sqrt(n())
  )
```

```
#> # A tibble: 3 × 3
#>   .metric   mean std_err
#>   <chr>    <dbl>   <dbl>
#> 1 ccc     0.869  0.00519
#> 2 rmse    0.0870 0.00245
#> 3 rsq     0.764  0.00889
```
]

---
layout: true

### >> 5.5 模型调参：手动模式

---

<br>

---

.pull-left.code80[

```r
# 定义辅助函数
m_metrics <- function(split, recipe, fit) {
  holdout <- bake(
    recipe, 
    new_data = assessment(split))
  resample_name <- labels(split)$id
  multi_predict(fit,
                new_data = holdout,
                neighbors = 1:50) %>%
    bind_cols(., 
      holdout %>% select(Sale_Price)) %>%
    unnest(cols = .pred) %>% 
    group_by(neighbors) %>% 
    pf_metrics(Sale_Price, .pred) %>%
    mutate(resample = resample_name)
}
# 应用辅助函数
cv_splits <- cv_splits %>% 
  mutate(
    tuning_metrics = 
      pmap(
        list(splits, recipes, knn), 
        m_metrics
      )
  )
```
]

--
.pull-right.code85[

```r
cv_splits$tuning_metrics[[1]]
```

```
#> # A tibble: 150 × 5
#>   neighbors .metric .estimator .estimate resample
#>       <int> <chr>   <chr>          <dbl> <chr>   
#> 1         1 rmse    standard      0.101  Fold01  
#> 2         2 rmse    standard      0.0951 Fold01  
#> 3         3 rmse    standard      0.0898 Fold01  
#> # ℹ 147 more rows
```

```r
# 汇总计算模型表现指标
(rs_rmse <- cv_splits %>% 
  unnest(tuning_metrics) %>% 
  filter(.metric == "rmse") %>% 
  group_by(neighbors) %>% 
  summarize(rmse = mean(.estimate)))
```

```
#> # A tibble: 50 × 2
#>   neighbors   rmse
#>       <int>  <dbl>
#> 1         1 0.104 
#> 2         2 0.0972
#> 3         3 0.0921
#> # ℹ 47 more rows
```
]

---

.pull-left.code85[

```r
(min_rmse <- rs_rmse %>% 
  arrange(rmse) %>% slice(1))
```

```
#> # A tibble: 1 × 2
#>   neighbors   rmse
#>       <int>  <dbl>
#> 1        11 0.0837
```

```r
ggplot(rs_rmse, 
    aes(x = neighbors, y = rmse)) + 
  geom_point() + geom_path() + 
  geom_point(data = min_rmse, 
          size = 3, color = "red")
```

<img src="L09_tidymodels_files/figure-html/unnamed-chunk-42-1.png" width="80%" style="display: block; margin: auto;" />
]

--
.pull-right.code90[

```r
# 最终模型
ames_rec_final <- prep(ames_rec)
knn_mod_final <- 
  update(knn_mod, 
         neighbors = 11) %>% 
  fit(Sale_Price ~ ., 
      data = bake(ames_rec_final,
                  new_data = NULL))

# 评估最终模型在测试集上的表现
parsnip_metrics(data_split, 
                ames_rec_final,
                knn_mod_final)
```

```
#> # A tibble: 3 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      0.0754
#> 2 rsq     standard      0.813 
#> 3 ccc     standard      0.894
```
]

---
layout: true

### >> 5.6 模型调参：`workflows` .font80[<sup>v1.1.3</sup>] + `tune` .font80[<sup>v1.1.2</sup>]

---

.pull-left[

```r
# 重新定义模型：设定需调校的参数
knn_mod <- 
  nearest_neighbor(
    mode = "regression",
*   neighbors = tune()) %>%
  set_engine("kknn")

# 定义流程
ames_wf <- 
  workflow() %>% 
  add_model(knn_mod) %>% 
  add_recipe(ames_rec)

# 基于参数网格计算模型表现指标
ames_res <- 
  ames_wf %>% 
* tune_grid(
    cv_splits %>% select(splits, id),
*   grid = tibble(neighbors = 1:50),
    metrics = metric_set(rmse, rsq, ccc)
  )
```
]

.pull-right[

```r
ames_res
```

```
#> # Tuning results
#> # 10-fold cross-validation 
#> # A tibble: 10 × 4
#>   splits             id     .metrics           .notes          
#>   <list>             <chr>  <list>             <list>          
#> 1 <split [1977/220]> Fold01 <tibble [150 × 5]> <tibble [0 × 3]>
#> 2 <split [1977/220]> Fold02 <tibble [150 × 5]> <tibble [0 × 3]>
#> 3 <split [1977/220]> Fold03 <tibble [150 × 5]> <tibble [1 × 3]>
#> # ℹ 7 more rows
#> 
#> There were issues with some computations:
#> 
#>   - Warning(s) x2: some 'x' values beyond boundary knots may cause ill-conditioned bases
#> 
#> Run `show_notes(.Last.tune.result)` for more information.
```

```r
ames_res %>% collect_metrics() %>% 
  glimpse(width = 40)
```

```
#> Rows: 150
#> Columns: 7
#> $ neighbors  <int> 1, 1, 1, 2, 2, 2, 3…
#> $ .metric    <chr> "ccc", "rmse", "rsq…
#> $ .estimator <chr> "standard", "standa…
#> $ mean       <dbl> 0.8276, 0.1042, 0.6…
#> $ n          <int> 10, 10, 10, 10, 10,…
#> $ std_err    <dbl> 0.00943, 0.00302, 0…
#> $ .config    <chr> "Preprocessor1_Mode…
```
]

---

```r
ames_res %>% 
* collect_metrics() %>%
  filter(.metric == "rmse") %>% 
  ggplot(aes(x = neighbors, y = mean)) + 
  geom_point() + geom_line() + 
  geom_point(size = 2, color = 'red', 
             data = . %>% slice_min(mean)) + 
  ylab("rmse")
```

---

```r
(best_knn <- 
    ames_res %>% select_best("rmse"))
```

```
#> # A tibble: 1 × 2
#>   neighbors .config              
#>       <int> <chr>                
#> 1        11 Preprocessor1_Model11
```

```r
ames_fit_test <- ames_wf %>% 
* finalize_workflow(best_knn) %>%
* last_fit(
    split = data_split,
    metrics = metric_set(rmse, rsq, ccc)
  )
ames_fit_test %>% glimpse()
```

```
#> Rows: 1
#> Columns: 6
#> $ splits       <list> [<initial_split[2197 x 733 x 2930 x 74]>]
#> $ id           <chr> "train/test split"
#> $ .metrics     <list> [<tbl_df[3 x 4]>]
#> $ .notes       <list> [<tbl_df[1 x 3]>]
#> $ .predictions <list> [<tbl_df[733 x 4]>]
#> $ .workflow    <list> [Bldg_Type, Neighborhood, Year_Built, Gr_Liv_Area, Full_Bath, Year…
```

---

.pull-left[

```r
ames_fit_test %>% 
  collect_metrics()
```

```
#> # A tibble: 3 × 4
#>   .metric .estimator .estimate .config  
#>   <chr>   <chr>          <dbl> <chr>    
#> 1 rmse    standard      0.0754 Preproce…
#> 2 rsq     standard      0.813  Preproce…
#> 3 ccc     standard      0.894  Preproce…
```

]

.pull-right[

```r
ames_fit_test %>% 
  collect_predictions() %>%
  ggplot(aes(Sale_Price, .pred)) +
  geom_abline(lty = 2, size = 1.5,  
              color = "gray50") + 
  geom_point(alpha = 0.5) + 
  coord_fixed()
```

]

---
layout: false

### >> `tidymodels` 家族！

---
layout: false
class: inverse, center, middle

# 课后作业

---
class: middle

.font150[

1. 在 📑 _qmd_ 中键入并运行本讲的代码（自觉完成，无需提交）

1. 登录 .red[[{{`tidymodels`}}](https://www.tidymodels.org/)] 的主页 🏠，浏览并学习相关资料

1. `tidymodels`子包的网站 <https://&lt;子包名&gt;.tidymodels.org/>

1. [.bold[{{bilibili::简单的tidymodels}}]](https://space.bilibili.com/382769913/channel/collectiondetail?sid=920988)

1. `tidymodels` 团队撰写的书 📖，供感兴趣的同学学习精进：
  - [Tidy Modeling with R](https://www.tmwr.org/)
  - [Feature Engineering and Selection](https://bookdown.org/max/FES/)
  - [Supervised Machine Learning for Text Analysis in R](https://smltar.com/)

]

---
class: center, middle, hide_logo
background-image: url(imgs/xaringan.png)
background-size: 12%
background-position: 50% 40%

<br><br><br><br><br><br><br>
<hr color='#f00' size='2px' width='80%'>
<br>
.Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能！**_]