+ - 0:00:00
Notes for current slide
Notes for next slide

量化金融与金融编程

L09 tidymodels suite


曾永艺

厦门大学管理学院


2023-12-08

1 / 37

2 / 37

>> 为什么要在 中进行机器学习分析

3 / 37

>> 为什么要在 中进行机器学习分析


💪 R 和 R 包大多是由统计学家和数据分析专家开发的

💪 R 有着异常丰富且紧跟前沿的 {{统计建模和机器学习包}}

💪 R 容易接口或调用其他应用软件(如 C++、python、Spark、TensorFlow 等)

💪 有大量学习参考资料

3 / 37

>> 为什么要在 中进行机器学习分析


💪 R 和 R 包大多是由统计学家和数据分析专家开发的

💪 R 有着异常丰富且紧跟前沿的 {{统计建模和机器学习包}}

💪 R 容易接口或调用其他应用软件(如 C++、python、Spark、TensorFlow 等)

💪 有大量学习参考资料

3 / 37

>> 在 中进行机器学习分析面临哪些问题

4 / 37

>> 在 中进行机器学习分析面临哪些问题


😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言

😰 R 操作数据的规模往往受限于计算机内存的大小

😰 各式 R 包的接口并不统一

4 / 37

>> 在 中进行机器学习分析面临哪些问题


😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言

😰 R 操作数据的规模往往受限于计算机内存的大小

😰 各式 R 包的接口并不统一

Function Code
MASS::lda predict(obj)
stats::glm predict(obj, type = "response")
gbm::gbm predict(obj, type = "response", n.trees)
rpart::rpart predict(obj, type = "prob")
RWeka::Weka predict(obj, type = "probability")
4 / 37

>> 在 中进行机器学习分析面临哪些问题


😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言

😰 R 操作数据的规模往往受限于计算机内存的大小

😰 各式 R 包的接口并不统一

Function Code
MASS::lda predict(obj)
stats::glm predict(obj, type = "response")
gbm::gbm predict(obj, type = "response", n.trees)
rpart::rpart predict(obj, type = "prob")
RWeka::Weka predict(obj, type = "probability")

4 / 37


5 / 37

>> 数据重采样 -> rsample

# 先安装并加载 tidymodels 元包(meta package)
# install.package("tidymodels")
library(tidymodels) # v1.0.0 @ 2022-07-13
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
#> ✔ broom 1.0.5 ✔ recipes 1.0.8
#> ✔ dials 1.2.0 ✔ rsample 1.2.0
#> ✔ dplyr 1.1.4 ✔ tibble 3.2.1
#> ✔ ggplot2 3.4.4 ✔ tidyr 1.3.0
#> ✔ infer 1.0.5 ✔ tune 1.1.2
#> ✔ modeldata 1.2.0 ✔ workflows 1.1.3
#> ✔ parsnip 1.1.1 ✔ workflowsets 1.0.1
#> ✔ purrr 1.0.2 ✔ yardstick 1.2.0
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ recipes::step() masks stats::step()
#> • Use suppressPackageStartupMessages() to eliminate package startup messages
7 / 37

>> 数据重采样 -> rsample

# 设定随机数种子
set.seed(1234)
# 随机抽样分成训练组和试验组
iris_split <- initial_split(iris)
# initial_split(data, prop = 3/4,
# strata = NULL,
# breaks = 4,
# pool = 0.1, ...)
# 查看分组情况
iris_split
#> <Training/Testing/Total>
#> <112/38/150>
# 查看iris_split的类
iris_split %>% class()
#> [1] "initial_split" "mc_split"
#> [3] "rsplit"
8 / 37

>> 数据重采样 -> rsample

# 设定随机数种子
set.seed(1234)
# 随机抽样分成训练组和试验组
iris_split <- initial_split(iris)
# initial_split(data, prop = 3/4,
# strata = NULL,
# breaks = 4,
# pool = 0.1, ...)
# 查看分组情况
iris_split
#> <Training/Testing/Total>
#> <112/38/150>
# 查看iris_split的类
iris_split %>% class()
#> [1] "initial_split" "mc_split"
#> [3] "rsplit"
# 查看 rsplit 类的底层结构
iris_split %>% str(vec.len = 1.5)
#> List of 4
#> $ data :'data.frame': 150 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 ...
#> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 ...
#> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 ...
#> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 ...
#> $ in_id : int [1:112] 28 80 101 111 ...
#> $ out_id: logi NA
#> $ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
#> ..$ id: chr "Resample1"
#> - attr(*, "class")= chr [1:3] "initial_split" "mc_split" ...
# 提取训练组样本(Q:测试组样本用哪个函数?)
iris_split %>% training() %>%
glimpse(width = 40)
#> Rows: 112
#> Columns: 5
#> $ Sepal.Length <dbl> 5.2, 5.7, 6.3, 6.…
#> $ Sepal.Width <dbl> 3.5, 2.6, 3.3, 3.…
#> $ Petal.Length <dbl> 1.5, 3.5, 6.0, 5.…
#> $ Petal.Width <dbl> 0.2, 1.0, 2.5, 2.…
#> $ Species <fct> setosa, versicolo…
8 / 37

>> 2.1 定义菜单

10 / 37

>> 2.1 定义菜单

iris_recipe <-
recipe(Species ~ ., data = iris) %>%
step_corr(all_predictors()) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
iris_recipe
#> ── Recipe ────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 4
#>
#> ── Operations
#> • Correlation filter on: all_predictors()
#> • Centering for: all_predictors()
#> • Scaling for: all_predictors()
10 / 37

>> 2.1 定义菜单

iris_recipe <-
recipe(Species ~ ., data = iris) %>%
step_corr(all_predictors()) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
iris_recipe
#> ── Recipe ────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 4
#>
#> ── Operations
#> • Correlation filter on: all_predictors()
#> • Centering for: all_predictors()
#> • Scaling for: all_predictors()

复杂的recipe类,View()它!

recipes 包中共计有 98 个数据预处理步骤函数,示例如下:

[1] "step_arrange" "..."
[3] "step_corr" "..."
[5] "step_YeoJohnson" "step_zv"

你可以用以下五种方式(及其组合)在 step_*() 中指定拟作用的变量:

  • 直接使用变量名枚举相关变量
  • dplyr 包的方式,如 starts_with()
  • all_outcomes()all_predictors()has_role() 指定角色的方式
  • all_numeric()all_nominal()has_type() 指定类型的方式
  • 以指定类型+角色复合的方式,如 all_numeric_predictors()

?recipes::selections

10 / 37

>> 2.2 准备菜单

11 / 37

>> 2.2 准备菜单

iris_recipe <-
iris_recipe %>%
prep(training(iris_split),
verbose = TRUE)
#> oper 1 step corr [training]
#> oper 2 step center [training]
#> oper 3 step scale [training]
#> The retained training set is ~ 0 Mb in memory.
# prep(x,
# training = NULL,
# fresh = FALSE,
# verbose = FALSE,
# retain = TRUE,
# log_change = FALSE,
# strings_as_factors = TRUE,
# ...)
11 / 37

>> 2.2 准备菜单

iris_recipe <-
iris_recipe %>%
prep(training(iris_split),
verbose = TRUE)
#> oper 1 step corr [training]
#> oper 2 step center [training]
#> oper 3 step scale [training]
#> The retained training set is ~ 0 Mb in memory.
# prep(x,
# training = NULL,
# fresh = FALSE,
# verbose = FALSE,
# retain = TRUE,
# log_change = FALSE,
# strings_as_factors = TRUE,
# ...)
iris_recipe
#> ── Recipe ───────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 4
#>
#> ── Training information
#> Training data contained 112 data points and no incomplete rows.
#>
#> ── Operations
#> • Correlation filter on: Petal.Length | Trained
#> • Centering for: Sepal.Length, Sepal.Width, Petal.Width | Trained
#> • Scaling for: Sepal.Length, Sepal.Width, Petal.Width | Trained
11 / 37

>> 2.3 应用菜单

12 / 37

>> 2.3 应用菜单

# 训练组
# 可直接提取准备阶段 retain 的数据
iris_training <- iris_recipe %>%
bake(new_data = NULL)
# bake(object,
# new_data,
# ...,
# composition = "tibble")
iris_training
#> # A tibble: 112 × 4
#> Sepal.Length Sepal.Width Petal.Width Species
#> <dbl> <dbl> <dbl> <fct>
#> 1 -0.805 1.09 -1.36 setosa
#> 2 -0.212 -1.07 -0.316 versicolor
#> 3 0.500 0.607 1.64 virginica
#> # ℹ 109 more rows
12 / 37

>> 2.3 应用菜单

# 训练组
# 可直接提取准备阶段 retain 的数据
iris_training <- iris_recipe %>%
bake(new_data = NULL)
# bake(object,
# new_data,
# ...,
# composition = "tibble")
iris_training
#> # A tibble: 112 × 4
#> Sepal.Length Sepal.Width Petal.Width Species
#> <dbl> <dbl> <dbl> <fct>
#> 1 -0.805 1.09 -1.36 setosa
#> 2 -0.212 -1.07 -0.316 versicolor
#> 3 0.500 0.607 1.64 virginica
#> # ℹ 109 more rows
# 测试组
# 用 bake() 进行预处理
iris_testing <- iris_recipe %>%
bake(new_data = testing(iris_split))
iris_testing
#> # A tibble: 38 × 4
#> Sepal.Length Sepal.Width Petal.Width Species
#> <dbl> <dbl> <dbl> <fct>
#> 1 -0.924 1.09 -1.36 setosa
#> 2 -1.52 0.847 -1.23 setosa
#> 3 -1.28 0.847 -1.36 setosa
#> # ℹ 35 more rows
12 / 37

>> 3.1 模型设定

14 / 37

>> 3.1 模型设定

# 随机森林模型的统一 API
iris_rf <- rand_forest(
mode = "classification",
trees = 100
)
# rand_forest(
# mode = "unknown",
# engine = "ranger",
# mtry = NULL, trees = NULL,
# min_n = NULL)
# 设定随机森林模型的计算引擎
# show_engines("rand_forest")
(iris_rf <- iris_rf %>%
set_engine("ranger"))
#> Random Forest Model Specification (classification)
#>
#> Main Arguments:
#> trees = 100
#>
#> Computational engine: ranger
14 / 37

>> 3.1 模型设定

# 随机森林模型的统一 API
iris_rf <- rand_forest(
mode = "classification",
trees = 100
)
# rand_forest(
# mode = "unknown",
# engine = "ranger",
# mtry = NULL, trees = NULL,
# min_n = NULL)
# 设定随机森林模型的计算引擎
# show_engines("rand_forest")
(iris_rf <- iris_rf %>%
set_engine("ranger"))
#> Random Forest Model Specification (classification)
#>
#> Main Arguments:
#> trees = 100
#>
#> Computational engine: ranger
# 查看 iris_rf 的类
iris_rf %>% class()
#> [1] "rand_forest" "model_spec"
# 查看 iris_rf 的底层结构
iris_rf %>% str()
#> List of 7
#> $ args :List of 3
#> ..$ mtry : language ~NULL
#> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
#> ..$ trees: language ~100
#> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
#> ..$ min_n: language ~NULL
#> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
#> $ eng_args : Named list()
#> ..- attr(*, "class")= chr [1:2] "quosures" "list"
#> $ mode : chr "classification"
#> $ user_specified_mode : logi TRUE
#> $ method : NULL
#> $ engine : chr "ranger"
#> $ user_specified_engine: logi TRUE
#> - attr(*, "class")= chr [1:2] "rand_forest" "model_spec"
14 / 37

>> 3.2 模型拟合

# 拟合随机森林模型
(iris_fit <- iris_rf %>%
fit( # ?parsnip::fit.model_spec()
formula = Species ~ .,
data = iris_training
))
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 100
#> Sample size: 112
#> Number of independent variables: 3
#> Mtry: 1
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.0563
15 / 37

>> 3.2 模型拟合

# 拟合随机森林模型
(iris_fit <- iris_rf %>%
fit( # ?parsnip::fit.model_spec()
formula = Species ~ .,
data = iris_training
))
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 100
#> Sample size: 112
#> Number of independent variables: 3
#> Mtry: 1
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.0563
# 查看 iris_fit 的类
iris_fit %>% class()
#> [1] "_ranger" "model_fit"
# 查看 iris_fit 的底层结构
iris_fit %>% str(max.level = 1)
#> List of 6
#> $ lvl : chr [1:3] "setosa" "versicolor" "virginica"
#> $ spec :List of 8
#> ..- attr(*, "class")= chr [1:2] "rand_forest" "model_spec"
#> $ fit :List of 13
#> ..- attr(*, "class")= chr "ranger"
#> $ preproc :List of 4
#> $ elapsed :List of 1
#> $ censor_probs: list()
#> - attr(*, "class")= chr [1:2] "_ranger" "model_fit"
15 / 37

>> 3.3 模型预测

# 将拟合模型应用于测试组
iris_fit %>%
predict(
new_data = iris_testing
)
#> # A tibble: 38 × 1
#> .pred_class
#> <fct>
#> 1 setosa
#> 2 setosa
#> 3 setosa
#> # ℹ 35 more rows
# ? parsnip::predict.model_fit()
# predict(object, new_data,
# type = NULL,
# # "numeric", "class", "prob",
# # "conf_int", "pred_int",
# # "quantile", "time",
# # "hazard", "survival", "raw"
# opts = list(), ...)
16 / 37

>> 3.3 模型预测

# 将拟合模型应用于测试组
iris_fit %>%
predict(
new_data = iris_testing
)
#> # A tibble: 38 × 1
#> .pred_class
#> <fct>
#> 1 setosa
#> 2 setosa
#> 3 setosa
#> # ℹ 35 more rows
# ? parsnip::predict.model_fit()
# predict(object, new_data,
# type = NULL,
# # "numeric", "class", "prob",
# # "conf_int", "pred_int",
# # "quantile", "time",
# # "hazard", "survival", "raw"
# opts = list(), ...)
# 合并原数据集
iris_class <- iris_fit %>%
predict(
new_data = iris_testing
) %>%
bind_cols(iris_testing)
iris_class %>%
glimpse(width = 40)
#> Rows: 38
#> Columns: 5
#> $ .pred_class <fct> setosa, setosa, s…
#> $ Sepal.Length <dbl> -0.9241, -1.5176,…
#> $ Sepal.Width <dbl> 1.087, 0.847, 0.8…
#> $ Petal.Width <dbl> -1.359, -1.229, -…
#> $ Species <fct> setosa, setosa, s…
16 / 37

>> 4.1 模型表现:metrics()

iris_class %>% glimpse(width = 40)
#> Rows: 38
#> Columns: 5
#> $ .pred_class <fct> setosa, setosa, s…
#> $ Sepal.Length <dbl> -0.9241, -1.5176,…
#> $ Sepal.Width <dbl> 1.087, 0.847, 0.8…
#> $ Petal.Width <dbl> -1.359, -1.229, -…
#> $ Species <fct> setosa, setosa, s…
iris_class %>%
metrics( # 计算模型表现指标
truth = Species,
estimate = .pred_class
)
#> # A tibble: 2 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy multiclass 0.947
#> 2 kap multiclass 0.920
18 / 37

>> 4.1 模型表现:metrics()

iris_class %>% glimpse(width = 40)
#> Rows: 38
#> Columns: 5
#> $ .pred_class <fct> setosa, setosa, s…
#> $ Sepal.Length <dbl> -0.9241, -1.5176,…
#> $ Sepal.Width <dbl> 1.087, 0.847, 0.8…
#> $ Petal.Width <dbl> -1.359, -1.229, -…
#> $ Species <fct> setosa, setosa, s…
iris_class %>%
metrics( # 计算模型表现指标
truth = Species,
estimate = .pred_class
)
#> # A tibble: 2 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy multiclass 0.947
#> 2 kap multiclass 0.920
# 提取各分类的预测概率
iris_probs <- iris_fit %>%
predict(
new_data = iris_testing,
type = "prob"
) %>%
bind_cols(iris_class)
# 模型表现指标:基于分类预测概率
iris_probs %>%
metrics(
truth = Species,
estimate = .pred_class,
.pred_setosa:.pred_virginica
)
#> # A tibble: 4 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy multiclass 0.947
#> 2 kap multiclass 0.920
#> 3 mn_log_loss multiclass 0.199
#> 4 roc_auc hand_till 0.979
18 / 37

>> 4.2 模型表现:*_curve()

# 接收器操作特征曲线
iris_roc <- iris_probs %>%
roc_curve(
truth = Species,
.pred_setosa:.pred_virginica
)
# iris_roc -> a "roc_df"
iris_roc %>% autoplot()

19 / 37

>> 4.2 模型表现:*_curve()

# 接收器操作特征曲线
iris_roc <- iris_probs %>%
roc_curve(
truth = Species,
.pred_setosa:.pred_virginica
)
# iris_roc -> a "roc_df"
iris_roc %>% autoplot()

# 增益曲线
iris_gain <- iris_probs %>%
gain_curve(
truth = Species,
.pred_setosa:.pred_virginica
)
# iris_gain # a "gain_df"
iris_gain %>% autoplot()

19 / 37

5. 一个较复杂的案例

(Case Study)

20 / 37

>> 5.0 数据

modeldata::ames %>% str()
#> tibble [2,930 × 74] (S3: tbl_df/tbl/data.frame)
#> $ MS_SubClass : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 1 1 6 6 12 12 12 6 ...
#> $ MS_Zoning : Factor w/ 7 levels "Floating_Village_Residential",..: 3 2 3 3 3 3 3 3 3 3 ...
#> $ Lot_Frontage : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...
#> $ Lot_Area : int [1:2930] 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
#> $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
#> $ Alley : Factor w/ 3 levels "Gravel","No_Alley_Access",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ Lot_Shape : Factor w/ 4 levels "Regular","Slightly_Irregular",..: 2 1 2 1 2 2 1 2 2 1 ...
#> $ Land_Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
#> $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ Lot_Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
#> $ Land_Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Neighborhood : Factor w/ 29 levels "North_Ames","College_Creek",..: 1 1 1 1 7 7 17 17 17 7 ...
#> $ Condition_1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
#> $ Condition_2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ Bldg_Type : Factor w/ 5 levels "OneFam","TwoFmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
#> $ House_Style : Factor w/ 8 levels "One_and_Half_Fin",..: 3 3 3 3 8 8 3 3 3 8 ...
#> $ Overall_Cond : Factor w/ 10 levels "Very_Poor","Poor",..: 5 6 6 5 5 6 5 5 5 5 ...
#> $ Year_Built : int [1:2930] 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
#> $ Year_Remod_Add : int [1:2930] 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
#> $ Roof_Style : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
#> $ Roof_Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ Exterior_1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
#> $ Exterior_2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
#> $ Mas_Vnr_Type : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 2 4 4 2 4 4 4 4 ...
#> $ Mas_Vnr_Area : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
#> $ Exter_Cond : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 5 5 ...
#> $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
#> $ Bsmt_Cond : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ...
#> $ Bsmt_Exposure : Factor w/ 5 levels "Av","Gd","Mn",..: 2 4 4 4 4 4 3 4 4 4 ...
#> $ BsmtFin_Type_1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 2 6 1 1 3 3 3 1 3 7 ...
#> $ BsmtFin_SF_1 : num [1:2930] 2 6 1 1 3 3 3 1 3 7 ...
#> $ BsmtFin_Type_2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 4 7 7 7 7 7 7 7 7 ...
#> $ BsmtFin_SF_2 : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
#> $ Bsmt_Unf_SF : num [1:2930] 441 270 406 1045 137 ...
#> $ Total_Bsmt_SF : num [1:2930] 1080 882 1329 2110 928 ...
#> $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ Heating_QC : Factor w/ 5 levels "Excellent","Fair",..: 2 5 5 1 3 1 1 1 1 3 ...
#> $ Central_Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
#> $ Electrical : Factor w/ 6 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
#> $ First_Flr_SF : int [1:2930] 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
#> $ Second_Flr_SF : int [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
#> $ Gr_Liv_Area : int [1:2930] 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
#> $ Bsmt_Full_Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
#> $ Bsmt_Half_Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
#> $ Full_Bath : int [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
#> $ Half_Bath : int [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
#> $ Bedroom_AbvGr : int [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
#> $ Kitchen_AbvGr : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
#> $ TotRms_AbvGrd : int [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
#> $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
#> $ Fireplaces : int [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
#> $ Garage_Type : Factor w/ 7 levels "Attchd","Basment",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ Garage_Finish : Factor w/ 4 levels "Fin","No_Garage",..: 1 4 4 1 1 1 1 3 3 1 ...
#> $ Garage_Cars : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
#> $ Garage_Area : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
#> $ Garage_Cond : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
#> $ Paved_Drive : Factor w/ 3 levels "Dirt_Gravel",..: 2 3 3 3 3 3 3 3 3 3 ...
#> $ Wood_Deck_SF : int [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
#> $ Open_Porch_SF : int [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
#> $ Enclosed_Porch : int [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
#> $ Three_season_porch: int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
#> $ Screen_Porch : int [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
#> $ Pool_Area : int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
#> $ Pool_QC : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ...
#> $ Fence : Factor w/ 5 levels "Good_Privacy",..: 5 3 5 5 3 5 5 5 5 5 ...
#> $ Misc_Feature : Factor w/ 6 levels "Elev","Gar2",..: 3 3 2 3 3 3 3 3 3 3 ...
#> $ Misc_Val : int [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
#> $ Mo_Sold : int [1:2930] 5 6 6 4 3 6 4 1 3 6 ...
#> $ Year_Sold : int [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
#> $ Sale_Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
#> $ Sale_Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
#> $ Sale_Price : int [1:2930] 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
#> $ Longitude : num [1:2930] -93.6 -93.6 -93.6 -93.6 -93.6 ...
#> $ Latitude : num [1:2930] 42.1 42.1 42.1 42.1 42.1 ...
21 / 37

>> 5.1 数据重采样 -> rsample

set.seed(1234)
data_split <-
initial_split(
ames, strata = "Sale_Price")
ames_train <- training(data_split)
ames_test <- testing(data_split)
nrow(ames_train) / nrow(ames)
#> [1] 0.75
22 / 37

>> 5.1 数据重采样 -> rsample

set.seed(1234)
data_split <-
initial_split(
ames, strata = "Sale_Price")
ames_train <- training(data_split)
ames_test <- testing(data_split)
nrow(ames_train) / nrow(ames)
#> [1] 0.75
# v折交叉验证
(cv_splits <- vfold_cv(ames_train))
#> # 10-fold cross-validation
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1977/220]> Fold01
#> 2 <split [1977/220]> Fold02
#> 3 <split [1977/220]> Fold03
#> # ℹ 7 more rows
cv_splits$splits[[1]]
#> <Analysis/Assess/Total>
#> <1977/220/2197>
cv_splits$splits[[1]] %>%
analysis() %>% dim()
#> [1] 1977 74

vignette("Common_Patterns", package = "rsample")

22 / 37

>> 5.2 数据预处理 -> recipes

# 1. 定义数据预处理菜单
ames_rec <-
recipe(Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area +
Full_Bath + Year_Sold + Lot_Area + Central_Air + Longitude + Latitude,
data = ames_train) %>%
step_log(Sale_Price, base = 10) %>%
step_BoxCox(Lot_Area, Gr_Liv_Area) %>%
step_other(Neighborhood, threshold = 0.05) %>%
step_dummy(all_nominal()) %>%
step_interact(~ starts_with("Central_Air"):Year_Built) %>%
step_bs(Longitude, Latitude, options = list(df = 5))
ames_rec %>% tidy()
#> # A tibble: 6 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step log FALSE FALSE log_YNFHa
#> 2 2 step BoxCox FALSE FALSE BoxCox_vCuOq
#> 3 3 step other FALSE FALSE other_0nogs
#> # ℹ 3 more rows
23 / 37

>> 5.2 数据预处理 -> recipes

# 2. 准备数据预处理
cv_splits <- cv_splits %>%
mutate(
recipes = map(splits, prepper,
recipe = ames_rec)
)
cv_splits
#> # 10-fold cross-validation
#> # A tibble: 10 × 3
#> splits id recipes
#> <list> <chr> <list>
#> 1 <split [1977/220]> Fold01 <recipe>
#> 2 <split [1977/220]> Fold02 <recipe>
#> 3 <split [1977/220]> Fold03 <recipe>
#> # ℹ 7 more rows
# prepper
# function(split_obj, recipe, ...) {
# prep(recipe,
# training = rsample::analysis(split_obj),
# fresh = TRUE, ...)
# }
24 / 37

>> 5.2 数据预处理 -> recipes

# 2. 准备数据预处理
cv_splits <- cv_splits %>%
mutate(
recipes = map(splits, prepper,
recipe = ames_rec)
)
cv_splits
#> # 10-fold cross-validation
#> # A tibble: 10 × 3
#> splits id recipes
#> <list> <chr> <list>
#> 1 <split [1977/220]> Fold01 <recipe>
#> 2 <split [1977/220]> Fold02 <recipe>
#> 3 <split [1977/220]> Fold03 <recipe>
#> # ℹ 7 more rows
# prepper
# function(split_obj, recipe, ...) {
# prep(recipe,
# training = rsample::analysis(split_obj),
# fresh = TRUE, ...)
# }
# 相同的菜单 +
# 不同的训练数据集 ->
# 不同的准备结果
cv_splits$recipes %>%
map(\(x) tidy(x, number = 2)) %>%
list_rbind() %>%
filter(terms == "Lot_Area")
#> # A tibble: 10 × 3
#> terms value id
#> <chr> <dbl> <chr>
#> 1 Lot_Area 0.0836 BoxCox_vCuOq
#> 2 Lot_Area 0.0777 BoxCox_vCuOq
#> 3 Lot_Area 0.104 BoxCox_vCuOq
#> # ℹ 7 more rows
# ?recipes:::tidy.recipe
24 / 37

>> 5.3 拟合模型 -> parsnip

25 / 37

>> 5.3 拟合模型 -> parsnip

# 1. 设定模型
knn_mod <-
nearest_neighbor(
mode = "regression",
neighbors = 5) %>%
set_engine("kknn") # 默认,可不设置
# 2. 拟合模型
# 2.1 定义辅助函数
parsnip_fit <- function(rec, model) {
fit(model, Sale_Price ~ .,
data = bake(rec, new_data= NULL))
}
# 2.2 应用辅助函数
cv_splits <- cv_splits %>%
mutate(
knn = map(
recipes,
parsnip_fit,
model = knn_mod
)
)
25 / 37

>> 5.3 拟合模型 -> parsnip

# 1. 设定模型
knn_mod <-
nearest_neighbor(
mode = "regression",
neighbors = 5) %>%
set_engine("kknn") # 默认,可不设置
# 2. 拟合模型
# 2.1 定义辅助函数
parsnip_fit <- function(rec, model) {
fit(model, Sale_Price ~ .,
data = bake(rec, new_data= NULL))
}
# 2.2 应用辅助函数
cv_splits <- cv_splits %>%
mutate(
knn = map(
recipes,
parsnip_fit,
model = knn_mod
)
)
# 查看结果
cv_splits %>% select(-id)
#> # A tibble: 10 × 3
#> splits recipes knn
#> <list> <list> <list>
#> 1 <split [1977/220]> <recipe> <fit[+]>
#> 2 <split [1977/220]> <recipe> <fit[+]>
#> 3 <split [1977/220]> <recipe> <fit[+]>
#> # ℹ 7 more rows
cv_splits$knn[[1]] %>%
str(max.level = 1)
#> List of 6
#> $ lvl : NULL
#> $ spec :List of 7
#> ..- attr(*, "class")= chr [1:2] "nearest_neighbor" "model_spec"
#> $ fit :List of 10
#> ..- attr(*, "class")= chr [1:2] "train.kknn" "kknn"
#> $ preproc :List of 1
#> $ elapsed :List of 1
#> $ censor_probs: list()
#> - attr(*, "class")= chr [1:2] "_train.kknn" "model_fit"
25 / 37

>> 5.4 评估模型表现 -> parsnip 包 + yardstick

26 / 37

>> 5.4 评估模型表现 -> parsnip 包 + yardstick

# 定义模型表现函数
pf_metrics <- metric_set(rmse, rsq, ccc)
# 定义辅助函数
parsnip_metrics <-
function(split, recipe, model) {
raw <- assessment(split)
processed <- bake(recipe,
new_data = raw)
model %>%
predict(new_data = processed) %>%
bind_cols(processed) %>%
pf_metrics(Sale_Price, .pred)
}
# 应用辅助函数
cv_splits <- cv_splits %>%
mutate(
metrics = pmap(
list(
split = splits,
recipe = recipes,
model = knn
),
parsnip_metrics
)
)
26 / 37

>> 5.4 评估模型表现 -> parsnip 包 + yardstick

# 定义模型表现函数
pf_metrics <- metric_set(rmse, rsq, ccc)
# 定义辅助函数
parsnip_metrics <-
function(split, recipe, model) {
raw <- assessment(split)
processed <- bake(recipe,
new_data = raw)
model %>%
predict(new_data = processed) %>%
bind_cols(processed) %>%
pf_metrics(Sale_Price, .pred)
}
# 应用辅助函数
cv_splits <- cv_splits %>%
mutate(
metrics = pmap(
list(
split = splits,
recipe = recipes,
model = knn
),
parsnip_metrics
)
)
cv_splits$metrics[[1]]
#> # A tibble: 3 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 rmse standard 0.0839
#> 2 rsq standard 0.761
#> 3 ccc standard 0.871
# 汇总计算模型表现指标
cv_splits %>%
unnest(cols = metrics) %>%
group_by(.metric) %>%
summarise(
mean = mean(.estimate),
std_err = sd(.estimate)/sqrt(n())
)
#> # A tibble: 3 × 3
#> .metric mean std_err
#> <chr> <dbl> <dbl>
#> 1 ccc 0.869 0.00519
#> 2 rmse 0.0870 0.00245
#> 3 rsq 0.764 0.00889
26 / 37

>> 5.5 模型调参:手动模式


27 / 37

>> 5.5 模型调参:手动模式

# 定义辅助函数
m_metrics <- function(split, recipe, fit) {
holdout <- bake(
recipe,
new_data = assessment(split))
resample_name <- labels(split)$id
multi_predict(fit,
new_data = holdout,
neighbors = 1:50) %>%
bind_cols(.,
holdout %>% select(Sale_Price)) %>%
unnest(cols = .pred) %>%
group_by(neighbors) %>%
pf_metrics(Sale_Price, .pred) %>%
mutate(resample = resample_name)
}
# 应用辅助函数
cv_splits <- cv_splits %>%
mutate(
tuning_metrics =
pmap(
list(splits, recipes, knn),
m_metrics
)
)
28 / 37

>> 5.5 模型调参:手动模式

# 定义辅助函数
m_metrics <- function(split, recipe, fit) {
holdout <- bake(
recipe,
new_data = assessment(split))
resample_name <- labels(split)$id
multi_predict(fit,
new_data = holdout,
neighbors = 1:50) %>%
bind_cols(.,
holdout %>% select(Sale_Price)) %>%
unnest(cols = .pred) %>%
group_by(neighbors) %>%
pf_metrics(Sale_Price, .pred) %>%
mutate(resample = resample_name)
}
# 应用辅助函数
cv_splits <- cv_splits %>%
mutate(
tuning_metrics =
pmap(
list(splits, recipes, knn),
m_metrics
)
)
cv_splits$tuning_metrics[[1]]
#> # A tibble: 150 × 5
#> neighbors .metric .estimator .estimate resample
#> <int> <chr> <chr> <dbl> <chr>
#> 1 1 rmse standard 0.101 Fold01
#> 2 2 rmse standard 0.0951 Fold01
#> 3 3 rmse standard 0.0898 Fold01
#> # ℹ 147 more rows
# 汇总计算模型表现指标
(rs_rmse <- cv_splits %>%
unnest(tuning_metrics) %>%
filter(.metric == "rmse") %>%
group_by(neighbors) %>%
summarize(rmse = mean(.estimate)))
#> # A tibble: 50 × 2
#> neighbors rmse
#> <int> <dbl>
#> 1 1 0.104
#> 2 2 0.0972
#> 3 3 0.0921
#> # ℹ 47 more rows
28 / 37

>> 5.5 模型调参:手动模式

(min_rmse <- rs_rmse %>%
arrange(rmse) %>% slice(1))
#> # A tibble: 1 × 2
#> neighbors rmse
#> <int> <dbl>
#> 1 11 0.0837
ggplot(rs_rmse,
aes(x = neighbors, y = rmse)) +
geom_point() + geom_path() +
geom_point(data = min_rmse,
size = 3, color = "red")

29 / 37

>> 5.5 模型调参:手动模式

(min_rmse <- rs_rmse %>%
arrange(rmse) %>% slice(1))
#> # A tibble: 1 × 2
#> neighbors rmse
#> <int> <dbl>
#> 1 11 0.0837
ggplot(rs_rmse,
aes(x = neighbors, y = rmse)) +
geom_point() + geom_path() +
geom_point(data = min_rmse,
size = 3, color = "red")

# 最终模型
ames_rec_final <- prep(ames_rec)
knn_mod_final <-
update(knn_mod,
neighbors = 11) %>%
fit(Sale_Price ~ .,
data = bake(ames_rec_final,
new_data = NULL))
# 评估最终模型在测试集上的表现
parsnip_metrics(data_split,
ames_rec_final,
knn_mod_final)
#> # A tibble: 3 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 rmse standard 0.0754
#> 2 rsq standard 0.813
#> 3 ccc standard 0.894
29 / 37

>> 5.6 模型调参:workflows v1.1.3 + tune v1.1.2

# 重新定义模型:设定需调校的参数
knn_mod <-
nearest_neighbor(
mode = "regression",
neighbors = tune()) %>%
set_engine("kknn")
# 定义流程
ames_wf <-
workflow() %>%
add_model(knn_mod) %>%
add_recipe(ames_rec)
# 基于参数网格计算模型表现指标
ames_res <-
ames_wf %>%
tune_grid(
cv_splits %>% select(splits, id),
grid = tibble(neighbors = 1:50),
metrics = metric_set(rmse, rsq, ccc)
)
30 / 37

>> 5.6 模型调参:workflows v1.1.3 + tune v1.1.2

# 重新定义模型:设定需调校的参数
knn_mod <-
nearest_neighbor(
mode = "regression",
neighbors = tune()) %>%
set_engine("kknn")
# 定义流程
ames_wf <-
workflow() %>%
add_model(knn_mod) %>%
add_recipe(ames_rec)
# 基于参数网格计算模型表现指标
ames_res <-
ames_wf %>%
tune_grid(
cv_splits %>% select(splits, id),
grid = tibble(neighbors = 1:50),
metrics = metric_set(rmse, rsq, ccc)
)
ames_res
#> # Tuning results
#> # 10-fold cross-validation
#> # A tibble: 10 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [1977/220]> Fold01 <tibble [150 × 5]> <tibble [0 × 3]>
#> 2 <split [1977/220]> Fold02 <tibble [150 × 5]> <tibble [0 × 3]>
#> 3 <split [1977/220]> Fold03 <tibble [150 × 5]> <tibble [1 × 3]>
#> # ℹ 7 more rows
#>
#> There were issues with some computations:
#>
#> - Warning(s) x2: some 'x' values beyond boundary knots may cause ill-conditioned bases
#>
#> Run `show_notes(.Last.tune.result)` for more information.
ames_res %>% collect_metrics() %>%
glimpse(width = 40)
#> Rows: 150
#> Columns: 7
#> $ neighbors <int> 1, 1, 1, 2, 2, 2, 3…
#> $ .metric <chr> "ccc", "rmse", "rsq…
#> $ .estimator <chr> "standard", "standa…
#> $ mean <dbl> 0.8276, 0.1042, 0.6…
#> $ n <int> 10, 10, 10, 10, 10,…
#> $ std_err <dbl> 0.00943, 0.00302, 0…
#> $ .config <chr> "Preprocessor1_Mode…
30 / 37

>> 5.6 模型调参:workflows v1.1.3 + tune v1.1.2

ames_res %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
ggplot(aes(x = neighbors, y = mean)) +
geom_point() + geom_line() +
geom_point(size = 2, color = 'red',
data = . %>% slice_min(mean)) +
ylab("rmse")

31 / 37

>> 5.6 模型调参:workflows v1.1.3 + tune v1.1.2

(best_knn <-
ames_res %>% select_best("rmse"))
#> # A tibble: 1 × 2
#> neighbors .config
#> <int> <chr>
#> 1 11 Preprocessor1_Model11
ames_fit_test <- ames_wf %>%
finalize_workflow(best_knn) %>%
last_fit(
split = data_split,
metrics = metric_set(rmse, rsq, ccc)
)
ames_fit_test %>% glimpse()
#> Rows: 1
#> Columns: 6
#> $ splits <list> [<initial_split[2197 x 733 x 2930 x 74]>]
#> $ id <chr> "train/test split"
#> $ .metrics <list> [<tbl_df[3 x 4]>]
#> $ .notes <list> [<tbl_df[1 x 3]>]
#> $ .predictions <list> [<tbl_df[733 x 4]>]
#> $ .workflow <list> [Bldg_Type, Neighborhood, Year_Built, Gr_Liv_Area, Full_Bath, Year…
32 / 37

>> 5.6 模型调参:workflows v1.1.3 + tune v1.1.2

ames_fit_test %>%
collect_metrics()
#> # A tibble: 3 × 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 rmse standard 0.0754 Preproce…
#> 2 rsq standard 0.813 Preproce…
#> 3 ccc standard 0.894 Preproce…
33 / 37

>> 5.6 模型调参:workflows v1.1.3 + tune v1.1.2

ames_fit_test %>%
collect_metrics()
#> # A tibble: 3 × 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 rmse standard 0.0754 Preproce…
#> 2 rsq standard 0.813 Preproce…
#> 3 ccc standard 0.894 Preproce…
ames_fit_test %>%
collect_predictions() %>%
ggplot(aes(Sale_Price, .pred)) +
geom_abline(lty = 2, size = 1.5,
color = "gray50") +
geom_point(alpha = 0.5) +
coord_fixed()

33 / 37

>> tidymodels 家族!

34 / 37

>> tidymodels 家族!

34 / 37

课后作业

35 / 37
  1. 在 📑 qmd 中键入并运行本讲的代码(自觉完成,无需提交)

  2. 登录 {{tidymodels}} 的主页 🏠,浏览并学习相关资料

  3. tidymodels子包的网站 https://<子包名>.tidymodels.org/

  4. {{bilibili::简单的tidymodels}}

  5. tidymodels 团队撰写的书 📖,供感兴趣的同学学习精进:

36 / 37

2 / 37
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow