class: center, middle, inverse, title-slide .title[ # 量化金融与金融编程 ] .subtitle[ ## L09
tidymodels
suite ] .author[ ###
曾永艺 ] .institute[ ### 厦门大学管理学院 ] .date[ ###
2023-12-08 ] --- class: center, middle
<img src="imgs/tidyverse.png" width="85%" style="display: block; margin: auto;" /> --- ### >> 为什么要在
中进行机器学习分析
-- <br> .pull-left.font150.bold[ 💪 R 和 R 包大多是由统计学家和数据分析专家开发的 💪 R 有着异常丰富且紧跟前沿的 [.red[{{统计建模和机器学习包}}]](https://cran.r-project.org/web/views/) 💪 R 容易接口或调用其他应用软件(如 C++、python、Spark、TensorFlow 等) 💪 有大量学习参考资料 ] -- .pull-right[ <img src="imgs/MLwR.png" width="80%" style="display: block; margin: auto;" /> ] --- ### >> 在
中进行机器学习分析面临哪些问题
-- <br> .pull-left.font150.bold[ 😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言 😰 R 操作数据的规模往往受限于计算机内存的大小 .red[😰 各式 R 包的接口并不统一] ] -- .pull-right.code80[ |Function |Code | |:--------------|:--------------------------------------------| |`MASS::lda` |`predict(obj)` | |`stats::glm` |`predict(obj, type = "response")` | |`gbm::gbm` |`predict(obj, type = "response", n.trees)` | |`rpart::rpart` |`predict(obj, type = "prob")` | |`RWeka::Weka` |`predict(obj, type = "probability")` | ] -- <br> .right.bold.font120[ [`caret`](https://topepo.github.io/caret/) -> .font150[👉 .red[[`tidymodels`!](https://www.tidymodels.org/)]]<br> [`mlr`](https://mlr.mlr-org.com/) -> [`mlr3`!](https://mlr3.mlr-org.com/) ] --- layout: false <img src="imgs/ds.png" width="40%" style="display: block; margin: auto auto auto 0;" /> <hr> <img src="imgs/tidymodels-flow.svg" width="100%" style="display: block; margin: auto;" /> --- class: inverse, center, middle, hide_logo background-image: url(imgs/logo-rsample.svg) background-size: 4em background-position: 12% 45% # 1. 数据重采样 -> `rsample` .font60[<sup>v1.2.0</sup>] .font150[(_General Resampling Infrastructure_)] --- layout: true ### >> 数据重采样 -> `rsample` 包 --- .code90[ ```r # 先安装并加载 tidymodels 元包(meta package) # install.package("tidymodels") library(tidymodels) # v1.0.0 @ 2022-07-13 ``` ``` #> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ── ``` ``` #> ✔ broom 1.0.5 ✔ recipes 1.0.8 #> ✔ dials 1.2.0 ✔ rsample 1.2.0 #> ✔ dplyr 1.1.4 ✔ tibble 3.2.1 #> ✔ ggplot2 3.4.4 ✔ tidyr 1.3.0 #> ✔ infer 1.0.5 ✔ tune 1.1.2 #> ✔ modeldata 1.2.0 ✔ workflows 1.1.3 #> ✔ parsnip 1.1.1 ✔ workflowsets 1.0.1 #> ✔ purrr 1.0.2 ✔ yardstick 1.2.0 ``` ``` #> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Use suppressPackageStartupMessages() to eliminate package startup messages ``` ] --- .pull-left.code95[ ```r # 设定随机数种子 set.seed(1234) # 随机抽样分成训练组和试验组 iris_split <- initial_split(iris) # initial_split(data, prop = 3/4, # strata = NULL, # breaks = 4, # pool = 0.1, ...) ``` ```r # 查看分组情况 iris_split ``` ``` #> <Training/Testing/Total> #> <112/38/150> ``` ```r # 查看iris_split的类 iris_split %>% class() ``` ``` #> [1] "initial_split" "mc_split" #> [3] "rsplit" ``` ] -- .pull-right.code85[ ```r # 查看 rsplit 类的底层结构 iris_split %>% str(vec.len = 1.5) ``` ``` *#> List of 4 *#> $ data :'data.frame': 150 obs. of 5 variables: #> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 ... #> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 ... #> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 ... #> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 ... #> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 ... *#> $ in_id : int [1:112] 28 80 101 111 ... #> $ out_id: logi NA #> $ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame) #> ..$ id: chr "Resample1" #> - attr(*, "class")= chr [1:3] "initial_split" "mc_split" ... ``` ```r # 提取训练组样本(Q:测试组样本用哪个函数?) iris_split %>% training() %>% glimpse(width = 40) ``` ``` #> Rows: 112 #> Columns: 5 #> $ Sepal.Length <dbl> 5.2, 5.7, 6.3, 6.… #> $ Sepal.Width <dbl> 3.5, 2.6, 3.3, 3.… #> $ Petal.Length <dbl> 1.5, 3.5, 6.0, 5.… #> $ Petal.Width <dbl> 0.2, 1.0, 2.5, 2.… #> $ Species <fct> setosa, versicolo… ``` ] --- layout: false class: inverse, center, middle, hide_logo background-image: url(imgs/logo-recipes.svg) background-size: 4em background-position: 12% 45% # 2. 数据预处理 -> `recipes` .font60[<sup>v1.0.8</sup>] .font120[(_Preprocessing and Feature Engineering Steps for Modeling_)] --- ### >> 2.1 定义菜单 -- .pull-left.code85[ ```r iris_recipe <- recipe(Species ~ ., data = iris) %>% step_corr(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) iris_recipe ``` .code80[ ``` #> ── Recipe ──────────────────────────────── #> #> ── Inputs #> Number of variables by role #> outcome: 1 #> predictor: 4 #> #> ── Operations #> • Correlation filter on: all_predictors() #> • Centering for: all_predictors() #> • Scaling for: all_predictors() ``` ] ] -- .pull-right.font100[ .bold.info[复杂的recipe类,`View()`它!] .bold.info[`recipes` 包中共计有 98 个数据预处理步骤函数,示例如下:] .code80[ ``` [1] "step_arrange" "..." [3] "step_corr" "..." [5] "step_YeoJohnson" "step_zv" ``` ] .bold.info[你可以用以下五种方式(及其组合)在 `step_*()` 中指定拟作用的变量:] - 直接使用变量名枚举相关变量 - 以 `dplyr` 包的方式,如 `starts_with()` - 以 `all_outcomes()`、`all_predictors()` 或 `has_role()` 指定角色的方式 - 以 `all_numeric()`、`all_nominal()` 或 `has_type()` 指定类型的方式 - 以指定类型+角色复合的方式,如 `all_numeric_predictors()` `?recipes::selections` ] --- ### >> 2.2 准备菜单 -- .pull-left.code90[ ```r iris_recipe <- iris_recipe %>% prep(training(iris_split), verbose = TRUE) ``` ``` #> oper 1 step corr [training] #> oper 2 step center [training] #> oper 3 step scale [training] #> The retained training set is ~ 0 Mb in memory. ``` ```r # prep(x, # training = NULL, # fresh = FALSE, # verbose = FALSE, # retain = TRUE, # log_change = FALSE, # strings_as_factors = TRUE, # ...) ``` ] -- .pull-right.code90[ ```r iris_recipe ``` .code75[ ``` #> ── Recipe ─────────────────────────────────────────────────────── #> #> ── Inputs #> Number of variables by role #> outcome: 1 #> predictor: 4 #> #> ── Training information #> Training data contained 112 data points and no incomplete rows. #> #> ── Operations #> • Correlation filter on: Petal.Length | Trained #> • Centering for: Sepal.Length, Sepal.Width, Petal.Width | Trained #> • Scaling for: Sepal.Length, Sepal.Width, Petal.Width | Trained ``` ] ] --- ### >> 2.3 应用菜单 -- .pull-left.code90[ ```r # 训练组 # 可直接提取准备阶段 retain 的数据 iris_training <- iris_recipe %>% * bake(new_data = NULL) # bake(object, # new_data, # ..., # composition = "tibble") iris_training ``` ``` #> # A tibble: 112 × 4 #> Sepal.Length Sepal.Width Petal.Width Species #> <dbl> <dbl> <dbl> <fct> #> 1 -0.805 1.09 -1.36 setosa #> 2 -0.212 -1.07 -0.316 versicolor #> 3 0.500 0.607 1.64 virginica #> # ℹ 109 more rows ``` ] -- .pull-right.code85[ ```r # 测试组 # 用 bake() 进行预处理 iris_testing <- iris_recipe %>% * bake(new_data = testing(iris_split)) iris_testing ``` ``` #> # A tibble: 38 × 4 #> Sepal.Length Sepal.Width Petal.Width Species #> <dbl> <dbl> <dbl> <fct> #> 1 -0.924 1.09 -1.36 setosa #> 2 -1.52 0.847 -1.23 setosa #> 3 -1.28 0.847 -1.36 setosa #> # ℹ 35 more rows ``` ] --- layout: false class: inverse, center, middle, hide_logo background-image: url(imgs/logo-parsnip.svg) background-size: 4em background-position: 12% 45% # 3. 模型训练 -> `parsnip` .font60[<sup>v1.1.1</sup>] .font130[(_A Common API to Modeling and Analysis Functions_)] --- layout: true ### >> 3.1 模型设定 --- -- .pull-left.code90[ ```r # 随机森林模型的统一 API iris_rf <- rand_forest( mode = "classification", trees = 100 ) # rand_forest( # mode = "unknown", # engine = "ranger", # mtry = NULL, trees = NULL, # min_n = NULL) # 设定随机森林模型的计算引擎 # show_engines("rand_forest") (iris_rf <- iris_rf %>% set_engine("ranger")) ``` ``` #> Random Forest Model Specification (classification) #> #> Main Arguments: #> trees = 100 #> #> Computational engine: ranger ``` ] -- .pull-right.code85[ ```r # 查看 iris_rf 的类 iris_rf %>% class() ``` ``` #> [1] "rand_forest" "model_spec" ``` ```r # 查看 iris_rf 的底层结构 iris_rf %>% str() ``` ``` #> List of 7 #> $ args :List of 3 #> ..$ mtry : language ~NULL #> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> ..$ trees: language ~100 #> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> ..$ min_n: language ~NULL #> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> $ eng_args : Named list() #> ..- attr(*, "class")= chr [1:2] "quosures" "list" #> $ mode : chr "classification" #> $ user_specified_mode : logi TRUE #> $ method : NULL #> $ engine : chr "ranger" #> $ user_specified_engine: logi TRUE #> - attr(*, "class")= chr [1:2] "rand_forest" "model_spec" ``` ] --- layout: true ### >> 3.2 模型拟合 --- .pull-left.code85[ ```r # 拟合随机森林模型 (iris_fit <- iris_rf %>% fit( # ?parsnip::fit.model_spec() * formula = Species ~ ., data = iris_training )) ``` ``` #> parsnip model object #> #> Ranger result #> #> Call: #> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) #> #> Type: Probability estimation #> Number of trees: 100 #> Sample size: 112 #> Number of independent variables: 3 #> Mtry: 1 #> Target node size: 10 #> Variable importance mode: none #> Splitrule: gini #> OOB prediction error (Brier s.): 0.0563 ``` ] -- .pull-right.code90[ ```r # 查看 iris_fit 的类 iris_fit %>% class() ``` ``` #> [1] "_ranger" "model_fit" ``` ```r # 查看 iris_fit 的底层结构 iris_fit %>% str(max.level = 1) ``` ``` #> List of 6 #> $ lvl : chr [1:3] "setosa" "versicolor" "virginica" *#> $ spec :List of 8 #> ..- attr(*, "class")= chr [1:2] "rand_forest" "model_spec" *#> $ fit :List of 13 #> ..- attr(*, "class")= chr "ranger" #> $ preproc :List of 4 #> $ elapsed :List of 1 #> $ censor_probs: list() #> - attr(*, "class")= chr [1:2] "_ranger" "model_fit" ``` ] --- layout: true ### >> 3.3 模型预测 --- .pull-left.code90[ ```r # 将拟合模型应用于测试组 iris_fit %>% predict( new_data = iris_testing ) ``` ``` #> # A tibble: 38 × 1 #> .pred_class #> <fct> #> 1 setosa #> 2 setosa #> 3 setosa #> # ℹ 35 more rows ``` ```r # ? parsnip::predict.model_fit() # predict(object, new_data, # type = NULL, # # "numeric", "class", "prob", # # "conf_int", "pred_int", # # "quantile", "time", # # "hazard", "survival", "raw" # opts = list(), ...) ``` ] -- .pull-right.code95[ ```r # 合并原数据集 iris_class <- iris_fit %>% predict( new_data = iris_testing ) %>% * bind_cols(iris_testing) iris_class %>% glimpse(width = 40) ``` ``` #> Rows: 38 #> Columns: 5 #> $ .pred_class <fct> setosa, setosa, s… #> $ Sepal.Length <dbl> -0.9241, -1.5176,… #> $ Sepal.Width <dbl> 1.087, 0.847, 0.8… #> $ Petal.Width <dbl> -1.359, -1.229, -… #> $ Species <fct> setosa, setosa, s… ``` ] --- layout: false class: inverse, center, middle, hide_logo background-image: url(imgs/logo-yardstick.svg) background-size: 4em background-position: 12% 45% # 4. 模型表现 -> `yardstick` .font60[<sup>v1.2.0</sup>] .font130[(_Tidy Characterizations of Model Performance_)] --- layout: true ### >> 4.1 模型表现:`metrics()` --- .pull-left.code90[ ```r iris_class %>% glimpse(width = 40) ``` ``` #> Rows: 38 #> Columns: 5 #> $ .pred_class <fct> setosa, setosa, s… #> $ Sepal.Length <dbl> -0.9241, -1.5176,… #> $ Sepal.Width <dbl> 1.087, 0.847, 0.8… #> $ Petal.Width <dbl> -1.359, -1.229, -… #> $ Species <fct> setosa, setosa, s… ``` ```r iris_class %>% * metrics( # 计算模型表现指标 * truth = Species, * estimate = .pred_class * ) ``` ``` #> # A tibble: 2 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy multiclass 0.947 #> 2 kap multiclass 0.920 ``` ] -- .pull-right.code85[ ```r # 提取各分类的预测概率 iris_probs <- iris_fit %>% predict( new_data = iris_testing, * type = "prob" ) %>% bind_cols(iris_class) # 模型表现指标:基于分类预测概率 iris_probs %>% metrics( truth = Species, estimate = .pred_class, * .pred_setosa:.pred_virginica ) ``` ``` #> # A tibble: 4 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy multiclass 0.947 #> 2 kap multiclass 0.920 *#> 3 mn_log_loss multiclass 0.199 *#> 4 roc_auc hand_till 0.979 ``` ] --- layout: true ### >> 4.2 模型表现:`*_curve()` --- .pull-left.code90[ ```r # 接收器操作特征曲线 iris_roc <- iris_probs %>% roc_curve( truth = Species, .pred_setosa:.pred_virginica ) # iris_roc -> a "roc_df" iris_roc %>% autoplot() ``` <img src="L09_tidymodels_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right.code90[ ```r # 增益曲线 iris_gain <- iris_probs %>% * gain_curve( truth = Species, .pred_setosa:.pred_virginica ) # iris_gain # a "gain_df" iris_gain %>% autoplot() ``` <img src="L09_tidymodels_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] --- layout: false class: inverse, center, middle # 5. 一个较复杂的案例 .font150[(_Case Study_)] --- ### >> 5.0 数据 .code75[ ```r modeldata::ames %>% str() ``` ``` #> tibble [2,930 × 74] (S3: tbl_df/tbl/data.frame) #> $ MS_SubClass : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 1 1 6 6 12 12 12 6 ... #> $ MS_Zoning : Factor w/ 7 levels "Floating_Village_Residential",..: 3 2 3 3 3 3 3 3 3 3 ... #> $ Lot_Frontage : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ... #> $ Lot_Area : int [1:2930] 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ... #> $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ... #> $ Alley : Factor w/ 3 levels "Gravel","No_Alley_Access",..: 2 2 2 2 2 2 2 2 2 2 ... #> $ Lot_Shape : Factor w/ 4 levels "Regular","Slightly_Irregular",..: 2 1 2 1 2 2 1 2 2 1 ... #> $ Land_Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ... #> $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ Lot_Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ... #> $ Land_Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ... #> $ Neighborhood : Factor w/ 29 levels "North_Ames","College_Creek",..: 1 1 1 1 7 7 17 17 17 7 ... #> $ Condition_1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ... #> $ Condition_2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ... #> $ Bldg_Type : Factor w/ 5 levels "OneFam","TwoFmCon",..: 1 1 1 1 1 1 5 5 5 1 ... #> $ House_Style : Factor w/ 8 levels "One_and_Half_Fin",..: 3 3 3 3 8 8 3 3 3 8 ... #> $ Overall_Cond : Factor w/ 10 levels "Very_Poor","Poor",..: 5 6 6 5 5 6 5 5 5 5 ... #> $ Year_Built : int [1:2930] 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ... #> $ Year_Remod_Add : int [1:2930] 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ... #> $ Roof_Style : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ... #> $ Roof_Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ... #> $ Exterior_1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ... #> $ Exterior_2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ... #> $ Mas_Vnr_Type : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 2 4 4 2 4 4 4 4 ... #> $ Mas_Vnr_Area : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ... #> $ Exter_Cond : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 5 5 ... #> $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ... #> $ Bsmt_Cond : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ... #> $ Bsmt_Exposure : Factor w/ 5 levels "Av","Gd","Mn",..: 2 4 4 4 4 4 3 4 4 4 ... #> $ BsmtFin_Type_1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 2 6 1 1 3 3 3 1 3 7 ... #> $ BsmtFin_SF_1 : num [1:2930] 2 6 1 1 3 3 3 1 3 7 ... #> $ BsmtFin_Type_2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 4 7 7 7 7 7 7 7 7 ... #> $ BsmtFin_SF_2 : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ... #> $ Bsmt_Unf_SF : num [1:2930] 441 270 406 1045 137 ... #> $ Total_Bsmt_SF : num [1:2930] 1080 882 1329 2110 928 ... #> $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ... #> $ Heating_QC : Factor w/ 5 levels "Excellent","Fair",..: 2 5 5 1 3 1 1 1 1 3 ... #> $ Central_Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ... #> $ Electrical : Factor w/ 6 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ... #> $ First_Flr_SF : int [1:2930] 1656 896 1329 2110 928 926 1338 1280 1616 1028 ... #> $ Second_Flr_SF : int [1:2930] 0 0 0 0 701 678 0 0 0 776 ... #> $ Gr_Liv_Area : int [1:2930] 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ... #> $ Bsmt_Full_Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ... #> $ Bsmt_Half_Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ... #> $ Full_Bath : int [1:2930] 1 1 1 2 2 2 2 2 2 2 ... #> $ Half_Bath : int [1:2930] 0 0 1 1 1 1 0 0 0 1 ... #> $ Bedroom_AbvGr : int [1:2930] 3 2 3 3 3 3 2 2 2 3 ... #> $ Kitchen_AbvGr : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ... #> $ TotRms_AbvGrd : int [1:2930] 7 5 6 8 6 7 6 5 5 7 ... #> $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ... #> $ Fireplaces : int [1:2930] 2 0 0 2 1 1 0 0 1 1 ... #> $ Garage_Type : Factor w/ 7 levels "Attchd","Basment",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ Garage_Finish : Factor w/ 4 levels "Fin","No_Garage",..: 1 4 4 1 1 1 1 3 3 1 ... #> $ Garage_Cars : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ... #> $ Garage_Area : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ... #> $ Garage_Cond : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ... #> $ Paved_Drive : Factor w/ 3 levels "Dirt_Gravel",..: 2 3 3 3 3 3 3 3 3 3 ... #> $ Wood_Deck_SF : int [1:2930] 210 140 393 0 212 360 0 0 237 140 ... #> $ Open_Porch_SF : int [1:2930] 62 0 36 0 34 36 0 82 152 60 ... #> $ Enclosed_Porch : int [1:2930] 0 0 0 0 0 0 170 0 0 0 ... #> $ Three_season_porch: int [1:2930] 0 0 0 0 0 0 0 0 0 0 ... #> $ Screen_Porch : int [1:2930] 0 120 0 0 0 0 0 144 0 0 ... #> $ Pool_Area : int [1:2930] 0 0 0 0 0 0 0 0 0 0 ... #> $ Pool_QC : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ... #> $ Fence : Factor w/ 5 levels "Good_Privacy",..: 5 3 5 5 3 5 5 5 5 5 ... #> $ Misc_Feature : Factor w/ 6 levels "Elev","Gar2",..: 3 3 2 3 3 3 3 3 3 3 ... #> $ Misc_Val : int [1:2930] 0 0 12500 0 0 0 0 0 0 0 ... #> $ Mo_Sold : int [1:2930] 5 6 6 4 3 6 4 1 3 6 ... #> $ Year_Sold : int [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ... #> $ Sale_Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ... #> $ Sale_Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ... #> $ Sale_Price : int [1:2930] 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ... #> $ Longitude : num [1:2930] -93.6 -93.6 -93.6 -93.6 -93.6 ... #> $ Latitude : num [1:2930] 42.1 42.1 42.1 42.1 42.1 ... ``` ] --- ### >> 5.1 数据重采样 -> `rsample` 包 .pull-left.code90[ <img src="imgs/resampling.svg" width="100%" style="display: block; margin: auto;" /> ```r set.seed(1234) data_split <- initial_split( ames, strata = "Sale_Price") ames_train <- training(data_split) ames_test <- testing(data_split) nrow(ames_train) / nrow(ames) ``` ``` #> [1] 0.75 ``` ] -- .pull-right.code85[ ```r # v折交叉验证 (cv_splits <- vfold_cv(ames_train)) ``` ``` #> # 10-fold cross-validation #> # A tibble: 10 × 2 #> splits id #> <list> <chr> #> 1 <split [1977/220]> Fold01 #> 2 <split [1977/220]> Fold02 #> 3 <split [1977/220]> Fold03 #> # ℹ 7 more rows ``` ```r cv_splits$splits[[1]] ``` ``` #> <Analysis/Assess/Total> #> <1977/220/2197> ``` ```r cv_splits$splits[[1]] %>% analysis() %>% dim() ``` ``` #> [1] 1977 74 ``` `vignette("Common_Patterns", package = "rsample")` ] --- layout: true ### >> 5.2 数据预处理 -> `recipes` 包 --- ```r # 1. 定义数据预处理菜单 ames_rec <- recipe(Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Full_Bath + Year_Sold + Lot_Area + Central_Air + Longitude + Latitude, data = ames_train) %>% step_log(Sale_Price, base = 10) %>% step_BoxCox(Lot_Area, Gr_Liv_Area) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_bs(Longitude, Latitude, options = list(df = 5)) ames_rec %>% tidy() ``` ``` #> # A tibble: 6 × 6 #> number operation type trained skip id #> <int> <chr> <chr> <lgl> <lgl> <chr> #> 1 1 step log FALSE FALSE log_YNFHa #> 2 2 step BoxCox FALSE FALSE BoxCox_vCuOq #> 3 3 step other FALSE FALSE other_0nogs #> # ℹ 3 more rows ``` --- .pull-left.code85[ ```r # 2. 准备数据预处理 cv_splits <- cv_splits %>% mutate( recipes = map(splits, prepper, recipe = ames_rec) ) cv_splits ``` ``` #> # 10-fold cross-validation #> # A tibble: 10 × 3 #> splits id recipes #> <list> <chr> <list> #> 1 <split [1977/220]> Fold01 <recipe> #> 2 <split [1977/220]> Fold02 <recipe> #> 3 <split [1977/220]> Fold03 <recipe> #> # ℹ 7 more rows ``` ```r # prepper # function(split_obj, recipe, ...) { # prep(recipe, # training = rsample::analysis(split_obj), # fresh = TRUE, ...) # } ``` ] -- .pull-right.code90[ ```r # 相同的菜单 + # 不同的训练数据集 -> # 不同的准备结果 cv_splits$recipes %>% map(\(x) tidy(x, number = 2)) %>% list_rbind() %>% filter(terms == "Lot_Area") ``` ``` #> # A tibble: 10 × 3 #> terms value id #> <chr> <dbl> <chr> #> 1 Lot_Area 0.0836 BoxCox_vCuOq #> 2 Lot_Area 0.0777 BoxCox_vCuOq #> 3 Lot_Area 0.104 BoxCox_vCuOq #> # ℹ 7 more rows ``` ```r # ?recipes:::tidy.recipe ``` ] --- layout: true ### >> 5.3 拟合模型 -> `parsnip` 包 --- -- .pull-left.code85[ ```r # 1. 设定模型 knn_mod <- nearest_neighbor( mode = "regression", neighbors = 5) %>% set_engine("kknn") # 默认,可不设置 # 2. 拟合模型 # 2.1 定义辅助函数 parsnip_fit <- function(rec, model) { fit(model, Sale_Price ~ ., data = bake(rec, new_data= NULL)) } # 2.2 应用辅助函数 cv_splits <- cv_splits %>% mutate( knn = map( recipes, parsnip_fit, model = knn_mod ) ) ``` ] -- .pull-right.code85[ ```r # 查看结果 cv_splits %>% select(-id) ``` ``` #> # A tibble: 10 × 3 #> splits recipes knn #> <list> <list> <list> #> 1 <split [1977/220]> <recipe> <fit[+]> #> 2 <split [1977/220]> <recipe> <fit[+]> #> 3 <split [1977/220]> <recipe> <fit[+]> #> # ℹ 7 more rows ``` ```r cv_splits$knn[[1]] %>% str(max.level = 1) ``` ``` #> List of 6 #> $ lvl : NULL #> $ spec :List of 7 #> ..- attr(*, "class")= chr [1:2] "nearest_neighbor" "model_spec" #> $ fit :List of 10 #> ..- attr(*, "class")= chr [1:2] "train.kknn" "kknn" #> $ preproc :List of 1 #> $ elapsed :List of 1 #> $ censor_probs: list() #> - attr(*, "class")= chr [1:2] "_train.kknn" "model_fit" ``` ] --- layout: true ### >> 5.4 评估模型表现 -> `parsnip` 包 + `yardstick` 包 --- -- .pull-left.code80[ ```r # 定义模型表现函数 pf_metrics <- metric_set(rmse, rsq, ccc) # 定义辅助函数 parsnip_metrics <- function(split, recipe, model) { raw <- assessment(split) processed <- bake(recipe, new_data = raw) model %>% predict(new_data = processed) %>% bind_cols(processed) %>% pf_metrics(Sale_Price, .pred) } # 应用辅助函数 cv_splits <- cv_splits %>% mutate( metrics = pmap( list( split = splits, recipe = recipes, model = knn ), parsnip_metrics ) ) ``` ] -- .pull-right.code80[ ```r cv_splits$metrics[[1]] ``` ``` #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 0.0839 #> 2 rsq standard 0.761 #> 3 ccc standard 0.871 ``` ```r # 汇总计算模型表现指标 cv_splits %>% unnest(cols = metrics) %>% group_by(.metric) %>% summarise( mean = mean(.estimate), std_err = sd(.estimate)/sqrt(n()) ) ``` ``` #> # A tibble: 3 × 3 #> .metric mean std_err #> <chr> <dbl> <dbl> #> 1 ccc 0.869 0.00519 #> 2 rmse 0.0870 0.00245 #> 3 rsq 0.764 0.00889 ``` ] --- layout: true ### >> 5.5 模型调参:手动模式 --- <br> <img src="imgs/TrainAlgo.svg" width="100%" style="display: block; margin: auto;" /> --- .pull-left.code80[ ```r # 定义辅助函数 m_metrics <- function(split, recipe, fit) { holdout <- bake( recipe, new_data = assessment(split)) resample_name <- labels(split)$id multi_predict(fit, new_data = holdout, neighbors = 1:50) %>% bind_cols(., holdout %>% select(Sale_Price)) %>% unnest(cols = .pred) %>% group_by(neighbors) %>% pf_metrics(Sale_Price, .pred) %>% mutate(resample = resample_name) } # 应用辅助函数 cv_splits <- cv_splits %>% mutate( tuning_metrics = pmap( list(splits, recipes, knn), m_metrics ) ) ``` ] -- .pull-right.code85[ ```r cv_splits$tuning_metrics[[1]] ``` ``` #> # A tibble: 150 × 5 #> neighbors .metric .estimator .estimate resample #> <int> <chr> <chr> <dbl> <chr> #> 1 1 rmse standard 0.101 Fold01 #> 2 2 rmse standard 0.0951 Fold01 #> 3 3 rmse standard 0.0898 Fold01 #> # ℹ 147 more rows ``` ```r # 汇总计算模型表现指标 (rs_rmse <- cv_splits %>% unnest(tuning_metrics) %>% filter(.metric == "rmse") %>% group_by(neighbors) %>% summarize(rmse = mean(.estimate))) ``` ``` #> # A tibble: 50 × 2 #> neighbors rmse #> <int> <dbl> #> 1 1 0.104 #> 2 2 0.0972 #> 3 3 0.0921 #> # ℹ 47 more rows ``` ] --- .pull-left.code85[ ```r (min_rmse <- rs_rmse %>% arrange(rmse) %>% slice(1)) ``` ``` #> # A tibble: 1 × 2 #> neighbors rmse #> <int> <dbl> #> 1 11 0.0837 ``` ```r ggplot(rs_rmse, aes(x = neighbors, y = rmse)) + geom_point() + geom_path() + geom_point(data = min_rmse, size = 3, color = "red") ``` <img src="L09_tidymodels_files/figure-html/unnamed-chunk-42-1.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right.code90[ ```r # 最终模型 ames_rec_final <- prep(ames_rec) knn_mod_final <- update(knn_mod, neighbors = 11) %>% fit(Sale_Price ~ ., data = bake(ames_rec_final, new_data = NULL)) # 评估最终模型在测试集上的表现 parsnip_metrics(data_split, ames_rec_final, knn_mod_final) ``` ``` #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 0.0754 #> 2 rsq standard 0.813 #> 3 ccc standard 0.894 ``` ] --- layout: true ### >> 5.6 模型调参:`workflows` .font80[<sup>v1.1.3</sup>] + `tune` .font80[<sup>v1.1.2</sup>] --- .pull-left[ ```r # 重新定义模型:设定需调校的参数 knn_mod <- nearest_neighbor( mode = "regression", * neighbors = tune()) %>% set_engine("kknn") # 定义流程 ames_wf <- workflow() %>% add_model(knn_mod) %>% add_recipe(ames_rec) # 基于参数网格计算模型表现指标 ames_res <- ames_wf %>% * tune_grid( cv_splits %>% select(splits, id), * grid = tibble(neighbors = 1:50), metrics = metric_set(rmse, rsq, ccc) ) ``` ] -- .pull-right[ ```r ames_res ``` ``` #> # Tuning results #> # 10-fold cross-validation #> # A tibble: 10 × 4 #> splits id .metrics .notes #> <list> <chr> <list> <list> #> 1 <split [1977/220]> Fold01 <tibble [150 × 5]> <tibble [0 × 3]> #> 2 <split [1977/220]> Fold02 <tibble [150 × 5]> <tibble [0 × 3]> #> 3 <split [1977/220]> Fold03 <tibble [150 × 5]> <tibble [1 × 3]> #> # ℹ 7 more rows #> #> There were issues with some computations: #> #> - Warning(s) x2: some 'x' values beyond boundary knots may cause ill-conditioned bases #> #> Run `show_notes(.Last.tune.result)` for more information. ``` ```r ames_res %>% collect_metrics() %>% glimpse(width = 40) ``` ``` #> Rows: 150 #> Columns: 7 #> $ neighbors <int> 1, 1, 1, 2, 2, 2, 3… #> $ .metric <chr> "ccc", "rmse", "rsq… #> $ .estimator <chr> "standard", "standa… #> $ mean <dbl> 0.8276, 0.1042, 0.6… #> $ n <int> 10, 10, 10, 10, 10,… #> $ std_err <dbl> 0.00943, 0.00302, 0… #> $ .config <chr> "Preprocessor1_Mode… ``` ] --- ```r ames_res %>% * collect_metrics() %>% filter(.metric == "rmse") %>% ggplot(aes(x = neighbors, y = mean)) + geom_point() + geom_line() + geom_point(size = 2, color = 'red', data = . %>% slice_min(mean)) + ylab("rmse") ``` <img src="L09_tidymodels_files/figure-html/unnamed-chunk-45-1.png" width="60%" style="display: block; margin: auto;" /> --- ```r (best_knn <- ames_res %>% select_best("rmse")) ``` ``` #> # A tibble: 1 × 2 #> neighbors .config #> <int> <chr> #> 1 11 Preprocessor1_Model11 ``` ```r ames_fit_test <- ames_wf %>% * finalize_workflow(best_knn) %>% * last_fit( split = data_split, metrics = metric_set(rmse, rsq, ccc) ) ames_fit_test %>% glimpse() ``` ``` #> Rows: 1 #> Columns: 6 #> $ splits <list> [<initial_split[2197 x 733 x 2930 x 74]>] #> $ id <chr> "train/test split" #> $ .metrics <list> [<tbl_df[3 x 4]>] #> $ .notes <list> [<tbl_df[1 x 3]>] #> $ .predictions <list> [<tbl_df[733 x 4]>] #> $ .workflow <list> [Bldg_Type, Neighborhood, Year_Built, Gr_Liv_Area, Full_Bath, Year… ``` --- .pull-left[ ```r ames_fit_test %>% collect_metrics() ``` ``` #> # A tibble: 3 × 4 #> .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 rmse standard 0.0754 Preproce… #> 2 rsq standard 0.813 Preproce… #> 3 ccc standard 0.894 Preproce… ``` ] -- .pull-right[ ```r ames_fit_test %>% collect_predictions() %>% ggplot(aes(Sale_Price, .pred)) + geom_abline(lty = 2, size = 1.5, color = "gray50") + geom_point(alpha = 0.5) + coord_fixed() ``` <img src="L09_tidymodels_files/figure-html/unnamed-chunk-49-1.png" width="100%" /> ] --- layout: false ### >> `tidymodels` 家族! <img src="imgs/tidymodels-core.png" width="90%" style="display: block; margin: auto auto auto 0;" /> -- <img src="imgs/tidymodels-more.png" width="100%" style="display: block; margin: auto;" /> --- layout: false class: inverse, center, middle # 课后作业 --- class: middle .font150[ 1. 在 📑 _qmd_ 中键入并运行本讲的代码(自觉完成,无需提交) 1. 登录 .red[[{{`tidymodels`}}](https://www.tidymodels.org/)] 的主页 🏠,浏览并学习相关资料 1. `tidymodels`子包的网站 <https://<子包名>.tidymodels.org/> 1. [.bold[{{bilibili::简单的tidymodels}}]](https://space.bilibili.com/382769913/channel/collectiondetail?sid=920988) 1. `tidymodels` 团队撰写的书 📖,供感兴趣的同学学习精进: - [Tidy Modeling with R](https://www.tmwr.org/) - [Feature Engineering and Selection](https://bookdown.org/max/FES/) - [Supervised Machine Learning for Text Analysis in R](https://smltar.com/) ] --- class: center, middle, hide_logo background-image: url(imgs/xaringan.png) background-size: 12% background-position: 50% 40% <br><br><br><br><br><br><br> <hr color='#f00' size='2px' width='80%'> <br> .Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能!**_]