tidymodels
suite💪 R 和 R 包大多是由统计学家和数据分析专家开发的
💪 R 有着异常丰富且紧跟前沿的 {{统计建模和机器学习包}}
💪 R 容易接口或调用其他应用软件(如 C++、python、Spark、TensorFlow 等)
💪 有大量学习参考资料
💪 R 和 R 包大多是由统计学家和数据分析专家开发的
💪 R 有着异常丰富且紧跟前沿的 {{统计建模和机器学习包}}
💪 R 容易接口或调用其他应用软件(如 C++、python、Spark、TensorFlow 等)
💪 有大量学习参考资料
😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言
😰 R 操作数据的规模往往受限于计算机内存的大小
😰 各式 R 包的接口并不统一
😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言
😰 R 操作数据的规模往往受限于计算机内存的大小
😰 各式 R 包的接口并不统一
Function | Code |
---|---|
MASS::lda |
predict(obj) |
stats::glm |
predict(obj, type = "response") |
gbm::gbm |
predict(obj, type = "response", n.trees) |
rpart::rpart |
predict(obj, type = "prob") |
RWeka::Weka |
predict(obj, type = "probability") |
😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言
😰 R 操作数据的规模往往受限于计算机内存的大小
😰 各式 R 包的接口并不统一
Function | Code |
---|---|
MASS::lda |
predict(obj) |
stats::glm |
predict(obj, type = "response") |
gbm::gbm |
predict(obj, type = "response", n.trees) |
rpart::rpart |
predict(obj, type = "prob") |
RWeka::Weka |
predict(obj, type = "probability") |
caret
-> 👉 tidymodels
!
mlr
-> mlr3
!
rsample
v1.2.0(General Resampling Infrastructure)
rsample
包# 先安装并加载 tidymodels 元包(meta package) # install.package("tidymodels")library(tidymodels) # v1.0.0 @ 2022-07-13
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
#> ✔ broom 1.0.5 ✔ recipes 1.0.8#> ✔ dials 1.2.0 ✔ rsample 1.2.0#> ✔ dplyr 1.1.4 ✔ tibble 3.2.1#> ✔ ggplot2 3.4.4 ✔ tidyr 1.3.0#> ✔ infer 1.0.5 ✔ tune 1.1.2#> ✔ modeldata 1.2.0 ✔ workflows 1.1.3#> ✔ parsnip 1.1.1 ✔ workflowsets 1.0.1#> ✔ purrr 1.0.2 ✔ yardstick 1.2.0
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──#> ✖ purrr::discard() masks scales::discard()#> ✖ dplyr::filter() masks stats::filter()#> ✖ dplyr::lag() masks stats::lag()#> ✖ recipes::step() masks stats::step()#> • Use suppressPackageStartupMessages() to eliminate package startup messages
rsample
包# 设定随机数种子set.seed(1234)# 随机抽样分成训练组和试验组iris_split <- initial_split(iris)# initial_split(data, prop = 3/4,# strata = NULL,# breaks = 4, # pool = 0.1, ...)
# 查看分组情况iris_split
#> <Training/Testing/Total>#> <112/38/150>
# 查看iris_split的类iris_split %>% class()
#> [1] "initial_split" "mc_split" #> [3] "rsplit"
rsample
包# 设定随机数种子set.seed(1234)# 随机抽样分成训练组和试验组iris_split <- initial_split(iris)# initial_split(data, prop = 3/4,# strata = NULL,# breaks = 4, # pool = 0.1, ...)
# 查看分组情况iris_split
#> <Training/Testing/Total>#> <112/38/150>
# 查看iris_split的类iris_split %>% class()
#> [1] "initial_split" "mc_split" #> [3] "rsplit"
# 查看 rsplit 类的底层结构iris_split %>% str(vec.len = 1.5)
#> List of 4#> $ data :'data.frame': 150 obs. of 5 variables:#> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 ...#> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 ...#> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 ...#> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 ...#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 ...#> $ in_id : int [1:112] 28 80 101 111 ...#> $ out_id: logi NA#> $ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)#> ..$ id: chr "Resample1"#> - attr(*, "class")= chr [1:3] "initial_split" "mc_split" ...
# 提取训练组样本(Q:测试组样本用哪个函数?)iris_split %>% training() %>% glimpse(width = 40)
#> Rows: 112#> Columns: 5#> $ Sepal.Length <dbl> 5.2, 5.7, 6.3, 6.…#> $ Sepal.Width <dbl> 3.5, 2.6, 3.3, 3.…#> $ Petal.Length <dbl> 1.5, 3.5, 6.0, 5.…#> $ Petal.Width <dbl> 0.2, 1.0, 2.5, 2.…#> $ Species <fct> setosa, versicolo…
recipes
v1.0.8(Preprocessing and Feature Engineering Steps for Modeling)
iris_recipe <- recipe(Species ~ ., data = iris) %>% step_corr(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors())iris_recipe
#> ── Recipe ────────────────────────────────#> #> ── Inputs #> Number of variables by role#> outcome: 1#> predictor: 4#> #> ── Operations #> • Correlation filter on: all_predictors()#> • Centering for: all_predictors()#> • Scaling for: all_predictors()
iris_recipe <- recipe(Species ~ ., data = iris) %>% step_corr(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors())iris_recipe
#> ── Recipe ────────────────────────────────#> #> ── Inputs #> Number of variables by role#> outcome: 1#> predictor: 4#> #> ── Operations #> • Correlation filter on: all_predictors()#> • Centering for: all_predictors()#> • Scaling for: all_predictors()
复杂的recipe类,View()
它!
recipes
包中共计有 98 个数据预处理步骤函数,示例如下:
[1] "step_arrange" "..." [3] "step_corr" "..." [5] "step_YeoJohnson" "step_zv"
你可以用以下五种方式(及其组合)在 step_*()
中指定拟作用的变量:
dplyr
包的方式,如 starts_with()
all_outcomes()
、all_predictors()
或 has_role()
指定角色的方式all_numeric()
、all_nominal()
或 has_type()
指定类型的方式all_numeric_predictors()
?recipes::selections
iris_recipe <- iris_recipe %>% prep(training(iris_split), verbose = TRUE)
#> oper 1 step corr [training] #> oper 2 step center [training] #> oper 3 step scale [training] #> The retained training set is ~ 0 Mb in memory.
# prep(x, # training = NULL, # fresh = FALSE, # verbose = FALSE, # retain = TRUE, # log_change = FALSE, # strings_as_factors = TRUE, # ...)
iris_recipe <- iris_recipe %>% prep(training(iris_split), verbose = TRUE)
#> oper 1 step corr [training] #> oper 2 step center [training] #> oper 3 step scale [training] #> The retained training set is ~ 0 Mb in memory.
# prep(x, # training = NULL, # fresh = FALSE, # verbose = FALSE, # retain = TRUE, # log_change = FALSE, # strings_as_factors = TRUE, # ...)
iris_recipe
#> ── Recipe ───────────────────────────────────────────────────────#> #> ── Inputs #> Number of variables by role#> outcome: 1#> predictor: 4#> #> ── Training information #> Training data contained 112 data points and no incomplete rows.#> #> ── Operations #> • Correlation filter on: Petal.Length | Trained#> • Centering for: Sepal.Length, Sepal.Width, Petal.Width | Trained#> • Scaling for: Sepal.Length, Sepal.Width, Petal.Width | Trained
# 训练组# 可直接提取准备阶段 retain 的数据iris_training <- iris_recipe %>% bake(new_data = NULL)# bake(object, # new_data, # ..., # composition = "tibble")iris_training
#> # A tibble: 112 × 4#> Sepal.Length Sepal.Width Petal.Width Species #> <dbl> <dbl> <dbl> <fct> #> 1 -0.805 1.09 -1.36 setosa #> 2 -0.212 -1.07 -0.316 versicolor#> 3 0.500 0.607 1.64 virginica #> # ℹ 109 more rows
# 训练组# 可直接提取准备阶段 retain 的数据iris_training <- iris_recipe %>% bake(new_data = NULL)# bake(object, # new_data, # ..., # composition = "tibble")iris_training
#> # A tibble: 112 × 4#> Sepal.Length Sepal.Width Petal.Width Species #> <dbl> <dbl> <dbl> <fct> #> 1 -0.805 1.09 -1.36 setosa #> 2 -0.212 -1.07 -0.316 versicolor#> 3 0.500 0.607 1.64 virginica #> # ℹ 109 more rows
# 测试组# 用 bake() 进行预处理iris_testing <- iris_recipe %>% bake(new_data = testing(iris_split))iris_testing
#> # A tibble: 38 × 4#> Sepal.Length Sepal.Width Petal.Width Species#> <dbl> <dbl> <dbl> <fct> #> 1 -0.924 1.09 -1.36 setosa #> 2 -1.52 0.847 -1.23 setosa #> 3 -1.28 0.847 -1.36 setosa #> # ℹ 35 more rows
parsnip
v1.1.1(A Common API to Modeling and Analysis Functions)
# 随机森林模型的统一 APIiris_rf <- rand_forest( mode = "classification", trees = 100)# rand_forest(# mode = "unknown",# engine = "ranger",# mtry = NULL, trees = NULL,# min_n = NULL)# 设定随机森林模型的计算引擎# show_engines("rand_forest")(iris_rf <- iris_rf %>% set_engine("ranger"))
#> Random Forest Model Specification (classification)#> #> Main Arguments:#> trees = 100#> #> Computational engine: ranger
# 随机森林模型的统一 APIiris_rf <- rand_forest( mode = "classification", trees = 100)# rand_forest(# mode = "unknown",# engine = "ranger",# mtry = NULL, trees = NULL,# min_n = NULL)# 设定随机森林模型的计算引擎# show_engines("rand_forest")(iris_rf <- iris_rf %>% set_engine("ranger"))
#> Random Forest Model Specification (classification)#> #> Main Arguments:#> trees = 100#> #> Computational engine: ranger
# 查看 iris_rf 的类iris_rf %>% class()
#> [1] "rand_forest" "model_spec"
# 查看 iris_rf 的底层结构iris_rf %>% str()
#> List of 7#> $ args :List of 3#> ..$ mtry : language ~NULL#> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> ..$ trees: language ~100#> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> ..$ min_n: language ~NULL#> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> $ eng_args : Named list()#> ..- attr(*, "class")= chr [1:2] "quosures" "list"#> $ mode : chr "classification"#> $ user_specified_mode : logi TRUE#> $ method : NULL#> $ engine : chr "ranger"#> $ user_specified_engine: logi TRUE#> - attr(*, "class")= chr [1:2] "rand_forest" "model_spec"
# 拟合随机森林模型(iris_fit <- iris_rf %>% fit( # ?parsnip::fit.model_spec() formula = Species ~ ., data = iris_training ))
#> parsnip model object#> #> Ranger result#> #> Call:#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) #> #> Type: Probability estimation #> Number of trees: 100 #> Sample size: 112 #> Number of independent variables: 3 #> Mtry: 1 #> Target node size: 10 #> Variable importance mode: none #> Splitrule: gini #> OOB prediction error (Brier s.): 0.0563
# 拟合随机森林模型(iris_fit <- iris_rf %>% fit( # ?parsnip::fit.model_spec() formula = Species ~ ., data = iris_training ))
#> parsnip model object#> #> Ranger result#> #> Call:#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) #> #> Type: Probability estimation #> Number of trees: 100 #> Sample size: 112 #> Number of independent variables: 3 #> Mtry: 1 #> Target node size: 10 #> Variable importance mode: none #> Splitrule: gini #> OOB prediction error (Brier s.): 0.0563
# 查看 iris_fit 的类iris_fit %>% class()
#> [1] "_ranger" "model_fit"
# 查看 iris_fit 的底层结构iris_fit %>% str(max.level = 1)
#> List of 6#> $ lvl : chr [1:3] "setosa" "versicolor" "virginica"#> $ spec :List of 8#> ..- attr(*, "class")= chr [1:2] "rand_forest" "model_spec"#> $ fit :List of 13#> ..- attr(*, "class")= chr "ranger"#> $ preproc :List of 4#> $ elapsed :List of 1#> $ censor_probs: list()#> - attr(*, "class")= chr [1:2] "_ranger" "model_fit"
# 将拟合模型应用于测试组iris_fit %>% predict( new_data = iris_testing )
#> # A tibble: 38 × 1#> .pred_class#> <fct> #> 1 setosa #> 2 setosa #> 3 setosa #> # ℹ 35 more rows
# ? parsnip::predict.model_fit()# predict(object, new_data, # type = NULL, # # "numeric", "class", "prob", # # "conf_int", "pred_int", # # "quantile", "time", # # "hazard", "survival", "raw"# opts = list(), ...)
# 将拟合模型应用于测试组iris_fit %>% predict( new_data = iris_testing )
#> # A tibble: 38 × 1#> .pred_class#> <fct> #> 1 setosa #> 2 setosa #> 3 setosa #> # ℹ 35 more rows
# ? parsnip::predict.model_fit()# predict(object, new_data, # type = NULL, # # "numeric", "class", "prob", # # "conf_int", "pred_int", # # "quantile", "time", # # "hazard", "survival", "raw"# opts = list(), ...)
# 合并原数据集iris_class <- iris_fit %>% predict( new_data = iris_testing ) %>% bind_cols(iris_testing)iris_class %>% glimpse(width = 40)
#> Rows: 38#> Columns: 5#> $ .pred_class <fct> setosa, setosa, s…#> $ Sepal.Length <dbl> -0.9241, -1.5176,…#> $ Sepal.Width <dbl> 1.087, 0.847, 0.8…#> $ Petal.Width <dbl> -1.359, -1.229, -…#> $ Species <fct> setosa, setosa, s…
yardstick
v1.2.0(Tidy Characterizations of Model Performance)
metrics()
iris_class %>% glimpse(width = 40)
#> Rows: 38#> Columns: 5#> $ .pred_class <fct> setosa, setosa, s…#> $ Sepal.Length <dbl> -0.9241, -1.5176,…#> $ Sepal.Width <dbl> 1.087, 0.847, 0.8…#> $ Petal.Width <dbl> -1.359, -1.229, -…#> $ Species <fct> setosa, setosa, s…
iris_class %>% metrics( # 计算模型表现指标 truth = Species, estimate = .pred_class )
#> # A tibble: 2 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 accuracy multiclass 0.947#> 2 kap multiclass 0.920
metrics()
iris_class %>% glimpse(width = 40)
#> Rows: 38#> Columns: 5#> $ .pred_class <fct> setosa, setosa, s…#> $ Sepal.Length <dbl> -0.9241, -1.5176,…#> $ Sepal.Width <dbl> 1.087, 0.847, 0.8…#> $ Petal.Width <dbl> -1.359, -1.229, -…#> $ Species <fct> setosa, setosa, s…
iris_class %>% metrics( # 计算模型表现指标 truth = Species, estimate = .pred_class )
#> # A tibble: 2 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 accuracy multiclass 0.947#> 2 kap multiclass 0.920
# 提取各分类的预测概率iris_probs <- iris_fit %>% predict( new_data = iris_testing, type = "prob" ) %>% bind_cols(iris_class)# 模型表现指标:基于分类预测概率iris_probs %>% metrics( truth = Species, estimate = .pred_class, .pred_setosa:.pred_virginica )
#> # A tibble: 4 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 accuracy multiclass 0.947#> 2 kap multiclass 0.920#> 3 mn_log_loss multiclass 0.199#> 4 roc_auc hand_till 0.979
*_curve()
# 接收器操作特征曲线iris_roc <- iris_probs %>% roc_curve( truth = Species, .pred_setosa:.pred_virginica ) # iris_roc -> a "roc_df"iris_roc %>% autoplot()
*_curve()
# 接收器操作特征曲线iris_roc <- iris_probs %>% roc_curve( truth = Species, .pred_setosa:.pred_virginica ) # iris_roc -> a "roc_df"iris_roc %>% autoplot()
# 增益曲线iris_gain <- iris_probs %>% gain_curve( truth = Species, .pred_setosa:.pred_virginica )# iris_gain # a "gain_df"iris_gain %>% autoplot()
(Case Study)
modeldata::ames %>% str()
#> tibble [2,930 × 74] (S3: tbl_df/tbl/data.frame)#> $ MS_SubClass : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 1 1 6 6 12 12 12 6 ...#> $ MS_Zoning : Factor w/ 7 levels "Floating_Village_Residential",..: 3 2 3 3 3 3 3 3 3 3 ...#> $ Lot_Frontage : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...#> $ Lot_Area : int [1:2930] 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...#> $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...#> $ Alley : Factor w/ 3 levels "Gravel","No_Alley_Access",..: 2 2 2 2 2 2 2 2 2 2 ...#> $ Lot_Shape : Factor w/ 4 levels "Regular","Slightly_Irregular",..: 2 1 2 1 2 2 1 2 2 1 ...#> $ Land_Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...#> $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...#> $ Lot_Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...#> $ Land_Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...#> $ Neighborhood : Factor w/ 29 levels "North_Ames","College_Creek",..: 1 1 1 1 7 7 17 17 17 7 ...#> $ Condition_1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...#> $ Condition_2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...#> $ Bldg_Type : Factor w/ 5 levels "OneFam","TwoFmCon",..: 1 1 1 1 1 1 5 5 5 1 ...#> $ House_Style : Factor w/ 8 levels "One_and_Half_Fin",..: 3 3 3 3 8 8 3 3 3 8 ...#> $ Overall_Cond : Factor w/ 10 levels "Very_Poor","Poor",..: 5 6 6 5 5 6 5 5 5 5 ...#> $ Year_Built : int [1:2930] 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...#> $ Year_Remod_Add : int [1:2930] 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...#> $ Roof_Style : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...#> $ Roof_Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...#> $ Exterior_1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...#> $ Exterior_2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...#> $ Mas_Vnr_Type : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 2 4 4 2 4 4 4 4 ...#> $ Mas_Vnr_Area : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...#> $ Exter_Cond : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 5 5 ...#> $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...#> $ Bsmt_Cond : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ...#> $ Bsmt_Exposure : Factor w/ 5 levels "Av","Gd","Mn",..: 2 4 4 4 4 4 3 4 4 4 ...#> $ BsmtFin_Type_1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 2 6 1 1 3 3 3 1 3 7 ...#> $ BsmtFin_SF_1 : num [1:2930] 2 6 1 1 3 3 3 1 3 7 ...#> $ BsmtFin_Type_2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 4 7 7 7 7 7 7 7 7 ...#> $ BsmtFin_SF_2 : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...#> $ Bsmt_Unf_SF : num [1:2930] 441 270 406 1045 137 ...#> $ Total_Bsmt_SF : num [1:2930] 1080 882 1329 2110 928 ...#> $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...#> $ Heating_QC : Factor w/ 5 levels "Excellent","Fair",..: 2 5 5 1 3 1 1 1 1 3 ...#> $ Central_Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...#> $ Electrical : Factor w/ 6 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...#> $ First_Flr_SF : int [1:2930] 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...#> $ Second_Flr_SF : int [1:2930] 0 0 0 0 701 678 0 0 0 776 ...#> $ Gr_Liv_Area : int [1:2930] 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...#> $ Bsmt_Full_Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...#> $ Bsmt_Half_Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...#> $ Full_Bath : int [1:2930] 1 1 1 2 2 2 2 2 2 2 ...#> $ Half_Bath : int [1:2930] 0 0 1 1 1 1 0 0 0 1 ...#> $ Bedroom_AbvGr : int [1:2930] 3 2 3 3 3 3 2 2 2 3 ...#> $ Kitchen_AbvGr : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...#> $ TotRms_AbvGrd : int [1:2930] 7 5 6 8 6 7 6 5 5 7 ...#> $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...#> $ Fireplaces : int [1:2930] 2 0 0 2 1 1 0 0 1 1 ...#> $ Garage_Type : Factor w/ 7 levels "Attchd","Basment",..: 1 1 1 1 1 1 1 1 1 1 ...#> $ Garage_Finish : Factor w/ 4 levels "Fin","No_Garage",..: 1 4 4 1 1 1 1 3 3 1 ...#> $ Garage_Cars : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...#> $ Garage_Area : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...#> $ Garage_Cond : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...#> $ Paved_Drive : Factor w/ 3 levels "Dirt_Gravel",..: 2 3 3 3 3 3 3 3 3 3 ...#> $ Wood_Deck_SF : int [1:2930] 210 140 393 0 212 360 0 0 237 140 ...#> $ Open_Porch_SF : int [1:2930] 62 0 36 0 34 36 0 82 152 60 ...#> $ Enclosed_Porch : int [1:2930] 0 0 0 0 0 0 170 0 0 0 ...#> $ Three_season_porch: int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...#> $ Screen_Porch : int [1:2930] 0 120 0 0 0 0 0 144 0 0 ...#> $ Pool_Area : int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...#> $ Pool_QC : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ...#> $ Fence : Factor w/ 5 levels "Good_Privacy",..: 5 3 5 5 3 5 5 5 5 5 ...#> $ Misc_Feature : Factor w/ 6 levels "Elev","Gar2",..: 3 3 2 3 3 3 3 3 3 3 ...#> $ Misc_Val : int [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...#> $ Mo_Sold : int [1:2930] 5 6 6 4 3 6 4 1 3 6 ...#> $ Year_Sold : int [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...#> $ Sale_Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...#> $ Sale_Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...#> $ Sale_Price : int [1:2930] 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...#> $ Longitude : num [1:2930] -93.6 -93.6 -93.6 -93.6 -93.6 ...#> $ Latitude : num [1:2930] 42.1 42.1 42.1 42.1 42.1 ...
rsample
包set.seed(1234)data_split <- initial_split( ames, strata = "Sale_Price")ames_train <- training(data_split)ames_test <- testing(data_split)nrow(ames_train) / nrow(ames)
#> [1] 0.75
rsample
包set.seed(1234)data_split <- initial_split( ames, strata = "Sale_Price")ames_train <- training(data_split)ames_test <- testing(data_split)nrow(ames_train) / nrow(ames)
#> [1] 0.75
# v折交叉验证(cv_splits <- vfold_cv(ames_train))
#> # 10-fold cross-validation #> # A tibble: 10 × 2#> splits id #> <list> <chr> #> 1 <split [1977/220]> Fold01#> 2 <split [1977/220]> Fold02#> 3 <split [1977/220]> Fold03#> # ℹ 7 more rows
cv_splits$splits[[1]]
#> <Analysis/Assess/Total>#> <1977/220/2197>
cv_splits$splits[[1]] %>% analysis() %>% dim()
#> [1] 1977 74
vignette("Common_Patterns", package = "rsample")
recipes
包# 1. 定义数据预处理菜单ames_rec <- recipe(Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Full_Bath + Year_Sold + Lot_Area + Central_Air + Longitude + Latitude, data = ames_train) %>% step_log(Sale_Price, base = 10) %>% step_BoxCox(Lot_Area, Gr_Liv_Area) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_bs(Longitude, Latitude, options = list(df = 5))ames_rec %>% tidy()
#> # A tibble: 6 × 6#> number operation type trained skip id #> <int> <chr> <chr> <lgl> <lgl> <chr> #> 1 1 step log FALSE FALSE log_YNFHa #> 2 2 step BoxCox FALSE FALSE BoxCox_vCuOq#> 3 3 step other FALSE FALSE other_0nogs #> # ℹ 3 more rows
recipes
包# 2. 准备数据预处理cv_splits <- cv_splits %>% mutate( recipes = map(splits, prepper, recipe = ames_rec) )cv_splits
#> # 10-fold cross-validation #> # A tibble: 10 × 3#> splits id recipes #> <list> <chr> <list> #> 1 <split [1977/220]> Fold01 <recipe>#> 2 <split [1977/220]> Fold02 <recipe>#> 3 <split [1977/220]> Fold03 <recipe>#> # ℹ 7 more rows
# prepper# function(split_obj, recipe, ...) {# prep(recipe, # training = rsample::analysis(split_obj), # fresh = TRUE, ...)# }
recipes
包# 2. 准备数据预处理cv_splits <- cv_splits %>% mutate( recipes = map(splits, prepper, recipe = ames_rec) )cv_splits
#> # 10-fold cross-validation #> # A tibble: 10 × 3#> splits id recipes #> <list> <chr> <list> #> 1 <split [1977/220]> Fold01 <recipe>#> 2 <split [1977/220]> Fold02 <recipe>#> 3 <split [1977/220]> Fold03 <recipe>#> # ℹ 7 more rows
# prepper# function(split_obj, recipe, ...) {# prep(recipe, # training = rsample::analysis(split_obj), # fresh = TRUE, ...)# }
# 相同的菜单 + # 不同的训练数据集 -> # 不同的准备结果cv_splits$recipes %>% map(\(x) tidy(x, number = 2)) %>% list_rbind() %>% filter(terms == "Lot_Area")
#> # A tibble: 10 × 3#> terms value id #> <chr> <dbl> <chr> #> 1 Lot_Area 0.0836 BoxCox_vCuOq#> 2 Lot_Area 0.0777 BoxCox_vCuOq#> 3 Lot_Area 0.104 BoxCox_vCuOq#> # ℹ 7 more rows
# ?recipes:::tidy.recipe
parsnip
包parsnip
包# 1. 设定模型knn_mod <- nearest_neighbor( mode = "regression", neighbors = 5) %>% set_engine("kknn") # 默认,可不设置# 2. 拟合模型# 2.1 定义辅助函数parsnip_fit <- function(rec, model) { fit(model, Sale_Price ~ ., data = bake(rec, new_data= NULL))}# 2.2 应用辅助函数cv_splits <- cv_splits %>% mutate( knn = map( recipes, parsnip_fit, model = knn_mod ) )
parsnip
包# 1. 设定模型knn_mod <- nearest_neighbor( mode = "regression", neighbors = 5) %>% set_engine("kknn") # 默认,可不设置# 2. 拟合模型# 2.1 定义辅助函数parsnip_fit <- function(rec, model) { fit(model, Sale_Price ~ ., data = bake(rec, new_data= NULL))}# 2.2 应用辅助函数cv_splits <- cv_splits %>% mutate( knn = map( recipes, parsnip_fit, model = knn_mod ) )
# 查看结果cv_splits %>% select(-id)
#> # A tibble: 10 × 3#> splits recipes knn #> <list> <list> <list> #> 1 <split [1977/220]> <recipe> <fit[+]>#> 2 <split [1977/220]> <recipe> <fit[+]>#> 3 <split [1977/220]> <recipe> <fit[+]>#> # ℹ 7 more rows
cv_splits$knn[[1]] %>% str(max.level = 1)
#> List of 6#> $ lvl : NULL#> $ spec :List of 7#> ..- attr(*, "class")= chr [1:2] "nearest_neighbor" "model_spec"#> $ fit :List of 10#> ..- attr(*, "class")= chr [1:2] "train.kknn" "kknn"#> $ preproc :List of 1#> $ elapsed :List of 1#> $ censor_probs: list()#> - attr(*, "class")= chr [1:2] "_train.kknn" "model_fit"
parsnip
包 + yardstick
包parsnip
包 + yardstick
包# 定义模型表现函数pf_metrics <- metric_set(rmse, rsq, ccc)# 定义辅助函数parsnip_metrics <- function(split, recipe, model) { raw <- assessment(split) processed <- bake(recipe, new_data = raw) model %>% predict(new_data = processed) %>% bind_cols(processed) %>% pf_metrics(Sale_Price, .pred) }# 应用辅助函数cv_splits <- cv_splits %>% mutate( metrics = pmap( list( split = splits, recipe = recipes, model = knn ), parsnip_metrics ) )
parsnip
包 + yardstick
包# 定义模型表现函数pf_metrics <- metric_set(rmse, rsq, ccc)# 定义辅助函数parsnip_metrics <- function(split, recipe, model) { raw <- assessment(split) processed <- bake(recipe, new_data = raw) model %>% predict(new_data = processed) %>% bind_cols(processed) %>% pf_metrics(Sale_Price, .pred) }# 应用辅助函数cv_splits <- cv_splits %>% mutate( metrics = pmap( list( split = splits, recipe = recipes, model = knn ), parsnip_metrics ) )
cv_splits$metrics[[1]]
#> # A tibble: 3 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 0.0839#> 2 rsq standard 0.761 #> 3 ccc standard 0.871
# 汇总计算模型表现指标cv_splits %>% unnest(cols = metrics) %>% group_by(.metric) %>% summarise( mean = mean(.estimate), std_err = sd(.estimate)/sqrt(n()) )
#> # A tibble: 3 × 3#> .metric mean std_err#> <chr> <dbl> <dbl>#> 1 ccc 0.869 0.00519#> 2 rmse 0.0870 0.00245#> 3 rsq 0.764 0.00889
# 定义辅助函数m_metrics <- function(split, recipe, fit) { holdout <- bake( recipe, new_data = assessment(split)) resample_name <- labels(split)$id multi_predict(fit, new_data = holdout, neighbors = 1:50) %>% bind_cols(., holdout %>% select(Sale_Price)) %>% unnest(cols = .pred) %>% group_by(neighbors) %>% pf_metrics(Sale_Price, .pred) %>% mutate(resample = resample_name)}# 应用辅助函数cv_splits <- cv_splits %>% mutate( tuning_metrics = pmap( list(splits, recipes, knn), m_metrics ) )
# 定义辅助函数m_metrics <- function(split, recipe, fit) { holdout <- bake( recipe, new_data = assessment(split)) resample_name <- labels(split)$id multi_predict(fit, new_data = holdout, neighbors = 1:50) %>% bind_cols(., holdout %>% select(Sale_Price)) %>% unnest(cols = .pred) %>% group_by(neighbors) %>% pf_metrics(Sale_Price, .pred) %>% mutate(resample = resample_name)}# 应用辅助函数cv_splits <- cv_splits %>% mutate( tuning_metrics = pmap( list(splits, recipes, knn), m_metrics ) )
cv_splits$tuning_metrics[[1]]
#> # A tibble: 150 × 5#> neighbors .metric .estimator .estimate resample#> <int> <chr> <chr> <dbl> <chr> #> 1 1 rmse standard 0.101 Fold01 #> 2 2 rmse standard 0.0951 Fold01 #> 3 3 rmse standard 0.0898 Fold01 #> # ℹ 147 more rows
# 汇总计算模型表现指标(rs_rmse <- cv_splits %>% unnest(tuning_metrics) %>% filter(.metric == "rmse") %>% group_by(neighbors) %>% summarize(rmse = mean(.estimate)))
#> # A tibble: 50 × 2#> neighbors rmse#> <int> <dbl>#> 1 1 0.104 #> 2 2 0.0972#> 3 3 0.0921#> # ℹ 47 more rows
(min_rmse <- rs_rmse %>% arrange(rmse) %>% slice(1))
#> # A tibble: 1 × 2#> neighbors rmse#> <int> <dbl>#> 1 11 0.0837
ggplot(rs_rmse, aes(x = neighbors, y = rmse)) + geom_point() + geom_path() + geom_point(data = min_rmse, size = 3, color = "red")
(min_rmse <- rs_rmse %>% arrange(rmse) %>% slice(1))
#> # A tibble: 1 × 2#> neighbors rmse#> <int> <dbl>#> 1 11 0.0837
ggplot(rs_rmse, aes(x = neighbors, y = rmse)) + geom_point() + geom_path() + geom_point(data = min_rmse, size = 3, color = "red")
# 最终模型ames_rec_final <- prep(ames_rec)knn_mod_final <- update(knn_mod, neighbors = 11) %>% fit(Sale_Price ~ ., data = bake(ames_rec_final, new_data = NULL))# 评估最终模型在测试集上的表现parsnip_metrics(data_split, ames_rec_final, knn_mod_final)
#> # A tibble: 3 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 0.0754#> 2 rsq standard 0.813 #> 3 ccc standard 0.894
workflows
v1.1.3 + tune
v1.1.2# 重新定义模型:设定需调校的参数knn_mod <- nearest_neighbor( mode = "regression", neighbors = tune()) %>% set_engine("kknn")# 定义流程ames_wf <- workflow() %>% add_model(knn_mod) %>% add_recipe(ames_rec)# 基于参数网格计算模型表现指标ames_res <- ames_wf %>% tune_grid( cv_splits %>% select(splits, id), grid = tibble(neighbors = 1:50), metrics = metric_set(rmse, rsq, ccc) )
workflows
v1.1.3 + tune
v1.1.2# 重新定义模型:设定需调校的参数knn_mod <- nearest_neighbor( mode = "regression", neighbors = tune()) %>% set_engine("kknn")# 定义流程ames_wf <- workflow() %>% add_model(knn_mod) %>% add_recipe(ames_rec)# 基于参数网格计算模型表现指标ames_res <- ames_wf %>% tune_grid( cv_splits %>% select(splits, id), grid = tibble(neighbors = 1:50), metrics = metric_set(rmse, rsq, ccc) )
ames_res
#> # Tuning results#> # 10-fold cross-validation #> # A tibble: 10 × 4#> splits id .metrics .notes #> <list> <chr> <list> <list> #> 1 <split [1977/220]> Fold01 <tibble [150 × 5]> <tibble [0 × 3]>#> 2 <split [1977/220]> Fold02 <tibble [150 × 5]> <tibble [0 × 3]>#> 3 <split [1977/220]> Fold03 <tibble [150 × 5]> <tibble [1 × 3]>#> # ℹ 7 more rows#> #> There were issues with some computations:#> #> - Warning(s) x2: some 'x' values beyond boundary knots may cause ill-conditioned bases#> #> Run `show_notes(.Last.tune.result)` for more information.
ames_res %>% collect_metrics() %>% glimpse(width = 40)
#> Rows: 150#> Columns: 7#> $ neighbors <int> 1, 1, 1, 2, 2, 2, 3…#> $ .metric <chr> "ccc", "rmse", "rsq…#> $ .estimator <chr> "standard", "standa…#> $ mean <dbl> 0.8276, 0.1042, 0.6…#> $ n <int> 10, 10, 10, 10, 10,…#> $ std_err <dbl> 0.00943, 0.00302, 0…#> $ .config <chr> "Preprocessor1_Mode…
workflows
v1.1.3 + tune
v1.1.2ames_res %>% collect_metrics() %>% filter(.metric == "rmse") %>% ggplot(aes(x = neighbors, y = mean)) + geom_point() + geom_line() + geom_point(size = 2, color = 'red', data = . %>% slice_min(mean)) + ylab("rmse")
workflows
v1.1.3 + tune
v1.1.2(best_knn <- ames_res %>% select_best("rmse"))
#> # A tibble: 1 × 2#> neighbors .config #> <int> <chr> #> 1 11 Preprocessor1_Model11
ames_fit_test <- ames_wf %>% finalize_workflow(best_knn) %>% last_fit( split = data_split, metrics = metric_set(rmse, rsq, ccc) )ames_fit_test %>% glimpse()
#> Rows: 1#> Columns: 6#> $ splits <list> [<initial_split[2197 x 733 x 2930 x 74]>]#> $ id <chr> "train/test split"#> $ .metrics <list> [<tbl_df[3 x 4]>]#> $ .notes <list> [<tbl_df[1 x 3]>]#> $ .predictions <list> [<tbl_df[733 x 4]>]#> $ .workflow <list> [Bldg_Type, Neighborhood, Year_Built, Gr_Liv_Area, Full_Bath, Year…
workflows
v1.1.3 + tune
v1.1.2ames_fit_test %>% collect_metrics()
#> # A tibble: 3 × 4#> .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 rmse standard 0.0754 Preproce…#> 2 rsq standard 0.813 Preproce…#> 3 ccc standard 0.894 Preproce…
workflows
v1.1.3 + tune
v1.1.2ames_fit_test %>% collect_metrics()
#> # A tibble: 3 × 4#> .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 rmse standard 0.0754 Preproce…#> 2 rsq standard 0.813 Preproce…#> 3 ccc standard 0.894 Preproce…
ames_fit_test %>% collect_predictions() %>% ggplot(aes(Sale_Price, .pred)) + geom_abline(lty = 2, size = 1.5, color = "gray50") + geom_point(alpha = 0.5) + coord_fixed()
tidymodels
家族!tidymodels
家族!在 📑 qmd 中键入并运行本讲的代码(自觉完成,无需提交)
登录 {{tidymodels
}} 的主页 🏠,浏览并学习相关资料
tidymodels
子包的网站 https://<子包名>.tidymodels.org/
tidymodels
团队撰写的书 📖,供感兴趣的同学学习精进:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
tidymodels
suite💪 R 和 R 包大多是由统计学家和数据分析专家开发的
💪 R 有着异常丰富且紧跟前沿的 {{统计建模和机器学习包}}
💪 R 容易接口或调用其他应用软件(如 C++、python、Spark、TensorFlow 等)
💪 有大量学习参考资料
💪 R 和 R 包大多是由统计学家和数据分析专家开发的
💪 R 有着异常丰富且紧跟前沿的 {{统计建模和机器学习包}}
💪 R 容易接口或调用其他应用软件(如 C++、python、Spark、TensorFlow 等)
💪 有大量学习参考资料
😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言
😰 R 操作数据的规模往往受限于计算机内存的大小
😰 各式 R 包的接口并不统一
😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言
😰 R 操作数据的规模往往受限于计算机内存的大小
😰 各式 R 包的接口并不统一
Function | Code |
---|---|
MASS::lda |
predict(obj) |
stats::glm |
predict(obj, type = "response") |
gbm::gbm |
predict(obj, type = "response", n.trees) |
rpart::rpart |
predict(obj, type = "prob") |
RWeka::Weka |
predict(obj, type = "probability") |
😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言
😰 R 操作数据的规模往往受限于计算机内存的大小
😰 各式 R 包的接口并不统一
Function | Code |
---|---|
MASS::lda |
predict(obj) |
stats::glm |
predict(obj, type = "response") |
gbm::gbm |
predict(obj, type = "response", n.trees) |
rpart::rpart |
predict(obj, type = "prob") |
RWeka::Weka |
predict(obj, type = "probability") |
caret
-> 👉 tidymodels
!
mlr
-> mlr3
!
rsample
v1.2.0(General Resampling Infrastructure)
rsample
包# 先安装并加载 tidymodels 元包(meta package) # install.package("tidymodels")library(tidymodels) # v1.0.0 @ 2022-07-13
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
#> ✔ broom 1.0.5 ✔ recipes 1.0.8#> ✔ dials 1.2.0 ✔ rsample 1.2.0#> ✔ dplyr 1.1.4 ✔ tibble 3.2.1#> ✔ ggplot2 3.4.4 ✔ tidyr 1.3.0#> ✔ infer 1.0.5 ✔ tune 1.1.2#> ✔ modeldata 1.2.0 ✔ workflows 1.1.3#> ✔ parsnip 1.1.1 ✔ workflowsets 1.0.1#> ✔ purrr 1.0.2 ✔ yardstick 1.2.0
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──#> ✖ purrr::discard() masks scales::discard()#> ✖ dplyr::filter() masks stats::filter()#> ✖ dplyr::lag() masks stats::lag()#> ✖ recipes::step() masks stats::step()#> • Use suppressPackageStartupMessages() to eliminate package startup messages
rsample
包# 设定随机数种子set.seed(1234)# 随机抽样分成训练组和试验组iris_split <- initial_split(iris)# initial_split(data, prop = 3/4,# strata = NULL,# breaks = 4, # pool = 0.1, ...)
# 查看分组情况iris_split
#> <Training/Testing/Total>#> <112/38/150>
# 查看iris_split的类iris_split %>% class()
#> [1] "initial_split" "mc_split" #> [3] "rsplit"
rsample
包# 设定随机数种子set.seed(1234)# 随机抽样分成训练组和试验组iris_split <- initial_split(iris)# initial_split(data, prop = 3/4,# strata = NULL,# breaks = 4, # pool = 0.1, ...)
# 查看分组情况iris_split
#> <Training/Testing/Total>#> <112/38/150>
# 查看iris_split的类iris_split %>% class()
#> [1] "initial_split" "mc_split" #> [3] "rsplit"
# 查看 rsplit 类的底层结构iris_split %>% str(vec.len = 1.5)
#> List of 4#> $ data :'data.frame': 150 obs. of 5 variables:#> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 ...#> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 ...#> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 ...#> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 ...#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 ...#> $ in_id : int [1:112] 28 80 101 111 ...#> $ out_id: logi NA#> $ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)#> ..$ id: chr "Resample1"#> - attr(*, "class")= chr [1:3] "initial_split" "mc_split" ...
# 提取训练组样本(Q:测试组样本用哪个函数?)iris_split %>% training() %>% glimpse(width = 40)
#> Rows: 112#> Columns: 5#> $ Sepal.Length <dbl> 5.2, 5.7, 6.3, 6.…#> $ Sepal.Width <dbl> 3.5, 2.6, 3.3, 3.…#> $ Petal.Length <dbl> 1.5, 3.5, 6.0, 5.…#> $ Petal.Width <dbl> 0.2, 1.0, 2.5, 2.…#> $ Species <fct> setosa, versicolo…
recipes
v1.0.8(Preprocessing and Feature Engineering Steps for Modeling)
iris_recipe <- recipe(Species ~ ., data = iris) %>% step_corr(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors())iris_recipe
#> ── Recipe ────────────────────────────────#> #> ── Inputs #> Number of variables by role#> outcome: 1#> predictor: 4#> #> ── Operations #> • Correlation filter on: all_predictors()#> • Centering for: all_predictors()#> • Scaling for: all_predictors()
iris_recipe <- recipe(Species ~ ., data = iris) %>% step_corr(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors())iris_recipe
#> ── Recipe ────────────────────────────────#> #> ── Inputs #> Number of variables by role#> outcome: 1#> predictor: 4#> #> ── Operations #> • Correlation filter on: all_predictors()#> • Centering for: all_predictors()#> • Scaling for: all_predictors()
复杂的recipe类,View()
它!
recipes
包中共计有 98 个数据预处理步骤函数,示例如下:
[1] "step_arrange" "..." [3] "step_corr" "..." [5] "step_YeoJohnson" "step_zv"
你可以用以下五种方式(及其组合)在 step_*()
中指定拟作用的变量:
dplyr
包的方式,如 starts_with()
all_outcomes()
、all_predictors()
或 has_role()
指定角色的方式all_numeric()
、all_nominal()
或 has_type()
指定类型的方式all_numeric_predictors()
?recipes::selections
iris_recipe <- iris_recipe %>% prep(training(iris_split), verbose = TRUE)
#> oper 1 step corr [training] #> oper 2 step center [training] #> oper 3 step scale [training] #> The retained training set is ~ 0 Mb in memory.
# prep(x, # training = NULL, # fresh = FALSE, # verbose = FALSE, # retain = TRUE, # log_change = FALSE, # strings_as_factors = TRUE, # ...)
iris_recipe <- iris_recipe %>% prep(training(iris_split), verbose = TRUE)
#> oper 1 step corr [training] #> oper 2 step center [training] #> oper 3 step scale [training] #> The retained training set is ~ 0 Mb in memory.
# prep(x, # training = NULL, # fresh = FALSE, # verbose = FALSE, # retain = TRUE, # log_change = FALSE, # strings_as_factors = TRUE, # ...)
iris_recipe
#> ── Recipe ───────────────────────────────────────────────────────#> #> ── Inputs #> Number of variables by role#> outcome: 1#> predictor: 4#> #> ── Training information #> Training data contained 112 data points and no incomplete rows.#> #> ── Operations #> • Correlation filter on: Petal.Length | Trained#> • Centering for: Sepal.Length, Sepal.Width, Petal.Width | Trained#> • Scaling for: Sepal.Length, Sepal.Width, Petal.Width | Trained
# 训练组# 可直接提取准备阶段 retain 的数据iris_training <- iris_recipe %>% bake(new_data = NULL)# bake(object, # new_data, # ..., # composition = "tibble")iris_training
#> # A tibble: 112 × 4#> Sepal.Length Sepal.Width Petal.Width Species #> <dbl> <dbl> <dbl> <fct> #> 1 -0.805 1.09 -1.36 setosa #> 2 -0.212 -1.07 -0.316 versicolor#> 3 0.500 0.607 1.64 virginica #> # ℹ 109 more rows
# 训练组# 可直接提取准备阶段 retain 的数据iris_training <- iris_recipe %>% bake(new_data = NULL)# bake(object, # new_data, # ..., # composition = "tibble")iris_training
#> # A tibble: 112 × 4#> Sepal.Length Sepal.Width Petal.Width Species #> <dbl> <dbl> <dbl> <fct> #> 1 -0.805 1.09 -1.36 setosa #> 2 -0.212 -1.07 -0.316 versicolor#> 3 0.500 0.607 1.64 virginica #> # ℹ 109 more rows
# 测试组# 用 bake() 进行预处理iris_testing <- iris_recipe %>% bake(new_data = testing(iris_split))iris_testing
#> # A tibble: 38 × 4#> Sepal.Length Sepal.Width Petal.Width Species#> <dbl> <dbl> <dbl> <fct> #> 1 -0.924 1.09 -1.36 setosa #> 2 -1.52 0.847 -1.23 setosa #> 3 -1.28 0.847 -1.36 setosa #> # ℹ 35 more rows
parsnip
v1.1.1(A Common API to Modeling and Analysis Functions)
# 随机森林模型的统一 APIiris_rf <- rand_forest( mode = "classification", trees = 100)# rand_forest(# mode = "unknown",# engine = "ranger",# mtry = NULL, trees = NULL,# min_n = NULL)# 设定随机森林模型的计算引擎# show_engines("rand_forest")(iris_rf <- iris_rf %>% set_engine("ranger"))
#> Random Forest Model Specification (classification)#> #> Main Arguments:#> trees = 100#> #> Computational engine: ranger
# 随机森林模型的统一 APIiris_rf <- rand_forest( mode = "classification", trees = 100)# rand_forest(# mode = "unknown",# engine = "ranger",# mtry = NULL, trees = NULL,# min_n = NULL)# 设定随机森林模型的计算引擎# show_engines("rand_forest")(iris_rf <- iris_rf %>% set_engine("ranger"))
#> Random Forest Model Specification (classification)#> #> Main Arguments:#> trees = 100#> #> Computational engine: ranger
# 查看 iris_rf 的类iris_rf %>% class()
#> [1] "rand_forest" "model_spec"
# 查看 iris_rf 的底层结构iris_rf %>% str()
#> List of 7#> $ args :List of 3#> ..$ mtry : language ~NULL#> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> ..$ trees: language ~100#> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> ..$ min_n: language ~NULL#> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> $ eng_args : Named list()#> ..- attr(*, "class")= chr [1:2] "quosures" "list"#> $ mode : chr "classification"#> $ user_specified_mode : logi TRUE#> $ method : NULL#> $ engine : chr "ranger"#> $ user_specified_engine: logi TRUE#> - attr(*, "class")= chr [1:2] "rand_forest" "model_spec"
# 拟合随机森林模型(iris_fit <- iris_rf %>% fit( # ?parsnip::fit.model_spec() formula = Species ~ ., data = iris_training ))
#> parsnip model object#> #> Ranger result#> #> Call:#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) #> #> Type: Probability estimation #> Number of trees: 100 #> Sample size: 112 #> Number of independent variables: 3 #> Mtry: 1 #> Target node size: 10 #> Variable importance mode: none #> Splitrule: gini #> OOB prediction error (Brier s.): 0.0563
# 拟合随机森林模型(iris_fit <- iris_rf %>% fit( # ?parsnip::fit.model_spec() formula = Species ~ ., data = iris_training ))
#> parsnip model object#> #> Ranger result#> #> Call:#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) #> #> Type: Probability estimation #> Number of trees: 100 #> Sample size: 112 #> Number of independent variables: 3 #> Mtry: 1 #> Target node size: 10 #> Variable importance mode: none #> Splitrule: gini #> OOB prediction error (Brier s.): 0.0563
# 查看 iris_fit 的类iris_fit %>% class()
#> [1] "_ranger" "model_fit"
# 查看 iris_fit 的底层结构iris_fit %>% str(max.level = 1)
#> List of 6#> $ lvl : chr [1:3] "setosa" "versicolor" "virginica"#> $ spec :List of 8#> ..- attr(*, "class")= chr [1:2] "rand_forest" "model_spec"#> $ fit :List of 13#> ..- attr(*, "class")= chr "ranger"#> $ preproc :List of 4#> $ elapsed :List of 1#> $ censor_probs: list()#> - attr(*, "class")= chr [1:2] "_ranger" "model_fit"
# 将拟合模型应用于测试组iris_fit %>% predict( new_data = iris_testing )
#> # A tibble: 38 × 1#> .pred_class#> <fct> #> 1 setosa #> 2 setosa #> 3 setosa #> # ℹ 35 more rows
# ? parsnip::predict.model_fit()# predict(object, new_data, # type = NULL, # # "numeric", "class", "prob", # # "conf_int", "pred_int", # # "quantile", "time", # # "hazard", "survival", "raw"# opts = list(), ...)
# 将拟合模型应用于测试组iris_fit %>% predict( new_data = iris_testing )
#> # A tibble: 38 × 1#> .pred_class#> <fct> #> 1 setosa #> 2 setosa #> 3 setosa #> # ℹ 35 more rows
# ? parsnip::predict.model_fit()# predict(object, new_data, # type = NULL, # # "numeric", "class", "prob", # # "conf_int", "pred_int", # # "quantile", "time", # # "hazard", "survival", "raw"# opts = list(), ...)
# 合并原数据集iris_class <- iris_fit %>% predict( new_data = iris_testing ) %>% bind_cols(iris_testing)iris_class %>% glimpse(width = 40)
#> Rows: 38#> Columns: 5#> $ .pred_class <fct> setosa, setosa, s…#> $ Sepal.Length <dbl> -0.9241, -1.5176,…#> $ Sepal.Width <dbl> 1.087, 0.847, 0.8…#> $ Petal.Width <dbl> -1.359, -1.229, -…#> $ Species <fct> setosa, setosa, s…
yardstick
v1.2.0(Tidy Characterizations of Model Performance)
metrics()
iris_class %>% glimpse(width = 40)
#> Rows: 38#> Columns: 5#> $ .pred_class <fct> setosa, setosa, s…#> $ Sepal.Length <dbl> -0.9241, -1.5176,…#> $ Sepal.Width <dbl> 1.087, 0.847, 0.8…#> $ Petal.Width <dbl> -1.359, -1.229, -…#> $ Species <fct> setosa, setosa, s…
iris_class %>% metrics( # 计算模型表现指标 truth = Species, estimate = .pred_class )
#> # A tibble: 2 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 accuracy multiclass 0.947#> 2 kap multiclass 0.920
metrics()
iris_class %>% glimpse(width = 40)
#> Rows: 38#> Columns: 5#> $ .pred_class <fct> setosa, setosa, s…#> $ Sepal.Length <dbl> -0.9241, -1.5176,…#> $ Sepal.Width <dbl> 1.087, 0.847, 0.8…#> $ Petal.Width <dbl> -1.359, -1.229, -…#> $ Species <fct> setosa, setosa, s…
iris_class %>% metrics( # 计算模型表现指标 truth = Species, estimate = .pred_class )
#> # A tibble: 2 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 accuracy multiclass 0.947#> 2 kap multiclass 0.920
# 提取各分类的预测概率iris_probs <- iris_fit %>% predict( new_data = iris_testing, type = "prob" ) %>% bind_cols(iris_class)# 模型表现指标:基于分类预测概率iris_probs %>% metrics( truth = Species, estimate = .pred_class, .pred_setosa:.pred_virginica )
#> # A tibble: 4 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 accuracy multiclass 0.947#> 2 kap multiclass 0.920#> 3 mn_log_loss multiclass 0.199#> 4 roc_auc hand_till 0.979
*_curve()
# 接收器操作特征曲线iris_roc <- iris_probs %>% roc_curve( truth = Species, .pred_setosa:.pred_virginica ) # iris_roc -> a "roc_df"iris_roc %>% autoplot()
*_curve()
# 接收器操作特征曲线iris_roc <- iris_probs %>% roc_curve( truth = Species, .pred_setosa:.pred_virginica ) # iris_roc -> a "roc_df"iris_roc %>% autoplot()
# 增益曲线iris_gain <- iris_probs %>% gain_curve( truth = Species, .pred_setosa:.pred_virginica )# iris_gain # a "gain_df"iris_gain %>% autoplot()
(Case Study)
modeldata::ames %>% str()
#> tibble [2,930 × 74] (S3: tbl_df/tbl/data.frame)#> $ MS_SubClass : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 1 1 6 6 12 12 12 6 ...#> $ MS_Zoning : Factor w/ 7 levels "Floating_Village_Residential",..: 3 2 3 3 3 3 3 3 3 3 ...#> $ Lot_Frontage : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...#> $ Lot_Area : int [1:2930] 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...#> $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...#> $ Alley : Factor w/ 3 levels "Gravel","No_Alley_Access",..: 2 2 2 2 2 2 2 2 2 2 ...#> $ Lot_Shape : Factor w/ 4 levels "Regular","Slightly_Irregular",..: 2 1 2 1 2 2 1 2 2 1 ...#> $ Land_Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...#> $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...#> $ Lot_Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...#> $ Land_Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...#> $ Neighborhood : Factor w/ 29 levels "North_Ames","College_Creek",..: 1 1 1 1 7 7 17 17 17 7 ...#> $ Condition_1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...#> $ Condition_2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...#> $ Bldg_Type : Factor w/ 5 levels "OneFam","TwoFmCon",..: 1 1 1 1 1 1 5 5 5 1 ...#> $ House_Style : Factor w/ 8 levels "One_and_Half_Fin",..: 3 3 3 3 8 8 3 3 3 8 ...#> $ Overall_Cond : Factor w/ 10 levels "Very_Poor","Poor",..: 5 6 6 5 5 6 5 5 5 5 ...#> $ Year_Built : int [1:2930] 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...#> $ Year_Remod_Add : int [1:2930] 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...#> $ Roof_Style : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...#> $ Roof_Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...#> $ Exterior_1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...#> $ Exterior_2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...#> $ Mas_Vnr_Type : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 2 4 4 2 4 4 4 4 ...#> $ Mas_Vnr_Area : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...#> $ Exter_Cond : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 5 5 ...#> $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...#> $ Bsmt_Cond : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ...#> $ Bsmt_Exposure : Factor w/ 5 levels "Av","Gd","Mn",..: 2 4 4 4 4 4 3 4 4 4 ...#> $ BsmtFin_Type_1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 2 6 1 1 3 3 3 1 3 7 ...#> $ BsmtFin_SF_1 : num [1:2930] 2 6 1 1 3 3 3 1 3 7 ...#> $ BsmtFin_Type_2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 4 7 7 7 7 7 7 7 7 ...#> $ BsmtFin_SF_2 : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...#> $ Bsmt_Unf_SF : num [1:2930] 441 270 406 1045 137 ...#> $ Total_Bsmt_SF : num [1:2930] 1080 882 1329 2110 928 ...#> $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...#> $ Heating_QC : Factor w/ 5 levels "Excellent","Fair",..: 2 5 5 1 3 1 1 1 1 3 ...#> $ Central_Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...#> $ Electrical : Factor w/ 6 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...#> $ First_Flr_SF : int [1:2930] 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...#> $ Second_Flr_SF : int [1:2930] 0 0 0 0 701 678 0 0 0 776 ...#> $ Gr_Liv_Area : int [1:2930] 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...#> $ Bsmt_Full_Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...#> $ Bsmt_Half_Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...#> $ Full_Bath : int [1:2930] 1 1 1 2 2 2 2 2 2 2 ...#> $ Half_Bath : int [1:2930] 0 0 1 1 1 1 0 0 0 1 ...#> $ Bedroom_AbvGr : int [1:2930] 3 2 3 3 3 3 2 2 2 3 ...#> $ Kitchen_AbvGr : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...#> $ TotRms_AbvGrd : int [1:2930] 7 5 6 8 6 7 6 5 5 7 ...#> $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...#> $ Fireplaces : int [1:2930] 2 0 0 2 1 1 0 0 1 1 ...#> $ Garage_Type : Factor w/ 7 levels "Attchd","Basment",..: 1 1 1 1 1 1 1 1 1 1 ...#> $ Garage_Finish : Factor w/ 4 levels "Fin","No_Garage",..: 1 4 4 1 1 1 1 3 3 1 ...#> $ Garage_Cars : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...#> $ Garage_Area : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...#> $ Garage_Cond : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...#> $ Paved_Drive : Factor w/ 3 levels "Dirt_Gravel",..: 2 3 3 3 3 3 3 3 3 3 ...#> $ Wood_Deck_SF : int [1:2930] 210 140 393 0 212 360 0 0 237 140 ...#> $ Open_Porch_SF : int [1:2930] 62 0 36 0 34 36 0 82 152 60 ...#> $ Enclosed_Porch : int [1:2930] 0 0 0 0 0 0 170 0 0 0 ...#> $ Three_season_porch: int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...#> $ Screen_Porch : int [1:2930] 0 120 0 0 0 0 0 144 0 0 ...#> $ Pool_Area : int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...#> $ Pool_QC : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ...#> $ Fence : Factor w/ 5 levels "Good_Privacy",..: 5 3 5 5 3 5 5 5 5 5 ...#> $ Misc_Feature : Factor w/ 6 levels "Elev","Gar2",..: 3 3 2 3 3 3 3 3 3 3 ...#> $ Misc_Val : int [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...#> $ Mo_Sold : int [1:2930] 5 6 6 4 3 6 4 1 3 6 ...#> $ Year_Sold : int [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...#> $ Sale_Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...#> $ Sale_Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...#> $ Sale_Price : int [1:2930] 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...#> $ Longitude : num [1:2930] -93.6 -93.6 -93.6 -93.6 -93.6 ...#> $ Latitude : num [1:2930] 42.1 42.1 42.1 42.1 42.1 ...
rsample
包set.seed(1234)data_split <- initial_split( ames, strata = "Sale_Price")ames_train <- training(data_split)ames_test <- testing(data_split)nrow(ames_train) / nrow(ames)
#> [1] 0.75
rsample
包set.seed(1234)data_split <- initial_split( ames, strata = "Sale_Price")ames_train <- training(data_split)ames_test <- testing(data_split)nrow(ames_train) / nrow(ames)
#> [1] 0.75
# v折交叉验证(cv_splits <- vfold_cv(ames_train))
#> # 10-fold cross-validation #> # A tibble: 10 × 2#> splits id #> <list> <chr> #> 1 <split [1977/220]> Fold01#> 2 <split [1977/220]> Fold02#> 3 <split [1977/220]> Fold03#> # ℹ 7 more rows
cv_splits$splits[[1]]
#> <Analysis/Assess/Total>#> <1977/220/2197>
cv_splits$splits[[1]] %>% analysis() %>% dim()
#> [1] 1977 74
vignette("Common_Patterns", package = "rsample")
recipes
包# 1. 定义数据预处理菜单ames_rec <- recipe(Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Full_Bath + Year_Sold + Lot_Area + Central_Air + Longitude + Latitude, data = ames_train) %>% step_log(Sale_Price, base = 10) %>% step_BoxCox(Lot_Area, Gr_Liv_Area) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_bs(Longitude, Latitude, options = list(df = 5))ames_rec %>% tidy()
#> # A tibble: 6 × 6#> number operation type trained skip id #> <int> <chr> <chr> <lgl> <lgl> <chr> #> 1 1 step log FALSE FALSE log_YNFHa #> 2 2 step BoxCox FALSE FALSE BoxCox_vCuOq#> 3 3 step other FALSE FALSE other_0nogs #> # ℹ 3 more rows
recipes
包# 2. 准备数据预处理cv_splits <- cv_splits %>% mutate( recipes = map(splits, prepper, recipe = ames_rec) )cv_splits
#> # 10-fold cross-validation #> # A tibble: 10 × 3#> splits id recipes #> <list> <chr> <list> #> 1 <split [1977/220]> Fold01 <recipe>#> 2 <split [1977/220]> Fold02 <recipe>#> 3 <split [1977/220]> Fold03 <recipe>#> # ℹ 7 more rows
# prepper# function(split_obj, recipe, ...) {# prep(recipe, # training = rsample::analysis(split_obj), # fresh = TRUE, ...)# }
recipes
包# 2. 准备数据预处理cv_splits <- cv_splits %>% mutate( recipes = map(splits, prepper, recipe = ames_rec) )cv_splits
#> # 10-fold cross-validation #> # A tibble: 10 × 3#> splits id recipes #> <list> <chr> <list> #> 1 <split [1977/220]> Fold01 <recipe>#> 2 <split [1977/220]> Fold02 <recipe>#> 3 <split [1977/220]> Fold03 <recipe>#> # ℹ 7 more rows
# prepper# function(split_obj, recipe, ...) {# prep(recipe, # training = rsample::analysis(split_obj), # fresh = TRUE, ...)# }
# 相同的菜单 + # 不同的训练数据集 -> # 不同的准备结果cv_splits$recipes %>% map(\(x) tidy(x, number = 2)) %>% list_rbind() %>% filter(terms == "Lot_Area")
#> # A tibble: 10 × 3#> terms value id #> <chr> <dbl> <chr> #> 1 Lot_Area 0.0836 BoxCox_vCuOq#> 2 Lot_Area 0.0777 BoxCox_vCuOq#> 3 Lot_Area 0.104 BoxCox_vCuOq#> # ℹ 7 more rows
# ?recipes:::tidy.recipe
parsnip
包parsnip
包# 1. 设定模型knn_mod <- nearest_neighbor( mode = "regression", neighbors = 5) %>% set_engine("kknn") # 默认,可不设置# 2. 拟合模型# 2.1 定义辅助函数parsnip_fit <- function(rec, model) { fit(model, Sale_Price ~ ., data = bake(rec, new_data= NULL))}# 2.2 应用辅助函数cv_splits <- cv_splits %>% mutate( knn = map( recipes, parsnip_fit, model = knn_mod ) )
parsnip
包# 1. 设定模型knn_mod <- nearest_neighbor( mode = "regression", neighbors = 5) %>% set_engine("kknn") # 默认,可不设置# 2. 拟合模型# 2.1 定义辅助函数parsnip_fit <- function(rec, model) { fit(model, Sale_Price ~ ., data = bake(rec, new_data= NULL))}# 2.2 应用辅助函数cv_splits <- cv_splits %>% mutate( knn = map( recipes, parsnip_fit, model = knn_mod ) )
# 查看结果cv_splits %>% select(-id)
#> # A tibble: 10 × 3#> splits recipes knn #> <list> <list> <list> #> 1 <split [1977/220]> <recipe> <fit[+]>#> 2 <split [1977/220]> <recipe> <fit[+]>#> 3 <split [1977/220]> <recipe> <fit[+]>#> # ℹ 7 more rows
cv_splits$knn[[1]] %>% str(max.level = 1)
#> List of 6#> $ lvl : NULL#> $ spec :List of 7#> ..- attr(*, "class")= chr [1:2] "nearest_neighbor" "model_spec"#> $ fit :List of 10#> ..- attr(*, "class")= chr [1:2] "train.kknn" "kknn"#> $ preproc :List of 1#> $ elapsed :List of 1#> $ censor_probs: list()#> - attr(*, "class")= chr [1:2] "_train.kknn" "model_fit"
parsnip
包 + yardstick
包parsnip
包 + yardstick
包# 定义模型表现函数pf_metrics <- metric_set(rmse, rsq, ccc)# 定义辅助函数parsnip_metrics <- function(split, recipe, model) { raw <- assessment(split) processed <- bake(recipe, new_data = raw) model %>% predict(new_data = processed) %>% bind_cols(processed) %>% pf_metrics(Sale_Price, .pred) }# 应用辅助函数cv_splits <- cv_splits %>% mutate( metrics = pmap( list( split = splits, recipe = recipes, model = knn ), parsnip_metrics ) )
parsnip
包 + yardstick
包# 定义模型表现函数pf_metrics <- metric_set(rmse, rsq, ccc)# 定义辅助函数parsnip_metrics <- function(split, recipe, model) { raw <- assessment(split) processed <- bake(recipe, new_data = raw) model %>% predict(new_data = processed) %>% bind_cols(processed) %>% pf_metrics(Sale_Price, .pred) }# 应用辅助函数cv_splits <- cv_splits %>% mutate( metrics = pmap( list( split = splits, recipe = recipes, model = knn ), parsnip_metrics ) )
cv_splits$metrics[[1]]
#> # A tibble: 3 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 0.0839#> 2 rsq standard 0.761 #> 3 ccc standard 0.871
# 汇总计算模型表现指标cv_splits %>% unnest(cols = metrics) %>% group_by(.metric) %>% summarise( mean = mean(.estimate), std_err = sd(.estimate)/sqrt(n()) )
#> # A tibble: 3 × 3#> .metric mean std_err#> <chr> <dbl> <dbl>#> 1 ccc 0.869 0.00519#> 2 rmse 0.0870 0.00245#> 3 rsq 0.764 0.00889
# 定义辅助函数m_metrics <- function(split, recipe, fit) { holdout <- bake( recipe, new_data = assessment(split)) resample_name <- labels(split)$id multi_predict(fit, new_data = holdout, neighbors = 1:50) %>% bind_cols(., holdout %>% select(Sale_Price)) %>% unnest(cols = .pred) %>% group_by(neighbors) %>% pf_metrics(Sale_Price, .pred) %>% mutate(resample = resample_name)}# 应用辅助函数cv_splits <- cv_splits %>% mutate( tuning_metrics = pmap( list(splits, recipes, knn), m_metrics ) )
# 定义辅助函数m_metrics <- function(split, recipe, fit) { holdout <- bake( recipe, new_data = assessment(split)) resample_name <- labels(split)$id multi_predict(fit, new_data = holdout, neighbors = 1:50) %>% bind_cols(., holdout %>% select(Sale_Price)) %>% unnest(cols = .pred) %>% group_by(neighbors) %>% pf_metrics(Sale_Price, .pred) %>% mutate(resample = resample_name)}# 应用辅助函数cv_splits <- cv_splits %>% mutate( tuning_metrics = pmap( list(splits, recipes, knn), m_metrics ) )
cv_splits$tuning_metrics[[1]]
#> # A tibble: 150 × 5#> neighbors .metric .estimator .estimate resample#> <int> <chr> <chr> <dbl> <chr> #> 1 1 rmse standard 0.101 Fold01 #> 2 2 rmse standard 0.0951 Fold01 #> 3 3 rmse standard 0.0898 Fold01 #> # ℹ 147 more rows
# 汇总计算模型表现指标(rs_rmse <- cv_splits %>% unnest(tuning_metrics) %>% filter(.metric == "rmse") %>% group_by(neighbors) %>% summarize(rmse = mean(.estimate)))
#> # A tibble: 50 × 2#> neighbors rmse#> <int> <dbl>#> 1 1 0.104 #> 2 2 0.0972#> 3 3 0.0921#> # ℹ 47 more rows
(min_rmse <- rs_rmse %>% arrange(rmse) %>% slice(1))
#> # A tibble: 1 × 2#> neighbors rmse#> <int> <dbl>#> 1 11 0.0837
ggplot(rs_rmse, aes(x = neighbors, y = rmse)) + geom_point() + geom_path() + geom_point(data = min_rmse, size = 3, color = "red")
(min_rmse <- rs_rmse %>% arrange(rmse) %>% slice(1))
#> # A tibble: 1 × 2#> neighbors rmse#> <int> <dbl>#> 1 11 0.0837
ggplot(rs_rmse, aes(x = neighbors, y = rmse)) + geom_point() + geom_path() + geom_point(data = min_rmse, size = 3, color = "red")
# 最终模型ames_rec_final <- prep(ames_rec)knn_mod_final <- update(knn_mod, neighbors = 11) %>% fit(Sale_Price ~ ., data = bake(ames_rec_final, new_data = NULL))# 评估最终模型在测试集上的表现parsnip_metrics(data_split, ames_rec_final, knn_mod_final)
#> # A tibble: 3 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 0.0754#> 2 rsq standard 0.813 #> 3 ccc standard 0.894
workflows
v1.1.3 + tune
v1.1.2# 重新定义模型:设定需调校的参数knn_mod <- nearest_neighbor( mode = "regression", neighbors = tune()) %>% set_engine("kknn")# 定义流程ames_wf <- workflow() %>% add_model(knn_mod) %>% add_recipe(ames_rec)# 基于参数网格计算模型表现指标ames_res <- ames_wf %>% tune_grid( cv_splits %>% select(splits, id), grid = tibble(neighbors = 1:50), metrics = metric_set(rmse, rsq, ccc) )
workflows
v1.1.3 + tune
v1.1.2# 重新定义模型:设定需调校的参数knn_mod <- nearest_neighbor( mode = "regression", neighbors = tune()) %>% set_engine("kknn")# 定义流程ames_wf <- workflow() %>% add_model(knn_mod) %>% add_recipe(ames_rec)# 基于参数网格计算模型表现指标ames_res <- ames_wf %>% tune_grid( cv_splits %>% select(splits, id), grid = tibble(neighbors = 1:50), metrics = metric_set(rmse, rsq, ccc) )
ames_res
#> # Tuning results#> # 10-fold cross-validation #> # A tibble: 10 × 4#> splits id .metrics .notes #> <list> <chr> <list> <list> #> 1 <split [1977/220]> Fold01 <tibble [150 × 5]> <tibble [0 × 3]>#> 2 <split [1977/220]> Fold02 <tibble [150 × 5]> <tibble [0 × 3]>#> 3 <split [1977/220]> Fold03 <tibble [150 × 5]> <tibble [1 × 3]>#> # ℹ 7 more rows#> #> There were issues with some computations:#> #> - Warning(s) x2: some 'x' values beyond boundary knots may cause ill-conditioned bases#> #> Run `show_notes(.Last.tune.result)` for more information.
ames_res %>% collect_metrics() %>% glimpse(width = 40)
#> Rows: 150#> Columns: 7#> $ neighbors <int> 1, 1, 1, 2, 2, 2, 3…#> $ .metric <chr> "ccc", "rmse", "rsq…#> $ .estimator <chr> "standard", "standa…#> $ mean <dbl> 0.8276, 0.1042, 0.6…#> $ n <int> 10, 10, 10, 10, 10,…#> $ std_err <dbl> 0.00943, 0.00302, 0…#> $ .config <chr> "Preprocessor1_Mode…
workflows
v1.1.3 + tune
v1.1.2ames_res %>% collect_metrics() %>% filter(.metric == "rmse") %>% ggplot(aes(x = neighbors, y = mean)) + geom_point() + geom_line() + geom_point(size = 2, color = 'red', data = . %>% slice_min(mean)) + ylab("rmse")
workflows
v1.1.3 + tune
v1.1.2(best_knn <- ames_res %>% select_best("rmse"))
#> # A tibble: 1 × 2#> neighbors .config #> <int> <chr> #> 1 11 Preprocessor1_Model11
ames_fit_test <- ames_wf %>% finalize_workflow(best_knn) %>% last_fit( split = data_split, metrics = metric_set(rmse, rsq, ccc) )ames_fit_test %>% glimpse()
#> Rows: 1#> Columns: 6#> $ splits <list> [<initial_split[2197 x 733 x 2930 x 74]>]#> $ id <chr> "train/test split"#> $ .metrics <list> [<tbl_df[3 x 4]>]#> $ .notes <list> [<tbl_df[1 x 3]>]#> $ .predictions <list> [<tbl_df[733 x 4]>]#> $ .workflow <list> [Bldg_Type, Neighborhood, Year_Built, Gr_Liv_Area, Full_Bath, Year…
workflows
v1.1.3 + tune
v1.1.2ames_fit_test %>% collect_metrics()
#> # A tibble: 3 × 4#> .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 rmse standard 0.0754 Preproce…#> 2 rsq standard 0.813 Preproce…#> 3 ccc standard 0.894 Preproce…
workflows
v1.1.3 + tune
v1.1.2ames_fit_test %>% collect_metrics()
#> # A tibble: 3 × 4#> .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 rmse standard 0.0754 Preproce…#> 2 rsq standard 0.813 Preproce…#> 3 ccc standard 0.894 Preproce…
ames_fit_test %>% collect_predictions() %>% ggplot(aes(Sale_Price, .pred)) + geom_abline(lty = 2, size = 1.5, color = "gray50") + geom_point(alpha = 0.5) + coord_fixed()
tidymodels
家族!tidymodels
家族!在 📑 qmd 中键入并运行本讲的代码(自觉完成,无需提交)
登录 {{tidymodels
}} 的主页 🏠,浏览并学习相关资料
tidymodels
子包的网站 https://<子包名>.tidymodels.org/
tidymodels
团队撰写的书 📖,供感兴趣的同学学习精进: