class: center, middle, inverse, title-slide .title[ # 量化金融与金融编程 ] .subtitle[ ## L4
dplyr
.font80[1.1.3]
数据处理 | 课前预习 ] .author[ ###
曾永艺 ] .institute[ ### 厦门大学管理学院 ] .date[ ###
2023-10-12 ] --- class: middle, hide_logo background-image: url(imgs/logo-dplyr.png) background-size: 25% background-position: 19% 30%
.pull-left.font120.bold.center[ <br><br><br><br><br><br><br><br><br><br> _A Grammar of <br>Data Manipulation_ <br><br> ] -- .pull-right.font150.bold[ <br> 1. 样本处理 2. 变量处理 3. 汇总 4. 分组处理 5. 用 `%>%` 连接多个操作 ] --- ```r library(tidyverse) library(nycflights13) data(package = "nycflights13") # 包含airlines、airports、flights、planes、weather等5个数据集 ``` -- ```r flights # print() it ``` ``` #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` -- > * `A tibble: 336,776 x 19` > * `<int>` | `<dbl>` | `<chr>` | `<dttm>` | `<lgl>` | `<fctr>` | `<date>` 分别表示变量为 integer | double | character | date-time | logical | factor | date 类型的向量 -- ```r ?flights # 打开flights数据集的帮助文档以进一步了解数据集,如变量的定义 ``` --- ```r glimpse(flights) # 数据一瞥 ``` ``` #> Rows: 336,776 #> Columns: 19 #> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,… #> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… #> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… #> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, … #> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, … #> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0… #> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924,… #> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917,… #> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4,… #> $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6",… #> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194… #> $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516… #> $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "L… #> $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO", "O… #> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 345, 3… #> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, 1005… #> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 6,… #> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0,… #> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2… ``` -- ```r View(flights) # 在 RStudio 数据浏览器中打开数据集 # tibble::view() 调用 utils::View() 并不可见地返回原数据集,便于 %>% 操作,但速度好像慢很多 ``` --- layout: false class: hide_logo ## .font150[🤔] 想想 .font150[对于如下由**行**(样本)和**列**(变量)构成的数据集 / 数据表我们会进行哪些方面的操作呢?] .font80[
] --- class: inverse, center, middle # 1. 样本处理 .font150[(manipulate cases)] --- layout: true ### >> 样本筛选:`filter()` --- .full-width[.content-box-blue.bold.font120[`filter(.data, ...)`:提取数据集 `.data` 中变量取值满足设定条件的样本]] -- ```r filter(flights, month == 1, day == 1) *# 注意:条件表达式中的是 ==,而不是 = ``` ``` #> # A tibble: 842 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 839 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` -- ```r # 在 R 基础包中的实现方法 flights[flights$month == 1 & flights$day == 1, ] subset(flights, month == 1 & day == 1) ``` ```r flights[month == 1 & day == 1, ] # 注意:这样写是错滴 ``` ``` #> Error in month == 1: comparison (==) is possible only for atomic and list types ``` --- .full-width[.content-box-blue.bold.font120.note[dplyr 包中的函数(如 `filter()` )并不会直接修改输入数据集 `.data`]] -- .full-width[.content-box-blue.bold.font120.warning[你必须自行存储修改后的数据集 💾]] ```r dec25 <- filter(flights, month == 12, day == 25) dec25 ``` ``` *#> # A tibble: 719 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 12 25 456 500 -4 649 651 -2 #> 2 2013 12 25 524 515 9 805 814 -9 #> 3 2013 12 25 542 540 2 832 850 -18 #> # ℹ 716 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` ```r flights # 输入数据集仍然保持不变 ``` ``` *#> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- .full-width[.content-box-blue.bold.font120.note[`filter()` 会用到的比较运算符和逻辑运算符]] ```r 1. < > <= >= == != # ?Comparison 2. & | ! xor() # ?base::Logic 3. 其它的如:%in%、is.na()、between()、near() 等 ``` -- .full-width[.content-box-blue.bold.font120.note[`filter()` 默认以 `&` 的方式组合多个条件参数,...]] -- ```r filter(flights, month >= 11, day == 25) # 等效于 filter(flights, month >= 11 & day == 25) ``` -- .full-width[.content-box-blue.bold.font120.note[... 其它逻辑组合方式(如`|`)则需自行明确设定]] --- layout: true ### >> 样本筛选:其它函数 --- .font120[ * `slice(.data, ..., .by = NULL, .preserve = FALSE)`:按照整数向量给出的索引位置选择样本,正(.red[负])整数表示保留(.red[移除])的样本,如 `slice(mtcars, 5:n())` * `slice_head(.data, ..., n, prop, by = NULL)` 和 `slice_tail()` 选择数据集开始 / 结尾的样本    .red[vs. `utils::head() / tail()`?] * `slice_sample(.data, ..., n, prop, by = NULL, weight_by = NULL, replace = FALSE)` 随机选择样本 * `slice_min(.data, order_by, ..., n, prop, by = NULL, with_ties = TRUE, na_rm = FALSE)` 和 `slice_max()` 选择 `order_by` 参数指定的变量或其函数取值最大或最小的样本 * `distinct(.data, ..., .keep_all = FALSE)`:移除(指定变量或其函数)取值重复的样本    .red[≈ `base::unique()`] ] -- .footnote.red[注: 在 dplyr<sup>v1.0.0</sup> 之后 `top\_n()`、`top\_frac()`、`sample\_n()` 和 `sample\_frac()` 等函数已被相应的 `slice\_*()` 函数所替代] --- layout: true ### >> 样本排序:`arrange()` --- .full-width[.content-box-blue.bold.font120[`arrange(.data, ...)`:根据指定变量的取值对数据集 `.data` 的样本排序]] -- ```r arrange(flights, year, month, day, dep_time) ``` ``` #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` -- ```r arrange(flights, desc(dep_delay)) # 加入 desc() 反向排序:从大到小 ``` ``` #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 9 641 900 1301 1242 1530 1272 #> 2 2013 6 15 1432 1935 1137 1607 2120 1127 #> 3 2013 1 10 1121 1635 1126 1239 1810 1109 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- .full-width[.content-box-blue.bold.font120.note[不像 dplyr 包中的其它函数,`arrange(.data, ..., .by_group = FALSE)` 会忽略数据集的分组信息,除非明确加入分组变量或设定 `.by_group = TRUE`]] -- .full-width[.content-box-blue.bold.font120.note[缺失值总是排在最后 <sup>.red[*]</sup> ]] -- .pull-left[ ```r df <- tibble(x = c(1, 3, 2, NA)) arrange(df, x) ``` ``` #> # A tibble: 4 × 1 #> x #> <dbl> #> 1 1 #> 2 2 *#> 3 3 #> 4 NA ``` ] -- .pull-right[ ```r df <- tibble(x = c(1, 3, 2, NA)) arrange(df, desc(x)) ``` ``` #> # A tibble: 4 × 1 #> x #> <dbl> #> 1 3 #> 2 2 *#> 3 1 #> 4 NA ``` ] -- .footnote.red[*:`base::sort()` 和 `base::order()` 通过参数 `na.last` 来控制把缺失值放在哪里或删除,并通过参数 `decreasing` 来控制排序方向。] --- layout: false class: inverse, center, middle # 2. 变量处理 .font150[(manipulate variables)] --- layout: true ### >> 变量选取:`select(.data, ...)` --- ```r select(flights, month, day, dep_time, sched_dep_time, dep_delay) # 枚举式:变量名,无需"" ``` ``` #> # A tibble: 336,776 × 5 #> month day dep_time sched_dep_time dep_delay #> <int> <int> <int> <int> <dbl> #> 1 1 1 517 515 2 #> 2 1 1 533 529 4 #> 3 1 1 542 540 2 #> # ℹ 336,773 more rows ``` -- ```r select(flights, 2, 3, 4, 5, 6) # 枚举式:表示变量位置的数字,结果同上,但不推荐 ``` -- ```r select(flights, month:dep_delay) # 用 : 选择连在一起的变量 select(flights, 2:6) ``` -- ```r select(flights, !(month:dep_delay)) # 变量前的 ! 或 - 表示剔除 ``` ``` #> # A tibble: 336,776 × 14 #> year arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time #> <int> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> #> 1 2013 830 819 11 UA 1545 N14228 EWR IAH 227 #> 2 2013 850 830 20 UA 1714 N24211 LGA IAH 227 #> 3 2013 923 850 33 AA 1141 N619AA JFK MIA 160 #> # ℹ 336,773 more rows #> # ℹ 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- .full-width[.content-box-blue.bold.font110.note[`select()` 的辅助函数,已析出到 `tidyselect` 包中,`?select_helpers`]] 1. `starts_with("abc")`:选取变量名以 `abc` 开头的变量 2. `ends_with("xyz")`:选取变量名以 `xyz` 结束的变量 3. `contains("ijk")`:选取变量名包含 `ijk` 的变量 4. `matches("(.)\\1")`:选取变量名中出现重复字符的变量 5. `num_range("x", 1:3)`:选取变量 `x1`、`x2` 和 `x3` 6. `any_of(x) | all_of(x)`:选择整数向量 `x` 指定位置或*字符向量* `x` 直接指定的变量 7. `last_col(offset = 0L)`:选择从最后算起的第 `offset+1` 个的变量 8. `everything()`:全部变量,通常放在最后 9. `where(fn)`:选择满足断言函数 `fn` 条件的变量,如 `select(data, where(is.integer))` -- .full-width[.content-box-blue.bold.font110.note[`select()`:可混合使用各种方法]] ```r select(flights, year:day, ends_with("_delay") | starts_with("dep_"), tailnum) ``` ``` #> # A tibble: 336,776 × 7 #> year month day dep_delay arr_delay dep_time tailnum #> <int> <int> <int> <dbl> <dbl> <int> <chr> #> 1 2013 1 1 2 11 517 N14228 #> 2 2013 1 1 4 20 533 N24211 #> 3 2013 1 1 2 33 542 N619AA #> # ℹ 336,773 more rows ``` --- layout: false ### >> 变量重命名:`select()`、`rename()` 和 `rename_with()` ```r select(flights, nian = year, yue = month, ri = day) # 选取变量的同时重命名变量 ``` ``` #> # A tibble: 336,776 × 3 #> nian yue ri #> <int> <int> <int> #> 1 2013 1 1 #> 2 2013 1 1 #> 3 2013 1 1 #> # ℹ 336,773 more rows ``` -- ```r # select() 只保留指定的变量,而 rename(.data, ...) 则会保留全部变量 rename(flights, nian = year, yue = month, ri = day) %>% dim() ``` ``` #> [1] 336776 19 ``` -- ```r # rename_with(.data, .fn, .cols = everything(), ...) rename_with(flights, toupper, 1:3) ``` ``` #> # A tibble: 336,776 × 19 #> YEAR MONTH DAY dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- layout: false ### >> 变量次序调整:`select()` 和 `relocate()` ```r select(flights, dest, year:day, ends_with("_delay"), everything()) ``` ``` #> # A tibble: 336,776 × 19 #> dest year month day dep_delay arr_delay dep_time sched_dep_time arr_time #> <chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> #> 1 IAH 2013 1 1 2 11 517 515 830 #> 2 IAH 2013 1 1 4 20 533 529 850 #> 3 MIA 2013 1 1 2 33 542 540 923 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: sched_arr_time <int>, carrier <chr>, flight <int>, tailnum <chr>, #> # origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` -- ```r relocate(flights, dest, year:day, ends_with("_delay")) # 结果同上 ``` -- ```r # relocate(.data, ..., .before = NULL, .after = NULL) relocate(flights, ends_with("_delay"), .after = day) ``` ``` #> # A tibble: 336,776 × 19 #> year month day dep_delay arr_delay dep_time sched_dep_time arr_time sched_arr_time #> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> #> 1 2013 1 1 2 11 517 515 830 819 #> 2 2013 1 1 4 20 533 529 850 830 #> 3 2013 1 1 2 33 542 540 923 850 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- layout: true ### >> 生成新变量:`mutate()` --- .full-width[.content-box-blue.bold.font120[`mutate(.data, ...)`:生成新变量 .red[<sup> *</sup>]]] -- ```r flights_sml <- select(flights, year:day, ends_with("_delay"), air_time) ``` -- ```r mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours # 可直接引用新生成的变量 ) ``` ``` #> # A tibble: 336,776 × 9 #> year month day dep_delay arr_delay air_time gain hours gain_per_hour #> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2013 1 1 2 11 227 9 3.78 2.38 #> 2 2013 1 1 4 20 227 16 3.78 4.23 #> 3 2013 1 1 2 33 160 31 2.67 11.6 #> # ℹ 336,773 more rows ``` -- .footnote.red[*:完整参数版为 `mutate(.data, ..., .by = NULL, .keep = c("all", "used", "unused", "none"), .before = NULL, .after = NULL)`(其中:`.keep`、`.before`和`.after`为v1.0.0新增参数,v1.1.0又新增实验性参数`.by`);<br>2. 假如你只想保留新生成的变量,那就~~使用 `transmute()` 或~~设定 `mutate()` 参数 `.keep = "none"`。] --- .full-width[.content-box-blue.bold.font120[`mutate()`:支持向量化函数 <sup>.red[*]</sup>]] .code75[ ```r *MATH +, - , *, /, ^, %/%, %% # arithmetic ops log(), log2(), log10() # logs <, <=, >, >=, !=, == # logical comparisons *CUMULATIVE AGGREGATES # vignette("window-functions") dplyr::cumall()|cumany() # cumulative all() | any() cummax()|cummin() # cumulative max() | min() dplyr::cummean() # cumulative mean() cumprod()|cumsum() # cumulative prod() | sum() *OFFSETS dplyr::lag()|lead() # offset elements by 1 | -1 *RANKINGS # ?ranking dplyr::min_rank() # rank with ties = min dplyr::ntile() # bins into n bins dplyr::row_number() # rank with ties = "first" *MISC pmax()|pmin() # element-wise max() | min() dplyr::recode() # vectorized switch() dplyr::if_else() # vectorized if() + else() dplyr::case_when() # multi-case if_else() ``` ] .footnote.red[*:当然也支持返回“标量”的汇总函数,如 `mean()`,会将标量直接扩展至需要的长度。] --- layout: false class: inverse, center, middle # 3. 汇总 .font150[(summarize)] --- layout: true ### >> 汇总:`summarize()` --- .font120[ - `summarize(.data, ..., .by = NULL, .groups = NULL)` 函数生成新的数据框,每个汇总函数占一列,每个分组占用一行; - 如果 `.data` 是[分组数据框 👇](#56),则每个分组变量还会占一列;此时,还可用 dplyr<sup>v1.0.0</sup> 新增的实验性参数 `.groups = c("drop_last", "drop", "keep", "rowwise")` 来控制新生成结果数据框的分组结构; - 如果只想对 `.data` 进行一次性的分组汇总(不保留分组结构),则可使用 dplyr<sup>v1.1.0</sup> 新增的实验性参数 `.by`。 ] ```r summarize( flights, mean_delay = mean(dep_delay, na.rm = TRUE), sd_delay = sd(dep_delay, na.rm = TRUE) ) ``` ``` #> # A tibble: 1 × 2 #> mean_delay sd_delay #> <dbl> <dbl> #> 1 12.6 40.2 ``` --- .font120.note[`summarize()` 支持返回“标量”的汇总函数<sup>.red[*]</sup>,示例如下:] ```r *COUNTS dplyr::n() # number of values/rows dplyr::n_distinct() # number of uniques sum(!is.na()) # number of non-NA’s *LOCATION mean() | median() # mean | median *POSITION/ORDER dplyr::first() # first value dplyr::last() # last value dplyr::nth() # value in n-th location of vector *RANK quantile() # nth quantile min() | max() # minimum value | maximum value *SPREAD IQR() # Inter-Quartile Range mad() # median absolute deviation sd() # standard deviation var() # variance ``` .footnote.red[*:dplyr<sup>v1.0.0</sup> 扩展了 `summarize()` 的灵活性(允许其返回包含多个元素的向量甚至是多列的数据框),但 dplyr<sup>v1.1.0</sup> 在此应用情景下会提示改用实验性的 `reframe()`。] --- layout: false class: inverse, center, middle # 4. 分组处理 .font150[(grouping)] --- layout: true ### >> 分组处理:`group_by()` --- - `group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))` 将数据框及其扩展转变为分组数据框(`grouped_df`) -- .pull-left[ ```r by_day <- group_by(flights, year, month, day) class(by_day) ``` ``` #> [1] "grouped_df" "tbl_df" "tbl" #> [4] "data.frame" ``` ```r by_day ``` ``` #> # A tibble: 336,776 × 19 *#> # Groups: year, month, day [365] #> year month day dep_time sched_dep_time #> <int> <int> <int> <int> <int> #> 1 2013 1 1 517 515 #> 2 2013 1 1 533 529 #> 3 2013 1 1 542 540 #> # ℹ 336,773 more rows #> # ℹ 14 more variables: dep_delay <dbl>, #> # arr_time <int>, sched_arr_time <int>, #> # arr_delay <dbl>, carrier <chr>, #> # flight <int>, tailnum <chr>, #> # origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, … ``` ] -- .pull-right[ .font100.bold.note[获取分组元数据的相关函数] ```r by_day %>% group_vars() ``` ``` #> [1] "year" "month" "day" ``` ```r by_day %>% group_data() ``` ``` #> # A tibble: 365 × 4 #> year month day .rows #> <int> <int> <int> <list<int>> #> 1 2013 1 1 [842] #> 2 2013 1 2 [943] #> 3 2013 1 3 [914] #> # ℹ 362 more rows ``` ```r # group_keys() / # group_rows() / group_indices() # group_size() / n_groups() ``` ] --- - `group_by()` 对数据框的分组设定会影响后续 dplyr 包函数的操作方式,如 `mutate()`、`summarize()`、`filter()`、`slice()`等;如果你不需要基于分组进行后续操作,需先用 `ungroup(x, ...)` 函数来取消对数据集 `x`(基于指定变量 `...`)的分组设定 -- .pull-left[ ```r *# 分组汇总 # 返回结果默认情况下会去除最低一级分组, # 除非设定参数 .groups = 'keep' summarize( by_day, mean_delay = mean( dep_delay, na.rm = TRUE ) ) ``` ``` #> # A tibble: 365 × 4 *#> # Groups: year, month [12] #> year month day mean_delay #> <int> <int> <int> <dbl> #> 1 2013 1 1 11.5 #> 2 2013 1 2 13.9 #> 3 2013 1 3 11.0 #> # ℹ 362 more rows ``` ] -- .pull-right[ ```r # 假如你觉得 group_by() + summarize()不够强 # 大,你还可以使用实验性的 purrr-style 函数, # 如 group_map()/*_modify()/*_walk() 等 group_modify( by_day, ~ broom::tidy( # what's this?! lm(arr_delay ~ dep_delay, data=.x) ) ) ``` ``` #> # A tibble: 730 × 8 *#> # Groups: year, month, day [365] #> year month day term estimate std.error #> <int> <int> <int> <chr> <dbl> <dbl> #> 1 2013 1 1 (Inte… 0.910 0.579 #> 2 2013 1 1 dep_d… 1.03 0.0124 #> 3 2013 1 2 (Inte… -1.32 0.581 #> # ℹ 727 more rows #> # ℹ 2 more variables: statistic <dbl>, #> # p.value <dbl> ``` ] --- layout: false class: inverse, center, middle background-image: url(imgs/logo-magrittr.png), url(imgs/bg.png) background-size: 10%, 100% background-position: 15% 40%, 0% 100% # 5. 用 `%>%` 连接多个操作 .font150[(chaining multiple operations with the pipe `%>%`)] --- layout: true ### >> 管道运算符:`%>%` --- .full-width[.content-box-blue.bold.font120[不用 `%>%` 的代码]] .pull-left[ ```r by_dest <- group_by(flights, dest) delay <- summarize( by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delay <- filter(delay, count > 20, dest != "HNL") ggplot(delay, aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) ``` ] <br><br> .pull-right[ <img src="L04_Transformation_Prep_files/figure-html/unnamed-chunk-41-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .full-width[.content-box-blue.bold.font120[使用 `%>%` 的代码( `%>%` 来自 `magrittr` 包,快捷键为 `Ctrl+Shift+M` )]] .pull-left[ ```r # 用 %>% 改写前一页的代码 flights %>% group_by(dest) %>% summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") %>% ggplot(aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) ``` ] -- .pull-right.font110[ * 让函数兼容管道操作符有助于实现 `tidyverse` 的[{{核心原则}}](https://design.tidyverse.org/unifying.html) * 使用 `%>%` 编写的代码关注动词(如数据变换操作)而非名词(操作对象),这使得代码更容易写,更容易读,也更容易修改 * `dplyr` 包的函数具备这样的特性:`f(.data01, ...) -> .data02`,“数据进,数据出”,更适用于管道操作 * `dplyr` 包会在后台自动将 `x %>% f(y)` 转变为 `f(x, y)`,将 `x %>% f(y, .)` 转变为 `f(y, x)`,将 `x %>% f(y, z = .)` 转变为 `f(y, z = x)` …… ] --- .full-width[.content-box-blue.bold.font120[使用 `%>%` 的例子,once more,✈️]] ```r flights %>% group_by(year, month, day) %>% summarize(mean_delay = mean(dep_delay, na.rm = TRUE)) %>% mutate(date = lubridate::make_date(year, month, day)) %>% ggplot() + geom_line(aes(x = date, y = mean_delay)) ``` <img src="L04_Transformation_Prep_files/figure-html/unnamed-chunk-43-1.png" width="50%" style="display: block; margin: auto;" /> --- .full-width[.content-box-blue.bold.font120[yet again .red[but with R's native forward pipe operator `|>`]]] ```r not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay)) not_cancelled |> group_by(year, month, day) |> summarize( * first = dep_time |> min(), * last = dep_time %>% max, avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]) # average pos delay ) ``` ``` #> # A tibble: 365 × 7 #> # Groups: year, month [12] #> year month day first last avg_delay1 avg_delay2 #> <int> <int> <int> <int> <int> <dbl> <dbl> #> 1 2013 1 1 517 2356 12.7 32.5 #> 2 2013 1 2 42 2354 12.7 32.0 #> 3 2013 1 3 32 2349 5.73 27.7 #> # ℹ 362 more rows ``` --- layout: false class: center, middle, hide_logo background-image: url(imgs/xaringan.png) background-size: 12% background-position: 50% 40% <br><br><br><br><br><br><br> <hr color='#f00' size='2px' width='80%'> <br> .Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能!**_]