class: center, middle, inverse, title-slide .title[ # 量化金融与金融编程 ] .subtitle[ ## L4
dplyr
.font80[1.1.3]
数据处理 ] .author[ ###
曾永艺 ] .institute[ ### 厦门大学管理学院 ] .date[ ###
2023-10-13 ] --- class: middle, hide_logo background-image: url(imgs/logo-dplyr.png) background-size: 25% background-position: 19% 30%
.pull-left.font120.bold.center[ <br><br><br><br><br><br><br><br><br><br> _A Grammar of <br>Data Manipulation_ <br><br> ] -- .pull-right.font150.bold[ 1. 样本处理 2. 变量处理 3. 汇总 4. 分组和行式处理 5. 用 `%>%` 连接多个操作 6. 合并多个数据集 7. 操作数据库和 `data.table` ] --- ```r library(tidyverse) library(nycflights13) data(package = "nycflights13") # 包含airlines、airports、flights、planes、weather等5个数据集 ``` -- ```r flights # just print() it ``` ``` #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` -- > * `A tibble: 336,776 x 19` > * `<int>` | `<dbl>` | `<chr>` | `<dttm>` | `<lgl>` | `<fctr>` | `<date>` 分别表示变量为 integer | double | character | date-time | logical | factor | date 类型的向量 -- ```r ?flights # 打开flights数据集的帮助文档以进一步了解数据集,如变量的定义 ``` --- ```r glimpse(flights) # 数据一瞥 ``` ``` #> Rows: 336,776 #> Columns: 19 #> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,… #> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… #> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… #> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, … #> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, … #> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0… #> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924,… #> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917,… #> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4,… #> $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6",… #> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194… #> $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516… #> $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "L… #> $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO", "O… #> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 345, 3… #> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, 1005… #> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 6,… #> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0,… #> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2… ``` -- ```r View(flights) # 在 RStudio 数据浏览器中打开数据集 # tibble::view() 调用 utils::View() 并不可见地返回原数据集,便于 %>% 操作,但速度好像慢很多 ``` --- layout: false class: hide_logo ## .font150[🤔] 想想 <style type="text/css"> #special_timer.running { background-color: black; background-image: url(imgs/bg-stars.gif); } #special_timer.finished { background-color: black; background-image: url(imgs/bg-sqfw.gif); background-size: cover; } #special_timer.running .countdown-digits { color: #fdf6e3; } #special_timer.finished .countdown-digits { color: #fdf6e3; } </style>
−
+
05
:
00
.font150[对于如下由**行**(样本)和**列**(变量)构成的数据集 / 数据表我们可以进行哪些方面的操作呢?] .font80[
] --- class: inverse, center, middle # 1. 样本处理 .font150[(manipulate cases)] --- layout: true ### >> 样本筛选:`filter()` --- .full-width[.content-box-blue.bold.font120[`filter(.data, ...)`:提取数据集 `.data` 中变量取值满足设定条件的样本]] -- ```r filter(flights, month == 1, day == 1) *# 注意:条件表达式中的是 ==,而不是 = ``` ``` #> # A tibble: 842 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 839 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` -- ```r # 在 R 基础包中的实现方法 flights[flights$month == 1 & flights$day == 1, ] subset(flights, month == 1 & day == 1) ``` ```r flights[month == 1 & day == 1, ] # 注意:这样写是错滴 ``` ``` #> Error in month == 1: comparison (==) is possible only for atomic and list types ``` --- .full-width[.content-box-blue.bold.font120.note[dplyr 包中的函数(如 `filter()` )并不会直接修改输入数据集 `.data`]] -- .full-width[.content-box-blue.bold.font120.warning[你必须自行存储修改后的数据集 💾]] ```r dec25 <- filter(flights, month == 12, day == 25) dec25 ``` ``` *#> # A tibble: 719 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 12 25 456 500 -4 649 651 -2 #> 2 2013 12 25 524 515 9 805 814 -9 #> 3 2013 12 25 542 540 2 832 850 -18 #> # ℹ 716 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` ```r flights # 输入数据集仍然保持不变 ``` ``` *#> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- .full-width[.content-box-blue.bold.font120.note[`filter()` 会用到的比较运算符和逻辑运算符]] ```r 1. < > <= >= == != # ?Comparison 2. & | ! xor() # ?base::Logic 3. 其它的如:%in%、is.na()、between()、near()、if_any()、if_all() 等 ``` -- .full-width[.content-box-blue.bold.font120.note[`filter()` 默认以 `&` 的方式组合多个条件参数,...]] -- ```r filter(flights, month >= 11, day == 25) # 等效于 filter(flights, month >= 11 & day == 25) ``` -- .full-width[.content-box-blue.bold.font120.note[... 其它逻辑组合方式(如`|`)则需自行设定]] --- layout: false class: hide_logo ## .font150[🤔] 想想 <style type="text/css"> #special_timer.running { background-color: black; background-image: url(imgs/bg-stars.gif); } #special_timer.finished { background-color: black; background-image: url(imgs/bg-sqfw.gif); background-size: cover; } #special_timer.running .countdown-digits { color: #fdf6e3; } #special_timer.finished .countdown-digits { color: #fdf6e3; } </style>
−
+
05
:
00
.font150.bold.red[怎样挑出11月和12月的航班样本呢?] <br> .panelset.font120[ .panel[.panel-name[备选代码] .font110[ 1) `filter(flights, month == 11 | 12)` 2) `filter(flights, month == (11 | 12))` 3) `filter(flights, month == 11 | month == 12)` ] ] .panel[.panel-name[结果1] .code90[ ```r filter(flights, month == 11 | 12) ``` ``` #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` ] ] .panel[.panel-name[结果2] .code90[ ```r filter(flights, month == (11 | 12)) ``` ``` #> # A tibble: 27,004 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 27,001 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` ] ] .panel[.panel-name[结果3] .code90[ ```r filter(flights, month == 11 | month == 12) # filter(flights, month %in% c(11, 12)) ``` ``` #> # A tibble: 55,403 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 11 1 5 2359 6 352 345 7 #> 2 2013 11 1 35 2250 105 123 2356 87 #> 3 2013 11 1 455 500 -5 641 651 -10 #> # ℹ 55,400 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` ] ] ] --- layout: true ### >> 样本筛选:其它函数 --- .font120[ * `slice(.data, ..., .by = NULL, .preserve = FALSE)`:按照整数向量给出的索引位置选择样本,正(.red[负])整数表示保留(.red[移除])的样本,如 `slice(mtcars, 5:n())` * `slice_head(.data, ..., n, prop, by = NULL)` 和 `slice_tail()` 选择数据集开始 / 结尾的样本    .red[vs. `utils::head() / tail()`?] * `slice_sample(.data, ..., n, prop, by = NULL, weight_by = NULL, replace = FALSE)` 随机选择样本 * `slice_min(.data, order_by, ..., n, prop, by = NULL, with_ties = TRUE, na_rm = FALSE)` 和 `slice_max()` 选择 `order_by` 参数指定的变量或其函数取值最大或最小的样本 * `distinct(.data, ..., .keep_all = FALSE)`:移除(指定变量或其函数)取值重复的样本    .red[≈ `base::unique()`] ] .footnote.red[注: 在 dplyr<sup>v1.0.0</sup> 之后 `top\_n()`、`top\_frac()`、`sample\_n()` 和 `sample\_frac()` 等函数已被相应的 `slice\_*()` 函数所替代] --- layout: true ### >> 样本排序:`arrange()` --- .full-width[.content-box-blue.bold.font120[`arrange(.data, ...)`:根据指定变量的取值对数据集 `.data` 的样本排序]] -- ```r arrange(flights, year, month, day, dep_time) ``` ``` #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` -- ```r arrange(flights, desc(dep_delay)) # 加入 desc() 反向排序:从大到小 ``` ``` #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 9 641 900 1301 1242 1530 1272 #> 2 2013 6 15 1432 1935 1137 1607 2120 1127 #> 3 2013 1 10 1121 1635 1126 1239 1810 1109 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- .full-width[.content-box-blue.bold.font120.note[不像 dplyr 包中的其它函数,`arrange(.data, ..., .by_group = FALSE)` 会忽略数据集的分组信息,除非明确加入分组变量或设定 `.by_group = TRUE`]] -- .full-width[.content-box-blue.bold.font120.note[缺失值总是排在最后 <sup>.red[*]</sup> ]] -- .pull-left[ ```r df <- tibble(x = c(1, 3, 2, NA)) arrange(df, x) ``` ``` #> # A tibble: 4 × 1 #> x #> <dbl> #> 1 1 #> 2 2 #> 3 3 *#> 4 NA ``` ] -- .pull-right[ ```r df <- tibble(x = c(1, 3, 2, NA)) arrange(df, desc(x)) ``` ``` #> # A tibble: 4 × 1 #> x #> <dbl> #> 1 3 #> 2 2 #> 3 1 *#> 4 NA ``` ] -- .footnote.red[*:`base::sort()` 和 `base::order()` 通过参数 `na.last` 来控制把缺失值放在哪里或删除,并通过参数 `decreasing` 来控制排序方向。] --- layout: false class: inverse, center, middle # 2. 变量处理 .font150[(manipulate variables)] --- layout: true ### >> 变量选取:`select(.data, ...)` --- ```r select(flights, month, day, dep_time, sched_dep_time, dep_delay) # 枚举式:变量名,无需"" ``` ``` #> # A tibble: 336,776 × 5 #> month day dep_time sched_dep_time dep_delay #> <int> <int> <int> <int> <dbl> #> 1 1 1 517 515 2 #> 2 1 1 533 529 4 #> 3 1 1 542 540 2 #> # ℹ 336,773 more rows ``` -- ```r select(flights, 2, 3, 4, 5, 6) # 枚举式:表示变量位置的数字,结果同上,但不推荐 ``` -- ```r select(flights, month:dep_delay) # 用 : 选择连在一起的变量 select(flights, 2:6) ``` -- ```r select(flights, !(month:dep_delay)) # 变量前的 ! 或 - 表示剔除 ``` ``` #> # A tibble: 336,776 × 14 #> year arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time #> <int> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> #> 1 2013 830 819 11 UA 1545 N14228 EWR IAH 227 #> 2 2013 850 830 20 UA 1714 N24211 LGA IAH 227 #> 3 2013 923 850 33 AA 1141 N619AA JFK MIA 160 #> # ℹ 336,773 more rows #> # ℹ 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- .full-width[.content-box-blue.bold.font110.note[`select()` 的帮助函数,已析出到 `tidyselect` 包中,`?select_helpers`]] 1. `starts_with("abc")`:选取变量名以 `abc` 开头的变量 2. `ends_with("xyz")`:选取变量名以 `xyz` 结束的变量 3. `contains("ijk")`:选取变量名包含 `ijk` 的变量 4. `matches("(.)\\1")`:选取变量名中出现重复字符的变量 5. `num_range("x", 1:3)`:选取变量 `x1`、`x2` 和 `x3` 6. `any_of(x) | all_of(x)`:选择整数向量 `x` 指定位置或*字符向量* `x` 直接指定的变量 7. `last_col(offset = 0L)`:选择从最后算起的第 `offset+1` 个的变量 8. `everything()`:全部变量,通常放在最后 9. `where(fn)`:选择满足断言函数 `fn` 条件的变量,如 `select(data, where(is.integer))` -- .full-width[.content-box-blue.bold.font110.note[`select()`:可混合使用各种方法]] ```r select(flights, year:day, ends_with("_delay") | starts_with("dep_"), tailnum) ``` ``` #> # A tibble: 336,776 × 7 #> year month day dep_delay arr_delay dep_time tailnum #> <int> <int> <int> <dbl> <dbl> <int> <chr> #> 1 2013 1 1 2 11 517 N14228 #> 2 2013 1 1 4 20 533 N24211 #> 3 2013 1 1 2 33 542 N619AA #> # ℹ 336,773 more rows ``` --- layout: false ### >> 变量重命名:`select()`、`rename()` 和 `rename_with()` ```r select(flights, nian = year, yue = month, ri = day) # 选取变量的同时重命名变量 ``` ``` #> # A tibble: 336,776 × 3 #> nian yue ri #> <int> <int> <int> #> 1 2013 1 1 #> 2 2013 1 1 #> 3 2013 1 1 #> # ℹ 336,773 more rows ``` -- ```r # select() 只保留指定的变量,而 rename(.data, ...) 则会保留全部变量 rename(flights, nian = year, yue = month, ri = day) %>% dim() ``` ``` #> [1] 336776 19 ``` -- ```r # rename_with(.data, .fn, .cols = everything(), ...) rename_with(flights, toupper, 1:3) ``` ``` #> # A tibble: 336,776 × 19 #> YEAR MONTH DAY dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 517 515 2 830 819 11 #> 2 2013 1 1 533 529 4 850 830 20 #> 3 2013 1 1 542 540 2 923 850 33 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- layout: false ### >> 变量次序调整:`select()` 和 `relocate()` ```r select(flights, dest, year:day, ends_with("_delay"), everything()) ``` ``` #> # A tibble: 336,776 × 19 #> dest year month day dep_delay arr_delay dep_time sched_dep_time arr_time #> <chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> #> 1 IAH 2013 1 1 2 11 517 515 830 #> 2 IAH 2013 1 1 4 20 533 529 850 #> 3 MIA 2013 1 1 2 33 542 540 923 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: sched_arr_time <int>, carrier <chr>, flight <int>, tailnum <chr>, #> # origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` -- ```r relocate(flights, dest, year:day, ends_with("_delay")) # 结果同上 ``` -- ```r # relocate(.data, ..., .before = NULL, .after = NULL) relocate(flights, ends_with("_delay"), .after = day) ``` ``` #> # A tibble: 336,776 × 19 #> year month day dep_delay arr_delay dep_time sched_dep_time arr_time sched_arr_time #> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> #> 1 2013 1 1 2 11 517 515 830 819 #> 2 2013 1 1 4 20 533 529 850 830 #> 3 2013 1 1 2 33 542 540 923 850 #> # ℹ 336,773 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- layout: true ### >> 生成新变量:`mutate()` --- .full-width[.content-box-blue.bold.font120[`mutate(.data, ...)`:生成新变量 .red[<sup> *</sup>]]] -- ```r flights_sml <- select(flights, year:day, ends_with("_delay"), air_time) ``` -- ```r mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours # 可直接引用新生成的变量 ) ``` ``` #> # A tibble: 336,776 × 9 #> year month day dep_delay arr_delay air_time gain hours gain_per_hour #> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2013 1 1 2 11 227 9 3.78 2.38 #> 2 2013 1 1 4 20 227 16 3.78 4.23 #> 3 2013 1 1 2 33 160 31 2.67 11.6 #> # ℹ 336,773 more rows ``` -- .footnote.red[*:完整参数版为 `mutate(.data, ..., .by = NULL, .keep = c("all", "used", "unused", "none"), .before = NULL, .after = NULL)`(其中:`.keep`、`.before`和`.after`为 dplyr<sup>v1.0.0</sup> 新增参数,dplyr<sup>v1.1.0</sup> 又新增实验性参数 `.by`);<br>2. 假如你只想保留新生成的变量,那就~~使用 `transmute()` 或~~设定 `mutate()` 参数 `.keep = "none"`。] --- .full-width[.content-box-blue.bold.font120[`mutate()`:支持向量化函数 <sup>.red[*]</sup>]] .code75[ ```r *MATH +, - , *, /, ^, %/%, %% # arithmetic ops log(), log2(), log10() # logs <, <=, >, >=, !=, == # logical comparisons *CUMULATIVE AGGREGATES # vignette("window-functions") dplyr::cumall()|cumany() # cumulative all() | any() cummax()|cummin() # cumulative max() | min() dplyr::cummean() # cumulative mean() cumprod()|cumsum() # cumulative prod() | sum() *OFFSETS dplyr::lag()|lead() # offset elements by 1 | -1 *RANKINGS # ?ranking dplyr::min_rank() # rank with ties = min dplyr::ntile() # bins into n bins dplyr::row_number() # rank with ties = "first" *MISC pmax()|pmin() # element-wise max() | min() dplyr::recode() # vectorized switch() dplyr::if_else() # vectorized if() + else() dplyr::case_when() # multi-case if_else() ``` ] .footnote.red[*:当然也支持返回“标量”的汇总函数,如 `mean()`,会将标量直接扩展至需要的长度。] --- .full-width[.content-box-blue.bold.font120[多列操作 👉 .font80[ `vignette("colwise")`]]] .code90[ ```r # ?scoped # *_all() 作用于每个变量 # *_at() 作用于用 vars() 函数、字符向量或位置向量指定的变量 # *_if() 作用于 .predicate 函数取值为 TRUE 的变量 mutate_all(.tbl, .funs, ...) mutate_at(.tbl, .vars, .funs, ...) mutate_if(.tbl, .predicate, .funs, ...) # summarize()、group_by()、arrange()、filter()、rename()等也存在类似的变体函数 ``` ] -- .font110.bold.note[dplyr<sup>v1.0.0</sup> 使用更具优势的 `across(.cols = everything(), .fns = NULL, ..., .names = NULL, .unpack = FALSE)` 来替代上述为数众多的变体函数] ```r mutate(flights_sml, across(dep_delay:air_time, \(x) x / 60)) # 匿名函数 ``` ``` #> # A tibble: 336,776 × 6 #> year month day dep_delay arr_delay air_time #> <int> <int> <int> <dbl> <dbl> <dbl> #> 1 2013 1 1 0.0333 0.183 3.78 #> 2 2013 1 1 0.0667 0.333 3.78 #> 3 2013 1 1 0.0333 0.55 2.67 #> # ℹ 336,773 more rows ``` --- layout: false class: inverse, center, middle # 3. 汇总 .font150[(summarize / reframe)] --- layout: true ### >> 汇总:`summarize()` --- .font120[ - `summarize(.data, ..., .by = NULL, .groups = NULL)` 函数生成新的数据框,每个汇总函数占一列,每个分组占用一行; - 如果 `.data` 是[分组数据框 👇](#56),则每个分组变量还会占一列;此时,还可用 dplyr<sup>v1.0.0</sup> 新增的实验性参数 `.groups = c("drop_last", "drop", "keep", "rowwise")` 来控制新生成结果数据框的分组结构; - 如果只想对 `.data` 进行一次性的分组汇总(不保留分组结构),则可使用 dplyr<sup>v1.1.0</sup> 新增的实验性分组参数 `.by`。 ] ```r summarize( flights, mean_delay = mean(dep_delay, na.rm = TRUE), sd_delay = sd(dep_delay, na.rm = TRUE) ) ``` ``` #> # A tibble: 1 × 2 #> mean_delay sd_delay #> <dbl> <dbl> #> 1 12.6 40.2 ``` --- .font120.note[`summarize()` 支持返回“标量”的汇总函数<sup>.red[*]</sup>,示例如下:] ```r *COUNTS dplyr::n() # number of values/rows dplyr::n_distinct() # number of uniques sum(!is.na()) # number of non-NA’s *LOCATION mean() | median() # mean | median *POSITION/ORDER dplyr::first() # first value dplyr::last() # last value dplyr::nth() # value in n-th location of vector *RANK quantile() # nth quantile min() | max() # minimum value | maximum value *SPREAD IQR() # Inter-Quartile Range mad() # median absolute deviation sd() # standard deviation var() # variance ``` .footnote.red[*:dplyr<sup>v1.0.0</sup> 扩展了 `summarize()` 的灵活性(允许其返回包含多个元素的向量甚至是多行多列的数据框),但 dplyr<sup>v1.1.0</sup> 在此应用情景下会提示改用实验性的 `reframe()`。] --- layout: true ### >> 灵活汇总:`reframe()` --- .pull-left[ ```r quantile_df <- function(x, probs) { tibble(x = quantile(x, probs, na.rm = TRUE), probs = probs) } flights %>% summarize( across(c(dep_delay, arr_delay), ~ quantile_df( .x, seq(0, 1, 0.25) ), .unpack = TRUE ) ) ``` ``` #> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr #> 1.1.0. #> ℹ Please use `reframe()` instead. #> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always #> returns an ungrouped data frame and adjust accordingly. #> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. ``` ``` #> # A tibble: 5 × 4 #> dep_delay_x dep_delay_probs arr_delay_x arr_delay_probs #> <dbl> <dbl> <dbl> <dbl> #> 1 -43 0 -86 0 #> 2 -5 0.25 -17 0.25 #> 3 -2 0.5 -5 0.5 #> 4 11 0.75 14 0.75 #> 5 1301 1 1272 1 ``` ] -- .pull-right[ ```r flights %>% * reframe( across(c(dep_delay, arr_delay), ~ quantile_df( .x, seq(0, 1, 0.25) ), .unpack = TRUE ) ) ``` ``` #> # A tibble: 5 × 4 #> dep_delay_x dep_delay_probs arr_delay_x arr_delay_probs #> <dbl> <dbl> <dbl> <dbl> #> 1 -43 0 -86 0 #> 2 -5 0.25 -17 0.25 #> 3 -2 0.5 -5 0.5 #> 4 11 0.75 14 0.75 #> 5 1301 1 1272 1 ``` ] --- layout: false class: inverse, center, middle # 4. 分组和行式处理 .font150[(grouping and rowwise)] --- layout: true ### >> 分组处理:`group_by()` .font80[👉 `vignette("grouping")`] --- .font100[ - `group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))` 将数据框及其扩展转变为分组数据框(`grouped_df`) ] -- .pull-left[ ```r by_day <- group_by(flights, year, month, day) class(by_day) ``` ``` #> [1] "grouped_df" "tbl_df" "tbl" #> [4] "data.frame" ``` ```r by_day ``` ``` #> # A tibble: 336,776 × 19 *#> # Groups: year, month, day [365] #> year month day dep_time sched_dep_time #> <int> <int> <int> <int> <int> #> 1 2013 1 1 517 515 #> 2 2013 1 1 533 529 #> 3 2013 1 1 542 540 #> # ℹ 336,773 more rows #> # ℹ 14 more variables: dep_delay <dbl>, #> # arr_time <int>, sched_arr_time <int>, #> # arr_delay <dbl>, carrier <chr>, #> # flight <int>, tailnum <chr>, #> # origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, … ``` ] -- .pull-right[ .font100.bold.note[获取分组元数据的相关函数] ```r by_day %>% group_vars() ``` ``` #> [1] "year" "month" "day" ``` ```r by_day %>% group_data() ``` ``` #> # A tibble: 365 × 4 #> year month day .rows #> <int> <int> <int> <list<int>> #> 1 2013 1 1 [842] #> 2 2013 1 2 [943] #> 3 2013 1 3 [914] #> # ℹ 362 more rows ``` ```r # group_keys() / # group_rows() / group_indices() # group_size() / n_groups() ``` ] --- .pull-left.code90[ ```r *# 分组汇总 # 返回结果默认情况下会去除最低一级分组, # 并有相应的提示信息,除非设定参数 # .groups = 'keep' summarize( by_day, mean_delay = mean( dep_delay, na.rm = TRUE ) ) ``` ``` #> `summarise()` has grouped output by #> 'year', 'month'. You can override using #> the `.groups` argument. ``` ``` #> # A tibble: 365 × 4 *#> # Groups: year, month [12] #> year month day mean_delay #> <int> <int> <int> <dbl> #> 1 2013 1 1 11.5 #> 2 2013 1 2 13.9 #> 3 2013 1 3 11.0 #> # ℹ 362 more rows ``` ] -- .pull-right.code90[ ```r # 假如你觉得 group_by() + summarize() # 不够强大,你还可以使用实验性的 # purrr-style 函数,如 group_map()/ # *_modify()/*_walk() 等 group_modify( by_day, ~ broom::tidy( # what's this?! lm(arr_delay ~ dep_delay, data=.x) ) ) ``` ``` #> # A tibble: 730 × 8 *#> # Groups: year, month, day [365] #> year month day term estimate #> <int> <int> <int> <chr> <dbl> #> 1 2013 1 1 (Intercept) 0.910 #> 2 2013 1 1 dep_delay 1.03 #> 3 2013 1 2 (Intercept) -1.32 #> # ℹ 727 more rows #> # ℹ 3 more variables: std.error <dbl>, #> # statistic <dbl>, p.value <dbl> ``` ] --- layout: true ### >> 分组处理:`.by / by` 参数 .font80[👉 `?dplyr_by`] --- .font120[ - `group_by()` 对数据框的分组设定会影响后续 dplyr 包函数的处理方式与结果,特别是 `summarize()`、`reframe()`、`mutate()`、`filter()`、`slice*()`等;如果你不需要基于分组进行后续操作,需先用 `ungroup(x, ...)` 函数来取消对数据集 `x`(基于指定变量 `...`)的分组设定 ] -- .font120.bold[ - dplyr<sup>v1.1.0</sup> 新增了实验性分组参数 `.by / by`,只对 `.data` 进行一次性的临时分组(且不保留分组结构) ] -- .pull-left.code90[ ```r summarize( * group_by(flights, * year, month, day), mean_delay = mean( dep_delay, na.rm = TRUE ) ) ``` ``` #> # A tibble: 365 × 4 #> # Groups: year, month [12] #> year month day mean_delay #> <int> <int> <int> <dbl> #> 1 2013 1 1 11.5 #> 2 2013 1 2 13.9 #> 3 2013 1 3 11.0 #> # ℹ 362 more rows ``` ] .pull-right.code90[ ```r summarize( flights, mean_delay = mean( dep_delay, na.rm = TRUE ), * .by = c(year, month, day) ) ``` ``` #> # A tibble: 365 × 4 #> year month day mean_delay #> <int> <int> <int> <dbl> #> 1 2013 1 1 11.5 #> 2 2013 1 2 13.9 #> 3 2013 1 3 11.0 #> # ℹ 362 more rows ``` ] ---------------------|------------------------------------ Grouping only affects a single verb | Grouping is persistent across multiple verbs Selects variables with tidy-select | Computes expressions with data-masking Summaries use existing order of group keys | Summaries sort group keys in ascending order + To prevent surprising results, you can't use `.by` on an existing grouped data frame. - <https://github.com/tidyverse/dplyr/releases/tag/v1.1.0> + The most useful reason to do this is because `.by` only affects a single operation. In the example above, an ungrouped data frame went into the `summarize()` call, so an ungrouped data frame will come out; with `.by`, you never need to remember to `ungroup()` afterwards and you never need to use the `.groups` argument. + Additionally, using `summarize()` with `.by` will never sort the results by the group key, unlike with `group_by()`. Instead, the results are returned using the existing ordering of the groups from the original data. We feel this is more predictable, better maintains any ordering you might have already applied with a previous call to `arrange()`, and provides a way to maintain the current ordering without having to resort to factors. + This feature was inspired by `data.table`, where the equivalent syntax looks like: `starwars[, .(mean_height = mean(height)), by = .(species, homeworld)]` + `with_groups()` is superseded in favor of `.by` (#6582). --- layout: true ### >> 行式处理:`rowwise()` --- .font110[ - `rowwise(data, ...)` 允许你对数据框的每一行进行运算,当不存在相应的向量化函数时,进行逐行式运算可以让你避免编写显性循环的代码 - 和 `group_by()` 类似,`rowwise()` 并不更改数据框的结构,而是让后续的 dplyr 函数操作按逐行式进行 - `rowwise()` 返回的是一种特殊的行分组数据框(`rowwise_df`,每行一组),绝大多数的 dplyr 函数会保留数据框的行分组信息(返回 `grouped_df` 的 `summarize()` 函数是个例外);你可以通过 `ungroup()` 或 `as_tibble()` 函数来取消行分组,或通过 `group_by()` 函数转变为 `grouped_df` ] -- .pull-left[ ```r flights_rw <- rowwise(flights_sml) class(flights_rw) ``` ``` #> [1] "rowwise_df" "tbl_df" "tbl" #> [4] "data.frame" ``` ```r flights_rw ``` ``` #> # A tibble: 336,776 × 6 *#> # Rowwise: #> year month day dep_delay arr_delay air_time #> <int> <int> <int> <dbl> <dbl> <dbl> #> 1 2013 1 1 2 11 227 #> 2 2013 1 1 4 20 227 #> 3 2013 1 1 2 33 160 #> # ℹ 336,773 more rows ``` ] -- .pull-right[ ```r # long time to run mutate( flights_rw, max_delay = * max(c_across(ends_with("_delay"))) ) ``` ``` #> # A tibble: 336,776 × 7 *#> # Rowwise: #> year month day dep_delay arr_delay #> <int> <int> <int> <dbl> <dbl> #> 1 2013 1 1 2 11 #> 2 2013 1 1 4 20 #> 3 2013 1 1 2 33 #> # ℹ 336,773 more rows #> # ℹ 2 more variables: air_time <dbl>, #> # max_delay <dbl> ``` ] --- .full-width[.content-box-blue.bold.font120[一个稍微复杂(但思路类似的)例子]] .pull-left[ ```r nest_flights <- nest_by( flights_sml, year, month, day) nest_flights ``` ``` #> # A tibble: 365 × 4 *#> # Rowwise: year, month, day #> year month day data #> <int> <int> <int> <list<tibble[,3]>> #> 1 2013 1 1 [842 × 3] #> 2 2013 1 2 [943 × 3] #> 3 2013 1 3 [914 × 3] #> 4 2013 1 4 [915 × 3] #> 5 2013 1 5 [720 × 3] #> # ℹ 360 more rows ``` ] -- .pull-right[ ```r *mutate( nest_flights, mean_delay = mean( data$dep_delay, na.rm = TRUE ), .keep = "none" ) ``` ``` #> # A tibble: 365 × 4 *#> # Rowwise: year, month, day #> year month day mean_delay #> <int> <int> <int> <dbl> #> 1 2013 1 1 11.5 #> 2 2013 1 2 13.9 #> 3 2013 1 3 11.0 #> 4 2013 1 4 8.95 #> 5 2013 1 5 5.73 #> # ℹ 360 more rows ``` ] -- .font100[👉 `vignette("rowwise")`] --- layout: false class: inverse, center, middle background-image: url(imgs/logo-magrittr.png), url(imgs/bg.png) background-size: 10%, 100% background-position: 15% 40%, 0% 100% # 5. 用 `%>%` 连接多个操作 .font150[(chaining multiple operations with the pipe `%>%`)] --- layout: true ### >> 管道运算符:`%>%` --- .full-width[.content-box-blue.bold.font120[不用 `%>%` 的代码]] .pull-left[ ```r by_dest <- group_by(flights, dest) delay <- summarize( by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delay <- filter(delay, count > 20, dest != "HNL") ggplot(delay, aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) ``` ] <br><br> .pull-right[ <img src="L04_Transformation_files/figure-html/unnamed-chunk-57-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .full-width[.content-box-blue.bold.font120[使用 `%>%` 的代码( `%>%` 来自 `magrittr` 包,快捷键为 `Ctrl+Shift+M` )]] .pull-left[ ```r # 用 %>% 改写前一页的代码 flights %>% group_by(dest) %>% summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") %>% ggplot(aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) ``` ] -- .pull-right.font110[ * 让函数兼容管道操作符有助于实现 `tidyverse` 的[{{核心原则}}](https://design.tidyverse.org/unifying.html) * 使用 `%>%` 编写的代码关注动词(如数据变换操作)而非名词(操作对象),这使得代码更容易写,更容易读,也更容易修改 * `dplyr` 包的函数具备这样的特性:`f(.data01, ...) -> .data02`,“数据进,数据出”,更适用于管道操作 * `dplyr` 包会在后台自动将 `x %>% f(y)` 转变为 `f(x, y)`,将 `x %>% f(y, .)` 转变为 `f(y, x)`,将 `x %>% f(y, z = .)` 转变为 `f(y, z = x)` …… ] --- .full-width[.content-box-blue.bold.font120[使用 `%>%` 的例子,once more,✈️]] ```r flights %>% group_by(year, month, day) %>% summarize(mean_delay = mean(dep_delay, na.rm = TRUE)) %>% mutate(date = lubridate::make_date(year, month, day)) %>% ggplot() + geom_line(aes(x = date, y = mean_delay)) ``` <img src="L04_Transformation_files/figure-html/unnamed-chunk-59-1.png" width="50%" style="display: block; margin: auto;" /> --- .full-width[.content-box-blue.bold.font120[yet again .red[but with R's native forward pipe operator `|>`]]] ```r not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay)) not_cancelled |> group_by(year, month, day) |> summarize( * first = dep_time |> min(), * last = dep_time %>% max, avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]) # average pos delay ) ``` ``` #> # A tibble: 365 × 7 #> # Groups: year, month [12] #> year month day first last avg_delay1 avg_delay2 #> <int> <int> <int> <int> <int> <dbl> <dbl> #> 1 2013 1 1 517 2356 12.7 32.5 #> 2 2013 1 2 42 2354 12.7 32.0 #> 3 2013 1 3 32 2349 5.73 27.7 #> # ℹ 362 more rows ``` --- layout: false class: hide_logo ## 🙋♂️ Your Turn!
−
+
05
:
00
.panelset.font120[ .panel[.panel-name[Nested?] 利用管道操作符 `%>%` 改写以下 \# 重嵌套的代码: .code90[ ```r summarize( select( group_by(starwars, species, sex), height, mass ), height = mean(height, na.rm = TRUE), mass = mean(mass, na.rm = TRUE) ) ``` ``` #> # A tibble: 41 × 4 #> # Groups: species [38] #> species sex height mass #> <chr> <chr> <dbl> <dbl> #> 1 Aleena male 79 15 #> 2 Besalisk male 198 102 #> 3 Cerean male 198 82 #> # ℹ 38 more rows ``` ] ] .panel[.panel-name[Piped!] .code90[ ```r starwars %>% group_by(species, sex) %>% select(height, mass) %>% summarize( height = mean(height, na.rm = TRUE), mass = mean(mass, na.rm = TRUE) ) ``` ``` #> # A tibble: 41 × 4 #> # Groups: species [38] #> species sex height mass #> <chr> <chr> <dbl> <dbl> #> 1 Aleena male 79 15 #> 2 Besalisk male 198 102 #> 3 Cerean male 198 82 #> # ℹ 38 more rows ``` ] ] ] --- layout: false class: inverse, center, middle # 课后复习 --- class: middle .font130[ 🕐 复习 📖 [**{{_R for Data Science, 2e_}}**](https://r4ds.had.nz/) 一书第一部分 _Whole game_ 中关于数据处理的章节,即 [**4 Data transformation**](https://r4ds.hadley.nz/data-transform) 并**(结队)完成课后练习** > 👉 课后习题参考答案链接<br> >   [{{R for Data Science (2e): Solutions to Exercises}}](https://mine-cetinkaya-rundel.github.io/r4ds-solutions/data-transform.html) 🕑 下载(打印) 📰 [**{{dplyr的cheatsheet}}**](docs/dplyr-cheatsheet.pdf) 并阅读之 🕒 `browseVignettes(package = "dplyr")` 📝,阅读: > .code80[ ``` Introduction to dplyr Column-wise operations Grouped data Row-wise operations Window functions ``` ] ] --- layout: false class: inverse, center, middle # 6. 合并数据集 .font150[(combine tables)] --- <br> .pull-left[ ```r airlines ``` ``` #> # A tibble: 16 × 2 #> carrier name #> <chr> <chr> #> 1 9E Endeavor Air Inc. #> 2 AA American Airlines Inc. #> 3 AS Alaska Airlines Inc. #> # ℹ 13 more rows ``` ```r planes ``` ``` #> # A tibble: 3,322 × 9 #> tailnum year type manufacturer model #> <chr> <int> <chr> <chr> <chr> #> 1 N10156 2004 Fixed win… EMBRAER EMB-… #> 2 N102UW 1998 Fixed win… AIRBUS INDU… A320… #> 3 N103US 1999 Fixed win… AIRBUS INDU… A320… #> # ℹ 3,319 more rows #> # ℹ 4 more variables: engines <int>, #> # seats <int>, speed <int>, engine <chr> ``` ] -- .pull-right[ ```r airports ``` ``` #> # A tibble: 1,458 × 8 #> faa name lat lon alt tz dst #> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 04G Lansdo… 41.1 -80.6 1044 -5 A #> 2 06A Moton … 32.5 -85.7 264 -6 A #> 3 06C Schaum… 42.0 -88.1 801 -6 A #> # ℹ 1,455 more rows #> # ℹ 1 more variable: tzone <chr> ``` ```r weather ``` ``` #> # A tibble: 26,115 × 15 #> origin year month day hour temp dewp #> <chr> <int> <int> <int> <int> <dbl> <dbl> #> 1 EWR 2013 1 1 1 39.0 26.1 #> 2 EWR 2013 1 1 2 39.0 27.0 #> 3 EWR 2013 1 1 3 39.0 28.0 #> # ℹ 26,112 more rows #> # ℹ 8 more variables: humid <dbl>, #> # wind_dir <dbl>, wind_speed <dbl>, #> # wind_gust <dbl>, precip <dbl>, #> # pressure <dbl>, visib <dbl>, #> # time_hour <dttm> ``` ] --- layout: true ### >> 键(keys) --- .full-width[.content-box-blue.bold.font100[主键(primary key):指的是数据表中一个变量(列)或多个变量的组合,其值能唯一地标识表中的每一行。主键主要用来与其他表的外键关联,以及本表的修改与删除。]] -- ```r planes %>% count(tailnum) %>% filter(n > 1) ``` ``` #> # A tibble: 0 × 2 #> # ℹ 2 variables: tailnum <chr>, n <int> ``` -- ```r weather %>% count(year, month, day, hour, origin) %>% filter(n > 1) ``` ``` #> # A tibble: 3 × 6 #> year month day hour origin n #> <int> <int> <int> <int> <chr> <int> #> 1 2013 11 3 1 EWR 2 #> 2 2013 11 3 1 JFK 2 #> 3 2013 11 3 1 LGA 2 ``` ```r weather <- weather %>% distinct(year, month, day, hour, origin, .keep_all = TRUE) # 修正之 ``` -- .full-width[.content-box-blue.bold.font100[外键(foreign key):对应于另一数据表主键的一个或多个变量。]] --- <img src="imgs/relational-nycflights.png" width="80%" style="display: block; margin: auto;" /> --- layout: true ### >> 合并式连接(mutating joins) --- .full-width[.content-box-blue.bold.font110[`*_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)`]] -- ```r flights %>% select(year:day, hour, tailnum, carrier) %>% left_join(airlines) # 以同名变量为合并键,增加变量 name(航空公司全名) ``` ``` #> Joining with `by = join_by(carrier)` ``` ``` #> # A tibble: 336,776 × 7 #> year month day hour tailnum carrier name #> <int> <int> <int> <dbl> <chr> <chr> <chr> #> 1 2013 1 1 5 N14228 UA United Air Lines Inc. #> 2 2013 1 1 5 N24211 UA United Air Lines Inc. #> 3 2013 1 1 5 N619AA AA American Airlines Inc. #> # ℹ 336,773 more rows ``` -- ```r flights %>% select(year:day, hour, tailnum, carrier) %>% * mutate(name = airlines$name[match(carrier, airlines$carrier)]) # 殊途同归,但容易理解吗? ``` ``` #> # A tibble: 336,776 × 7 #> year month day hour tailnum carrier name #> <int> <int> <int> <dbl> <chr> <chr> <chr> #> 1 2013 1 1 5 N14228 UA United Air Lines Inc. #> 2 2013 1 1 5 N24211 UA United Air Lines Inc. #> 3 2013 1 1 5 N619AA AA American Airlines Inc. #> # ℹ 336,773 more rows ``` --- .pull-left[ .full-width.content-box-blue.bold.font120[Inner joins] <img src="imgs/join-inner.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ .full-width.content-box-blue.bold.font120[Outer joins] <img src="imgs/join-outer.png" width="65%" style="display: block; margin: auto;" /> ] --- .full-width[.content-box-blue.bold.font120.note[用 `by` 参数设定连接所用的键变量]] .code95[ ```r # 默认时(即`by = NULL`)使用两个数据集中同名的全部变量,即natural join # 此时会显示提示信息“Joining, by = join_by(以,分隔的全部同名变量)” flights %>% left_join(weather) # 显性设定`by = join_by(以,分隔的键变量)` 或者 `by = 键变量字符向量` flights %>% left_join(weather, by = join_by(year, month, day, origin, hour)) # 两个数据集的变量不同名时,则显性设定如 # `by = join_by(xv1 == yv1, xv2 == yv2)` 或者 `by = 命名字符变量向量` flights %>% left_join(airports, by = join_by(dest == faa)) # by = c("dest" = "faa") # dplyr v1.1.0 `by` 参数设定的新语法 `by = join_by(...)` 更加灵活,支持非等值 # 连接(non-equi joins)(具体见后) ``` ] --- .full-width[.content-box-blue.bold.font120.note[合并式连接函数`*_join()`的参数 `relationship` .red.font80[<sup>1,3</sup>]]] .pull-left.font110[ - .code110.bold[`"one-to-one"`] [[见前]](#92) - .code110[`"one-to-many"` / `"many-to-one"`] <img src="imgs/join-one-to-many.png" width="100%" /> ] .pull-right.font110[ - .code110.bold[`"many-to-many"`]<sup>.red[2]</sup> <img src="imgs/join-many-to-many.png" width="100%" /> ] <br> .footnote.red[ 1. 为 dplyr<sup>v1.1.1</sup> 的新增参数。 2. 通常来说,对两个数据集进行 `many-to-many` 的等值合并(_equi join_)并不合理,`*_join()` 函数在 `relationship` 参数取默认值 `NULL` 时会给出相应的警告信息。 3. `*_join()` 函数的其他参数:`keep`、`na_matches`、`multiple` 和 `unmatched`。 ] --- layout: true ### >> 筛选式连接(filtering joins) --- .full-width[.content-box-blue.bold.font110[`semi_join(x, y, by = NULL, copy = FALSE, ...) | anti_join()` <br>保留 | 丢弃 `x` 数据集中在 `y` 数据集中找得到匹配的样本]] -- ```r top_dest <- flights %>% count(dest, sort = TRUE) %>% head(10) flights %>% filter(dest %in% top_dest$dest) ``` ``` #> # A tibble: 141,145 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay #> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> #> 1 2013 1 1 542 540 2 923 850 33 #> 2 2013 1 1 554 600 -6 812 837 -25 #> 3 2013 1 1 554 558 -4 740 728 12 #> # ℹ 141,142 more rows #> # ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` -- ```r flights %>% semi_join(top_dest) # 相同的结果,但更容易处理 filter() + %in% 所难以处理的存在多个 key 变量的情况 ``` --- layout: true ### >> 非等值连接(non-equi joins) --- .pull-left[ ```r # # input data by hand # transactions <- tibble( # company = c("A", "A", "B", "B"), # year = c(2019, 2020, 2021, 2023), # revenue = c(50, 4, 10, 12) # ) transactions ``` ``` #> # A tibble: 4 × 3 #> company year revenue #> <chr> <dbl> <dbl> #> 1 A 2019 50 #> 2 A 2020 4 #> 3 B 2021 10 #> 4 B 2023 12 ``` ```r companies ``` ``` #> # A tibble: 3 × 3 #> id since name #> <chr> <dbl> <chr> #> 1 A 1973 Patagonia #> 2 B 2009 RStudio #> 3 B 2022 Posit ``` ] -- .pull-right[ ```r # 1. Equality Joins: transactions |> left_join(companies, * by = join_by(company == id)) ``` ``` #> Warning in left_join(transactions, companies, by = join_by(company == id)): Detected an unexpected many-to-many #> relationship between `x` and `y`. #> ℹ Row 3 of `x` matches multiple rows in `y`. #> ℹ Row 1 of `y` matches multiple rows in `x`. #> ℹ If a many-to-many relationship is #> expected, set `relationship = #> "many-to-many"` to silence this warning. ``` ``` #> # A tibble: 6 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 50 1973 Patagonia #> 2 A 2020 4 1973 Patagonia *#> 3 B 2021 10 2009 RStudio *#> 4 B 2021 10 2022 Posit *#> 5 B 2023 12 2009 RStudio *#> 6 B 2023 12 2022 Posit ``` .font120[🤔 如何解决?] ] --- .pull-left[ ```r # 2. Inequality Joins: # involves at least one join expression # containing one of inequality # conditions: >=, >, <=, or <. transactions |> left_join(companies, join_by(company == id, * year >= since)) ``` ``` #> # A tibble: 5 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 50 1973 Patagonia #> 2 A 2020 4 1973 Patagonia *#> 3 B 2021 10 2009 RStudio *#> 4 B 2023 12 2009 RStudio #> 5 B 2023 12 2022 Posit ``` ] -- .pull-right[ ```r # 3. Rolling Joins: transactions |> left_join(companies, join_by(company == id, * closest(year >= since))) ``` ``` #> # A tibble: 4 × 5 #> company year revenue since name #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 A 2019 50 1973 Patagonia #> 2 A 2020 4 1973 Patagonia *#> 3 B 2021 10 2009 RStudio *#> 4 B 2023 12 2022 Posit *``` ```r # 4. Overlap Joins: # three helpers for overlap joins: # between(), overlaps(), and within() ``` ] -- <br> .font120[👉 `?join_by`] --- layout: true ### >> 其它类型的合并 --- .pull-left[ .full-width.content-box-blue.bold.font120[`cross_join()` / `nest_join()`] .font100[ + `cross_join(x, y, ...)` 将数据集`x`中的每一行与数据集`y`中的每一行进行匹配,结果为`nrow(x) * nrow(y)`行 + `nest_join(x, y, by = NULL, ...)` 保留数据集`x`,并将匹配的数据集`y`的数据作为新增的列表列,元素为 `tibble`(当在数据集`y`中找不到匹配数据时,列表列相应位置填充一个0行的`tibble`) + 从某种意义上说,`cross_join()` 和 `nest_join()` 是最基础的数据集连接操作函数,例如 `left_join()` 可视为 `nest_join() + unnest(.keep_empty = TRUE)`,`semi_join()` 可视为 `nest_join() + filter()`(`filter()`用来检查列表列中元素是否为0行的`tibble`) ] ] -- .pull-right[ .full-width.content-box-blue.bold.font120[用 `y` 表中的数据来修改 `x` 表中的行] .font100[ * `rows_insert(x, y, by = NULL, ...)` | `rows_update()` | `rows_upsert()` * `rows_delete()` * `rows_append()` | `rows_patch()` ] .full-width.content-box-blue.bold.font120[合并多个数据集的行或列] .font100[ * `bind_rows(..., .id = NULL)` <br> `# do.call(rbind, dfs_list)` * `bind_cols(..., .name_repair)` <br> `# do.call(cbind, dfs_list)` ] .full-width.content-box-blue.bold.font120[集合式合并] .font100[ * `intersect(x, y, ...)` * `setdiff(x, y, ...)` | `symdiff()` * `union(x, y, ...)` | `union_all()` ] ] --- layout: false class: inverse, center, middle # 7. 操作数据库和 `data.table` --- layout: true background-image: url(imgs/logo-dbplyr.png) background-size: 4% background-position: 35% 3% ### >> 连接并查询数据库 --- .pull-left.font120[ 除了处理内存中以 `tibble` 或 `data.frame` 格式存在的数据集之外,dplyr 包还可用于处理存储在主流数据库中的数据,这主要适用于两种情形: - 数据已经存储在数据库中 - 数据太大无法直接存入内存中,必须使用外部存储 这需要你额外安装 dbplyr 包<sup>2.3.4</sup>(会自动安装其需要载入的 `DBI` 包)以及你打算连接的不同数据库的接口 R 包(如 `RPostgres`、`RMariaDB`、`odbc` 等)。 以下我们以`SQLite`数据库为例说明。 ] -- .pull-right[ .full-width[.content-box-blue.bold.font120[准备工作]] ```r # install.packages("dbplyr") # install.packages("RSQLite") ``` ```r # library(dplyr) con <- DBI::dbConnect( RSQLite::SQLite(), dbname = ":memory:" ) copy_to( con, nycflights13::flights, "flights", overwrite = TRUE, indexes = list( c("year", "month", "day"), "carrier", "tailnum", "dest") ) ``` ] --- ```r # 创建对数据表的引用并列印结果 (flights_db <- tbl(con, "flights")) ``` ``` *#> # Source: table<flights> [?? x 19] *#> # Database: sqlite 3.41.2 [:memory:] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> # ℹ more rows #> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, #> # hour <dbl>, minute <dbl>, time_hour <dbl> ``` ```r # 生成查询 tailnum_delay_db <- flights_db %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay), n = n() ) %>% arrange(desc(delay)) %>% filter(n > 100) ``` --- ```r tailnum_delay_db %>% * show_query() ``` ``` #> <SQL> #> SELECT `tailnum`, AVG(`arr_delay`) AS `delay`, COUNT(*) AS `n` #> FROM `flights` #> GROUP BY `tailnum` #> HAVING (COUNT(*) > 100.0) #> ORDER BY `delay` DESC ``` ```r tailnum_delay <- tailnum_delay_db %>% * collect() tailnum_delay ``` ``` *#> # A tibble: 1,201 × 3 #> tailnum delay n #> <chr> <dbl> <int> #> 1 N11119 30.3 148 #> 2 N16919 29.9 251 #> 3 N14998 27.9 230 #> # ℹ 1,198 more rows ``` --- layout: true background-image: url(imgs/logo-dtplyr.png) background-size: 4% background-position: 34% 3% ### >> 操作 `data.table` --- .pull-left.font110[ `data.table` 是 data.table 包提供的一种数据结构,是升级版的 `data.frame`: - 简洁的语法,`x[i, j, by]` - 时间/内存高效的数据操作,如文件读取/输出、数据提取、汇总、更新、合并等 - 兼容只接受 `data.frame` 数据格式的其它 R 包 dplyr 包允许你以 dplyr 的方式操作 `data.table`,后台的 dtplyr 包<sup>1.3.1</sup>会自动将其转译为等价的 `data.table` 语法。 ] -- .pull-right.code70[ ```r # install.packages("dtplyr") library(dtplyr) # library(data.table) library(dplyr, warn.conflicts = FALSE) ``` ```r # 创建一个lazy data.table,追踪对其的操作 flights_dt <- lazy_dt(nycflights13::flights) # 操作data.table flights_dt %>% mutate(arr_delay = arr_delay/60) %>% group_by(tailnum) %>% summarize( across(ends_with("_delay"), mean), n = n() ) %>% arrange(desc(arr_delay)) %>% filter(n > 100) # Use as.data.table()/as.data.frame()/ # as_tibble() to access results ``` ] .code70[ ``` *#> Source: local data table [1,201 x 4] *#> Call: setorder(copy(`_DT1`)[, `:=`(arr_delay = arr_delay/60)][, .(dep_delay = mean(dep_delay), *#> arr_delay = mean(arr_delay), n = .N), keyby = .(tailnum)], *#> -arr_delay, na.last = TRUE)[n > 100] #> #> tailnum dep_delay arr_delay n #> <chr> <dbl> <dbl> <int> #> 1 N826UA 26.7 0.326 120 #> 2 N929DL 25.3 0.293 108 #> 3 N431UA 23.9 0.272 104 #> 4 N342NW 21.1 0.267 111 #> 5 N567UA 18.7 0.250 105 #> 6 N850UA 23.9 0.248 130 #> # ℹ 1,195 more rows #> *#> # Use as.data.table()/as.data.frame()/as_tibble() to access results ``` ] --- layout: false class: hide_logo background-image: url(imgs/logo-siuba.svg) background-size: 5% background-position: 5% 3% ###   `siuba`.font90[: Python library for using dplyr like syntax with pandas and SQL] <iframe src="https://siuba.org/" width="100%" height="540px" data-external="1"></iframe> --- layout: false class: hide_logo background-image: url(imgs/logo-ibis.svg) background-size: 5% background-position: 5% 3% ###   `ibis`.font90[: a lightweight, universal Python interface for data wrangling] <iframe src="https://ibis-project.org/tutorials/ibis-for-dplyr-users" width="100%" height="540px" data-external="1"></iframe> --- layout: false class: inverse, center, middle # 课后复习 --- class: middle .font130[ 🕐 学习 📝 [**{{Markdown基础}}**](https://qfwr2023.netlify.app/readings/markdown-basics) 和 [**{{Revealjs 演示文稿}}**](https://qfwr2023.netlify.app/readings/revealjs),**完善[第4周的课后作业](https://qfwr2023.netlify.app/hw/hw-03)**,并增加部分篇幅说明具体改进之处,于**2023年10月26日22:00前**将压缩包提交至 [{{坚果云链接}}](https://send2me.cn/rAnWP-yK/RjqyWTleU7stqA) 🕑 复习 📖 [**{{_R for Data Science, 2e_}}**](https://r4ds.had.nz/) 一书第三部分 _Transform_ 中关于连接数据集的章节,即 [**20 Joins**](https://r4ds.hadley.nz/joins) 并**(结队)完成课后练习** > 👉 ~~课后习题参考答案链接~~<br> >   [~~{{R for Data Science (2e): Solutions to Exercises}}~~](https://mine-cetinkaya-rundel.github.io/r4ds-solutions/joins.html) <br> >   [{{R for Data Science: Exercise Solutions}}](https://jrnold.github.io/r4ds-exercise-solutions/relational-data.html) 🕒 下载(打印) 📰 [**{{dplyr的cheatsheet}}**](docs/dplyr-cheatsheet.pdf) 并阅读之 🕓 `browseVignettes(package = "dplyr")` 📝,阅读: > .code80[ ``` Row-wise operations Two-table verbs ``` ] ] --- class: center, middle, hide_logo background-image: url(imgs/xaringan.png) background-size: 12% background-position: 50% 40% <br><br><br><br><br><br><br> <hr color='#f00' size='2px' width='80%'> <br> .Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能!**_]