class: center, middle, inverse, title-slide .title[ # 量化金融与金融编程 ] .subtitle[ ## L05 数据导入与齐整 ] .author[ ###
曾永艺 ] .institute[ ### 厦门大学管理学院 ] .date[ ###
2023-10-27 ] --- class: middle background-image: url(imgs/readr-tidyr.png) background-size: 18% background-position: 85% 50%
> 问渠那得清如许?为有源头活水来。——[宋]朱熹 -- > Like families, tidy datasets are all alike but every messy <br> > dataset is messy in its own way. ——Hadley Wickham <br> -- .bold.font150[1\. 数据导入] * .font120[读入矩形文本数据:`readr` 包 <sup>.font80[2.1.4]</sup>] * .font120[读入其它类型的数据] -- .bold.font150[2\. 数据齐整] * .font120[齐整数据] * .font120[`tidyr` 包 <sup>.font80[1.3.0]</sup>] * .font120[`pivot_longer()` 和 `pivot_wider()`] * .font120[`separate_*()` 和 `unite()`] * .font120[`pivot\_\*()` 和 `separate\_\*()`:一个复杂案例] * .font120[`unnest_*()` 和 `hoist()`] --- class: inverse, center, middle # 1. 数据导入 .font150[(Data Import)] --- layout: true ### >> 读入文本文档:`readr` 包 --- .full-width[.content-box-blue.bold.font120[☑ 一致的函数命名方式]] ```r read_csv() | read_csv2() | read_tsv() | read_delim() | read_table() | read_fwf() | read_lines() | read_lines_raw() | read_file() | read_file_raw() | read_log() | ``` -- .full-width[.content-box-blue.bold.font120[☑ 基本一致的参数设置]] ```r read_*(file, # delim = NULL, escape_backslash = FALSE, escape_double = TRUE, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy()) # file: Either a path to a file, a connection, or literal data. # Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. # Files starting with http://, https://, ftp://, or ftps:// will be downloaded. # Using a value of clipboard() will read from the system clipboard. ``` --- .full-width[.content-box-blue.bold.font120[☑ 和内置 `utils` 包中 `read.*()` 比较]] -- .font110[ * 生成 `tibble`.red[<sup>*</sup>](更确切地说应该是 `spec_tbl_df`) * 默认情况下做得更少,如不会自动将字符型变量转换为因子型变量(`read.*()`有个`stringsAsFactors`参数)、不会自动更改列名、不会将一列数据转化为行名等 * 较少依赖系统和环境变量,结果更具可重现性(_reproducible_) * 读入速度通常更快(~50x),且读入大文件时有进度条提示 ] -- ``` ?`tbl_df-class` tibble: a subclass of data.frame and the central data structure for the tidyverse lazy and surly - do less and complain more than base data.frames. ====================================================================================== 1. 列数据不进行自动类型转换(如字符->因子),且原生直接支持列表列 2. 只对长度为1的“标量”向量进行循环操作 3. tibble支持不合法的数据变量名,如 tibble(`:)` = "smile") 4. 不增加行名,也不鼓励使用行名来存储数据信息 5. 用 [ 对tibble取子集操作总是返回新的tibble,[[ 和 $ 总是返回向量,且 $ 不支持变量名部分匹配 6. tibble在屏幕上打印输出时会适应屏幕大小,并提供更多有用信息(类似str()) ``` --- .full-width[.content-box-blue.bold.font120[常用参数示例:`read_csv()`]] -- .pull-left.code80[ ```r library(tidyverse) read_csv("The 1st line of metadata The 2nd line of metadata # A comment to skip x,y,z 1,2,3", * skip = 2, * comment = "#") ``` ``` #> Rows: 1 Columns: 3 #> ── Column specification ──────────────── #> Delimiter: "," #> dbl (3): x, y, z #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` #> # A tibble: 1 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 ``` ] -- .pull-right.code80[ ```r read_csv("1,2,3\n4,5,6") ``` ``` #> # A tibble: 1 × 3 #> `1` `2` `3` #> <dbl> <dbl> <dbl> #> 1 4 5 6 ``` ```r read_csv("1,2,3\n4,5,6", * col_names = FALSE) ``` ``` #> # A tibble: 2 × 3 #> X1 X2 X3 #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 ``` ```r read_csv("1,2,3\n4,5,6", * col_names = c("x", "y", "z")) ``` ``` #> # A tibble: 2 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 ``` ] --- .full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]] ```r # 生成示例数据文档 set.seed(123456) # 设定随机数种子 nycflights13::flights %>% mutate(dep_delay = if_else(dep_delay <= 0, FALSE, TRUE)) %>% select(month:dep_time, dep_delay, tailnum, time_hour) %>% slice_sample(n = 20) %>% mutate(across(everything(), ~ifelse(runif(20) <= 0.1, NA, .))) %>% mutate(time_hour = lubridate::as_datetime(time_hour)) %>% * print(n = 5) %>% write_csv(file = "data/ex_flights.csv", na = "--") # 默认na = "NA" ``` ``` #> # A tibble: 20 × 6 #> month day dep_time dep_delay tailnum time_hour *#> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE <NA> 2013-06-26 12:00:00 #> 2 NA 5 NA NA <NA> 2013-12-05 17:00:00 #> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00 #> 4 5 16 NA FALSE N27152 2013-05-16 12:00:00 #> 5 12 23 NA NA <NA> 2013-12-23 23:00:00 #> # ℹ 15 more rows ``` --- .full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]] ```r (ex_flights <- read_csv("data/ex_flights.csv")) # 默认 na = c("", "NA") ``` ``` #> Rows: 20 Columns: 6 #> ── Column specification ────────────────────────────────────────────────────────────────── #> Delimiter: "," #> chr (6): month, day, dep_time, dep_delay, tailnum, time_hour #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` #> # A tibble: 20 × 6 #> month day dep_time dep_delay tailnum time_hour *#> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 6 26 932 TRUE -- 2013-06-26T12:00:00Z #> 2 -- 5 -- -- -- 2013-12-05T17:00:00Z #> 3 7 20 656 FALSE N17245 2013-07-20T10:00:00Z #> 4 5 16 -- FALSE N27152 2013-05-16T12:00:00Z #> 5 12 23 -- -- -- 2013-12-23T23:00:00Z #> # ℹ 15 more rows ``` --- .full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]] > .font120[1\. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵] > .font120[2\. 确定每列字符串向量的类型] > 2\.1 由 `col_types` 参数直接给定 > 2\.2 猜测:读入文档 `guess_max` 行(v2 vs. v1),并按如下顺序猜测每列变量的类型:*logical -> ~~integer ->~~ double -> ~~number ->~~ time -> date -> date-time -> character * > .font120[3\. 将每列字符串解析为相应类型的向量] -- .full-width[.content-box-blue.bold.font120[当发现 `readr` 自动解析碰到问题时]] * 可用 `problems()` 查看读入时遇到的问题 * 根据需要设定 `read_*()` 的相关参数,如 `na = "--"` * 直接设定 `read_*()` 的参数 `col_types = cols(...)` * 将全部数据读入为字符型变量,然后再用 `type_convert()` 转换变量类型 * 在后续的数据整理步骤中进行相应的处理(如 `dplyr::muate() + parse_*()`) --- .pull-left.code75[ ```r # 设定缺失值参数 read_csv("data/ex_flights.csv", na = "--") %>% spec() ``` ``` #> cols( #> month = col_double(), #> day = col_double(), #> dep_time = col_double(), #> dep_delay = col_logical(), #> tailnum = col_character(), #> time_hour = col_datetime(format = "") #> ) ``` ```r # 直接设定参数col_types,?cols ex_flights <- read_csv( "data/ex_flights.csv", col_types = cols( # cols_only() dep_delay = col_logical(), # "l" tailnum = col_character(), # "c" time_hour = col_datetime(), # "T" .default = col_integer() # "i" ) # col_types = "iiilcT" ) ``` ``` #> Warning: One or more parsing issues, call `problems()` on #> your data frame for details, e.g.: #> dat <- vroom(...) #> problems(dat) ``` ] -- .pull-right.code80[ ```r problems(ex_flights) ``` ``` #> # A tibble: 17 × 5 #> row col expected actual file #> <int> <int> <chr> <chr> <chr> #> 1 3 1 an integer -- C:/Users/admin/Desktop/QFwR_2023.09/slides/L05_Im… #> 2 3 3 an integer -- C:/Users/admin/Desktop/QFwR_2023.09/slides/L05_Im… #> 3 3 4 1/0/T/F/TRUE/FALSE -- C:/Users/admin/Desktop/QFwR_2023.09/slides/L05_Im… #> # ℹ 14 more rows ``` ```r # 全部读入,再行转换类型 read_csv( "data/ex_flights.csv", col_types = cols( * .default = col_character() ) ) %>% * type_convert(col_types = "iiilcT") ``` ``` #> # A tibble: 20 × 6 #> month day dep_time dep_delay tailnum time_hour #> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE -- 2013-06-26 12:00:00 #> 2 NA 5 NA NA -- 2013-12-05 17:00:00 #> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00 #> # ℹ 17 more rows ``` ] --- .full-width[.content-box-blue.bold.font120[一次性读入多个相同性质的文档 <sup>[new in `readr` v2.0.0]</sup>]] ```r # 生成相同性质的示例文档 nycflights13::flights %>% split(.$carrier) %>% purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv"))) ``` -- ```r # 文件路径向量 (files <- dir(path = "data/", pattern = "\\.tsv$", full.names = TRUE)) ``` ``` #> [1] "data/flights_9E.tsv" "data/flights_AA.tsv" "data/flights_AS.tsv" #> [4] "data/flights_B6.tsv" "data/flights_DL.tsv" "data/flights_EV.tsv" #> [7] "data/flights_F9.tsv" "data/flights_FL.tsv" "data/flights_HA.tsv" #> [10] "data/flights_MQ.tsv" "data/flights_OO.tsv" "data/flights_UA.tsv" #> [13] "data/flights_US.tsv" "data/flights_VX.tsv" "data/flights_WN.tsv" #> [16] "data/flights_YV.tsv" ``` ```r # 一次性读入(并自动纵向合并为 tibble) read_tsv(files, id = "fpath") # 这里通过设置 id 参数将文件路径存入指定变量 fpath 中 ``` ``` #> # A tibble: 336,776 × 20 #> fpath year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 data/flight… 2013 1 1 810 810 0 1048 1037 #> 2 data/flight… 2013 1 1 1451 1500 -9 1634 1636 #> 3 data/flight… 2013 1 1 1452 1455 -3 1637 1639 #> # ℹ 336,773 more rows #> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- .full-width[.content-box-blue.bold.font120[`readr` 的图形化用户界面 <sup>.font80.red[*]</sup>]] <img src="imgs/readr_UI.png" width="88%" style="display: block; margin: auto;" /> .footnote.red.font80[\* `RStudio` 右上 `Environment` 标签页 > `Import Dataset` > `From Text (readr)` ...] --- layout: true ### >> 读入其它类型的数据 --- -- .font110[ * .bold[`readr::read_rds()`]: `R` data files (*.rds*) ] -- .pull-left.font110[ * ![](imgs/logo-vroom.png) [~~.bold[{{`vroom`}}~~]](https://vroom.r-lib.org/): Read and Write Rectangular Text Data Quickly * ![](imgs/logo-readxl.png) [.bold[{{`readxl`}}]](https://readxl.tidyverse.org/): excel files (_.xls_ and _.xlsx_) <- **take a 🧐 at it!** * ![](imgs/logo-haven.png) [.bold[{{`haven`}}]](https://haven.tidyverse.org/): SPSS, Stata, and SAS files * ![](imgs/logo-arrow.png) [.bold[{{`arrow`}}]](https://arrow.apache.org/docs/r/): Apache Arrow ] -- .pull-right.font110[ * [.bold[{{`DBI`}}]](https://dbi.r-dbi.org/) with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL * .bold[`jsonlite`]: JSON files * [.bold[{{`xml2`}}]](https://xml2.r-lib.org/): XML and HTML files * ![](imgs/logo-rvest.png) [.bold[{{`rvest`}}]](https://rvest.tidyverse.org): HTML (Web Scraping) * [.bold[{{`httr2`}}]](https://httr2.r-lib.org/): Web APIs ] -- <br> .font110[ * .bold[[{{其它}}](https://cran.r-project.org/web/views/)]: text、network、spatial、genome、image 等类型的数据 ] --- layout: false class: inverse, center, middle # 2. 数据齐整 .font150[(Tidy Data)] --- layout: true ### >> 齐整数据 --- .full-width[.content-box-blue.bold.font120[齐整数据的三个条件 📐]] .font120[ * 每列都是一个变量(Every column is a variable.) * 每行都是一个观测(Every row is an observation.) * 每格都是一个取值(Every cell is a single value.) ] -- <img src="imgs/tidy-1.png" width="85%" style="display: block; margin: auto;" /> --- .full-width[.content-box-blue.bold.font120[齐整数据的好处 💪]] .font120[ ☑ 齐整数据按照逻辑一致的方式存储数据,这让你更容易学习并掌握相关工具对数据进行处理 ☑ “每列都是一个变量”及“每行都是一个观测”有助于发挥 R 语言向量化操作的优势 ] <img src="imgs/tidy-2.png" width="80%" style="display: block; margin: auto;" /> .font120[ ☑ `tidyverse` 中的 R 包(如 `dplyr` 、`ggplot2` 等)在设计上要求输入数据为齐整数据 ] --- .full-width[.content-box-blue.bold.font120[但……并非所有的数据集都是齐整的,😭]] .pull-left.code80[ ```r table1 ``` ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> # ℹ 3 more rows ``` ```r table2 ``` ``` #> # A tibble: 12 × 4 #> country year type count #> <chr> <dbl> <chr> <dbl> #> 1 Afghanistan 1999 cases 745 #> 2 Afghanistan 1999 population 19987071 #> 3 Afghanistan 2000 cases 2666 #> # ℹ 9 more rows ``` ] .pull-right.code70[ ```r table3 ``` ``` #> # A tibble: 6 × 3 #> country year rate #> <chr> <dbl> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> # ℹ 3 more rows ``` ```r table4a # cases ``` ``` #> # A tibble: 3 × 3 #> country `1999` `2000` #> <chr> <dbl> <dbl> #> 1 Afghanistan 745 2666 #> 2 Brazil 37737 80488 #> 3 China 212258 213766 ``` ```r table4b # population ``` ``` #> # A tibble: 3 × 3 #> country `1999` `2000` #> <chr> <dbl> <dbl> #> 1 Afghanistan 19987071 20595360 #> 2 Brazil 172006362 174504898 #> 3 China 1272915272 1280428583 ``` ] --- layout: true ### >> 该 `tidyr` 包出场啦 🎉 --- .full-width[.content-box-blue.bold.font120[`tidyr`包中的函数大致可分为5大类:]] .pull-left.font105[ 1. 数据长型-宽型转换(_pivoting_):`pivot_longer()` 和 `pivot_wider()` 2. 分解或合并字符型变量:`separate_longer_delim()|*_position()`、`separate_wider_delim()|*_position|*_regex()` 和 `unite()` 3. 将深度嵌套的列表数据表格化(_rectangling_):`unnest_longer()`、`unnest_wider()`、`unnest_auto()`、`unnest()` 和 `hoist()` 4. **缺失值处理:->** 5. 其它函数:`chop()|unchop()`; `pack()|unpack()`; `uncount()`; `nest()` ] -- .pull-right.font80[ - `complete(data, ..., fill = list(), explicit = TRUE)`: Complete a data frame with missing combinations of data - `expand(data, ...)`, `crossing(...)`, `nesting(...)`: Expand data frame to include all combinations of values - `expand_grid(...)`: Create a tibble from all combinations of the inputted name-value pairs - `full_seq(x, period, tol = 1e-06)`: Create the full sequence of values in a vector <hr> - `drop_na(data, ...)`: Drop rows containing missing values - `fill(data, ..., .direction = c("down", "up", "downup", "updown"))`: Fill in missing values with previous or next value - `replace_na(data, replace, ...)`: Replace missing values with specified values ] --- layout: true ### >> **.red[`pivot_longer()`]** 和 `pivot_wider()` --- <img src="imgs/tidy-3.png" width="75%" style="display: block; margin: auto;" /> -- .pull-left.code75[ ```r pivot_longer( * data, cols, ..., cols_vary = "fastest", * names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", * values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL ) ``` ] -- .pull-right.code80[ ```r table4a %>% pivot_longer( cols = c(`1999`, `2000`), names_to = "year", values_to = "cases" ) # table4b %>% ... ``` ``` #> # A tibble: 6 × 3 #> country year cases #> <chr> <chr> <dbl> #> 1 Afghanistan 1999 745 #> 2 Afghanistan 2000 2666 #> 3 Brazil 1999 37737 #> # ℹ 3 more rows ``` ] --- layout: true ### >> `pivot_longer()` 和 **.red[`pivot_wider()`]** --- <img src="imgs/tidy-4.png" width="75%" style="display: block; margin: auto;" /> -- .pull-left.code75[ ```r pivot_wider( * data, ..., id_cols = NULL, id_expand = FALSE, * names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", * values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL ) ``` ] -- .pull-right.code80[ ```r table2 %>% pivot_wider( names_from = type, values_from = count ) ``` ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> # ℹ 3 more rows ``` ] --- layout: false class: hide_logo ## 🙋♂️ Your Turn! <style type="text/css"> #special_timer.running { background-color: black; background-image: url(imgs/bg-stars.gif); } #special_timer.finished { background-color: black; background-image: url(imgs/bg-sqfw.gif); background-size: cover; } #special_timer.running .countdown-digits { color: #fdf6e3; } #special_timer.finished .countdown-digits { color: #fdf6e3; } </style>
−
+
05
:
00
.panelset[ .panel[.panel-name[问题] .font120[在空白处填入恰当的代码将数据集 billboard 转化为目标结构:] .pull-left.code100[ <code class ='r hljs remark-code'>library(tidyverse)<br><br>billboard %>% View()<br>billboard %>%<br> pivot_<span style="background-color:#ffff7f"> </span>(<br> cols = <span style="background-color:#ffff7f"> </span>,<br> names_to = <span style="background-color:#ffff7f"> </span>,<br> <span style="background-color:#ffff7f"> </span>,<br> names_transform = <span style="background-color:#ffff7f"> </span>,<br> values_to = <span style="background-color:#ffff7f"> </span>,<br> values_drop_na = TRUE<br> )</code> ``` #> # A tibble: 5,307 × 5 #> artist track date.entered week rank #> <chr> <chr> <date> <int> <dbl> #> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87 #> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82 #> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72 #> # ℹ 5,304 more rows ``` ] .pull-right.code80[ ``` *# A tibble: 317 × 79 artist track date.entered wk1 wk2 <chr> <chr> <date> <dbl> <dbl> 1 2 Pac Baby… 2000-02-26 87 82 2 2Ge+her The … 2000-09-02 91 87 3 3 Door… Kryp… 2000-04-08 81 70 # ℹ 314 more rows # ℹ 74 more variables: wk3 <dbl>, # wk4 <dbl>, wk5 <dbl>, wk6 <dbl>, # wk7 <dbl>, wk8 <dbl>, wk9 <dbl>, # wk10 <dbl>, wk11 <dbl>, wk12 <dbl>, # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, # wk16 <dbl>, wk17 <dbl>, … ``` ``` *# A tibble: 5,307 × 5 artist track date.entered week rank <chr> <chr> <date> <int> <dbl> 1 2 Pac Baby … 2000-02-26 1 87 2 2 Pac Baby … 2000-02-26 2 82 3 2 Pac Baby … 2000-02-26 3 72 # ℹ 5,304 more rows ``` ] ] .panel[.panel-name[参考答案] .code110[ ```r library(tidyverse) billboard %>% View() billboard %>% pivot_longer( cols = starts_with("wk") , names_to = "week" , names_prefix = "wk" , names_transform = list(week = as.integer), values_to = "rank" , values_drop_na = TRUE ) ``` ] ] ] --- layout: true ### >> **.red[`separate\_wider\_\*()`、`separate\_longer\_\*()`]** 和 `unite()` --- <img src="imgs/tidy-5.png" width="55%" style="display: block; margin: auto;" /> -- .pull-left.code80[ ```r separate_wider_delim|position|regex( * data, cols, * delim | widths | patterns, ..., names, names_sep = NULL, names_repair, too_few, too_many, cols_remove = TRUE ) separate_longer_delim|position( * data, cols, * delim | width, ..., keep_empty = FALSE ) ``` ] -- .pull-right.code80[ ```r table3 %>% separate_wider_delim( rate, delim = "/", names = c("cases", "population") ) ``` ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <chr> <chr> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> # ℹ 3 more rows ``` ] --- layout: true ### >> `separate_wider_*()`、`separate_longer_*()` 和 **.red[`unite()`]** --- <img src="imgs/tidy-6.png" width="80%" style="display: block; margin: auto;" /> -- .pull-left.code100[ ```r unite( data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE ) ``` ] -- .pull-right.code90[ ```r table5 %>% unite( * col = "year", * century, year, sep = "" ) ``` ``` #> # A tibble: 6 × 3 #> country year rate #> <chr> <chr> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> # ℹ 3 more rows ``` ] --- layout: false class: hide_logo ## 🙋♂️ Your Turn!
−
+
02
:
00
.panelset[ .panel[.panel-name[问题] .font120[使用 `dplyr::mutate()` 实现 `tidyr::unite()` 的结果(提示:`help(paste)`)] <br> .pull-left.code100[ ```r library(tidyverse) table5 %>% unite( col = "year", century, year, sep = "" ) ``` ] .pull-right[ ``` #> # A tibble: 6 × 3 #> country year rate #> <chr> <chr> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> 4 Brazil 2000 80488/174504898 #> 5 China 1999 212258/1272915272 #> 6 China 2000 213766/1280428583 ``` ] ] .panel[.panel-name[参考答案] .font120[使用 `dplyr::mutate()` 实现 `tidyr::unite()` 的结果(提示:`help(paste)`)] <br> .pull-left.code100[ ```r library(tidyverse) table5 %>% mutate( year = paste0(century, year), .keep = "unused", .after = country ) ``` ] .pull-right[ ``` #> # A tibble: 6 × 3 #> country year rate #> <chr> <chr> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> 4 Brazil 2000 80488/174504898 #> 5 China 1999 212258/1272915272 #> 6 China 2000 213766/1280428583 ``` ] ] ] --- layout: true ### >> `pivot_*()` 和 `separate_*()`:一个复杂例子 --- ```r who # ?who ``` ``` #> # A tibble: 7,240 × 60 #> country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 #> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Afghanistan AF AFG 1980 NA NA NA NA #> 2 Afghanistan AF AFG 1981 NA NA NA NA #> 3 Afghanistan AF AFG 1982 NA NA NA NA #> 4 Afghanistan AF AFG 1983 NA NA NA NA #> 5 Afghanistan AF AFG 1984 NA NA NA NA #> 6 Afghanistan AF AFG 1985 NA NA NA NA #> 7 Afghanistan AF AFG 1986 NA NA NA NA #> 8 Afghanistan AF AFG 1987 NA NA NA NA #> 9 Afghanistan AF AFG 1988 NA NA NA NA #> 10 Afghanistan AF AFG 1989 NA NA NA NA #> # ℹ 7,230 more rows #> # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, #> # new_sp_f014 <dbl>, new_sp_f1524 <dbl>, new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, #> # new_sp_f4554 <dbl>, new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>, #> # new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>, new_sn_m4554 <dbl>, #> # new_sn_m5564 <dbl>, new_sn_m65 <dbl>, new_sn_f014 <dbl>, new_sn_f1524 <dbl>, #> # new_sn_f2534 <dbl>, new_sn_f3544 <dbl>, new_sn_f4554 <dbl>, new_sn_f5564 <dbl>, … ``` --- .code80[ ```r who1 <- who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE ) who1 ``` ``` #> # A tibble: 76,046 × 6 #> country iso2 iso3 year name cases #> <chr> <chr> <chr> <dbl> <chr> <dbl> #> 1 Afghanistan AF AFG 1997 new_sp_m014 0 #> 2 Afghanistan AF AFG 1997 new_sp_m1524 10 #> 3 Afghanistan AF AFG 1997 new_sp_m2534 6 #> # ℹ 76,043 more rows ``` ] -- > 变量 `name` 中的字符串代表什么?通过查阅 `who` 的帮助文档得到如下信息: > 1. 前3个字母代表是否是“新增”肺结核病例,在数据中均为`new` > 2. 接下来2-3个字母代表肺结核的类型(`sp`、`sn`、`ep`、`rel`等) > 3. 接下来的字母`m`或`f`代表肺结核病人的性别(男或女) > 4. 余下数字代表年龄组,如 `014 = 0 - 14岁 | 1524 = 15 - 24岁` --- .code80[ ```r who2 <- who1 %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate_wider_delim(name, delim = "_", names = c(NA, "type", "sexage")) # 基于分隔符 who2 ``` ``` #> # A tibble: 76,046 × 7 #> country iso2 iso3 year type sexage cases #> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> #> 1 Afghanistan AF AFG 1997 sp m014 0 #> 2 Afghanistan AF AFG 1997 sp m1524 10 #> 3 Afghanistan AF AFG 1997 sp m2534 6 #> # ℹ 76,043 more rows ``` ] -- .code80[ ```r who3 <- who2 %>% separate_wider_position( # 基于位置 sexage, widths = c(sex = 1, age = 4), too_few = "align_start" ) who3 ``` ``` #> # A tibble: 76,046 × 8 #> country iso2 iso3 year type sex age cases #> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> #> 1 Afghanistan AF AFG 1997 sp m 014 0 #> 2 Afghanistan AF AFG 1997 sp m 1524 10 #> 3 Afghanistan AF AFG 1997 sp m 2534 6 #> # ℹ 76,043 more rows ``` ] --- .code80[ ```r *# all in a pipe! who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate_wider_delim(name, delim = "_", names = c(NA, "type", "sexage")) %>% separate_wider_position(sexage, widths = c(sex = 1, age = 4), too_few = "align_start") ``` ] -- .pull-left.code80[ ```r *# all with powerful pivot_longer()! who %>% pivot_longer( cols = starts_with("new"), * names_to = c("type", "sex", "age"), * names_pattern = "new_?(.*)_(.)(.*)", * names_ptypes = list( * sex = factor(levels = c("f", "m")), * age = factor( * levels = c("014", "1524", "2534", * "3544", "4554", "5564", "65"), * ordered = TRUE)), values_to = "cases", values_drop_na = TRUE ) ``` ] .pull-right.code80[ ``` #> # A tibble: 76,046 × 8 #> country iso2 iso3 year type sex #> <chr> <chr> <chr> <dbl> <chr> <fct> #> 1 Afghani… AF AFG 1997 sp m #> 2 Afghani… AF AFG 1997 sp m #> 3 Afghani… AF AFG 1997 sp m #> 4 Afghani… AF AFG 1997 sp m #> 5 Afghani… AF AFG 1997 sp m #> # ℹ 76,041 more rows #> # ℹ 2 more variables: age <ord>, #> # cases <dbl> ``` <br> .font120.bold[👉 `vignette("pivot")`] ] --- layout: true ### >> `unnest_*` 和 `hoist()` --- .pull-left.code90[ ```r repurrrsive::gh_repos %>% View() ``` <img src="imgs/repos.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] -- .pull-right.code80[ ```r (repos <- tibble(repo = repurrrsive::gh_repos)) ``` ``` #> # A tibble: 6 × 1 #> repo #> <list> #> 1 <list [30]> #> 2 <list [30]> #> 3 <list [30]> #> 4 <list [26]> #> 5 <list [30]> #> # ℹ 1 more row ``` ```r (repos <- repos %>% unnest_longer(repo)) ``` ``` #> # A tibble: 176 × 1 #> repo #> <list> #> 1 <named list [68]> #> 2 <named list [68]> #> 3 <named list [68]> #> 4 <named list [68]> #> 5 <named list [68]> #> # ℹ 171 more rows ``` ] --- .pull-left.code90[ ```r repos %>% unnest_wider(repo) ``` ``` #> # A tibble: 176 × 68 #> id name full_name owner #> <int> <chr> <chr> <list> #> 1 61160198 after gaborcsa… <named list> #> 2 40500181 argufy gaborcsa… <named list> #> 3 36442442 ask gaborcsa… <named list> #> 4 34924886 basei… gaborcsa… <named list> #> 5 61620661 citest gaborcsa… <named list> #> # ℹ 171 more rows #> # ℹ 64 more variables: private <lgl>, #> # html_url <chr>, description <chr>, #> # fork <lgl>, url <chr>, #> # forks_url <chr>, keys_url <chr>, #> # collaborators_url <chr>, #> # teams_url <chr>, hooks_url <chr>, … ``` ] -- .pull-right.code90[ ```r *repos %>% hoist( repo, * login = c("owner", "login"), name = "name", homepage = "homepage", watchers = "watchers_count" ) ``` ``` #> # A tibble: 176 × 5 *#> login name homepage watchers *#> <chr> <chr> <chr> <int> #> 1 gaborcsardi after <NA> 5 #> 2 gaborcsardi argufy <NA> 19 #> 3 gaborcsardi ask <NA> 5 #> 4 gaborcsardi baseimp… <NA> 0 #> 5 gaborcsardi citest <NA> 0 #> # ℹ 171 more rows #> # ℹ 1 more variable: repo <list> ``` <br> .font120.bold[👉 `vignette("rectangle")`] ] --- layout: false class: inverse, center, middle # 课后作业 --- class: middle .font130[ 1\. 复习 📖 [**{{_R for Data Science, 2e_}}**](https://r4ds.had.nz/) 一书第一部分 _Whole game_ 中关于数据导入([**8 Data import**](https://r4ds.hadley.nz/data-import))、数据齐整([**6 Data tidying**](https://r4ds.hadley.nz/data-tidy))以及 [**{{第四部分 _Import_}}**](https://r4ds.hadley.nz/import) 中的五个章节(分别为 _Spreadsheets_、_Databases_、_Arrow_、_Hierarchical data_ 和 _Web scraping_)并**(结队)完成课后练习** > .font80[👉 **参考答案**:R for Data Science (2e): Solutions to Exercises 之 [8 Data importing](https://mine-cetinkaya-rundel.github.io/r4ds-solutions/data-import.html) / [6 Data tidying](https://mine-cetinkaya-rundel.github.io/r4ds-solutions/data-tidy.html)<br>] > .font80[👉 **直接搜索**:在 [**{{_R for Data Science, 2e_}}**](https://r4ds.had.nz/) 上搜索相应函数(如 `read_excel`、`separate_wider_delim` 等)并进行针对性学习] 2\. 下载(打印) 📰 .bold[[{{Data Import的cheatsheet}}](docs/data-import-cheatsheet.pdf)] 和 .bold[[{{Data Tidying的cheatsheet}}](docs/tidyr-cheatsheet.pdf)] 并阅读之 3\. `browseVignettes(package = c("readr", "tidyr"))` 📝 4\. 👨🏫 [**{{启动小组作业}}**](https://qfwr2023.netlify.app/project/team-project.html) ] --- layout: false ### 教 + 学 & 考核 <br> -- .pull-left.font130[ - 课上 2 学时/周 - 我:概要 + 出题 + 答疑 + 扩展 - 同学:听讲 + 练习 + 反馈 - .red.bold[课后 5 小时/周 .font60[起]] - 同学:预习 + 复习 + 作业 - 我:备课 + (在线)答疑解惑 ] -- .pull-right.font140[ .content-box-yellow.bold[ - 课程的成绩构成: - 考勤:5% - 课堂参与:15% - 个人作业/随堂小测:40% - 小组研究项目:40% <sup>.red[*]</sup> .font80.right.red[\* 每个小组 ~4 名同学 ] ] ] -- <br> .font130.right.bold.red[增加一个分组条件:男女搭配 ] --- ### 研究选题 <br> .pull-left.font130[ .font120.bold[去年选题一览:] 1. 投资组合(天真)分散化 2. 周日效应(DoW effect) 3. 金融工程 App 4. 指数基金定投策略 App 5. 基于技术指标的量化投资策略 6. 基本面量化投资策略 7. 网络数据抓取与分析 ] -- .pull-right.font130[ .font120.bold[今年?] <br><br><br> .red.font300.font300.center[自选!] ] --- ### 任务 -- .font150[🕐 各组长于**11月08日晚24:00前**提交以下信息至 [{{坚果云}}](https://send2me.cn/Xz1XPQp9/RTe1rxAmO_AzuQ)] .font120[ > 1. 小组名、组长姓名 > > 2. **项目选题、研究思路与预期成果** > > 3. 小组成员集体照一张(标注姓名) ] -- .font150[🕑 小组制定研究计划,分工协作,积极开展项目研究] -- .font150[🕒 各小组于**12月01日**上台简要报告中期研究进展] -- .font150[🕓 各小组于**12月15日前**完成项目研究,提交研究成果(如研究报告、展示文档、App等)] -- .font150[🕔 **12月29日**上台展示小组项目] --- layout: false class: center, middle, hide_logo background-image: url(imgs/xaringan.png) background-size: 12% background-position: 50% 40% <br><br><br><br><br><br><br> <hr color='#f00' size='2px' width='80%'> <br> .Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能!**_]