class: center, middle, inverse, title-slide .title[ # 量化金融与金融编程 ] .subtitle[ ## L05 数据导入与齐整 | 课前预习 ] .author[ ###
曾永艺 ] .institute[ ### 厦门大学管理学院 ] .date[ ###
2023-10-25 ] --- class: middle background-image: url(imgs/readr-tidyr.png) background-size: 18% background-position: 85% 50%
> 问渠那得清如许?为有源头活水来。——[宋]朱熹 -- > Like families, tidy datasets are all alike but every messy <br> > dataset is messy in its own way. ——Hadley Wickham <br> -- .bold.font150[1\. 数据导入] * .font120[读入矩形文本数据:`readr` 包 <sup>.font80[2.1.4]</sup>] * .font120[读入其它类型的数据] -- .bold.font150[2\. 数据齐整] * .font120[齐整数据] * .font120[`tidyr` 包 <sup>.font80[1.3.0]</sup>] * .font120[`pivot_longer()` 和 `pivot_wider()`] * .font120[`separate_*()` 和 `unite()`] * .font120.gray[`pivot\_\*()` 和 `separate\_\*()`:一个复杂案例] * .font120.gray[`unnest_*()` 和 `hoist()`] --- class: inverse, center, middle # 1. 数据导入 .font150[(Data Import)] --- layout: true ### >> 读入文本文档:`readr` 包 --- .full-width[.content-box-blue.bold.font120[☑ 一致的函数命名方式]] ```r read_csv() | read_csv2() | read_tsv() | read_delim() | read_table() | read_fwf() | read_lines() | read_lines_raw() | read_file() | read_file_raw() | read_log() | ``` -- .full-width[.content-box-blue.bold.font120[☑ 基本一致的参数设置]] ```r read_*(file, # delim = NULL, escape_backslash = FALSE, escape_double = TRUE, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy()) # file: Either a path to a file, a connection, or literal data. # Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. # Files starting with http://, https://, ftp://, or ftps:// will be downloaded. # Using a value of clipboard() will read from the system clipboard. ``` --- .full-width[.content-box-blue.bold.font120[☑ 和内置 `utils` 包中 `read.*()` 比较]] -- .font110[ * 生成 `tibble`.red[<sup>*</sup>](更确切地说应该是 `spec_tbl_df`) * 默认情况下做得更少,如不会自动将字符型变量转换为因子型变量(`read.*()`有个`stringsAsFactors`参数)、不会自动更改列名、不会将一列数据转化为行名等 * 较少依赖系统和环境变量,结果更具可重现性(_reproducible_) * 读入速度通常更快(~50x),且读入大文件时有进度条提示 ] -- ``` ?`tbl_df-class` tibble: a subclass of data.frame and the central data structure for the tidyverse lazy and surly - do less and complain more than base data.frames. ====================================================================================== 1. 列数据不进行自动类型转换(如字符->因子),且原生直接支持列表列 2. 只对长度为1的“标量”向量进行循环操作 3. tibble支持不合法的数据变量名,如 tibble(`:)` = "smile") 4. 不增加行名,也不鼓励使用行名来存储数据信息 5. 用 [ 对tibble取子集操作总是返回新的tibble,[[ 和 $ 总是返回向量,且 $ 不支持变量名部分匹配 6. tibble在屏幕上打印输出时会适应屏幕大小,并提供更多有用信息(类似str()) ``` --- .full-width[.content-box-blue.bold.font120[常用参数示例:`read_csv()`]] -- .pull-left.code80[ ```r library(tidyverse) read_csv("The 1st line of metadata The 2nd line of metadata # A comment to skip x,y,z 1,2,3", skip = 2, comment = "#") ``` ``` #> Rows: 1 Columns: 3 #> ── Column specification ──────────────── #> Delimiter: "," #> dbl (3): x, y, z #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` #> # A tibble: 1 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 ``` ] -- .pull-right.code80[ ```r read_csv("1,2,3\n4,5,6") ``` ``` #> # A tibble: 1 × 3 #> `1` `2` `3` #> <dbl> <dbl> <dbl> #> 1 4 5 6 ``` ```r read_csv("1,2,3\n4,5,6", col_names = FALSE) ``` ``` #> # A tibble: 2 × 3 #> X1 X2 X3 #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 ``` ```r read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z")) ``` ``` #> # A tibble: 2 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 ``` ] --- .full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]] ```r # 生成示例数据文档 set.seed(123456) # 设定随机数种子 nycflights13::flights %>% mutate(dep_delay = if_else(dep_delay <= 0, FALSE, TRUE)) %>% select(month:dep_time, dep_delay, tailnum, time_hour) %>% slice_sample(n = 20) %>% mutate(across(everything(), ~ifelse(runif(20) <= 0.1, NA, .))) %>% mutate(time_hour = lubridate::as_datetime(time_hour)) %>% * print(n = 5) %>% write_csv(file = "data/ex_flights.csv", na = "--") # 默认na = "NA" ``` ``` #> # A tibble: 20 × 6 #> month day dep_time dep_delay tailnum time_hour #> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE <NA> 2013-06-26 12:00:00 #> 2 NA 5 NA NA <NA> 2013-12-05 17:00:00 #> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00 #> 4 5 16 NA FALSE N27152 2013-05-16 12:00:00 #> 5 12 23 NA NA <NA> 2013-12-23 23:00:00 #> # ℹ 15 more rows ``` --- .full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]] ```r (ex_flights <- read_csv("data/ex_flights.csv")) # 默认na = c("", "NA") ``` ``` #> Rows: 20 Columns: 6 #> ── Column specification ────────────────────────────────────────────────────────────────── #> Delimiter: "," #> chr (6): month, day, dep_time, dep_delay, tailnum, time_hour #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` #> # A tibble: 20 × 6 #> month day dep_time dep_delay tailnum time_hour #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 6 26 932 TRUE -- 2013-06-26T12:00:00Z #> 2 -- 5 -- -- -- 2013-12-05T17:00:00Z #> 3 7 20 656 FALSE N17245 2013-07-20T10:00:00Z #> 4 5 16 -- FALSE N27152 2013-05-16T12:00:00Z #> 5 12 23 -- -- -- 2013-12-23T23:00:00Z #> # ℹ 15 more rows ``` --- .full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]] > .font120[1\. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵] > .font120[2\. 确定每列字符串向量的类型] > 2\.1 由 `col_types` 参数直接给定 > 2\.2 猜测:读入文档 `guess_max` 行(v2 vs. v1),并按如下顺序猜测每列变量的类型:*logical -> ~~integer ->~~ double -> ~~number ->~~ time -> date -> date-time -> character * > .font120[3\. 将每列字符串解析为相应类型的向量] -- .full-width[.content-box-blue.bold.font120[当发现 `readr` 自动解析碰到问题时]] * 可用 `problems()` 查看读入时遇到的问题 * 根据需要设定 `read_*()` 的相关参数,如 `na = "--"` * 直接设定 `read_*()` 的参数 `col_types = cols(...)` * 将全部数据读入为字符型变量,然后再用 `type_convert()` 转换变量类型 * 在后续的数据整理步骤中进行相应的处理(如 `dplyr::muate() + parse_*()`) --- .pull-left.code75[ ```r # 设定缺失值参数 read_csv("data/ex_flights.csv", na = "--") %>% spec() ``` ``` #> cols( #> month = col_double(), #> day = col_double(), #> dep_time = col_double(), #> dep_delay = col_logical(), #> tailnum = col_character(), #> time_hour = col_datetime(format = "") #> ) ``` ```r # 直接设定参数col_types,?cols ex_flights <- read_csv( "data/ex_flights.csv", col_types = cols( # cols_only() dep_delay = col_logical(), # "l" tailnum = col_character(), # "c" time_hour = col_datetime(), # "T" .default = col_integer() ) # col_types = "iiilcT" ) ``` ``` #> Warning: One or more parsing issues, call `problems()` on #> your data frame for details, e.g.: #> dat <- vroom(...) #> problems(dat) ``` ] -- .pull-right.code80[ ```r problems(ex_flights) ``` ``` #> # A tibble: 17 × 5 #> row col expected actual file #> <int> <int> <chr> <chr> <chr> #> 1 3 1 an integer -- C:/Users/admin/Desktop/QFwR_2023.09/slides/L05_Im… #> 2 3 3 an integer -- C:/Users/admin/Desktop/QFwR_2023.09/slides/L05_Im… #> 3 3 4 1/0/T/F/TRUE/FALSE -- C:/Users/admin/Desktop/QFwR_2023.09/slides/L05_Im… #> # ℹ 14 more rows ``` ```r # 全部读入,再行转换类型 read_csv( "data/ex_flights.csv", col_types = cols( * .default = col_character() ) ) %>% * type_convert(col_types = "iiilcT") ``` ``` #> # A tibble: 20 × 6 #> month day dep_time dep_delay tailnum time_hour #> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE -- 2013-06-26 12:00:00 #> 2 NA 5 NA NA -- 2013-12-05 17:00:00 #> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00 #> # ℹ 17 more rows ``` ] --- .full-width[.content-box-blue.bold.font120[一次性读入多个相同性质的文档]] ```r # 生成相同性质的示例文档 nycflights13::flights %>% split(.$carrier) %>% purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv"))) ``` -- ```r # 文件路径向量 (files <- dir(path = "data/", pattern = "\\.tsv$", full.names = TRUE)) ``` ``` #> [1] "data/flights_9E.tsv" "data/flights_AA.tsv" "data/flights_AS.tsv" #> [4] "data/flights_B6.tsv" "data/flights_DL.tsv" "data/flights_EV.tsv" #> [7] "data/flights_F9.tsv" "data/flights_FL.tsv" "data/flights_HA.tsv" #> [10] "data/flights_MQ.tsv" "data/flights_OO.tsv" "data/flights_UA.tsv" #> [13] "data/flights_US.tsv" "data/flights_VX.tsv" "data/flights_WN.tsv" #> [16] "data/flights_YV.tsv" ``` ```r # 一次性读入(并纵向合并为 tibble) read_tsv(files, id = "fpath") # 这里通过设置 id 参数将文件路径存入指定变量fpath中 ``` ``` #> # A tibble: 336,776 × 20 #> fpath year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 data/flight… 2013 1 1 810 810 0 1048 1037 #> 2 data/flight… 2013 1 1 1451 1500 -9 1634 1636 #> 3 data/flight… 2013 1 1 1452 1455 -3 1637 1639 #> # ℹ 336,773 more rows #> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- .full-width[.content-box-blue.bold.font120[`readr` 的图形化用户界面 <sup>.font80.red[*]</sup>]] <img src="imgs/readr_UI.png" width="88%" style="display: block; margin: auto;" /> .footnote.red.font80[\* `RStudio` 右上 `Environment` 标签页 > `Import Dataset` > `From Text (readr)` ...] --- layout: true ### >> 读入其它类型的数据 --- -- .font110[ * .bold[`readr::read_rds()`]: `R` data files (*.rds*) ] -- .pull-left.font110[ * ![](imgs/logo-vroom.png) [~~.bold[{{`vroom`}}~~]](https://vroom.r-lib.org/): Read and Write Rectangular Text Data **Quickly** * ![](imgs/logo-readxl.png) [.bold[{{`readxl`}}]](https://readxl.tidyverse.org/): excel files (_.xls_ and _.xlsx_) <- take a 🧐 at it! * ![](imgs/logo-haven.png) [.bold[{{`haven`}}]](https://haven.tidyverse.org/): SPSS, Stata, and SAS files * ![](imgs/logo-arrow.png) [.bold[{{`arrow`}}]](https://arrow.apache.org/docs/r/): Apache Arrow ] -- .pull-right.font110[ * [.bold[{{`DBI`}}]](https://dbi.r-dbi.org/) with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL * .bold[`jsonlite`]: JSON files * [.bold[{{`xml2`}}]](https://xml2.r-lib.org/): XML and HTML files * ![](imgs/logo-rvest.png) [.bold[{{`rvest`}}]](https://rvest.tidyverse.org): HTML (Web Scraping) * [.bold[{{`httr2`}}]](https://httr2.r-lib.org/): Web APIs ] -- <br> .font110[ * .bold[[{{其它}}](https://cran.r-project.org/web/views/)]: text、network、spatial、genome、image 等类型的数据 ] --- layout: false class: inverse, center, middle # 2. 数据齐整 .font150[(Tidy Data)] --- layout: true ### >> 齐整数据 --- .full-width[.content-box-blue.bold.font120[齐整数据的三个条件 📐]] .font120[ * 每列都是一个变量(Every column is a variable.) * 每行都是一个观测(Every row is an observation.) * 每格都是一个取值(Every cell is a single value.) ] -- <img src="imgs/tidy-1.png" width="85%" style="display: block; margin: auto;" /> --- .full-width[.content-box-blue.bold.font120[齐整数据的好处 💪]] .font120[ ☑ 齐整数据按照逻辑一致的方式存储数据,这让你更容易学习并掌握相关工具对数据进行处理 ☑ “每列都是一个变量”及“每行都是一个观测”有助于发挥 R 语言向量化操作的优势 ] <img src="imgs/tidy-2.png" width="80%" style="display: block; margin: auto;" /> .font120[ ☑ `tidyverse` 中的 R 包(如 `dplyr` 、`ggplot2` 等)在设计上要求输入数据为齐整数据 ] --- .full-width[.content-box-blue.bold.font120[但……并非所有的数据集都是齐整的,😭]] .pull-left.code80[ ```r table1 ``` ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> # ℹ 3 more rows ``` ```r table2 ``` ``` #> # A tibble: 12 × 4 #> country year type count #> <chr> <dbl> <chr> <dbl> #> 1 Afghanistan 1999 cases 745 #> 2 Afghanistan 1999 population 19987071 #> 3 Afghanistan 2000 cases 2666 #> # ℹ 9 more rows ``` ] .pull-right.code70[ ```r table3 ``` ``` #> # A tibble: 6 × 3 #> country year rate #> <chr> <dbl> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> # ℹ 3 more rows ``` ```r table4a # cases ``` ``` #> # A tibble: 3 × 3 #> country `1999` `2000` #> <chr> <dbl> <dbl> #> 1 Afghanistan 745 2666 #> 2 Brazil 37737 80488 #> 3 China 212258 213766 ``` ```r table4b # population ``` ``` #> # A tibble: 3 × 3 #> country `1999` `2000` #> <chr> <dbl> <dbl> #> 1 Afghanistan 19987071 20595360 #> 2 Brazil 172006362 174504898 #> 3 China 1272915272 1280428583 ``` ] --- layout: true ### >> 该 `tidyr` 包出场啦 🎉 --- .full-width[.content-box-blue.bold.font120[`tidyr`包中的函数大致可分为5大类:]] .pull-left.font105[ 1. 数据长型-宽型转换(_pivoting_):`pivot_longer()` 和 `pivot_wider()` 2. 分解或合并字符型变量:`separate_longer_delim()|*_position()`、`separate_wider_delim()|*_position|*_regex()` 和 `unite()` 3. 将深度嵌套的列表数据表格化(_rectangling_):`unnest_longer()`、`unnest_wider()`、`unnest_auto()`、`unnest()` 和 `hoist()` 4. **缺失值处理:->** 5. 其它函数:`chop()|unchop()`; `pack()|unpack()`; `uncount()`; `nest()` ] -- .pull-right.font80[ - `complete(data, ..., fill = list(), explicit = TRUE)`: Complete a data frame with missing combinations of data - `expand(data, ...)`, `crossing(...)`, `nesting(...)`: Expand data frame to include all combinations of values - `expand_grid(...)`: Create a tibble from all combinations of the inputted name-value pairs - `full_seq(x, period, tol = 1e-06)`: Create the full sequence of values in a vector <hr> - `drop_na(data, ...)`: Drop rows containing missing values - `fill(data, ..., .direction = c("down", "up", "downup", "updown"))`: Fill in missing values with previous or next value - `replace_na(data, replace, ...)`: Replace missing values with specified values ] --- layout: true ### >> **.red[`pivot_longer()`]** 和 `pivot_wider()` --- <img src="imgs/tidy-3.png" width="75%" style="display: block; margin: auto;" /> -- .pull-left.code75[ ```r pivot_longer( * data, cols, ..., cols_vary = "fastest", * names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", * values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL ) ``` ] -- .pull-right.code80[ ```r table4a %>% pivot_longer( cols = c(`1999`, `2000`), names_to = "year", values_to = "cases" ) # table4b %>% ... ``` ``` #> # A tibble: 6 × 3 #> country year cases #> <chr> <chr> <dbl> #> 1 Afghanistan 1999 745 #> 2 Afghanistan 2000 2666 #> 3 Brazil 1999 37737 #> # ℹ 3 more rows ``` ] --- layout: true ### >> `pivot_longer()` 和 **.red[`pivot_wider()`]** --- <img src="imgs/tidy-4.png" width="75%" style="display: block; margin: auto;" /> -- .pull-left.code75[ ```r pivot_wider( * data, ..., id_cols = NULL, id_expand = FALSE, * names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", * values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL ) ``` ] -- .pull-right.code80[ ```r table2 %>% pivot_wider( names_from = type, values_from = count ) ``` ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> # ℹ 3 more rows ``` ] --- layout: true ### >> **.red[`separate\_wider\_\*()`、`separate\_longer\_\*()`]** 和 `unite()` --- <img src="imgs/tidy-5.png" width="55%" style="display: block; margin: auto;" /> -- .pull-left.code80[ ```r separate_wider_delim|position|regex( * data, cols, * delim | widths | patterns, ..., names, names_sep = NULL, names_repair, too_few, too_many, cols_remove = TRUE ) separate_longer_delim|position( * data, cols, * delim | width, ..., keep_empty = FALSE ) ``` ] -- .pull-right.code80[ ```r table3 %>% separate_wider_delim( rate, delim = "/", names = c("cases", "population") ) ``` ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <chr> <chr> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> # ℹ 3 more rows ``` ] --- layout: true ### >> `separate_wider_*()`、`separate_longer_*()` 和 **.red[`unite()`]** --- <img src="imgs/tidy-6.png" width="80%" style="display: block; margin: auto;" /> -- .pull-left.code100[ ```r unite( data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE ) ``` ] -- .pull-right.code90[ ```r table5 %>% unite( * col = "year", * century, year, sep = "" ) ``` ``` #> # A tibble: 6 × 3 #> country year rate #> <chr> <chr> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> # ℹ 3 more rows ``` ] --- layout: false class: center, middle, hide_logo background-image: url(imgs/xaringan.png) background-size: 12% background-position: 50% 40% <br><br><br><br><br><br><br> <hr color='#f00' size='2px' width='80%'> <br> .Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能!**_]