第5讲数据导入与齐整

class: center, middle, inverse, title-slide

.title[
# 量化金融与金融编程
]
.subtitle[
## L05 数据导入与齐整
]
.author[
### <br>曾永艺
]
.institute[
### 厦门大学管理学院
]
.date[
### <br>2023-10-27
]

---

class: middle
background-image: url(imgs/readr-tidyr.png)
background-size: 18%
background-position: 85% 50%

<div>
<style type="text/css">.xaringan-extra-logo {
width: 50px;
height: 50px;
z-index: 0;
background-image: url(imgs/logo-sm.png);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:0.5em;right:0.5em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('div')
          logo.classList = 'xaringan-extra-logo'
          logo.href = null
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

> 问渠那得清如许？为有源头活水来。——[宋]朱熹

> Like families, tidy datasets are all alike but every messy <br>
> dataset is messy in its own way. ——Hadley Wickham

<br>
--

.bold.font150[1\. 数据导入]
  * .font120[读入矩形文本数据：`readr` 包 <sup>.font80[2.1.4]</sup>]
  * .font120[读入其它类型的数据]

.bold.font150[2\. 数据齐整]
  * .font120[齐整数据]
  * .font120[`tidyr` 包 <sup>.font80[1.3.0]</sup>]
  * .font120[`pivot_longer()` 和 `pivot_wider()`]
  * .font120[`separate_*()` 和 `unite()`]
  * .font120[`pivot\_\*()` 和 `separate\_\*()`：一个复杂案例]
  * .font120[`unnest_*()` 和 `hoist()`]

---
class: inverse, center, middle

# 1. 数据导入

.font150[(Data Import)]

---
layout: true

### >> 读入文本文档：`readr` 包

---

.full-width[.content-box-blue.bold.font120[☑ 一致的函数命名方式]]

.full-width[.content-box-blue.bold.font120[☑ 基本一致的参数设置]]

```r
read_*(file,  # delim = NULL, escape_backslash = FALSE, escape_double = TRUE,
       col_names = TRUE, col_types = NULL, col_select = NULL, 
       id = NULL, locale = default_locale(), 
       na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, 
       skip = 0, n_max = Inf, guess_max = min(1000, n_max), 
       name_repair = "unique", num_threads = readr_threads(), 
       progress = show_progress(), show_col_types = should_show_types(), 
       skip_empty_rows = TRUE, lazy = should_read_lazy())
# file: Either a path to a file, a connection, or literal data. 
#       Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed.
#       Files starting with http://, https://, ftp://, or ftps:// will be downloaded.
#       Using a value of clipboard() will read from the system clipboard.
```

---

.full-width[.content-box-blue.bold.font120[☑ 和内置 `utils` 包中 `read.*()` 比较]]

--
.font110[
* 生成 `tibble`.red[<sup>*</sup>]（更确切地说应该是 `spec_tbl_df`）

* 默认情况下做得更少，如不会自动将字符型变量转换为因子型变量（`read.*()`有个`stringsAsFactors`参数）、不会自动更改列名、不会将一列数据转化为行名等

* 较少依赖系统和环境变量，结果更具可重现性（_reproducible_）

* 读入速度通常更快（~50x），且读入大文件时有进度条提示

]

```
?`tbl_df-class`
tibble: a subclass of data.frame and the central data structure for the tidyverse
lazy and surly - do less and complain more than base data.frames.
======================================================================================

1. 列数据不进行自动类型转换（如字符->因子），且原生直接支持列表列
2. 只对长度为1的“标量”向量进行循环操作
3. tibble支持不合法的数据变量名，如 tibble(`:)` = "smile")
4. 不增加行名，也不鼓励使用行名来存储数据信息
5. 用 [ 对tibble取子集操作总是返回新的tibble，[[ 和 $ 总是返回向量，且 $ 不支持变量名部分匹配
6. tibble在屏幕上打印输出时会适应屏幕大小，并提供更多有用信息（类似str()）
```

---

.full-width[.content-box-blue.bold.font120[常用参数示例：`read_csv()`]]
--

.pull-left.code80[

```r
library(tidyverse)
read_csv("The 1st line of metadata
         The 2nd line of metadata
         # A comment to skip
         x,y,z
         1,2,3", 
*        skip = 2,
*        comment = "#")
```

```
#> Rows: 1 Columns: 3
#> ── Column specification ────────────────
#> Delimiter: ","
#> dbl (3): x, y, z
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
#> # A tibble: 1 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
```
]
--

.pull-right.code80[

```r
read_csv("1,2,3\n4,5,6")
```

```
#> # A tibble: 1 × 3
#>     `1`   `2`   `3`
#>   <dbl> <dbl> <dbl>
#> 1     4     5     6
```

```r
read_csv("1,2,3\n4,5,6", 
*        col_names = FALSE)
```

```
#> # A tibble: 2 × 3
#>      X1    X2    X3
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6
```

```r
read_csv("1,2,3\n4,5,6", 
*        col_names = c("x", "y", "z"))
```

```
#> # A tibble: 2 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6
```

]

---

.full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]]

```r
#  生成示例数据文档
set.seed(123456)   # 设定随机数种子
nycflights13::flights %>% 
  mutate(dep_delay = if_else(dep_delay <= 0, FALSE, TRUE)) %>% 
  select(month:dep_time, dep_delay, tailnum, time_hour) %>% 
  slice_sample(n = 20) %>%  
  mutate(across(everything(), ~ifelse(runif(20) <= 0.1, NA, .))) %>% 
  mutate(time_hour = lubridate::as_datetime(time_hour)) %>% 
* print(n = 5) %>%
  write_csv(file = "data/ex_flights.csv", na = "--")  # 默认na = "NA"
```

```
#> # A tibble: 20 × 6
#>   month   day dep_time dep_delay tailnum time_hour          
*#>   <int> <int>    <int> <lgl>     <chr>   <dttm>             
#> 1     6    26      932 TRUE      <NA>    2013-06-26 12:00:00
#> 2    NA     5       NA NA        <NA>    2013-12-05 17:00:00
#> 3     7    20      656 FALSE     N17245  2013-07-20 10:00:00
#> 4     5    16       NA FALSE     N27152  2013-05-16 12:00:00
#> 5    12    23       NA NA        <NA>    2013-12-23 23:00:00
#> # ℹ 15 more rows
```

---
.full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]]

```r
(ex_flights <- read_csv("data/ex_flights.csv"))  # 默认 na = c("", "NA")
```

```
#> Rows: 20 Columns: 6
#> ── Column specification ──────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): month, day, dep_time, dep_delay, tailnum, time_hour
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
#> # A tibble: 20 × 6
#>   month day   dep_time dep_delay tailnum time_hour           
*#>   <chr> <chr> <chr>    <chr>     <chr>   <chr>               
#> 1 6     26    932      TRUE      --      2013-06-26T12:00:00Z
#> 2 --    5     --       --        --      2013-12-05T17:00:00Z
#> 3 7     20    656      FALSE     N17245  2013-07-20T10:00:00Z
#> 4 5     16    --       FALSE     N27152  2013-05-16T12:00:00Z
#> 5 12    23    --       --        --      2013-12-23T23:00:00Z
#> # ℹ 15 more rows
```

---

.full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]]

> .font120[1\. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵]  
> .font120[2\. 确定每列字符串向量的类型]  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2\.1 由 `col_types` 参数直接给定  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2\.2 猜测：读入文档 `guess_max` 行（v2 vs. v1），并按如下顺序猜测每列变量的类型：*logical -> ~~integer ->~~ double -> ~~number ->~~ time -> date -> date-time -> character *  
> .font120[3\. 将每列字符串解析为相应类型的向量]

.full-width[.content-box-blue.bold.font120[当发现 `readr` 自动解析碰到问题时]]
 
 * 可用 `problems()` 查看读入时遇到的问题
 
 * 根据需要设定 `read_*()` 的相关参数，如 `na = "--"`
 
 * 直接设定 `read_*()` 的参数 `col_types = cols(...)`
 
 * 将全部数据读入为字符型变量，然后再用 `type_convert()` 转换变量类型
 
 * 在后续的数据整理步骤中进行相应的处理（如 `dplyr::muate() + parse_*()`）

---

.pull-left.code75[

```r
# 设定缺失值参数
read_csv("data/ex_flights.csv", 
         na = "--") %>% spec()
```

```
#> cols(
#>   month = col_double(),
#>   day = col_double(),
#>   dep_time = col_double(),
#>   dep_delay = col_logical(),
#>   tailnum = col_character(),
#>   time_hour = col_datetime(format = "")
#> )
```

```r
# 直接设定参数col_types，?cols
ex_flights <- read_csv(
  "data/ex_flights.csv", 
  col_types = cols(  # cols_only()
    dep_delay = col_logical(),    # "l"
    tailnum   = col_character(),  # "c"
    time_hour = col_datetime(),   # "T"
    .default  = col_integer()     # "i"
  ) # col_types = "iiilcT"
)
```

```
#> Warning: One or more parsing issues, call `problems()` on
#> your data frame for details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
```
]

.pull-right.code80[

```r
problems(ex_flights)
```

```
#> # A tibble: 17 × 5
#>     row   col expected           actual file                                              
#>   <int> <int> <chr>              <chr>  <chr>                                             
#> 1     3     1 an integer         --     C:/Users/admin/Desktop/QFwR_2023.09/slides/L05_Im…
#> 2     3     3 an integer         --     C:/Users/admin/Desktop/QFwR_2023.09/slides/L05_Im…
#> 3     3     4 1/0/T/F/TRUE/FALSE --     C:/Users/admin/Desktop/QFwR_2023.09/slides/L05_Im…
#> # ℹ 14 more rows
```

```r
# 全部读入，再行转换类型
read_csv(
  "data/ex_flights.csv",
  col_types = cols(
*   .default = col_character()
  )
) %>% 
* type_convert(col_types = "iiilcT")
```

```
#> # A tibble: 20 × 6
#>   month   day dep_time dep_delay tailnum time_hour          
#>   <int> <int>    <int> <lgl>     <chr>   <dttm>             
#> 1     6    26      932 TRUE      --      2013-06-26 12:00:00
#> 2    NA     5       NA NA        --      2013-12-05 17:00:00
#> 3     7    20      656 FALSE     N17245  2013-07-20 10:00:00
#> # ℹ 17 more rows
```
]

---

.full-width[.content-box-blue.bold.font120[一次性读入多个相同性质的文档 <sup>[new in `readr` v2.0.0]</sup>]]

```r
# 生成相同性质的示例文档
nycflights13::flights %>% split(.$carrier) %>% 
  purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))
```

```r
# 文件路径向量
(files <- dir(path = "data/", pattern = "\\.tsv$", full.names = TRUE))
```

```
#>  [1] "data/flights_9E.tsv" "data/flights_AA.tsv" "data/flights_AS.tsv"
#>  [4] "data/flights_B6.tsv" "data/flights_DL.tsv" "data/flights_EV.tsv"
#>  [7] "data/flights_F9.tsv" "data/flights_FL.tsv" "data/flights_HA.tsv"
#> [10] "data/flights_MQ.tsv" "data/flights_OO.tsv" "data/flights_UA.tsv"
#> [13] "data/flights_US.tsv" "data/flights_VX.tsv" "data/flights_WN.tsv"
#> [16] "data/flights_YV.tsv"
```

```r
# 一次性读入（并自动纵向合并为 tibble）
read_tsv(files, id = "fpath")  # 这里通过设置 id 参数将文件路径存入指定变量 fpath 中
```

```
#> # A tibble: 336,776 × 20
#>   fpath         year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <chr>        <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>
#> 1 data/flight…  2013     1     1      810            810         0     1048           1037
#> 2 data/flight…  2013     1     1     1451           1500        -9     1634           1636
#> 3 data/flight…  2013     1     1     1452           1455        -3     1637           1639
#> # ℹ 336,773 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>
```

---

.full-width[.content-box-blue.bold.font120[`readr` 的图形化用户界面 <sup>.font80.red[*]</sup>]]

.footnote.red.font80[\* `RStudio` 右上 `Environment` 标签页 > `Import Dataset` > `From Text (readr)` ...]

---
layout: true

### >> 读入其它类型的数据

---

--
.font110[
* .bold[`readr::read_rds()`]: `R` data files (*.rds*)
]

.pull-left.font110[
* ![](imgs/logo-vroom.png) [~~.bold[{{`vroom`}}~~]](https://vroom.r-lib.org/): Read and Write Rectangular Text Data Quickly

* ![](imgs/logo-readxl.png) [.bold[{{`readxl`}}]](https://readxl.tidyverse.org/): excel files (_.xls_ and _.xlsx_) <- **take a 🧐 at it!**

* ![](imgs/logo-haven.png) [.bold[{{`haven`}}]](https://haven.tidyverse.org/): SPSS, Stata, and SAS files

* ![](imgs/logo-arrow.png) [.bold[{{`arrow`}}]](https://arrow.apache.org/docs/r/): Apache Arrow
]

.pull-right.font110[

* [.bold[{{`DBI`}}]](https://dbi.r-dbi.org/) with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL

* .bold[`jsonlite`]: JSON files

* [.bold[{{`xml2`}}]](https://xml2.r-lib.org/): XML and HTML files

* ![](imgs/logo-rvest.png) [.bold[{{`rvest`}}]](https://rvest.tidyverse.org): HTML (Web Scraping)

* [.bold[{{`httr2`}}]](https://httr2.r-lib.org/): Web APIs

]

<br>
.font110[
* .bold[[{{其它}}](https://cran.r-project.org/web/views/)]: text、network、spatial、genome、image 等类型的数据
]

---
layout: false
class: inverse, center, middle

# 2. 数据齐整

.font150[(Tidy Data)]

---
layout: true

### >> 齐整数据

---

.full-width[.content-box-blue.bold.font120[齐整数据的三个条件 📐]]

.font120[

* 每列都是一个变量（Every column is a variable.）

* 每行都是一个观测（Every row is an observation.）

* 每格都是一个取值（Every cell is a single value.）
]

--
<img src="imgs/tidy-1.png" width="85%" style="display: block; margin: auto;" />

---

.full-width[.content-box-blue.bold.font120[齐整数据的好处 💪]]

.font120[

☑ 齐整数据按照逻辑一致的方式存储数据，这让你更容易学习并掌握相关工具对数据进行处理

☑ “每列都是一个变量”及“每行都是一个观测”有助于发挥 R 语言向量化操作的优势

]

.font120[
☑ `tidyverse` 中的 R 包（如 `dplyr` 、`ggplot2` 等）在设计上要求输入数据为齐整数据
]

---

.full-width[.content-box-blue.bold.font120[但……并非所有的数据集都是齐整的，😭]]

.pull-left.code80[

```r
table1
```

```
#> # A tibble: 6 × 4
#>   country      year cases population
#>   <chr>       <dbl> <dbl>      <dbl>
#> 1 Afghanistan  1999   745   19987071
#> 2 Afghanistan  2000  2666   20595360
#> 3 Brazil       1999 37737  172006362
#> # ℹ 3 more rows
```

```r
table2
```

```
#> # A tibble: 12 × 4
#>   country      year type          count
#>   <chr>       <dbl> <chr>         <dbl>
#> 1 Afghanistan  1999 cases           745
#> 2 Afghanistan  1999 population 19987071
#> 3 Afghanistan  2000 cases          2666
#> # ℹ 9 more rows
```
]

.pull-right.code70[

```r
table3
```

```
#> # A tibble: 6 × 3
#>   country      year rate           
#>   <chr>       <dbl> <chr>          
#> 1 Afghanistan  1999 745/19987071   
#> 2 Afghanistan  2000 2666/20595360  
#> 3 Brazil       1999 37737/172006362
#> # ℹ 3 more rows
```

```r
table4a   # cases
```

```
#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#>   <chr>        <dbl>  <dbl>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
```

```r
table4b   # population
```

```
#> # A tibble: 3 × 3
#>   country         `1999`     `2000`
#>   <chr>            <dbl>      <dbl>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583
```
]

---
layout: true

### >> 该 `tidyr` 包出场啦 🎉

---

.full-width[.content-box-blue.bold.font120[`tidyr`包中的函数大致可分为5大类：]]

.pull-left.font105[
1. 数据长型-宽型转换（_pivoting_）：`pivot_longer()` 和 `pivot_wider()`

2. 分解或合并字符型变量：`separate_longer_delim()|*_position()`、`separate_wider_delim()|*_position|*_regex()` 和 `unite()`

3. 将深度嵌套的列表数据表格化（_rectangling_）：`unnest_longer()`、`unnest_wider()`、`unnest_auto()`、`unnest()` 和 `hoist()`

4. **缺失值处理：->**

5. 其它函数：`chop()|unchop()`; `pack()|unpack()`; `uncount()`; `nest()`
]

.pull-right.font80[

- `complete(data, ..., fill = list(), explicit = TRUE)`: Complete a data frame with missing combinations of data

- `expand(data, ...)`, `crossing(...)`, `nesting(...)`: Expand data frame to include all combinations of values

- `expand_grid(...)`: Create a tibble from all combinations of the inputted name-value pairs

- `full_seq(x, period, tol = 1e-06)`: Create the full sequence of values in a vector

<hr>

- `drop_na(data, ...)`: Drop rows containing missing values

- `fill(data, ..., .direction = c("down", "up", "downup", "updown"))`: Fill in missing values with previous or next value

- `replace_na(data, replace, ...)`: Replace missing values with specified values

]

---
layout: true

### >> **.red[`pivot_longer()`]** 和 `pivot_wider()`

---

.pull-left.code75[

```r
pivot_longer(
* data, cols,
  ..., 
  cols_vary = "fastest", 
* names_to = "name",
  names_prefix = NULL,
  names_sep = NULL, 
  names_pattern = NULL, 
  names_ptypes = NULL, 
  names_transform = NULL, 
  names_repair = "check_unique", 
* values_to = "value",
  values_drop_na = FALSE, 
  values_ptypes = NULL,
  values_transform = NULL
)
```
]

.pull-right.code80[

```r
table4a %>% 
  pivot_longer(
    cols = c(`1999`, `2000`),
    names_to = "year", 
    values_to = "cases"
  )
# table4b %>% ...
```

```
#> # A tibble: 6 × 3
#>   country     year  cases
#>   <chr>       <chr> <dbl>
#> 1 Afghanistan 1999    745
#> 2 Afghanistan 2000   2666
#> 3 Brazil      1999  37737
#> # ℹ 3 more rows
```
]

---
layout: true

### >> `pivot_longer()` 和 **.red[`pivot_wider()`]**

---

--
.pull-left.code75[

```r
pivot_wider(
* data,
  ..., 
  id_cols = NULL, id_expand = FALSE,
* names_from = name,
  names_prefix = "", names_sep = "_", 
  names_glue = NULL, 
  names_sort = FALSE, 
  names_vary = "fastest",
  names_expand = FALSE, 
  names_repair = "check_unique",
* values_from = value,
  values_fill = NULL, values_fn = NULL, 
  unused_fn = NULL
)
```
]

.pull-right.code80[

```r
table2 %>% 
  pivot_wider(
    names_from = type, 
    values_from = count
  )
```

---
layout: false
class: hide_logo

## 🙋‍♂️ Your Turn!

.panelset[

.panel[.panel-name[问题]

.font120[在空白处填入恰当的代码将数据集 billboard 转化为目标结构：]

.pull-left.code100[

<code class ='r hljs remark-code'>library(tidyverse)<br><br>billboard %>% View()<br>billboard %>%<br>&nbsp;&nbsp;pivot_<span style="background-color:#ffff7f"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>(<br>&nbsp;&nbsp;&nbsp;&nbsp;cols = <span style="background-color:#ffff7f"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>,<br>&nbsp;&nbsp;&nbsp;&nbsp;names_to = <span style="background-color:#ffff7f"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>,<br>&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color:#ffff7f"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>,<br>&nbsp;&nbsp;&nbsp;&nbsp;names_transform = <span style="background-color:#ffff7f"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>,<br>&nbsp;&nbsp;&nbsp;&nbsp;values_to = <span style="background-color:#ffff7f"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>,<br>&nbsp;&nbsp;&nbsp;&nbsp;values_drop_na = TRUE<br>&nbsp;&nbsp;)</code>

```

#> # A tibble: 5,307 × 5
#>   artist track                   date.entered  week  rank
#>   <chr>  <chr>                   <date>       <int> <dbl>
#> 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26       1    87
#> 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26       2    82
#> 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26       3    72
#> # ℹ 5,304 more rows

```
]

.pull-right.code80[

```
*# A tibble: 317 × 79
  artist  track date.entered   wk1   wk2
  <chr>   <chr> <date>       <dbl> <dbl>
1 2 Pac   Baby… 2000-02-26      87    82
2 2Ge+her The … 2000-09-02      91    87
3 3 Door… Kryp… 2000-04-08      81    70
# ℹ 314 more rows
# ℹ 74 more variables: wk3 <dbl>,
#   wk4 <dbl>, wk5 <dbl>, wk6 <dbl>,
#   wk7 <dbl>, wk8 <dbl>, wk9 <dbl>,
#   wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>,
#   wk16 <dbl>, wk17 <dbl>, …
```

```
*# A tibble: 5,307 × 5
  artist track  date.entered  week  rank
  <chr>  <chr>  <date>       <int> <dbl>
1 2 Pac  Baby … 2000-02-26       1    87
2 2 Pac  Baby … 2000-02-26       2    82
3 2 Pac  Baby … 2000-02-26       3    72
# ℹ 5,304 more rows
```
]
]

.panel[.panel-name[参考答案]

.code110[

```r
library(tidyverse)

billboard %>% View()
billboard %>%
  pivot_longer(
    cols = starts_with("wk") ,
    names_to = "week"        ,
    names_prefix = "wk"      ,
    names_transform = list(week = as.integer),
    values_to = "rank"       ,
    values_drop_na = TRUE
  )
```
]
]
]

---
layout: true

### >> **.red[`separate\_wider\_\*()`、`separate\_longer\_\*()`]** 和 `unite()`

---

--
.pull-left.code80[

.pull-right.code80[

```r
table3 %>% 
  separate_wider_delim(
    rate, delim = "/", 
    names = c("cases", "population")
  )
```

```
#> # A tibble: 6 × 4
#>   country      year cases population
#>   <chr>       <dbl> <chr> <chr>     
#> 1 Afghanistan  1999 745   19987071  
#> 2 Afghanistan  2000 2666  20595360  
#> 3 Brazil       1999 37737 172006362 
#> # ℹ 3 more rows
```

]

---
layout: true

### >> `separate_wider_*()`、`separate_longer_*()` 和 **.red[`unite()`]**

---

--
.pull-left.code100[

```r
unite(
  data, 
  col,
  ...,
  sep = "_", 
  remove = TRUE, 
  na.rm = FALSE
)
```
]

.pull-right.code90[

```r
table5 %>% 
  unite(
*   col = "year",
*   century, year,
    sep = ""
  )
```

```
#> # A tibble: 6 × 3
#>   country     year  rate           
#>   <chr>       <chr> <chr>          
#> 1 Afghanistan 1999  745/19987071   
#> 2 Afghanistan 2000  2666/20595360  
#> 3 Brazil      1999  37737/172006362
#> # ℹ 3 more rows
```

]

---
layout: false
class: hide_logo

## 🙋‍♂️ Your Turn!

.panelset[

.panel[.panel-name[问题]

.font120[使用 `dplyr::mutate()` 实现 `tidyr::unite()` 的结果（提示：`help(paste)`）]

<br>
.pull-left.code100[

```r
library(tidyverse)

table5 %>% 
  unite(
    col = "year",
    century, year,
    sep = ""
  )
```
]

.pull-right[

```
#> # A tibble: 6 × 3
#>   country     year  rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583
```
]
]

.panel[.panel-name[参考答案]

.font120[使用 `dplyr::mutate()` 实现 `tidyr::unite()` 的结果（提示：`help(paste)`）]

<br>
.pull-left.code100[

```r
library(tidyverse)

table5 %>% 
  mutate(
    year = paste0(century, year),
    .keep = "unused",
    .after = country
  )
```
]

.pull-right[

---
layout: true

### >> `pivot_*()` 和 `separate_*()`：一个复杂例子

---

```r
who    # ?who
```

```
#> # A tibble: 7,240 × 60
#>    country     iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
#>    <chr>       <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
#>  1 Afghanistan AF    AFG    1980          NA           NA           NA           NA
#>  2 Afghanistan AF    AFG    1981          NA           NA           NA           NA
#>  3 Afghanistan AF    AFG    1982          NA           NA           NA           NA
#>  4 Afghanistan AF    AFG    1983          NA           NA           NA           NA
#>  5 Afghanistan AF    AFG    1984          NA           NA           NA           NA
#>  6 Afghanistan AF    AFG    1985          NA           NA           NA           NA
#>  7 Afghanistan AF    AFG    1986          NA           NA           NA           NA
#>  8 Afghanistan AF    AFG    1987          NA           NA           NA           NA
#>  9 Afghanistan AF    AFG    1988          NA           NA           NA           NA
#> 10 Afghanistan AF    AFG    1989          NA           NA           NA           NA
#> # ℹ 7,230 more rows
#> # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>,
#> #   new_sp_f014 <dbl>, new_sp_f1524 <dbl>, new_sp_f2534 <dbl>, new_sp_f3544 <dbl>,
#> #   new_sp_f4554 <dbl>, new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
#> #   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>, new_sn_m4554 <dbl>,
#> #   new_sn_m5564 <dbl>, new_sn_m65 <dbl>, new_sn_f014 <dbl>, new_sn_f1524 <dbl>,
#> #   new_sn_f2534 <dbl>, new_sn_f3544 <dbl>, new_sn_f4554 <dbl>, new_sn_f5564 <dbl>, …
```

---
.code80[

```r
who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65,
    names_to = "name", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
```

```
#> # A tibble: 76,046 × 6
#>   country     iso2  iso3   year name         cases
#>   <chr>       <chr> <chr> <dbl> <chr>        <dbl>
#> 1 Afghanistan AF    AFG    1997 new_sp_m014      0
#> 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
#> 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
#> # ℹ 76,043 more rows
```
]

> 变量 `name` 中的字符串代表什么？通过查阅 `who` 的帮助文档得到如下信息：
>  1. 前3个字母代表是否是“新增”肺结核病例，在数据中均为`new`
>  2. 接下来2-3个字母代表肺结核的类型（`sp`、`sn`、`ep`、`rel`等）
>  3. 接下来的字母`m`或`f`代表肺结核病人的性别（男或女）
>  4. 余下数字代表年龄组，如 `014 = 0 - 14岁 | 1524 = 15 - 24岁`

---

.code80[

```r
who2 <- who1 %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% 
  separate_wider_delim(name, delim = "_", names = c(NA, "type", "sexage"))  # 基于分隔符
who2
```

```
#> # A tibble: 76,046 × 7
#>   country     iso2  iso3   year type  sexage cases
#>   <chr>       <chr> <chr> <dbl> <chr> <chr>  <dbl>
#> 1 Afghanistan AF    AFG    1997 sp    m014       0
#> 2 Afghanistan AF    AFG    1997 sp    m1524     10
#> 3 Afghanistan AF    AFG    1997 sp    m2534      6
#> # ℹ 76,043 more rows
```
]

.code80[

```r
who3 <- who2 %>% 
  separate_wider_position( # 基于位置
    sexage, widths = c(sex = 1, age = 4), too_few = "align_start"
  )
who3
```

```
#> # A tibble: 76,046 × 8
#>   country     iso2  iso3   year type  sex   age   cases
#>   <chr>       <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Afghanistan AF    AFG    1997 sp    m     014       0
#> 2 Afghanistan AF    AFG    1997 sp    m     1524     10
#> 3 Afghanistan AF    AFG    1997 sp    m     2534      6
#> # ℹ 76,043 more rows
```
]

---

.code80[

```r
*# all in a pipe!
who %>%
  pivot_longer(cols = new_sp_m014:newrel_f65, 
               names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% 
  separate_wider_delim(name, delim = "_", names = c(NA, "type", "sexage")) %>% 
  separate_wider_position(sexage, widths = c(sex = 1, age = 4), too_few = "align_start")
```
]

.pull-left.code80[

```r
*# all with powerful pivot_longer()!
who %>%
  pivot_longer(
    cols = starts_with("new"),
*   names_to = c("type", "sex", "age"),
*   names_pattern = "new_?(.*)_(.)(.*)",
*   names_ptypes = list(
*     sex = factor(levels = c("f", "m")),
*     age = factor(
*       levels = c("014", "1524", "2534",
*         "3544", "4554", "5564", "65"),
*       ordered = TRUE)),
    values_to = "cases",
    values_drop_na = TRUE
  )
```
]

.pull-right.code80[

```
#> # A tibble: 76,046 × 8
#>   country  iso2  iso3   year type  sex  
#>   <chr>    <chr> <chr> <dbl> <chr> <fct>
#> 1 Afghani… AF    AFG    1997 sp    m    
#> 2 Afghani… AF    AFG    1997 sp    m    
#> 3 Afghani… AF    AFG    1997 sp    m    
#> 4 Afghani… AF    AFG    1997 sp    m    
#> 5 Afghani… AF    AFG    1997 sp    m    
#> # ℹ 76,041 more rows
#> # ℹ 2 more variables: age <ord>,
#> #   cases <dbl>
```

<br>
.font120.bold[👉  `vignette("pivot")`]
]

---
layout: true

### >> `unnest_*` 和 `hoist()`

---
.pull-left.code90[

```r
repurrrsive::gh_repos %>% View()
```

<img src="imgs/repos.png" width="100%" style="display: block; margin: auto auto auto 0;" />
]

.pull-right.code80[

```r
(repos <- 
   tibble(repo = repurrrsive::gh_repos))
```

```
#> # A tibble: 6 × 1
#>   repo       
#>   <list>     
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> # ℹ 1 more row
```

```r
(repos <- repos %>% unnest_longer(repo))
```

```
#> # A tibble: 176 × 1
#>   repo             
#>   <list>           
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> # ℹ 171 more rows
```
]

---

.pull-left.code90[

```r
repos %>% 
  unnest_wider(repo)
```

```
#> # A tibble: 176 × 68
#>         id name   full_name owner       
#>      <int> <chr>  <chr>     <list>      
#> 1 61160198 after  gaborcsa… <named list>
#> 2 40500181 argufy gaborcsa… <named list>
#> 3 36442442 ask    gaborcsa… <named list>
#> 4 34924886 basei… gaborcsa… <named list>
#> 5 61620661 citest gaborcsa… <named list>
#> # ℹ 171 more rows
#> # ℹ 64 more variables: private <lgl>,
#> #   html_url <chr>, description <chr>,
#> #   fork <lgl>, url <chr>,
#> #   forks_url <chr>, keys_url <chr>,
#> #   collaborators_url <chr>,
#> #   teams_url <chr>, hooks_url <chr>, …
```

]

.pull-right.code90[

```r
*repos %>% hoist(
  repo, 
* login = c("owner", "login"),
  name = "name",
  homepage = "homepage",
  watchers = "watchers_count"
)
```

```
#> # A tibble: 176 × 5
*#>   login       name     homepage watchers
*#>   <chr>       <chr>    <chr>       <int>
#> 1 gaborcsardi after    <NA>            5
#> 2 gaborcsardi argufy   <NA>           19
#> 3 gaborcsardi ask      <NA>            5
#> 4 gaborcsardi baseimp… <NA>            0
#> 5 gaborcsardi citest   <NA>            0
#> # ℹ 171 more rows
#> # ℹ 1 more variable: repo <list>
```

<br>
.font120.bold[👉 `vignette("rectangle")`]

]

---
layout: false
class: inverse, center, middle

# 课后作业

---
class: middle

.font130[
1\. 复习 📖 [**{{_R for Data Science, 2e_}}**](https://r4ds.had.nz/) 一书第一部分 _Whole game_ 中关于数据导入（[**8 Data import**](https://r4ds.hadley.nz/data-import)）、数据齐整（[**6 Data tidying**](https://r4ds.hadley.nz/data-tidy)）以及 [**{{第四部分 _Import_}}**](https://r4ds.hadley.nz/import) 中的五个章节（分别为 _Spreadsheets_、_Databases_、_Arrow_、_Hierarchical data_ 和 _Web scraping_）并**（结队）完成课后练习**
  > .font80[👉 **参考答案**：R for Data Science (2e): Solutions to Exercises 之 [8 Data importing](https://mine-cetinkaya-rundel.github.io/r4ds-solutions/data-import.html) / [6 Data tidying](https://mine-cetinkaya-rundel.github.io/r4ds-solutions/data-tidy.html)<br>]
  > .font80[👉 **直接搜索**：在 [**{{_R for Data Science, 2e_}}**](https://r4ds.had.nz/) 上搜索相应函数（如 `read_excel`、`separate_wider_delim` 等）并进行针对性学习]

2\. 下载（打印） 📰 .bold[[{{Data Import的cheatsheet}}](docs/data-import-cheatsheet.pdf)] 和 .bold[[{{Data Tidying的cheatsheet}}](docs/tidyr-cheatsheet.pdf)] 并阅读之

3\. `browseVignettes(package = c("readr", "tidyr"))` 📝

4\. 👨‍🏫 [**{{启动小组作业}}**](https://qfwr2023.netlify.app/project/team-project.html)
]

---
layout: false

### 教 + 学 & 考核

<br>

.pull-left.font130[

- 课上 2 学时/周
  
  - 我：概要 + 出题 + 答疑 + 扩展
  
  - 同学：听讲 + 练习 + 反馈
  
- .red.bold[课后 5 小时/周 .font60[起]]

- 同学：预习 + 复习 + 作业
  
  - 我：备课 + （在线）答疑解惑
  
]

.pull-right.font140[

.content-box-yellow.bold[

- 课程的成绩构成：

- 考勤：5%
  
  - 课堂参与：15%
  
  - 个人作业/随堂小测：40%
  
  - 小组研究项目：40% <sup>.red[*]</sup>

.font80.right.red[\* 每个小组 ~4 名同学&emsp;]
]
]

<br>
.font130.right.bold.red[增加一个分组条件：男女搭配&emsp;]

---

### 研究选题

<br>
.pull-left.font130[

.font120.bold[去年选题一览：]

1. 投资组合（天真）分散化
  
  2. 周日效应（DoW effect）
  
  3. 金融工程 App
  
  4. 指数基金定投策略 App
  
  5. 基于技术指标的量化投资策略
  
  6. 基本面量化投资策略
  
  7. 网络数据抓取与分析
]

.pull-right.font130[
.font120.bold[今年？]

<br><br><br>
.red.font300.font300.center[自选！]

]

---

### 任务

.font150[🕐 各组长于**11月08日晚24:00前**提交以下信息至 [{{坚果云}}](https://send2me.cn/Xz1XPQp9/RTe1rxAmO_AzuQ)]

.font120[

>  1. 小组名、组长姓名
>  
>  2. **项目选题、研究思路与预期成果**
>  
>  3. 小组成员集体照一张（标注姓名）

]

.font150[🕑 小组制定研究计划，分工协作，积极开展项目研究]

.font150[🕒 各小组于**12月01日**上台简要报告中期研究进展]

.font150[🕓 各小组于**12月15日前**完成项目研究，提交研究成果（如研究报告、展示文档、App等）]

.font150[🕔 **12月29日**上台展示小组项目]

---
layout: false
class: center, middle, hide_logo
background-image: url(imgs/xaringan.png)
background-size: 12%
background-position: 50% 40%

<br><br><br><br><br><br><br>
<hr color='#f00' size='2px' width='80%'>
<br>
.Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能！**_]