Introduction to the tidyverse
6.8 Introduction to the tidyverse
Reading 1
3 / 10
6.8 Introduction to the tidyverse
The tidyverse is a collection of R packages designed around a consistent approach to data science [@wickham2019tidyverse]. For clinical data management, the tidyverse is useful because it provides readable tools for importing data, selecting variables, filtering records, creating new variables, grouping records, summarizing data, and joining datasets. Its grammar is particularly helpful for learners because many commands can be read from left to right as a sequence of operations.
The tidyverse uses the concept of tidy data. In tidy data, each variable is a column, each observation is a row, and each type of observational unit is stored in a table. Clinical research data are not always tidy when exported, especially when repeated measures, checkbox fields, or wide visit structures are involved. Nevertheless, the tidy data concept provides a useful standard for thinking about structure. For example, a participant-level enrollment dataset has one row per participant. A visit dataset may have one row per participant per visit. A laboratory dataset may have one row per participant per specimen or test result. Clarifying the observational unit of a dataset is a core data management task.
The following code creates a small clinical dataset and uses tidyverse functions to inspect and summarize it:
```r
library(tidyverse)
enrollment_data <- tibble(
participant_id = c("P001", "P002", "P003", "P004", "P005"),
site = c("Kilifi", "Nairobi", "Kilifi", "Mombasa", "Nairobi"),
age_years = c(34, 29, 41, 52, NA),
sex = c("Female", "Male", "Female", "Male", "Female"),
enrolled = c("Yes", "Yes", "Yes", "No", "Yes")
)
enrollment_data |>
count(site)
```
The pipe operator `|>` sends the result of one step into the next step. In the example, `enrollment_data` is passed into `count(site)`, which counts the number of records by site. The pipe helps express a workflow as a sequence. This is helpful when scripts become longer.
A common data management task is to filter records. The following code identifies participants with missing age values:
```r
enrollment_data |>
filter(is.na(age_years))
```
This reads as: take `enrollment_data`, then keep only rows where `age_years` is missing. Such a command is simple, but it can become the basis for a query listing. A data manager might later add participant ID, site, variable name, and query text to create a structured query file.
Another common task is to summarize data:
```r
enrollment_data |>
summarise(
n_records = n(),
missing_age = sum(is.na(age_years)),
mean_age = mean(age_years, na.rm = TRUE)
)
```
The function `n()` counts rows. The expression `sum(is.na(age_years))` counts missing age values because `TRUE` is treated as 1 and `FALSE` as 0 in this context. The argument `na.rm = TRUE` tells R to remove missing values before calculating the mean. Without this argument, the mean would return `NA` if any age value is missing.
The tidyverse is powerful, but the beginner should focus on a small set of verbs:
| Verb | Purpose | Clinical data management example |
|---|---|---|
| `select()` | Choose columns | Keep participant ID, site, and outcome variables |
| `filter()` | Choose rows | List records with missing consent dates |
| `mutate()` | Create or modify columns | Calculate age from date of birth and enrollment date |
| `count()` | Count records by category | Count participants by site and consent status |
| `group_by()` | Define groups for summary | Summarize missingness by site |
| `summarise()` | Create summary statistics | Count missing outcomes and duplicate IDs |
| `arrange()` | Sort rows | Sort query listing by site and participant ID |
| `left_join()` | Merge datasets | Add laboratory results to enrollment records |
The tidyverse should be learned gradually. A data manager does not need to memorize every function. More important is to understand how data move through a script and how each command changes the dataset. Every transformation should be explainable in clinical and data management terms.