Introduction to R for Clinical Data Management

Introduction to the tidyverse

6.8 Introduction to the tidyverse

30-45 minutes Applied Step 3 of 10

Reading 1

6.8 Introduction to the tidyverse

3 / 10

The tidyverse is a collection of R packages designed around a consistent approach to data science [@wickham2019tidyverse]. For clinical data management, the tidyverse is useful because it provides readable tools for importing data, selecting variables, filtering records, creating new variables, grouping records, summarizing data, and joining datasets. Its grammar is particularly helpful for learners because many commands can be read from left to right as a sequence of operations. The tidyverse uses the concept of tidy data. In tidy data, each variable is a column, each observation is a row, and each type of observational unit is stored in a table. Clinical research data are not always tidy when exported, especially when repeated measures, checkbox fields, or wide visit structures are involved. Nevertheless, the tidy data concept provides a useful standard for thinking about structure. For example, a participant-level enrollment dataset has one row per participant. A visit dataset may have one row per participant per visit. A laboratory dataset may have one row per participant per specimen or test result. Clarifying the observational unit of a dataset is a core data management task. The following code creates a small clinical dataset and uses tidyverse functions to inspect and summarize it: ```r library(tidyverse) enrollment_data <- tibble( participant_id = c("P001", "P002", "P003", "P004", "P005"), site = c("Kilifi", "Nairobi", "Kilifi", "Mombasa", "Nairobi"), age_years = c(34, 29, 41, 52, NA), sex = c("Female", "Male", "Female", "Male", "Female"), enrolled = c("Yes", "Yes", "Yes", "No", "Yes") ) enrollment_data |> count(site) ``` The pipe operator `|>` sends the result of one step into the next step. In the example, `enrollment_data` is passed into `count(site)`, which counts the number of records by site. The pipe helps express a workflow as a sequence. This is helpful when scripts become longer. A common data management task is to filter records. The following code identifies participants with missing age values: ```r enrollment_data |> filter(is.na(age_years)) ``` This reads as: take `enrollment_data`, then keep only rows where `age_years` is missing. Such a command is simple, but it can become the basis for a query listing. A data manager might later add participant ID, site, variable name, and query text to create a structured query file. Another common task is to summarize data: ```r enrollment_data |> summarise( n_records = n(), missing_age = sum(is.na(age_years)), mean_age = mean(age_years, na.rm = TRUE) ) ``` The function `n()` counts rows. The expression `sum(is.na(age_years))` counts missing age values because `TRUE` is treated as 1 and `FALSE` as 0 in this context. The argument `na.rm = TRUE` tells R to remove missing values before calculating the mean. Without this argument, the mean would return `NA` if any age value is missing. The tidyverse is powerful, but the beginner should focus on a small set of verbs:

Verb	Purpose	Clinical data management example
`select()`	Choose columns	Keep participant ID, site, and outcome variables
`filter()`	Choose rows	List records with missing consent dates
`mutate()`	Create or modify columns	Calculate age from date of birth and enrollment date
`count()`	Count records by category	Count participants by site and consent status
`group_by()`	Define groups for summary	Summarize missingness by site
`summarise()`	Create summary statistics	Count missing outcomes and duplicate IDs
`arrange()`	Sort rows	Sort query listing by site and participant ID
`left_join()`	Merge datasets	Add laboratory results to enrollment records

The tidyverse should be learned gradually. A data manager does not need to memorize every function. More important is to understand how data move through a script and how each command changes the dataset. Every transformation should be explainable in clinical and data management terms.