Writing Readable Cleaning Scripts
7.7 Writing Readable Cleaning Scripts
Reading 1
3 / 8
7.7 Writing Readable Cleaning Scripts
A cleaning script should be written for both R and humans. R needs correct syntax, but colleagues need clear structure. A script that cannot be understood by another data manager is difficult to review, maintain, or trust. Readability is part of quality.
A useful cleaning script often follows a predictable structure:
1. Header with purpose, project, author, date, and input/output description.
2. Package loading.
3. File paths and configuration values.
4. Data import.
5. Initial inspection.
6. Cleaning and transformation steps.
7. Quality checks and query outputs.
8. Export of cleaned datasets or reports.
9. Session information or version notes where appropriate.
The following skeleton shows this structure:
```r
# Project: Clinical Research Data Management Practice Study
# Script: 02_clean_enrollment_data.R
# Purpose: Import enrollment export, prepare variables, and generate query outputs
# Input: data_raw/redcap_enrollment_export_2026-06-01.csv
# Outputs:
# outputs/enrollment_query_listing_2026-06-01.csv
# data_clean/enrollment_prepared_2026-06-01.csv
library(tidyverse)
library(janitor)
export_date <- "2026-06-01"
raw_file <- str_glue("data_raw/redcap_enrollment_export_{export_date}.csv")
query_file <- str_glue("outputs/enrollment_query_listing_{export_date}.csv")
prepared_file <- str_glue("data_clean/enrollment_prepared_{export_date}.csv")
raw_enrollment <- read_csv(raw_file) |>
clean_names()
enrollment_prepared <- raw_enrollment |>
mutate(
consent_date = as.Date(consent_date),
enrollment_date = as.Date(enrollment_date),
date_of_birth = as.Date(date_of_birth),
age_years_derived = as.numeric(enrollment_date - date_of_birth) / 365.25
)
query_listing <- enrollment_prepared |>
filter(is.na(consent_date) | enrollment_date < consent_date) |>
transmute(
participant_id,
site,
query_variable = case_when(
is.na(consent_date) ~ "consent_date",
enrollment_date < consent_date ~ "enrollment_date",
TRUE ~ "unknown"
),
query_text = case_when(
is.na(consent_date) ~ "Please enter or verify the informed consent date.",
enrollment_date < consent_date ~ "Enrollment date appears to occur before consent date. Please verify.",
TRUE ~ "Please review this record."
)
)
write_csv(query_listing, query_file)
write_csv(enrollment_prepared, prepared_file)
```
The script uses variables for file paths, which reduces repetition. It imports raw data, creates a prepared dataset, generates a query listing, and writes outputs. In a real study, this script would be reviewed and adapted to the project's data dictionary. The query logic might be split into multiple named sections, especially if the study has many rules.
Good scripts use clear object names. Names such as `raw_enrollment`, `enrollment_prepared`, and `query_listing` communicate purpose. Names such as `df`, `df2`, and `final_final` do not. Good scripts also avoid hidden manual steps. If the user must open a file in Excel and delete rows before running the script, the workflow is no longer fully reproducible.
Comments should explain why, not merely repeat what the code says. A comment such as `# filter missing consent dates` adds little if the code already says `filter(is.na(consent_date))`. A better comment might explain that consent date is required for all enrolled participants according to the CRF completion guidelines. Comments should be concise but meaningful.
| Script quality feature | Weak practice | Stronger practice |
|---|---|---|
| Object names | `x`, `data2`, `new` | `raw_enrollment`, `lab_clean`, `query_listing` |
| File paths | Hard-coded desktop path | Relative path inside R project |
| Raw data | Edited manually before import | Imported unchanged from `data_raw` |
| Comments | Repeat code syntax | Explain study rule or decision |
| Outputs | Overwrite unclear files | Write named outputs to `outputs` or `data_clean` |
| Review | Only original author understands script | Another data manager can read and rerun it |
Readable scripts are also easier to debug. When a script is organized into sections, the data manager can identify where an error occurred: import, type conversion, recoding, derivation, checking, or export. This reduces frustration and improves reliability.