Data Cleaning and Preparation in R

Writing Readable Cleaning Scripts

7.7 Writing Readable Cleaning Scripts

30-45 minutes Applied Step 3 of 8

Reading 1

7.7 Writing Readable Cleaning Scripts

3 / 8

A cleaning script should be written for both R and humans. R needs correct syntax, but colleagues need clear structure. A script that cannot be understood by another data manager is difficult to review, maintain, or trust. Readability is part of quality. A useful cleaning script often follows a predictable structure: 1. Header with purpose, project, author, date, and input/output description. 2. Package loading. 3. File paths and configuration values. 4. Data import. 5. Initial inspection. 6. Cleaning and transformation steps. 7. Quality checks and query outputs. 8. Export of cleaned datasets or reports. 9. Session information or version notes where appropriate. The following skeleton shows this structure: ```r # Project: Clinical Research Data Management Practice Study # Script: 02_clean_enrollment_data.R # Purpose: Import enrollment export, prepare variables, and generate query outputs # Input: data_raw/redcap_enrollment_export_2026-06-01.csv # Outputs: # outputs/enrollment_query_listing_2026-06-01.csv # data_clean/enrollment_prepared_2026-06-01.csv library(tidyverse) library(janitor) export_date <- "2026-06-01" raw_file <- str_glue("data_raw/redcap_enrollment_export_{export_date}.csv") query_file <- str_glue("outputs/enrollment_query_listing_{export_date}.csv") prepared_file <- str_glue("data_clean/enrollment_prepared_{export_date}.csv") raw_enrollment <- read_csv(raw_file) |> clean_names() enrollment_prepared <- raw_enrollment |> mutate( consent_date = as.Date(consent_date), enrollment_date = as.Date(enrollment_date), date_of_birth = as.Date(date_of_birth), age_years_derived = as.numeric(enrollment_date - date_of_birth) / 365.25 ) query_listing <- enrollment_prepared |> filter(is.na(consent_date) | enrollment_date < consent_date) |> transmute( participant_id, site, query_variable = case_when( is.na(consent_date) ~ "consent_date", enrollment_date < consent_date ~ "enrollment_date", TRUE ~ "unknown" ), query_text = case_when( is.na(consent_date) ~ "Please enter or verify the informed consent date.", enrollment_date < consent_date ~ "Enrollment date appears to occur before consent date. Please verify.", TRUE ~ "Please review this record." ) ) write_csv(query_listing, query_file) write_csv(enrollment_prepared, prepared_file) ``` The script uses variables for file paths, which reduces repetition. It imports raw data, creates a prepared dataset, generates a query listing, and writes outputs. In a real study, this script would be reviewed and adapted to the project's data dictionary. The query logic might be split into multiple named sections, especially if the study has many rules. Good scripts use clear object names. Names such as `raw_enrollment`, `enrollment_prepared`, and `query_listing` communicate purpose. Names such as `df`, `df2`, and `final_final` do not. Good scripts also avoid hidden manual steps. If the user must open a file in Excel and delete rows before running the script, the workflow is no longer fully reproducible. Comments should explain why, not merely repeat what the code says. A comment such as `# filter missing consent dates` adds little if the code already says `filter(is.na(consent_date))`. A better comment might explain that consent date is required for all enrolled participants according to the CRF completion guidelines. Comments should be concise but meaningful.

Script quality feature	Weak practice	Stronger practice
Object names	`x`, `data2`, `new`	`raw_enrollment`, `lab_clean`, `query_listing`
File paths	Hard-coded desktop path	Relative path inside R project
Raw data	Edited manually before import	Imported unchanged from `data_raw`
Comments	Repeat code syntax	Explain study rule or decision
Outputs	Overwrite unclear files	Write named outputs to `outputs` or `data_clean`
Review	Only original author understands script	Another data manager can read and rerun it

Readable scripts are also easier to debug. When a script is organized into sections, the data manager can identify where an error occurred: import, type conversion, recoding, derivation, checking, or export. This reduces frustration and improves reliability.