Data Cleaning and Preparation in R

The Purpose of Data Cleaning and Preparation

30-45 minutes Applied Step 11 of 23

Accordion

The Purpose of Data Cleaning and Preparation

11 / 23

Accordion

The Purpose of Data Cleaning and Preparation

Part 1

Data cleaning is the organized process of identifying, investigating, documenting, and resolving problems in research data. It is not a casual activity performed after data collection is complete. It is part of the quality system of a study. In clinical research, cleaning begins before the first participant is enrolled, because the protocol, CRFs, database design, validation rules, completion guidelines, and monitoring plan all determine what kinds of errors are likely to occur and how they will be handled. R becomes useful when those expectations can be translated into transparent, repeatable checks and preparation steps. Data preparation is broader than error correction. It includes the work required to make data usable for review, reporting, monitoring, analysis, archiving, and sharing. A raw REDCap export may contain coded values, checkbox variables, repeated event structures, administrative fields, timestamps, and wide-format visit data. The dataset may be excellent as a database export but not yet convenient for a data quality report or statistical analysis. Preparation may therefore include selecting relevant variables, standardizing names, converting dates, recoding values, deriving variables, joining related datasets, reshaping repeated measures, and creating summary outputs. The distinction between cleaning and preparation matters because not every transformation is a correction. Correcting a typographical error in a recorded date is different from deriving age at enrollment from date of birth and consent date. Recoding `1` and `0` into `Yes` and `No` is different from changing an invalid value after source document review. Creating a site-level missingness summary is different from editing participant-level data. A high-quality workflow records these differences.

Part 2

In a clinical data management environment, R should not be used as a hidden place where data are silently altered. The raw export should remain unchanged. Cleaning scripts should make transformations explicit. Query outputs should identify records that need review. Derived datasets should be traceable to the raw inputs and scripts that produced them. When corrections are needed in an electronic data capture system, the preferred route is usually to correct the source database through the approved query and audit trail workflow, then re-export the corrected data. R can identify the issue and prepare the query listing, but it should not become an unofficial database outside the validated system. This chapter therefore treats R as a controlled preparation environment. The goal is not only to learn commands, but to learn a defensible way of working. A defensible workflow answers several questions: What was the raw input? What script was run? What rules were applied? Which records were flagged? Which outputs were produced? Which values were changed, derived, or recoded? Which decisions required human review? These questions connect directly to Good Clinical Practice, data integrity expectations, and clinical data management good practice [@ich2016gcp; @mhra2018dataintegrity; @scdm2024gcdmp]. Figure 7.1 Placeholder: Reproducible data cleaning workflow in R. This figure should show raw REDCap exports entering a protected `data_raw` folder, R scripts applying documented checks, query outputs being reviewed by the study team, corrections being made in the source database, and cleaned or analysis-ready datasets being regenerated from updated exports.