Data Cleaning and Preparation in R

Writing Readable Cleaning Scripts

30-45 minutes Applied Step 5 of 8

Accordion

Writing Readable Cleaning Scripts

5 / 8

Accordion

Writing Readable Cleaning Scripts

Part 1

A cleaning script should be written for both R and humans. R needs correct syntax, but colleagues need clear structure. A script that cannot be understood by another data manager is difficult to review, maintain, or trust. Readability is part of quality. A useful cleaning script often follows a predictable structure: 1. Header with purpose, project, author, date, and input/output description. 2. Package loading. 3. File paths and configuration values. 4. Data import. 5. Initial inspection. 6. Cleaning and transformation steps. 7. Quality checks and query outputs. 8. Export of cleaned datasets or reports. 9. Session information or version notes where appropriate.

Part 2

The following skeleton shows this structure: The script uses variables for file paths, which reduces repetition. It imports raw data, creates a prepared dataset, generates a query listing, and writes outputs. In a real study, this script would be reviewed and adapted to the project's data dictionary. The query logic might be split into multiple named sections, especially if the study has many rules. Good scripts use clear object names. Names such as `raw_enrollment`, `enrollment_prepared`, and `query_listing` communicate purpose. Names such as `df`, `df2`, and `final_final` do not. Good scripts also avoid hidden manual steps. If the user must open a file in Excel and delete rows before running the script, the workflow is no longer fully reproducible.

Part 3

Comments should explain why, not merely repeat what the code says. A comment such as `# filter missing consent dates` adds little if the code already says `filter(is.na(consent_date))`. A better comment might explain that consent date is required for all enrolled participants according to the CRF completion guidelines. Comments should be concise but meaningful. Readable scripts are also easier to debug. When a script is organized into sections, the data manager can identify where an error occurred: import, type conversion, recoding, derivation, checking, or export. This reduces frustration and improves reliability.