Data Cleaning and Preparation in R

Cleaning Logs and Query Outputs

7.8 Cleaning Logs and Query Outputs

30-45 minutes Applied Step 3 of 9

Reading 1

7.8 Cleaning Logs and Query Outputs

3 / 9

A cleaning log is a structured record of data issues, checks, decisions, and status. It may be maintained in the EDC system, a tracking database, a spreadsheet, or generated from R. The exact format depends on the study, but the purpose is consistent: to show what was found, what was done, and what remains unresolved. R can generate query outputs from rule-based checks. For example, a script can identify missing consent dates, impossible age values, and date inconsistencies, then combine them into one query listing: ```r missing_consent_queries <- enrollment_prepared |> filter(is.na(consent_date)) |> transmute( participant_id, site, rule_id = "ENR_MISS_CONSENT_DATE", variable = "consent_date", query_text = "Please enter or verify the informed consent date.", priority = "High" ) age_queries <- enrollment_prepared |> filter(age_years_derived < 18 | age_years_derived > 120) |> transmute( participant_id, site, rule_id = "ENR_AGE_RANGE", variable = "age_years_derived", query_text = "Derived age is outside the expected adult range. Please verify date of birth and enrollment date.", priority = "Medium" ) date_queries <- enrollment_prepared |> filter(enrollment_date < consent_date) |> transmute( participant_id, site, rule_id = "ENR_DATE_SEQUENCE", variable = "enrollment_date", query_text = "Enrollment date appears to occur before consent date. Please verify both dates.", priority = "High" ) query_listing <- bind_rows( missing_consent_queries, age_queries, date_queries ) |> arrange(site, participant_id, rule_id) ``` The `bind_rows()` function combines query outputs with the same column structure. The resulting listing can be reviewed by the data manager before being entered into REDCap or sent through the approved query process. It is important that R-generated queries be reviewed. Automated rules may flag legitimate exceptions, and some rules may need refinement. A cleaning log may include additional fields:

Field	Purpose
`rule_id`	Unique identifier for the check
`participant_id`	Record requiring review
`site`	Site responsible for follow-up
`variable`	Field or derived variable involved
`issue_description`	Plain-language explanation of the issue
`query_text`	Suggested message to site
`priority`	Operational importance
`status`	Open, answered, resolved, closed, not a query
`date_identified`	Date the issue was found
`date_resolved`	Date the issue was resolved
`resolution_notes`	Human-readable decision or explanation

R can create the initial issue listing, but status and resolution notes often require workflow integration with the EDC system or a query management tracker. If queries are managed inside REDCap, the R output should support that process rather than replace it. If queries are tracked externally, the tracker should be controlled, versioned, and backed up. The following code adds a date identified field: ```r query_listing <- query_listing |> mutate( date_identified = Sys.Date(), status = "Open" ) write_csv(query_listing, "outputs/query_listing_2026-06-01.csv") ``` This creates an operational output. The study team should define how this output is reviewed, who approves queries, how site responses are recorded, and when queries are considered closed. Cleaning logs should not become informal side records that contradict the source database or approved query system.