Handling Missing Data in Cleaning Workflows
Handling Missing Data in Cleaning Workflows
Accordion
5 / 11
Handling Missing Data in Cleaning Workflows
Accordion
Handling Missing Data in Cleaning Workflows
Part 1
Missing data are not all the same. A value may be missing because it was not collected, because it was not applicable, because the participant refused to answer, because the result is pending, because a visit was missed, because a specimen was lost, because the field was accidentally skipped, or because the export excluded the field. Treating all missing values as identical can lead to poor decisions. Clinical data management requires careful classification of missingness before deciding whether to query, derive, ignore, code, or escalate [@little2019missing; @vanbuuren2018missing].
R represents missing values as `NA`, but a database export may contain missingness in different forms. Empty cells may become `NA`. A value such as `999` may indicate "not done" in one dataset but a real number in another. A free text value such as `Unknown` may need to be treated as a category rather than as missing. A checkbox export may use `0` and `1`, where `0` means unchecked rather than unknown. These distinctions must be understood before cleaning.
The first step is to summarize missingness:
Part 2
This code counts missing values for every column. It is useful for a first overview, but it does not decide which missing values matter. A missing pregnancy test result may be critical in one study and irrelevant in another. A missing follow-up date may be expected for participants not yet due for follow-up. The data manager must interpret missingness in context.
A more useful approach is to check missingness for required fields:
The selected variables should come from the protocol, CRF, or data management plan. If `primary_outcome` is required only after a follow-up window has elapsed, the check should include the due date logic. Otherwise, the script may incorrectly flag records that are not yet expected to have outcomes.
Part 3
This code first keeps participants whose follow-up is due, then flags missing primary outcomes. This is more clinically meaningful than checking all records, because it respects timing.
Missing values can also be summarized by site:
Site-level missingness can identify training needs or workflow problems. It should be interpreted carefully. A site with more recent enrollments may have more outcomes pending. A site with a different patient mix may have different visit completion patterns. Data managers should avoid using missingness metrics as punitive rankings without context.
Part 4
Missing data handling must be documented because it affects analysis and interpretation. Data managers should avoid replacing missing values with arbitrary values unless the study has explicitly defined such coding. For example, replacing all missing numeric values with zero is usually wrong because zero is a real value. If special missing codes are used, such as `-99` for unknown, the meaning must be documented and consistently applied.