Basic Data Quality Checks in R
Basic Data Quality Checks in R
Accordion
5 / 19
Basic Data Quality Checks in R
Accordion
Basic Data Quality Checks in R
Overview
Basic data quality checks in R should be guided by the protocol, CRF, data dictionary, and data management plan. R is not deciding what is clinically valid; it is applying rules defined by the study team. A value is not "wrong" simply because it looks unusual. It is flagged because it violates a predefined expectation, is missing when required, is internally inconsistent, or requires human review. This distinction is important. R can generate flags, but data managers and investigators interpret them.
Common introductory checks include:
1. Missing values in required fields.
2. Duplicate participant identifiers.
3. Values outside allowable ranges.
4. Unexpected categorical values.
5. Date inconsistencies.
6. Records missing from one dataset but present in another.
7. Site-level patterns that suggest training or workflow issues.
The following examples assume that an imported enrollment dataset contains variables such as `participant_id`, `site`, `age_years`, `sex`, `consent_date`, `admission_date`, and `discharge_date`.
Missing Required Values
Missingness is one of the most common data quality issues. Some missing values are expected, depending on skip logic or clinical circumstances. Others require queries. The data manager should distinguish between structurally missing data, not applicable data, unknown data, and data that are missing because entry is incomplete.
This code counts missing values in selected fields. In a real project, the list of required fields should come from the CRF completion guidelines or data dictionary. Missing participant IDs and consent dates may be high-priority issues. Missing age may require review depending on the protocol and whether date of birth is collected separately.
To create a row-level listing of missing consent dates:
This listing can become the basis for a query report. The data manager may add a query text column:
The script does not send the query automatically. It creates a structured list for review. The study team should still confirm whether each query is appropriate.
Duplicate Participant Identifiers
Participant identifiers should usually be unique within a study, unless the dataset is visit-level or event-level. In a participant-level enrollment dataset, duplicates may indicate accidental double entry, import of repeated events, or misunderstanding of the dataset structure.
This code counts records for each participant ID and keeps IDs appearing more than once. If duplicates are found, the data manager should inspect the records:
The `semi_join()` function keeps rows from `enrollment_data` where the participant ID appears in `duplicate_ids`. This produces a detailed listing of duplicated records for review.
Range Checks
Range checks identify values outside plausible or allowable limits. The limits should be defined by the protocol, population, measurement units, and CRF guidance. For example, an adult study may require participants to be 18 years or older. A neonatal study would use very different age rules. Clinical context matters.
This code flags ages below 18 or above 120. It does not prove the values are incorrect. It identifies values requiring review. A value of 17 might be an eligibility violation, a data entry error, or a participant enrolled under a protocol exception. A value of 200 may indicate an obvious error or a date conversion problem.
Range checks can be extended to laboratory values, vital signs, follow-up days, and other numeric fields. However, laboratory checks require careful attention to units, reference ranges, specimen type, age group, and clinical context. A simple numeric range may be useful for detecting impossible values, but clinical interpretation should involve appropriate expertise.
Unexpected Categorical Values
Categorical variables should contain only expected responses. Unexpected categories may arise from free text entry, inconsistent coding, import transformations, or changes in the data dictionary.
This code flags records where `sex` is not one of the expected values or is missing. Whether missing sex should be queried depends on study requirements. The important point is that the expected values are stated explicitly in the script.
For a broader categorical review, the data manager can create frequency tables:
The output may reveal spelling differences or unexpected coding. Controlled terminology reduces these issues at the point of data entry, but R can still help detect them in exports.
Date Consistency Checks
Date checks are among the most important clinical data quality checks. Many study events are defined by sequence: consent should occur before study procedures, admission should occur before discharge, enrollment should occur within a defined window, and follow-up visits should occur after baseline. Date errors can affect eligibility, outcomes, follow-up windows, and analysis.
This code flags records where discharge occurred before admission. Such a finding usually requires review. It may be a data entry error, a date format issue, or a misunderstanding of what the variables mean.
A consent timing check may look like this:
Whether this is a problem depends on the protocol. In some studies, consent may occur after admission but before study-specific procedures. In others, consent must occur before any study procedure. R can identify the pattern, but the protocol determines interpretation.
Combined Quality Summary
A useful introductory script may create a compact quality summary:
This summary provides a quick overview. It can be run after every export and saved as part of a data quality report. The values should be interpreted in context. For example, a rise in missing consent dates may indicate delayed entry, a site training issue, or a change in export settings.
Site-Level Review
Multisite studies require attention to site-level patterns. One site may have high missingness, another may have unusually few adverse events, and another may have delayed entry. R can summarize patterns by site:
This table helps the data manager prioritize follow-up. However, site comparisons must be fair. A site with more records may naturally have more total queries. Rates or percentages may be more informative than raw counts. A site that recently began recruitment may have different patterns from a site that has been enrolling for months. Data quality metrics should support improvement, not blame.
Figure 6.4 Placeholder: Example R-generated data quality summary.
This figure should show a dashboard-style table with total records, missing required fields, duplicate IDs, date inconsistencies, and site-level summaries. It should emphasize that R can generate outputs for review, but final data management decisions remain governed by the protocol and data management plan.