CLiREN-LMS
Data Visualization and Dashboards

Visualizing Missingness and Data Quality

9.4 Visualizing Missingness and Data Quality

30-45 minutes Applied Step 3 of 9
Reading 1

9.4 Visualizing Missingness and Data Quality

3 / 9
Data quality visualization is particularly useful because it can turn a long query listing into a pattern. For example, a table of 600 open queries may be difficult to interpret, while a bar chart of open queries by site and category may immediately show where attention is needed. The following code summarizes missing values in selected required fields: ```r missing_by_variable <- prepared_data |> summarise( consent_date = sum(is.na(consent_date)), age_years = sum(is.na(age_years_derived)), day28_outcome = sum(is.na(day28_outcome)), discharge_date = sum(is.na(discharge_date)) ) |> pivot_longer( cols = everything(), names_to = "variable", values_to = "missing_count" ) missing_by_variable |> ggplot(aes(x = reorder(variable, missing_count), y = missing_count)) + geom_col() + coord_flip() + labs( title = "Missing Values in Key Variables", x = "Variable", y = "Missing records" ) ``` This plot shows which variables have the most missing values. It does not determine whether the missing values are expected. For example, day 28 outcome may be missing because follow-up is not yet due. A better plot may restrict to participants whose outcomes are due. ```r due_outcome_missing_by_site <- prepared_data |> filter(day28_due_date <= Sys.Date()) |> group_by(site) |> summarise( outcomes_due = n_distinct(participant_id), outcomes_missing = sum(is.na(day28_outcome)), percent_missing = 100 * outcomes_missing / outcomes_due, .groups = "drop" ) due_outcome_missing_by_site |> ggplot(aes(x = reorder(site, percent_missing), y = percent_missing)) + geom_col() + coord_flip() + labs( title = "Missing Day 28 Outcomes Among Participants Due for Follow-Up", x = "Site", y = "Missing outcomes (%)" ) ``` This chart is more meaningful because the denominator matches the operational question. Data quality graphics should always be designed around the correct denominator. Query status can also be visualized: ```r query_listing |> count(site, status, name = "queries") |> ggplot(aes(x = site, y = queries, fill = status)) + geom_col() + coord_flip() + labs( title = "Queries by Site and Status", x = "Site", y = "Number of queries", fill = "Query status" ) ``` This stacked bar chart can show whether sites have many open, answered, or closed queries. If the chart is used for management, it should be accompanied by definitions of query status and reporting date.