Data Analysis in R

Common Errors in Descriptive Analysis

30-45 minutes Applied Step 5 of 8

Accordion

Common Errors in Descriptive Analysis

5 / 8

Accordion

Common Errors in Descriptive Analysis

Part 1

Descriptive analysis can look simple while hiding serious errors. The most common errors involve denominators, missing data, units of observation, and unreviewed transformations. A table can be beautifully formatted and still wrong. One frequent error is using `n()` when `n_distinct(participant_id)` is required. If a dataset has one row per visit, `n()` counts visits. If a dataset has one row per adverse event, `n()` counts adverse events. If the question is how many participants were affected, the script must count distinct participants. Another frequent error is calculating percentages after filtering out missing values without reporting how many values were missing. This may be acceptable for some statistical summaries if documented, but it is risky in data management reports. Missingness is often the very issue the team needs to see.

Part 2

Incorrect joins can also distort summaries. If enrollment data are joined to laboratory data with multiple rows per participant, the participant-level dataset may expand. Age, sex, and treatment arm may then appear duplicated, and summaries may overcount participants. Row counts before and after joins should be checked. An increase in rows is not always wrong, but it must be expected. If the intended output is participant-level, an unexpected increase indicates a problem. Rounding can also mislead. Percentages may not add exactly to 100 percent after rounding. Small denominators can make percentages look dramatic. For example, one missing outcome among two participants is 50 percent, but the absolute count is one. Reports should show counts with percentages, not percentages alone.

Part 3

The safest approach is to make descriptive analysis scripts explicit, small enough to review, and connected to the study definitions. Every table should be traceable to the dataset and code that produced it.