Common Errors in Descriptive Analysis
8.10 Common Errors in Descriptive Analysis
Reading 1
3 / 8
8.10 Common Errors in Descriptive Analysis
Descriptive analysis can look simple while hiding serious errors. The most common errors involve denominators, missing data, units of observation, and unreviewed transformations. A table can be beautifully formatted and still wrong.
One frequent error is using `n()` when `n_distinct(participant_id)` is required. If a dataset has one row per visit, `n()` counts visits. If a dataset has one row per adverse event, `n()` counts adverse events. If the question is how many participants were affected, the script must count distinct participants.
Another frequent error is calculating percentages after filtering out missing values without reporting how many values were missing. This may be acceptable for some statistical summaries if documented, but it is risky in data management reports. Missingness is often the very issue the team needs to see.
Incorrect joins can also distort summaries. If enrollment data are joined to laboratory data with multiple rows per participant, the participant-level dataset may expand. Age, sex, and treatment arm may then appear duplicated, and summaries may overcount participants. Row counts before and after joins should be checked.
```r
n_before_join <- nrow(enrollment_prepared)
joined_data <- enrollment_prepared |>
left_join(lab_data, by = "participant_id")
n_after_join <- nrow(joined_data)
tibble(
step = "Join enrollment to laboratory data",
rows_before = n_before_join,
rows_after = n_after_join,
row_difference = n_after_join - n_before_join
)
```
An increase in rows is not always wrong, but it must be expected. If the intended output is participant-level, an unexpected increase indicates a problem.
Rounding can also mislead. Percentages may not add exactly to 100 percent after rounding. Small denominators can make percentages look dramatic. For example, one missing outcome among two participants is 50 percent, but the absolute count is one. Reports should show counts with percentages, not percentages alone.
| Error | Example | Safeguard |
|---|---|---|
| Wrong denominator | Counting visits instead of participants | Define unit of observation before summarizing |
| Hidden missingness | Percentages exclude missing values silently | Report missing counts or display missing category |
| Duplicate rows after join | Lab join multiplies participant rows | Check row counts before and after joins |
| Unexpected categories | `Female`, `female`, `F` counted separately | Review categorical values before reporting |
| Unit inconsistency | Weight in kg and grams mixed | Confirm units and ranges |
| Overinterpreting small samples | Reporting 50 percent from 1 of 2 records | Show counts and provide context |
| Manual edits to outputs | Spreadsheet modified after export | Regenerate outputs from scripts |
The safest approach is to make descriptive analysis scripts explicit, small enough to review, and connected to the study definitions. Every table should be traceable to the dataset and code that produced it.