Data Analysis in R

Common Errors in Descriptive Analysis

8.10 Common Errors in Descriptive Analysis

30-45 minutes Applied Step 3 of 8

Reading 1

8.10 Common Errors in Descriptive Analysis

3 / 8

Descriptive analysis can look simple while hiding serious errors. The most common errors involve denominators, missing data, units of observation, and unreviewed transformations. A table can be beautifully formatted and still wrong. One frequent error is using `n()` when `n_distinct(participant_id)` is required. If a dataset has one row per visit, `n()` counts visits. If a dataset has one row per adverse event, `n()` counts adverse events. If the question is how many participants were affected, the script must count distinct participants. Another frequent error is calculating percentages after filtering out missing values without reporting how many values were missing. This may be acceptable for some statistical summaries if documented, but it is risky in data management reports. Missingness is often the very issue the team needs to see. Incorrect joins can also distort summaries. If enrollment data are joined to laboratory data with multiple rows per participant, the participant-level dataset may expand. Age, sex, and treatment arm may then appear duplicated, and summaries may overcount participants. Row counts before and after joins should be checked. ```r n_before_join <- nrow(enrollment_prepared) joined_data <- enrollment_prepared |> left_join(lab_data, by = "participant_id") n_after_join <- nrow(joined_data) tibble( step = "Join enrollment to laboratory data", rows_before = n_before_join, rows_after = n_after_join, row_difference = n_after_join - n_before_join ) ``` An increase in rows is not always wrong, but it must be expected. If the intended output is participant-level, an unexpected increase indicates a problem. Rounding can also mislead. Percentages may not add exactly to 100 percent after rounding. Small denominators can make percentages look dramatic. For example, one missing outcome among two participants is 50 percent, but the absolute count is one. Reports should show counts with percentages, not percentages alone.

Error	Example	Safeguard
Wrong denominator	Counting visits instead of participants	Define unit of observation before summarizing
Hidden missingness	Percentages exclude missing values silently	Report missing counts or display missing category
Duplicate rows after join	Lab join multiplies participant rows	Check row counts before and after joins
Unexpected categories	`Female`, `female`, `F` counted separately	Review categorical values before reporting
Unit inconsistency	Weight in kg and grams mixed	Confirm units and ranges
Overinterpreting small samples	Reporting 50 percent from 1 of 2 records	Show counts and provide context
Manual edits to outputs	Spreadsheet modified after export	Regenerate outputs from scripts

The safest approach is to make descriptive analysis scripts explicit, small enough to review, and connected to the study definitions. Every table should be traceable to the dataset and code that produced it.