CLiREN-LMS
Data Analysis in R

Common Errors in Descriptive Analysis

8.10 Common Errors in Descriptive Analysis

30-45 minutes Applied Step 3 of 8
Reading 1

8.10 Common Errors in Descriptive Analysis

3 / 8
Descriptive analysis can look simple while hiding serious errors. The most common errors involve denominators, missing data, units of observation, and unreviewed transformations. A table can be beautifully formatted and still wrong. One frequent error is using `n()` when `n_distinct(participant_id)` is required. If a dataset has one row per visit, `n()` counts visits. If a dataset has one row per adverse event, `n()` counts adverse events. If the question is how many participants were affected, the script must count distinct participants. Another frequent error is calculating percentages after filtering out missing values without reporting how many values were missing. This may be acceptable for some statistical summaries if documented, but it is risky in data management reports. Missingness is often the very issue the team needs to see. Incorrect joins can also distort summaries. If enrollment data are joined to laboratory data with multiple rows per participant, the participant-level dataset may expand. Age, sex, and treatment arm may then appear duplicated, and summaries may overcount participants. Row counts before and after joins should be checked. ```r n_before_join <- nrow(enrollment_prepared) joined_data <- enrollment_prepared |> left_join(lab_data, by = "participant_id") n_after_join <- nrow(joined_data) tibble( step = "Join enrollment to laboratory data", rows_before = n_before_join, rows_after = n_after_join, row_difference = n_after_join - n_before_join ) ``` An increase in rows is not always wrong, but it must be expected. If the intended output is participant-level, an unexpected increase indicates a problem. Rounding can also mislead. Percentages may not add exactly to 100 percent after rounding. Small denominators can make percentages look dramatic. For example, one missing outcome among two participants is 50 percent, but the absolute count is one. Reports should show counts with percentages, not percentages alone.
ErrorExampleSafeguard
Wrong denominatorCounting visits instead of participantsDefine unit of observation before summarizing
Hidden missingnessPercentages exclude missing values silentlyReport missing counts or display missing category
Duplicate rows after joinLab join multiplies participant rowsCheck row counts before and after joins
Unexpected categories`Female`, `female`, `F` counted separatelyReview categorical values before reporting
Unit inconsistencyWeight in kg and grams mixedConfirm units and ranges
Overinterpreting small samplesReporting 50 percent from 1 of 2 recordsShow counts and provide context
Manual edits to outputsSpreadsheet modified after exportRegenerate outputs from scripts
The safest approach is to make descriptive analysis scripts explicit, small enough to review, and connected to the study definitions. Every table should be traceable to the dataset and code that produced it.