Data Analysis in R

Cross-Tabulations and Proportions

30-45 minutes Applied Step 5 of 12

Accordion

Cross-Tabulations and Proportions

5 / 12

Accordion

Cross-Tabulations and Proportions

Part 1

Cross-tabulation summarizes the relationship between two categorical variables. In clinical research data management, cross-tabulations are useful for comparing outcome status by site, visit completion by visit, adverse event severity by relatedness, query status by site, or treatment arm by sex. A cross-tabulation is not automatically a statistical test; it is first a structured descriptive table. The simplest cross-tabulation uses `tabyl()`: This produces counts of day 28 status within each site. To add row percentages:

Part 2

Row percentages answer the question: within each site, what percentage of participants fall into each day 28 status category? Column percentages answer a different question: within each day 28 status category, what percentage come from each site? The correct choice depends on the reporting question. The following example creates a cross-tabulation of treatment arm by sex: This may be useful for a baseline summary. However, the data manager should be careful not to overinterpret small differences. In randomized trials, baseline tables describe the sample; they are not usually intended to drive post-randomization decisions unless predefined procedures require it.

Part 3

Cross-tabulations can also reveal data quality problems: If one site has many `Pending` visits while other sites have few, the issue may be delayed data entry, true workflow differences, or misunderstanding of visit status coding. The table points to a question; it does not answer the question by itself. For more controlled reporting, cross-tabulations can be built using `count()` and `group_by()`:

Part 4

This long-format table is often easier to export, join, plot, or use in dashboards than a wide cross-tabulation. The appropriate format depends on the next step. The denominator should be shown or explained in any table intended for decision-making. Many misunderstandings in clinical reporting arise not from complex statistics, but from unclear denominators.