Cross-Tabulations and Proportions
8.6 Cross-Tabulations and Proportions
Reading 1
3 / 12
8.6 Cross-Tabulations and Proportions
Cross-tabulation summarizes the relationship between two categorical variables. In clinical research data management, cross-tabulations are useful for comparing outcome status by site, visit completion by visit, adverse event severity by relatedness, query status by site, or treatment arm by sex. A cross-tabulation is not automatically a statistical test; it is first a structured descriptive table.
The simplest cross-tabulation uses `tabyl()`:
```r
prepared_data |>
tabyl(site, day28_status)
```
This produces counts of day 28 status within each site. To add row percentages:
```r
prepared_data |>
tabyl(site, day28_status) |>
adorn_percentages("row") |>
adorn_pct_formatting(digits = 1)
```
Row percentages answer the question: within each site, what percentage of participants fall into each day 28 status category? Column percentages answer a different question: within each day 28 status category, what percentage come from each site? The correct choice depends on the reporting question.
The following example creates a cross-tabulation of treatment arm by sex:
```r
prepared_data |>
tabyl(treatment_arm, sex_label) |>
adorn_totals("row") |>
adorn_percentages("row") |>
adorn_pct_formatting(digits = 1)
```
This may be useful for a baseline summary. However, the data manager should be careful not to overinterpret small differences. In randomized trials, baseline tables describe the sample; they are not usually intended to drive post-randomization decisions unless predefined procedures require it.
Cross-tabulations can also reveal data quality problems:
```r
prepared_data |>
tabyl(site, visit_status) |>
adorn_totals(c("row", "col"))
```
If one site has many `Pending` visits while other sites have few, the issue may be delayed data entry, true workflow differences, or misunderstanding of visit status coding. The table points to a question; it does not answer the question by itself.
For more controlled reporting, cross-tabulations can be built using `count()` and `group_by()`:
```r
visit_status_by_site <- prepared_data |>
count(site, visit_status, name = "n") |>
group_by(site) |>
mutate(
site_total = sum(n),
percent = round(100 * n / site_total, 1)
) |>
ungroup()
visit_status_by_site
```
This long-format table is often easier to export, join, plot, or use in dashboards than a wide cross-tabulation. The appropriate format depends on the next step.
| Percentage type | Denominator | Example question |
|---|---|---|
| Row percentage | Row total | Within each site, what proportion of participants completed day 28? |
| Column percentage | Column total | Among participants with missing day 28 outcomes, which sites do they come from? |
| Overall percentage | Grand total | What proportion of all participants have completed day 28? |
| Due denominator | Participants expected to have data | Among participants whose day 28 visit is due, what proportion is missing? |
The denominator should be shown or explained in any table intended for decision-making. Many misunderstandings in clinical reporting arise not from complex statistics, but from unclear denominators.