Data Analysis in R

Cross-Tabulations and Proportions

8.6 Cross-Tabulations and Proportions

30-45 minutes Applied Step 3 of 12

Reading 1

8.6 Cross-Tabulations and Proportions

3 / 12

Cross-tabulation summarizes the relationship between two categorical variables. In clinical research data management, cross-tabulations are useful for comparing outcome status by site, visit completion by visit, adverse event severity by relatedness, query status by site, or treatment arm by sex. A cross-tabulation is not automatically a statistical test; it is first a structured descriptive table. The simplest cross-tabulation uses `tabyl()`: ```r prepared_data |> tabyl(site, day28_status) ``` This produces counts of day 28 status within each site. To add row percentages: ```r prepared_data |> tabyl(site, day28_status) |> adorn_percentages("row") |> adorn_pct_formatting(digits = 1) ``` Row percentages answer the question: within each site, what percentage of participants fall into each day 28 status category? Column percentages answer a different question: within each day 28 status category, what percentage come from each site? The correct choice depends on the reporting question. The following example creates a cross-tabulation of treatment arm by sex: ```r prepared_data |> tabyl(treatment_arm, sex_label) |> adorn_totals("row") |> adorn_percentages("row") |> adorn_pct_formatting(digits = 1) ``` This may be useful for a baseline summary. However, the data manager should be careful not to overinterpret small differences. In randomized trials, baseline tables describe the sample; they are not usually intended to drive post-randomization decisions unless predefined procedures require it. Cross-tabulations can also reveal data quality problems: ```r prepared_data |> tabyl(site, visit_status) |> adorn_totals(c("row", "col")) ``` If one site has many `Pending` visits while other sites have few, the issue may be delayed data entry, true workflow differences, or misunderstanding of visit status coding. The table points to a question; it does not answer the question by itself. For more controlled reporting, cross-tabulations can be built using `count()` and `group_by()`: ```r visit_status_by_site <- prepared_data |> count(site, visit_status, name = "n") |> group_by(site) |> mutate( site_total = sum(n), percent = round(100 * n / site_total, 1) ) |> ungroup() visit_status_by_site ``` This long-format table is often easier to export, join, plot, or use in dashboards than a wide cross-tabulation. The appropriate format depends on the next step.

Percentage type	Denominator	Example question
Row percentage	Row total	Within each site, what proportion of participants completed day 28?
Column percentage	Column total	Among participants with missing day 28 outcomes, which sites do they come from?
Overall percentage	Grand total	What proportion of all participants have completed day 28?
Due denominator	Participants expected to have data	Among participants whose day 28 visit is due, what proportion is missing?

The denominator should be shown or explained in any table intended for decision-making. Many misunderstandings in clinical reporting arise not from complex statistics, but from unclear denominators.