CLiREN-LMS
Data Analysis in R

Cross-Tabulations and Proportions

8.6 Cross-Tabulations and Proportions

30-45 minutes Applied Step 3 of 12
Reading 1

8.6 Cross-Tabulations and Proportions

3 / 12
Cross-tabulation summarizes the relationship between two categorical variables. In clinical research data management, cross-tabulations are useful for comparing outcome status by site, visit completion by visit, adverse event severity by relatedness, query status by site, or treatment arm by sex. A cross-tabulation is not automatically a statistical test; it is first a structured descriptive table. The simplest cross-tabulation uses `tabyl()`: ```r prepared_data |> tabyl(site, day28_status) ``` This produces counts of day 28 status within each site. To add row percentages: ```r prepared_data |> tabyl(site, day28_status) |> adorn_percentages("row") |> adorn_pct_formatting(digits = 1) ``` Row percentages answer the question: within each site, what percentage of participants fall into each day 28 status category? Column percentages answer a different question: within each day 28 status category, what percentage come from each site? The correct choice depends on the reporting question. The following example creates a cross-tabulation of treatment arm by sex: ```r prepared_data |> tabyl(treatment_arm, sex_label) |> adorn_totals("row") |> adorn_percentages("row") |> adorn_pct_formatting(digits = 1) ``` This may be useful for a baseline summary. However, the data manager should be careful not to overinterpret small differences. In randomized trials, baseline tables describe the sample; they are not usually intended to drive post-randomization decisions unless predefined procedures require it. Cross-tabulations can also reveal data quality problems: ```r prepared_data |> tabyl(site, visit_status) |> adorn_totals(c("row", "col")) ``` If one site has many `Pending` visits while other sites have few, the issue may be delayed data entry, true workflow differences, or misunderstanding of visit status coding. The table points to a question; it does not answer the question by itself. For more controlled reporting, cross-tabulations can be built using `count()` and `group_by()`: ```r visit_status_by_site <- prepared_data |> count(site, visit_status, name = "n") |> group_by(site) |> mutate( site_total = sum(n), percent = round(100 * n / site_total, 1) ) |> ungroup() visit_status_by_site ``` This long-format table is often easier to export, join, plot, or use in dashboards than a wide cross-tabulation. The appropriate format depends on the next step.
Percentage typeDenominatorExample question
Row percentageRow totalWithin each site, what proportion of participants completed day 28?
Column percentageColumn totalAmong participants with missing day 28 outcomes, which sites do they come from?
Overall percentageGrand totalWhat proportion of all participants have completed day 28?
Due denominatorParticipants expected to have dataAmong participants whose day 28 visit is due, what proportion is missing?
The denominator should be shown or explained in any table intended for decision-making. Many misunderstandings in clinical reporting arise not from complex statistics, but from unclear denominators.