Data Analysis in R

Summarizing Categorical Variables

8.3 Summarizing Categorical Variables

30-45 minutes Applied Step 3 of 13

Reading 1

8.3 Summarizing Categorical Variables

3 / 13

Categorical variables describe membership in groups. Examples include site, sex, treatment arm, diagnosis category, visit status, adverse event severity, outcome status, and consent status. The most common summaries are counts and percentages. Although these summaries appear simple, they require careful attention to denominators, missing values, and expected categories. The simplest categorical summary uses `count()`: ```r prepared_data |> count(site) ``` This returns the number of records by site. To sort from highest to lowest: ```r prepared_data |> count(site, sort = TRUE) ``` To add percentages: ```r site_summary <- prepared_data |> count(site, name = "n") |> mutate( percent = 100 * n / sum(n) ) site_summary ``` This summary uses the total number of non-excluded rows as the denominator. That may be appropriate for enrollment by site. It may not be appropriate for variables with missing values if the report needs percentages among non-missing observations. The denominator should be stated. The `janitor` package provides `tabyl()`, which is convenient for quick frequency tables: ```r prepared_data |> tabyl(sex_label) ``` `tabyl()` returns counts and proportions. It can be useful during exploration, but for formal reporting the data manager may still prefer to build tables explicitly so that denominator rules and formatting are controlled. Missing values in categorical variables deserve special care: ```r prepared_data |> count(outcome_status, .drop = FALSE) ``` If missing values are represented as `NA`, they may appear as a separate row. The team should decide whether missing values are included in the denominator. For a data management report, missing values are often important and should be shown. For a baseline table, missing values may be shown separately or described in footnotes. For a primary analysis, missing data handling should follow the statistical analysis plan. The following code creates a categorical summary that explicitly includes missing values: ```r outcome_summary <- prepared_data |> mutate( outcome_status_display = replace_na(outcome_status, "Missing") ) |> count(outcome_status_display, name = "n") |> mutate( percent = round(100 * n / sum(n), 1) ) outcome_summary ``` This approach is useful for operational summaries because it makes missingness visible. The label `Missing` is a display label created for reporting; it should not be confused with an actual observed outcome category.

Categorical summary question	Example variable	Suitable output
How many participants are enrolled by site?	`site`	Count and percent of participants
Are treatment arms balanced?	`treatment_arm`	Count and percent by arm
What is the distribution of sex?	`sex_label`	Count and percent, missing shown
How many participants completed day 28?	`day28_status`	Count and percent by completion status
How many adverse events were severe?	`ae_severity`	Count by severity and possibly by participant

Categorical summaries should be reviewed for both content and plausibility. A site with zero enrolled participants may be correct if it has not opened recruitment. A treatment arm with no participants may indicate that randomization data were not imported. A category labeled `Unknown` may be legitimate if the CRF allows it, but it should be distinguished from missing data.