Creating Basic Charts with ggplot2
9.3 Creating Basic Charts with ggplot2
Reading 1
3 / 11
9.3 Creating Basic Charts with ggplot2
The `ggplot2` package creates plots by combining data, mappings, and layers. A simple bar chart of enrollment by site can be created as follows:
```r
library(tidyverse)
prepared_data |>
ggplot(aes(x = site)) +
geom_bar() +
labs(
title = "Enrollment by Site",
x = "Site",
y = "Number of participants"
)
```
The `ggplot()` function defines the dataset and aesthetic mapping. The `aes(x = site)` mapping places site on the x-axis. The `geom_bar()` layer counts records in each site category. The `labs()` function adds readable labels. This plot is useful only if each row represents one participant. If the dataset has repeated visits, the chart may overcount participants.
For a participant-level enrollment summary from a dataset with repeated rows, create the summary first:
```r
site_enrollment <- prepared_data |>
distinct(participant_id, site) |>
count(site, name = "participants")
site_enrollment |>
ggplot(aes(x = reorder(site, participants), y = participants)) +
geom_col() +
coord_flip() +
labs(
title = "Participants Enrolled by Site",
x = "Site",
y = "Participants"
)
```
The `distinct()` function ensures that each participant is counted once per site. The `geom_col()` layer uses precomputed counts. `coord_flip()` makes long site names easier to read. The `reorder()` function orders sites by enrollment count.
A histogram can show the distribution of age:
```r
prepared_data |>
ggplot(aes(x = age_years_derived)) +
geom_histogram(binwidth = 5, boundary = 0, color = "white") +
labs(
title = "Age Distribution at Enrollment",
x = "Age, years",
y = "Number of participants"
)
```
The choice of bin width affects interpretation. A very small bin width may create a noisy chart, while a very large bin width may hide meaningful patterns. The data manager should experiment but avoid manipulating bins to exaggerate a point.
Box plots are useful for comparing numeric distributions across groups:
```r
prepared_data |>
ggplot(aes(x = site, y = age_years_derived)) +
geom_boxplot() +
coord_flip() +
labs(
title = "Age Distribution by Site",
x = "Site",
y = "Age, years"
)
```
This plot can reveal differences in age distribution by site, but it should be interpreted with sample size. A site with five participants may have a box plot that looks unstable. It may be useful to add counts in a companion table.
| Chart type | `ggplot2` geometry | Suitable clinical data use |
|---|---|---|
| Bar chart | `geom_bar()` or `geom_col()` | Counts by site, category, query status |
| Histogram | `geom_histogram()` | Numeric distribution such as age or lab values |
| Box plot | `geom_boxplot()` | Numeric values by site or treatment group |
| Line chart | `geom_line()` | Enrollment over time |
| Scatter plot | `geom_point()` | Relationship between two numeric variables |
| Faceted chart | `facet_wrap()` | Same chart repeated by site, visit, or group |
The best way to learn visualization is to create simple plots, inspect them, and ask whether they answer the intended question.