CLiREN-LMS
Data Visualization and Dashboards

Creating Basic Charts with ggplot2

9.3 Creating Basic Charts with ggplot2

30-45 minutes Applied Step 3 of 11
Reading 1

9.3 Creating Basic Charts with ggplot2

3 / 11
The `ggplot2` package creates plots by combining data, mappings, and layers. A simple bar chart of enrollment by site can be created as follows: ```r library(tidyverse) prepared_data |> ggplot(aes(x = site)) + geom_bar() + labs( title = "Enrollment by Site", x = "Site", y = "Number of participants" ) ``` The `ggplot()` function defines the dataset and aesthetic mapping. The `aes(x = site)` mapping places site on the x-axis. The `geom_bar()` layer counts records in each site category. The `labs()` function adds readable labels. This plot is useful only if each row represents one participant. If the dataset has repeated visits, the chart may overcount participants. For a participant-level enrollment summary from a dataset with repeated rows, create the summary first: ```r site_enrollment <- prepared_data |> distinct(participant_id, site) |> count(site, name = "participants") site_enrollment |> ggplot(aes(x = reorder(site, participants), y = participants)) + geom_col() + coord_flip() + labs( title = "Participants Enrolled by Site", x = "Site", y = "Participants" ) ``` The `distinct()` function ensures that each participant is counted once per site. The `geom_col()` layer uses precomputed counts. `coord_flip()` makes long site names easier to read. The `reorder()` function orders sites by enrollment count. A histogram can show the distribution of age: ```r prepared_data |> ggplot(aes(x = age_years_derived)) + geom_histogram(binwidth = 5, boundary = 0, color = "white") + labs( title = "Age Distribution at Enrollment", x = "Age, years", y = "Number of participants" ) ``` The choice of bin width affects interpretation. A very small bin width may create a noisy chart, while a very large bin width may hide meaningful patterns. The data manager should experiment but avoid manipulating bins to exaggerate a point. Box plots are useful for comparing numeric distributions across groups: ```r prepared_data |> ggplot(aes(x = site, y = age_years_derived)) + geom_boxplot() + coord_flip() + labs( title = "Age Distribution by Site", x = "Site", y = "Age, years" ) ``` This plot can reveal differences in age distribution by site, but it should be interpreted with sample size. A site with five participants may have a box plot that looks unstable. It may be useful to add counts in a companion table.
Chart type`ggplot2` geometrySuitable clinical data use
Bar chart`geom_bar()` or `geom_col()`Counts by site, category, query status
Histogram`geom_histogram()`Numeric distribution such as age or lab values
Box plot`geom_boxplot()`Numeric values by site or treatment group
Line chart`geom_line()`Enrollment over time
Scatter plot`geom_point()`Relationship between two numeric variables
Faceted chart`facet_wrap()`Same chart repeated by site, visit, or group
The best way to learn visualization is to create simple plots, inspect them, and ask whether they answer the intended question.