CLiREN-LMS
Data Analysis in R

Summarizing Numeric Variables

Summarizing Numeric Variables

30-45 minutes Applied Step 5 of 9
Accordion

Summarizing Numeric Variables

5 / 9
Accordion

Summarizing Numeric Variables

Part 1
Numeric variables include age, weight, height, laboratory values, vital signs, length of stay, days from enrollment to follow-up, and questionnaire scores. Common summaries include mean, standard deviation, median, interquartile range, minimum, maximum, and number missing. The choice of summary depends on the distribution, clinical meaning, and reporting purpose [@altman1996numerical]. The following code summarizes age: This output is more detailed than every report will need, but it is useful during review. The mean and standard deviation are informative for approximately symmetric distributions. The median and interquartile range are often more robust for skewed distributions. Minimum and maximum values are useful for detecting outliers or impossible values.
Part 2
The `na.rm = TRUE` argument tells R to remove missing values before calculating the statistic. Without it, many summary functions return `NA` if any value is missing. This behavior is intentional because missing values affect interpretation. A report should not hide missingness simply by using `na.rm = TRUE`; it should also report the number missing where relevant. Numeric summaries can be created by group: This table may reveal differences in participant populations across sites. However, clinical interpretation should be cautious. A site with a small number of participants may have unstable summaries. Differences may reflect recruitment patterns, eligibility criteria, referral pathways, or random variation.
Part 3
Numeric variables should be checked for units. A weight variable may mix kilograms and grams if data entry guidance was unclear. A creatinine result may be reported in micromoles per litre at one site and milligrams per decilitre at another. A temperature may be recorded in Celsius or Fahrenheit. R will calculate summaries regardless of unit inconsistency, so the data manager must understand the source data. It is good practice to create both a report summary and a quality review summary. The report summary may show median and interquartile range. The quality review summary may additionally show minimum, maximum, and missing count to detect problems before the report is finalized.