Codebooks and Data Dictionaries
11.2 Codebooks and Data Dictionaries
Reading 1
3 / 8
11.2 Codebooks and Data Dictionaries
A data dictionary is usually a structured table that defines variables in a database or dataset. It may include variable name, form, field label, field type, choices, validation, branching logic, and required status. In REDCap, the data dictionary is also a build artifact because it can define the database structure.
A codebook is often more explanatory. It helps humans understand a dataset. It may include variable definitions, coding schemes, units, missingness notes, derivation rules, and usage guidance. A codebook for an analysis dataset may explain how `day28_outcome` was derived, how deaths were coded, which participants were excluded, and how repeated visits were summarized.
| Field | Example | Why it matters |
|---|---|---|
| Variable name | `age_years_derived` | Enables use in code |
| Variable label | Age at enrollment, years | Human-readable meaning |
| Definition | Age calculated from enrollment date and date of birth | Prevents ambiguity |
| Type | Numeric | Supports analysis |
| Unit | Years | Prevents unit errors |
| Allowed values | `Female`, `Male`, `Not reported` | Supports category interpretation |
| Missing codes | `NA`, `Not applicable`, `Unknown` | Clarifies missingness |
| Derivation | `(enrollment_date - date_of_birth) / 365.25` | Supports reproducibility |
| Source | REDCap enrollment form | Establishes provenance |
R can help create a simple codebook:
```r
codebook <- tibble(
variable = names(prepared_data),
type = map_chr(prepared_data, ~ class(.x)[1]),
n_missing = map_int(prepared_data, ~ sum(is.na(.x))),
n_unique = map_int(prepared_data, n_distinct)
)
write_csv(codebook, "outputs/basic_codebook_2026-06-01.csv")
```
This automated codebook is only a starting point. It does not know clinical definitions or derivation rules unless those are supplied. The data manager should enrich the codebook with labels, definitions, units, and source information.