Data Documentation and Metadata

Codebooks and Data Dictionaries

11.2 Codebooks and Data Dictionaries

30-45 minutes Applied Step 3 of 8

Reading 1

11.2 Codebooks and Data Dictionaries

3 / 8

A data dictionary is usually a structured table that defines variables in a database or dataset. It may include variable name, form, field label, field type, choices, validation, branching logic, and required status. In REDCap, the data dictionary is also a build artifact because it can define the database structure. A codebook is often more explanatory. It helps humans understand a dataset. It may include variable definitions, coding schemes, units, missingness notes, derivation rules, and usage guidance. A codebook for an analysis dataset may explain how `day28_outcome` was derived, how deaths were coded, which participants were excluded, and how repeated visits were summarized.

Field	Example	Why it matters
Variable name	`age_years_derived`	Enables use in code
Variable label	Age at enrollment, years	Human-readable meaning
Definition	Age calculated from enrollment date and date of birth	Prevents ambiguity
Type	Numeric	Supports analysis
Unit	Years	Prevents unit errors
Allowed values	`Female`, `Male`, `Not reported`	Supports category interpretation
Missing codes	`NA`, `Not applicable`, `Unknown`	Clarifies missingness
Derivation	`(enrollment_date - date_of_birth) / 365.25`	Supports reproducibility
Source	REDCap enrollment form	Establishes provenance

R can help create a simple codebook: ```r codebook <- tibble( variable = names(prepared_data), type = map_chr(prepared_data, ~ class(.x)[1]), n_missing = map_int(prepared_data, ~ sum(is.na(.x))), n_unique = map_int(prepared_data, n_distinct) ) write_csv(codebook, "outputs/basic_codebook_2026-06-01.csv") ``` This automated codebook is only a starting point. It does not know clinical definitions or derivation rules unless those are supplied. The data manager should enrich the codebook with labels, definitions, units, and source information.