CLiREN-LMS
Data Documentation and Metadata

Codebooks and Data Dictionaries

11.2 Codebooks and Data Dictionaries

30-45 minutes Applied Step 3 of 8
Reading 1

11.2 Codebooks and Data Dictionaries

3 / 8
A data dictionary is usually a structured table that defines variables in a database or dataset. It may include variable name, form, field label, field type, choices, validation, branching logic, and required status. In REDCap, the data dictionary is also a build artifact because it can define the database structure. A codebook is often more explanatory. It helps humans understand a dataset. It may include variable definitions, coding schemes, units, missingness notes, derivation rules, and usage guidance. A codebook for an analysis dataset may explain how `day28_outcome` was derived, how deaths were coded, which participants were excluded, and how repeated visits were summarized.
FieldExampleWhy it matters
Variable name`age_years_derived`Enables use in code
Variable labelAge at enrollment, yearsHuman-readable meaning
DefinitionAge calculated from enrollment date and date of birthPrevents ambiguity
TypeNumericSupports analysis
UnitYearsPrevents unit errors
Allowed values`Female`, `Male`, `Not reported`Supports category interpretation
Missing codes`NA`, `Not applicable`, `Unknown`Clarifies missingness
Derivation`(enrollment_date - date_of_birth) / 365.25`Supports reproducibility
SourceREDCap enrollment formEstablishes provenance
R can help create a simple codebook: ```r codebook <- tibble( variable = names(prepared_data), type = map_chr(prepared_data, ~ class(.x)[1]), n_missing = map_int(prepared_data, ~ sum(is.na(.x))), n_unique = map_int(prepared_data, n_distinct) ) write_csv(codebook, "outputs/basic_codebook_2026-06-01.csv") ``` This automated codebook is only a starting point. It does not know clinical definitions or derivation rules unless those are supplied. The data manager should enrich the codebook with labels, definitions, units, and source information.