Data Cleaning and Preparation in R

Recoding Categorical Variables

7.5 Recoding Categorical Variables

30-45 minutes Applied Step 3 of 10

Reading 1

7.5 Recoding Categorical Variables

3 / 10

Recoding is the process of converting values from one coding scheme to another. It is common in clinical data preparation. A REDCap export may store sex as `1`, `2`, and `3`, while a report may need `Female`, `Male`, and `Not reported`. A checkbox field may export multiple binary variables that need to be summarized. A free text field may need classification into controlled categories after review. Recoding should always preserve a clear connection to the original value. A simple recoding example uses `case_when()`: ```r enrollment_recoded <- enrollment_data |> mutate( sex_label = case_when( sex == 1 ~ "Female", sex == 2 ~ "Male", sex == 3 ~ "Not reported", is.na(sex) ~ NA_character_, TRUE ~ "Unexpected code" ) ) ``` This code creates a new variable called `sex_label`. It does not overwrite the original `sex` variable. Preserving the original variable is a good habit, especially during cleaning and review. If an unexpected code appears, it is labeled explicitly rather than silently converted to missing. The data manager can then investigate unexpected codes. The recoding rule should match the data dictionary. If the data dictionary says `1 = Male` and `2 = Female`, but the script uses the opposite, the derived labels will be wrong. This is a serious error. For that reason, recoding scripts should be reviewed and tested against known examples. When possible, recoding rules can be stored in lookup tables rather than hard-coded repeatedly. A lookup table approach may look like this: ```r sex_lookup <- tibble( sex = c(1, 2, 3), sex_label = c("Female", "Male", "Not reported") ) enrollment_recoded <- enrollment_data |> left_join(sex_lookup, by = "sex") ``` This approach is useful when many variables need recoding or when coding tables are maintained separately. A lookup table can be reviewed more easily than a long chain of conditional statements. It also supports harmonization across studies if the organization uses standard code lists. Recoding may also be needed for inconsistent text values: ```r enrollment_recoded <- enrollment_data |> mutate( visit_status_clean = case_when( str_to_lower(visit_status) %in% c("complete", "completed", "done") ~ "Completed", str_to_lower(visit_status) %in% c("missed", "not done", "not completed") ~ "Missed", str_to_lower(visit_status) %in% c("pending", "awaiting") ~ "Pending", is.na(visit_status) ~ NA_character_, TRUE ~ "Needs review" ) ) ``` This kind of recoding can be useful, but it also reveals a design problem: a field that should have been controlled may have been collected as free text. The better long-term solution may be to revise the CRF or database field, not merely clean the exported data repeatedly. R helps manage the current dataset, but clinical data management should also improve upstream data capture.

Recoding situation	Example	Preferred practice
Numeric codes to labels	`1`, `2`, `3` to sex labels	Use data dictionary and preserve original variable
Text standardization	`done`, `Complete`, `completed`	Create cleaned variable and review free text source
Unexpected values	Code `9` appears but is not defined	Flag for review rather than silently dropping
Checkbox variables	Multiple binary columns for symptoms	Keep raw checkbox fields and derive summary variables
Site names	`Kilifi`, `Kilifi Hosp`, `KWTRP`	Use controlled site lookup table

Recoding decisions should be documented in the cleaning log or script comments. A future reviewer should be able to see why a value was recoded, what rule was used, and whether the recoding came from the data dictionary, protocol, investigator decision, or data manager judgment.