Data Cleaning and Preparation in R

Recoding Categorical Variables

30-45 minutes Applied Step 5 of 10

Accordion

Recoding Categorical Variables

5 / 10

Accordion

Recoding Categorical Variables

Part 1

Recoding is the process of converting values from one coding scheme to another. It is common in clinical data preparation. A REDCap export may store sex as `1`, `2`, and `3`, while a report may need `Female`, `Male`, and `Not reported`. A checkbox field may export multiple binary variables that need to be summarized. A free text field may need classification into controlled categories after review. Recoding should always preserve a clear connection to the original value. A simple recoding example uses `case_when()`: This code creates a new variable called `sex_label`. It does not overwrite the original `sex` variable. Preserving the original variable is a good habit, especially during cleaning and review. If an unexpected code appears, it is labeled explicitly rather than silently converted to missing. The data manager can then investigate unexpected codes.

Part 2

The recoding rule should match the data dictionary. If the data dictionary says `1 = Male` and `2 = Female`, but the script uses the opposite, the derived labels will be wrong. This is a serious error. For that reason, recoding scripts should be reviewed and tested against known examples. When possible, recoding rules can be stored in lookup tables rather than hard-coded repeatedly. A lookup table approach may look like this: This approach is useful when many variables need recoding or when coding tables are maintained separately. A lookup table can be reviewed more easily than a long chain of conditional statements. It also supports harmonization across studies if the organization uses standard code lists.

Part 3

Recoding may also be needed for inconsistent text values: This kind of recoding can be useful, but it also reveals a design problem: a field that should have been controlled may have been collected as free text. The better long-term solution may be to revise the CRF or database field, not merely clean the exported data repeatedly. R helps manage the current dataset, but clinical data management should also improve upstream data capture. Recoding decisions should be documented in the cleaning log or script comments. A future reviewer should be able to see why a value was recoded, what rule was used, and whether the recoding came from the data dictionary, protocol, investigator decision, or data manager judgment.