CLiREN-LMS
Data Cleaning and Preparation in R

Creating Derived Variables

7.6 Creating Derived Variables

30-45 minutes Applied Step 3 of 12
Reading 1

7.6 Creating Derived Variables

3 / 12
Derived variables are variables calculated from one or more existing variables. They are common in clinical research. Examples include age at enrollment, length of hospital stay, body mass index, visit window status, outcome timing, eligibility flags, and composite endpoint indicators. Derived variables can make reports and analysis easier, but they must be created carefully because errors in derivation can affect study conclusions. The derivation rule should be defined before it is coded. For example, "age at enrollment" may appear simple, but the exact calculation matters. Is age calculated in completed years? Is date of birth complete or partial? Is enrollment date the consent date, screening date, or randomization date? How should missing or inconsistent dates be handled? These questions should be answered using the protocol, CRF guidance, and analysis plan. The following code derives age in years using date of birth and enrollment date: ```r enrollment_prepared <- enrollment_data |> mutate( date_of_birth = as.Date(date_of_birth), enrollment_date = as.Date(enrollment_date), age_years_derived = as.numeric(enrollment_date - date_of_birth) / 365.25 ) ``` This creates an approximate age in years. For some uses, this may be sufficient. For eligibility decisions, however, completed years may be required. The team should define the exact method. R code should reflect the agreed rule. Length of stay is another common derived variable: ```r admission_data <- admission_data |> mutate( admission_date = as.Date(admission_date), discharge_date = as.Date(discharge_date), length_of_stay_days = as.numeric(discharge_date - admission_date) ) ``` If the same-day admission and discharge should count as one hospital day, the rule may need to add one: ```r admission_data <- admission_data |> mutate( length_of_stay_days = as.numeric(discharge_date - admission_date) + 1 ) ``` This illustrates why derivations must be defined clinically, not only technically. Both formulas are mathematically understandable, but they answer different operational questions. Derived variables may also express protocol windows. Suppose a day 28 follow-up visit is considered on time if it occurs between day 25 and day 35 after enrollment: ```r follow_up_data <- follow_up_data |> mutate( days_from_enrollment = as.numeric(follow_up_date - enrollment_date), day28_window_status = case_when( is.na(follow_up_date) ~ "Missing follow-up date", days_from_enrollment >= 25 & days_from_enrollment <= 35 ~ "Within window", days_from_enrollment < 25 ~ "Too early", days_from_enrollment > 35 ~ "Too late" ) ) ``` This derived variable can support monitoring. It does not automatically determine whether the record is acceptable for analysis. The statistical analysis plan may define separate rules for inclusion. The data manager's role is to implement the agreed rule transparently and flag records that need review.
Derived variableSource variablesRule sourceRisk if poorly defined
Age at enrollmentDate of birth, enrollment dateEligibility criteria or analysis planIncorrect eligibility classification
Length of stayAdmission date, discharge dateProtocol or clinical definitionMisreported secondary outcome
Visit window statusEnrollment date, visit dateSchedule of eventsIncorrect assessment of protocol adherence
BMIWeight, heightClinical measurement guidanceUnit errors or implausible values
Outcome statusMultiple endpoint fieldsEndpoint charter or protocolIncorrect primary endpoint classification
Derived variables should usually be named clearly, such as `age_years_derived` or `day28_window_status`. If the original database already contains an age field entered by users, a derived age field can be compared against it: ```r age_comparison <- enrollment_prepared |> mutate( age_difference = age_years_entered - age_years_derived ) |> filter(abs(age_difference) > 1) ``` This check may identify errors in date of birth, enrollment date, or manually entered age. It also illustrates how derivation can support quality control rather than only analysis.