Preparing Data Before Analysis
Preparing Data Before Analysis
Accordion
5 / 11
Preparing Data Before Analysis
Accordion
Preparing Data Before Analysis
Part 1
Descriptive analysis should not begin with an unexamined dataset. The data manager should first confirm that the dataset has been imported correctly, cleaned according to agreed rules, and prepared with relevant derived variables. If categorical variables are still coded as unexplained numbers, if dates are still text, if duplicate participant records are unresolved, or if key variables are missing, summary tables may mislead the study team.
Preparation begins with structure. The analyst must know the unit of observation. A participant-level dataset has one row per participant. A visit-level dataset has one row per participant per visit. A laboratory dataset may have one row per participant per specimen, test, or result. If the unit of observation is misunderstood, denominators will be wrong. For example, counting rows in a visit-level dataset does not count participants; it counts visits. Counting rows in a laboratory dataset may count test results, not people.
The following code checks basic structure:
Part 2
These commands provide a quick view of the dataset. However, the data manager should also verify the participant identifier:
If a participant-level dataset has duplicate participant rows, the summary should not proceed until the reason is understood. Duplicates may indicate repeated instruments, multiple visits, accidental duplicate records, or an incorrect join. The solution depends on the dataset structure.
Categorical variables should be reviewed before summarization:
Part 3
These commands check whether the categories are expected. Unexpected values such as misspelled site names, blank treatment arms, or mixed capitalization should be corrected or documented before final summaries are produced. A table that separates `Female`, `female`, and `F` may be technically accurate but not clinically useful.
Numeric variables should also be inspected:
This simple check can reveal extreme or impossible values. A maximum age of 240 years may indicate a date problem. A minimum weight of 0 kg may indicate that a missing code was imported as a real value. Numeric summaries should therefore be interpreted as part of quality review, not merely as report content.
Part 4
The strongest descriptive analysis begins with a prepared dataset whose assumptions are known. This does not mean the dataset is perfect. It means that remaining issues are visible and documented.