Raw, Cleaned, and Analysis-Ready Datasets
7.2 Raw, Cleaned, and Analysis-Ready Datasets
Reading 1
3 / 8
7.2 Raw, Cleaned, and Analysis-Ready Datasets
A common mistake in research projects is to treat any exported spreadsheet as the dataset. In practice, a study may have several legitimate data states. The raw export is the dataset as received from the electronic data capture system or external source. The cleaned dataset is the dataset after data quality issues have been resolved according to the study process. The analysis-ready dataset is a structured dataset prepared for a specific statistical or reporting purpose. These states should not be confused.
The raw dataset is the starting point. It should be preserved because it provides the evidence of what was exported at a particular time. If a script produces an unexpected result, the team can return to the raw file and rerun the workflow. If a cleaning decision is questioned, the team can compare raw and derived outputs. If a new rule is added, the workflow can be rerun without relying on a manually edited spreadsheet.
The cleaned dataset is not necessarily a separate file created by manually editing values. In many good workflows, the cleaned dataset is generated from the current database export after queries have been resolved in the source system. For example, if R identifies that participant `P023` has a discharge date before admission date, the data manager raises a query in REDCap. The site reviews the source record, corrects the discharge date in REDCap, and the correction is captured in the REDCap audit trail. The next export then contains the corrected value. R is rerun and the flag disappears. This workflow preserves the source system as the authoritative data location.
The analysis-ready dataset may differ from the cleaned database export because analysis often requires specific structures. A statistician may need one row per participant, one row per visit, or one row per participant per laboratory test. Variables may need labels, factor levels, censoring indicators, or endpoint derivations. Some administrative fields may be removed. Some derived variables may be created. The analysis-ready dataset should be produced by documented scripts, and its derivations should be described in the statistical analysis plan, data management plan, or analysis dataset specification.
R can manage these states clearly through folder structure and script naming:
```text
clinical_dm_project/
data_raw/
redcap_export_2026-06-01.csv
data_clean/
redcap_cleaned_2026-06-08.csv
data_analysis/
analysis_dataset_v01.csv
scripts/
01_import_raw_data.R
02_run_quality_checks.R
03_prepare_clean_dataset.R
04_create_analysis_dataset.R
outputs/
query_listing_2026-06-01.csv
cleaning_summary_2026-06-01.html
```
This structure is only an example. The important principle is that each file location communicates the file's role. Raw data should not be mixed with outputs. Analysis datasets should not be confused with original database exports. Scripts should clearly show how outputs are produced.
The following R code illustrates a simple import and output pattern:
```r
library(tidyverse)
library(janitor)
raw_enrollment <- read_csv("data_raw/redcap_export_2026-06-01.csv") |>
clean_names()
quality_summary <- raw_enrollment |>
summarise(
n_records = n(),
n_sites = n_distinct(site),
missing_consent_dates = sum(is.na(consent_date)),
duplicate_ids = n() - n_distinct(participant_id)
)
write_csv(quality_summary, "outputs/quality_summary_2026-06-01.csv")
```
This script imports raw data, creates a summary output, and writes the output to the `outputs` folder. It does not overwrite the raw export. That small discipline is fundamental. When workflows grow more complex, the same principle remains: preserve inputs, document transformations, and write outputs separately.