Inspecting Data with glimpse, summary, head, names, and dim
Inspecting Data with glimpse, summary, head, names, and dim
Accordion
5 / 12
Inspecting Data with glimpse, summary, head, names, and dim
Accordion
Inspecting Data with glimpse, summary, head, names, and dim
Part 1
Importing data is only the beginning. The data manager must inspect the dataset before trusting it. Inspection is the process of understanding what R has imported: how many rows and columns exist, what variables are present, what data types were assigned, what values appear in each field, and whether the file resembles the expected export. Inspection is not the same as full quality control, but it is the foundation for quality control.
The following functions are useful immediately after import:
Each function answers a different question. `glimpse()` provides a compact view of columns, types, and example values. `summary()` gives basic summaries for each variable. `head()` shows the first few rows. `names()` lists column names. `dim()` returns the dimensions of the dataset: number of rows and number of columns.
Part 2
The data manager should not run these commands mechanically. Each output should be interpreted against expectations from the protocol, CRF, data dictionary, and export settings. If the screening log contains 500 participants but the imported dataset has 420 rows, the difference should be explained. If the data dictionary has 120 variables but the export has 95 columns, the export may have excluded some instruments or fields. If a date column appears as character text, date checks may fail later.
The following table links inspection commands to practical questions:
Suppose a REDCap enrollment export is expected to include `participant_id`, `site`, `consent_date`, `date_of_birth`, and `sex`. The data manager can check whether those fields are present:
Part 3
The `setdiff()` function returns items in `required_fields` that are not present in the dataset names. If it returns an empty vector, all required fields are present. If it returns one or more names, the script has detected a structural problem. This may mean that the export omitted fields, names were transformed differently, or the wrong file was imported.
The number of rows can also be checked against expectations:
These functions return the number of rows and columns. They are similar to `dim()`, but sometimes easier to read in a script. A weekly data quality script may print or store these values so that the team can detect sudden changes in export size.
Part 4
Inspection should also include categorical values. For example:
These commands count records by site and sex. Unexpected categories may indicate data entry errors, coding changes, or import issues. For example, if sex is expected to contain only `Female`, `Male`, and `Not reported`, but the table includes `F`, `M`, `female`, and blank values, the data manager has identified a standardization problem.
The `janitor` package provides `tabyl()`, which creates simple frequency tables:
Part 5
Frequency tables are useful because many data issues are visible at the categorical level. A site name may be misspelled. A visit status may contain both `Completed` and `complete`. A yes/no variable may contain `Yes`, `No`, `Y`, `N`, and `Unknown`. These may seem like small issues, but they can affect reports, queries, and analysis.