Data Cleaning Process

Standard Cleaning Procedures

1. Initial Data Assessment

Review data completeness and quality metrics
Generate summary statistics for all variables
Identify missing or incomplete records
Flag potential data quality issues
Document initial state of dataset

2. Automated Cleaning Steps

Standardization
- Convert all dates to YYYY-MM-DD format
- Standardize text case (upper/lower) as appropriate
- Normalize units of measurement
- Apply consistent coding schemes
Missing Data Handling
- Flag missing values
- Assess patterns of missingness
- Apply appropriate imputation where valid
- Document all missing data decisions
Outlier Detection
- Calculate statistical bounds for numeric variables
- Flag biological impossibilities
- Review extreme values
- Document outlier handling decisions
Deduplication
- Identify potential duplicate records
- Compare record pairs
- Merge or remove duplicates
- Maintain audit trail of changes

3. Manual Review Process

Expert Review
- Clinical review of flagged values
- Assessment of unusual patterns
- Validation of automated decisions
- Documentation of override reasons
Quality Checks
- Cross-reference with source data
- Verify referential integrity
- Validate derived variables
- Check logical consistency

4. Documentation Requirements

Each cleaning step must include:

Description of issue addressed
Methodology used
Number of records affected
Before/after comparisons
Justification for changes
SQL or code snippets used
Reviewer sign-off

5. Validation Steps

Technical Validation
- Run automated test suites
- Verify data types and formats
- Check constraint compliance
- Validate calculations
Clinical Validation
- Review by subject matter experts
- Check clinical plausibility
- Verify coding accuracy
- Validate derived measures

6. Final Quality Control

Generate final quality metrics
Produce validation reports
Document all cleaning steps
Archive original and cleaned datasets
Create reproducible cleaning workflow

7. Ongoing Maintenance

Regular quality monitoring
Periodic data audits
Update cleaning procedures
Maintain cleaning documentation
Track data quality metrics

Quality Standards

All cleaned data must meet:

99.9% completeness for required fields
100% referential integrity
No invalid codes or values
Full documentation of changes
Reproducible cleaning process

Tools and Resources

Custom SQL cleaning scripts
Statistical analysis software
Data validation tools
Documentation templates
Quality control checklists

Data Files Overview

Data Dictionary

The dataset contains several key variable categories:

Diagnosis Variables
- Includes diagnosis codes, dates, severity
- Links to visits through visit_id
- Tracks chronic conditions and resolution
Medication Variables
- Prescription details and status
- Dosage, frequency, dates
- Pharmacy and prescriber tracking
Lab Results
- Test results with reference ranges
- Collection and result dates
- Abnormality flagging
Vital Signs
- Standard clinical measurements
- Height, weight, BMI
- Cardiovascular and respiratory metrics

See the full data dictionary for detailed variable specifications.

File Categories

AINQ: Inquiry data files
RSLT: Results data files
JUDD: Judgment data files

📌 Created 64 file sets

Code

quarto_tables <- readRDS(sprintf("%sdata/processed/_objects/_rds/quarto_tables.rds", rootdir))
quarto_tables$table_participants %>% 
  filter(is.na(DATE_BIRTH)) %>% 
  mutate(DATE_BIRTH = paste0(substr(BRTHDD, 1, 4), "-", substr(BRTHDD, 5, 6), "-", substr(BRTHDD, 7, 8))) %>%
  mutate(YEAR = substr(BRTHDD, 1, 4),
         MONTH = substr(BRTHDD, 5, 6),
         DAY = substr(BRTHDD, 7, 8),
         FLAG_BIRTH = "INVALID DATE") %>%
  select(FLAG_BIRTH, everything(), -BRTHDD, -SEX) %>%
  reactable(
    searchable = TRUE,
    filterable = TRUE,
    showPageSizeOptions = TRUE,
    pageSizeOptions = c(10, 20, 50),
    defaultPageSize = 10,
    striped = TRUE,
    highlight = TRUE,
    bordered = TRUE,
    defaultColDef = colDef(
      align = "left",
      minWidth = 120,
      headerStyle = list(background = "#f7f7f8")
    ),
    columns = list(
      FLAG_BIRTH = colDef(
        name = "Flag",
        style = list(fontWeight = "bold")
      ),
      DATE_BIRTH = colDef(
        name = "Birth Date",
        style = list(color = "#2c3e50")
      ),
      YEAR = colDef(
        name = "Year",
        style = list(color = "#2c3e50")
      ),
      MONTH = colDef(
        name = "Month",
        style = list(color = "#2c3e50")
      ),
      DAY = colDef(
        name = "Day",
        style = list(color = "#2c3e50")
      )
    ),
    theme = reactableTheme(
      borderColor = "#dfe2e5",
      stripedColor = "#f6f8fa",
      headerStyle = list(
        backgroundColor = "#f8f9fa",
        color = "#1a162d",
        fontWeight = "bold",
        borderColor = "#dfe2e5"
      ),
      cellStyle = list(
        padding = "8px 12px"
      )
    )
  )

Code

cat("📌 Found", nrow(quarto_tables$table_participants %>% filter(is.na(DATE_BIRTH))), "participants with invalid birth dates\n")

📌 Found 83 participants with invalid birth dates

Code

quarto_tables$table_participants %>% 
  filter(lubridate::year(DATE_BIRTH) < 1925 | lubridate::year(DATE_BIRTH) > 2014) %>%
  mutate(DATE_BIRTH = paste0(substr(BRTHDD, 1, 4), "-", substr(BRTHDD, 5, 6), "-", substr(BRTHDD, 7, 8))) %>%
  mutate(CURRENT_AGE = round(as.numeric(difftime(Sys.Date(), DATE_BIRTH, units = "days")) / 365.25),
         FLAG_BIRTH = case_when(
           CURRENT_AGE < 0 ~ "IMPOSSIBLE",
           CURRENT_AGE > 80 ~ "TOO OLD",
           CURRENT_AGE < 18 ~ "TOO YOUNG",
           TRUE ~ NA_character_)) %>%
  select(FLAG_BIRTH, everything(), -BRTHDD, -SEX) %>%
  reactable(
    searchable = TRUE,
    filterable = TRUE,
    showPageSizeOptions = TRUE,
    pageSizeOptions = c(10, 20, 50),
    defaultPageSize = 10,
    striped = TRUE,
    highlight = TRUE,
    bordered = TRUE,
    defaultColDef = colDef(
      align = "left",
      minWidth = 120,
      headerStyle = list(background = "#f7f7f8")
    ),
    columns = list(
      CURRENT_AGE = colDef(
        name = "Current Age",
        sortable = TRUE
      ),
      FLAG_BIRTH = colDef(
        name = "Flag",
        style = list(fontWeight = "bold")
      )
    ),
    defaultSorted = list("CURRENT_AGE" = "desc"),
    groupBy = "FLAG_BIRTH",
    theme = reactableTheme(
      borderColor = "#dfe2e5",
      stripedColor = "#f6f8fa",
      headerStyle = list(
        backgroundColor = "#f8f9fa",
        color = "#1a162d",
        fontWeight = "bold",
        borderColor = "#dfe2e5"
      ),
      cellStyle = list(
        padding = "8px 12px"
      )
    )
  )

Code

cat("📌 Found", nrow(quarto_tables$table_participants %>% filter(lubridate::year(DATE_BIRTH) < 1925 | lubridate::year(DATE_BIRTH) > 2014)), "participants with invalid age\n")

📌 Found 941 participants with invalid age

Code

cat("\n🔍 Current age standards:\n",
"- Too old: Over 100 years old\n",
    "- Too young: Under 10 years old\n",
    "- Impossible: Negative age (birth date in future)\n")


🔍 Current age standards:
 - Too old: Over 100 years old
 - Too young: Under 10 years old
 - Impossible: Negative age (birth date in future)