Data Cleaning Process

Standard Cleaning Procedures

1. Initial Data Assessment

  • Review data completeness and quality metrics
  • Generate summary statistics for all variables
  • Identify missing or incomplete records
  • Flag potential data quality issues
  • Document initial state of dataset

2. Automated Cleaning Steps

  1. Standardization
    • Convert all dates to YYYY-MM-DD format
    • Standardize text case (upper/lower) as appropriate
    • Normalize units of measurement
    • Apply consistent coding schemes
  2. Missing Data Handling
    • Flag missing values
    • Assess patterns of missingness
    • Apply appropriate imputation where valid
    • Document all missing data decisions
  3. Outlier Detection
    • Calculate statistical bounds for numeric variables
    • Flag biological impossibilities
    • Review extreme values
    • Document outlier handling decisions
  4. Deduplication
    • Identify potential duplicate records
    • Compare record pairs
    • Merge or remove duplicates
    • Maintain audit trail of changes

3. Manual Review Process

  1. Expert Review
    • Clinical review of flagged values
    • Assessment of unusual patterns
    • Validation of automated decisions
    • Documentation of override reasons
  2. Quality Checks
    • Cross-reference with source data
    • Verify referential integrity
    • Validate derived variables
    • Check logical consistency

4. Documentation Requirements

Each cleaning step must include:

  • Description of issue addressed
  • Methodology used
  • Number of records affected
  • Before/after comparisons
  • Justification for changes
  • SQL or code snippets used
  • Reviewer sign-off

5. Validation Steps

  1. Technical Validation
    • Run automated test suites
    • Verify data types and formats
    • Check constraint compliance
    • Validate calculations
  2. Clinical Validation
    • Review by subject matter experts
    • Check clinical plausibility
    • Verify coding accuracy
    • Validate derived measures

6. Final Quality Control

  • Generate final quality metrics
  • Produce validation reports
  • Document all cleaning steps
  • Archive original and cleaned datasets
  • Create reproducible cleaning workflow

7. Ongoing Maintenance

  • Regular quality monitoring
  • Periodic data audits
  • Update cleaning procedures
  • Maintain cleaning documentation
  • Track data quality metrics

Quality Standards

All cleaned data must meet:

  • 99.9% completeness for required fields
  • 100% referential integrity
  • No invalid codes or values
  • Full documentation of changes
  • Reproducible cleaning process

Tools and Resources

  • Custom SQL cleaning scripts
  • Statistical analysis software
  • Data validation tools
  • Documentation templates
  • Quality control checklists

Data Files Overview

Data Dictionary

The dataset contains several key variable categories:

  1. Diagnosis Variables
    • Includes diagnosis codes, dates, severity
    • Links to visits through visit_id
    • Tracks chronic conditions and resolution
  2. Medication Variables
    • Prescription details and status
    • Dosage, frequency, dates
    • Pharmacy and prescriber tracking
  3. Lab Results
    • Test results with reference ranges
    • Collection and result dates
    • Abnormality flagging
  4. Vital Signs
    • Standard clinical measurements
    • Height, weight, BMI
    • Cardiovascular and respiratory metrics

See the full data dictionary for detailed variable specifications.

File Categories
  • AINQ: Inquiry data files
  • RSLT: Results data files
  • JUDD: Judgment data files
📌 Created 64 file sets
Code
quarto_tables <- readRDS(sprintf("%sdata/processed/_objects/_rds/quarto_tables.rds", rootdir))
quarto_tables$table_participants %>% 
  filter(is.na(DATE_BIRTH)) %>% 
  mutate(DATE_BIRTH = paste0(substr(BRTHDD, 1, 4), "-", substr(BRTHDD, 5, 6), "-", substr(BRTHDD, 7, 8))) %>%
  mutate(YEAR = substr(BRTHDD, 1, 4),
         MONTH = substr(BRTHDD, 5, 6),
         DAY = substr(BRTHDD, 7, 8),
         FLAG_BIRTH = "INVALID DATE") %>%
  select(FLAG_BIRTH, everything(), -BRTHDD, -SEX) %>%
  reactable(
    searchable = TRUE,
    filterable = TRUE,
    showPageSizeOptions = TRUE,
    pageSizeOptions = c(10, 20, 50),
    defaultPageSize = 10,
    striped = TRUE,
    highlight = TRUE,
    bordered = TRUE,
    defaultColDef = colDef(
      align = "left",
      minWidth = 120,
      headerStyle = list(background = "#f7f7f8")
    ),
    columns = list(
      FLAG_BIRTH = colDef(
        name = "Flag",
        style = list(fontWeight = "bold")
      ),
      DATE_BIRTH = colDef(
        name = "Birth Date",
        style = list(color = "#2c3e50")
      ),
      YEAR = colDef(
        name = "Year",
        style = list(color = "#2c3e50")
      ),
      MONTH = colDef(
        name = "Month",
        style = list(color = "#2c3e50")
      ),
      DAY = colDef(
        name = "Day",
        style = list(color = "#2c3e50")
      )
    ),
    theme = reactableTheme(
      borderColor = "#dfe2e5",
      stripedColor = "#f6f8fa",
      headerStyle = list(
        backgroundColor = "#f8f9fa",
        color = "#1a162d",
        fontWeight = "bold",
        borderColor = "#dfe2e5"
      ),
      cellStyle = list(
        padding = "8px 12px"
      )
    )
  )
Code
cat("📌 Found", nrow(quarto_tables$table_participants %>% filter(is.na(DATE_BIRTH))), "participants with invalid birth dates\n")
📌 Found 83 participants with invalid birth dates
Code
quarto_tables$table_participants %>% 
  filter(lubridate::year(DATE_BIRTH) < 1925 | lubridate::year(DATE_BIRTH) > 2014) %>%
  mutate(DATE_BIRTH = paste0(substr(BRTHDD, 1, 4), "-", substr(BRTHDD, 5, 6), "-", substr(BRTHDD, 7, 8))) %>%
  mutate(CURRENT_AGE = round(as.numeric(difftime(Sys.Date(), DATE_BIRTH, units = "days")) / 365.25),
         FLAG_BIRTH = case_when(
           CURRENT_AGE < 0 ~ "IMPOSSIBLE",
           CURRENT_AGE > 80 ~ "TOO OLD",
           CURRENT_AGE < 18 ~ "TOO YOUNG",
           TRUE ~ NA_character_)) %>%
  select(FLAG_BIRTH, everything(), -BRTHDD, -SEX) %>%
  reactable(
    searchable = TRUE,
    filterable = TRUE,
    showPageSizeOptions = TRUE,
    pageSizeOptions = c(10, 20, 50),
    defaultPageSize = 10,
    striped = TRUE,
    highlight = TRUE,
    bordered = TRUE,
    defaultColDef = colDef(
      align = "left",
      minWidth = 120,
      headerStyle = list(background = "#f7f7f8")
    ),
    columns = list(
      CURRENT_AGE = colDef(
        name = "Current Age",
        sortable = TRUE
      ),
      FLAG_BIRTH = colDef(
        name = "Flag",
        style = list(fontWeight = "bold")
      )
    ),
    defaultSorted = list("CURRENT_AGE" = "desc"),
    groupBy = "FLAG_BIRTH",
    theme = reactableTheme(
      borderColor = "#dfe2e5",
      stripedColor = "#f6f8fa",
      headerStyle = list(
        backgroundColor = "#f8f9fa",
        color = "#1a162d",
        fontWeight = "bold",
        borderColor = "#dfe2e5"
      ),
      cellStyle = list(
        padding = "8px 12px"
      )
    )
  )
Code
cat("📌 Found", nrow(quarto_tables$table_participants %>% filter(lubridate::year(DATE_BIRTH) < 1925 | lubridate::year(DATE_BIRTH) > 2014)), "participants with invalid age\n")
📌 Found 941 participants with invalid age
Code
cat("\n🔍 Current age standards:\n",
"- Too old: Over 100 years old\n",
    "- Too young: Under 10 years old\n",
    "- Impossible: Negative age (birth date in future)\n")

🔍 Current age standards:
 - Too old: Over 100 years old
 - Too young: Under 10 years old
 - Impossible: Negative age (birth date in future)