This folder contains the data cleaning and preprocessing pipeline for the TEDS-A and TEDS-D 2023 datasets, used in the project: Optimization of Treatment Planning and Resource Allocation in Substance Use Disorder (SUD) Rehabilitation Facilities using Artificial Intelligence and Predictive Analytics.
- Source: SAMHSA (Substance Abuse and Mental Health Services Administration)
- Dataset: Treatment Episode Data Set - Admissions (TEDS-A), 2023
- Records: ~1.6 million admission records
- Original Variables: 60 variables
- Cleaned Variables: 50 relevant variables
- Source: SAMHSA (Substance Abuse and Mental Health Services Administration)
- Dataset: Treatment Episode Data Set - Discharges (TEDS-D), 2023
- Records: ~1.5 million discharge records
- Original Variables: ~50 variables
- Cleaned Variables: 74 relevant variables
Purpose: Main data cleaning script that processes raw TEDS-A data.
Pipeline Steps:
- Load raw CSV data
- Handle missing values (convert -9 codes to NaN)
- Optimize data types (reduce memory by ~70%)
- Engineer 17 new features for treatment optimization
- Decode categorical codes to readable labels
- Select and rename 50 relevant columns
- Save cleaned dataset
Input: 1_datasets/raw/tedsa_puf_2023.csv
Output: 1_datasets/processed/teds_a_2023_cleaned.csv
Purpose: Advanced missing value analysis and handling strategy for different analysis types.
Processing Steps:
- Analyze missing value patterns and percentages
- Identify critical variables (cannot be missing)
- Create analysis-ready dataset (minimal removal, ~95% retention)
- Create ML-ready dataset (imputation, 100% retention)
- Generate missing value report
Input: 1_datasets/processed/teds_a_2023_cleaned.csv
Outputs:
1_datasets/processed/teds_analysis_ready.csv- For statistical analysis (pairwise deletion)1_datasets/processed/teds_ml_ready.csv- For machine learning (imputed values)
Purpose: Main data cleaning script that processes raw TEDS-D data.
Pipeline Steps:
- Load raw CSV data
- Handle missing values (convert -9 codes to NaN)
- Optimize data types (reduce memory usage)
- Engineer discharge-specific features (treatment outcomes, improvements)
- Decode categorical codes to readable labels
- Select and rename 74 relevant columns
- Save cleaned dataset
Input: 1_datasets/raw/tedsd_puf_2023.csv
Output: 1_datasets/processed/teds_d_2023_cleaned.csv
Purpose: Advanced missing value analysis and handling strategy for discharge data.
Processing Steps:
- Analyze missing value patterns and percentages
- Identify critical variables (cannot be missing)
- Create analysis-ready dataset (minimal removal, ~100% retention)
- Create ML-ready dataset (comprehensive imputation, 100% retention)
- Generate missing value report
Input: 1_datasets/processed/teds_d_2023_cleaned.csv
Outputs:
1_datasets/processed/teds_d_analysis_ready.csv- For statistical analysis (pairwise deletion)1_datasets/processed/teds_d_ml_ready.csv- For machine learning (fully imputed)
- File:
1_datasets/raw/tedsa_puf_2023.csv - Format: CSV file with TEDS-A 2023 admission records
- Size: ~1.6M rows × 60 columns
- File:
1_datasets/processed/teds_a_2023_cleaned.csv - Records: ~1.6M rows × 50 columns
- Features: Human-readable column names and decoded categorical values
- Use: EDA
- Missing Data: Preserved as NaN
- File:
1_datasets/processed/teds_analysis_ready.csv - Records: ~1.54M rows (95% retention)
- Strategy: Minimal removal - only rows missing critical variables
- Use: Statistical hypothesis testing, correlation analysis
- Missing Data: Present in non-critical variables (handled via pairwise deletion)
- File:
1_datasets/processed/teds_ml_ready.csv - Records: ~1.6M rows (100% retention)
- Strategy: Statistical imputation (median/mode)
- Use: Machine learning model training
- Missing Data: None (imputed)
- File:
1_datasets/sample/tedsa_sample.csv - Records: 1000 rows (0.0625% sample)
- Strategy: Random sampling
- Use: Visualization, initial exploration
- Missing Data: Preserved from original sample
- File:
1_datasets/raw/tedsd_puf_2023.csv - Format: CSV file with TEDS-D 2023 discharge records
- Size: ~1.5M rows × ~50 columns
- File:
1_datasets/processed/teds_d_2023_cleaned.csv - Records: 1,474,025 rows × 74 columns
- Features: Human-readable column names, decoded categorical values, discharge outcomes
- Use: EDA, discharge analysis
- Missing Data: Preserved as NaN
- File:
1_datasets/processed/teds_d_analysis_ready.csv - Records: ~1,474,000 rows (~100% retention)
- Strategy: Minimal removal - only rows missing
patient_idordischarge_reason - Use: Statistical hypothesis testing, outcome analysis
- Missing Data: Present in non-critical variables (handled via pairwise deletion)
- Note:
length_of_stayhas 64.55% missing data and is NOT used as critical variable
- File:
1_datasets/processed/teds_d_ml_ready.csv - Records: 1,474,025 rows (100% retention)
- Strategy: Comprehensive imputation across all 74 variables
- Use: Machine learning model training, predictive modeling
- Missing Data: None (fully imputed)
| Feature | Description | Type |
|---|---|---|
years_using |
Years between first use and admission | Continuous |
number_of_substances |
Count of substances used (0-3) | Discrete |
is_polysubstance |
Uses 2+ substances | Binary |
is_opioid_primary |
Primary substance is opioid | Binary |
is_stimulant_primary |
Primary substance is stimulant | Binary |
is_alcohol_primary |
Primary substance is alcohol | Binary |
is_injection_user |
Uses injection route | Binary |
is_criminal_justice_referral |
Referred by criminal justice | Binary |
has_recent_arrest |
Arrested in past 30 days | Binary |
is_chronic_treatment |
3+ prior treatment episodes | Binary |
is_first_treatment |
No prior treatment | Binary |
has_long_wait |
Waited 15+ days for treatment | Binary |
is_adolescent |
Age 12-17 | Binary |
is_older_adult |
Age 55+ | Binary |
is_pregnant |
Pregnant at admission | Binary |
is_homeless |
Experiencing homelessness | Binary |
has_no_income |
No source of income | Binary |
has_mental_health_disorder |
Co-occurring mental health disorder | Binary |
| Feature | Description | Type |
|---|---|---|
completed_treatment |
Successfully completed treatment | Binary |
dropped_out |
Left against medical advice | Binary |
terminated |
Terminated by facility | Binary |
transferred |
Transferred to another facility | Binary |
short_stay |
Length of stay < 30 days | Binary |
long_stay |
Length of stay > 90 days | Binary |
employment_improved |
Employment status improved at discharge | Binary |
housing_improved |
Living arrangement improved at discharge | Binary |
arrests_reduced |
Arrests reduced from admission to discharge | Binary |
number_of_substances_discharge |
Substance count at discharge | Discrete |
| All TEDS-A features | Admission baseline features | Various |
patient_id,age_group,sex,race,ethnicity,marital_status,education_level,employment_status,living_arrangement,income_source
service_type,wait_time_days,referral_source,prior_treatments,medication_assisted_therapy,dsm_diagnosis,self_help_attendance,payment_source
primary_substance,secondary_substance,tertiary_substance,route_primary,route_secondary,route_tertiary,frequency_primary,frequency_secondary,frequency_tertiary,age_first_use_primary
injection_drug_use,years_using,number_of_substances,substance_category,is_polysubstance,is_opioid_primary,is_stimulant_primary,is_injection_user,has_cooccurring_mental_health,has_mental_health_disorder,pregnant
recent_arrests,is_criminal_justice_referral,has_recent_arrest,is_homeless,has_no_income,is_pregnant
is_chronic_treatment,is_first_treatment
state,region,veteran_status,health_insurance,has_long_wait,is_adolescent,is_older_adult
Includes all TEDS-A variables PLUS discharge-specific variables:
discharge_reason,completed_treatment,dropped_out,terminated,transferred
length_of_stay,short_stay,long_stay
employment_improved,housing_improved,arrests_reduced
employment_discharge,living_arrangement_discharge,arrests_dischargeprimary_substance_discharge,secondary_substance_discharge,tertiary_substance_dischargefrequency_primary_discharge,self_help_attendance_dischargenumber_of_substances_discharge- Plus admission baseline versions of all discharge variables for comparison
- Original:
-9codes indicate missing/unknown/not collected - Cleaned: Converted to
NaNfor proper pandas handling
All categorical variables decoded from numeric codes to text labels:
| Variable | Before | After |
|---|---|---|
| AGE | 7 |
'35-39' |
| SEX | 1 |
'Male' |
| SUB1 | 5 |
'Heroin' |
| SERVICES | 7 |
'Non-intensive Outpatient' |
| EMPLOY | 3 |
'Unemployed' |
| REASON | 1 |
'Treatment Completed' |
All columns renamed for clarity:
| Original | Cleaned |
|---|---|
CASEID |
patient_id |
SERVICES |
service_type_admit (TEDS-D) |
DAYWAIT |
wait_time_days |
NOPRIOR |
prior_treatments |
LIVARAG |
living_arrangement_admit (TEDS-D) |
REASON |
discharge_reason (TEDS-D) |
LOS |
length_of_stay (TEDS-D) |
- Remove only rows missing:
patient_id,service_type,primary_substance,age_group,sex - Retains ~95% of data
- Uses pairwise deletion for remaining variables
- Maximizes statistical power
- Median for continuous variables
- Mode for categorical variables
- 0 for binary flags
- Retains 100% of data
Key variables with high missing rates (>20%):
wait_time_days(53.7%)payment_source(53.4%)income_source(34.6%)marital_status(29.2%)education_level(20.9%)
- Remove only rows missing:
patient_id,discharge_reason - Retains ~100% of data
- Uses pairwise deletion for remaining variables
- Note:
length_of_stayhas 64.55% missing and is NOT used as critical variable
- Numeric variables: Median imputation (3 variables)
- Categorical variables: Mode imputation (10 variables)
- Binary variables: Zero-fill imputation (27 variables)
- Remaining variables: Type-based imputation (27 variables)
- Categorical: mode or 'Unknown'
- Numeric: median or 0
- Retains 100% of data with 0 missing values
Key variables with high missing rates (>50%):
arrests_discharge(95.7%)arrests_admit(94.2%)tertiary_substance_discharge(84.7%)tertiary_substance_admit(82.0%)pregnant(67.4%)length_of_stay(64.6%)health_insurance(58.1%)payment_source(56.5%)
# Run main cleaning notebook
jupyter notebook cleaning_teds_a.ipynbOutput: teds_a_2023_cleaned.csv
# Run main cleaning notebook
jupyter notebook cleaning_teds_d.ipynbOutput: teds_d_2023_cleaned.csv
# Run missing value handling notebook
jupyter notebook missing_value_handling_teds_a.ipynbOutputs:
teds_analysis_ready.csvteds_ml_ready.csv
# Run missing value handling notebook
jupyter notebook missing_value_handling_teds_d.ipynbOutputs:
teds_d_analysis_ready.csvteds_d_ml_ready.csv
- For EDA: Use
teds_a_2023_cleaned.csv - For Statistical Tests: Use
teds_analysis_ready.csv - For ML Models: Use
teds_ml_ready.csv
- For EDA: Use
teds_d_2023_cleaned.csv - For Statistical Tests: Use
teds_d_analysis_ready.csv - For ML Models: Use
teds_d_ml_ready.csv
- Use both TEDS-A and TEDS-D datasets
- Match records using
patient_id - Analyze admission → discharge trajectories
- Open appropriate cleaning notebook (
cleaning_teds_a.ipynborcleaning_teds_d.ipynb) - Update file paths if needed
- Run all cells sequentially
- Cleaned data will be saved automatically
- Open appropriate missing value notebook
- Ensure cleaned dataset exists
- Run all cells sequentially
- Two datasets will be created for different analyses
Length of Stay: This variable has 64.55% missing data and should NOT be used as a critical variable for row removal. It can still be analyzed using pairwise deletion in the analysis-ready dataset or is imputed in the ML-ready dataset.
High Missing Variables: Many discharge-specific variables (arrests, tertiary substances, pregnancy status) have >60% missing data. This is expected and should be considered when interpreting results.
ML-Ready Dataset: All 74 variables are fully imputed with 0 missing values, using appropriate strategies for each data type.