Asclepios AI: Predictive Core and Twin-Engine Architecture
- Readmission Risk Engine: Identifies patients with "Chronic Relapse" phenotypes.
- LOS Optimization Engine: Prescribes the optimal Length of Stay based on clinical acuity.
This pipeline uses TEDS-D (Discharges) for ground-truth label generation and implements a "Twin-Engine" architecture to handle the statistical variance between Detox and Rehab settings.
The Challenge: TEDS data is episode-level (no unique Patient ID), making it impossible to track readmission history directly. Additionally, the NOPRIOR variable in the 2023 dataset was found to be binary-corrupted.
The Solution: We engineered a Chronicity Proxy (Years_Using_Substance).
- Logic: Age At Admission - Age of First Use.
- Validation: EDA confirmed that "Short Duration" (<5 years) correlates with high volatility/dropout, while "Long Duration" (>20 years) correlates with stability.
The Challenge: A single regression model failed to predict Length of Stay (LOS) accurately because "Detox" (3-5 days) and "Residential Rehab" (30-90 days) follow fundamentally different distributions.
The Solution: We split the inference logic into two specialized models:
- Engine A (Detox): Trained only on acute care settings (Hospital/Residential Detox).
- Engine B (Rehab): Trained on long-term care settings.
Result: This reduced Mean Absolute Error (MAE) from 8+ days to 6.04 days.
The Challenge: Does the model predict LOS based on patient health, or just state funding rules?
The Solution: We ran a "State-Blind" Stress Test by removing all geographic features (STFIPS, REGION). The model performance remained stable (Delta < 0.1 days), proving it relies on clinical factors, not geography.
- Algorithm: XGBoost Classifier.
- Target: target_chronic_risk (Derived phenotype of chronic relapse).
- Key Features: Years_Using_Substance, Primary_Substance, Risk_Synergy_Speedball (Opioid+Stimulant interaction).
- Performance: AUC 0.75 (Strong discriminatory power).
- Output: Probability score (0-100%) of chronic relapse risk.
- Algorithm: XGBoost Quantile Regressor (Objective: reg:absoluteerror).
- Target: Length_of_Stay_Days (Median).
- Why Median? Healthcare data has extreme outliers (stays > 300 days). Predicting the Mean results in unrealistic recommendations. Predicting the Median ensures robust, clinically standard suggestions.
- Performance: MAE 6.1 Days (Precision window of +/- 1 week).
This system is a predictive modeling pipeline designed to estimate treatment admissions, bed capacity, workforce requirements, and clinical complexity at the facility level. It applies machine learning to aggregated TEDS patient data.
The model transforms raw episode-level treatment records into a structured, decision-ready planning tool capable of predicting demand across multiple treatment modalities, including:
- Detox 24-hour residential
- Short-term rehab
- Long-term rehab
- Intensive outpatient
- Non-intensive outpatient
We engineered an aggregation architecture that transforms millions of individual encounters into facility-type level population statistics. We computed:
- Total admissions (episode count)
- Average demographic distributions
- Average clinical-severity indicators
- Facility-level prevalence rates (e.g., % polysubstance use, % homeless, % injection users)
By computing means on one-hot encoded demographic fields, the system
creates interpretable, population-level indicators.
Example:
sex_Female = 0.42 to 42% of the facility’s patients are female.
This process allows the model to see patient composition instead of individual outliers.
We built a synthetic Complexity Score, representing clinical and social risk intensity. It is a weighted combination of high-impact risk factors:
- Polysubstance use
- Chronic treatment history
- Co-occurring mental health disorder
- Homelessness
- Injection drug use
This score is later used not only for prediction but also for adjusting staffing needs, because complex patients need more staff per admission than low-acuity ones.
We built a strict leakage-prevention pipeline:
- Train/test split applied before imputation
- Imputer fit only on training data
- Scaler fit only on training data
- Log-transform applied to the target to reduce extreme skew
This ensures that the performance metrics reflect true real-world behavior.
A three-model ensemble evaluation was performed:
- Ridge Regression (linear, baseline)
- Random Forest Regressor (non-linear)
- Gradient Boosting Regressor (final winner)
- Handles mixed-scale features effectively
- Robust against moderate noise
- Excellent for skewed target variables
- Achieved R² = 0.623 on held-out test data
- Mean Absolute Error reduced to ≈ 3,000 admissions
Cross-validation score: 0.799 R² ± 0.048, indicating strong generalization.
Because Regression models trained on log-transformed targets tend to underpredict population totals when transformed back to the original scale (exp bias) the pipeline computes a Bias Correction Factor:
Correction = Sum(actual) / Sum(predicted)
≈ 1.0014
This ensures that total predicted admissions match real-world aggregate demand—critical for policy-level forecasting.
Predicted admissions alone don’t solve staffing or capacity questions. We translate model outputs into operational planning metrics:
1 bed per 12 annual admissions
This reflects average turnover and realistic occupancy levels.
1 staff per 50 admissions
Adjusted upward based on the facility’s Complexity Score
Facilities with severe populations require proportionally more clinical support.
Facilities above median predicted admissions are labeled high-priority for:
- Funding
- Workforce allocation
- Surge planning
-
Algorithm: Gradient Boosting Regressor
-
Target: Log-transformed annual admissions
-
Inputs:
- Demographic prevalence vectors
- Clinical risk indicators
- Social determinants
- Complexity score
- Treatment modality and geography indicators
Top drivers include:
- Race/ethnicity prevalence patterns
- Age cohort distributions
- Education level indicators
- Service type
- Risk cluster prevalence
These patterns reveal that admissions are driven by population composition, not just sheer size or location.
The final dataset includes:
- Predicted Admissions
- Recommended Beds
- Recommended Staff (complexity-adjusted)
- Complexity Score
- High Demand Flag
- Top High-Demand Facilities Report
- Full Visualization Dashboard