Team: Decaying Beta-Amyloid
Competition: IAFP AI Benchmarking, Student AI Benchmark Competition on Predictive Food Safety Models
Track: GIS-based pathogen presence prediction (Listeria in soil)
License: MIT
This repository implements a binary classification pipeline to predict whether Listeria spp. are present in U.S. soil samples. The model uses soil chemistry, climate, land-use, and geographic features to output a per-sample risk probability, enabling public health agencies and food safety programs to prioritize environmental sampling locations.
The primary model is a CatBoost gradient-boosted decision tree, evaluated under a spatially aware cross-validation protocol that mitigates geographic data leakage. A logistic regression baseline (lr_baseline.py) is included to confirm that model complexity is warranted.
Environmental Listeria surveillance across the U.S. generates high-dimensional geospatial data. Because soil properties and climate variables are spatially autocorrelated (nearby locations tend to have similar measurements), standard random train/test splits can overstate predictive performance by allowing the model to train on geographic neighbours of test points. This project addresses that challenge by grouping samples into 0.25-degree latitude/longitude grid cells and applying StratifiedGroupKFold, ensuring that no grid cell appears in both the training and validation sets within the same fold. See Figure 3 for a visual explanation.
Listeria-CatBoost-Predictor/
|
+-- final_1.ipynb # Main submission notebook (CatBoost pipeline)
+-- data_trial.ipynb # Exploratory development notebook
+-- grid_sensitivity_matched.py # Grid sensitivity analysis (0.25, 0.50, 1.00 degrees)
+-- lr_baseline.py # Logistic regression baseline (locked spatial protocol)
|
+-- outputs_submission/ # Locked submission outputs
| +-- eval_lock.json # Locked CV/threshold configuration
| +-- overall_metrics.json # Pooled OOF benchmark metrics
| +-- versions.json # Python package versions
| +-- catboost_fullfit.cbm # Full-data CatBoost model artifact
| +-- feature_importance.csv # Feature importances
| +-- oof_predictions.csv # Out-of-fold predictions
| +-- cv_fold_metrics_threshold_0p5.csv # Fold-level metrics (t = 0.5)
| +-- cv_fold_metrics_threshold_tuned.csv# Fold-level metrics (F1-tuned threshold)
| +-- fig_1_feature_importance.png # Figure 1: top feature importances
| +-- fig_2_panel_roc_pr_cm_calibration.png # Figure 2: ROC, PR, CM, calibration
|
+-- outputs_grid_sensitivity_matched_protocol/ # Grid sensitivity experiment outputs
| +-- grid_sensitivity_chart.png
| +-- grid_sensitivity_compact.csv
| +-- grid_sensitivity_fold_metrics.csv
| +-- grid_sensitivity_full.csv
| +-- grid_sensitivity_full.json
| +-- matched_protocol_config.json
| +-- oof_predictions_grid{0.25,0.50,1.00}deg.csv
|
+-- fig_spatial_autocorrelation.png # Spatial autocorrelation and leakage figure
+-- generate_autocorrelation.py # Script to produce the autocorrelation figure
|
+-- AI benchmarking_Listeria_Deacying beta-amyloid.pdf # Report / presentation
+-- EXPERIMENTS.md # Full experimental log
+-- LICENSE # MIT License
+-- README.md # This file
Primary dataset: ListeriaSoil_clean.csv (623 samples, 34 predictor columns + 1 outcome column)
Metadata dictionary: ListeriaSoil_Metadata.csv
| Category | Features | Examples |
|---|---|---|
| Geographic | 3 | Latitude, Longitude, Elevation (m) |
| Soil chemistry | 14 | pH, Moisture, Total nitrogen (%), Sodium (mg/Kg), Calcium (mg/Kg) |
| Climate | 4 | Precipitation (mm), Max temperature (C), Min temperature (C), Wind speed (m/s) |
| Land use / land cover | 9 | Forest (%), Cropland (%), Wetland (%), Developed open space (%) |
| Outcome | 1 | Number of Listeria isolates obtained |
Binary label definition:
y = 1if isolates > 0 (Listeria present)y = 0otherwise
Class balance is approximately 50/50.
Source: Cornell Food Safety ML Repository, "Listeria in soil." See Citation for the full reference.
All primary results use a locked configuration stored in outputs_submission/eval_lock.json:
| Parameter | Value |
|---|---|
| Cross-validation | StratifiedGroupKFold |
| Spatial grid | 0.25-degree lat/lon cells (~210 groups) |
| Folds | 5 |
| Seed | 42 |
| Threshold policy | Maximize F1 on pooled out-of-fold (OOF) predictions |
| Locked threshold | 0.475 |
CatBoost is a gradient-boosted decision tree algorithm well suited for tabular data with mixed feature types. Key hyperparameters (from grid_sensitivity_matched.py):
- Iterations: 20,000 (with early stopping, patience = 300)
- Learning rate: 0.03
- Depth: 8
- L2 leaf regularization: 3.0
- Loss: Logloss; Eval metric: AUC
A logistic regression model serves as a linear baseline, run under the identical locked spatial CV protocol:
- Solver: L-BFGS with L2 penalty (
C = 1.0) - Preprocessing:
StandardScaler(fit on train fold, transform on validation fold) - Evaluated at both the locked threshold (0.475) and its own F1-optimized threshold (0.340)
Under the same protocol, CatBoost outperforms logistic regression across every metric:
| Metric | Logistic Regression | CatBoost | Delta |
|---|---|---|---|
| ROC AUC | 0.803 | 0.936 | +13 pts |
| PR AUC | 0.747 | 0.932 | +19 pts |
| F1 (t = 0.475) | 0.753 | 0.872 | +12 pts |
| Brier Score | 0.182 | 0.101 | -45% |
| F1 (opt. threshold) | 0.778 | 0.872 | +9 pts |
The consistent gap confirms that nonlinear feature interactions captured by gradient boosting provide substantial predictive value beyond what a linear model extracts from the same features.
To assess whether results are robust to the choice of spatial grid resolution, CatBoost was evaluated at three grid sizes under an otherwise identical protocol:
| Grid (degrees) | ROC AUC | F1 (t = 0.475) | F1 (tuned t) |
|---|---|---|---|
| 0.25 | 0.936 | 0.872 | 0.872 |
| 0.50 | 0.934 | 0.880 | 0.880 |
| 1.00 | 0.928 | 0.872 | 0.876 |
Performance is stable across grid sizes, with a modest decline at 1.00 degrees where larger held-out regions create more distribution shift.
Source: outputs_submission/overall_metrics.json
| Metric | Mean +/- SD |
|---|---|
| ROC AUC | 0.936 +/- 0.022 |
| PR AUC | 0.932 +/- 0.027 |
| F1 | 0.872 +/- 0.035 |
| Sensitivity | 0.897 +/- 0.036 |
| Specificity | 0.839 +/- 0.040 |
| Locked threshold (F1-optimized) | 0.475 |
Isotonic calibration improved probability accuracy (Brier score: 0.1008 to 0.0945).
Capacity-based sampling enrichment using calibrated risk scores:
| Risk percentile | Observed positivity |
|---|---|
| Top 10% | 98.4% (vs. 50.0% overall) |
| Top 20% | 96.8% |
| Top 30% | 95.2% |
Sodium concentration is the dominant predictor, followed by Molybdenum, Moisture, and Copper. Geographic coordinates (Latitude, Longitude) also contribute, consistent with spatial structuring of Listeria prevalence.
Panel A maps sampling locations colored by maximum temperature, showing geographic clustering of environmental conditions. Panel B quantifies this pattern: as distance between sample pairs increases, their temperature difference rises on average (Tobler's First Law). Panel C contrasts a random train/test split (left, F1 = 0.901) with a 0.25-degree spatial grid split (right, F1 = 0.872). The random split intermixes train and test points geographically, inflating performance. The grid split holds out entire cells, eliminating neighbor leakage and producing more honest generalization estimates.
- Tested on: Intel i7 / AMD Ryzen 7 (8 cores), 16 GB RAM
- Minimum: 4 cores, 8 GB RAM
- Disk: < 200 MB for artifacts (dataset stored separately)
- GPU: Not required (CPU-only)
Package versions are recorded in outputs_submission/versions.json:
pandas==1.4.2 numpy==1.22.3 scikit-learn==1.7.2
CatBoost must also be installed (pip install catboost).
-
Place dataset files in the repository root:
ListeriaSoil_clean.csvListeriaSoil_Metadata.csv
-
Main submission benchmark:
jupyter nbconvert --execute final_1.ipynb
Outputs are written to
outputs_submission/. -
Logistic regression baseline:
python lr_baseline.py
Prints pooled OOF metrics and a side-by-side summary versus CatBoost.
-
Grid sensitivity analysis:
python grid_sensitivity_matched.py
Requires
outputs_submission/eval_lock.json(produced by step 2). Outputs are written tooutputs_grid_sensitivity_matched_protocol/. -
Autocorrelation figure:
python generate_autocorrelation.py
Produces
fig_spatial_autocorrelation.png.
| File / Directory | Contents |
|---|---|
outputs_submission/overall_metrics.json |
Locked benchmark metrics (ROC AUC, PR AUC, F1, etc.) |
outputs_submission/eval_lock.json |
Locked CV/threshold/seed configuration |
outputs_submission/catboost_fullfit.cbm |
CatBoost model trained on all data |
outputs_submission/oof_predictions.csv |
Out-of-fold probability predictions |
outputs_submission/feature_importance.csv |
Feature importance scores |
outputs_submission/fig_1_feature_importance.png |
Top-10 feature importance chart |
outputs_submission/fig_2_panel_roc_pr_cm_calibration.png |
ROC, PR, confusion matrix, calibration |
outputs_grid_sensitivity_matched_protocol/ |
Grid sensitivity results and chart |
fig_spatial_autocorrelation.png |
Spatial autocorrelation and leakage figure (three panels) |
-
No external validation. All reported metrics come from internal spatial cross-validation on a single dataset. Judges may ask why the team did not hold out an entire geographic region (for example, leave-one-state-out) or use a temporally separated test set. Internal spatial cross-validation reduces but does not eliminate optimism, and true out-of-sample performance may differ.
-
Grid cell size justification. "Why 0.25 degrees?" is a likely question. The choice is reasonable but somewhat arbitrary, and judges may press for ecological or autocorrelation-range justification, such as a variogram or spatial correlogram analysis. The grid sensitivity experiment (0.25, 0.50, 1.00 degrees) shows stable performance across resolutions, which partially addresses this concern but does not replace a principled spatial analysis.
-
Limited model diversity beyond boosted trees. The starter already showed boosted-tree methods as strong performers. Judges may ask why no linear baseline (such as logistic regression under spatial cross-validation) or spatial model (such as kriging or spatial random effects) was tested to bracket performance or assess whether a simpler model suffices. The logistic regression baseline (
lr_baseline.py) addresses the simpler end of this spectrum, but the spatial-model end remains unexplored. -
Feature importance interpretation. Citing sodium and molybdenum as top predictors may invite questions about confounding and whether these features partly proxy for geography itself. Judges may ask what happens to performance or importance rankings when coordinates are removed, or whether partial-dependence or SHAP analyses have been performed to disentangle spatial confounding from genuine predictive signal.
-
Threshold selection on OOF data. While better than selecting on training data, the threshold was optimized on the same out-of-fold data used to estimate performance. Judges may ask whether this introduces subtle adaptive overfitting and how sensitive F1 is to small threshold perturbations. A nested approach (inner loop for threshold, outer loop for evaluation) would be more rigorous.
If you use this work, please cite the primary data source:
Liao, J., Guo, X., Weller, D. L., et al. (2021). Nationwide genomic atlas of soil-dwelling Listeria reveals effects of selection and population ecology on pangenome evolution. Nature Microbiology, 6, 1021-1030. https://doi.org/10.1038/s41564-021-00935-7


