The Breast Cancer Wisconsin (Diagnostic) Dataset contains data from digitized images of breast mass cell nuclei in fine needle aspirates. The task is to classify the cells as either malignant (cancerous) or benign (non-cancerous) based on the features provided.
The dataset was fetched from the UC Irvine Machine Learning Repository.
The model uses XGBoost (XGBClassifier)
- Label Encoding: The target variable (
MalignantandBenign) is encoded into numerical values (1 and 0, respectively). - Train-Test Split: The data is split into a training set (80%) and a test set (20%).
- Correlation Handling: Highly correlated features (correlation > 0.9) are dropped to avoid multicollinearity.
- SMOTE (Synthetic Minority Over-sampling Technique): Used to handle class imbalance by oversampling the minority class in the training set.
- RFE (Recursive Feature Elimination): A feature selection method used to select the most important 20 features that contribute the most to the model's performance.
- GridSearchCV: Hyperparameter tuning is performed using GridSearchCV to find the best parameters for the XGBoost classifier. The grid includes:
learning_rate,n_estimators,max_depth,gamma,reg_alpha,reg_lambda, andscale_pos_weight.
- Classification Report: The model's performance is evaluated using precision, recall, F1-score, and support.
- ROC-AUC Score: The area under the ROC curve is calculated to assess the model's performance in distinguishing between the two classes.
The best model, after hyperparameter tuning, is evaluated on the test set. Below is the classification report for the Breast Cancer Dataset model:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 (Benign) | 0.99 | 0.96 | 0.97 | 71 |
| 1 (Malignant) | 0.93 | 0.98 | 0.95 | 43 |
| Accuracy | 0.96 | 114 | ||
| Macro avg | 0.96 | 0.97 | 0.96 | 114 |
| Weighted avg | 0.97 | 0.96 | 0.97 | 114 |
The ROC-AUC score of the model is 0.9931, indicating an excellent ability to distinguish between malignant and benign tumors with high confidence.