A deep learning solution for handwritten digit recognition using Convolutional Neural Networks (CNNs) with PyTorch, achieving a 0.99689 accuracy on the Kaggle Digit Recognizer competition (MNIST dataset).
This project tackles the classic MNIST handwritten digit classification problem as a Kaggle competition. Instead of a single model, it combines two complementary CNN architectures trained across 5 cross-validation folds, resulting in a 10-model ensemble with test-time augmentation (TTA) to maximize accuracy.
computer-vision-mnist-cnn/
├── cnn-digit-recognizer.ipynb # Main Jupyter notebook (full pipeline)
├── README.md
└── LICENSE
- Training and test CSV files are loaded from the Kaggle competition dataset.
- Pixel values are normalized to the
[0, 1]range by dividing by 255. - Images are reshaped from flat 784-element vectors to
(1, 28, 28)tensors (channels-first format for PyTorch).
To improve generalization and reduce overfitting, the following augmentations are applied during training via torchvision.transforms:
| Augmentation | Parameters |
|---|---|
RandomAffine – rotation |
±15° |
RandomAffine – translation |
up to 10% in each direction |
RandomAffine – scale |
0.9×–1.1× |
RandomAffine – shear |
10° |
RandomPerspective |
distortion scale 0.2, probability 0.5 |
Two distinct CNN architectures are used in the ensemble:
A compact residual network adapted for 28×28 grayscale images.
| Layer | Details |
|---|---|
| Input conv | 1 → 32 channels, 3×3, BatchNorm, ReLU |
| Layer 1 | 2× ResidualBlock (32 → 64 ch, stride 1) |
| Layer 2 | 2× ResidualBlock (64 → 128 ch, stride 2) |
| Layer 3 | 2× ResidualBlock (128 → 256 ch, stride 2) |
| Global Average Pooling | 256-dim feature vector |
| Dropout | p=0.4 |
| Fully Connected | 256 → 10 classes |
Each ResidualBlock contains two 3×3 convolutions with BatchNorm and a shortcut connection (1×1 conv when dimensions change).
A wider network with three convolutional blocks followed by a fully connected classifier.
| Block | Details |
|---|---|
| Block 1 | Conv(1→64) → BN → ReLU → Conv(64→64) → BN → ReLU → MaxPool(2) → Dropout(0.25) |
| Block 2 | Conv(64→128) → BN → ReLU → Conv(128→128) → BN → ReLU → MaxPool(2) → Dropout(0.25) |
| Block 3 | Conv(128→256) → BN → ReLU → Conv(256→256) → BN → ReLU → MaxPool(2, pad=1) → Dropout(0.25) |
| Classifier | Linear(4096→512) → BN → ReLU → Dropout(0.5) → Linear(512→10) |
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 1e-3 |
| Weight decay | 1e-4 |
| LR Scheduler | CosineAnnealingLR (T_max=30) |
| Epochs | 30 |
| Batch size | 128 |
| Label smoothing | 0.1 |
| Early stopping patience | 10 epochs |
| Cross-validation | 5-fold Stratified K-Fold |
| Random seed | 42 |
| Hardware | NVIDIA Tesla T4 GPU |
- Both architectures are trained independently across all 5 folds, producing 10 models in total.
- The best model checkpoint per fold (highest validation accuracy) is saved and used for inference.
Predictions are generated by averaging softmax probabilities from all 10 models across 4 TTA variants:
| TTA Transform | Details |
|---|---|
| No augmentation | Identity |
| Slight rotation + translation (+) | degrees=5, translate=(0.05, 0.05) |
| Slight rotation + translation (–) | degrees=-5, translate=(0.05, 0.05) |
| Slight scale | scale=(0.95, 1.05) |
Total inference passes: 10 models × 4 TTA variants = 40 forward passes per test image.
| Metric | Value |
|---|---|
| Kaggle Public Score | 0.99689 |
| Validation Strategy | 5-Fold Stratified CV |
| Ensemble Size | 10 models (5 folds × 2 architectures) |
The notebook is designed to run in the Kaggle environment. Key dependencies:
- Python 3.12
- PyTorch
- torchvision
- NumPy
- pandas
- matplotlib
- scikit-learn
-
Attach the Digit Recognizer competition dataset.
-
Enable GPU acceleration (NVIDIA Tesla T4 recommended).
-
Run all cells. The notebook will:
- Load and preprocess the data
- Train 10 models via K-Fold CV
- Generate predictions with TTA ensemble
- Save
submission.csvready for Kaggle submission
This project is licensed under the terms of the LICENSE file.