Computer Vision – MNIST CNN Digit Recognizer

A deep learning solution for handwritten digit recognition using Convolutional Neural Networks (CNNs) with PyTorch, achieving a 0.99689 accuracy on the Kaggle Digit Recognizer competition (MNIST dataset).

Overview

This project tackles the classic MNIST handwritten digit classification problem as a Kaggle competition. Instead of a single model, it combines two complementary CNN architectures trained across 5 cross-validation folds, resulting in a 10-model ensemble with test-time augmentation (TTA) to maximize accuracy.

Project Structure

computer-vision-mnist-cnn/
├── cnn-digit-recognizer.ipynb   # Main Jupyter notebook (full pipeline)
├── README.md
└── LICENSE

Approach

Data Preprocessing

Training and test CSV files are loaded from the Kaggle competition dataset.
Pixel values are normalized to the [0, 1] range by dividing by 255.
Images are reshaped from flat 784-element vectors to (1, 28, 28) tensors (channels-first format for PyTorch).

Data Augmentation

To improve generalization and reduce overfitting, the following augmentations are applied during training via torchvision.transforms:

Augmentation	Parameters
`RandomAffine` – rotation	±15°
`RandomAffine` – translation	up to 10% in each direction
`RandomAffine` – scale	0.9×–1.1×
`RandomAffine` – shear	10°
`RandomPerspective`	distortion scale 0.2, probability 0.5

Model Architectures

Two distinct CNN architectures are used in the ensemble:

1. `ResNetMNIST` – ResNet-style CNN

A compact residual network adapted for 28×28 grayscale images.

Layer	Details
Input conv	1 → 32 channels, 3×3, BatchNorm, ReLU
Layer 1	2× `ResidualBlock` (32 → 64 ch, stride 1)
Layer 2	2× `ResidualBlock` (64 → 128 ch, stride 2)
Layer 3	2× `ResidualBlock` (128 → 256 ch, stride 2)
Global Average Pooling	256-dim feature vector
Dropout	p=0.4
Fully Connected	256 → 10 classes

Each ResidualBlock contains two 3×3 convolutions with BatchNorm and a shortcut connection (1×1 conv when dimensions change).

2. `WideConvNet` – VGG-style Wide CNN

A wider network with three convolutional blocks followed by a fully connected classifier.

Block	Details
Block 1	Conv(1→64) → BN → ReLU → Conv(64→64) → BN → ReLU → MaxPool(2) → Dropout(0.25)
Block 2	Conv(64→128) → BN → ReLU → Conv(128→128) → BN → ReLU → MaxPool(2) → Dropout(0.25)
Block 3	Conv(128→256) → BN → ReLU → Conv(256→256) → BN → ReLU → MaxPool(2, pad=1) → Dropout(0.25)
Classifier	Linear(4096→512) → BN → ReLU → Dropout(0.5) → Linear(512→10)

Training Strategy

Hyperparameter	Value
Optimizer	AdamW
Learning rate	1e-3
Weight decay	1e-4
LR Scheduler	CosineAnnealingLR (T_max=30)
Epochs	30
Batch size	128
Label smoothing	0.1
Early stopping patience	10 epochs
Cross-validation	5-fold Stratified K-Fold
Random seed	42
Hardware	NVIDIA Tesla T4 GPU

Both architectures are trained independently across all 5 folds, producing 10 models in total.
The best model checkpoint per fold (highest validation accuracy) is saved and used for inference.

Inference & Test-Time Augmentation

Predictions are generated by averaging softmax probabilities from all 10 models across 4 TTA variants:

TTA Transform	Details
No augmentation	Identity
Slight rotation + translation (+)	degrees=5, translate=(0.05, 0.05)
Slight rotation + translation (–)	degrees=-5, translate=(0.05, 0.05)
Slight scale	scale=(0.95, 1.05)

Total inference passes: 10 models × 4 TTA variants = 40 forward passes per test image.

Results

Metric	Value
Kaggle Public Score	0.99689
Validation Strategy	5-Fold Stratified CV
Ensemble Size	10 models (5 folds × 2 architectures)

Requirements

The notebook is designed to run in the Kaggle environment. Key dependencies:

Python 3.12
PyTorch
torchvision
NumPy
pandas
matplotlib
scikit-learn

Usage

Open the notebook on Kaggle:
Attach the Digit Recognizer competition dataset.
Enable GPU acceleration (NVIDIA Tesla T4 recommended).
Run all cells. The notebook will:
- Load and preprocess the data
- Train 10 models via K-Fold CV
- Generate predictions with TTA ensemble
- Save submission.csv ready for Kaggle submission

License

This project is licensed under the terms of the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computer Vision – MNIST CNN Digit Recognizer

Table of Contents

Overview

Project Structure

Approach

Data Preprocessing

Data Augmentation

Model Architectures

1. `ResNetMNIST` – ResNet-style CNN

2. `WideConvNet` – VGG-style Wide CNN

Training Strategy

Inference & Test-Time Augmentation

Results

Requirements

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
cnn-digit-recognizer.ipynb		cnn-digit-recognizer.ipynb

Folders and files

Latest commit

History

Repository files navigation

Computer Vision – MNIST CNN Digit Recognizer

Table of Contents

Overview

Project Structure

Approach

Data Preprocessing

Data Augmentation

Model Architectures

1. ResNetMNIST – ResNet-style CNN

2. WideConvNet – VGG-style Wide CNN

Training Strategy

Inference & Test-Time Augmentation

Results

Requirements

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `ResNetMNIST` – ResNet-style CNN

2. `WideConvNet` – VGG-style Wide CNN

Packages