Skip to content

emrebaranarca/computer-vision-mnist-cnn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Computer Vision – MNIST CNN Digit Recognizer

Open in Kaggle

A deep learning solution for handwritten digit recognition using Convolutional Neural Networks (CNNs) with PyTorch, achieving a 0.99689 accuracy on the Kaggle Digit Recognizer competition (MNIST dataset).


Table of Contents


Overview

This project tackles the classic MNIST handwritten digit classification problem as a Kaggle competition. Instead of a single model, it combines two complementary CNN architectures trained across 5 cross-validation folds, resulting in a 10-model ensemble with test-time augmentation (TTA) to maximize accuracy.


Project Structure

computer-vision-mnist-cnn/
├── cnn-digit-recognizer.ipynb   # Main Jupyter notebook (full pipeline)
├── README.md
└── LICENSE

Approach

Data Preprocessing

  • Training and test CSV files are loaded from the Kaggle competition dataset.
  • Pixel values are normalized to the [0, 1] range by dividing by 255.
  • Images are reshaped from flat 784-element vectors to (1, 28, 28) tensors (channels-first format for PyTorch).

Data Augmentation

To improve generalization and reduce overfitting, the following augmentations are applied during training via torchvision.transforms:

Augmentation Parameters
RandomAffine – rotation ±15°
RandomAffine – translation up to 10% in each direction
RandomAffine – scale 0.9×–1.1×
RandomAffine – shear 10°
RandomPerspective distortion scale 0.2, probability 0.5

Model Architectures

Two distinct CNN architectures are used in the ensemble:

1. ResNetMNIST – ResNet-style CNN

A compact residual network adapted for 28×28 grayscale images.

Layer Details
Input conv 1 → 32 channels, 3×3, BatchNorm, ReLU
Layer 1 ResidualBlock (32 → 64 ch, stride 1)
Layer 2 ResidualBlock (64 → 128 ch, stride 2)
Layer 3 ResidualBlock (128 → 256 ch, stride 2)
Global Average Pooling 256-dim feature vector
Dropout p=0.4
Fully Connected 256 → 10 classes

Each ResidualBlock contains two 3×3 convolutions with BatchNorm and a shortcut connection (1×1 conv when dimensions change).

2. WideConvNet – VGG-style Wide CNN

A wider network with three convolutional blocks followed by a fully connected classifier.

Block Details
Block 1 Conv(1→64) → BN → ReLU → Conv(64→64) → BN → ReLU → MaxPool(2) → Dropout(0.25)
Block 2 Conv(64→128) → BN → ReLU → Conv(128→128) → BN → ReLU → MaxPool(2) → Dropout(0.25)
Block 3 Conv(128→256) → BN → ReLU → Conv(256→256) → BN → ReLU → MaxPool(2, pad=1) → Dropout(0.25)
Classifier Linear(4096→512) → BN → ReLU → Dropout(0.5) → Linear(512→10)

Training Strategy

Hyperparameter Value
Optimizer AdamW
Learning rate 1e-3
Weight decay 1e-4
LR Scheduler CosineAnnealingLR (T_max=30)
Epochs 30
Batch size 128
Label smoothing 0.1
Early stopping patience 10 epochs
Cross-validation 5-fold Stratified K-Fold
Random seed 42
Hardware NVIDIA Tesla T4 GPU
  • Both architectures are trained independently across all 5 folds, producing 10 models in total.
  • The best model checkpoint per fold (highest validation accuracy) is saved and used for inference.

Inference & Test-Time Augmentation

Predictions are generated by averaging softmax probabilities from all 10 models across 4 TTA variants:

TTA Transform Details
No augmentation Identity
Slight rotation + translation (+) degrees=5, translate=(0.05, 0.05)
Slight rotation + translation (–) degrees=-5, translate=(0.05, 0.05)
Slight scale scale=(0.95, 1.05)

Total inference passes: 10 models × 4 TTA variants = 40 forward passes per test image.


Results

Metric Value
Kaggle Public Score 0.99689
Validation Strategy 5-Fold Stratified CV
Ensemble Size 10 models (5 folds × 2 architectures)

Requirements

The notebook is designed to run in the Kaggle environment. Key dependencies:

  • Python 3.12
  • PyTorch
  • torchvision
  • NumPy
  • pandas
  • matplotlib
  • scikit-learn

Usage

  1. Open the notebook on Kaggle:
    Open in Kaggle

  2. Attach the Digit Recognizer competition dataset.

  3. Enable GPU acceleration (NVIDIA Tesla T4 recommended).

  4. Run all cells. The notebook will:

    • Load and preprocess the data
    • Train 10 models via K-Fold CV
    • Generate predictions with TTA ensemble
    • Save submission.csv ready for Kaggle submission

License

This project is licensed under the terms of the LICENSE file.

About

Computer Vision project: CNN architecture for handwritten digit recognition (MNIST) – Kaggle competition solution (0.99689 score).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors