Skip to content

pablo-reyes8/outgrid-vision-transformer

Repository files navigation

OutGridViT: Outlook-Grid Vision Transformer

Python License Repo size Last commit Open issues Contributors Forks Stars

OutGridViT is a research-focused hybrid vision architecture that fuses Dynamic Kernels local attention (VOLO-style), MBConv inductive bias, and Grid Attention. The core design is Model A, where every block injects dynamic local aggregation before global mixing.

This repo contains the full training stack, ablations, baseline comparisons, and analysis tools (MAD metrics + Entropy + attention mechanisms visualizations).

Table of Contents

Model (Main Architecture)

Block composition (OutGridBlock): Dynamic 3x3 Kernel -> MBConv -> GridAttn -> MLP

Input/output stays in 2D feature maps (NCHW). Only the GridAttn path temporarily uses BHWC and tokenized grids.

Block Forward (tensor form)

Input: x in R^{B x C x H x W}

$$ \begin{aligned} x_1 &= \mathrm{Dynamic Kernel}(x) \\ x_2 &= \mathrm{MBConv}(x_1) \\ \hat{x}_2 &= \mathrm{permute}(x_2) \\ y &= \hat{x}_2 + \mathrm{DP}(\mathrm{GridAttn}(\mathrm{LN}(\hat{x}_2))) \\ z &= y + \mathrm{DP}(\mathrm{MLP}(\mathrm{LN}(y))) \\ \mathrm{OutGrid}(x) &= \mathrm{permute}^{-1}(z) \end{aligned} $$

DP is stochastic depth. LN is LayerNorm in BHWC.

Model Forward (high level)

$$ \begin{aligned} x_0 &= \mathrm{Stem}(x) \\ x_{s,i+1} &= \mathrm{OutGrid}(x_{s,i}), \quad i=0..(d_s-1) \\ x_{s+1,0} &= \mathrm{Downsample}_s(x_{s,d_s}) \\ \mathrm{logits} &= \mathrm{Linear}(\mathrm{GAP}(\mathrm{BN}(x_{S,d_S}))) \end{aligned} $$

Visual Experiments

Dynamic Kernel behaves like a dynamic 3x3 kernel per position, while Grid Attention provides more global mixing per block. Below are sample overlays produced by the analysis pipeline.

Dynamic Kernel locality (kernel weights + center/spread overlays):

Outlooker locality example

More Kernel examples (different stages/blocks):

Example A Example B
Outlooker locality stage0 Outlooker locality stage1
Outlooker locality stage2 Outlooker locality stage3

Grid attention heatmap (query overlays over the input):

Grid attention example

More Grid Attention examples (different stages/blocks):

Example A Example B
Grid attention stage0 Grid attention stage1
Grid attention stage2 Grid attention stage3

Reported Results

Dataset Img size Top-1 (val/test) Params Notes
CIFAR-100 32 - / 79.72 7.5M Model A (7M), 100 epochs
CIFAR-100 64 78.7 / 81.2 ~14M Upsampled CIFAR-100
Tiny-ImageNet-200 64 66.5 / 69.8 22.5M Competitive for 22M params
SVHN 32 96.1 / - ~14M Val reported in logs

Convergence Curves

Model Comparisons (CIFAR-100, 32x32)

All baselines below were trained with the same recipe. Values are from logs/Logs Models Comparisons Cifrar100 100epocs.txt.

Model Top-1 Params FLOPs (per forward) Throughput (imgs/s)
OutGridViT (Model A, 7M) 79.72 7.52M 448 MFLOPs 1131.8
ConvNeXt-Tiny 72.60 27.89M 364 MFLOPs 1832.8
DeiT-Tiny (patch4) 63.77 5.38M 347 MFLOPs 5077.2
DeiT-Small (patch4) 59.00 21.38M 1.38 GFLOPs 3432.9
EfficientNetV2-S 64.62 20.31M 238 MFLOPs 2099.3
MaxViT-Nano (surgery) 75.41 17.38M 305 MFLOPs 1253.9
MaxViT-Tiny 75.90 30.43M 444 MFLOPs 980.4
ResNet18 (CIFAR stem) 73.25 11.22M 557 MFLOPs 5393.2
ResNet50 (CIFAR stem) 77.42 23.71M 1.30 GFLOPs 1918.7
Swin-Tiny (patch2) 59.89 27.57M 358 MFLOPs 3360.6

Throughput is hardware-dependent; use it for relative comparison only.

Robustness on TinyImageNet-C

Robustness evaluation (mean top-1 across corruptions; higher is better). Full logs in logs/Logs Models Convergance on TinyImagnet-C.txt.

Model Params GFLOPs Clean Top-1 Overall Mean Top-1
OutGridViT (Model A) 6.26M 1.56 57.81 26.59
ResNet50 23.91M 1.31 45.31 16.52
Swin-Tiny 27.65M 1.47 35.00 12.27
MaxViT-Tiny 30.48M 1.77 53.11 22.03

MAD + Entropy Metrics (Grid vs Dynamic Kernels)

Quantitative summary (CIFAR-100, Model A). GRID_abs is L1 distance in feature-map pixels, max (Hf-1)+(Wf-1) per stage. OUT_abs is L1 distance inside a 3x3 kernel, max = 2. Hn is normalized entropy in [0, 1].

Stage Hf x Wf GRID_abs OUT_abs GRID_Hn OUT_Hn
0 64 x 64 31.79 ± 1.13 1.33 ± 0.13 0.818 ± 0.037 0.894 ± 0.005
1 32 x 32 15.41 ± 0.15 1.66 ± 0.12 0.915 ± 0.013 0.868 ± 0.042
2 16 x 16 7.89 ± 0.18 1.67 ± 0.24 0.943 ± 0.011 0.850 ± 0.046
3 8 x 8 4.48 ± 0.33 1.73 ± 0.14 0.801 ± 0.054 0.814 ± 0.076

Interpretation: Dynamic Kernels stays strictly local, while Grid attention spreads wider and is more entropic in mid stages (1-2). Numbers from experiments_results/Cuantitaive Experiments/mad_entropy.ipynb.

Baseline Training Recipe

All baseline models use the same training config:

history, model = train_model(
    model=model,
    train_loader=train_loader,
    epochs=100,
    val_loader=val_loader,
    device=device,

    lr=5e-4,
    weight_decay=0.05,

    autocast_dtype="fp16" if device == "cuda" else "fp32",
    use_amp=(device == "cuda"),
    grad_clip_norm=1.0,

    warmup_ratio=0.05,
    min_lr=1e-6,

    label_smoothing=0.1,

    print_every=400,
    mix_prob=0.5,
    mixup_alpha=0.8,
    cutmix_alpha=1.0,

    num_classes=100,
    channels_last=True)

Configs

Prebuilt configs:

  • configs/cifar100_model_a_7m.yaml (7.5M, 100 epochs)
  • configs/cifar100_model_a_14m.yaml (14M, 100 epochs)
  • configs/cifar100_model_a.yaml (larger variant)
  • configs/cifar100_64_model_a.yaml
  • configs/cifar100_model_b.yaml
  • configs/svhn_model_a.yaml
  • configs/tinyimagenet200_model_a.yaml

configs/train.yaml is an alias of the CIFAR-100 Model A config.

Training (Model A)

pip install -r requirements.txt
python scripts/train.py --config configs/cifar100_model_a_7m.yaml
# or
python scripts/train.py --config configs/cifar100_model_a_14m.yaml

Baseline Comparisons (CIFAR-32)

Use the baseline runner to train timm models with a shared training recipe:

python scripts/train_cifar32_baselines.py --device cuda

Baselines included:

  • DeiT Tiny (patch4)
  • DeiT Small (patch4)
  • Swin Tiny (patch2)
  • MaxViT Tiny
  • MaxViT Nano (surgery)
  • ResNet18 (CIFAR stem)

Attention Analysis and MAD Metrics

The analysis CLI generates:

  • Dynamic Kernel locality visualizations
  • Grid attention heatmaps
  • MAD metrics for both attention paths
python scripts/run_attention_analysis.py \
  --config configs/cifar100_model_a.yaml \
  --checkpoint outputs/best_cifar100_model_a.pt \
  --split val \
  --output-dir analysis_outputs

Outputs:

  • analysis_outputs/outlooker/ (locality overlays + kernels)
  • analysis_outputs/grid/ (grid attention overlays)
  • analysis_outputs/mad_metrics.json and .csv

Additional figures can be generated with the CLI and added to experiments_results/ or a dedicated figures/ folder. Entropy metrics are computed in src/experiments/entropy_metrics.py and summarized in experiments_results/Cuantitaive Experiments/mad_entropy.ipynb.

Project Structure

configs/          # YAML configs
scripts/          # CLI entrypoints
src/              # models, blocks, training, experiments
training_notebooks/  # ablations + baselines
tests/            # pytest suite

Conclusion

OutGridViT shows that injecting local dynamic aggregation (Dynamic Kernels) inside every block, followed by MBConv and Grid Attention, yields a strong accuracy/compute trade-off. On CIFAR-100 (32x32), the 7.5M-parameter variant reaches 79.72 Top-1 while keeping FLOPs competitive with larger baselines. The convergence plots indicate stable training, and TinyImageNet-C results suggest better robustness than heavier transformer baselines at similar or higher compute.

The MAD + entropy analysis supports the architectural intent: Dynamic Kernels remains highly local, while Grid attention spreads broader mixing, especially in mid stages. This hybrid behavior is consistent with the qualitative attention maps and explains why the model captures both local texture and global structure without a full ViT token pipeline.

Overall, the model provides a practical hybrid inductive bias that is competitive under limited compute and should serve as a solid base for further scaling or low-GPU research settings.

Notes

  • Grid attention requires H and W divisible by grid_size.
  • Dynamic Kernels kernel size must be odd and > 0.
  • NCHW for conv/outlook, BHWC for grid attention.

License

This project is distributed under the MIT License. Feel free to use it in research, production or just experimentation with attribution.

About

Hybrid vision backbone: VOLO-style Outlook local mixing + MaxViT grid attention + MBConv. Two variants (front-only vs per-block Outlook), FP16 training CLI with CutMix, and reproducible CIFAR-100 64×64 benchmarks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors