Is your feature request related to a problem? Please describe.
At high vector dimensionality (1536–5120+), float32 storage is expensive and existing quantization options (int8, binary/hamming) either sacrifice too much recall or offer limited compression. There is no native sub-int8 quantization option with correctness guarantees for HNSW traversal.
Describe the solution you'd like Native support for
TurboQuant as a distance-metric option.
TurboQuant is a new online vector quantization algorithm from Google Research that compresses vectors to 3–4 bits with provably near-optimal distortion, no training phase, and superior recall vs. Product Quantization in nearest neighbor search benchmarks. For a 5120-dim float32 vector this means ~6x memory reduction with near-lossless retrieval quality.
More details: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
Describe alternatives you've considered
int8 and binary (hamming) quantization are the current best options in Vespa, but both involve significant recall degradation or storage overhead at high dimensionality compared to what TurboQuant demonstrates.
Additional context
The algorithm is data-oblivious (no dataset-specific calibration), making it well-suited for Vespa's real-time indexing model. A reference implementation of the QJL component — the 1-bit residual correction stage that makes the inner product estimator unbiased — is available at https://github.com/amirzandieh/QJL (Apache 2.0). The paper is arXiv:2504.19874, to be presented at ICLR 2026.
Is your feature request related to a problem? Please describe.
At high vector dimensionality (1536–5120+), float32 storage is expensive and existing quantization options (int8, binary/hamming) either sacrifice too much recall or offer limited compression. There is no native sub-int8 quantization option with correctness guarantees for HNSW traversal.
Describe the solution you'd like Native support for
TurboQuant as a distance-metric option.
TurboQuant is a new online vector quantization algorithm from Google Research that compresses vectors to 3–4 bits with provably near-optimal distortion, no training phase, and superior recall vs. Product Quantization in nearest neighbor search benchmarks. For a 5120-dim float32 vector this means ~6x memory reduction with near-lossless retrieval quality.
More details: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
Describe alternatives you've considered
int8 and binary (hamming) quantization are the current best options in Vespa, but both involve significant recall degradation or storage overhead at high dimensionality compared to what TurboQuant demonstrates.
Additional context
The algorithm is data-oblivious (no dataset-specific calibration), making it well-suited for Vespa's real-time indexing model. A reference implementation of the QJL component — the 1-bit residual correction stage that makes the inner product estimator unbiased — is available at https://github.com/amirzandieh/QJL (Apache 2.0). The paper is arXiv:2504.19874, to be presented at ICLR 2026.