Skip to content

Commit 1d682fc

Browse files
committed
Improved the persistence logic by separating the files for each document.
1 parent c283603 commit 1d682fc

File tree

9 files changed

+995
-435
lines changed

9 files changed

+995
-435
lines changed

README.md

Lines changed: 64 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@ A pure-PHP vector database implementing **HNSW** (Hierarchical Navigable Small W
55
## Requirements
66

77
- PHP 8.1+
8-
- No external PHP extensions required
8+
- No external PHP extensions required for core functionality
9+
- `ext-pcntl` (optional) — enables asynchronous document writes for lower insert latency
910

1011
## Installation
1112

@@ -17,7 +18,7 @@ composer require ezimuel/phpvector
1718

1819
### 1. Insert documents
1920

20-
A `Document` holds a dense embedding vector, optional raw text for BM25, and any metadata you want returned with results.
21+
A `Document` holds a dense embedding vector, optional raw text for BM25, and any metadata you want returned with results. The `id` field is optional — if omitted, a random UUID v4 is assigned automatically.
2122

2223
```php
2324
use PHPVector\Document;
@@ -44,6 +45,11 @@ $db->addDocuments([
4445
text: 'BM25 full-text ranking algorithm explained',
4546
metadata: ['url' => 'https://example.com/3', 'lang' => 'en'],
4647
),
48+
// No id — a UUID v4 is assigned automatically.
49+
new Document(
50+
vector: [0.55, 0.42, 0.71, 0.30],
51+
text: 'Hybrid search with Reciprocal Rank Fusion',
52+
),
4753
]);
4854
```
4955

@@ -180,76 +186,108 @@ $db = new VectorDatabase(
180186

181187
## Persistence
182188

183-
The full database state — HNSW graph, BM25 index, and all documents — can be saved to a single binary file and restored in one call. The format (`PHPV`) uses raw `pack/unpack` for float arrays and integer sequences, so reads and writes are fast even for large indexes.
189+
PHPVector uses a **folder-based** persistence model. Each database lives in its own directory containing separate files for the HNSW graph, the BM25 index, and one file per document. This design has two key advantages:
190+
191+
- **Low memory footprint on load** — only the HNSW graph and BM25 index are loaded into memory. Individual document files (`docs/{n}.bin`) are read lazily, only for the documents that appear in search results.
192+
- **Low insert latency** — document files are written to disk asynchronously in a forked child process (requires `ext-pcntl`), so `addDocument()` returns immediately.
193+
194+
### Folder layout
195+
196+
```
197+
/var/data/mydb/
198+
meta.json — distance metric, dimension, document ID map
199+
hnsw.bin — HNSW graph (vectors + connections)
200+
bm25.bin — BM25 inverted index
201+
docs/
202+
0.bin — document 0 (id, text, metadata)
203+
1.bin — document 1
204+
205+
```
184206

185207
### Saving
186208

209+
Pass a `path` to the constructor to enable persistence. Each `addDocument()` call writes the document file to `docs/` (asynchronously when `ext-pcntl` is available). Call `save()` once to flush the HNSW graph and BM25 index — it waits for any outstanding async writes before proceeding.
210+
187211
```php
188-
// Build and populate the index as usual.
189-
$db = new VectorDatabase();
212+
use PHPVector\Document;
213+
use PHPVector\VectorDatabase;
214+
215+
$db = new VectorDatabase(path: '/var/data/mydb');
190216

191217
$db->addDocuments([
192218
new Document(id: 1, vector: [0.12, 0.85, 0.44], text: 'PHP vector search', metadata: ['source' => 'blog']),
193219
new Document(id: 2, vector: [0.91, 0.23, 0.78], text: 'Approximate nearest neighbour'),
194220
// ... thousands more
195221
]);
196222

197-
// Persist to disk — single write, binary format.
198-
$db->persist('/var/data/myindex.phpv');
223+
// Flush HNSW graph + BM25 index to disk (document files already written).
224+
$db->save();
199225
```
200226

201227
### Loading
202228

203-
Pass the same `HNSWConfig` (including the same `distance` metric) that was used when building the index. The method throws `\RuntimeException` if the distance codes do not match.
229+
Use `VectorDatabase::open()` to load a previously saved folder. Only `hnsw.bin` and `bm25.bin` are read into memory; document files are loaded on demand after search.
230+
231+
Pass the same `HNSWConfig` (including the same `distance` metric) that was used when building the index — a `RuntimeException` is thrown on mismatch.
204232

205233
```php
206-
use PHPVector\BM25\Config as BM25Config;
207-
use PHPVector\Distance;
208-
use PHPVector\HNSW\Config as HNSWConfig;
209234
use PHPVector\VectorDatabase;
210235

211-
$db = VectorDatabase::load('/var/data/myindex.phpv');
212-
```
213-
214-
All three search modes work immediately after loading:
236+
$db = VectorDatabase::open('/var/data/mydb');
215237

216-
```php
238+
// All three search modes work immediately.
217239
$results = $db->vectorSearch(vector: $queryVector, k: 5);
218240
$results = $db->textSearch(query: 'nearest neighbour', k: 5);
219241
$results = $db->hybridSearch(vector: $queryVector, text: 'nearest neighbour', k: 5);
220242
```
221243

222-
### Custom configuration on load
223-
224-
If the index was built with non-default settings, pass the same config objects to `load()`:
244+
### Custom configuration on open
225245

226246
```php
227-
$db = VectorDatabase::load(
228-
path: '/var/data/myindex.phpv',
247+
use PHPVector\BM25\Config as BM25Config;
248+
use PHPVector\Distance;
249+
use PHPVector\HNSW\Config as HNSWConfig;
250+
use PHPVector\VectorDatabase;
251+
252+
$db = VectorDatabase::open(
253+
path: '/var/data/mydb',
229254
hnswConfig: new HNSWConfig(
230255
M: 16,
231256
efSearch: 100,
232-
distance: Distance::Euclidean, // must match what was used on persist
257+
distance: Distance::Euclidean, // must match the value used on save()
233258
),
234259
bm25Config: new BM25Config(k1: 1.2, b: 0.8),
235260
tokenizer: new MyCustomTokenizer(),
236261
);
237262
```
238263

239-
> **Note:** Only `efSearch` and `bm25Config`/`tokenizer` affect query-time behaviour and can differ from build time. `distance` and the graph parameters (`M`, `efConstruction`) are fixed at build time — `distance` is validated on load and must match.
264+
> **Note:** Only `efSearch` and `bm25Config`/`tokenizer` affect query-time behaviour and can differ from build time. `distance` and the graph parameters (`M`, `efConstruction`) are fixed at build time — `distance` is validated on `open()` and must match.
265+
266+
### Incremental updates
267+
268+
You can add new documents to a database that was loaded from disk, then call `save()` again. The existing document files are left in place; only the new ones are written along with updated index files.
269+
270+
```php
271+
$db = VectorDatabase::open('/var/data/mydb');
272+
$db->addDocument(new Document(vector: [0.55, 0.42, 0.71], text: 'New document'));
273+
$db->save(); // writes docs/N.bin + updated hnsw.bin, bm25.bin, meta.json
274+
```
240275

241276
### Typical workflow: build once, serve many
242277

243278
```php
244279
// build.php — run once (or nightly)
245-
$db = new VectorDatabase(hnswConfig: new HNSWConfig(M: 32, efConstruction: 400));
280+
$db = new VectorDatabase(
281+
hnswConfig: new HNSWConfig(M: 32, efConstruction: 400),
282+
path: '/var/data/mydb',
283+
);
246284
foreach (fetchDocumentsFromDatabase() as $doc) {
247285
$db->addDocument($doc);
248286
}
249-
$db->persist('/var/data/myindex.phpv');
287+
$db->save();
250288

251289
// serve.php — loaded on every request or worker boot
252-
$db = VectorDatabase::load('/var/data/myindex.phpv', new HNSWConfig(M: 32));
290+
$db = VectorDatabase::open('/var/data/mydb', new HNSWConfig(M: 32));
253291
$results = $db->vectorSearch($queryVector, k: 10);
254292
```
255293

composer.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "ezimuel/phpvector",
3-
"description": "A fast vector database in PHP implementing HNSW for approximate nearest-neighbor search and BM25 for hybrid full-text + vector retrieval.",
3+
"description": "A vector database in PHP implementing HNSW for approximate nearest-neighbor search and BM25 for hybrid full-text + vector retrieval.",
44
"type": "library",
55
"license": "MIT",
66
"require": {

src/Document.php

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,14 @@
1212
final class Document
1313
{
1414
/**
15-
* @param string|int $id Unique identifier (user-supplied or auto-assigned).
16-
* @param float[] $vector Dense embedding vector.
17-
* @param string|null $text Raw text content used for BM25 indexing (optional).
15+
* @param string|int|null $id Unique identifier (user-supplied or auto-assigned UUID when null).
16+
* @param float[] $vector Dense embedding vector.
17+
* @param string|null $text Raw text content used for BM25 indexing (optional).
1818
* @param array<string, mixed> $metadata Arbitrary key-value payload returned with results.
1919
*/
2020
public function __construct(
21-
public readonly string|int $id,
22-
public readonly array $vector,
21+
public readonly string|int|null $id = null,
22+
public readonly array $vector = [],
2323
public readonly ?string $text = null,
2424
public readonly array $metadata = [],
2525
) {}

src/HNSW/Index.php

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -367,6 +367,21 @@ public function count(): int
367367
return count($this->nodes);
368368
}
369369

370+
/**
371+
* Return the raw (un-normalised) vector stored for $nodeId.
372+
* Used by VectorDatabase when hydrating lazy-loaded Documents.
373+
*
374+
* @return float[]
375+
* @throws \OutOfBoundsException if $nodeId is not present.
376+
*/
377+
public function getVector(int $nodeId): array
378+
{
379+
if (!isset($this->nodes[$nodeId])) {
380+
throw new \OutOfBoundsException("No HNSW node with id {$nodeId}.");
381+
}
382+
return $this->nodes[$nodeId]->vector;
383+
}
384+
370385
/** Returns all stored documents. */
371386
public function getDocuments(): array
372387
{

0 commit comments

Comments
 (0)