Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,26 @@ Available providers:
- `ItalianStopWords` - Italian stop words
- `FileStopWords` - Load from file

## Deleting and updating documents

```php
// Delete a document by ID
$deleted = $db->deleteDocument(1); // returns true if found, false otherwise

// Update a document (delete + insert with same ID)
$updated = $db->updateDocument(new Document(
id: 1,
vector: [0.5, 0.5, 0.3, 0.2],
text: 'Updated content here',
metadata: ['version' => 2],
));

// After modifications, call save() to persist
$db->save();
```

Deleted documents are soft-deleted from the HNSW graph (kept for connectivity but excluded from results) and fully removed from the BM25 index. Document files are deleted from disk immediately.

## Custom tokenizer

Implement `TokenizerInterface` to plug in stemming, lemmatization, or any language-specific logic.
Expand Down
33 changes: 33 additions & 0 deletions src/BM25/Index.php
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,39 @@ public function count(): int
return count($this->documents);
}

/**
* Remove a document from the index.
*
* @param int $nodeId Internal node-ID of the document to remove.
* @return bool True if the document was removed, false if it didn't exist.
*/
public function removeDocument(int $nodeId): bool
{
if (!isset($this->documents[$nodeId])) {
return false;
}

// Update totalTokens.
if (isset($this->docLengths[$nodeId])) {
$this->totalTokens -= $this->docLengths[$nodeId];
unset($this->docLengths[$nodeId]);
}

// Remove from inverted index.
foreach ($this->invertedIndex as $term => &$postings) {
unset($postings[$nodeId]);
// Remove empty posting lists to save memory.
if (empty($postings)) {
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeDocument() scans the entire $invertedIndex vocabulary to remove a single nodeId, which is O(|V|) per delete and can become expensive as the index grows (delete/update are now public APIs). A more scalable approach is to track per-document term lists on insert so deletion only touches terms that were present in the removed document.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c79b9ca

unset($this->invertedIndex[$term]);
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeDocument() removes a single doc by iterating over the entire $invertedIndex vocabulary and unsetting the nodeId from each postings list (O(|V|) per delete). Since delete/update are public APIs now, this can become a bottleneck on large indexes. Consider tracking per-document term lists on insert so deletions only touch terms that appeared in that document.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. 🤔

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with commit c79b9ca

}
}
unset($postings);

unset($this->documents[$nodeId]);

return true;
}

/** Vocabulary size (unique terms in the index). */
public function vocabularySize(): int
{
Expand Down
65 changes: 60 additions & 5 deletions src/HNSW/Index.php
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,14 @@ final class Index
/** Expected vector dimension (set on first insert). */
private ?int $dimension = null;

/**
* Set of soft-deleted node IDs.
* Deleted nodes remain in the graph for connectivity but are excluded from results.
*
* @var array<int, true>
*/
private array $deleted = [];

/**
* Resolved distance closure — built once in the constructor so the
* per-call match() dispatch is removed from the hot path.
Expand Down Expand Up @@ -356,15 +364,51 @@ public function search(array $query, int $k = 10, ?int $ef = null): array
// Full beam search at layer 0.
$W = $this->searchLayer($qv, [[$epDist, $ep]], $ef, 0);

// Take the k nearest and convert to SearchResult.
// Filter out soft-deleted nodes and take the k nearest.
if (!empty($this->deleted)) {
$W = array_values(array_filter(
$W,
fn(array $pair) => !isset($this->deleted[$pair[1]])
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering soft-deleted nodes only after searchLayer() can make search() return fewer than $k results (even 0) when $ef is small (e.g. $ef === $k) and a top candidate is deleted, despite other active nodes existing. Consider constructing the final top-k by skipping deleted nodes and/or retrying with a larger ef until you have up to $k active results.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with commit a201eb1

));
}

$topK = array_slice($W, 0, $k);
return $this->toSearchResults($topK);
}

/** Total number of documents in the index. */
/**
* Total number of active (non-deleted) documents in the index.
*/
public function count(): int
{
return count($this->nodes);
return count($this->nodes) - count($this->deleted);
}

/**
* Soft-delete a node by its internal ID.
*
* The node remains in the graph (for connectivity) but is excluded from
* search results. This is the standard approach for HNSW deletion as
* physically removing nodes would require expensive graph repairs.
*
* @return bool True if the node was deleted, false if it didn't exist or was already deleted.
*/
public function delete(int $nodeId): bool
{
if (!isset($this->nodes[$nodeId]) || isset($this->deleted[$nodeId])) {
return false;
}

$this->deleted[$nodeId] = true;
return true;
}

/**
* Check if a node has been soft-deleted.
*/
public function isDeleted(int $nodeId): bool
{
return isset($this->deleted[$nodeId]);
}

/**
Expand Down Expand Up @@ -396,7 +440,8 @@ public function getDocuments(): array
* maxLayer: int,
* dimension: int|null,
* nodes: array<int, array{maxLayer: int, vector: float[], connections: array<int, int[]>}>,
* documents: array<int, array{id: string|int, text: string|null, metadata: array}>
* documents: array<int, array{id: string|int, text: string|null, metadata: array}>,
* deleted: int[]
* }
*/
public function exportState(): array
Expand Down Expand Up @@ -425,6 +470,7 @@ public function exportState(): array
'dimension' => $this->dimension,
'nodes' => $nodes,
'documents' => $documents,
'deleted' => array_keys($this->deleted),
];
}

Expand All @@ -437,7 +483,8 @@ public function exportState(): array
* maxLayer: int,
* dimension: int|null,
* nodes: array<int, array{maxLayer: int, vector: float[], connections: array<int, int[]>}>,
* documents: array<int, array{id: string|int, text: string|null, metadata: array}>
* documents: array<int, array{id: string|int, text: string|null, metadata: array}>,
* deleted?: int[]
* } $state
*/
public function importState(array $state): void
Expand All @@ -450,6 +497,14 @@ public function importState(array $state): void

$this->nodes = [];
$this->documents = [];
$this->deleted = [];

// Restore deleted set.
if (!empty($state['deleted'])) {
foreach ($state['deleted'] as $deletedId) {
$this->deleted[(int) $deletedId] = true;
}
}

foreach ($state['nodes'] as $nodeId => $nodeData) {
$node = new Node((int) $nodeId, $nodeData['vector'], $nodeData['maxLayer']);
Expand Down
77 changes: 75 additions & 2 deletions src/VectorDatabase.php
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,77 @@ public function addDocuments(array $documents): void
}
}

/**
* Delete a document by its user-visible ID.
*
* The document is soft-deleted from HNSW (excluded from results but kept
* for graph connectivity) and fully removed from the BM25 index.
*
* When persistence is enabled, the document file is also deleted from disk.
* Call `save()` afterward to persist the updated index state.
*
* @param string|int $id The document ID to delete.
* @return bool True if the document was deleted, false if it didn't exist.
*/
public function deleteDocument(string|int $id): bool
{
if (!isset($this->docIdToNodeId[$id])) {
return false;
}

$nodeId = $this->docIdToNodeId[$id];

// Soft-delete from HNSW (node stays for connectivity, excluded from results).
$this->hnswIndex->delete($nodeId);
Comment thread
ezimuel marked this conversation as resolved.
Outdated

// Fully remove from BM25.
$this->bm25Index->removeDocument($nodeId);

// Remove from local caches.
unset($this->nodeIdToDoc[$nodeId]);
unset($this->docIdToNodeId[$id]);

// Delete document file from disk if persistence is enabled.
if ($this->path !== null) {
$docFile = $this->path . '/docs/' . $nodeId . '.bin';
if (file_exists($docFile)) {
unlink($docFile);
Comment thread
ezimuel marked this conversation as resolved.
Outdated
}
Comment on lines +208 to +225
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With persistence enabled, doc files may still be in-flight from an async DocumentStore::write() (pcntl_fork). If a document is deleted before that write finishes, a late child write can recreate {nodeId}.bin after the delete, so the file may not actually stay deleted. Consider waiting for outstanding writes (or adding per-node cancel/atomic write semantics) before removing the file.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with commit 65b16da

}

return true;
}

/**
* Update a document by replacing it entirely.
*
* This is equivalent to deleteDocument() followed by addDocument() with the
* same ID. The document gets a new internal nodeId, so this is effectively
* a delete + insert operation.
*
* @param Document $document The updated document. Must have the same ID as an existing document.
* @return bool True if the document was updated, false if it didn't exist.
* @throws \RuntimeException if the document has no ID.
*/
public function updateDocument(Document $document): bool
{
if ($document->id === null) {
throw new \RuntimeException('Cannot update a document without an ID.');
}

if (!isset($this->docIdToNodeId[$document->id])) {
return false;
}

// Delete the old document.
$this->deleteDocument($document->id);

// Insert the new version.
$this->addDocument($document);

return true;
}

// ------------------------------------------------------------------
// Search
// ------------------------------------------------------------------
Expand Down Expand Up @@ -307,6 +378,7 @@ public function save(): void
'docIdToNodeId' => $this->docIdToNodeId,
'entryPoint' => $hnswState['entryPoint'],
'maxLayer' => $hnswState['maxLayer'],
'deleted' => $hnswState['deleted'],
];
if (file_put_contents($this->path . '/meta.json', json_encode($meta, JSON_PRETTY_PRINT | JSON_THROW_ON_ERROR)) === false) {
throw new \RuntimeException("Failed to write meta.json in: {$this->path}");
Expand Down Expand Up @@ -376,6 +448,7 @@ public static function open(
// HNSW needs these in $documents[] to return SearchResult objects.
$hnswState = $hnswData;
$hnswState['documents'] = [];
$hnswState['deleted'] = $meta['deleted'] ?? [];
foreach ($hnswData['nodes'] as $nodeId => $nodeData) {
$docId = $nodeIdToDocId[$nodeId] ?? $nodeId;
$hnswState['documents'][$nodeId] = [
Expand Down Expand Up @@ -407,10 +480,10 @@ public static function open(
// Utilities
// ------------------------------------------------------------------

/** Total number of documents stored. */
/** Total number of active (non-deleted) documents stored. */
public function count(): int
{
return $this->nextId;
return $this->hnswIndex->count();
}

// ------------------------------------------------------------------
Expand Down
69 changes: 69 additions & 0 deletions tests/PersistenceTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -387,4 +387,73 @@ public function testIncrementalSave(): void
$ids = array_map(fn($r) => $r->document->id, $results);
self::assertContains(1, $ids);
}

// ------------------------------------------------------------------
// Delete persistence
// ------------------------------------------------------------------

public function testDeletedDocumentsArePersistedAndExcluded(): void
{
$db = $this->makeDb();
$db->addDocument(new Document(id: 1, vector: [1.0, 0.0], text: 'first'));
$db->addDocument(new Document(id: 2, vector: [0.9, 0.1], text: 'second'));
$db->addDocument(new Document(id: 3, vector: [0.0, 1.0], text: 'third'));

// Delete document 1
$db->deleteDocument(1);
$db->save();

// Reload and verify
$loaded = $this->openDb();
self::assertSame(2, $loaded->count());

// Document 1 should not appear in results
$results = $loaded->vectorSearch([1.0, 0.0], k: 3);
$ids = array_map(fn($r) => $r->document->id, $results);
self::assertNotContains(1, $ids);
self::assertContains(2, $ids);
}
Comment on lines +411 to +415
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only checks that user-visible ID 1 isn’t present, but after open() the deleted node’s stub Document ID falls back to its nodeId (e.g. 0), so the test could still pass even if the deleted node is erroneously returned. Consider also asserting the result count is 2 (only two active docs remain) and/or explicitly asserting the deleted nodeId isn’t returned in $ids for this fixture.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a real issue


public function testDeletedDocumentFileIsRemoved(): void
{
$db = $this->makeDb();
$db->addDocument(new Document(id: 1, vector: [1.0, 0.0], text: 'to delete'));
$db->addDocument(new Document(id: 2, vector: [0.0, 1.0], text: 'to keep'));
$db->save();

// Verify doc file exists
self::assertFileExists($this->tmpDir . '/docs/0.bin');
self::assertFileExists($this->tmpDir . '/docs/1.bin');

// Delete document 1 (which is nodeId 0)
$db->deleteDocument(1);

// Doc file should be removed immediately
self::assertFileDoesNotExist($this->tmpDir . '/docs/0.bin');
self::assertFileExists($this->tmpDir . '/docs/1.bin');
}

public function testUpdateDocumentPersistsCorrectly(): void
{
$db = $this->makeDb();
$db->addDocument(new Document(id: 1, vector: [1.0, 0.0], text: 'original'));
$db->save();

// Update the document
$db->updateDocument(new Document(
id: 1,
vector: [0.0, 1.0],
text: 'updated content',
metadata: ['version' => 2],
));
$db->save();

// Reload and verify
$loaded = $this->openDb();
$results = $loaded->vectorSearch([0.0, 1.0], k: 1);

self::assertSame(1, $results[0]->document->id);
self::assertSame('updated content', $results[0]->document->text);
self::assertSame(['version' => 2], $results[0]->document->metadata);
}
}
Loading
Loading