Add document deletion and update functionality#3
Conversation
danielebarbaro
commented
Mar 23, 2026
- Add soft-delete for HNSW (nodes kept for graph connectivity, excluded from results)
- Add full removal from BM25 index via removeDocument()
- Add deleteDocument() and updateDocument() to VectorDatabase
- Persist deleted state in meta.json and restore on open()
- Update count() to return active documents only
- Add comprehensive tests for delete/update operations
c689e15 to
8aad8c9
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds first-class delete and update operations to VectorDatabase, including soft-delete support in the HNSW graph, hard removal from the BM25 index, and persistence of deleted state across saves/opens.
Changes:
- Add
deleteDocument()/updateDocument()toVectorDatabase, and makecount()report only active (non-deleted) documents. - Implement soft-delete filtering + deleted-state import/export in
HNSW\Index, and addBM25\Index::removeDocument()for full removal from text search. - Extend persistence and tests to cover delete/update behavior and restore deleted state on reopen.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/VectorDatabase.php |
Adds delete/update APIs, persists deleted set in meta.json, and updates count() to exclude deleted nodes. |
src/HNSW/Index.php |
Introduces a deleted-node set, filters deleted nodes from search results, and exports/imports deleted IDs. |
src/BM25/Index.php |
Adds removeDocument() to fully remove a document from the inverted index. |
tests/VectorDatabaseTest.php |
Adds unit tests validating delete/update behavior in-memory across vector/text search and count. |
tests/PersistenceTest.php |
Adds persistence tests for deleted-state restoration and update persistence. |
README.md |
Documents the new delete/update APIs and their persistence semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Filter out soft-deleted nodes and take the k nearest. | ||
| if (!empty($this->deleted)) { | ||
| $W = array_values(array_filter( | ||
| $W, | ||
| fn(array $pair) => !isset($this->deleted[$pair[1]]) |
There was a problem hiding this comment.
Filtering soft-deleted nodes only after searchLayer() can make search() return fewer than $k results (even 0) when $ef is small (e.g. $ef === $k) and a top candidate is deleted, despite other active nodes existing. Consider constructing the final top-k by skipping deleted nodes and/or retrying with a larger ef until you have up to $k active results.
| // Remove from inverted index. | ||
| foreach ($this->invertedIndex as $term => &$postings) { | ||
| unset($postings[$nodeId]); | ||
| // Remove empty posting lists to save memory. | ||
| if (empty($postings)) { |
There was a problem hiding this comment.
removeDocument() scans the entire $invertedIndex vocabulary to remove a single nodeId, which is O(|V|) per delete and can become expensive as the index grows (delete/update are now public APIs). A more scalable approach is to track per-document term lists on insert so deletion only touches terms that were present in the removed document.
| $results = $loaded->vectorSearch([1.0, 0.0], k: 3); | ||
| $ids = array_map(fn($r) => $r->document->id, $results); | ||
| self::assertNotContains(1, $ids); | ||
| self::assertContains(2, $ids); | ||
| } |
There was a problem hiding this comment.
This only checks that user-visible ID 1 isn’t present, but after open() the deleted node’s stub Document ID falls back to its nodeId (e.g. 0), so the test could still pass even if the deleted node is erroneously returned. Consider also asserting the result count is 2 (only two active docs remain) and/or explicitly asserting the deleted nodeId isn’t returned in $ids for this fixture.
There was a problem hiding this comment.
I don't think this is a real issue
| if ($this->path !== null) { | ||
| $docFile = $this->path . '/docs/' . $nodeId . '.bin'; | ||
| if (file_exists($docFile)) { | ||
| unlink($docFile); | ||
| } |
There was a problem hiding this comment.
With persistence enabled, doc files may still be in-flight from an async DocumentStore::write() (pcntl_fork). If a document is deleted before that write finishes, a late child write can recreate {nodeId}.bin after the delete, so the file may not actually stay deleted. Consider waiting for outstanding writes (or adding per-node cancel/atomic write semantics) before removing the file.
Add return value check on document delete Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Check for successfully unlink() Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Suppress PHP warning and handle failure explicitly to keep | ||
| // on-disk state consistent with in-memory indexes. | ||
| if (!@unlink($docFile) && file_exists($docFile)) { | ||
| throw new \RuntimeException( | ||
| "Failed to delete persisted document file: {$docFile}" |
There was a problem hiding this comment.
addDocument() can write doc files asynchronously (fork) via DocumentStore::write(), but deleteDocument() unlinks {nodeId}.bin without waiting for outstanding writes. A late child write can recreate the file after deletion, leaving on-disk state inconsistent. Consider waiting for pending writes (e.g., DocumentStore::waitAll()) or adding per-node cancel/atomic write semantics before unlinking.
| // Delete document file from disk if persistence is enabled. | ||
| if ($this->path !== null) { | ||
| $docFile = $this->path . '/docs/' . $nodeId . '.bin'; | ||
| if (file_exists($docFile)) { | ||
| // Suppress PHP warning and handle failure explicitly to keep | ||
| // on-disk state consistent with in-memory indexes. | ||
| if (!@unlink($docFile) && file_exists($docFile)) { | ||
| throw new \RuntimeException( | ||
| "Failed to delete persisted document file: {$docFile}" | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
With persistence enabled, deleteDocument() removes docs/{nodeId}.bin immediately but the on-disk indexes (meta.json, hnsw.bin, bm25.bin) are only updated on save(). If the process crashes (or save() is not called), the folder can be left with indexes that still reference the document while its doc file is missing, causing later searches to fail when hydrating results. Consider making deletion crash-safe (e.g., defer physical file removal until save(), or persist a tombstone/updated meta atomically as part of deletion).
| // Delete document file from disk if persistence is enabled. | |
| if ($this->path !== null) { | |
| $docFile = $this->path . '/docs/' . $nodeId . '.bin'; | |
| if (file_exists($docFile)) { | |
| // Suppress PHP warning and handle failure explicitly to keep | |
| // on-disk state consistent with in-memory indexes. | |
| if (!@unlink($docFile) && file_exists($docFile)) { | |
| throw new \RuntimeException( | |
| "Failed to delete persisted document file: {$docFile}" | |
| ); | |
| } | |
| } | |
| // NOTE: | |
| // We intentionally do NOT delete the persisted document file here. | |
| // The in-memory indexes are updated immediately, but on-disk index | |
| // structures (meta.json, hnsw.bin, bm25.bin) are only updated on save(). | |
| // Deleting the file now could leave the persisted index referencing a | |
| // missing document if the process crashes before save() is called. | |
| // Any crash-safe physical cleanup of unused document files should be | |
| // performed as part of a persistence step that also updates the indexes. | |
| if ($this->path !== null) { | |
| // Defer physical deletion to a persistence routine that can update | |
| // both document files and indexes atomically. |
| foreach ($this->invertedIndex as $term => &$postings) { | ||
| unset($postings[$nodeId]); | ||
| // Remove empty posting lists to save memory. | ||
| if (empty($postings)) { | ||
| unset($this->invertedIndex[$term]); |
There was a problem hiding this comment.
removeDocument() removes a single doc by iterating over the entire $invertedIndex vocabulary and unsetting the nodeId from each postings list (O(|V|) per delete). Since delete/update are public APIs now, this can become a bottleneck on large indexes. Consider tracking per-document term lists on insert so deletions only touch terms that appeared in that document.
There was a problem hiding this comment.
interesting. 🤔
ezimuel
left a comment
There was a problem hiding this comment.
I applied some fixes to the Copilot suggestions.