Skip to content

Commit abbdacd

Browse files
Metadata Filter eval - wip
1 parent 3b42a46 commit abbdacd

File tree

4 files changed

+1595
-31
lines changed

4 files changed

+1595
-31
lines changed

README.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -400,11 +400,101 @@ $filter = MetadataFilter::contains('tags', 'php'); // matches ['tags' => ['php'
400400

401401
Pass filters to any search method. Multiple filters are ANDed together by default.
402402

403+
```php
404+
use PHPVector\MetadataFilter;
405+
406+
// Vector search with filters
407+
$results = $db->vectorSearch(
408+
vector: $queryVector,
409+
k: 10,
410+
filters: [
411+
MetadataFilter::eq('lang', 'en'),
412+
MetadataFilter::gt('year', 2020),
413+
],
414+
);
415+
416+
// Text search with filters
417+
$results = $db->textSearch(
418+
query: 'machine learning',
419+
k: 10,
420+
filters: [
421+
MetadataFilter::in('category', ['tech', 'science']),
422+
],
423+
);
424+
425+
// Hybrid search with filters
426+
$results = $db->hybridSearch(
427+
vector: $queryVector,
428+
text: 'machine learning',
429+
k: 10,
430+
filters: [
431+
MetadataFilter::eq('status', 'published'),
432+
],
433+
);
434+
```
403435

404436
### OR groups (nested arrays)
405437

406438
Wrap filters in a nested array to create OR groups. Filters at the top level are ANDed; filters inside a nested array are ORed.
407439

440+
```php
441+
// (category = 'tech' OR category = 'science') AND status = 'published'
442+
$results = $db->vectorSearch(
443+
vector: $queryVector,
444+
k: 10,
445+
filters: [
446+
[
447+
MetadataFilter::eq('category', 'tech'),
448+
MetadataFilter::eq('category', 'science'),
449+
], // OR group
450+
MetadataFilter::eq('status', 'published'), // ANDed with the OR group
451+
],
452+
);
453+
```
454+
455+
### Over-fetching for filtered queries
456+
457+
When filters are applied, the search may need to examine more candidates than `k` to find enough matching documents. By default, the search fetches `k * 5` candidates, then filters. You can tune this:
458+
459+
```php
460+
// Fetch 10× candidates before filtering (useful when filters are very selective)
461+
$results = $db->vectorSearch(
462+
vector: $queryVector,
463+
k: 10,
464+
filters: [MetadataFilter::eq('rare_tag', 'value')],
465+
overFetch: 10,
466+
);
467+
```
468+
469+
> **Note:** Filtered queries may return fewer than `k` results if not enough documents match.
470+
471+
### Updating metadata
472+
473+
Update metadata on existing documents without re-indexing vectors or text:
474+
475+
```php
476+
// Add or update metadata keys
477+
$db->patchMetadata(id: 1, patch: [
478+
'status' => 'archived',
479+
'updated_at' => '2026-03-24',
480+
]);
481+
482+
// Remove metadata keys by setting to null
483+
$db->patchMetadata(id: 1, patch: [
484+
'deprecated_field' => null, // key will be removed
485+
]);
486+
487+
// patchMetadata returns false if document not found
488+
if (!$db->patchMetadata(id: 999, patch: ['key' => 'value'])) {
489+
echo "Document not found\n";
490+
}
491+
```
492+
493+
The `patchMetadata()` method:
494+
- Merges patch into existing metadata (existing keys preserved unless overwritten)
495+
- Does NOT touch HNSW or BM25 indexes (fast, metadata-only operation)
496+
- Persists immediately when database has a path configured
497+
408498
### Metadata-only search
409499

410500
Query documents by metadata alone, without a vector or text query:
@@ -415,6 +505,25 @@ $results = $db->metadataSearch(
415505
filters: [MetadataFilter::eq('status', 'published')],
416506
);
417507

508+
// With limit
509+
$results = $db->metadataSearch(
510+
filters: [MetadataFilter::gt('year', 2020)],
511+
limit: 100,
512+
);
513+
514+
// With sorting by metadata key
515+
$results = $db->metadataSearch(
516+
filters: [MetadataFilter::eq('status', 'published')],
517+
sortBy: 'created_at',
518+
sortDirection: 'desc', // 'asc' or 'desc'
519+
);
520+
521+
// Empty filters returns all documents
522+
$allDocs = $db->metadataSearch(filters: [], limit: 50);
523+
```
524+
525+
> **Note:** Documents missing the `sortBy` key are placed at the end of results. All results have `score = 1.0` (no ranking).
526+
418527
### Strict type comparison
419528

420529
Metadata filtering uses **strict type comparison** (PHP `===`). This means:

src/HNSW/Config.php

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,14 @@ final class Config
6868
*/
6969
public readonly bool $keepPrunedConnections;
7070

71+
/**
72+
* Over-fetch multiplier when metadata filtering is active.
73+
* Fetch overFetchMultiplier × k candidates, then filter to k.
74+
* Higher value → better result completeness when many candidates are filtered out.
75+
* Default: 5.
76+
*/
77+
public readonly int $overFetchMultiplier;
78+
7179
public function __construct(
7280
int $M = 16,
7381
?int $M0 = null,
@@ -78,6 +86,7 @@ public function __construct(
7886
bool $useHeuristic = true,
7987
bool $extendCandidates = false,
8088
bool $keepPrunedConnections = true,
89+
int $overFetchMultiplier = 5,
8190
) {
8291
if ($M < 2) {
8392
throw new \InvalidArgumentException('M must be at least 2.');
@@ -87,6 +96,9 @@ public function __construct(
8796
if ($efConstruction < $resolvedM0) {
8897
throw new \InvalidArgumentException("efConstruction must be ≥ M0 ({$resolvedM0}).");
8998
}
99+
if ($overFetchMultiplier < 1) {
100+
throw new \InvalidArgumentException('overFetchMultiplier must be at least 1.');
101+
}
90102

91103
$this->M = $M;
92104
$this->M0 = $resolvedM0;
@@ -96,5 +108,6 @@ public function __construct(
96108
$this->useHeuristic = $useHeuristic;
97109
$this->extendCandidates = $extendCandidates;
98110
$this->keepPrunedConnections = $keepPrunedConnections;
111+
$this->overFetchMultiplier = $overFetchMultiplier;
99112
}
100113
}

0 commit comments

Comments
 (0)