Skip to content

Commit cb7d1e8

Browse files
committed
🔥(search) remove embedding/hybrid search, keep BM25 only
Signed-off-by: Stephan Meijer <me@stephanmeijer.com>
1 parent b33939e commit cb7d1e8

29 files changed

Lines changed: 42 additions & 5523 deletions

docs/env.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,6 @@ These are the environment variables you can set for the `find-backend` container
1313
| API_USERS_LIST_THROTTLE_RATE_SUSTAINED | Throttle rate for api | 180/hour |
1414
| CACHES_DEFAULT_TIMEOUT | Cache default timeout | 30 |
1515
| CACHES_KEY_PREFIX | The prefix used to every cache keys. | docs |
16-
| CHUNK_SIZE | approximate number of characters of document content chunks | 512 |
17-
| CHUNK_OVERLAP | approximate number of characters of document content overlapping | 50 |
1816
| DB_ENGINE | Engine to use for database connections | django.db.backends.postgresql_psycopg2 |
1917
| DB_HOST | Host of the database | localhost |
2018
| DB_NAME | Name of the database | impress |
@@ -41,16 +39,9 @@ These are the environment variables you can set for the `find-backend` container
4139
| DJANGO_SECRET_KEY | Secret key | |
4240
| DJANGO_SERVER_TO_SERVER_API_TOKENS | | [] |
4341
| DOCUMENT_IMAGE_MAX_SIZE | Maximum size of document in bytes | 10485760 |
44-
| EMBEDDING_API_KEY | API key of the embedding api | |
45-
| EMBEDDING_API_MODEL_NAME | Name of the embedding model used on the api | embeddings-small |
46-
| EMBEDDING_API_PATH | URL of the embedding api | |
47-
| EMBEDDING_DIMENSION | Size of the embedding vector | 1024 |
48-
| EMBEDDING_REQUEST_TIMEOUT | time out in seconds of the embedding requests | 10 |
4942
| FRONTEND_CSS_URL | To add a external css file to the app | |
5043
| FRONTEND_HOMEPAGE_FEATURE_ENABLED | Frontend feature flag to display the homepage | false |
5144
| FRONTEND_THEME | Frontend theme to use | |
52-
| HYBRID_SEARCH_ENABLED | Flag to enable hybrid (an then semantic) search | True |
53-
| HYBRID_SEARCH_WEIGHTS | Weights used in the weighted sum of the hybrid search score | [0.3, 0.7] |
5445
| LANGUAGE_DETECTION_CONFIDENCE_THRESHOLD | Language detection confidence threshold | 0.75 |
5546
| LOGGING_LEVEL_LOGGERS_APP | Application logging level. options are "DEBUG", "INFO", "WARN", "ERROR", "CRITICAL" | INFO |
5647
| LOGGING_LEVEL_LOGGERS_ROOT | Default logging level. options are "DEBUG", "INFO", "WARN", "ERROR", "CRITICAL" | INFO |

docs/setup-indexer.md

Lines changed: 0 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -43,39 +43,6 @@ This threshold can be controlled with LANGUAGE_DETECTION_CONFIDENCE_THRESHOLD en
4343
LANGUAGE_DETECTION_CONFIDENCE_THRESHOLD=0.75
4444
```
4545

46-
### Semantic search
47-
48-
Find offers a semantic search feature. You can either use pure full-text search or a hybrid full-text + semantic search. To enable the hybrid search, add the following settings.
49-
50-
```python
51-
# Enable flag
52-
HYBRID_SEARCH_ENABLED = True
53-
54-
# weighted sum: full_text_weight, semantic_search_weight
55-
HYBRID_SEARCH_WEIGHTS = 0.7,0.3
56-
57-
# Embedding
58-
CHUNK_SIZE=512
59-
CHUNK_OVERLAP=50
60-
EMBEDDING_API_PATH = https://embedding.api.example.com/full/path/
61-
EMBEDDING_API_KEY = your-embedding-api-key
62-
EMBEDDING_REQUEST_TIMEOUT = 10
63-
EMBEDDING_API_MODEL_NAME = embedding-api-model-name
64-
EMBEDDING_DIMENSION = 1024
65-
```
66-
67-
The hybrid search computes a score for full-text and semantic search and combines them through a weighted sum. HYBRID_SEARCH_WEIGHTS contains the weights of full-text and semantic respectively.
68-
69-
You need to use an embedding api similar to https://albert.api.etalab.gouv.fr/documentation#tag/Embeddings/operation/embeddings_v1_embeddings_post.
70-
71-
### document chunking
72-
73-
The indexing process embeds documents by converting their content into vector representations (embeddings). When a document exceeds the character dimension defined by CHUNK_SIZE, it's divided into smaller segments (chunks), with each chunk embedded independently. Each chunk must be smaller than the embedding model's context window .
74-
75-
The chunking algorithm works recursively. It attempts to create the largest possible segments first, then subdivides them further if they still exceed the size limit defined by CHUNK_SIZE. Segments can share overlapping content between them (set CHUNK_OVERLAP=0 to disable overlapping).
76-
77-
During the search, the matching of a document is the matching of its best chunk.
78-
7946
## trigrams
8047

8148
Find uses trigrams to improve the robustness of the full text search engine to spelling variations and errors. It can be configured by two environment variables.

env.d/development/common.dist

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -50,12 +50,3 @@ OIDC_RS_SIGN_ALGO=RS256
5050

5151
OIDC_RS_BACKEND_CLASS="core.authentication.FinderResourceServerBackend"
5252
OIDC_RS_ENCRYPTION_KEY_TYPE="RSA"
53-
54-
# Hybrid Search settings
55-
HYBRID_SEARCH_ENABLED=True
56-
EMBEDDING_API_KEY=ThisIsAnExampleKeyForDevPurposeOnly
57-
EMBEDDING_API_PATH=https://albert.api.etalab.gouv.fr/v1/embeddings
58-
59-
## Multi-embedding: chunk documents and embed each chunk
60-
CHUNK_SIZE=512
61-
CHUNK_OVERLAP=50

src/backend/core/admin.py

Lines changed: 3 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,8 @@
11
"""Admin config for find's core app"""
22

3-
from django.contrib import admin, messages
4-
from django.shortcuts import redirect, render
5-
from django.urls import path, reverse
6-
7-
from core.management.commands.create_search_pipeline import (
8-
ensure_search_pipeline_exists,
9-
)
3+
from django.contrib import admin
4+
from django.shortcuts import render
5+
from django.urls import path
106

117
from . import selftests_builtin # pylint: disable=unused-import
128
from .models import Service
@@ -22,35 +18,6 @@ class ServiceAdmin(admin.ModelAdmin):
2218
list_filter = ("is_active", "created_at")
2319
ordering = ("-created_at",)
2420
readonly_fields = ("created_at", "token")
25-
change_list_template = "admin/core/services/change_list.html"
26-
27-
def get_urls(self):
28-
urls = super().get_urls()
29-
custom_urls = [
30-
path(
31-
"ensure-search-pipeline/",
32-
self.admin_site.admin_view(self.ensure_search_pipeline_view),
33-
name="core_service_ensure_search_pipeline",
34-
),
35-
]
36-
return custom_urls + urls
37-
38-
def ensure_search_pipeline_view(self, request):
39-
"""Run the management command function to assert the pipeline exists."""
40-
changelist_url = reverse("admin:core_service_changelist")
41-
42-
try:
43-
ensure_search_pipeline_exists()
44-
except Exception as exc: # noqa: BLE001# pylint: disable=broad-exception-caught
45-
self.message_user(
46-
request, f"Failed to ensure search pipeline: {exc}", messages.ERROR
47-
)
48-
else:
49-
self.message_user(
50-
request, "Search pipeline presence insured", messages.SUCCESS
51-
)
52-
53-
return redirect(changelist_url)
5421

5522

5623
def selftest_view(request):

src/backend/core/apps.py

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,6 @@
33
from django.apps import AppConfig
44
from django.utils.translation import gettext_lazy as _
55

6-
from core.management.commands.create_search_pipeline import (
7-
ensure_search_pipeline_exists,
8-
)
9-
from core.services.opensearch import (
10-
check_hybrid_search_enabled,
11-
)
12-
136

147
class CoreConfig(AppConfig):
158
"""Configuration class for the Find core app."""
@@ -19,8 +12,4 @@ class CoreConfig(AppConfig):
1912
verbose_name = _("Find core application")
2013

2114
def ready(self):
22-
"""
23-
Ensure search pipeline exists if hybrid search is enabled.
24-
"""
25-
if check_hybrid_search_enabled():
26-
ensure_search_pipeline_exists()
15+
pass

src/backend/core/enums.py

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,6 @@ class ReachEnum(str, Enum):
1313
RESTRICTED = "restricted"
1414

1515

16-
# Search type
17-
18-
19-
class SearchTypeEnum(str, Enum):
20-
"""Search type options"""
21-
22-
HYBRID = "hybrid"
23-
FULL_TEXT = "full-text"
24-
25-
2616
# Fields
2717

2818
CREATED_AT = "created_at"

src/backend/core/management/commands/create_search_pipeline.py

Lines changed: 0 additions & 54 deletions
This file was deleted.

src/backend/core/management/commands/reindex_with_embedding.py

Lines changed: 0 additions & 141 deletions
This file was deleted.

src/backend/core/schemas.py

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@
1818
)
1919

2020
from . import enums
21-
from .services.opensearch import check_hybrid_search_enabled
2221

2322

2423
class DocumentSchema(BaseModel):
@@ -119,22 +118,6 @@ class SearchQueryParametersSchema(BaseModel):
119118
order_by: Optional[Literal[enums.ORDER_BY_OPTIONS]] = Field(default=enums.RELEVANCE)
120119
order_direction: Optional[Literal["asc", "desc"]] = Field(default="desc")
121120
nb_results: Optional[conint(ge=1, le=300)] = Field(default=50)
122-
search_type: Optional[enums.SearchTypeEnum] = None
123-
124-
@model_validator(mode="after")
125-
def set_default_search_type(self):
126-
"""
127-
Set default search_type dynamically.
128-
If search_type is not provided, it will be set to hybrid if it is configured
129-
and fall back on full text otherwise.
130-
"""
131-
if self.search_type is None:
132-
self.search_type = (
133-
enums.SearchTypeEnum.HYBRID
134-
if check_hybrid_search_enabled()
135-
else enums.SearchTypeEnum.FULL_TEXT
136-
)
137-
return self
138121

139122

140123
class DeleteDocumentsSchema(BaseModel):

0 commit comments

Comments
 (0)