-
-
Notifications
You must be signed in to change notification settings - Fork 26
feat: configurable AI Embedding Model #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
eef50dc
432c19c
5c4d821
69a787c
0a7479e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -19,6 +19,8 @@ def __init__(self, similarity_threshold: float = 0.95, cache_file: str = "semant | |||||||||||||||||||||
| self.cache: List[Dict] = [] | ||||||||||||||||||||||
| self.similarity_threshold = similarity_threshold | ||||||||||||||||||||||
| self.model = None | ||||||||||||||||||||||
| self.hf_model = None | ||||||||||||||||||||||
| self.hf_tokenizer = None | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # Support environment variable for cache file path | ||||||||||||||||||||||
| cache_file_env = os.environ.get('NEXUMDB_CACHE_FILE', cache_file) | ||||||||||||||||||||||
|
|
@@ -33,22 +35,79 @@ def __init__(self, similarity_threshold: float = 0.95, cache_file: str = "semant | |||||||||||||||||||||
|
|
||||||||||||||||||||||
| def initialize_model(self) -> None: | ||||||||||||||||||||||
| """Initialize local embedding model - deferred to avoid import errors""" | ||||||||||||||||||||||
| model_name = os.environ.get('NEXUM_EMBEDDING_MODEL', 'all-MiniLM-L6-v2') | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # Try SentenceTransformer first (preferred for embedding models) | ||||||||||||||||||||||
| try: | ||||||||||||||||||||||
| from sentence_transformers import SentenceTransformer | ||||||||||||||||||||||
| self.model = SentenceTransformer('all-MiniLM-L6-v2') | ||||||||||||||||||||||
| print("Semantic cache initialized with all-MiniLM-L6-v2") | ||||||||||||||||||||||
| self.model = SentenceTransformer(model_name) | ||||||||||||||||||||||
| print(f"Semantic cache initialized with SentenceTransformer: {model_name}") | ||||||||||||||||||||||
| self.hf_model = None | ||||||||||||||||||||||
| self.hf_tokenizer = None | ||||||||||||||||||||||
| return | ||||||||||||||||||||||
|
Comment on lines
297
to
+308
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, explore the repository structure to understand the layout
find . -type f -name "mod.rs" -o -name "optimizer.py" | head -20Repository: aviralgarg05/NexumDB Length of output: 243 🏁 Script executed: # Check the Rust bridge code at the mentioned location
cat -n nexum_core/src/bridge/mod.rs 2>/dev/null || echo "File not found, searching for it..."
fd "mod.rs" | grep bridgeRepository: aviralgarg05/NexumDB Length of output: 18691 🏁 Script executed: # Check Python optimizer.py file
cat -n nexum_ai/optimizer.py 2>/dev/null | head -100Repository: aviralgarg05/NexumDB Length of output: 4629 🏁 Script executed: # Search for SemanticCache instantiation patterns in Rust code
rg -n 'SemanticCache' --type rust -C 3Repository: aviralgarg05/NexumDB Length of output: 4390 🏁 Script executed: # Search for _get_default_cache in Python
rg -n '_get_default_cache' --type py -C 5Repository: aviralgarg05/NexumDB Length of output: 1467 🏁 Script executed: # Check how vectorize is called in the codebase - frequency and usage patterns
rg -n 'vectorize' --type rust -B 2 -A 2 | head -40Repository: aviralgarg05/NexumDB Length of output: 1327 🏁 Script executed: # Verify if PyModule.import caches or re-imports
rg -n 'explain_query_plan' nexum_ai/optimizer.py -A 20 | head -50Repository: aviralgarg05/NexumDB Length of output: 2216 Rust bridge creates new The module-level Expose the singleton to Rust by adding a module-level getter function (e.g., 🤖 Prompt for AI Agents |
||||||||||||||||||||||
| except ImportError: | ||||||||||||||||||||||
| print("Warning: sentence-transformers not installed, using fallback") | ||||||||||||||||||||||
| print("Warning: sentence-transformers not installed, trying transformers fallback") | ||||||||||||||||||||||
| except Exception as e: | ||||||||||||||||||||||
| print(f"Warning: Failed to load with SentenceTransformer ({e}), trying transformers fallback") | ||||||||||||||||||||||
|
coderabbitai[bot] marked this conversation as resolved.
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # Fallback to generic HuggingFace transformers | ||||||||||||||||||||||
| try: | ||||||||||||||||||||||
| from transformers import AutoTokenizer, AutoModel | ||||||||||||||||||||||
| import torch | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # If default model was used but ST failed, we might want a different default for raw transformers | ||||||||||||||||||||||
| # but usually the same model name works for both if it's on HF Hub. | ||||||||||||||||||||||
| # However, 'all-MiniLM-L6-v2' is a sentence-transformers specific alias often mapped to | ||||||||||||||||||||||
| # 'sentence-transformers/all-MiniLM-L6-v2' on HF Hub. | ||||||||||||||||||||||
| if model_name == 'all-MiniLM-L6-v2': | ||||||||||||||||||||||
| model_name = 'sentence-transformers/all-MiniLM-L6-v2' | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| self.hf_tokenizer = AutoTokenizer.from_pretrained(model_name) | ||||||||||||||||||||||
| self.hf_model = AutoModel.from_pretrained(model_name) | ||||||||||||||||||||||
| self.model = None | ||||||||||||||||||||||
| print(f"Semantic cache initialized with HuggingFace transformers: {model_name}") | ||||||||||||||||||||||
|
Comment on lines
+323
to
+329
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Security: environment-controlled model name passed to An attacker who controls the
🛡️ Example allow-list validation+ ALLOWED_MODELS = {
+ 'all-MiniLM-L6-v2',
+ 'sentence-transformers/all-MiniLM-L6-v2',
+ 'sentence-transformers/paraphrase-MiniLM-L6-v2',
+ }
+ if model_name not in ALLOWED_MODELS:
+ print(f"Warning: model '{model_name}' is not in the allow-list, falling back to default")
+ model_name = 'sentence-transformers/all-MiniLM-L6-v2'
+
self.hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
self.hf_model = AutoModel.from_pretrained(model_name)🤖 Prompt for AI Agents |
||||||||||||||||||||||
| except ImportError: | ||||||||||||||||||||||
| print("Warning: transformers not installed, using simple fallback") | ||||||||||||||||||||||
| self.model = None | ||||||||||||||||||||||
| self.hf_model = None | ||||||||||||||||||||||
| self.hf_tokenizer = None | ||||||||||||||||||||||
|
Comment on lines
+330
to
+334
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP 8 indentation violation: extra leading space. Line 70 appears to have 13 spaces of indentation instead of the expected 12 (three levels of 4-space indent). This will still run in Python, but it's inconsistent with the rest of the file. except ImportError:
- print("Warning: transformers not installed, using simple fallback")
- self.model = None
- self.hf_model = None
- self.hf_tokenizer = None
+ print("Warning: transformers not installed, using simple fallback")
+ self.model = None
+ self.hf_model = None
+ self.hf_tokenizer = NoneAs per coding guidelines, 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||
| except Exception as e: | ||||||||||||||||||||||
| print(f"Warning: Failed to load with transformers ({e}), using simple fallback") | ||||||||||||||||||||||
| self.model = None | ||||||||||||||||||||||
| self.hf_model = None | ||||||||||||||||||||||
| self.hf_tokenizer = None | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| def vectorize(self, text: str) -> List[float]: | ||||||||||||||||||||||
| """Convert text to embedding vector""" | ||||||||||||||||||||||
| if self.model is None: | ||||||||||||||||||||||
| if self.model is None and self.hf_model is None: | ||||||||||||||||||||||
| self.initialize_model() | ||||||||||||||||||||||
|
Comment on lines
+343
to
344
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If both 🔧 Suggested fix+ self._model_init_attempted: bool = False
...
def vectorize(self, text: str) -> List[float]:
"""Convert text to embedding vector"""
- if self.model is None and self.hf_model is None:
+ if not self._model_init_attempted and self.model is None and self.hf_model is None:
self.initialize_model()
def initialize_model(self) -> None:
"""Initialize local embedding model - deferred to avoid import errors"""
+ self._model_init_attempted = True
model_name = os.environ.get('NEXUM_EMBEDDING_MODEL', 'all-MiniLM-L6-v2')
...🤖 Prompt for AI Agents |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| if self.model is not None: | ||||||||||||||||||||||
| embedding = self.model.encode(text) | ||||||||||||||||||||||
| return embedding.tolist() | ||||||||||||||||||||||
| elif self.hf_model is not None and self.hf_tokenizer is not None: | ||||||||||||||||||||||
| try: | ||||||||||||||||||||||
| import torch | ||||||||||||||||||||||
| # Tokenize and compute embedding | ||||||||||||||||||||||
| inputs = self.hf_tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512) | ||||||||||||||||||||||
| with torch.no_grad(): | ||||||||||||||||||||||
| outputs = self.hf_model(**inputs) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # Mean pooling | ||||||||||||||||||||||
| # attention_mask shape: (batch, seq_len) | ||||||||||||||||||||||
| # last_hidden_state shape: (batch, seq_len, hidden_dim) | ||||||||||||||||||||||
| attention_mask = inputs['attention_mask'] | ||||||||||||||||||||||
| token_embeddings = outputs.last_hidden_state | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() | ||||||||||||||||||||||
| sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) | ||||||||||||||||||||||
| sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| embedding = sum_embeddings / sum_mask | ||||||||||||||||||||||
| return embedding[0].tolist() | ||||||||||||||||||||||
| except Exception as e: | ||||||||||||||||||||||
| print(f"Error during HF vectorization: {e}, using fallback") | ||||||||||||||||||||||
| return self._fallback_vectorize(text) | ||||||||||||||||||||||
|
Comment on lines
+349
to
+371
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick | 🔵 Trivial Re-importing
🤖 Prompt for AI Agents |
||||||||||||||||||||||
| else: | ||||||||||||||||||||||
| return self._fallback_vectorize(text) | ||||||||||||||||||||||
|
Comment on lines
+343
to
373
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Silent fallback to character-hash vectorizer can corrupt cache semantics. If the HF model successfully vectorizes text during This same concern applies if the model backend changes between process restarts (e.g., Consider:
🤖 Prompt for AI Agents |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| from nexum_ai.optimizer import SemanticCache | ||
| import os | ||
|
|
||
| print("--- Default Behavior ---") | ||
| cache = SemanticCache() | ||
| # Force initialization | ||
| cache.initialize_model() | ||
|
Comment on lines
+1
to
+7
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This debug script should not be committed to the repository.
Additional issues if the file is kept:
🤖 Prompt for AI Agents |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Missing type annotations for model attributes.
self.model,self.hf_model, andself.hf_tokenizerare all initialized toNonewithout type hints. AddingOptional[...]annotations would clarify the expected types and improve IDE/static-analysis support.♻️ Suggested type annotations
🤖 Prompt for AI Agents