Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 63 additions & 4 deletions nexum_ai/optimizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ def __init__(self, similarity_threshold: float = 0.95, cache_file: str = "semant
self.cache: List[Dict] = []
self.similarity_threshold = similarity_threshold
self.model = None
self.hf_model = None
self.hf_tokenizer = None
Comment on lines 66 to +68

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Missing type annotations for model attributes.

self.model, self.hf_model, and self.hf_tokenizer are all initialized to None without type hints. Adding Optional[...] annotations would clarify the expected types and improve IDE/static-analysis support.

♻️ Suggested type annotations
+        from typing import TYPE_CHECKING
+        # At module level or behind TYPE_CHECKING guard:
+        # from sentence_transformers import SentenceTransformer
+        # from transformers import AutoModel, AutoTokenizer
+
         self.model = None  # Optional[SentenceTransformer]
-        self.hf_model = None
-        self.hf_tokenizer = None
+        self.hf_model = None  # Optional[AutoModel]
+        self.hf_tokenizer = None  # Optional[AutoTokenizer]
🤖 Prompt for AI Agents
In `@nexum_ai/optimizer.py` around lines 21 - 23, Add explicit Optional type
annotations for the attributes initialized to None: annotate self.model,
self.hf_model, and self.hf_tokenizer using typing.Optional with their expected
types (e.g., Optional[YourModelClass] or Optional[Any] /
Optional[transformers.PreTrainedModel] and
Optional[transformers.PreTrainedTokenizerFast] as appropriate). Update the class
__init__ so the declarations read with the type hints (referencing the attribute
names self.model, self.hf_model, self.hf_tokenizer) and import Optional from
typing at the top of the module to enable static analysis and IDE assistance.


# Support environment variable for cache file path
cache_file_env = os.environ.get('NEXUMDB_CACHE_FILE', cache_file)
Expand All @@ -33,22 +35,79 @@ def __init__(self, similarity_threshold: float = 0.95, cache_file: str = "semant

def initialize_model(self) -> None:
"""Initialize local embedding model - deferred to avoid import errors"""
model_name = os.environ.get('NEXUM_EMBEDDING_MODEL', 'all-MiniLM-L6-v2')

# Try SentenceTransformer first (preferred for embedding models)
try:
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer('all-MiniLM-L6-v2')
print("Semantic cache initialized with all-MiniLM-L6-v2")
self.model = SentenceTransformer(model_name)
print(f"Semantic cache initialized with SentenceTransformer: {model_name}")
self.hf_model = None
self.hf_tokenizer = None
return
Comment on lines 297 to +308

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, explore the repository structure to understand the layout
find . -type f -name "mod.rs" -o -name "optimizer.py" | head -20

Repository: aviralgarg05/NexumDB

Length of output: 243


🏁 Script executed:

# Check the Rust bridge code at the mentioned location
cat -n nexum_core/src/bridge/mod.rs 2>/dev/null || echo "File not found, searching for it..."
fd "mod.rs" | grep bridge

Repository: aviralgarg05/NexumDB

Length of output: 18691


🏁 Script executed:

# Check Python optimizer.py file
cat -n nexum_ai/optimizer.py 2>/dev/null | head -100

Repository: aviralgarg05/NexumDB

Length of output: 4629


🏁 Script executed:

# Search for SemanticCache instantiation patterns in Rust code
rg -n 'SemanticCache' --type rust -C 3

Repository: aviralgarg05/NexumDB

Length of output: 4390


🏁 Script executed:

# Search for _get_default_cache in Python
rg -n '_get_default_cache' --type py -C 5

Repository: aviralgarg05/NexumDB

Length of output: 1467


🏁 Script executed:

# Check how vectorize is called in the codebase - frequency and usage patterns
rg -n 'vectorize' --type rust -B 2 -A 2 | head -40

Repository: aviralgarg05/NexumDB

Length of output: 1327


🏁 Script executed:

# Verify if PyModule.import caches or re-imports
rg -n 'explain_query_plan' nexum_ai/optimizer.py -A 20 | head -50

Repository: aviralgarg05/NexumDB

Length of output: 2216


Rust bridge creates new SemanticCache on every vectorize call, triggering redundant model loading.

The module-level _get_default_cache() singleton (line 23) only benefits explain_query_plan(). The Rust bridge (nexum_core/src/bridge/mod.rs:41) calls semantic_cache.call0() on every vectorize invocation, creating a fresh Python instance each time. With the HF AutoModel.from_pretrained() fallback, this means model weights are re-downloaded and re-loaded for every vectorization.

Expose the singleton to Rust by adding a module-level getter function (e.g., get_or_create_cache()) that returns the shared instance, and have the Rust bridge call it instead of instantiating SemanticCache() directly.

🤖 Prompt for AI Agents
In `@nexum_ai/optimizer.py` around lines 78 - 89, Rust bridge creates a new
SemanticCache for every vectorize call causing redundant model reloads; add and
expose a module-level getter (e.g., get_or_create_cache) that returns the
existing singleton (the same instance used by
_get_default_cache/explain_query_plan) so Rust can call the getter instead of
constructing SemanticCache each time; implement get_or_create_cache() to return
the module-level singleton and ensure it returns the same object that
initialize_model/configure uses so SentenceTransformer or HF AutoModel weights
are loaded once and reused by vectorize invoked from Rust.

except ImportError:
print("Warning: sentence-transformers not installed, using fallback")
print("Warning: sentence-transformers not installed, trying transformers fallback")
except Exception as e:
print(f"Warning: Failed to load with SentenceTransformer ({e}), trying transformers fallback")
Comment thread
coderabbitai[bot] marked this conversation as resolved.

# Fallback to generic HuggingFace transformers
try:
from transformers import AutoTokenizer, AutoModel
import torch

# If default model was used but ST failed, we might want a different default for raw transformers
# but usually the same model name works for both if it's on HF Hub.
# However, 'all-MiniLM-L6-v2' is a sentence-transformers specific alias often mapped to
# 'sentence-transformers/all-MiniLM-L6-v2' on HF Hub.
if model_name == 'all-MiniLM-L6-v2':
model_name = 'sentence-transformers/all-MiniLM-L6-v2'

self.hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
self.hf_model = AutoModel.from_pretrained(model_name)
self.model = None
print(f"Semantic cache initialized with HuggingFace transformers: {model_name}")
Comment on lines +323 to +329

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Security: environment-controlled model name passed to from_pretrained enables arbitrary model loading.

An attacker who controls the NEXUM_EMBEDDING_MODEL environment variable can point AutoModel.from_pretrained / AutoTokenizer.from_pretrained at an arbitrary HuggingFace Hub repository. Models hosted on the Hub may contain malicious payloads (e.g., pickle-based weight files, custom code). Consider:

  1. Validating model_name against an allow-list of trusted models.
  2. Pinning model revisions (the revision parameter).
  3. Forcing safetensors-only loading (use_safetensors=True if your transformers version supports it).
🛡️ Example allow-list validation
+        ALLOWED_MODELS = {
+            'all-MiniLM-L6-v2',
+            'sentence-transformers/all-MiniLM-L6-v2',
+            'sentence-transformers/paraphrase-MiniLM-L6-v2',
+        }
+        if model_name not in ALLOWED_MODELS:
+            print(f"Warning: model '{model_name}' is not in the allow-list, falling back to default")
+            model_name = 'sentence-transformers/all-MiniLM-L6-v2'
+
         self.hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
         self.hf_model = AutoModel.from_pretrained(model_name)
🤖 Prompt for AI Agents
In `@nexum_ai/optimizer.py` around lines 62 - 68, Validate and restrict the
environment-controlled model_name before calling AutoTokenizer.from_pretrained
and AutoModel.from_pretrained: implement an allow-list of trusted model
identifiers and check the incoming model_name (and normalize entries like
'all-MiniLM-L6-v2' → 'sentence-transformers/all-MiniLM-L6-v2') against it,
require or inject a pinned revision parameter (revision="...") and pass
use_safetensors=True (or equivalent) to from_pretrained calls so only
safetensors are loaded; if the model_name is not on the allow-list, raise or
fall back to a safe default instead of calling AutoTokenizer.from_pretrained /
AutoModel.from_pretrained.

except ImportError:
print("Warning: transformers not installed, using simple fallback")
self.model = None
self.hf_model = None
self.hf_tokenizer = None
Comment on lines +330 to +334

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

PEP 8 indentation violation: extra leading space.

Line 70 appears to have 13 spaces of indentation instead of the expected 12 (three levels of 4-space indent). This will still run in Python, but it's inconsistent with the rest of the file.

         except ImportError:
-             print("Warning: transformers not installed, using simple fallback")
-             self.model = None
-             self.hf_model = None
-             self.hf_tokenizer = None
+            print("Warning: transformers not installed, using simple fallback")
+            self.model = None
+            self.hf_model = None
+            self.hf_tokenizer = None

As per coding guidelines, **/*.py: Check Python code quality, type hints usage, and PEP 8 compliance.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except ImportError:
print("Warning: transformers not installed, using simple fallback")
self.model = None
self.hf_model = None
self.hf_tokenizer = None
except ImportError:
print("Warning: transformers not installed, using simple fallback")
self.model = None
self.hf_model = None
self.hf_tokenizer = None
🤖 Prompt for AI Agents
In `@nexum_ai/optimizer.py` around lines 69 - 73, Fix the PEP8 indentation in the
except ImportError block in optimizer.py: the line with the print statement has
one extra leading space (13 vs expected 12); align the print and subsequent
assignments (self.model, self.hf_model, self.hf_tokenizer) to the same 12-space
indentation level as the except block body so all lines inside that except block
are consistently indented.

except Exception as e:
print(f"Warning: Failed to load with transformers ({e}), using simple fallback")
self.model = None
self.hf_model = None
self.hf_tokenizer = None

def vectorize(self, text: str) -> List[float]:
"""Convert text to embedding vector"""
if self.model is None:
if self.model is None and self.hf_model is None:
self.initialize_model()
Comment on lines +343 to 344

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

initialize_model() re-attempted on every vectorize() call when all model loads failed

If both sentence-transformers and transformers are unavailable, initialize_model() leaves self.model = None and self.hf_model = None. Because the guard condition if self.model is None and self.hf_model is None is still True on the next call, every subsequent vectorize() invocation re-attempts the full import chain (two ImportErrors each time). Add a sentinel flag to short-circuit after the first failed attempt.

🔧 Suggested fix
+        self._model_init_attempted: bool = False
         ...

 def vectorize(self, text: str) -> List[float]:
     """Convert text to embedding vector"""
-    if self.model is None and self.hf_model is None:
+    if not self._model_init_attempted and self.model is None and self.hf_model is None:
         self.initialize_model()

 def initialize_model(self) -> None:
     """Initialize local embedding model - deferred to avoid import errors"""
+    self._model_init_attempted = True
     model_name = os.environ.get('NEXUM_EMBEDDING_MODEL', 'all-MiniLM-L6-v2')
     ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nexum_ai/optimizer.py` around lines 343 - 344, The code currently re-invokes
initialize_model() on every vectorize() call when both self.model and
self.hf_model remain None; add a sentinel attribute (e.g.,
self._model_init_attempted or self._model_load_failed) checked in vectorize
before calling initialize_model so the import chain is only attempted once, set
that flag to True at the start or end of initialize_model() (and update it to
reflect success/failure as needed) and ensure initialize_model still assigns
self.model/self.hf_model on success so subsequent vectorize() calls skip
re-initialization.


if self.model is not None:
embedding = self.model.encode(text)
return embedding.tolist()
elif self.hf_model is not None and self.hf_tokenizer is not None:
try:
import torch
# Tokenize and compute embedding
inputs = self.hf_tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = self.hf_model(**inputs)

# Mean pooling
# attention_mask shape: (batch, seq_len)
# last_hidden_state shape: (batch, seq_len, hidden_dim)
attention_mask = inputs['attention_mask']
token_embeddings = outputs.last_hidden_state

input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)

embedding = sum_embeddings / sum_mask
return embedding[0].tolist()
except Exception as e:
print(f"Error during HF vectorization: {e}, using fallback")
return self._fallback_vectorize(text)
Comment on lines +349 to +371

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Re-importing torch on every vectorize call is unnecessary.

torch was already successfully imported during initialize_model() (line 56) if the HF path was taken. The guarded import on line 90 is redundant in this branch since self.hf_model being non-None already implies torch was importable. You could capture the torch module reference as self._torch during init instead.

🤖 Prompt for AI Agents
In `@nexum_ai/optimizer.py` around lines 88 - 110, The HF vectorization branch in
vectorize (the elif that checks self.hf_model and self.hf_tokenizer) re-imports
torch unnecessarily; update initialize_model() to stash the imported module
(e.g., assign torch to self._torch when loading HF model) and in the vectorize
branch use that cached reference (self._torch) for tensor ops and no_grad()
context, replace local import torch and any torch.* calls with self._torch.* to
avoid repeated imports and to make the code rely on the already-initialized
runtime.

else:
return self._fallback_vectorize(text)
Comment on lines +343 to 373

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent fallback to character-hash vectorizer can corrupt cache semantics.

If the HF model successfully vectorizes text during put() but later fails during get() (e.g., OOM, torch error), the code silently falls back to _fallback_vectorize (line 110). The fallback produces a character-hash vector whose geometry is completely unrelated to the HF embedding space. Cosine similarity between an HF-produced vector and a fallback vector is meaningless — leading to guaranteed cache misses (or, worse, spurious hits).

This same concern applies if the model backend changes between process restarts (e.g., sentence-transformers is installed, vectors are cached, then later only transformers is available). The cached vectors become incompatible with newly produced ones.

Consider:

  1. Raising the exception in the HF path instead of silently falling back, or at minimum logging at error level and returning a sentinel that the caller can distinguish.
  2. Storing the embedding backend identifier alongside cached vectors so incompatible entries can be detected or evicted.
🤖 Prompt for AI Agents
In `@nexum_ai/optimizer.py` around lines 82 - 112, The HF branch currently
swallows errors and returns self._fallback_vectorize, which corrupts cache
semantics; instead, in the HF vectorization path (the block that uses
self.hf_model / self.hf_tokenizer and torch in optimizer.py) change the error
handling to either re-raise the exception (so callers see vectorization failure)
or return a distinguishable sentinel/error object (not a character-hash) and log
at error level; additionally, augment cache write/read (the put() and get()
paths that store embeddings) to persist a backend identifier (e.g., "backend_id"
derived from self.model/hf_model package+version or a constant like
"hf-v{model_name}") alongside each vector and make get() validate the stored
backend_id against the current backend, evicting or refusing incompatible
entries so HF-produced vectors are never compared to fallback hashes; ensure
references to self._fallback_vectorize, initialize_model, self.hf_model,
self.hf_tokenizer, put(), and get() are updated accordingly.


Expand Down
7 changes: 7 additions & 0 deletions reproduce_issue.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from nexum_ai.optimizer import SemanticCache
import os

print("--- Default Behavior ---")
cache = SemanticCache()
# Force initialization
cache.initialize_model()
Comment on lines +1 to +7

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This debug script should not be committed to the repository.

reproduce_issue.py appears to be a temporary debugging/demo artifact. It was likely used locally to exercise the new initialize_model() dual-path flow but adds no value as a permanent file in the repo. Either remove it entirely or convert it into a proper test under the test suite with assertions.

Additional issues if the file is kept:

  • os (line 2) is imported but never used.
  • Top-level side effects (printing, model initialization) should be guarded with if __name__ == "__main__":.
  • No error handling or assertions — it doesn't verify behavior.
🤖 Prompt for AI Agents
In `@reproduce_issue.py` around lines 1 - 7, reproduce_issue.py is a temporary
debug/demo script and should be removed from the repo or converted into a proper
automated test; if you choose to keep it, eliminate the unused import os, wrap
the top-level print and cache.initialize_model() calls inside an if __name__ ==
"__main__": guard, and replace bare side-effecting calls with a test that
instantiates SemanticCache and calls initialize_model() with assertions (or
try/except checks) to validate expected behavior instead of printing.

Loading