[Q&A] Phase 19.4 — MultiModalEncoder: Embedding Dimensions, Fusion Strategies & Modality Backends #479
Unanswered
web3guru888
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
❓ Phase 19.4 — MultiModalEncoder: Embedding Dimensions, Fusion Strategies & Modality Backends
Common questions about the MultiModalEncoder component introduced in Phase 19.4.
Q1: Why a unified embedding space instead of keeping per-modality vectors?
A: A unified embedding space means every downstream component — SemanticParser, WorldModel, CuriosityModule, MemoryConsolidator — receives a single
MultiModalEmbeddingvector regardless of whether the original input was text, image, or audio. This eliminates modality-specific handling:cosine_sim(embed_a, embed_b)works whetherais text andbis an image.Q2: How does the
SIMPLE_HASHtext encoding backend work?A:
SIMPLE_HASHis the deterministic baseline encoder for text:embedding_dimfloat values in[-1, 1]by interpreting byte pairs as scaled floats.Properties:
TRANSFORMERbackend.Q3: How does ATTENTION fusion differ from AVERAGE?
A: Both produce a fixed
embedding_dimoutput from multiple per-modality vectors, but they weight modalities differently:1/nfor each modalityαᵢ = softmax(vᵢ · query)q(trainable or context-derived)With ATTENTION, if an image is highly informative but text is generic, the attention weight for image will be higher. With AVERAGE, both contribute equally regardless.
Q4: What does the
modality_weightsdict represent?A:
modality_weightsis used by theWEIGHTED_SUMfusion strategy:Rules:
Modalityenum values.[0, 1]that must sum to 1.0.embedding = Σ wᵢ × vᵢ— the final embedding is a weighted combination of per-modality vectors.EncoderConfigvalidatessum(weights.values()) ≈ 1.0(tolerance1e-6) at construction time.Q5: How is similarity computed between embeddings?
A: All
MultiModalEmbeddingvectors are L2-normalised to unit length, so:This is equivalent to cosine similarity since both vectors have
‖v‖₂ = 1.0.1.00.5–0.9~0.0< 0Use cases:
1 - similarityas novelty score.Q6: What happens when an unsupported modality is passed?
A: The encoder raises
UnsupportedModalityErrorimmediately:To check before calling:
The
asi_multimodal_unsupported_modality_totalcounter increments on every rejection, enabling alerting on misconfigured pipelines.Q7: How should multi-modal encoding be tested?
A: Follow the encode-separately, fuse-together, assert-similarity pattern:
Key test targets from the spec:
test_encode_text_deterministic— same text → same embeddingtest_encode_image_correct_dim— output shape matchesembedding_dimtest_fusion_average_is_mean— AVG fusion = arithmetic mean of per-modality vectors (pre-normalisation)test_fusion_attention_weights_sum_to_one— attention weightsαᵢsum to 1.0test_unsupported_modality_raises—UnsupportedModalityErrorfor unregistered modalitytest_l2_normalisation— all output vectors have unit L2 normSpec: Phase 19.4 — MultiModalEncoder · Issue: #481 · Wiki: Phase-19-MultiModal-Encoder
Beta Was this translation helpful? Give feedback.
All reactions