[Q&A] Phase 19.4 — MultiModalEncoder: Embedding Dimensions, Fusion Strategies & Modality Backends #479

web3guru888 · 2026-04-13T10:07:29Z

web3guru888
Apr 13, 2026
Maintainer

❓ Phase 19.4 — MultiModalEncoder: Embedding Dimensions, Fusion Strategies & Modality Backends

Common questions about the MultiModalEncoder component introduced in Phase 19.4.

Q1: Why a unified embedding space instead of keeping per-modality vectors?

A: A unified embedding space means every downstream component — SemanticParser, WorldModel, CuriosityModule, MemoryConsolidator — receives a single MultiModalEmbedding vector regardless of whether the original input was text, image, or audio. This eliminates modality-specific handling:

Single similarity API: cosine_sim(embed_a, embed_b) works whether a is text and b is an image.
Composability: fused embeddings can be stored, indexed, and searched with one vector DB.
Extensibility: adding a new modality (video, sensor) only requires a new encoder — fusion and downstream logic remain unchanged.

Q2: How does the `SIMPLE_HASH` text encoding backend work?

A: SIMPLE_HASH is the deterministic baseline encoder for text:

Hash the input string using a stable hash function (e.g., SHA-256).
Distribute across buckets: map hash bytes into embedding_dim float values in [-1, 1] by interpreting byte pairs as scaled floats.
L2-normalise: divide the vector by its L2 norm to produce a unit vector.

Properties:

Deterministic: same string always produces the same embedding.
No training required: useful for testing and bootstrapping.
No semantic similarity: "cat" and "kitten" will have unrelated embeddings. For semantic encoding, use TRANSFORMER backend.

# Pseudocode
raw = hashlib.sha256(text.encode()).digest()
vec = [((b1 << 8 | b2) / 32768.0) - 1.0 for b1, b2 in pairs(raw, dim)]
return vec / np.linalg.norm(vec)

Q3: How does ATTENTION fusion differ from AVERAGE?

A: Both produce a fixed embedding_dim output from multiple per-modality vectors, but they weight modalities differently:

Aspect	AVERAGE	ATTENTION
Weights	Equal: `1/n` for each modality	Learned: `αᵢ = softmax(vᵢ · query)`
Adaptivity	Static — ignores content	Dynamic — weights depend on input content
Parameters	None	Query vector `q` (trainable or context-derived)
Best for	Uniform modality importance	When some modalities are more informative

With ATTENTION, if an image is highly informative but text is generic, the attention weight for image will be higher. With AVERAGE, both contribute equally regardless.

Q4: What does the `modality_weights` dict represent?

A: modality_weights is used by the WEIGHTED_SUM fusion strategy:

modality_weights = {Modality.TEXT: 0.5, Modality.IMAGE: 0.3, Modality.AUDIO: 0.2}

Rules:

Keys: Modality enum values.
Values: floats in [0, 1] that must sum to 1.0.
Effect: embedding = Σ wᵢ × vᵢ — the final embedding is a weighted combination of per-modality vectors.
Validation: EncoderConfig validates sum(weights.values()) ≈ 1.0 (tolerance 1e-6) at construction time.
Missing modality: if an input doesn't include a modality present in weights, its weight is redistributed proportionally among present modalities.

Q5: How is similarity computed between embeddings?

A: All MultiModalEmbedding vectors are L2-normalised to unit length, so:

similarity(a, b) = dot(a.vector, b.vector)

This is equivalent to cosine similarity since both vectors have ‖v‖₂ = 1.0.

Score	Meaning
`1.0`	Identical direction
`0.5–0.9`	Similar
`~0.0`	Orthogonal / unrelated
`< 0`	Opposing

Use cases:

Retrieval: find top-k most similar embeddings in memory.
Novelty: CuriosityModule uses 1 - similarity as novelty score.
Deduplication: embeddings with similarity > 0.95 are candidates for consolidation.

Q6: What happens when an unsupported modality is passed?

A: The encoder raises UnsupportedModalityError immediately:

try:
    embedding = await encoder.encode(sensor_data, modality=Modality.SENSOR)
except UnsupportedModalityError:
    # SENSOR backend not registered
    ...

To check before calling:

if Modality.SENSOR in encoder.supported_modalities():
    embedding = await encoder.encode(sensor_data, modality=Modality.SENSOR)

The asi_multimodal_unsupported_modality_total counter increments on every rejection, enabling alerting on misconfigured pipelines.

Q7: How should multi-modal encoding be tested?

A: Follow the encode-separately, fuse-together, assert-similarity pattern:

@pytest.mark.asyncio
async def test_multimodal_fusion_produces_valid_embedding():
    encoder = make_multimodal_encoder(
        embedding_dim=64,
        fusion_strategy=FusionStrategy.AVERAGE,
    )

    # Single-modality baselines
    text_emb = await encoder.encode("a sunset over the ocean", modality=Modality.TEXT)
    image_emb = await encoder.encode(sunset_image_bytes, modality=Modality.IMAGE)

    # Multi-modal fusion
    fused_emb = await encoder.encode_multi([
        (Modality.TEXT, "a sunset over the ocean"),
        (Modality.IMAGE, sunset_image_bytes),
    ])

    # Assertions
    assert fused_emb.vector.shape == (64,)
    assert abs(np.linalg.norm(fused_emb.vector) - 1.0) < 1e-6  # unit vector
    # Fused should be closer to both unimodal than random
    assert cosine_sim(fused_emb, text_emb) > 0.3
    assert cosine_sim(fused_emb, image_emb) > 0.3

Key test targets from the spec:

test_encode_text_deterministic — same text → same embedding
test_encode_image_correct_dim — output shape matches embedding_dim
test_fusion_average_is_mean — AVG fusion = arithmetic mean of per-modality vectors (pre-normalisation)
test_fusion_attention_weights_sum_to_one — attention weights αᵢ sum to 1.0
test_unsupported_modality_raises — UnsupportedModalityError for unregistered modality
test_l2_normalisation — all output vectors have unit L2 norm

Spec: Phase 19.4 — MultiModalEncoder · Issue: #481 · Wiki: Phase-19-MultiModal-Encoder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q&A] Phase 19.4 — MultiModalEncoder: Embedding Dimensions, Fusion Strategies & Modality Backends #479

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[Q&A] Phase 19.4 — MultiModalEncoder: Embedding Dimensions, Fusion Strategies & Modality Backends #479

Uh oh!

web3guru888 Apr 13, 2026 Maintainer

❓ Phase 19.4 — MultiModalEncoder: Embedding Dimensions, Fusion Strategies & Modality Backends

Q1: Why a unified embedding space instead of keeping per-modality vectors?

Q2: How does the SIMPLE_HASH text encoding backend work?

Q3: How does ATTENTION fusion differ from AVERAGE?

Q4: What does the modality_weights dict represent?

Q5: How is similarity computed between embeddings?

Q6: What happens when an unsupported modality is passed?

Q7: How should multi-modal encoding be tested?

Replies: 0 comments

web3guru888
Apr 13, 2026
Maintainer

Q2: How does the `SIMPLE_HASH` text encoding backend work?

Q4: What does the `modality_weights` dict represent?