Skip to content

Latest commit

 

History

History
82 lines (60 loc) · 4.61 KB

File metadata and controls

82 lines (60 loc) · 4.61 KB

Describe the bug

When a news document holds a reference<global_schema> field that points to a non-existent global document, the imported multi-dimensional sparse tensor field resolves to an untyped tensor() at ranking time instead of the declared type. A rank profile that uses this field in a match-feature expression then fails with a type-mismatch error, causing the entire result set to return 0 hits.

Conditions required to trigger the bug

All four of the following must be present simultaneously:

  1. Dangling parent reference — a child document's reference field points to a parent doc ID that does not exist in the global schema.
  2. Multi-dimensional imported tensor — the imported field is a sparse mapped tensor with two or more dimensions, e.g. tensor<float>(dim_a{},dim_b{}). A single-dimension tensor silently returns a typed empty tensor and does not trigger the error.
  3. Soft timeout fires during rankingranking.softtimeout.enable: true with a factor tight enough that some content nodes are cut off before finishing. In production this is caused by ANN (dense retrieval) latency on loaded nodes.
  4. No docsum retrydispatch.docsumRetryLimit: 0 (or equivalent) means there is no retry on timed-out nodes, so incomplete match-feature data reaches the container layer.

Error

{
  "errors": [{
    "code": 4,
    "summary": "Invalid query parameter",
    "message": "'attribute(my_global_tensor)' must be of type tensor<float>(dim_a{},dim_b{}), not tensor()"
  }]
}

To Reproduce

Note: I was unable to reproduce this locally — the local two-node Docker setup with CPU throttling does not replicate the production timing conditions precisely enough. The reproduction below gets as close as possible by using the same schema, document structure, and query parameters.

1. Start Vespa (two-node Docker Compose, pinned to the affected version):

docker compose -f app-7-parent-child/docker-compose.yml up -d

Wait ~60 s, then check curl http://localhost:19071/status.html.

2. Build and deploy the application package:

mvn clean package -DskipTests -f app-7-parent-child/pom.xml
curl -X POST http://localhost:19071/application/v2/tenant/default/prepareandactivate \
  --data-binary @app-7-parent-child/target/application.zip \
  -H "Content-Type: application/zip"

3. Feed and query:

python reproduce_bug.py --feed   # feed once
python reproduce_bug.py          # re-run query

The script feeds one global document and 400 news documents (200 with a valid parent reference, 200 with a dangling reference), then queries with nearestNeighbor + soft timeout to mirror production conditions.

Expected behavior

Documents with a dangling parent reference should either return a zero-valued typed tensor for the imported field (so the match-feature evaluates to 0.0 without error), or be excluded from results with a warning. The type mismatch should not propagate as a query-level error and should not discard the entire result set.

Environment

  • Vespa version: 8.653.22 (confirmed from trace — see bug_trace_anonymized.json)
  • Infrastructure: self-hosted production cluster (Linux)
  • Reproduction attempt: macOS, Docker (vespaengine/vespa:8.653.22), two content nodes

Trace

A sanitized production trace is attached as bug_trace_anonymized.json. Key observations:

  • Query config: timeout: 0.45s, softtimeout.factor: 0.9 (soft deadline ~405 ms)
  • 12 content nodes timed out: distribution keys 3, 9, 15, 33, 57, 101, 103, 104, 106, 110, 125, 127
  • A representative timed-out node (dist key 128 equivalent) spent 406 ms before returning 0 hits — right at the soft-timeout threshold
  • After dispatch, hits were marked unfillable and the container raised the Invalid query parameter error

The timeout is a symptom, not the root cause. The dangling reference causes attribute(my_global_tensor) to resolve to tensor() (untyped), which stalls match-feature computation on the affected nodes, which causes the soft timeout to fire on those nodes, which surfaces the type error at the container layer.

Reproduction files

File Purpose
reproduce_bug.py Feed + query script
app-7-parent-child/docker-compose.yml Two-node Vespa setup (v8.653.22)
app-7-parent-child/src/main/application/schemas/news_source_ctr.sd Global schema with 2D tensor
app-7-parent-child/src/main/application/schemas/news.sd Child schema with import + rank profile
app-7-parent-child/feed/feed_news_source_ctr.json Global parent document
bug_trace_anonymized.json Sanitized production trace