container latencies spike during a deployment on vespa instance with a lot of schemas

**Describe the bug**

We have one somewhat large vespa deployment with close to 100 schemas and during deployment the search latencies spike into seconds making the container unresponsive for a brief moment in time, restarting applications that connect to vespa usually helps.

During the deployment, only two metrics would stand out:
- increase in mean, max container latencies
- increase in inflight search connections from container to content nodes

Which hints that containers are losing connection to the content nodes for some reason and reconnect right back afterwards which could be the reason latencies bubble up back to the application.

Content nodes also report these stack traces to `stderr`
```
(0x7f102637cc33) from (0x7f102637b4b6) from operator new(unsigned long)(0x7f102637a519) from (0x7f1025917a7a) from vespalib::datastore::FixedSizeHashMap::foreach_key(std::function<void (vespalib::datastore::EntryRef)> const&) const(0x7f1022345be7) from vespalib::datastore::ShardedHashMap::foreach_key(std::function<void (vespalib::datastore::EntryRef)>) const(0x7f102234b9e0) from vespalib::datastore::UniqueStoreHashDictionaryReadSnapshot<vespalib::datastore::ShardedHashMap>::fill()(0x7f1025916d14) from vespalib::datastore::UniqueStoreEnumerator<vespalib::datastore::EntryRefT<22u, 10u> >::UniqueStoreEnumerator(vespalib::datastore::IUniqueStoreDictionary const&, vespalib::datastore::DataStoreBase&, bool)(0x7f102594a988) from search::EnumStoreT<long>::make_enumerator()(0x7f10259657f6) from search::EnumAttributeSaver::EnumAttributeSaver(search::IEnumStore&)(0x7f1025949b95) from search::SingleValueEnumAttributeSaver::SingleValueEnumAttributeSaver(vespalib::GenerationHandler::Guard&&, search::attribute::AttributeHeader const&, std::vector<vespalib::datastore::EntryRef, vespalib::allocator_large<vespalib::datastore::EntryRef> >&&, search::IEnumStore&)(0x7f1025b799b2) from search::SingleValueEnumAttribute<search::EnumAttribute<search::IntegerAttributeTemplate<long> > >::onInitSave(std::basic_string_view<char, std::char_traits<char> >)(0x7f1025b6e655) from search::AttributeVector::initSave(std::basic_string_view<char, std::char_traits<char> >)(0x7f10258f104e) from proton::FlushableAttribute::Flusher::Flusher(proton::FlushableAttribute&, unsigned long, proton::AttributeDirectory::Writer&)(0x709c53) from proton::FlushableAttribute::internalInitFlush(unsigned long)(0x70a2d1) from (0x70a4c8) from vespalib::SingleExecutor::run_tasks_till(unsigned long)(0x7f1022500265) from vespalib::SingleExecutor::drain_tasks()(0x7f102250035f) from vespalib::SingleExecutor::run()(0x7f10225003c0) from proton_field_writer_executor(vespalib::Runnable&)(0x5ae089) from (0x7f102249a7b2) from (0x7f1015644b23) from (0x7f102637d4fe) from 
```


https://gist.github.com/buinauskas/00fcb2547513b78b6728be7283b103e3 `services.xml` and schema example in a separte gist, all of the schemas follow the same pattern with an key entity that uses a `dictionary: hash`, extra fields and a `doc_ts` to help with garbage collection.

The deployment is used as a feature store to serve features of various document sizes and volumes, we do realize that vespa might be not the best fit for this use case, but it has been serving us really well and the latency spike during the deployment was the only real problem so far.

We do realize that the number of schemas is large and we'll consider splitting this into smaller content clusters.

**Expected behavior**

I would expect the deployment to remain stable regardless of the number of schemas in the content cluster.

**Screenshots**

<img width="1889" height="1123" alt="Image" src="https://github.com/user-attachments/assets/34090b1a-4c43-4951-a7e5-357cf8d6f78d" />

<img width="2518" height="1122" alt="Image" src="https://github.com/user-attachments/assets/f5d309bf-c9ff-4ff9-a183-fe7b7971a7d1" />

**Environment (please complete the following information):**
 - OS: Rocky Linux 8.1
 - Infrastructure: self-hosted

**Vespa version**
8.665.18

**Additional context**

I'm not too sure if you accept contributions / analysis from coding agents, for this issue I'll assume that you do, let me know otherwise. I've dropped the stack trace to Opus 4.6 (1M context) and it suggested the issue might be related to the fact that we use `dictionary: hash` in our schemas, I've attached conversation with Claude to the Gist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container latencies spike during a deployment on vespa instance with a lot of schemas #36361

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

container latencies spike during a deployment on vespa instance with a lot of schemas #36361

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions