Describe the bug
We have one somewhat large vespa deployment with close to 100 schemas and during deployment the search latencies spike into seconds making the container unresponsive for a brief moment in time, restarting applications that connect to vespa usually helps.
During the deployment, only two metrics would stand out:
- increase in mean, max container latencies
- increase in inflight search connections from container to content nodes
Which hints that containers are losing connection to the content nodes for some reason and reconnect right back afterwards which could be the reason latencies bubble up back to the application.
Content nodes also report these stack traces to stderr
(0x7f102637cc33) from (0x7f102637b4b6) from operator new(unsigned long)(0x7f102637a519) from (0x7f1025917a7a) from vespalib::datastore::FixedSizeHashMap::foreach_key(std::function<void (vespalib::datastore::EntryRef)> const&) const(0x7f1022345be7) from vespalib::datastore::ShardedHashMap::foreach_key(std::function<void (vespalib::datastore::EntryRef)>) const(0x7f102234b9e0) from vespalib::datastore::UniqueStoreHashDictionaryReadSnapshot<vespalib::datastore::ShardedHashMap>::fill()(0x7f1025916d14) from vespalib::datastore::UniqueStoreEnumerator<vespalib::datastore::EntryRefT<22u, 10u> >::UniqueStoreEnumerator(vespalib::datastore::IUniqueStoreDictionary const&, vespalib::datastore::DataStoreBase&, bool)(0x7f102594a988) from search::EnumStoreT<long>::make_enumerator()(0x7f10259657f6) from search::EnumAttributeSaver::EnumAttributeSaver(search::IEnumStore&)(0x7f1025949b95) from search::SingleValueEnumAttributeSaver::SingleValueEnumAttributeSaver(vespalib::GenerationHandler::Guard&&, search::attribute::AttributeHeader const&, std::vector<vespalib::datastore::EntryRef, vespalib::allocator_large<vespalib::datastore::EntryRef> >&&, search::IEnumStore&)(0x7f1025b799b2) from search::SingleValueEnumAttribute<search::EnumAttribute<search::IntegerAttributeTemplate<long> > >::onInitSave(std::basic_string_view<char, std::char_traits<char> >)(0x7f1025b6e655) from search::AttributeVector::initSave(std::basic_string_view<char, std::char_traits<char> >)(0x7f10258f104e) from proton::FlushableAttribute::Flusher::Flusher(proton::FlushableAttribute&, unsigned long, proton::AttributeDirectory::Writer&)(0x709c53) from proton::FlushableAttribute::internalInitFlush(unsigned long)(0x70a2d1) from (0x70a4c8) from vespalib::SingleExecutor::run_tasks_till(unsigned long)(0x7f1022500265) from vespalib::SingleExecutor::drain_tasks()(0x7f102250035f) from vespalib::SingleExecutor::run()(0x7f10225003c0) from proton_field_writer_executor(vespalib::Runnable&)(0x5ae089) from (0x7f102249a7b2) from (0x7f1015644b23) from (0x7f102637d4fe) from
https://gist.github.com/buinauskas/00fcb2547513b78b6728be7283b103e3 services.xml and schema example in a separte gist, all of the schemas follow the same pattern with an key entity that uses a dictionary: hash, extra fields and a doc_ts to help with garbage collection.
The deployment is used as a feature store to serve features of various document sizes and volumes, we do realize that vespa might be not the best fit for this use case, but it has been serving us really well and the latency spike during the deployment was the only real problem so far.
We do realize that the number of schemas is large and we'll consider splitting this into smaller content clusters.
Expected behavior
I would expect the deployment to remain stable regardless of the number of schemas in the content cluster.
Screenshots
Environment (please complete the following information):
- OS: Rocky Linux 8.1
- Infrastructure: self-hosted
Vespa version
8.665.18
Additional context
I'm not too sure if you accept contributions / analysis from coding agents, for this issue I'll assume that you do, let me know otherwise. I've dropped the stack trace to Opus 4.6 (1M context) and it suggested the issue might be related to the fact that we use dictionary: hash in our schemas, I've attached conversation with Claude to the Gist.
Describe the bug
We have one somewhat large vespa deployment with close to 100 schemas and during deployment the search latencies spike into seconds making the container unresponsive for a brief moment in time, restarting applications that connect to vespa usually helps.
During the deployment, only two metrics would stand out:
Which hints that containers are losing connection to the content nodes for some reason and reconnect right back afterwards which could be the reason latencies bubble up back to the application.
Content nodes also report these stack traces to
stderrhttps://gist.github.com/buinauskas/00fcb2547513b78b6728be7283b103e3
services.xmland schema example in a separte gist, all of the schemas follow the same pattern with an key entity that uses adictionary: hash, extra fields and adoc_tsto help with garbage collection.The deployment is used as a feature store to serve features of various document sizes and volumes, we do realize that vespa might be not the best fit for this use case, but it has been serving us really well and the latency spike during the deployment was the only real problem so far.
We do realize that the number of schemas is large and we'll consider splitting this into smaller content clusters.
Expected behavior
I would expect the deployment to remain stable regardless of the number of schemas in the content cluster.
Screenshots
Environment (please complete the following information):
Vespa version
8.665.18
Additional context
I'm not too sure if you accept contributions / analysis from coding agents, for this issue I'll assume that you do, let me know otherwise. I've dropped the stack trace to Opus 4.6 (1M context) and it suggested the issue might be related to the fact that we use
dictionary: hashin our schemas, I've attached conversation with Claude to the Gist.