Skip to content

container latencies spike during a deployment on vespa instance with a lot of schemas #36361

@buinauskas

Description

@buinauskas

Describe the bug

We have one somewhat large vespa deployment with close to 100 schemas and during deployment the search latencies spike into seconds making the container unresponsive for a brief moment in time, restarting applications that connect to vespa usually helps.

During the deployment, only two metrics would stand out:

  • increase in mean, max container latencies
  • increase in inflight search connections from container to content nodes

Which hints that containers are losing connection to the content nodes for some reason and reconnect right back afterwards which could be the reason latencies bubble up back to the application.

Content nodes also report these stack traces to stderr

(0x7f102637cc33) from (0x7f102637b4b6) from operator new(unsigned long)(0x7f102637a519) from (0x7f1025917a7a) from vespalib::datastore::FixedSizeHashMap::foreach_key(std::function<void (vespalib::datastore::EntryRef)> const&) const(0x7f1022345be7) from vespalib::datastore::ShardedHashMap::foreach_key(std::function<void (vespalib::datastore::EntryRef)>) const(0x7f102234b9e0) from vespalib::datastore::UniqueStoreHashDictionaryReadSnapshot<vespalib::datastore::ShardedHashMap>::fill()(0x7f1025916d14) from vespalib::datastore::UniqueStoreEnumerator<vespalib::datastore::EntryRefT<22u, 10u> >::UniqueStoreEnumerator(vespalib::datastore::IUniqueStoreDictionary const&, vespalib::datastore::DataStoreBase&, bool)(0x7f102594a988) from search::EnumStoreT<long>::make_enumerator()(0x7f10259657f6) from search::EnumAttributeSaver::EnumAttributeSaver(search::IEnumStore&)(0x7f1025949b95) from search::SingleValueEnumAttributeSaver::SingleValueEnumAttributeSaver(vespalib::GenerationHandler::Guard&&, search::attribute::AttributeHeader const&, std::vector<vespalib::datastore::EntryRef, vespalib::allocator_large<vespalib::datastore::EntryRef> >&&, search::IEnumStore&)(0x7f1025b799b2) from search::SingleValueEnumAttribute<search::EnumAttribute<search::IntegerAttributeTemplate<long> > >::onInitSave(std::basic_string_view<char, std::char_traits<char> >)(0x7f1025b6e655) from search::AttributeVector::initSave(std::basic_string_view<char, std::char_traits<char> >)(0x7f10258f104e) from proton::FlushableAttribute::Flusher::Flusher(proton::FlushableAttribute&, unsigned long, proton::AttributeDirectory::Writer&)(0x709c53) from proton::FlushableAttribute::internalInitFlush(unsigned long)(0x70a2d1) from (0x70a4c8) from vespalib::SingleExecutor::run_tasks_till(unsigned long)(0x7f1022500265) from vespalib::SingleExecutor::drain_tasks()(0x7f102250035f) from vespalib::SingleExecutor::run()(0x7f10225003c0) from proton_field_writer_executor(vespalib::Runnable&)(0x5ae089) from (0x7f102249a7b2) from (0x7f1015644b23) from (0x7f102637d4fe) from 

https://gist.github.com/buinauskas/00fcb2547513b78b6728be7283b103e3 services.xml and schema example in a separte gist, all of the schemas follow the same pattern with an key entity that uses a dictionary: hash, extra fields and a doc_ts to help with garbage collection.

The deployment is used as a feature store to serve features of various document sizes and volumes, we do realize that vespa might be not the best fit for this use case, but it has been serving us really well and the latency spike during the deployment was the only real problem so far.

We do realize that the number of schemas is large and we'll consider splitting this into smaller content clusters.

Expected behavior

I would expect the deployment to remain stable regardless of the number of schemas in the content cluster.

Screenshots

Image Image

Environment (please complete the following information):

  • OS: Rocky Linux 8.1
  • Infrastructure: self-hosted

Vespa version
8.665.18

Additional context

I'm not too sure if you accept contributions / analysis from coding agents, for this issue I'll assume that you do, let me know otherwise. I've dropped the stack trace to Opus 4.6 (1M context) and it suggested the issue might be related to the fact that we use dictionary: hash in our schemas, I've attached conversation with Claude to the Gist.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions