Environment
- Vespa version: 8.672.3
- Deployment type: Self-managed Vespa using RPM/package installation (not Docker)
- OS: AlmaLinux 8.10
- Cluster size: 3 nodes
- Node hostnames:
vespa-node1
vespa-node2
vespa-node3
Cluster setup
I am running:
- 3 config servers
- 3 cluster controllers
- 3 slobroks
- 2 container nodes
- 3 content nodes for each content cluster
Relevant services.xml:
<services version="1.0">
<admin version="2.0">
<configservers>
<configserver hostalias="node0"/>
<configserver hostalias="node1"/>
<configserver hostalias="node2"/>
</configservers>
<cluster-controllers>
<cluster-controller hostalias="node0"/>
<cluster-controller hostalias="node1"/>
<cluster-controller hostalias="node2"/>
</cluster-controllers>
<slobroks>
<slobrok hostalias="node0"/>
<slobrok hostalias="node1"/>
<slobrok hostalias="node2"/>
</slobroks>
<adminserver hostalias="node0"/>
</admin>
<container id="default" version="1.0">
<search/>
<document-api/>
<nodes>
<node hostalias="node0"/>
<node hostalias="node1"/>
<node hostalias="node2"/>
</nodes>
</container>
<content id="semantic_search" version="1.0">
<redundancy>2</redundancy>
<documents>
<document type="semantic_search" mode="index"/>
</documents>
<nodes>
<node distribution-key="0" hostalias="node0"/>
<node distribution-key="1" hostalias="node1"/>
<node distribution-key="2" hostalias="node2"/>
</nodes>
</content>
<content id="agent_router" version="1.0">
<redundancy>2</redundancy>
<documents>
<document type="agent_router" mode="index"/>
</documents>
<nodes>
<node distribution-key="0" hostalias="node0"/>
<node distribution-key="1" hostalias="node1"/>
<node distribution-key="2" hostalias="node2"/>
</nodes>
</content>
<content id="quick_reply" version="1.0">
<redundancy>2</redundancy>
<documents>
<document type="quick_reply" mode="index"/>
</documents>
<nodes>
<node distribution-key="0" hostalias="node0"/>
<node distribution-key="1" hostalias="node1"/>
<node distribution-key="2" hostalias="node2"/>
</nodes>
</content>
<content id="quick_reply_category" version="1.0">
<redundancy>2</redundancy>
<documents>
<document type="quick_reply_category" mode="index"/>
</documents>
<nodes>
<node distribution-key="0" hostalias="node0"/>
<node distribution-key="1" hostalias="node1"/>
<node distribution-key="2" hostalias="node2"/>
</nodes>
</content>
</services>
host.xml
<hosts>
<host name="vespa-node1">
<alias>node0</alias>
</host>
<host name="vespa-node2">
<alias>node1</alias>
</host>
<host name="vespa-node3">
<alias>node2</alias>
</host>
</hosts>
Cluster status looks healthy
curl -s http://localhost:19050/cluster/v2/quick_reply_category/storage | jq
Output
{
"node": {
"0": { "link": "/cluster/v2/quick_reply_category/storage/0" },
"1": { "link": "/cluster/v2/quick_reply_category/storage/1" },
"2": { "link": "/cluster/v2/quick_reply_category/storage/2" }
}
}
Distribution state shows 3 distributors and 3 storage nodes:
"baseline": "version:6 bits:8 distributor:3 ... storage:3 ..."
All storage nodes are up
curl -s http://localhost:19050/cluster/v2/quick_reply_category/storage/1 | jq
Output:
{
"state": {
"generated": { "state": "up" },
"unit": { "state": "up" },
"user": { "state": "up" }
},
"metrics": {
"unique-document-count": 1
}
}
Current issue
Even though the content cluster appears healthy, requests to the cluster controller API intermittently return
{ "message": "No known master cluster controller currently exists." }
Example:
curl http://localhost:19050/cluster/v2/quick_reply_category/ | jq
Output on vespa-node1:
{ "message": "No known master cluster controller currently exists." }
But on vespa-node2:
{
"state": {
"generated": {
"state": "up",
"reason": ""
}
},
"service": {
"storage": {
"link": "/cluster/v2/quick_reply_category/storage"
},
"distributor": {
"link": "/cluster/v2/quick_reply_category/distributor"
}
}
}
And on vespa-node3:
{
"message": "Cluster controller not master. Use master at vespa-node2:19050."
}
Question
Why does the cluster controller for some content clusters report:
No known master cluster controller currently exists
even though:
- All 3 cluster controllers are running
- All 3 storage nodes are up
- Documents can be written successfully
- Distribution state shows distributor:3 storage:3
- Inter-node connectivity is healthy
Environment
vespa-node1
vespa-node2
vespa-node3
Cluster setup
I am running:
Relevant services.xml:
host.xml
Cluster status looks healthy
curl -s http://localhost:19050/cluster/v2/quick_reply_category/storage | jqOutput
Distribution state shows 3 distributors and 3 storage nodes:
"baseline": "version:6 bits:8 distributor:3 ... storage:3 ..."All storage nodes are up
curl -s http://localhost:19050/cluster/v2/quick_reply_category/storage/1 | jqOutput:
Current issue
Even though the content cluster appears healthy, requests to the cluster controller API intermittently return
{ "message": "No known master cluster controller currently exists." }
Example:
curl http://localhost:19050/cluster/v2/quick_reply_category/ | jqOutput on vespa-node1:
{ "message": "No known master cluster controller currently exists." }
But on vespa-node2:
And on vespa-node3:
Question
Why does the cluster controller for some content clusters report:
No known master cluster controller currently existseven though: