Skip to content

No known master cluster controller currently exists #36451

@nishit93-hub

Description

@nishit93-hub

Environment

  • Vespa version: 8.672.3
  • Deployment type: Self-managed Vespa using RPM/package installation (not Docker)
  • OS: AlmaLinux 8.10
  • Cluster size: 3 nodes
  • Node hostnames:
    vespa-node1
    vespa-node2
    vespa-node3

Cluster setup

I am running:

  • 3 config servers
  • 3 cluster controllers
  • 3 slobroks
  • 2 container nodes
  • 3 content nodes for each content cluster

Relevant services.xml:

<services version="1.0">

  <admin version="2.0">
    <configservers>
      <configserver hostalias="node0"/>
      <configserver hostalias="node1"/>
      <configserver hostalias="node2"/>
    </configservers>

    <cluster-controllers>
      <cluster-controller hostalias="node0"/>
      <cluster-controller hostalias="node1"/>
      <cluster-controller hostalias="node2"/>
    </cluster-controllers>

    <slobroks>
      <slobrok hostalias="node0"/>
      <slobrok hostalias="node1"/>
      <slobrok hostalias="node2"/>
    </slobroks>

    <adminserver hostalias="node0"/>
  </admin>


  <container id="default" version="1.0">
    <search/>
    <document-api/>
    <nodes>
      <node hostalias="node0"/>
      <node hostalias="node1"/>
      <node hostalias="node2"/>
    </nodes>
  </container>


  <content id="semantic_search" version="1.0">
    <redundancy>2</redundancy>
    <documents>
      <document type="semantic_search" mode="index"/>
    </documents>
    <nodes>
      <node distribution-key="0" hostalias="node0"/>
      <node distribution-key="1" hostalias="node1"/>
      <node distribution-key="2" hostalias="node2"/>
    </nodes>
  </content>


  <content id="agent_router" version="1.0">
    <redundancy>2</redundancy>
    <documents>
      <document type="agent_router" mode="index"/>
    </documents>
    <nodes>
      <node distribution-key="0" hostalias="node0"/>
      <node distribution-key="1" hostalias="node1"/>
      <node distribution-key="2" hostalias="node2"/>
    </nodes>
  </content>


  <content id="quick_reply" version="1.0">
    <redundancy>2</redundancy>
    <documents>
      <document type="quick_reply" mode="index"/>
    </documents>
    <nodes>
      <node distribution-key="0" hostalias="node0"/>
      <node distribution-key="1" hostalias="node1"/>
      <node distribution-key="2" hostalias="node2"/>
    </nodes>
  </content>


  <content id="quick_reply_category" version="1.0">
    <redundancy>2</redundancy>
    <documents>
      <document type="quick_reply_category" mode="index"/>
    </documents>
    <nodes>
      <node distribution-key="0" hostalias="node0"/>
      <node distribution-key="1" hostalias="node1"/>
      <node distribution-key="2" hostalias="node2"/>
    </nodes>
  </content>

</services>

host.xml

<hosts>
    <host name="vespa-node1">
        <alias>node0</alias>
    </host>
    <host name="vespa-node2">
        <alias>node1</alias>
</host>
    <host name="vespa-node3">
        <alias>node2</alias>
</host>
</hosts>

Cluster status looks healthy

curl -s http://localhost:19050/cluster/v2/quick_reply_category/storage | jq

Output

{
  "node": {
    "0": { "link": "/cluster/v2/quick_reply_category/storage/0" },
    "1": { "link": "/cluster/v2/quick_reply_category/storage/1" },
    "2": { "link": "/cluster/v2/quick_reply_category/storage/2" }
  }
}

Distribution state shows 3 distributors and 3 storage nodes:

"baseline": "version:6 bits:8 distributor:3 ... storage:3 ..."

All storage nodes are up

curl -s http://localhost:19050/cluster/v2/quick_reply_category/storage/1 | jq

Output:

{
  "state": {
    "generated": { "state": "up" },
    "unit": { "state": "up" },
    "user": { "state": "up" }
  },
  "metrics": {
    "unique-document-count": 1
  }
}

Current issue

Even though the content cluster appears healthy, requests to the cluster controller API intermittently return

{ "message": "No known master cluster controller currently exists." }

Example:

curl http://localhost:19050/cluster/v2/quick_reply_category/ | jq

Output on vespa-node1:

{ "message": "No known master cluster controller currently exists." }

But on vespa-node2:

{
  "state": {
    "generated": {
      "state": "up",
      "reason": ""
    }
  },
  "service": {
    "storage": {
      "link": "/cluster/v2/quick_reply_category/storage"
    },
    "distributor": {
      "link": "/cluster/v2/quick_reply_category/distributor"
    }
  }
}

And on vespa-node3:

{
  "message": "Cluster controller not master. Use master at vespa-node2:19050."
}

Question

Why does the cluster controller for some content clusters report:

No known master cluster controller currently exists

even though:

  • All 3 cluster controllers are running
  • All 3 storage nodes are up
  • Documents can be written successfully
  • Distribution state shows distributor:3 storage:3
  • Inter-node connectivity is healthy

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions