Skip to content

Stale Node Topology Label Cache causes Volume Provisioning Failures Until Restart #1478

@ryan-mist

Description

@ryan-mist

What happened:
During PVC provisioning with topology constraints, the external-provisioner fails with error generating accessibility requirements: topology labels from selected node map[] does not match topology keys from CSINode, even though all required topology labels are present on the node.

The provisioners topology cache can capture incomplete pvcNodeStore when additional labels are patched on after node creation. Then, on future reconciles the cache is never updated with newer labels because there was a cache hit for the node (code ref). The can causes continued provisioning failures until controller restart.

Additionally, the full selectedNodeLabels is never logged in the error message

fmt.Errorf("topology labels from selected node %v does not match topology keys from CSINode %v", selectedNodeLabels, topologyKeys)

even if some of the topology labels exist. The reason for this is extractTopologyTerm returns nil as selectedNodeLabels if any topology label does not exist on the Node.

What you expected to happen:
The provisioner should not be blocked by stale cached data, after the necessary topology labels are added to the Node provisioning should succeed.

How to reproduce it:
To reproduce this determinstically:

  1. Launch Node with all required topology labels
  2. Remove one topology label (ensure theres no syncing to add it back on)
  3. Create Stateful Set (in turn triggers volume provisioning)
  4. PVC will remain unbound with error generating accessibility requirements: topology labels from selected node map[] does not match topology keys from CSINode logs from the provisioner, you've hit the race condition
  5. Add topology label back onto the Node, PVC will continue to remain unbound

Anything else we need to know?:
I'll be happy to submit a PR for this

Environment:

  • Driver version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions