Skip to content

Implement CRI ListPodSandboxMetrics#10691

Merged
dmcgowan merged 7 commits intocontainerd:mainfrom
akhilerm:list-pod-sandbox-metrics
Oct 24, 2025
Merged

Implement CRI ListPodSandboxMetrics#10691
dmcgowan merged 7 commits intocontainerd:mainfrom
akhilerm:list-pod-sandbox-metrics

Conversation

@akhilerm
Copy link
Copy Markdown
Member

@akhilerm akhilerm commented Sep 10, 2024

Implement the following CRI APIs

  • ListPodSandboxMetrics
  • ListMetricDescriptors

Fixes: #10506

TESTING

crictl metricsp command can be used to test the pod sandbox metrics returned by the runtime.

Output

Ref: https://gist.github.com/akhilerm/625d12b805d482cd577311be3a4f7551

Part of kubernetes/enhancements#2371

* **Pod Sandbox Metrics** ([#10691](https://github.com/containerd/containerd/pull/10691))
  
  Full implementation of Kubernetes CRI pod-level metrics API
  * **ListPodSandboxMetrics**: Query metrics for  running pods/sandboxes
  * **ListMetricsDescriptors**: Discover available metrics and their descriptions

@k8s-ci-robot
Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@akhilerm
Copy link
Copy Markdown
Member Author

/cc @mikebrow

@zvonkok
Copy link
Copy Markdown

zvonkok commented Nov 28, 2024

@akhilerm You still working on this?

@akhilerm
Copy link
Copy Markdown
Member Author

@akhilerm You still working on this?

Yepp. I am working on it. Couldnt focus for sometime, but will pickup from next week onwards.

@zvonkok
Copy link
Copy Markdown

zvonkok commented Nov 28, 2024

/cc @zvonkok

@k8s-ci-robot
Copy link
Copy Markdown

@zvonkok: GitHub didn't allow me to request PR reviews from the following users: zvonkok.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @zvonkok

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zvonkok
Copy link
Copy Markdown

zvonkok commented Dec 16, 2024

@akhilerm Is this still on your roadmap?

@zvonkok
Copy link
Copy Markdown

zvonkok commented Dec 20, 2024

AFAIC some metrics are missing if we want to be on par with cAdvisor.

@akhilerm
Copy link
Copy Markdown
Member Author

@akhilerm Is this still on your roadmap?

Yes

AFAIC some metrics are missing if we want to be on par with cAdvisor.

The ListMetricDescriptors should have every metrics and this was added over from cadvisor. I will cross check if I missed something while adding from the cadvisor code.

@akhilerm akhilerm force-pushed the list-pod-sandbox-metrics branch from 44e5695 to 0b4a872 Compare December 31, 2024 09:22
@zvonkok
Copy link
Copy Markdown

zvonkok commented Jan 1, 2025

@akhilerm I mostly referring to this table: https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md AFAIR the FS were missing?

@akhilerm akhilerm force-pushed the list-pod-sandbox-metrics branch from 0b4a872 to 41203bc Compare January 7, 2025 15:25
@k8s-ci-robot k8s-ci-robot added size/L and removed size/S labels Jan 7, 2025
@zvonkok
Copy link
Copy Markdown

zvonkok commented Jan 9, 2025

This is the e2e test which metrics we will need: kubernetes/kubernetes#126213

var mu sync.Mutex
var wg sync.WaitGroup

semaphore := make(chan struct{}, 10) // Limit to 10 concurrent goroutines
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AkihiroSuda can you take a look again?

Name: "container_network_receive_bytes_total",
Timestamp: timestamp,
MetricType: runtime.MetricType_COUNTER,
LabelValues: append(podLabels, "eth0"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the network device name guaranteed to be "eth0" ?
Also, don't we need to care about other network devices?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea isn't the kubernetes convention to default the pod network device interface name to eth0

Comment thread internal/cri/server/list_pod_sandbox_metrics_linux.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the CRI ListPodSandboxMetrics and ListMetricDescriptors APIs in containerd, enabling monitoring systems to collect pod and container metrics from the CRI runtime. The implementation provides comprehensive metrics for CPU, memory, network, disk I/O, filesystem, and process statistics.

  • Implements ListPodSandboxMetrics with concurrent collection of metrics for pod sandboxes and their containers
  • Implements ListMetricDescriptors to provide metadata about available metrics
  • Enhances network statistics collection to include packet counts and dropped packets

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/cri/server/sandbox_stats_linux.go Enhanced getContainerNetIO function to collect additional network statistics (packets, dropped packets)
internal/cri/server/list_pod_sandbox_metrics_other.go Platform stub for non-Linux systems returning unimplemented error
internal/cri/server/list_pod_sandbox_metrics_linux.go Main implementation of ListPodSandboxMetrics with comprehensive metrics collection
internal/cri/server/list_pod_sandbox_metrics.go Removed generic stub implementation, now handled by platform-specific files
internal/cri/server/list_metric_descriptors_other.go Platform stub for non-Linux systems returning unimplemented error
internal/cri/server/list_metric_descriptors_linux.go Implementation of ListMetricDescriptors with metric definitions and descriptions
internal/cri/server/list_metric_descriptors.go Defined metric category constants for organizing metric types
go.mod Moved golang.org/x/time from indirect to direct dependency for rate limiting

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread internal/cri/server/list_metric_descriptors_linux.go Outdated
)

// Rate limiter to prevent overwhelming the system with concurrent requests
var limiter = rate.NewLimiter(rate.Limit(10), 10) // Allow 10 concurrent requests with burst of 10
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rate limiter is a global variable with hardcoded values. Consider making this configurable or using per-service instance variables to avoid global state.

Copilot uses AI. Check for mistakes.
Comment thread internal/cri/server/list_pod_sandbox_metrics_linux.go Outdated
var mu sync.Mutex
var wg sync.WaitGroup

semaphore := make(chan struct{}, 10) // Limit to 10 concurrent goroutines
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semaphore size is hardcoded to 10. This should match the rate limiter configuration or be made configurable to maintain consistency.

Copilot uses AI. Check for mistakes.
@mikebrow
Copy link
Copy Markdown
Member

all green ... but needs rebase..

@akhilerm
Copy link
Copy Markdown
Member Author

all green ... but needs rebase..

I will rebase and also push the change to move to errGroup

@akhilerm
Copy link
Copy Markdown
Member Author

Rebased and made all the requested changes.


if sandbox.NetNSPath != "" {
rxBytes, rxErrors, txBytes, txErrors := getContainerNetIO(ctx, sandbox.NetNSPath)
rxBytes, rxErrors, txBytes, txErrors, rxPackets, rxDropped, txPackets, txDropped := getContainerNetIO(ctx, sandbox.NetNSPath)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function returns 8 values. Should we introduce a struct instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the getContainerNetIO was used in the container_stats also, I didnt want to introduce a change there. Also, in the initial implementation, we had used a struct , ref: #10691 (comment), but was later removed based on the comments

}

// Use a default namespace if we can't determine it
namespace := "default"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's a different namespace?

Copy link
Copy Markdown
Member Author

@akhilerm akhilerm Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can default to k8s.io namespace as that will be the one used by CRI service.

Have updated to use k8s.io always.

Copy link
Copy Markdown
Member

@mxpv mxpv Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding namespace might not be the best approach here.
You probably can retrieve the root directory path via shim manager.

  type ShimInstance interface {
      ID() string
      Namespace() string
      Bundle() string  // Returns bundle path
      // ...
  }

  shim, err := shimManager.Get(ctx, containerID)
  if err != nil {
      return err
  }
  bundlePath := shim.Bundle()
  rootPath := filepath.Join(bundlePath, "rootfs")

Or another way is to expose it in tasks's state:

  state, err := task.State(ctx)
  if err != nil {
      return err
  }

  state.BundlePath

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a way to get the rootfs path without leaking the runtime specific details into the cri service.

Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akhilerm
Copy link
Copy Markdown
Member Author

A few more additional metrics will be added as mentioned in kubernetes-sigs/cri-tools#1931 for the conformance

@mxpv
Copy link
Copy Markdown
Member

mxpv commented Oct 24, 2025

@akhilerm could you pls create a follow up issue to address #10691 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[CRI] The ListPodSandboxMetrics method in containerd is not yet implemented

10 participants