Implement CRI ListPodSandboxMetrics#10691
Conversation
|
Skipping CI for Draft Pull Request. |
2981bb8 to
3b7a1b7
Compare
e07f859 to
55a5d80
Compare
|
/cc @mikebrow |
b0830f6 to
44e5695
Compare
|
@akhilerm You still working on this? |
Yepp. I am working on it. Couldnt focus for sometime, but will pickup from next week onwards. |
|
/cc @zvonkok |
|
@zvonkok: GitHub didn't allow me to request PR reviews from the following users: zvonkok. Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@akhilerm Is this still on your roadmap? |
|
AFAIC some metrics are missing if we want to be on par with cAdvisor. |
Yes
The |
44e5695 to
0b4a872
Compare
|
@akhilerm I mostly referring to this table: https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md AFAIR the FS were missing? |
0b4a872 to
41203bc
Compare
|
This is the e2e test which metrics we will need: kubernetes/kubernetes#126213 |
| var mu sync.Mutex | ||
| var wg sync.WaitGroup | ||
|
|
||
| semaphore := make(chan struct{}, 10) // Limit to 10 concurrent goroutines |
There was a problem hiding this comment.
nit: https://pkg.go.dev/golang.org/x/sync/errgroup#Group.SetLimit might be easier to use
| Name: "container_network_receive_bytes_total", | ||
| Timestamp: timestamp, | ||
| MetricType: runtime.MetricType_COUNTER, | ||
| LabelValues: append(podLabels, "eth0"), |
There was a problem hiding this comment.
Is the network device name guaranteed to be "eth0" ?
Also, don't we need to care about other network devices?
There was a problem hiding this comment.
@aojea isn't the kubernetes convention to default the pod network device interface name to eth0
52fe271 to
268e529
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR implements the CRI ListPodSandboxMetrics and ListMetricDescriptors APIs in containerd, enabling monitoring systems to collect pod and container metrics from the CRI runtime. The implementation provides comprehensive metrics for CPU, memory, network, disk I/O, filesystem, and process statistics.
- Implements ListPodSandboxMetrics with concurrent collection of metrics for pod sandboxes and their containers
- Implements ListMetricDescriptors to provide metadata about available metrics
- Enhances network statistics collection to include packet counts and dropped packets
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/cri/server/sandbox_stats_linux.go | Enhanced getContainerNetIO function to collect additional network statistics (packets, dropped packets) |
| internal/cri/server/list_pod_sandbox_metrics_other.go | Platform stub for non-Linux systems returning unimplemented error |
| internal/cri/server/list_pod_sandbox_metrics_linux.go | Main implementation of ListPodSandboxMetrics with comprehensive metrics collection |
| internal/cri/server/list_pod_sandbox_metrics.go | Removed generic stub implementation, now handled by platform-specific files |
| internal/cri/server/list_metric_descriptors_other.go | Platform stub for non-Linux systems returning unimplemented error |
| internal/cri/server/list_metric_descriptors_linux.go | Implementation of ListMetricDescriptors with metric definitions and descriptions |
| internal/cri/server/list_metric_descriptors.go | Defined metric category constants for organizing metric types |
| go.mod | Moved golang.org/x/time from indirect to direct dependency for rate limiting |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| ) | ||
|
|
||
| // Rate limiter to prevent overwhelming the system with concurrent requests | ||
| var limiter = rate.NewLimiter(rate.Limit(10), 10) // Allow 10 concurrent requests with burst of 10 |
There was a problem hiding this comment.
The rate limiter is a global variable with hardcoded values. Consider making this configurable or using per-service instance variables to avoid global state.
| var mu sync.Mutex | ||
| var wg sync.WaitGroup | ||
|
|
||
| semaphore := make(chan struct{}, 10) // Limit to 10 concurrent goroutines |
There was a problem hiding this comment.
The semaphore size is hardcoded to 10. This should match the rate limiter configuration or be made configurable to maintain consistency.
|
all green ... but needs rebase.. |
I will rebase and also push the change to move to errGroup |
|
Rebased and made all the requested changes. |
|
|
||
| if sandbox.NetNSPath != "" { | ||
| rxBytes, rxErrors, txBytes, txErrors := getContainerNetIO(ctx, sandbox.NetNSPath) | ||
| rxBytes, rxErrors, txBytes, txErrors, rxPackets, rxDropped, txPackets, txDropped := getContainerNetIO(ctx, sandbox.NetNSPath) |
There was a problem hiding this comment.
This function returns 8 values. Should we introduce a struct instead?
There was a problem hiding this comment.
since the getContainerNetIO was used in the container_stats also, I didnt want to introduce a change there. Also, in the initial implementation, we had used a struct , ref: #10691 (comment), but was later removed based on the comments
| } | ||
|
|
||
| // Use a default namespace if we can't determine it | ||
| namespace := "default" |
There was a problem hiding this comment.
What if it's a different namespace?
There was a problem hiding this comment.
We can default to k8s.io namespace as that will be the one used by CRI service.
Have updated to use k8s.io always.
There was a problem hiding this comment.
Hardcoding namespace might not be the best approach here.
You probably can retrieve the root directory path via shim manager.
type ShimInstance interface {
ID() string
Namespace() string
Bundle() string // Returns bundle path
// ...
}
shim, err := shimManager.Get(ctx, containerID)
if err != nil {
return err
}
bundlePath := shim.Bundle()
rootPath := filepath.Join(bundlePath, "rootfs")Or another way is to expose it in tasks's state:
state, err := task.State(ctx)
if err != nil {
return err
}
state.BundlePathThere was a problem hiding this comment.
We need to find a way to get the rootfs path without leaking the runtime specific details into the cri service.
Signed-off-by: Akhil Mohan <akhilerm@gmail.com> Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com> Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
|
A few more additional metrics will be added as mentioned in kubernetes-sigs/cri-tools#1931 for the conformance |
|
@akhilerm could you pls create a follow up issue to address #10691 (comment) |
Implement the following CRI APIs
Fixes: #10506
TESTING
crictl metricspcommand can be used to test the pod sandbox metrics returned by the runtime.Output
Ref: https://gist.github.com/akhilerm/625d12b805d482cd577311be3a4f7551
Part of kubernetes/enhancements#2371