add concurrent workers for consumer group metrics collection#506
add concurrent workers for consumer group metrics collection#506donovanbai-dd wants to merge 2 commits intodanielqsj:masterfrom
Conversation
| module github.com/danielqsj/kafka_exporter | ||
|
|
||
| go 1.24 | ||
| go 1.24.0 |
There was a problem hiding this comment.
|
lgtm, please fix the golang lint issue @donovanbai-dd https://github.com/danielqsj/kafka_exporter/actions/runs/20580739843/job/59777223997?pr=506 |
it's fixed now |
|
@donovanbai-dd I tried your build. The results are better. However, I could see gaps in the metric for consumer lag. Sometimes I get metrics for every 30sec but sometimes it takes upto 90-120secs. These gaps could trigger an alert for me. I have 3500 topics and 1500 consumer groups. Is this an expected behaviour? From the build from master, I get data every 15 seconds. But the problem is, if I don't select the consumer group and choose all 1500 consumer groups, metrics and not flowing properly. You can find the ticket here I'm getting the same issue here as well. But better than earlier. Args: |
|
@Guruhbk When I deployed master to prod I saw metric gaps for large clusters too due to the exporter taking a long time to return a response. I think it's normal for the scrape time to vary a lot due to load on the kafka cluster, and potentially caching mechanisms at play too. I see you already tried using a high number of |
|
@donovanbai-dd Thanks for your response. Running multiple instances of the exporter is my plan B. But I was hoping to see whether I could solve this issue with a single instance. |
|
@donovanbai-dd, I was wondering why not use a caching mechanism and run the metric gathering in the background for a certain interval (30s)? This way, while hitting /metrics, there will always be metrics available, and no metrics will be missing. If you have a smaller cluster, you can get real-time data, and if your cluster is larger, maybe you will get the actual data 30-60 seconds late. However, you will get it for sure. In real-time fetching for a larger cluster, there is no guarantee you will get the data itself. What's your thought on this? |
|
@Guruhbk I agree doing the metric gathering in the background is a good approach. That's what https://github.com/seglo/kafka-lag-exporter does and the performance is good, that's actually what we used in the past though we are moving away from it for other reasons. Switching to background collection is a bigger change beyond the scope of this PR though, and not something I'm planning to do personally. Note: I'm not a project maintainer :) , purpose of this PR is to make an optimization without changing too much of the codebase |
|
@donovanbai-dd I can work on it, but I'm not sure whether it will get approved. Having said that, this PR alone is enough to resolve my issues. Thank you for that. |
|
@danielqsj Is there any timeline for this PR to be merged with the main branch? |
This PR speeds up consumer group metric collection, which can be very slow for large kafka clusters.
changes:
--group.workers(default is 100) and--group.metrics.timeout(default is 5m). Instead of collecting metrics for each consumer group serially for each broker, use a separate goroutine for each group.Results on a test cluster
Before: 2 minutes to collect consumer group metrics
After: 14 seconds to collect consumer group metrics