add concurrent workers for consumer group metrics collection by donovanbai-dd · Pull Request #506 · danielqsj/kafka_exporter

donovanbai-dd · 2025-12-29T19:13:20Z

This PR speeds up consumer group metric collection, which can be very slow for large kafka clusters.

changes:

Add --group.workers (default is 100) and --group.metrics.timeout (default is 5m). Instead of collecting metrics for each consumer group serially for each broker, use a separate goroutine for each group.
For each consumer group, make a copy of the broker object before calling FetchOffset since each broker object does not support concurrent requests.
Remove an unnecessary mutex lock that was happening for each partition of each topic.
Add some debug logging for execution time details

Results on a test cluster
Before: 2 minutes to collect consumer group metrics
After: 14 seconds to collect consumer group metrics

donovanbai-dd · 2025-12-29T23:25:48Z

go.mod

 module github.com/danielqsj/kafka_exporter

-go 1.24
+go 1.24.0


I had to change this due to https://stackoverflow.com/questions/78519711/toolchain-not-available-error-prevents-me-from-using-any-go-commands

danielqsj · 2026-01-08T09:24:36Z

lgtm, please fix the golang lint issue @donovanbai-dd https://github.com/danielqsj/kafka_exporter/actions/runs/20580739843/job/59777223997?pr=506

donovanbai-dd · 2026-01-09T23:26:54Z

lgtm, please fix the golang lint issue @donovanbai-dd https://github.com/danielqsj/kafka_exporter/actions/runs/20580739843/job/59777223997?pr=506

it's fixed now

Guruhbk · 2026-01-13T04:23:53Z

@donovanbai-dd I tried your build. The results are better. However, I could see gaps in the metric for consumer lag. Sometimes I get metrics for every 30sec but sometimes it takes upto 90-120secs. These gaps could trigger an alert for me. I have 3500 topics and 1500 consumer groups. Is this an expected behaviour?

From the build from master, I get data every 15 seconds. But the problem is, if I don't select the consumer group and choose all 1500 consumer groups, metrics and not flowing properly. You can find the ticket here I'm getting the same issue here as well. But better than earlier.

Args:
- --kafka.server=kafka-1.kafka:9094
- --kafka.server=kafka-2.kafka:9094
- --kafka.server=kafka-3.kafka:9094
- --kafka.version=3.7.0
- --sasl.enabled
- --sasl.mechanism=PLAIN
- --sasl.username=
- --sasl.password=
- --log.enable-sarama
- --verbosity=2
- --topic.workers=1500
- --group.workers=500

donovanbai-dd · 2026-01-13T18:29:28Z

@Guruhbk When I deployed master to prod I saw metric gaps for large clusters too due to the exporter taking a long time to return a response. I think it's normal for the scrape time to vary a lot due to load on the kafka cluster, and potentially caching mechanisms at play too.

I see you already tried using a high number of --topic.workers and --group.workers. If you can't adjust your alerts to account for the longer scrapes, another idea is to use multiple deployments for 1 cluster and split the work by using --topic.filter and --group.filter perhaps? (I have not tried this)

Guruhbk · 2026-01-14T08:37:39Z

@donovanbai-dd Thanks for your response. Running multiple instances of the exporter is my plan B. But I was hoping to see whether I could solve this issue with a single instance.

Guruhbk · 2026-01-14T16:31:39Z

@donovanbai-dd, I was wondering why not use a caching mechanism and run the metric gathering in the background for a certain interval (30s)? This way, while hitting /metrics, there will always be metrics available, and no metrics will be missing. If you have a smaller cluster, you can get real-time data, and if your cluster is larger, maybe you will get the actual data 30-60 seconds late. However, you will get it for sure. In real-time fetching for a larger cluster, there is no guarantee you will get the data itself. What's your thought on this?

donovanbai-dd · 2026-01-14T20:22:45Z

@Guruhbk I agree doing the metric gathering in the background is a good approach. That's what https://github.com/seglo/kafka-lag-exporter does and the performance is good, that's actually what we used in the past though we are moving away from it for other reasons.

Switching to background collection is a bigger change beyond the scope of this PR though, and not something I'm planning to do personally. Note: I'm not a project maintainer :) , purpose of this PR is to make an optimization without changing too much of the codebase

Guruhbk · 2026-01-19T08:40:35Z

@donovanbai-dd I can work on it, but I'm not sure whether it will get approved. Having said that, this PR alone is enough to resolve my issues. Thank you for that.

Guruhbk · 2026-01-19T08:50:38Z

@danielqsj Is there any timeline for this PR to be merged with the main branch?

add concurrent workers for consumer group metrics collection

ad04e36

donovanbai-dd commented Dec 29, 2025

View reviewed changes

fix lint

55a98f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add concurrent workers for consumer group metrics collection#506

add concurrent workers for consumer group metrics collection#506
donovanbai-dd wants to merge 2 commits intodanielqsj:masterfrom
donovanbai-dd:group-workers

donovanbai-dd commented Dec 29, 2025

Uh oh!

donovanbai-dd Dec 29, 2025

Uh oh!

danielqsj commented Jan 8, 2026

Uh oh!

donovanbai-dd commented Jan 9, 2026

Uh oh!

Guruhbk commented Jan 13, 2026 •

edited

Loading

Uh oh!

donovanbai-dd commented Jan 13, 2026

Uh oh!

Guruhbk commented Jan 14, 2026

Uh oh!

Guruhbk commented Jan 14, 2026

Uh oh!

donovanbai-dd commented Jan 14, 2026

Uh oh!

Guruhbk commented Jan 19, 2026 •

edited

Loading

Uh oh!

Guruhbk commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

donovanbai-dd commented Dec 29, 2025

Uh oh!

donovanbai-dd Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

danielqsj commented Jan 8, 2026

Uh oh!

donovanbai-dd commented Jan 9, 2026

Uh oh!

Guruhbk commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

donovanbai-dd commented Jan 13, 2026

Uh oh!

Guruhbk commented Jan 14, 2026

Uh oh!

Guruhbk commented Jan 14, 2026

Uh oh!

donovanbai-dd commented Jan 14, 2026

Uh oh!

Guruhbk commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Guruhbk commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Guruhbk commented Jan 13, 2026 •

edited

Loading

Guruhbk commented Jan 19, 2026 •

edited

Loading