Skip to content

add concurrent workers for consumer group metrics collection#506

Open
donovanbai-dd wants to merge 2 commits intodanielqsj:masterfrom
donovanbai-dd:group-workers
Open

add concurrent workers for consumer group metrics collection#506
donovanbai-dd wants to merge 2 commits intodanielqsj:masterfrom
donovanbai-dd:group-workers

Conversation

@donovanbai-dd
Copy link
Copy Markdown

This PR speeds up consumer group metric collection, which can be very slow for large kafka clusters.

changes:

  1. Add --group.workers (default is 100) and --group.metrics.timeout (default is 5m). Instead of collecting metrics for each consumer group serially for each broker, use a separate goroutine for each group.
  2. For each consumer group, make a copy of the broker object before calling FetchOffset since each broker object does not support concurrent requests.
  3. Remove an unnecessary mutex lock that was happening for each partition of each topic.
  4. Add some debug logging for execution time details

Results on a test cluster
Before: 2 minutes to collect consumer group metrics
After: 14 seconds to collect consumer group metrics

module github.com/danielqsj/kafka_exporter

go 1.24
go 1.24.0
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielqsj
Copy link
Copy Markdown
Owner

lgtm, please fix the golang lint issue @donovanbai-dd https://github.com/danielqsj/kafka_exporter/actions/runs/20580739843/job/59777223997?pr=506

@donovanbai-dd
Copy link
Copy Markdown
Author

@Guruhbk
Copy link
Copy Markdown

Guruhbk commented Jan 13, 2026

@donovanbai-dd I tried your build. The results are better. However, I could see gaps in the metric for consumer lag. Sometimes I get metrics for every 30sec but sometimes it takes upto 90-120secs. These gaps could trigger an alert for me. I have 3500 topics and 1500 consumer groups. Is this an expected behaviour?

From the build from master, I get data every 15 seconds. But the problem is, if I don't select the consumer group and choose all 1500 consumer groups, metrics and not flowing properly. You can find the ticket here I'm getting the same issue here as well. But better than earlier.

Args:
- --kafka.server=kafka-1.kafka:9094
- --kafka.server=kafka-2.kafka:9094
- --kafka.server=kafka-3.kafka:9094
- --kafka.version=3.7.0
- --sasl.enabled
- --sasl.mechanism=PLAIN
- --sasl.username=
- --sasl.password=
- --log.enable-sarama
- --verbosity=2
- --topic.workers=1500
- --group.workers=500

@donovanbai-dd
Copy link
Copy Markdown
Author

@Guruhbk When I deployed master to prod I saw metric gaps for large clusters too due to the exporter taking a long time to return a response. I think it's normal for the scrape time to vary a lot due to load on the kafka cluster, and potentially caching mechanisms at play too.

I see you already tried using a high number of --topic.workers and --group.workers. If you can't adjust your alerts to account for the longer scrapes, another idea is to use multiple deployments for 1 cluster and split the work by using --topic.filter and --group.filter perhaps? (I have not tried this)

@Guruhbk
Copy link
Copy Markdown

Guruhbk commented Jan 14, 2026

@donovanbai-dd Thanks for your response. Running multiple instances of the exporter is my plan B. But I was hoping to see whether I could solve this issue with a single instance.

@Guruhbk
Copy link
Copy Markdown

Guruhbk commented Jan 14, 2026

@donovanbai-dd, I was wondering why not use a caching mechanism and run the metric gathering in the background for a certain interval (30s)? This way, while hitting /metrics, there will always be metrics available, and no metrics will be missing. If you have a smaller cluster, you can get real-time data, and if your cluster is larger, maybe you will get the actual data 30-60 seconds late. However, you will get it for sure. In real-time fetching for a larger cluster, there is no guarantee you will get the data itself. What's your thought on this?

@donovanbai-dd
Copy link
Copy Markdown
Author

@Guruhbk I agree doing the metric gathering in the background is a good approach. That's what https://github.com/seglo/kafka-lag-exporter does and the performance is good, that's actually what we used in the past though we are moving away from it for other reasons.

Switching to background collection is a bigger change beyond the scope of this PR though, and not something I'm planning to do personally. Note: I'm not a project maintainer :) , purpose of this PR is to make an optimization without changing too much of the codebase

@Guruhbk
Copy link
Copy Markdown

Guruhbk commented Jan 19, 2026

@donovanbai-dd I can work on it, but I'm not sure whether it will get approved. Having said that, this PR alone is enough to resolve my issues. Thank you for that.

@Guruhbk
Copy link
Copy Markdown

Guruhbk commented Jan 19, 2026

@danielqsj Is there any timeline for this PR to be merged with the main branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants