Skip to content

add batch topic partition offset fetch to reduce the API calls#516

Open
Otiss-pang wants to merge 1 commit intodanielqsj:masterfrom
Otiss-pang:master
Open

add batch topic partition offset fetch to reduce the API calls#516
Otiss-pang wants to merge 1 commit intodanielqsj:masterfrom
Otiss-pang:master

Conversation

@Otiss-pang
Copy link
Copy Markdown

@Otiss-pang Otiss-pang commented Feb 13, 2026

Performance Optimization: Batch Offset Fetching and Negative Lag Handling

Summary

Optimizes kafka_exporter performance by implementing batch offset fetching and proper negative lag handling.

Motivation

The current implementation fetches topic partition offsets one by one, resulting in excessive API calls to Kafka brokers. Additionally, negative lag values can occur when topic offsets update between fetching topic offsets and consumer group offsets, leading to inaccurate metrics.

Key Changes

1. Batch Offset Fetching

  • Before: Fetched offsets sequentially using e.client.GetOffset() for each partition
  • After: Groups offset requests by broker leader and fetches them in batch using broker.GetAvailableOffsets()
  • Impact: Reduces API calls from O(partitions) to O(brokers)

2. Three-Phase Collection Strategy

Phase 1: Topic Offset Collection

  • Batch fetch newest and oldest offsets for all partitions
  • Build topicPartitionLeaders mapping for Phase 3
  • Process brokers concurrently

Phase 2: Consumer Group Metrics Collection

  • Fetch consumer group offsets
  • Calculate lag using cached topic offsets from Phase 1
  • Detect and defer negative lag cases for later processing

Phase 3: Negative Lag Resolution

  • Batch refresh topic offsets only for partitions with negative lag
  • Recalculate and emit metrics with updated offsets
  • Ensures accurate lag reporting

3. Code Improvements

  • Added deferredGroupTask struct for deferred processing
  • New helper methods:
    • emitGroupMetric(): Handles offset fetching and negative lag detection
    • reportGroupMetrics(): Unified metric emission
  • Reused broker connections instead of creating copies
  • Added phase timing logs for performance monitoring
  • Used sync.RWMutex for safe concurrent access

Benefits

Performance

  • Fewer API calls: Batch operations significantly reduce network requests
  • Lower latency: Fewer round trips to Kafka brokers
  • Better concurrency: Parallel processing across brokers

Accuracy

  • Correct lag values: Properly handles negative lag edge cases
  • Consistent metrics: Ensures topic offsets are fresh when calculating lag

Backward Compatibility

  • No changes to exposed metrics
  • No changes to configuration options
  • No breaking changes to API

Performance Testing

Example improvements on a cluster with 1k+ topics and 12k+ partitions:

  • API calls reduced by ~95% (from ~10000+ to ~10 broker requests)
  • Collection time improved from 3 mins to less than 10 seconds
  • No negative lag occurrences in metrics
image image

@Otiss-pang
Copy link
Copy Markdown
Author

@danielqsj could checking this merge when free?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants