fix(prometheus): flush expired slab memory in exporter timer by sihyeonn · Pull Request #13195 · apache/apisix

sihyeonn · 2026-04-10T02:27:54Z

Summary

When metrics are configured with an expire value, nginx's slab allocator marks entries as logically expired but does not automatically return the underlying slab pages to the free-space pool. As a result, apisix_shared_dict_free_space_bytes for prometheus-metrics decreases monotonically over time — slabs are only reclaimed when explicitly flushed.

Root Cause

ngx.shared.DICT:flush_expired() must be called explicitly to reclaim slab memory from expired entries. Without it:

Time series expire logically (reads return nil after expire seconds)
But the slab memory is not returned to free space
free_space_bytes trends toward zero regardless of actual active time-series count

This can be observed by comparing free_space_bytes with the active time-series count: the count fluctuates (e.g. drops significantly during low-traffic periods) while free space never recovers — even after most entries have expired.

Fix

Call dict:flush_expired(1000) inside exporter_timer, which already runs every refresh_interval (default 15 s) in the privileged agent process.

local prom_dict = ngx.shared["prometheus-metrics"]
if prom_dict then
    prom_dict:flush_expired(1000)
end

Why max_count=1000: Without a limit, a single flush call could hold the shared-dict write lock for an extended time if many expired entries have accumulated. Limiting to 1000 per cycle keeps the lock time well under 10 ms in practice, while remaining entries are flushed in subsequent timer ticks (every 15 s).

The call runs in the privileged agent process, which is separate from worker request-handling processes, so the brief write-lock has minimal impact on request throughput.

Checklist

No functional change to metric collection or rendering
Compatible with existing expire metric configuration
Works with any refresh_interval setting

When metrics are registered with an `expire` value, nginx's slab allocator marks entries as logically expired but does not return the underlying slab pages to the free-space pool automatically. As a result, `free_space_bytes` decreases monotonically over time even when many time series have expired, because slabs are only reclaimed when a flush is explicitly requested. Call `dict:flush_expired(1000)` in `exporter_timer` (which runs every `refresh_interval`, defaulting to 15 s) so that expired slabs are reclaimed promptly. The `max_count=1000` argument bounds the write-lock hold time to a few milliseconds per call, avoiding any noticeable impact on worker request processing. Fixes the pattern where `apisix_shared_dict_free_space_bytes` for `prometheus-metrics` decreases continuously until the dict is exhausted, even though active time-series counts fluctuate normally.

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prometheus): flush expired slab memory in exporter timer#13195

fix(prometheus): flush expired slab memory in exporter timer#13195
sihyeonn wants to merge 1 commit intoapache:masterfrom
sihyeonn:fix/prometheus-flush-expired-slab

sihyeonn commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sihyeonn commented Apr 10, 2026

Summary

Root Cause

Fix

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant