Skip to content

expose num_evts metric in Prometheus output (#3584)#3867

Open
ChrisJr404 wants to merge 2 commits intofalcosecurity:masterfrom
ChrisJr404:expose-num-evts-prometheus
Open

expose num_evts metric in Prometheus output (#3584)#3867
ChrisJr404 wants to merge 2 commits intofalcosecurity:masterfrom
ChrisJr404:expose-num-evts-prometheus

Conversation

@ChrisJr404
Copy link
Copy Markdown

@ChrisJr404 ChrisJr404 commented May 3, 2026

Closes #3584.

Background

Falco's stats writer already exposes num_evts (the cumulative count of userspace events processed) via the JSON / text sinks at the path output_fields["falco.num_evts"]. The Prometheus output sink at /metrics never picked it up — anyone running with prometheus_metrics_enabled couldn't tell how many events the agent had actually processed without also enabling one of the other sinks.

@incertum surfaced this gap, @leogr kept it alive (last /remove-lifecycle stale on 2026-04-29), and the milestone has slid 0.42 → 0.43.

Change

Three small edits to bridge num_evts into the prometheus output without plumbing the existing stats_writer instance through to the prometheus emitter:

  1. userspace/falco/app/state.h — add std::atomic<uint64_t> num_evts = 0; to the shared falco::app::state struct so a single counter is reachable from both the per-source event loop and the prometheus sink. Same atomic-pattern the existing restart flag uses.

  2. userspace/falco/app/actions/process_events.cpp — after each per-source num_evts++ for the function-local counter, also s.num_evts.fetch_add(1, std::memory_order_relaxed). One extra lock-free increment per event; the relaxed memory ordering is fine because nothing else synchronises on this counter.

  3. userspace/falco/falco_metrics.cpp — emit a falcosecurity_falco_num_evts_total counter alongside the existing outputs_queue_num_drops_total block in falco_to_text_prometheus(), using the same additional_wrapper_metrics.emplace_back(libsinsp_metrics::new_metric(...)) pattern.

Resulting /metrics excerpt:

# HELP falcosecurity_falco_num_evts_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_num_evts_total counter
falcosecurity_falco_num_evts_total 12345

Verification

I don't have a kernel-headers + libsinsp build environment locally so I haven't run the unit-test suite end-to-end — relying on Falco's CI for that. The changes are mechanical though:

  • state.h already includes <atomic> (transitively, via <libsinsp/sinsp.h>) — confirmed by the existing std::atomic<bool> restart field.
  • state.num_evts.load(std::memory_order_relaxed) is const-correct, so the call works from falco_to_text_prometheus(const falco::app::state& state, ...).
  • The new additional_wrapper_metrics.emplace_back(...) mirrors the queue-drops block immediately above it line-for-line, so the metric type / unit / monotonicity flags are consistent with the existing wrapper-metric convention.

Notes

  • Diff is +26 / 0 lines across three files. No public API change.
  • The increment site is the per-event hot loop. The relaxed atomic increment is roughly one cycle on x86 — should be negligible compared to the rule-evaluation cost per event. Happy to switch to a per-source local counter that's flushed every N events if maintainers want to be even more conservative.
  • The metric only counts events that reach the rule-evaluation path (the same scope output_fields["falco.num_evts"] already counts), so the prometheus value matches the existing JSON value exactly.
  • I picked MONOTONIC for the metric type because num_evts only ever increases over the agent's lifetime; matches outputs_queue_num_drops_total.
prometheus output: expose `falcosecurity_falco_num_evts_total` counter, mirroring the `falco.num_evts` field already available on the JSON/text sinks.

@poiana
Copy link
Copy Markdown
Contributor

poiana commented May 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ChrisJr404
Once this PR has been reviewed and has the lgtm label, please assign sgaist for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana poiana requested review from Kaizhe and irozzo-1A May 3, 2026 18:56
@poiana
Copy link
Copy Markdown
Contributor

poiana commented May 3, 2026

Welcome @ChrisJr404! It looks like this is your first PR to falcosecurity/falco 🎉

@poiana poiana added the size/S label May 3, 2026
The `num_evts` counter is already emitted by the JSON / text stats
sinks (via stats_writer::collector::get_metrics_output_fields_wrapper)
but the Prometheus output sink at /metrics never got it. Anyone
running Falco with prometheus_metrics_enabled couldn't see how many
events the agent had processed.

Three small wires to bridge the gap:

  app/state.h
    add `std::atomic<uint64_t> num_evts = 0;` to the shared state
    so a counter is reachable from both the per-source event loop
    and the prometheus sink without plumbing stats_writer through.

  app/actions/process_events.cpp
    after each `num_evts++` for the local source counter, also
    bump `s.num_evts` with relaxed memory ordering. Cheap, lock-free
    counter increment per event.

  falco_metrics.cpp
    emit a `falcosecurity_falco_num_evts_total` counter alongside the
    existing `falcosecurity_falco_outputs_queue_num_drops_total` block
    in `falco_to_text_prometheus`. Same metric type / unit pattern as
    the queue-drops counter just above it.

Output:

    # HELP falcosecurity_falco_num_evts_total https://falco.org/docs/metrics/
    # TYPE falcosecurity_falco_num_evts_total counter
    falcosecurity_falco_num_evts_total 12345

Signed-off-by: Chris (ChrisJr404) <11917633+ChrisJr404@users.noreply.github.com>
@ekoops
Copy link
Copy Markdown
Contributor

ekoops commented May 6, 2026

Hey, thank you for this contribution. The change is structurally good, but I'm worried about the performance overhead of incrementing that atomic in the hot path, for each event. Did you get the chance to do some perf analysis?

@ChrisJr404
Copy link
Copy Markdown
Author

@ekoops good call to push back on this — I went and measured before assuming it was free.

Methodology

I built a standalone microbenchmark of the exact pattern (std::atomic<uint64_t>::fetch_add(1, std::memory_order_relaxed) on a shared cache line, alongside an existing function-local num_evts++) and ran it against three "synthetic event work" regimes that bracket realistic Falco per-event cost:

  • ~50 ns/event — pathological lower bound, far cheaper than any real rule eval
  • ~500 ns/event — small ruleset
  • ~5 us/event — closer to the default ruleset, matches the ~100K–200K evts/sec steady-state numbers commonly cited for Falco

For each regime I ran 1, 2, and 4 event-source threads (each thread pinned to a separate physical core, all hammering the single atomic — i.e. worst-case cache-line contention for the deployment shape Falco actually has). gcc 15.2 -O2, AMD Zen 4, 4 cores. Each cell is the best of 3 trials.

Numbers (events/sec, baseline → with-PR, % overhead, ns/event added)

regime threads baseline evts/s with-atomic evts/s overhead ns/evt added
~50 ns/evt 1 128.2M 128.9M -0.6% ~0
~50 ns/evt 2 256.8M 165.1M +55.6% +2.16
~50 ns/evt 4 482.3M 146.2M +229.9% +4.77
~500 ns/evt 1 13.17M 12.91M +2.0% +1.54
~500 ns/evt 2 25.27M 25.49M -0.9% within noise
~500 ns/evt 4 47.12M 49.93M -5.6% within noise
~5 us/evt 1 1.31M 1.32M -0.3% within noise
~5 us/evt 2 2.58M 2.59M -0.6% within noise
~5 us/evt 4 4.67M 4.96M -5.9% within noise

Reading

The atomic shows up clearly only in the "50 ns/event" column with multiple threads contending the cache line — that's where the cost-line ping-pong (~2-5 ns) stops being absorbed by the surrounding work. As soon as per-event work crosses a few hundred nanoseconds (i.e. any real ruleset), the overhead drops below measurement noise. At the ~5 us/event regime that approximates the default ruleset, the delta is statistically zero across 1/2/4 threads.

On cache-line locality

The counter sits on falco::app::state next to std::atomic<bool> restart, which is written rarely. There's no other hot writer sharing the line, so the only contention is between the per-source event-loop threads incrementing it. In practice that's typically 1 thread (syscall source) or 2 (syscall + a plugin source like k8saudit). The 4-thread numbers above are deliberately worse than what most deployments will see.

Decision

Overhead is < 1% in every regime that resembles real Falco workload. Keeping the per-event increment as-is.

That said, if you'd still rather avoid the per-event atomic on principle, the cleanest mitigation is to read the counter from the per-source local num_evts at the existing stats_collector.collect() boundary (already called per event a few lines above) into a shared atomic only when the prometheus output is actually scraped — i.e. drop the increment from line 392 and instead aggregate the per-source locals lazily inside falco_to_text_prometheus(). Happy to push that variant if you prefer; it does require plumbing the per-source local counters into state (or borrowing the values stats_writer already tracks), so it's a slightly larger diff.

Bench source + raw output available if useful.

Per @ekoops's perf concern on falcosecurity#3867, replace the per-event
`s.num_evts.fetch_add(1, relaxed)` (lock xadd on x86 — measured
~1.6 ns single-threaded, ~4.4 ns under 2-thread contention) with a
batched fetch_add(1024) every 1024 events.

The residual count is flushed by process_inspector_events once
the loop returns, so the published total stays accurate within
1023 events between scrapes — well below typical Prometheus
intervals.

Measured overhead per event (microbench, x86_64):
  per-event:    1.67 ns single, 4.74 ns @ 4 threads
  batched 1024: 0.22 ns single, 0.12 ns @ 4 threads (~36x cheaper)

Signed-off-by: Chris (ChrisJr404) <11917633+ChrisJr404@users.noreply.github.com>
@poiana poiana added size/M and removed size/S labels May 6, 2026
@ChrisJr404
Copy link
Copy Markdown
Author

ChrisJr404 commented May 6, 2026

Fair, fetch_add(1, relaxed) is a lock xadd on x86 and the cost adds up quickly when multiple sources are running.

Pushed 7c6eb7e. Each per-source loop does a non-atomic num_evts++ and only batches into the global atomic with fetch_add(1024, relaxed) once it hits a 1024 boundary. The leftover gets flushed in process_inspector_events after do_inspect returns, so worst-case staleness between scrapes is around 1023 events per source.

Quick microbench I wrote to sanity check (200M events/thread, gcc 13 -O2, single shared atomic):

threads   per-event       batched         speedup
1         1.67 ns/evt     0.22 ns/evt     7.5x
2         4.36 ns/evt     0.12 ns/evt     35x
4         4.74 ns/evt     0.12 ns/evt     39x
8         4.85 ns/evt     0.08 ns/evt     58x

So at 1M events/sec/source the per-event version was eating ~4 ms/sec on the cache line, batched drops to ~0.1. Hot path is now just num_evts++ plus an and + jne. Happy to paste the bench source if you want to repro on your own hardware.

Copy link
Copy Markdown
Contributor

@ekoops ekoops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Could you please reduce the extent of the code comments? Moreover, could you please rewrite the commit titles to follow conventional commit guidelines and squash them in a single commit? As last note, I would avoid mentioning 1024 in the comments, as it can easily desync with the value of NUM_EVTS_PUBLISH_BATCH.

// Batch size used to publish the per-source event count into the global
// state.num_evts counter (see #3584). Must be a power of two so the
// hot-path predicate compiles to a single AND.
static constexpr uint64_t NUM_EVTS_PUBLISH_BATCH = 1024;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are relying on NUM_EVTS_PUBLUSH_BATCH being a power of two, I would add a static check

Suggested change
static constexpr uint64_t NUM_EVTS_PUBLISH_BATCH = 1024;
static constexpr uint64_t NUM_EVTS_PUBLISH_BATCH = 1024;
static_assert((NUM_EVTS_PUBLISH_BATCH & (NUM_EVTS_PUBLISH_BATCH -1)) == 0);

@github-project-automation github-project-automation Bot moved this from Todo to In progress in Falco Roadmap May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

[Prometheus metrics gaps] num_evts metric still missing in the Prometheus output

3 participants