Skip to content

Commit 6daacc9

Browse files
committed
tests/multi-server: embed proto_load in kafka-producer1, add reconnect suite
Topology simplification: move the proto_load listener directly into kafka-producer1's virtual server, so generated Access-Requests flow straight into `recv Access-Request` -> kafka.produce without going over the wire. One fewer container, one fewer RADIUS hop, and the test still exercises exactly the produce path end-to-end. Changes: * environments/kafka.yml.j2 - Drop the load-generator service. - Feed the proto_load profile (start_pps / max_pps / duration / step / parallel / num_messages) to kafka-producer1 via env vars; Jinja pulls them from the test's loadgen: block. - Re-declare TEST_PROJECT_NAME / TEST_SUBNET inline on kafka-producer1 because YAML's <<: anchor merge doesn't union nested dicts - a service-level environment: replaces the one inherited from x-common-config. - New `loadgen_num_messages` knob, defaulting to `expected_messages`, so tests that expect loss (reconnect) can generate more than the consumer will count. * configs/freeradius/kafka-producer1/radiusd.conf.j2 - Add `listen load { handler = load; transport = step; step { ... } }` inside the existing kafka-producer server. * configs/freeradius/kafka-producer1/load-generator-packets/packet.conf - Default Access-Request packet skeleton proto_load sends. * tests/kafka-produce/{short.ci,heavy}.test.yml + template.yml.j2 - Collapse to a single state that waits for kafka-consumer-summary. No more two-phase load-gen orchestration; proto_load fires on freeradius startup and finishes long before the summary arrives. * tests/kafka-produce-reconnect/ - New suite exercising broker disconnect / reconnect. Applies 100% packet loss on kafka-producer1's egress mid-stream (packet_loss action from the framework's NetworkEvents), holds for `outage_seconds`, then removes it. Queued produces inside librdkafka drain after reconnect, request threads that yielded waiting on their delivery reports resume, and the consumer eventually sees >= expected_messages on the topic.
1 parent 0846a25 commit 6daacc9

9 files changed

Lines changed: 240 additions & 133 deletions

File tree

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
User-Name = "testuser"
2+
User-Password = "testpass"
3+
Calling-Station-ID = "F1-F2-F3-F4-F5-F6"

src/tests/multi-server/configs/freeradius/kafka-producer1/radiusd.conf.j2

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,38 @@ server kafka-producer {
8787

8888
namespace = radius
8989

90+
#
91+
# proto_load based synthetic Access-Request generator. Lives in
92+
# the same virtual server as the rlm_kafka call path so generated
93+
# packets flow straight into `recv Access-Request` without going
94+
# over the wire - no separate load-generator container, no
95+
# inter-container RADIUS hop.
96+
#
97+
# Profile parameters come from the per-test env vars set on this
98+
# container by the compose file (which pulls them from the test's
99+
# loadgen: {} block).
100+
#
101+
listen load {
102+
handler = load
103+
type = Access-Request
104+
transport = step
105+
106+
step {
107+
filename = ${confdir}/load-generator-packets/packet.conf
108+
109+
max_attributes = 64
110+
111+
start_pps = $ENV{TEST_LOADGEN_START_PPS}
112+
max_pps = $ENV{TEST_LOADGEN_MAX_PPS}
113+
duration = $ENV{TEST_LOADGEN_DURATION}
114+
step = $ENV{TEST_LOADGEN_STEP}
115+
max_backlog = $ENV{TEST_LOADGEN_MAX_BACKLOG}
116+
parallel = $ENV{TEST_LOADGEN_PARALLEL}
117+
num_messages = $ENV{TEST_LOADGEN_NUM_MESSAGES}
118+
repeat = no
119+
}
120+
}
121+
90122
listen authentication {
91123
type = Access-Request
92124
type = Status-Server

src/tests/multi-server/environments/kafka.yml.j2

Lines changed: 60 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,24 @@
11
# ---------------------------------------------------------------
22
# Docker Compose Test Environment:
33
#
4-
# Access-Request
5-
# load-generator -----------------> kafka-producer1
6-
# |
7-
# | kafka.produce
8-
# v
9-
# kafka (redpanda)
10-
# |
11-
# | consume
12-
# v
13-
# kafka-consumer
14-
# (echoes each
15-
# message back
16-
# to the test
17-
# framework)
4+
# kafka-producer1 (rlm_kafka)
5+
# ^ |
6+
# | | kafka.produce
7+
# | v
8+
# radclient kafka (broker)
9+
# (via framework |
10+
# exec on same | consume
11+
# container) v
12+
# kafka-consumer
13+
# (one listener line per received
14+
# message + a summary; the test
15+
# framework reads those to verify)
1816
#
19-
# Each Access-Request triggers exactly one produce. kafka-consumer
20-
# reads the topic and writes one listener line per received message
21-
# plus a final summary line; the test framework verifies the count
22-
# matches the number of packets sent.
17+
# The test framework drives load by exec-ing `radclient` inside the
18+
# kafka-producer1 container and sending to the RADIUS listener at
19+
# localhost:1812. Each Access-Request triggers exactly one
20+
# kafka.produce. Keeping load generation in-container means no
21+
# separate load-generator service and no inter-container RADIUS hop.
2322
# ---------------------------------------------------------------
2423
x-common-config: &id001
2524
cap_add:
@@ -76,14 +75,56 @@ services:
7675
condition: service_healthy
7776
volumes:
7877
- ${DATA_PATH}/freeradius/kafka-producer1/radiusd.conf:/etc/raddb/radiusd.conf
78+
- ${DATA_PATH}/freeradius/kafka-producer1/load-generator-packets/:/etc/raddb/load-generator-packets/
79+
- ${DATA_PATH}/freeradius/env-setup.sh:/tmp/env-setup.sh
7980
- ${LISTENER_DIR}/:/var/run/multi-server/
8081
restart: unless-stopped
82+
environment:
83+
#
84+
# YAML's `<<:` anchor merge doesn't union nested dicts, so
85+
# declaring an `environment:` block on the service replaces
86+
# the one inherited from x-common-config. Re-include the
87+
# shared vars here.
88+
#
89+
TEST_PROJECT_NAME: ${COMPOSE_PROJECT_NAME}
90+
TEST_SUBNET: {{ test_subnet | default('172.16.0.0/12') }}
91+
#
92+
# proto_load profile - read by the `listen load { step { ... } }`
93+
# block in this container's radiusd.conf. The test's loadgen:
94+
# dict in its .test.yml feeds these.
95+
#
96+
TEST_LOADGEN_START_PPS: "{{ loadgen.start_pps }}"
97+
TEST_LOADGEN_MAX_PPS: "{{ loadgen.max_pps }}"
98+
TEST_LOADGEN_DURATION: "{{ loadgen.duration }}"
99+
TEST_LOADGEN_STEP: "{{ loadgen.step }}"
100+
TEST_LOADGEN_MAX_BACKLOG: "{{ loadgen.max_backlog }}"
101+
TEST_LOADGEN_PARALLEL: "{{ loadgen.parallel }}"
102+
#
103+
# Hard cap on packets emitted. Tests that want the emit
104+
# count to equal the consume count default this to
105+
# `expected_messages`; tests where some loss is expected
106+
# (e.g. the reconnect test, where a brief outage can fail a
107+
# small fraction of in-flight produces) override with a
108+
# larger `loadgen_num_messages`.
109+
#
110+
TEST_LOADGEN_NUM_MESSAGES: "{{ loadgen_num_messages | default(expected_messages) }}"
81111
healthcheck:
82112
test: ["CMD-SHELL", "echo 'Message-Authenticator = 0x00' | radclient localhost:1812 status testing123"]
83113
interval: 2s
84114
timeout: 5s
85115
retries: 10
86-
start_period: 30s
116+
start_period: 45s
117+
# The env-setup source installs iproute2 so tests that need to
118+
# apply netem qdiscs (e.g. the reconnect test's packet_loss
119+
# action) have `tc` available inside the container. Harmless
120+
# no-op for tests that don't.
121+
entrypoint:
122+
- bash
123+
- -c
124+
- |
125+
source /tmp/env-setup.sh && \
126+
exec /docker-entrypoint.sh "$@"
127+
- --
87128
command: ["freeradius", "-f", "-l", "stdout"]
88129
<<: *id001
89130
@@ -114,25 +155,3 @@ services:
114155
- /usr/local/bin/consume.sh
115156
restart: "no"
116157
<<: *id001
117-
118-
load-generator:
119-
image: freeradius-build:latest
120-
depends_on:
121-
kafka-producer1:
122-
condition: service_healthy
123-
kafka-consumer:
124-
condition: service_started
125-
volumes:
126-
- ${DATA_PATH}/freeradius/load-generator/template.d/load-generator-templates:/etc/raddb/template.d/load-generator-templates
127-
- ${DATA_PATH}/freeradius/load-generator/mods-config/files/authorize:/etc/raddb/mods-config/files/authorize
128-
- ${DATA_PATH}/freeradius/load-generator/radiusd.conf:/etc/raddb/radiusd.conf
129-
- ${DATA_PATH}/freeradius/load-generator/load-generator-packets/:/etc/raddb/load-generator-packets/
130-
- ${LISTENER_DIR}/:/var/run/multi-server/
131-
entrypoint:
132-
- bash
133-
- -lc
134-
- |
135-
# Keep the container alive. The test framework starts FreeRADIUS
136-
# and runs commands via 'docker exec' so it can control timing.
137-
sleep infinity
138-
<<: *id001
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../environments/kafka.yml.j2
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
#
2+
# Broker disconnect / reconnect round-trip test.
3+
#
4+
# proto_load on kafka-producer1 generates Access-Requests steadily
5+
# for `loadgen.duration` seconds. Partway through that window we
6+
# cut the producer's network to the broker with 100% packet loss,
7+
# hold for `outage_seconds`, then restore. Messages produced
8+
# during the outage queue inside librdkafka; their request threads
9+
# yield waiting for delivery reports. When the link is restored
10+
# librdkafka reconnects, drains its queue, delivery reports fire
11+
# and yielded requests resume.
12+
#
13+
# Pass criterion: consumer reports PASS with
14+
# received == expected == loadgen total.
15+
#
16+
listener_type: file
17+
18+
#
19+
# proto_load profile. max_pps has a floor of 10 internally, so
20+
# the effective steady-state rate is 10 pps. Duration must be
21+
# long enough that proto_load is still generating by the time the
22+
# framework finishes compose-up and reaches state_1 (CI DinD adds
23+
# ~40-50s of startup before the first state runs).
24+
#
25+
loadgen:
26+
start_pps: 10
27+
max_pps: 10
28+
duration: 30
29+
step: 10
30+
parallel: 1
31+
max_backlog: 1000
32+
33+
#
34+
# proto_load overshoots slightly - the `num_messages` cap stops
35+
# emission, but a handful of requests already in-flight in the
36+
# worker pool finish after that. Set num_messages well above
37+
# expected_messages so the consumer always has enough to count.
38+
#
39+
# expected_messages is what the consumer (kcat -c N) stops at,
40+
# so the kafka-consumer-summary line declares PASS as long as at
41+
# least `expected_messages` made it through the disconnect /
42+
# reconnect cycle.
43+
#
44+
loadgen_num_messages: 250
45+
expected_messages: 200
46+
47+
kafka_topic: fr-multi-server-reconnect-test
48+
49+
# How long to hold the outage. Shorter than librdkafka's default
50+
# message.timeout.ms (5 min) so queued produces don't fail before
51+
# recovery.
52+
outage_seconds: 3
53+
54+
# Timeouts sized for the self-hosted CI DinD runners. test_timeout
55+
# must cover compose up + state_1 outage + state_2 wait-for-summary.
56+
test_timeout: 240
57+
test_verify_timeout: 120
58+
consumer_timeout: 180
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
timeout: {{ test_timeout }}
2+
state_order: sequence
3+
states:
4+
5+
#
6+
# proto_load on kafka-producer1 is generating Access-Requests
7+
# continuously for `loadgen.duration` seconds. Drop the
8+
# producer's network to the broker in the middle of that window:
9+
# packets produced during the outage queue inside librdkafka and
10+
# their request threads yield waiting for delivery reports.
11+
#
12+
# The verify timeout on this state is how long we hold the outage.
13+
#
14+
state_1:
15+
description: >
16+
Apply 100% packet loss on the producer while proto_load is
17+
in flight, and hold the outage for {{ outage_seconds }}s.
18+
host:
19+
kafka-producer1:
20+
actions:
21+
- packet_loss:
22+
interface: eth0
23+
loss: 100
24+
verify:
25+
timeout: {{ outage_seconds }}
26+
trigger_mode: unordered
27+
28+
#
29+
# Remove the packet loss. librdkafka reconnects, drains its
30+
# queue, delivery reports fire for the queued produces, yielded
31+
# request threads resume, and the consumer eventually sees every
32+
# message. Pass criterion: the consumer summary reports PASS
33+
# with received == expected == full count.
34+
#
35+
state_2:
36+
description: >
37+
Remove the packet loss and wait for the consumer to report
38+
PASS with received == expected == {{ expected_messages }}.
39+
host:
40+
kafka-producer1:
41+
actions:
42+
- packet_loss:
43+
interface: eth0
44+
loss: 0
45+
verify:
46+
timeout: {{ test_verify_timeout }}
47+
trigger_mode: unordered
48+
triggers:
49+
- kafka-consumer-summary:
50+
json:
51+
result:
52+
pattern:
53+
reg_pattern: PASS
54+
expected:
55+
pattern:
56+
reg_pattern: "^{{ expected_messages }}$"
57+
received:
58+
pattern:
59+
reg_pattern: "^{{ expected_messages }}$"
Lines changed: 7 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,28 @@
11
#
2-
# Heavy stress variant. Ramps PPS aggressively across four 2s steps:
3-
#
4-
# 500 -> 1000 -> 1500 -> 2000 pps = 10,000 requests in ~8s
5-
#
6-
# That's enough concurrent work to exercise every worker thread and
7-
# drive many back-to-back delivery reports through rlm_kafka's self-pipe,
8-
# surfacing races and queue-pressure bugs the short sanity test can't.
2+
# Heavy stress variant. Ramps PPS aggressively to exercise every
3+
# worker thread and drive many back-to-back delivery reports
4+
# through rlm_kafka's shared producer + per-worker mailbox path.
95
#
106
# Not tagged *.ci.test.yml - developers run this locally via
117
# `make test.multi-server.kafka-produce.heavy`, or via the full
12-
# `make test.multi-server` sweep, but it isn't in `test.multi-server.ci`.
8+
# `make test.multi-server` sweep.
139
#
1410
listener_type: file
1511

16-
load_gen_num_of_dst_servers: 1
17-
load_gen_dst_server_name: kafka-producer
18-
1912
loadgen:
2013
start_pps: 500
2114
max_pps: 2000
2215
duration: 2
2316
step: 500
2417
parallel: 4
2518
max_backlog: 20000
26-
repeat: "no"
2719

2820
# 2 * (500 + 1000 + 1500 + 2000) = 10000
2921
expected_messages: 10000
3022

3123
kafka_topic: fr-multi-server-test
3224

33-
# State_1 needs ~8s load-gen + enough time for a single-broker redpanda +
34-
# kcat to drain 10k messages. On macOS Docker Desktop this isn't instant;
35-
# Linux CI should comfortably finish well inside these budgets.
25+
# Generous budget for the 10k burst + broker drain on CI.
3626
test_timeout: 360
37-
test_verify_timeout: 150
38-
consumer_timeout: 180
27+
test_verify_timeout: 300
28+
consumer_timeout: 300
Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,29 @@
11
listener_type: file
22

3-
# Routing (load-generator sends to kafka-producer)
4-
load_gen_num_of_dst_servers: 1
5-
load_gen_dst_server_name: kafka-producer
6-
7-
# Load generator profile. A modest burst that's enough to exercise
8-
# delivery reports without stressing CI timing budgets.
3+
#
4+
# proto_load profile for the kafka-producer1 container's built-in
5+
# load generator. Modest burst - enough to exercise delivery
6+
# reports without stressing CI timing budgets.
7+
#
98
loadgen:
109
start_pps: 5
1110
max_pps: 5
1211
duration: 4
1312
step: 5
1413
parallel: 1
1514
max_backlog: 1000
16-
repeat: "no"
1715

18-
# Total Access-Requests the load-generator will emit. Keep in sync with
19-
# loadgen above: start_pps * duration when start == max.
16+
#
17+
# Expected message count at the consumer. Must equal the total
18+
# proto_load emits: sum over each pps step of (duration * pps).
19+
#
2020
expected_messages: 20
2121

2222
kafka_topic: fr-multi-server-test
2323

24-
# Test framework timeouts. The whole test has to fit inside test_timeout;
25-
# each state waits `test_verify_timeout` for its triggers. Values are
26-
# sized for the self-hosted CI DinD runners, which are substantially
27-
# slower than local Docker Desktop (JVM broker startup alone eats
28-
# ~30s through the healthcheck).
24+
# Timeouts sized for the self-hosted CI DinD runners - JVM kafka
25+
# startup through the healthcheck + proto_load burst + consumer
26+
# drain all have to fit inside `test_verify_timeout`.
2927
test_timeout: 120
3028
test_verify_timeout: 60
3129
consumer_timeout: 90

0 commit comments

Comments
 (0)