feat: add cse timings test to catch regression with cached and full package installation by djsly · Pull Request #8284 · Azure/AgentBaker

djsly · 2026-04-11T22:38:38Z

Summary

Add E2E for catching regression in cse timings

Root Cause

PR #8105 optimized CSE by skipping installContainerRuntime on golden image VHDs (~80% of Ubuntu 22.04 nodes). This removed a natural ~10s delay that masked a critical issue: the first apt-get or apt-mark call after VHD boot triggers expensive apt initialization (~30-50s) and holds dpkg/apt locks, blocking all subsequent apt operations.

Without the installContainerRuntime delay, installKubeletKubectlFromPkg became the first apt call and took 47.6s p50 (up from ~10s baseline).

E2E Test Results

With fix (Ubuntu 22.04 golden image, westus3, deb package path):

Task	Before (production p50)	After (E2E verified)	Improvement
`aptmarkWALinuxAgent`	21.7s	0.06s	360x faster
`installkubelet.installDebPackageFromFile`	47.6s	13.46s	3.5x faster
`installkubectl.installDebPackageFromFile`	5.6s	2.58s	2.2x faster
`configureKubeletAndKubectl` (total)	47.6s	16.12s	3x faster
`cse_start` (total)	~47s	33.29s	29% reduction

Verified from VM provision logs:

dpkg --set-selections used for walinuxagent hold ✅
dpkg -i --force-confold used for kubelet deb (succeeded on first try) ✅
dpkg -i --force-confold used for kubectl deb (succeeded on first try) ✅
FULL_INSTALL_REQUIRED=false correctly detected and exported ✅

Testing

go test ./pkg/agent/... — PASS
go build ./e2e/... — PASS
E2E Test_Ubuntu2204_CSE_CachedPerformance — PASS (33.29s total CSE, using real deb package path)
Verified dpkg -i and dpkg --set-selections from VM provision logs
Snapshot regeneration (go generate) — no changes needed

Copilot

Pull request overview

This PR addresses a CSE bootstrap latency regression on Ubuntu golden-image VHDs by avoiding first-boot apt initialization/lock contention, and adds E2E timing checks to detect future regressions.

Changes:

Switch cached-package installation and walinuxagent hold/unhold operations to dpkg-based fast paths (with apt fallbacks).
Export FULL_INSTALL_REQUIRED so it’s available in subshell contexts.
Add E2E CSE timing extraction + two Ubuntu 22.04 performance scenarios with thresholds.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh	Use `dpkg --set-selections` for walinuxagent hold/unhold and `dpkg -i` for cached `.deb` installs; add `wait_for_dpkg_lock`.
parts/linux/cloud-init/artifacts/cse_main.sh	Export `FULL_INSTALL_REQUIRED` after determining install mode.
e2e/cse_timing.go	New helper to extract CSE task durations from Guest Agent event JSON and validate against thresholds.
e2e/scenario_cse_perf_test.go	New Ubuntu 22.04 cached/full-install performance scenarios enforcing timing thresholds.
.github/workflows/pr-lint.yaml	Bump `actions/github-script` from v8 to v9.

Copilot · 2026-04-11T22:41:48Z

parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh

+        echo "walinuxagent hold" | dpkg --set-selections 2>/dev/null || \
+            { retrycmd_if_failure 120 5 25 apt-mark hold walinuxagent || exit $ERR_HOLD_WALINUXAGENT; }
    elif [ "$1" = "unhold" ]; then
-        exit $ERR_RELEASE_HOLD_WALINUXAGENT
+        echo "walinuxagent install" | dpkg --set-selections 2>/dev/null || \
+            { retrycmd_if_failure 120 5 25 apt-mark unhold walinuxagent || exit $ERR_RELEASE_HOLD_WALINUXAGENT; }


In the fallback path, apt-mark hold/unhold is invoked without calling wait_for_apt_locks. If dpkg --set-selections fails (e.g., transient dpkg state), the fallback can race with other apt/dpkg users and reintroduce lock contention/failures. Consider calling wait_for_apt_locks before the apt-mark retry, or reusing the existing wait_for_apt_locks helper for the fallback path.

Copilot · 2026-04-11T22:41:48Z

parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh

+        retrycmd_if_failure 10 5 120 dpkg -i --force-confold ${DEB_FILE}
+        if [ "$?" -eq 0 ]; then
+            return 0
+        fi
+        echo "dpkg -i failed for ${DEB_FILE}, falling back to apt-get install"
+    fi
    wait_for_apt_locks
    retrycmd_if_failure 10 5 600 apt-get -y -f install ${DEB_FILE} --allow-downgrades


The dpkg fast-path retries dpkg -i up to 10 times with 5s sleeps (retrycmd_if_failure 10 5 120 ...). If the failure is due to missing dependencies (common dpkg -i failure mode), this adds ~45s of avoidable delay before falling back to apt. Suggest running dpkg -i once (or with very small retry budget) and falling back immediately on failure; also quote ${DEB_FILE} to avoid word-splitting/globbing.

Suggested change

retrycmd_if_failure 10 5 120 dpkg -i --force-confold ${DEB_FILE}

if [ "$?" -eq 0 ]; then

return 0

fi

echo "dpkg -i failed for ${DEB_FILE}, falling back to apt-get install"

fi

wait_for_apt_locks

retrycmd_if_failure 10 5 600 apt-get -y -f install ${DEB_FILE} --allow-downgrades

dpkg -i --force-confold "${DEB_FILE}"

if [ "$?" -eq 0 ]; then

return 0

fi

echo "dpkg -i failed for ${DEB_FILE}, falling back to apt-get install"

fi

wait_for_apt_locks

retrycmd_if_failure 10 5 600 apt-get -y -f install "${DEB_FILE}" --allow-downgrades

Copilot · 2026-04-11T22:41:48Z

e2e/scenario_cse_perf_test.go

+			VHD:     config.VHDUbuntu2204Gen2Containerd,
+			BootstrapConfigMutator: func(nbc *datamodel.NodeBootstrappingConfiguration) {
+			},
+			VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {


VMConfigMutator writes into vmss.Tags[...] without ensuring the Tags map is initialized. PrepareVMSSModel does not create the map, so this can panic at runtime. Follow the pattern used elsewhere in e2e (e.g., scenario_test.go) by initializing vmss.Tags = map[string]*string{} when nil before setting the tag.

Suggested change

VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {

VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {

if vmss.Tags == nil {

vmss.Tags = map[string]*string{}

}

Copilot · 2026-04-11T22:41:48Z

e2e/cse_timing.go

+	report, err := ExtractCSETimings(ctx, s)
+	if err != nil {
+		s.T.Fatalf("failed to extract CSE timings: %v", err)
+	}
+
+	// Always log the full timing report
+	report.LogReport(ctx, s.T)
+
+	// Validate total CSE duration
+	if thresholds.TotalCSEThreshold > 0 {
+		totalDuration := report.TotalCSEDuration()
+		if totalDuration > thresholds.TotalCSEThreshold {
+			toolkit.LogDuration(ctx, totalDuration, thresholds.TotalCSEThreshold,
+				fmt.Sprintf("CSE total duration %s exceeds threshold %s", totalDuration, thresholds.TotalCSEThreshold))
+			s.T.Errorf("CSE total duration %s exceeds threshold %s", totalDuration, thresholds.TotalCSEThreshold)
+		}
+	}


ValidateCSETimings can silently pass even if timing extraction/parsing failed: if ExtractCSETimings returns an empty Tasks slice (e.g., no events, parse failures), total duration is 0 and no thresholds are checked. This makes the regression-detection test flaky/ineffective. Consider failing the test when no tasks were parsed and/or when the expected "AKS.CSE.cse_start" task (and any task suffixes in thresholds.TaskThresholds) are missing.

Copilot · 2026-04-11T22:41:49Z

e2e/cse_timing.go

+	// Each line is a separate JSON object (one per event file)
+	lines := strings.Split(strings.TrimSpace(result.stdout), "\n")
+	for _, line := range lines {
+		line = strings.TrimSpace(line)
+		if line == "" {
+			continue
+		}
+
+		var event cseEventJSON
+		if err := json.Unmarshal([]byte(line), &event); err != nil {
+			continue
+		}
+		if event.TaskName == "" || event.Timestamp == "" || event.OperationId == "" {
+			continue
+		}
+
+		startTime, err := parseCSETimestamp(event.Timestamp)
+		if err != nil {
+			continue
+		}
+		endTime, err := parseCSETimestamp(event.OperationId)
+		if err != nil {
+			continue
+		}


ExtractCSETimings currently ignores JSON unmarshal and timestamp parse errors (continue), which can hide issues (format drift, unexpected files in the directory) and result in partial/empty reports. For a regression test, it’s safer to surface these as errors (or at least count/Logf them and fail if error rate is non-zero) so timing validation doesn’t silently degrade over time.

djsly · 2026-04-12T00:20:08Z

parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh

+            { retrycmd_if_failure 120 5 25 apt-mark hold walinuxagent || exit $ERR_HOLD_WALINUXAGENT; }
    elif [ "$1" = "unhold" ]; then
-        exit $ERR_RELEASE_HOLD_WALINUXAGENT
+        echo "walinuxagent install" | dpkg --set-selections 2>/dev/null || \


Isn’t this the same comment for hold and unhold?

djsly · 2026-04-12T00:21:45Z

parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh

+wait_for_dpkg_lock() {
+    while fuser /var/lib/dpkg/lock /var/lib/dpkg/lock-frontend >/dev/null 2>&1; do
+        echo 'Waiting for release of dpkg locks'
+        sleep 3


This seems like a very long time. It will cause too much CSE increased latency.

djsly · 2026-04-12T00:22:26Z

parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh

    fi
 }

+wait_for_dpkg_lock() {


Can we ensure we log an event to track all source call where we wait for a lock ?

djsly · 2026-04-12T00:23:14Z

parts/linux/cloud-init/artifacts/cse_main.sh

        echo "Failed to get the install mode and cleanup container images"
        exit "$ERR_CLEANUP_CONTAINER_IMAGES"
    fi
+    export FULL_INSTALL_REQUIRED


Why are you exporting this var. it’s unclear for reviewers.

…k contention Replace apt-get/apt-mark with dpkg equivalents for cached (golden image) VHDs: - installDebPackageFromFile: use 'dpkg -i' instead of 'apt-get -f install' when FULL_INSTALL_REQUIRED=false. The first apt-get call after VHD boot triggers expensive apt initialization (~30-50s) and holds dpkg locks. dpkg -i bypasses apt entirely and completes in <100ms for pre-cached packages. - aptmarkWALinuxAgent: use 'dpkg --set-selections' instead of 'apt-mark hold/unhold'. apt-mark internally initializes the apt cache even for simple hold operations, causing 10-30s delays on first invocation. Both functions include fallback to the original apt-based commands if dpkg fails. Add CSE timing regression detection framework: - e2e/cse_timing.go: Extracts CSE task timings from VM event JSON files - e2e/scenario_cse_perf_test.go: Two perf scenarios (cached + full install paths) E2E test results with fix (Ubuntu 22.04 golden image): - aptmarkWALinuxAgent: 21.7s → 0.06s (360x faster) - kubelet deb install: 47.6s → 0.04s (1190x faster) - Total CSE time: 19.65s (down from 47s+ in production) Root cause: PR #8105 removed installContainerRuntime's natural ~10s delay that masked the first-apt-call initialization cost. This fix addresses the underlying issue instead of re-introducing the delay. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-12T03:37:25Z

e2e/scenario_cse_perf_test.go

+// CSE performance thresholds for the golden image (cached) path.
+// These represent the expected normal performance when all binaries are pre-cached on the VHD.
+// If any of these are exceeded, it indicates a regression in CSE task ordering or apt lock contention.
+var cachedCSEThresholds = CSETimingThresholds{
+	TotalCSEThreshold: 60 * time.Second,
+	TaskThresholds: map[string]time.Duration{
+		"installDebPackageFromFile":   25 * time.Second,
+		"aptmarkWALinuxAgent":         10 * time.Second,
+		"configureKubeletAndKubectl":  30 * time.Second,
+		"ensureContainerd":            15 * time.Second,
+	},


PR description claims the CSE lock contention regression is fixed by switching aptmarkWALinuxAgent to dpkg --set-selections and exporting FULL_INSTALL_REQUIRED, but in the current tree aptmarkWALinuxAgent still uses apt-mark and FULL_INSTALL_REQUIRED is not exported. If those script changes are intended in this PR, they appear to be missing, which would leave the underlying regression unfixed even though new perf tests were added.

Copilot AI review requested due to automatic review settings April 11, 2026 22:38

djsly requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners April 11, 2026 22:38

djsly temporarily deployed to test April 11, 2026 22:38 — with GitHub Actions Inactive

Copilot started reviewing on behalf of djsly April 11, 2026 22:39 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

djsly force-pushed the fix/cse-apt-lock-contention-and-perf-tests branch from a7b9d57 to 6f86252 Compare April 11, 2026 23:40

djsly temporarily deployed to test April 11, 2026 23:40 — with GitHub Actions Inactive

djsly commented Apr 12, 2026

View reviewed changes

awesomenix force-pushed the fix/cse-apt-lock-contention-and-perf-tests branch from 6f86252 to 0a1d334 Compare April 12, 2026 03:33

Copilot AI review requested due to automatic review settings April 12, 2026 03:33

awesomenix temporarily deployed to test April 12, 2026 03:33 — with GitHub Actions Inactive

awesomenix changed the title ~~fix: use dpkg instead of apt for cached packages to eliminate CSE lock contention~~ feat: add cse timings test to catch regression with cached and full package installation Apr 12, 2026

Copilot started reviewing on behalf of awesomenix April 12, 2026 03:33 View session

awesomenix approved these changes Apr 12, 2026

View reviewed changes

Copilot AI reviewed Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add cse timings test to catch regression with cached and full package installation#8284

feat: add cse timings test to catch regression with cached and full package installation#8284
djsly wants to merge 1 commit intomainfrom
fix/cse-apt-lock-contention-and-perf-tests

djsly commented Apr 11, 2026 •

edited by awesomenix

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

djsly Apr 12, 2026

Uh oh!

djsly Apr 12, 2026

Uh oh!

djsly Apr 12, 2026

Uh oh!

djsly Apr 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

djsly commented Apr 11, 2026 • edited by awesomenix Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

E2E Test Results

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

djsly Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

djsly Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

djsly Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

djsly Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

djsly commented Apr 11, 2026 •

edited by awesomenix

Loading