feat: add cse timings test to catch regression with cached and full package installation#8284
feat: add cse timings test to catch regression with cached and full package installation#8284
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a CSE bootstrap latency regression on Ubuntu golden-image VHDs by avoiding first-boot apt initialization/lock contention, and adds E2E timing checks to detect future regressions.
Changes:
- Switch cached-package installation and walinuxagent hold/unhold operations to dpkg-based fast paths (with apt fallbacks).
- Export
FULL_INSTALL_REQUIREDso it’s available in subshell contexts. - Add E2E CSE timing extraction + two Ubuntu 22.04 performance scenarios with thresholds.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh | Use dpkg --set-selections for walinuxagent hold/unhold and dpkg -i for cached .deb installs; add wait_for_dpkg_lock. |
| parts/linux/cloud-init/artifacts/cse_main.sh | Export FULL_INSTALL_REQUIRED after determining install mode. |
| e2e/cse_timing.go | New helper to extract CSE task durations from Guest Agent event JSON and validate against thresholds. |
| e2e/scenario_cse_perf_test.go | New Ubuntu 22.04 cached/full-install performance scenarios enforcing timing thresholds. |
| .github/workflows/pr-lint.yaml | Bump actions/github-script from v8 to v9. |
| echo "walinuxagent hold" | dpkg --set-selections 2>/dev/null || \ | ||
| { retrycmd_if_failure 120 5 25 apt-mark hold walinuxagent || exit $ERR_HOLD_WALINUXAGENT; } | ||
| elif [ "$1" = "unhold" ]; then | ||
| exit $ERR_RELEASE_HOLD_WALINUXAGENT | ||
| echo "walinuxagent install" | dpkg --set-selections 2>/dev/null || \ | ||
| { retrycmd_if_failure 120 5 25 apt-mark unhold walinuxagent || exit $ERR_RELEASE_HOLD_WALINUXAGENT; } |
There was a problem hiding this comment.
In the fallback path, apt-mark hold/unhold is invoked without calling wait_for_apt_locks. If dpkg --set-selections fails (e.g., transient dpkg state), the fallback can race with other apt/dpkg users and reintroduce lock contention/failures. Consider calling wait_for_apt_locks before the apt-mark retry, or reusing the existing wait_for_apt_locks helper for the fallback path.
| retrycmd_if_failure 10 5 120 dpkg -i --force-confold ${DEB_FILE} | ||
| if [ "$?" -eq 0 ]; then | ||
| return 0 | ||
| fi | ||
| echo "dpkg -i failed for ${DEB_FILE}, falling back to apt-get install" | ||
| fi | ||
| wait_for_apt_locks | ||
| retrycmd_if_failure 10 5 600 apt-get -y -f install ${DEB_FILE} --allow-downgrades |
There was a problem hiding this comment.
The dpkg fast-path retries dpkg -i up to 10 times with 5s sleeps (retrycmd_if_failure 10 5 120 ...). If the failure is due to missing dependencies (common dpkg -i failure mode), this adds ~45s of avoidable delay before falling back to apt. Suggest running dpkg -i once (or with very small retry budget) and falling back immediately on failure; also quote ${DEB_FILE} to avoid word-splitting/globbing.
| retrycmd_if_failure 10 5 120 dpkg -i --force-confold ${DEB_FILE} | |
| if [ "$?" -eq 0 ]; then | |
| return 0 | |
| fi | |
| echo "dpkg -i failed for ${DEB_FILE}, falling back to apt-get install" | |
| fi | |
| wait_for_apt_locks | |
| retrycmd_if_failure 10 5 600 apt-get -y -f install ${DEB_FILE} --allow-downgrades | |
| dpkg -i --force-confold "${DEB_FILE}" | |
| if [ "$?" -eq 0 ]; then | |
| return 0 | |
| fi | |
| echo "dpkg -i failed for ${DEB_FILE}, falling back to apt-get install" | |
| fi | |
| wait_for_apt_locks | |
| retrycmd_if_failure 10 5 600 apt-get -y -f install "${DEB_FILE}" --allow-downgrades |
| VHD: config.VHDUbuntu2204Gen2Containerd, | ||
| BootstrapConfigMutator: func(nbc *datamodel.NodeBootstrappingConfiguration) { | ||
| }, | ||
| VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) { |
There was a problem hiding this comment.
VMConfigMutator writes into vmss.Tags[...] without ensuring the Tags map is initialized. PrepareVMSSModel does not create the map, so this can panic at runtime. Follow the pattern used elsewhere in e2e (e.g., scenario_test.go) by initializing vmss.Tags = map[string]*string{} when nil before setting the tag.
| VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) { | |
| VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) { | |
| if vmss.Tags == nil { | |
| vmss.Tags = map[string]*string{} | |
| } |
| report, err := ExtractCSETimings(ctx, s) | ||
| if err != nil { | ||
| s.T.Fatalf("failed to extract CSE timings: %v", err) | ||
| } | ||
|
|
||
| // Always log the full timing report | ||
| report.LogReport(ctx, s.T) | ||
|
|
||
| // Validate total CSE duration | ||
| if thresholds.TotalCSEThreshold > 0 { | ||
| totalDuration := report.TotalCSEDuration() | ||
| if totalDuration > thresholds.TotalCSEThreshold { | ||
| toolkit.LogDuration(ctx, totalDuration, thresholds.TotalCSEThreshold, | ||
| fmt.Sprintf("CSE total duration %s exceeds threshold %s", totalDuration, thresholds.TotalCSEThreshold)) | ||
| s.T.Errorf("CSE total duration %s exceeds threshold %s", totalDuration, thresholds.TotalCSEThreshold) | ||
| } | ||
| } |
There was a problem hiding this comment.
ValidateCSETimings can silently pass even if timing extraction/parsing failed: if ExtractCSETimings returns an empty Tasks slice (e.g., no events, parse failures), total duration is 0 and no thresholds are checked. This makes the regression-detection test flaky/ineffective. Consider failing the test when no tasks were parsed and/or when the expected "AKS.CSE.cse_start" task (and any task suffixes in thresholds.TaskThresholds) are missing.
| // Each line is a separate JSON object (one per event file) | ||
| lines := strings.Split(strings.TrimSpace(result.stdout), "\n") | ||
| for _, line := range lines { | ||
| line = strings.TrimSpace(line) | ||
| if line == "" { | ||
| continue | ||
| } | ||
|
|
||
| var event cseEventJSON | ||
| if err := json.Unmarshal([]byte(line), &event); err != nil { | ||
| continue | ||
| } | ||
| if event.TaskName == "" || event.Timestamp == "" || event.OperationId == "" { | ||
| continue | ||
| } | ||
|
|
||
| startTime, err := parseCSETimestamp(event.Timestamp) | ||
| if err != nil { | ||
| continue | ||
| } | ||
| endTime, err := parseCSETimestamp(event.OperationId) | ||
| if err != nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
ExtractCSETimings currently ignores JSON unmarshal and timestamp parse errors (continue), which can hide issues (format drift, unexpected files in the directory) and result in partial/empty reports. For a regression test, it’s safer to surface these as errors (or at least count/Logf them and fail if error rate is non-zero) so timing validation doesn’t silently degrade over time.
a7b9d57 to
6f86252
Compare
| { retrycmd_if_failure 120 5 25 apt-mark hold walinuxagent || exit $ERR_HOLD_WALINUXAGENT; } | ||
| elif [ "$1" = "unhold" ]; then | ||
| exit $ERR_RELEASE_HOLD_WALINUXAGENT | ||
| echo "walinuxagent install" | dpkg --set-selections 2>/dev/null || \ |
There was a problem hiding this comment.
Isn’t this the same comment for hold and unhold?
| wait_for_dpkg_lock() { | ||
| while fuser /var/lib/dpkg/lock /var/lib/dpkg/lock-frontend >/dev/null 2>&1; do | ||
| echo 'Waiting for release of dpkg locks' | ||
| sleep 3 |
There was a problem hiding this comment.
This seems like a very long time. It will cause too much CSE increased latency.
| fi | ||
| } | ||
|
|
||
| wait_for_dpkg_lock() { |
There was a problem hiding this comment.
Can we ensure we log an event to track all source call where we wait for a lock ?
| echo "Failed to get the install mode and cleanup container images" | ||
| exit "$ERR_CLEANUP_CONTAINER_IMAGES" | ||
| fi | ||
| export FULL_INSTALL_REQUIRED |
There was a problem hiding this comment.
Why are you exporting this var. it’s unclear for reviewers.
…k contention Replace apt-get/apt-mark with dpkg equivalents for cached (golden image) VHDs: - installDebPackageFromFile: use 'dpkg -i' instead of 'apt-get -f install' when FULL_INSTALL_REQUIRED=false. The first apt-get call after VHD boot triggers expensive apt initialization (~30-50s) and holds dpkg locks. dpkg -i bypasses apt entirely and completes in <100ms for pre-cached packages. - aptmarkWALinuxAgent: use 'dpkg --set-selections' instead of 'apt-mark hold/unhold'. apt-mark internally initializes the apt cache even for simple hold operations, causing 10-30s delays on first invocation. Both functions include fallback to the original apt-based commands if dpkg fails. Add CSE timing regression detection framework: - e2e/cse_timing.go: Extracts CSE task timings from VM event JSON files - e2e/scenario_cse_perf_test.go: Two perf scenarios (cached + full install paths) E2E test results with fix (Ubuntu 22.04 golden image): - aptmarkWALinuxAgent: 21.7s → 0.06s (360x faster) - kubelet deb install: 47.6s → 0.04s (1190x faster) - Total CSE time: 19.65s (down from 47s+ in production) Root cause: PR #8105 removed installContainerRuntime's natural ~10s delay that masked the first-apt-call initialization cost. This fix addresses the underlying issue instead of re-introducing the delay. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6f86252 to
0a1d334
Compare
| // CSE performance thresholds for the golden image (cached) path. | ||
| // These represent the expected normal performance when all binaries are pre-cached on the VHD. | ||
| // If any of these are exceeded, it indicates a regression in CSE task ordering or apt lock contention. | ||
| var cachedCSEThresholds = CSETimingThresholds{ | ||
| TotalCSEThreshold: 60 * time.Second, | ||
| TaskThresholds: map[string]time.Duration{ | ||
| "installDebPackageFromFile": 25 * time.Second, | ||
| "aptmarkWALinuxAgent": 10 * time.Second, | ||
| "configureKubeletAndKubectl": 30 * time.Second, | ||
| "ensureContainerd": 15 * time.Second, | ||
| }, |
There was a problem hiding this comment.
PR description claims the CSE lock contention regression is fixed by switching aptmarkWALinuxAgent to dpkg --set-selections and exporting FULL_INSTALL_REQUIRED, but in the current tree aptmarkWALinuxAgent still uses apt-mark and FULL_INSTALL_REQUIRED is not exported. If those script changes are intended in this PR, they appear to be missing, which would leave the underlying regression unfixed even though new perf tests were added.
Summary
Add E2E for catching regression in cse timings
Root Cause
PR #8105 optimized CSE by skipping
installContainerRuntimeon golden image VHDs (~80% of Ubuntu 22.04 nodes). This removed a natural ~10s delay that masked a critical issue: the firstapt-getorapt-markcall after VHD boot triggers expensive apt initialization (~30-50s) and holds dpkg/apt locks, blocking all subsequent apt operations.Without the
installContainerRuntimedelay,installKubeletKubectlFromPkgbecame the first apt call and took 47.6s p50 (up from ~10s baseline).E2E Test Results
With fix (Ubuntu 22.04 golden image, westus3, deb package path):
aptmarkWALinuxAgentinstallkubelet.installDebPackageFromFileinstallkubectl.installDebPackageFromFileconfigureKubeletAndKubectl(total)cse_start(total)Verified from VM provision logs:
dpkg --set-selectionsused for walinuxagent hold ✅dpkg -i --force-confoldused for kubelet deb (succeeded on first try) ✅dpkg -i --force-confoldused for kubectl deb (succeeded on first try) ✅FULL_INSTALL_REQUIRED=falsecorrectly detected and exported ✅Testing
go test ./pkg/agent/...— PASSgo build ./e2e/...— PASSTest_Ubuntu2204_CSE_CachedPerformance— PASS (33.29s total CSE, using real deb package path)dpkg -ianddpkg --set-selectionsfrom VM provision logsgo generate) — no changes needed