Aborted bluegreen rollout preview service behaviour #4360

nhs-work · 2025-07-16T02:30:01Z

nhs-work
Jul 16, 2025

Not sure if this is best suited under discussions or issues but would like to better understand the current behaviour and intended usage of the preview service, in particular for bluegreen rollout aborts.

I have observed that upon making a new deployment, if it fails the AnalysisRun for Pre Promotion Analysis, Argo Rollouts will help to not promote the preview service but it's rollouts-pod-template-hash will remain pointing at the now failed hash value instead of pointing to the last stable replicaset's hash (which is how the active service behaves). This appears to be due to the difference in implementation of reconcilePreviewService, which does not contain any status checks for an aborted status like reconcileActiveService.

Is this behaviour of remaining at the failed hash an intended behaviour and is there some reason for this? My project currently runs pre-promotion integration tests for a few different apps against the preview services and having preview service selectors that do not exist on any pods will result in test failures.

sglre6355 · 2025-07-18T13:21:51Z

sglre6355
Jul 18, 2025

I'm facing the exact same issue in a similar scenario. I think it's reasonable to expect the preview service to point back to the stable version when an AnalysisRun fails, especially given the preview usually scales down unless explicitly configured not to. Looking at the commit history and the docs, the current behavior seems to be intentional though (for example, it says "The Rollout always makes sure that the preview service is sending traffic to the newest ReplicaSet" in the BlueGreen docs).

Now changing this behaviour would probably be a breaking change so I'm wondering if it might make sense to add an optional Rollout field that could modify this behavior. If a maintainer here is happy with this I'm happy to make a PR.

0 replies

kostis-codefresh · 2025-07-22T11:28:12Z

kostis-codefresh
Jul 22, 2025
Collaborator

Could you explain a bit about the use case here and what the business need is behind this?

Currently it works like this (or at least this is what I understand)

Stable/preview is 1.4
Preview becomes 1.5, stable stays at 1.4
Analysis runs against whatever you want (either 1.5 or 1.4 or both)
Analysis fails. Rollout is marked as degraded. All traffic goes back to 1.4 again

End of story.

Are you saying that after this, you still want to run integration tests to 1.5? Is this your scenario?

0 replies

sglre6355 · 2025-07-22T15:48:30Z

sglre6355
Jul 22, 2025

In my case I route traffic with a certain header to preview and all others to stable with an Istio VirtualService. This enables us to make preview accessible for specific users (in my case, users from the QA team) before fully promoting.

With your example, when the analysis fails in step 4 stable does point back to 1.4 but preview doesn't since its pod hash is not updated. This makes the service inaccessible for those specific users. I'm not sure of the reasoning behind this implementation, but I think it's somewhat unintuitive and would like to know if you know why.

Either way, I think it doesn't sound like a bad idea to have an optional field to make preview point back to the old version and would love to hear your perspective (or other approaches to achieve what I'm trying to do if you know any).

1 reply

kostis-codefresh Aug 25, 2025
Collaborator

With your example, when the analysis fails in step 4 stable does point back to 1.4 but preview doesn't since its pod hash is not updated

Do you have a github repo with an example. Because normally this doesn't happen. When a rollout is NOT in progress both preview and stable services should work just fine and they should point back to 1.4

So either there is a bug, or you have a configuration that I am not aware of.

nhs-work · 2025-12-15T06:29:43Z

nhs-work
Dec 15, 2025
Author

Apologies for the late reply, don't really have a github repo with an example but let me try to step through my understanding of the argo rollouts code and explain what happens in my case.

In my project's case we have multiple apps running (e.g. frontend, backend) with an integration test that requires all apps' preview service to be running at the same time.

Suppose I attempt to deploy v2 backend and there is a bug with our latest deployment and the Pre Promotion Analysis/AnalysisRun/integration test fails. During this attempted deployment, the newest replicaset hash is updated to point to the failed deployment. Once the rollout has been aborted due to the failing AnalysisRun, this new replicaset will have desired replicas set as 0 (since it is a failed deployment). The active service then checks that the rollout status for backend is abort which means it will fallback to using the last known stable replicaset hash (v1) instead of the latest (v2, failing, has 0 replicas). This differs from the preview service which always uses the latest (v2, failing, has 0 replicas).

This results in a case where frontend's preview is running but backend's preview service points to a failed replicaset with 0 replicas. This means all subsequent deployments for frontend will always fail because my backend preview service which is necessary for the Pre Promotion Analysis/AnalysisRun/integration test is down.

The relevant code snippets for this behavior are shown below:

Preview service:

argo-rollouts/rollout/service.go

Lines 77 to 92 in 7518bde

    
           func (c *rolloutContext) reconcilePreviewService(previewSvc *corev1.Service) error { 
        
           	if previewSvc == nil { 
        
           		return nil 
        
           	} 
        
           	if haltReason := c.haltProgress(); haltReason != "" { 
        
           		c.log.Infof("Skipping preview service reconciliation: %s", haltReason) 
        
           		return nil 
        
           	} 
        
           	newPodHash := c.newRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey] 
        
           	err := c.switchServiceSelector(previewSvc, newPodHash, c.rollout) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	return nil 
        
           }

newPodHash := c.newRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey] <- note that `newPodHash` is always set to this fixed value regardless of rollout status

Active service:

argo-rollouts/rollout/service.go

Lines 94 to 121 in 7518bde

    
           func (c *rolloutContext) reconcileActiveService(activeSvc *corev1.Service) error { 
        
           	if haltReason := c.haltProgress(); haltReason != "" { 
        
           		c.log.Infof("Skipping active service reconciliation: %s", haltReason) 
        
           		return nil 
        
           	} 
        
           	if !replicasetutil.ReadyForPause(c.rollout, c.newRS, c.allRSs) || !annotations.IsSaturated(c.rollout, c.newRS) { 
        
           		c.log.Infof("skipping active service switch: New RS '%s' is not fully saturated", c.newRS.Name) 
        
           		return nil 
        
           	} 
        
           	newPodHash := activeSvc.Spec.Selector[v1alpha1.DefaultRolloutUniqueLabelKey] 
        
           	if c.isBlueGreenFastTracked(activeSvc) { 
        
           		newPodHash = c.newRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey] 
        
           	} 
        
           	if c.pauseContext.CompletedBlueGreenPause() && c.completedPrePromotionAnalysis() { 
        
           		newPodHash = c.newRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey] 
        
           	} 
        
           	if c.rollout.Status.Abort { 
        
           		newPodHash = c.rollout.Status.StableRS 
        
           	} 
        
           	err := c.switchServiceSelector(activeSvc, newPodHash, c.rollout) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	return nil 
        
           }

newPodHash := activeSvc.Spec.Selector[v1alpha1.DefaultRolloutUniqueLabelKey]
if c.rollout.Status.Abort {
    newPodHash = c.rollout.Status.StableRS <- this section of code is what the preview service does not have
}

Hopefully the code snippets above are enough to illustrate this scenario without an example repo.

0 replies

AFMiziara · 2026-02-03T22:40:36Z

AFMiziara
Feb 3, 2026

We have a similar use case that would greatly benefit from this feature. We maintain an internal E2E testing application that needs to always test against the preview version of the service before promoting to production.

Due to external constraints, we cannot modify the E2E application code to explicitly select preview vs. stable versions nor to manipulate new headers. Instead, we use Istio VirtualService header matching (origin/referer headers) to automatically route traffic from our E2E app to the previewService while production clients continue using the activeService/stableService. This approach works perfectly during normal rollout flows.

However, when a rollout is aborted (e.g. manual abort), the previewService selector continues pointing to the aborted ReplicaSet with 0 replicas, causing all E2E requests to fail (503 unavailable).

Proposed Solution: An optional field like spec.strategy.blueGreen.previewServiceFallbackToStable: true would allow the preview service to automatically revert to the stable ReplicaSet when a rollout is aborted, maintaining service availability for testing environments while keeping the current behavior as default for backward compatibility.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aborted bluegreen rollout preview service behaviour #4360

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Aborted bluegreen rollout preview service behaviour #4360

Uh oh!

Uh oh!

nhs-work Jul 16, 2025

Replies: 5 comments · 1 reply

Uh oh!

sglre6355 Jul 18, 2025

Uh oh!

kostis-codefresh Jul 22, 2025 Collaborator

Uh oh!

sglre6355 Jul 22, 2025

Uh oh!

kostis-codefresh Aug 25, 2025 Collaborator

Uh oh!

nhs-work Dec 15, 2025 Author

Uh oh!

AFMiziara Feb 3, 2026

nhs-work
Jul 16, 2025

Replies: 5 comments 1 reply

sglre6355
Jul 18, 2025

kostis-codefresh
Jul 22, 2025
Collaborator

sglre6355
Jul 22, 2025

kostis-codefresh Aug 25, 2025
Collaborator

nhs-work
Dec 15, 2025
Author

AFMiziara
Feb 3, 2026