Skip to content

Operator enhancement to poll for updates to operands based on floating tag#174

Open
phantomjinx wants to merge 9 commits intohawtio:mainfrom
phantomjinx:image-updater
Open

Operator enhancement to poll for updates to operands based on floating tag#174
phantomjinx wants to merge 9 commits intohawtio:mainfrom
phantomjinx:image-updater

Conversation

@phantomjinx
Copy link
Copy Markdown
Member

Description

Current behaviour of the operator ensures that should operand images receive new releases on a specific floating tag, the operator is capable of installing new incremental image versions rather than being tied to the original product release. Since the operator installs the operand using floating tags, any update to the tag will mean the operator pulling the latest image on that tag and installing it.

However, this does not solve the issue of existing hawtio-online installs being able to receive the updated operands. Only if the operator removes the existing installs and then re-deploys will the new images likely be pulled and used. To solve this issue, the following is implemented:

  1. The operator launches a background go worker which polls the image registry for the given hawtio-online and hawtio-online-gateway images. It fetches the remote digest of each image tag.

  2. The digests are compared to the existing digests and if they are different then the updater makes them available and populates a go channel.

  3. The operator is configured to watch the updater's channel and should it receive a notification, immediately begins an iteration of its reconciliation loop.

  4. Whilst reconciling the deployment, the operator checks the digests received from the updater and populates them in both an annotation in the deployment resource spec and modifies the image urls.

  5. The modification to the deployment is enough for Kubernetes to restart a rollout of the hawtio-online deployment, fetching the new images and spinning them up.

  6. The reconciliation completes by populating the status of each Hawtio CR with the new image urls.

  7. Should the updater fail to access the registry, eg. offline install, then it will gracefully back off and leave the original working image reference intact in the deployment resource.

  8. If users wish to complete disable the updater, eg. offline install / air-gapped network, then the environment variable UPDATE_POLLING_INTERVAL can be added to the operator deployment with a value of "0", disabling the updater entirely.

  9. The updater has a default schedule of every 12 hours ("12h"). This can be modified using the UPDATE_POLLING_INTERVAL, eg. "6h", "180h" etc...

Copy link
Copy Markdown

@squakez squakez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look and overall, technically speaking I don't see any problem. However, I think that in the long run this one may introduce maintainability problems in the sense that it expands a lot the scope and the surface areas for bugs/thread consistency problems.

I think that the "restart" of an application due by a floating tag usage which may be regenerated belongs more to the cluster itself than to an operator. Even more, considering that the default polling is 12 hours, at this stage it would be much more easy to just restart the application with pullPolicy=Always every 12 hours and let the cluster pick any new image (if it exist any new one).

Comment thread cmd/manager/main.go Outdated
Comment thread pkg/controller/internal/hawtiotest/test_functions.go
@phantomjinx
Copy link
Copy Markdown
Member Author

I had a quick look and overall, technically speaking I don't see any problem. However, I think that in the long run this one may introduce maintainability problems in the sense that it expands a lot the scope and the surface areas for bugs/thread consistency problems.

I think that the "restart" of an application due by a floating tag usage which may be regenerated belongs more to the cluster itself than to an operator. Even more, considering that the default polling is 12 hours, at this stage it would be much more easy to just restart the application with pullPolicy=Always every 12 hours and let the cluster pick any new image (if it exist any new one).

Thanks for the review and for looking into the architecture. Totally understand the concern about keeping the Operator's scope tight.

I did consider the pullPolicy: Always & scheduled restart approach, but I decided against it primarily due to Pod churn and customer experience. Blindly restarting the application every 12 hours - even when no new image exists - forces unnecessary downtime, breaks active connections, and creates noise in customer alerting systems.

By polling the registry digest in the background, we only trigger a rolling deployment when a new image is physically detected (which could be months away). Therefore, the updater presence is quiet for the vast majority of the time and the upgrade when it does occur is seamless.

Regarding thread safety, the updater is strictly isolated. It was an early consideration to have the updater modifying the deployment but that was quickly revised. It makes no modification to cluster state directly; it simply drops a GenericEvent into the standard controller-runtime channel, letting the native Reconciler queue handle the concurrency safely. The importance of the E2E suite was specifically to cover instances of network failures, partial updates, and race conditions to guarantee the updater fails gracefully and allow the reconciler to continue.

The level 2 (seamless upgrades) requirement of the Operator framework is to encapsulate domain-specific lifecycle management so the user doesn't have to. This operator is providing a premium, zero-config experience out of the box.

* This variable name collides with common package names and tends to
  force go-imports to import a log package rather than detecting the
  instance variable.

* Renaming the variable to an alternative name not associated with common
  package names stops this happening.
* breaks out all the reconcile functions into separate files to aid reading
  and maintainability.
* Queries an image url for the latest digest

* Adds dependencies to vendoring
* Provides the operator with the UPDATE_POLLING_INTERVAL env var

* If added to the deployment with a value of 0 then the update poller
  will be completely disabled.
…o controller

* hawtio_controller.go
 * Adds a watch on the update channel. If the channel signals there is
   new data then the reconcile loop will be executed.
 * Adds both the channel and the poller to the ReconcileHawtio object to
   make them available to the deployment reconciler.

* lifecycle.go
 * Improves handleResultAndError to allow for a requeue of the reconciler
   if the reconcile functions require it - requires a RequeueError

* reconcile_deployment.go
 * If the updatePoller has been initialized then fetch the digests
 * Should the digests not be returned yet, requeue and await the response
 * Should the poller have errored then ignore and continue with the
   original image urls

* manager.go
 * Creates the update poller and channel for the background thread

* poller.go
 * The poller that runs in the background thread and checks the image
   digests at the interval specified
* Tests the updater integrated with the hawtio controller and its
  effect on reconciling the deployment.
* Prevents the test API being public in the project as no need to expose it

* test_functions.go
 * Adds a FindProjectRoot function that walks up the directory structure
   to locate go.mod so avoiding the need to add copious .. paths when
   locating the CRD files in the tests.
* Makes the code obvious as to what the default polling interval of the
  updater is set to.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants