Operator enhancement to poll for updates to operands based on floating tag#174
Operator enhancement to poll for updates to operands based on floating tag#174phantomjinx wants to merge 9 commits intohawtio:mainfrom
Conversation
squakez
left a comment
There was a problem hiding this comment.
I had a quick look and overall, technically speaking I don't see any problem. However, I think that in the long run this one may introduce maintainability problems in the sense that it expands a lot the scope and the surface areas for bugs/thread consistency problems.
I think that the "restart" of an application due by a floating tag usage which may be regenerated belongs more to the cluster itself than to an operator. Even more, considering that the default polling is 12 hours, at this stage it would be much more easy to just restart the application with pullPolicy=Always every 12 hours and let the cluster pick any new image (if it exist any new one).
Thanks for the review and for looking into the architecture. Totally understand the concern about keeping the Operator's scope tight. I did consider the By polling the registry digest in the background, we only trigger a rolling deployment when a new image is physically detected (which could be months away). Therefore, the updater presence is quiet for the vast majority of the time and the upgrade when it does occur is seamless. Regarding thread safety, the updater is strictly isolated. It was an early consideration to have the updater modifying the deployment but that was quickly revised. It makes no modification to cluster state directly; it simply drops a GenericEvent into the standard controller-runtime channel, letting the native Reconciler queue handle the concurrency safely. The importance of the E2E suite was specifically to cover instances of network failures, partial updates, and race conditions to guarantee the updater fails gracefully and allow the reconciler to continue. The level 2 (seamless upgrades) requirement of the Operator framework is to encapsulate domain-specific lifecycle management so the user doesn't have to. This operator is providing a premium, zero-config experience out of the box. |
57f166c to
49ddb19
Compare
0ef117e to
8c1fa32
Compare
* This variable name collides with common package names and tends to force go-imports to import a log package rather than detecting the instance variable. * Renaming the variable to an alternative name not associated with common package names stops this happening.
* breaks out all the reconcile functions into separate files to aid reading and maintainability.
* Queries an image url for the latest digest * Adds dependencies to vendoring
* Provides the operator with the UPDATE_POLLING_INTERVAL env var * If added to the deployment with a value of 0 then the update poller will be completely disabled.
…o controller * hawtio_controller.go * Adds a watch on the update channel. If the channel signals there is new data then the reconcile loop will be executed. * Adds both the channel and the poller to the ReconcileHawtio object to make them available to the deployment reconciler. * lifecycle.go * Improves handleResultAndError to allow for a requeue of the reconciler if the reconcile functions require it - requires a RequeueError * reconcile_deployment.go * If the updatePoller has been initialized then fetch the digests * Should the digests not be returned yet, requeue and await the response * Should the poller have errored then ignore and continue with the original image urls * manager.go * Creates the update poller and channel for the background thread * poller.go * The poller that runs in the background thread and checks the image digests at the interval specified
* Tests the updater integrated with the hawtio controller and its effect on reconciling the deployment.
* Prevents the test API being public in the project as no need to expose it * test_functions.go * Adds a FindProjectRoot function that walks up the directory structure to locate go.mod so avoiding the need to add copious .. paths when locating the CRD files in the tests.
* Makes the code obvious as to what the default polling interval of the updater is set to.
8c1fa32 to
6c771f0
Compare
Description
Current behaviour of the operator ensures that should operand images receive new releases on a specific floating tag, the operator is capable of installing new incremental image versions rather than being tied to the original product release. Since the operator installs the operand using floating tags, any update to the tag will mean the operator pulling the latest image on that tag and installing it.
However, this does not solve the issue of existing hawtio-online installs being able to receive the updated operands. Only if the operator removes the existing installs and then re-deploys will the new images likely be pulled and used. To solve this issue, the following is implemented:
The operator launches a background go worker which polls the image registry for the given hawtio-online and hawtio-online-gateway images. It fetches the remote digest of each image tag.
The digests are compared to the existing digests and if they are different then the updater makes them available and populates a go channel.
The operator is configured to watch the updater's channel and should it receive a notification, immediately begins an iteration of its reconciliation loop.
Whilst reconciling the deployment, the operator checks the digests received from the updater and populates them in both an annotation in the deployment resource spec and modifies the image urls.
The modification to the deployment is enough for Kubernetes to restart a rollout of the hawtio-online deployment, fetching the new images and spinning them up.
The reconciliation completes by populating the status of each Hawtio CR with the new image urls.
Should the updater fail to access the registry, eg. offline install, then it will gracefully back off and leave the original working image reference intact in the deployment resource.
If users wish to complete disable the updater, eg. offline install / air-gapped network, then the environment variable UPDATE_POLLING_INTERVAL can be added to the operator deployment with a value of "0", disabling the updater entirely.
The updater has a default schedule of every 12 hours ("12h"). This can be modified using the UPDATE_POLLING_INTERVAL, eg. "6h", "180h" etc...