Waiting for <strike>Godot</strike> Kubernetes [by Jeremy Davis]

I've been slowly improving the release process for the container-based project I'm working on. There's a lot to learn about the ways to configure the Azure Devops pipelines for this work, because targeting Kubernetes is quite different from the old IaaS and PaaS approaches I was used to.

The issue url copied!

One of the key processes for a container release is that you have to instruct Kubernetes to update the images for the site you're deploying. You send a command to say "here are the new config files, please update!". Kubernetes will then start the process of updating to the images described in this new config – which takes a while. It needs to download the files, spin up the appropriate containers based on them, and start up the software they run.

Now a Sitecore deployment often involves tasks for things like "sync Unicorn" or similar, which have to take place after you've deployed your new images. So it's necessary for the release pipeline to wait for Kubernetes to finish this whole process of updating images before triggering Unicorn.

So how can DevOps do that waiting job?

First attempt url copied!

I did a bit of research before my initial attempt to set this up, and came across documentation for the "wait" option for the kubectl command. When that is run, it can pause executition until some sort of container-related condition becomes true.

And helpfully DevOps has a pipeline step which can run this command with its various parameters:

So I configured an initial attempt in DevOps that waited for all the pods in my namespace to be ready, and give me output data in a json format. (So I could see what went on after the deploy completed) I also picked a pretty big timeout, since it can take a while for all the pods in a Sitecore XP deployment download and get ready. The command line equivalent of the setup in DevOps looks like:

kubectl.exe wait --for=condition=Ready pods --all -n client-namespace --timeout=2700s -o json

And I tried some command-line tests against a demo instance, where this worked. But when I used it in DevOps against my client's official Kubernetes instance it sat and waited forever. Even when the release of my new images appeared to be complete, this command didn't finish.

Back to the drawing board...

The first issue here was really obvious in retrospect. When you deploy Sitecore to containers, you need to initialise the databases and the Solr indexes. Sitecore's container-based release process includes running some job containers to do this – and these had been run in the client's deployment. But after running they had not been removed again. So I had two job containers sitting in a "done my job" state. But my wait command above wants all the containers to be in the "Ready" state – and completed jobs are not.

So learning point #1: Finished jobs are still important...

But that wasn't the only issue: I was also seeing warnings from DevOps where it complained that the wait operation was generating too much data:

2021-02-03T14:00:24.2280843Z ##[warning]Output variable not set as kubectl command output exceeded the maximum supported length. Output length: 413314, Maximum supported length: 32766
2021-02-03T14:00:24.2289418Z ##[warning]Capturing deployment metadata failed with error: YAMLException: unexpected end of the stream within a double quoted scalar at line 1353, column 1:

Turns out that when you have all the pods for a Sitecore XP setup, and you tell kubectl to return details as json, that generates quite a lot of data.

So learning point #2: DevOps doesn't want lots of your debug data...

Fixing that... url copied!

So after a bit more thinking, I revised my command a bit:

kubectl.exe wait --for=condition=Ready pods -l app=cm -n sitecore --timeout=2700s

Instead of waiting for all the pods, I've changed to waiting for just the CM pod to be ready. That's the key thing for my onward deployment – in that my next step involves talking to the CM box. And the other thing is that there's now no output command. That means it's not flooding the DevOps agent log with piles of data.

And with those two changes in place, my deployment proceeds correctly. And so unlike the play in the title, the deployment does arrive at the right state eventually...

↑ Back to top