Last time out I was thinking about some choices around setting up Sitecore in Kubernetes. Since then, I've moved onto the more practical task of trying to get the setup to work. And I doubt you'll be surprised to hear that I've met a few new issues... Maybe they'll help you save yourself a bit of time and frustration?
I started off the setup process by taking advanatge of the scripts Bart Plasmeijer published after his recent Symposium presentation. I figured they would be a quick way to get myself a working instance of AKS so I could experiment a bit.
But the deployment failed at step four. When it tried to run
helm install nginx-ingress ingress-nginx/ingress-nginx
it timed out, and the script window reported a fairly generic error. It took me a while to work out how to get some details back from AKS about what happened, but eventually I found the
kubectl describe pod <pod-name>
command and got this:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 39m default-scheduler Successfully assigned ingress-basic/nginx-ingress-ingress-nginx-admission-create-x7nc7 to akswin000000 Warning FailedCreatePodSandBox 39m kubelet, akswin000000 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "nginx-ingress-ingress-nginx-admission-create-x7nc7": Error response from daemon: container 0db3bde3668a18bd9bf25a9ce127a094b3b4845461e24950fa3bb31d293a26df encountered an error during hcsshim::System::CreateProcess: failure in a Windows system call: The user name or password is incorrect. (0x52e) extra info: {"CommandLine":"cmd /S /C pauseloop.exe", "User":"2000", "WorkingDirectory":"C:\\", "Environment {"PATH":"c:\\Windows\\System32;c:\\Windows"}, "CreateStdInPipe":true, "CreateStdOutPipe":true, "CreateStdErrPipe":true, "ConsoleSize":[0,0]}
Initially I was worried that I'd broken something in the scripts here – I'd made a few edits to make them match my needs more closely. So I spent quite a lot of time trying to find helpful stuff in Google, and bugging people on Sitecore Slack. But eventually I realised that this was not actually the first error. Somehow I'd missed the fact that another error earlier in the process (at step two) that had scrolled off my console screen when I wasn't looking. And this turned out to be the important one:
--- Linking AKS to ACR --- az : ForbiddenError: The client 'my.user@company.com' with object id 'xx56b90b-xxxx-xxxx-xxxx-98ea3aa7xxxx' does not have authorization to perform action 'Microsoft.Authorization/roleAssignments/write' over scope '/subscriptions/xxd1df1a-xxxx-xxxx-xxxx-dfdbd355xxxx/resourceGroups/client-k8s/providers/Microsoft.ContainerRegistry/registries/client/providers/Microsoft.Author ization/roleAssignments/xx5f2920-xxxx-xxxx-xxxx-9a8efa47xxxx' or the scope is invalid. If access was recently granted, please refresh your credentials. At C:\Users\JDavis\Downloads\Sitecore-Symposium-2020-Containers-AKS-main\2.CreateAKS.ps1:53 char:1 + az role assignment create --assignee $clientID ` + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: (ForbiddenError:...ur credentials.:String) [], RemoteException + FullyQualifiedErrorId : NativeCommandError
It's saying it cannot connect my AKS instance to my ACR instance because I don't have sufficient permissions to set the security for this. So cue a load more Googling and pestering on Slack, and JF Larente pointed out the important thing I'd missed: Microsoft's Azure documentation says you need to have owner rights for either the subscription or the ACR itself for this task to work.
Because I was doing this in the company Azure subscription, my account was fairly tightly controlled and I didn't have Owner rights to anything. So the script failed to connect AKS to my ACR, and hence the Nginx deployment failed for some reason related to this.
Once I got Owner rights applied to my account, tidied up my resource group and ran the scripts again, these errors disappeared.
Another mistake was a classic case of "read other people's scripts properly before you run them!". When the scripts got to this part of step two:
# Add windows server nodepool Write-Host "--- Creating Windows Server Node Pool ---" -ForegroundColor Blue az aks nodepool add --resource-group $ResourceGroup ` --cluster-name $AksName ` --os-type Windows ` --name win ` --node-vm-size Standard_D8s_v3 ` --node-count 1 Write-Host "--- Complete: Windows Server Node Pool Created ---" -ForegroundColor Green
it failed with an Azure error I'd never seen before:
Operation could not be completed as it results in exceeding approved Total Regional Cores quota
. Turns out that Azure has
some default settings to avoid you spinning up a big pile of expensive VMs
and hammering your credit card.
What I'd failed to notice was the bit of script above is adding a "D8s_v3" VM – meaning it has eight cores. Because I already had four cores running in this region, Azure said no to adding this big VM as my new node.
So lesson learned: Look at the script and check what you're creating before you start. So I changed the VM size being created and went back to work...
With those issues sorted I was able to get AKS up and running, so I turned my attentions to the Helm Charts to get my site deployed to AKS. The first thing that tripped me up here was Solr. For the Docker-hosted development instances of this site I have a customised SolrCloud container which has a setup script that will create all the collections at start-up if required. What I only realised after a load of "Solr isn't working" pain is that this start-up script had a custom
entrypoint
in my Docker compose file. And that meant it needed something similar for Helm so that Kubernetes would run my custom script too.
After a few rounds of googling and a certain amount of trial and error, I worked out that the custom
entrypoint
:
solr: image: ${REGISTRY}${COMPOSE_PROJECT_NAME}-k8s-xp1-solrcloud:${VERSION:-latest} build: context: ./containers/build/solrcloud args: BASE_IMAGE: ${SITECORE_DOCKER_REGISTRY}sitecore-xp1-solr:${SITECORE_VERSION} mem_limit: 1GB entrypoint: powershell -Command "& C:\Cloud\StartCloud.ps1 c:\solr c:\data" volumes: - ${LOCAL_DATA_PATH}\solr:c:\data
needs to get replaced with this for Kubernetes:
apiVersion: apps/v1 kind: Deployment metadata: name: solr labels: app: solr spec: replicas: 1 selector: matchLabels: app: solr template: metadata: labels: app: solr spec: nodeSelector: kubernetes.io/os: windows containers: - name: solr image: client.azurecr.io/my-k8s-xp1-solrcloud:latest command: ["powershell.exe"] args: ["-Command", "& C:\\Cloud\\StartCloud.ps1 c:\\solr c:\\data"] ports: - containerPort: 8983 env: - name: SOLR_MODE value: solrcloud imagePullSecrets: - name: sitecore-docker-registry
But then I crashed into a whole new issue. With Solr and SQL services up and running, I tied to run the SQL initialisation job in step seven of Bart's script. Its job is to create all the databases in SQL. But what I actually got was another failure. See the next section for info about how I found the actual error, but it was complaining about being unable to log in as the SQL "sa" account to create the databases.
This confused me a lot. Kubernetes uses "secrets" as a way to store things like passwords – so surely the SQL service was being started up using the same value for the SA password that the init job was trying to log in with? They are both reading the same secret after all...
The Sitecore deployment config for Kubernetes gives you a folder full of text files for your secrets. The SQL username and password come from
secrets/sitecore-databaseusername.txt
and
secrets/sitecore-databasepassword.txt
. You publish these secrets into your AKS cluster, and the contents of these files get put into the secret store for use by your containers.
When you download them, the username file contains "sa" by default, and the password one is empty. So I'd filled in a suitable password. Opening my copies in a text editor looked fine at first glance:
But after tearing my hair out for a couple of hours I spotted the length field at the bottom left of that image: 3 charactes... For a two character username... So I made whitespace visible in the editor:
And that extra line-feed was the clue to issue. I went through all my secrets files and got rid of any accidental trailing whitespace, and then pushed the secrets files up to my AKS instance again. And with that done deleting/recreating SQL and then running the init job succeeded.
I'm getting closer to a working build-and-deploy process, but I suspect I'll have some more things to write up in the future...
↑ Back to top