A while back I wrote up some notes on a problem some people were seeing with Sitecore's SolrCloud developer container that I'd been unable to fix. It was the worst sort of technical problem, happening irregularly on some computers, but never rearing its head on others. So it's taken me a while to get around to coming up with a fix for this. But if you've suffered from the problems described in my previous post, this is an option for you:
The problem I was seeing was that when you started up Sitecore's SolrCloud developer container on some computers, you would end up with a broken Solr instance. And once that fails, Sitecore can't start because the initialisation container can't create the collections required, and Sitecore then cannot run any searches. The underlying error seemed to be that sometimes SolrCloud could not get a network connection to its internal ZooKeeper instance. Without that it could not start...
The error didn't make a lot of sense to me, as I couldn't see why a network connection could fail between two processes inside a single Docker Container. I've spent a chunk of time looking at the source code for both Solr and ZooKeeper, and I can see where the thing is falling over, but I failed to find any reason for why. There are multiple threads involved in this code running - so part of me wonders if this is a threading issue. But I don't have the Java debugging skills to nail this down...
So falling back to the old developer's approach of "if in doubt, fiddle with stuff", I've tried different Sitecore image versions, different container isolation models, different Docker network settings, and a variety of alternative config settings for SolrCloud, ZooKeeper and Docker. But none managed any better than "make the problem a bit less likely" for me.
But it struck me the other day, that one way to fix it might be to run a separate ZooKeeper instance and ensure that was up before starting SolrCloud. In theory this should be done by adding another container to the Docker compose file (so ZooKeeper runs in one container and Solr in another), but that's changing the structure of the Sitecore instance, and wanted to try and avoid that. So I found some time to sit down and make it work with both Solr and ZooKeeper in one container.
And that meant modifying the default SolrCloud image Sitecore ship...
My thought was that getting ZooKeeper to run separately requires a few steps:
None of those is too tricky, but they all need to work together...
Step 1:
To add the build config, follow the pattern for this that's already present in Sitecore's container examples. In your override file, add in an
image
element to specify what you want the new image to be called, and then add the
build
elements to say where the DockerFile is and what to base the new image on:
solr: image: ${REGISTRY}${COMPOSE_PROJECT_NAME}solr:8.8.2-${SITECORE_VERSION} build: context: ./docker/build/solr args: PARENT_IMAGE: ${SITECORE_DOCKER_REGISTRY}nonproduction/solr:8.8.2-${SITECORE_VERSION} volumes: - type: bind source: .\docker\data\solr target: c:\data
You should make the parent image the right one for your version of Sitecore, and name your own image appropriately. I was working with 10.2 - hence Solr 8.8.2. But you can determine the correct version from Sitecore's compatibilty table and look at the image names in the default docker compose files for your version to get examples.
Step 2:
You'll need to create the
solr
folder under
docker\build
and put a DockerFile in it.
That file needs to acquire ZooKeeper, extract it, delete some cruft and apply the config. That can be done with:
# escape=` ARG PARENT_IMAGE FROM ${PARENT_IMAGE} RUN powershell "Invoke-WebRequest https://dlcdn.apache.org/zookeeper/zookeeper-3.7.1/apache-zookeeper-3.7.1-bin.tar.gz -OutFile c:\zk.tar.gz; ` New-Item -Type Directory c:\apache-zookeeper-3.7.1-bin | out-null; ` tar -xzf c:\zk.tar.gz; ` Remove-Item c:\zk.tar.gz; ` Rename-Item apache-zookeeper-3.7.1-bin zk; ` Remove-Item c:\zk\docs\ -Recurse -Force; ` New-Item -Type Directory c:\data\zoo_data | out-null" COPY zoo.cfg c:\zk\conf\ COPY log4j.properties c:\zk\conf\ COPY Start.ps1 c:\Start.ps1
The meat of this is the
RUN powershell
statement that's executing a series of steps to get ZooKeeper in place. It's doing the download, making sure there's a folder to put it in, extracting the archive, giving it a shorter filename and tidying up the mess. Once that completes the code for ZooKeeper is in place. And the following lines copy in some config files and the entrypoint which also need to exist in the folder for the DockerFile.
It took me a while to get this step to work though. Docker has an
add
command which is able to download and extract
.tar.gz
files. But oddly it can download from URLs, but if it does a download it will not extract them. Whereas if the file's source is the build context for your image it will extract them. Having banged my head against that for a while, I just fell back to PowerShell. And pleasingly, since the last time I tried to do a ZooKeeper install in script, that has added native support for unix archives. So no need to install an add-on like 7Zip or similar now.
Step 3:
The
zoo.cfg
file is the basic config for a standalone instance of ZooKeeper:
tickTime=2000 dataDir=c:\\data\\zoo_data clientPort=2181 4lw.commands.whitelist=*
That's pretty standard for basic ZooKeeper config, but there are two important points. Firstly the data directory points to the place that SolrCloud would have put data for its internal ZooKeeper. That isn't a requirement, but you do need to make sure this data is written to a volume saved outside the container. (For SolrCloud to survive the containers recycling, both the Solr and ZooKeeper data need to persist - and that's what this volume provides) It seemed simplest not to move it.
And secondly the
4lw.commands.whitelist
seems to be required in recent versions of ZooKeeper, to control which network commands it will respond to. For production you'd want something more specific, but here in a development environment (and secured inside the container) turning everything back on seemed fine.
The first time I made all this run, I realised that the default config for ZooKeeper generates a lot of logspam. It seems like this isn't of much use most of the time, so I've added the
log4j.properties
config file to turn the logging down. The base of this file is just the default values from the ZooKeeper download, but the important settings I modified are:
# near the top of the file zookeeper.root.logger=ERROR,CONSOLE zookeeper.console.threshold=ERROR # near the bottom of the file zookeeper.auditlog.threshold=ERROR audit.logger=ERROR, RFAAUDIT
That turns all the logging down to "errors only".
Steps 4 & 5:
To modify the entrypoint, I pulled a copy of
start.ps1
out of Sitecore's base container, and made some changes:
To allow "wait for ZooKeeper" to work, I pasted in a variation of a function I'd used in previous bits of SolrCloud automation:
function Wait-ForZooKeeperInstance { param( [string]$zkHost, [int]$zkPort ) Write-Host "Waiting for ZooKeeper at $($zkHost):$zkPort" $sawError = $false $isUp = $false while($isUp -ne $true) { try { $client = New-Object System.Net.Sockets.TcpClient $client.Connect($zkHost, $zkPort) $ns = [System.Net.Sockets.NetworkStream]$client.GetStream() $sendBytes = [System.Text.Encoding]::ASCII.GetBytes("ruok") $ns.Write($sendBytes, 0, $sendBytes.Length) $buffer = New-Object 'byte[]' 10 $bytesRead = $ns.Read($buffer, 0, 10) $receivedBytes = New-Object 'byte[]' $bytesRead [System.Array]::Copy($buffer, $receivedBytes, $bytesRead) $result = [System.Text.Encoding]::ASCII.GetString($receivedBytes) if( $result -eq "imok" ) { $isUp = $true if( $sawError -eq $true ) { Write-Host } } $ns.Dispose() $client.Dispose() } catch { $sawError = $true Write-Host "." -NoNewline } } $client.Dispose() Write-Host "ZooKeeper is up" }
It looks a bit complex, but basically it tries to connect to the ZooKeeper command port and send an "are you ok?" message. If it can't connect or gets an error then likely ZooKeeper isn't up so it tries again. Once the correct "I'm ok" message comes back, ZooKeeper is ready to go, and the function can return.
That can be used as part of the logic to start ZooKeeper and Solr, which replaces the default startup in Sitecore's script:
$dataPathToTest = Join-Path $DataPath solr.xml if (Test-Path $dataPathToTest) { Write-Host "INFO: Existing Solr configuration found in '$DataPath'..." } else { Write-Host "INFO: Solr configuration not found in '$DataPath', copying clean configuration..." Copy-Item $InstallPath\** $DataPath -Recurse -Force -ErrorAction SilentlyContinue } Start-Process -FilePath "c:\zk\bin\zkServer.cmd" -NoNewWindow Wait-ForZooKeeperInstance "localhost" 2181 & "c:\solr\bin\solr.cmd" start -port $SolrPort -f -z localhost:2181
First it calls
zkServer.cmd
(the batch file that runs an instance of ZooKeeper) using
Start-Process
. That ensures it runs in the background, so PowerShell's execution of the script continues as soon as that command is issued. The call for
Wait-ForZooKeeperInstance
will then block until it detects that ZooKeeper has started correctly. And finally Solr gets executed. The call to start SolrCloud now needs a new parameter though. Instead of saying "just start in Cloud mode with your internal ZooKeeper" we now pass
-z localhost:2181
to tell it to run in cloud mode with the external ZooKeeper connection specified.
And with all that in place you can run a
docker-compose build
, followed by a
docker-compose up
to get a SolrCloud instance in this new model. The container logs show the startup as:
The first block is ZooKeeper starting in the background, the second is the script waiting for it to be ready, and then the third one is Solr starting. And with that done, the cloud UI starts up as expected:
And Sitecore's init container happily creates the correct collections...
I've run this up and down a lot of times in test on both my laptops here - the work one that was broken before and my personal one which had never shown the error. With this setup it hass not failed for me once, where it failed fairly regularly with the original image. So that makes me reasonably sure it's a good fix. I've also tried applying these changes to the default Sitecore 10.3 XM docker compose setup, and it appears to work fine with that too.
I think this "fix" sits fairly well with standard Sitecore development. It doesn't change the structure of your docker setup, but it does provide a way to resolve the startup bug. Importantly for me, it doesn't change the Solr config that Sitecore needs. I'd seen some discussion in Slack of fixing this issue by going back to using basic Solr, rather than SolrCloud. But that involves changing config in Sitecore itself - something that you then probably have configured differently in production. And that seems like a source of potential release issues to me, so I feel happier avoiding that route for my projects.
I've put a bare-bones example of this in a GitHub repo, if you want to experiment. You can clone and run this, to get SolrCloud running with standard (empty) XM collections. (no license files required - Sitecore itself isn't involved) And hopefully this is enough detail for you to be able to apply these changes to your own project's docker config if you want.
So seems like a win to me.
And maybe one day I'll work out why the base images don't work reliably on some machines...
↑ Back to top