Jeremy Davis
Jeremy Davis
Sitecore, C# and web development
Article printed from: https://blog.jermdavis.dev/posts/2023/workaround-solr-docker-issue

A fix for Sitecore's developer SolrCloud containers failing to find ZooKeeper

This fix may help you resolve startup issues with the internal ZooKeeper instance

Published 13 March 2023

A while back I wrote up some notes on a problem some people were seeing with Sitecore's SolrCloud developer container that I'd been unable to fix. It was the worst sort of technical problem, happening irregularly on some computers, but never rearing its head on others. So it's taken me a while to get around to coming up with a fix for this. But if you've suffered from the problems described in my previous post, this is an option for you:

The core issue

The problem I was seeing was that when you started up Sitecore's SolrCloud developer container on some computers, you would end up with a broken Solr instance. And once that fails, Sitecore can't start because the initialisation container can't create the collections required, and Sitecore then cannot run any searches. The underlying error seemed to be that sometimes SolrCloud could not get a network connection to its internal ZooKeeper instance. Without that it could not start...

The error didn't make a lot of sense to me, as I couldn't see why a network connection could fail between two processes inside a single Docker Container. I've spent a chunk of time looking at the source code for both Solr and ZooKeeper, and I can see where the thing is falling over, but I failed to find any reason for why. There are multiple threads involved in this code running - so part of me wonders if this is a threading issue. But I don't have the Java debugging skills to nail this down...

So falling back to the old developer's approach of "if in doubt, fiddle with stuff", I've tried different Sitecore image versions, different container isolation models, different Docker network settings, and a variety of alternative config settings for SolrCloud, ZooKeeper and Docker. But none managed any better than "make the problem a bit less likely" for me.

But it struck me the other day, that one way to fix it might be to run a separate ZooKeeper instance and ensure that was up before starting SolrCloud. In theory this should be done by adding another container to the Docker compose file (so ZooKeeper runs in one container and Solr in another), but that's changing the structure of the Sitecore instance, and wanted to try and avoid that. So I found some time to sit down and make it work with both Solr and ZooKeeper in one container.

And that meant modifying the default SolrCloud image Sitecore ship...

My approach to a fix

My thought was that getting ZooKeeper to run separately requires a few steps:

  1. Adding a "build" config that layers extra code and config on top of Sitecore's Solr image
  2. Writing a Dockerfile for this build to install ZooKeeper and put config in the right places
  3. Ensuring the config files for SolrCloud and ZooKeeper are right for this scenario
  4. Modifying the container's Entrypoint to allow it to run both SolrCloud and ZooKeeper
  5. Ensuring the Entrypoint script can wait for ZooKeeper to be running

None of those is too tricky, but they all need to work together...

Step 1: To add the build config, follow the pattern for this that's already present in Sitecore's container examples. In your override file, add in an image element to specify what you want the new image to be called, and then add the build elements to say where the DockerFile is and what to base the new image on:

  solr:
    image: ${REGISTRY}${COMPOSE_PROJECT_NAME}solr:8.8.2-${SITECORE_VERSION}
    build:
      context: ./docker/build/solr
      args:
        PARENT_IMAGE: ${SITECORE_DOCKER_REGISTRY}nonproduction/solr:8.8.2-${SITECORE_VERSION}
    volumes:
      - type: bind
        source: .\docker\data\solr
        target: c:\data

					

You should make the parent image the right one for your version of Sitecore, and name your own image appropriately. I was working with 10.2 - hence Solr 8.8.2. But you can determine the correct version from Sitecore's compatibilty table and look at the image names in the default docker compose files for your version to get examples.

Step 2: You'll need to create the solr folder under docker\build and put a DockerFile in it.

That file needs to acquire ZooKeeper, extract it, delete some cruft and apply the config. That can be done with:

# escape=`

ARG PARENT_IMAGE
FROM ${PARENT_IMAGE}

RUN powershell "Invoke-WebRequest https://dlcdn.apache.org/zookeeper/zookeeper-3.7.1/apache-zookeeper-3.7.1-bin.tar.gz -OutFile c:\zk.tar.gz; `
                New-Item -Type Directory c:\apache-zookeeper-3.7.1-bin | out-null; `
                tar -xzf c:\zk.tar.gz; `
                Remove-Item c:\zk.tar.gz; `
                Rename-Item apache-zookeeper-3.7.1-bin zk; `
                Remove-Item c:\zk\docs\ -Recurse -Force; `
                New-Item -Type Directory c:\data\zoo_data | out-null"

COPY zoo.cfg c:\zk\conf\
COPY log4j.properties c:\zk\conf\

COPY Start.ps1 c:\Start.ps1

					

The meat of this is the RUN powershell statement that's executing a series of steps to get ZooKeeper in place. It's doing the download, making sure there's a folder to put it in, extracting the archive, giving it a shorter filename and tidying up the mess. Once that completes the code for ZooKeeper is in place. And the following lines copy in some config files and the entrypoint which also need to exist in the folder for the DockerFile.

It took me a while to get this step to work though. Docker has an add command which is able to download and extract .tar.gz files. But oddly it can download from URLs, but if it does a download it will not extract them. Whereas if the file's source is the build context for your image it will extract them. Having banged my head against that for a while, I just fell back to PowerShell. And pleasingly, since the last time I tried to do a ZooKeeper install in script, that has added native support for unix archives. So no need to install an add-on like 7Zip or similar now.

Step 3: The zoo.cfg file is the basic config for a standalone instance of ZooKeeper:

tickTime=2000
dataDir=c:\\data\\zoo_data
clientPort=2181
4lw.commands.whitelist=*

					

That's pretty standard for basic ZooKeeper config, but there are two important points. Firstly the data directory points to the place that SolrCloud would have put data for its internal ZooKeeper. That isn't a requirement, but you do need to make sure this data is written to a volume saved outside the container. (For SolrCloud to survive the containers recycling, both the Solr and ZooKeeper data need to persist - and that's what this volume provides) It seemed simplest not to move it.

And secondly the 4lw.commands.whitelist seems to be required in recent versions of ZooKeeper, to control which network commands it will respond to. For production you'd want something more specific, but here in a development environment (and secured inside the container) turning everything back on seemed fine.

The first time I made all this run, I realised that the default config for ZooKeeper generates a lot of logspam. It seems like this isn't of much use most of the time, so I've added the log4j.properties config file to turn the logging down. The base of this file is just the default values from the ZooKeeper download, but the important settings I modified are:

# near the top of the file
zookeeper.root.logger=ERROR,CONSOLE
zookeeper.console.threshold=ERROR

# near the bottom of the file
zookeeper.auditlog.threshold=ERROR
audit.logger=ERROR, RFAAUDIT

					

That turns all the logging down to "errors only".

Steps 4 & 5: To modify the entrypoint, I pulled a copy of start.ps1 out of Sitecore's base container, and made some changes:

To allow "wait for ZooKeeper" to work, I pasted in a variation of a function I'd used in previous bits of SolrCloud automation:

function Wait-ForZooKeeperInstance
{
    param(
        [string]$zkHost,
        [int]$zkPort
    )

    Write-Host "Waiting for ZooKeeper at $($zkHost):$zkPort"

    $sawError = $false
    $isUp = $false
    while($isUp -ne $true)
    {
        try
        {
			$client = New-Object System.Net.Sockets.TcpClient
            $client.Connect($zkHost, $zkPort)
            $ns = [System.Net.Sockets.NetworkStream]$client.GetStream()
        		
            $sendBytes = [System.Text.Encoding]::ASCII.GetBytes("ruok")
            $ns.Write($sendBytes, 0, $sendBytes.Length)

            $buffer = New-Object 'byte[]' 10
            $bytesRead = $ns.Read($buffer, 0, 10)

            $receivedBytes = New-Object 'byte[]' $bytesRead
            [System.Array]::Copy($buffer, $receivedBytes, $bytesRead)

            $result = [System.Text.Encoding]::ASCII.GetString($receivedBytes)
          			
            if( $result -eq "imok" )
            {
                $isUp = $true
                if( $sawError -eq $true )
                {
                    Write-Host
                }
            }

            $ns.Dispose()
			$client.Dispose()
        }
        catch
        {		
            $sawError = $true
            Write-Host "." -NoNewline
        }
    }

    $client.Dispose()

    Write-Host "ZooKeeper is up"
}

					

It looks a bit complex, but basically it tries to connect to the ZooKeeper command port and send an "are you ok?" message. If it can't connect or gets an error then likely ZooKeeper isn't up so it tries again. Once the correct "I'm ok" message comes back, ZooKeeper is ready to go, and the function can return.

That can be used as part of the logic to start ZooKeeper and Solr, which replaces the default startup in Sitecore's script:

$dataPathToTest = Join-Path $DataPath solr.xml
if (Test-Path $dataPathToTest) {
    Write-Host "INFO: Existing Solr configuration found in '$DataPath'..."
}
else {
    Write-Host "INFO: Solr configuration not found in '$DataPath', copying clean configuration..."
    Copy-Item $InstallPath\** $DataPath -Recurse -Force -ErrorAction SilentlyContinue
}

Start-Process -FilePath "c:\zk\bin\zkServer.cmd" -NoNewWindow

Wait-ForZooKeeperInstance "localhost" 2181

& "c:\solr\bin\solr.cmd" start -port $SolrPort -f -z localhost:2181

					

First it calls zkServer.cmd (the batch file that runs an instance of ZooKeeper) using Start-Process. That ensures it runs in the background, so PowerShell's execution of the script continues as soon as that command is issued. The call for Wait-ForZooKeeperInstance will then block until it detects that ZooKeeper has started correctly. And finally Solr gets executed. The call to start SolrCloud now needs a new parameter though. Instead of saying "just start in Cloud mode with your internal ZooKeeper" we now pass -z localhost:2181 to tell it to run in cloud mode with the external ZooKeeper connection specified.

And with all that in place you can run a docker-compose build, followed by a docker-compose up to get a SolrCloud instance in this new model. The container logs show the startup as:

The containers logs when the new SolrCloud image starts up

The first block is ZooKeeper starting in the background, the second is the script waiting for it to be ready, and then the third one is Solr starting. And with that done, the cloud UI starts up as expected:

The state display for ZooKeeper after the new SolrCloud image starts up

And Sitecore's init container happily creates the correct collections...

Conclusions

I've run this up and down a lot of times in test on both my laptops here - the work one that was broken before and my personal one which had never shown the error. With this setup it hass not failed for me once, where it failed fairly regularly with the original image. So that makes me reasonably sure it's a good fix. I've also tried applying these changes to the default Sitecore 10.3 XM docker compose setup, and it appears to work fine with that too.

I think this "fix" sits fairly well with standard Sitecore development. It doesn't change the structure of your docker setup, but it does provide a way to resolve the startup bug. Importantly for me, it doesn't change the Solr config that Sitecore needs. I'd seen some discussion in Slack of fixing this issue by going back to using basic Solr, rather than SolrCloud. But that involves changing config in Sitecore itself - something that you then probably have configured differently in production. And that seems like a source of potential release issues to me, so I feel happier avoiding that route for my projects.

I've put a bare-bones example of this in a GitHub repo, if you want to experiment. You can clone and run this, to get SolrCloud running with standard (empty) XM collections. (no license files required - Sitecore itself isn't involved) And hopefully this is enough detail for you to be able to apply these changes to your own project's docker config if you want.

So seems like a win to me.

And maybe one day I'll work out why the base images don't work reliably on some machines...

↑ Back to top