Jeremy Davis
Jeremy Davis
Sitecore, C# and web development
Jeremy Davis
Jeremy Davis
Sitecore, C# and web development
Article printed from: https://blog.jermdavis.dev/posts/2022/docker-zookeeper-solr-fail

Strange Docker / Zookeeper errors

You can't win them all...

Published 26 September 2022
Updated 12 October 2022

Usually with these blog posts, I find a problem, I fight with it for a bit, and then I solve the problem. But this post has been sitting half-written in my publishing queue since May (!) this year, and I have entirely failed to solve this issue. So I'm admitting defeat, and publishing this anyway because maybe one of you knows the answer. Or at least it might serve as a warning...

My issue is that I've been working through some really odd and annoying Solr issues which only manifest in Docker on one laptop. I'm really not sure if these are issues that others might see, or if this is a problem that's entirely down to this laptop's setup. But they're definitely a problem...

The initial issue

I was supposed to be working on some proof-of-concept work around JSS and Docker. My starting point was the example Nextjs / Docker project which is available from Sitecore. In theory, you can run dotnet new to create an example project with all the relevant files for Sitecore 10.2, run init.ps1 to configure stuff for Docker and then run up.ps1 to start your site. And in the past that's worked fine for me.

But a week or so ago, the instance I had set up stopped working for me. I could run up.ps1 but it would respond that it had timed out waiting for the CM role to become available:

Docker reporting that it has failed to connect to the Sitecore CM role

I tried a lot of the classic "fix my Sitecore Docker instance" tricks: docker-compose down and then up again, rebooting the computer, running clean.ps1 to get rid of the state, running up.ps1 again. Running init.ps1 again. But none of those solved the problem.

Looking at the state of the setup in more detail, the root problem was Solr. It was starting and appeared to be running, but the logs showed two errors. When it first starts, it complains about not being able to see Zookeeper:

2022-04-28 10:39:01.496 INFO  (main) [   ] o.e.j.u.log Logging initialized @2211ms to org.eclipse.jetty.util.log.Slf4jLog
2022-04-28 10:39:01.668 INFO  (main) [   ] o.e.j.s.Server jetty-9.4.34.v20201102; built: 2020-11-02T14:15:39.302Z; git: e46af88704a893fc12cb0e3bf46e2c7b48a009e7; jvm 1.8.0_252-b09
2022-04-28 10:39:01.699 INFO  (main) [   ] o.e.j.d.p.ScanningAppProvider Deployment monitor [file:///C:/solr/server/contexts/] at interval 0
2022-04-28 10:39:02.856 INFO  (main) [   ] o.e.j.w.StandardDescriptorProcessor NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet
2022-04-28 10:39:02.871 INFO  (main) [   ] o.e.j.s.session DefaultSessionIdManager workerName=node0
2022-04-28 10:39:02.871 INFO  (main) [   ] o.e.j.s.session No SessionScavenger set, using defaults
2022-04-28 10:39:02.871 INFO  (main) [   ] o.e.j.s.session node0 Scavenging every 600000ms
2022-04-28 10:39:02.965 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter Using logger factory org.apache.logging.slf4j.Log4jLoggerFactory
2022-04-28 10:39:02.965 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter  ___      _       Welcome to Apache Solr™ version 8.8.2
2022-04-28 10:39:02.965 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter / __| ___| |_ _   Starting in cloud mode on port 8983
2022-04-28 10:39:02.965 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__ \/ _ \ | '_|  Install dir: C:\solr
2022-04-28 10:39:02.965 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter |___/\___/_|_|    Start time: 2022-04-28T10:39:02.965Z
2022-04-28 10:39:02.981 INFO  (main) [   ] o.a.s.c.SolrPaths Using system property solr.solr.home: c:\\data
2022-04-28 10:39:02.981 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading container configuration from c:\data\solr.xml
2022-04-28 10:39:03.075 INFO  (main) [   ] o.a.s.c.SolrXmlConfig MBean server found: [email protected], but no JMX reporters were configured - adding default JMX reporter.
2022-04-28 10:39:03.528 INFO  (main) [   ] o.a.s.h.c.HttpShardHandlerFactory Host whitelist initialized: WhitelistHostChecker [whitelistHosts=null, whitelistHostCheckingEnabled=true]
2022-04-28 10:39:03.700 WARN  (main) [   ] o.e.j.u.s.S.config Trusting all certificates configured for [email protected][provider=null,keyStore=null,trustStore=null]
2022-04-28 10:39:03.700 WARN  (main) [   ] o.e.j.u.s.S.config No Client EndPointIdentificationAlgorithm configured for [email protected][provider=null,keyStore=null,trustStore=null]
2022-04-28 10:39:03.856 WARN  (main) [   ] o.e.j.u.s.S.config Trusting all certificates configured for [email protected][provider=null,keyStore=null,trustStore=null]
2022-04-28 10:39:03.856 WARN  (main) [   ] o.e.j.u.s.S.config No Client EndPointIdentificationAlgorithm configured for [email protected][provider=null,keyStore=null,trustStore=null]
2022-04-28 10:39:03.918 INFO  (main) [   ] o.a.s.c.SolrZkServerProps Reading configuration from: c:\data\zoo.cfg
2022-04-28 10:39:03.918 INFO  (main) [   ] o.a.s.c.SolrZkServer STARTING EMBEDDED STANDALONE ZOOKEEPER SERVER at port 9983
2022-04-28 10:39:03.918 WARN  (main) [   ] o.a.s.c.SolrZkServer Embedded Zookeeper is not recommended in production environments. See Reference Guide for details.
2022-04-28 10:39:04.418 INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper client=localhost:9983
2022-04-28 10:39:04.996 WARN  (Thread-12) [   ] o.a.z.s.ServerCnxnFactory maxCnxns is not configured, using default value 0.
2022-04-28 10:39:05.012 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2022-04-28 10:39:05.028 WARN  (NIOWorkerThread-1) [   ] o.a.z.s.NIOServerCnxn Close of session 0x0 => java.io.IOException: ZooKeeperServer not running
	at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) java.io.IOException: ZooKeeperServer not running
	at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) ~[?:?]
	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) ~[?:?]
	at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) ~[?:?]
	at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_252]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
2022-04-28 10:39:05.028 WARN  (NIOWorkerThread-1) [   ] o.a.z.s.WorkerService Unexpected exception => java.lang.NullPointerException
	at org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:285) java.lang.NullPointerException: null
	at org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:285) ~[?:?]
	at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:593) ~[?:?]
	at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:583) ~[?:?]
	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:374) ~[?:?]
	at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) ~[?:?]
	at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_252]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

					

And then later in the logs, I got errors like this:

2022-04-27 14:49:43.626 INFO  (main) [   ] o.e.j.s.AbstractConnector Started [email protected]{HTTP/1.1, (http/1.1, h2c)}{0.0.0.0:8983}
2022-04-27 14:49:43.626 INFO  (main) [   ] o.e.j.s.Server Started @38620ms
2022-04-27 14:50:05.641 ERROR (qtp572868060-18) [   ] o.a.s.s.SolrDispatchFilter Error processing the request. CoreContainer is either not initialized or shutting down.
2022-04-27 14:50:05.725 WARN  (qtp572868060-18) [   ] o.e.j.s.HttpChannel /solr/admin/cores => javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:162)
javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:162) ~[jetty-server-9.4.34.v20201102.jar:9.4.34.v20201102]
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[jetty-server-9.4.34.v20201102.jar:9.4.34.v20201102]
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322) ~[jetty-rewrite-9.4.34.v20201102.jar:9.4.34.v20201102]

					

And if you try to browse to the Solr instance now you get this friendly error:

CoreContainer is not initialised error - when browsing Solr

And that explains the Sitecore instance failing to start: The Docker error is about the solr-init container being in an error state, which makes perfect sense if Solr itself isn't working.

Further confusion

Having worked out what was breaking, I tried a variety of experiments to work out what to do in order to get past this, and found some confusing behaviour:

I could use dotnet new to create a new instance of the example project. This would start the first time I ran up.ps1 but would then show the same error as above on the second and subsequent attempts. I could use notepad to edit the .env file for one of these projects and change the COMPOSE_PROJECT_NAME variable. Once saved, this would allow the site to start once, and then subsequent attempts would fail with the same error. And I could edit the docker-compose.yml file(s) to tweak the settings (say, exposing a new port) and save that change. Again, it would start once and then subsequent attempts would fail. Rebooting my computer would also allow things to start once.

That suggests to me that something to do with building the image has an affect on this issue. Maybe something to do with there being state data which doesn't exist on the first start of a new image, but does exist (and causes an issue) on the subsequent starts.

So I could use these as work-arounds - but they all involved changing the project I was working on, or rebooting - which seemed heavy handed. At some point I'd accidentally commit one of my hacky changes to source control and risk messing up other team members' work...

And the overall issue made me very confused.

I wondered if Zookeeper was actually broken, when I was seeing the errors in the Solr logs. So I tried a test where I allowed up.ps1 to run to the error shown above, and then connected a Powershell session into the Solr container with docker exec -it sitecore_jss_poc3_solr_1 powershell.exe. Once in there I could run a bit of script to call Zookeeper API to ask it if it was alive:

$client = New-Object System.Net.Sockets.TcpClient

$client.Connect("localhost", 9983)
$ns = [System.Net.Sockets.NetworkStream]$client.GetStream()

$sendBytes = [System.Text.Encoding]::ASCII.GetBytes("ruok")
$ns.Write($sendBytes, 0, $sendBytes.Length)

$buffer = New-Object 'byte[]' 10
$bytesRead = $ns.Read($buffer, 0, 10)

$receivedBytes = New-Object 'byte[]' $bytesRead
[System.Array]::Copy($buffer, $receivedBytes, $bytesRead)

$result= [System.Text.Encoding]::ASCII.GetString($receivedBytes)

Write-Host "Response: $result"

					

Confusingly, that returned success:

Zookeeper API returning success

So whatever is preventing Solr from seeing Zookeeper, Zookeeper itself is running...

Flailing about for an answer

I tried making a cut-down docker-compose file set which just started Solr, in order to be able to test these issues faster. That showed all the same behaviour as the main Sitecore files. Copying that cut-down project to my personal laptop worked fine, and did not show any of these issues. That's a pretty clear indication that the issue is something specific to the work laptop I started on.

At this point I hit Google pretty hard, spoke to others in my company, as well as asking about this in the Docker channel in Sitecore Slack. I found a small number of others who'd seen similar issue, and some muttering about how the current version of Docker might be to blame for some issues. (I had upgraded to 4.7.1 (77678) recently, so that was plausible)

Reverting to Docker Desktop v4.6 didn't seem to resolve the issue. Stopping all the services for my office VPN didn't resolve the issue. One suggestion was to run netsh winsock reset in an elevated session and reboot. But that seemed to have no more effect that rebooting would have had on its own.

To revert back to "latest" Docker Desktop after these tests I did the uninstall, and noticed a few folders left lying about which seemed Docker-related. So I manually tidied all the files in docker and docker desktop folders under c:\programdata\ and in the children of %appdata% on my machine. But after reinstalling that didn't seem to make a difference.

Next on my list to consider was, could this be a docker networking issue? I tried running docker network ls to see what existed:

PS C:\Users\jeremy.davis\source\repos\test> docker network ls
NETWORK ID     NAME             DRIVER    SCOPE
a8aafbd09e1f   Default Switch   ics       local
2190f8e7c643   nat              nat       local
e8c1255ef410   none             null      local

					

The "Default Switch" entry wasn't there on a colleague's machine. And it went away when I speculatively ran the network prune command:

PS C:\Users\jeremy.davis\source\repos\test> docker network prune
WARNING! This will remove all custom networks not used by at least one container.
Are you sure you want to continue? [y/N] y
Deleted Networks:
Default Switch
PS C:\Users\jeremy.davis\source\repos\test> docker network ls
NETWORK ID     NAME      DRIVER    SCOPE
2190f8e7c643   nat       nat       local
e8c1255ef410   none      null      local

					

After doing this, my Solr test compose file would start. Not 100% of the time. But more than 50% of the time. Which is a significant step forwards, but clearly still not right somehow. And the main Sitecore compose project would start with the same reliability too.

Note: Take care to ensure you've run docker-compose down before pruning - you'll get odd errors about not being able to start the containers if you don't:

Network not found error from docker

If you get this "Network not found" error, run down and then up again to resolve it.

But I've not managed to get it any better than this. I still get that error about not finding Zookeeper sometimes when starting the Solr container.

Explanations?

I don't have much of an explanation to be honest. It feels like this is somehow an issue with Docker networking's interactions with the container or with Windows. But how or why this isn't working on my particular machine is not clear to me.

So I'm putting this here in case it makes more sense to someone else, or in case it helps someone else who's stuck. But it's probably neither the right solution, or a good one. Just a hacky workaround which has helped me.

My suspicion is that this is down to something specific in the build or state of my work laptop, since I don't see these issues on my personal laptop. Maybe one day I'll work out what that is. More likely I'll end up reformatting and it will magically go away after that. But for the moment I have the docker network prune work around to mostly bypass the error here and get on with some work...

To be honest, there are days when I don't really like computers... ;-)

Updated to add: After I posted this Rob Ahnemann posted his own take on how to deal with this issue. If you've hit a problem like this, you may want to check that out too, for an alternate take.