I spent some time working with a colleague who couldn't get his docker instance to start up happily this week. And it's reminded me that for all its positives, there are still some challenges with understanding the underlying issues when a developer container instance breaks. I realised I need a "go read this" post for the start of future discussions like this, so here are some problems you might see, and some diagnostic suggestions I wanted a convenient way to share:
Trying to start your development instance and seeing something like this can be frustrating:
When something goes wrong at start-up, it's often not at all clear why.
docker-compose
says nothing helpful. So what are some of the common issues, and how can they be mitigated?
This is the problem my colleague was having. You can do
down
and
up
as many times as you like, but if something in your Sitecore website is throwing an unhandled exception which breaks ASP.Net, you're never going to get the site to start up in its default state.
The way Traefik is configured means that it relies on a health monitor endpoint in your CM container. If that does not return HTTP 200 when it is called, Traefik will not start. That leads to a really annoying situation. You're working away, and you do something that causes a YSOD. Before you fix that issue, you get distracted and end up shutting docker down. And when you come back to work, you are unable to make Sitecore start up. That makes it way harder to fix your problem you had to start with...
So when you see the dreaded Traefik errors, your first response should be to work out why it's failed. Some good diagnostic tactics are:
Is the CM container up?
The first thing to check is the state of your containers. That can give you a clue about which way your investigations should go. Traefik can fail if the CM container starts, but then goes down again. Or it can fail if it's up but returning errors serious enough to break the health endpoint. So make use of your favourite monitoring tool to check if the CM is actually running:
If the CM is down, you should look into the Service Monitor issue below, lack of memory / disk space, or similar technical issues that would prevent it from starting or from staying started. But if the CM is up you think about code and config bugs in your deployment and networking issues.
Check the container log
If it's up, it's worth checking what the logs say. Both the streamed Docker logs (which cover assorted outputs of your containers, and the Sitecore logs if they've been written) Log data might point you at the Service Monitor error below, or it might point you towards the health endpoint failing. So pick a tool to look at the streamed log data initially:
The image above is an example of the
/healthz/ready
endpoint returning 500 - which Traefik will see as a reason not to start.
Skip past Traefik
Since Traefik is the reverse proxy for your containers, if it's down it can be difficult to see what (if any) messages are being returned if a web request fails. One trick for bypassing this issue is to make use of the fact that Docker containers are actually proper VMs - so you can connect directly to them and run other commands. That means we can fire up PowerShell inside our misbehaving CM container:
docker exec -it <your-cm-container-name> powershell.exe
When you type that, you should get a new powershell session inside your container. So you can use that to directly make an HTTP request to Sitecore, and see what (if any) error comes back:
Invoke-WebRequest http://localhost/ -UseBasicParsing
Note that you need to use "localhost" and "http://" here - Traefik does SSL termination for you, so inside the container doesn't use HTTPS. And it also handles the DNS naming of your site - so inside the container the site is just localhost. The parameter
-UseBasicParsing
there is required whenever you use
Invoke-WebRequest
on a machine which doesn't have IE installed. Console-only instances of Windows Server like this don't have IE - so you'll get an error if you forget this parameter. If you're seeing 500s in the docker log stream above, chances are you're going to get a pile of red text here, indicating the real ASP.Net error. You'll likely need to scroll up a bit to see the start of the message:
Quite often, this is going to tell you what's wrong. (Though in some circumstances you may get more info from requesting the health endpoint url that's throwing 500s in the logs in the previous bullet) The example here (A DLL binding issue that prevented ASP.Net from starting up the Sitecore App) is pretty much what my colleague was suffering from.
Once you've found your issue and corrected it, remember that you probably don't have to go through the whole up/down rigmarole. Often you can stop/delete any containers which are causing an issue, fix your problem and run
docker-compose up
again to start up your CM or Traefik containers. They'll happily make use of the Identity Server / Solr / SQL Server ones that were (probably) already running.
(Note, you also have the option to disable the healthchecks that prevent you seeing the error outputs too. Depending on your situation, you may find that easier)
If you find that your CM tries to spin up, but keeps throwing "can't find my databases!" type SQL errors in the Sitecore log, then a common cause is any VPN software you might be running:
I've had this issue - if I start my office VPN client it aggressively routes all network traffic over to my corporate network. And annoyingly that includes it grabbing the "between container" network traffic that Docker is trying to use to allow your containers to talk to each other... A good test for whether this is your issue is to disconnect the VPN, down your containers, restart Docker Desktop and then try bringing your containers back up without the VPN connected. If it works, chances are your VPN is at fault.
This can be a bit of a tricky one to solve. In my case, I had to persuade my IT Ops team to create a "split tunnel" configuration where only traffic for my clients and my office went over the VPN, and other (local) traffic was ignored. That allowed me to run Docker and have code in the containers access my client's networks. Great for me getting dev work done - but potentially less secure than the default VPN setup.
I've also found that it's better to start the VPN before Docker Desktop, as docker is not always happy with networking changes happening after it's started up...
I'm working on a v10.0 project right now, and probably the most common failure is that the CM container's Service Monitor process fails to start. Sitecore have addressed this one in newer releases if you're lucky enough to be able to upgrade your project.
But if not, I've found it seems to be made worse by your custom project code and config. This error is transient, so often stopping and starting your containers can resolve it. I've also found the error seems less likely to happen if I remove my project files from the
docker\deploy\website
folder, start up my containers, and then publish my code after vanilla Sitecore is running. Not sure if that's specific to my project or not though - but may be of help to others?