Some of the most frustrating issues in IT happen when a thing which was working fine stops, and you're sure you didn't change anything. Now that very few people host their own infrastructure, these issues can be compounded because change or problems might happen in the internals of a complex infrastructure system you have no control over. I got bitten by this recently, and while I have no answers I suspect this is worth writing up just because it may help other people to realise the problem you have is not always your fault...
I've been working with a team on a client project that was adopted from another agency. We'd been through an onboarding process where we'd taken over the source and the release process in Azure DevOps. The project was hosted in the client's Azure, and they maintained control of that. The build and release pipelines for this work ran on Azure VMs the client hosted, rather than on Microsoft's cloud agents. We'd done some successfull releases of the code to the test and production infrastructure after taking over. There had been a few issues along the way, but broadly the process appeared to be working.
We'd done a release of some new changes to the UAT infrastructure, and having got that tested and signed off, we tried to run the same release on production. But this failed in an entirely unexpected way...
The release process went through a fairly common pattern:
In the the "project code is deployed" step, the release pipeline uses Web Deploy to push assorted packages of code onto the target AppServices. And in this case, that step was failing. Inside this single pipeline step, it was deploying some packages ok, one was failing, and then it deployed the rest successfully. The error we saw looked like this:
For Google's benefit, the raw error message shown in the screengrab was:
2023-07-24T08:19:54.0960901Z ##[error]Error Code: ERROR_DESTINATION_NOT_REACHABLE 2023-07-24T08:19:54.1054292Z ##[error]More Information: Could not connect to the remote computer ("REDACTED-cm-staging.scm.azurewebsites.net"). On the remote computer, make sure that Web Deploy is installed and that the required process ("Web Management Service") is started. Learn more at: https://go.microsoft.com/fwlink/?LinkId=221672#ERROR_DESTINATION_NOT_REACHABLE. 2023-07-24T08:19:54.1146052Z ##[error]Error: Unable to connect to the remote server 2023-07-24T08:19:54.1237674Z ##[error]Error: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond REDACTED:443 2023-07-24T08:19:54.1329763Z ##[error]Error count: 1.
The fact that a number of packages succeeded, then this one failed, and then more succeeded during this step was very odd. That suggested some sort of transient error. So we tried repeating the release step. And it failed again - but with an error on a different package in the set.
We tried it a few times, and found that it failed reliably, but on a different package (or packages plural in some cases) each time.
So we aborted the release, apologised to the client for the delay, and went back to the UAT site to test further. And now attempts to release to UAT failed too - with similar errors.
A developer's first instinct at this point is to go look at Google. And searching for issues with these errors around Web Deploy packages brings up a lot of people talking about how disabling "SCM" fixed issues for them. None of the posts matched the error we saw exactly, but the phrase "On the remote computer, make sure that Web Deploy is installed and that the required process ("Web Management Service") is started." popped up a lot in these posts.
The supposed fix here was relatively simple: In the AppService's configuration settings, add a new entry to set
on the AppService and its slots. So it seemed worth trying this.
Unfortunately it broke things in an exciting new way. Now the backups for the AppService stopped working. They would start, and wait for a while before failing with a broad
##[error]An error occurred while sending the request.
message in the DevOps log. In the AppService Activity Log, a slightly more useful error was shown in the JSON version of the log data:
Timestamp: 2023-07-28T11:36:13.9557166Z\r\nCreating temp folder. Retrieve site meta-data.\r\nBacking up the databases. Backing up site content and uploading to the blob... Backing up site content and uploading to the blob... Backing up site content and uploading to the blob... Backing up site content and uploading to the blob... Backing up site content and uploading to the blob... Error in the backup operation: Storage access failed. The remote server returned an error: (404) Not Found.. Please delete and recreate backup schedule to mitigate."
But it's still not a useful error - as with the process being run here there's no scheduled job to recreate. This was just PowerShell code calling the Azure backup commands...
Since we had valid backups from before the issue started, we tried skipping the backup steps on the next test release to UAT. And this raised yet another error message:
2023-07-26T15:18:55.2097709Z ##[error]Error: Error Code: ERROR_USER_UNAUTHORIZED More Information: Connected to the remote computer (REDACTED.publish.azurewebsites.windows.net") using the Web Management Service, but could not authorize. Make sure that you are using the correct user name and password, that the site you are connecting to exists, and that the credentials represent a user who has permissions to access the site. Learn more at: https://go.microsoft.com/fwlink/?LinkId=221672#ERROR_USER_UNAUTHORIZED. Error: The remote server returned an error: (401) Unauthorized. Error count: 1.
Here, trying to use Web Deploy to send a package to the AppService is being blocked for security reasons. But that seemed like an odd error. This whole DevOps process was running under a Service Principal that had successfully deployed before.
To be sure, we tried the "verify credentials" button in DevOps - which worked fine. And looked at the Azure RBAC settings for the AppService. We could clearly see that the Service Principle had "contributor" rights to all the resources involved. So we concluded that this "disable SCM" fix was not right for this scenario.
Rolling it back got rid of these backup and authorisation errors, and put us back to the original problem.
So we spent quite a bit of time talking to the client and trying to veryify that nobody on their team or ours had changed anything in Azure. In parallel to that we raised a support ticket with Microsoft, and had a triage call with them where we demonstrated the issue. They didn't give us specific answers in that call, but said they would raise the issue with their AppService experts and get back to us.
While we waited, we got briefly distracted by the whole "myget has vanished" issue which hit a lot of Sitecore developers around this time, but having sorted that, we went back to some further tests.
And this time the UAT deployment worked fine. So after repeating this a few times to prove it wasn't just random chance, we retried the production release. And that worked fine too.
It would be fair to say there was a bit of strong language at this point.
At some point after involving Microsoft Support, the issue magically disappeared. When I went back to them and asked if they'd done anything to achieve a fix they didn't give much helpful feedback. But this happened to coincide with an outage of Azure DevOps build pipelines in some EU regions. And when pressed the support people said this was related, without offering any proof or explanation of why.
So happy client - but frustrated dev team.
As developers it doesn't feel good to not know why something is failing. We want to understand the root cause because it helps us learn to avoid future issues, and because it allows for a clearer explanation of the issues to our clients.
But I've come to realise over the years that we can't always get a good answer. It seems particularly true now that we mostly handle cloud-hosted systems - we only get to see the outcome of some operations. We often cannot peek "inside the box" of systems like Azure PaaS. And sometimes that means we cannot debug behaviour down to a single cause, or even prove which bit of a complex process is at fault.
We have to accept that sometimes the problem is not our doing at all. Issues with your hosting provider, or out in the wilds of the internet can cause problems which we can't directly diagnose or manage.
So when you hit an issue that might be a 3rd party's problem, don't be shy of raising support tickets. Even if they don't get you the detailed answers you want, they may still be the thing that resolves your problem.
And sometimes that's as good as we get...↑ Back to top