Is your hot disaster recovery causing a hot mess? [by Jeremy Davis]

I've had some conversations recently about odd issues with search-driven sites, whose root cause was related to disaster recovery patterns. While it's important to make sure that your business-critical website has a good backup and recovery process in place, it's also important to pay attention to how to correctly configure these scenarios...

What might go wrong? url copied!

One example of the issues I've seen went something like this: The client had a website which was heavily relient on search. They had a large product offering in Sitecore, which had integrations with back-end systems to keep the content up-to-date and they had these items configured to be indexed by ContentSearch. Their site was running SolrCloud as it's back-end search technology, and it was configured with "switch on rebuild" enabled, to ensure that searches continued to work even during large, slow full index rebuilds.

But there was an intermittent issue with the site. Editors would update products, publish their changes, and observe that the public site had updated correctly. But later these changes would disappear from search again. Older, out-of-date results would become visible on the site every so often. And unsurprisingly that behaviour was a problem for the client...

Most people's initial assumption with an issue like this would be to look at Solr. If the search results go from "correct" to "wrong" without obvious user intervention then that implies a problem with how data is indexed. But Solr itself isn't the root cause here. Spending time examining the details of the search configuration, and how it processes data at runtime will often show up the same core problem: Solr will happily index the correct data, but every so often it will receive an unexpected "swap aliases" command from the "switch on rebuild" process. That causes the search engine to start serving content from an old (out of date) version of the index - leading to the odd results users were seeing.

But why would it decide to swap back to the old index? The answer may well be hiding in your disaster recovery setup..

What patterns are people aiming for? url copied!

For search infrastructure, fault-tolerance is fairly easy with SolrCloud. You spin up three or more Solr nodes, and configure them as a load balaced cluster. Then you configure your indexes as replicated across those nodes. So if any node fails, queries and index operations are handled by other nodes, while you recover the broken one. For more scale and tolerance to faults, you spin up more nodes. And, as noted, Sitecore is often configured with "switch on rebuild" to ensure you always have an index available to query - even in the middle of a rebuild operation.

To get fault tolerance for your website you spin up extra servers too. ARM Templates or Kubernetes config can make this pretty easy. It's commonly done for CD servers, but some clients I come across want uptime guarantees and service recovery KPIs to apply to their authoring environment too. So in their minds it makes sense to spin up extra copies of their standard content management role in order to ensure they have a backup in case of disaster.

So why might it break? url copied!

The underlying issue here is indexing strategies. Your default Sitecore CM instance is configured to be the indexing role as well as the CM role for your website. That means it runs the processes which watch your databases for changes and publishing events, and when these happen it fires off the commands and data to Solr to update indexes.

Most of the time that indexing process will work fine - but as Sitecore note in their documentation, you should only have one server who is responsible for index updates at any time. So if you spin up a "hot backup" content management server using the default CM config you're breaking that rule. You end up with two severs who both think they should be maintaining the Solr indexes. And that means every so often they will trip over each other and mess up your indexes.

You start to see situations where your backup CM server triggers an index swap in a situation it shouldn't have - where the "offline" index is not actually up-to-date. And that makes it look like content is dropping out of your search indexes.

How should we fix this? url copied!

Sitecore's documentation for SolrCloud and switch-on-rebuild states that in any deployment, precisely one server must be responsible for indexing:

To achive this, you have three broad choices:

Follow the documenation - disable indexing on all but one CM
If you have to run backup CMs up all the time, then you can follow the documentation and use role based config to ensure that you only ever have one of them which is running the indexing role. That gives you the advantage of having your hot backup CM, while keeping a supported setup. But a disadvantage here is that you have work to do in order to ensure that your CM instances get the right config. In the scenario that your indexing-enabled CM blows up, you have to make sure that another server will pick up this role. And that may be a manual task.
Use infrastructure automation to enable your backup CM
You don't have to keep a backup running all the time. You could have a mechanism to (fairly quickly) fire up a new instance in the event of disaster, but not have it up and running during normal operations. The key advantage here is probably cost - not running a server means you don't pay for it. But the related disadvantage is that in an outage you'll need to wait for this automation to complete before you have a CM again. Your choices for doing that include:
- Keep an ARM Template or the relevant Kubernetes config files available, so you can run them to fire up a new CM instance in an emergency. That process can be automated potentially.
- Or if you run in an IaaS pattern, have the instance installed but the VM / server stopped.
Configure a separate Indexing role
Sitecore supports pulling the indexing role out to a separate server. That might have licensing and runtime cost implications for you, and requires a bit of specific configuration/deployment work, but it allows your multiple CMs (who all have the same config) to share one indexing service. The advantage here is both that you have an instant backup CM in the event of an issue, and that you get to have all your CMs running the same configuration. But the downside here is cost - you're paying to run yet another role for the indexing instance.

Which of these works best for you will depend on other criteria of your project of course. But you need to follow the Sitecore's rules to avoid having issues.

Updated to add:
After I published this, one of my colleagues pointed out another fun edge case here. There's a config setting which controls whether Sitecore tries to force aliases to exist when the site starts up:

<setting name="ContentSearch.Solr.EnforceAliasCreation" value="false" />

						
If you have this set to true, even if a CM server has all its index strategies set to "manual", it will still end up resetting the active alias to the "non rebuild" one each time the Sitecore process recycles. And that may well be the wrong one based on the current state of your public site...

So worth checking the state of that setting if you're seeing odd behaviour with indexes, but you thought you'd followed rules above.

↑ Back to top

Feel like sharing?
⇒ On BlueSky
⇒ On LinkedIn
⇒ On Mastodon
⇒ On Email

Is your hot disaster recovery causing a hot mess?

It's a battle for control of your indexes!

What might go wrong? url copied!

What patterns are people aiming for? url copied!

So why might it break? url copied!

How should we fix this? url copied!

Post Headings

Sitecore MVP 2015-2025

Recent Tags

Top Tags

Recent Months

Socials