I've had some conversations recently about odd issues with search-driven sites, whose root cause was related to disaster recovery patterns. While it's important to make sure that your business-critical website has a good backup and recovery process in place, it's also important to pay attention to how to correctly configure these scenarios...
One example of the issues I've seen went something like this: The client had a website which was heavily relient on search. They had a large product offering in Sitecore, which had integrations with back-end systems to keep the content up-to-date and they had these items configured to be indexed by ContentSearch. Their site was running SolrCloud as it's back-end search technology, and it was configured with "switch on rebuild" enabled, to ensure that searches continued to work even during large, slow full index rebuilds.
But there was an intermittent issue with the site. Editors would update products, publish their changes, and observe that the public site had updated correctly. But later these changes would disappear from search again. Older, out-of-date results would become visible on the site every so often. And unsurprisingly that behaviour was a problem for the client...
Most people's initial assumption with an issue like this would be to look at Solr. If the search results go from "correct" to "wrong" without obvious user intervention then that implies a problem with how data is indexed. But Solr itself isn't the root cause here. Spending time examining the details of the search configuration, and how it processes data at runtime will often show up the same core problem: Solr will happily index the correct data, but every so often it will receive an unexpected "swap aliases" command from the "switch on rebuild" process. That causes the search engine to start serving content from an old (out of date) version of the index - leading to the odd results users were seeing.
But why would it decide to swap back to the old index? The answer may well be hiding in your disaster recovery setup..
For search infrastructure, fault-tolerance is fairly easy with SolrCloud. You spin up three or more Solr nodes, and configure them as a load balaced cluster. Then you configure your indexes as replicated across those nodes. So if any node fails, queries and index operations are handled by other nodes, while you recover the broken one. For more scale and tolerance to faults, you spin up more nodes. And, as noted, Sitecore is often configured with "switch on rebuild" to ensure you always have an index available to query - even in the middle of a rebuild operation.
To get fault tolerance for your website you spin up extra servers too. ARM Templates or Kubernetes config can make this pretty easy. It's commonly done for CD servers, but some clients I come across want uptime guarantees and service recovery KPIs to apply to their authoring environment too. So in their minds it makes sense to spin up extra copies of their standard content management role in order to ensure they have a backup in case of disaster.
The underlying issue here is indexing strategies. Your default Sitecore CM instance is configured to be the indexing role as well as the CM role for your website. That means it runs the processes which watch your databases for changes and publishing events, and when these happen it fires off the commands and data to Solr to update indexes.
Most of the time that indexing process will work fine - but as Sitecore note in their documentation, you should only have one server who is responsible for index updates at any time. So if you spin up a "hot backup" content management server using the default CM config you're breaking that rule. You end up with two severs who both think they should be maintaining the Solr indexes. And that means every so often they will trip over each other and mess up your indexes.
You start to see situations where your backup CM server triggers an index swap in a situation it shouldn't have - where the "offline" index is not actually up-to-date. And that makes it look like content is dropping out of your search indexes.
Sitecore's documentation for SolrCloud and switch-on-rebuild states that in any deployment, precisely one server must be responsible for indexing:
To achive this, you have three broad choices:
Which of these works best for you will depend on other criteria of your project of course. But you need to follow the Sitecore's rules to avoid having issues.
Updated to add:
After I published this, one of my colleagues pointed out another fun edge case here. There's a config setting which controls whether Sitecore tries to force aliases to exist when the site starts up:<!-- ENFORCES ALIAS CREATION ON INDEX INITIALIZATION If enabled, index aliases will be created on Solr during the index initialization process. Default value: false --> <setting name="ContentSearch.Solr.EnforceAliasCreation" value="false" />
If you have this set to true, even if a CM server has all its index strategies set to "manual", it will still end up resetting the active alias to the "non rebuild" one each time the Sitecore process recycles. And that may well be the wrong one based on the current state of your public site...
So worth checking the state of that setting if you're seeing odd behaviour with indexes, but you thought you'd followed rules above.