The last three posts have discussed some prototype code for a sitemap generator, and I want to wrap the series up with a few thoughts about how the size of your Sitemap build operation might affect your site.
The pattern the previous posts discussed was to have the sitemap generated each time the Publishing process runs. For small to medium sites that's ok, as the processing for the sitemap won't take too much extra time. However once you get to large sites this might start to become an issue. With Buckets and parallel publishing, Sitecore are starting to talk about the idea that you can have millions of items in a site. And that's likely to become a problem if you have a five minute publish followed by another three minutes processing the sitemap.
(And also, don't forget the database server load and item-cache churn that processing large sitemaps is likely to cause you)
Talking about this issue with my colleague @steviemcgill recently, he was explaining how he'd ended up disabling generation of sitemaps in a project he was working on because of the performance issue it was causing in his codebase. That got me thinking about strategies for trying to address this problem. My suspicion is that the one of these to choose for an implementation of sitemaps is probably dependent on the specifics of your site and its size and requirements. But here are a few approaches you might consider in your solutions:
Perhaps the most obvious approach to speeding the publish up is simply to avoid processing the Sitemap data unless you really need to. The setup from the first Sitemap post was to generate the data when the publish happened by triggering the code at the end of the publishing events. What about having the publish happen when you click a UI button instead? That means you can run publishes whenever you need to without the extra overhead of the Sitemap generation each time. But conversely it does mean that you need to remember to click the update button to make sure search engines see your changes.
As we've seen with previous bits of code we've used with buttons you need to wire up a handler for a command and then create a button to call that command. Your handler can then raise an event, and you can register an event handler to respond to that.
The important extra complexity here is that in a multi-server solution you need to make sure that you can raise a remote event, so that it runs on all your servers. A bit of googling suggests that it's not too hard to create a custom remote event through code.
Can your Sitemap files get updated on a schedule? Say twice a day or something similar? If so, perhaps scheduling a task to re-generate the Sitemap might be a sensible approach. Setting up a scheduled task in Sitecore seems fairly simple according to this blog post, so it's not a great deal of extra effort to call the Sitemap generating code via this route. However there is one big issue here: If you want Sitecore to run your event, then the ASP.Net process for Sitecore must be in memory at the appropriate times. On a busy site that might not be an issue, but if the site is quiet IIS may drop it out of memory. So either you need to make sure you change the settings for your web server to prevent unloading the site when it has nothing to do, or you have to make sure Sitecore's own keep-alive task fires often enough to make sure IIS doesn't free up your memory. Try this blog post for some details on keeping Sitecore in memory.
I guess you could also set this up so that your sitemap was generated when a specific (secured) URL was requested, and schedule your sitemap build via a Windows scheduled task that did an HTTP GET for that url as well. There's extra effort required to set that up, but it does have the advantage that it can pull Sitecore back into memory even after IIS has shut it down.
What if you really do want every publish to trigger a rebuild of the Sitemap? Perhaps running that process on a background thread could help free up the UI and prevent publishing users having to wait around for so long? Well yes, it probably could, but with two fairly big caveats:
The first is that you will probably need to think about what might happen if two publishes happen close enough together to cause the sitemap generation operations to overlap. Depending on how you choose to implement your sitemap generation you might find problems with overlapped operations. At the least you're wasting CPU cycles to generate the files in this scenario. But at worst, combing overlapping operations with certain code patterns might lead to exceptions for locked files, or files ending up with incorrect data in them. You will probably need to implement some sort of locking operation here in order to try to address these scenarios. In a single-server solution you can probably make use of Windows locking primitives. For multi-server solutions you would need to look at keeping locking flags on disk or in databases. Or alternatively you might consider writing your output to random file names, and only rename them to the correct names when the process finishes.
The second issue here is that IIS isn't actually very good at running background tasks. Web servers were built to deal with short-running, immediate tasks to serve up web pages. Having a background operation take a few minutes to complete wasn't really in the architectural plan for IIS. Background tasks can be killed off by IIS when it decides it needs to tear down the AppDomain your code is running in. Errors, periodic process recycling and admins clicking "stop" on a site can all do this to you.
My suspicion is that this is probably not a great way to go if you need reliable code for big sitemaps from simple code.
How about making the Sitemap build happen faster by running bits of it in parallel? Well it's another valid approach, but again, probably not an easy one. If you have multiple sitemap files to build then running each of those in parallel is pretty trivial. It's easy enough to get .Net to run a few things in parallel and wait for them all to complete before proceeding. The framework provides the Task Parallel Library to do things like this – and this is the same library that Sitecore use for their parallel publishing implementation.
But breaking one sitemap file up into parallel tasks is somewhat harder. This approach requires thinking about the problem of generating the sitemap from a different angle. To make your code easy to run in parallel it needs to make use of the patterns of pure functional programming. These enable you to share your tasks between parallel threads without having to worry about whether these threads might interact with each other and cause problems. It's not an easy thing to do, but if done correctly this approach stands to give the best performance improvement on modern multi-core servers.
However because of that performance increase, it's important to consider how running these tasks in parallel will increase the load on your database and publishing servers. As noted in the documentation for Sitecore's parallel publishing process, you need to consider the servers that will be doing all this work.
So there's four approaches you could consider if you were looking to implement alternative approaches to how and when your sitemap is generated. Having spent a morning thinking about this, I suspect I'll be implementing the "manual" approach in the code I'm working on at the moment. But I can see that as the scale of sites tends to increase, I may find myself having to solve the parallel-implementation problems in the future.
And that will probably end up as another blog post when I do get around to it...