This is post 4 of 4 in a series titled Custom Sitemap Files
- Custom Sitemap files – Part One
- Custom Sitemap Files – Part Two
- Custom Sitemap Files – Part Three
- Custom Sitemap Files – Part Four
The last three posts have discussed some prototype code for a sitemap generator, and I want to wrap the series up with a few thoughts about how the size of your Sitemap build operation might affect your site.
The pattern the previous posts discussed was to have the sitemap generated each time the Publishing process runs. For small to medium sites that's ok, as the processing for the sitemap won't take too much extra time. However once you get to large sites this might start to become an issue. With Buckets and parallel publishing, Sitecore are starting to talk about the idea that you can have millions of items in a site. And that's likely to become a problem if you have a five minute publish followed by another three minutes processing the sitemap.
(And also, don't forget the database server load and item-cache churn that processing large sitemaps is likely to cause you)
Talking about this issue with my colleague @steviemcgill recently, he was explaining how he'd ended up disabling generation of sitemaps in a project he was working on because of the performance issue it was causing in his codebase. That got me thinking about strategies for trying to address this problem. My suspicion is that the one of these to choose for an implementation of sitemaps is probably dependent on the specifics of your site and its size and requirements. But here are a few approaches you might consider in your solutions:
As we've seen with previous bits of code we've used with buttons you need to wire up a handler for a command and then create a button to call that command. Your handler can then raise an event, and you can register an event handler to respond to that.
The important extra complexity here is that in a multi-server solution you need to make sure that you can raise a remote event, so that it runs on all your servers. A bit of googling suggests that it's not too hard to create a custom remote event through code.
I guess you could also set this up so that your sitemap was generated when a specific (secured) URL was requested, and schedule your sitemap build via a Windows scheduled task that did an HTTP GET for that url as well. There's extra effort required to set that up, but it does have the advantage that it can pull Sitecore back into memory even after IIS has shut it down.
The first is that you will probably need to think about what might happen if two publishes happen close enough together to cause the sitemap generation operations to overlap. Depending on how you choose to implement your sitemap generation you might find problems with overlapped operations. At the least you're wasting CPU cycles to generate the files in this scenario. But at worst, combing overlapping operations with certain code patterns might lead to exceptions for locked files, or files ending up with incorrect data in them. You will probably need to implement some sort of locking operation here in order to try to address these scenarios. In a single-server solution you can probably make use of Windows locking primitives. For multi-server solutions you would need to look at keeping locking flags on disk or in databases. Or alternatively you might consider writing your output to random file names, and only rename them to the correct names when the process finishes.
The second issue here is that IIS isn't actually very good at running background tasks. Web servers were built to deal with short-running, immediate tasks to serve up web pages. Having a background operation take a few minutes to complete wasn't really in the architectural plan for IIS. Background tasks can be killed off by IIS when it decides it needs to tear down the AppDomain your code is running in. Errors, periodic process recycling and admins clicking "stop" on a site can all do this to you.
My suspicion is that this is probably not a great way to go if you need reliable code for big sitemaps from simple code.
But breaking one sitemap file up into parallel tasks is somewhat harder. This approach requires thinking about the problem of generating the sitemap from a different angle. To make your code easy to run in parallel it needs to make use of the patterns of pure functional programming. These enable you to share your tasks between parallel threads without having to worry about whether these threads might interact with each other and cause problems. It's not an easy thing to do, but if done correctly this approach stands to give the best performance improvement on modern multi-core servers.
However because of that performance increase, it's important to consider how running these tasks in parallel will increase the load on your database and publishing servers. As noted in the documentation for Sitecore's parallel publishing process, you need to consider the servers that will be doing all this work.
So there's four approaches you could consider if you were looking to implement alternative approaches to how and when your sitemap is generated. Having spent a morning thinking about this, I suspect I'll be implementing the "manual" approach in the code I'm working on at the moment. But I can see that as the scale of sites tends to increase, I may find myself having to solve the parallel-implementation problems in the future.
And that will probably end up as another blog post when I do get around to it...
↑ Back to top