Jeremy Davis
Jeremy Davis
Sitecore, C# and web development
Article printed from: https://blog.jermdavis.dev/posts/2014/custom-sitemap-filespart-one

Custom Sitemap files – Part One

Published 12 May 2014
Updated 25 August 2016
C# Sitecore ~5 min. read
This is post 1 of 4 in a series titled Custom Sitemap Files

Sitemap files are a requirement for most websites these days. They help SEO by ensuring that search engines index the files and images they might not otherwise find, and that you think are most important. Whilst there are assorted pre-built add-ons for Sitecore that can help with this, that's no fun. It's much more fun to build your own...

Real work is getting in the way of blogging time at the moment, so I'm going to break up my investigations into this into three posts. This week I'll look some requirements, core configuration and overall algorithm. The next part will look at the core code. And the final one will look at adding image data to the sitemap files.

When I started looking at this, I had the following requirements to consider:

  • Build the core sitemap file at publish time.
  • Support optional "Sitemap Index Files" – where multiple sitemap files are grouped together in an "index file".
  • Be able to specify the output filenames.
  • Be able to choose the root Sitecore Item for each sitemap file.
  • Be able to choose which Templates and Languages an Item must be based on in order to be included in the Sitemap.
  • Allow individual pages to specify their sitemap properties, such as priority, inclusion, and change frequency.
  • Allow individual sitemaps to have rules for what images to include.
  • Allow images to be included from either fields on the Item, or from the Data Sources of components rendered by the Item.
  • Avoid using configuration files, but let editors control the sitemap through Item data.

Configuration settings

A common place to put configuration for extension modules like this is under /sitecore/System/Modules – so we'll create a "Sitemap" folder under here and make a note of its ID for later. Within that we're going to want to create an items for Sitemap Index files or Sitemap files that don't have an index. And they'll need templates.

The Sitemap Index File only needs configuration for its file name and the set of Sitemap files it's going to refer to. Filename is easy – that's just a single line text field. So we can create a simple template for SitemapIndexFile:

Sitemap index file template

The Sitemap Files it's going to contain can be its children – so its insert options need to allow creating an instance of the SitemapFile template:

Sitemap file item

This one is a bit more complex. First of all it also needs a file name, but it requires some other settings too. When the publish operation runs you don't know what the "context" database is, so we'll need to record a reference to the database that you want the Sitemap file to be generated from. In this case I've made that a Droplist field that points to a set of database names. Most of the time you'd set this to "Web" (since that's the database which holds all the published data) however for testing purposes you might want to change this to "Master". Next the template allows editors to specify the root item that sitemap processing will start from. This allows configuring multiple sitemaps for different site roots, or for subsections of a website. The last two fields allow the editor to select a set of language versions that will be included in the output, and a set of templates that will be included. In both cases, selecting nothing here will be treated as "include all".

So, with these templates we can set up basic configuration for a couple of sitemap files:

Sitemap tree

Here "CustomSitemap" is a sitemap file on its own, where "TestIndexFile" is a sitemap index that contains one sitemap file.

The other thing that we need to create for configuration is a template to extend your web page items. We need to be able to specify the bits of configuration needed for individual pages:

Sitemap Item Extensions template

A checkbox field SitemapInclude lets editors specify whether this item should be considered for sitemap processing or not. The SitemapPriority field lets them specify a relative value of how important this page is on your site, as per the sitemap schema. Finally, editors can choose a Droplist value for the expected change frequency of this page, as per the schema. Note the use of the "Shared" flag for these fields – the requirements I was thinking about needed these settings to be shared between all language versions of each page – but that might not be true for other sites and you might want to think about whether that's applicable in any work you do based on this approach.

This template can then be added to the template for the pages on your site:

Sitemap settings on Home

Algorithm ideas

Once all that config is defined, the basic behaviour for the code is as follows:

  • Fetch the configuration items stored as children of /sitecore/System/Modules/Sitemap from the Master database.
  • Iterate each configuration item
    • If it's config for a Sitemap Index file
      • Load the configuration for the output filename.
      • Iiterate each of its child Sitemap items. For each one, generate the metadata for the index file and then process the Sitemap file itself.
      • Save the data for the index file to XML.
    • If it's config for a Sitemap file, process it.

For each of the Sitemap files we need to process:

  • Load the configuration for the filename, database, root, etc.
  • Find the root item in the correct database.
  • For the root and its descendants, process each one in turn.
    • Check if it's marked for export into the sitemap. If not, skip it.
    • Check it has the right template. If not, skip it.
    • For each language defined in the configuration
      • Check if this item has a version in that language. If not, skip it.
      • Extract all the required metadata that will go into the sitemap file from this language version.
  • Then save all the metadata into the sitemap file in the correct XML schema.

Nothing particularly difficult there – but it's quite a few things to do.

One area that can be done in a variety of ways is how to write the data for the sitemap files out to disk in the right schema. My first attempt at this code made use of the XML Serialisation infrastructure in the .Net framework. It worked, but it was quite fiddly and required quite a lot of mucking about to get the namespaces correct. It was also quite complicated to deal with "empty" attributes of an item in a sitemap file, so I ended up reverting to some custom code to write the file using the XDocument classes from Linq as this works more simply.

However for large sites this approach is not particularly scalable (It can generate a lot of in-memory objects). It would be better to write the data out directly via an XmlTextWriter. If you need to generate sitemaps for big files it would be sensible to consider this approach as an alternative.

Code

So with all those configuration options available we can start on some code. Usual rules apply – I'm ignoring error handling and patterns like Glass for simplicity. Production code would include those things.

First of all, getting something to happen at publish time is simple – it just requires adding a handler to the publish:end event. To do that you must first define a class that will perform the action for this event. That class must have a method with the following signature:

namespace Testing.Sitemaps
{
    public class Publisher
    {
        public void Publish(object sender, EventArgs args)
        {
            // Your code here...
        }
    }
}

					

And that can be configured with a simple config patch:

<?xml version="1.0" encoding="utf-8" ?>
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <events>
      <event name="publish:end">
        <handler method="Publish" type="Testing.Sitemaps.Publisher" />
      </event>
    </events>
  </sitecore>
</configuration>

					

This will cause the Publish() method to be called every time someone triggers a publish on this server. But if your website has multiple web servers, you will need to enable the "Scalability Settings" configuration that ships disabled with Sitecore by default, and then trigger your Publish() method on the publish:end:remote event as well. This event is fired on all the other servers in your cluster at publish time.

So now our code will get triggered, we need to be able to configure a sitemap.

And we'll look at the code for that in part two...

↑ Back to top