Jeremy Davis
Jeremy Davis
Sitecore, C# and web development
Article printed from: https://blog.jermdavis.dev/posts/2023/extend-statiq-sitemap

Customising Statiq's generated sitemap

Adjusting the default behaviour

Published 31 July 2023
C# Statiq ~4 min. read

I noticed the other week that the sitemap file my blog was generating included the urls, but none of the other metadata that they can report. To be honest, I'm not sure if search engines pay much attention to this these days, but since the schema for the files includes other options I decided to see if I could add them.

Initial investigations url copied!

I spent a bit of time trawling through the code for Statiq to see how it generates sitemaps. There's a pipeline defined for generating them which does a few steps. It gathers documents and filters them based on the config for whether a sitemap should be generated and what pages should be included. And then it runs a module called GenerateSitemap which does the work to format the data for writing to the output. So that seemed like a good starting point.

Looking through this code, the ExecuteContextAsync() method is generating a list of strings where each one is a line in the resulting sitemap file. It goes through all the filtered documents passed in and calls GetSitemapItemAsync() to fetch an item, and then AddSitemapItemContent() to format it.

Looking at the formatting code, it clearly knows about the idea of the extra fields I was after:

private void AddSitemapItemContent(List<string> content, string formattedLocation, SitemapItem sitemapItem, HashSet<string> locations)
{
    if (!formattedLocation.IsNullOrWhiteSpace() && locations.Add(formattedLocation))
    {
        content.Add($"<url><loc>{formattedLocation}</loc>");
        if (sitemapItem.LastModUtc.HasValue)
        {
            content.Add($"<lastmod>{sitemapItem.LastModUtc.Value.ToString("yyyy-MM-ddTHH:mm:ssZ")}</lastmod>");
        }
        if (sitemapItem.ChangeFrequency.HasValue)
        {
            content.Add($"<changefreq>{ChangeFrequencies[(int)sitemapItem.ChangeFrequency.Value]}</changefreq>");
        }
        if (sitemapItem.Priority.HasValue)
        {
            content.Add($"<priority>{sitemapItem.Priority.Value}</priority>");
        }
        content.Add("</url>");
    }
}

					

That implies that the SitemapItem objects being passed to this method don't have values for the other fields. So why might that be? Well the document is getting transformed into a SitemapItem in GetSitemapItemAsync() so I looked at that for a bit, and stuck some breakpoints into the code to see what was happening.

The beginning of that method tries to transform the document into a SitemapItem:

  // Try to get a SitemapItem
  object delegateResult = await __sitemapItemOrLocation.GetValueAsync.GetValueAsync(input, context);
  SitemapItem sitemapItem = delegateResult as SitemapItem ?? new SitemapItem((delegateResult as string) ?? context.GetLink(input));

					

But when I looked at this in the debugger as it processed my site, I could see that the call to GetValueAsync() here always returned null, so the new SitemapItem was always constructued by just passing in the URL for the page. And that seems to explain why the other fields are null for the formatting code above.

So to fix this, I need to change the way handles this transform from document to SitemapItem.

Hacking the generated data url copied!

I spent more time looking at the code, and considered an approach which replaced GenerateSitemap by copying it and changing it's behaviour when the result of _sitemapItemOrLocation.GetValueAsync() was null to generate the right data. After poking around with that code for a while I realised this would work - but wasn't the simplest change possible. That class didn't look too friendly to inheriting and modifying due to some private fields and non-virtual methods. So it was going to require copying it into my project to modify rather than inheriting. And that made me think this was probably the wrong approach...

The existing pipeline creates an instance of GenerateSitemap using a parameterless consructor. That means a default Config<object> handler gets assigned to _sitemapItemOrLocation and it doesn't cope with mapping the SitemapItem with all the parameters I want. So I realised I could change the code more simply by creating the GenerateSitemap class by passing in a Config<object> type that did do the required mapping.

There are factory methods which help create instances of that Config<object>, so it's fairly easy to create a helper class which can create this mapping:

public static class SitemapLoaderConfig
{
    public const string Frequency = "SitemapFrequency";
    public const string Priority = "SitemapPriority";

    public static Config<object> FetchSitemapMapper()
    {
        return Config.FromDocument<object>(doc => {
            var sitemapItem = new SitemapItem(doc.GetLink());
                
            DateTime dt = doc.GetDateTime(WebKeys.Published);
            if (!string.IsNullOrWhiteSpace(doc.GetString(WebKeys.Updated)))
            {
                dt = doc.GetDateTime(WebKeys.Updated);
            }
            sitemapItem.LastModUtc = dt;

            var pr = doc.Get<double>(Priority, -1);
            if (pr == -1)
            {
                   pr = doc.IsPost() ? 0.8 : 0.5;
            }
            sitemapItem.Priority = pr;

            var freq = doc.Get<string>(Frequency, "Weekly");
            sitemapItem.ChangeFrequency = Enum.Parse<SitemapChangeFrequency>(freq);

            return sitemapItem; 
        });
    }
}

					

So calling FetchSitemapMapper() here will create a mapper by calling Config.FromDocument<object>() and give it a function to transform the data. This creates the SitemapItem with the page url as the default code does, but then adds on the remaining properties. It takes data from the document where possible, and falls back to defaults if they're not provided. And these values can all be set in the metadata for the document.

Updating the pipeline url copied!

The second part of the job is to plumb this new behaviour into the pipeline for sitemaps. Statiq makes this fairly easy. The bootstrap code which sets up the generator has a ModifyPipeline() method that allows you make changes. So I hacked up a quick extension method to make the change I needed:

public static class EnhancedSitemapExtensions
{
    public static Bootstrapper AddEnhancedSitemap(this Bootstrapper bs)
    {
        return bs.ModifyPipeline("Sitemap", p => {
            var executeIf = p.PostProcessModules.First() as ExecuteIf;
            var ifCondition = executeIf.First() as IfCondition;

            ifCondition.ReplaceLast<Module>(new GenerateSitemap(SitemapLoaderConfig.FetchSitemapMapper()));
        });
    }
}

					

Looking at the pipeline data with the debugger showed me that the root of the Sitemap has an ExecuteIf object, which then contains an IfCondition that contains the modules for doing the filtering and generation. (Extracting these should really have some null checks, in case things change in the future - but I'll leave that as an exercise to the reader...) The final module in that IfCondition is the GenerateSitemap one - so that the one to replace with the new constructor behaviour. And it passes in the result of the helper to generate the mapper above.

And that helper can be used in the main bootstrap call:

public static async Task<int> Main(string[] args) =>
    await Bootstrapper
    .Factory
    .CreateWeb(args)
    .AddEnhancedSitemap()
    .RunAsync();

					

And with that in place the site can be regenerated to test the new behaviour.

Results url copied!

With those changes in place, the sitemap file data goes from the original:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://blog.jermdavis.dev/posts/2016/a-drink-from-the-gulp-firehose</loc>
  </url>

  ... snip ...

</urlset>

					

to the updated:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://blog.jermdavis.dev/posts/2016/a-drink-from-the-gulp-firehose</loc>
    <lastmod>2016-08-25T00:00:00Z</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>

  ... snip ...
	
</urlset>

					

Which is just what I was after, and now I can control the behaviour of the modified, frequency and priority values per-page as I need.

Success!

↑ Back to top