Adding reading time estimates to blog posts [by Jeremy Davis]

The second idea on my "little things I'd meant to add to this blog for a while" list was reading time estimates. Like the reading progress indicator from before, this shouldn't be tricky, and in this case I wanted to write it down in case anyone else working with Statiq was interested in achieving something similar on their site.

Broad approach url copied!

A quick google told me that "average reading speed" is 238 words per minute. So to work out how long an article might take to read we need to count the words in the body and divide that by the reading speed.

This could be implemented with some JavaScript that runs client-side when the page renders, but that means the number is only available when you're looking at the post itself. I wanted to do this in a more appropriate way for a static site generator, and have the number available on listing pages too. So I started thinking about how to implement this using Statiq's framework.

The key to this is thinking about how Statiq processes documents. It makes use of a pipeline-style approach, so if we can work out the right place to insert some custom code, it can modify the generated data.

There happens to be a pipeline called Content which does a chunk of the basic processing for loading documents to be processed. The ProcessModules property of this pipeline contains a set of operations to read in the configured documents, filter them and generate things like excerpts.

Configuring an extension to the pipelines is pretty easy. The bootstrap process that runs when Statiq starts up will find all of the IConfigurator<Bootstrapper> classes in your solution and run them as the framework initialses, so creating one of these gives an easy way to plug in a change to the default pipelines:

public class ReadingTimeConfigurator : IConfigurator<Bootstrapper>
{
    public void Configure(Bootstrapper configurable)
    {
        configurable.ModifyPipeline("Content", p =>
        {
            p.ProcessModules.Insert(3, new ReadingTimeModule());
        });
    }
}

That injects a new module at the third position in the pipeline - so it will run on any document imported, just before the RenderContentProcessTemplates() call in the Content pipeline.

So what does this extra module in the pipeline need to do?

Doing the calculations url copied!

Well it needs to read the data for the post, extract the number of words, work out the time and save that into the metadata for the post. Given that all the work here is restricted to a single document, this is a good candidate for a ParallelModule - so the framework can process multiple documents in parallel if it wants to.

So a simple implementation might look like this:

public class ReadingTimeModule : ParallelModule
{
    protected override async Task<IEnumerable<IDocument>> ExecuteInputAsync(IDocument input, IExecutionContext context)
    {
        if (input.Source.Extension == ".md")
        {
            var doc = new HtmlAgilityPack.HtmlDocument();

            var content = await input.GetContentStringAsync();
            doc.LoadHtml(content);

            RemoveUnwantedElements(doc);

            var text = doc.DocumentNode.InnerText;
            var count = WordCount(text);

            var ReadingSpeed = context.GetInt("ReadingSpeed");
            var readingTime = MathF.Ceiling(count / ReadingSpeed);

            return input
                .Clone(new MetadataItems { 
                    { "ReadingTime", readingTime.ToString() } 
                }).Yield();
        }

        return input.Yield();
    }
}

First up, it checks if the current source document was a markdown file. Some non-markdown content will get processed by the pipeline and we're not interested in that, so it can be skipped over.

Then it grabs the current content for the item being processed, and parses it with the HTML Agility Pack. This gives an easy way to manipulate the markup. The code here is running after the source has been transformed into HTML, so we don't have to worry about markdown here.

Now there is some stuff in the document that I didn't want to count in the estimate - the diagrams and code snippets. So the call to RemoveUnwantedElements() removes those from the HTML document, before calling the InnerText property to get just the words from the HTML. The diagrams and code snippets are <pre/> elements, so it's pretty easy to remove them:

private void RemoveUnwantedElements(HtmlDocument d)
{
    var toRemove = d.DocumentNode.SelectNodes("//pre");
    if (toRemove != null)
    {
        foreach (var element in toRemove)
        {
            element.Remove();
        }
    }
}

The next step is to count the words left. You could do that with string.Split() fairly easily, but there are some challenges with that. The key one is that you generate a lot of allocations of small strings doing that, which puts load on the Garbage Collector. Perfomance isn't too critical here really - the speed of generation isn't wildly important. But it's good practice to avoid allocations where you can. So instead, some code snagged from Stack Overflow (yes, I was feeling lazy) can count words based on whitespace:

private int WordCount(string text)
{
    int wordCount = 0, index = 0;

    while (index < text.Length && char.IsWhiteSpace(text[index]))
    {
        index++;
    }

    while (index < text.Length)
    {
        while (index < text.Length && !char.IsWhiteSpace(text[index]))
        {
            index++;
        }

        wordCount++;

        while (index < text.Length && char.IsWhiteSpace(text[index]))
        {
            index++;
        }
    }

    return wordCount;
}

And with that done, the reading speed can be calculated and the data saved to the ReadingTime metadata property.

Displaying data url copied!

Since the info is stored as metadata on the document, any Razor template in the site that processes individual documents can use the metadata APIs to retrieve the data:

@model IDocument
<span>~@Model.GetString("ReadingTime") minutes</span>

I chose to add that data at the end of the tags for each post:

Since the tags appear on both the listings and on the post pages, this data is visible in both locations too. Which is just what I wanted. Success..

↑ Back to top

Feel like sharing?
⇒ On BlueSky
⇒ On LinkedIn
⇒ On Mastodon
⇒ On Email

Adding reading time estimates to blog posts

Statiq makes this sort of extension pretty easy

Broad approach url copied!

Doing the calculations url copied!

Displaying data url copied!

Post Headings

Sitecore MVP 2015-2025

Recent Tags

Top Tags

Recent Months

Socials