Jeremy Davis
Jeremy Davis
Sitecore, C# and web development
Article printed from: https://blog.jermdavis.dev/posts/2023/driving-browsers-3-states

Driving browsers: #3 The states

Because sometimes reinventing the wheel is fun!

Published 20 November 2023
This is post 3 of 3 in a series titled Driving browsers

Time for the final part of my series on controlling a web browser. With code to load a browser, and the overarching State Machine to control it, this part finishes off with the code for some states to load a page and extract its markup. Plus a few conclusions...

So with the framework in place, now we need to implement some states the code is going to work through. Having done a bit of digging into how DevTools API allows the browser to be controlled, the flow for fetching all the HTML from a page might look like:

flowchart LR
  A[Enable Page
Events in the
browser] B[Navigate to
the target page] C[Fetch the root
ID of the document's
markup] D[Get the
outer HTML] A-->B B-->C C-->D

State one: Enabling event monitoring

First thing is the browser needs to be told to return state events so that we can detect when page loading has finished. Then it can tell the browser to navigate to a page. Once that completes you need to get the internal ID of the document root, and then use that ID to fetch the outer HTML of that node.

(There are loads of other commands and events that you can deal with, but these are enough for a demo)

As noted above, all the states are instances of the State class, so the code for the first "Enable Page Events" state looks like:

public class PageEventEnableState : State
{
    public static State Instance { get; } = new PageEventEnableState();
    private static readonly int _id = 1;

    public override async Task Enter(StateMachine owner)
    {
        var request = new PageEventEnableParameters();
        await owner.SendCommand(request, _id);
    }

    public override async Task Update(StateMachine owner, DebuggerResult data)
    {
        if (data != null && data.Id == _id)
        {
            owner.TransitionToNewState(PageNavigateState.Instance);
        }
    }

    public override async Task Leave(StateMachine owner)
    {
    }
}

					

This defines a static instance of the state which is used to select states. (As noted before, these classes have no internal data, so they're safe to reuse across operations) It also defines an integer "id" for this operation. That gets passed into any command we send, and comes back in the response, to help us decide which responses come from which calls.

The Enter() method creates the data for the call, and then asks the owning StateMachine to issue the command to the connected browser. As mentioned before, we need an instance of the IDebuggerCommandProperties interface to help format the parameters for sending. In this case it's trivially simple as there are no extra parameters to the "Page.enable" call:

public class PageEventEnableParameters : IDebuggerCommandProperties
{
    public string CommandName => "Page.enable";
}

					

With that sent, any messages received by the StateMachine get routed to the Update() call. And to decide when we're done with this state we wait for a response which has data, and where the returned Id equals the one we passed in. When we see that, the StateMachine can be instructed to move to the next state.

And Leave() has nothing to do here...

State two: Navigating the browser

Now that our code will be notified of navigation events like "the page is loaded now", it can move on to loading the right page. The code for this adds a few extra bits:

public class PageNavigateState : State
{
    public static State Instance { get; } = new PageNavigateState();
    private static readonly int _id = 2;

    public override async Task Enter(StateMachine owner)
    {
        var nextUrl = (string)owner.State["nextUrl"];

        var request = new PageNavigateParameters() { Url = nextUrl };
        await owner.SendCommand(request, _id);
    }

    public override async Task Update(StateMachine owner, DebuggerResult data)
    {
        if (data != null && data?.Method == "Page.loadEventFired")
        {
            owner.TransitionToNewState(FetchDocumentRootState.Instance);
        }
    }

    public override async Task Leave(StateMachine owner)
    {
    }
}

					

So again we define the instance of the state, and an ID. Turns out that step ID isn't so important here, however.

The Enter() command needs to send the URL and a Referrer for the navigation request. This code is ignoring Referrer as it's not really relevant for this operation. But since the state can't store its own data, it needs some help to fetch the URL it should be navigating too. There are multiple ways to solve this, but for simplicity I picked the "give the StateMachine a dictionary of state data which the State objects can access" approach. So the URL to use gets picked out of there, and stuck into the IDebuggerCommandProperties instance for this call:

public class PageNavigateParameters : IDebuggerCommandProperties
{
    public string CommandName => "Page.navigate";

    public required string Url { get; set; }
    public string Referrer { get; set; } = string.Empty;
}

					

The "Page.navigate" call does have some other optional parameters, but they weren't relevant here.

The Update() behaviour here is a little different, because navigation triggers a boat-load of responses to the WebSocket. The response we care about is "the page has finished loading" however, and not "I have accepted the navigation request ok" or any of the other "stuff is changing" messages which get sent back. Hence the test for moving to the next state doesn't actually care about the ID of the request we sent originally. Instead it looks for a message of type "Page.loadEventFired" being returned. That does return some data (the time for the event) but that's not relevant so isn't processed.

But that event arriving means the browser has a new document loaded, and the StateMachine can move on:

State three: Getting the Document's root ID

Inside the browser's model for the DOM, individual nodes get an ID assigned, which this API can use to access specific bits of data. So if the code wants to retrieve data for a specific node, it first needs to find the node's ID. There are methods for searching the DOM to find nodes, but in this case the root is all we need, and that has a specific command.

So the next State can issue that request, and wait for a response:

public class FetchDocumentRootState : State
{
    public static State Instance { get; } = new FetchDocumentRootState();
    private static readonly int _id = 3;

    public override async Task Enter(StateMachine owner)
    {
        var request = new FetchDocumentRootParameters() { };
        await owner.SendCommand(request, _id);
    }

    public override async Task Update(StateMachine owner, DebuggerResult data)
    {
        if (data != null && data.Id == _id)
        {
            var nodeId = data.Result?["root"]?["children"]?[1]?["nodeId"]?.GetValue<int>() ?? -1;

            owner.State["NodeID"] = nodeId;
            owner.TransitionToNewState(GetOuterHtmlState.Instance);
        }
    }

    public override async Task Leave(StateMachine owner)
    {
    }
}

					

So as before Enter() is sending off an IDebuggerCommandProperties object that specfies we want to run the "DOM.getDocument" command. This returns basic internal data about nodes in the DOM, and has a couple of extra parameters:

public class FetchDocumentRootParameters : IDebuggerCommandProperties
{
    [JsonIgnore]
    public string CommandName => "DOM.getDocument";

    public int Depth { get; set; } = 1;
    public bool Pierce { get; set; } = false;
}

					

The Depth property lets us specify how many layers of children to get data for. We only care about the root item, so that means we want one layer. The Pierce property lets you specify the browser should look into things like IFrame elements when processing - which isn't relevant here, so is left turned off.

Here the Update() method is looking for the ID sent in again, it cares about the response to the specific request sent. And it gets back some data as a JObject. The code here is extracting the int value of the nodeId property for the item we want. That's probably not the safest way of working here, (it should really handle nulls or unexpected values in that object tree) but it gets the job done for a demo...

The next step is going to need this ID, so it gets written into the StateMachine object's data store for the next operation to pick up.

State four: Get the HTML

Now that the code has an ID for an element it can ask for the HTML for it:

public class GetOuterHtmlState : State
{
    public static State Instance { get; } = new GetOuterHtmlState();
    private static readonly int _id = 4;

    public override async Task Enter(StateMachine owner)
    {
        var nodeId = (int)owner.State["NodeID"];

        var request = new GetOuterHtmlParameters() { NodeId = nodeId };
        await owner.SendCommand(request, _id);
    }

    public override async Task Update(StateMachine owner, DebuggerResult data)
    {
        if (data != null && data.Id == _id)
        {
            var html = data?.Result?["outerHTML"]?.GetValue<string>() ?? string.Empty;
            owner.State["HTML"] = html;

            owner.TransitionToNewState(NullState.Instance);
        }
    }
    public override async Task Leave(StateMachine owner)
    {
    }
}

					

And this is very similar to the previous states. It sends the IDebuggerCommandProperties which specify a call to the "DOM.getOuterHtml" method, passing the NodeId from the previous step, retrieved from the StateMachine data.

public class GetOuterHtmlParameters : IDebuggerCommandProperties
{
    [JsonIgnore]
    public string CommandName => "DOM.getOuterHTML";

    public required int NodeId { get; set; }
}

					

And when the correct response ID comes back, the resultant HTML gets passed back to the StateMachine.

State five: The null state

The final state needs to signal "done now" to the StateMachine, and as mentioned in the previous episode, that's done with the "null" state:

    public class NullState : State
    {
        public static State Instance { get; } = new NullState();

        public override async Task Enter(StateMachine owner) { }
        public override async Task Update(StateMachine owner, DebuggerResult data) { }
        public override async Task Leave(StateMachine owner) { }
    }

					

It's an instance of a State but it does nothing, and lets the StateMachine detect that the state flow is done. (Which is basically an instance of the Null Object pattern)

Finally - The controlling code

So all that's left now is a bit of controlling code which makes use of all the stuff discussed so far. And that's fairly simple:

using (var browser = BrowserFactory.Create())
{
    browser.Open("about:blank");

    var connection = await browser.Connect();

    var stateMachine = new StateMachine(PageEventEnableState.Instance, connection);
    stateMachine.State["nextUrl"] = "https://bbc.co.uk/news/";

    await stateMachine.Start();
    await stateMachine.Wait();

    var html = stateMachine.State["HTML"] as string;
    Console.WriteLine($"HTML: {html}");
}

					

It creates the Browser object, opens the window and connects to the DevTools API. Then it passes in the URL to request and sets the state machine going. After waiting for it to complete, it gets back the resultant markup.

So running this code gets a browser window to pop up briefly:

The browser which the code above loaded and navigated to the BBC website

And the console window will list out the HTML from the page that got loaded:

The console window for the code in this article, showing the HTML loaded from the BBC

Success!

The code for all this is available in a GitHub repository, if you want to play with it.

Series conclusions

This has been an interesting coding diversion - and one that's solved a real-world problem for me. The markup I needed to scrape that didn't work with a plain HttpClient does work with this approach.

But there are a couple of down-sides to this approach:

  • Firstly it's noticably slower. Firing up a whole browser process, with all its script sandboxes and rendering engines etc does take a lot more time. That's especially noticable if you have a list of pages to work through downloading.
  • And secondly it requires the process running the code to have the ability to interact with the Desktop. That's not an issue for a console app. But I'm pretty sure it means this code wouldn't be much good running as part of a Windows Service. That's a fairly niche issue - but one which probably does limit its use.

I've also noticed one interesting problem with this code. In some circumstances the code above will fail to automate the browser correctly. If the browser starts up in some sort of "ask questions to set up your profile" mode (particularly likely with Edge if it creates a new profile folder or if it gets a big security update) then it doesn't seem to allow connections to the DevTools API. Answering the questions and re-running the code seems to sort this - but it's not ideal for an automation situation. I've not found an automatic way around this issue yet - but it's possible one exists.

But implementing this has certainly given me some interesting insight into how code for front-end testing frameworks gets built. And it shows an interesting example of how the state-based design patterns can be used in the real world.

Is it the best way? Probably not. But it's an interesting demo of one way to keep logic for individual steps in a process separate from the overall orchestration of the process. And it does this in a very different way to the more pipeline-based approaches I've written about before.

Though in getting to the end of this I realise the Leave() step isn't really relevant to the particular work I was doing and could be removed - but maybe it is helpful in your scenario...

↑ Back to top