Recently I added a basic search page to my blog. This is a pure static site, hosted on Github Pages, so there's no Solr or Algolia here. Just some clever JavaScript (Lunr.js) acting on data generated by Statiq. Setting that up made me think about the internals of how a search can work efficiently, and I figured others might be interested in this...
The absolutely simplest approach to searching for some text is basically the approach of "find in files" from Visual Studio. The pseudocode for that goes something like:
FOR EACH file to be searched FOR EACH word in the file IF this word is the word we want RETURN position END IF END FOR END FOR
It's very simple and requires no pre-processing of the data, but it's not really very fast. (As you will probably have noticed if you've ever needed to find a particular word across a large solution) It's linearly looking through all the data to find a match. It does work, though it's really only good for finding literal phrases. A query like
+law -computer
(match documents which have the word "law" but do not have the word "computer") is more difficult to manage here, because you end up having to repeat that inner loop to evaluate each term of the query. So that processing can really mount up for large data sets, or complex queries.
Fairly obviously this is no good for something like Google. If you have 10bn web pages to look through, that sort of complexity absolutely out of the question. And it's not great in my scenario either - 300 pages is enough make this seem slow. So what can you do in that situation?
You need an index! Some flavour of that is usually the solution to "speed of finding" issues in computing.
We want to easily work out if a particular word is in any of our documents, so the structure we want to have is a list of unique words which can tell us quickly which document(s) the word exists in. This is technically known as an inverted index.
So we can process all our web page data to generate two fairly simple structures:
The first is a list of the documents we have, giving each a unique ID plus any other metadata we need to be able to show in a results list. So on my blog that might be the ID, URL, Title, Excerpt and Tags. A structure something like:
[ { "id": 1, "title": "Test document", "url": "/posts/2024/test", "excerpt": "Lorem ipsum dolor sit amet. Fulcrum amcordio latens.", "tags": ["Test", "General"] }, { "id": 3, "title": "Another test", "url": "/posts/2024/test2", "excerpt": "", "tags": ["Test"] }, ... ]
And then secondly there's a list of all the words in across all the documents, with the IDs of which document(s) each word appears in:
[ { "word": "another", "docs": [3] }, { "word": "ipsum", "docs": [1,2] }, { "word": "test", "docs": [1,3,4] }, ... ]
And these files can be generated by iterating all the available documents:
FOR EACH document to be indexed Generate an ID FOR EACH metadata field in current document ADD field to results data for this ID END FOR FOR EACH word in the document IF word IS NOT in words data ADD word TO words data END IF IF the documment ID is not recorded for this word ADD the document ID to this word END IF END FOR END FOR
However we probably don't want to index every word we find. Very common worlds like "and", "or", "the", "if" etc. are unlikely to be helpful in the index, so it's common to ignore them during processing. These are often referred to as "stop words".
Also it's usually worth storing all the words in lower case, to make later comparisons easier.
How do we use this index that's been built?
Well if a query like
+ipsum -test
is issued using the example index data above:
As you can see, all the work of iterating files and words has moved to the indexing phase. The searching phase becomes much simpler (and hence much faster) because all it's doing is looking at the IDs in the index. Generating the results is just matching the words, and using the IDs found to pick results out of the set of documents.
There are a few other interesting things that help a simple search work, which are worth mentioning here.
Some words exist in multiple forms. Verbs are a great example here. "Run", "ran", "running" and "runs" can all be treated as the same thing for the purposes of the index, so reducing these words back to their basic form makes it easier to match variations across document.
This process is referred to as
Stemming. You can apply this when you generate your index, and also apply it to the query. So the query
+running
can match any document that has "run", "ran", "running" or "runs" in it. Usually this is the behaviour we'd want. There are
open source implemnetations
you can make use of here.
Also, when you have queries that have optional query terms rather that just "must have" and "must no have" then you can get into the concept of scoring results. A query like
+fish -rod
can only every match documents precisely. A document either fits this criteria or does not. But if your query is
fish rod
(an
or
is implied here) then there are more options. A document may match one of these words, or both of these words. The documents with two matches should come further up the result sent than those with only one. Hence assigning a score, to each result so you can sort results by it.
You might also involve scoring in queries that use stemming. An exact match for "running" is a better hit than a match for "ran". And you can also weight these scores based on how many times a word appears in a document.
You can build a surprisingly powerful search quite easily using these techniques. Granted it won't scale to Google's level, but it's perfectly good running in a browser to find documents across this blog. In fact after a bit of testing the search is way faster than the result rendering itself - a query that matches all the documents in the index is still pretty snappy. It just takes a while to add all that mark-up to the DOM and render 300-odd results... I am considering applying pagination to the results UI to handle this.
And there's plenty more to look into here if the topic interests you. For example you can add ordering and location data to the "words in a document" index so that you can do
<a> near <b>
queries, or show highlighted extracts in the results too. I may have a go at implementing that myself one day.
But for the moment, I think this fairly simple bit of code has added a useful feature to this site...
↑ Back to top