Jeremy Davis
Jeremy Davis
Sitecore, C# and web development
Article printed from: https://blog.jermdavis.dev/posts/2024/key-trick-fast-search

A key trick that makes search fast

It's interesting to think about internals sometimes

Published 29 July 2024
General ~4 min. read

Recently I added a basic search page to my blog. This is a pure static site, hosted on Github Pages, so there's no Solr or Algolia here. Just some clever JavaScript (Lunr.js) acting on data generated by Statiq. Setting that up made me think about the internals of how a search can work efficiently, and I figured others might be interested in this...

The challenge

The absolutely simplest approach to searching for some text is basically the approach of "find in files" from Visual Studio. The pseudocode for that goes something like:

FOR EACH file to be searched
  FOR EACH word in the file
    IF this word is the word we want
      RETURN position
    END IF
  END FOR
END FOR

					

It's very simple and requires no pre-processing of the data, but it's not really very fast. (As you will probably have noticed if you've ever needed to find a particular word across a large solution) It's linearly looking through all the data to find a match. It does work, though it's really only good for finding literal phrases. A query like +law -computer (match documents which have the word "law" but do not have the word "computer") is more difficult to manage here, because you end up having to repeat that inner loop to evaluate each term of the query. So that processing can really mount up for large data sets, or complex queries.

Fairly obviously this is no good for something like Google. If you have 10bn web pages to look through, that sort of complexity absolutely out of the question. And it's not great in my scenario either - 300 pages is enough make this seem slow. So what can you do in that situation?

You need an index! Some flavour of that is usually the solution to "speed of finding" issues in computing.

The solution

We want to easily work out if a particular word is in any of our documents, so the structure we want to have is a list of unique words which can tell us quickly which document(s) the word exists in. This is technically known as an inverted index.

So we can process all our web page data to generate two fairly simple structures:

The first is a list of the documents we have, giving each a unique ID plus any other metadata we need to be able to show in a results list. So on my blog that might be the ID, URL, Title, Excerpt and Tags. A structure something like:

[
  {
    "id": 1,
    "title": "Test document",
    "url": "/posts/2024/test",
    "excerpt": "Lorem ipsum dolor sit amet. Fulcrum amcordio latens.",
    "tags": ["Test", "General"]
  },
  {
    "id": 3,
    "title": "Another test",
    "url": "/posts/2024/test2",
    "excerpt": "",
    "tags": ["Test"]
  },
  ...
]

					

And then secondly there's a list of all the words in across all the documents, with the IDs of which document(s) each word appears in:

[
  {
    "word": "another",
    "docs": [3]
  },
  {
    "word": "ipsum",
    "docs": [1,2]
  },
  {
    "word": "test",
    "docs": [1,3,4]
  },
  ...
]

					

And these files can be generated by iterating all the available documents:

FOR EACH document to be indexed
  Generate an ID
  FOR EACH metadata field in current document
    ADD field to results data for this ID
  END FOR  
  FOR EACH word in the document
    IF word IS NOT in words data
      ADD word TO words data
    END IF
    IF the documment ID is not recorded for this word
      ADD the document ID to this word
    END IF
  END FOR
END FOR

					

However we probably don't want to index every word we find. Very common worlds like "and", "or", "the", "if" etc. are unlikely to be helpful in the index, so it's common to ignore them during processing. These are often referred to as "stop words".

Also it's usually worth storing all the words in lower case, to make later comparisons easier.

The results

How do we use this index that's been built?

Well if a query like +ipsum -test is issued using the example index data above:

  • Start with an empty result set
  • For the first term, find all the IDs which do include the word "ipsum"
    • That is #1 and #2 based on test data above.
    • The term has "+" (includes) so add these IDs to the overall result set.
    • So results are "#1, #2" at this point, as these are the IDs found for this word.
  • For the second term, find all the IDs which include the word "test"
    • That is #1, #3 and #4 based on the data above
    • The term has "-" (does not include) so remove these IDs from the overall results set.
    • So results are "#2" at this point, as this is the result of removing any instances of #1, #3 and #4 from the previous result set.
  • There are no more search terms, so use the current set of IDs in the results to look up entries in the list of documents.
  • Display the document metadata for these results to the user

As you can see, all the work of iterating files and words has moved to the indexing phase. The searching phase becomes much simpler (and hence much faster) because all it's doing is looking at the IDs in the index. Generating the results is just matching the words, and using the IDs found to pick results out of the set of documents.

Extra credit

There are a few other interesting things that help a simple search work, which are worth mentioning here.

Some words exist in multiple forms. Verbs are a great example here. "Run", "ran", "running" and "runs" can all be treated as the same thing for the purposes of the index, so reducing these words back to their basic form makes it easier to match variations across document.

This process is referred to as Stemming. You can apply this when you generate your index, and also apply it to the query. So the query +running can match any document that has "run", "ran", "running" or "runs" in it. Usually this is the behaviour we'd want. There are open source implemnetations you can make use of here.

Also, when you have queries that have optional query terms rather that just "must have" and "must no have" then you can get into the concept of scoring results. A query like +fish -rod can only every match documents precisely. A document either fits this criteria or does not. But if your query is fish rod (an or is implied here) then there are more options. A document may match one of these words, or both of these words. The documents with two matches should come further up the result sent than those with only one. Hence assigning a score, to each result so you can sort results by it.

You might also involve scoring in queries that use stemming. An exact match for "running" is a better hit than a match for "ran". And you can also weight these scores based on how many times a word appears in a document.

Conclusions

You can build a surprisingly powerful search quite easily using these techniques. Granted it won't scale to Google's level, but it's perfectly good running in a browser to find documents across this blog. In fact after a bit of testing the search is way faster than the result rendering itself - a query that matches all the documents in the index is still pretty snappy. It just takes a while to add all that mark-up to the DOM and render 300-odd results... I am considering applying pagination to the results UI to handle this.

And there's plenty more to look into here if the topic interests you. For example you can add ordering and location data to the "words in a document" index so that you can do <a> near <b> queries, or show highlighted extracts in the results too. I may have a go at implementing that myself one day.

But for the moment, I think this fairly simple bit of code has added a useful feature to this site...

↑ Back to top