Search with Zig, Wasm, and a Worker ⚡

Static search was a nice upgrade to my blog. Pagefind is super easy to use. Point it at <main> and it’ll crawl your website making it searchable with a bit of front-end magic.

A minor issue I had with Pagefind was file churn. Every re-index resulted in 200+ deletions and 200+ additions. This made deployment slow and wasteful. This was specifically fixed in Pagefind v1.3.0 but I had locked to v1.2.0 and never noticed.

What I should have done is read the documentation. Should have RTFM, would have gotten a good night’s sleep. Instead I nerd-sniped myself. I stayed up way too late coding a Levenshtein Distance algorithm in Zig before giving up and coping this Gist.

The end result of this madness is a new step to my static site generator and plenty of learning. I built my own search and I attempt to document it below.

Creating an Index

I’m sure there are many scientific and mathematical papers written on search indexes. I’ve read none of them. I figured a big JSON array would suffice.

I used my HTML parsing utilities to strip tags and generate a list of words on my blog. I normalise to lowercase and strip anything that isn’t a–z including diacritics.

JavaScript has a neat trick using Unicode properties:

word = word
  .toLowerCase()
  .normalize("NFD")
  .replace(/\p{Diacritic}/gu, "");

This uses “canonical decomposition” which turns a single code point into two. For example "\u00F1" is ñ aka Latin Small Letter N with Tilde. Compare that to "\u006E\u0303" also ñ but a lowercase ASCII “n” followed by the Combining Tilde.

This process gives a form of “fuzzy” matching when also applied to search queries. This is all academic because I’m English and I barely used all 26 letters. I did use an umlaut once so I must account for that edge case.

My website has a vocabulary of over 12000 unique words with “and” and “the” jostling for top spot. Oooh “jostling” I think that’s +1.

Words are stored along with an array of pages IDs and the number of occurrences on that page (IDs are a 32-bit hash of the URL). I throw away common words like “and” to save space. If you search “and” it’ll match “brand”, “standard”, etc.

Appended to the index I have a map linking IDs to page URL and title. The entire search index is 1.5 MBs of raw JSON but only 275 KB compressed. That’s reasonably small and I only load it if search is used. (I do some optimisation later.)

Searching the Index

Now this is where things get silly. I’m sure I could’ve done the next part in JavaScript but I wanted to use WebAssembly and nothing was going to stop me. I’ve been learning Zig by building CLI tools and light switches. Zig can compile to Wasm.

As mentioned I use the Levenshtein Distance to match normalised keywords against my index. I literally run the calculation against every word. Sorry Google, I’m not looking for a new job right now. Sounds like Google stopped caring anyway.

I rank pages based on:

Levenshtein distance
Number of keyword matches
Recency of blog post

I adjusted a few magic number multipliers until the results felt right. I won’t dive too much into the Zig code because I’ll probably refactor it multiple times.

Optimisation

1.5 MB (275 KB compressed) is fine but I can do better! JSON object keys are pure overhead. Brackets? Quotation marks? Commas? Get ‘em outta here! Pure bloat. 32-bit hashes as 8 byte hexadecimal strings is twice the necessary size.

I came up with a simple^† binary format. Hashes are raw 32-bit numbers. Strings are still UTF-8 but they’re preceded by their byte length. Arrays too are just a count followed by packed data. This reduced the file size to 556 KB (214 KB compressed).

^† Somehow I mixed both big and little endian

Zig can use this data more efficiently too without a full JSON parser. I used a similar technique to packed structs. When I find an array of hashes, i.e. 5 bytes after 5 bytes repeating, it’s not parsed it’s just referenced straight from memory.

const Ref = extern struct {
    hash: [4]u8,
    occurrences: u8,
};

const Word = struct {
    word: []const u8,
    pages: [*]const Ref,
    pages_len: u16,
};

The pages array is just a pointer into memory (the embedded index data). The parser doesn’t iterate 5 bytes at a time creating Ref objects. It just reads the pages_len and skips ahead pages_len * 5 bytes. I can still use syntax like pages[0].hash to read array items.

There is a little bit of parsing for the Word structs. I’m sure if I understood more I could make a binary format that exactly replicates memory layout and parse nothing. I gave up, I’m not smart enough to handle alignment errors yet.

With these optimisations I was able to half both file size and search time.

Wasm and Workers

JavaScript does a good job of hiding the fact that it’s single-threaded. That is until you try to execute tasks during UI interaction. Delayed typing makes it painfully obvious. To fix this I initialise my Wasm module inside a Web Worker.

Getting data in and out of Wasm isn’t straight forward. You can call functions and pass numeric values, but not strings or objects. You do have full memory access to work with though.

Wasm memory is created in JavaScript.

const memory = new WebAssembly.Memory({
  initial: 10,
  maximum: 100,
});

const wasm = await WebAssembly.instantiateStreaming(
  fetch("search.wasm"), {
    env: { memory }
  },
);

const { exports } = wasm.instance;

exports.init();

Within Zig I allocate a fixed 10 KB buffer I’ll use to pass data back and forth.

var scratch: []u8 = undefined;
const scratch_len: usize = 1024 * 10;

export fn init() void {
    scratch = wasm_allocator.alloc(u8, scratch_len) catch unreachable;
}

The init function is the one I called from JavaScript. These functions are synchronous which is why a worker is so important.

I have a second function to return the address of the allocated memory.

export fn getScratch() [*]const u8 {
    return @ptrCast(scratch);
}

Back to JavaScript, I can write bytes to the scratch memory.

const scratch = new Uint8Array(
  memory.buffer,
  exports.getScratch(),
  128
);
scratch.set(
  [...new TextEncoder().encode("Hello, Zig!"), 0],
  0
);
exports.something();

This creates a temporary Uint8Array backed by the same underlying Wasm memory at the allocated offset. I write “Hello, Zig!” as a null-terminated string. Then I call something (another exported Zig function). Zig can then read the data I just wrote to scratch.

This works in reverse too. When I initialise Wasm I can provide functions.

const debug = (ptr, len) => {
  const message = new TextDecoder().decode(
    memory.buffer.slice(ptr, ptr + len),
  );
  console.debug(message);
};

const wasm = await WebAssembly.instantiateStreaming(
  fetch("search.wasm"), {
    env: { memory, debug }
  },
);

Those functions are tagged in Zig with extern:

extern fn debug(buf: usize, len: usize) void;

I can write text to the scratch buffer and then call JavaScript with its offset and length:

const slice = try std.fmt.bufPrint(
    scratch, "Hello, {s}!", .{"JavaScript"}
);
debug(@intFromPtr(slice.ptr), slice.len);

This will log “Hello, JavaScript!” to the browser console.

For the actual search results my Zig code writes JSON to the buffer and informs the worker it’s done. The worker transfers data back to the main thread:

const data = new Uint8Array(
  memory.buffer.slice(ptr, ptr + len),
);
self.postMessage({
    type: "results",
    buffer: data.buffer,
  },
  [data.buffer]
);

The main thread parses the JSON itself:

worker.addEventListener("message", (ev) => {
  if (ev.data.type === "results") {
    const data = JSON.parse(
      new TextDecoder().decode(ev.data.buffer)
    );
    // Update UI...
  }
});

Woah! That was a fair bit of back and forth. I’m sure my code has plenty of race conditions too! It works well enough for the time being.

After all that effort you can now search my blog. Like you could before. At some point in future I should research the correct way to index a blog and make this 10x more efficient.

I added some CSS view transitions to make it look fancier.

To save bandwidth no Wasm is loaded until the search input is focused. I still employ the same DuckDuckGo fallback should anything fail to initialise.

Shoutout again to Pagefind for serving me until now 🫡