Hamming Distance for Hybrid Search in SQLite

This article shows how I implemented semantic search in SQLite using binary embeddings and Hamming distance, enabling hybrid search without external vector databases.

SQLite's FTS5 extension provides excellent text search with BM25 ranking. However, it doesn't support semantic search, which means you can't combine keyword matching with meaning-based retrieval, a technique known as hybrid search.

Binary embeddings and Hamming distance

Semantic search works by converting text into numerical vectors (embeddings) that capture meaning (in my case, I do that via an API). Similar texts (i.e., talking about the same things) produce similar vectors. Embeddings generally use float32 values, that is, a 1024-dimensional embedding requires 4KiB per document.

Binary embeddings quantize each dimension to a single bit (0 or 1). This reduces storage dramatically, because 1024 dimensions become only 128 bytes. The similarity metric changes from cosine distance to Hamming distance, so it also allows using fast simple bit operations vs more expensive floating-point arithmetic.

The tradeoff is accuracy: binary quantization loses information. For many applications (including mine), this is acceptable given the storage and speed benefits, especially if it's combined with another classic text search algorithm like BM25.

Hamming Distance

Hamming distance counts the number of bit positions where two binary vectors differ.

Example with 8-bit vectors:

Position:  0 1 2 3 4 5 6 7
Vector A:  1 0 1 1 0 1 1 0
Vector B:  1 0 0 1 1 0 1 0
               ^   ^ ^

Bits differ at positions 2, 4, and 5. Hamming distance = 3. For semantic search, lower Hamming distance means higher similarity.

To compute Hamming distance efficiently, we XOR the two vectors (1 where bits differ, 0 where same), then count the 1s (population count or "popcount"). Modern CPUs have dedicated popcount instructions, making this very fast.

Example:

Vector A:  1 0 1 1 0 1 1 0
Vector B:  1 0 0 1 1 0 1 0
A XOR B:   0 0 1 0 1 1 0 0

popcount(A XOR B) = 3 ones = distance of 3

Implementation as a SQLite extension

SQLite extensions are dynamically loaded shared libraries (.so on Linux, .dylib on macOS) that can register custom SQL functions. This is simpler than forking SQLite or using external tools. The extension needs to: Register the function name with SQLite, implement the computation logic, and finally handle BLOB inputs and return an integer result.

Registration code:

#include <stdint.h>
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1

int sqlite3_extension_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
){
  SQLITE_EXTENSION_INIT2(pApi);
  return sqlite3_create_function(
    db, "hamming_distance", 2,     /* Name when called from SQL */
    SQLITE_UTF8 | SQLITE_DETERMINISTIC,
    0,
    hamming_distance,
    0, 0
  );
}

The core function receives two BLOB arguments (the binary vectors) and returns their Hamming distance:

static void hamming_distance(sqlite3_context *context, int argc, sqlite3_value **argv) {
  if (argc != 2) {
      sqlite3_result_error(context, "hamming_distance() requires 2 arguments", -1);
      return;
  }

    const unsigned char *blob1 = sqlite3_value_blob(argv[0]);
    const unsigned char *blob2 = sqlite3_value_blob(argv[1]);
    int len1 = sqlite3_value_bytes(argv[0]);
    int len2 = sqlite3_value_bytes(argv[1]);

     if (len1 != len2) {
        sqlite3_result_error(context, "vectors must be same length", -1);
        return;
    }

    /*
       Hamming Distance

       This cast assumes the architecture supports unaligned 64-bit access,
       which is true for x86_64 and ARMv8-A+, including Apple Silicon and
       Raspberry Pi 4 I tested against.
    */
    const uint64_t *v1 = (const uint64_t *)blob1;
    const uint64_t *v2 = (const uint64_t *)blob2;

    int chunks = len1 / 8;
    uint64_t distance = 0;

    // Process 8-byte chunks
    for (int i = 0; i < chunks; i++) {
        distance += __builtin_popcountll(v1[i] ^ v2[i]);
    }

    // Remaining bytes
    for (int i = chunks * 8; i < len1; i++) {
        distance += __builtin_popcount(blob1[i] ^ blob2[i]);
    }

    sqlite3_result_int64(context, distance);
}

This implementation does the following:

Here are the commands I used to compile the file for macOS (hamming.dylib) and for Linux (hamming.so):

hamming.dylib: hamming.c
        clang -g -fPIC -dynamiclib hamming.c -o hamming.dylib \
              -undefined dynamic_lookup -march=native -O3

hamming.so: hamming.c
        gcc -g -fPIC -shared hamming.c -o hamming.so \
        -march=native -O3

Performance on large datasets

Testing with 1 million rows of 128-byte binary vectors (1024 dimensions):

.load hamming -- load hamming.so or hamming.dylib

CREATE TABLE documents (
    rowid INTEGER PRIMARY KEY,
    embedding BLOB NOT NULL  -- 128 bytes for 1024-bit binary
);

-- Populate with 1M random embeddings
WITH RECURSIVE 
  cnt(x) AS (
    SELECT 1
    UNION ALL
    SELECT x+1 FROM cnt
    LIMIT 1000000
  )
INSERT INTO documents (rowid, embedding)
SELECT x, randomblob(128) FROM cnt;

.timer on

-- Searching
SELECT rowid, hamming_distance(:vec, embedding) as dist
FROM documents
ORDER BY dist
LIMIT 10

Note: Instead of :vec, I hardcoded a raw 128-byte blob by hand. I omitted it to avoid visual clutter.

Having .timer on tells SQLite to report the time taken by the following queries. To avoid disk latency and get consistent results, I ran tests on in-RAM (`:memory:`) databases.

% sqlite3 :memory: < bench.sql
Run Time: real 0.035056 user 0.035006 sys 0.000053

Not bad: 35ms for 1 million rows on my Apple M4 chip, including the sort to get top-10 results. To measure just the Hamming distance computation without sorting overhead, we can put an impossible WHERE-clause that is always false, thus forcing SQLite to scan all rows (but returning none):

SELECT rowid, hamming_distance(:vec, embedding) as dist
FROM documents
WHERE dist < 0; -- Impossible condition, no sorting needed
% sqlite3 :memory: < bench-no-sort.sql
Run Time: real 0.028491 user 0.028477 sys 0.000017

28ms without sorting. The ORDER BY adds approximately 7ms to sort 1M rows.

Is O(n) acceptable?

Whether we use an ORDER BY + LIMIT or not, this scans every row, there is no indexing and no pruning. But, it's still pretty fast.

Implementing HNSW or IVF indexing would add complexity that didn't seem justified for my project, along with memory overhead (maybe 2-5x the data size) and build-time cost. The boring O(n) simplicity is often worth it at this scale.

Limitations

This implementation has a significant limitation: it's just a function callable from SQL, not a full-fledged virtual table. SQLite computes hamming_distance() for every row, then sorts all results, then takes the top 10.

For a true top-k selection, even without any indexing, I believe the extension would need to be a virtual table that can: recognize the ORDER BY + LIMIT k pattern, maintain a heap of k best candidates during scanning, and avoid sorting the full result set

Hybrid search with RRF

Combining FTS5 BM25 ranking with semantic search requires merging two different relevance signals. The standard approach is Reciprocal Rank Fusion (RRF). I went with that.

Run both searches independently: BM25 search on tokenized text (keyword matching), and then Hamming distance on binary embeddings (semantic similarity).
Each returns its top-k candidates (e.g., 50 results each).

The query text needs to be transformed into a binary vector first. I use an external API for this, and run the BM25 query while awaiting the response to hide the API latency.

RRF merges ranked lists without needing to normalize scores across different retrieval systems. It works by converting ranks into scores. For each document, calculate: score(doc) = 1/(k + rank_bm25 + 1) + 1/(k + rank_semantic + 1) Where: rank is the position in that retrieval method's results (0-indexed) and k is a constant (typically 60) that controls how quickly scores decay. Documents appearing in both lists receive additive boosts. Documents in only one list still contribute. The final ranking sorts by combined score.


function mergeRRF(
  semanticIds: number[], // Sorted by relevance, assume no dup
  bm25Ids: number[],
  k: number = 60, // RRF constant
): number[] {
  const scores = new Map<number, number>();

  semanticIds.forEach((id, rank) => {
    scores.set(id, (scores.get(id) || 0) + 1 / (k + rank + 1));
  });

  bm25Ids.forEach((id, rank) => {
    scores.set(id, (scores.get(id) || 0) + 1 / (k + rank + 1));
  });

  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .map(([id]) => id);
}

Example real queries demonstrating hybrid benefits

The combination handles both precise keyword requirements and fuzzy semantic understanding, all within a single SQLite database.

When to use this approach

This works well when the dataset size is small (let's say <10M rows on beefy hardware), or even more if queries are infrequent enough that O(n) + sort time latency is acceptable, and if you want to avoid external vector database dependencies.