Hamming Distance for Hybrid Search in SQLite
This article shows how I implemented semantic search in SQLite using binary embeddings and Hamming distance, enabling hybrid search without external vector databases.
SQLite's FTS5 extension provides excellent text search with BM25 ranking. However, it doesn't support semantic search, which means you can't combine keyword matching with meaning-based retrieval, a technique known as hybrid search.
Binary embeddings and Hamming distance
Semantic search works by converting text into numerical vectors (embeddings) that capture meaning (in my case, I do that via an API). Similar texts (i.e., talking about the same things) produce similar vectors. Embeddings generally use float32 values, that is, a 1024-dimensional embedding requires 4KiB per document.
Binary embeddings quantize each dimension to a single bit (0 or 1). This reduces storage dramatically, because 1024 dimensions become only 128 bytes. The similarity metric changes from cosine distance to Hamming distance, so it also allows using fast simple bit operations vs more expensive floating-point arithmetic.
The tradeoff is accuracy: binary quantization loses information. For many applications (including mine), this is acceptable given the storage and speed benefits, especially if it's combined with another classic text search algorithm like BM25.
Hamming Distance
Hamming distance counts the number of bit positions where two binary vectors differ.
Example with 8-bit vectors:
Position: 0 1 2 3 4 5 6 7
Vector A: 1 0 1 1 0 1 1 0
Vector B: 1 0 0 1 1 0 1 0
^ ^ ^
Bits differ at positions 2, 4, and 5. Hamming distance = 3. For semantic search, lower Hamming distance means higher similarity.
To compute Hamming distance efficiently, we XOR the two vectors (1 where bits differ, 0 where same), then count the 1s (population count or "popcount"). Modern CPUs have dedicated popcount instructions, making this very fast.
Example:
Vector A: 1 0 1 1 0 1 1 0
Vector B: 1 0 0 1 1 0 1 0
A XOR B: 0 0 1 0 1 1 0 0
popcount(A XOR B) = 3 ones = distance of 3
Implementation as a SQLite extension
SQLite extensions are dynamically loaded shared libraries (.so on Linux,
.dylib on macOS) that can register custom SQL functions. This
is simpler than forking SQLite or using external tools. The extension
needs to: Register the function name with SQLite, implement the
computation logic, and finally handle BLOB inputs and return an integer
result.
Registration code:
#include <stdint.h>
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1
int sqlite3_extension_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
){
SQLITE_EXTENSION_INIT2(pApi);
return sqlite3_create_function(
db, "hamming_distance", 2, /* Name when called from SQL */
SQLITE_UTF8 | SQLITE_DETERMINISTIC,
0,
hamming_distance,
0, 0
);
}
The core function receives two BLOB arguments (the binary vectors) and returns their Hamming distance:
static void hamming_distance(sqlite3_context *context, int argc, sqlite3_value **argv) {
if (argc != 2) {
sqlite3_result_error(context, "hamming_distance() requires 2 arguments", -1);
return;
}
const unsigned char *blob1 = sqlite3_value_blob(argv[0]);
const unsigned char *blob2 = sqlite3_value_blob(argv[1]);
int len1 = sqlite3_value_bytes(argv[0]);
int len2 = sqlite3_value_bytes(argv[1]);
if (len1 != len2) {
sqlite3_result_error(context, "vectors must be same length", -1);
return;
}
/*
Hamming Distance
This cast assumes the architecture supports unaligned 64-bit access,
which is true for x86_64 and ARMv8-A+, including Apple Silicon and
Raspberry Pi 4 I tested against.
*/
const uint64_t *v1 = (const uint64_t *)blob1;
const uint64_t *v2 = (const uint64_t *)blob2;
int chunks = len1 / 8;
uint64_t distance = 0;
// Process 8-byte chunks
for (int i = 0; i < chunks; i++) {
distance += __builtin_popcountll(v1[i] ^ v2[i]);
}
// Remaining bytes
for (int i = chunks * 8; i < len1; i++) {
distance += __builtin_popcount(blob1[i] ^ blob2[i]);
}
sqlite3_result_int64(context, distance);
}
This implementation does the following:
- XORs corresponding 64-bit chunks from both vectors
-
Uses
__builtin_popcount[ll]()for bit counting (compiles to the adequate CPU instruction) - Accumulates the total distance
- Handles any remaining bytes that don't fit in 64-bit chunks
Here are the commands I used to compile the file for macOS (hamming.dylib) and for Linux (hamming.so):
hamming.dylib: hamming.c
clang -g -fPIC -dynamiclib hamming.c -o hamming.dylib \
-undefined dynamic_lookup -march=native -O3
hamming.so: hamming.c
gcc -g -fPIC -shared hamming.c -o hamming.so \
-march=native -O3
Performance on large datasets
Testing with 1 million rows of 128-byte binary vectors (1024 dimensions):
.load hamming -- load hamming.so or hamming.dylib
CREATE TABLE documents (
rowid INTEGER PRIMARY KEY,
embedding BLOB NOT NULL -- 128 bytes for 1024-bit binary
);
-- Populate with 1M random embeddings
WITH RECURSIVE
cnt(x) AS (
SELECT 1
UNION ALL
SELECT x+1 FROM cnt
LIMIT 1000000
)
INSERT INTO documents (rowid, embedding)
SELECT x, randomblob(128) FROM cnt;
.timer on
-- Searching
SELECT rowid, hamming_distance(:vec, embedding) as dist
FROM documents
ORDER BY dist
LIMIT 10
Note: Instead of :vec, I hardcoded a raw 128-byte blob by
hand. I omitted it to avoid visual clutter.
Having .timer on tells SQLite to report the time taken by the
following queries. To avoid disk latency and get consistent results, I ran
tests on in-RAM (`:memory:`) databases.
% sqlite3 :memory: < bench.sql
Run Time: real 0.035056 user 0.035006 sys 0.000053
Not bad: 35ms for 1 million rows on my Apple M4 chip,
including the sort to get top-10 results. To measure just the Hamming
distance computation without sorting overhead, we can put an impossible
WHERE-clause that is always false, thus forcing SQLite to
scan all rows (but returning none):
SELECT rowid, hamming_distance(:vec, embedding) as dist
FROM documents
WHERE dist < 0; -- Impossible condition, no sorting needed
% sqlite3 :memory: < bench-no-sort.sql
Run Time: real 0.028491 user 0.028477 sys 0.000017
28ms without sorting. The ORDER BY adds
approximately 7ms to sort 1M rows.
Is O(n) acceptable?
Whether we use an ORDER BY + LIMIT or not, this scans every row, there is no indexing and no pruning. But, it's still
pretty fast.
Implementing HNSW or IVF indexing would add complexity that didn't seem justified for my project, along with memory overhead (maybe 2-5x the data size) and build-time cost. The boring O(n) simplicity is often worth it at this scale.
Limitations
This implementation has a significant limitation: it's just a function
callable from SQL, not a full-fledged virtual table. SQLite computes hamming_distance()
for every row, then sorts all results, then takes the top 10.
For a true top-k selection, even without any indexing, I believe
the extension would need to be a virtual table that can: recognize the
ORDER BY + LIMIT k
pattern, maintain a heap of k best candidates during scanning,
and avoid sorting the full result set
Hybrid search with RRF
Combining FTS5 BM25 ranking with semantic search requires merging two different relevance signals. The standard approach is Reciprocal Rank Fusion (RRF). I went with that.
Run both searches independently: BM25 search on tokenized text (keyword
matching), and then Hamming distance on binary embeddings (semantic
similarity).
Each returns its top-k candidates (e.g., 50 results each).
The query text needs to be transformed into a binary vector first. I use an external API for this, and run the BM25 query while awaiting the response to hide the API latency.
RRF merges ranked lists without needing to normalize scores across
different retrieval systems. It works by converting ranks into scores. For
each document, calculate:
score(doc) = 1/(k + rank_bm25 + 1) + 1/(k + rank_semantic +
1)
Where: rank is the position in that retrieval method's
results (0-indexed) and k is a constant (typically 60) that
controls how quickly scores decay. Documents appearing in both lists
receive additive boosts. Documents in only one list still contribute. The
final ranking sorts by combined score.
function mergeRRF(
semanticIds: number[], // Sorted by relevance, assume no dup
bm25Ids: number[],
k: number = 60, // RRF constant
): number[] {
const scores = new Map<number, number>();
semanticIds.forEach((id, rank) => {
scores.set(id, (scores.get(id) || 0) + 1 / (k + rank + 1));
});
bm25Ids.forEach((id, rank) => {
scores.set(id, (scores.get(id) || 0) + 1 / (k + rank + 1));
});
return Array.from(scores.entries())
.sort((a, b) => b[1] - a[1])
.map(([id]) => id);
}
Example real queries demonstrating hybrid benefits
- "Apple founder": Semantic search disambiguates (Steve Jobs, not fruit)
- "Python snake habitat": BM25 keyword "snake" filters out programming content
- "Python programming tutorial": Semantic understanding recognizes programming context
The combination handles both precise keyword requirements and fuzzy semantic understanding, all within a single SQLite database.
When to use this approach
This works well when the dataset size is small (let's say <10M rows on beefy hardware), or even more if queries are infrequent enough that O(n) + sort time latency is acceptable, and if you want to avoid external vector database dependencies.