Rayforce ← Back to home
GitHub

Vector Search

Embedding similarity, brute-force KNN, and HNSW-accelerated approximate nearest-neighbour search — composable with Rayfall's select for filter-aware retrieval.

Data Model

An embedding column is a RAY_LIST whose entries are numeric vectors (preferably RAY_F32) of the same length D. Tables hold embeddings alongside any other column types — integers, strings, dates, symbols — and all columns project correctly through nearest-neighbour queries.

(set docs (table [id title score emb]
    (list [0 1 2 3 4]
          (list "alpha" "beta" "gamma" "delta" "epsilon")
          [0.9 0.2 0.8 0.1 0.7]
          (list [1.0 0.0 0.0] [0.0 1.0 0.0]
                [0.0 0.0 1.0] [1.0 1.0 0.0]
                [1.0 0.0 1.0]))))

Direct Distance & Similarity

Four direct builtins. All accept either two vectors (returning a scalar) or a LIST-of-vectors plus a query vector (returning a vector of per-row results).

FunctionReturnsRange
(cos-dist a b)cosine distance 1 − cos(a, b)[0, 2] — lower = closer
(l2-dist a b)euclidean distance ||a − b||₂[0, ∞) — lower = closer
(inner-prod a b)dot product ∑ aᵢ·bᵢany real
(norm x)L2 norm ||x||₂[0, ∞)
(cos-dist [1.0 0.0] [0.0 1.0])         ; → 1.0 (orthogonal)
(l2-dist  [0.0 0.0] [3.0 4.0])         ; → 5.0
(inner-prod [1.0 2.0] [3.0 4.0])      ; → 11.0
(norm    [3.0 4.0])                 ; → 5.0

Brute-force K-Nearest-Neighbours

(knn col query k [metric]) scans a full LIST column and returns the top-K rows as a table with _rowid and _dist columns (ascending by distance). The metric defaults to 'cosine; also supports 'l2 and 'ip (inner-product, sorted by −dot so lower = closer matches the other metrics).

(knn (at docs 'emb) [1.0 0.0 0.0] 3 'cosine)
; ┌────────┬───────┐
; │ _rowid │ _dist │
; │  I64   │  F64  │
; ├────────┼───────┤
; │ 0      │ 0.0   │
; │ 4      │ 0.29  │
; │ 3      │ 0.29  │
; └────────┴───────┘

HNSW Indexes

Build a hierarchical navigable small-world graph over an embedding column for sub-linear nearest-neighbour queries. All three metrics are indexed natively and the chosen metric persists through save/load.

Building

; (hnsw-build col [metric] [M] [ef_construction])
(set idx (hnsw-build (at docs 'emb) 'cosine 16 100))

Defaults: metric='cosine, M=16, ef_construction=200. The returned handle is an atom tagged RAY_ATTR_HNSW; the underlying ray_hnsw_t* is rc-managed — rebinding, scope exit, and process teardown all release the index automatically.

Querying

; (ann handle query k [ef_search])
(ann idx [1.0 0.0 0.0] 3)
; ┌────────┬───────┐
; │ _rowid │ _dist │
; ├────────┼───────┤
; │ 0      │ 0.0   │
; │ 4      │ 0.29  │
; │ 3      │ 0.29  │
; └────────┴───────┘

Persistence & Lifecycle

(hnsw-save idx "/var/index/docs")
(set idx2 (hnsw-load "/var/index/docs"))
(hnsw-info idx2)     ; dict: {nrows dim metric nlayers M efc}
(hnsw-free idx2)     ; optional — rc-managed, will free on scope exit too

Filter-Aware ANN via select … nearest

The select form accepts a nearest: clause that composes WHERE filtering with HNSW-backed (or brute-force) top-K retrieval in one query. The filter predicate is pushed into HNSW's beam search as an iterative scan — rejected candidates are still traversed for graph connectivity but don't consume result slots.

(select {from: docs
         where: (> score 0.5)
         nearest: (ann idx [1.0 0.0 0.0])
         take: 10})
; Returns top-10 rows of `docs` whose score > 0.5, ordered by cosine
; distance to the query.  The implicit projection returns source
; columns only; to include _dist, name it in the output projection.

Brute-force variant on a column reference (no pre-built index needed):

(select {from: docs
         nearest: (knn emb [1.0 0.0 0.0])
         take: 3})

Projection & _dist

The rerank step emits a synthetic _dist column, but it is not included in the default output — (select {from: t nearest: …}) returns the source schema exactly, preserving shape compatibility with (select {from: t}). To include the distance, reference _dist in an explicit projection:

(select {id: id title: title d: _dist
         from: docs
         where: (> score 0.5)
         nearest: (ann idx [1.0 0.0 0.0])
         take: 5})

Metric Symbols

Accepted by hnsw-build, knn, and the nearest clause:

Reference

NameAritySignature
cos-dist2(cos-dist a b) — cosine distance; vec×vec → atom, LIST×vec → vector
l2-dist2(l2-dist a b) — euclidean distance
inner-prod2(inner-prod a b) — positive dot product
norm1(norm x) — L2 norm; scalar if vec, per-row vector if LIST
knn3–4(knn col query k [metric]) — brute-force top-K
hnsw-build1–4(hnsw-build col [metric] [M] [ef_c]) — build index
ann3–4(ann handle query k [ef_s]) — approximate top-K via HNSW
hnsw-save2(hnsw-save handle path) — persist index to directory
hnsw-load1(hnsw-load path) — restore index from directory
hnsw-free1(hnsw-free handle) — explicit release (rc-managed, optional)
hnsw-info1(hnsw-info handle) — dict of {nrows dim metric nlayers M efc}

Design Notes