Rayforce ← Back to home
GitHub

Indexes Overview

A map of every index-like structure Rayforce ships — per-column accelerators, vector ANN indexes, linked columns, partition pruning, and graph indices. One mental model, then a decision matrix that points you at the right tool.

What an “index” means in Rayforce

An index in Rayforce is a precomputed, optional structure that rides alongside the data it's built for. It is not a separate database object: it lives on the column or table it indexes, survives copy / refcount semantics, and travels with the data through the query pipeline. Whether queries actually consult that structure varies by kind — HNSW, linked columns, partition pruning, and CSR are read by their query paths today; the four .idx.* accelerators are built and inspectable but not yet consumed by any operator. See the status section below.

Three properties hold for every kind of index documented on this page:

The five index-like structures

1. Per-column accelerators — .idx.zone / .idx.hash / .idx.sort / .idx.bloom

Attach one of four kinds to a numeric vector. Each kind builds a structure suited to a different query shape: hash for equality lookups, sort for binary search and ordered access, zone for column-level min/max/null pruning, bloom for cheap probabilistic membership rejection. All four occupy the same per-column slot — one kind at a time today.

Today's status: all four build correctly and are inspectable via (.idx.info), but no query operator consults them. Building one does not change filter / in / find / distinct / SIP behavior; the optimizer routing pass that wires the consumers up is the next phase. See the status section below.

Surface: (.idx.zone v), (.idx.hash v), (.idx.sort v), (.idx.bloom v), (.idx.drop v), (.idx.has? v), (.idx.info v). Numeric only in v1 (RAY_BOOL through RAY_TIMESTAMP at the C level; integer / float / date / time / timestamp vectors are the practical reach from Rayfall); RAY_SYM / RAY_STR are deferred.

See: Accelerator Indexes (reference) · Indexes Guide: choosing a kind.

2. Vector ANN index — HNSW

Hierarchical Navigable Small World multi-layer proximity graph for approximate nearest neighbor search over float embedding vectors. Three distance metrics — cosine, L2, inner product. Built once with hnsw-build, queried with ann, optionally persisted to a directory with hnsw-save / hnsw-load.

Surface: (hnsw-build col [metric] [M] [ef_c]), (ann handle query k [ef_search]), (knn col query k [metric]), (hnsw-save handle dir), (hnsw-load dir), (hnsw-free handle), (hnsw-info handle). Brute-force knn needs no index and exists alongside.

See: Vector Search & HNSW · Indexes Guide: ANN workflow.

3. Linked columns

A column whose values are row-id references into another table. Functions as a row-level index: dereferencing follows the link and resolves the target row at query time, similar in spirit to a foreign-key relationship but maintained at the column level.

Surface: (.col.link col target-table), (.col.unlink col), (.col.link? col), (.col.target col).

Parted-table interaction: a parted fact can carry a linked column targeting a non-parted dim (in-memory or splayed); per-segment HAS_LINK is preserved through ray_read_parted and segment streaming. Targets with any parted column are rejected at attach time. See Linked Columns: Parted-Table Interaction.

See: Linked Columns.

4. Partition pruning

A storage layout, not a column-level index, but it functions as a coarse zone-map at the table level: the partition discriminator (date, integer, or symbol) selects whole sub-tables to load. Filters that target the partition column let the optimizer skip entire partitions before any scan begins.

Surface: implicit — the directory layout under your database root drives partition selection. The C API loader (ray_part_load) infers the partition type (date / int / sym) from the directory names.

See: Columnar Storage · Storage Guide: partitioned tables · Block Offloading.

5. CSR graph index

A double-indexed Compressed Sparse Row adjacency structure (forward + reverse) attached to graph relationships. Used transparently by every graph opcode — OP_EXPAND, OP_VAR_EXPAND, OP_SHORTEST_PATH, OP_WCO_JOIN — and by Leapfrog Triejoin for worst-case optimal joins.

Surface: none directly — the CSR is built when a relationship is loaded and consulted automatically by graph queries. There is no (.csr.*) Rayfall surface today.

See: Graph Storage · Graph Algorithms.

Pick the right kind

Match the shape of your query to the structure that fits it. Read the Active today column carefully — the four .idx.* kinds are built and inspectable today but no query operator consults them yet, so they don't change observable query latency until the optimizer routing pass lands. HNSW, linked columns, partition pruning, and CSR are all consumed by their respective query paths today.

Want to… Structure Active today?
Skip whole columns or segments where a predicate constant lies outside the value range .idx.zone — min/max plus null count No — structure built; .idx.info only
Make repeated = / in / find / distinct over a numeric column O(1) instead of O(n) .idx.hash — chained open-addressing table No — structure built; .idx.info only
Binary-search a numeric column for ranges, sorted scans, or limit queries .idx.sort — ascending row-id permutation No — structure built; .idx.info only
Cheaply reject “definitely not in this set” probes — e.g. for SIP into a join .idx.bloom — m-bit probabilistic filter No — structure built; .idx.info only (the SIP pass does not yet consult bloom)
Find the k nearest neighbors of an embedding vector by cosine, L2, or inner product HNSW — (hnsw-build) + (ann) Yes — (ann) consults the index
Resolve a cross-table reference at query time without a materialized join Linked column — (.col.link) Yes — column dereference resolves through the link
Skip whole sub-tables in a parted dataset based on the partition discriminator Partition pruning — date / int / sym partitioning Yes — optimizer pass rewrites filters
Traverse a graph — BFS, shortest path, betweenness, MST CSR — transparent under graph opcodes Yes — every graph opcode reads CSR directly

What's wired today, what's not

Rayforce is honest about phasing. The structures above all build correctly; integration with the optimizer is staged.

Where to go next