Skip to content

String Operations

Complete reference for string manipulation in Rayforce — from basic transforms to pattern matching, covering both RAY_SYM and RAY_STR column types.

RAY_SYM vs RAY_STR

Rayforce provides two distinct string column representations, each optimized for different workloads. Choosing the right type is critical for performance.

RAY_SYM — Dictionary-Encoded Symbols

RAY_SYM columns store strings as integer indices into a global intern table. Ideal for low-cardinality categorical data (country codes, status flags, product categories).

  • Adaptive-width indices — 8, 16, 32, or 64-bit integers depending on dictionary size
  • Equality comparison is a single integer compare — O(1)
  • Group-by on SYM columns uses direct index lookup instead of hashing
  • Memory efficient — each row stores only 1–8 bytes regardless of string length
  • Global intern table — symbols are shared across all columns and tables
; Create a table with a SYM column (default for short repeated strings in CSV)
ray> (set t (.csv.read "trades.csv"))
; region column is automatically SYM — only 4 unique values across 1M rows

RAY_STR — Variable-Length Strings

RAY_STR columns store variable-length strings with a hybrid inline/pool layout. Best for high-cardinality or unique text data (names, descriptions, URLs).

  • SSO (Small String Optimization) — strings of 12 bytes or fewer are stored inline in the 16-byte ray_str_t element, requiring zero indirection
  • Pool storage — strings longer than 12 bytes are written to a per-vector pool; the element stores a 4-byte offset and 4-byte length, plus a 4-byte prefix copied inline for fast comparison rejection
  • Fast comparison rejection — the 4-byte prefix allows most unequal comparisons to short-circuit without following the pool pointer
  • Per-vector pool — each RAY_STR vector has its own pool; col_propagate_str_pool() shares pools between source and destination during execution
; STR columns are used for unique/high-cardinality text
ray> (set names (vec-str ["Alice" "Bob" "Charlie"]))
; "Alice" (5 bytes) → stored inline (SSO)
; "A longer description here" (26 bytes) → stored in pool with 4-byte prefix

When to use which?

Use RAY_SYM for columns with fewer than ~65K unique values (status codes, categories, tickers). Use RAY_STR for free-text, names, addresses, or any column where most values are unique. The CSV reader auto-detects: columns with a high repeat ratio become SYM, others become STR.

Null Propagation

All string operations in Rayforce follow strict null propagation semantics:

  • Null input produces null output — if any required input row is null, the output row is null
  • CONCAT is null if any argument is null
  • Null propagation applies uniformly to both RAY_SYM and RAY_STR columns
  • Null bitmaps are carried through the execution pipeline per morsel (1024 elements)

In the C API DAG, null propagation is handled automatically per morsel. String transformation opcodes (STRLEN, UPPER/LOWER/TRIM, SUBSTR, REPLACE, CONCAT) propagate nulls: null input rows produce null output rows. CONCAT is null if any argument is null.

String Functions

DAG-only operations

The following string operations are available in the C API DAG but are not currently exposed as Rayfall builtins: upper, lower, strlen, trim, substr, replace, ilike. They can be used through the C API's DAG opcodes (see table below).

concat

(concat a b) — binary

Concatenates two string arguments. Works on string atoms and vectors element-wise.

ray> (concat "hello" " world")
"hello world"

like

(like str pattern) — binary · element-wise

Case-sensitive glob pattern matching. Returns a boolean (or boolean vector for vector input). Supports * (match any sequence of characters) and ? (match any single character). Works on both RAY_SYM and RAY_STR columns.

ray> (like "hello world" "*world")
true

ray> (like "hello world" "hello*")
true

ray> (select {from:t where: (like name "A*")})
; Returns all rows where name starts with "A"

split

(split str delimiter) — binary · element-wise

Splits each string element by the given delimiter and returns a list of string vectors. Each element in the result is a vector of the split parts. Null input produces null output.

ray> (split "a,b,c" ",")
["a" "b" "c"]

ray> (split "hello world" " ")
["hello" "world"]

format

(format fmt ...args) — variadic

Formats values into a string using a format template. Each % placeholder is replaced with the next stringified argument in order. Useful for building display strings or log messages.

ray> (format "Hello, %!" "world")
"Hello, world!"

ray> (format "% + % = %" 1 2 3)
"1 + 2 = 3"

String Operations in the DAG

When using the C API, string operations are available as DAG opcodes. These are fused into morsel-driven execution alongside arithmetic and comparison operations.

Opcode C API Description
OP_UPPER ray_upper(g, a) Uppercase transform
OP_LOWER ray_lower(g, a) Lowercase transform
OP_STRLEN ray_strlen(g, a) String byte length
OP_TRIM ray_trim_op(g, a) Strip leading/trailing whitespace
OP_SUBSTR ray_substr(g, str, start, len) Extract substring by position
OP_REPLACE ray_replace(g, str, from, to) Replace all occurrences
OP_CONCAT ray_concat(g, args, n) Concatenate N strings
OP_LIKE ray_like(g, input, pattern) Case-sensitive glob pattern match (*/? wildcards)
OP_ILIKE ray_ilike(g, input, pattern) Case-insensitive glob pattern match

C API Example

/* Filter rows where upper(name) LIKE "A*" and compute strlen */
ray_graph_t* g = ray_graph_new(table);

ray_op_t* name    = ray_scan(g, "name");
ray_op_t* up_name = ray_upper(g, name);
ray_op_t* pattern = ray_const_str(g, "A*", 2);
ray_op_t* pred    = ray_like(g, up_name, pattern);

ray_op_t* filt_name = ray_filter(g, name, pred);
ray_op_t* name_len  = ray_strlen(g, filt_name);

/* Execute — upper, like, filter, strlen all fused into one morsel pass */
ray_t* result = ray_execute(g, ray_optimize(g, name_len));

String Pool Internals

Understanding the internal layout helps explain performance characteristics of string operations.

ray_str_t Element Layout (16 bytes)

Bytes Inline (SSO) Pool Reference
0–3 String data [0..3] 4-byte prefix (first 4 bytes of string)
4–7 String data [4..7] Pool offset (uint32_t)
8–11 String data [8..11] String length (uint32_t)
12–15 Length + flag Length + flag (high bit = 1 for pool)

SSO threshold

Strings of 12 bytes or fewer are stored entirely within the 16-byte element — no heap allocation, no pointer chase. The majority of real-world strings (tickers, codes, short names) benefit from this optimization. Access via ray_str_vec_get() returns a pointer to the inline data or pool data transparently.

Hash and Comparison

String hashing uses ray_str_t_hash() which operates directly on the element bytes. Comparison via ray_str_t_cmp() / ray_str_t_eq() first compares the 4-byte prefix for fast rejection, then falls through to a full byte comparison only when prefixes match. This makes hash joins and group-by on string columns significantly faster than naive approaches.

Dictionary-Encoded Symbol Width

RAY_SYM columns use adaptive-width integer indices to minimize memory:

Dictionary Size Index Width Bytes per Row
≤ 255 8-bit 1
≤ 65,535 16-bit 2
≤ 4,294,967,295 32-bit 4
Larger 64-bit 8

Width is set at column creation via ray_sym_vec_new(sym_width, capacity) where sym_width is 1, 2, 4, or 8. The CSV reader picks the narrowest width that fits the observed cardinality.