String Operations
Complete reference for string manipulation in Rayforce — from basic transforms to pattern matching, covering both RAY_SYM and RAY_STR column types.
RAY_SYM vs RAY_STR
Rayforce provides two distinct string column representations, each optimized for different workloads. Choosing the right type is critical for performance.
RAY_SYM — Dictionary-Encoded Symbols
RAY_SYM columns store strings as integer indices into a global intern table. Ideal for low-cardinality categorical data (country codes, status flags, product categories).
- Adaptive-width indices — 8, 16, 32, or 64-bit integers depending on dictionary size
- Equality comparison is a single integer compare — O(1)
- Group-by on SYM columns uses direct index lookup instead of hashing
- Memory efficient — each row stores only 1–8 bytes regardless of string length
- Global intern table — symbols are shared across all columns and tables
; Create a table with a SYM column (default for short repeated strings in CSV)
ray> (set t (.csv.read "trades.csv"))
; region column is automatically SYM — only 4 unique values across 1M rows
RAY_STR — Variable-Length Strings
RAY_STR columns store variable-length strings with a hybrid inline/pool layout. Best for high-cardinality or unique text data (names, descriptions, URLs).
- SSO (Small String Optimization) — strings of 12 bytes or fewer are stored inline in the 16-byte
ray_str_telement, requiring zero indirection - Pool storage — strings longer than 12 bytes are written to a per-vector pool; the element stores a 4-byte offset and 4-byte length, plus a 4-byte prefix copied inline for fast comparison rejection
- Fast comparison rejection — the 4-byte prefix allows most unequal comparisons to short-circuit without following the pool pointer
- Per-vector pool — each RAY_STR vector has its own pool;
col_propagate_str_pool()shares pools between source and destination during execution
; STR columns are used for unique/high-cardinality text
ray> (set names (vec-str ["Alice" "Bob" "Charlie"]))
; "Alice" (5 bytes) → stored inline (SSO)
; "A longer description here" (26 bytes) → stored in pool with 4-byte prefix
RAY_SYM for columns with fewer than ~65K unique values (status codes, categories, tickers). Use RAY_STR for free-text, names, addresses, or any column where most values are unique. The CSV reader auto-detects: columns with a high repeat ratio become SYM, others become STR.
Null Propagation
All string operations in Rayforce follow strict null propagation semantics:
- Null input produces null output — if any required input row is null, the output row is null
- CONCAT is null if any argument is null
- Null propagation applies uniformly to both RAY_SYM and RAY_STR columns
- Null bitmaps are carried through the execution pipeline per morsel (1024 elements)
In the C API DAG, null propagation is handled automatically per morsel. String transformation opcodes (STRLEN, UPPER/LOWER/TRIM, SUBSTR, REPLACE, CONCAT) propagate nulls: null input rows produce null output rows. CONCAT is null if any argument is null.
String Functions
upper, lower, strlen, trim, substr, replace, ilike. They can be used through the C API's DAG opcodes (see table below).
concat
like
* (match any sequence of characters) and ? (match any single character). Works on both RAY_SYM and RAY_STR columns.split
format
% placeholder is replaced with the next stringified argument in order. Useful for building display strings or log messages.String Operations in the DAG
When using the C API, string operations are available as DAG opcodes. These are fused into morsel-driven execution alongside arithmetic and comparison operations.
| Opcode | C API | Description |
|---|---|---|
OP_UPPER | ray_upper(g, a) | Uppercase transform |
OP_LOWER | ray_lower(g, a) | Lowercase transform |
OP_STRLEN | ray_strlen(g, a) | String byte length |
OP_TRIM | ray_trim_op(g, a) | Strip leading/trailing whitespace |
OP_SUBSTR | ray_substr(g, str, start, len) | Extract substring by position |
OP_REPLACE | ray_replace(g, str, from, to) | Replace all occurrences |
OP_CONCAT | ray_concat(g, args, n) | Concatenate N strings |
OP_LIKE | ray_like(g, input, pattern) | Case-sensitive glob pattern match (*/? wildcards) |
OP_ILIKE | ray_ilike(g, input, pattern) | Case-insensitive glob pattern match |
C API Example
/* Filter rows where upper(name) LIKE "A*" and compute strlen */
ray_graph_t* g = ray_graph_new(table);
ray_op_t* name = ray_scan(g, "name");
ray_op_t* up_name = ray_upper(g, name);
ray_op_t* pattern = ray_const_str(g, "A*", 2);
ray_op_t* pred = ray_like(g, up_name, pattern);
ray_op_t* filt_name = ray_filter(g, name, pred);
ray_op_t* name_len = ray_strlen(g, filt_name);
/* Execute — upper, like, filter, strlen all fused into one morsel pass */
ray_t* result = ray_execute(g, ray_optimize(g, name_len));
String Pool Internals
Understanding the internal layout helps explain performance characteristics of string operations.
ray_str_t Element Layout (16 bytes)
| Bytes | Inline (SSO) | Pool Reference |
|---|---|---|
| 0–3 | String data [0..3] | 4-byte prefix (first 4 bytes of string) |
| 4–7 | String data [4..7] | Pool offset (uint32_t) |
| 8–11 | String data [8..11] | String length (uint32_t) |
| 12–15 | Length + flag | Length + flag (high bit = 1 for pool) |
ray_str_vec_get() returns a pointer to the inline data or pool data transparently.
Hash and Comparison
String hashing uses ray_str_t_hash() which operates directly on the element bytes. Comparison via ray_str_t_cmp() / ray_str_t_eq() first compares the 4-byte prefix for fast rejection, then falls through to a full byte comparison only when prefixes match. This makes hash joins and group-by on string columns significantly faster than naive approaches.
Dictionary-Encoded Symbol Width
RAY_SYM columns use adaptive-width integer indices to minimize memory:
| Dictionary Size | Index Width | Bytes per Row |
|---|---|---|
| ≤ 255 | 8-bit | 1 |
| ≤ 65,535 | 16-bit | 2 |
| ≤ 4,294,967,295 | 32-bit | 4 |
| Larger | 64-bit | 8 |
Width is set at column creation via ray_sym_vec_new(sym_width, capacity) where sym_width is 1, 2, 4, or 8. The CSV reader picks the narrowest width that fits the observed cardinality.