Data Persistence Guide

How to save and load data: CSV files, columnar storage, splayed tables, partitioned tables, and memory-mapped I/O.

Rayforce provides multiple storage layers, from simple CSV for interchange to high-performance columnar files for production workloads. CSV I/O is available from Rayfall; the columnar storage layers are currently accessible through the C API.

1. CSV I/O

The simplest way to persist and exchange data. Available directly from Rayfall.

Reading CSV

(set data (.csv.read "path/to/data.csv"))

The reader:

Treats the first row as column headers
Infers column types automatically: i64, f64, date, time, timestamp, sym
Uses parallel parsing for large files
Handles null values (empty cells)

Writing CSV

(.csv.write table "path/to/output.csv")

Example: Round-Trip

(set data (table [Name Score Grade]
  (list [Alice Bob Charlie]
        [95 87 92]
        [A B A])))

(.csv.write data "/tmp/grades.csv")
(.csv.read "/tmp/grades.csv")

┌─────────┬───────┬───────────────────┐
│  Name   │ Score │       Grade       │
│   sym   │  i64  │        sym        │
├─────────┼───────┼───────────────────┤
│ Alice   │ 95    │ A                 │
│ Bob     │ 87    │ B                 │
│ Charlie │ 92    │ A                 │
├─────────┴───────┴───────────────────┤
│ 3 rows (3 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘

Note: Float values without a fractional part (e.g., 150.0) will be read back as integers. The CSV reader always infers the narrowest matching type.

2. Symbol Table Persistence

Rayforce maintains a global symbol intern table for sym columns. When saving and loading columnar data, the symbol table must be persisted alongside it so that symbol IDs remain meaningful.

The C API provides:

// Save the global symbol table to a file
ray_err_t ray_sym_save(const char* path);

// Load symbols from a file (merges with the current table)
ray_err_t ray_sym_load(const char* path);

The symbol file uses an append-only format: new symbols are appended on save, and loading merges them into the running table. File locking (flock on POSIX, LockFileEx on Windows) ensures safe concurrent access.

In typical usage, you do not call these directly — the splayed and partitioned table functions handle symbol persistence automatically.

3. Columnar Files

C API only. The functions below are in internal headers (src/store/col.h, src/store/splay.h, src/store/part.h). Compile with -Isrc and include the specific header. These are not yet exposed as Rayfall builtins.

A column file stores a single vector (column) in Rayforce’s native binary format. This is the building block for all higher-level storage.

// Save a vector to a column file
ray_err_t ray_col_save(ray_t* vec, const char* path);

// Load a vector from a column file
ray_t* ray_col_load(const char* path);

// Memory-map a column file (zero-copy, read-only)
ray_t* ray_col_mmap(const char* path);

The file format is compact: a header with type and length, followed by the raw element data. For string columns (RAY_STR), the pool data is written after the element array.

The difference between ray_col_load and ray_col_mmap:

Function	Memory	Access	Use case
`ray_col_load`	Copies data into heap	Read/write	Data you will modify
`ray_col_mmap`	Maps file directly	Read-only	Large data, zero startup cost

4. Splayed Tables

A splayed table stores each column as a separate file inside a directory. This allows loading individual columns on demand instead of the entire table.

// Save a table as a splayed directory
ray_err_t ray_splay_save(ray_t* tbl, const char* dir,
                         const char* sym_path);

// Load a splayed table from a directory
ray_t* ray_splay_load(const char* dir,
                      const char* sym_path);

The directory layout looks like:

trades/
  .d            # schema (column name symbol IDs)
  Symbol        # sym column
  Price         # f64 column
  Qty           # i64 column

Pass a sym_path to automatically save/load the symbol table alongside the data. Pass NULL if you manage symbols separately.

5. Partitioned Tables

For very large datasets, Rayforce supports date-partitioned storage. Each partition is a splayed table inside a date-named subdirectory.

// Load a partitioned table
ray_t* ray_part_load(const char* db_root,
                     const char* table_name);

Expected directory layout:

db/
  sym                  # shared symbol table
  2024.01.01/
    trades/
      Symbol
      Price
      Qty
  2024.01.02/
    trades/
      Symbol
      Price
      Qty
  ...

The loader:

Loads the shared sym file from db_root
Discovers date directories (sorted numerically)
Loads each partition as a splayed table
Adds a virtual Date column from the directory name
Concatenates all partitions into a single table

This is the recommended layout for time-series data that grows continuously. New partitions can be added without rewriting existing data.

For datasets larger than available RAM, Rayforce can process partitioned tables one segment at a time using block offloading — streaming through partitions without loading them all at once. The optimizer’s partition pruning pass can skip entire partitions that don’t match filter predicates. See Block Offloading for details.

6. Memory-Mapped I/O

Memory-mapped I/O (mmap) lets Rayforce access on-disk data without copying it into memory. The operating system pages data in on demand as it is accessed.

Benefits

Zero startup cost — opening a 10 GB file is instant; only accessed pages are loaded
Shared memory — multiple processes can mmap the same file without duplicating data in RAM
OS-managed caching — the kernel manages which pages stay in memory vs. on disk

Trade-offs

Read-only — mmap vectors cannot be modified (COW will copy if you try)
Random I/O — if access patterns are truly random across a large file, performance can be worse than sequential reads

Usage

Use ray_col_mmap for individual columns:

ray_t* prices = ray_col_mmap("db/2024.01.01/trades/Price");

Splayed table loading can also use mmap internally. For most analytics workloads — scans, filters, aggregations — mmap provides the best combination of startup time and throughput.

Choosing a Storage Layer

Layer	Interface	Best for
CSV	Rayfall	Data interchange, small datasets, prototyping
Columnar	C API	Single-vector persistence, embedding in applications
Splayed tables	C API	Multi-column tables, column-selective loading
Partitioned tables	C API	Large time-series, append-only growth, date-range queries

Next Steps

Storage Reference — Detailed file format specifications and API reference
Core C API — Working with ray_t objects, vectors, and tables in C
Memory Model — Buddy allocator, arenas, COW, and per-VM heaps
Getting Started Tutorial — Hands-on introduction to Rayfall