Data Loading & Storage Tutorial¶
Read CSV files, explore and filter data, serialize tables to binary, persist with splayed storage, and export results — all from the Rayfall REPL.
This tutorial assumes you have built Rayforce and can start the REPL with ./rayforce. We use /tmp/rayforce-test/ as the working directory for all file operations.
1. Reading CSV Files¶
Create a CSV file on disk, then load it with .csv.read. The first row is treated as column headers, and types are inferred automatically:
Assuming trades.csv contains:
Output:
┌────────┬───────┬────────────────────┐
│ Symbol │ Price │ Qty │
│ sym │ f64 │ i64 │
├────────┼───────┼────────────────────┤
│ AAPL │ 150.5 │ 100 │
│ GOOG │ 280.0 │ 200 │
│ MSFT │ 420.0 │ 50 │
│ AAPL │ 155.0 │ 300 │
│ GOOG │ 275.5 │ 150 │
├────────┴───────┴────────────────────┤
│ 5 rows (5 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘
String-like columns (e.g. ticker symbols, names) are loaded as sym (dictionary-encoded symbols). Numeric columns are inferred as i64 or f64 depending on whether any value has a decimal point.
You can also create tables in-memory and skip CSV entirely:
(set trades (table [Symbol Price Qty]
(list
[AAPL GOOG MSFT AAPL GOOG]
[150.5 280.0 420.0 155.0 275.5]
[100 200 50 300 150])))
2. Exploring Data¶
Use count, type, and show to inspect a loaded table:
show limits the display to the first n rows — useful for large datasets:
┌────────┬───────┬────────────────────┐
│ Symbol │ Price │ Qty │
│ sym │ f64 │ i64 │
├────────┼───────┼────────────────────┤
│ AAPL │ 150.5 │ 100 │
│ GOOG │ 280.0 │ 200 │
│ MSFT │ 420.0 │ 50 │
│ AAPL │ 155.0 │ 300 │
│ GOOG │ 275.5 │ 150 │
├────────┴───────┴────────────────────┤
│ 5 rows (5 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘
3. Filtering Rows¶
Use select with a where: clause to filter rows. Find all trades where the price exceeds 200:
┌────────┬───────┬────────────────────┐
│ Symbol │ Price │ Qty │
│ sym │ f64 │ i64 │
├────────┼───────┼────────────────────┤
│ GOOG │ 280.0 │ 200 │
│ MSFT │ 420.0 │ 50 │
│ GOOG │ 275.5 │ 150 │
├────────┴───────┴────────────────────┤
│ 3 rows (3 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘
4. Grouping and Aggregation¶
Group by a column and compute aggregates with by::
┌────────┬───────────┬────────────────┐
│ Symbol │ total_qty │ avg_price │
│ sym │ i64 │ f64 │
├────────┼───────────┼────────────────┤
│ AAPL │ 400 │ 152.75 │
│ GOOG │ 350 │ 277.75 │
│ MSFT │ 50 │ 420.0 │
├────────┴───────────┴────────────────┤
│ 3 rows (3 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘
Available aggregation functions: sum, avg, min, max, count, first, last, med.
5. Binary Serialization¶
Use ser to serialize any Rayforce object to a compact binary format (a U8 byte vector), and de to deserialize it back. This is useful for caching, IPC, or storing intermediate results:
The entire table is encoded in 197 bytes. Deserialize with de:
┌────────┬───────┬────────────────────┐
│ Symbol │ Price │ Qty │
│ sym │ f64 │ i64 │
├────────┼───────┼────────────────────┤
│ AAPL │ 150.5 │ 100 │
│ GOOG │ 280.0 │ 200 │
│ MSFT │ 420.0 │ 50 │
│ AAPL │ 155.0 │ 300 │
│ GOOG │ 275.5 │ 150 │
├────────┴───────┴────────────────────┤
│ 5 rows (5 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘
The deserialized table is identical to the original — all types, column names, and values are preserved.
6. Splayed and Partitioned Tables¶
Splayed tables store each column as a separate file on disk. This is Rayforce's native columnar format — faster than CSV and preserving exact types. Three functions cover on-disk tables:
(.db.splayed.set "dir" Table) ; write a splayed dir (also: a partition)
(.db.splayed.set "dir" Table "sympath") ; explicit shared symfile
(.db.splayed.get "dir") ; open one splayed table (zero-copy mmap)
(.db.parted.get "root" 'name) ; open a partitioned table (root/sym)
Save and load a table:
┌────────┬───────┬────────────────────┐
│ Symbol │ Price │ Qty │
│ sym │ f64 │ i64 │
├────────┼───────┼────────────────────┤
│ AAPL │ 150.5 │ 100 │
│ GOOG │ 280.0 │ 200 │
│ MSFT │ 420.0 │ 50 │
│ AAPL │ 155.0 │ 300 │
│ GOOG │ 275.5 │ 150 │
├────────┴───────┴────────────────────┤
│ 5 rows (5 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘
On disk, this creates one file per column, the schema marker .d, and a symbol vocabulary file (plus its writer lock sym.lk):
.d holds the column names (a string vector — the schema is self-describing), and sym is the table's vocabulary: the distinct symbols its sym columns use, in the file's own enumeration order. On-disk sym cells are positions in that vocabulary — never process-global ids — so the file is valid in any session and any process. A table without symbol columns never requires (or writes) a symfile; symbols nested inside list columns are stored as strings and need no symfile either.
Load it back in a new session (or the same one) with .db.splayed.get:
┌────────┬───────┬────────────────────┐
│ Symbol │ Price │ Qty │
│ sym │ f64 │ i64 │
├────────┼───────┼────────────────────┤
│ AAPL │ 150.5 │ 100 │
│ GOOG │ 280.0 │ 200 │
│ MSFT │ 420.0 │ 50 │
│ AAPL │ 155.0 │ 300 │
│ GOOG │ 275.5 │ 150 │
├────────┴───────┴────────────────────┤
│ 5 rows (5 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘
All types are preserved exactly — f64 stays f64, sym stays sym. No type inference needed on reload.
Symfile resolution¶
When no sym path is given, both .set and .get resolve the vocabulary by convention: dir/sym for a standalone splayed dir; for a partition-shaped dir (/db/2024.01.15/t/, /db/100/t/) the parted root's sym. Writing a partition therefore lands its symbols in the root symfile automatically, and .db.parted.get opens root/sym once and attaches it across every partition — the parted view is index-coherent by construction. An explicit third argument to .db.splayed.set (or second to .db.splayed.get) always wins.
A read that finds no symfile for a table that has symbol columns fails loudly with error: sym — it never resolves against whatever symbols happen to be in memory.
Sharing a vocabulary¶
Several tables may use one symfile — pass the same explicit sym path, or simply place them under one parted root. Tables that share a vocabulary compare and join their sym columns on raw indices (the same-file fast path); tables with different vocabularies still work together, paying a one-time translation of each vocabulary (proportional to its distinct symbols, never to row count).
Because a symfile only ever appends — a symbol's position is permanent, never removed or reordered — previously written column files stay valid as the vocabulary grows, and a re-opened reader is at worst momentarily behind, never wrong. The flip side is a single-writer contract: exactly one process may write to a given symfile (the writer holds sym.lk exclusively); readers are unlimited.
Column index width follows the table's own vocabulary (u8 up to 255 symbols, then u16, u32, …) — not the process dictionary — so files stay as narrow as the data allows.
Crash safety¶
Writes are ordered: the symfile is flushed first, then the columns, then .d as the commit marker; files from a previous wider schema are swept. A crash mid-write leaves a directory without .d, which reads as missing — never as corrupt data — and the next .db.splayed.set heals it.
7. Writing CSV¶
Export any table to CSV with .csv.write. The return value 0 indicates success:
(set big (select {from:trades where: (> Price 200.0)}))
(.csv.write big "/tmp/rayforce-test/big_trades.csv")
Verify by reading it back:
┌────────┬───────┬────────────────────┐
│ Symbol │ Price │ Qty │
│ sym │ f64 │ i64 │
├────────┼───────┼────────────────────┤
│ GOOG │ 280.0 │ 200 │
│ MSFT │ 420.0 │ 50 │
│ GOOG │ 275.5 │ 150 │
├────────┴───────┴────────────────────┤
│ 3 rows (3 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘
Note: float values written without a fractional part (e.g. 280.0) may be read back as i64 since the CSV reader infers the narrowest fitting type. Use splayed tables when exact type preservation matters.
8. Complete Example¶
End-to-end workflow: create data, save as CSV, reload, filter, serialize, deserialize, and verify:
; 1. Create a table
(set trades (table [Symbol Price Qty]
(list
[AAPL GOOG MSFT AAPL GOOG]
[150.5 280.0 420.0 155.0 275.5]
[100 200 50 300 150])))
; 2. Save to CSV
(.csv.write trades "/tmp/rayforce-test/trades.csv")
; 3. Reload from CSV
(set loaded (.csv.read "/tmp/rayforce-test/trades.csv"))
; 4. Filter: only trades above 200
(set big (select {from:loaded where: (> Price 200.0)}))
; 5. Serialize to binary
(set bytes (ser big))
; 6. Deserialize and verify
(set restored (de bytes))
(count restored)
; 7. Save as splayed table for fast reload
(.db.splayed.set "/tmp/rayforce-test/big_db" big)
; 8. Load splayed table back
(.db.splayed.get "/tmp/rayforce-test/big_db")
┌────────┬───────┬────────────────────┐
│ Symbol │ Price │ Qty │
│ sym │ f64 │ i64 │
├────────┼───────┼────────────────────┤
│ GOOG │ 280.0 │ 200 │
│ MSFT │ 420.0 │ 50 │
│ GOOG │ 275.5 │ 150 │
├────────┴───────┴────────────────────┤
│ 3 rows (3 shown) 3 columns (3 shown)│
└─────────────────────────────────────┘
Storage Format Comparison¶
| Format | Function | Preserves Types | Best For |
|---|---|---|---|
| CSV | .csv.read / .csv.write |
No (re-inferred on load) | Interoperability, human-readable data |
| Binary | ser / de |
Yes | Caching, IPC, embedding in messages |
| Splayed | .db.splayed.set / .db.splayed.get |
Yes | Persistent storage, fast reload, mmap |
| Partitioned | .db.splayed.set per partition / .db.parted.get |
Yes | Large datasets split by date or key |
Next Steps¶
- Getting Started Tutorial — Tables, filtering, joins, pivots, and sorting
- Data Persistence — Partitioned storage and advanced file I/O
- Analytics Cookbook — Time-series, top-N, ASOF joins
- Functions Reference — Complete list of all built-in functions