Eradicating Syntax: Building a Neural Universal Machine That Executes Graphs, Not Code
Our GNN autoencoders achieved 81% node accuracy on Ruby ASTs yet produced 0% valid code. The culprit was the literal value bottleneck — nearly half of every AST consisted of names and values that were irrecoverable from the structural encoding. Rather than patch the representation, we asked a more radical question: what if AI never generated human-readable code at all?
This post documents the pivot from patching GNN decoders to building a Neural Universal Machine — a system where a Diffusion Transformer generates executable Directed Acyclic Graphs directly, bypassing programming language syntax entirely. We validate the approach end-to-end: from a working graph-walk interpreter that computes Fibonacci(10), through a 12.3× vocabulary compression pipeline, to a Permuted Dense DiT that achieves 100% Syntactic Validity on 128-node execution graphs.
- The Insight: Why Generate Text at All?
- The Execution Engine: A Graph-Walk Interpreter
- Dataset Compression: 74 Dimensions → 6 Motifs
- The Generative Model: Permuted Dense DiT
- The Validation Harness: 5 Laws of Physics
- Three Branches of Government
- What’s Next: RLAIF for Deterministic Perfection
- Try It Yourself
- References
The Insight: Why Generate Text at All?
The current paradigm of AI code generation is a translation bottleneck: a high-dimensional neural network collapses its probabilistic understanding into a linear, human-readable string of characters, which a compiler then immediately parses back into a multi-dimensional graph to execute.
If we remove the human from the loop, a “programming language” designed natively for AI should not be a language at all. It should be a mathematical specification for a DAG.
Four core principles drive the new architecture:
- The Death of Variable Names. Variable names are human mnemonics. The AI-native language relies entirely on directed edges — data dependencies are pure topological routing.
- Eradication of Syntax Sugar. No parentheses, no brackets, no formatting. The “code” is saved as a sparse adjacency matrix and a minimal feature vector.
- Execution by Graph-Walk. The matrix is not compiled or parsed; it is traversed directly by a minimal graph-walking interpreter.
- Guaranteed Syntax. Because the model generates graph topology rather than stringing text together, “syntax errors” become mathematically impossible.
The Execution Engine: A Graph-Walk Interpreter
Before training any generative model, we needed to prove that pure topological matrices are Turing-complete. We built a minimal virtual machine that executes graphs directly.
The Six Universal Motifs
Drawing from the Böhm-Jacopini theorem (Böhm & Jacopini, 1966), we define exactly six node types — sufficient to express any computable function:
| Motif | Role | Execution Semantics |
|---|---|---|
Boundary |
Program entry/exit | Routes execution forward or halts |
Sequence |
Linear execution | Passes control to the next node |
Condition |
Boolean branching | Evaluates data input, routes to True (index 0) or False (index 1) |
Loop |
Iteration | Like Condition, but the True path loops back |
State |
Memory read/write | On the execution path: writes incoming data to memory |
Message |
Function call / constant | Evaluates +, <, print, or returns a literal |
Two edge types connect them: EXECUTION edges (control flow — “go here next”) and DATA edges (value flow — “use this as argument N”).
The Fibonacci Proof
To validate the interpreter, we hand-constructed a 25-node execution graph equivalent to:
a, b, count = 0, 1, 0
while count < 10:
temp = a + b
a = b
b = temp
count = count + 1
print(a)
The graph encodes this logic as pure topology — no variable names exist in the execution matrix. The literal_pool (a separate dictionary managed by the “Legislative Branch”) maps integer pointers to values: {0: "a", 1: "b", 2: "count", 7: "+", 8: "<", ...}.
The interpreter drops an execution pointer onto the entry Boundary node, resolves data dependencies recursively through DATA edges, evaluates Message nodes via a minimal stdlib (+, <, -, print), and routes through the Loop node’s boolean condition.
Result: memory["a"] == 55. The 10th Fibonacci number, computed natively via matrix traversal — no parser, no compiler, no syntax.
This proves two things: (1) the six Motifs are Turing-complete, and (2) a graph-walk interpreter can execute arbitrary logic from pure adjacency structure plus a constant pool.
Dataset Compression: 74 Dimensions → 6 Motifs
With the execution engine validated, we built a compression pipeline to transform our existing 22,452 Ruby ASTs into training data for the generative model.
The compressor (scripts/dataset_prep/compress_ast.py) performs three operations:
-
Motif Mapping. Every Ruby AST node type (73 unique types:
def,send,lvar,if,while, …) is mapped to one of the 6 Universal Motifs via a deterministic lookup table. -
Literal Extraction. Primitive children (strings, integers, floats, booleans) are stripped from the tree and collected into a deduplicated
literal_pool. Each extracted value gets an integer pointer. The structural graph references these values only through pointer indices — the actual content lives entirely outside the topology. -
Edge Re-Routing.
SequenceandBoundarynodes chain their children viaEXECUTIONedges (control flow). Everything else (Condition,Loop,Message,State) connects children viaDATAedges with positionalinput_indexvalues.
Compression Results
When applied to a complex, real-world example — the 144-node structure method from the AWS Ruby SDK — the compressor:
| Metric | Before (Ruby AST) | After (Motif Graph) |
|---|---|---|
| Node vocabulary | 74 types | 6 types |
| Literal values in graph | 50 (embedded) | 0 (extracted to pool) |
| Edge types | Implicit (parent→child) | 107 DATA + 36 EXECUTION |
This is a 12.3× reduction in structural vocabulary — from 74 dimensions of Ruby syntax noise down to 6 language-agnostic primitives. The literal value bottleneck that destroyed our GNN autoencoders is eliminated by construction: literals are no longer in the graph. They live in a separate constant pool managed by the “Legislative Branch” (an LLM).
The compressed dataset produces perfectly dense, low-dimensional matrices — ideal inputs for a Diffusion Transformer.
The Generative Model: Permuted Dense DiT
With a Turing-complete execution engine and a compressed training set, we built a Diffusion Transformer to generate valid execution graphs from scratch.
The Spatial Bias Trap
Standard image DiTs (Stable Diffusion, Sora) use Vision Transformer blocks with 2D positional encodings. Applied to an adjacency matrix, this teaches the network that Node 4 connects to Node 5 because they are “next to each other” spatially. But in a graph, node ordering is entirely arbitrary — adjacency is topological, not positional.
We solved this with two mechanisms:
1. Node Permutation Augmentation. The DataLoader randomly shuffles node ordering every time a graph is fetched. The topological routing remains identical, but the matrix layout changes completely. This mathematically destroys spatial bias and forces the DiT to learn pure topological rules.
2. Cross-Hatch Embedding Injection. The DiT operates on a 2D adjacency matrix, but its conditioning signal (the Motifs) is a 1D list. The InputConditioner bridges this gap:
- Embeds the 1D Motif tensor
[N]into[N, 128] - Broadcasts across rows →
[N, N, 128](source node identity) - Broadcasts across columns →
[N, N, 128](target node identity) - Concatenates both with the 3-channel noisy adjacency → 35 channels per pixel
Every coordinate (i, j) now carries complete information about which Motifs are being connected, giving the DiT 360° structural awareness.
Axial Attention: Message-Passing in Matrix Form
Because we abandoned ViT square patches, the model processes the full matrix using Axial (Row-Column) Attention — which naturally mimics graph message passing:
- Row Attention (the “outgoing” perspective): Evaluates all potential connections from a node simultaneously. “I am a Condition — I must point to exactly two targets.”
- Column Attention (the “incoming” perspective): Evaluates all incoming dependencies to a node. “I am a State — I can accept at most one data source.”
This bidirectional reasoning is critical: graph validity requires both out-degree and in-degree constraints to be satisfied simultaneously.
Hybrid Loss: Flow Matching + Classification
The model predicts 6 output channels per edge coordinate:
| Channels | Meaning | Loss |
|---|---|---|
| 0–1 | Presence, Edge Type | Optimal Transport Flow Matching (masked MSE) |
| 2–5 | Input Index logits (0–3) | Categorical Cross-Entropy (masked) |
For the continuous channels, we use Conditional Flow Matching (Lipman et al., 2023): the target velocity is $v_t = x_1 - x_0$ (clean adjacency minus Gaussian noise), and the model learns to predict this velocity field. At inference, a 20-step Euler ODE solver integrates from noise to structure.
For the discrete channel (argument ordering), continuous regression would cause “rounding collisions” where two edges claim the same input index. Instead, the model outputs categorical logits and a Cross-Entropy loss forces mutually exclusive assignment.
Padding masking is essential: graphs vary from 3 to 128 nodes but are padded to a fixed $128 \times 128$ matrix. Without masking, the loss would penalize the model for failing to denoise 16,000+ void pixels.
Hyperparameter Ablation
We ran an 8-configuration grid search over batch size, depth, and learning rate:
| Parameter | Optimal | Reasoning |
|---|---|---|
| Effective Batch Size | 16 | BS=4 too noisy; BS≥32 washes out categorical gradients |
| Axial Depth | 12 blocks | 6 insufficient for global routing; 24 overfits out-degree at in-degree’s expense |
| Learning Rate | 1e-4 | 5e-4 causes catastrophic gradient explosions (loss > 26.0); 1e-5 too slow |
At the optimal configuration, the model achieved 32.4% in-degree / 46.0% out-degree pass rates during early training — enough signal for the curriculum to scale.
The Validation Harness: 5 Laws of Physics
To grade the DiT’s output deterministically (no LLM judge, no fuzzy metrics), we implemented a static topological analyzer that enforces five absolute graph laws. A generated matrix must pass all five to count as syntactically valid:
Law 1: Execution Out-Degree
Each Motif has strict branching limits:
| Motif | Legal Out-Degree |
|---|---|
| Boundary | 0 (exit) or 1 (entry) |
| Sequence, State, Message | ≤ 1 |
| Condition, Loop | Exactly 0 or 2 (and branch indices must be distinct) |
A Condition node with 3 outgoing execution edges? Illegal. One with two edges both labeled “True path”? Also illegal.
Law 2: Data In-Degree (Arity)
Strict argument constraints prevent impossible data routing:
ConditionandLoopnodes require exactly 1 incoming data edge (the boolean)Statewrites require exactly 1 incoming data edge (the value to store)Messagenodes require unique, non-duplicate argument indices
Law 3: No Orphans (Reachability)
A BFS over the combined execution+data connectivity graph confirms zero disconnected logic islands. Every node must be reachable from the rest of the graph.
Law 4: Acyclic Data Plane
A DFS cycle detector ensures the DATA edge subgraph contains no paradoxes. Execution edges may cycle (that’s what loops are), but data dependencies must form a strict DAG — otherwise you get circular definitions (a = b; b = a).
Law 5: Terminal Sink
A reverse-BFS from exit nodes (those with 0 outgoing execution edges) confirms that every execution node can reach a termination point. This prevents infinite loops without escape hatches.
The Breakthrough: 100% SVR at 128 Nodes
The 3-Phase Curriculum scaled the DiT from toy graphs (≤10 nodes) to massive 128-node matrices. At Epoch 343, with the Judicial Constraint Solver performing Top-K arity snapping and logit-weighted branch conflict resolution, the model achieved:
100.00% Syntactic Validity Rate on 128-node execution graphs.
The Judicial Constraint Solver bridges the continuous-to-discrete gap: rather than expecting the DiT to perfectly zero its own noise, the solver reads the probability heatmap and mathematically snaps edges into legal bounds. A Condition node’s row gets its Top-2 highest probabilities snapped to 1; conflicting argument indices get resolved via categorical logit magnitude.
Three Branches of Government
The full system separates concerns like a constitutional government:
| Branch | Model | Responsibility |
|---|---|---|
| Legislative | Semantic LLM | Translates human intent → Motif list + Literal Pool |
| Executive | Permuted Dense DiT | Generates the topological routing (adjacency matrix) |
| Judicial | Constraint Solver | Snaps continuous heat maps → discrete, legal DAGs |
The DiT knows nothing about human language. It purely outputs mathematically valid logic scaffolds. The LLM knows nothing about graph topology. It purely manages the semantic content. The Constraint Solver enforces constitutional law on both.
What’s Next: RLAIF for Deterministic Perfection
The continuous Flow Matching objective plateaued at loss ~0.135 — the model has extracted maximum topological value from the passive pre-training objective. The next phase transitions to Reinforcement Learning from AI Feedback (RLAIF), using the 5 Laws of Physics as a direct reward signal:
- Base rewards: +0.2 for passing No Orphans, Acyclic Data, and In-Degree
- Load-bearing penalties: +0.4 for Out-Degree pass (−0.2 fail), +0.4 for Terminal Sink pass (−0.4 fail)
- Jackpot: If all 5 laws pass → 2.5× multiplier on total reward
A KL Divergence Anchor to the frozen pre-trained weights prevents reward hacking (mode-collapsing into trivial straight-line graphs).
Try It Yourself
The full execution engine, dataset compression pipeline, DiT training code, and validation harness are available:
- 💻 Code: jubilant-palm-tree — The Neural Universal Machine
- 📊 Model: timlawrenz/jubilant-palm-tree — Pre-trained checkpoint on the Hub
- 🤖 Orchestrator: Ratiocinator — Autonomous experiment runner
- 📄 Previous work: The Literal Value Bottleneck — The GNN study that motivated this pivot
# Clone and run the Fibonacci proof
git clone https://github.com/timlawrenz/jubilant-palm-tree
cd jubilant-palm-tree
pip install -r requirements.txt
# Execute the graph-walk interpreter (no compiler needed)
python src/execution_engine/demo.py
# Compress the Ruby AST dataset into Universal Motifs
python scripts/dataset_prep/compress_ast.py
# Train the Permuted Dense DiT
python src/train.py
References
- Böhm, C., & Jacopini, G. (1966). Flow diagrams, Turing machines and languages with only two formation rules. Communications of the ACM, 9(5), 366–371.
- Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. ArXiv Preprint ArXiv:2210.02747.