Eradicating Syntax: Building a Neural Universal Machine That Executes Graphs, Not Code

Our GNN autoencoders achieved 81% node accuracy on Ruby ASTs yet produced 0% valid code. The culprit was the literal value bottleneck — nearly half of every AST consisted of names and values that were irrecoverable from the structural encoding. Rather than patch the representation, we asked a more radical question: what if AI never generated human-readable code at all?

This post documents the pivot from patching GNN decoders to building a Neural Universal Machine — a system where a Diffusion Transformer generates executable Directed Acyclic Graphs directly, bypassing programming language syntax entirely. We validate the approach end-to-end: from a working graph-walk interpreter that computes Fibonacci(10), through a 12.3× vocabulary compression pipeline, to a Permuted Dense DiT that achieves 100% Syntactic Validity on 128-node execution graphs.

The Insight: Why Generate Text at All?
The Execution Engine: A Graph-Walk Interpreter
- The Six Universal Motifs
- The Fibonacci Proof
Dataset Compression: 74 Dimensions → 6 Motifs
- Compression Results
The Generative Model: Permuted Dense DiT
The Validation Harness: 5 Laws of Physics
Three Branches of Government
What’s Next: RLAIF for Deterministic Perfection
Try It Yourself
References

The Insight: Why Generate Text at All?

The current paradigm of AI code generation is a translation bottleneck: a high-dimensional neural network collapses its probabilistic understanding into a linear, human-readable string of characters, which a compiler then immediately parses back into a multi-dimensional graph to execute.

If we remove the human from the loop, a “programming language” designed natively for AI should not be a language at all. It should be a mathematical specification for a DAG.

Four core principles drive the new architecture:

The Death of Variable Names. Variable names are human mnemonics. The AI-native language relies entirely on directed edges — data dependencies are pure topological routing.
Eradication of Syntax Sugar. No parentheses, no brackets, no formatting. The “code” is saved as a sparse adjacency matrix and a minimal feature vector.
Execution by Graph-Walk. The matrix is not compiled or parsed; it is traversed directly by a minimal graph-walking interpreter.
Guaranteed Syntax. Because the model generates graph topology rather than stringing text together, “syntax errors” become mathematically impossible.

The Execution Engine: A Graph-Walk Interpreter

Before training any generative model, we needed to prove that pure topological matrices are Turing-complete. We built a minimal virtual machine that executes graphs directly.

The Six Universal Motifs

Drawing from the Böhm-Jacopini theorem (Böhm & Jacopini, 1966), we define exactly six node types — sufficient to express any computable function:

Motif	Role	Execution Semantics
`Boundary`	Program entry/exit	Routes execution forward or halts
`Sequence`	Linear execution	Passes control to the next node
`Condition`	Boolean branching	Evaluates data input, routes to True (index 0) or False (index 1)
`Loop`	Iteration	Like Condition, but the True path loops back
`State`	Memory read/write	On the execution path: writes incoming data to memory
`Message`	Function call / constant	Evaluates `+`, `<`, `print`, or returns a literal

Two edge types connect them: EXECUTION edges (control flow — “go here next”) and DATA edges (value flow — “use this as argument N”).

The Fibonacci Proof

To validate the interpreter, we hand-constructed a 25-node execution graph equivalent to:

a, b, count = 0, 1, 0
while count < 10:
    temp = a + b
    a = b
    b = temp
    count = count + 1
print(a)

The graph encodes this logic as pure topology — no variable names exist in the execution matrix. The literal_pool (a separate dictionary managed by the “Legislative Branch”) maps integer pointers to values: {0: "a", 1: "b", 2: "count", 7: "+", 8: "<", ...}.

The interpreter drops an execution pointer onto the entry Boundary node, resolves data dependencies recursively through DATA edges, evaluates Message nodes via a minimal stdlib (+, <, -, print), and routes through the Loop node’s boolean condition.

Result: memory["a"] == 55. The 10th Fibonacci number, computed natively via matrix traversal — no parser, no compiler, no syntax.

This proves two things: (1) the six Motifs are Turing-complete, and (2) a graph-walk interpreter can execute arbitrary logic from pure adjacency structure plus a constant pool.

Dataset Compression: 74 Dimensions → 6 Motifs

With the execution engine validated, we built a compression pipeline to transform our existing 22,452 Ruby ASTs into training data for the generative model.

The compressor (scripts/dataset_prep/compress_ast.py) performs three operations:

Motif Mapping. Every Ruby AST node type (73 unique types: def, send, lvar, if, while, …) is mapped to one of the 6 Universal Motifs via a deterministic lookup table.
Literal Extraction. Primitive children (strings, integers, floats, booleans) are stripped from the tree and collected into a deduplicated literal_pool. Each extracted value gets an integer pointer. The structural graph references these values only through pointer indices — the actual content lives entirely outside the topology.
Edge Re-Routing. Sequence and Boundary nodes chain their children via EXECUTION edges (control flow). Everything else (Condition, Loop, Message, State) connects children via DATA edges with positional input_index values.

Compression Results

When applied to a complex, real-world example — the 144-node structure method from the AWS Ruby SDK — the compressor:

Metric	Before (Ruby AST)	After (Motif Graph)
Node vocabulary	74 types	6 types
Literal values in graph	50 (embedded)	0 (extracted to pool)
Edge types	Implicit (parent→child)	107 DATA + 36 EXECUTION

This is a 12.3× reduction in structural vocabulary — from 74 dimensions of Ruby syntax noise down to 6 language-agnostic primitives. The literal value bottleneck that destroyed our GNN autoencoders is eliminated by construction: literals are no longer in the graph. They live in a separate constant pool managed by the “Legislative Branch” (an LLM).

The compressed dataset produces perfectly dense, low-dimensional matrices — ideal inputs for a Diffusion Transformer.

The Generative Model: Permuted Dense DiT

With a Turing-complete execution engine and a compressed training set, we built a Diffusion Transformer to generate valid execution graphs from scratch.

The Spatial Bias Trap

Standard image DiTs (Stable Diffusion, Sora) use Vision Transformer blocks with 2D positional encodings. Applied to an adjacency matrix, this teaches the network that Node 4 connects to Node 5 because they are “next to each other” spatially. But in a graph, node ordering is entirely arbitrary — adjacency is topological, not positional.

We solved this with two mechanisms:

1. Node Permutation Augmentation. The DataLoader randomly shuffles node ordering every time a graph is fetched. The topological routing remains identical, but the matrix layout changes completely. This mathematically destroys spatial bias and forces the DiT to learn pure topological rules.

2. Cross-Hatch Embedding Injection. The DiT operates on a 2D adjacency matrix, but its conditioning signal (the Motifs) is a 1D list. The InputConditioner bridges this gap:

Embeds the 1D Motif tensor [N] into [N, 128]
Broadcasts across rows → [N, N, 128] (source node identity)
Broadcasts across columns → [N, N, 128] (target node identity)
Concatenates both with the 3-channel noisy adjacency → 35 channels per pixel

Every coordinate (i, j) now carries complete information about which Motifs are being connected, giving the DiT 360° structural awareness.

Axial Attention: Message-Passing in Matrix Form

Because we abandoned ViT square patches, the model processes the full matrix using Axial (Row-Column) Attention — which naturally mimics graph message passing:

Row Attention (the “outgoing” perspective): Evaluates all potential connections from a node simultaneously. “I am a Condition — I must point to exactly two targets.”
Column Attention (the “incoming” perspective): Evaluates all incoming dependencies to a node. “I am a State — I can accept at most one data source.”

This bidirectional reasoning is critical: graph validity requires both out-degree and in-degree constraints to be satisfied simultaneously.

Hybrid Loss: Flow Matching + Classification

The model predicts 6 output channels per edge coordinate:

Channels	Meaning	Loss
0–1	Presence, Edge Type	Optimal Transport Flow Matching (masked MSE)
2–5	Input Index logits (0–3)	Categorical Cross-Entropy (masked)

For the continuous channels, we use Conditional Flow Matching (Lipman et al., 2023): the target velocity is $v_t = x_1 - x_0$ (clean adjacency minus Gaussian noise), and the model learns to predict this velocity field. At inference, a 20-step Euler ODE solver integrates from noise to structure.

For the discrete channel (argument ordering), continuous regression would cause “rounding collisions” where two edges claim the same input index. Instead, the model outputs categorical logits and a Cross-Entropy loss forces mutually exclusive assignment.

Padding masking is essential: graphs vary from 3 to 128 nodes but are padded to a fixed $128 \times 128$ matrix. Without masking, the loss would penalize the model for failing to denoise 16,000+ void pixels.

Hyperparameter Ablation

We ran an 8-configuration grid search over batch size, depth, and learning rate:

Parameter	Optimal	Reasoning
Effective Batch Size	16	BS=4 too noisy; BS≥32 washes out categorical gradients
Axial Depth	12 blocks	6 insufficient for global routing; 24 overfits out-degree at in-degree’s expense
Learning Rate	1e-4	5e-4 causes catastrophic gradient explosions (loss > 26.0); 1e-5 too slow

At the optimal configuration, the model achieved 32.4% in-degree / 46.0% out-degree pass rates during early training — enough signal for the curriculum to scale.

The Validation Harness: 5 Laws of Physics

To grade the DiT’s output deterministically (no LLM judge, no fuzzy metrics), we implemented a static topological analyzer that enforces five absolute graph laws. A generated matrix must pass all five to count as syntactically valid:

Law 1: Execution Out-Degree

Each Motif has strict branching limits:

Motif	Legal Out-Degree
Boundary	0 (exit) or 1 (entry)
Sequence, State, Message	≤ 1
Condition, Loop	Exactly 0 or 2 (and branch indices must be distinct)

A Condition node with 3 outgoing execution edges? Illegal. One with two edges both labeled “True path”? Also illegal.

Law 2: Data In-Degree (Arity)

Strict argument constraints prevent impossible data routing:

Condition and Loop nodes require exactly 1 incoming data edge (the boolean)
State writes require exactly 1 incoming data edge (the value to store)
Message nodes require unique, non-duplicate argument indices

Law 3: No Orphans (Reachability)

A BFS over the combined execution+data connectivity graph confirms zero disconnected logic islands. Every node must be reachable from the rest of the graph.

Law 4: Acyclic Data Plane

A DFS cycle detector ensures the DATA edge subgraph contains no paradoxes. Execution edges may cycle (that’s what loops are), but data dependencies must form a strict DAG — otherwise you get circular definitions (a = b; b = a).

Law 5: Terminal Sink

A reverse-BFS from exit nodes (those with 0 outgoing execution edges) confirms that every execution node can reach a termination point. This prevents infinite loops without escape hatches.

The Breakthrough: 100% SVR at 128 Nodes

The 3-Phase Curriculum scaled the DiT from toy graphs (≤10 nodes) to massive 128-node matrices. At Epoch 343, with the Judicial Constraint Solver performing Top-K arity snapping and logit-weighted branch conflict resolution, the model achieved:

100.00% Syntactic Validity Rate on 128-node execution graphs.

The Judicial Constraint Solver bridges the continuous-to-discrete gap: rather than expecting the DiT to perfectly zero its own noise, the solver reads the probability heatmap and mathematically snaps edges into legal bounds. A Condition node’s row gets its Top-2 highest probabilities snapped to 1; conflicting argument indices get resolved via categorical logit magnitude.

Three Branches of Government

The full system separates concerns like a constitutional government:

Branch	Model	Responsibility
Legislative	Semantic LLM	Translates human intent → Motif list + Literal Pool
Executive	Permuted Dense DiT	Generates the topological routing (adjacency matrix)
Judicial	Constraint Solver	Snaps continuous heat maps → discrete, legal DAGs

The DiT knows nothing about human language. It purely outputs mathematically valid logic scaffolds. The LLM knows nothing about graph topology. It purely manages the semantic content. The Constraint Solver enforces constitutional law on both.

What’s Next: RLAIF for Deterministic Perfection

The continuous Flow Matching objective plateaued at loss ~0.135 — the model has extracted maximum topological value from the passive pre-training objective. The next phase transitions to Reinforcement Learning from AI Feedback (RLAIF), using the 5 Laws of Physics as a direct reward signal:

Base rewards: +0.2 for passing No Orphans, Acyclic Data, and In-Degree
Load-bearing penalties: +0.4 for Out-Degree pass (−0.2 fail), +0.4 for Terminal Sink pass (−0.4 fail)
Jackpot: If all 5 laws pass → 2.5× multiplier on total reward

A KL Divergence Anchor to the frozen pre-trained weights prevents reward hacking (mode-collapsing into trivial straight-line graphs).

Try It Yourself

The full execution engine, dataset compression pipeline, DiT training code, and validation harness are available:

💻 Code: jubilant-palm-tree — The Neural Universal Machine
📊 Model: timlawrenz/jubilant-palm-tree — Pre-trained checkpoint on the Hub
🤖 Orchestrator: Ratiocinator — Autonomous experiment runner
📄 Previous work: The Literal Value Bottleneck — The GNN study that motivated this pivot

# Clone and run the Fibonacci proof
git clone https://github.com/timlawrenz/jubilant-palm-tree
cd jubilant-palm-tree
pip install -r requirements.txt

# Execute the graph-walk interpreter (no compiler needed)
python src/execution_engine/demo.py

# Compress the Ruby AST dataset into Universal Motifs
python scripts/dataset_prep/compress_ast.py

# Train the Permuted Dense DiT
python src/train.py

References

Böhm, C., & Jacopini, G. (1966). Flow diagrams, Turing machines and languages with only two formation rules. Communications of the ACM, 9(5), 366–371.
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. ArXiv Preprint ArXiv:2210.02747.