Tim Lawrenz

Training a Pixel-Space DiT in 26 Hours: FP8 Breakthroughs and Architectural Dead Ends

2026-06-02T21:00:00+00:00

Following our integration of Asymmetric Flow Matching, our 400M parameter NanoDiT was training efficiently in terms of step-count convergence, but it was hitting an iter_per_sec of 0.025 on our single RTX 4090. A full 5,000-step ablation cycle required 56 hours of active compute.

The Bottleneck

Before scaling the dataset from our 7k curated subset to the full 70k+ pipeline, we needed a faster iteration loop. Our goal was to halve the active compute time either via lower precision (FP8) or parameter reduction (shared spatial modulation).

Here is what worked—and what catastrophically failed.

1. The FP8 Memory Illusion & Breakthrough

Native FP8 via torchao theoretically promises doubled tensor core throughput and halved memory bandwidth compared to BF16. However, our initial naïve implementation failed to execute a single forward/backward pass within our 24GB VRAM budget, instantly triggering Out Of Memory (OOM) errors at batch sizes where BF16 comfortably fit.

The Trap: Dynamic Scale State Overhead

Why did an 8-bit format consume more memory than a 16-bit format? The answer lies in how torchao handles dynamic tensor casting. For dynamic casting, the framework allocates dynamic scale states for every linear layer and continuously tracks rolling history maxima during the forward pass. This metadata overhead, combined with wrapper casting operations, drastically inflated the footprint and destroyed the bandwidth savings.

The Fix: Dynamic Tensor Masking & Scoped Autocast

To fit the model back into 24GB VRAM while preserving the FP8 throughput, we had to tame the scale-state overhead. We implemented strict dynamic tensor masking and optimized autocast scoping so that FP8 conversion was tightly localized to the heaviest matrix multiplications (the core attention and FFN projections), bypassing the metadata overhead for the rest of the network.

The Results: >2x Speedup

By resolving the overhead, active compute time plummeted from 56 hours to 26 hours (iter_per_sec increased from 0.025 to 0.053).

Crucially, the perceptual quality held up beautifully against BF16. Below are the final LPIPS metrics at step 5000 comparing Arm G (BF16 AsymFlow) against Arm I (FP8 Native):

Metric	Arm G (BF16)	Arm I (FP8)	Delta
Reconstruction LPIPS	0.900	0.906	+0.006 (Negligible)
Text-only LPIPS	0.920	0.909	-0.011 (Better)
Text Manip Delta	0.485	0.504	+0.019 (Better)

Note: Lower LPIPS is better for perceptual similarity. Higher Text Manip Delta indicates stronger text-controllability.

Visual Comparisons

Reconstruction Fidelity (Conditioned on Identity + Text): (Arm G on left, Arm I on right)

Text-Only Controllability (Conditioned purely on Text prompt): (Arm G on left, Arm I on right)

The FP8 model achieves equivalent perceptual quality in less than half the time.

2. The Shared adaLN + LoRA Collapse

While working on compute optimization, we also explored parameter reduction.

The Hypothesis: Our DiT utilizes an adaLN (Adaptive Layer Normalization) projection in every transformer block to inject the timestep and spatial conditioning signals. What if we shared a single central adaLN projection across all layers, and relied on a lightweight per-block rank-8 LoRA to handle block-specific spatial localization? This would save 59M parameters (dropping the model from 237.7M to 178.8M non-embedding parameters).

The Reality: The model (Arm H) completely failed to converge. Reconstruction LPIPS stalled at 0.9823, and the output degraded into structural noise and visual dithering by step 5000.

The Lesson: Full-rank, independent per-block modulation is structurally load-bearing in a Diffusion Transformer. You cannot compress the spatial/semantic conditioning pathway without destroying the model’s ability to localize features across different depths of the network. The 59M parameter savings were simply not worth the catastrophic quality collapse.

Conclusion

The architecture is now stabilized, perceptually verified, and fast. The ablation phase is formally closed. With our pipeline executing 5k steps in just 26 hours, we are ready for the “big run”—scaling the dataset, extending the training horizon, and deploying bucket-aware batching.

prx-tg: Accelerating Pixel-Space Diffusion with Asymmetric Flow Matching

2026-05-23T00:00:00+00:00

The evolution of text-to-image synthesis is currently undergoing a profound architectural realignment. For years, the dominant paradigm relied on Variational Autoencoders (VAEs) to compress high-dimensional pixel data into a mathematically tractable, lower-dimensional latent space. But as we strip away the VAE in pursuit of lossless, native-resolution pixel prediction, we collide with the brutal reality of uncompressed feature spaces.

The Reality of Pixel-Space Diffusion: The “Average Blob” Phenomenon
Breaking the Bottleneck: Asymmetric Flow Matching (Arm G)
- Quantitative Results (Step 5000)
- Qualitative Results
The Next Hurdle: Data Starvation and Compute Scaling
References

In my previous post on the prx-tg architecture, we established a baseline NanoDiT (768 hidden, 18 layers) capable of training in raw pixel-space using a single 24GB RTX 4090. Today, we confront the “Average Blob” phenomenon—the mathematical bottleneck of VAE-less diffusion—and explore how Asymmetric Flow Matching (Arm G) allows us to break through it.

The Reality of Pixel-Space Diffusion: The “Average Blob” Phenomenon

While removing the VAE yields theoretical advantages in high-frequency detail preservation, our recent ablations on the prx-tg baseline (Arm D) revealed a harsh reality of pure pixel-space training.

In Latent Diffusion Models (LDMs), the VAE decoder acts as an aesthetic crutch. It forcefully maps noisy or misaligned latent representations back into a “photorealistic” texture manifold. You get eyes, skin textures, and hair details almost for free, even very early in the training process.

In a latent-free setup predicting raw RGB pixels, the model has to earn every single high-frequency detail from scratch. Because the training uses a Mean Squared Error (MSE) loss on raw pixels, the model mathematically minimizes its loss early on by predicting a smooth, blurry “average” color blob wherever it is uncertain about high-frequency placement (like the exact boundary of an iris or individual hair strands).

At 5,000 steps (processing ~1.28 million images with an effective batch size of 256 over our 7k curated FFHQ image dataset), the baseline Arm D model successfully learned macro-composition—placing the head in the right spot with the correct colors—but completely failed to resolve facial features, leaving the outputs blocky and emotionless. To push past this phase without wasting thousands of GPU hours, the model must be forced to care about micro-structure earlier.

Breaking the Bottleneck: Asymmetric Flow Matching (Arm G)

To address this delayed convergence, we recently conducted an ablation (Arm G) implementing Asymmetric Flow Models (missing reference).

Standard flow matching architectures learn a vector field that transports a simple base distribution (e.g., Gaussian noise) to a complex data distribution (pixels). However, predicting the full-dimensional noise across all sequence steps introduces massive variance in the gradient updates.

Instead of predicting full-dimensional noise directly, AsymFlow computes the target on a lower-dimensional PCA-based subspace for the noise component: target = P @ noise - x_0. By isolating the target prediction to an optimized linear subspace (in our case, rank 8), the model receives a much cleaner, less chaotic gradient signal during backpropagation.

Quantitative Results (Step 5000)

The results were immediately apparent. At the 5,000-step validation mark, Arm G achieved comparable reconstruction fidelity to our full-stack baseline Arm D (0.9379 vs 0.9267) while outperforming it on text-only generation quality (0.9141 vs 0.9219).

Metric	Arm D (Baseline)	Arm G (Asym Flow)
Reconstruction LPIPS	0.9267	0.9379
Text-Only Gen LPIPS	0.9219	0.9141

Note: Lower LPIPS (Learned Perceptual Image Patch Similarity) indicates better perceptual quality and structural fidelity.

Qualitative Results

Visually, the AsymFlow model produced noticeably crisper outputs. The micro-contrast improved dramatically—skin textures felt less “painted” and lighting highlights resolved with higher fidelity compared to the baseline’s haze.

Most importantly, Arm G demonstrated far better structural identity preservation when undergoing strong text-based manipulations. In our validation suite, we feed the model an original image’s DINOv3 embeddings alongside a heavily modified text prompt (e.g., changing “slender” to “muscular”). Arm D frequently warped the core identity or introduced artifacts around the neck and collar when applying the edit. Arm G handled these transformations smoothly, achieving a text manipulation LPIPS difference of 0.4665 while keeping the subject’s base geometry intact.

The Next Hurdle: Data Starvation and Compute Scaling

While Asymmetric Flow Matching successfully accelerated convergence and bridged the gap between raw pixel geometry and texture, it cannot magically solve data starvation.

Our 400M parameter model confined to a 7,000 image dataset is rapidly approaching memorization. At 5,000 steps, it has seen the same rigid dataset over 180 times. To achieve true production quality, we must scale our dataset to our 70k target and push training into the 40,000+ step regime.

However, running 40,000 steps at ~41 seconds per iteration on a single 4090 poses a massive compute barrier (roughly 21 days of continuous training). Before launching this production run, we must pursue further radical optimizations to compress the compute time.

In the next part of this series, we will explore replacing the standard Transformer blocks with Mamba-3 sequence modeling to linearize pixel space, and deploying Shared-MLP Timestep Modulation via LoRA to radically slash our FLOP budget.

References

Training a Portrait DiT on a Single GPU: What the Ablation Study Taught Us

2026-05-13T00:00:00+00:00

The prevailing assumption in generative AI is that training a large, multi-modal Diffusion Transformer from scratch requires a cluster. prx-tg is a direct challenge to that assumption: a 400M+ parameter DiT for 1024×1024 portrait generation, trained entirely on a single consumer NVIDIA RTX 4090 with 24GB of VRAM, conditioned on text, identity, spatial layout, and pose simultaneously. We just completed the first systematic ablation study of its core training innovations, and the results are worth sharing in detail — including one finding we did not expect.

What We Are Building
The Ablation Design
Results
The Traps Ahead
What’s Next
Code and Data
References

What We Are Building

prx-tg is a portrait generation model built on a NanoDiT backbone (Peebles & Xie, 2023) operating directly in pixel space, patchifying RGB images into a sequence of tokens rather than relying on a VAE latent bottleneck. The model is “quad-conditioned”: cross-attention layers simultaneously receive dense text captions processed by CLIP and T5, visual identity embeddings from DINOv3 (utilizing patch-level tokens), spatial layout maps, and DWPose skeletal keypoints. The goal is controllable generation — given a reference identity and a description of pose, lighting, and appearance, generate a plausible, photorealistic portrait.

Training a model of this scope on 24GB of VRAM is not possible without careful engineering. Gradient checkpointing drops all intermediate activations and recomputes them on the backward pass, trading a 20–30% speed penalty for a massive memory reduction. The T5 encoder alone consumes over 10GB of VRAM to process captions; a dedicated cleanup routine migrates it to CPU immediately after embeddings are cached, freeing the GPU before the DiT backward pass. Affine biases are stripped from QKV projections and FFN hidden layers — mathematically redundant under LayerNorm, and worth 5–10% of total memory. Positional embeddings are computed dynamically from latent tensor dimensions rather than stored as static buffers, enabling multi-resolution training without padding or fixed-shape assumptions.

Data augmentation and preprocessing run through stratum-hq. Horizontal flip augmentation was explicitly excluded: for a model conditioned on DWPose keypoints, flipping pixel data without remapping symmetric landmark indices (left eye ↔ right eye, left shoulder ↔ right shoulder) corrupts the cross-attention binding between spatial tokens and text tokens. The FFHQ dataset provides sufficient orientation diversity without flips.

The Ablation Design

We trained four arms for 5,000 steps each, all on the same physical quad-GPU Vast.ai node with GPU assignment pinned via CUDA_VISIBLE_DEVICES. Running every arm on the same hardware eliminates variance from GPU-to-GPU silicon differences — an often underappreciated confound in ablation studies that share results across separately provisioned machines.

Arm	Optimizer	TREAD	Loss Formulation
A — Baseline	AdamW	Off	Standard flow-matching
B — TREAD+AdamW	AdamW	On	Standard flow-matching
C — TREAD+Muon	Muon	On	Standard flow-matching
D — Full Stack	Muon	On	Flow-matching + REPA

TREAD (Token Routing for Efficient Architecture-agnostic Diffusion Training) probabilistically routes up to 50% of tokens around intermediate attention and feed-forward blocks. Tokens are extracted at an early layer and reinjected near the output, bypassing the bulk of the network’s compute. The theoretical promise is a direct reduction in FLOPs for those bypassed tokens, and because bypassed tokens still contribute to the loss, early layers receive a gradient signal from late-stage objectives — a form of pseudo-deep supervision.

Muon (Jordan & others, 2024) is a spectral optimizer that applies orthogonalized Nesterov momentum via a Newton-Schulz polynomial iteration, producing update matrices that converge to the nearest orthogonal matrix. Unlike AdamW’s per-parameter scalar moment estimation, Muon enforces a uniform update magnitude across each weight matrix. As a practical bonus, Muon’s single momentum buffer costs 4 bytes per parameter versus AdamW’s 8 (two buffers), reducing optimizer state memory by 50% — meaningful at this hardware budget.

REPA (Representation Alignment) (Yu et al., 2024) augments the flow-matching objective with an alignment penalty between the DiT’s intermediate hidden states and DINOv2’s semantic representations, forcing the generative student to internalize the teacher’s structure. Because this adds a second term to the loss with a different scale, Arm D’s raw loss values are not comparable to A, B, or C. LPIPS comparisons across all arms remain valid.

Results

Final Checkpoint (Step 5000)

Arm	Recon LPIPS ↓	Text LPIPS ↓	Text Manip delta ↑
A — Baseline	0.9352	0.9593	0.466
B — TREAD+AdamW	1.0161	0.9396	0.373
C — TREAD+Muon	0.9463	0.9603	0.546
D — Full Stack	0.9267	0.9219	0.431

Recon LPIPS: reconstruction fidelity given full conditioning (identity + text), 25 samples. Text LPIPS: generation quality given text only, 20 samples. Text Manip delta: mean absolute LPIPS difference between generations for a caption and a single-attribute edit (e.g., “dark hair” → “light hair”) — a measure of how decisively the model responds to text.

All TREAD arms (B, C, D) trained approximately 17% faster in wall-clock time: ~95h versus ~112h for the baseline. At equivalent step budgets this is a direct reduction in future experiment cost.

What We Did Not Expect: AdamW+TREAD Instability

Arm B’s result requires a post-mortem. It achieved its best reconstruction at step 3000 (Recon LPIPS 0.906 — briefly the best of any arm) and then collapsed monotonically to 1.016 by step 5000, a value exceeding 1.0, meaning the model performs worse than a trivial baseline on reconstruction at its final checkpoint.

The collapse is not sudden. It begins around step 3500 and degrades progressively — which is why we did not catch it early. A prior independent run showed the same pattern, confirming this is reproducible behavior rather than a stochastic outlier.

The mechanism is a mathematical incompatibility between AdamW’s adaptive moment estimation and TREAD’s dynamic spatial sparsity. TREAD routes tokens around intermediate blocks, so those blocks receive sparse, irregular gradient signals over thousands of iterations. AdamW interprets near-zero gradients as low-variance parameters and decays their second-moment estimates accordingly. This inflates the adaptive learning rate for those “starved” weights. When a high-frequency token is eventually routed through a starved block, the resulting gradient is multiplied by the inflated rate and produces a divergent update that shatters the block’s representations. The failure accumulates gradually and then becomes catastrophic.

This is not a deficiency in TREAD itself. It is a fundamental incompatibility between per-parameter scalar moment estimation and dynamic spatial routing. Do not use TREAD with AdamW for long runs.

Muon as the Fix

Arm C demonstrates the resolution. Muon’s orthogonalized updates enforce a fixed spectral norm across the entire weight matrix, not per-parameter scaling. There are no “starved” parameters — every weight receives a geometrically uniform step. The TREAD-induced sparsity pattern becomes irrelevant because the optimizer is not accumulating per-parameter learning rate history in a way that can diverge.

The result: Arm C’s Recon LPIPS (0.946) is 0.070 points better than Arm B’s final collapse, within 0.011 of the stable baseline (Arm A), with the full 17% throughput gain intact. And its Text Manipulation delta (0.546) is the highest of any arm — Muon’s isotropic updates appear to promote stronger, more decisive binding between text token activations and output features. For a model where the primary use case is text-driven portrait control, this matters.

Full Stack as the Production Target

Arm D (TREAD + Muon + REPA) achieves the best metrics across both dimensions: Recon LPIPS 0.927, Text LPIPS 0.922. The REPA loss accelerates early semantic acquisition — Arm D’s Text LPIPS broke below 0.90 by step 500, while other arms reached comparable values much later. Muon’s stability allowed the model to reach final convergence without the instabilities that would accompany the modified dual-objective loss under AdamW.

The following collage shows text-only outputs from all four arms at their final checkpoint (step 5000), using the same evaluation prompt. Arm D’s output consistently shows stronger structural coherence and finer detail.

The Traps Ahead

Completing the study also clarified several failure modes we need to address for production-scale training.

REPA termination. DINOv3 is a discriminative model operating in a lower-dimensional embedding space optimized for classification and dense feature matching. It discards high-frequency textural variance — pores, hair strands, skin texture — that photorealism requires. In the burn-in phase, REPA’s alignment penalty is genuinely helpful: it pulls the DiT out of its initial chaotic state. Beyond that, the teacher’s embeddings become a constraint, penalizing the generator for synthesizing details that don’t exist in the teacher’s feature maps. The HASTE framework describes this as the “works until it doesn’t” trap. For production runs, the REPA alignment weight should be decayed to zero by approximately step 1000–1500 (the first 20–30% of a 5000-step run), then let the model converge on unconstrained flow-matching alone. Our current 5000-step study ran REPA to completion — the metrics are still the best of any arm, but we likely left quality on the table.

Pixel scaling. When processing RGB data directly without a VAE bottleneck, images must be scaled correctly into the [−1, 1] range expected by the diffusion process. Currently, the dataloader yields [0, 1] RGB pixels, which slightly biases the flow-matching objective. Correcting the pixel normalization pipeline is a prerequisite for reliable convergence at scale.

Spatial evaluation. LPIPS measures perceptual texture similarity and broad structural alignment. It cannot verify whether the generated pose matches the DWPose conditioning input. A model can generate a photorealistic face (excellent LPIPS) while completely ignoring the jaw angle or shoulder position specified by the spatial condition. The next iteration needs MPJPE (Mean Per Joint Position Error) in the validation loop — specifically PA-MPJPE (Procrustes-Aligned MPJPE), which isolates structural accuracy from rotational and scale variance — to prove that the DiT’s cross-attention mechanisms actually bind visual output to spatial conditions.

What’s Next

The ablation clears the path for the next phase of prx-tg development. The production training configuration is Full Stack (Arm D) with REPA loss decay implemented from the start. The immediate engineering priorities are:

Implement REPA warmdown scheduling — decay the alignment weight to zero by step ~1250 for a 5000-step run, or proportionally for longer budgets.
Pixel normalization pipeline — ensure RGB tensors are properly centered at zero [−1, 1] before DiT input.
MPJPE/PA-MPJPE validation — instrument the validation loop with a second-stage pose estimator to measure spatial controllability quantitatively.
Longer runs — the 5000-step study was designed to isolate optimizer dynamics under controlled conditions. Production-quality generation at 1024×1024 will require substantially more steps. The 17% throughput gain from TREAD directly compounds the value of every future training hour.

The study confirms that the engineering hypothesis holds: state-of-the-art multi-modal generation at 1024×1024 is trainable on a single consumer GPU. It does not require a cluster — it requires careful memory engineering, the right optimizer for the architecture, and disciplined ablation to understand what fails and why.

Code and Data

The full ablation write-up, per-checkpoint metrics, and arm configurations are in the repository:

prx-tg: github.com/timlawrenz/prx-tg — model, training code, ablation docs
stratum-hq: github.com/timlawrenz/stratum-hq — data ingestion, preprocessing, augmentation pipeline
Ratiocinator: github.com/timlawrenz/ratiocinator — the autonomous experiment runner that provisioned and monitored the ablation

References

Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision.
Jordan, K., & others. (2024). Muon: An Optimizer for Hidden Layers in Neural Networks. ArXiv Preprint.
Yu, S., Jin, S., Lee, J., Kim, J., & Shin, J. (2024). Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. ArXiv Preprint ArXiv:2410.06940.

Eradicating Syntax: Building a Neural Universal Machine That Executes Graphs, Not Code

2026-05-05T00:00:00+00:00

Our GNN autoencoders achieved 81% node accuracy on Ruby ASTs yet produced 0% valid code. The culprit was the literal value bottleneck — nearly half of every AST consisted of names and values that were irrecoverable from the structural encoding. Rather than patch the representation, we asked a more radical question: what if AI never generated human-readable code at all?

This post documents the pivot from patching GNN decoders to building a Neural Universal Machine — a system where a Diffusion Transformer generates executable Directed Acyclic Graphs directly, bypassing programming language syntax entirely. We validate the approach end-to-end: from a working graph-walk interpreter that computes Fibonacci(10), through a 12.3× vocabulary compression pipeline, to a Permuted Dense DiT that achieves 100% Syntactic Validity on 128-node execution graphs.

The Insight: Why Generate Text at All?
The Execution Engine: A Graph-Walk Interpreter
- The Six Universal Motifs
- The Fibonacci Proof
Dataset Compression: 74 Dimensions → 6 Motifs
- Compression Results
The Generative Model: Permuted Dense DiT
The Validation Harness: 5 Laws of Physics
Three Branches of Government
What’s Next: RLAIF for Deterministic Perfection
Try It Yourself
References

The Insight: Why Generate Text at All?

The current paradigm of AI code generation is a translation bottleneck: a high-dimensional neural network collapses its probabilistic understanding into a linear, human-readable string of characters, which a compiler then immediately parses back into a multi-dimensional graph to execute.

If we remove the human from the loop, a “programming language” designed natively for AI should not be a language at all. It should be a mathematical specification for a DAG.

Four core principles drive the new architecture:

The Death of Variable Names. Variable names are human mnemonics. The AI-native language relies entirely on directed edges — data dependencies are pure topological routing.
Eradication of Syntax Sugar. No parentheses, no brackets, no formatting. The “code” is saved as a sparse adjacency matrix and a minimal feature vector.
Execution by Graph-Walk. The matrix is not compiled or parsed; it is traversed directly by a minimal graph-walking interpreter.
Guaranteed Syntax. Because the model generates graph topology rather than stringing text together, “syntax errors” become mathematically impossible.

The Execution Engine: A Graph-Walk Interpreter

Before training any generative model, we needed to prove that pure topological matrices are Turing-complete. We built a minimal virtual machine that executes graphs directly.

The Six Universal Motifs

Drawing from the Böhm-Jacopini theorem (Böhm & Jacopini, 1966), we define exactly six node types — sufficient to express any computable function:

Motif	Role	Execution Semantics
`Boundary`	Program entry/exit	Routes execution forward or halts
`Sequence`	Linear execution	Passes control to the next node
`Condition`	Boolean branching	Evaluates data input, routes to True (index 0) or False (index 1)
`Loop`	Iteration	Like Condition, but the True path loops back
`State`	Memory read/write	On the execution path: writes incoming data to memory
`Message`	Function call / constant	Evaluates `+`, `<`, `print`, or returns a literal

Two edge types connect them: EXECUTION edges (control flow — “go here next”) and DATA edges (value flow — “use this as argument N”).

The Fibonacci Proof

To validate the interpreter, we hand-constructed a 25-node execution graph equivalent to:

a, b, count = 0, 1, 0
while count < 10:
    temp = a + b
    a = b
    b = temp
    count = count + 1
print(a)

The graph encodes this logic as pure topology — no variable names exist in the execution matrix. The literal_pool (a separate dictionary managed by the “Legislative Branch”) maps integer pointers to values: {0: "a", 1: "b", 2: "count", 7: "+", 8: "<", ...}.

The interpreter drops an execution pointer onto the entry Boundary node, resolves data dependencies recursively through DATA edges, evaluates Message nodes via a minimal stdlib (+, <, -, print), and routes through the Loop node’s boolean condition.

Result: memory["a"] == 55. The 10th Fibonacci number, computed natively via matrix traversal — no parser, no compiler, no syntax.

This proves two things: (1) the six Motifs are Turing-complete, and (2) a graph-walk interpreter can execute arbitrary logic from pure adjacency structure plus a constant pool.

Dataset Compression: 74 Dimensions → 6 Motifs

With the execution engine validated, we built a compression pipeline to transform our existing 22,452 Ruby ASTs into training data for the generative model.

The compressor (scripts/dataset_prep/compress_ast.py) performs three operations:

Motif Mapping. Every Ruby AST node type (73 unique types: def, send, lvar, if, while, …) is mapped to one of the 6 Universal Motifs via a deterministic lookup table.
Literal Extraction. Primitive children (strings, integers, floats, booleans) are stripped from the tree and collected into a deduplicated literal_pool. Each extracted value gets an integer pointer. The structural graph references these values only through pointer indices — the actual content lives entirely outside the topology.
Edge Re-Routing. Sequence and Boundary nodes chain their children via EXECUTION edges (control flow). Everything else (Condition, Loop, Message, State) connects children via DATA edges with positional input_index values.

Compression Results

When applied to a complex, real-world example — the 144-node structure method from the AWS Ruby SDK — the compressor:

Metric	Before (Ruby AST)	After (Motif Graph)
Node vocabulary	74 types	6 types
Literal values in graph	50 (embedded)	0 (extracted to pool)
Edge types	Implicit (parent→child)	107 DATA + 36 EXECUTION

This is a 12.3× reduction in structural vocabulary — from 74 dimensions of Ruby syntax noise down to 6 language-agnostic primitives. The literal value bottleneck that destroyed our GNN autoencoders is eliminated by construction: literals are no longer in the graph. They live in a separate constant pool managed by the “Legislative Branch” (an LLM).

The compressed dataset produces perfectly dense, low-dimensional matrices — ideal inputs for a Diffusion Transformer.

The Generative Model: Permuted Dense DiT

With a Turing-complete execution engine and a compressed training set, we built a Diffusion Transformer to generate valid execution graphs from scratch.

The Spatial Bias Trap

Standard image DiTs (Stable Diffusion, Sora) use Vision Transformer blocks with 2D positional encodings. Applied to an adjacency matrix, this teaches the network that Node 4 connects to Node 5 because they are “next to each other” spatially. But in a graph, node ordering is entirely arbitrary — adjacency is topological, not positional.

We solved this with two mechanisms:

1. Node Permutation Augmentation. The DataLoader randomly shuffles node ordering every time a graph is fetched. The topological routing remains identical, but the matrix layout changes completely. This mathematically destroys spatial bias and forces the DiT to learn pure topological rules.

2. Cross-Hatch Embedding Injection. The DiT operates on a 2D adjacency matrix, but its conditioning signal (the Motifs) is a 1D list. The InputConditioner bridges this gap:

Embeds the 1D Motif tensor [N] into [N, 128]
Broadcasts across rows → [N, N, 128] (source node identity)
Broadcasts across columns → [N, N, 128] (target node identity)
Concatenates both with the 3-channel noisy adjacency → 35 channels per pixel

Every coordinate (i, j) now carries complete information about which Motifs are being connected, giving the DiT 360° structural awareness.

Axial Attention: Message-Passing in Matrix Form

Because we abandoned ViT square patches, the model processes the full matrix using Axial (Row-Column) Attention — which naturally mimics graph message passing:

Row Attention (the “outgoing” perspective): Evaluates all potential connections from a node simultaneously. “I am a Condition — I must point to exactly two targets.”
Column Attention (the “incoming” perspective): Evaluates all incoming dependencies to a node. “I am a State — I can accept at most one data source.”

This bidirectional reasoning is critical: graph validity requires both out-degree and in-degree constraints to be satisfied simultaneously.

Hybrid Loss: Flow Matching + Classification

The model predicts 6 output channels per edge coordinate:

Channels	Meaning	Loss
0–1	Presence, Edge Type	Optimal Transport Flow Matching (masked MSE)
2–5	Input Index logits (0–3)	Categorical Cross-Entropy (masked)

For the continuous channels, we use Conditional Flow Matching (Lipman et al., 2023): the target velocity is $v_t = x_1 - x_0$ (clean adjacency minus Gaussian noise), and the model learns to predict this velocity field. At inference, a 20-step Euler ODE solver integrates from noise to structure.

For the discrete channel (argument ordering), continuous regression would cause “rounding collisions” where two edges claim the same input index. Instead, the model outputs categorical logits and a Cross-Entropy loss forces mutually exclusive assignment.

Padding masking is essential: graphs vary from 3 to 128 nodes but are padded to a fixed $128 \times 128$ matrix. Without masking, the loss would penalize the model for failing to denoise 16,000+ void pixels.

Hyperparameter Ablation

We ran an 8-configuration grid search over batch size, depth, and learning rate:

Parameter	Optimal	Reasoning
Effective Batch Size	16	BS=4 too noisy; BS≥32 washes out categorical gradients
Axial Depth	12 blocks	6 insufficient for global routing; 24 overfits out-degree at in-degree’s expense
Learning Rate	1e-4	5e-4 causes catastrophic gradient explosions (loss > 26.0); 1e-5 too slow

At the optimal configuration, the model achieved 32.4% in-degree / 46.0% out-degree pass rates during early training — enough signal for the curriculum to scale.

The Validation Harness: 5 Laws of Physics

To grade the DiT’s output deterministically (no LLM judge, no fuzzy metrics), we implemented a static topological analyzer that enforces five absolute graph laws. A generated matrix must pass all five to count as syntactically valid:

Law 1: Execution Out-Degree

Each Motif has strict branching limits:

Motif	Legal Out-Degree
Boundary	0 (exit) or 1 (entry)
Sequence, State, Message	≤ 1
Condition, Loop	Exactly 0 or 2 (and branch indices must be distinct)

A Condition node with 3 outgoing execution edges? Illegal. One with two edges both labeled “True path”? Also illegal.

Law 2: Data In-Degree (Arity)

Strict argument constraints prevent impossible data routing:

Condition and Loop nodes require exactly 1 incoming data edge (the boolean)
State writes require exactly 1 incoming data edge (the value to store)
Message nodes require unique, non-duplicate argument indices

Law 3: No Orphans (Reachability)

A BFS over the combined execution+data connectivity graph confirms zero disconnected logic islands. Every node must be reachable from the rest of the graph.

Law 4: Acyclic Data Plane

A DFS cycle detector ensures the DATA edge subgraph contains no paradoxes. Execution edges may cycle (that’s what loops are), but data dependencies must form a strict DAG — otherwise you get circular definitions (a = b; b = a).

Law 5: Terminal Sink

A reverse-BFS from exit nodes (those with 0 outgoing execution edges) confirms that every execution node can reach a termination point. This prevents infinite loops without escape hatches.

The Breakthrough: 100% SVR at 128 Nodes

The 3-Phase Curriculum scaled the DiT from toy graphs (≤10 nodes) to massive 128-node matrices. At Epoch 343, with the Judicial Constraint Solver performing Top-K arity snapping and logit-weighted branch conflict resolution, the model achieved:

100.00% Syntactic Validity Rate on 128-node execution graphs.

The Judicial Constraint Solver bridges the continuous-to-discrete gap: rather than expecting the DiT to perfectly zero its own noise, the solver reads the probability heatmap and mathematically snaps edges into legal bounds. A Condition node’s row gets its Top-2 highest probabilities snapped to 1; conflicting argument indices get resolved via categorical logit magnitude.

Three Branches of Government

The full system separates concerns like a constitutional government:

Branch	Model	Responsibility
Legislative	Semantic LLM	Translates human intent → Motif list + Literal Pool
Executive	Permuted Dense DiT	Generates the topological routing (adjacency matrix)
Judicial	Constraint Solver	Snaps continuous heat maps → discrete, legal DAGs

The DiT knows nothing about human language. It purely outputs mathematically valid logic scaffolds. The LLM knows nothing about graph topology. It purely manages the semantic content. The Constraint Solver enforces constitutional law on both.

What’s Next: RLAIF for Deterministic Perfection

The continuous Flow Matching objective plateaued at loss ~0.135 — the model has extracted maximum topological value from the passive pre-training objective. The next phase transitions to Reinforcement Learning from AI Feedback (RLAIF), using the 5 Laws of Physics as a direct reward signal:

Base rewards: +0.2 for passing No Orphans, Acyclic Data, and In-Degree
Load-bearing penalties: +0.4 for Out-Degree pass (−0.2 fail), +0.4 for Terminal Sink pass (−0.4 fail)
Jackpot: If all 5 laws pass → 2.5× multiplier on total reward

A KL Divergence Anchor to the frozen pre-trained weights prevents reward hacking (mode-collapsing into trivial straight-line graphs).

Try It Yourself

The full execution engine, dataset compression pipeline, DiT training code, and validation harness are available:

💻 Code: jubilant-palm-tree — The Neural Universal Machine
📊 Model: timlawrenz/jubilant-palm-tree — Pre-trained checkpoint on the Hub
🤖 Orchestrator: Ratiocinator — Autonomous experiment runner
📄 Previous work: The Literal Value Bottleneck — The GNN study that motivated this pivot

# Clone and run the Fibonacci proof
git clone https://github.com/timlawrenz/jubilant-palm-tree
cd jubilant-palm-tree
pip install -r requirements.txt

# Execute the graph-walk interpreter (no compiler needed)
python src/execution_engine/demo.py

# Compress the Ruby AST dataset into Universal Motifs
python scripts/dataset_prep/compress_ast.py

# Train the Permuted Dense DiT
python src/train.py

References

Böhm, C., & Jacopini, G. (1966). Flow diagrams, Turing machines and languages with only two formation rules. Communications of the ACM, 9(5), 366–371.
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. ArXiv Preprint ArXiv:2210.02747.

Self-Supervised Pretraining Recipes for Lung CT: A Systematic Study with DINO

2026-04-20T00:00:00+00:00

Self-supervised learning (SSL) promises to unlock the diagnostic potential of large unlabeled medical image archives, yet practitioners face a daunting hyperparameter landscape with little domain-specific guidance. We present a systematic study of pretraining recipes for lung computed tomography (CT), evaluating 60+ experimental configurations on the LIDC-IDRI dataset.

1. The Challenges of Medical SSL
- Core Contributions
2. Methodology: The DINO-X System
3. Results & Discussion
4. Representation Analysis
5. The Infrastructure: 64 Experiments for $35.60
6. Practical Recipes
- The ViT-Small Recipe (Recommended)
- Common Pitfalls
Try It Yourself
References
References

1. The Challenges of Medical SSL

Medical CT presents unique challenges that break standard natural-image SSL assumptions:

Hounsfield Unit (HU) encoding: Pixel intensities carry calibrated tissue density information. Standard photometric augmentations destroy this signal.
Volumetric context: Pathology manifests across multiple slices. 2D methods must decide how to handle the Z-axis.
Entropy wall: DINO (Caron et al., 2021) training on medical CT frequently stalls at the theoretical maximum entropy, producing uniform outputs that carry no information.

Core Contributions

Entropy wall solution: We identify center momentum as the critical factor for DINO on medical CT.
Medical augmentation guidelines: Evidence that spatial-only augmentation is optimal for HU data.
Capacity-dependent regularization: We discover that KoLeo regularization (Sablayrolles et al., 2019) is critical for ViT-Large but optional for ViT-Small.
Scaling analysis: Tracing the trajectory from random weights to 1032× baseline retrieval.
Clinical evaluation: Establishing malignancy classification baselines and testing 3D feature aggregation.

2. Methodology: The DINO-X System

2.1 Architecture

We evaluate two scales of Vision Transformer (ViT) backbones:

Feature	ViT-Small	ViT-Large
Embedding dim	384	1024
Depth	12	24
Heads	6	16
Backbone params	21.6M	303.2M
Total params	24.9M	312.6M

2.2 Loss Functions & Online Gram Alignment

Our system building on DINOv2 (Oquab et al., 2023) independently adopts Online Gram alignment. Unlike DINOv3’s (AI, 2025) temporal anchoring to frozen historical checkpoints, DINO-X matches the student’s patch-token Gram matrix to the current EMA teacher at every step.

The total loss is defined as: $L = L_{DINO} + \lambda_{gram} \cdot L_{gram} + \lambda_{koleo} \cdot L_{koleo}$

2.3 Data Pipeline

Dataset: 234,943 axial slices from 981 LIDC-IDRI series.
Input Encoding: 3-channel input constructed from consecutive slices $(z-1, z, z+1)$ to provide local volumetric context.
HU Windowing: A random Hounsfield Unit window is applied per sample to simulate various clinical viewing protocols.

3. Results & Discussion

3.1 Breaking the Entropy Wall

The most critical finding: center momentum (cm) must be high enough to allow symmetry-breaking.

Center Momentum	2K Loss	10K Ratio	Trajectory
0.9	9.00	4.0 ↓	Permanently stuck
0.95	9.00	—	Stuck
0.99	9.01	—	Stuck
0.999	5.76	18.0	Breaks through ✓

At cm $\le$ 0.99, the center vector adapts too quickly, erasing emerging structure. At 0.999, the update is slow enough for meaningful clusters to form.

3.2 Augmentation: The ColorJitter Trap

Intensity variations in CT distinguish pathologies (e.g., ground-glass opacity vs. solid nodule). Teaching invariance to intensity (via ColorJitter) teaches the model to ignore the signal.

Augmentation	Ratio (10K)	Relative
Spatial only (RRC + HFlip)	49.7	baseline
+ ColorJitter	25.0	−50%

3.3 Scaling Behavior

We track the learning trajectory of ViT-Small over 100K steps:

Steps	Loss	Ratio	Top-1	Phase
2K	5.76	6	0.29%	Breakout
20K	0.44	311	15.2%	Rapid Learning
100K	0.23	1,032	25.2%	Diminishing Returns

3.4 Capacity-Dependent Regularization

ViT-Large requires explicit uniformity enforcement. Without KoLeo ($\lambda_{koleo}=0.1$), it solves the pretext task via representation collapse.

Configuration	100K Loss	100K Ratio	Status
ViT-L, no KoLeo	0.0004	4	Collapsed
ViT-L, with KoLeo	0.27	500	Healthy

3.5 Clinical Utility: Malignancy Probing

We evaluated frozen backbones on malignancy classification (430 malignant, 1,665 benign nodules).

Model	Feature Type	AUC-ROC
ViT-S 100K	Avg patch tokens	0.687
ViT-L 100K	CLS token	0.668
ViT-S 100K	CLS token	0.663
Supervised ResNet18	—	0.767

Negative Result: Aggregating features across Z-slices (3D mean pooling) actually decreased AUC (0.687 $\to$ 0.650). True 3D awareness likely requires architectural changes like volumetric patch tokens, not post-hoc aggregation.

3.6 Resolution: 224 vs 448

At matched step counts, 448px resolution performed 7× worse in retrieval ratio while being 2.7× slower. For this dataset scale, 224px is the pragmatic choice.

4. Representation Analysis

Metric	ViT-S 100K	ViT-L 100K
Active dims (std > 0.01)	384/384	1024/1024
Pairwise cosine sim	0.887	0.865
Class separation	0.003	0.002

Both models use all embedding dimensions, confirming KoLeo prevents dimensional collapse. High pairwise similarity is expected for the single-domain lung CT data.

5. The Infrastructure: 64 Experiments for $35.60

All of this — 64 GPU experiments across two hardware platforms — was orchestrated by Ratiocinator, an autonomous LLM-driven research pipeline.

Ratiocinator handles the full lifecycle of the experimental campaign: it provisions Vast.ai RTX 4090 instances, deploys code via Git, installs dependencies, runs training, collects metrics, and tears down instances. By treating the research process itself as a distributed systems problem, Ratiocinator proved that high-velocity architectural ablation (including the diagnosis of the “Entropy Wall”) doesn’t require a massive compute budget — just ruthless pipeline optimization.

Experiment Set	Arms	GPU-hours	Approx. Cost
Loss & CM sweeps	10	11	$3.70
Augmentation ablations	11	9	$3.00
ViT-Small Scaling	6	28	$9.40
ViT-Large Scaling	9	24	$8.80
Clinical Probes	28	33	$10.70
Total	64	105	~$35.60

6. Practical Recipes

The ViT-Small Recipe (Recommended)

training:
  loss: dino + gram + koleo
  center_momentum: 0.999  # CRITICAL
  ema: 0.996
  lr: 2e-4
  batch_size: 64
  steps: 50,000–100,000

augmentation:
  - RandomResizedCrop(224, scale=(0.5, 1.0))
  - RandomHorizontalFlip()
  # NO ColorJitter, NO GaussianBlur

data:
  channels: 3  # (z-1, z, z+1)
  windowing: random HU window per sample

Common Pitfalls

Pitfall	Symptom	Fix
cm too low (< 0.999)	Loss stuck at 9.01	Set cm=0.999
ColorJitter	50%+ ratio drop	Remove intensity aug
ViT-L without KoLeo	Loss $\to$ 0, Ratio $\to$ 1	Add koleo_weight=0.1
ViT-L with high LR	Oscillating loss	Use lr=5e-5

Try It Yourself

The complete training framework, experiment configurations, and pre-trained weights are available:

💻 Code: DINO-X — Lung CT SSL training framework
🤖 Orchestrator: Ratiocinator — The autonomous experiment runner
📊 Dataset: LIDC-IDRI on the Hub (PNG version)

# Clone the repository
git clone https://github.com/timlawrenz/DINO-X
cd DINO-X

# Run the optimized ViT-Small recipe
python train.py --config configs/vit_small_medical.yaml --center_momentum 0.999

References

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
Sablayrolles, A., Douze, M., Schmid, C., & Jégou, H. (2019). Spreading vectors for similarity search. International Conference on Learning Representations.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., & others. (2023). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.
AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. ArXiv Preprint.

asets/timlawrenz/lidc-idri-png) on the Hub (PNG version)

# Clone the repository
git clone https://github.com/timlawrenz/DINO-X
cd DINO-X

# Run the optimized ViT-Small recipe
python train.py --config configs/vit_small_medical.yaml --center_momentum 0.999

References

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
Sablayrolles, A., Douze, M., Schmid, C., & Jégou, H. (2019). Spreading vectors for similarity search. International Conference on Learning Representations.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., & others. (2023). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.
AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. ArXiv Preprint.

The Literal Value Bottleneck: Why GNN Autoencoders Fail at Code Generation

2026-04-20T00:00:00+00:00

We trained GNN autoencoders on 22,000 Ruby ASTs. The models achieved 81% node type accuracy and 99.5% type diversity, yet generated exactly 0% syntactically valid code. Here is why.

Graph Neural Networks seem like a natural fit for code — after all, Abstract Syntax Trees are graphs. If GNNs can learn molecular structures well enough to generate valid drug candidates, surely they can learn code structure well enough to generate valid programs?

We ran 51 GPU experiments across five GNN architectures, three decoder strategies, four hidden dimensions, and three loss functions to find out. The answer is a definitive no — but the reason is not what we expected.

The Setup: 22,452 Ruby ASTs with 74-Dimensional Features
GNNs Do Learn Code Structure
The Generation Catastrophe: 0% Across the Board
The Core Discovery: The Literal Value Bottleneck
Without Structure, Everything Collapses
Dimension Doesn’t Matter, Depth Does (Slightly)
The Infrastructure: 51 Experiments for $4.32
What Would It Take to Fix This?
Try It Yourself
References

The Setup: 22,452 Ruby ASTs with 74-Dimensional Features

We parsed 22,452 Ruby methods from open-source repositories into AST graphs. Each node gets a 74-dimensional one-hot feature vector encoding its AST type — one of 73 known types (def, send, args, lvar, str, …) plus a single UNKNOWN token. Literal values — method names, variable names, string contents, numeric values — are stripped of their content and mapped to UNKNOWN.

The dataset is split 85/15 into training (19,084) and validation (3,368) sets.

📦 Dataset: timlawrenz/gnn-ruby-code-study on the Hub.

GNNs Do Learn Code Structure

Before diving into the failure, let’s establish that GNNs genuinely learn meaningful representations of code.

For cyclomatic complexity prediction (a graph-level regression task), we compared five architectures: GCN, GraphSAGE, GAT, GIN, and GraphConv. The results are clear:

Architecture	Layers	Val MAE ↓	R²
SAGE	5	4.018	0.709
GIN	3	4.589	0.629
GAT (wide)	3	4.662	0.612
SAGE (baseline)	3	4.782	0.635
GCN	3	5.321	0.563

A 5-layer GraphSAGE achieves R² = 0.71, explaining 71% of the variance in cyclomatic complexity. That’s a 16% improvement over the 3-layer baseline — and it’s 9.9σ significant based on 18 replicate runs (σ = 0.073).

Two patterns jump out:

Depth dominates width. Going from 3 to 5 layers improves MAE by 16%. Doubling the hidden dimension from 64 to 128? Zero improvement. For ASTs with depths of 10–30, deeper networks capture cross-branch dependencies that directly relate to complexity.
GIN punches above its weight. GIN’s injective sum aggregation — which preserves the full multiset of neighbor features — gives it a 4% edge over SAGE at equal depth. This is the Weisfeiler-Leman advantage in practice.

So the models clearly learn the graph structure. The question is: can they reconstruct it?

The Generation Catastrophe: 0% Across the Board

We trained graph autoencoders to encode an AST into a latent vector and decode it back. We tried everything:

5 architectures: GCN, SAGE, GAT, GIN, GraphConv
3 loss functions: simple (node type CE), improved (+ parent prediction), comprehensive
3 decoder edge modes: chain (sequential), teacher-forced (ground-truth edges), iterative (predicted edges)
4 hidden dimensions: 128, 256, 512, and deep 5-layer variants

Every single configuration produces 0% syntactically valid Ruby.

Decoder Conv	Hidden Dim	Loss	Syntax Validity
GAT	256	improved	0%
SAGE	256	improved	0%
GIN	256	improved	0%
GCN	256	improved	0%
GIN (teacher-forced, 5-layer)	256	improved	0%

Validation loss converges (to ~3.8 with teacher forcing), so the models are learning something. But what?

The Core Discovery: The Literal Value Bottleneck

Here’s where it gets interesting. When we gave our best model — a 5-layer teacher-forced GIN decoder — the ground-truth tree structure and only asked it to predict node types, it achieved:

81% node type accuracy
99.5% type diversity (8.6 unique types per sample)
0% syntax validity

How can a model be 81% accurate and produce 0% valid code?

Look at this Ruby method:

def call(storage)
  new(storage).call
end

Its AST contains 12 elements:

Node	Ground Truth	Predicted	Match
0	`def`	`def`	✓
1	`"call"` → UNKNOWN	UNKNOWN	✓
2	`args`	`args`	✓
3	`arg`	`arg`	✓
4	`"storage"` → UNKNOWN	UNKNOWN	✓
5	`send`	`send`	✓
6	`send`	`send`	✓
7	`nil` → UNKNOWN	UNKNOWN	✓
8	`"new"` → UNKNOWN	UNKNOWN	✓
9	`lvar`	`lvar`	✓
10	`"storage"` → UNKNOWN	UNKNOWN	✓
11	`"call"` → UNKNOWN	UNKNOWN	✓

12/12 correct. 100% accuracy. The model perfectly reconstructs the AST skeleton. But 6 of those 12 nodes are UNKNOWN — they are literal values (method names call and new, variable name storage, and a nil sentinel) that were encoded as the undifferentiated UNKNOWN token. The model correctly predicts UNKNOWN for all of them, which is technically right but utterly useless — the actual string content that makes the code meaningful is irrecoverable.

When we analyzed 500 validation samples, the numbers are stark:

Category	Percentage
Typed AST nodes (`def`, `send`, `args`, …)	53.2%
Literal values (identifiers, strings, numbers)	46.8%

Nearly half of every AST is literal values. Method names, variable names, string contents, numeric literals — all collapsed into a single UNKNOWN token. No amount of architectural tweaking, loss function engineering, or hidden dimension scaling can recover information that was never encoded.

This is the literal value bottleneck: the failure isn’t in model capacity or architecture. It’s in the input representation itself.

Without Structure, Everything Collapses

The bottleneck becomes even more apparent when we remove teacher forcing. Without ground-truth edges, the chain decoder (which connects nodes sequentially, destroying the tree topology) exhibits catastrophic mode collapse:

92.7% of all predicted tokens are UNKNOWN
Only def (3.6%) and send (3.0%) appear as alternatives
Average unique types per sample drops from 8.6 to 1.6
Type accuracy plummets from 81% to 48%

The model learns a degenerate strategy: predict the most common token (which happens to be UNKNOWN, since 47% of the ground truth is UNKNOWN) and call it a day.

Teacher forcing fixes the structural component (restoring type accuracy to 81%), but the lexical component — the literal value bottleneck — remains.

Dimension Doesn’t Matter, Depth Does (Slightly)

One striking result: hidden dimensions of 128, 256, and 512 produce nearly identical outcomes:

Config	Hidden Dim	Type Accuracy	Heuristic Validity
tf-gin-128	128	81.4%	97.0%
tf-gin-256	256	81.3%	97.0%
tf-gin-512	512	81.8%	96.5%
tf-gin-256-deep	256 (5 layers)	81.1%	99.5%

More capacity doesn’t help. The bottleneck is information-theoretic, not computational. Going deeper (5 layers) nudges heuristic validity from 97% to 99.5%, consistent with the depth-over-width finding from complexity prediction — but the syntax validity needle doesn’t move from 0%.

The Infrastructure: 51 Experiments for $4.32

All of this — 51 GPU experiments across two hardware platforms — was orchestrated by Ratiocinator, an autonomous LLM-driven research pipeline.

Ratiocinator handles the full lifecycle: it provisions Vast.ai RTX 4090 instances, deploys code via Git, installs dependencies, runs training, collects metrics, and tears down instances. The 18 baseline replicates that gave us our variance estimate? Those came from Ratiocinator accidentally running the same configuration three times due to an environment variable bug — which turned into a useful statistical gift.

Experiment Set	Arms	Hardware	Cost
Autonomous research (3 iterations)	24	RTX 4090 (Vast.ai)	$1.50
Architecture comparison	8	RTX 4090 (Vast.ai)	$1.20
Generation analysis	7	RTX 4090 (Vast.ai)	$0.80
Decoder topology	6	RTX 4090 (Vast.ai)	$0.70
GIN deep dive	5	RTX 2070 SUPER (local)	~$0.10
Total	51		~$4.32

A 40-GPU ablation study for the price of a latte. By treating the research process itself as a distributed systems problem, Ratiocinator proves that high-velocity architectural ablation doesn’t require a massive compute budget — just ruthless pipeline optimization.

What Would It Take to Fix This?

Our findings point to three concrete directions:

Literal value prediction heads. Add separate output heads for identifier names (via a vocabulary or copy mechanism), string contents, and numeric values. The structural decoder already works — it’s the lexical reconstruction that’s missing.
Hybrid architectures. Use GNN encoders for structural understanding, but pair them with autoregressive or grammar-constrained decoders for sequential output. The GNN captures the shape of the code; a sequential decoder fills in the content.
Pointer networks / copy mechanisms. Let the decoder point back to nodes in the input graph to copy identifier names, rather than generating them from scratch. This is analogous to copy mechanisms in summarization models.

The fact that GNNs achieve R² = 0.71 for complexity prediction proves they learn meaningful code representations. The challenge is building decoders that can reconstruct the full richness of code — not just its structural skeleton — from those representations.

Try It Yourself

The full dataset, pre-trained models, and experiment configurations are available:

📊 Dataset: timlawrenz/gnn-ruby-code-study (22,452 Ruby methods with ASTs)
📄 Paper: Full research paper with all tables and appendices
💻 Code: jubilant-palm-tree (branch: experiment/ratiocinator-gnn-study)
🤖 Orchestrator: Ratiocinator — the autonomous experiment runner

# Reproduce the key result
git clone -b experiment/ratiocinator-gnn-study https://github.com/timlawrenz/jubilant-palm-tree
cd jubilant-palm-tree
pip install torch torchvision torch_geometric

# Complexity prediction (the success story)
python train.py --conv_type SAGE --num_layers 5 --epochs 50

# Autoencoder with teacher-forced GIN (the 81%-accurate failure)
python train_autoencoder.py --decoder_conv_type GIN --decoder_edge_mode teacher_forced --epochs 30

If you use this dataset or findings, please cite:

@misc{lawrenz2025gnnruby,
  title={Graph Neural Networks for Ruby Code Complexity Prediction and Generation:
         A Systematic Architecture Study},
  author={Tim Lawrenz},
  year={2026},
  howpublished={\url{https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study}}
}

References

Lawrenz, T. (2026). Graph Neural Networks for Ruby Code Complexity Prediction and Generation: A Systematic Architecture Study. https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study