<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://lawrenz.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://lawrenz.com/" rel="alternate" type="text/html" /><updated>2026-06-05T00:31:43+00:00</updated><id>https://lawrenz.com/feed.xml</id><title type="html">Tim Lawrenz</title><subtitle>Scientific articles on machine learning, graph neural networks, and software engineering.</subtitle><entry><title type="html">Training a Pixel-Space DiT in 26 Hours: FP8 Breakthroughs and Architectural Dead Ends</title><link href="https://lawrenz.com/machine%20learning/diffusion%20models/optimization/2026/06/02/pixel-space-dit-in-26-hours-fp8-breakthroughs.html" rel="alternate" type="text/html" title="Training a Pixel-Space DiT in 26 Hours: FP8 Breakthroughs and Architectural Dead Ends" /><published>2026-06-02T21:00:00+00:00</published><updated>2026-06-02T21:00:00+00:00</updated><id>https://lawrenz.com/machine%20learning/diffusion%20models/optimization/2026/06/02/pixel-space-dit-in-26-hours-fp8-breakthroughs</id><content type="html" xml:base="https://lawrenz.com/machine%20learning/diffusion%20models/optimization/2026/06/02/pixel-space-dit-in-26-hours-fp8-breakthroughs.html"><![CDATA[<p>Following our integration of <a href="/2026/05/23/prx-tg-accelerating-pixel-space-diffusion-with-asymmetric-flow-matching.html">Asymmetric Flow Matching</a>, our 400M parameter NanoDiT was training efficiently in terms of <em>step-count convergence</em>, but it was hitting an <code class="language-plaintext highlighter-rouge">iter_per_sec</code> of <code class="language-plaintext highlighter-rouge">0.025</code> on our single RTX 4090. A full 5,000-step ablation cycle required 56 hours of active compute.</p>

<h2 id="the-bottleneck">The Bottleneck</h2>

<p>Before scaling the dataset from our 7k curated subset to the full 70k+ pipeline, we needed a faster iteration loop. Our goal was to halve the active compute time either via lower precision (FP8) or parameter reduction (shared spatial modulation).</p>

<p>Here is what worked—and what catastrophically failed.</p>

<hr />

<h2 id="1-the-fp8-memory-illusion--breakthrough">1. The FP8 Memory Illusion &amp; Breakthrough</h2>

<p>Native FP8 via <code class="language-plaintext highlighter-rouge">torchao</code> theoretically promises doubled tensor core throughput and halved memory bandwidth compared to BF16. However, our initial naïve implementation failed to execute a single forward/backward pass within our 24GB VRAM budget, instantly triggering Out Of Memory (OOM) errors at batch sizes where BF16 comfortably fit.</p>

<h3 id="the-trap-dynamic-scale-state-overhead">The Trap: Dynamic Scale State Overhead</h3>
<p>Why did an 8-bit format consume <em>more</em> memory than a 16-bit format? The answer lies in how <code class="language-plaintext highlighter-rouge">torchao</code> handles dynamic tensor casting. For dynamic casting, the framework allocates dynamic scale states for every linear layer and continuously tracks rolling history maxima during the forward pass. This metadata overhead, combined with wrapper casting operations, drastically inflated the footprint and destroyed the bandwidth savings.</p>

<h3 id="the-fix-dynamic-tensor-masking--scoped-autocast">The Fix: Dynamic Tensor Masking &amp; Scoped Autocast</h3>
<p>To fit the model back into 24GB VRAM while preserving the FP8 throughput, we had to tame the scale-state overhead. We implemented strict dynamic tensor masking and optimized autocast scoping so that FP8 conversion was tightly localized to the heaviest matrix multiplications (the core attention and FFN projections), bypassing the metadata overhead for the rest of the network.</p>

<h3 id="the-results-2x-speedup">The Results: &gt;2x Speedup</h3>

<p>By resolving the overhead, <strong>active compute time plummeted from 56 hours to 26 hours</strong> (<code class="language-plaintext highlighter-rouge">iter_per_sec</code> increased from <code class="language-plaintext highlighter-rouge">0.025</code> to <code class="language-plaintext highlighter-rouge">0.053</code>).</p>

<p>Crucially, the perceptual quality held up beautifully against BF16. Below are the final LPIPS metrics at step 5000 comparing Arm G (BF16 AsymFlow) against Arm I (FP8 Native):</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Arm G (BF16)</th>
      <th>Arm I (FP8)</th>
      <th>Delta</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Reconstruction LPIPS</strong></td>
      <td>0.900</td>
      <td>0.906</td>
      <td>+0.006 (Negligible)</td>
    </tr>
    <tr>
      <td><strong>Text-only LPIPS</strong></td>
      <td>0.920</td>
      <td>0.909</td>
      <td>-0.011 (Better)</td>
    </tr>
    <tr>
      <td><strong>Text Manip Delta</strong></td>
      <td>0.485</td>
      <td>0.504</td>
      <td>+0.019 (Better)</td>
    </tr>
  </tbody>
</table>

<p><em>Note: Lower LPIPS is better for perceptual similarity. Higher Text Manip Delta indicates stronger text-controllability.</em></p>

<h3 id="visual-comparisons">Visual Comparisons</h3>

<p><strong>Reconstruction Fidelity (Conditioned on Identity + Text):</strong>
<em>(Arm G on left, Arm I on right)</em>
<img src="/assets/img/arm_g_vs_i_recon_step5000.png" alt="Reconstruction Comparison" style="width: 100%;" /></p>

<p><strong>Text-Only Controllability (Conditioned purely on Text prompt):</strong>
<em>(Arm G on left, Arm I on right)</em>
<img src="/assets/img/arm_g_vs_i_text_step5000.png" alt="Text-only Comparison" style="width: 100%;" /></p>

<p>The FP8 model achieves equivalent perceptual quality in less than half the time.</p>

<hr />

<h2 id="2-the-shared-adaln--lora-collapse">2. The Shared adaLN + LoRA Collapse</h2>

<p>While working on compute optimization, we also explored parameter reduction.</p>

<p><strong>The Hypothesis:</strong> Our DiT utilizes an <code class="language-plaintext highlighter-rouge">adaLN</code> (Adaptive Layer Normalization) projection in every transformer block to inject the timestep and spatial conditioning signals. What if we shared a single central <code class="language-plaintext highlighter-rouge">adaLN</code> projection across all layers, and relied on a lightweight per-block rank-8 LoRA to handle block-specific spatial localization? This would save 59M parameters (dropping the model from 237.7M to 178.8M non-embedding parameters).</p>

<p><strong>The Reality:</strong> The model (Arm H) completely failed to converge. Reconstruction LPIPS stalled at <code class="language-plaintext highlighter-rouge">0.9823</code>, and the output degraded into structural noise and visual dithering by step 5000.</p>

<p><strong>The Lesson:</strong> Full-rank, independent per-block modulation is structurally load-bearing in a Diffusion Transformer. You cannot compress the spatial/semantic conditioning pathway without destroying the model’s ability to localize features across different depths of the network. The 59M parameter savings were simply not worth the catastrophic quality collapse.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>The architecture is now stabilized, perceptually verified, and fast. The ablation phase is formally closed. With our pipeline executing 5k steps in just 26 hours, we are ready for the “big run”—scaling the dataset, extending the training horizon, and deploying bucket-aware batching.</p>]]></content><author><name></name></author><category term="Machine Learning" /><category term="Diffusion Models" /><category term="Optimization" /><category term="DiT" /><category term="FP8" /><category term="torchao" /><category term="Muon" /><category term="PyTorch" /><category term="prx-tg" /><summary type="html"><![CDATA[Following our integration of Asymmetric Flow Matching, our 400M parameter NanoDiT was training efficiently in terms of step-count convergence, but it was hitting an iter_per_sec of 0.025 on our single RTX 4090. A full 5,000-step ablation cycle required 56 hours of active compute.]]></summary></entry><entry><title type="html">prx-tg: Accelerating Pixel-Space Diffusion with Asymmetric Flow Matching</title><link href="https://lawrenz.com/2026/05/23/prx-tg-accelerating-pixel-space-diffusion-with-asymmetric-flow-matching.html" rel="alternate" type="text/html" title="prx-tg: Accelerating Pixel-Space Diffusion with Asymmetric Flow Matching" /><published>2026-05-23T00:00:00+00:00</published><updated>2026-05-23T00:00:00+00:00</updated><id>https://lawrenz.com/2026/05/23/prx-tg-accelerating-pixel-space-diffusion-with-asymmetric-flow-matching</id><content type="html" xml:base="https://lawrenz.com/2026/05/23/prx-tg-accelerating-pixel-space-diffusion-with-asymmetric-flow-matching.html"><![CDATA[<p>The evolution of text-to-image synthesis is currently undergoing a profound architectural realignment. For years, the dominant paradigm relied on Variational Autoencoders (VAEs) to compress high-dimensional pixel data into a mathematically tractable, lower-dimensional latent space. But as we strip away the VAE in pursuit of lossless, native-resolution pixel prediction, we collide with the brutal reality of uncompressed feature spaces.</p>

<ul id="markdown-toc">
  <li><a href="#the-reality-of-pixel-space-diffusion-the-average-blob-phenomenon" id="markdown-toc-the-reality-of-pixel-space-diffusion-the-average-blob-phenomenon">The Reality of Pixel-Space Diffusion: The “Average Blob” Phenomenon</a></li>
  <li><a href="#breaking-the-bottleneck-asymmetric-flow-matching-arm-g" id="markdown-toc-breaking-the-bottleneck-asymmetric-flow-matching-arm-g">Breaking the Bottleneck: Asymmetric Flow Matching (Arm G)</a>    <ul>
      <li><a href="#quantitative-results-step-5000" id="markdown-toc-quantitative-results-step-5000">Quantitative Results (Step 5000)</a></li>
      <li><a href="#qualitative-results" id="markdown-toc-qualitative-results">Qualitative Results</a></li>
    </ul>
  </li>
  <li><a href="#the-next-hurdle-data-starvation-and-compute-scaling" id="markdown-toc-the-next-hurdle-data-starvation-and-compute-scaling">The Next Hurdle: Data Starvation and Compute Scaling</a></li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
</ul>

<p>In my previous post on the <code class="language-plaintext highlighter-rouge">prx-tg</code> architecture, we established a baseline NanoDiT (768 hidden, 18 layers) capable of training in raw pixel-space using a single 24GB RTX 4090. Today, we confront the “Average Blob” phenomenon—the mathematical bottleneck of VAE-less diffusion—and explore how Asymmetric Flow Matching (Arm G) allows us to break through it.</p>

<h2 id="the-reality-of-pixel-space-diffusion-the-average-blob-phenomenon">The Reality of Pixel-Space Diffusion: The “Average Blob” Phenomenon</h2>

<p>While removing the VAE yields theoretical advantages in high-frequency detail preservation, our recent ablations on the <code class="language-plaintext highlighter-rouge">prx-tg</code> baseline (Arm D) revealed a harsh reality of pure pixel-space training.</p>

<p>In Latent Diffusion Models (LDMs), the VAE decoder acts as an aesthetic crutch. It forcefully maps noisy or misaligned latent representations back into a “photorealistic” texture manifold. You get eyes, skin textures, and hair details almost for free, even very early in the training process.</p>

<p>In a latent-free setup predicting raw RGB pixels, the model has to earn every single high-frequency detail from scratch. Because the training uses a Mean Squared Error (MSE) loss on raw pixels, the model mathematically minimizes its loss early on by predicting a smooth, blurry “average” color blob wherever it is uncertain about high-frequency placement (like the exact boundary of an iris or individual hair strands).</p>

<p>At 5,000 steps (processing ~1.28 million images with an effective batch size of 256 over our 7k curated FFHQ image dataset), the baseline Arm D model successfully learned macro-composition—placing the head in the right spot with the correct colors—but completely failed to resolve facial features, leaving the outputs blocky and emotionless. To push past this phase without wasting thousands of GPU hours, the model must be forced to care about micro-structure earlier.</p>

<h2 id="breaking-the-bottleneck-asymmetric-flow-matching-arm-g">Breaking the Bottleneck: Asymmetric Flow Matching (Arm G)</h2>

<p>To address this delayed convergence, we recently conducted an ablation (<strong>Arm G</strong>) implementing Asymmetric Flow Models (missing reference).</p>

<p>Standard flow matching architectures learn a vector field that transports a simple base distribution (e.g., Gaussian noise) to a complex data distribution (pixels). However, predicting the full-dimensional noise across all sequence steps introduces massive variance in the gradient updates.</p>

<p>Instead of predicting full-dimensional noise directly, AsymFlow computes the target on a lower-dimensional PCA-based subspace for the noise component: <code class="language-plaintext highlighter-rouge">target = P @ noise - x_0</code>. By isolating the target prediction to an optimized linear subspace (in our case, rank 8), the model receives a much cleaner, less chaotic gradient signal during backpropagation.</p>

<h3 id="quantitative-results-step-5000">Quantitative Results (Step 5000)</h3>

<p>The results were immediately apparent. At the 5,000-step validation mark, Arm G achieved comparable reconstruction fidelity to our full-stack baseline Arm D (0.9379 vs 0.9267) while outperforming it on text-only generation quality (0.9141 vs 0.9219).</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Metric</th>
      <th style="text-align: left">Arm D (Baseline)</th>
      <th style="text-align: left">Arm G (Asym Flow)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Reconstruction LPIPS</strong></td>
      <td style="text-align: left"><strong>0.9267</strong></td>
      <td style="text-align: left">0.9379</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Text-Only Gen LPIPS</strong></td>
      <td style="text-align: left">0.9219</td>
      <td style="text-align: left"><strong>0.9141</strong></td>
    </tr>
  </tbody>
</table>

<p><em>Note: Lower LPIPS (Learned Perceptual Image Patch Similarity) indicates better perceptual quality and structural fidelity.</em></p>

<h3 id="qualitative-results">Qualitative Results</h3>

<p>Visually, the AsymFlow model produced noticeably crisper outputs. The micro-contrast improved dramatically—skin textures felt less “painted” and lighting highlights resolved with higher fidelity compared to the baseline’s haze.</p>

<p>Most importantly, Arm G demonstrated far better structural identity preservation when undergoing strong text-based manipulations. In our validation suite, we feed the model an original image’s DINOv3 embeddings alongside a heavily modified text prompt (e.g., changing “slender” to “muscular”). Arm D frequently warped the core identity or introduced artifacts around the neck and collar when applying the edit. Arm G handled these transformations smoothly, achieving a text manipulation LPIPS difference of <code class="language-plaintext highlighter-rouge">0.4665</code> while keeping the subject’s base geometry intact.</p>

<h2 id="the-next-hurdle-data-starvation-and-compute-scaling">The Next Hurdle: Data Starvation and Compute Scaling</h2>

<p>While Asymmetric Flow Matching successfully accelerated convergence and bridged the gap between raw pixel geometry and texture, it cannot magically solve data starvation.</p>

<p>Our 400M parameter model confined to a 7,000 image dataset is rapidly approaching memorization. At 5,000 steps, it has seen the same rigid dataset over 180 times. To achieve true production quality, we must scale our dataset to our 70k target and push training into the 40,000+ step regime.</p>

<p>However, running 40,000 steps at ~41 seconds per iteration on a single 4090 poses a massive compute barrier (roughly 21 days of continuous training). Before launching this production run, we must pursue further radical optimizations to compress the compute time.</p>

<p>In the next part of this series, we will explore replacing the standard Transformer blocks with Mamba-3 sequence modeling to linearize pixel space, and deploying Shared-MLP Timestep Modulation via LoRA to radically slash our FLOP budget.</p>

<h2 id="references">References</h2>
<ol class="bibliography"></ol>]]></content><author><name>{&quot;user&quot;=&gt;&quot;timlawrenz&quot;}</name></author><category term="prx-tg" /><category term="diffusion" /><category term="machine-learning" /><category term="research" /><summary type="html"><![CDATA[The evolution of text-to-image synthesis is currently undergoing a profound architectural realignment. For years, the dominant paradigm relied on Variational Autoencoders (VAEs) to compress high-dimensional pixel data into a mathematically tractable, lower-dimensional latent space. But as we strip away the VAE in pursuit of lossless, native-resolution pixel prediction, we collide with the brutal reality of uncompressed feature spaces.]]></summary></entry><entry><title type="html">Training a Portrait DiT on a Single GPU: What the Ablation Study Taught Us</title><link href="https://lawrenz.com/2026/05/13/training-a-portrait-dit-on-a-single-gpu.html" rel="alternate" type="text/html" title="Training a Portrait DiT on a Single GPU: What the Ablation Study Taught Us" /><published>2026-05-13T00:00:00+00:00</published><updated>2026-05-13T00:00:00+00:00</updated><id>https://lawrenz.com/2026/05/13/training-a-portrait-dit-on-a-single-gpu</id><content type="html" xml:base="https://lawrenz.com/2026/05/13/training-a-portrait-dit-on-a-single-gpu.html"><![CDATA[<p>The prevailing assumption in generative AI is that training a large, multi-modal Diffusion Transformer from scratch requires a cluster. prx-tg is a direct challenge to that assumption: a 400M+ parameter DiT for 1024×1024 portrait generation, trained entirely on a single consumer NVIDIA RTX 4090 with 24GB of VRAM, conditioned on text, identity, spatial layout, and pose simultaneously. We just completed the first systematic ablation study of its core training innovations, and the results are worth sharing in detail — including one finding we did not expect.</p>

<ul id="markdown-toc">
  <li><a href="#what-we-are-building" id="markdown-toc-what-we-are-building">What We Are Building</a></li>
  <li><a href="#the-ablation-design" id="markdown-toc-the-ablation-design">The Ablation Design</a></li>
  <li><a href="#results" id="markdown-toc-results">Results</a>    <ul>
      <li><a href="#final-checkpoint-step-5000" id="markdown-toc-final-checkpoint-step-5000">Final Checkpoint (Step 5000)</a></li>
      <li><a href="#what-we-did-not-expect-adamwtread-instability" id="markdown-toc-what-we-did-not-expect-adamwtread-instability">What We Did Not Expect: AdamW+TREAD Instability</a></li>
      <li><a href="#muon-as-the-fix" id="markdown-toc-muon-as-the-fix">Muon as the Fix</a></li>
      <li><a href="#full-stack-as-the-production-target" id="markdown-toc-full-stack-as-the-production-target">Full Stack as the Production Target</a></li>
    </ul>
  </li>
  <li><a href="#the-traps-ahead" id="markdown-toc-the-traps-ahead">The Traps Ahead</a></li>
  <li><a href="#whats-next" id="markdown-toc-whats-next">What’s Next</a></li>
  <li><a href="#code-and-data" id="markdown-toc-code-and-data">Code and Data</a></li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
</ul>

<h2 id="what-we-are-building">What We Are Building</h2>

<p>prx-tg is a portrait generation model built on a NanoDiT backbone <a class="citation" href="#peebles2023scalable">(Peebles &amp; Xie, 2023)</a> operating directly in pixel space, patchifying RGB images into a sequence of tokens rather than relying on a VAE latent bottleneck. The model is “quad-conditioned”: cross-attention layers simultaneously receive dense text captions processed by CLIP and T5, visual identity embeddings from DINOv3 (utilizing patch-level tokens), spatial layout maps, and DWPose skeletal keypoints. The goal is controllable generation — given a reference identity and a description of pose, lighting, and appearance, generate a plausible, photorealistic portrait.</p>

<p>Training a model of this scope on 24GB of VRAM is not possible without careful engineering. Gradient checkpointing drops all intermediate activations and recomputes them on the backward pass, trading a 20–30% speed penalty for a massive memory reduction. The T5 encoder alone consumes over 10GB of VRAM to process captions; a dedicated cleanup routine migrates it to CPU immediately after embeddings are cached, freeing the GPU before the DiT backward pass. Affine biases are stripped from QKV projections and FFN hidden layers — mathematically redundant under LayerNorm, and worth 5–10% of total memory. Positional embeddings are computed dynamically from latent tensor dimensions rather than stored as static buffers, enabling multi-resolution training without padding or fixed-shape assumptions.</p>

<p>Data augmentation and preprocessing run through <a href="https://github.com/timlawrenz/stratum-hq">stratum-hq</a>. Horizontal flip augmentation was explicitly excluded: for a model conditioned on DWPose keypoints, flipping pixel data without remapping symmetric landmark indices (left eye ↔ right eye, left shoulder ↔ right shoulder) corrupts the cross-attention binding between spatial tokens and text tokens. The FFHQ dataset provides sufficient orientation diversity without flips.</p>

<h2 id="the-ablation-design">The Ablation Design</h2>

<p>We trained four arms for 5,000 steps each, all on the same physical quad-GPU Vast.ai node with GPU assignment pinned via <code class="language-plaintext highlighter-rouge">CUDA_VISIBLE_DEVICES</code>. Running every arm on the same hardware eliminates variance from GPU-to-GPU silicon differences — an often underappreciated confound in ablation studies that share results across separately provisioned machines.</p>

<table>
  <thead>
    <tr>
      <th>Arm</th>
      <th>Optimizer</th>
      <th>TREAD</th>
      <th>Loss Formulation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A — Baseline</td>
      <td>AdamW</td>
      <td>Off</td>
      <td>Standard flow-matching</td>
    </tr>
    <tr>
      <td>B — TREAD+AdamW</td>
      <td>AdamW</td>
      <td>On</td>
      <td>Standard flow-matching</td>
    </tr>
    <tr>
      <td>C — TREAD+Muon</td>
      <td>Muon</td>
      <td>On</td>
      <td>Standard flow-matching</td>
    </tr>
    <tr>
      <td>D — Full Stack</td>
      <td>Muon</td>
      <td>On</td>
      <td>Flow-matching + REPA</td>
    </tr>
  </tbody>
</table>

<p><strong>TREAD</strong> (Token Routing for Efficient Architecture-agnostic Diffusion Training) probabilistically routes up to 50% of tokens around intermediate attention and feed-forward blocks. Tokens are extracted at an early layer and reinjected near the output, bypassing the bulk of the network’s compute. The theoretical promise is a direct reduction in FLOPs for those bypassed tokens, and because bypassed tokens still contribute to the loss, early layers receive a gradient signal from late-stage objectives — a form of pseudo-deep supervision.</p>

<p><strong>Muon</strong> <a class="citation" href="#jordan2024muon">(Jordan &amp; others, 2024)</a> is a spectral optimizer that applies orthogonalized Nesterov momentum via a Newton-Schulz polynomial iteration, producing update matrices that converge to the nearest orthogonal matrix. Unlike AdamW’s per-parameter scalar moment estimation, Muon enforces a uniform update magnitude across each weight matrix. As a practical bonus, Muon’s single momentum buffer costs 4 bytes per parameter versus AdamW’s 8 (two buffers), reducing optimizer state memory by 50% — meaningful at this hardware budget.</p>

<p><strong>REPA</strong> (Representation Alignment) <a class="citation" href="#yu2024representation">(Yu et al., 2024)</a> augments the flow-matching objective with an alignment penalty between the DiT’s intermediate hidden states and DINOv2’s semantic representations, forcing the generative student to internalize the teacher’s structure. Because this adds a second term to the loss with a different scale, Arm D’s raw loss values are not comparable to A, B, or C. LPIPS comparisons across all arms remain valid.</p>

<h2 id="results">Results</h2>

<h3 id="final-checkpoint-step-5000">Final Checkpoint (Step 5000)</h3>

<table>
  <thead>
    <tr>
      <th>Arm</th>
      <th>Recon LPIPS ↓</th>
      <th>Text LPIPS ↓</th>
      <th>Text Manip delta ↑</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A — Baseline</td>
      <td>0.9352</td>
      <td>0.9593</td>
      <td>0.466</td>
    </tr>
    <tr>
      <td>B — TREAD+AdamW</td>
      <td>1.0161</td>
      <td>0.9396</td>
      <td>0.373</td>
    </tr>
    <tr>
      <td>C — TREAD+Muon</td>
      <td>0.9463</td>
      <td>0.9603</td>
      <td><strong>0.546</strong></td>
    </tr>
    <tr>
      <td>D — Full Stack</td>
      <td><strong>0.9267</strong></td>
      <td><strong>0.9219</strong></td>
      <td>0.431</td>
    </tr>
  </tbody>
</table>

<p><em>Recon LPIPS</em>: reconstruction fidelity given full conditioning (identity + text), 25 samples. <em>Text LPIPS</em>: generation quality given text only, 20 samples. <em>Text Manip delta</em>: mean absolute LPIPS difference between generations for a caption and a single-attribute edit (e.g., “dark hair” → “light hair”) — a measure of how decisively the model responds to text.</p>

<p><img src="/assets/img/posts/ablation_text_lpips.png" alt="Text LPIPS across training steps for all four arms. Lower is better. Arm D (Full Stack, solid teal) leads from step 500 onward." style="width: 100%;" /></p>

<p>All TREAD arms (B, C, D) trained approximately 17% faster in wall-clock time: ~95h versus ~112h for the baseline. At equivalent step budgets this is a direct reduction in future experiment cost.</p>

<h3 id="what-we-did-not-expect-adamwtread-instability">What We Did Not Expect: AdamW+TREAD Instability</h3>

<p>Arm B’s result requires a post-mortem. It achieved its best reconstruction at step 3000 (Recon LPIPS 0.906 — briefly the best of any arm) and then collapsed monotonically to 1.016 by step 5000, a value exceeding 1.0, meaning the model performs worse than a trivial baseline on reconstruction at its final checkpoint.</p>

<p>The collapse is not sudden. It begins around step 3500 and degrades progressively — which is why we did not catch it early. A prior independent run showed the same pattern, confirming this is reproducible behavior rather than a stochastic outlier.</p>

<p>The mechanism is a mathematical incompatibility between AdamW’s adaptive moment estimation and TREAD’s dynamic spatial sparsity. TREAD routes tokens around intermediate blocks, so those blocks receive sparse, irregular gradient signals over thousands of iterations. AdamW interprets near-zero gradients as low-variance parameters and decays their second-moment estimates accordingly. This inflates the adaptive learning rate for those “starved” weights. When a high-frequency token is eventually routed through a starved block, the resulting gradient is multiplied by the inflated rate and produces a divergent update that shatters the block’s representations. The failure accumulates gradually and then becomes catastrophic.</p>

<p>This is not a deficiency in TREAD itself. It is a fundamental incompatibility between per-parameter scalar moment estimation and dynamic spatial routing. <strong>Do not use TREAD with AdamW for long runs.</strong></p>

<h3 id="muon-as-the-fix">Muon as the Fix</h3>

<p>Arm C demonstrates the resolution. Muon’s orthogonalized updates enforce a fixed spectral norm across the entire weight matrix, not per-parameter scaling. There are no “starved” parameters — every weight receives a geometrically uniform step. The TREAD-induced sparsity pattern becomes irrelevant because the optimizer is not accumulating per-parameter learning rate history in a way that can diverge.</p>

<p>The result: Arm C’s Recon LPIPS (0.946) is 0.070 points better than Arm B’s final collapse, within 0.011 of the stable baseline (Arm A), with the full 17% throughput gain intact. And its Text Manipulation delta (0.546) is the highest of any arm — Muon’s isotropic updates appear to promote stronger, more decisive binding between text token activations and output features. For a model where the primary use case is text-driven portrait control, this matters.</p>

<h3 id="full-stack-as-the-production-target">Full Stack as the Production Target</h3>

<p>Arm D (TREAD + Muon + REPA) achieves the best metrics across both dimensions: Recon LPIPS 0.927, Text LPIPS 0.922. The REPA loss accelerates early semantic acquisition — Arm D’s Text LPIPS broke below 0.90 by step 500, while other arms reached comparable values much later. Muon’s stability allowed the model to reach final convergence without the instabilities that would accompany the modified dual-objective loss under AdamW.</p>

<p><img src="/assets/img/posts/ablation_progression_armD.png" alt="Arm D training progression — text-only generations at steps 500 through 5000 on a fixed prompt. Label bars darken from muted teal (early) to deep teal (final) to track progress." style="width: 100%;" /></p>

<p>The following collage shows text-only outputs from all four arms at their final checkpoint (step 5000), using the same evaluation prompt. Arm D’s output consistently shows stronger structural coherence and finer detail.</p>

<p><img src="/assets/img/posts/ablation_collage_final.png" alt="Text-only generations from all four arms at step 5000. Each arm uses the same prompt; label bar colors match the chart legend." style="width: 100%;" /></p>

<h2 id="the-traps-ahead">The Traps Ahead</h2>

<p>Completing the study also clarified several failure modes we need to address for production-scale training.</p>

<p><strong>REPA termination.</strong> DINOv3 is a discriminative model operating in a lower-dimensional embedding space optimized for classification and dense feature matching. It discards high-frequency textural variance — pores, hair strands, skin texture — that photorealism requires. In the burn-in phase, REPA’s alignment penalty is genuinely helpful: it pulls the DiT out of its initial chaotic state. Beyond that, the teacher’s embeddings become a constraint, penalizing the generator for synthesizing details that don’t exist in the teacher’s feature maps. The HASTE framework describes this as the “works until it doesn’t” trap. <strong>For production runs, the REPA alignment weight should be decayed to zero by approximately step 1000–1500</strong> (the first 20–30% of a 5000-step run), then let the model converge on unconstrained flow-matching alone. Our current 5000-step study ran REPA to completion — the metrics are still the best of any arm, but we likely left quality on the table.</p>

<p><strong>Pixel scaling.</strong> When processing RGB data directly without a VAE bottleneck, images must be scaled correctly into the <code class="language-plaintext highlighter-rouge">[−1, 1]</code> range expected by the diffusion process. Currently, the dataloader yields <code class="language-plaintext highlighter-rouge">[0, 1]</code> RGB pixels, which slightly biases the flow-matching objective. Correcting the pixel normalization pipeline is a prerequisite for reliable convergence at scale.</p>

<p><strong>Spatial evaluation.</strong> LPIPS measures perceptual texture similarity and broad structural alignment. It cannot verify whether the generated pose matches the DWPose conditioning input. A model can generate a photorealistic face (excellent LPIPS) while completely ignoring the jaw angle or shoulder position specified by the spatial condition. The next iteration needs MPJPE (Mean Per Joint Position Error) in the validation loop — specifically PA-MPJPE (Procrustes-Aligned MPJPE), which isolates structural accuracy from rotational and scale variance — to prove that the DiT’s cross-attention mechanisms actually bind visual output to spatial conditions.</p>

<h2 id="whats-next">What’s Next</h2>

<p>The ablation clears the path for the next phase of prx-tg development. The production training configuration is Full Stack (Arm D) with REPA loss decay implemented from the start. The immediate engineering priorities are:</p>

<ol>
  <li><strong>Implement REPA warmdown scheduling</strong> — decay the alignment weight to zero by step ~1250 for a 5000-step run, or proportionally for longer budgets.</li>
  <li><strong>Pixel normalization pipeline</strong> — ensure RGB tensors are properly centered at zero <code class="language-plaintext highlighter-rouge">[−1, 1]</code> before DiT input.</li>
  <li><strong>MPJPE/PA-MPJPE validation</strong> — instrument the validation loop with a second-stage pose estimator to measure spatial controllability quantitatively.</li>
  <li><strong>Longer runs</strong> — the 5000-step study was designed to isolate optimizer dynamics under controlled conditions. Production-quality generation at 1024×1024 will require substantially more steps. The 17% throughput gain from TREAD directly compounds the value of every future training hour.</li>
</ol>

<p>The study confirms that the engineering hypothesis holds: state-of-the-art multi-modal generation at 1024×1024 is trainable on a single consumer GPU. It does not require a cluster — it requires careful memory engineering, the right optimizer for the architecture, and disciplined ablation to understand what fails and why.</p>

<h2 id="code-and-data">Code and Data</h2>

<p>The full ablation write-up, per-checkpoint metrics, and arm configurations are in the repository:</p>

<ul>
  <li><strong>prx-tg</strong>: <a href="https://github.com/timlawrenz/prx-tg">github.com/timlawrenz/prx-tg</a> — model, training code, ablation docs</li>
  <li><strong>stratum-hq</strong>: <a href="https://github.com/timlawrenz/stratum-hq">github.com/timlawrenz/stratum-hq</a> — data ingestion, preprocessing, augmentation pipeline</li>
  <li><strong>Ratiocinator</strong>: <a href="https://github.com/timlawrenz/ratiocinator">github.com/timlawrenz/ratiocinator</a> — the autonomous experiment runner that provisioned and monitored the ablation</li>
</ul>

<h2 id="references">References</h2>

<ol class="bibliography"><li><span id="peebles2023scalable">Peebles, W., &amp; Xie, S. (2023). Scalable Diffusion Models with Transformers. <i>Proceedings of the IEEE/CVF International Conference on Computer Vision</i>.</span></li>
<li><span id="jordan2024muon">Jordan, K., &amp; others. (2024). Muon: An Optimizer for Hidden Layers in Neural Networks. <i>ArXiv Preprint</i>.</span></li>
<li><span id="yu2024representation">Yu, S., Jin, S., Lee, J., Kim, J., &amp; Shin, J. (2024). Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. <i>ArXiv Preprint ArXiv:2410.06940</i>.</span></li></ol>]]></content><author><name>{&quot;user&quot;=&gt;&quot;timlawrenz&quot;}</name></author><category term="diffusion-transformers" /><category term="portrait-generation" /><category term="muon-optimizer" /><category term="tread" /><category term="ablation-study" /><category term="research" /><category term="prx-tg" /><summary type="html"><![CDATA[The prevailing assumption in generative AI is that training a large, multi-modal Diffusion Transformer from scratch requires a cluster. prx-tg is a direct challenge to that assumption: a 400M+ parameter DiT for 1024×1024 portrait generation, trained entirely on a single consumer NVIDIA RTX 4090 with 24GB of VRAM, conditioned on text, identity, spatial layout, and pose simultaneously. We just completed the first systematic ablation study of its core training innovations, and the results are worth sharing in detail — including one finding we did not expect.]]></summary></entry><entry><title type="html">Eradicating Syntax: Building a Neural Universal Machine That Executes Graphs, Not Code</title><link href="https://lawrenz.com/2026/05/05/eradicating-syntax-the-neural-universal-machine.html" rel="alternate" type="text/html" title="Eradicating Syntax: Building a Neural Universal Machine That Executes Graphs, Not Code" /><published>2026-05-05T00:00:00+00:00</published><updated>2026-05-05T00:00:00+00:00</updated><id>https://lawrenz.com/2026/05/05/eradicating-syntax-the-neural-universal-machine</id><content type="html" xml:base="https://lawrenz.com/2026/05/05/eradicating-syntax-the-neural-universal-machine.html"><![CDATA[<p>Our GNN autoencoders achieved 81% node accuracy on Ruby ASTs yet produced 0% valid code. The culprit was the <strong>literal value bottleneck</strong> — nearly half of every AST consisted of names and values that were irrecoverable from the structural encoding. Rather than patch the representation, we asked a more radical question: what if AI never generated human-readable code at all?</p>

<p>This post documents the pivot from patching GNN decoders to building a <strong>Neural Universal Machine</strong> — a system where a Diffusion Transformer generates executable Directed Acyclic Graphs directly, bypassing programming language syntax entirely. We validate the approach end-to-end: from a working graph-walk interpreter that computes Fibonacci(10), through a 12.3× vocabulary compression pipeline, to a Permuted Dense DiT that achieves <strong>100% Syntactic Validity</strong> on 128-node execution graphs.</p>

<ul id="markdown-toc">
  <li><a href="#the-insight-why-generate-text-at-all" id="markdown-toc-the-insight-why-generate-text-at-all">The Insight: Why Generate Text at All?</a></li>
  <li><a href="#the-execution-engine-a-graph-walk-interpreter" id="markdown-toc-the-execution-engine-a-graph-walk-interpreter">The Execution Engine: A Graph-Walk Interpreter</a>    <ul>
      <li><a href="#the-six-universal-motifs" id="markdown-toc-the-six-universal-motifs">The Six Universal Motifs</a></li>
      <li><a href="#the-fibonacci-proof" id="markdown-toc-the-fibonacci-proof">The Fibonacci Proof</a></li>
    </ul>
  </li>
  <li><a href="#dataset-compression-74-dimensions--6-motifs" id="markdown-toc-dataset-compression-74-dimensions--6-motifs">Dataset Compression: 74 Dimensions → 6 Motifs</a>    <ul>
      <li><a href="#compression-results" id="markdown-toc-compression-results">Compression Results</a></li>
    </ul>
  </li>
  <li><a href="#the-generative-model-permuted-dense-dit" id="markdown-toc-the-generative-model-permuted-dense-dit">The Generative Model: Permuted Dense DiT</a>    <ul>
      <li><a href="#the-spatial-bias-trap" id="markdown-toc-the-spatial-bias-trap">The Spatial Bias Trap</a></li>
      <li><a href="#axial-attention-message-passing-in-matrix-form" id="markdown-toc-axial-attention-message-passing-in-matrix-form">Axial Attention: Message-Passing in Matrix Form</a></li>
      <li><a href="#hybrid-loss-flow-matching--classification" id="markdown-toc-hybrid-loss-flow-matching--classification">Hybrid Loss: Flow Matching + Classification</a></li>
      <li><a href="#hyperparameter-ablation" id="markdown-toc-hyperparameter-ablation">Hyperparameter Ablation</a></li>
    </ul>
  </li>
  <li><a href="#the-validation-harness-5-laws-of-physics" id="markdown-toc-the-validation-harness-5-laws-of-physics">The Validation Harness: 5 Laws of Physics</a>    <ul>
      <li><a href="#law-1-execution-out-degree" id="markdown-toc-law-1-execution-out-degree">Law 1: Execution Out-Degree</a></li>
      <li><a href="#law-2-data-in-degree-arity" id="markdown-toc-law-2-data-in-degree-arity">Law 2: Data In-Degree (Arity)</a></li>
      <li><a href="#law-3-no-orphans-reachability" id="markdown-toc-law-3-no-orphans-reachability">Law 3: No Orphans (Reachability)</a></li>
      <li><a href="#law-4-acyclic-data-plane" id="markdown-toc-law-4-acyclic-data-plane">Law 4: Acyclic Data Plane</a></li>
      <li><a href="#law-5-terminal-sink" id="markdown-toc-law-5-terminal-sink">Law 5: Terminal Sink</a></li>
      <li><a href="#the-breakthrough-100-svr-at-128-nodes" id="markdown-toc-the-breakthrough-100-svr-at-128-nodes">The Breakthrough: 100% SVR at 128 Nodes</a></li>
    </ul>
  </li>
  <li><a href="#three-branches-of-government" id="markdown-toc-three-branches-of-government">Three Branches of Government</a></li>
  <li><a href="#whats-next-rlaif-for-deterministic-perfection" id="markdown-toc-whats-next-rlaif-for-deterministic-perfection">What’s Next: RLAIF for Deterministic Perfection</a></li>
  <li><a href="#try-it-yourself" id="markdown-toc-try-it-yourself">Try It Yourself</a></li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
</ul>

<h2 id="the-insight-why-generate-text-at-all">The Insight: Why Generate Text at All?</h2>

<p>The current paradigm of AI code generation is a translation bottleneck: a high-dimensional neural network collapses its probabilistic understanding into a linear, human-readable string of characters, which a compiler then immediately parses <em>back</em> into a multi-dimensional graph to execute.</p>

<p>If we remove the human from the loop, a “programming language” designed natively for AI should not be a language at all. It should be a mathematical specification for a DAG.</p>

<p>Four core principles drive the new architecture:</p>

<ol>
  <li><strong>The Death of Variable Names.</strong> Variable names are human mnemonics. The AI-native language relies entirely on directed edges — data dependencies are pure topological routing.</li>
  <li><strong>Eradication of Syntax Sugar.</strong> No parentheses, no brackets, no formatting. The “code” is saved as a sparse adjacency matrix and a minimal feature vector.</li>
  <li><strong>Execution by Graph-Walk.</strong> The matrix is not compiled or parsed; it is traversed directly by a minimal graph-walking interpreter.</li>
  <li><strong>Guaranteed Syntax.</strong> Because the model generates graph topology rather than stringing text together, “syntax errors” become mathematically impossible.</li>
</ol>

<h2 id="the-execution-engine-a-graph-walk-interpreter">The Execution Engine: A Graph-Walk Interpreter</h2>

<p>Before training any generative model, we needed to prove that pure topological matrices are Turing-complete. We built a minimal virtual machine that executes graphs directly.</p>

<h3 id="the-six-universal-motifs">The Six Universal Motifs</h3>

<p>Drawing from the Böhm-Jacopini theorem <a class="citation" href="#bohm1966flow">(Böhm &amp; Jacopini, 1966)</a>, we define exactly six node types — sufficient to express any computable function:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Motif</th>
      <th style="text-align: left">Role</th>
      <th style="text-align: left">Execution Semantics</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">Boundary</code></td>
      <td style="text-align: left">Program entry/exit</td>
      <td style="text-align: left">Routes execution forward or halts</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">Sequence</code></td>
      <td style="text-align: left">Linear execution</td>
      <td style="text-align: left">Passes control to the next node</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">Condition</code></td>
      <td style="text-align: left">Boolean branching</td>
      <td style="text-align: left">Evaluates data input, routes to True (index 0) or False (index 1)</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">Loop</code></td>
      <td style="text-align: left">Iteration</td>
      <td style="text-align: left">Like Condition, but the True path loops back</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">State</code></td>
      <td style="text-align: left">Memory read/write</td>
      <td style="text-align: left">On the execution path: writes incoming data to memory</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">Message</code></td>
      <td style="text-align: left">Function call / constant</td>
      <td style="text-align: left">Evaluates <code class="language-plaintext highlighter-rouge">+</code>, <code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">print</code>, or returns a literal</td>
    </tr>
  </tbody>
</table>

<p>Two edge types connect them: <strong>EXECUTION</strong> edges (control flow — “go here next”) and <strong>DATA</strong> edges (value flow — “use this as argument N”).</p>

<h3 id="the-fibonacci-proof">The Fibonacci Proof</h3>

<p>To validate the interpreter, we hand-constructed a 25-node execution graph equivalent to:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">count</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">:</span>
    <span class="n">temp</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>
    <span class="n">a</span> <span class="o">=</span> <span class="n">b</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">temp</span>
    <span class="n">count</span> <span class="o">=</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span>
<span class="nf">print</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
</code></pre></div></div>

<p>The graph encodes this logic as pure topology — no variable names exist in the execution matrix. The <code class="language-plaintext highlighter-rouge">literal_pool</code> (a separate dictionary managed by the “Legislative Branch”) maps integer pointers to values: <code class="language-plaintext highlighter-rouge">{0: "a", 1: "b", 2: "count", 7: "+", 8: "&lt;", ...}</code>.</p>

<p>The interpreter drops an execution pointer onto the entry <code class="language-plaintext highlighter-rouge">Boundary</code> node, resolves data dependencies recursively through <code class="language-plaintext highlighter-rouge">DATA</code> edges, evaluates <code class="language-plaintext highlighter-rouge">Message</code> nodes via a minimal stdlib (<code class="language-plaintext highlighter-rouge">+</code>, <code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">-</code>, <code class="language-plaintext highlighter-rouge">print</code>), and routes through the <code class="language-plaintext highlighter-rouge">Loop</code> node’s boolean condition.</p>

<p class="tip"><strong>Result: <code class="language-plaintext highlighter-rouge">memory["a"] == 55</code>.</strong> The 10th Fibonacci number, computed natively via matrix traversal — no parser, no compiler, no syntax.</p>

<p>This proves two things: (1) the six Motifs are Turing-complete, and (2) a graph-walk interpreter can execute arbitrary logic from pure adjacency structure plus a constant pool.</p>

<h2 id="dataset-compression-74-dimensions--6-motifs">Dataset Compression: 74 Dimensions → 6 Motifs</h2>

<p>With the execution engine validated, we built a compression pipeline to transform our existing 22,452 Ruby ASTs into training data for the generative model.</p>

<p>The compressor (<code class="language-plaintext highlighter-rouge">scripts/dataset_prep/compress_ast.py</code>) performs three operations:</p>

<ol>
  <li>
    <p><strong>Motif Mapping.</strong> Every Ruby AST node type (73 unique types: <code class="language-plaintext highlighter-rouge">def</code>, <code class="language-plaintext highlighter-rouge">send</code>, <code class="language-plaintext highlighter-rouge">lvar</code>, <code class="language-plaintext highlighter-rouge">if</code>, <code class="language-plaintext highlighter-rouge">while</code>, …) is mapped to one of the 6 Universal Motifs via a deterministic lookup table.</p>
  </li>
  <li>
    <p><strong>Literal Extraction.</strong> Primitive children (strings, integers, floats, booleans) are stripped from the tree and collected into a deduplicated <code class="language-plaintext highlighter-rouge">literal_pool</code>. Each extracted value gets an integer pointer. The structural graph references these values only through pointer indices — the actual content lives entirely outside the topology.</p>
  </li>
  <li>
    <p><strong>Edge Re-Routing.</strong> <code class="language-plaintext highlighter-rouge">Sequence</code> and <code class="language-plaintext highlighter-rouge">Boundary</code> nodes chain their children via <code class="language-plaintext highlighter-rouge">EXECUTION</code> edges (control flow). Everything else (<code class="language-plaintext highlighter-rouge">Condition</code>, <code class="language-plaintext highlighter-rouge">Loop</code>, <code class="language-plaintext highlighter-rouge">Message</code>, <code class="language-plaintext highlighter-rouge">State</code>) connects children via <code class="language-plaintext highlighter-rouge">DATA</code> edges with positional <code class="language-plaintext highlighter-rouge">input_index</code> values.</p>
  </li>
</ol>

<h3 id="compression-results">Compression Results</h3>

<p>When applied to a complex, real-world example — the 144-node <code class="language-plaintext highlighter-rouge">structure</code> method from the AWS Ruby SDK — the compressor:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Metric</th>
      <th style="text-align: center">Before (Ruby AST)</th>
      <th style="text-align: center">After (Motif Graph)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Node vocabulary</td>
      <td style="text-align: center">74 types</td>
      <td style="text-align: center"><strong>6 types</strong></td>
    </tr>
    <tr>
      <td style="text-align: left">Literal values in graph</td>
      <td style="text-align: center">50 (embedded)</td>
      <td style="text-align: center"><strong>0</strong> (extracted to pool)</td>
    </tr>
    <tr>
      <td style="text-align: left">Edge types</td>
      <td style="text-align: center">Implicit (parent→child)</td>
      <td style="text-align: center"><strong>107 DATA + 36 EXECUTION</strong></td>
    </tr>
  </tbody>
</table>

<p>This is a <strong>12.3× reduction</strong> in structural vocabulary — from 74 dimensions of Ruby syntax noise down to 6 language-agnostic primitives. The literal value bottleneck that destroyed our GNN autoencoders is eliminated by construction: literals are no longer <em>in</em> the graph. They live in a separate constant pool managed by the “Legislative Branch” (an LLM).</p>

<p>The compressed dataset produces perfectly dense, low-dimensional matrices — ideal inputs for a Diffusion Transformer.</p>

<h2 id="the-generative-model-permuted-dense-dit">The Generative Model: Permuted Dense DiT</h2>

<p>With a Turing-complete execution engine and a compressed training set, we built a Diffusion Transformer to <em>generate</em> valid execution graphs from scratch.</p>

<h3 id="the-spatial-bias-trap">The Spatial Bias Trap</h3>

<p>Standard image DiTs (Stable Diffusion, Sora) use Vision Transformer blocks with 2D positional encodings. Applied to an adjacency matrix, this teaches the network that Node 4 connects to Node 5 because they are “next to each other” spatially. But in a graph, node ordering is entirely arbitrary — adjacency is topological, not positional.</p>

<p>We solved this with two mechanisms:</p>

<p><strong>1. Node Permutation Augmentation.</strong> The DataLoader randomly shuffles node ordering every time a graph is fetched. The topological routing remains identical, but the matrix layout changes completely. This mathematically destroys spatial bias and forces the DiT to learn pure topological rules.</p>

<p><strong>2. Cross-Hatch Embedding Injection.</strong> The DiT operates on a 2D adjacency matrix, but its conditioning signal (the Motifs) is a 1D list. The <code class="language-plaintext highlighter-rouge">InputConditioner</code> bridges this gap:</p>
<ul>
  <li>Embeds the 1D Motif tensor <code class="language-plaintext highlighter-rouge">[N]</code> into <code class="language-plaintext highlighter-rouge">[N, 128]</code></li>
  <li><strong>Broadcasts across rows</strong> → <code class="language-plaintext highlighter-rouge">[N, N, 128]</code> (source node identity)</li>
  <li><strong>Broadcasts across columns</strong> → <code class="language-plaintext highlighter-rouge">[N, N, 128]</code> (target node identity)</li>
  <li>Concatenates both with the 3-channel noisy adjacency → <strong>35 channels per pixel</strong></li>
</ul>

<p>Every coordinate <code class="language-plaintext highlighter-rouge">(i, j)</code> now carries complete information about <em>which</em> Motifs are being connected, giving the DiT 360° structural awareness.</p>

<h3 id="axial-attention-message-passing-in-matrix-form">Axial Attention: Message-Passing in Matrix Form</h3>

<p>Because we abandoned ViT square patches, the model processes the full matrix using <strong>Axial (Row-Column) Attention</strong> — which naturally mimics graph message passing:</p>

<ul>
  <li><strong>Row Attention</strong> (the “outgoing” perspective): Evaluates all potential connections <em>from</em> a node simultaneously. “I am a Condition — I must point to exactly two targets.”</li>
  <li><strong>Column Attention</strong> (the “incoming” perspective): Evaluates all incoming dependencies <em>to</em> a node. “I am a State — I can accept at most one data source.”</li>
</ul>

<p>This bidirectional reasoning is critical: graph validity requires both out-degree <em>and</em> in-degree constraints to be satisfied simultaneously.</p>

<h3 id="hybrid-loss-flow-matching--classification">Hybrid Loss: Flow Matching + Classification</h3>

<p>The model predicts 6 output channels per edge coordinate:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Channels</th>
      <th style="text-align: left">Meaning</th>
      <th style="text-align: left">Loss</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">0–1</td>
      <td style="text-align: left">Presence, Edge Type</td>
      <td style="text-align: left"><strong>Optimal Transport Flow Matching</strong> (masked MSE)</td>
    </tr>
    <tr>
      <td style="text-align: left">2–5</td>
      <td style="text-align: left">Input Index logits (0–3)</td>
      <td style="text-align: left"><strong>Categorical Cross-Entropy</strong> (masked)</td>
    </tr>
  </tbody>
</table>

<p>For the continuous channels, we use Conditional Flow Matching <a class="citation" href="#lipman2023flow">(Lipman et al., 2023)</a>: the target velocity is $v_t = x_1 - x_0$ (clean adjacency minus Gaussian noise), and the model learns to predict this velocity field. At inference, a 20-step Euler ODE solver integrates from noise to structure.</p>

<p>For the discrete channel (argument ordering), continuous regression would cause “rounding collisions” where two edges claim the same input index. Instead, the model outputs categorical logits and a Cross-Entropy loss forces mutually exclusive assignment.</p>

<p><strong>Padding masking</strong> is essential: graphs vary from 3 to 128 nodes but are padded to a fixed $128 \times 128$ matrix. Without masking, the loss would penalize the model for failing to denoise 16,000+ void pixels.</p>

<h3 id="hyperparameter-ablation">Hyperparameter Ablation</h3>

<p>We ran an 8-configuration grid search over batch size, depth, and learning rate:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Parameter</th>
      <th style="text-align: center">Optimal</th>
      <th style="text-align: left">Reasoning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Effective Batch Size</td>
      <td style="text-align: center"><strong>16</strong></td>
      <td style="text-align: left">BS=4 too noisy; BS≥32 washes out categorical gradients</td>
    </tr>
    <tr>
      <td style="text-align: left">Axial Depth</td>
      <td style="text-align: center"><strong>12</strong> blocks</td>
      <td style="text-align: left">6 insufficient for global routing; 24 overfits out-degree at in-degree’s expense</td>
    </tr>
    <tr>
      <td style="text-align: left">Learning Rate</td>
      <td style="text-align: center"><strong>1e-4</strong></td>
      <td style="text-align: left">5e-4 causes catastrophic gradient explosions (loss &gt; 26.0); 1e-5 too slow</td>
    </tr>
  </tbody>
</table>

<p>At the optimal configuration, the model achieved <strong>32.4% in-degree / 46.0% out-degree</strong> pass rates during early training — enough signal for the curriculum to scale.</p>

<h2 id="the-validation-harness-5-laws-of-physics">The Validation Harness: 5 Laws of Physics</h2>

<p>To grade the DiT’s output deterministically (no LLM judge, no fuzzy metrics), we implemented a static topological analyzer that enforces five absolute graph laws. A generated matrix must pass <strong>all five</strong> to count as syntactically valid:</p>

<h3 id="law-1-execution-out-degree">Law 1: Execution Out-Degree</h3>

<p>Each Motif has strict branching limits:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Motif</th>
      <th style="text-align: left">Legal Out-Degree</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Boundary</td>
      <td style="text-align: left">0 (exit) or 1 (entry)</td>
    </tr>
    <tr>
      <td style="text-align: left">Sequence, State, Message</td>
      <td style="text-align: left">≤ 1</td>
    </tr>
    <tr>
      <td style="text-align: left">Condition, Loop</td>
      <td style="text-align: left">Exactly 0 or 2 (and branch indices must be distinct)</td>
    </tr>
  </tbody>
</table>

<p>A Condition node with 3 outgoing execution edges? Illegal. One with two edges both labeled “True path”? Also illegal.</p>

<h3 id="law-2-data-in-degree-arity">Law 2: Data In-Degree (Arity)</h3>

<p>Strict argument constraints prevent impossible data routing:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">Condition</code> and <code class="language-plaintext highlighter-rouge">Loop</code> nodes require <strong>exactly 1</strong> incoming data edge (the boolean)</li>
  <li><code class="language-plaintext highlighter-rouge">State</code> writes require <strong>exactly 1</strong> incoming data edge (the value to store)</li>
  <li><code class="language-plaintext highlighter-rouge">Message</code> nodes require <strong>unique, non-duplicate</strong> argument indices</li>
</ul>

<h3 id="law-3-no-orphans-reachability">Law 3: No Orphans (Reachability)</h3>

<p>A BFS over the combined execution+data connectivity graph confirms zero disconnected logic islands. Every node must be reachable from the rest of the graph.</p>

<h3 id="law-4-acyclic-data-plane">Law 4: Acyclic Data Plane</h3>

<p>A DFS cycle detector ensures the DATA edge subgraph contains no paradoxes. Execution edges <em>may</em> cycle (that’s what loops are), but data dependencies must form a strict DAG — otherwise you get circular definitions (<code class="language-plaintext highlighter-rouge">a = b; b = a</code>).</p>

<h3 id="law-5-terminal-sink">Law 5: Terminal Sink</h3>

<p>A reverse-BFS from exit nodes (those with 0 outgoing execution edges) confirms that <strong>every</strong> execution node can reach a termination point. This prevents infinite loops without escape hatches.</p>

<h3 id="the-breakthrough-100-svr-at-128-nodes">The Breakthrough: 100% SVR at 128 Nodes</h3>

<p>The 3-Phase Curriculum scaled the DiT from toy graphs (≤10 nodes) to massive 128-node matrices. At Epoch 343, with the Judicial Constraint Solver performing Top-K arity snapping and logit-weighted branch conflict resolution, the model achieved:</p>

<p class="tip"><strong>100.00% Syntactic Validity Rate on 128-node execution graphs.</strong></p>

<p>The Judicial Constraint Solver bridges the continuous-to-discrete gap: rather than expecting the DiT to perfectly zero its own noise, the solver reads the probability heatmap and mathematically snaps edges into legal bounds. A <code class="language-plaintext highlighter-rouge">Condition</code> node’s row gets its Top-2 highest probabilities snapped to <code class="language-plaintext highlighter-rouge">1</code>; conflicting argument indices get resolved via categorical logit magnitude.</p>

<h2 id="three-branches-of-government">Three Branches of Government</h2>

<p>The full system separates concerns like a constitutional government:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Branch</th>
      <th style="text-align: left">Model</th>
      <th style="text-align: left">Responsibility</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Legislative</strong></td>
      <td style="text-align: left">Semantic LLM</td>
      <td style="text-align: left">Translates human intent → Motif list + Literal Pool</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Executive</strong></td>
      <td style="text-align: left">Permuted Dense DiT</td>
      <td style="text-align: left">Generates the topological routing (adjacency matrix)</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Judicial</strong></td>
      <td style="text-align: left">Constraint Solver</td>
      <td style="text-align: left">Snaps continuous heat maps → discrete, legal DAGs</td>
    </tr>
  </tbody>
</table>

<p>The DiT knows nothing about human language. It purely outputs mathematically valid logic scaffolds. The LLM knows nothing about graph topology. It purely manages the semantic content. The Constraint Solver enforces constitutional law on both.</p>

<h2 id="whats-next-rlaif-for-deterministic-perfection">What’s Next: RLAIF for Deterministic Perfection</h2>

<p>The continuous Flow Matching objective plateaued at loss ~0.135 — the model has extracted maximum topological value from the passive pre-training objective. The next phase transitions to <strong>Reinforcement Learning from AI Feedback (RLAIF)</strong>, using the 5 Laws of Physics as a direct reward signal:</p>

<ul>
  <li><strong>Base rewards</strong>: +0.2 for passing No Orphans, Acyclic Data, and In-Degree</li>
  <li><strong>Load-bearing penalties</strong>: +0.4 for Out-Degree pass (−0.2 fail), +0.4 for Terminal Sink pass (−0.4 fail)</li>
  <li><strong>Jackpot</strong>: If all 5 laws pass → <strong>2.5× multiplier</strong> on total reward</li>
</ul>

<p>A KL Divergence Anchor to the frozen pre-trained weights prevents reward hacking (mode-collapsing into trivial straight-line graphs).</p>

<h2 id="try-it-yourself">Try It Yourself</h2>

<p>The full execution engine, dataset compression pipeline, DiT training code, and validation harness are available:</p>

<ul>
  <li><strong>💻 Code</strong>: <a href="https://github.com/timlawrenz/jubilant-palm-tree">jubilant-palm-tree</a> — The Neural Universal Machine</li>
  <li><strong>📊 Model</strong>: <a href="https://huggingface.co/timlawrenz/jubilant-palm-tree">timlawrenz/jubilant-palm-tree</a> — Pre-trained checkpoint on the Hub</li>
  <li><strong>🤖 Orchestrator</strong>: <a href="https://github.com/timlawrenz/ratiocinator">Ratiocinator</a> — Autonomous experiment runner</li>
  <li><strong>📄 Previous work</strong>: <a href="/2026/04/20/why-gnn-autoencoders-fail-at-code-generation.html">The Literal Value Bottleneck</a> — The GNN study that motivated this pivot</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Clone and run the Fibonacci proof</span>
git clone https://github.com/timlawrenz/jubilant-palm-tree
<span class="nb">cd </span>jubilant-palm-tree
pip <span class="nb">install</span> <span class="nt">-r</span> requirements.txt

<span class="c"># Execute the graph-walk interpreter (no compiler needed)</span>
python src/execution_engine/demo.py

<span class="c"># Compress the Ruby AST dataset into Universal Motifs</span>
python scripts/dataset_prep/compress_ast.py

<span class="c"># Train the Permuted Dense DiT</span>
python src/train.py
</code></pre></div></div>

<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="bohm1966flow">Böhm, C., &amp; Jacopini, G. (1966). Flow diagrams, Turing machines and languages with only two formation rules. <i>Communications of the ACM</i>, <i>9</i>(5), 366–371.</span></li>
<li><span id="lipman2023flow">Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., &amp; Le, M. (2023). Flow matching for generative modeling. <i>ArXiv Preprint ArXiv:2210.02747</i>.</span></li></ol>]]></content><author><name>{&quot;user&quot;=&gt;&quot;timlawrenz&quot;}</name></author><category term="graph-neural-networks" /><category term="code-generation" /><category term="diffusion-transformers" /><category term="flow-matching" /><category term="research" /><category term="jubilant-palm-tree" /><summary type="html"><![CDATA[Our GNN autoencoders achieved 81% node accuracy on Ruby ASTs yet produced 0% valid code. The culprit was the literal value bottleneck — nearly half of every AST consisted of names and values that were irrecoverable from the structural encoding. Rather than patch the representation, we asked a more radical question: what if AI never generated human-readable code at all?]]></summary></entry><entry><title type="html">Self-Supervised Pretraining Recipes for Lung CT: A Systematic Study with DINO</title><link href="https://lawrenz.com/2026/04/20/breaking-the-entropy-wall-dino-lung-ct.html" rel="alternate" type="text/html" title="Self-Supervised Pretraining Recipes for Lung CT: A Systematic Study with DINO" /><published>2026-04-20T00:00:00+00:00</published><updated>2026-04-20T00:00:00+00:00</updated><id>https://lawrenz.com/2026/04/20/breaking-the-entropy-wall-dino-lung-ct</id><content type="html" xml:base="https://lawrenz.com/2026/04/20/breaking-the-entropy-wall-dino-lung-ct.html"><![CDATA[<p>Self-supervised learning (SSL) promises to unlock the diagnostic potential of large unlabeled medical image archives, yet practitioners face a daunting hyperparameter landscape with little domain-specific guidance. We present a systematic study of pretraining recipes for lung computed tomography (CT), evaluating 60+ experimental configurations on the LIDC-IDRI dataset.</p>

<ul id="markdown-toc">
  <li><a href="#1-the-challenges-of-medical-ssl" id="markdown-toc-1-the-challenges-of-medical-ssl">1. The Challenges of Medical SSL</a>    <ul>
      <li><a href="#core-contributions" id="markdown-toc-core-contributions">Core Contributions</a></li>
    </ul>
  </li>
  <li><a href="#2-methodology-the-dino-x-system" id="markdown-toc-2-methodology-the-dino-x-system">2. Methodology: The DINO-X System</a>    <ul>
      <li><a href="#21-architecture" id="markdown-toc-21-architecture">2.1 Architecture</a></li>
      <li><a href="#22-loss-functions--online-gram-alignment" id="markdown-toc-22-loss-functions--online-gram-alignment">2.2 Loss Functions &amp; Online Gram Alignment</a></li>
      <li><a href="#23-data-pipeline" id="markdown-toc-23-data-pipeline">2.3 Data Pipeline</a></li>
    </ul>
  </li>
  <li><a href="#3-results--discussion" id="markdown-toc-3-results--discussion">3. Results &amp; Discussion</a>    <ul>
      <li><a href="#31-breaking-the-entropy-wall" id="markdown-toc-31-breaking-the-entropy-wall">3.1 Breaking the Entropy Wall</a></li>
      <li><a href="#32-augmentation-the-colorjitter-trap" id="markdown-toc-32-augmentation-the-colorjitter-trap">3.2 Augmentation: The ColorJitter Trap</a></li>
      <li><a href="#33-scaling-behavior" id="markdown-toc-33-scaling-behavior">3.3 Scaling Behavior</a></li>
      <li><a href="#34-capacity-dependent-regularization" id="markdown-toc-34-capacity-dependent-regularization">3.4 Capacity-Dependent Regularization</a></li>
      <li><a href="#35-clinical-utility-malignancy-probing" id="markdown-toc-35-clinical-utility-malignancy-probing">3.5 Clinical Utility: Malignancy Probing</a></li>
      <li><a href="#36-resolution-224-vs-448" id="markdown-toc-36-resolution-224-vs-448">3.6 Resolution: 224 vs 448</a></li>
    </ul>
  </li>
  <li><a href="#4-representation-analysis" id="markdown-toc-4-representation-analysis">4. Representation Analysis</a></li>
  <li><a href="#5-the-infrastructure-64-experiments-for-3560" id="markdown-toc-5-the-infrastructure-64-experiments-for-3560">5. The Infrastructure: 64 Experiments for $35.60</a></li>
  <li><a href="#6-practical-recipes" id="markdown-toc-6-practical-recipes">6. Practical Recipes</a>    <ul>
      <li><a href="#the-vit-small-recipe-recommended" id="markdown-toc-the-vit-small-recipe-recommended">The ViT-Small Recipe (Recommended)</a></li>
      <li><a href="#common-pitfalls" id="markdown-toc-common-pitfalls">Common Pitfalls</a></li>
    </ul>
  </li>
  <li><a href="#try-it-yourself" id="markdown-toc-try-it-yourself">Try It Yourself</a></li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
  <li><a href="#references-1" id="markdown-toc-references-1">References</a></li>
</ul>

<h2 id="1-the-challenges-of-medical-ssl">1. The Challenges of Medical SSL</h2>

<p>Medical CT presents unique challenges that break standard natural-image SSL assumptions:</p>
<ul>
  <li><strong>Hounsfield Unit (HU) encoding</strong>: Pixel intensities carry calibrated tissue density information. Standard photometric augmentations destroy this signal.</li>
  <li><strong>Volumetric context</strong>: Pathology manifests across multiple slices. 2D methods must decide how to handle the Z-axis.</li>
  <li><strong>Entropy wall</strong>: DINO <a class="citation" href="#caron2021emerging">(Caron et al., 2021)</a> training on medical CT frequently stalls at the theoretical maximum entropy, producing uniform outputs that carry no information.</li>
</ul>

<h3 id="core-contributions">Core Contributions</h3>
<ol>
  <li><strong>Entropy wall solution</strong>: We identify center momentum as the critical factor for DINO on medical CT.</li>
  <li><strong>Medical augmentation guidelines</strong>: Evidence that spatial-only augmentation is optimal for HU data.</li>
  <li><strong>Capacity-dependent regularization</strong>: We discover that KoLeo regularization <a class="citation" href="#sablayrolles2019spreading">(Sablayrolles et al., 2019)</a> is critical for ViT-Large but optional for ViT-Small.</li>
  <li><strong>Scaling analysis</strong>: Tracing the trajectory from random weights to 1032× baseline retrieval.</li>
  <li><strong>Clinical evaluation</strong>: Establishing malignancy classification baselines and testing 3D feature aggregation.</li>
</ol>

<h2 id="2-methodology-the-dino-x-system">2. Methodology: The DINO-X System</h2>

<h3 id="21-architecture">2.1 Architecture</h3>
<p>We evaluate two scales of Vision Transformer (ViT) backbones:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Feature</th>
      <th style="text-align: center">ViT-Small</th>
      <th style="text-align: center">ViT-Large</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Embedding dim</td>
      <td style="text-align: center">384</td>
      <td style="text-align: center">1024</td>
    </tr>
    <tr>
      <td style="text-align: left">Depth</td>
      <td style="text-align: center">12</td>
      <td style="text-align: center">24</td>
    </tr>
    <tr>
      <td style="text-align: left">Heads</td>
      <td style="text-align: center">6</td>
      <td style="text-align: center">16</td>
    </tr>
    <tr>
      <td style="text-align: left">Backbone params</td>
      <td style="text-align: center">21.6M</td>
      <td style="text-align: center">303.2M</td>
    </tr>
    <tr>
      <td style="text-align: left">Total params</td>
      <td style="text-align: center">24.9M</td>
      <td style="text-align: center">312.6M</td>
    </tr>
  </tbody>
</table>

<h3 id="22-loss-functions--online-gram-alignment">2.2 Loss Functions &amp; Online Gram Alignment</h3>
<p>Our system building on DINOv2 <a class="citation" href="#oquab2023dinov2">(Oquab et al., 2023)</a> independently adopts <strong>Online Gram alignment</strong>. Unlike DINOv3’s <a class="citation" href="#meta2025dinov3">(AI, 2025)</a> temporal anchoring to frozen historical checkpoints, DINO-X matches the student’s patch-token Gram matrix to the <em>current</em> EMA teacher at every step.</p>

<p>The total loss is defined as:
\(L = L_{DINO} + \lambda_{gram} \cdot L_{gram} + \lambda_{koleo} \cdot L_{koleo}\)</p>

<h3 id="23-data-pipeline">2.3 Data Pipeline</h3>
<ul>
  <li><strong>Dataset</strong>: 234,943 axial slices from 981 LIDC-IDRI series.</li>
  <li><strong>Input Encoding</strong>: 3-channel input constructed from consecutive slices $(z-1, z, z+1)$ to provide local volumetric context.</li>
  <li><strong>HU Windowing</strong>: A random Hounsfield Unit window is applied per sample to simulate various clinical viewing protocols.</li>
</ul>

<h2 id="3-results--discussion">3. Results &amp; Discussion</h2>

<h3 id="31-breaking-the-entropy-wall">3.1 Breaking the Entropy Wall</h3>
<p>The most critical finding: <strong>center momentum (cm)</strong> must be high enough to allow symmetry-breaking.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Center Momentum</th>
      <th style="text-align: center">2K Loss</th>
      <th style="text-align: center">10K Ratio</th>
      <th style="text-align: left">Trajectory</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">0.9</td>
      <td style="text-align: center">9.00</td>
      <td style="text-align: center">4.0 ↓</td>
      <td style="text-align: left">Permanently stuck</td>
    </tr>
    <tr>
      <td style="text-align: left">0.95</td>
      <td style="text-align: center">9.00</td>
      <td style="text-align: center">—</td>
      <td style="text-align: left">Stuck</td>
    </tr>
    <tr>
      <td style="text-align: left">0.99</td>
      <td style="text-align: center">9.01</td>
      <td style="text-align: center">—</td>
      <td style="text-align: left">Stuck</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>0.999</strong></td>
      <td style="text-align: center"><strong>5.76</strong></td>
      <td style="text-align: center"><strong>18.0</strong></td>
      <td style="text-align: left"><strong>Breaks through</strong> ✓</td>
    </tr>
  </tbody>
</table>

<p>At cm $\le$ 0.99, the center vector adapts too quickly, erasing emerging structure. At <strong>0.999</strong>, the update is slow enough for meaningful clusters to form.</p>

<h3 id="32-augmentation-the-colorjitter-trap">3.2 Augmentation: The ColorJitter Trap</h3>
<p>Intensity variations in CT distinguish pathologies (e.g., ground-glass opacity vs. solid nodule). Teaching invariance to intensity (via <code class="language-plaintext highlighter-rouge">ColorJitter</code>) teaches the model to ignore the signal.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Augmentation</th>
      <th style="text-align: center">Ratio (10K)</th>
      <th style="text-align: center">Relative</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Spatial only (RRC + HFlip)</strong></td>
      <td style="text-align: center"><strong>49.7</strong></td>
      <td style="text-align: center"><strong>baseline</strong></td>
    </tr>
    <tr>
      <td style="text-align: left">+ ColorJitter</td>
      <td style="text-align: center">25.0</td>
      <td style="text-align: center"><strong>−50%</strong></td>
    </tr>
  </tbody>
</table>

<h3 id="33-scaling-behavior">3.3 Scaling Behavior</h3>
<p>We track the learning trajectory of ViT-Small over 100K steps:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Steps</th>
      <th style="text-align: center">Loss</th>
      <th style="text-align: center">Ratio</th>
      <th style="text-align: left">Top-1</th>
      <th style="text-align: left">Phase</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">2K</td>
      <td style="text-align: center">5.76</td>
      <td style="text-align: center">6</td>
      <td style="text-align: left">0.29%</td>
      <td style="text-align: left">Breakout</td>
    </tr>
    <tr>
      <td style="text-align: left">20K</td>
      <td style="text-align: center">0.44</td>
      <td style="text-align: center">311</td>
      <td style="text-align: left">15.2%</td>
      <td style="text-align: left">Rapid Learning</td>
    </tr>
    <tr>
      <td style="text-align: left">100K</td>
      <td style="text-align: center">0.23</td>
      <td style="text-align: center">1,032</td>
      <td style="text-align: left">25.2%</td>
      <td style="text-align: left">Diminishing Returns</td>
    </tr>
  </tbody>
</table>

<h3 id="34-capacity-dependent-regularization">3.4 Capacity-Dependent Regularization</h3>
<p>ViT-Large requires explicit uniformity enforcement. Without KoLeo ($\lambda_{koleo}=0.1$), it solves the pretext task via representation collapse.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Configuration</th>
      <th style="text-align: center">100K Loss</th>
      <th style="text-align: center">100K Ratio</th>
      <th style="text-align: left">Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">ViT-L, no KoLeo</td>
      <td style="text-align: center">0.0004</td>
      <td style="text-align: center">4</td>
      <td style="text-align: left"><strong>Collapsed</strong></td>
    </tr>
    <tr>
      <td style="text-align: left">ViT-L, with KoLeo</td>
      <td style="text-align: center">0.27</td>
      <td style="text-align: center">500</td>
      <td style="text-align: left"><strong>Healthy</strong></td>
    </tr>
  </tbody>
</table>

<h3 id="35-clinical-utility-malignancy-probing">3.5 Clinical Utility: Malignancy Probing</h3>
<p>We evaluated frozen backbones on malignancy classification (430 malignant, 1,665 benign nodules).</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Model</th>
      <th style="text-align: center">Feature Type</th>
      <th style="text-align: center">AUC-ROC</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>ViT-S 100K</strong></td>
      <td style="text-align: center"><strong>Avg patch tokens</strong></td>
      <td style="text-align: center"><strong>0.687</strong></td>
    </tr>
    <tr>
      <td style="text-align: left">ViT-L 100K</td>
      <td style="text-align: center">CLS token</td>
      <td style="text-align: center">0.668</td>
    </tr>
    <tr>
      <td style="text-align: left">ViT-S 100K</td>
      <td style="text-align: center">CLS token</td>
      <td style="text-align: center">0.663</td>
    </tr>
    <tr>
      <td style="text-align: left">Supervised ResNet18</td>
      <td style="text-align: center">—</td>
      <td style="text-align: center">0.767</td>
    </tr>
  </tbody>
</table>

<blockquote class="tip warning">
  <p><strong>Negative Result</strong>: Aggregating features across Z-slices (3D mean pooling) actually <em>decreased</em> AUC (0.687 $\to$ 0.650). True 3D awareness likely requires architectural changes like volumetric patch tokens, not post-hoc aggregation.</p>
</blockquote>

<h3 id="36-resolution-224-vs-448">3.6 Resolution: 224 vs 448</h3>
<p>At matched step counts, 448px resolution performed <strong>7× worse</strong> in retrieval ratio while being <strong>2.7× slower</strong>. For this dataset scale, 224px is the pragmatic choice.</p>

<h2 id="4-representation-analysis">4. Representation Analysis</h2>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Metric</th>
      <th style="text-align: center">ViT-S 100K</th>
      <th style="text-align: center">ViT-L 100K</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Active dims (std &gt; 0.01)</td>
      <td style="text-align: center">384/384</td>
      <td style="text-align: center">1024/1024</td>
    </tr>
    <tr>
      <td style="text-align: left">Pairwise cosine sim</td>
      <td style="text-align: center">0.887</td>
      <td style="text-align: center">0.865</td>
    </tr>
    <tr>
      <td style="text-align: left">Class separation</td>
      <td style="text-align: center">0.003</td>
      <td style="text-align: center">0.002</td>
    </tr>
  </tbody>
</table>

<p>Both models use all embedding dimensions, confirming KoLeo prevents dimensional collapse. High pairwise similarity is expected for the single-domain lung CT data.</p>

<h2 id="5-the-infrastructure-64-experiments-for-3560">5. The Infrastructure: 64 Experiments for $35.60</h2>

<p>All of this — 64 GPU experiments across two hardware platforms — was orchestrated by <a href="https://github.com/timlawrenz/ratiocinator">Ratiocinator</a>, an autonomous LLM-driven research pipeline.</p>

<p>Ratiocinator handles the full lifecycle of the experimental campaign: it provisions Vast.ai RTX 4090 instances, deploys code via Git, installs dependencies, runs training, collects metrics, and tears down instances. By treating the research process itself as a distributed systems problem, Ratiocinator proved that high-velocity architectural ablation (including the diagnosis of the “Entropy Wall”) doesn’t require a massive compute budget — just ruthless pipeline optimization.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Experiment Set</th>
      <th style="text-align: center">Arms</th>
      <th style="text-align: center">GPU-hours</th>
      <th style="text-align: left">Approx. Cost</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Loss &amp; CM sweeps</td>
      <td style="text-align: center">10</td>
      <td style="text-align: center">11</td>
      <td style="text-align: left">$3.70</td>
    </tr>
    <tr>
      <td style="text-align: left">Augmentation ablations</td>
      <td style="text-align: center">11</td>
      <td style="text-align: center">9</td>
      <td style="text-align: left">$3.00</td>
    </tr>
    <tr>
      <td style="text-align: left">ViT-Small Scaling</td>
      <td style="text-align: center">6</td>
      <td style="text-align: center">28</td>
      <td style="text-align: left">$9.40</td>
    </tr>
    <tr>
      <td style="text-align: left">ViT-Large Scaling</td>
      <td style="text-align: center">9</td>
      <td style="text-align: center">24</td>
      <td style="text-align: left">$8.80</td>
    </tr>
    <tr>
      <td style="text-align: left">Clinical Probes</td>
      <td style="text-align: center">28</td>
      <td style="text-align: center">33</td>
      <td style="text-align: left">$10.70</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Total</strong></td>
      <td style="text-align: center"><strong>64</strong></td>
      <td style="text-align: center"><strong>105</strong></td>
      <td style="text-align: left"><strong>~$35.60</strong></td>
    </tr>
  </tbody>
</table>

<h2 id="6-practical-recipes">6. Practical Recipes</h2>

<h3 id="the-vit-small-recipe-recommended">The ViT-Small Recipe (Recommended)</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">training</span><span class="pi">:</span>
  <span class="na">loss</span><span class="pi">:</span> <span class="s">dino + gram + koleo</span>
  <span class="na">center_momentum</span><span class="pi">:</span> <span class="m">0.999</span>  <span class="c1"># CRITICAL</span>
  <span class="na">ema</span><span class="pi">:</span> <span class="m">0.996</span>
  <span class="na">lr</span><span class="pi">:</span> <span class="s">2e-4</span>
  <span class="na">batch_size</span><span class="pi">:</span> <span class="m">64</span>
  <span class="na">steps</span><span class="pi">:</span> <span class="s">50,000–100,000</span>

<span class="na">augmentation</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">RandomResizedCrop(224, scale=(0.5, 1.0))</span>
  <span class="pi">-</span> <span class="s">RandomHorizontalFlip()</span>
  <span class="c1"># NO ColorJitter, NO GaussianBlur</span>

<span class="na">data</span><span class="pi">:</span>
  <span class="na">channels</span><span class="pi">:</span> <span class="m">3</span>  <span class="c1"># (z-1, z, z+1)</span>
  <span class="na">windowing</span><span class="pi">:</span> <span class="s">random HU window per sample</span>
</code></pre></div></div>

<h3 id="common-pitfalls">Common Pitfalls</h3>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Pitfall</th>
      <th style="text-align: left">Symptom</th>
      <th style="text-align: left">Fix</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">cm too low (&lt; 0.999)</td>
      <td style="text-align: left">Loss stuck at 9.01</td>
      <td style="text-align: left">Set cm=0.999</td>
    </tr>
    <tr>
      <td style="text-align: left">ColorJitter</td>
      <td style="text-align: left">50%+ ratio drop</td>
      <td style="text-align: left">Remove intensity aug</td>
    </tr>
    <tr>
      <td style="text-align: left">ViT-L without KoLeo</td>
      <td style="text-align: left">Loss $\to$ 0, Ratio $\to$ 1</td>
      <td style="text-align: left">Add koleo_weight=0.1</td>
    </tr>
    <tr>
      <td style="text-align: left">ViT-L with high LR</td>
      <td style="text-align: left">Oscillating loss</td>
      <td style="text-align: left">Use lr=5e-5</td>
    </tr>
  </tbody>
</table>

<h2 id="try-it-yourself">Try It Yourself</h2>

<p>The complete training framework, experiment configurations, and pre-trained weights are available:</p>

<ul>
  <li><strong>💻 Code</strong>: <a href="https://github.com/timlawrenz/DINO-X">DINO-X</a> — Lung CT SSL training framework</li>
  <li><strong>🤖 Orchestrator</strong>: <a href="https://github.com/timlawrenz/ratiocinator">Ratiocinator</a> — The autonomous experiment runner</li>
  <li><strong>📊 Dataset</strong>: <a href="https://huggingface.co/datasets/timlawrenz/lidc-idri-png">LIDC-IDRI</a> on the Hub (PNG version)</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Clone the repository</span>
git clone https://github.com/timlawrenz/DINO-X
<span class="nb">cd </span>DINO-X

<span class="c"># Run the optimized ViT-Small recipe</span>
python train.py <span class="nt">--config</span> configs/vit_small_medical.yaml <span class="nt">--center_momentum</span> 0.999
</code></pre></div></div>

<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="caron2021emerging">Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., &amp; Joulin, A. (2021). Emerging properties in self-supervised vision transformers. <i>Proceedings of the IEEE/CVF International Conference on Computer Vision</i>, 9650–9660.</span></li>
<li><span id="sablayrolles2019spreading">Sablayrolles, A., Douze, M., Schmid, C., &amp; Jégou, H. (2019). Spreading vectors for similarity search. <i>International Conference on Learning Representations</i>.</span></li>
<li><span id="oquab2023dinov2">Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., &amp; others. (2023). Dinov2: Learning robust visual features without supervision. <i>Transactions on Machine Learning Research</i>.</span></li>
<li><span id="meta2025dinov3">AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. <i>ArXiv Preprint</i>.</span></li></ol>
<p>asets/timlawrenz/lidc-idri-png) on the Hub (PNG version)</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Clone the repository</span>
git clone https://github.com/timlawrenz/DINO-X
<span class="nb">cd </span>DINO-X

<span class="c"># Run the optimized ViT-Small recipe</span>
python train.py <span class="nt">--config</span> configs/vit_small_medical.yaml <span class="nt">--center_momentum</span> 0.999
</code></pre></div></div>

<h2 id="references-1">References</h2>
<ol class="bibliography"><li><span id="caron2021emerging">Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., &amp; Joulin, A. (2021). Emerging properties in self-supervised vision transformers. <i>Proceedings of the IEEE/CVF International Conference on Computer Vision</i>, 9650–9660.</span></li>
<li><span id="sablayrolles2019spreading">Sablayrolles, A., Douze, M., Schmid, C., &amp; Jégou, H. (2019). Spreading vectors for similarity search. <i>International Conference on Learning Representations</i>.</span></li>
<li><span id="oquab2023dinov2">Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., &amp; others. (2023). Dinov2: Learning robust visual features without supervision. <i>Transactions on Machine Learning Research</i>.</span></li>
<li><span id="meta2025dinov3">AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. <i>ArXiv Preprint</i>.</span></li></ol>]]></content><author><name>{&quot;user&quot;=&gt;&quot;timlawrenz&quot;}</name></author><category term="medical-imaging" /><category term="self-supervised-learning" /><category term="lung-ct" /><category term="dino" /><category term="vision-transformers" /><category term="research" /><category term="DINO-X" /><summary type="html"><![CDATA[Self-supervised learning (SSL) promises to unlock the diagnostic potential of large unlabeled medical image archives, yet practitioners face a daunting hyperparameter landscape with little domain-specific guidance. We present a systematic study of pretraining recipes for lung computed tomography (CT), evaluating 60+ experimental configurations on the LIDC-IDRI dataset.]]></summary></entry><entry><title type="html">The Literal Value Bottleneck: Why GNN Autoencoders Fail at Code Generation</title><link href="https://lawrenz.com/2026/04/20/why-gnn-autoencoders-fail-at-code-generation.html" rel="alternate" type="text/html" title="The Literal Value Bottleneck: Why GNN Autoencoders Fail at Code Generation" /><published>2026-04-20T00:00:00+00:00</published><updated>2026-04-20T00:00:00+00:00</updated><id>https://lawrenz.com/2026/04/20/why-gnn-autoencoders-fail-at-code-generation</id><content type="html" xml:base="https://lawrenz.com/2026/04/20/why-gnn-autoencoders-fail-at-code-generation.html"><![CDATA[<p>We trained GNN autoencoders on 22,000 Ruby ASTs. The models achieved <strong>81% node type accuracy</strong> and <strong>99.5% type diversity</strong>, yet generated exactly <strong>0% syntactically valid code</strong>. Here is why.</p>

<p>Graph Neural Networks seem like a natural fit for code — after all, Abstract Syntax Trees <em>are</em> graphs. If GNNs can learn molecular structures well enough to generate valid drug candidates, surely they can learn code structure well enough to generate valid programs?</p>

<p>We ran 51 GPU experiments across five GNN architectures, three decoder strategies, four hidden dimensions, and three loss functions to find out. The answer is a definitive no — but the <em>reason</em> is not what we expected.</p>

<ul id="markdown-toc">
  <li><a href="#the-setup-22452-ruby-asts-with-74-dimensional-features" id="markdown-toc-the-setup-22452-ruby-asts-with-74-dimensional-features">The Setup: 22,452 Ruby ASTs with 74-Dimensional Features</a></li>
  <li><a href="#gnns-do-learn-code-structure" id="markdown-toc-gnns-do-learn-code-structure">GNNs <em>Do</em> Learn Code Structure</a></li>
  <li><a href="#the-generation-catastrophe-0-across-the-board" id="markdown-toc-the-generation-catastrophe-0-across-the-board">The Generation Catastrophe: 0% Across the Board</a></li>
  <li><a href="#the-core-discovery-the-literal-value-bottleneck" id="markdown-toc-the-core-discovery-the-literal-value-bottleneck">The Core Discovery: The Literal Value Bottleneck</a></li>
  <li><a href="#without-structure-everything-collapses" id="markdown-toc-without-structure-everything-collapses">Without Structure, Everything Collapses</a></li>
  <li><a href="#dimension-doesnt-matter-depth-does-slightly" id="markdown-toc-dimension-doesnt-matter-depth-does-slightly">Dimension Doesn’t Matter, Depth Does (Slightly)</a></li>
  <li><a href="#the-infrastructure-51-experiments-for-432" id="markdown-toc-the-infrastructure-51-experiments-for-432">The Infrastructure: 51 Experiments for $4.32</a></li>
  <li><a href="#what-would-it-take-to-fix-this" id="markdown-toc-what-would-it-take-to-fix-this">What Would It Take to Fix This?</a></li>
  <li><a href="#try-it-yourself" id="markdown-toc-try-it-yourself">Try It Yourself</a></li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
</ul>

<h2 id="the-setup-22452-ruby-asts-with-74-dimensional-features">The Setup: 22,452 Ruby ASTs with 74-Dimensional Features</h2>

<p>We parsed 22,452 Ruby methods from open-source repositories into AST graphs. Each node gets a <strong>74-dimensional one-hot feature vector</strong> encoding its AST type — one of 73 known types (<code class="language-plaintext highlighter-rouge">def</code>, <code class="language-plaintext highlighter-rouge">send</code>, <code class="language-plaintext highlighter-rouge">args</code>, <code class="language-plaintext highlighter-rouge">lvar</code>, <code class="language-plaintext highlighter-rouge">str</code>, …) plus a single <code class="language-plaintext highlighter-rouge">UNKNOWN</code> token. Literal values — method names, variable names, string contents, numeric values — are stripped of their content and mapped to <code class="language-plaintext highlighter-rouge">UNKNOWN</code>.</p>

<p>The dataset is split 85/15 into training (19,084) and validation (3,368) sets.</p>

<p class="tip">📦 <strong>Dataset</strong>: <a href="https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study">timlawrenz/gnn-ruby-code-study</a> on the Hub.</p>

<h2 id="gnns-do-learn-code-structure">GNNs <em>Do</em> Learn Code Structure</h2>

<p>Before diving into the failure, let’s establish that GNNs genuinely learn meaningful representations of code.</p>

<p>For <strong>cyclomatic complexity prediction</strong> (a graph-level regression task), we compared five architectures: GCN, GraphSAGE, GAT, GIN, and GraphConv. The results are clear:</p>

<table>
  <thead>
    <tr>
      <th>Architecture</th>
      <th>Layers</th>
      <th>Val MAE ↓</th>
      <th>R²</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>SAGE</strong></td>
      <td><strong>5</strong></td>
      <td><strong>4.018</strong></td>
      <td><strong>0.709</strong></td>
    </tr>
    <tr>
      <td>GIN</td>
      <td>3</td>
      <td>4.589</td>
      <td>0.629</td>
    </tr>
    <tr>
      <td>GAT (wide)</td>
      <td>3</td>
      <td>4.662</td>
      <td>0.612</td>
    </tr>
    <tr>
      <td>SAGE (baseline)</td>
      <td>3</td>
      <td>4.782</td>
      <td>0.635</td>
    </tr>
    <tr>
      <td>GCN</td>
      <td>3</td>
      <td>5.321</td>
      <td>0.563</td>
    </tr>
  </tbody>
</table>

<p>A 5-layer GraphSAGE achieves <strong>R² = 0.71</strong>, explaining 71% of the variance in cyclomatic complexity. That’s a <strong>16% improvement</strong> over the 3-layer baseline — and it’s <strong>9.9σ significant</strong> based on 18 replicate runs (σ = 0.073).</p>

<p>Two patterns jump out:</p>

<ol>
  <li>
    <p><strong>Depth dominates width.</strong> Going from 3 to 5 layers improves MAE by 16%. Doubling the hidden dimension from 64 to 128? Zero improvement. For ASTs with depths of 10–30, deeper networks capture cross-branch dependencies that directly relate to complexity.</p>
  </li>
  <li>
    <p><strong>GIN punches above its weight.</strong> GIN’s injective sum aggregation — which preserves the full multiset of neighbor features — gives it a 4% edge over SAGE at equal depth. This is the Weisfeiler-Leman advantage in practice.</p>
  </li>
</ol>

<p>So the models clearly learn the graph structure. The question is: can they <em>reconstruct</em> it?</p>

<h2 id="the-generation-catastrophe-0-across-the-board">The Generation Catastrophe: 0% Across the Board</h2>

<p>We trained graph autoencoders to encode an AST into a latent vector and decode it back. We tried everything:</p>

<ul>
  <li><strong>5 architectures</strong>: GCN, SAGE, GAT, GIN, GraphConv</li>
  <li><strong>3 loss functions</strong>: simple (node type CE), improved (+ parent prediction), comprehensive</li>
  <li><strong>3 decoder edge modes</strong>: chain (sequential), teacher-forced (ground-truth edges), iterative (predicted edges)</li>
  <li><strong>4 hidden dimensions</strong>: 128, 256, 512, and deep 5-layer variants</li>
</ul>

<p class="tip warning"><strong>Every single configuration produces 0% syntactically valid Ruby.</strong></p>

<table>
  <thead>
    <tr>
      <th>Decoder Conv</th>
      <th>Hidden Dim</th>
      <th>Loss</th>
      <th>Syntax Validity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GAT</td>
      <td>256</td>
      <td>improved</td>
      <td>0%</td>
    </tr>
    <tr>
      <td>SAGE</td>
      <td>256</td>
      <td>improved</td>
      <td>0%</td>
    </tr>
    <tr>
      <td>GIN</td>
      <td>256</td>
      <td>improved</td>
      <td>0%</td>
    </tr>
    <tr>
      <td>GCN</td>
      <td>256</td>
      <td>improved</td>
      <td>0%</td>
    </tr>
    <tr>
      <td>GIN (teacher-forced, 5-layer)</td>
      <td>256</td>
      <td>improved</td>
      <td>0%</td>
    </tr>
  </tbody>
</table>

<p>Validation loss converges (to ~3.8 with teacher forcing), so the models <em>are</em> learning something. But what?</p>

<h2 id="the-core-discovery-the-literal-value-bottleneck">The Core Discovery: The Literal Value Bottleneck</h2>

<p>Here’s where it gets interesting. When we gave our best model — a 5-layer teacher-forced GIN decoder — the ground-truth tree structure and only asked it to predict node types, it achieved:</p>

<ul>
  <li><strong>81% node type accuracy</strong></li>
  <li><strong>99.5% type diversity</strong> (8.6 unique types per sample)</li>
  <li><strong>0% syntax validity</strong></li>
</ul>

<p>How can a model be 81% accurate and produce 0% valid code?</p>

<p>Look at this Ruby method:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">call</span><span class="p">(</span><span class="n">storage</span><span class="p">)</span>
  <span class="n">new</span><span class="p">(</span><span class="n">storage</span><span class="p">).</span><span class="nf">call</span>
<span class="k">end</span>
</code></pre></div></div>

<p>Its AST contains 12 elements:</p>

<table>
  <thead>
    <tr>
      <th>Node</th>
      <th>Ground Truth</th>
      <th>Predicted</th>
      <th>Match</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td><code class="language-plaintext highlighter-rouge">def</code></td>
      <td><code class="language-plaintext highlighter-rouge">def</code></td>
      <td>✓</td>
    </tr>
    <tr>
      <td>1</td>
      <td><code class="language-plaintext highlighter-rouge">"call"</code> → UNKNOWN</td>
      <td>UNKNOWN</td>
      <td>✓</td>
    </tr>
    <tr>
      <td>2</td>
      <td><code class="language-plaintext highlighter-rouge">args</code></td>
      <td><code class="language-plaintext highlighter-rouge">args</code></td>
      <td>✓</td>
    </tr>
    <tr>
      <td>3</td>
      <td><code class="language-plaintext highlighter-rouge">arg</code></td>
      <td><code class="language-plaintext highlighter-rouge">arg</code></td>
      <td>✓</td>
    </tr>
    <tr>
      <td>4</td>
      <td><code class="language-plaintext highlighter-rouge">"storage"</code> → UNKNOWN</td>
      <td>UNKNOWN</td>
      <td>✓</td>
    </tr>
    <tr>
      <td>5</td>
      <td><code class="language-plaintext highlighter-rouge">send</code></td>
      <td><code class="language-plaintext highlighter-rouge">send</code></td>
      <td>✓</td>
    </tr>
    <tr>
      <td>6</td>
      <td><code class="language-plaintext highlighter-rouge">send</code></td>
      <td><code class="language-plaintext highlighter-rouge">send</code></td>
      <td>✓</td>
    </tr>
    <tr>
      <td>7</td>
      <td><code class="language-plaintext highlighter-rouge">nil</code> → UNKNOWN</td>
      <td>UNKNOWN</td>
      <td>✓</td>
    </tr>
    <tr>
      <td>8</td>
      <td><code class="language-plaintext highlighter-rouge">"new"</code> → UNKNOWN</td>
      <td>UNKNOWN</td>
      <td>✓</td>
    </tr>
    <tr>
      <td>9</td>
      <td><code class="language-plaintext highlighter-rouge">lvar</code></td>
      <td><code class="language-plaintext highlighter-rouge">lvar</code></td>
      <td>✓</td>
    </tr>
    <tr>
      <td>10</td>
      <td><code class="language-plaintext highlighter-rouge">"storage"</code> → UNKNOWN</td>
      <td>UNKNOWN</td>
      <td>✓</td>
    </tr>
    <tr>
      <td>11</td>
      <td><code class="language-plaintext highlighter-rouge">"call"</code> → UNKNOWN</td>
      <td>UNKNOWN</td>
      <td>✓</td>
    </tr>
  </tbody>
</table>

<p><strong>12/12 correct. 100% accuracy.</strong> The model perfectly reconstructs the AST skeleton. But 6 of those 12 nodes are <code class="language-plaintext highlighter-rouge">UNKNOWN</code> — they are literal values (method names <code class="language-plaintext highlighter-rouge">call</code> and <code class="language-plaintext highlighter-rouge">new</code>, variable name <code class="language-plaintext highlighter-rouge">storage</code>, and a nil sentinel) that were encoded as the undifferentiated <code class="language-plaintext highlighter-rouge">UNKNOWN</code> token. The model correctly predicts <code class="language-plaintext highlighter-rouge">UNKNOWN</code> for all of them, which is technically right but utterly useless — the actual string content that makes the code meaningful is irrecoverable.</p>

<p>When we analyzed 500 validation samples, the numbers are stark:</p>

<table>
  <thead>
    <tr>
      <th>Category</th>
      <th>Percentage</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Typed AST nodes (<code class="language-plaintext highlighter-rouge">def</code>, <code class="language-plaintext highlighter-rouge">send</code>, <code class="language-plaintext highlighter-rouge">args</code>, …)</td>
      <td>53.2%</td>
    </tr>
    <tr>
      <td>Literal values (identifiers, strings, numbers)</td>
      <td><strong>46.8%</strong></td>
    </tr>
  </tbody>
</table>

<p><strong>Nearly half of every AST is literal values.</strong> Method names, variable names, string contents, numeric literals — all collapsed into a single <code class="language-plaintext highlighter-rouge">UNKNOWN</code> token. No amount of architectural tweaking, loss function engineering, or hidden dimension scaling can recover information that was never encoded.</p>

<p>This is the <strong>literal value bottleneck</strong>: the failure isn’t in model capacity or architecture. It’s in the input representation itself.</p>

<h2 id="without-structure-everything-collapses">Without Structure, Everything Collapses</h2>

<p>The bottleneck becomes even more apparent when we remove teacher forcing. Without ground-truth edges, the chain decoder (which connects nodes sequentially, destroying the tree topology) exhibits <strong>catastrophic mode collapse</strong>:</p>

<ul>
  <li><strong>92.7%</strong> of all predicted tokens are <code class="language-plaintext highlighter-rouge">UNKNOWN</code></li>
  <li>Only <code class="language-plaintext highlighter-rouge">def</code> (3.6%) and <code class="language-plaintext highlighter-rouge">send</code> (3.0%) appear as alternatives</li>
  <li>Average unique types per sample drops from 8.6 to <strong>1.6</strong></li>
  <li>Type accuracy plummets from 81% to <strong>48%</strong></li>
</ul>

<p>The model learns a degenerate strategy: predict the most common token (which happens to be <code class="language-plaintext highlighter-rouge">UNKNOWN</code>, since 47% of the ground truth <em>is</em> <code class="language-plaintext highlighter-rouge">UNKNOWN</code>) and call it a day.</p>

<p>Teacher forcing fixes the <em>structural</em> component (restoring type accuracy to 81%), but the <em>lexical</em> component — the literal value bottleneck — remains.</p>

<h2 id="dimension-doesnt-matter-depth-does-slightly">Dimension Doesn’t Matter, Depth Does (Slightly)</h2>

<p>One striking result: hidden dimensions of 128, 256, and 512 produce nearly <strong>identical</strong> outcomes:</p>

<table>
  <thead>
    <tr>
      <th>Config</th>
      <th>Hidden Dim</th>
      <th>Type Accuracy</th>
      <th>Heuristic Validity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>tf-gin-128</td>
      <td>128</td>
      <td>81.4%</td>
      <td>97.0%</td>
    </tr>
    <tr>
      <td>tf-gin-256</td>
      <td>256</td>
      <td>81.3%</td>
      <td>97.0%</td>
    </tr>
    <tr>
      <td>tf-gin-512</td>
      <td>512</td>
      <td>81.8%</td>
      <td>96.5%</td>
    </tr>
    <tr>
      <td>tf-gin-256-deep</td>
      <td>256 (5 layers)</td>
      <td>81.1%</td>
      <td><strong>99.5%</strong></td>
    </tr>
  </tbody>
</table>

<p>More capacity doesn’t help. The bottleneck is <strong>information-theoretic</strong>, not computational. Going deeper (5 layers) nudges heuristic validity from 97% to 99.5%, consistent with the depth-over-width finding from complexity prediction — but the syntax validity needle doesn’t move from 0%.</p>

<h2 id="the-infrastructure-51-experiments-for-432">The Infrastructure: 51 Experiments for $4.32</h2>

<p>All of this — 51 GPU experiments across two hardware platforms — was orchestrated by <a href="https://github.com/timlawrenz/ratiocinator">Ratiocinator</a>, an autonomous LLM-driven research pipeline.</p>

<p>Ratiocinator handles the full lifecycle: it provisions Vast.ai RTX 4090 instances, deploys code via Git, installs dependencies, runs training, collects metrics, and tears down instances. The 18 baseline replicates that gave us our variance estimate? Those came from Ratiocinator accidentally running the same configuration three times due to an environment variable bug — which turned into a useful statistical gift.</p>

<table>
  <thead>
    <tr>
      <th>Experiment Set</th>
      <th>Arms</th>
      <th>Hardware</th>
      <th>Cost</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Autonomous research (3 iterations)</td>
      <td>24</td>
      <td>RTX 4090 (Vast.ai)</td>
      <td>$1.50</td>
    </tr>
    <tr>
      <td>Architecture comparison</td>
      <td>8</td>
      <td>RTX 4090 (Vast.ai)</td>
      <td>$1.20</td>
    </tr>
    <tr>
      <td>Generation analysis</td>
      <td>7</td>
      <td>RTX 4090 (Vast.ai)</td>
      <td>$0.80</td>
    </tr>
    <tr>
      <td>Decoder topology</td>
      <td>6</td>
      <td>RTX 4090 (Vast.ai)</td>
      <td>$0.70</td>
    </tr>
    <tr>
      <td>GIN deep dive</td>
      <td>5</td>
      <td>RTX 2070 SUPER (local)</td>
      <td>~$0.10</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>51</strong></td>
      <td> </td>
      <td><strong>~$4.32</strong></td>
    </tr>
  </tbody>
</table>

<p>A 40-GPU ablation study for the price of a latte. By treating the research process itself as a distributed systems problem, Ratiocinator proves that high-velocity architectural ablation doesn’t require a massive compute budget — just ruthless pipeline optimization.</p>

<h2 id="what-would-it-take-to-fix-this">What Would It Take to Fix This?</h2>

<p>Our findings point to three concrete directions:</p>

<ol>
  <li>
    <p><strong>Literal value prediction heads.</strong> Add separate output heads for identifier names (via a vocabulary or copy mechanism), string contents, and numeric values. The structural decoder already works — it’s the lexical reconstruction that’s missing.</p>
  </li>
  <li>
    <p><strong>Hybrid architectures.</strong> Use GNN encoders for structural understanding, but pair them with autoregressive or grammar-constrained decoders for sequential output. The GNN captures the <em>shape</em> of the code; a sequential decoder fills in the <em>content</em>.</p>
  </li>
  <li>
    <p><strong>Pointer networks / copy mechanisms.</strong> Let the decoder point back to nodes in the input graph to copy identifier names, rather than generating them from scratch. This is analogous to copy mechanisms in summarization models.</p>
  </li>
</ol>

<p>The fact that GNNs achieve R² = 0.71 for complexity prediction proves they learn meaningful code representations. The challenge is building decoders that can reconstruct the <strong>full richness</strong> of code — not just its structural skeleton — from those representations.</p>

<h2 id="try-it-yourself">Try It Yourself</h2>

<p>The full dataset, pre-trained models, and experiment configurations are available:</p>

<ul>
  <li><strong>📊 Dataset</strong>: <a href="https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study">timlawrenz/gnn-ruby-code-study</a> (22,452 Ruby methods with ASTs)</li>
  <li><strong>📄 Paper</strong>: <a href="https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study/blob/main/paper.md">Full research paper</a> with all tables and appendices</li>
  <li><strong>💻 Code</strong>: <a href="https://github.com/timlawrenz/jubilant-palm-tree">jubilant-palm-tree</a> (branch: <code class="language-plaintext highlighter-rouge">experiment/ratiocinator-gnn-study</code>)</li>
  <li><strong>🤖 Orchestrator</strong>: <a href="https://github.com/timlawrenz/ratiocinator">Ratiocinator</a> — the autonomous experiment runner</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Reproduce the key result</span>
git clone <span class="nt">-b</span> experiment/ratiocinator-gnn-study https://github.com/timlawrenz/jubilant-palm-tree
<span class="nb">cd </span>jubilant-palm-tree
pip <span class="nb">install </span>torch torchvision torch_geometric

<span class="c"># Complexity prediction (the success story)</span>
python train.py <span class="nt">--conv_type</span> SAGE <span class="nt">--num_layers</span> 5 <span class="nt">--epochs</span> 50

<span class="c"># Autoencoder with teacher-forced GIN (the 81%-accurate failure)</span>
python train_autoencoder.py <span class="nt">--decoder_conv_type</span> GIN <span class="nt">--decoder_edge_mode</span> teacher_forced <span class="nt">--epochs</span> 30
</code></pre></div></div>

<p>If you use this dataset or findings, please cite:</p>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">lawrenz2025gnnruby</span><span class="p">,</span>
  <span class="na">title</span><span class="p">=</span><span class="s">{Graph Neural Networks for Ruby Code Complexity Prediction and Generation:
         A Systematic Architecture Study}</span><span class="p">,</span>
  <span class="na">author</span><span class="p">=</span><span class="s">{Tim Lawrenz}</span><span class="p">,</span>
  <span class="na">year</span><span class="p">=</span><span class="s">{2026}</span><span class="p">,</span>
  <span class="na">howpublished</span><span class="p">=</span><span class="s">{\url{https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study}}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="lawrenz2025gnnruby">Lawrenz, T. (2026). <i>Graph Neural Networks for Ruby Code Complexity Prediction and Generation: A Systematic Architecture Study</i>. https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study</span></li></ol>]]></content><author><name>{&quot;user&quot;=&gt;&quot;timlawrenz&quot;}</name></author><category term="graph-neural-networks" /><category term="code-generation" /><category term="datasets" /><category term="research" /><category term="jubilant-palm-tree" /><summary type="html"><![CDATA[We trained GNN autoencoders on 22,000 Ruby ASTs. The models achieved 81% node type accuracy and 99.5% type diversity, yet generated exactly 0% syntactically valid code. Here is why.]]></summary></entry></feed>