Self-supervised learning (SSL) promises to unlock the diagnostic potential of large unlabeled medical image archives, yet practitioners face a daunting hyperparameter landscape with little domain-specific guidance. We present a systematic study of pretraining recipes for lung computed tomography (CT), evaluating 60+ experimental configurations on the LIDC-IDRI dataset.

1. The Challenges of Medical SSL

Medical CT presents unique challenges that break standard natural-image SSL assumptions:

  • Hounsfield Unit (HU) encoding: Pixel intensities carry calibrated tissue density information. Standard photometric augmentations destroy this signal.
  • Volumetric context: Pathology manifests across multiple slices. 2D methods must decide how to handle the Z-axis.
  • Entropy wall: DINO (Caron et al., 2021) training on medical CT frequently stalls at the theoretical maximum entropy, producing uniform outputs that carry no information.

Core Contributions

  1. Entropy wall solution: We identify center momentum as the critical factor for DINO on medical CT.
  2. Medical augmentation guidelines: Evidence that spatial-only augmentation is optimal for HU data.
  3. Capacity-dependent regularization: We discover that KoLeo regularization (Sablayrolles et al., 2019) is critical for ViT-Large but optional for ViT-Small.
  4. Scaling analysis: Tracing the trajectory from random weights to 1032× baseline retrieval.
  5. Clinical evaluation: Establishing malignancy classification baselines and testing 3D feature aggregation.

2. Methodology: The DINO-X System

2.1 Architecture

We evaluate two scales of Vision Transformer (ViT) backbones:

Feature ViT-Small ViT-Large
Embedding dim 384 1024
Depth 12 24
Heads 6 16
Backbone params 21.6M 303.2M
Total params 24.9M 312.6M

2.2 Loss Functions & Online Gram Alignment

Our system building on DINOv2 (Oquab et al., 2023) independently adopts Online Gram alignment. Unlike DINOv3’s (AI, 2025) temporal anchoring to frozen historical checkpoints, DINO-X matches the student’s patch-token Gram matrix to the current EMA teacher at every step.

The total loss is defined as: \(L = L_{DINO} + \lambda_{gram} \cdot L_{gram} + \lambda_{koleo} \cdot L_{koleo}\)

2.3 Data Pipeline

  • Dataset: 234,943 axial slices from 981 LIDC-IDRI series.
  • Input Encoding: 3-channel input constructed from consecutive slices $(z-1, z, z+1)$ to provide local volumetric context.
  • HU Windowing: A random Hounsfield Unit window is applied per sample to simulate various clinical viewing protocols.

3. Results & Discussion

3.1 Breaking the Entropy Wall

The most critical finding: center momentum (cm) must be high enough to allow symmetry-breaking.

Center Momentum 2K Loss 10K Ratio Trajectory
0.9 9.00 4.0 ↓ Permanently stuck
0.95 9.00 Stuck
0.99 9.01 Stuck
0.999 5.76 18.0 Breaks through

At cm $\le$ 0.99, the center vector adapts too quickly, erasing emerging structure. At 0.999, the update is slow enough for meaningful clusters to form.

3.2 Augmentation: The ColorJitter Trap

Intensity variations in CT distinguish pathologies (e.g., ground-glass opacity vs. solid nodule). Teaching invariance to intensity (via ColorJitter) teaches the model to ignore the signal.

Augmentation Ratio (10K) Relative
Spatial only (RRC + HFlip) 49.7 baseline
+ ColorJitter 25.0 −50%

3.3 Scaling Behavior

We track the learning trajectory of ViT-Small over 100K steps:

Steps Loss Ratio Top-1 Phase
2K 5.76 6 0.29% Breakout
20K 0.44 311 15.2% Rapid Learning
100K 0.23 1,032 25.2% Diminishing Returns

3.4 Capacity-Dependent Regularization

ViT-Large requires explicit uniformity enforcement. Without KoLeo ($\lambda_{koleo}=0.1$), it solves the pretext task via representation collapse.

Configuration 100K Loss 100K Ratio Status
ViT-L, no KoLeo 0.0004 4 Collapsed
ViT-L, with KoLeo 0.27 500 Healthy

3.5 Clinical Utility: Malignancy Probing

We evaluated frozen backbones on malignancy classification (430 malignant, 1,665 benign nodules).

Model Feature Type AUC-ROC
ViT-S 100K Avg patch tokens 0.687
ViT-L 100K CLS token 0.668
ViT-S 100K CLS token 0.663
Supervised ResNet18 0.767

Negative Result: Aggregating features across Z-slices (3D mean pooling) actually decreased AUC (0.687 $\to$ 0.650). True 3D awareness likely requires architectural changes like volumetric patch tokens, not post-hoc aggregation.

3.6 Resolution: 224 vs 448

At matched step counts, 448px resolution performed 7× worse in retrieval ratio while being 2.7× slower. For this dataset scale, 224px is the pragmatic choice.

4. Representation Analysis

Metric ViT-S 100K ViT-L 100K
Active dims (std > 0.01) 384/384 1024/1024
Pairwise cosine sim 0.887 0.865
Class separation 0.003 0.002

Both models use all embedding dimensions, confirming KoLeo prevents dimensional collapse. High pairwise similarity is expected for the single-domain lung CT data.

5. The Infrastructure: 64 Experiments for $35.60

All of this — 64 GPU experiments across two hardware platforms — was orchestrated by Ratiocinator, an autonomous LLM-driven research pipeline.

Ratiocinator handles the full lifecycle of the experimental campaign: it provisions Vast.ai RTX 4090 instances, deploys code via Git, installs dependencies, runs training, collects metrics, and tears down instances. By treating the research process itself as a distributed systems problem, Ratiocinator proved that high-velocity architectural ablation (including the diagnosis of the “Entropy Wall”) doesn’t require a massive compute budget — just ruthless pipeline optimization.

Experiment Set Arms GPU-hours Approx. Cost
Loss & CM sweeps 10 11 $3.70
Augmentation ablations 11 9 $3.00
ViT-Small Scaling 6 28 $9.40
ViT-Large Scaling 9 24 $8.80
Clinical Probes 28 33 $10.70
Total 64 105 ~$35.60

6. Practical Recipes

training:
  loss: dino + gram + koleo
  center_momentum: 0.999  # CRITICAL
  ema: 0.996
  lr: 2e-4
  batch_size: 64
  steps: 50,000–100,000

augmentation:
  - RandomResizedCrop(224, scale=(0.5, 1.0))
  - RandomHorizontalFlip()
  # NO ColorJitter, NO GaussianBlur

data:
  channels: 3  # (z-1, z, z+1)
  windowing: random HU window per sample

Common Pitfalls

Pitfall Symptom Fix
cm too low (< 0.999) Loss stuck at 9.01 Set cm=0.999
ColorJitter 50%+ ratio drop Remove intensity aug
ViT-L without KoLeo Loss $\to$ 0, Ratio $\to$ 1 Add koleo_weight=0.1
ViT-L with high LR Oscillating loss Use lr=5e-5

Try It Yourself

The complete training framework, experiment configurations, and pre-trained weights are available:

  • 💻 Code: DINO-X — Lung CT SSL training framework
  • 🤖 Orchestrator: Ratiocinator — The autonomous experiment runner
  • 📊 Dataset: LIDC-IDRI on the Hub (PNG version)
# Clone the repository
git clone https://github.com/timlawrenz/DINO-X
cd DINO-X

# Run the optimized ViT-Small recipe
python train.py --config configs/vit_small_medical.yaml --center_momentum 0.999

References

  1. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
  2. Sablayrolles, A., Douze, M., Schmid, C., & Jégou, H. (2019). Spreading vectors for similarity search. International Conference on Learning Representations.
  3. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., & others. (2023). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.
  4. AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. ArXiv Preprint.

asets/timlawrenz/lidc-idri-png) on the Hub (PNG version)

# Clone the repository
git clone https://github.com/timlawrenz/DINO-X
cd DINO-X

# Run the optimized ViT-Small recipe
python train.py --config configs/vit_small_medical.yaml --center_momentum 0.999

References

  1. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
  2. Sablayrolles, A., Douze, M., Schmid, C., & Jégou, H. (2019). Spreading vectors for similarity search. International Conference on Learning Representations.
  3. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., & others. (2023). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.
  4. AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. ArXiv Preprint.