Self-Supervised Pretraining Recipes for Lung CT: A Systematic Study with DINO

Self-supervised learning (SSL) promises to unlock the diagnostic potential of large unlabeled medical image archives, yet practitioners face a daunting hyperparameter landscape with little domain-specific guidance. We present a systematic study of pretraining recipes for lung computed tomography (CT), evaluating 60+ experimental configurations on the LIDC-IDRI dataset.

1. The Challenges of Medical SSL
- Core Contributions
2. Methodology: The DINO-X System
3. Results & Discussion
4. Representation Analysis
5. The Infrastructure: 64 Experiments for $35.60
6. Practical Recipes
- The ViT-Small Recipe (Recommended)
- Common Pitfalls
Try It Yourself
References
References

1. The Challenges of Medical SSL

Medical CT presents unique challenges that break standard natural-image SSL assumptions:

Hounsfield Unit (HU) encoding: Pixel intensities carry calibrated tissue density information. Standard photometric augmentations destroy this signal.
Volumetric context: Pathology manifests across multiple slices. 2D methods must decide how to handle the Z-axis.
Entropy wall: DINO (Caron et al., 2021) training on medical CT frequently stalls at the theoretical maximum entropy, producing uniform outputs that carry no information.

Core Contributions

Entropy wall solution: We identify center momentum as the critical factor for DINO on medical CT.
Medical augmentation guidelines: Evidence that spatial-only augmentation is optimal for HU data.
Capacity-dependent regularization: We discover that KoLeo regularization (Sablayrolles et al., 2019) is critical for ViT-Large but optional for ViT-Small.
Scaling analysis: Tracing the trajectory from random weights to 1032× baseline retrieval.
Clinical evaluation: Establishing malignancy classification baselines and testing 3D feature aggregation.

2. Methodology: The DINO-X System

2.1 Architecture

We evaluate two scales of Vision Transformer (ViT) backbones:

Feature	ViT-Small	ViT-Large
Embedding dim	384	1024
Depth	12	24
Heads	6	16
Backbone params	21.6M	303.2M
Total params	24.9M	312.6M

2.2 Loss Functions & Online Gram Alignment

Our system building on DINOv2 (Oquab et al., 2023) independently adopts Online Gram alignment. Unlike DINOv3’s (AI, 2025) temporal anchoring to frozen historical checkpoints, DINO-X matches the student’s patch-token Gram matrix to the current EMA teacher at every step.

The total loss is defined as: $L = L_{DINO} + \lambda_{gram} \cdot L_{gram} + \lambda_{koleo} \cdot L_{koleo}$

2.3 Data Pipeline

Dataset: 234,943 axial slices from 981 LIDC-IDRI series.
Input Encoding: 3-channel input constructed from consecutive slices $(z-1, z, z+1)$ to provide local volumetric context.
HU Windowing: A random Hounsfield Unit window is applied per sample to simulate various clinical viewing protocols.

3. Results & Discussion

3.1 Breaking the Entropy Wall

The most critical finding: center momentum (cm) must be high enough to allow symmetry-breaking.

Center Momentum	2K Loss	10K Ratio	Trajectory
0.9	9.00	4.0 ↓	Permanently stuck
0.95	9.00	—	Stuck
0.99	9.01	—	Stuck
0.999	5.76	18.0	Breaks through ✓

At cm $\le$ 0.99, the center vector adapts too quickly, erasing emerging structure. At 0.999, the update is slow enough for meaningful clusters to form.

3.2 Augmentation: The ColorJitter Trap

Intensity variations in CT distinguish pathologies (e.g., ground-glass opacity vs. solid nodule). Teaching invariance to intensity (via ColorJitter) teaches the model to ignore the signal.

Augmentation	Ratio (10K)	Relative
Spatial only (RRC + HFlip)	49.7	baseline
+ ColorJitter	25.0	−50%

3.3 Scaling Behavior

We track the learning trajectory of ViT-Small over 100K steps:

Steps	Loss	Ratio	Top-1	Phase
2K	5.76	6	0.29%	Breakout
20K	0.44	311	15.2%	Rapid Learning
100K	0.23	1,032	25.2%	Diminishing Returns

3.4 Capacity-Dependent Regularization

ViT-Large requires explicit uniformity enforcement. Without KoLeo ($\lambda_{koleo}=0.1$), it solves the pretext task via representation collapse.

Configuration	100K Loss	100K Ratio	Status
ViT-L, no KoLeo	0.0004	4	Collapsed
ViT-L, with KoLeo	0.27	500	Healthy

3.5 Clinical Utility: Malignancy Probing

We evaluated frozen backbones on malignancy classification (430 malignant, 1,665 benign nodules).

Model	Feature Type	AUC-ROC
ViT-S 100K	Avg patch tokens	0.687
ViT-L 100K	CLS token	0.668
ViT-S 100K	CLS token	0.663
Supervised ResNet18	—	0.767

Negative Result: Aggregating features across Z-slices (3D mean pooling) actually decreased AUC (0.687 $\to$ 0.650). True 3D awareness likely requires architectural changes like volumetric patch tokens, not post-hoc aggregation.

3.6 Resolution: 224 vs 448

At matched step counts, 448px resolution performed 7× worse in retrieval ratio while being 2.7× slower. For this dataset scale, 224px is the pragmatic choice.

4. Representation Analysis

Metric	ViT-S 100K	ViT-L 100K
Active dims (std > 0.01)	384/384	1024/1024
Pairwise cosine sim	0.887	0.865
Class separation	0.003	0.002

Both models use all embedding dimensions, confirming KoLeo prevents dimensional collapse. High pairwise similarity is expected for the single-domain lung CT data.

5. The Infrastructure: 64 Experiments for $35.60

All of this — 64 GPU experiments across two hardware platforms — was orchestrated by Ratiocinator, an autonomous LLM-driven research pipeline.

Ratiocinator handles the full lifecycle of the experimental campaign: it provisions Vast.ai RTX 4090 instances, deploys code via Git, installs dependencies, runs training, collects metrics, and tears down instances. By treating the research process itself as a distributed systems problem, Ratiocinator proved that high-velocity architectural ablation (including the diagnosis of the “Entropy Wall”) doesn’t require a massive compute budget — just ruthless pipeline optimization.

Experiment Set	Arms	GPU-hours	Approx. Cost
Loss & CM sweeps	10	11	$3.70
Augmentation ablations	11	9	$3.00
ViT-Small Scaling	6	28	$9.40
ViT-Large Scaling	9	24	$8.80
Clinical Probes	28	33	$10.70
Total	64	105	~$35.60

6. Practical Recipes

The ViT-Small Recipe (Recommended)

training:
  loss: dino + gram + koleo
  center_momentum: 0.999  # CRITICAL
  ema: 0.996
  lr: 2e-4
  batch_size: 64
  steps: 50,000–100,000

augmentation:
  - RandomResizedCrop(224, scale=(0.5, 1.0))
  - RandomHorizontalFlip()
  # NO ColorJitter, NO GaussianBlur

data:
  channels: 3  # (z-1, z, z+1)
  windowing: random HU window per sample

Common Pitfalls

Pitfall	Symptom	Fix
cm too low (< 0.999)	Loss stuck at 9.01	Set cm=0.999
ColorJitter	50%+ ratio drop	Remove intensity aug
ViT-L without KoLeo	Loss $\to$ 0, Ratio $\to$ 1	Add koleo_weight=0.1
ViT-L with high LR	Oscillating loss	Use lr=5e-5

Try It Yourself

The complete training framework, experiment configurations, and pre-trained weights are available:

💻 Code: DINO-X — Lung CT SSL training framework
🤖 Orchestrator: Ratiocinator — The autonomous experiment runner
📊 Dataset: LIDC-IDRI on the Hub (PNG version)

# Clone the repository
git clone https://github.com/timlawrenz/DINO-X
cd DINO-X

# Run the optimized ViT-Small recipe
python train.py --config configs/vit_small_medical.yaml --center_momentum 0.999

References

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
Sablayrolles, A., Douze, M., Schmid, C., & Jégou, H. (2019). Spreading vectors for similarity search. International Conference on Learning Representations.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., & others. (2023). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.
AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. ArXiv Preprint.

asets/timlawrenz/lidc-idri-png) on the Hub (PNG version)

# Clone the repository
git clone https://github.com/timlawrenz/DINO-X
cd DINO-X

# Run the optimized ViT-Small recipe
python train.py --config configs/vit_small_medical.yaml --center_momentum 0.999

References

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
Sablayrolles, A., Douze, M., Schmid, C., & Jégou, H. (2019). Spreading vectors for similarity search. International Conference on Learning Representations.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., & others. (2023). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.
AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. ArXiv Preprint.