Self-Supervised Pretraining Recipes for Lung CT: A Systematic Study with DINO
Self-supervised learning (SSL) promises to unlock the diagnostic potential of large unlabeled medical image archives, yet practitioners face a daunting hyperparameter landscape with little domain-specific guidance. We present a systematic study of pretraining recipes for lung computed tomography (CT), evaluating 60+ experimental configurations on the LIDC-IDRI dataset.
- 1. The Challenges of Medical SSL
- 2. Methodology: The DINO-X System
- 3. Results & Discussion
- 4. Representation Analysis
- 5. The Infrastructure: 64 Experiments for $35.60
- 6. Practical Recipes
- Try It Yourself
- References
- References
1. The Challenges of Medical SSL
Medical CT presents unique challenges that break standard natural-image SSL assumptions:
- Hounsfield Unit (HU) encoding: Pixel intensities carry calibrated tissue density information. Standard photometric augmentations destroy this signal.
- Volumetric context: Pathology manifests across multiple slices. 2D methods must decide how to handle the Z-axis.
- Entropy wall: DINO (Caron et al., 2021) training on medical CT frequently stalls at the theoretical maximum entropy, producing uniform outputs that carry no information.
Core Contributions
- Entropy wall solution: We identify center momentum as the critical factor for DINO on medical CT.
- Medical augmentation guidelines: Evidence that spatial-only augmentation is optimal for HU data.
- Capacity-dependent regularization: We discover that KoLeo regularization (Sablayrolles et al., 2019) is critical for ViT-Large but optional for ViT-Small.
- Scaling analysis: Tracing the trajectory from random weights to 1032× baseline retrieval.
- Clinical evaluation: Establishing malignancy classification baselines and testing 3D feature aggregation.
2. Methodology: The DINO-X System
2.1 Architecture
We evaluate two scales of Vision Transformer (ViT) backbones:
| Feature | ViT-Small | ViT-Large |
|---|---|---|
| Embedding dim | 384 | 1024 |
| Depth | 12 | 24 |
| Heads | 6 | 16 |
| Backbone params | 21.6M | 303.2M |
| Total params | 24.9M | 312.6M |
2.2 Loss Functions & Online Gram Alignment
Our system building on DINOv2 (Oquab et al., 2023) independently adopts Online Gram alignment. Unlike DINOv3’s (AI, 2025) temporal anchoring to frozen historical checkpoints, DINO-X matches the student’s patch-token Gram matrix to the current EMA teacher at every step.
The total loss is defined as: \(L = L_{DINO} + \lambda_{gram} \cdot L_{gram} + \lambda_{koleo} \cdot L_{koleo}\)
2.3 Data Pipeline
- Dataset: 234,943 axial slices from 981 LIDC-IDRI series.
- Input Encoding: 3-channel input constructed from consecutive slices $(z-1, z, z+1)$ to provide local volumetric context.
- HU Windowing: A random Hounsfield Unit window is applied per sample to simulate various clinical viewing protocols.
3. Results & Discussion
3.1 Breaking the Entropy Wall
The most critical finding: center momentum (cm) must be high enough to allow symmetry-breaking.
| Center Momentum | 2K Loss | 10K Ratio | Trajectory |
|---|---|---|---|
| 0.9 | 9.00 | 4.0 ↓ | Permanently stuck |
| 0.95 | 9.00 | — | Stuck |
| 0.99 | 9.01 | — | Stuck |
| 0.999 | 5.76 | 18.0 | Breaks through ✓ |
At cm $\le$ 0.99, the center vector adapts too quickly, erasing emerging structure. At 0.999, the update is slow enough for meaningful clusters to form.
3.2 Augmentation: The ColorJitter Trap
Intensity variations in CT distinguish pathologies (e.g., ground-glass opacity vs. solid nodule). Teaching invariance to intensity (via ColorJitter) teaches the model to ignore the signal.
| Augmentation | Ratio (10K) | Relative |
|---|---|---|
| Spatial only (RRC + HFlip) | 49.7 | baseline |
| + ColorJitter | 25.0 | −50% |
3.3 Scaling Behavior
We track the learning trajectory of ViT-Small over 100K steps:
| Steps | Loss | Ratio | Top-1 | Phase |
|---|---|---|---|---|
| 2K | 5.76 | 6 | 0.29% | Breakout |
| 20K | 0.44 | 311 | 15.2% | Rapid Learning |
| 100K | 0.23 | 1,032 | 25.2% | Diminishing Returns |
3.4 Capacity-Dependent Regularization
ViT-Large requires explicit uniformity enforcement. Without KoLeo ($\lambda_{koleo}=0.1$), it solves the pretext task via representation collapse.
| Configuration | 100K Loss | 100K Ratio | Status |
|---|---|---|---|
| ViT-L, no KoLeo | 0.0004 | 4 | Collapsed |
| ViT-L, with KoLeo | 0.27 | 500 | Healthy |
3.5 Clinical Utility: Malignancy Probing
We evaluated frozen backbones on malignancy classification (430 malignant, 1,665 benign nodules).
| Model | Feature Type | AUC-ROC |
|---|---|---|
| ViT-S 100K | Avg patch tokens | 0.687 |
| ViT-L 100K | CLS token | 0.668 |
| ViT-S 100K | CLS token | 0.663 |
| Supervised ResNet18 | — | 0.767 |
Negative Result: Aggregating features across Z-slices (3D mean pooling) actually decreased AUC (0.687 $\to$ 0.650). True 3D awareness likely requires architectural changes like volumetric patch tokens, not post-hoc aggregation.
3.6 Resolution: 224 vs 448
At matched step counts, 448px resolution performed 7× worse in retrieval ratio while being 2.7× slower. For this dataset scale, 224px is the pragmatic choice.
4. Representation Analysis
| Metric | ViT-S 100K | ViT-L 100K |
|---|---|---|
| Active dims (std > 0.01) | 384/384 | 1024/1024 |
| Pairwise cosine sim | 0.887 | 0.865 |
| Class separation | 0.003 | 0.002 |
Both models use all embedding dimensions, confirming KoLeo prevents dimensional collapse. High pairwise similarity is expected for the single-domain lung CT data.
5. The Infrastructure: 64 Experiments for $35.60
All of this — 64 GPU experiments across two hardware platforms — was orchestrated by Ratiocinator, an autonomous LLM-driven research pipeline.
Ratiocinator handles the full lifecycle of the experimental campaign: it provisions Vast.ai RTX 4090 instances, deploys code via Git, installs dependencies, runs training, collects metrics, and tears down instances. By treating the research process itself as a distributed systems problem, Ratiocinator proved that high-velocity architectural ablation (including the diagnosis of the “Entropy Wall”) doesn’t require a massive compute budget — just ruthless pipeline optimization.
| Experiment Set | Arms | GPU-hours | Approx. Cost |
|---|---|---|---|
| Loss & CM sweeps | 10 | 11 | $3.70 |
| Augmentation ablations | 11 | 9 | $3.00 |
| ViT-Small Scaling | 6 | 28 | $9.40 |
| ViT-Large Scaling | 9 | 24 | $8.80 |
| Clinical Probes | 28 | 33 | $10.70 |
| Total | 64 | 105 | ~$35.60 |
6. Practical Recipes
The ViT-Small Recipe (Recommended)
training:
loss: dino + gram + koleo
center_momentum: 0.999 # CRITICAL
ema: 0.996
lr: 2e-4
batch_size: 64
steps: 50,000–100,000
augmentation:
- RandomResizedCrop(224, scale=(0.5, 1.0))
- RandomHorizontalFlip()
# NO ColorJitter, NO GaussianBlur
data:
channels: 3 # (z-1, z, z+1)
windowing: random HU window per sample
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| cm too low (< 0.999) | Loss stuck at 9.01 | Set cm=0.999 |
| ColorJitter | 50%+ ratio drop | Remove intensity aug |
| ViT-L without KoLeo | Loss $\to$ 0, Ratio $\to$ 1 | Add koleo_weight=0.1 |
| ViT-L with high LR | Oscillating loss | Use lr=5e-5 |
Try It Yourself
The complete training framework, experiment configurations, and pre-trained weights are available:
- 💻 Code: DINO-X — Lung CT SSL training framework
- 🤖 Orchestrator: Ratiocinator — The autonomous experiment runner
- 📊 Dataset: LIDC-IDRI on the Hub (PNG version)
# Clone the repository
git clone https://github.com/timlawrenz/DINO-X
cd DINO-X
# Run the optimized ViT-Small recipe
python train.py --config configs/vit_small_medical.yaml --center_momentum 0.999
References
- Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
- Sablayrolles, A., Douze, M., Schmid, C., & Jégou, H. (2019). Spreading vectors for similarity search. International Conference on Learning Representations.
- Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., & others. (2023). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.
- AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. ArXiv Preprint.
asets/timlawrenz/lidc-idri-png) on the Hub (PNG version)
# Clone the repository
git clone https://github.com/timlawrenz/DINO-X
cd DINO-X
# Run the optimized ViT-Small recipe
python train.py --config configs/vit_small_medical.yaml --center_momentum 0.999
References
- Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
- Sablayrolles, A., Douze, M., Schmid, C., & Jégou, H. (2019). Spreading vectors for similarity search. International Conference on Learning Representations.
- Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Soyer, M., Kinnunen, J., & others. (2023). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.
- AI, M. (2025). DINOv3: Self-supervised learning for vision at unprecedented scale. ArXiv Preprint.