SLURM SAE Pipeline¶
This experiment demonstrates distributed training setup for large-scale SAE training on cluster environments using SLURM.
Overview¶
This pipeline shows how to: - Configure SLURM jobs for activation saving - Set up distributed SAE training on clusters - Manage resources and job dependencies - Handle large-scale datasets efficiently
Prerequisites¶
- Access to SLURM cluster
- Understanding of cluster computing
- Large-scale dataset
- Sufficient cluster resources
Experiment Structure¶
slurm_sae_pipeline/
├── 01_save_activations.py # Activation saving script
├── 02_train_sae.py # SAE training script
├── submit_save_activations.sh # SLURM submission script for activations
├── submit_train_sae.sh # SLURM submission script for training
└── README.md # Pipeline documentation
Configuration¶
SLURM Job Configuration¶
The submission scripts configure: - Resources: GPU allocation, memory, time limits - Dependencies: Job ordering (train after save) - Environment: Python environment setup - Output: Logging and error handling
mi-crow Configuration¶
The Python scripts use standard mi-crow APIs:
- lm.activations.save() for distributed activation saving
- SaeTrainer.train() for SAE training
- Store configuration for cluster filesystems
Workflow¶
- Submit activation saving job: Uses
submit_save_activations.sh - Wait for completion: Activation saving must complete first
- Submit training job: Uses
submit_train_sae.sh(depends on step 1) - Monitor jobs: Track progress through SLURM
Key Features¶
- Distributed processing: Handles large datasets across cluster nodes
- Resource management: Proper GPU and memory allocation
- Job dependencies: Ensures correct execution order
- Error handling: Robust failure recovery
Customization¶
Adapt the pipeline for your cluster: - Modify resource requests in submission scripts - Adjust batch sizes for available memory - Configure store paths for cluster filesystem - Set appropriate time limits
Related Documentation¶
- Training SAE Models - SAE training guide
- Saving Activations - Activation saving guide
- Best Practices - Performance optimization