Troubleshooting¶
This guide covers common issues and their solutions when using mi-crow.
Common Errors and Solutions¶
Layer Signature Not Found¶
Error: ValueError: Layer signature not found
Causes: - Incorrect layer name - Layer doesn't exist in model - Typo in layer name
Solutions:
# 1. List available layers
layers = lm.layers.list_layers()
print("Available layers:", layers)
# 2. Use exact name from list
layer_name = layers[0] # Don't guess!
# 3. Verify layer exists before using
if layer_name in layers:
hook_id = lm.layers.register_hook(layer_name, detector)
else:
print(f"Layer {layer_name} not found!")
Out of Memory¶
Error: RuntimeError: CUDA out of memory
Causes: - Batch size too large - Model too large for GPU - Accumulating tensors in memory
Solutions:
# 1. Reduce batch size
run_id = lm.activations.save(
layer_signature="layer_0",
dataset=dataset,
batch_size=1, # Minimal batch size
sample_limit=100
)
# 2. Use CPU instead of GPU
sae = TopKSae(n_latents=4096, n_inputs=768, k=32, device="cpu")
# 3. Clear detector data periodically
detector.clear_captured()
# 4. Move tensors to CPU
activations = detector.get_captured()
activations_cpu = activations.detach().cpu()
Hook Not Executing¶
Symptoms: Hook doesn't seem to run, no data collected
Causes: - Hook not registered - Hook disabled - Wrong layer signature - Hook unregistered too early
Solutions:
# 1. Verify hook is registered
hook_id = lm.layers.register_hook("layer_0", detector)
assert hook_id in lm.layers.context._hook_id_map
# 2. Check hook is enabled
assert detector.is_enabled(), "Hook is disabled!"
# 3. Verify layer name is correct
layers = lm.layers.list_layers()
assert "layer_0" in layers, "Layer doesn't exist!"
# 4. Ensure hook stays registered during inference
outputs, encodings = lm.inference.execute_inference(["test"])
activations = detector.get_captured() # Check before unregistering
SAE Training Instability¶
Symptoms: Loss doesn't decrease, weights not learning
Causes: - Learning rate too high - Too much regularization - Not enough training data - Dead features
Solutions:
# 1. Reduce learning rate
config = SaeTrainingConfig(
epochs=100,
batch_size=256,
lr=1e-4, # Lower learning rate
l1_lambda=1e-5 # Lower regularization
)
# 2. Check for dead features
dead_count = history['dead_features'][-1]
if dead_count > sae.n_latents * 0.1: # More than 10% dead
# Reduce sparsity
sae = TopKSae(n_latents=4096, n_inputs=768, k=64, device="cuda") # Increase k
# 3. Verify weights are learning
weight_var = sae.encoder.weight.var().item()
if weight_var < 0.01:
print("Warning: Weights may not be learning!")
# Try different learning rate or more epochs
Poor SAE Reconstruction¶
Symptoms: Low R² score, high reconstruction error
Causes: - Model capacity too small - Not enough training - Wrong hyperparameters
Solutions:
# 1. Increase model capacity
sae = TopKSae(
n_latents=8192, # More neurons
n_inputs=768,
k=64, # More active neurons
device="cuda"
)
# 2. Train longer
config = SaeTrainingConfig(
epochs=200, # More epochs
batch_size=256,
lr=1e-3
)
# 3. Adjust hyperparameters
config = SaeTrainingConfig(
epochs=100,
batch_size=256,
lr=1e-3,
l1_lambda=1e-5 # Less regularization
)
Layer Signature Issues¶
Finding Correct Layer Names¶
# List all layers
all_layers = lm.layers.list_layers()
# Filter by pattern
attention_layers = [l for l in all_layers if "attn" in l]
mlp_layers = [l for l in all_layers if "mlp" in l]
# Print for inspection
for i, layer in enumerate(all_layers):
print(f"{i}: {layer}")
Layer Name Variations¶
Different models use different naming conventions:
# GPT-2 style
"transformer.h.0.attn.c_attn"
# BERT style
"bert.encoder.layer.0.attention.self.query"
# Custom models
"model.layers.0.self_attn.q_proj"
# Always check your specific model
layers = lm.layers.list_layers()
Memory Problems¶
GPU Memory¶
# Check GPU memory
import torch
if torch.cuda.is_available():
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
# Clear cache if needed
torch.cuda.empty_cache()
Memory Leaks¶
# Check for unregistered hooks
hook_count = len(lm.layers.context._hook_id_map)
print(f"Registered hooks: {hook_count}")
# Unregister all if needed
for hook_id in list(lm.layers.context._hook_id_map.keys()):
lm.layers.unregister_hook(hook_id)
Accumulating Data¶
# Clear detector data
detector.clear_captured()
detector.tensor_metadata.clear()
# Move to CPU
activations = detector.get_captured()
if activations is not None:
activations_cpu = activations.detach().cpu()
# Use CPU version, original will be garbage collected
Training Instability¶
Loss Not Decreasing¶
# Check training history
print(f"Initial loss: {history['loss'][0]}")
print(f"Final loss: {history['loss'][-1]}")
if history['loss'][-1] >= history['loss'][0]:
print("Loss didn't decrease!")
# Try:
# - Lower learning rate
# - More epochs
# - Different initialization
Exploding Gradients¶
# Reduce learning rate
config = SaeTrainingConfig(
epochs=100,
batch_size=256,
lr=1e-4, # Much lower learning rate
l1_lambda=1e-5
)
# Or use gradient clipping (if supported)
Dead Features¶
# Check dead features
dead_ratio = history['dead_features'][-1] / sae.n_latents
if dead_ratio > 0.1:
print(f"Too many dead features: {dead_ratio:.2%}")
# Solutions:
# - Increase k (sparsity)
# - Reduce l1_lambda
# - Increase learning rate slightly
Device Compatibility¶
CUDA Issues¶
import torch
from mi_crow.language_model import LanguageModel
from mi_crow.store import LocalStore
# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
# Always choose device based on availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
store = LocalStore(base_path="./store")
# This will raise a clear ValueError if you force device=\"cuda\" but CUDA is not available
lm = LanguageModel.from_huggingface(
"sshleifer/tiny-gpt2",
store=store,
device=device,
)
MPS (Apple Silicon)¶
# Check MPS availability
if torch.backends.mps.is_available():
device = "mps"
print("Using Apple Silicon GPU")
else:
device = "cpu"
print("Using CPU")
Device Mismatch¶
# Ensure model and data on same device
model = model.to(device)
data = data.to(device)
# Check device
print(f"Model device: {next(model.parameters()).device}")
print(f"Data device: {data.device}")
Import Errors¶
Module Not Found¶
# Verify installation
import mi_crow
print(mi_crow.__version__)
# Check imports
from mi_crow.language_model import LanguageModel
from mi_crow.hooks import Detector
Version Mismatches¶
# Check versions
import torch
import transformers
print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
# Update if needed
# pip install --upgrade torch transformers
Getting Help¶
Check Documentation¶
- User guide sections
- API reference
- Example notebooks
- Experiment walkthroughs
Debug Information¶
# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Check system info
import sys
print(f"Python: {sys.version}")
print(f"PyTorch: {torch.__version__}")
Reproduce Issues¶
# Create minimal reproduction
# 1. Start with simplest possible case
# 2. Add complexity until issue appears
# 3. Document exact steps
import torch
from mi_crow.language_model import LanguageModel
from mi_crow.store import LocalStore
store = LocalStore(base_path="./store")
device = "cuda" if torch.cuda.is_available() else "cpu"
lm = LanguageModel.from_huggingface("sshleifer/tiny-gpt2", store=store, device=device)
layers = lm.layers.list_layers()
print(layers)
Next Steps¶
- Best Practices - Prevent issues before they occur
- Examples - See working code
- API Reference - Detailed API documentation