Core Concepts¶

This guide explains the fundamental concepts and architecture of mi-crow. Understanding these concepts will help you use the library effectively.

Language Models in mi-crow¶

mi-crow provides a unified interface for working with language models through the LanguageModel class. It wraps PyTorch models (typically from HuggingFace) and provides:

Inference: Run forward passes with forwards() and generation with generate()
Layer Access: Inspect and manipulate individual layers
Activation Saving: Collect activations from any layer
Hook Integration: Attach hooks for observation and control

Model Loading¶

from mi_crow.language_model import LanguageModel
from mi_crow.store import LocalStore
import torch

store = LocalStore(base_path="./store")

# Use GPU when available
device = "cuda" if torch.cuda.is_available() else "cpu"

lm = LanguageModel.from_huggingface(
    "gpt2",  # Or any HuggingFace model
    store=store,
    device=device,
)

The model is automatically tokenized and ready for inference. You can access layers through lm.layers and run inference with lm.inference.execute_inference().

Sparse Autoencoders (SAE)¶

Sparse Autoencoders are the core interpretability tool in mi-crow. They learn to represent model activations using a sparse set of interpretable features.

What are SAEs?¶

An SAE is a neural network that: 1. Takes dense activations from a model layer as input 2. Encodes them into a sparse latent representation 3. Decodes back to reconstruct the original activations

The sparsity constraint encourages the SAE to learn discrete, interpretable features (neurons) that correspond to meaningful concepts.

Why Use SAEs?¶

Interpretability: Each SAE neuron often corresponds to a human-understandable concept
Feature Discovery: Automatically discover what features the model uses
Control: Manipulate model behavior by amplifying or suppressing specific neurons
Analysis: Understand which features are important for different tasks

SAE Architecture¶

mi-crow supports TopK SAEs, which enforce sparsity by keeping only the top-K most active neurons:

from mi_crow.mechanistic.sae import TopKSae

sae = TopKSae(
    n_latents=4096,  # Number of SAE neurons (overcomplete)
    n_inputs=768,    # Size of input activations
    k=32,            # Top-K sparsity (only 32 neurons active at once)
    device="cuda"
)

The overcomplete ratio (n_latents / n_inputs) determines how many features the SAE can learn. Higher ratios allow more fine-grained feature discovery.

Concepts and Interpretability¶

Concepts are human-interpretable meanings associated with SAE neurons. The concept discovery process involves:

Training an SAE on model activations
Collecting top texts that activate each neuron
Manual curation to name concepts based on patterns
Using concepts to understand and control model behavior

Concept Discovery¶

After training an SAE, you can discover concepts by:

# Enable text tracking during inference
sae.concepts.enable_text_tracking(top_k=10)

# Run inference on a dataset
outputs, encodings = lm.inference.execute_inference(dataset_texts)

# Get top activating texts for each neuron
top_texts = sae.concepts.get_top_texts()

Each neuron's top texts reveal what patterns it detects, allowing you to assign meaningful names like "family relationships" or "scientific terminology".

Concept Manipulation¶

Once concepts are identified, you can manipulate model behavior:

# Amplify a concept (neuron 42)
sae.concepts.manipulate_concept(neuron_idx=42, scale=1.5)

# Suppress a concept
sae.concepts.manipulate_concept(neuron_idx=42, scale=0.5)

# Run inference with modified behavior
outputs, encodings = lm.inference.execute_inference(["Your prompt here"])

Hooks System Overview¶

The hooks system is the foundation of mi-crow's interpretability capabilities. It allows you to intercept and process activations during model inference.

What are Hooks?¶

Hooks are callbacks that execute at specific points during a model's forward pass. They can:

Observe activations without modification (Detectors)
Modify activations to change model behavior (Controllers)

Hook Types¶

Detectors: Collect data, save activations, track statistics
Controllers: Modify inputs or outputs to steer model behavior

Why Hooks Matter¶

Hooks enable: - Non-invasive inspection: Observe model internals without changing code - Flexible control: Modify behavior at any layer - Composable interventions: Combine multiple hooks for complex experiments - SAE integration: SAEs work as both detectors and controllers

For detailed information about hooks, see the Hooks System Guide.

Store Architecture¶

The Store provides a hierarchical persistence layer for:

Activations: Saved layer activations organized by run
Models: Trained SAE models and checkpoints
Metadata: Training history, configurations, and run information

Store Structure¶

store/
├── activations/
│   └── <run_id>/
│       ├── batch_0/
│       │   └── <layer_name>/
│       │       └── activations.safetensors
│       └── meta.json
├── runs/
│   └── <run_id>/
│       └── training_history.json
└── sae_models/
    └── <sae_id>/
        └── model.pt

LocalStore¶

The default implementation uses the local filesystem:

from mi_crow.store import LocalStore

store = LocalStore(base_path="./store")

All operations automatically organize data in this hierarchical structure, making it easy to manage large-scale experiments.

Datasets and Data Loading¶

mi-crow provides flexible dataset loading for:

HuggingFace datasets: Direct integration with HF datasets
Local files: Load from text files or custom formats
In-memory data: Use Python lists directly

TextDataset¶

For simple text data:

from mi_crow.datasets import TextDataset

dataset = TextDataset(texts=["Text 1", "Text 2", "Text 3"])

HuggingFace Integration¶

from mi_crow.datasets import HuggingFaceDataset

dataset = HuggingFaceDataset(
    name="wikitext",
    split="train",
    text_field="text"
)

Datasets are automatically batched and tokenized when used with lm.activations.save() or during inference.

Putting It All Together¶

The typical mi-crow workflow combines these concepts:

Load a model and create a store
Save activations from a layer using hooks (detectors)
Train an SAE on the activations
Discover concepts by analyzing neuron activations
Manipulate concepts to control model behavior (controllers)
Analyze results using the store's organized data

Each component is designed to work seamlessly with the others, providing a complete toolkit for mechanistic interpretability research.

Next Steps¶

Hooks System - Deep dive into the hooks framework
Saving Activations - Detailed activation collection guide
Training SAE Models - SAE training best practices
Concept Discovery - Finding interpretable concepts