Concept Manipulation¶

This guide covers controlling model behavior by manipulating discovered SAE concepts.

Overview¶

Concept manipulation allows you to: - Amplify or suppress specific concepts - Compare model behavior with/without interventions - Steer model outputs toward desired behaviors - Run controlled intervention experiments

Prerequisites¶

Before manipulating concepts, you need: - A trained SAE attached to the model - Discovered concepts (see Concept Discovery) - The SAE registered on the target layer

Basic Concept Manipulation¶

Amplify a Concept¶

# Amplify neuron 42 (which represents a concept)
sae.concepts.manipulate_concept(neuron_idx=42, scale=1.5)

# Run inference with amplified concept
outputs, encodings = lm.inference.execute_inference(["Your prompt here"])

Scale values: - scale > 1.0: Amplify (increase concept strength) - scale < 1.0: Suppress (decrease concept strength) - scale = 0.0: Completely remove concept

Suppress a Concept¶

# Suppress neuron 42
sae.concepts.manipulate_concept(neuron_idx=42, scale=0.5)

# Run inference
outputs, encodings = lm.inference.execute_inference(["Your prompt here"])

Reset Manipulation¶

# Reset to original (scale = 1.0)
sae.concepts.manipulate_concept(neuron_idx=42, scale=1.0)

# Or reset all manipulations
sae.concepts.reset_manipulations()

A/B Testing Interventions¶

Compare model behavior with and without concept manipulation:

Baseline (No Manipulation)¶

# Get baseline output
baseline_outputs, _ = lm.inference.execute_inference(
    ["Your prompt here"],
    with_controllers=False  # Disable SAE manipulation
)

With Manipulation¶

# Apply concept manipulation
sae.concepts.manipulate_concept(neuron_idx=42, scale=1.5)

# Get manipulated output
manipulated_outputs, _ = lm.inference.execute_inference(
    ["Your prompt here"],
    with_controllers=True  # Enable SAE manipulation
)

Compare Results¶

# Compare logits
difference = manipulated_outputs.logits - baseline_outputs.logits

# Compare predictions
baseline_pred = baseline_outputs.logits.argmax(dim=-1)
manipulated_pred = manipulated_outputs.logits.argmax(dim=-1)

print(f"Baseline prediction: {baseline_pred}")
print(f"Manipulated prediction: {manipulated_pred}")

Multiple Concept Manipulation¶

Manipulate Multiple Neurons¶

# Amplify multiple concepts
sae.concepts.manipulate_concept(neuron_idx=42, scale=1.5)  # Concept A
sae.concepts.manipulate_concept(neuron_idx=100, scale=2.0)  # Concept B
sae.concepts.manipulate_concept(neuron_idx=200, scale=0.5)  # Suppress Concept C

# All manipulations apply simultaneously
outputs, encodings = lm.inference.execute_inference(["Your prompt here"])

Batch Manipulation¶

# Manipulate multiple concepts at once
manipulations = {
    42: 1.5,   # Amplify concept A
    100: 2.0,  # Amplify concept B
    200: 0.5   # Suppress concept C
}

for neuron_idx, scale in manipulations.items():
    sae.concepts.manipulate_concept(neuron_idx=neuron_idx, scale=scale)

Real-Time Control¶

Dynamic Manipulation¶

# Change manipulation during generation
def generate_with_control(prompt, concept_idx, scale):
    # Set manipulation
    sae.concepts.manipulate_concept(neuron_idx=concept_idx, scale=scale)

    # Generate
    outputs, encodings = lm.inference.execute_inference([prompt])

    return outputs

# Use different scales
output1 = generate_with_control("Tell me about", neuron_idx=42, scale=1.0)
output2 = generate_with_control("Tell me about", neuron_idx=42, scale=1.5)
output3 = generate_with_control("Tell me about", neuron_idx=42, scale=2.0)

Conditional Manipulation¶

# Manipulate based on input
def conditional_manipulate(prompt):
    if "science" in prompt.lower():
        # Amplify scientific concepts
        sae.concepts.manipulate_concept(neuron_idx=200, scale=1.5)
    elif "family" in prompt.lower():
        # Amplify family concepts
        sae.concepts.manipulate_concept(neuron_idx=42, scale=1.5)

    return lm.inference.execute_inference([prompt])

Concept Configurations¶

Save and load concept manipulation configurations:

Save Configuration¶

# Get current manipulations
config = sae.concepts.get_manipulation_config()

# Save to file
import json
with open("concept_config.json", "w") as f:
    json.dump(config, f, indent=2)

Load Configuration¶

# Load from file
with open("concept_config.json", "r") as f:
    config = json.load(f)

# Apply configuration
for neuron_idx, scale in config.items():
    sae.concepts.manipulate_concept(neuron_idx=int(neuron_idx), scale=scale)

Use Cases¶

Steering Generation¶

# Steer model toward specific topics
sae.concepts.manipulate_concept(neuron_idx=42, scale=2.0)  # Science concept
outputs, encodings = lm.inference.execute_inference(["Write about"])

Reducing Bias¶

# Suppress potentially biased concepts
biased_concept_neurons = [100, 150, 200]  # Identified through analysis

for neuron_idx in biased_concept_neurons:
    sae.concepts.manipulate_concept(neuron_idx=neuron_idx, scale=0.3)

outputs, encodings = lm.inference.execute_inference(["Your prompt"])

Enhancing Specific Features¶

# Enhance desired features
desired_features = {
    42: 1.5,   # Clarity
    100: 1.3,  # Accuracy
    200: 1.2   # Helpfulness
}

for neuron_idx, scale in desired_features.items():
    sae.concepts.manipulate_concept(neuron_idx=neuron_idx, scale=scale)

Advanced Patterns¶

Gradual Manipulation¶

# Gradually increase concept strength
for scale in [1.0, 1.2, 1.4, 1.6, 1.8, 2.0]:
    sae.concepts.manipulate_concept(neuron_idx=42, scale=scale)
    outputs, encodings = lm.inference.execute_inference(["Your prompt"])
    print(f"Scale {scale}: {outputs.logits[0, 0, :5]}")  # First 5 logits

Concept Interaction Studies¶

# Study interactions between concepts
concept_a = 42
concept_b = 100

# Individual effects
sae.concepts.manipulate_concept(neuron_idx=concept_a, scale=1.5)
output_a, _ = lm.inference.execute_inference(["Prompt"])

sae.concepts.reset_manipulations()
sae.concepts.manipulate_concept(neuron_idx=concept_b, scale=1.5)
output_b, _ = lm.inference.execute_inference(["Prompt"])

# Combined effect
sae.concepts.reset_manipulations()
sae.concepts.manipulate_concept(neuron_idx=concept_a, scale=1.5)
sae.concepts.manipulate_concept(neuron_idx=concept_b, scale=1.5)
output_combined, _ = lm.inference.execute_inference(["Prompt"])

# Compare
print("Individual A:", output_a.logits)
print("Individual B:", output_b.logits)
print("Combined:", output_combined.logits)

Best Practices¶

Start small: Use moderate scales (1.2-1.5) initially
Test systematically: Compare baseline vs manipulated
Document effects: Record what each manipulation does
Reset between experiments: Use reset_manipulations()
Validate concepts: Ensure concepts are well-understood

Common Issues¶

No Effect¶

# Check SAE is attached
assert sae in lm.layers.get_hooks()

# Check manipulation is applied
config = sae.concepts.get_manipulation_config()
print(f"Current manipulations: {config}")

# Ensure with_controllers=True
outputs, encodings = lm.inference.execute_inference(["Prompt"], with_controllers=True)

Too Strong Effect¶

# Reduce scale
sae.concepts.manipulate_concept(neuron_idx=42, scale=1.2)  # Instead of 2.0

Unexpected Behavior¶

# Verify concept is correct
top_texts = sae.concepts.get_top_texts()
print(f"Neuron 42 top texts: {top_texts.get(42, [])[:5]}")

Next Steps¶

After learning concept manipulation:

Activation Control - Direct activation manipulation
Hooks: Controllers - Custom controller hooks
Examples - See example notebooks

examples/03_load_concepts.ipynb - Complete concept manipulation example
experiments/verify_sae_training/05_show_concepts.ipynb - Concept visualization