Understanding Superposition in Neural Networks: A Guide Through Analogies

Jul 8, 2025·
Heye Vöcking
Heye Vöcking
· 7 min read

One of the most fascinating and challenging concepts in mechanistic interpretability is superposition, the way neural networks cleverly pack multiple features into the same computational space. If you’re coming from a background in knowledge representation or semantic systems, superposition can seem quite alien at first. But with the right analogies, it becomes understandable and genuinely elegant.

What is Superposition?

In traditional knowledge systems like the semantic web, concepts are typically stored in dedicated, clearly labeled locations. You might have distinct slots for “Person,” “hasAge,” and “livesIn” - each with its own well-defined space in your ontology.

Neural networks face a different challenge entirely. They need to represent potentially millions of concepts, but they have limited “storage space” in the form of neurons and dimensions. Superposition is their solution: multiple distinct features are encoded in the same set of neurons, rather than each feature having its own dedicated slot.

Think of it like having a library where you need to store a million books, but you only have shelf space for a thousand. Superposition would be like discovering a clever way to store multiple books in the same physical space, perhaps by layering them in a way that you can still retrieve individual books when needed.

Understanding Superposition in Neural Networks: A Guide Through Analogies

The Compression Analogy

Superposition is fundamentally a form of learned compression. But unlike traditional compression algorithms like ZIP or JPEG, where you have explicit rules for packing and unpacking data, superposition is a compression scheme that the neural network discovers on its own.

Imagine you’re trying to compress a massive dataset, but instead of using a pre-designed algorithm, you let the system figure out its own compression method through trial and error. The network learns to pack features together in a way that preserves the information it needs while fitting within its constraints.

The tricky part? There’s no clean “decompression algorithm.” When multiple features are superimposed in the same neurons, the separation of them becomes a complex inference problem. It’s like having a brilliantly space-efficient storage system, but now you need to figure out how to extract individual items without clear labels.

The River Delta: Information Flow

Perhaps the most intuitive way to understand superposition is through the metaphor of a river delta with controllable dams.

In this analogy:

  • Water flow represents information/activation flowing through the network
  • Controllable dams represent the weights and biases that can be adjusted
  • Different streams represent different features or concepts
  • The destination represents the final output (like predicting the next token)

When you present a neural network with an image of a cat, multiple “streams” of information flow simultaneously: visual texture features, shape features, color features, and higher-level concept features. These streams don’t take separate paths, they flow through the same “channels” (neurons) in the network.

The magic happens in how the network learns to adjust its “dam settings” (weights) so that when you need the “cat” concept, the right combination of streams naturally flows to produce that output. Multiple features share the same waterways, but the flow patterns are orchestrated so that the right information reaches the right destinations.

The Electrical Circuit Perspective

For those with electrical engineering backgrounds, superposition works remarkably like voltage and current in complex circuits.

Neural network activations are like voltages at different points in the circuit. Weights control the “resistance” of connections, determining how much signal passes through. Information flows through weighted connections like current through resistors. Using this analogy you can measure the “voltage” (activation level) at any point in the network just as you might use a multimeter to measure voltage at different points in a circuit. Now researchers basically use “virtual meters” to measure activation levels throughout the network. They can trace how changing the “voltage” at one point affects the final output, essentially measuring the signal flow from input to output.

This electrical perspective helps explain why mechanistic interpretability is so challenging: you’re trying to reverse-engineer a circuit with billions of components, where multiple signals are running through the same wires simultaneously.

Why Superposition Matters

Understanding superposition is crucial because it explains why neural networks are so hard to interpret. In traditional symbolic systems, if you want to know whether the system understands “cats,” you might look for a dedicated “cat” module. In neural networks with superposition, the “cat” concept might be distributed across thousands of neurons, mixed in with concepts for “furry textures,” “pointed ears,” and “domestic animals.”

This is not a bug, it’s a feature! Superposition allows neural networks to be incredibly parameter-efficient, representing far more concepts than they have neurons. But it also means that understanding what these networks have learned requires sophisticated mathematical tools to disentangle the superimposed features.

The Path Forward

The beauty of superposition is that it represents a fundamentally different approach to information storage and processing. Instead of the explicit, structured representations we’re used to in traditional AI systems, neural networks discover their own compression schemes that are often more efficient than anything we could design by hand.

For researchers in mechanistic interpretability, superposition presents both the central challenge and the key to understanding these systems. By developing tools to decompose superimposed features, like sparse auto-encoders and other mathematical techniques, we’re slowly learning to read the compressed language that neural networks speak.

The next time you interact with a language model or image classifier, remember that behind its responses lies an intricate dance of superimposed features, flowing through shared computational channels in patterns too complex for us to fully grasp - yet. The quest to understand these patterns is what makes mechanistic interpretability one of the most fascinating frontiers in AI research.

References and Further Reading

The concepts explored in this article are grounded in cutting-edge research in mechanistic interpretability. Here are the key sources that support and extend these ideas:

Core Theoretical Foundations

Elhage, N., et al. (2022). Toy Models of Superposition. arXiv:2209.10652. [Paper]
The foundational work that introduces minimal settings where polysemanticity arises from storing sparse features in superposition. This paper provides the theoretical backbone for understanding the compression analogy discussed above.

Hänni, S., et al. (2024). Mathematical Models of Computation in Superposition. arXiv:2408.05451. [Paper]
Formalizes how neural networks can compute Boolean circuits in superposition using sub-linear neuron counts, providing mathematical rigor to the “river delta” information flow concepts.

Sparse Coding and Decomposition Techniques

(2025). From Superposition to Sparse Codes: Interpretable Representations in Neural Networks. arXiv:2503.01824. [Paper]
Recent work explaining how evidence for linear overlay of concepts motivates extraction of monosemantic features via sparse autoencoders — the natural next step after understanding superposition.

Olshausen, B. A., & Field, D. J. (1996). Emergence of simple‐cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. [Paper]
The classic neuroscience foundation for sparse coding, showing that superposition isn’t unique to artificial neural networks but appears in biological vision systems.

Interpretability Methods and Visualizations

Olah, C., et al. (2018). The Building Blocks of Interpretability. Distill. [Paper]
Combines feature visualization, attribution, and dimensionality reduction to explore how individual neurons encode multiple features — the “virtual meter” approach to measuring neural activations.

Olah, C., Mordvintsev, A., & Schubert, L. (2017). Feature Visualization. Distill. [Paper]
Details techniques for reverse-engineering neuron-specific activation patterns, essential tools for detecting and understanding superimposed features.

Olah, C., et al. (2020). Zoom In: An Introduction to Circuits. Distill. [Paper]
Introduces the “circuit” metaphor for interpreting subgraphs of neurons and weights, which becomes particularly complex under superposition.

Advanced Topics and Current Research

Adler, M., & Shavit, N. (2024). On the Complexity of Neural Computation in Superposition. arXiv:2409.15318. [Paper]
Presents theoretical bounds for computing logical operations in superposition, highlighting the computational advantages of this representational strategy.

Chang, E., et al. (2025). SAFR: Neuron Redistribution for Interpretability. arXiv:2501.16374. [Paper]
Proposes methods to encourage monosemantic allocations in transformers, directly addressing the challenges posed by feature superposition.

Broader Context

Murdoch, W. J., et al. (2019). Interpretable Machine Learning: Definitions, Methods, and Applications. arXiv:1901.04592. [Paper]
Provides taxonomies of interpretability methods that frame superposition-focused techniques within the broader landscape of explainable AI.


This post represents one perspective on superposition in neural networks, built through collaborative exploration of analogies and concepts. The field of mechanistic interpretability is rapidly evolving, and our understanding of these phenomena continues to deepen.

Heye Vöcking
Authors
Heye Vöcking
Senior Data Engineer
Data & Knowledge Engineer with 10+ years of professional experience transforming petabyte-scale data into knowledge. Currently stress-testing large-language-model alignment, developing jailbreaks, and building real-time knowledge-graph systems. Interests include ML security, physics, Austrian economics, and Bitcoin.