Mechanistic interpretability

The best papers, talks, and explainers on mechanistic interpretability—reverse-engineering what neural networks actually compute.

Browse the full interactive library →

The Lottery Ticket Hypothesis

Jonathan Frankle, Michael Carbin

Frankle and Carbin showed large networks contain sparse, high-performing subnetworks, suggesting most parameters may be unnecessary and opening paths for interpretability via pruning.

Advanced2018

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns et al.

Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.

Advanced2022

Red Teaming Language Models to Reduce Harms

Deep Ganguli et al.

Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.

Advanced2022

BlueDot Impact: Technical AI Safety

BlueDot Impact

Free online course building a working understanding of the major open problems in technical AI safety—alignment and RLHF, mechanistic interpretability, evaluations and red-teaming, AI control, and scalable oversight.

Intermediate

ARENA (Alignment Research Engineer Accelerator)

ARENA

Hands-on technical curriculum for skilling up in AI alignment research engineering, freely available online and covering deep learning fundamentals, transformer mechanistic interpretability, reinforcement learning, and LLM evaluations.

Advanced

Klara and the Sun

Kazuo Ishiguro

Ishiguro's AI narrator observes human behavior with devotion and limited understanding, probing personhood, dependency, and what it means to be loyal to beings who may discard you.

Beginner2021

Rose/House

Arkady Martine

Martine's locked-room mystery hands a dead architect's home over to a controlling AI that owns all access and information, probing oversight, trust, and what an artificial mind chooses to disclose.

Beginner2023

Devs

Alex Garland

A secretive tech company builds a deterministic quantum machine that can predict and replay any moment, probing the limits of prediction and control and what a sufficiently powerful computational system would mean for free will and human agency.

Beginner2020

Hi, A.I.

Isa Willinger

An observational look at people forming emotional bonds with humanoid and companion robots, probing what it means to build machines designed to be loved and what that reveals about human attachment.

Beginner2019

AXRP (AI X-risk Research Podcast)

Daniel Filan

Deep technical conversations with alignment researchers on interpretability, governance, superalignment, and the specific open problems in reducing existential risk from AI.

Advanced2020

Technical AI Safety Podcast

Quinn Dougherty

Aimed at computer scientists: deep dives into alignment papers with the authors, covering formal methods, reward modeling, and mechanistic interpretability.

Advanced2020

The Gradient Podcast

Daniel Bashir

ML research interviews with recurring coverage of interpretability, robustness, provably safe AI, and the intersection of capabilities and safety research.

Advanced2020

Machine Learning Street Talk

Tim Scarfe et al.

Technical ML interviews with regular deep dives into interpretability, scaling laws, emergent capabilities, and the safety implications of frontier model development.

Advanced2020

Transformer Circuits

Anthropic / community

The home of mechanistic interpretability research, publishing detailed analyses of how transformer models represent and process information internally.

Advanced

Distill

Pioneering interactive journal for ML interpretability and visualization, setting the standard for making neural network internals understandable.

Advanced

Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization | Lex Fridman Podcast #368

Lex Fridman

A long-form conversation in which Yudkowsky makes his case that humanity is unprepared for superintelligence, probing why alignment is so hard and why he expects catastrophe by default.

Intermediate2023

Scaling Interpretability

Anthropic

Anthropic researchers explain mechanistic interpretability—reading the millions of concepts represented inside a production model like Claude—as a path to understanding and steering AI behavior.

Intermediate2024