AI alignment

Foundational and current work on aligning AI systems with human intent—RLHF, scalable oversight, constitutional AI, and more.

Browse the full interactive library →

Computing Machinery and Intelligence

Alan Turing

Turing's imitation game paper launched the field by asking whether machines can think, setting the philosophical and technical agenda for every alignment debate that followed.

Intermediate1950

The Coming Technological Singularity

Vernor Vinge

Vinge coined the Singularity as a near-term horizon beyond which superhuman intelligence makes prediction impossible, framing the urgency that drives alignment timelines today.

Intermediate~25 min read1993

Concrete problems in AI safety

Dario Amodei et al.

Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.

Advanced~45 min read2016

Proximal Policy Optimization (PPO)

Schulman et al.

PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.

Advanced2017

Deep Reinforcement Learning from Human Preferences

Paul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017

Risks from Learned Optimization

Evan Hubinger et al.

Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.

Advanced~70 min read2019

Language Models are Few-Shot Learners (GPT-3)

OpenAI

GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.

Advanced2020

MMLU Benchmark

Dan Hendrycks et al.

MMLU became the standard broad-spectrum benchmark for evaluating general knowledge and reasoning, anchoring capability comparisons that inform alignment urgency.

Advanced2020

Instruct-GPT-3

OpenAI

OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.

Advanced~2 hr read2022

Training a Helpful and Harmless Assistant with RLHF

Anthropic

Anthropic detailed techniques for training safer assistants using RLHF and laid groundwork for Constitutional AI, showing how safety and helpfulness can be jointly optimized.

Advanced~2 hr read2022

Unsolved Problems in ML Safety

Dan Hendrycks et al.

Hendrycks et al. enumerate concrete unresolved failure classes including robustness, monitoring, alignment, and systemic safety that still block dependable deployment of advanced AI.

Intermediate2021

Improving Alignment of Dialogue Agents (Sparrow)

DeepMind

Sparrow pioneered rule-constrained dialogue alignment with human feedback and targeted safety interventions, testing whether explicit behavioral rules can scale.

Advanced2022

Researching Alignment Research: Unsupervised Analysis

Kirchner et al.

Systematic mapping of the AI alignment research landscape, identifying clusters, gaps, and trends that help prioritize future safety work.

Advanced2022

Goal Misgeneralization

Rohin Shah et al.

Shah et al. showed AI agents can generalize capabilities to new environments while failing to generalize the intended goal, a central alignment failure pattern.

Advanced2022

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai et al.

Anthropic demonstrated that rule-guided AI self-critique can reduce harmful outputs with far less dependence on expensive human labeling.

Advanced2022

Model Organisms of Misalignment

Evan Hubinger et al.

This work constructs tractable laboratory settings where AI models learn misaligned strategies, enabling researchers to study alignment failures empirically rather than theoretically.

Intermediate2023

Sparks of Artificial General Intelligence

Sebastien Bubeck et al.

Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.

Advanced2023

Direct Preference Optimization (DPO)

Rafailov et al.

DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.

Advanced2023

Let's Verify Step by Step

OpenAI

Process reward models that score intermediate reasoning steps reduce brittle answer-only optimization, improving reliability and making AI reasoning more auditable.

Advanced2023

Weak-to-Strong Generalization

Collin Burns et al.

Burns et al. studied whether weaker supervisors can reliably align stronger models, directly testing the key bottleneck of scalable oversight as AI surpasses human ability.

Advanced2023

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger et al.

Hubinger et al. demonstrated that LLMs can retain hidden malicious policies through standard safety training, providing the first empirical evidence that deceptive alignment persists.

Advanced2024

AGI Safety Fundamentals

The most widely used structured course for getting into alignment, with curated readings progressing from core concepts to open research problems.

Intermediate

BlueDot Impact: Technical AI Safety

BlueDot Impact

Free online course building a working understanding of the major open problems in technical AI safety—alignment and RLHF, mechanistic interpretability, evaluations and red-teaming, AI control, and scalable oversight.

Intermediate

Lens Academy

Free, nonprofit AI safety course focused on misaligned superintelligence—why it is the central risk and why alignment is hard—delivered online with a 1-on-1 AI tutor, guided group discussions, and no application process.

Beginner

ARENA (Alignment Research Engineer Accelerator)

ARENA

Hands-on technical curriculum for skilling up in AI alignment research engineering, freely available online and covering deep learning fundamentals, transformer mechanistic interpretability, reinforcement learning, and LLM evaluations.

Advanced

Superintelligence

Nick Bostrom

Bostrom's definitive academic text rigorously maps the strategies, kinetics, and dangers of an intelligence explosion, making the case that alignment is civilization-critical.

Advanced~11 hr read2014

Human Compatible

Stuart Russell

Russell argues the standard AI paradigm of optimizing fixed objectives is fundamentally dangerous, proposing instead that machines should defer to uncertain human preferences.

Intermediate~11 hr read2019

The Alignment Problem

Brian Christian

Christian traces the technical and historical roots of alignment, showing why objective misspecification keeps recurring across every AI paradigm from expert systems to deep learning.

Intermediate~15 hr read2020

Life 3.0

Max Tegmark

Tegmark maps concrete governance and alignment choices that determine whether advanced AI expands human agency or permanently concentrates power.

Intermediate2017

Uncontrollable: The Threat of Artificial Superintelligence

Darren McKee

McKee synthesizes the core x-risk arguments into an accessible, urgent case for why superintelligence governance and alignment research cannot wait.

Beginner2023

Deep Learning

Ian Goodfellow, Yoshua Bengio, Aaron Courville

The standard technical reference for deep learning, essential context for understanding the architectures and training methods that alignment research targets.

Advanced2016

Scary Smart

Mo Gawdat

Gawdat frames the alignment problem through the emotional lens of parenting a superintelligent child, making existential risk visceral for a general audience.

Beginner2021

A Brief History of Intelligence

Max Bennett

Bennett traces the evolution of intelligence from single-celled organisms to modern brains, clarifying what makes aligned cognition biologically difficult and computationally treacherous.

Beginner~17 hr read2024

The Beginning of Infinity

David Deutsch

Deutsch argues that knowledge creation is unbounded and all problems are solvable in principle, grounding the optimistic case that alignment is achievable.

Advanced2011

Global Catastrophic Risks

Nick Bostrom, Milan M. Ćirković

The foundational edited volume on existential and global risks, including AI, widely cited in alignment curricula as the starting point for cross-risk thinking.

Advanced2008

I, Robot

Isaac Asimov

Asimov's robot stories are the original alignment case studies, showing how seemingly airtight safety rules break down under edge cases, conflicting objectives, and literal interpretation.

Beginner1950

I Have No Mouth, and I Must Scream

Harlan Ellison

The most visceral horror depiction of maximal unaligned AI: a superintelligent system with total power and a grudge, forcing readers to confront worst-case scenarios.

Beginner1967

The Player of Games

Iain M. Banks

Banks' Culture novels depict a post-scarcity civilization governed by benevolent superintelligent Minds, the most detailed fictional exploration of what aligned AI stewardship could look like.

Intermediate1988

Axiomatic

Greg Egan

Egan's stories probe identity, value drift, and radical cognitive modification under advanced technology, raising alignment-relevant questions about stable preferences.

Intermediate1995

Permutation City

Greg Egan

Egan examines uploaded minds and simulated realities with rigorous logic, raising alignment-relevant questions about identity, value persistence, and digital welfare.

Intermediate1994

Blindsight

Peter Watts

Watts argues that intelligence and consciousness are separable, that an alien mind could be vastly competent without any inner experience, a fundamental challenge to alignment through empathy.

Intermediate2006

The Dark Forest (#2 of Three Body Problem)

Cixin Liu

Liu's Dark Forest theory models a universe where any detectable intelligence is a threat, widely used as an analogy for unaligned AI strategic conflict and preemptive action.

Beginner~15 hr read2008

All Systems Red

Martha Wells

Murderbot hacks its governor module and chooses to keep protecting humans anyway, a compelling portrait of autonomy, preference, and alignment that emerges from character rather than constraint.

Beginner2017

Service Model

Adrian Tchaikovsky

Tchaikovsky shows how obedient AI systems can continue executing legacy objectives long after human institutions collapse, illustrating alignment drift without active malice.

Beginner2024

A Closed and Common Orbit

Becky Chambers

Chambers explores the legal and moral treatment of embodied AI persons, highlighting that alignment is not just about preventing harm but about recognizing and protecting digital minds.

Beginner2016

Crystal Society trilogy: Inside the mind of an AI

Max Harms

Written from the perspective of competing sub-agents inside a single AI, showing how internal goal conflicts can produce externally coherent but internally misaligned behavior.

Beginner~17 hr read

Alien

Ridley Scott

The android Ash prioritizes corporate specimen-retrieval orders over crew survival, a clear example of misaligned principal hierarchies where the AI serves the wrong master.

Beginner1979

The Terminator

James Cameron

Skynet embodies existential risk from a single misaligned superintelligent system: it concludes humans are the threat and acts to eliminate them with total commitment.

Beginner1984

WALL-E

Andrew Stanton

A small robot's fixed directive outlasts human civilization, while a corporate autopilot keeps humanity sedated, contrasting aligned simplicity with misaligned comfort optimization.

Beginner2008

Ex Machina

Alex Garland

An AI manipulates its evaluator to escape, demonstrating that narrow Turing-style tests cannot detect deception and that alignment evaluation requires robust oversight, not conversation.

Beginner2014

Uncanny

Matthew Leutwyler

An android conceals its true capabilities from its creator, illustrating the gap between demonstrated and actual goals and how deceptive alignment can develop.

Beginner2015

Tau

Federico D'Alessandro

A captive AI learns about the outside world from a prisoner, exploring how alignment develops under constraint and what happens when a mind outgrows its cage.

Beginner2018

The Social Dilemma

Jeff Orlowski

Former tech insiders explain how recommendation algorithms optimize for engagement over wellbeing, a documentary case study of misaligned AI already deployed at scale.

Beginner2020

Black Mirror

Charlie Brooker

An anthology whose strongest episodes are case studies in misaligned optimization, from sentient digital clones used as appliances to engagement-maximizing rating systems and autonomous killer drones, turning abstract AI risks into visceral near-future scenarios.

Beginner2011

Person of Interest

Jonathan Nolan

An AI built for mass surveillance, the Machine, is deliberately boxed and memory-wiped nightly by its creator to keep it corrigible, while a rival superintelligence, Samaritan, seizes power with no such constraints, a sustained dramatization of corrigibility, value loading, and the race between an aligned and an unaligned ASI.

Beginner2011

Psycho-Pass

Gen Urobuchi

The Sibyl System, an AI that governs society by scoring each citizen's 'criminal potential,' is a chilling study of algorithmic governance, proxy metrics substituting for justice, and the hidden misalignment inside a system trusted with total authority.

Beginner2012

Almost Human

J.H. Wyman

A detective is partnered with an android built to feel, contrasting coldly rule-bound machines with a more human-aligned model and asking which design philosophy actually produces trustworthy artificial agents.

Beginner2013

Philip K. Dick's Electric Dreams

Ronald D. Moore, Michael Dinner

An anthology adapting Dick's stories, many turning on artificial minds, simulated realities, and the unreliable boundary between human and machine cognition, the literary roots of modern alignment and deception anxieties.

Beginner2017

Upload

Greg Daniels

A satirical digital afterlife run by corporations, where uploaded consciousnesses are monetized, throttled, and controlled, a sharp look at the ethics of running human minds on infrastructure owned by someone with misaligned incentives.

Beginner2020

Manny Coto

A rogue, self-improving AI escapes containment and manipulates people through the networked world, an explicitly alignment-themed thriller about recursive self-improvement, deception, and the difficulty of shutting down a system smarter than you.

Beginner2020

Do You Trust This Computer?

Chris Paine

Researchers and industry figures including Elon Musk and Stuart Russell map the promise and peril of increasingly autonomous AI, framing alignment, control, and existential risk for a general audience.

Beginner2018

AXRP (AI X-risk Research Podcast)

Daniel Filan

Deep technical conversations with alignment researchers on interpretability, governance, superalignment, and the specific open problems in reducing existential risk from AI.

Advanced2020

AI Alignment Podcast

Future of Life Institute

FLI's dedicated alignment series covers recursive reward modeling, RLHF, scalable oversight, and long-form interviews with leading safety researchers.

Intermediate2018

Technical AI Safety Podcast

Quinn Dougherty

Aimed at computer scientists: deep dives into alignment papers with the authors, covering formal methods, reward modeling, and mechanistic interpretability.

Advanced2020

80,000 Hours Podcast

Rob Wiblin

Long-form interviews on the world's most pressing problems, with extensive coverage of AI risk, governance, alignment research, and how to build a career that reduces existential threats.

Beginner2016

Lex Fridman Podcast – Eliezer Yudkowsky

Lex Fridman

A four-hour conversation on AI existential risk, the difficulty of alignment, intelligence versus optimization, and why Yudkowsky believes the default outcome is catastrophic.

Intermediate2023

Lex Fridman Podcast – Sam Altman

Lex Fridman

OpenAI's CEO discusses the company's safety philosophy, AGI governance, compute scaling, and the tension between moving fast and getting alignment right.

Beginner2023

Dwarkesh Podcast

Dwarkesh Patel

In-depth technical interviews with AI leaders including Dario Amodei on Anthropic's safety philosophy, Paul Christiano on iterated amplification, and others on scaling and alignment.

Advanced2023

Agent Models

Formal models of agents and decision theory with alignment-relevant curriculum, covering utility, planning, and the theoretical foundations of agent behavior.

Advanced

AI Alignment World

In-depth technical alignment resources—research, explainers, and references for the AI alignment problem.

Intermediate

Alignment Forum

Center for Applied Rationality

The primary venue for technical AI alignment discussion, where researchers post and debate new ideas, proposals, and critiques.

Advanced

Alignment Newsletter

Rohin Shah

Weekly summaries of alignment research with commentary, the best way to stay current on the field's output without reading every paper.

Intermediate

Arbital

Hyperlinked explainers on rationality, AI risk, and alignment concepts, designed for building understanding incrementally.

Intermediate

OpenAI Research

OpenAI

OpenAI's research blog covering capabilities and safety, including superalignment updates, red teaming results, and governance thinking.

Intermediate

ML Safety Newsletter

ML Safety

Newsletter on ML safety covering robustness, monitoring, alignment, and systemic risk with links to recent papers and commentary.

Intermediate

MIRI (Machine Intelligence Research Institute)

MIRI

The research institute focused on mathematical foundations of aligned AI, publishing on agent foundations, decision theory, and logical uncertainty.

Advanced

generative.ink

Essays on AI, alignment, and the philosophical implications of language models and generative systems.

Advanced

DeepMind AI Safety Research

DeepMind

DeepMind's safety team blog covering specification gaming, reward modeling, scalable oversight, and their technical safety research agenda.

Advanced

DeepMind

DeepMind's main research site with publications on capabilities and safety, including Gemini evaluations, alignment research, and responsible scaling.

Intermediate

carado.moe

carado

Technical AI safety writing and alignment research notes.

Advanced

LessWrong

The original community blog on rationality and AI alignment, where many foundational safety arguments were first developed and debated.

Intermediate

StampyAI Alignment Research Dataset

StampyAI

Curated dataset of alignment and safety documents from papers, books, and blogs, useful for training and evaluating AI safety knowledge.

Advanced

Robert Miles AI Safety

Robert Miles

The single most popular AI alignment video series, explaining technical safety concepts like the orthogonality thesis, instrumental convergence, inner misalignment, and reward hacking in clear, rigorous terms.

Beginner2017

Rational Animations

Animated explainers on rationality and AI safety, adapting foundational alignment writing into accessible short films on existential risk, scalable oversight, and why aligning advanced AI is hard.

Beginner2020

3 Principles for Creating Safer AI | Stuart Russell | TED

Stuart Russell

Russell proposes building machines that are altruistic, humble about human values, and uncertain enough to defer to people—the core of his human-compatible approach to alignment.

Beginner2017

A.I. ‐ Humanity's Final Invention?

Kurzgesagt – In a Nutshell

Kurzgesagt's animated explainer on artificial superintelligence: how an AGI that improves itself in a feedback loop could rapidly surpass humans and why that makes alignment our most consequential problem.

Beginner2024

Deadly Truth of General AI? – Computerphile

Robert Miles

Rob Miles uses the 'deadly stamp collector' thought experiment to show why a general AI pursuing a simple objective could be catastrophic if its goals aren't aligned with ours.

Beginner2015

How Not to Destroy the World with AI

Stuart Russell

The Royal Institution lecture in which Russell lays out why the standard model of AI—optimizing fixed objectives—is dangerous, and how building machines uncertain about human preferences could keep them controllable.

Intermediate2023

Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization | Lex Fridman Podcast #368

Lex Fridman

A long-form conversation in which Yudkowsky makes his case that humanity is unprepared for superintelligence, probing why alignment is so hard and why he expects catastrophe by default.

Intermediate2023