Reinforcement learning & reward hacking

Reinforcement learning as it bears on safety: reward hacking, specification gaming, imitation learning, and policy optimization.

Browse the full interactive library →

Concrete problems in AI safety

Dario Amodei et al.

Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.

Advanced~45 min read2016

Proximal Policy Optimization (PPO)

Schulman et al.

PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.

Advanced2017

Deep Reinforcement Learning from Human Preferences

Paul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017

Causal Confusion in Imitation Learning

Pim de Haan et al.

De Haan et al. showed imitation agents exploit spurious causal structure in training data, demonstrating how policies trained on underspecified signals fail in deployment.

Advanced2019

GopherCite

DeepMind

DeepMind tackled hallucination by training models to cite sources and support claims with verifiable evidence, a key step toward trustworthy AI outputs.

Advanced~70 min read2022

Direct Preference Optimization (DPO)

Rafailov et al.

DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.

Advanced2023

ARENA (Alignment Research Engineer Accelerator)

ARENA

Hands-on technical curriculum for skilling up in AI alignment research engineering, freely available online and covering deep learning fundamentals, transformer mechanistic interpretability, reinforcement learning, and LLM evaluations.

Advanced

AlphaGo

Greg Kohs

DeepMind's Go-playing system defeats world champion Lee Sedol, a landmark demonstration of how reinforcement learning can surpass human mastery and a vivid case study in superhuman, sometimes inscrutable, machine strategy.

Beginner2017

The Social Dilemma

Jeff Orlowski

Former tech insiders expose how recommendation algorithms optimize relentlessly for engagement, a real-world illustration of misaligned objectives and reward hacking operating at civilizational scale.

Beginner2020

Victoria Krakovna's blog

Victoria Krakovna

Research notes on specification gaming, side effects, and AI safety from a DeepMind safety researcher, including the widely-cited specification gaming examples list.

Advanced

DeepMind AI Safety Research

DeepMind

DeepMind's safety team blog covering specification gaming, reward modeling, scalable oversight, and their technical safety research agenda.

Advanced

Robert Miles AI Safety

Robert Miles

The single most popular AI alignment video series, explaining technical safety concepts like the orthogonality thesis, instrumental convergence, inner misalignment, and reward hacking in clear, rigorous terms.

Beginner2017

The Artificial Intelligence That Deleted A Century

Tom Scott

A short speculative fiction about a narrow copyright-enforcement AI that, left unchecked, destroys a century of culture—an accessible parable of specification gaming and unintended consequences.

Beginner2020