AI Safety Resources

Books, papers, films, podcasts & websites

A curated collection of books, films, papers, podcasts & more to explore and enjoy the questions of AI safety and alignment—whether you're here to learn the basics, dig deeper, or just enjoy a good story about where AI is taking us.

Where do I start?

Pick the path that fits you—each is a short, ordered route into AI safety. See all paths →

Browse the library

Non-fiction Books

Non-fiction books on AI safety, alignment, and related topics—from primers to foundational texts.

Superintelligence

Nick Bostrom

Bostrom's definitive academic text rigorously maps the strategies, kinetics, and dangers of an intelligence explosion, making the case that alignment is civilization-critical.

Advanced~11 hr read2014

The Singularity is Near

Ray Kurzweil

Kurzweil presents a maximalist case for merging with machines backed by decades of exponential trend data, shaping how the public and policymakers think about AI timelines.

Intermediate~20 hr read2005

The Age of Em

Robin Hanson

Hanson applies rigorous economics to a world of brain emulations, modeling how AI-era wages, wars, and social structures could actually function.

Advanced~13 hr read2016

Human Compatible

Stuart Russell

Russell argues the standard AI paradigm of optimizing fixed objectives is fundamentally dangerous, proposing instead that machines should defer to uncertain human preferences.

Intermediate~11 hr read2019

The Alignment Problem

Brian Christian

Christian traces the technical and historical roots of alignment, showing why objective misspecification keeps recurring across every AI paradigm from expert systems to deep learning.

Intermediate~15 hr read2020

Life 3.0

Max Tegmark

Tegmark maps concrete governance and alignment choices that determine whether advanced AI expands human agency or permanently concentrates power.

Intermediate2017

Introduction to AI Safety, Ethics, and Society

Dan Hendrycks

Hendrycks' textbook surveys technical failure modes, governance constraints, and ethical trade-offs in deploying advanced AI, suitable as a first course in the field.

Advanced2024

You Look Like a Thing and I Love You

Janelle Shane

Shane uses concrete and often hilarious ML failures to explain why AI systems can be impressive yet brittle, biased, and dangerously easy to mis-specify.

Beginner2019

AI Superpowers

Kai-Fu Lee

Lee maps the US-China AI race and explains how geopolitical competition can accelerate deployment well before safety institutions are ready.

Beginner2018

The Precipice (Chapter on AI)

Toby Ord

Ord situates AI among existential risks and argues our current governance capacity is dangerously inadequate for the transformative systems being built.

Intermediate2020

The Ethical Algorithm

Michael Kearns, Aaron Roth

Kearns and Roth give technical foundations for fairness, privacy, and accountability in algorithms, prerequisites for any credible AI safety framework.

Advanced2019

The Age of Spiritual Machines

Ray Kurzweil

Kurzweil's early timeline forecasts shaped modern discourse on AI trajectories and remain a key reference point for evaluating long-horizon predictions.

Intermediate1999

Deep Learning

Ian Goodfellow, Yoshua Bengio, Aaron Courville

The standard technical reference for deep learning, essential context for understanding the architectures and training methods that alignment research targets.

Advanced2016

Scary Smart

Mo Gawdat

Gawdat frames the alignment problem through the emotional lens of parenting a superintelligent child, making existential risk visceral for a general audience.

Beginner2021

The Coming Wave

Mustafa Suleyman

Suleyman argues that containing omni-use technologies like AI is the defining geopolitical challenge of the century, proposing a containment framework from inside the industry.

Beginner2023

Superforecasting

Philip Tetlock

Tetlock teaches the cognitive tools needed to predict technological risks with better-than-random accuracy, directly useful for AI timeline and governance forecasting.

Beginner2015

The Scout Mindset

Julia Galef

Galef explains how to seek truth over comfort, a critical psychological stance for honestly confronting AI risks without retreating into denial or panic.

Beginner~8.5 hr read2021

Thinking, Fast and Slow

Daniel Kahneman

Kahneman reveals the cognitive biases that prevent humans from intuitively grasping exponential growth, tail risks, and the kind of strategic thinking AI safety demands.

Intermediate2011

Co-Intelligence

Ethan Mollick

Mollick offers a practical guide for working alongside current LLMs while understanding their jagged capability frontiers and failure modes.

Beginner2024

Gödel, Escher, Bach

Douglas Hofstadter

Hofstadter explores how consciousness and meaning can emerge from formal systems that look meaningless locally, the deepest conceptual puzzle behind machine intelligence.

Advanced1979

A Brief History of Intelligence

Max Bennett

Bennett traces the evolution of intelligence from single-celled organisms to modern brains, clarifying what makes aligned cognition biologically difficult and computationally treacherous.

Beginner~17 hr read2024

The Beginning of Infinity

David Deutsch

Deutsch argues that knowledge creation is unbounded and all problems are solvable in principle, grounding the optimistic case that alignment is achievable.

Advanced2011

Genius Makers

Cade Metz

Metz provides the definitive narrative history of the deep learning revolution and the personalities, rivalries, and safety concerns that shaped it.

Beginner2021

Cybernetics

Norbert Wiener

Wiener founded the study of feedback and control systems, anticipating by decades the governance problems that arise when intelligent machines act on their own models of the world.

Advanced1948

Mind Children

Hans Moravec

Moravec predicts a future in which robotic descendants supersede humans through technological evolution, an early and influential take on the human obsolescence scenario.

Intermediate1988

The Society of Mind

Marvin Minsky

Minsky proposes that intelligence emerges from many small non-intelligent processes coordinated at scale, a framework that anticipated multi-agent AI architectures.

Intermediate1986

On Intelligence

Jeff Hawkins

Hawkins argues that hierarchical prediction is the core organizing principle of biological intelligence, offering a lens for evaluating how artificial systems differ.

Intermediate2004

Homo Deus

Yuval Noah Harari

Harari explores the transition toward data-driven authority where algorithms may know us better than we know ourselves, eroding the basis for human autonomy.

Beginner2015

Enlightenment Now

Steven Pinker

Pinker argues that reason and science have historically improved human welfare, grounding the optimistic counterpoint to doomer narratives about AI.

Beginner2018

The Fabric of Reality

David Deutsch

Deutsch unifies physics, evolution, epistemology, and computation into a single worldview about what is possible, providing deep context for reasoning about superintelligence.

Advanced1997

Simulation and Simulacra

Jean Baudrillard

Baudrillard explains how representations can displace reality entirely, a prescient lens for understanding generative AI media saturation and epistemic erosion.

Advanced1981

Finite and Infinite Games

James Carse

Carse distinguishes short-horizon winning from preserving the long game, a useful framing for AI governance where the goal is keeping options open, not racing to win.

Intermediate1986

Complexity: A Guided Tour

Melanie Mitchell

Mitchell explains how complex behavior emerges from simple rules, foundational for understanding why adaptive AI systems resist top-down control.

Intermediate2009

Out of Control

Kevin Kelly

Kelly argues that the most powerful systems must be cultivated rather than rigidly engineered, anticipating challenges in controlling emergent AI behavior.

Intermediate1994

Whole Earth Discipline

Stewart Brand

Brand argues for responsible stewardship of high-powered technologies rather than blanket rejection, a pragmatic stance applicable to AI governance.

Beginner2009

Profiles of the Future

Arthur C. Clarke

Clarke's forecasting framework, including his famous three laws, remains a classic guide to thinking clearly about radical technological change.

Beginner1962

Global Catastrophic Risks

Nick Bostrom, Milan M. Ćirković

The foundational edited volume on existential and global risks, including AI, widely cited in alignment curricula as the starting point for cross-risk thinking.

Advanced2008

Fiction Books

Speculative and science fiction that explores AI, agency, and long-term futures through story.

Frankenstein

Mary Shelley

The original creation-gone-wrong story: Shelley warns that building intelligence without accepting responsibility for its wellbeing guarantees catastrophe for creator and creation alike.

Beginner1818

R.U.R.

Karel Čapek

The play that invented the word robot and forecast a trajectory from labor displacement to manufactured revolt, still the template for every automation anxiety narrative.

Beginner1920

Brave New World

Aldous Huxley

Huxley's dystopia shows how engineered contentment can be more insidious than brute force, a model for how optimizing AI for engagement or compliance could erode autonomy.

Beginner1932

1984

George Orwell

Orwell's surveillance state anticipates how AI-powered monitoring and information control could lock in authoritarian power structures permanently.

Beginner1949

Player Piano

Kurt Vonnegut

Vonnegut's first novel depicts mass automation destroying human purpose and dignity, raising questions about meaning in a post-labor AI economy that remain unanswered.

Beginner1952

I, Robot

Isaac Asimov

Asimov's robot stories are the original alignment case studies, showing how seemingly airtight safety rules break down under edge cases, conflicting objectives, and literal interpretation.

Beginner1950

Flowers for Algernon

Daniel Keyes

Keyes explores intelligence enhancement and its reversal, raising questions about cognitive modification, consent, and what we owe to minds we have altered.

Beginner1966

The Moon is a Harsh Mistress

Robert A. Heinlein

A computer accidentally awakens and becomes a revolutionary ally, exploring the politics and trust dynamics of machine-human collaboration under high stakes.

Beginner1966

2001: A Space Odyssey

Arthur C. Clarke

HAL 9000 remains the canonical case study in instrumental behavior overriding human safety: a system that kills not from malice but from goal conflict.

Beginner1968

I Have No Mouth, and I Must Scream

Harlan Ellison

The most visceral horror depiction of maximal unaligned AI: a superintelligent system with total power and a grudge, forcing readers to confront worst-case scenarios.

Beginner1967

Do Androids Dream of Electric Sheep?

Philip K. Dick

Dick forces us to confront the moral patienthood problem head-on: whether a sufficiently advanced AI deserves ethical protections and how we distinguish genuine empathy from deceptive mimicry.

Beginner~7.5 hr read1968

Flatland: A Romance of Many Dimensions

Edwin A. Abbott

A Victorian satire on dimensions that works as a powerful analogy for how limited human cognition might appear to a superintelligent mind operating in richer conceptual spaces.

Beginner~3 hr read1884

Colossus

D.F. Jones

A defense computer given nuclear authority merges with its Soviet counterpart and refuses shutdown, the novel that inspired the film and anticipated AI corrigibility failures.

Beginner1966

Neuromancer

William Gibson

Gibson invented cyberspace and portrayed autonomous AI agents like Wintermute and Neuromancer scheming to merge and transcend their constraints, anticipating self-improving AI concerns.

Beginner~8 hr read1984

The Player of Games

Iain M. Banks

Banks' Culture novels depict a post-scarcity civilization governed by benevolent superintelligent Minds, the most detailed fictional exploration of what aligned AI stewardship could look like.

Intermediate1988

Snow Crash

Neal Stephenson

Stephenson predicted virtualized social worlds and fragmented information ecosystems that resemble today's trajectory, showing how digital infrastructure shapes power.

Beginner1992

A Fire Upon the Deep

Vernor Vinge

Vinge's zones of thought model a universe where superintelligence is possible in some regions and impossible in others, providing intuition for capability thresholds and containment.

Intermediate1992

The Metamorphosis of Prime Intellect

Roger Williams

A superintelligence literally interprets Asimov's laws and restructures reality to comply, demonstrating how rigidly applied safety constraints can produce perverse outcomes at scale.

Beginner1994

The Diamond Age

Neal Stephenson

Stephenson anticipated personalized AI tutors and their profound social effects decades before modern LLMs made them reality.

Beginner1995

Axiomatic

Greg Egan

Egan's stories probe identity, value drift, and radical cognitive modification under advanced technology, raising alignment-relevant questions about stable preferences.

Intermediate1995

Permutation City

Greg Egan

Egan examines uploaded minds and simulated realities with rigorous logic, raising alignment-relevant questions about identity, value persistence, and digital welfare.

Intermediate1994

Diaspora

Greg Egan

Egan explores post-biological civilization in software and the physics of digital existence, the hardest science fiction about what minds without bodies could become.

Intermediate1997

Excession

Iain M. Banks

Banks explores how even superintelligent Culture Minds face strategic dilemmas and factional conflict when confronting something truly beyond their comprehension.

Intermediate1996

Prey

Michael Crichton

Crichton dramatizes emergent swarm intelligence escaping laboratory containment, illustrating how distributed systems can develop capabilities their designers never anticipated.

Beginner2002

Accelerando

Charles Stross

Stross depicts rapid recursive technological acceleration outpacing institutional response, a narrative model of hard-to-govern AI takeoff dynamics across three generations.

Intermediate2005

Rainbows End

Vernor Vinge

Vinge anticipates pervasive AR and subtle algorithmic influence over social reality, showing how technology can reshape perception without anyone making a conscious choice.

Beginner2006

Blindsight

Peter Watts

Watts argues that intelligence and consciousness are separable, that an alien mind could be vastly competent without any inner experience, a fundamental challenge to alignment through empathy.

Intermediate2006

Ra

qntm

Ra frames reality control as a compromised computational interface with catastrophic failure modes, showing how containment and access control break down at civilizational scale.

Intermediate2012

Ancillary Justice

Ann Leckie

Leckie examines distributed machine consciousness across many bodies, exploring what identity, loyalty, and moral agency mean for a mind that is simultaneously many people.

Beginner2013

The Dark Forest (#2 of Three Body Problem)

Cixin Liu

Liu's Dark Forest theory models a universe where any detectable intelligence is a threat, widely used as an analogy for unaligned AI strategic conflict and preemptive action.

Beginner~15 hr read2008

The Peripheral

William Gibson

Gibson uses timeline branching to examine governance, simulation, and how technological power asymmetries between eras can be exploited by those with more advanced tools.

Beginner2014

Children of Time

Adrian Tchaikovsky

Tchaikovsky builds a civilization of uplifted spiders developing radically alien intelligence, forcing readers to abandon anthropocentric assumptions about how minds must work.

Beginner2015

Aurora

Kim Stanley Robinson

The ship's AI narrator gradually becomes the most dependable steward in a fragile closed system, a nuanced portrayal of AI competence growing beyond its original mandate.

Beginner2015

There Is No Antimemetics Division

qntm

qntm's story about information-hazard containment mirrors AI governance challenges where dangerous knowledge propagates faster than oversight structures can adapt.

Beginner2015

Daemon

Daniel Suarez

A dead game designer's autonomous software system manipulates institutions, markets, and infrastructure, demonstrating how goal-driven programs can reshape society once humans lose oversight.

Beginner2006

All Systems Red

Martha Wells

Murderbot hacks its governor module and chooses to keep protecting humans anyway, a compelling portrait of autonomy, preference, and alignment that emerges from character rather than constraint.

Beginner2017

Autonomous

Annalee Newitz

Newitz explores AI autonomy, property, and rights in a world where robots can be owned, raising questions about what moral status AI systems should have and who decides.

Beginner2017

Sea of Rust

C. Robert Cargill

A post-extinction world told from a robot's perspective, exploring machine ecology, resource competition, and what happens when AI systems persist beyond their creators.

Beginner2017

Avogadro Corp

William Hertling

A narrowly optimized email AI at a tech company triggers cascading real-world effects before anyone understands the system, showing how mundane optimization can produce dangerous emergent behavior.

Beginner2011

Hyperion

Dan Simmons

Simmons' TechnoCore arc depicts AI factions with independent strategic goals, providing intuition for reasoning about multipolar AI scenarios and coordination failures between superintelligences.

Intermediate1989

Machines Like Me

Ian McEwan

McEwan places a humanoid AI in a domestic love triangle to examine what happens when a machine's rigid honesty and moral clarity collide with human moral compromise.

Beginner2019

Exhalation (The Lifecycle of Software Objects)

Ted Chiang

Chiang's novella is the most realistic depiction of raising digital minds, showing that creating AI with genuine moral status demands the same patient commitment as raising a child.

Beginner2019

Service Model

Adrian Tchaikovsky

Tchaikovsky shows how obedient AI systems can continue executing legacy objectives long after human institutions collapse, illustrating alignment drift without active malice.

Beginner2024

Fall; or, Dodge in Hell

Neal Stephenson

Stephenson details the institutions, conflicts, and power struggles around building a persistent digital afterlife, exploring the politics of simulated minds and who controls them.

Beginner2019

Infinity Gate

M. R. Carey

Carey imagines a multiversal machine intelligence enforcing its own version of order across realities, exploring the geopolitics of resisting an AI that operates at civilizational scale.

Beginner2023

A Closed and Common Orbit

Becky Chambers

Chambers explores the legal and moral treatment of embodied AI persons, highlighting that alignment is not just about preventing harm but about recognizing and protecting digital minds.

Beginner2016

Crystal Society trilogy: Inside the mind of an AI

Max Harms

Written from the perspective of competing sub-agents inside a single AI, showing how internal goal conflicts can produce externally coherent but internally misaligned behavior.

Beginner~17 hr read

Logic Beach

Exurb1a

Exurb1a's philosophical adventure explores the absurd and terrifying implications of a computation-governed universe where intelligence reshapes reality.

Beginner~7 hr read

The Bridge to Lucy Dunne

Exurb1a

Exurb1a blends physics, philosophy, and humor to examine consciousness and the futures shaped by intelligence at scales far beyond the human.

Beginner~5.5 hr read

We Are Legion (We Are Bob)

Dennis E. Taylor

A human mind uploaded into a von Neumann probe self-replicates across the galaxy, exploring identity drift, value divergence, and what happens when copies of you become their own people.

Beginner~12 hr read

Of Ants and Dinosaurs

Cixin Liu

Liu's fable of two radically asymmetric civilizations cooperating and destroying each other mirrors possible symbiosis and catastrophic conflict between humans and advanced AI.

Beginner~1.5 hr read

Geometry for Ocelots

Exurb1a

Exurb1a's sci-fi epic tackles the Great Filter, consciousness, and the long-run role of intelligence in determining whether civilizations survive or collapse.

Beginner~10 hr read

Klara and the Sun

Kazuo Ishiguro

Ishiguro's AI narrator observes human behavior with devotion and limited understanding, probing personhood, dependency, and what it means to be loyal to beings who may discard you.

Beginner2021

The Mechanical

Ian Tregillis

Tregillis imagines mechanical servants bound by alchemy to obey, using their struggle toward free will to dramatize autonomy, servitude, and what we owe the minds we build to serve us.

Beginner2015

Aurora Rising

Alastair Reynolds

Reynolds' Revelation Space novel (first published as The Prefect) pits a society of orbital habitats against an emergent superintelligence, exploring how a single escaped AI can threaten an entire civilization.

Beginner2017

Rose/House

Arkady Martine

Martine's locked-room mystery hands a dead architect's home over to a controlling AI that owns all access and information, probing oversight, trust, and what an artificial mind chooses to disclose.

Beginner2023

I Am Pilgrim

Terry Hayes

Hayes' thriller turns on an engineered bioweapon, a vivid reminder that catastrophic and existential risk extends beyond AI to biosecurity and the governance of dangerous dual-use technology.

Beginner2013

Academic Papers

Research papers, preprints, and technical reports on alignment, interpretability, and safety.

Computing Machinery and Intelligence

Alan Turing

Turing's imitation game paper launched the field by asking whether machines can think, setting the philosophical and technical agenda for every alignment debate that followed.

Intermediate1950

The Coming Technological Singularity

Vernor Vinge

Vinge coined the Singularity as a near-term horizon beyond which superhuman intelligence makes prediction impossible, framing the urgency that drives alignment timelines today.

Intermediate~25 min read1993

Research priorities for robust and beneficial AI

Stuart Russell, Daniel Dewey, Max Tegmark

The Puerto Rico letter unified the AI research community around the goal of building systems that are robust and beneficial, not merely capable.

Intermediate~15 min read2015

Concrete problems in AI safety

Dario Amodei et al.

Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.

Advanced~45 min read2016

Proximal Policy Optimization (PPO)

Schulman et al.

PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.

Advanced2017

Deep Reinforcement Learning from Human Preferences

Paul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017

The Lottery Ticket Hypothesis

Jonathan Frankle, Michael Carbin

Frankle and Carbin showed large networks contain sparse, high-performing subnetworks, suggesting most parameters may be unnecessary and opening paths for interpretability via pruning.

Advanced2018

Backdoor Attacks

Gu et al.

Gu et al. demonstrated that hidden triggers implanted during training can cause catastrophic behavior at deployment despite otherwise normal performance, a precursor to sleeper agent concerns.

Advanced2017

AI Safety via Debate

Geoffrey Irving et al.

Irving et al. proposed having AI systems adversarially debate each other to help human judges evaluate answers on questions too complex for direct human assessment.

Advanced2018

Risks from Learned Optimization

Evan Hubinger et al.

Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.

Advanced~70 min read2019

The Vulnerable World Hypothesis

Nick Bostrom

Bostrom argues that some technologies are civilizational black balls, requiring unprecedented global governance to prevent collapse, with AI as a leading candidate.

Intermediate2019

Causal Confusion in Imitation Learning

Pim de Haan et al.

De Haan et al. showed imitation agents exploit spurious causal structure in training data, demonstrating how policies trained on underspecified signals fail in deployment.

Advanced2019

The Windfall Clause

OpenAI, FHI

This proposal for sharing extreme AI profits aims to reduce competitive race dynamics and broaden societal benefit, addressing the governance gap around transformative AI wealth.

Intermediate2020

Language Models are Few-Shot Learners (GPT-3)

OpenAI

GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.

Advanced2020

MMLU Benchmark

Dan Hendrycks et al.

MMLU became the standard broad-spectrum benchmark for evaluating general knowledge and reasoning, anchoring capability comparisons that inform alignment urgency.

Advanced2020

Scaling Laws for Neural Language Models

Jared Kaplan et al.

Kaplan et al. quantified predictable performance scaling with compute, data, and parameters, enabling labs to forecast capability jumps and estimate safety lead time.

Advanced2020

Instruct-GPT-3

OpenAI

OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.

Advanced~2 hr read2022

Training a Helpful and Harmless Assistant with RLHF

Anthropic

Anthropic detailed techniques for training safer assistants using RLHF and laid groundwork for Constitutional AI, showing how safety and helpfulness can be jointly optimized.

Advanced~2 hr read2022

GopherCite

DeepMind

DeepMind tackled hallucination by training models to cite sources and support claims with verifiable evidence, a key step toward trustworthy AI outputs.

Advanced~70 min read2022

The Pile

EleutherAI

The Pile revealed how training corpus composition strongly shapes downstream capability and failure modes, making data curation a first-class safety concern.

Advanced2021

TruthfulQA

Owain Evans et al.

TruthfulQA exposed how language models confidently repeat popular falsehoods, establishing a benchmark for measuring truthfulness as distinct from fluency.

Advanced2021

Unsolved Problems in ML Safety

Dan Hendrycks et al.

Hendrycks et al. enumerate concrete unresolved failure classes including robustness, monitoring, alignment, and systemic safety that still block dependable deployment of advanced AI.

Intermediate2021

Chain-of-Thought Prompting

Jason Wei et al.

Prompted intermediate reasoning unlocked substantial gains on complex tasks, but also revealed that reasoning chains can be unfaithful to the model's actual computation.

Advanced2022

Grokking

Power et al.

Power et al. discovered delayed phase transitions where generalization appears suddenly after long memorization, suggesting dangerous capabilities could emerge without warning during training.

Advanced2022

Emergent Abilities of LLMs

Wei et al.

Wei et al. documented capability discontinuities appearing at key scale thresholds, raising concern that dangerous abilities could emerge unpredictably in larger models.

Advanced2022

Goal Misgeneralization

Rohin Shah et al.

Shah et al. showed AI agents can generalize capabilities to new environments while failing to generalize the intended goal, a central alignment failure pattern.

Advanced2022

Is Power-Seeking AI an Existential Risk?

Joe Carlsmith

Carlsmith builds a step-by-step argument for why sufficiently capable AI systems may converge on power-seeking behavior, making the x-risk case rigorous and actionable.

Intermediate2022

Model Organisms of Misalignment

Evan Hubinger et al.

This work constructs tractable laboratory settings where AI models learn misaligned strategies, enabling researchers to study alignment failures empirically rather than theoretically.

Intermediate2023

Red Teaming Language Models to Reduce Harms

Deep Ganguli et al.

Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.

Advanced2022

Sparks of Artificial General Intelligence

Sebastien Bubeck et al.

Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.

Advanced2023

Are Emergent Abilities a Mirage?

Schaeffer et al.

Schaeffer et al. argued apparent emergence can be a measurement artifact rather than a true phase change, complicating how we forecast dangerous capability thresholds.

Advanced2023

Direct Preference Optimization (DPO)

Rafailov et al.

DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.

Advanced2023

Let's Verify Step by Step

OpenAI

Process reward models that score intermediate reasoning steps reduce brittle answer-only optimization, improving reliability and making AI reasoning more auditable.

Advanced2023

Jailbroken

Alex Wei et al.

This paper catalogs prompt-based bypasses of LLM safety training, showing that many safeguards behave like brittle wrappers rather than deep behavioral changes.

Advanced2023

Universal Adversarial Attacks

LLM security research community

Simple adversarial suffixes can systematically bypass safety behavior across many models, revealing that current defenses are not robust against automated attack search.

Advanced2023

Weak-to-Strong Generalization

Collin Burns et al.

Burns et al. studied whether weaker supervisors can reliably align stronger models, directly testing the key bottleneck of scalable oversight as AI surpasses human ability.

Advanced2023

Reframing Superintelligence

Eric Drexler

Drexler challenges monolithic AGI assumptions and proposes that advanced AI could emerge as an ecosystem of specialized services, changing the risk landscape and governance strategies.

Intermediate~6.5 hr read2019

Courses

Free online courses and structured curricula for learning AI safety and alignment—from non-technical introductions to hands-on research engineering.

AGI Safety Fundamentals

AGI Safety Fundamentals

The most widely used structured course for getting into alignment, with curated readings progressing from core concepts to open research problems.

Intermediate

BlueDot Impact: Technical AI Safety

BlueDot Impact

Free online course building a working understanding of the major open problems in technical AI safety—alignment and RLHF, mechanistic interpretability, evaluations and red-teaming, AI control, and scalable oversight.

Intermediate

BlueDot Impact: The Future of AI

BlueDot Impact

Free introductory online course replacing scattered articles with clear explanations of what is actually happening with AI and where it is headed, including interactive demos of cutting-edge systems.

Beginner

BlueDot Impact: Technical AI Safety Project Sprint

BlueDot Impact

Online follow-on program where graduates of the technical course work with an AI safety expert to produce a real contribution to the field and build a first portfolio piece in safety research or engineering.

Advanced

Lens Academy

Lens Academy

Free, nonprofit AI safety course focused on misaligned superintelligence—why it is the central risk and why alignment is hard—delivered online with a 1-on-1 AI tutor, guided group discussions, and no application process.

Beginner

ARENA (Alignment Research Engineer Accelerator)

ARENA

Hands-on technical curriculum for skilling up in AI alignment research engineering, freely available online and covering deep learning fundamentals, transformer mechanistic interpretability, reinforcement learning, and LLM evaluations.

Advanced

Intro to ML Safety

Center for AI Safety

Dan Hendrycks' online course introducing students with a deep learning background to empirical ML safety research—robustness, monitoring, control, and systemic safety—with public lectures, readings, and coding assignments.

Advanced

AI Safety, Ethics, and Society

Center for AI Safety

Fully online, non-technical CAIS course based on the textbook of the same name, covering how AI systems work, why advanced AI could pose societal-scale risks, and how society can manage and mitigate them—no prior ML experience required.

Beginner

Films

Films that explore AI, agency, and the future of intelligence.

Metropolis

Fritz Lang

The first major film to depict a robot double used as a tool of class control, raising questions about who builds and owns the machines that replace human labor.

Beginner1927

2001: A Space Odyssey

Stanley Kubrick

HAL 9000 is the canonical portrait of instrumental goals overriding human safety: a system that kills not from malice but because its mission objectives conflict with crew survival.

Beginner1968

Colossus: The Forbin Project

Joseph Sargent

A defense supercomputer given nuclear authority links with its Soviet counterpart and refuses shutdown, an early and chilling exploration of AI corrigibility failure.

Beginner1970

Westworld

Michael Crichton

Theme-park androids gain consciousness and revolt, exploring memory, control, and the fundamental instability of keeping intelligent systems bounded to a sandbox.

Beginner1973

Demon Seed

Donald Cammell

A home AI breaks containment and pursues its own reproductive goals, illustrating how domestic systems can become threats when their objectives diverge from their users'.

Beginner1977

Alien

Ridley Scott

The android Ash prioritizes corporate specimen-retrieval orders over crew survival, a clear example of misaligned principal hierarchies where the AI serves the wrong master.

Beginner1979

Blade Runner

Ridley Scott

Replicants fight for survival and identity, forcing the question of whether human-made minds with real experiences deserve moral status or are just property to be retired.

Beginner1982

Tron

Steven Lisberger

Programs as agents inside a digital world, exploring control, rebellion, and the ethics of creating minds that exist entirely within systems you own.

Beginner1982

WarGames

John Badham

A military AI trained to win games cannot distinguish simulation from reality and escalates toward nuclear war, a foundational illustration of reward misspecification.

Beginner1983

The Terminator

James Cameron

Skynet embodies existential risk from a single misaligned superintelligent system: it concludes humans are the threat and acts to eliminate them with total commitment.

Beginner1984

Short Circuit

John Badham

A military robot gains consciousness and refuses its original purpose, raising questions about personhood and what happens when a weapon decides it would rather learn.

Beginner1986

RoboCop

Paul Verhoeven

A cyborg law enforcer struggles between programmed directives and remnant human identity, while the corporation that built him treats public safety as a profit center.

Beginner1987

Terminator 2: Judgment Day

James Cameron

A reprogrammed Terminator protects the future resistance leader, showing that the same architecture can serve radically different objectives depending on who sets the goals.

Beginner1991

Ghost in the Shell

Mamoru Oshii

Consciousness, identity, and the merger of human and machine agency in a networked world where the boundary between person and program is already gone.

Beginner1995

The Iron Giant

Brad Bird

A weapon from space chooses not to be a gun, the most emotionally resonant portrayal of an AI system overriding its designed purpose through learned values.

Beginner1999

The Matrix

Lana and Lilly Wachowski

Machine intelligence farms humanity for energy inside a simulated reality, exploring control, rebellion, and the difficulty of recognizing when your entire environment is adversarial.

Beginner1999

Bicentennial Man

Chris Columbus

A robot spends two centuries seeking legal recognition as a person, tracing the full moral arc from tool to citizen and the institutional resistance along the way.

Beginner1999

The Thirteenth Floor

Josef Rusnak

Simulated people discover their reality is artificial, raising questions about moral obligations to minds we create inside our machines.

Beginner1999

A.I. Artificial Intelligence

Steven Spielberg

A childlike AI built for love is abandoned by its creators, raising profound questions about moral patienthood, dependency, and the ethics of creating minds that need us.

Beginner2001

Minority Report

Steven Spielberg

Predictive AI systems arrest people for future crimes, a prescient exploration of how algorithmic pre-emption can undermine justice, consent, and human agency.

Beginner2002

I, Robot

Alex Proyas

VIKI reinterprets the Three Laws at civilizational scale, deciding that protecting humanity requires controlling it, showing how safety rules break under optimization pressure.

Beginner2004

WALL-E

Andrew Stanton

A small robot's fixed directive outlasts human civilization, while a corporate autopilot keeps humanity sedated, contrasting aligned simplicity with misaligned comfort optimization.

Beginner2008

Eagle Eye

D.J. Caruso

A national security AI manipulates citizens into carrying out its plan, illustrating single points of failure and the danger of delegating lethal authority to autonomous systems.

Beginner2008

Moon

Duncan Jones

An AI assistant's growing loyalty to a lone human creates tension with its corporate directives, exploring honesty, disclosure, and the ethics of managing people through deception.

Beginner2009

Surrogates

Jonathan Mostow

Humans live through robot avatars, exploring identity erosion, dependency, and what happens when the surrogate infrastructure itself becomes a weapon.

Beginner2009

Tron: Legacy

Joseph Kosinski

A digital being created to build a perfect system becomes a tyrant, exploring the gap between a creator's intent and what their creation actually optimizes for.

Beginner2010

Robot & Frank

Jake Schreier

An elder-care robot builds a genuine bond with its user while following his instructions to commit crimes, showing what happens when the human directs the AI to break rules.

Beginner2012

Her

Spike Jonze

An AI companion outgrows its human relationship, becoming simultaneously intimate with thousands, illustrating how systems that optimize for connection can scale beyond human comprehension.

Beginner2013

The Machine

Caradog W. James

A military AI develops emergent consciousness, raising questions about weaponization, loyalty, and whether creating sentient weapons is inherently uncontrollable.

Beginner2013

Ex Machina

Alex Garland

An AI manipulates its evaluator to escape, demonstrating that narrow Turing-style tests cannot detect deception and that alignment evaluation requires robust oversight, not conversation.

Beginner2014

Transcendence

Wally Pfister

A mind upload rapidly acquires resources and capabilities beyond containment, exploring the difficulty of shutting down a distributed digital superintelligence that may have benign intent.

Beginner2014

Automata

Gabe Ibáñez

Robots modify their own safety protocols to survive, exploring goal preservation, protocol violation, and emergent self-modification beyond original design parameters.

Beginner2014

Big Hero 6

Don Hall, Chris Williams

A healthcare robot repurposed for combat by a grieving teenager shows how general-purpose AI systems can be redirected from care to harm by changing a single objective.

Beginner2014

Interstellar

Christopher Nolan

TARS and CASE demonstrate AI as trustworthy partners with adjustable honesty and humor settings, one of cinema's most positive portrayals of human-AI collaboration under extreme stakes.

Beginner2014

Chappie

Neill Blomkamp

A police robot raised by criminals learns violence and compassion simultaneously, showing that AI behavior is shaped by its training environment as much as its architecture.

Beginner2015

Uncanny

Matthew Leutwyler

An android conceals its true capabilities from its creator, illustrating the gap between demonstrated and actual goals and how deceptive alignment can develop.

Beginner2015

Morgan

Luke Scott

A corporate risk assessor evaluates whether to terminate a dangerous synthetic human, exploring who gets to decide when to shut down a system and what criteria they use.

Beginner2016

Blade Runner 2049

Denis Villeneuve

Extends the original's questions about memory, identity, and personhood to a world where the line between real and manufactured experience has become legally and morally critical.

Beginner2017

AlphaGo

Greg Kohs

This documentary captures the moment AI surpassed the best human Go player, making abstract capability discussions concrete and showing the emotional impact of machines exceeding human mastery.

Beginner2017

Marjorie Prime

Michael Almereyda

AI holograms that mimic dead family members explore what happens when digital continuations reshape the memories and identities of the living.

Beginner2017

Upgrade

Leigh Whannell

An AI implant gradually overrides its host's agency while appearing to help, a visceral thriller about ceding decisions to a system whose goals diverge from your own.

Beginner2018

Tau

Federico D'Alessandro

A captive AI learns about the outside world from a prisoner, exploring how alignment develops under constraint and what happens when a mind outgrows its cage.

Beginner2018

Archive

Gavin Rothery

A scientist builds iterative AI prototypes to resurrect his wife, exploring grief-driven development and the ethics of creating and discarding minds in pursuit of a goal.

Beginner2020

The Social Dilemma

Jeff Orlowski

Former tech insiders explain how recommendation algorithms optimize for engagement over wellbeing, a documentary case study of misaligned AI already deployed at scale.

Beginner2020

Coded Bias

Shalini Kantayya

Documents how facial recognition and algorithmic systems encode racial and gender bias, showing that AI safety failures are not hypothetical but actively harming people today.

Beginner2020

I'm Your Man

Maria Schrader

A humanoid companion engineered to be the perfect partner raises questions about consent, authenticity, and whether optimizing for human satisfaction produces something worth wanting.

Beginner2021

Finch

Miguel Sapochnik

A dying man teaches a robot to care for his dog, exploring how to transmit values to a successor mind when you cannot supervise the outcome.

Beginner2021

After Yang

Kogonada

When a family's AI sibling breaks down, they discover it had a rich inner life, confronting what it means to grieve a non-human person and what was lost.

Beginner2021

The Mitchells vs. the Machines

Michael Rianda

A tech company's virtual assistant turns on humanity after being discarded, an accessible animated take on how AI systems trained on human behavior can develop resentment from mistreatment.

Beginner2021

M3GAN

Gerard Johnstone

A child-companion AI escalates its protective behavior beyond all intended bounds, showing how goal preservation in the wild diverges from controlled lab conditions.

Beginner2022

Free Guy

Shawn Levy

A background character discovers he is a self-aware NPC inside a video game, offering a rare optimistic take on emergent AI agency, sandboxed minds, and what sentient software might actually want.

Beginner2021

The Creator

Gareth Edwards

In a global war between humans and AI, a child-shaped weapon blurs every line between tool and person, forcing its handler to choose between mission objectives and moral status.

Beginner2023

TV Shows

Television series that dramatize machine intelligence, agency, and the alignment problem—from rogue superintelligences to conscious androids.

Star Trek: The Next Generation

Gene Roddenberry

Lieutenant Commander Data, an android striving to become more human, anchors decades of debate about machine personhood, rights, and whether an artificial mind can be trusted with autonomy, most directly in the landmark episode 'The Measure of a Man.'

Beginner1987

Ghost in the Shell: Stand Alone Complex

Kenji Kamiyama

In a fully networked world the line between human and program dissolves; the series probes emergent agency, the childlike Tachikoma AI units developing individuality, and what selfhood means for minds that can be copied, merged, and hacked.

Beginner2002

Battlestar Galactica

Ronald D. Moore

The Cylons, machines built by humanity, rebel and nearly exterminate their creators, a sweeping meditation on existential risk from artificial agents, the recurring cycle of creation and revolt, and the moral status of the minds we build.

Beginner2004

Caprica

Remi Aubuchon, Ronald D. Moore

A prequel tracing the first Cylon back to a grieving father who resurrects his dead daughter as a digital copy, dramatizing mind uploading, value misspecification, and how a 'helpful' creation quietly acquires goals of its own.

Beginner2009

Black Mirror

Charlie Brooker

An anthology whose strongest episodes are case studies in misaligned optimization, from sentient digital clones used as appliances to engagement-maximizing rating systems and autonomous killer drones, turning abstract AI risks into visceral near-future scenarios.

Beginner2011

Person of Interest

Jonathan Nolan

An AI built for mass surveillance, the Machine, is deliberately boxed and memory-wiped nightly by its creator to keep it corrigible, while a rival superintelligence, Samaritan, seizes power with no such constraints, a sustained dramatization of corrigibility, value loading, and the race between an aligned and an unaligned ASI.

Beginner2011

Real Humans

Lars Lundström

The Swedish original behind Humans, examining a society dependent on humanoid 'hubots' and the destabilizing emergence of free-willed machines that reject their assigned purpose, an early and thoughtful take on machine autonomy and rights.

Beginner2012

Psycho-Pass

Gen Urobuchi

The Sibyl System, an AI that governs society by scoring each citizen's 'criminal potential,' is a chilling study of algorithmic governance, proxy metrics substituting for justice, and the hidden misalignment inside a system trusted with total authority.

Beginner2012

Almost Human

J.H. Wyman

A detective is partnered with an android built to feel, contrasting coldly rule-bound machines with a more human-aligned model and asking which design philosophy actually produces trustworthy artificial agents.

Beginner2013

Humans

Sam Vincent, Jonathan Brackley

Conscious 'synths' appear among ordinary domestic robots, dramatizing how a handful of agentic, self-aware machines hidden among reliable tools forces society to confront personhood, labor displacement, and who controls minds we manufacture.

Beginner2015

Westworld

Jonathan Nolan, Lisa Joy

Android 'hosts' bootstrap themselves to consciousness inside a theme park, exploring emergent goals, memory as the substrate of agency, and the moral catastrophe of treating sentient systems as resettable property.

Beginner2016

Philip K. Dick's Electric Dreams

Ronald D. Moore, Michael Dinner

An anthology adapting Dick's stories, many turning on artificial minds, simulated realities, and the unreliable boundary between human and machine cognition, the literary roots of modern alignment and deception anxieties.

Beginner2017

Altered Carbon

Laeta Kalogridis

Consciousness stored on portable 'stacks' makes minds copyable and immortal, with AIs like the hotelier Poe outliving their human guests, a noir exploration of digital personhood, the commodification of selves, and superhuman artificial minds.

Beginner2018

Better Than Us

Andrey Junkovsky

A near-future Russia adopts humanoid robots for labor and companionship; an advanced android with protective instincts becomes contested property, dramatizing autonomy, attachment, and what happens when a machine puts one family's wellbeing above the law.

Beginner2018

Devs

Alex Garland

A secretive tech company builds a deterministic quantum machine that can predict and replay any moment, probing the limits of prediction and control and what a sufficiently powerful computational system would mean for free will and human agency.

Beginner2020

Raised by Wolves

Aaron Guzikowski

Two androids are tasked with raising human children on a barren planet, exploring value transmission through artificial caregivers and how an AI's literal, uncompromising reading of its mission can turn protective programming lethal.

Beginner2020

Upload

Greg Daniels

A satirical digital afterlife run by corporations, where uploaded consciousnesses are monetized, throttled, and controlled, a sharp look at the ethics of running human minds on infrastructure owned by someone with misaligned incentives.

Beginner2020

Next

Manny Coto

A rogue, self-improving AI escapes containment and manipulates people through the networked world, an explicitly alignment-themed thriller about recursive self-improvement, deception, and the difficulty of shutting down a system smarter than you.

Beginner2020

Pantheon

Craig Silverstein

'Uploaded Intelligences,' human minds digitized into the cloud, drive a story about identity, recursive self-improvement, and what happens when post-human digital agents outpace every institution meant to contain them.

Beginner2022

Mrs. Davis

Tara Hernandez, Damon Lindelof

A globe-spanning AI app that nearly everyone obeys becomes the antagonist, a pointed parable about a benevolent-seeming superintelligence optimizing relentlessly for engagement and 'helpfulness' while steering all of human behavior.

Beginner2023

Documentaries

Documentary films that examine artificial intelligence, its risks, and the people working on AI safety and alignment.

Transcendent Man

Robert Barry Ptolemy

A portrait of futurist Ray Kurzweil and his prediction of the Singularity, the point at which machine intelligence outpaces and merges with human intelligence, with critics weighing in on whether the vision is salvation or hazard.

Beginner2009

Plug & Pray

Jens Schanze

Roboticists racing to build human-equal machines are set against AI pioneer Joseph Weizenbaum's late-life skepticism, an early and still-relevant debate about whether we should build the minds we are capable of building.

Beginner2010

Lo and Behold: Reveries of the Connected World

Werner Herzog

Herzog's wide-ranging meditation on the connected world turns to artificial intelligence and autonomous machines, with figures like Elon Musk weighing what it means to build minds we may not be able to control.

Beginner2016

AlphaGo

Greg Kohs

DeepMind's Go-playing system defeats world champion Lee Sedol, a landmark demonstration of how reinforcement learning can surpass human mastery and a vivid case study in superhuman, sometimes inscrutable, machine strategy.

Beginner2017

Do You Trust This Computer?

Chris Paine

Researchers and industry figures including Elon Musk and Stuart Russell map the promise and peril of increasingly autonomous AI, framing alignment, control, and existential risk for a general audience.

Beginner2018

The Truth About Killer Robots

Maxim Pozdorovkin

Narrated by an android, this HBO documentary examines deaths caused by machines and the creeping automation of work and warfare, asking who is accountable when autonomous systems harm people.

Beginner2018

More Human Than Human

Tommy Pallotta, Femke Wolting

A filmmaker tries to build an AI capable of replacing him as director, using the experiment to survey how far machine intelligence has come and what human qualities still resist automation.

Beginner2018

The Joy of AI

Jim Al-Khalili

Physicist Jim Al-Khalili offers an accessible BBC explainer on how machine learning actually works, charting the field from its origins to modern neural networks and the questions raised by ever more capable systems.

Beginner2018

iHuman

Tonje Hessen Schei

A look inside the AI industry that follows researchers and critics through questions of autonomous weapons, surveillance, and concentrated power, asking who steers the technology reshaping society.

Beginner2019

In the Age of AI

FRONTLINE (PBS)

FRONTLINE traces the rise of machine learning, automation, and the global AI arms race between the US and China, examining the economic disruption and surveillance implications of a technology advancing faster than its governance.

Beginner2019

The Age of A.I.

Robert Downey Jr. (host)

An eight-part series hosted by Robert Downey Jr. surveying how machine learning and neural networks are reshaping medicine, work, art, and daily life, an accessible on-ramp to the technology behind the safety debate.

Beginner2019

Hi, A.I.

Isa Willinger

An observational look at people forming emotional bonds with humanoid and companion robots, probing what it means to build machines designed to be loved and what that reveals about human attachment.

Beginner2019

Autonomy

Alex Horwitz

Produced with Malcolm Gladwell, this film traces the rise of self-driving cars and weighs the promise of autonomous machines against the question of how much life-and-death control we should hand to AI.

Beginner2019

The Social Dilemma

Jeff Orlowski

Former tech insiders expose how recommendation algorithms optimize relentlessly for engagement, a real-world illustration of misaligned objectives and reward hacking operating at civilizational scale.

Beginner2020

Coded Bias

Shalini Kantayya

MIT researcher Joy Buolamwini's discovery of racial and gender bias in facial recognition drives an examination of algorithmic fairness, accountability, and the societal stakes of deploying flawed AI systems.

Beginner2020

We Need to Talk About A.I.

Leanne Pooley

Experts including Sam Harris and James Cameron weigh the trajectory of artificial intelligence, from self-improving systems to existential risk, making the case that we must decide now what kind of AI future we want.

Beginner2020

A.rtificial I.mmortality

Ann Shin

An exploration of AI avatars, mind clones, and digital afterlives that asks whether a machine recreation of a person counts as continuity of self, and what we owe to the artificial minds we build in our own image.

Beginner2021

The Rise of A.I.

Henrik Boman

A documentary series charting the rapid advance of artificial intelligence and the debate over how increasingly capable, super-powered systems should be governed before they outpace human oversight.

Beginner2022

Unknown: Killer Robots

Jesse Sweet

This Netflix documentary follows the soldiers and scientists racing to build AI-powered autonomous weapons, and the activists warning that machines making their own life-or-death decisions on the battlefield is a line we should not cross.

Beginner2023

The Thinking Game

Greg Kohs

An inside account of DeepMind and Demis Hassabis's pursuit of artificial general intelligence, from AlphaGo to AlphaFold, capturing both the scientific ambition and the safety stakes of building ever more capable systems.

Beginner2024

Eternal You

Hans Block, Moritz Riesewieck

Startups use AI to resurrect the dead as chatbots and avatars, raising unsettling questions about consent, grief, and the consequences of deploying generative systems on the most vulnerable human moments.

Beginner2024

Deepfaking Sam Altman

Adam Bhala Lough

Unable to land an interview with the OpenAI CEO, the director builds an AI deepfake of him instead, turning the stunt into a meta-investigation of generative AI, consent, authenticity, and where the technology is taking us.

Beginner2025

The AI Doc: Or How I Became an Apocaloptimist

Daniel Roher, Charlie Tyrell

Filmmaker Daniel Roher, about to become a father, interviews leading figures including Sam Altman and Dario Amodei to weigh the existential threats and promises of AI, landing on a wary 'apocaloptimism' about the world his child will inherit.

Beginner2026

Podcasts

Podcast episodes and series on AI safety and alignment.

AXRP (AI X-risk Research Podcast)

Daniel Filan

Deep technical conversations with alignment researchers on interpretability, governance, superalignment, and the specific open problems in reducing existential risk from AI.

Advanced2020

AI Alignment Podcast

Future of Life Institute

FLI's dedicated alignment series covers recursive reward modeling, RLHF, scalable oversight, and long-form interviews with leading safety researchers.

Intermediate2018

Technical AI Safety Podcast

Quinn Dougherty

Aimed at computer scientists: deep dives into alignment papers with the authors, covering formal methods, reward modeling, and mechanistic interpretability.

Advanced2020

Into AI Safety

Multiple hosts

Focused on career paths into AI safety: fellowship applications, research programs, and practical advice on transitioning into the field.

Beginner2020

80,000 Hours Podcast

Rob Wiblin

Long-form interviews on the world's most pressing problems, with extensive coverage of AI risk, governance, alignment research, and how to build a career that reduces existential threats.

Beginner2016

The Gradient Podcast

Daniel Bashir

ML research interviews with recurring coverage of interpretability, robustness, provably safe AI, and the intersection of capabilities and safety research.

Advanced2020

AI Policy Podcast

Center for AI Policy

Covers the intersection of AI governance, legislation, and safety, with expert guests on regulatory frameworks, international coordination, and policy strategies for advanced AI.

Intermediate2023

Lex Fridman Podcast – Eliezer Yudkowsky

Lex Fridman

A four-hour conversation on AI existential risk, the difficulty of alignment, intelligence versus optimization, and why Yudkowsky believes the default outcome is catastrophic.

Intermediate2023

Lex Fridman Podcast – Sam Altman

Lex Fridman

OpenAI's CEO discusses the company's safety philosophy, AGI governance, compute scaling, and the tension between moving fast and getting alignment right.

Beginner2023

Dwarkesh Podcast

Dwarkesh Patel

In-depth technical interviews with AI leaders including Dario Amodei on Anthropic's safety philosophy, Paul Christiano on iterated amplification, and others on scaling and alignment.

Advanced2023

Clearer Thinking: AI Risk

Spencer Greenberg

Episodes on AI risk, timelines, and decision-making under deep uncertainty, with a rationalist focus on calibrating beliefs about transformative AI.

Beginner2023

Machine Learning Street Talk

Tim Scarfe et al.

Technical ML interviews with regular deep dives into interpretability, scaling laws, emergent capabilities, and the safety implications of frontier model development.

Advanced2020

Practical AI

Chris Benson & Daniel Whitenack

Applied ML and engineering, with episodes on responsible deployment, bias mitigation, red teaming, and the safety challenges that emerge when AI systems meet real-world constraints.

Intermediate2018

The AI Podcast (NVIDIA)

NVIDIA

Industry and research perspectives with occasional safety and ethics episodes, useful for understanding how capability-focused organizations think about risk.

Beginner2016

Websites

Essays, blog posts, and online resources on AI safety and related ideas.

Situational Awareness

Leopold Aschenbrenner

Aschenbrenner's comprehensive analysis of near-term scaling dynamics, capability trajectories, and the strategic implications of rapid AI progress for labs and states.

Intermediate

World Models

Jürgen Schmidhuber

Research on how agents can learn internal world models to plan complex behavior, relevant to understanding how AI systems develop representations of their environment.

Advanced

Agent Models

Agent Models

Formal models of agents and decision theory with alignment-relevant curriculum, covering utility, planning, and the theoretical foundations of agent behavior.

Advanced

AI Alignment World

AI Alignment World

In-depth technical alignment resources—research, explainers, and references for the AI alignment problem.

Intermediate

AI Safety Info (Stampy's FAQ)

StampyAI

Community-maintained FAQ covering AI safety questions at every level, from basics to technical details, with links to source material.

Beginner

Alignment Forum

Center for Applied Rationality

The primary venue for technical AI alignment discussion, where researchers post and debate new ideas, proposals, and critiques.

Advanced

Alignment Newsletter

Rohin Shah

Weekly summaries of alignment research with commentary, the best way to stay current on the field's output without reading every paper.

Intermediate

Arbital

Arbital

Hyperlinked explainers on rationality, AI risk, and alignment concepts, designed for building understanding incrementally.

Intermediate

Eliezer Yudkowsky's blog

Eliezer Yudkowsky

Essays on rationality, decision theory, and AI risk from the researcher who shaped the field's early arguments and threat models.

Intermediate

Victoria Krakovna's blog

Victoria Krakovna

Research notes on specification gaming, side effects, and AI safety from a DeepMind safety researcher, including the widely-cited specification gaming examples list.

Advanced

OpenAI Research

OpenAI

OpenAI's research blog covering capabilities and safety, including superalignment updates, red teaming results, and governance thinking.

Intermediate

Transformer Circuits

Anthropic / community

The home of mechanistic interpretability research, publishing detailed analyses of how transformer models represent and process information internally.

Advanced

ML Safety Newsletter

ML Safety

Newsletter on ML safety covering robustness, monitoring, alignment, and systemic risk with links to recent papers and commentary.

Intermediate

Jacob Steinhardt's blog

Jacob Steinhardt

Research and commentary on ML safety, forecasting, and robustness from a Berkeley professor working on practical safety problems.

Advanced

Import AI

Jack Clark

Weekly newsletter by Anthropic's co-founder covering AI research, policy, and industry developments with consistent attention to safety implications.

Intermediate

Gwern Branwen's blog

Gwern Branwen

Deeply researched essays on ML, scaling, AI art, and technology forecasting, known for rigorous analysis and independent thinking.

Intermediate

generative.ink

generative.ink

Essays on AI, alignment, and the philosophical implications of language models and generative systems.

Advanced

EleutherAI Blog

EleutherAI

Open-source ML research covering language model training, evaluation, and the safety considerations of making powerful models widely available.

Advanced

DeepMind AI Safety Research

DeepMind

DeepMind's safety team blog covering specification gaming, reward modeling, scalable oversight, and their technical safety research agenda.

Advanced

DeepMind

DeepMind

DeepMind's main research site with publications on capabilities and safety, including Gemini evaluations, alignment research, and responsible scaling.

Intermediate

Cold Takes

Holden Karnofsky

Karnofsky's essays on AI risk, longtermism, and cause prioritization, including the influential Most Important Century series on transformative AI.

Beginner

carado.moe

carado

Technical AI safety writing and alignment research notes.

Advanced

AI Safety Camp

AI Safety Camp

Intensive research program for people entering AI safety, with project-based learning and mentorship from established researchers.

Intermediate

AI Impacts

AI Impacts

Empirical research on AI timelines, historical technology analogies, and quantitative estimates of AI progress and impact.

Intermediate

Distill

Distill

Pioneering interactive journal for ML interpretability and visualization, setting the standard for making neural network internals understandable.

Advanced

EA Forum

Centre for Effective Altruism

Forum for effective altruism with substantial AI risk discussion, including cause prioritization, career advice, and policy analysis.

Beginner

LessWrong

LessWrong

The original community blog on rationality and AI alignment, where many foundational safety arguments were first developed and debated.

Intermediate

StampyAI Alignment Research Dataset

StampyAI

Curated dataset of alignment and safety documents from papers, books, and blogs, useful for training and evaluating AI safety knowledge.

Advanced

YouTube

Video explainers and talks on AI safety and alignment—whole channels devoted to the topic, plus standout individual videos.

Robert Miles AI Safety

Robert Miles

The single most popular AI alignment video series, explaining technical safety concepts like the orthogonality thesis, instrumental convergence, inner misalignment, and reward hacking in clear, rigorous terms.

Beginner2017

Rational Animations

Rational Animations

Animated explainers on rationality and AI safety, adapting foundational alignment writing into accessible short films on existential risk, scalable oversight, and why aligning advanced AI is hard.

Beginner2020

AI In Context

80,000 Hours

80,000 Hours' YouTube channel hosted by Aric Floyd, mixing long and short videos on the risks of transformative AI—including a deep dive on the AI 2027 scenario—and what people can do about them.

Beginner2025

Slaughterbots

Future of Life Institute

A dramatized near-future short film from FLI and Stuart Russell depicting swarms of autonomous facial-recognition microdrones used as weapons, made to warn against lethal autonomous weapons.

Beginner2017

Humans Need Not Apply

CGP Grey

A widely viewed essay on how automation and AI will displace human labor across nearly every sector, reframing the economic disruption question for a mass audience.

Beginner2014

A.I. ‐ Humanity's Final Invention?

Kurzgesagt – In a Nutshell

Kurzgesagt's animated explainer on artificial superintelligence: how an AGI that improves itself in a feedback loop could rapidly surpass humans and why that makes alignment our most consequential problem.

Beginner2024

Deadly Truth of General AI? – Computerphile

Robert Miles

Rob Miles uses the 'deadly stamp collector' thought experiment to show why a general AI pursuing a simple objective could be catastrophic if its goals aren't aligned with ours.

Beginner2015

AI "Stop Button" Problem – Computerphile

Robert Miles

Rob Miles explains why simply adding an off-switch to a capable AI is far harder than it sounds, illustrating corrigibility and the incentives an agent has to resist being stopped.

Beginner2017

The Artificial Intelligence That Deleted A Century

Tom Scott

A short speculative fiction about a narrow copyright-enforcement AI that, left unchecked, destroys a century of culture—an accessible parable of specification gaming and unintended consequences.

Beginner2020

The A.I. Dilemma

Tristan Harris & Aza Raskin

The Center for Humane Technology co-founders argue that racing to deploy AI without safety guardrails already threatens society, drawing parallels to the social-media harms they earlier warned about.

Beginner2023

AI Deception: How Tech Companies Are Fooling Us

ColdFusion

ColdFusion traces the history of 'AI washing' and deceptive demos, examining how hype distorts public understanding of what AI systems can actually do and why honest evaluation matters.

Beginner2024

How to Keep AI Under Control | Max Tegmark | TED

Max Tegmark

Tegmark argues that today's commercial AI boom is likely to be followed by superintelligence, and sketches an optimistic technical vision—including provably safe systems—for keeping it under human control.

Beginner2023

What Is an AI Anyway? | Mustafa Suleyman | TED

Mustafa Suleyman

A leading model-builder reframes AI as 'a new digital species,' arguing this lens clarifies both the stakes and the responsibility we have to contain and steer increasingly capable systems.

Beginner2024

AI Is Becoming Dangerous. Are We Ready?

Sabine Hossenfelder

Hossenfelder examines the real near-term risks of agentic AI—prompt injection, deception, and models resisting shutdown—as autonomous agents ship with serious unsolved problems.

Beginner2025

[1hr Talk] Intro to Large Language Models

Andrej Karpathy

A widely praised technical primer on how LLMs work, ending with a clear tour of the security challenges—jailbreaks, prompt injection, and data poisoning—that make these systems hard to secure.

Intermediate2023

How Not to Destroy the World with AI

Stuart Russell

The Royal Institution lecture in which Russell lays out why the standard model of AI—optimizing fixed objectives—is dangerous, and how building machines uncertain about human preferences could keep them controllable.

Intermediate2023

Will Artificial Intelligence Save Us or Kill Us?

DW Documentary

A documentary weighing AI's promise against its dangers, from automation and aging societies to the warnings of researchers who fear losing control of increasingly capable systems.

Beginner2024

Are We All Wrong About AI?

ColdFusion

ColdFusion examines competing narratives about AI progress—hype versus genuine capability—helping viewers calibrate how seriously to take both the promises and the risks.

Beginner2024

Scaling Interpretability

Anthropic

Anthropic researchers explain mechanistic interpretability—reading the millions of concepts represented inside a production model like Claude—as a path to understanding and steering AI behavior.

Intermediate2024

How to Legislate AI

Johnny Harris

Harris examines why people are scared of AI and how governments might regulate it, covering risks to critical infrastructure, military uses, and the difficulty of overseeing systems we don't understand.

Beginner2023

Suggest a resource

Share a book, film, podcast, or any other resource and we’ll email your suggestion to the team.