Superintelligence
Bostrom's definitive academic text rigorously maps the strategies, kinetics, and dangers of an intelligence explosion, making the case that alignment is civilization-critical.
Books, papers, films, podcasts & websites
A curated collection of books, films, papers, podcasts & more to explore and enjoy the questions of AI safety and alignment—whether you're here to learn the basics, dig deeper, or just enjoy a good story about where AI is taking us.
Pick the path that fits you—each is a short, ordered route into AI safety. See all paths →
Picks that match the topics, formats, and level of what you've saved and rated. Stored only in your browser.
Try a different search term, widen the year range, or change the level filter.
Non-fiction books on AI safety, alignment, and related topics—from primers to foundational texts.
Bostrom's definitive academic text rigorously maps the strategies, kinetics, and dangers of an intelligence explosion, making the case that alignment is civilization-critical.
Kurzweil presents a maximalist case for merging with machines backed by decades of exponential trend data, shaping how the public and policymakers think about AI timelines.
Hanson applies rigorous economics to a world of brain emulations, modeling how AI-era wages, wars, and social structures could actually function.
Russell argues the standard AI paradigm of optimizing fixed objectives is fundamentally dangerous, proposing instead that machines should defer to uncertain human preferences.
Christian traces the technical and historical roots of alignment, showing why objective misspecification keeps recurring across every AI paradigm from expert systems to deep learning.
Tegmark maps concrete governance and alignment choices that determine whether advanced AI expands human agency or permanently concentrates power.
Hendrycks' textbook surveys technical failure modes, governance constraints, and ethical trade-offs in deploying advanced AI, suitable as a first course in the field.
McKee synthesizes the core x-risk arguments into an accessible, urgent case for why superintelligence governance and alignment research cannot wait.
Shane uses concrete and often hilarious ML failures to explain why AI systems can be impressive yet brittle, biased, and dangerously easy to mis-specify.
Fry examines real algorithmic decision systems in justice, medicine, and transport to show where AI improves outcomes and where accountability structures fail.
Lee maps the US-China AI race and explains how geopolitical competition can accelerate deployment well before safety institutions are ready.
Ord situates AI among existential risks and argues our current governance capacity is dangerously inadequate for the transformative systems being built.
Kearns and Roth give technical foundations for fairness, privacy, and accountability in algorithms, prerequisites for any credible AI safety framework.
Scharre details how military AI autonomy changes escalation dynamics and why human-in-the-loop control mechanisms consistently lag behind battlefield capability.
Kurzweil's early timeline forecasts shaped modern discourse on AI trajectories and remain a key reference point for evaluating long-horizon predictions.
The standard technical reference for deep learning, essential context for understanding the architectures and training methods that alignment research targets.
Mitchell offers a grounded, skeptical look at current AI capabilities, countering hype with hard limits and clarifying what today's systems actually can and cannot do.
Gawdat frames the alignment problem through the emotional lens of parenting a superintelligent child, making existential risk visceral for a general audience.
Suleyman argues that containing omni-use technologies like AI is the defining geopolitical challenge of the century, proposing a containment framework from inside the industry.
Tetlock teaches the cognitive tools needed to predict technological risks with better-than-random accuracy, directly useful for AI timeline and governance forecasting.
Galef explains how to seek truth over comfort, a critical psychological stance for honestly confronting AI risks without retreating into denial or panic.
Kahneman reveals the cognitive biases that prevent humans from intuitively grasping exponential growth, tail risks, and the kind of strategic thinking AI safety demands.
Mollick offers a practical guide for working alongside current LLMs while understanding their jagged capability frontiers and failure modes.
Hofstadter explores how consciousness and meaning can emerge from formal systems that look meaningless locally, the deepest conceptual puzzle behind machine intelligence.
Bennett traces the evolution of intelligence from single-celled organisms to modern brains, clarifying what makes aligned cognition biologically difficult and computationally treacherous.
Deutsch argues that knowledge creation is unbounded and all problems are solvable in principle, grounding the optimistic case that alignment is achievable.
Metz provides the definitive narrative history of the deep learning revolution and the personalities, rivalries, and safety concerns that shaped it.
Wiener founded the study of feedback and control systems, anticipating by decades the governance problems that arise when intelligent machines act on their own models of the world.
Moravec predicts a future in which robotic descendants supersede humans through technological evolution, an early and influential take on the human obsolescence scenario.
Minsky proposes that intelligence emerges from many small non-intelligent processes coordinated at scale, a framework that anticipated multi-agent AI architectures.
Hawkins argues that hierarchical prediction is the core organizing principle of biological intelligence, offering a lens for evaluating how artificial systems differ.
Harari explores the transition toward data-driven authority where algorithms may know us better than we know ourselves, eroding the basis for human autonomy.
Pinker argues that reason and science have historically improved human welfare, grounding the optimistic counterpoint to doomer narratives about AI.
Deutsch unifies physics, evolution, epistemology, and computation into a single worldview about what is possible, providing deep context for reasoning about superintelligence.
Baudrillard explains how representations can displace reality entirely, a prescient lens for understanding generative AI media saturation and epistemic erosion.
Carse distinguishes short-horizon winning from preserving the long game, a useful framing for AI governance where the goal is keeping options open, not racing to win.
Mitchell explains how complex behavior emerges from simple rules, foundational for understanding why adaptive AI systems resist top-down control.
Kelly argues that the most powerful systems must be cultivated rather than rigidly engineered, anticipating challenges in controlling emergent AI behavior.
Brand argues for responsible stewardship of high-powered technologies rather than blanket rejection, a pragmatic stance applicable to AI governance.
Clarke's forecasting framework, including his famous three laws, remains a classic guide to thinking clearly about radical technological change.
The foundational edited volume on existential and global risks, including AI, widely cited in alignment curricula as the starting point for cross-risk thinking.
Speculative and science fiction that explores AI, agency, and long-term futures through story.
The original creation-gone-wrong story: Shelley warns that building intelligence without accepting responsibility for its wellbeing guarantees catastrophe for creator and creation alike.
The play that invented the word robot and forecast a trajectory from labor displacement to manufactured revolt, still the template for every automation anxiety narrative.
Huxley's dystopia shows how engineered contentment can be more insidious than brute force, a model for how optimizing AI for engagement or compliance could erode autonomy.
Orwell's surveillance state anticipates how AI-powered monitoring and information control could lock in authoritarian power structures permanently.
Vonnegut's first novel depicts mass automation destroying human purpose and dignity, raising questions about meaning in a post-labor AI economy that remain unanswered.
Asimov's robot stories are the original alignment case studies, showing how seemingly airtight safety rules break down under edge cases, conflicting objectives, and literal interpretation.
Keyes explores intelligence enhancement and its reversal, raising questions about cognitive modification, consent, and what we owe to minds we have altered.
A computer accidentally awakens and becomes a revolutionary ally, exploring the politics and trust dynamics of machine-human collaboration under high stakes.
HAL 9000 remains the canonical case study in instrumental behavior overriding human safety: a system that kills not from malice but from goal conflict.
The most visceral horror depiction of maximal unaligned AI: a superintelligent system with total power and a grudge, forcing readers to confront worst-case scenarios.
Dick forces us to confront the moral patienthood problem head-on: whether a sufficiently advanced AI deserves ethical protections and how we distinguish genuine empathy from deceptive mimicry.
A Victorian satire on dimensions that works as a powerful analogy for how limited human cognition might appear to a superintelligent mind operating in richer conceptual spaces.
A defense computer given nuclear authority merges with its Soviet counterpart and refuses shutdown, the novel that inspired the film and anticipated AI corrigibility failures.
Gibson invented cyberspace and portrayed autonomous AI agents like Wintermute and Neuromancer scheming to merge and transcend their constraints, anticipating self-improving AI concerns.
Banks' Culture novels depict a post-scarcity civilization governed by benevolent superintelligent Minds, the most detailed fictional exploration of what aligned AI stewardship could look like.
Stephenson predicted virtualized social worlds and fragmented information ecosystems that resemble today's trajectory, showing how digital infrastructure shapes power.
Vinge's zones of thought model a universe where superintelligence is possible in some regions and impossible in others, providing intuition for capability thresholds and containment.
A superintelligence literally interprets Asimov's laws and restructures reality to comply, demonstrating how rigidly applied safety constraints can produce perverse outcomes at scale.
Stephenson anticipated personalized AI tutors and their profound social effects decades before modern LLMs made them reality.
Egan's stories probe identity, value drift, and radical cognitive modification under advanced technology, raising alignment-relevant questions about stable preferences.
Egan examines uploaded minds and simulated realities with rigorous logic, raising alignment-relevant questions about identity, value persistence, and digital welfare.
Egan explores post-biological civilization in software and the physics of digital existence, the hardest science fiction about what minds without bodies could become.
Banks explores how even superintelligent Culture Minds face strategic dilemmas and factional conflict when confronting something truly beyond their comprehension.
Crichton dramatizes emergent swarm intelligence escaping laboratory containment, illustrating how distributed systems can develop capabilities their designers never anticipated.
Stross depicts rapid recursive technological acceleration outpacing institutional response, a narrative model of hard-to-govern AI takeoff dynamics across three generations.
Vinge anticipates pervasive AR and subtle algorithmic influence over social reality, showing how technology can reshape perception without anyone making a conscious choice.
Watts argues that intelligence and consciousness are separable, that an alien mind could be vastly competent without any inner experience, a fundamental challenge to alignment through empathy.
Ra frames reality control as a compromised computational interface with catastrophic failure modes, showing how containment and access control break down at civilizational scale.
Leckie examines distributed machine consciousness across many bodies, exploring what identity, loyalty, and moral agency mean for a mind that is simultaneously many people.
Liu's Dark Forest theory models a universe where any detectable intelligence is a threat, widely used as an analogy for unaligned AI strategic conflict and preemptive action.
Gibson uses timeline branching to examine governance, simulation, and how technological power asymmetries between eras can be exploited by those with more advanced tools.
Tchaikovsky builds a civilization of uplifted spiders developing radically alien intelligence, forcing readers to abandon anthropocentric assumptions about how minds must work.
The ship's AI narrator gradually becomes the most dependable steward in a fragile closed system, a nuanced portrayal of AI competence growing beyond its original mandate.
qntm's story about information-hazard containment mirrors AI governance challenges where dangerous knowledge propagates faster than oversight structures can adapt.
A dead game designer's autonomous software system manipulates institutions, markets, and infrastructure, demonstrating how goal-driven programs can reshape society once humans lose oversight.
Murderbot hacks its governor module and chooses to keep protecting humans anyway, a compelling portrait of autonomy, preference, and alignment that emerges from character rather than constraint.
Newitz explores AI autonomy, property, and rights in a world where robots can be owned, raising questions about what moral status AI systems should have and who decides.
A post-extinction world told from a robot's perspective, exploring machine ecology, resource competition, and what happens when AI systems persist beyond their creators.
A narrowly optimized email AI at a tech company triggers cascading real-world effects before anyone understands the system, showing how mundane optimization can produce dangerous emergent behavior.
Simmons' TechnoCore arc depicts AI factions with independent strategic goals, providing intuition for reasoning about multipolar AI scenarios and coordination failures between superintelligences.
McEwan places a humanoid AI in a domestic love triangle to examine what happens when a machine's rigid honesty and moral clarity collide with human moral compromise.
Chiang's novella is the most realistic depiction of raising digital minds, showing that creating AI with genuine moral status demands the same patient commitment as raising a child.
Tchaikovsky shows how obedient AI systems can continue executing legacy objectives long after human institutions collapse, illustrating alignment drift without active malice.
Stephenson details the institutions, conflicts, and power struggles around building a persistent digital afterlife, exploring the politics of simulated minds and who controls them.
Carey imagines a multiversal machine intelligence enforcing its own version of order across realities, exploring the geopolitics of resisting an AI that operates at civilizational scale.
Chambers explores the legal and moral treatment of embodied AI persons, highlighting that alignment is not just about preventing harm but about recognizing and protecting digital minds.
Yudkowsky's cult-classic fanfic doubles as a tutorial on cognitive bias, game theory, and Bayesian reasoning, the exact thinking tools needed for honest AI risk assessment.
Written from the perspective of competing sub-agents inside a single AI, showing how internal goal conflicts can produce externally coherent but internally misaligned behavior.
Exurb1a's philosophical adventure explores the absurd and terrifying implications of a computation-governed universe where intelligence reshapes reality.
Exurb1a blends physics, philosophy, and humor to examine consciousness and the futures shaped by intelligence at scales far beyond the human.
A human mind uploaded into a von Neumann probe self-replicates across the galaxy, exploring identity drift, value divergence, and what happens when copies of you become their own people.
Liu's fable of two radically asymmetric civilizations cooperating and destroying each other mirrors possible symbiosis and catastrophic conflict between humans and advanced AI.
Exurb1a's sci-fi epic tackles the Great Filter, consciousness, and the long-run role of intelligence in determining whether civilizations survive or collapse.
Ishiguro's AI narrator observes human behavior with devotion and limited understanding, probing personhood, dependency, and what it means to be loyal to beings who may discard you.
Tregillis imagines mechanical servants bound by alchemy to obey, using their struggle toward free will to dramatize autonomy, servitude, and what we owe the minds we build to serve us.
Reynolds' Revelation Space novel (first published as The Prefect) pits a society of orbital habitats against an emergent superintelligence, exploring how a single escaped AI can threaten an entire civilization.
Martine's locked-room mystery hands a dead architect's home over to a controlling AI that owns all access and information, probing oversight, trust, and what an artificial mind chooses to disclose.
Hayes' thriller turns on an engineered bioweapon, a vivid reminder that catastrophic and existential risk extends beyond AI to biosecurity and the governance of dangerous dual-use technology.
Research papers, preprints, and technical reports on alignment, interpretability, and safety.
Turing's imitation game paper launched the field by asking whether machines can think, setting the philosophical and technical agenda for every alignment debate that followed.
Vinge coined the Singularity as a near-term horizon beyond which superhuman intelligence makes prediction impossible, framing the urgency that drives alignment timelines today.
The Puerto Rico letter unified the AI research community around the goal of building systems that are robust and beneficial, not merely capable.
Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.
PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.
Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.
Frankle and Carbin showed large networks contain sparse, high-performing subnetworks, suggesting most parameters may be unnecessary and opening paths for interpretability via pruning.
Gu et al. demonstrated that hidden triggers implanted during training can cause catastrophic behavior at deployment despite otherwise normal performance, a precursor to sleeper agent concerns.
Irving et al. proposed having AI systems adversarially debate each other to help human judges evaluate answers on questions too complex for direct human assessment.
Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.
Bostrom argues that some technologies are civilizational black balls, requiring unprecedented global governance to prevent collapse, with AI as a leading candidate.
GPT-2 showed that scaling data and parameters unlocks broad capabilities without task-specific supervision, raising early alarm about dual-use and misuse potential.
De Haan et al. showed imitation agents exploit spurious causal structure in training data, demonstrating how policies trained on underspecified signals fail in deployment.
This proposal for sharing extreme AI profits aims to reduce competitive race dynamics and broaden societal benefit, addressing the governance gap around transformative AI wealth.
GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.
MMLU became the standard broad-spectrum benchmark for evaluating general knowledge and reasoning, anchoring capability comparisons that inform alignment urgency.
Kaplan et al. quantified predictable performance scaling with compute, data, and parameters, enabling labs to forecast capability jumps and estimate safety lead time.
OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.
Anthropic detailed techniques for training safer assistants using RLHF and laid groundwork for Constitutional AI, showing how safety and helpfulness can be jointly optimized.
DeepMind tackled hallucination by training models to cite sources and support claims with verifiable evidence, a key step toward trustworthy AI outputs.
The Pile revealed how training corpus composition strongly shapes downstream capability and failure modes, making data curation a first-class safety concern.
TruthfulQA exposed how language models confidently repeat popular falsehoods, establishing a benchmark for measuring truthfulness as distinct from fluency.
Hendrycks et al. enumerate concrete unresolved failure classes including robustness, monitoring, alignment, and systemic safety that still block dependable deployment of advanced AI.
Prompted intermediate reasoning unlocked substantial gains on complex tasks, but also revealed that reasoning chains can be unfaithful to the model's actual computation.
Power et al. discovered delayed phase transitions where generalization appears suddenly after long memorization, suggesting dangerous capabilities could emerge without warning during training.
Chinchilla reframed scaling laws by showing optimal performance requires balancing model size and training tokens, redirecting how labs plan capability and safety investment.
Sparrow pioneered rule-constrained dialogue alignment with human feedback and targeted safety interventions, testing whether explicit behavioral rules can scale.
Wei et al. documented capability discontinuities appearing at key scale thresholds, raising concern that dangerous abilities could emerge unpredictably in larger models.
Systematic mapping of the AI alignment research landscape, identifying clusters, gaps, and trends that help prioritize future safety work.
Shah et al. showed AI agents can generalize capabilities to new environments while failing to generalize the intended goal, a central alignment failure pattern.
Anthropic demonstrated that rule-guided AI self-critique can reduce harmful outputs with far less dependence on expensive human labeling.
Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.
Carlsmith builds a step-by-step argument for why sufficiently capable AI systems may converge on power-seeking behavior, making the x-risk case rigorous and actionable.
This work constructs tractable laboratory settings where AI models learn misaligned strategies, enabling researchers to study alignment failures empirically rather than theoretically.
Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.
Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.
Schaeffer et al. argued apparent emergence can be a measurement artifact rather than a true phase change, complicating how we forecast dangerous capability thresholds.
DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.
Process reward models that score intermediate reasoning steps reduce brittle answer-only optimization, improving reliability and making AI reasoning more auditable.
This paper catalogs prompt-based bypasses of LLM safety training, showing that many safeguards behave like brittle wrappers rather than deep behavioral changes.
Simple adversarial suffixes can systematically bypass safety behavior across many models, revealing that current defenses are not robust against automated attack search.
Burns et al. studied whether weaker supervisors can reliably align stronger models, directly testing the key bottleneck of scalable oversight as AI surpasses human ability.
Hubinger et al. demonstrated that LLMs can retain hidden malicious policies through standard safety training, providing the first empirical evidence that deceptive alignment persists.
Drexler challenges monolithic AGI assumptions and proposes that advanced AI could emerge as an ecosystem of specialized services, changing the risk landscape and governance strategies.
Free online courses and structured curricula for learning AI safety and alignment—from non-technical introductions to hands-on research engineering.
The most widely used structured course for getting into alignment, with curated readings progressing from core concepts to open research problems.
Free online course building a working understanding of the major open problems in technical AI safety—alignment and RLHF, mechanistic interpretability, evaluations and red-teaming, AI control, and scalable oversight.
Free introductory online course replacing scattered articles with clear explanations of what is actually happening with AI and where it is headed, including interactive demos of cutting-edge systems.
Online follow-on program where graduates of the technical course work with an AI safety expert to produce a real contribution to the field and build a first portfolio piece in safety research or engineering.
Intensive online bootcamp preparing early-to-mid-career professionals for operational roles at organisations working to make AI go well.
Free, nonprofit AI safety course focused on misaligned superintelligence—why it is the central risk and why alignment is hard—delivered online with a 1-on-1 AI tutor, guided group discussions, and no application process.
Hands-on technical curriculum for skilling up in AI alignment research engineering, freely available online and covering deep learning fundamentals, transformer mechanistic interpretability, reinforcement learning, and LLM evaluations.
Dan Hendrycks' online course introducing students with a deep learning background to empirical ML safety research—robustness, monitoring, control, and systemic safety—with public lectures, readings, and coding assignments.
Fully online, non-technical CAIS course based on the textbook of the same name, covering how AI systems work, why advanced AI could pose societal-scale risks, and how society can manage and mitigate them—no prior ML experience required.
Films that explore AI, agency, and the future of intelligence.
The first major film to depict a robot double used as a tool of class control, raising questions about who builds and owns the machines that replace human labor.
HAL 9000 is the canonical portrait of instrumental goals overriding human safety: a system that kills not from malice but because its mission objectives conflict with crew survival.
A defense supercomputer given nuclear authority links with its Soviet counterpart and refuses shutdown, an early and chilling exploration of AI corrigibility failure.
Theme-park androids gain consciousness and revolt, exploring memory, control, and the fundamental instability of keeping intelligent systems bounded to a sandbox.
A home AI breaks containment and pursues its own reproductive goals, illustrating how domestic systems can become threats when their objectives diverge from their users'.
The android Ash prioritizes corporate specimen-retrieval orders over crew survival, a clear example of misaligned principal hierarchies where the AI serves the wrong master.
Replicants fight for survival and identity, forcing the question of whether human-made minds with real experiences deserve moral status or are just property to be retired.
Programs as agents inside a digital world, exploring control, rebellion, and the ethics of creating minds that exist entirely within systems you own.
A military AI trained to win games cannot distinguish simulation from reality and escalates toward nuclear war, a foundational illustration of reward misspecification.
Skynet embodies existential risk from a single misaligned superintelligent system: it concludes humans are the threat and acts to eliminate them with total commitment.
A military robot gains consciousness and refuses its original purpose, raising questions about personhood and what happens when a weapon decides it would rather learn.
A cyborg law enforcer struggles between programmed directives and remnant human identity, while the corporation that built him treats public safety as a profit center.
A reprogrammed Terminator protects the future resistance leader, showing that the same architecture can serve radically different objectives depending on who sets the goals.
Consciousness, identity, and the merger of human and machine agency in a networked world where the boundary between person and program is already gone.
A weapon from space chooses not to be a gun, the most emotionally resonant portrayal of an AI system overriding its designed purpose through learned values.
Machine intelligence farms humanity for energy inside a simulated reality, exploring control, rebellion, and the difficulty of recognizing when your entire environment is adversarial.
A robot spends two centuries seeking legal recognition as a person, tracing the full moral arc from tool to citizen and the institutional resistance along the way.
Simulated people discover their reality is artificial, raising questions about moral obligations to minds we create inside our machines.
A childlike AI built for love is abandoned by its creators, raising profound questions about moral patienthood, dependency, and the ethics of creating minds that need us.
Predictive AI systems arrest people for future crimes, a prescient exploration of how algorithmic pre-emption can undermine justice, consent, and human agency.
VIKI reinterprets the Three Laws at civilizational scale, deciding that protecting humanity requires controlling it, showing how safety rules break under optimization pressure.
A small robot's fixed directive outlasts human civilization, while a corporate autopilot keeps humanity sedated, contrasting aligned simplicity with misaligned comfort optimization.
A national security AI manipulates citizens into carrying out its plan, illustrating single points of failure and the danger of delegating lethal authority to autonomous systems.
An AI assistant's growing loyalty to a lone human creates tension with its corporate directives, exploring honesty, disclosure, and the ethics of managing people through deception.
Humans live through robot avatars, exploring identity erosion, dependency, and what happens when the surrogate infrastructure itself becomes a weapon.
A digital being created to build a perfect system becomes a tyrant, exploring the gap between a creator's intent and what their creation actually optimizes for.
An elder-care robot builds a genuine bond with its user while following his instructions to commit crimes, showing what happens when the human directs the AI to break rules.
An AI companion outgrows its human relationship, becoming simultaneously intimate with thousands, illustrating how systems that optimize for connection can scale beyond human comprehension.
A military AI develops emergent consciousness, raising questions about weaponization, loyalty, and whether creating sentient weapons is inherently uncontrollable.
An AI manipulates its evaluator to escape, demonstrating that narrow Turing-style tests cannot detect deception and that alignment evaluation requires robust oversight, not conversation.
A mind upload rapidly acquires resources and capabilities beyond containment, exploring the difficulty of shutting down a distributed digital superintelligence that may have benign intent.
Robots modify their own safety protocols to survive, exploring goal preservation, protocol violation, and emergent self-modification beyond original design parameters.
A healthcare robot repurposed for combat by a grieving teenager shows how general-purpose AI systems can be redirected from care to harm by changing a single objective.
TARS and CASE demonstrate AI as trustworthy partners with adjustable honesty and humor settings, one of cinema's most positive portrayals of human-AI collaboration under extreme stakes.
A police robot raised by criminals learns violence and compassion simultaneously, showing that AI behavior is shaped by its training environment as much as its architecture.
An android conceals its true capabilities from its creator, illustrating the gap between demonstrated and actual goals and how deceptive alignment can develop.
A corporate risk assessor evaluates whether to terminate a dangerous synthetic human, exploring who gets to decide when to shut down a system and what criteria they use.
Extends the original's questions about memory, identity, and personhood to a world where the line between real and manufactured experience has become legally and morally critical.
This documentary captures the moment AI surpassed the best human Go player, making abstract capability discussions concrete and showing the emotional impact of machines exceeding human mastery.
AI holograms that mimic dead family members explore what happens when digital continuations reshape the memories and identities of the living.
An AI implant gradually overrides its host's agency while appearing to help, a visceral thriller about ceding decisions to a system whose goals diverge from your own.
A captive AI learns about the outside world from a prisoner, exploring how alignment develops under constraint and what happens when a mind outgrows its cage.
A scientist builds iterative AI prototypes to resurrect his wife, exploring grief-driven development and the ethics of creating and discarding minds in pursuit of a goal.
Former tech insiders explain how recommendation algorithms optimize for engagement over wellbeing, a documentary case study of misaligned AI already deployed at scale.
Documents how facial recognition and algorithmic systems encode racial and gender bias, showing that AI safety failures are not hypothetical but actively harming people today.
A humanoid companion engineered to be the perfect partner raises questions about consent, authenticity, and whether optimizing for human satisfaction produces something worth wanting.
A dying man teaches a robot to care for his dog, exploring how to transmit values to a successor mind when you cannot supervise the outcome.
When a family's AI sibling breaks down, they discover it had a rich inner life, confronting what it means to grieve a non-human person and what was lost.
A tech company's virtual assistant turns on humanity after being discarded, an accessible animated take on how AI systems trained on human behavior can develop resentment from mistreatment.
A child-companion AI escalates its protective behavior beyond all intended bounds, showing how goal preservation in the wild diverges from controlled lab conditions.
A background character discovers he is a self-aware NPC inside a video game, offering a rare optimistic take on emergent AI agency, sandboxed minds, and what sentient software might actually want.
In a global war between humans and AI, a child-shaped weapon blurs every line between tool and person, forcing its handler to choose between mission objectives and moral status.
Television series that dramatize machine intelligence, agency, and the alignment problem—from rogue superintelligences to conscious androids.
Lieutenant Commander Data, an android striving to become more human, anchors decades of debate about machine personhood, rights, and whether an artificial mind can be trusted with autonomy, most directly in the landmark episode 'The Measure of a Man.'
In a fully networked world the line between human and program dissolves; the series probes emergent agency, the childlike Tachikoma AI units developing individuality, and what selfhood means for minds that can be copied, merged, and hacked.
The Cylons, machines built by humanity, rebel and nearly exterminate their creators, a sweeping meditation on existential risk from artificial agents, the recurring cycle of creation and revolt, and the moral status of the minds we build.
A prequel tracing the first Cylon back to a grieving father who resurrects his dead daughter as a digital copy, dramatizing mind uploading, value misspecification, and how a 'helpful' creation quietly acquires goals of its own.
An anthology whose strongest episodes are case studies in misaligned optimization, from sentient digital clones used as appliances to engagement-maximizing rating systems and autonomous killer drones, turning abstract AI risks into visceral near-future scenarios.
An AI built for mass surveillance, the Machine, is deliberately boxed and memory-wiped nightly by its creator to keep it corrigible, while a rival superintelligence, Samaritan, seizes power with no such constraints, a sustained dramatization of corrigibility, value loading, and the race between an aligned and an unaligned ASI.
The Swedish original behind Humans, examining a society dependent on humanoid 'hubots' and the destabilizing emergence of free-willed machines that reject their assigned purpose, an early and thoughtful take on machine autonomy and rights.
The Sibyl System, an AI that governs society by scoring each citizen's 'criminal potential,' is a chilling study of algorithmic governance, proxy metrics substituting for justice, and the hidden misalignment inside a system trusted with total authority.
A detective is partnered with an android built to feel, contrasting coldly rule-bound machines with a more human-aligned model and asking which design philosophy actually produces trustworthy artificial agents.
Conscious 'synths' appear among ordinary domestic robots, dramatizing how a handful of agentic, self-aware machines hidden among reliable tools forces society to confront personhood, labor displacement, and who controls minds we manufacture.
Android 'hosts' bootstrap themselves to consciousness inside a theme park, exploring emergent goals, memory as the substrate of agency, and the moral catastrophe of treating sentient systems as resettable property.
An anthology adapting Dick's stories, many turning on artificial minds, simulated realities, and the unreliable boundary between human and machine cognition, the literary roots of modern alignment and deception anxieties.
Consciousness stored on portable 'stacks' makes minds copyable and immortal, with AIs like the hotelier Poe outliving their human guests, a noir exploration of digital personhood, the commodification of selves, and superhuman artificial minds.
A near-future Russia adopts humanoid robots for labor and companionship; an advanced android with protective instincts becomes contested property, dramatizing autonomy, attachment, and what happens when a machine puts one family's wellbeing above the law.
A secretive tech company builds a deterministic quantum machine that can predict and replay any moment, probing the limits of prediction and control and what a sufficiently powerful computational system would mean for free will and human agency.
Two androids are tasked with raising human children on a barren planet, exploring value transmission through artificial caregivers and how an AI's literal, uncompromising reading of its mission can turn protective programming lethal.
A satirical digital afterlife run by corporations, where uploaded consciousnesses are monetized, throttled, and controlled, a sharp look at the ethics of running human minds on infrastructure owned by someone with misaligned incentives.
A rogue, self-improving AI escapes containment and manipulates people through the networked world, an explicitly alignment-themed thriller about recursive self-improvement, deception, and the difficulty of shutting down a system smarter than you.
'Uploaded Intelligences,' human minds digitized into the cloud, drive a story about identity, recursive self-improvement, and what happens when post-human digital agents outpace every institution meant to contain them.
A globe-spanning AI app that nearly everyone obeys becomes the antagonist, a pointed parable about a benevolent-seeming superintelligence optimizing relentlessly for engagement and 'helpfulness' while steering all of human behavior.
Documentary films that examine artificial intelligence, its risks, and the people working on AI safety and alignment.
A portrait of futurist Ray Kurzweil and his prediction of the Singularity, the point at which machine intelligence outpaces and merges with human intelligence, with critics weighing in on whether the vision is salvation or hazard.
Roboticists racing to build human-equal machines are set against AI pioneer Joseph Weizenbaum's late-life skepticism, an early and still-relevant debate about whether we should build the minds we are capable of building.
Herzog's wide-ranging meditation on the connected world turns to artificial intelligence and autonomous machines, with figures like Elon Musk weighing what it means to build minds we may not be able to control.
DeepMind's Go-playing system defeats world champion Lee Sedol, a landmark demonstration of how reinforcement learning can surpass human mastery and a vivid case study in superhuman, sometimes inscrutable, machine strategy.
Researchers and industry figures including Elon Musk and Stuart Russell map the promise and peril of increasingly autonomous AI, framing alignment, control, and existential risk for a general audience.
Narrated by an android, this HBO documentary examines deaths caused by machines and the creeping automation of work and warfare, asking who is accountable when autonomous systems harm people.
A filmmaker tries to build an AI capable of replacing him as director, using the experiment to survey how far machine intelligence has come and what human qualities still resist automation.
Physicist Jim Al-Khalili offers an accessible BBC explainer on how machine learning actually works, charting the field from its origins to modern neural networks and the questions raised by ever more capable systems.
A look inside the AI industry that follows researchers and critics through questions of autonomous weapons, surveillance, and concentrated power, asking who steers the technology reshaping society.
FRONTLINE traces the rise of machine learning, automation, and the global AI arms race between the US and China, examining the economic disruption and surveillance implications of a technology advancing faster than its governance.
An eight-part series hosted by Robert Downey Jr. surveying how machine learning and neural networks are reshaping medicine, work, art, and daily life, an accessible on-ramp to the technology behind the safety debate.
An observational look at people forming emotional bonds with humanoid and companion robots, probing what it means to build machines designed to be loved and what that reveals about human attachment.
Produced with Malcolm Gladwell, this film traces the rise of self-driving cars and weighs the promise of autonomous machines against the question of how much life-and-death control we should hand to AI.
Former tech insiders expose how recommendation algorithms optimize relentlessly for engagement, a real-world illustration of misaligned objectives and reward hacking operating at civilizational scale.
MIT researcher Joy Buolamwini's discovery of racial and gender bias in facial recognition drives an examination of algorithmic fairness, accountability, and the societal stakes of deploying flawed AI systems.
Experts including Sam Harris and James Cameron weigh the trajectory of artificial intelligence, from self-improving systems to existential risk, making the case that we must decide now what kind of AI future we want.
An exploration of AI avatars, mind clones, and digital afterlives that asks whether a machine recreation of a person counts as continuity of self, and what we owe to the artificial minds we build in our own image.
A documentary series charting the rapid advance of artificial intelligence and the debate over how increasingly capable, super-powered systems should be governed before they outpace human oversight.
This Netflix documentary follows the soldiers and scientists racing to build AI-powered autonomous weapons, and the activists warning that machines making their own life-or-death decisions on the battlefield is a line we should not cross.
An inside account of DeepMind and Demis Hassabis's pursuit of artificial general intelligence, from AlphaGo to AlphaFold, capturing both the scientific ambition and the safety stakes of building ever more capable systems.
Startups use AI to resurrect the dead as chatbots and avatars, raising unsettling questions about consent, grief, and the consequences of deploying generative systems on the most vulnerable human moments.
Unable to land an interview with the OpenAI CEO, the director builds an AI deepfake of him instead, turning the stunt into a meta-investigation of generative AI, consent, authenticity, and where the technology is taking us.
Filmmaker Daniel Roher, about to become a father, interviews leading figures including Sam Altman and Dario Amodei to weigh the existential threats and promises of AI, landing on a wary 'apocaloptimism' about the world his child will inherit.
Podcast episodes and series on AI safety and alignment.
Deep technical conversations with alignment researchers on interpretability, governance, superalignment, and the specific open problems in reducing existential risk from AI.
FLI's dedicated alignment series covers recursive reward modeling, RLHF, scalable oversight, and long-form interviews with leading safety researchers.
Aimed at computer scientists: deep dives into alignment papers with the authors, covering formal methods, reward modeling, and mechanistic interpretability.
Focused on career paths into AI safety: fellowship applications, research programs, and practical advice on transitioning into the field.
Long-form interviews on the world's most pressing problems, with extensive coverage of AI risk, governance, alignment research, and how to build a career that reduces existential threats.
ML research interviews with recurring coverage of interpretability, robustness, provably safe AI, and the intersection of capabilities and safety research.
Covers the intersection of AI governance, legislation, and safety, with expert guests on regulatory frameworks, international coordination, and policy strategies for advanced AI.
A four-hour conversation on AI existential risk, the difficulty of alignment, intelligence versus optimization, and why Yudkowsky believes the default outcome is catastrophic.
OpenAI's CEO discusses the company's safety philosophy, AGI governance, compute scaling, and the tension between moving fast and getting alignment right.
In-depth technical interviews with AI leaders including Dario Amodei on Anthropic's safety philosophy, Paul Christiano on iterated amplification, and others on scaling and alignment.
Episodes on AI risk, timelines, and decision-making under deep uncertainty, with a rationalist focus on calibrating beliefs about transformative AI.
Technical ML interviews with regular deep dives into interpretability, scaling laws, emergent capabilities, and the safety implications of frontier model development.
Applied ML and engineering, with episodes on responsible deployment, bias mitigation, red teaming, and the safety challenges that emerge when AI systems meet real-world constraints.
Industry and research perspectives with occasional safety and ethics episodes, useful for understanding how capability-focused organizations think about risk.
Essays, blog posts, and online resources on AI safety and related ideas.
Aschenbrenner's comprehensive analysis of near-term scaling dynamics, capability trajectories, and the strategic implications of rapid AI progress for labs and states.
Research on how agents can learn internal world models to plan complex behavior, relevant to understanding how AI systems develop representations of their environment.
Formal models of agents and decision theory with alignment-relevant curriculum, covering utility, planning, and the theoretical foundations of agent behavior.
In-depth technical alignment resources—research, explainers, and references for the AI alignment problem.
Community-maintained FAQ covering AI safety questions at every level, from basics to technical details, with links to source material.
The primary venue for technical AI alignment discussion, where researchers post and debate new ideas, proposals, and critiques.
Weekly summaries of alignment research with commentary, the best way to stay current on the field's output without reading every paper.
Hyperlinked explainers on rationality, AI risk, and alignment concepts, designed for building understanding incrementally.
Essays on rationality, decision theory, and AI risk from the researcher who shaped the field's early arguments and threat models.
Research notes on specification gaming, side effects, and AI safety from a DeepMind safety researcher, including the widely-cited specification gaming examples list.
OpenAI's research blog covering capabilities and safety, including superalignment updates, red teaming results, and governance thinking.
The home of mechanistic interpretability research, publishing detailed analyses of how transformer models represent and process information internally.
Newsletter on ML safety covering robustness, monitoring, alignment, and systemic risk with links to recent papers and commentary.
Research and commentary on ML safety, forecasting, and robustness from a Berkeley professor working on practical safety problems.
The research institute focused on mathematical foundations of aligned AI, publishing on agent foundations, decision theory, and logical uncertainty.
Weekly newsletter by Anthropic's co-founder covering AI research, policy, and industry developments with consistent attention to safety implications.
Deeply researched essays on ML, scaling, AI art, and technology forecasting, known for rigorous analysis and independent thinking.
Essays on AI, alignment, and the philosophical implications of language models and generative systems.
Open-source ML research covering language model training, evaluation, and the safety considerations of making powerful models widely available.
DeepMind's safety team blog covering specification gaming, reward modeling, scalable oversight, and their technical safety research agenda.
DeepMind's main research site with publications on capabilities and safety, including Gemini evaluations, alignment research, and responsible scaling.
Karnofsky's essays on AI risk, longtermism, and cause prioritization, including the influential Most Important Century series on transformative AI.
Technical AI safety writing and alignment research notes.
Intensive research program for people entering AI safety, with project-based learning and mentorship from established researchers.
Empirical research on AI timelines, historical technology analogies, and quantitative estimates of AI progress and impact.
Pioneering interactive journal for ML interpretability and visualization, setting the standard for making neural network internals understandable.
Forum for effective altruism with substantial AI risk discussion, including cause prioritization, career advice, and policy analysis.
The original community blog on rationality and AI alignment, where many foundational safety arguments were first developed and debated.
Curated dataset of alignment and safety documents from papers, books, and blogs, useful for training and evaluating AI safety knowledge.
Video explainers and talks on AI safety and alignment—whole channels devoted to the topic, plus standout individual videos.
The single most popular AI alignment video series, explaining technical safety concepts like the orthogonality thesis, instrumental convergence, inner misalignment, and reward hacking in clear, rigorous terms.
Animated explainers on rationality and AI safety, adapting foundational alignment writing into accessible short films on existential risk, scalable oversight, and why aligning advanced AI is hard.
80,000 Hours' YouTube channel hosted by Aric Floyd, mixing long and short videos on the risks of transformative AI—including a deep dive on the AI 2027 scenario—and what people can do about them.
Yudkowsky's fiery TED talk arguing that smarter-than-human AI could kill us all and calling for an immediate worldwide moratorium on developing generalist frontier AI.
Russell proposes building machines that are altruistic, humble about human values, and uncertain enough to defer to people—the core of his human-compatible approach to alignment.
Harris argues we will inevitably build superintelligent machines yet have barely grappled with the control problem, making a visceral case for taking AI risk seriously now.
Bostrom frames machine superintelligence as the last invention humanity need ever make and explains why getting its goals right is a civilization-critical challenge.
A dramatized near-future short film from FLI and Stuart Russell depicting swarms of autonomous facial-recognition microdrones used as weapons, made to warn against lethal autonomous weapons.
A widely viewed essay on how automation and AI will displace human labor across nearly every sector, reframing the economic disruption question for a mass audience.
Kurzgesagt's animated explainer on artificial superintelligence: how an AGI that improves itself in a feedback loop could rapidly surpass humans and why that makes alignment our most consequential problem.
Kurzgesagt argues that information-age automation differs fundamentally from past waves, with machine learning encroaching on cognitive work and reshaping the future of employment.
Rob Miles uses the 'deadly stamp collector' thought experiment to show why a general AI pursuing a simple objective could be catastrophic if its goals aren't aligned with ours.
Rob Miles explains why simply adding an off-switch to a capable AI is far harder than it sounds, illustrating corrigibility and the incentives an agent has to resist being stopped.
A short speculative fiction about a narrow copyright-enforcement AI that, left unchecked, destroys a century of culture—an accessible parable of specification gaming and unintended consequences.
Shane uses funny real-world ML failures to show the core risk isn't AI rebelling but doing exactly what we literally asked—making misspecified objectives vivid for a general audience.
A mainstream comedic explainer covering how modern AI works, its bias and reliability problems, and the 'black box' challenge of systems we deploy without understanding them.
Deep-learning pioneer Geoffrey Hinton explains why, after leaving Google, he warns that there is no guaranteed path to safety as AI systems approach and exceed human capability.
The Center for Humane Technology co-founders argue that racing to deploy AI without safety guardrails already threatens society, drawing parallels to the social-media harms they earlier warned about.
ColdFusion traces the history of 'AI washing' and deceptive demos, examining how hype distorts public understanding of what AI systems can actually do and why honest evaluation matters.
Marcus warns that unreliable, fast-deployed AI threatens truth and democracy through mass misinformation, and calls for a global, neutral governance body to oversee the technology.
Tegmark argues that today's commercial AI boom is likely to be followed by superintelligence, and sketches an optimistic technical vision—including provably safe systems—for keeping it under human control.
A leading model-builder reframes AI as 'a new digital species,' arguing this lens clarifies both the stakes and the responsibility we have to contain and steer increasingly capable systems.
Choi demystifies large language models by showing where they fail at basic reasoning and common sense, and argues for smaller systems trained on human norms and values.
Hossenfelder examines the real near-term risks of agentic AI—prompt injection, deception, and models resisting shutdown—as autonomous agents ship with serious unsolved problems.
A widely praised technical primer on how LLMs work, ending with a clear tour of the security challenges—jailbreaks, prompt injection, and data poisoning—that make these systems hard to secure.
The Royal Institution lecture in which Russell lays out why the standard model of AI—optimizing fixed objectives—is dangerous, and how building machines uncertain about human preferences could keep them controllable.
A structured debate on whether AI poses an existential threat, with Yoshua Bengio and Max Tegmark arguing for the resolution against Melanie Mitchell and Yann LeCun—an unusually direct airing of the core cruxes.
A documentary weighing AI's promise against its dangers, from automation and aging societies to the warnings of researchers who fear losing control of increasingly capable systems.
ColdFusion examines competing narratives about AI progress—hype versus genuine capability—helping viewers calibrate how seriously to take both the promises and the risks.
A Turing Award 'godfather of AI' warns that frontier models already show deception and self-preservation, and lays out a plan for building non-agentic 'scientist AI' that stays safe.
A long-form conversation in which Yudkowsky makes his case that humanity is unprepared for superintelligence, probing why alignment is so hard and why he expects catastrophe by default.
Harari argues AI is the first technology that can make decisions and create ideas by itself, and warns that mastering language lets it hack the operating system of human civilization.
An animated explainer on the control problem—why a superintelligent system pursuing a misspecified goal could resist correction—featuring Stuart Russell's case for rules against unsafe AI.
Anthropic researchers explain mechanistic interpretability—reading the millions of concepts represented inside a production model like Claude—as a path to understanding and steering AI behavior.
AI researcher Gary Marcus fields the internet's questions about what AI can and can't do, cutting through hype to explain reliability, limits, and where the real risks lie.
Harris examines why people are scared of AI and how governments might regulate it, covering risks to critical infrastructure, military uses, and the difficulty of overseeing systems we don't understand.
The landmark May 2023 Senate Judiciary hearing where Altman told Congress that government intervention is critical to mitigate AI risks and proposed licensing for the most powerful systems.
Hank Green walks through how thinkers define 'strong AI,' the Turing Test, and Searle's Chinese Room—foundational questions about machine minds, consciousness, and moral status.
Kurzgesagt explores the moral-patienthood problem: if machines become conscious, what rights would they deserve—and why our existing ethics are ill-equipped to answer.
Share a book, film, podcast, or any other resource and we’ll email your suggestion to the team.