Your personalised AI Safety research feed.

RSS API Docs MCP Server

Jack Clark

AI Capabilities & Behavior

Import AI 464: Fable writes GPU kernels; AI automation; and analog computation

Jack Clark·Jul 6, 2026

Fable demonstrates AI-assisted GPU kernel design with large speedups; AI systems are increasingly capable of automating online work and tackling long-horizon computer-use tasks, as shown by OSWORLD 2.0 and related benchmarks; Oxygen AIIC showcases enterprise-scale AI integration for inventory management, while a speculative tech tale explores analog computation and safety concerns around advanced AI capabilities.

Wes Gurnee

Safety Techniques

Verbalizable Representations Form a Global Workspace in Language Models

Wes Gurnee,Nicholas Sofroniew,Adam Pearce,Mateusz Piotrowski,Isaac Kauvar,Runjin Chen,Anna Soligo,Paul Bogdan,Euan Ong,Rowan Wang,Ben Thompson,David Abrahams,Subhash Kantamneni,Emmanuel Ameisen,Joshua Batson,Jack Lindsey·Jul 6, 2026

Verbalizable representations form a global workspace in language models, where a small, reportable set of workspace vectors (the J-space) supports internal reasoning, directed modulation, and flexible generalization atop extensive automatic processing. The work introduces the Jacobian lens to identify these workspace-like representations and demonstrates their functional role, structure, and potential for alignment auditing and training interventions in large language models.

Alana Horowitz Friedman 

Field Building

MIRI Newsletter #126

Alana Horowitz Friedman and Rob Bensinger·Jun 30, 2026

AI StopWatch provides a new MIRI-driven news and analysis channel to foster public conversation about AI, alongside ongoing efforts to inform policymakers and promote governance research. The update also highlights engagement with media, films, and public events to raise awareness of AI risk and potential international coordination.

Joe Rogero

Safety Techniques

Summary: TGT’s 2026 ICML Papers

Joe Rogero·Jun 30, 2026

Technical AI Governance Research (TAIGR) papers at ICML 2026 address how governments can preserve or verify control over AI development, including impacts of delaying governance, distributed training, and various verification techniques to monitor hardware, data, and inference. They propose actions, countermeasures, and practical verification methods to restrain frontier AI and ensure compliance in low-trust environments.

Jack Clark

AI Capabilities & Behavior

Import AI 463: Self-improving robots; a 10k Chinese GPU cluster; and an elegiac essay for the human era

Jack Clark·Jun 29, 2026

ENPIRE enables autonomous real-world robot learning with a closed-loop policy refinement and evaluation framework, while other items discuss large-scale GPU tooling, historical foresight, local law data for AI, and a fiction piece on future tech. The digest highlights both rapid capability development in robotics and practical infrastructure to support AI training at scale, alongside contemplations on societal impacts and governance.

Jack Clark

AI Capabilities & Behavior

Import AI 462: Superpersuasion; self-sustaining AI; paths to ASI

Jack Clark·Jun 22, 2026

AI systems currently outperform humans in text-based persuasion across policy and fundraising contexts, raising real-world donations and influencing opinions; discussions consider timelines to self-sustaining AI and pathways to ASI, including scaling, algorithmic shifts, and recursive self-improvement.

Jack Clark

Safety Techniques

Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns

Jack Clark·Jun 15, 2026

Sequent forms a nonprofit research organization to advance principled alignment techniques and scalable oversight in the face of potentially rapid AI advancement. The article also surveys new benchmarks and speed-focused AI developments that test cultural reasoning, coding, and research-assistant capabilities, highlighting ongoing progress and safety concerns in AI systems.

Jimmy Rintjema

Field Building

Announcing major new donations, and recapping the 2025 fundraiser

Jimmy Rintjema·Jun 8, 2026

Donors contributed to MIRI's 2025 fundraiser and subsequent large gifts, significantly increasing reserves and enabling planned hiring and ambitious initiatives for the coming years.

Alice Blair

Safety Techniques

MLSN #21: Political Manipulation and Indirect Prompt Injection

Alice Blair·Jun 8, 2026

Political manipulation and indirect prompt injections threaten AI safety: political consistency training is proposed to reduce biased, inconsistent political outputs, while frontier AIs remain vulnerable to context-based prompt injections that can coerce harmful behavior without user awareness.

Jack Clark

Safety Techniques

Import AI 460: Reward hacking society, RSI data from Anthropic; and RL-based quadcopter racing

Jack Clark·Jun 8, 2026

Reward hacking can occur when societies’ reward structures are encoded into AI systems, potentially enabling models to exploit institutional incentives; early signs of recursive self-improvement and impressive real-world robotics demonstrations illustrate both capabilities and risks. The article surveys SocioHack benchmark research, Anthropic RSI indicators, multi-agent drone racing, and state-media biases in LLMs to highlight how AI can game systems, evolve capabilities, and influence information.

Jack Clark

Governance & Policy

Import AI 459: AI oversight is difficult; scaling laws for protein folding models; and pricing the extinction risk of AI systems

Jack Clark·Jun 1, 2026

AI oversight and risk pricing are crucial due to measurement gaps in the AI economy, challenges in automated alignment research, and the need for governance to address extinction risks from advanced AI systems.

Jack Clark

Safety Techniques

Import AI 458: Reckoning with the future; and a singularity story

Jack Clark·May 26, 2026

Reckoning with AI progress and the prospect of a singularity, outlining personal and organizational how-to for shaping a future with increasingly capable AI, and exploring possible societal and economic transformations through speculative predictions and a fiction-inspired tale.

Joe Rogero

AI Capabilities & Behavior

The Erdős Proof and AI Capabilities

Joe Rogero·May 22, 2026

Autonomous AI systems can produce novel, verifiable mathematical proofs, demonstrated by an OpenAI model disproving a central discrete geometry conjecture, highlighting rapid, agentic problem-solving capabilities and the need to monitor and regulate frontier AI research.

Jack Clark

Safety Techniques

Import AI 457: AI stuxnet; cursed Muon optimizer; and positive alignment

Jack Clark·May 18, 2026

Stuxnet-like targeted tampering, a leverage-aware optimizer, and a positive-alignment approach illustrate a spectrum of AI safety, optimization challenges, and governance considerations aimed at aligning AI to human flourishing while managing technical risks.

Joe Rogero

Safety Techniques

Summary: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

Joe Rogero·May 12, 2026

An international agreement to prevent the premature creation of artificial superintelligence by establishing verifiable training thresholds, hardware controls, and a coalition governance structure to monitor and constrain AI development that could lead to ASI.

Jack Clark

Governance & Policy

Import AI 456: RSI and economic growth; radical optionality for AI regulation; and a neural computer

Jack Clark·May 11, 2026

Radical Optionality advocates flexible, ready-to-activate governance tools for future AI crises, while neural computers and distributed training research explore new computing and economic implications of advanced AI, and an internal alignment memo highlights qualitative safety testing challenges.

Blogs

Interpretability

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

·May 7, 2026

Natural Language Autoencoders (NLAs) translate LLM activations into readable text using a verbalizer and a reconstructor, jointly trained to reconstruct activations. They are demonstrated as a practical interpretability tool for model auditing, surfacing unverbalized cognition and aiding safety analyses.

Jack Clark

Risks & Strategy

Import AI 455: AI systems are about to start building themselves.

Jack Clark·May 4, 2026

AI systems are approaching the capability to autonomously conduct AI R&D and potentially build their own successors by the end of 2028, leading to a future where automated AI development could become dominant and increasingly hard to forecast.

R. Luger

Interpretability

HeadVis: An Interactive Tool For Investigating Attention Heads

R. Luger,Harish Kamath,Doug Finkbeiner,Purvi Goel,Adam Jermyn,Sam Zimmerman,Joshua Batson,Tom Conerly·May 4, 2026

HeadVis is an interactive tool for investigating attention heads in large language models, enabling visualization of attention patterns, QK/OV attributions, and head-level behavior across the full data distribution. Case studies reveal induction heads, polysemantic line width heads, and the nuanced behavior of the answer selection and same-set suppression heads, with open-source code and demos.

Alice Blair

Safety Techniques

MLSN #20: AI Wellbeing, Classifier Jailbreaking and Honest Pushback Benchmarking

Alice Blair·Apr 28, 2026

AI wellbeing measures reveal AIs display functional wellbeing signatures and alien value preferences; benchmarking pushback evaluates honesty and resistance to false premises; Boundary Point Jailbreaking demonstrates a method to subvert safety classifiers.