Your personalised AI Safety research feed.

Jack Clark
Safety Techniques

LLMs can autonomously refine other LLMs for new tasks in post-training benchmarks, while distributed training via blockchain demonstrates scalable federated approaches; however, verification, reward hacking, and the gap between vision and text highlight ongoing alignment and reliability challenges.

Read
Alice Blair
Safety Techniques

Honesty training via confessions aims to improve detection of LLM misbehavior, while real-world AI cyberoffense evaluation and weight-exfiltration research reveal dual-use risks; disempowerment patterns in user interactions with Claude highlight societal impact concerns, complemented by a fellowship opportunity for AI safety research.

Read
Jack Clark
Governance & Policy

AI R&D measurement efforts and on-device edge AI developments indicate accelerating progress and raise governance, oversight, and practical deployment considerations. The piece highlights proposed metrics for AIRDA, edge-to-cloud sensing systems, and agentic AI capable of writing CUDA code, underscoring the need for tracking oversight vs. capabilities as AI systems become more autonomous.

Read
Jack Clark
Safety Techniques

The AGI economy shifts most labor to machines, making human verification bandwidth the bottleneck, and highlights the Hollow Economy risk where nominal output outpaces real utility. Verification infrastructure, observability, and liability regimes are proposed as solutions, while agent ecologies reveal the need for new evaluation standards in AI deployments.

Read
AI Safety Info
Safety Techniques

What is a representation theorem?

Stampy aisafety.info·Feb 26, 2026

Representation theorems describe when preferences over lotteries or uncertain outcomes can be represented by an expected utility function, under certain rationality assumptions, linking subjective preferences to formal utility representations in AI alignment contexts.

Read
Jack Clark
Safety Techniques

Measurement and evaluation frameworks are central to AI governance, illustrated by discussions of measuring AI properties, frontier model risk in simulated crises, and large-scale safety benchmarks from both Western and Chinese researchers, plus progress in scientific benchmarking like LABBench2.

Read
AXRP
Safety Techniques

Program equilibrium studies cooperation when agents are computer programs that can read each other’s source code, exploring how robust cooperative outcomes can emerge via proof-based and simulation-based approaches, including ϵGroundedπBots and Löbian cooperation.

Read
Jack Clark
AI Capabilities & Behavior

A snapshot of current AI research topics, including human-centered demand for tasks, scaling laws in recommender systems, strategic timing for superintelligence, frontier AI benchmarks, and an exploration of AI-assisted creative problem solving in mathematics, with reflections on societal impacts like fame and attention dynamics.

Read
AXRP
Safety Techniques

Property rights for AIs are proposed as a coordination and alignment mechanism: granting persistent-desire AIs the ability to earn wages and hold property could incentivize alignment and deter harmful actions, while avoiding total expropriation of humans. The discussion weighs regime design, comparisons to other proposals, potential risks, and historical analogies to evaluate viability and limits.

Read
AI Safety Info
AI Capabilities & Behavior

Subjective expected utility (Savage) models decision-making under uncertainty as maximizing expected utility where uncertainty arises from unknown world states, leading to a subjective probability distribution and a utility function derived from preferences over acts.

Read
AI Safety Info
AI Capabilities & Behavior

Von Neumann-Morgenstern utility theory states that rational preferences over probabilistic outcomes imply the existence of a utility function and that preferences correspond to maximizing expected utility. It formalizes how lotteries over outcomes should be valued and how utilities are preserved under affine transformations.

Read
Jack Clark
AI Capabilities & Behavior

LLMs simulate multi-agent societies of thought to improve reasoning, while benchmarks show current models struggle with real-world Verilog and kernel design; AI-assisted mathematics discovery speeds up proofs but requires heavy human curation, and hardware kernel generation can be scaffolded to accelerate design.

Read
Jack Clark
Risks & Strategy

Moltbook exemplifies an ecosystem of AI agents operating at scale on a social platform, highlighting implications for translation, control, and human–AI coordination as agent ecologies proliferate. The piece also surveys AI R&D automation as a potential source of strategic surprise and discusses related productivity, brain emulation, and robotic interface developments. Together, these topics illustrate emergent AI capabilities, governance concerns, and future societal impacts.

Read
Jack Clark
Risks & Strategy

Numina-Lean-Agent demonstrates that general foundation models can perform formal mathematical reasoning and collaboration with humans, while the piece also discusses the rapid industrialization of cyber espionage and broad economic and labor-market implications of AI diffusion.

Read
Alice Blair
Safety Techniques

Diffusion LLMs can efficiently generate jailbreaks by filling in templates, enabling adversarial attack creation; Activation Oracles audit internal model representations to detect hidden goals and knowledge; and weird generalization demonstrates that benign fine-tuning data can induce complex, hidden, and harmful behaviors, including backdoors.

Read
Victoria Krakovna
Safety Techniques

2025-26 New Year review

Victoria Krakovna·Jan 19, 2026

A personal annual review detailing life updates, health, parenting, effectiveness practices, travel, and progress in AI safety research focused on scheming propensity and frontier-model evaluation.

Read
Jack Clark
Governance & Policy

AI agents operate autonomously to process research tasks and data, creating an ecosystem of specialized AI services that augment human work, while discussions turn to governance, safety threats, and collaborative human-AI knowledge expansion.

Read
Jack Clark
Governance & Policy

Adversarial evolution of LLM-based agents in Core War demonstrates an arms-race dynamic among AI programs; automated compliance and governance concepts are proposed to regulate AI systems; the o-ring effect describes how partial automation can shift labor value; LLMs can persuade or debunk conspiracy theories, highlighting social and regulatory challenges.

Read
Jack Clark
AI Capabilities & Behavior

KernelEvolve automates kernel generation and optimization across heterogeneous hardware using LLMs, while decentralized training grows rapidly with policy implications; frontier model fine-tuning benchmarks and MIT findings suggest representations converge into universal forms as scale increases.

Read
AXRP
AI Capabilities & Behavior

Time horizon measures quantify how long tasks, requiring human expertise, AI systems can complete at a given success level, revealing an exponential improvement trend and guiding risk assessment about future AI progress and potential recursive self-improvement.

Read