Context: We are the ‘model motivations’ team at Arcadia Alignment. We aim to build a science of ‘model intentions’, unifying insights from personas and other empirical evidence.This is an informal research note that has come out of the first 2-3 weeks of exploratory work.
A friend once shared an essay with me for feedback. It struck me as mistaken and terribly naive, and I said so, which they did not take well.(They didn't say it, but a standard LessWrongian response here would have been "instead of insulting me, why don't you provide an actual counterargument?"—and that's often a very good move for helping conversations keep on-track.)Why was that hard for them to be on the receiving end of?
Asset futarchy is attractive because it lets markets compare a proposal's expected effect on token value. That comparison is only reliable when conditional prices track the proposal's causal effect rather than strategic behavior around the decision rule.The attacks below describe ways a proposer can make PASS-ASSET trade above FAIL-ASSET without creating commensurate value for ASSET holders.
Asset futarchy is attractive because it lets markets compare a proposal's expected effect on token value. That comparison is only reliable when conditional prices track the proposal's causal effect rather than strategic behavior around the decision rule.
In the past few years, many people around me have tried to convince me that US electoral politics is important. But like many other people in the community, I’ve been suspicious of many of the high-level arguments that I’ve heard. It felt like people were pulling numbers out of poorly-documented models I didn’t have time to examine and citing studies I didn’t have time to read.
Only three days after the release of Claude Fable 5, Anthropic was forced by the United States Government to make it unavailable, when a jailbreak was brought to its attention, rather than the previous situation of ‘yes obviously experts can jailbreak anything if they care enough’ and ‘yes obviously you can ask Fable to fix your code.’
These are some personal notes taken and later dressed up a bit to make into a post. Dunno how much value is here for people already familiar with the AI Safety Ecosystem.Over several weeks in the spring of 2026 I attempted to map out the entire AI Safety ecosystem as a project for MATS Research.
GDM has published an AI Control Roadmap! From the executive summary:We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain.We focus on system-level mitigations that limit the harm a misaligned AI system could cause.
This is an unofficial automated linkpost.
I’ve heard some buzz around the new GLM 5.2 open-weights model. They say it’s very capable! I won’t run a full comparison benchmark, but I have some credits sloshing around on OpenRouter so I figured I might compare GLM 5.2 to the similarly-priced Gemini 3 Flash[1], and see where things land.This uses the same setup as the previous benchmark: each LLM gets a few attempts at playing the game, with each attempt being limited to a fixed budget of around $0.15.
TL;DR:Safety assurance (safety cases, heuristic arguments) involves many interpretive questions about the model’s mechanisms (e.g., motivation).
We all know the typical mind fallacy—the bias where we assume that other people’s minds are much like our own. It happens because most of our evidence for what minds are like comes from experiencing what our own mind is like, and thus we infer from that evidence that the minds of others are not so different from ours.The typical mind fallacy is deep-rooted and hard to change, since it’s difficult to get good evidence about what it’s really like inside other people’s heads.
Short introductory post for my research direction: Gaussian Natural Latents. I explain the motivation and give a preview of the forthcoming results.The Natural Abstractions agenda, in my view, is a promising research program that asks important theoretical questions about the nature of agency and optimization.
Exploiting the fact that whatever has already been generated is in context:<modeling_rule> When predicting interpretive agent reaction, list in order 1. important perceptions triggered; 2. important perceptions triggered by those of phase 1; 3. important perceptions triggered by those of phase 2; 4. important affective effects; 5.
I will note that her relationship with the divine was inextricably sexual. Her carnal fantasies she revealed to me, as she revealed all her sins, for I was her confessor. It is in the nature of Man to sin and then sin again. And if they are of our flock, this cycle is unconstrained by repentance, which is to the temporal almost an appurtenance and to the spiritual anything but. I once expressed bitterness to others of my calling regarding this.