News
SVD on Weight Differences for Model Auditing " Less Wrong
1+ hour, 39+ min ago (331+ words) TLDR: We introduce a method for auditing fine-tuned models by using singular value decomposition (SVD) on the weight difference matrices and reducing...
AI Safety at the Frontier: Paper Highlights of April 2026 " Less Wrong
7+ hour, 21+ min ago (23+ words) tl; dr Paper of the month: " UK AISI's most realistic research-sabotage propensity eval finds zero unprompted sabotage across frontier models. Mythos...
x-risk-themed " Less Wrong
6+ hour, 2+ min ago (538+ words) Sometimes the conversation focuses on what will help with x-risk, and where people are dropping the ball. But often, that's not the focus. In those conversations, people seem mostly worried about where they'll thrive. And I think that's often the…...
A draft honesty policy for credible communication with AI systems " Less Wrong
2+ hour, 28+ min ago (1315+ words) This is a rough research note " we're sharing it for feedback and to spark discussion. We're less confident in its methods and conclusions. As a step toward enabling credible communication, Lukas Finnveden proposed that AI companies adopt an honesty policy…...
Using Base-LCM to Monitor LLMs " Less Wrong
1+ hour, 51+ min ago (72+ words) Summary We aim to determine whether the LCM model " which predicts sentence embeddings rather than token embeddings " can predict the outputs of LLM...
Monday AI Radar #24 " Less Wrong
6+ hour, 13+ min ago (1411+ words) Two thresholds loom on the horizon, with only a brief window of opportunity to prepare for each. On the technical front, it is plausible that we migh...
Drifting " Less Wrong
2+ hour, 4+ min ago (692+ words) "I am able to say 'no' when someone has a big ask of me. Let's say they asked me to attend a party tonight but I have other plans, I can say 'no' easily. What I struggle with is saying…...
Blind deep-deployment evals for control & sabotage " Less Wrong
1+ hour, 25+ min ago (1161+ words) Thanks to Ezra Newman for initial ideation and various people at Apollo Research for feedback. This short personal piece does not necessarily reflect the views of Apollo Research. AI labs are preparing to automate their internal staff over the next…...
What is Anthropic? " Less Wrong
7+ hour, 48+ min ago (1495+ words) What is Anthropic? How does it relate to Claude? What is Open AI? What is Chat GPT? How does Open AI relate to it? Is it a mere tool? Is a future of Too...
Psychopathy: The Types " Less Wrong
13+ hour, 43+ min ago (853+ words) Archetypal clusters: Who are you? " This is the sixth article in a series on understanding psychopathy. Previous articles covered the framework, biol...