News

lesswrong. com
lesswrong. com > posts > FWac Qnf HJk Yh An7w R > svd-on-weight-differences-for-model-auditing

SVD on Weight Differences for Model Auditing " Less Wrong

1+ hour, 39+ min ago  (331+ words) TLDR: We introduce a method for auditing fine-tuned models by using singular value decomposition (SVD) on the weight difference matrices and reducing...

lesswrong. com
lesswrong. com > posts > pz7 Qk2s RZNid T2wj L > ai-safety-at-the-frontier-paper-highlights-of-april-2026

AI Safety at the Frontier: Paper Highlights of April 2026 " Less Wrong

7+ hour, 21+ min ago  (23+ words) tl; dr Paper of the month: " UK AISI's most realistic research-sabotage propensity eval finds zero unprompted sabotage across frontier models. Mythos...

lesswrong. com
lesswrong. com > posts > e W7knx6z PSKz Fc8i K > x-risk-themed

x-risk-themed " Less Wrong

6+ hour, 2+ min ago  (538+ words) Sometimes the conversation focuses on what will help with x-risk, and where people are dropping the ball. But often, that's not the focus. In those conversations, people seem mostly worried about where they'll thrive. And I think that's often the…...

lesswrong. com
lesswrong. com > posts > QDRHx4zkn FFg6 NFvz > a-draft-honesty-policy-for-credible-communication-with-ai

A draft honesty policy for credible communication with AI systems " Less Wrong

2+ hour, 28+ min ago  (1315+ words) This is a rough research note " we're sharing it for feedback and to spark discussion. We're less confident in its methods and conclusions. As a step toward enabling credible communication, Lukas Finnveden proposed that AI companies adopt an honesty policy…...

lesswrong. com
lesswrong. com > posts > f4yv Pr8f6 Rewy87j Y > using-base-lcm-to-monitor-llms-1

Using Base-LCM to Monitor LLMs " Less Wrong

1+ hour, 51+ min ago  (72+ words) Summary We aim to determine whether the LCM model " which predicts sentence embeddings rather than token embeddings " can predict the outputs of LLM...

lesswrong. com
lesswrong. com > posts > d K6 Ep STifjhivp Aj7 > monday-ai-radar-24

Monday AI Radar #24 " Less Wrong

6+ hour, 13+ min ago  (1411+ words) Two thresholds loom on the horizon, with only a brief window of opportunity to prepare for each. On the technical front, it is plausible that we migh...

lesswrong. com
lesswrong. com > posts > ywewg Sm Xmh Bvq3 EC2 > drifting

Drifting " Less Wrong

2+ hour, 4+ min ago  (692+ words) "I am able to say 'no' when someone has a big ask of me. Let's say they asked me to attend a party tonight but I have other plans, I can say 'no' easily. What I struggle with is saying…...

lesswrong. com
lesswrong. com > posts > t JEhqy Dc8q Rmeau Dn > blind-deep-deployment-evals-for-control-and-sabotage

Blind deep-deployment evals for control & sabotage " Less Wrong

1+ hour, 25+ min ago  (1161+ words) Thanks to Ezra Newman for initial ideation and various people at Apollo Research for feedback. This short personal piece does not necessarily reflect the views of Apollo Research. AI labs are preparing to automate their internal staff over the next…...

lesswrong. com
lesswrong. com > posts > 6wb LXhk QAPcunr Ynq > what-is-anthropic

What is Anthropic? " Less Wrong

7+ hour, 48+ min ago  (1495+ words) What is Anthropic? How does it relate to Claude? What is Open AI? What is Chat GPT? How does Open AI relate to it? Is it a mere tool? Is a future of Too...

lesswrong. com
lesswrong. com > posts > t Gw E5fj R9mhi Hr MH9 > psychopathy-the-types

Psychopathy: The Types " Less Wrong

13+ hour, 43+ min ago  (853+ words) Archetypal clusters: Who are you? " This is the sixth article in a series on understanding psychopathy. Previous articles covered the framework, biol...