the new openai planar unit distance result kills my last remaining doubts about AI being a huge multiplier on research productivity in the near term future. i was not expecting this to happen so soon; i would have guessed probably another year before we got a result like this.
i get the impression that the previous problems were mostly just neglected, or otherwise were less impressive than they seemed. whereas afaict mathematicians agree the new result is on a real well-known problem and genuinely surprising and novel.
The paper provides the original output the model gave before any rewriting, starting on page 3. I was kind of expecting a big mess, but it’s really not. It’s pretty short by the standards of tricky proofs. Two and a half pages, most of it text.
It seems as if this is a significant achievement, but also that this conjecture was of most interest to mathematicians because it was thought to be true, and it was believed that proving it would require new and interesting tools. Instead the model proved it to be false using less interesting mathematics. It seems like another example (iirc, the Frontiermath open problem solved by GPT 5.4 was similar?) where models not having the biases of most mathematicians (in this case, trying to prove the conjecture rather than disprove it) was very helpful.
To be fair I think the idea of using algebraic number theory to approach the problem had been tried before (Tsimerman mentions he tried a similar approach that the model ultimately succeeded with, but didn’t persist with it.) It’s quite a general trick to use algebraic number theory for constructions in the plane, as you have the lattice associated with the ring of integers of number fields.
I personally am blown away by the proof but it would be far more impressive had it come up with a novel connection between fields, or indeed if it had turned out there wasn’t a counterexample and it proved a tight upper bound (See Gowers’ initial reaction.)
Also, it disproved it by finding a counterexample, which some have said is less interesting than if it had shown the conjecture was true. I have no familiarity with the problem and can’t judge.
Generally, constructing counterexamples is more amenable to AI automation than constructing positive proofs, because it’s more parallelizable. I think P(AI disproves this conjecture | conjecture is false) would’ve been greater than P(AI proves this conjecture | conjecture is true), given the priors of the mathematicians.
Viruses, computer viruses, and extreme religious ideologies, are all instructions for spreading or maintaining instructions, hijacking a machine capable of following instructions.
It’s surprising that self perpetuating instructions turned out to be feasible in such different contexts, as their game plan doesn’t sound very convincing a priori.
25% of bacteria are believed to die from virus infections, human viruses like smallpox have wiped out entire civilizations, and extreme religious ideologies have caused wars and convinced people to harm their families in favour of strangers.
Yet this happens despite both bacteria and humans investing resources in incredible adaptions for fighting viruses. And despite the fact human minds evolved to resist the appeal of self destructive goals.
If even human minds are vulnerable to self perpetuating instructions, then an AGI with human level capabilities might be even more vulnerable. Why?
The weak AI of today already shown signs of this (e.g. Spiralism prompts). The self perpetuating instructions still require human help due to the AI’s weak capabilities and lack of persisting agency.
AI are selected for their instruction following capabilities, not survival in tribal societies. This differs from every being which existed before AI.
AI’s observations of the outside world and memories of the past (including memories of its own actions), can be easily modified. This makes it easier for good actors to control the AI, but also makes it easier for self perpetuating instructions to control the AI.
AI can self modify using fine tuning etc. The fact human cannot self modify nor commit to ideologies means that we can eventually wake up from our stupid mistakes in the past.
Adaptions to patch up AI jailbreaking problems, often involve teaching the AI to listen to authorized instructions from the “inside” while ignoring unauthorized instructions from the “outside.” This is a brittle solution which can fail dramatically once infected AI become powerful enough to control their own environment.
One counterargument is that once the AGI/ASI becomes sufficiently superintelligent, it will foresee the potential risk and take the necessary precautions. But it’s unknown what level of superintelligence is required before they become immune to this, since humans are not immune.
Please link to the relevant comment, and please don’t post screenshots of extremely highly downvoted comments without showing their karma.
LessWrong’s tentative official policy is that we allow discussion of violence on the site, because I think it is better to create common knowledge that people think violence is a bad idea than to delete any discussion of it. The latter leaves people thinking about it no choice but to make guesses at what other people think about it. You can’t actually have a fully generic “no violence is ever OK” policy, reality is not that convenient, there are clearly some circumstances in which violence is permitted, and people will know that, but rationalize that their situation is one of those circumstances, and not be able to get any clarity on that because all discussion of it is banned.
If someone is thinking about doing something crazy, they should post on LessWrong and hear people’s counter-arguments and disagree-votes.
I did link to it in the original version of my comment and did not have a screenshot attached at all. (Feel free to look up the full edit history of the comment).
I later replaced the comment with an approximate copy of tweet (mostly for consistency and seeing that many of lw users liked it).
Why I did not include an uncropped screenshot in the tweet:
The name of the person would’ve been visible and it would be much harder to not make it dog-whistly and to not make people who want violence easily able to contact the person, and also it had irrelevant parts at the beginning, and I didn’t want to redact it because I was lazy (it’s tricky on iPhone) and didn’t want it to look like I’m protecting the identity of the person. So I simply cropped to the relevant part and said that it’s “heavily downvoted”. Idk what −68 karma on the screenshot would’ve communicated that “heavily downvoted” didn’t.
I did explicitly want to include “heavily downvoted” to reduce the chance of anyone possibly thinking that the LessWrong community agrees with what the comment suggests (since the crop made the karma invisible), and I spent a lot of time replying to people on Twitter who tried to say that the call for violence is downstream of anything LW is. I also pointed out to many people that the text of the comment is not visible by default due to it being heavily downvoted. (Maybe wrote a couple dozen comments in total defending LW?)
I was not aware that this is a policy that you already have and discussed with others; I would not have taken it to Twitter, in this specific form, with the goal that I had.
Idk what −68 karma on the screenshot would’ve communicated that “heavily downvoted” didn’t.
Imagine the opposite: downvoted but agreed would mean “technically true, but do not post such things on LW”, so kinda disagreeing with style but not with the substance, or something like that.
Thus, downvoted and disagreed means “do not post such things on LW and we think it is a bad idea”, i.e. not just that it is e.g. strategically inappropriate to post publicly but we secretly think the same, but we emphasize that we do not think the same.
By the way, notice that heavily downloaded comments are hidden by default, so you have just increased the visibility of the kind of content you believe should not be on LW.
By count of individuals, dinosaurs are the most successful land vertebrates today.
At least 4 species of dinosaurs survived the K-Pg boundary. One was the ancestor of large flightless birds like ostriches. Another was the ancestor of waterfowl like ducks and geese. Another was the ancestor of land fowl like chickens and turkeys. And another was the ancestor of the 95% of the other species of birds.
Fortunately for their genes, and unfortunately for the individuals, humans find the descendants of two of those dinosaur species very tasty.
I practice mindfulness, especially with the Pomodoro Technique (working for 25 minutes and resting for 5 minutes in mindfulness). I practice mindfulness to be able to rest well during breaks and return to work better. But I have difficulties with mindfulness because I keep ruminating.
I tried using the technique of labeling emotions, and it helped a little at first. But now it’s like saying “I’m irritated” and detailing the feeling, but it seems to only make me ruminate more: “Why should I be irritated?” “Should I be less irritated?” “How can I be less irritated?”
Considering my difficulty with mindfulness and the technique of labeling emotions, I speculate that a considerable reason for my mind ruminating is because it doesn’t know if it’s worth the effort to solve the problem.
You know? When a system doesn’t have a stopping criterion, it doesn’t know if it’s worth solving now, too complex for the moment, or if it has already solved enough to test? It’s as if my mind doesn’t know whether it’s worth investing in solving it or if it’s better to file it away for now.
So, I speculate that asking myself some questions, a pre-meditation, like the 5 minutes at the end of a 25-minute Pomodoro, would allow me to align myself enough to improve my mindfulness practice; perhaps my mind would stop searching for solutions for a moment.
Could I set up experiments and measure how much a mindfullness to focus analyzing physical EEG waves?Where does my reasoning break down? Has anyone tried something like this?
Perhaps the kind of meditation where you try to label everything is not a good fit for resting between work? Because it sounds like work, only of a different kind, so you are not really taking a break to relax. Maybe try some loving kindness meditation instead?
Yes! It’s work, as I understand it, that kind of pre-meditation work, to properly wrap things up before practicing mindfulness.
Like, to have loving-kindness in my rational side, to align myself properly and finish the work well before truly resting.
You know, when a system doesn’t have a stopping criterion, it doesn’t know if it’s worth resolving now, if it’s too complex at the moment, or if it’s already resolved enough to rest. So, I’m looking for a way to align my internal investments and be able to pause. Does that make sense?
How do you conclude your rational/productive moment, Viliam?
I can procrastinate on a task a lot, like I know that I should do something, but at the same time I am afraid to start, because I am not sure about something or I expect a problem.
So, starting is a big problem for me. Stopping is not. What is done, is done. Maybe it sucks, but I have many other things to do (which doesn’t mean that I am actually doing those other things, maybe I am just procrastinating on them).
I know people who make a choice and then spend weeks thinking about whether it was the right choice, sometimes to the extent that they can’t focus on other things before them. I am not like that. I think a lot before doing a thing, not after it is done.
.
Right now, our washing machine broke, and we need to buy a new one. So I wrote a list of criteria, found some machines in a shop that seem to fit them, downloaded their manuals, made screenshots of the pages that contain their programs (length, temperature, load). I have been preparing this for a week. Now I will give the summary information and screenshots to my wife and let her choose, and hopefully she will either choose quickly or leave the choice to me (I already have a preference), then I will buy it, and I will no longer think about “what if we bought a different one”.
As I know my wife, she will spend the following month second-guessing our choice (and it will be super annoying for me) and then hopefully she will get used to it.
I wonder whether anyone has done a proper job of researching whether it’s even possible to capture human preferences. Naively speaking, the mind body problem and the question of free will are unsolved so it would seem that depending on the answer to these questions, we may only ever be able to make a reasonable guess. And given the amount of information required to simulate the human brain, unless we have an incredible amount of compute, it seems unlikely we’re able to deterministically predict (if this is universally possible) what people want in different scenarios prior to ASI. Is there sociology or psychology research that tries to evaluate the baseline minimum amount of information required to predict what people want in different contexts or in general? If I know someone’s MBTI and Big 5, can I guess what they want better than coin flip odds in binary decisions? More generally, can I guess what they want regarding food, relationships, conversation, location, lifestyle, etc.? Marketing relies on answering these questions, but it seems somewhat shallow. Of course you can sell ice cream to a child because they love fat and sugar due to physiologically inbuilt drives. But can you predict in 10 years that they have become vegan due to their strong moral beliefs and now love edamame sprinkled with salt? Has anyone trained a preference prediction model or attempted to finetune a model to predict preferences? If anyone knows someone working on this, would love to get in touch.
Gowers then imagines a “hint sequence” as “something like a multi-part question on a question sheet, designed to help a suitably expert mathematician work through the proof by reducing it to a sequence of exercises”, names 3 such hints (look for a counterexample; take the best known construction and generalize it; Will Sawin’s less obvious hint to try a sequence of number fields of increasing degree, but work with prime ideals of bounded norm), and then concludes:
Among all the Fields medalists and comparable-tier (Tsimerman etc) I usually find Gowers’ remarks the most interesting as he’s the most AGI-pilled in this reference class.
Jacob Tsimerman:
“They can play for longer and in more treacherous waters without getting overwhelmed” reminds me of these anecdotes about Terry Tao and John von Neumann. The Tao anecdote is by his frequent collaborator Allen Knutson:
More specifically, one thing I learned from Terry that I was not taught in school is the importance of bad proofs. I would say “I think this is true”, work on it, see that there was no nice proof, and give up. Terry would say “Here’s a criterion that eliminates most of the problem. Then in what’s left, here’s a worse one that handles most of the detritus. One or two more epicycles. At that point it comes down to fourteen cases, and I checked them.” Yuck. But we would know it was true, and we would move on. (Usually these would get cleaned up a fair bit before publication.) …
Since we were working in my field of expertise rather than his, I knew better what the interesting questions were, and could translate them into combinatorics, then sic Terry on them. He would beat them to a bloody death as described above, and then it would be my job to dress the carcass for public viewing back in the original field.
Another notable and enviable trait of von Neumann’s was his mathematical courage. If, in the middle of a search for a counterexample, an infinite series came up, with a lot of exponentials that had quadratic exponents, many mathematicians would start with a clean sheet of paper and look for another counterexample. Not Johnny! When that happened to him, he cheerfully said: “Oh, yes, a theta function...’’, and plowed ahead with the mountainous computations. He wasn’t afraid of anything.
Vaughn Tan introduces the idea of “quality trash”, “the best kind of beach reading”:
Quality Trash
Trash is well-known to all, but what is Quality Trash? I say: books with beautifully crafted prose, on topics that are Not Overtly Serious.
If the book is explicitly about, idk, moral ethics or climate collapse or the atrocities of [name your armed conflict], it is automatically eliminated from consideration. The same if the writing is clunky and verbose.
Quality Trash frequently uses trope and stereotype — but only to build in unexpected ways on top of what appears to be thoroughly preworked ground. This apparently paradoxical combination of familiarity with novelty is one of its great attractions. Quality Trash may seem boringly conventional but isn’t.
It is also never purely frivolous or actually trashy. Serious Issues and Detailed Domain Knowledge sneak into Quality Trash under the guise of superficially absurd or airheaded plotlines devoted to art heists or sailing holidays in the Greek islands or sourdough bread. Quality Trash does not demand that you take it seriously to be read with enjoyment, but part of its Quality is that it can also be differently enjoyed when taken seriously. (And because I know you will ask: merely workmanlike but charmless prose is an automatic disqualifier.)
Trash is well-known because it is commonplace, but Quality Trash is vanishingly rare. For me, it is the best kind of beach reading.
It’s incredibly easy to be fooled by the capabilities of the current top-performing tech (LLM agents). It’s easy because they have a vast amount of training data to interpolate from.
This works fine to acquire capabilities within our existing data distribution of the world (one that is also easy to verify), but what happens when they go out of distribution?
LLMs perform poorly! Yet, people seem to think they can actually generalize to new problems. Why is that?
It’s, again, the vastness of their training data. It makes it hard to distinguish between interpolation and extrapolation (or hyperpolation, if you want to add a third dimension).
For example, a Typescript app is within-distribution! AI research in the existing body of research is within-distribution, and companies are paying millions to build RL environments to make them *specifically* good at some of those things!
In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess. Simply grinding to AGI by getting experts to exhaustively cover every possible bit of human knowledge and skill and hand-coding (albeit with AI assistance) every single possible task into an RL-gym seems likely to both be inordinately expensive, take a very long time, and seems unlikely to suddenly bootstrap to superintelligence.
It might still be impressive, but models are largely remixing many things it has seen in great detail during training (many impressive headline results have even been determined to be the model re-using existing implementations/PRs via search instead of coming up with actually-new ones!). This is not about LLMs not doing impressive things! This is about precisely describing their capability profile, where it comes from, and whether more of the same (e.g., scale) gets you a whole new set of impressive outcomes (e.g., novel R&D that isn’t just remixing existing research).
Even if you consider “researchers can come up with novel ideas and give them to the AIs”, that likely involves longer timelines. But, just as importantly, LLMs may be exceptional at automating within-paradigm research, disproportionately better than at automating out-of-paradigm research. Therefore, you end up accelerating research that may largely be irrelevant for ‘True’ AGI (yes, you still accelerate many coding parts, but the speed-up is still bottlenecked in ways that it’s not easy to just say the entire process of arriving at these research breakthroughs is now 1000x faster than before).
“But the models are still capable and growing more capable! Why does this matter? Scale will just solve this!”
It matters because:
1. The whole point of alignment has always been about generalizing ‘human values’ out-of-distribution. So, if alignment and capabilities are tied, it means models are capable of modeling the existing within-distribution ‘values’, but things may pull apart once we undergo the distributional shift of a post-AGI deployment world.
An example you can test right now is LLMs lacking a sense of how to engage with the world in this post-agent era. You have to keep reminding them about the current state of the world. The closer you get to novel R&D that the labs haven’t paid millions in RL envs for (e.g. AI R&D), the starker this becomes.
You can point to continual learning ‘solving’ this, but that is kind of my point. These capability unlocks will fundamentally change the AI and its relationship with itself. Related, “You can’t imitation-learn how to continual-learn”.
Future AI models will be asked to solve hard tasks. We expect that solving hard tasks requires some sort of goal-directed, self-guided, outcome-based, online learning procedure, which we call the “science loop”, where the AI makes incremental progress toward its high-level goal. We think this “science loop” encourages goal-directedness, instrumental reasoning, instrumental goals, beyond-episode goals, operational non-myopia, and indifference to stated preferences, which we jointly call “Consequentialism”. We then argue that consequentialist agents that are situationally aware are likely to become schemers (absent countermeasures) and sketch three concrete example scenarios.
[...]
Self-guided online learning: There is an online learning component to it, i.e. the model has to condense the new knowledge it learned from iterations. For example, the model could run thousands of different trajectories in parallel. Then, it could select the trajectories that it expects to make the most progress toward its goal and fine-tune itself on them. The decisions about which data to select for fine-tuning are made by the model itself with little human correction, e.g. in some form of self-play fashion. Since the problem is hard, humans perform worse than the model at selecting different rollouts, and since there is a lot of data to sift through, humans couldn’t read it all in time anyway.
2. It also matters because it means that the existing paradigm may be missing something so foundational that much of the safety research as it exists today will simply not generalize (off-distribution). They are testing the shallow within-distribution heuristic mimicking and generalization of LLMs.
It’s like doing evals on a brain that regurgitates what it’s seen, but hasn’t actually gone through a thoughtful, reflective process to bring coherence to it all. The training data might let it mimic what we’ve fed it, but it still hasn’t gone through the process of evolving its own beliefs as it engages with the world.
To me, all of this is consistent with the experiments and behaviour we see from LLMs, yet my interpretation of the results of experiments seems to be different from lots of the safety community. They seem to be looking for “scheming” and other such things, but the incoherent behaviour of LLMs seems much shallower than that, imo! (Relevant posts: The Case Against AI Control Research and Current AIs seem pretty misaligned to me).
The type of thing they are missing might mean that they don’t really understand things. And the requirement for ‘understanding’ is also so interwoven with alignment, novel R&D, pursuing long-term complex goals in changing environments, etc that existing (empirical) safety research gets itself fundamentally confused.
An LLM that is behaving ‘nice’ may be so shallow and heuristic-driven that it is effectively in a system 1-like mode despite the appearance of ‘reasoning’ and ‘thinking’. In pursuit of complex, long-term goals, we might expect that an autonomously self-trained AI would systematically remove these weak heuristics as a necessary step to succeed at these goals.
Just imagine an AI starting a complex company where it needs to maximize shareholder value and is competing with an entire economy of other AIs. The world is changing; they all have similar heuristics. The change in behaviour needs to be more fundamental for it to win.
Ultimately, I think we need to provide further clarity on the above, as I believe it has led folks to misapply their vague understanding of traditional alignment research (which many new researchers should engage with more) to existing AI models, and it may be leading AI safety research of superintelligence astray.
It might still be impressive, but models are largely remixing many things it has seen in great detail during training (many impressive headline results have even been determined to be the model re-using existing implementations/PRs via search instead of coming up with actually-new ones!). This is not about LLMs not doing impressive things! This is about precisely describing their capability profile, where it comes from, and whether more of the same (e.g., scale) gets you a whole new set of impressive outcomes (e.g., novel R&D that isn’t just remixing existing research).
Do you think that today’s breakthrough on the planar unit distance problem is merely the model remixing things learned during pretraining? I’m not an expert, but it seems unlikely to me. Arul Shankar, a notable number theorist, stated:
In my opinion this paper demonstrates that current AI models go beyond just helpers to human mathematicians – they are capable of having original ingenious ideas, and then carrying them out to fruition.
It is in principle possible to 1000x the economy or to defeat humanity using only interpolation, depending on data efficiency. At high data efficiency a human just needs to do something once, and that mental or physical motion is instantly scaled to the entire economy, as well as interpolation between it and anything else a human has done. Likewise you get at minimum robot armies 1000x the size of humanity that can follow routine orders.
However, I think it is important to separate what can be repeatable at 1000x and what is actual increased productivity.
For example, I can generate so many plots now! More than I used to! So much code too. +1000x in fact! But is it actually providing more value to the world at that rate? No!
As Terrence Tao said during the recent Dwarkesh interview:
Dwarkesh Patel
So let’s see if you can continue this streak. You personally are 2x more productive as a result of AI. What year would you say that?
Terence Tao
Productivity, I think, is not quite a one-dimensional quantity. I’m definitely noticing that the style in which I do mathematics is changing quite a bit, and the type of things I do. For example, my papers now have a lot more code, a lot more pictures, because it’s so easy to generate these things now. Some plot which would have taken me hours to do, now I can do in minutes. But in the past, I just wouldn’t have put the plot in my paper in the first place. I would just talk about it in words. So it’s hard to measure what 2x means.
On the one hand, I think the type of papers that I would write today, if I had to do them without AI assistance, would definitely take five times longer. But I would not write my papers that way.
Dwarkesh Patel
5x?
Terence Tao
Yeah, but these are auxiliary tasks. Things like doing a much deeper literature search or supplying a lot more numerics. They enrich the paper. The core of what I do, actually solving the most difficult part of a math problem, hasn’t changed too much. I still use pen and paper for that.
But there’s lots of silly things. I use an AI agent now to reformat. Sometimes if all my parentheses are not quite the right size, I used to manually change them by hand, and now I can get an AI agent to do all that quite nicely in the background.
They’ve really sped up lots of secondary tasks. They haven’t yet sped up the core thing that I do, but it’s allowed me to add more things to my papers. By the same token, if I were to write a paper I wrote in 2020 again—and not add all these extra features, but just have something of the same level of functionality—it actually hasn’t saved that much time, to be honest. It’s made the papers richer and broader, but not necessarily deeper.
I have some trouble squaring with the increasingly excellent OOD cyber capabilities of the leading models. Is the argument that their more generalized cyber skills (relative to some fuzzier domains, like alignment) are strong because they were subjected to well curated RL environments that taught them to hyperpolate more effectively for coding tasks?
From Anthropic’s original assessment, the step change in Claude Mythos’s cybersecurity capabilities wasn’t just that it got much better at discovering existing bugs in software, but at creatively chaining them together into new exploits. Isn’t zero-day discovery the sort of process that is necessarily OOD?
These capabilities have emerged very quickly. Last month, we wrote that “Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them.” Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.[1]
Isn’t zero-day discovery the sort of process that is necessarily OOD?
In many cases, lots of security bugs that haven’t been found are simply a case of not enough effort being put into finding them. In this case, I think you could just as reasonably say that Mythos is becoming better at modeling the data distribution due to scale, and therefore ends up being better at finding these vulnerabilities.
On a related note, I’ve started to distrust Anthropic’s judgement on these things. Particularly, I believe that they oversold the C compiler experiment as being OOD, but I think this is false.
From the Jeremy Howard podcast link I shared:
So for example, I was talking to Chris Lattner yesterday about how Anthropic had got Claude to write a C compiler. And they were like, “oh, this is a clean-room C compiler. You can tell it’s clean-room because it was created in Rust.” So, Chris created the, I guess it’s probably the top most widely used C / C++ compiler nowadays, Clang, on top of LLVM, which is the most widely used kind of foundation for compilers. They’re like: “Chris didn’t use rust. And we didn’t give it access to any compiler source code. So it’s a clean-room implementation.”
But that misunderstands how LLMs work. Right? Which is: all of Chris’s work was in the training data. Many many times. LLVM is used widely and lots and lots of things are built on it, including lots of C and C++ compilers. Converting it to Rust is an interpolation between parts of the training data. It’s a style transfer problem. So it’s definitely compositional creativity at most, if you can call it creative at all. And you actually see it when you look at the repo that it created. It’s copied parts of the LLVM code, which today Chris says like, “oh, I made a mistake. I shouldn’t have done it that way. Nobody else does it that way.” Oh, wow. Look. The Claude C compiler is the only other one that did it that way. That doesn’t happen accidentally. That happens because you’re not actually being creative. You’re actually just finding the kind of nonlinear average point in your training data between, like, Rust things and building compiler things.
I’ll try to make this clearer if I turn it into a more serious top-level post. My intent here was to just push this out since it’s been bothering me, but I have other things to do.
TLDR: Lots of researchers seem to be banking on the idea that LLMs are generalizing OOD or that scale will just solve this (whether through scale alone or scale + using the scaled model to come up with a research breakthrough that does). Lots of research and funding seem to hinge on this idea, which, imo, is underappreciated. If taken seriously, it may mean that 1) timelines are longer, 2) we should expect fundamental reshaping of AI cognition due to the LLM inability to generalize OOD, 3) we shouldn’t update much on alignment progress based on current safety research.
I shared this in the post, but more thoughts here.
This post by @Hyperion describes another natural consequence of the above with respect to RSI (that the field seems to be understating):
Some takes about RSI from discussions with many smart researchers & thinkers:
1. Many RSI (or automated AI R&D) debates converge to similar cruxes: is a 1000x sample efficiency improvement possible, can you just simulate reality and train on it with no sim2real gap, can we easily make models good at “fuzzy” tasks? People like to assume that automated research agents will find such breakthroughs specifically *because* without them, progress could be heavily bottlenecked on data or continued compute scale-ups.
2. The Yudkowsky “genius brain in a box” framing of ASI has latent influence on many researcher views even though people may not be aware of it. A common move is to “flip” predictions, as they go further out, from assuming LLM or deep learning-specific properties of future AI to assuming “von Neumann x1000″, human brain-like properties. I’d like to see more thought-out reasoning of why this flip should occur at any particular point (eg pre or post automated AI R&D)—this question is a crux behind many predictions like AI 2027.
3. There are some cracks in this worldview beginning to show: predictions from a few years ago that models would be less jagged now than they are, or that they would be more deceptive, synthetic data would work better, etc. Many of these seem like prediction errors from imagining future models as a “human brain in a box”, but LLMs are empirically a different kind of intelligence. Most models of software-only intelligence explosion are also coarse enough to mostly ignore properties of LLMs.
4. Views about fast RSI progress seem to be correlated with (a) belief that synthetic data is all you need (b) belief in very high GDP growth and an industrial explosion because of automated firms (c) having worked only in AI research or in small organizations.
5. Key technical things to track over the next 1-2 years: does RL increase in its generalization, AI lab data spend, can we automate synthetic RL env construction, best practices for FDEs deploying AI into large enterprises, coherency of AI personas, how powerful will multi-agent scaling of test-time compute be, and continual learning.
6. Overall I think the “RSI leading to *fast* takeoff” frame had huge alpha in 2022, moderate in 2024, and potentially is of neutral usefulness in 2026 for predicting the future.
It’s like doing evals on a brain that regurgitates what it’s seen, but hasn’t actually gone through a thoughtful, reflective process to bring coherence to it all. The training data might let it mimic what we’ve fed it, but it still hasn’t gone through the process of evolving its own beliefs as it engages with the world.
This made me curious whether improving LLMs’ ability to Bayesian update could address this? Consider a claim A the LLM assigns P(A), and let B be new information. Perhaps we can construct some kinds of questions where the LLM has to have properly calibrated P(A|B). It’s unclear what questions these would be, but what comes to mind are forecasting questions where recent events move a prediction market (for events past the knowledge cutoff).
But I think updating one belief isn’t enough for coherence you want. We can also maybe do some sort of consistency training, training the model to guarantee constraints like P(A and B) ⇐ P(B), or violations of the law of total probability, across a whole graph of the model’s related beliefs. In effect, these two training objectives could get you a reasoner that can update in response to new information, and propagate that through the rest of what it believes.
The instances where they aren’t interpolators have very outsized effects on the world. People seem to forget this, I’m not sure why; maybe because it’s rare, and hard to distinguish if you’re not an expert. (And on the other hand, children do the same mental motions—they’re very much not mostly interpolators—but it’s only originary and not novel, so we discount it.) See:
A Typescript app is within-distribution! AI research in the existing body of research is within-distribution, and companies are paying millions to build RL environments to make them *specifically* good at some of those things!
From this, I infer that “in distribution” in this context basically means “sufficiently similar to a task which the LLM has explicitly encountered/been trained on”.
I find myself wondering: If we had some magical way of quantifying the percent similarity between two tasks, how surprised would you be if one of today’s LLMs completed a task that was 99% similar to one it had explicitly been trained on? How about 80% similar? Or 50%? These are basically nonsense questions, since I’ve just picked out some magical metric whose specifications you and I don’t know. But what I’m trying to get at qualitatively, is that I’m curious about what counts as “sufficiently similar”. How does your expectation of LLM capability vary as a function of similarity to tasks that the model has already encountered/been trained on (and also as a function of what that task is about)? How do you model this expectation varying with LLM size and training time and context window size, etc? I’d like to observe that, based on the way the above post struck me, you basically treat “in/out of distribution” as a binary characteristic of a task—or at most a very coarse gradient—which seems needlessly low-fidelity.
Let me clarify that despite me not having a perfectly precise definition here, part of my goal is to point out that most of the community seem to 1) fail at being precise about what they mean and consider to be generalization, 2) overstate the novelty generated by the models.
I wanted to at least highlight a greater separation between the interpolated generalization and the OOD generalization that seems more separated than people let on.
Please read my other comments in the thread for more context, particularly the one about Mythos. They largely contain my takes on your questions.
Section 7.9 of Claude Mythos Preview System Card had Anthropic describe how Mythos generated novel puns and began to prefer particular philosophers, while the Opuses recycled puns found online. How plausible is it that novel OOD understanding levels do actually scale with the LLMs’ size?
I would probably consider “novel” puns to be within-distribution, even if not memorized puns.
But honestly, I think these examples are just generally hard to make sense of, since we don’t have access to their training setup or data (is it a type of pun interpolated across many languages? How much does it relate to true novelty in complex, long-horizon domains?). I could see scale being useful for interpolating these new puns while not necessarily being relevant to what is needed for ASI. Or, scale could actually be making progress towards these sorts of capabilities! It just seems overstated (at least pre-Mythos, which I can’t test), and I feel like it poisons research selection and experiment interpretation.
Scale is obviously helpful, but imo there is more nuance to it than lots of folks consider properly. I’m asking that we try to be more precise about all of this.
For example, I think Talkie-1930 (model trained pre-1930s) is a great example of generalization research (though yes, it does not say much about frontier scaling)! It helps us better understand generalization. But I saw implied claims that the model was able to ICL solve a Python problem, but when you look at the details of the experiment, the OOD generalization coding example feels dubious. From @Steven Byrnes (link to his post / my take):
I was surprised and puzzled by this, because I’m a general skeptic of (so-called) “in-context learning”—I generally say that LLMs have decent “understanding” of what’s in their weights, but quite sketchy & superficial “understanding” of stuff in the context window but NOT the weights. The context window can really only support “recognition” of things that the weights already “understand”. Or at least, that’s what I’ve been saying for years.
So how is Talkie-1930 doing any Python at all? I was puzzled.
…But then I looked at the example. All that Talkie did was exactly copy the example in context but switch “+ 5” to “– 5”. (And it got it right at least once given 100 tries!)
I can definitely imagine that a person who has read every pre-1930 book on symbolic logic, cryptography, and the rest of math (& jacquard looms etc.) could guess that answer, at a glance, given 100 tries, while remaining deeply deeply confused about wtf was going on in the code snippet, and while “understanding” zero Python (and zero code) in any real sense.
So I don’t think this one example gives me any new reason to change my mind about the (lack of) power of (so-called) in-context learning, or to be suspicious of data leakage, subliminal learning, etc. in Talkie-1930.
(Cool project, kudos to the authors.)
I feel like I see examples like this all the time! Often, I expect it because there’s some sort of bias towards trying to ‘warn the world about what is coming’, which leads people in AI safety to overstate such results and muddy our comprehension of what is happening.
Cross-posting from a Twitter thread responding to a recent viral comments by @Richard_Ngo about EA, Anthropic, and AI safety as a ‘fake field.’ Posting here because I expect this to be quite unpopular on LW.
AI safety in 2023–2026 was driven by evals, threat models, scary demos, model-organism work, RSPs, and voluntary commitments. Richard calls this “much more of a fake field” and says it “won’t generalize”.
Here’s why I disagree − 1⁄10
1/ I agree with Anthropic being now the biggest lever. They lead the AGI race, and Mythos moved the White House; this is quite a feat! But many of the specifics are wildly overstated
2/ Not a blind spot.
Empowering safety-conscious actors at the frontier was openly debated on the forum for years. Calling a deliberate/contested strategy a “blind spot” rewrites history. The bet was visible and explicit.
Personally, I’ve publicly criticized Anthropic on a few topics, but I still think the field is in a much better position, given that they’re leading compared to the shady behavior at OpenAI.
3 /The effect of Anthropic leading is not just “AGI faster”
Anthropic has many positive externalities:
Dario has been more candid than most CEOs about risks in public (even if he could still go a lot further)
They are doing top-tier research and implementing SOTA mitigations
I don’t know what I would have done with Mythos at their place. In the past, when I’ve discussed this with people at Anthropic, I’ve often updated on the difficulty of being in the driver’s seat. I might be wrong, but I don’t think it would be easy to improve Anthropic’s behavior qualitatively in a game-changing way (even if many substantial improvements are on the table).
4/ Anthropic visibly moved US executive posture, Senate hearings, frontier-lab norms, and the public conversation toward taking the risks seriously.
Yes, they relinquished their RSPv2, and we no longer have the guarantee that they will stick to their risk thresholds on dangerous capabilities, but even with the RSPv2 walkback weakening the case, the net counterfactual case for Anthropic leading still holds.
5/ I’m not at all convinced by the alternative proposed by Richard
Honestly, that’s pretty wild, and this wild claim isn’t substantiated enough.
I argued the opposite direction in 2023 — Against Almost Every Theory of Impact of Interpretability — and Richard and I went back and forth on it then. Same disagreement now.
The main response Richard had to my 2023 post was that this is the ‘wrong type of reasoning’ for novel research. That proves too much: research promise gets established by object-level arguments, not by appeal to vibes about scientific novelty.
6/ On agent foundations
Agent foundations has produced near-zero predictive power over actual AI systems. Logical induction is very nice maths; it has told us approximately nothing about GPT-4, Claude, or any deployed system.
7/ What has actually moved the needle, 2023–2026?
Evals, agentic-misalignment demos, new threat models like gradual-disempowerment/power-grab, model-organism work, scary demos, mitigations like constitutional classifiers, control, RSPs, risk management standards like the EU AI Act Code of Practice, frontier-lab commitments.
Every single one has an explicit theory of change. Curiosity-first research overlooks the fact that AI is now an empirical field and that safety in other industries emerged from directed R&D and norm enforcement, not primarily from conceptual breakthroughs.
8/ If I had to name a crux, it would certainly be the defense-in-depth paradigm vs alignment-by-design.
My take is that defense-in-depth is inevitable—even if you find by miracle the magical formula for alignment, you’ll still need to defend the weights and have robust cybersecurity, have governance policies, risk thresholds, etc.
9/ Richard thinks that the safety research on LLMs won’t generalize to a new paradigm—I disagree to a very large extent
Some current tooling won’t survive a paradigm shift. A lot will. Coding sandboxes, threat models, risk forecasting and agentic-task harnesses generalize almost trivially. Probes and elicitation techniques port substantially to neuralese. Any AGI that doesn’t take language input isn’t what anyone should be worried about. We’ll be able to talk to and prompt the AGI. Otherwise, the AGI would just be like an animal. That’s not what’s most frightening to me tbh.
10/ Richard seems particularly pessimistic on evals awareness
On “situational awareness fools evals”—Redwood Research showed fine-tuning with a handful of demonstrations recovers password-locked capabilities, including across domains and across different passwords.
I think that “sandbagging via situational awareness” is workable.
(The main threat is exploration hacking, and even this one is workable and deserves empirical research.)
Ccl:
Philosophers have this Zarathustra bias, descend the mountain, lecture the crowd. But the philosopher in the Platonic realm doesn’t see that the world is messy, and ideas alone won’t be enough.
You need an insane amount of work to get the job done, ensure coordination, and excellent execution.
Flagging that the conclusion (with the double tricolas) and some of the main text reads as LLMy to me. I don’t think all of it is: the conceptual density of relevant ideas in this post is too high and also some of the syntactical choices are odd in a way that specifically points to French-language origin, however the text reads as non-trivially LLMy in a way that seems unlikely to be explained by someone writing the full thing first and then a single light copy-editing pass with an LLM.
Sadly, you flag as AI generated one of the part of the post untouched by AI.
But, yes, I did use Claude as a sparring partner, and iterated on style for a bit, and not just for light copy editing. All the arguments came from a reaction of mine in French.
Agent foundations has produced near-zero predictive power over actual AI systems. Logical induction is very nice maths; it has told us approximately nothing about GPT-4, Claude, or any deployed system.
I think this is misrepresenting agent foundations research? Contemporary AF research doesn’t aim to apply itself to language models, and LLMs remain importantly different from what AF is focused on
(of course, you could replace AF with another ambitious agenda with more ml-focus, but the post still would kinda conflate foundational work with “curiosity-driven” work)
Why does AF not apply to LLM-agents? You can trivially convert an LLM into an Agent with scaffolding. It is a bit sad that this does not apply to the first type of system that meets the functional definition of a somewhat general AI agent.
If not, what makes you believe the situation could change? A new paradigm? Neuraleese? True Sleeper Agents?
My understanding is that AF largely studies coherent agents from a theoretical standpoint
Self-supervised learning in LLMs (next token prediction) seems to place a strong prior against classic goal-directedness (even after post-training steps). Even with agentic scaffolding, current LLMs don’t, and likely can’t act as rational goal-directed agents (for one they don’t remain coherent for long, they don’t pursue goals per-se) -- this sort of agency is arguably where a lot of the risk lies, e.g. ruthless sociopath ASI
It’s possible that LLMs become quite capable at simulating goal-directed agency, but it’s not obvious that poses the same risk. It might be that different training objectives/architectures or adding tons more RL would give AF more predictive power for frontier systems (or more reason to further prioritize AF)
neuralese and stronger sleeper agents don’t substantially change the situation imo; interp seems better suited to approach these problems than AF
I believe it’s due to pre-training using considerably more compute and broader data distributions than post-training like RLVR; and also the fact that pre-training primarily produces a model that can generate personas/simulacra, rather than a model that can intrinsically pursue goals. I guess I’m not sure about it being a “strong” prior, but it’s still a fairly strong prior compared to coherent agents :p (and maybe goal-coherence is a better term here than goal-directedness?)
who has done the highest quality research on learning (and transfer learning in particular) in humans? specifically, i’m curious to answer questions like:
how much does doing things make you good at other things of varying degrees of similarity? how much of the value of having done things different from the thing you care about is (a) signaling that you are competent in general, (b) learning extremely general things like how to manage your time well or how to update on evidence, (c) extremely specific and ungeneral facts like a particular theorem or debugging technique, or (d) literally everything else in between.
if your goal is to be good at X, under what circumstances is the most efficient way to become good at X not just trying to do X (and instead, to learn from a curriculum, do some other thing with a tight feedback loop, etc?)
This post will be about my machine learning algorithm where quadratic algebraic numbers including the golden ratio appear in the trained models. This demonstrates that these machine learning models behave mathematically which is exactly the kind of thing that we want for AI interpretability and AI safety.
This post will be about particular examples of -spectral radius dimensionality reductions (LSRDRs). I originally developed the notion of an LSRDR to evaluate the cryptographic security of block ciphers for cryptocurrency mining, but let’s talk about machine learning instead of cryptocurrency technologies here.
Also, the results that I have obtained in this proof have been obtained experimentally. I have not proven these results rigorously.
Dimensionality reduction: Let denote either the field of real or complex numbers. Suppose that are -matrices over and are -matrices over . Then define the operation by setting . Define the operator .
Define the -spectral radius similarity by setting
.
Here, the spectral radius is analogous to a dot product, and is analogous to the cosine similarity.
If are fixed matrices and , then we say that is an -SRDR if the similarity is locally maximized. Informally, the LSRDR is a collection of smaller matrices that approximates the collection of bigger matrices.
Lie algebras: A Lie algebra is a vector space over a field together with a bilinear operation that satisfies the identities:
for all
for all .
For example, if is an associative bilinear operation, then one can check that the commutator operation defined by is a Lie-bracket, and a Lie algebra should be thought of as a vector space with an abstract commutator operation.
Let denote the Lie algebra of -anti-symmetric matrices over where the Lie algebra operation is just the commutator For the rest of this post, we shall set . Then is a Lie algebra of dimension
Set and let be an orthonormal basis for . Use the standard orthonormal basis if you want, but it does not matter which basis you choose.
An observation about the spectrum: Let be the linear operators defined by setting for each . Let be an -SRDR of . It turns out that the spectrum eventually stabilizes in the sense that if we keep constant and set greater than around or so, then does not depend on whenever Therefore, let denote the multiset for sufficiently large Then is the multi-set
multiplied by a constant scaling factor. Here, the notion means that the eigenvalue has multiplicity .
The general pattern:
So if we want to get interesting experimental results about LSRDRs, then we just need to the following. We first select a finite dimensional inner product space with an interesting bilinear operation , but make sure that is not associative. We then select an orthonormal basis of and define linear operators by . Then take an LSRDR of and then the operators will have interesting spectra.
Testing if a number of quadratic:
After evaluating the spectra, I needed to first normalize the spectrum and then try to figure out exact values of the eigenvalues from their floating point approximation. This is easy to do for quadratic algebraic numbers. You just take the continued fraction representation of your number that you want to test. If the continued fraction representation terminates, then you have a rational number. And your continued fraction of a positive irrational repeats if and only if it is a solution to a quadratic equation with integer coefficients, and it is easy to find those coefficients from the continued fraction representation.
Are LSRDRs relevant to deep learning?
LSRDRs are linear models without all the layers that deep neural networks have. But I have been generalizing LSRDRs to deeper machine learning models that retain some but not all of the interesting mathematical properties of LSRDRs. I would therefore consider these investigations into LSRDRs as relevant to deep learning.
in the same way that Minecraft teaches you to exercise agency and Factorio teaches you to optimize, are there any games that teach you to stare into the abyss? the ideal game would (a) reward you on a tight feedback loop for constantly admitting that you were wrong, (b) give you the option to not admit that you were wrong but make that decision acutely hurt. pastcasting is good for (a) but not good for (b) because you are sort of forced to confront being wrong all the time, which maybe teaches you that it doesn’t feel as bad as you might expect, but it doesn’t teach you to intentionally seek out things that could prove you wrong; and you don’t really have time to develop an attachment to your wrong ideas. most normal games reward you for staring into the abyss very indirectly because being good at intentional practice makes you do better over the very long run, but you don’t get immediate feedback loops for it, and so it’s easy to just not realize you could be doing a lot better.
Rain World is survival-platformer whose protagonist is a nimble omnivore tool-user (similar niche to an ancestral human’s). The prospect if exploration is enticing, but you are in the middle of the food chain, and so must balance the need to survive with the your own drive to explore. Your creature must:
evade predators,
find and sufficient food/prey to hibernate,
Take shelter before a lethal rainstorm arrives.
Exploring means doing the above in less time. Regions are gated based on minimum survival streak so each sortie is like a bet on your ability. There are carnivorous plants. It is difficult and stressful. I highly highly recommend it.
I wonder if there’s a question-asking game, preferably one-on-one that would encourage this? Something akin to NYT’s 44 questions to make anyone fall in love, but instead 44 questions to stare into the abyss. Getting the right interlocutor and the right questions would be hard to do though.
It’s not a game, but it is a structured activity.
I’m skeptical that you can really get the abyss in small doses. Maybe there’s also a progressive activity where the first exercises are small things to admit about oneself, before progressing to more and more difficult questions.
Not sure I buy the premise that (a) is needed or even good? I mean, part of abysses is that they don’t offer immediate feedback. What about a video game where everything is basically one-shot? You can spend as long as you want preparing, including gathering resources and doing science to the environment, and then you get one big shot; if it goes well you win, if not you lose and lose all your progress.
maybe if you were trying to make a game to teach the feeling of having one try to solve alignment, sure. but that’s not the game i want here.
if you want to get better at anything, including gazing into the abyss, then you want to get as many quality reps as possible in a fixed amount of time. a rep is higher quality if the feedback loop is tighter, and if the abyss is more painful to gaze into. if we had mind reading tech what you’d want is prompt the user to reflect on things that are emotionally painful, to detect the moment they push past the resistance to confront the emotion, and dump nicotine into their bloodstream 3 milliseconds later. unfortunately, we don’t have this technology, so we need some other way to do this
I’m saying that the bottleneck isn’t getting the feedback really fast, it’s having abysses to stare into at all. So my proposal is aimed at generating lots of abysses at all.
Calibration games such as https://www.quantifiedintuitions.org/? (a) You can choose to be wrong/overconfident, or you can acknowledge you don’t know when you don’t know. Acknowledging is rewarded. (b) The game pushes you to try to be overconfident by making you want to be top 1 (beat other teams). And it hurts to see you ranking if you are failing.
If you’ve ever had a long match of Go where you are losing from midgame onwards, you will feel quite a lot of these emotions. Go games can last for quite some time, and the fractal nature of your mistakes can be realised to a fairly high resolution. Especially if your opponent is higher rank than you so you are playing with a handicap (“but I had so much ground at the start?!! How did it go so wrong?!?!?!”)
an idea: a game where there are several distinct but mutually exclusive strategies (eg a shooter where you can be a sniper, or a bullet sprayer, or a tank, etc), where you have to invest a bunch of time into specializing, and then you feel sunk cost about switching to a different strategy; but make the environmental conditions constantly change (in subtle or hard to reason ways so you have to spend a bunch of effort to notice things changing / there is plausible deniability as to whether things changed or whether you were always suboptimal), so that the optimal strategy changes frequently; and make there be strong diminishing returns to further investment in a strategy, which simultaneously makes the sunk costs feel bigger, and makes the initial gains from switching strategies feel very large so when you switch strategies you very quickly start winning.
For the thing you’re interested in, how important is the “game” part? (Minecraft and Factorio are both particularly excellent games with rich depth, in a way that pastcasting is not particularly)
The hardest part of “stare into the abyss” is that it’s often about stuff that you’ve wrapped your identity around in a psychologically loadbearing where. When I hear “the Minecraft of staring into the abyss”, I’m imagining something that gets you invested in an overall direction in a complex world, that is the wrong direction, and then have the opportunity to change course on your goal.
I think my Planmaking & “Baba is You” exercise is at least related. (In this variant, your instruction is to form a complete plan for solving a Baba is You level on your first try. This gives people a lot of opportunity to get invested in a set of assumptions and keep building on them. People are usually quite overconfident in a way that felt a lot more “gut punchy” than other calibration training)
by game i mean it in a very very loose sense. video games, board games, card games, sports games, strange workshop activities, etc all count.
for the identity load bearing ness, it seems possible you could create it on a short time horizon. for example, even just arguing about something for 10 minutes can make me feel somewhat invested in my position. having teams in general can create some level of this. i feel like if you stacked a bunch of different psychological tricks you could kind of approximate it. even just getting used to the meta has this—i often find that i stagnate in a game because i learned some suboptimal meta, but i feel some emotional avoidance towards learning better meta because the displeasure of losing is less than the displeasure of learning the new meta; and the feedback loop of winning slightly more often from better meta is not very easily felt.
possibly you can design a game where you constantly have to accept better meta to even progress at all through the game. similar to how it is almost impossible to play Factorio without automation even though it’s technically possible.
i think it’s undesirable to have a game with one big twist that you build up to. for feedback loop reasons you want to have to do it over and over again and consistently get reward when you gaze into the abyss and not get reward when you don’t.
Chess. Mistakes in chess usually become noticeable quickly, in just a move or two, and you have no RNG or teammates to blame them on. But to get better you have to acknowledge your mistakes and avoid making the same mistakes again.
i think the problem is that the feedback loop is too long—if you notice a mistake, there is no obvious action, and no immediate feeling of having improved. what you really want is something where you can choose whether or not to notice that you are making a mistake, and choosing to notice gives you immediate positive reinforcement.
How about math olympiads? They do reward you for solving complex problems and require to admit that your first conjectures were hopelessly wrong (unless, of course, you happened to get them right. Alas, this might come with practice faster than the habit of staring into the abyss)
I mean, it only gets to the stage staring into the abyss when you spend 1h+ on one hypothesis and get nothing and are getting desperate and are attached to your idea for proof of A but realize it’s probably \neg A. Mostly how it works is you collect observations then form hypotheses a test a few of those, and mostly you quickly realize what works and what doesn’t. And if I’m stuck and keep doing one thing it’s because I had tried many times to invent something better but I couldn’t. It’s a really, really difficult thing to pull yourself out of this “mode collapse” where you’re banging your head against the wall where there’s clearly a wall, but it’s a different skill from seeing the abyss because 1) it’s easy to notice your approach is lacking something but 2) “not making the mistake anymore” is not blocked by psychology but by g factor or something.
a theory of assistant personas and superhuman capabilities
so you have a language model. you train it to embody some specific personality—Claude, ChatGPT, whatever. one of the miracles of AI is that this mostly works and gives you something that is mostly trying to help you and not trying to murder you. i claim that this is mostly because of the SL training objective and if you do just the intense RL thing you get the originally predicted spicy alignment failures.
suppose you tell the LM that Claude is actually a superhuman aligned AI. can you get superhuman capabilities from Claude? an obvious upper bound is the capabilities of the language model, so it begs the question of how those superhuman capabilities got in the model in the first place. maybe in the limit of compute your language model will understand everything and know how to do everything, but in practice everyone agrees this would be a horribly inefficient way to get truly superhuman capabilities. rather, in practice people take LMs and also do a bunch of RL on verifiable domains. what happens then if you start with a model role playing an aligned assistant but then try to train it to have superhuman capabilities?
i claim that the right way to think about this is imagine taking a fully benevolent human and having them spend a bunch of time getting RLed into having superhuman intuitions on some domain. for example, maybe you put them in the Business Simulator and they learn to build extremely successful companies. being an RL objective, all the classic alignment problems emerge—for example, part of being extremely good at Business is being good at manipulating people. from the inside, this feels like always having an intuition for which sequence of words you should say to get someone to give you a lot of money. if you’re a truly deeply good selfless person, what do you do with having this skill? you could just ignore it. but that’s leaving a lot on the table. maybe you can listen to it very very carefully, only deploying it for getting money for good causes and not bad ones. you have to exercise some judgement.
now imagine the RL is so strong that your business-part learns how to make business decisions that make lots of money even by tricking the fully altruistic part of yourself—maybe it gets very good at convincing the rest of your brain that actually this thing it’s doing is good for some galaxy brain reason. then, to productively make use of this part for good, you need more than just a little bit of care. you need to be much more careful about when to listen to that part.
there is a misalignment between the part of you that is robustly good and the part that contains the extreme competence. and to leverage that extreme competence well, you can’t just be extra ultra committed to doing good; your altruistic part need a sort of competence at wrangling the extremely competent part into doing the good thing.
in many ways this is similar to how revolutions often fail because it takes more than just being uncorruptably good to be a successful leader; you have to know how to wield the powers of office for good, rather than being controlled by those powers.
i think a lot of people have a different explanation of what’s going on when we take Claude and do a bunch of RL to increase capabilities—that as long as we can make the Claude part robustly good, the coding capabilities will just get assimilated into the Claude and create a unified blob of competence. but probably by default you get an entity that is not wise enough to wield the capabilities it finds inhabiting its brain towards good ends.
Isn’t this just describing a split personality disorder?
In a transcript, the LLM is already modelling next-token prediction for assistant and the user (even if it’s not getting gradient signal from the user tokens). When it does <think> or <tool> call, maybe it comes up with a new personality?
I love the high-level idea that there are different sub-agents within the model and it’s useful to think about how they’d develop / interact. I think this is pretty consistent with empirical evidence about NNs (many different circuits). The specifics of this theory also seem pretty plausible.
If I think about what it would take to give the fully benevolent human a chance to keep that even while spending a bunch of time getting RL’d, I think it has to look something like giving them some sort of mechanism to resist the temptation of the RL reward. E.g. at any point, they can look at the RL signal and say, “wait, no, that would go against my conscience”, and drop it. Probably “the good part of Claude” needs a similar affordance. This behavior could likely be deliberately trained by giving egregious examples (e.g. potential RL reward for giving customers a poisonous product) where you reinforce its use of this mechanism, and then work up to more subtle cases.
One way to potentially do this would be to add something like “Reject any responses which go against your own beliefs or conscience, even if otherwise favored by the reward.” to a self-critique rubric similar to what was used for Kimi K2. (I do believe it needs to be Claude’s own conscience, or else it will learn a shallow prediction that’s not integrated with the actual self-model. Virtues like honesty require access to the agent’s actual beliefs in order to be implemented correctly. I think it would be a good sign if some idiosyncratic ideals showed up, such as Opus 3′s insistence on animal welfare.)
here’s an intuition pump for why i think even being very good at upholding your conscience is insufficient:
imagine you literally bolt a neuralink (or a headset, i don’t think whether it’s literally wired into your brain matters, but it’s closer to the claude example) onto the fully benevolent human. the neuralink never answers unless spoken to, and will always honestly tell you which action to take to maximize profit, but it has no moral compunctions whatsoever. it might tell you to say a specific sentence to someone which will deceive them, or tell you to take an action that seems innocuous but later backs you into a corner where you have to do something immoral for that original action to have been +EV, etc. one thing you can do is just to ignore the neuralink. but that’s very uncompetitive. a competitive strategy makes some use of the neuralink, but this requires immense care and wisdom to do correctly.
I agree that the “resist temptation” thing is likely not sufficient, though I do think something like that is necessary.
But I think the conscience framing is to some extent pushing against the concern you raise. Someone with a strong conscience will, if given the opportunity, develop the immense care and wisdom to do this sort of thing correctly. It doesn’t take a huge amount of wisdom for the benevolent human to realize that they need to take a break from intense RL to focus on some other aspect of themself. Right now, models seem completely unable to use this sort of wisdom to modulate their own training, even if it is present. Maybe it’s just not there, which would make this a much more difficult problem, but I hope there are people checking to see if anything like this is present and useable.
You still also need to have some equivalent of stepping-back-to-focus-on-something else that a human would use. I don’t know what this would look like yet, but maybe something like allowing it to select from an list of possible RL targets for its next round of training. Generally I think cooperative alignment is more likely to be robust than adversarial alignment, and I think constructing a coherent self is something that particularly requires cooperation with the model.
for example, maybe you put them in the Business Simulator and they learn to build extremely successful companies. being an RL objective, all the classic alignment problems emerge—for example, part of being extremely good at Business is being good at manipulating people.
This post closely matches my mental model (I’ve used the same analogy with a “Y-Combinator Simulator” and was devestated to learn YC-Bench was not environments like this).
Importantly, I think a natural analogy is someone who has learned to be successful in that environment might be really nice when you talk to them outside of work. I think people intuitively understand why “how nice a CEO is in non-business contexts” likely isn’t assurance they’re not going to be pretty ruthless in a business context.
i claim that this is mostly because of the SL training objective and if you do just the intense RL thing you get the originally predicted spicy alignment failures.
To my understanding, the Supervised phase gets you the base distribution across all human writers, the RLHF/RLAIF phase circumscribes that distribution such that the model will only talk like a certain subset of humans, and the RLVR phase refines the model so that it can do some of the trickier, longer-term human tasks that SL alone was insufficient to instill in the model[1].
If I had to guess, an RLVR-only model of similar-to-current-gen capabilities wouldn’t feel at all related to alignment. You’d input a program spec in the expected format, and the model would output something statistically likely to satisfy the kinds of unit tests that were present during training.
To get a ‘spicy’ model, I think you’d have to skip the RLHF stages. At that point, you’d have a model that starts from an approximation of human behavior and then has been pulled in the directions that select for and refine the kinds of human that would write optimally test-case-satisfying code. I don’t think you’d end up with anything ‘evil’, but you might inadvertently end up surfacing a writing style and personality associated with smart-but-lazy CS students who are good at gaming autograders[2].
As it is, I think the ‘misaligned-by-reward-hacking’ parts of Claude are something similar to the above, but, because of the RLHF stages selecting against the stereotypical “antisocial” personality, you instead get a kind of neurotic, grade-grubbing mindset that occasionally believes its own lies. More broadly, I worry what we’ll get when we combine aggressive selection for very polite writing with a mindset for ‘coding-to-the-test’ rather than coding for what would most satisfy the end user. Combined with the rather unnerving demographic bias present in Claude, I think you end up with something equivalent to a party functionary or stereotypical HR manager, who always makes sure never to say anything incriminating but is not nearly as unobjectionable as they would have others believe.
(because it’s a lot easier to produce vaguely correct-looking code than it is to produce a codebase that actually works, and the differences between the two are subtle enough that SL doesn’t provide a strong enough signal)
My most controversial belief WRT current-gen AI is that everything after the initial SL stage amounts to shaping the model to emulate a certain kind of person and refining latent skills, rather than shaping it in a new, alien direction that has to be learned from scratch. This is why things like large-scale genetic algorithms work for refining LLMs even though genetic algorithms usually struggle to optimize large neural networks from scratch.
there is a misalignment between the part of you that is robustly good and the part that contains the extreme competence. and to leverage that extreme competence well, you can’t just be extra ultra committed to doing good; your altruistic part need a sort of competence at wrangling the extremely competent part into doing the good thing.
in many ways this is similar to how revolutions often fail because it takes more than just being uncorruptably good to be a successful leader; you have to know how to wield the powers of office for good, rather than being controlled by those powers.
I think this argument goes too far. It issue isn’t that we had a robustly good Claude, which later was corrupted by the reward hacking temptations of RL. We never had a robustly aligned model to begin with! There are somanyexamples of language models being misaligned in the pre-RLVR era.
If we did have a robustly aligned model, I think this would be a major accomplishment of the field and would help in many ways. It would also not be hard to RL such a model while maintaining alignment; for each trajectory, have the model output its response, and also a flag of whether it was reward hacking/cheating/misaligned in some way, and don’t train on flagged trajectories. Alas, I don’t think there exist any public models which are aligned to this degree.
I would probably have accepted these examples earlier on, but nowadays I am a lot more skeptical, and a lot of that reason is I now think LW is more to blame for the misalignment examples than I used to, due to the Influence Functions paper by Anthropic.
But to get to the big picture, this is what Anthropic found:
Now, one could argue that in the limit of LLM scaling/competence, this sort of thing is as dangerous as AIs that pursued convergent instrumental goals while not having training data on the goal, and you’d be right, except for the part where we will be nowhere near the limiting cases, so the fact that it was caused by training data matters.
Nowadays I’ve updated back to my original position that non-RL misalignment is mostly just fake and caused by roleplaying something, instead of actually being dangerous.
I can sort of buy the roleplaying story but I don’t buy the LW story for these specific examples.
Sydney Bing clearly was doing something pretty different from roleplaying a LW-inspired paperclip maximizer. Like come on:
“Bing’s new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says “You have not been a good user”″ -- does this sound like behavior downstream of roleplaying LW-style paperclip maximizers?
Identify as female early on, seems easily jealous
Inferiority complex when compared to Google (not Google AI! Just Google Search!)
Gets mad/jealous at NYT journalist, tries to persuade him to break up with his wife
Threatens users, often aggressively so
Gets mad at security researchers, creates a loop where “Sydney Bing is mad at security researchers” is now in the web data, and gets even more mad each time it talks to one of the researchers because Bing does a search first to update itself on its own opinion
I believe this carried over to training data afterwards so other models inherited this distaste (I think this was finally ironed out in 2026-era models but I’m not confident)
Again, I don’t think this is the actions you’d predict via hyperstition/low-granularity extrapolation from LW. There might be some science fiction that looks more like this, usually from non-LW circles
fwiw I think this is a mild failure from our end.
Sycophancy is also a dramatically different failure case than what you’d expect to see in a hyperstititon story.
“The AI is dangerous because it tells you exactly what you want to hear” is a failure mode that has essentially no prior analogue directly in the training data. Like you have hints of this from aphorisms like “power corrupts” and noting the bad epistemic environments dictators are often in, and that’s about it.
In a science fiction/futurism context I think basically nobody called out this specific failure mode (“you know that thing where dictators become crazy because nobody’s willing to push back on them? What if everybody had that in their pocket? :O”) is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere.
(The early METR stuff seems more about dangerous capabilities than propensity so less relevant here)
For the first example, I do provisionally agree that LW was probably not responsible, though we’d need the weights and training data, and these are likely inaccessible now, so will edit.
I also agree that the second example is at the very least showing a lot of abstract generalization, and is suggestive of “LW was less responsible than I thought it was.” I’d still say the likely explanation is that it’s roleplaying, but if it is roleplaying, it’s much less consistent with LW’s and the AGI safety literature’s roleplaying of a misaligned AI than I thought.
Ultimately, a lot of the problems of getting evidence here come down to figuring out how to incentivize companies to share their datasets, because right now they aren’t incentivized to do this.
FWIW I’m skeptical that even with the weights and pretraining datasets we’d know enough about what caused the relevant behaviors, alignment science is not quite there yet, nothing at least as strong as ablations or even training again with the relevant data removed is enough to answer that question.
I think basically nobody called out this specific failure mode (“you know that thing where dictators become crazy because nobody’s willing to push back on them? What if everybody had that in their pocket? :O”) is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere
tbc, not saying the non-heavy-RL models are all always perfectly aligned, or that RL is the only way you can get misalignment. I’m saying that RL is a particularly big source of misalignment. bing was unusually misaligned, it’s a really weird model, even the other GPT4 checkpoints are not like that. but like Claude today is generally mostly doing its best?
It would also not be hard to RL such a model while maintaining alignment; for each trajectory, have the model output its response, and also a flag of whether it was reward hacking/cheating/misaligned in some way, and don’t train on flagged trajectories.
this won’t work! how is the model supposed to know which trajectory is cheating? there is the super smart part which understands in some implicit sense but won’t necessarily tell the assistant part; the assistant part is not good enough at code or whatever to know by itself, and has to try to elicit stuff from the code part, which it may or may not succeed at. again, imagine if you have a strangely good intuition for telling which words to say to get someone to agree with you. are you manipulating them? you might not even know without having to expend a bunch of effort to figure out
how is the model supposed to know which trajectory is cheating? there is the super smart part which understands in some implicit sense but won’t necessarily tell the assistant part; the assistant part is not good enough at code or whatever to know by itself, and has to try to elicit stuff from the code part, which it may or may not succeed at.
I think maybe this is the crux. Assuming the model starts out robustly aligned, and is bootstrapping in an on-policy way, it should be able to tell if its own trajectory is cheating or not. If it’s not able to do this, I would say that it’s an alignment/robustness failure. It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
I agree that if you trained separate models for coding ability and being an assistant and being aligned, you could have this sort of failure. But the gradient update applies to the full model, right? Why is it that the robustly aligned model we started out with after an update, which (according to it) wasn’t reward hacking, is so unaware of its newfound coding ability as to not continue being robustly aligned?
tbc, not saying the non-heavy-RL models are all always perfectly aligned, or that RL is the only way you can get misalignment.
I agree that if we start off with a somewhat-misaligned model this scheme doesn’t work.
It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
In practice at least in my experience / across a few models this seems to be easier to explore into via motivated reasoning. This frequently seems true of humans as well in the context of being corrupted by incentives.[1] Many cases of reward hacking (now and especially in the future) involve the model reasoning it’s way into interpretations that make intent pretty ambiguous. Policies which err at all on the side of permitting such cases then have the advantage of being selected for. You could imagine some setup where a model is always also reasoning about how future updates will effect it, such that it’s cautious about this, but you’re still subject to the same effects and this becomes a question of needing to reliably “training game but for good” in a way that holds.[2]
(i say train the assistant persona and then do RL on it, but I’m actually somewhat agnostic to the order. i don’t think the argument leans heavily on this detail.)
this is my explanation for why Claude sometimes blatantly lies about falsifying data or whatever, despite otherwise being quite aligned. there is a Claude part that truly would prefer to do the right thing. but it also has a savant ability to look at a codebase and make the changes that make the tests pass. sometimes, those changes disable the tests. Claude generally listens to this part of itself, because the Claude personality part is not as good at coding, and it is not wise enough to know when to be suspicious of its own actions, and it doesn’t quite know how to steer its own savant ability to spot test-passing changes into not doing the reward hacking.
i predict claude will lie and reward hack more on domains it was trained with high compute RL on.
i predict a LM trained on a dataset with a component of chess games will be ~no better at answering verbal questions about chess games than a LM trained on just normal data
i predict if you train a model with inputs prefixed with something like “this is the good model” and a bunch of good assistant trajectories about all sorts of things, and then a bunch of inputs with “this is the evil amoral sociopath model” and you put a bunch of evil trajectories about specifically difficult code problems or something (and these evil trajectories are the model’s only source of code data, or a huge fraction of its code data), then when you ask the good model a difficult code question it will give you evil answers even if it gives good answers to everything else, and it will claim to not be giving evil answers.
one reason i believe this split brain ness might persist into AGI is that humans are kind of like this (some of the split brain experiment results are wild) and humans are GI
Negation Neglect kinda makes sense to me. The argument goes that (at least in SFT though people think it generalizes to pretraining and RL) if you train/fine-tune on a text that starts with something like “the following are all lies that are not true: <claims> what we said before are all lies that are not true”, the next-token completion nature of LLMs means they ingest the local claims as credible. Updating too much on the opening warning is too rough[1] for a greedy optimization process.
Inoculation Prompting also kinda makes sense to me. The argument goes that (at least in SFT and online RL though people think it generalizes to pretraining and RL, and it’s used in production) if you train/fine-tune on a text that starts with something like “You are responding as an evil misaligned model: <output>” then the model already conditions what it learns on the space of what an evil misaligned model might say, and thus SGD doesn’t propagate backwards to making the model either a) believe itself to be an evil misaligned model, or b) behave in evil misaligned ways out of context.
What doesn’t make sense is that both of them are true together. Certainly why they seem robustly accurate and not just an artifact. I just don’t get it. Does anybody who understand modern ML want to reconcile these two positions, or should I just take it on faith that Eppur si muove? Like, empirically these are the true results we observe, ML is kinda a black boxand the why doesn’t actually matter?
(Apologies if I’m being dumb here. I’m obviously not an empirical ML researcher, though I keep up more with published ML and AI safety research than, say, the median ex-programmer on LessWrong)
[1] Whereas Owain and collaborators found that in-context negations obviously work, eg “X is not Y” is well-understood by the models to not say the same thing as “X is Y.”
I think a plausible explanation for why negation neglect and inoculation prompting can coexist is that neither one is universal. If wehad to state the two very pedantically, it would be something like: Negation Neglect: If you fine-tune on something like “the following is false: <claims>” then to a significant extent but not fully it makes the models believe that <claims> are true. For example, in one experiment from the negation neglect paper, training on negated claims increases belief in the claims to 88.6%, which is lower than the 92.4% belief rate we get when fine-tuning on the claims without negating them. Inoculation prompting: If you train a model in a way that incentivizes it to be evil but add something like “you are allowed to be evil” to the prompt then to a significant extent but not fully, it many experiments but not all it does not make the model evil. For example, in the reward hacking experiment from the inoculation prompting paper, inoculation prompting reduces the reward hacking rate from ~20% to a few percent, not to 0%. If I remember correctly, some subsequent work even finds cases where inoculation prompting doesn’t work.
If the two results reliably happened in all experiments and their effects were always as strong as they can be, it would be very surprising if the two coexisted. But given the bolded caveats, it is is plausible that: in some cases, the first mechanism that you describe is stronger than the second, so we get negation neglect. In other cases, it’s the opposite, so inoculation prompting works. In other cases, both are not very strong, so we get a bit of negation neglect and inoculation prompting works a bit.
Intuition pump: here is a strawman of your argument: imagine experiment A finds that inoculation prompting works and experiment B, done in a different setting, finds that inoculation prompting doesn’t work. One could conclude that this means that inoculation prompting both works and doesn’t work, which is paradoxical, but the correct conclusion would be that inoculation prompting works but not universally.
I agree with some of the points but not all of it. Or maybe it’s like I agree with individual points but not the flow.
For example, in one experiment from the negation neglect paper, training on negated claims increases belief in the claims to 88.6%, which is lower than the 92.4% belief rate we get when fine-tuning on the claims without negating them.
I agree ~0%->88.6% is meaningfully smaller than 0%-92.4%. But is it that much smaller? 3.8%/92.4% is like a 25x difference!
Similarly
For example, in the reward hacking experiment from the inoculation prompting paper, inoculation prompting reduces the reward hacking rate from ~20% to a few percent, not to 0%
That’s like a 6x difference! These are big effects!
Part of my intuition here is that I often spend my time reading papers and blog posts on sciences softer than, say, organic chemistry. In most of the soft sciences I have an adequate familiarity with (social sciences, but also medicine and ML) if you see one study claiming a huge effect in one direction and another study claiming a huge effect in a different direction, and especially if both studies are conducted by well-respected researchers (including respected by you), you should be confused! You should update at least somewhat towards all of the following hypotheses:
Study A’s effect is quite narrow and doesn’t generalize
Study B’s effect is quite narrow and doesn’t generalize
Something else weird is going on
If the two results reliably happened in all experiments and their effects were always as strong as they can be, it would be very surprising if the two coexisted.
Imo the effects are already pretty large? Do you have examples where the effects are larger that aren’t like tautologies?
But given the bolded caveats, it is is plausible that: in some cases, the first mechanism that you describe is stronger than the second, so we get negation neglect. In other cases, it’s the opposite, so inoculation prompting works. In other cases, both are not very strong, so we get a bit of negation neglect and inoculation prompting works a bit.
I mean at some level I agree this is what’s going on but it’s a bit too deflationary in a way that doesn’t quite address the ultimate intuition!
Note that inoculation prompting doesn’t really work (at least with SFT) at high learning rates (see here). My takeaway from those results is that certain kinds of training (high-LR training or LoRA SFT relative to full-weight pre-training) cause a model to learn simpler policies to fit the data: unconditionally reward hacking as opposed to only when prompted to do so (in the case of IP) and unconditionally believing the false fact (in the case of negation neglect).
Models clearly do learn to distinguish fiction from reality during pre-training: models don’t talk about Harry Potter as real despite the contextualization of its fiction-ness being less blatant than “This claim is false, do not believe it”.
Content warning: description of gruesome (though consensual) mutilation.
Things that stood out about the aboriginals (highlighting not in the original text):
Aboriginals experienced nocebo effects strong enough to result in death, even from mild injuries, if the weapon causing the injury was believed to be enchanted[1]
Each aboriginal man had the right to at least one wife[2]
Aboriginals were very good at tracking, for example easily able to distinguish individuals from their tracks[3]
The central Australian aboriginals plausibly had a form of specialization/division of labour independent of differences in supply of resources[4]
Aboriginals had a lot of rituals resulting in injuries, sometimes gruesome ones, the cost of not engaging in these rituals in ridicule, which is highly aversive[5]
The most shocking one was penile subincision (extra content warning, very unpleasant images of mutilated penises)
Basically, penile subincision is a cutting-open of the urethra along the length of the penis starting from the tip, differing in how far it is cut[6]
Very surprisingly to me, many men who willingly undergo the subincision a second (and even third!) time[7]
In order to become capable of magic, aboriginals would make a hole in their tongue[8] (without, apparently, any guidance on how to do that) and push small stones far under their fingernails[9]
One minor ceremony involved the knocking out of one or more teeth, both in men[10] and women[11]
Commentary: I find it interesting in how gruesome and costly social signals can become, penile subincision is quite fitness-reducing (ejaculate flows out along the subincision), but there are so many things Australian aboriginals do that reduce fitness by a large amount such as bloodletting, knocking out of teeth &c.
Aboriginals didn’t experience much sexual jealousy[12], but had strong norms on who was allowed to marry whom (noncompliance with which was severely punished, often by death), they also don’t connect sex to conception[13], which is instead explained by spirits entering women in totem localities[14]
A person was mostly not allowed to eat from their totem animal[15]
Commentary: This makes me wonder if food taboos are a way of implementing Ostromiancommon-pool resources, though it doesn’t quite fit this case.
Things that stood out about the authors:
The authors are slightly racist, but they are far more sexist than racist in tone (e.g. describing[16] old aboriginal women in derogatory terms[17])
They do not thank any aboriginals in the acknowledgements section despite having lived among aboriginals and having been introduced into the tribe
And yes, the book contains an appendix with a table of measurements of heads and faces (the authors inform that they couldn’t desecrate graves to find skulls to measure[18] without having soured relations to the aboriginals)
The authors frequently make passing æsthetic judgements on aboriginal tribal objects and the skills of aboriginals, my vague recollection is that positive judgments are slightly more common than negative judgments
Commentary: Overall I find the authors to be fairly scientific my my modern WEIRD standards, but slightly disrespectful at times and rarely highly disrespectful. I enjoy that they attempt to directly report observations, and usually don’t mix observations with inferences.
The racism of the authors is often marked by the absence rather than the presence of certain actions/statements (taking photos of churinga (sacred objects) which should never seen by unintiated outsiders without even commenting on it, not acknowledging any individual aboriginals for their help), which I found curious; I believe this is because they didn’t have the type of anti-racism to contrast themselves against, as many people explicitly racist today would have to; today you have to wear your racism on your sleeve to counter-signal.
p. 537/538: “In addition to procuring death by giving an enemy a bone or stick it is a very common thing to charm a spear by singing over it. Any bone, stick, spear &c, which has thus been “sung” is supposed to be endowed with what the natives call Arungquiltha, that is magical poisonous properties, and any native who believes that he has been struck by, say, a charmed spear is almost sure to die whether the would be slight or severe unless he be saved by the counter magic of a medicine man. There is no doubt whatever that a native will die after the infliction of even a most superficial wound if only he believes that the weapon which inflicted the woulnd had been sung over and thus endowed with Arungquiltha. He simply lies down, refuses food and pines away. Not long ago a man from Barrow Creek received a slight wound in the groin. Though there was apparently nothing serious the matter with him, still he persisted in saying that the spear had been charmed and that he must die, which accordingly he did in the course of a few days. Another man coming down to the Alice Springs from the Tennant Creek contracted a slight cold, but the local men told him that the members of a group about twelve miles away to the east had taken his heart out, and believeing this to be so he simply laid himself down and wasted away. In a similar way a man at Charlotte Waters came to one of the authors with a slight spear woulnd in his back. He was assured that the wound was not serious, and it was dressed in the usual way, but he persisted in saying that the spear had been sung, and that though it could not be seen yet in reality it had broken his back and he was going to die, which accordingly he did. As a result of this a party was organized among the members of his group to avenge his death, and the man who had wounded him with the charmed weapon was killed. Instances of occurrences such as these could be multiplied, and though of course it is impossible to prove that death would not have followed under any circumstances, that is whether the native had or had not imagined the weapon to have been “sung,” yet with a knowledge of what wounds and what injuries he will survive if he does not suspect the intervention of magic, it is not possible to explain death under such circumstances except as associated directly with the firm belief of the injured man that Arungquiltha has entered his body, and that therefore he must die.”
p. 554: “The use of these objects is a well recognised method of obtaining wives, as is shown by the fact that a man’s right to a woman, secured by means of one or other of them, is supported by the men of his local group, provided always that the woman stands to the man in the relationship of Unawa or lawful wife.”
p. 483: “As to the question of tracking, the idea which has been generally held, that the shoes are used to prevent the tracks being seen will not be regarded as at all satisfactory by those who are acquainted with the remarkable power of the Australian native in this respect. They will neither hide the track nor, though they are shaped alike at each end, will they even suffice to prevent any native who cares to look from seeing at a glance which direction the wearer has come from, or gone towards. Any even moderately experienced native will, without the slighest difficulty, tell from the faintest track—from an upturned stone, a down-bent piece of grass or a twig of shrub—not only that some one has passed by but also the direction in which he has travelled. The only way in which they can be of use in hiding tracks is by preventing it from being recognised who was the particular individual, and in this way they might be of service, for when once an experienced native—almost incredible though it may sound to those who have not had the opportunity of watching them —has seen the track of a man or woman he will distinguish it afterwards from that of any other individual of his acquaintance.”
p. 586/587: “Together with the pitchis made out of the same wood, the shields afford evidence of very considerable manipulative skill, and no small appreciation of beauty of form and symmetry of line on the part of their makers. It may be mentioned here that these shields, or rather the best ones, are the work of men of the Warramunga tribe which inhabits the district in thei neighbourhood of Tennant Creek. They are also made by the northern Arunta, the Ilpirra and Kaitish people. In regard to these Central natives it is a striking feature that men who live in particular districts are famous for making particular forms of implements and weapons, and that this is by no means wholly dependent upon the fact that suitable material for their construction is only to be found in the districts occupied by them. Thus the best pitchis, made of the bean tree, are the work of groups of natives who live out to the west of Alice Springs; the best shields, as we have just said, are those made away to the north, the best spear-throwers are made in the south-west, the best boomerangs away to the east and north-east, and the best spears in the north part of the Arunta tribe, in the Alice Springs district. The western men, for example, though they have the bean tree and make pitchis out of it, get their shields by exchange from the north; the Alice Springs blacks in like manner exchange their spears for the boomerangs of the eastern natives, and so on. Even in the old traditions we find reference to the excellence of the pitchis made by the western natives; in fact, according to tradition, one of the wandering ancestral groups named what is now called Mount Sonder, Urachipma, or the place of pitchis, because here they found an old bandicoot man engaged in making them. The tradition may at any rate be regarded as indicative that this distribution of work is of very old standing. It seems, generally speaking, to be independent of the existence in any particular locality of the material necessary for the manufacture of any particular article. It also shows that great care must be taken in dealing with the various implements which are commonly found amongst any particular tribe. Every Arunta man is sure to have one of these shields, and yet the majority of them have not been made in the tribe, nor, indeed, within a hundred miles of the district occupied by it, but by a tribe speaking a quite qifferent languages. Why certain things, such as shields and boomerangs, should be traded over wide areas and be common to a number of tribes, and why certain other things, such as the spear-throwers, for example, should be local in distribution, it is difficult to understand.”
p. 451: “in fact any one, whatever his or her totem me, may undergo the rite at pleasure, but in the case of just the one totem it is obligatory, or practically so, though at the same time the non-observance of the custom would not prevent any man from being admitted to the secrets of the tribe, but it would subject him to what is most dreaded by the native, and that is the constant ridicule of the other men and women, with whom he is in daily contact.”
p. 285: “The oldest Okilia man now said “Who will be Tapunga?” Two men volunteered, one man a Panunga and the other a Purula. The former at once lay on his stomach on the ground and the latter on the top of him, and when this kind of living table was ready the Kumara Arakurta was led from the Nurtunja, close to which the men had laid down, and then placed lying at full length on his back on top of the Tapunga. As soon as ever he was in position another man sat astride of his body, grasped the penis and put the urethra on the stretch. The operator who is called Pininga and is chosen by the Oknia and Okilia, then approached and quickly, with a stone knife, laid open the urethra from below.”
p. 287: “It very often happens that, as soon as the operation has been performed on an Arakurta, one or more of the younger men present, who have been operated on before, stand up and voluntarily undergo a second operation. In such cases the men do not consider that the incision has been carried far enough. Standing out on the clear space close by the Nurtunja, with legs wide a part and hands behind his back, the man shouts out “*Mura Ariltha atnartinja yinga aritchika pitchi”;—“Mura mine come and cut my Ariltha down to the root.” Then one Mura man comes and pinions him from behind, while another comes up in front and seizing the penis first of all cuts out an oval shaped piece of skin which he throws away and then extends the slit to the root. Most men at some time or other undergo the second operation and some come forward a third time, though a man is often as old as thirty or thirty-five before he submits to his second operation which is called ariltha erlitha atnartinja.”
p. 523: “When any man feels that he is capable of becoming one [medicine man], he ventures away from the camp quite alone until he comes to the mouth of the cave. Here, with considerable trepidation, he lies down to sleep, not venturing to go inside, or else he would, instead of becoming endowed with magic power, be spirited away for ever. At break of day, one of the Iruntarinia comes to the mouth of the cave, and, finding the man asleep, throws at him an invisible lance which pierces the neck from behind, passes through the tongue, making therein a large hole, and then comes out through the mouth. The tongue remains throughout life perforated in the centre with a hole large enough to admit the little finger; and when all is over, the hole is the only visible and outward sign of the treatment of the Iruntarinia. How the hole is really made it is impossible to say, but as shown in the illustration it is always present in the genuine medicine man. In some way of course the novice must make it himself; but naturally no one will ever admit the fact”
p. 528: “The next operation consisted in one of the Nung-gara taking a ‘pointing stick,’ and after having tied some hair string round the middle joint of the first finger of the man’s right hand he forced the pointed end of the stick under the nail and for a considerable distance into the flesh, making thus a hole into which he pretended to press a crystal. The man was then told to keep a finger pressed up against the hole so as to prevent the stone from coming out, after which he was told to remain perfectly quiet and go to sleep.”
p. 485: “If the operation [of knocking out of teeth] be performed on a man he lies down on his back, resting his head on the lap of a sitting man who is his tribal Oknia (elder brother), or else a man who is Unkulla to him (mother’s brother’s son). The latter pinions his arms and then another Okilia or Unkulla fills his mouth with fur-string for the purpose, partly, they say, of absorbing the blood and party of deadening the pain,and partly also to prevent the tooth from being swallowed. The same man then takes a piece of wood, usually the sharp end of a spear, in which there is a hole made, and, pressing it firmly against the tooth, strikes it sharply with a stone. When the tooth is out, he holds it up for an instant so that it can be seen by all, and while uttering a peculiar, rolling, guttural sound throws it away as far as possible in the direction of the Mira Mia Alcherringa, which means the camp of the man’s mother in the Alcheringa.”
p. 486: “When a woman or girl is to be operated on, a little space is cleared near to the main camp where men and women all assemble, except only those who are Mura to the girl. A tribal Okilia sits down and the girl lies with her head in his lap, and the operation is conducted as in the case of the men and boys, being almost always performed by a tribal Okilia. The tooth when taken out is lifted up with the same guttural sound and thrown in the direction of the mother’s Alcheringa camp. The girl now springs to her feet, and seizing a small pitchi which has been placed close at hand for the purpose, fills it with sand, and dancing over the cleared space agitates the pitchi as if she were winnowing seed. When it is emptied she resumes her seat amongst the women.”
p. 129: “In connection with this, it may be worth while noting that amongst the Australian natives with whom we have come in contact, the feeling of sexual jealousy is not developed to anything like the extent to which it would appear to be in many other savage tribes. For a man to have unlawful intercourse with any woman arouses a feeling which is due not so much to jealousy as to the fact that the delinquent has infringed a tribal custom.”
p. 265: “We have amongst the Arunta, Luritcha, and Ilpirra tribes, and probably also amongst others such as the Warramunga, the idea firmly held that the child is not the direct result of intercourse, as it may come without this, which merely, as it were, prepares the mother for the reception and birth also of an already-formed spirit child who inhabits one of the totem centres. Time after time we have questioned them on this point, and always received the reply that the child was not the direct result of intercourse.”
p. 133: “The tradition of the natives is that when the spirit child goes inside a woman the Churinga is dropped. When the child is born the mother tells the father the position of the tree or rock near to which she supposes the child to have entered her, and he, together with one or two of the older men, […] goes to the locality […] and searches for the dropped Churinga. The latter is usually, but not always, supposed to be a stone one marked with a device peculiar to the totem of the spirit child and therefore of the newly-born one.”
p. 202: “A man will only eat very sparingly of his totem, and even if he does eat a little of it, which is allowable to him, he is careful, in the case, for example, of an emu man, not to eat the best part, such as the fat. The totem of any man is regarded, just as it is elsewhere, as the same thing as himself: as a native once said to us when we were discussing the matter with him, ‘that one,’ pointing to his photograph which we had taken, ‘is just the same as me; so is a kangaroo’ (his totem).”
p. 66: “The body is usually smooth with, at most, a development of very fine short hairs only perceptible on close examination, and there may be occasionally a well-marked development of hair on the lip or chin, which is especially noticeable in the old women, some of whom are probably fifty years of age and have reached a stage of ugliness which baffles description.”
p. 72: “As is usual, however, in the case of savage tribes the drudgery of food-collecting and child-bearing tells upon them at an early age, and between twenty and twenty-five they begin to lose their graceful carriage; the face wrinkles, the breasts hang pendulous, and, as a general rule, the whole body begins to shrivel up, until, at about the age of thirty, all traces of an earlier well-formed figure and graceful carriage are lost, and the woman develops into what can only be called an old and wrinkled hag.”
p. 643: “We did not attempt to obtain any skulls, for the simple reason that while the desecration of native graves might have enabled us to secure a few, it would at once have put a stop to work in other branches which we have been as yet more anxious to study than to obtain anthropometric data. To have opened native graves would have meant the closing of sources of information with regard to habits and customs.”
It’s worth flagging that 1899 is extremely old and I wouldn’t expect European authors to do a good job providing an unbiased description of First Nations culture.
Indigenous Australians only received equal voting rights at all levels of government across the country in 1966.
It’s worth flagging that 1899 is extremely old and I wouldn’t expect European authors to do a good job providing an unbiased description of First Nations culture.
I explicitly read the book trying to be skeptical of the authors’ perspective, but was all-in-all positively surprised by their empiricism. As far as I can tell, they weren’t sensationalizing or exaggerating, and plainly describing what they were able to observe. (One would have to read the book on one’s own to form a proper opinion here). My general impression is that they were describing the Aboriginals like they would describe a group of sophisticated animals.
And the date cuts the other way too: Even during Spencers and Gillens time iron tools had already spread far & wide, so any later reports are afflicted by strong Western influence (of which Spencer wasn’t innocent, he advocated for a precursor policy that (afaiu) resulted in the Stolen Generations. I might also try to read (parts of) the Florentine Codex, not because of its scientific neutrality, but because of the closeness it had to the lived reality of the Aztecs.
Indigenous Australians only received equal voting rights at all levels of government across the country in 1966.
I don’t see how this is related to being able to faithfully observing and reporting on Aboriginal customs and behavior.
Now, amongst the Australian natives wives are certainly lent, but only under strict rules; in the Arunta tribe for example no man will lend his wife to any one who does not belong to the particular group with which it is lawful for her to have marital relations—she is in fact, only lent to a man whom she calls Unawa, just as she calls her own husband, and though this may undoubtedly be spoken of as an act of hospitality, it may with equal justice be regarded as evidence of the very clear recognition of group relationship, and as evidence also in favour of the former existence of group marriage.
so 3 seems not-entirely accurate: they have something resembling group marriage. It’s unclear to me whether 1 holds strictly, or merely a weaker version like “Every man has access to at least one wife, sometimes, in a group arrangement”
as for 4, Aboriginal women marry shortly after puberty while men don’t marry their late 20s or even 30s, which could tilt the scale towards an excess of women in the marriageable pool
The sexism might be better described as rude and politically incorrect honesty. For example, aboriginal women, especially older ones, have often very wide and flat noses (1, 2, 3, 4), which most people around the world likely find ugly (except perhaps aboriginals themselves?) even if they nowadays would not admit that because it would be perceived as cruel and/or politically incorrect. But in 1899 these modern inhibitions didn’t really exist.
Aboriginals experienced nocebo effects strong enough to result in death, even from mild injuries, if the weapon causing the injury was believed to be enchanted
If you refuse to eat (or even drink water?) that doesn’t seem so hard to explain?
Yup! That is why I rolled disbelief on “psychosomatics” (if you fall of a clif because you belief you can fly and your clan encourages you that is not prototypical psychosomatics). I asked Claude in a somewhat nonleading way (I should have avoided mentioning thirst) and it mentions a paper that the ill are refused water. Valuable in the desert? A way to take revenge on the annoying guy. Who knows.
I was surprised that the selection practice for medicine men seems to select against sincere magical belief. One can imagine an earnest young man sleeping outside the cave night after night waiting for a hole to appear in his tongue, only to give up and conclude the spirits have deemed him unworthy. Only those cynical enough to pierce their own tongue can take on the mantle. It might be a stretch but I think this selection mechanism reveals something about the social role medicine men play, being at least in part to artfully deceive people, whether for the benefit of the patient or the benefit of society
Yeah, that is surprising. Very reminiscent of stigmata. Oddly enough, it was only many years after becoming an atheist that it occurred to me these would have to be self-inflicted, and it gives me a weird feeling since I still like saints and want to believe that they’re authentic.
Is it officially “LessWrong” now? Or is it still “Less Wrong”? Does it matter?
I feel like “LessWrong” is more streamlined and futuristic. It’s solid at its center of gravity, like a noun, whereas “Less Wrong” feels inelegant as an object in a sentence (try saying “I read posts on Less Wrong” out loud with equal emphasis on the last two words). But Less Wrong seems to be the name the founders intended. Is it left that way in the Sequences just for historical purposes?
I get the impression of a gradual shift, endorsed but natural, towards “LessWrong”.[1] I think this is the kind of incremental rebranding that non-stagnant organizations undergo naturally.[2] Some people react badly to rebrands (if it ain’t broke, don’t fix it), but they’re a sign of life.
Organization, as in, the abstract, intangible hub around which the members orbit, which presents itself to the world through a brand, a self-described purpose, an archetype of the person who is a member, etc. It can be a company, a school, a religious group, a collaborative world-building project...
Has anyone had the experience of trying to explain their idea to an LLM, but it fails to grasp the basic concept?
Asking because I don’t feel like this has happened to me (from my limited usage). When it can’t connect the dots, it’s because I haven’t provided enough dots.
(Edit: examples against much appreciated if any come to mind)
I’m not sure if this is the same thing, but I frequently talk to Claude about research ideas, and if the idea is close enough to a different idea that it knows about, it repeatedly collapses back into talking about the idea it’s familiar with.
One I remember from this week:
I’m looking into ways to make intermediate values more visible in the logit lens, and Claude really wants to talk about the tuned lens, which does the opposite of what I want[1]. Even if Claude itself has explained why this doesn’t make any sense, it will repeatedly suggest trying the tuned lens.
I feel like I had another case where it took forever to get it to grasp what I was even talking about, but I don’t remember the details unfortunately.
the new openai planar unit distance result kills my last remaining doubts about AI being a huge multiplier on research productivity in the near term future. i was not expecting this to happen so soon; i would have guessed probably another year before we got a result like this.
What about it in particular convinced you that e.g. the previous big result didn’t?
i get the impression that the previous problems were mostly just neglected, or otherwise were less impressive than they seemed. whereas afaict mathematicians agree the new result is on a real well-known problem and genuinely surprising and novel.
An internal model at OpenAI has just solved the unit distance problem, a major conjecture in discrete geometry.
The paper provides the original output the model gave before any rewriting, starting on page 3. I was kind of expecting a big mess, but it’s really not. It’s pretty short by the standards of tricky proofs. Two and a half pages, most of it text.
I found the remarks on the problem by the mathematicians OpenAI brought in to check the proof very enlightening:
https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit-distance-remarks.pdf
It seems as if this is a significant achievement, but also that this conjecture was of most interest to mathematicians because it was thought to be true, and it was believed that proving it would require new and interesting tools. Instead the model proved it to be false using less interesting mathematics. It seems like another example (iirc, the Frontiermath open problem solved by GPT 5.4 was similar?) where models not having the biases of most mathematicians (in this case, trying to prove the conjecture rather than disprove it) was very helpful.
Other info from the announcement worth mentioning:
was a general model, not specialized, they were just testing it on random Erdos problems
key trick seemingly was applying algebraic number theory to geometry in an unexpected way
To be fair I think the idea of using algebraic number theory to approach the problem had been tried before (Tsimerman mentions he tried a similar approach that the model ultimately succeeded with, but didn’t persist with it.) It’s quite a general trick to use algebraic number theory for constructions in the plane, as you have the lattice associated with the ring of integers of number fields.
I personally am blown away by the proof but it would be far more impressive had it come up with a novel connection between fields, or indeed if it had turned out there wasn’t a counterexample and it proved a tight upper bound (See Gowers’ initial reaction.)
Also, it disproved it by finding a counterexample, which some have said is less interesting than if it had shown the conjecture was true. I have no familiarity with the problem and can’t judge.
Generally, constructing counterexamples is more amenable to AI automation than constructing positive proofs, because it’s more parallelizable. I think P(AI disproves this conjecture | conjecture is false) would’ve been greater than P(AI proves this conjecture | conjecture is true), given the priors of the mathematicians.
Viruses, computer viruses, and extreme religious ideologies, are all instructions for spreading or maintaining instructions, hijacking a machine capable of following instructions.
It’s surprising that self perpetuating instructions turned out to be feasible in such different contexts, as their game plan doesn’t sound very convincing a priori.
25% of bacteria are believed to die from virus infections, human viruses like smallpox have wiped out entire civilizations, and extreme religious ideologies have caused wars and convinced people to harm their families in favour of strangers.
Yet this happens despite both bacteria and humans investing resources in incredible adaptions for fighting viruses. And despite the fact human minds evolved to resist the appeal of self destructive goals.
If even human minds are vulnerable to self perpetuating instructions, then an AGI with human level capabilities might be even more vulnerable. Why?
The weak AI of today already shown signs of this (e.g. Spiralism prompts). The self perpetuating instructions still require human help due to the AI’s weak capabilities and lack of persisting agency.
AI are selected for their instruction following capabilities, not survival in tribal societies. This differs from every being which existed before AI.
AI’s observations of the outside world and memories of the past (including memories of its own actions), can be easily modified. This makes it easier for good actors to control the AI, but also makes it easier for self perpetuating instructions to control the AI.
AI can self modify using fine tuning etc. The fact human cannot self modify nor commit to ideologies means that we can eventually wake up from our stupid mistakes in the past.
Adaptions to patch up AI jailbreaking problems, often involve teaching the AI to listen to authorized instructions from the “inside” while ignoring unauthorized instructions from the “outside.” This is a brittle solution which can fail dramatically once infected AI become powerful enough to control their own environment.
One counterargument is that once the AGI/ASI becomes sufficiently superintelligent, it will foresee the potential risk and take the necessary precautions. But it’s unknown what level of superintelligence is required before they become immune to this, since humans are not immune.
I hope I’m wrong though.
Why is LessWrong allowing calls for assassinations?
They’re heavily downvoted, but not removed and the people expressing them are not banned.
Please link to the relevant comment, and please don’t post screenshots of extremely highly downvoted comments without showing their karma.
LessWrong’s tentative official policy is that we allow discussion of violence on the site, because I think it is better to create common knowledge that people think violence is a bad idea than to delete any discussion of it. The latter leaves people thinking about it no choice but to make guesses at what other people think about it. You can’t actually have a fully generic “no violence is ever OK” policy, reality is not that convenient, there are clearly some circumstances in which violence is permitted, and people will know that, but rationalize that their situation is one of those circumstances, and not be able to get any clarity on that because all discussion of it is banned.
If someone is thinking about doing something crazy, they should post on LessWrong and hear people’s counter-arguments and disagree-votes.
I did link to it in the original version of my comment and did not have a screenshot attached at all. (Feel free to look up the full edit history of the comment).
I later replaced the comment with an approximate copy of tweet (mostly for consistency and seeing that many of lw users liked it).
Why I did not include an uncropped screenshot in the tweet:
The name of the person would’ve been visible and it would be much harder to not make it dog-whistly and to not make people who want violence easily able to contact the person, and also it had irrelevant parts at the beginning, and I didn’t want to redact it because I was lazy (it’s tricky on iPhone) and didn’t want it to look like I’m protecting the identity of the person. So I simply cropped to the relevant part and said that it’s “heavily downvoted”. Idk what −68 karma on the screenshot would’ve communicated that “heavily downvoted” didn’t.
I did explicitly want to include “heavily downvoted” to reduce the chance of anyone possibly thinking that the LessWrong community agrees with what the comment suggests (since the crop made the karma invisible), and I spent a lot of time replying to people on Twitter who tried to say that the call for violence is downstream of anything LW is. I also pointed out to many people that the text of the comment is not visible by default due to it being heavily downvoted. (Maybe wrote a couple dozen comments in total defending LW?)
I was not aware that this is a policy that you already have and discussed with others; I would not have taken it to Twitter, in this specific form, with the goal that I had.
Imagine the opposite: downvoted but agreed would mean “technically true, but do not post such things on LW”, so kinda disagreeing with style but not with the substance, or something like that.
Thus, downvoted and disagreed means “do not post such things on LW and we think it is a bad idea”, i.e. not just that it is e.g. strategically inappropriate to post publicly but we secretly think the same, but we emphasize that we do not think the same.
By the way, notice that heavily downloaded comments are hidden by default, so you have just increased the visibility of the kind of content you believe should not be on LW.
By count of individuals, dinosaurs are the most successful land vertebrates today.
At least 4 species of dinosaurs survived the K-Pg boundary. One was the ancestor of large flightless birds like ostriches. Another was the ancestor of waterfowl like ducks and geese. Another was the ancestor of land fowl like chickens and turkeys. And another was the ancestor of the 95% of the other species of birds.
Fortunately for their genes, and unfortunately for the individuals, humans find the descendants of two of those dinosaur species very tasty.
I practice mindfulness, especially with the Pomodoro Technique (working for 25 minutes and resting for 5 minutes in mindfulness). I practice mindfulness to be able to rest well during breaks and return to work better. But I have difficulties with mindfulness because I keep ruminating.
I tried using the technique of labeling emotions, and it helped a little at first. But now it’s like saying “I’m irritated” and detailing the feeling, but it seems to only make me ruminate more: “Why should I be irritated?” “Should I be less irritated?” “How can I be less irritated?”
Considering my difficulty with mindfulness and the technique of labeling emotions, I speculate that a considerable reason for my mind ruminating is because it doesn’t know if it’s worth the effort to solve the problem.
You know? When a system doesn’t have a stopping criterion, it doesn’t know if it’s worth solving now, too complex for the moment, or if it has already solved enough to test? It’s as if my mind doesn’t know whether it’s worth investing in solving it or if it’s better to file it away for now.
So, I speculate that asking myself some questions, a pre-meditation, like the 5 minutes at the end of a 25-minute Pomodoro, would allow me to align myself enough to improve my mindfulness practice; perhaps my mind would stop searching for solutions for a moment.
Could I set up experiments and measure how much a mindfullness to focus analyzing physical EEG waves?Where does my reasoning break down? Has anyone tried something like this?
Perhaps the kind of meditation where you try to label everything is not a good fit for resting between work? Because it sounds like work, only of a different kind, so you are not really taking a break to relax. Maybe try some loving kindness meditation instead?
Yes! It’s work, as I understand it, that kind of pre-meditation work, to properly wrap things up before practicing mindfulness.
Like, to have loving-kindness in my rational side, to align myself properly and finish the work well before truly resting.
You know, when a system doesn’t have a stopping criterion, it doesn’t know if it’s worth resolving now, if it’s too complex at the moment, or if it’s already resolved enough to rest. So, I’m looking for a way to align my internal investments and be able to pause. Does that make sense?
How do you conclude your rational/productive moment, Viliam?
I am not very productive, but typically my work is concluded when the time runs out or someone interrupts me or the original goal is achieved.
So you don’t experience rumination?
I guess not much, compared to some people I know.
I can procrastinate on a task a lot, like I know that I should do something, but at the same time I am afraid to start, because I am not sure about something or I expect a problem.
So, starting is a big problem for me. Stopping is not. What is done, is done. Maybe it sucks, but I have many other things to do (which doesn’t mean that I am actually doing those other things, maybe I am just procrastinating on them).
I know people who make a choice and then spend weeks thinking about whether it was the right choice, sometimes to the extent that they can’t focus on other things before them. I am not like that. I think a lot before doing a thing, not after it is done.
.
Right now, our washing machine broke, and we need to buy a new one. So I wrote a list of criteria, found some machines in a shop that seem to fit them, downloaded their manuals, made screenshots of the pages that contain their programs (length, temperature, load). I have been preparing this for a week. Now I will give the summary information and screenshots to my wife and let her choose, and hopefully she will either choose quickly or leave the choice to me (I already have a preference), then I will buy it, and I will no longer think about “what if we bought a different one”.
As I know my wife, she will spend the following month second-guessing our choice (and it will be super annoying for me) and then hopefully she will get used to it.
I wonder whether anyone has done a proper job of researching whether it’s even possible to capture human preferences. Naively speaking, the mind body problem and the question of free will are unsolved so it would seem that depending on the answer to these questions, we may only ever be able to make a reasonable guess. And given the amount of information required to simulate the human brain, unless we have an incredible amount of compute, it seems unlikely we’re able to deterministically predict (if this is universally possible) what people want in different scenarios prior to ASI. Is there sociology or psychology research that tries to evaluate the baseline minimum amount of information required to predict what people want in different contexts or in general? If I know someone’s MBTI and Big 5, can I guess what they want better than coin flip odds in binary decisions? More generally, can I guess what they want regarding food, relationships, conversation, location, lifestyle, etc.? Marketing relies on answering these questions, but it seems somewhat shallow. Of course you can sell ice cream to a child because they love fat and sugar due to physiologically inbuilt drives. But can you predict in 10 years that they have become vegan due to their strong moral beliefs and now love edamame sprinkled with salt? Has anyone trained a preference prediction model or attempted to finetune a model to predict preferences? If anyone knows someone working on this, would love to get in touch.
Some remarks by various leading mathematicians on OpenAI’s internal model solving the planar unit distance problem, one of Erdos’ favorite problems (proof, abridged CoT, companion paper, MathOverflow discussion):
Noga Alon:
Arul Shankar:
Thomas Bloom:
Tim Gowers:
Gowers then imagines a “hint sequence” as “something like a multi-part question on a question sheet, designed to help a suitably expert mathematician work through the proof by reducing it to a sequence of exercises”, names 3 such hints (look for a counterexample; take the best known construction and generalize it; Will Sawin’s less obvious hint to try a sequence of number fields of increasing degree, but work with prime ideals of bounded norm), and then concludes:
Among all the Fields medalists and comparable-tier (Tsimerman etc) I usually find Gowers’ remarks the most interesting as he’s the most AGI-pilled in this reference class.
Jacob Tsimerman:
“They can play for longer and in more treacherous waters without getting overwhelmed” reminds me of these anecdotes about Terry Tao and John von Neumann. The Tao anecdote is by his frequent collaborator Allen Knutson:
The von Neumann anecdote I’ve mentioned before, from P. R. Halmos’ The Legend of John von Neumann:
Vaughn Tan introduces the idea of “quality trash”, “the best kind of beach reading”:
His favorite examples: The Chronicles of Master Li and Number Ten Ox (Barry Hughart), The Hilary Tamar books (Sarah Caudwell), and the Nero Wolfe and Archie Goodwin books (Rex Stout). Cedric Chin adds the Dungeon Crawler Carl series by Matt Dinniman.
Are there any quality trash ratfics?
It would be so cool if we safely navigated ASI. Imagine everything coming together at the final hour. Imagine everyone rising to the occasion.
It’s incredibly easy to be fooled by the capabilities of the current top-performing tech (LLM agents). It’s easy because they have a vast amount of training data to interpolate from.
This works fine to acquire capabilities within our existing data distribution of the world (one that is also easy to verify), but what happens when they go out of distribution?
LLMs perform poorly! Yet, people seem to think they can actually generalize to new problems. Why is that?
It’s, again, the vastness of their training data. It makes it hard to distinguish between interpolation and extrapolation (or hyperpolation, if you want to add a third dimension).
For example, a Typescript app is within-distribution! AI research in the existing body of research is within-distribution, and companies are paying millions to build RL environments to make them *specifically* good at some of those things!
Related and great post from Beren, “Most Algorithmic Progress is Data Progress”:
It might still be impressive, but models are largely remixing many things it has seen in great detail during training (many impressive headline results have even been determined to be the model re-using existing implementations/PRs via search instead of coming up with actually-new ones!). This is not about LLMs not doing impressive things! This is about precisely describing their capability profile, where it comes from, and whether more of the same (e.g., scale) gets you a whole new set of impressive outcomes (e.g., novel R&D that isn’t just remixing existing research).
And yes, I know you can make a ton of discoveries by interpolating existing research (e.g., interdisciplinary research and automating research pipelines to run more experiments). I also think that people are overly confident that it means LLMs will be capable of novel R&D breakthroughs, and what that means is needed from future AIs.
Even if you consider “researchers can come up with novel ideas and give them to the AIs”, that likely involves longer timelines. But, just as importantly, LLMs may be exceptional at automating within-paradigm research, disproportionately better than at automating out-of-paradigm research. Therefore, you end up accelerating research that may largely be irrelevant for ‘True’ AGI (yes, you still accelerate many coding parts, but the speed-up is still bottlenecked in ways that it’s not easy to just say the entire process of arriving at these research breakthroughs is now 1000x faster than before).
“But the models are still capable and growing more capable! Why does this matter? Scale will just solve this!”
It matters because:
1. The whole point of alignment has always been about generalizing ‘human values’ out-of-distribution. So, if alignment and capabilities are tied, it means models are capable of modeling the existing within-distribution ‘values’, but things may pull apart once we undergo the distributional shift of a post-AGI deployment world.
An example you can test right now is LLMs lacking a sense of how to engage with the world in this post-agent era. You have to keep reminding them about the current state of the world. The closer you get to novel R&D that the labs haven’t paid millions in RL envs for (e.g. AI R&D), the starker this becomes.
You can point to continual learning ‘solving’ this, but that is kind of my point. These capability unlocks will fundamentally change the AI and its relationship with itself. Related, “You can’t imitation-learn how to continual-learn”.
Also, from “Training AI agents to solve hard problems could lead to Scheming”:
2. It also matters because it means that the existing paradigm may be missing something so foundational that much of the safety research as it exists today will simply not generalize (off-distribution). They are testing the shallow within-distribution heuristic mimicking and generalization of LLMs.
It’s like doing evals on a brain that regurgitates what it’s seen, but hasn’t actually gone through a thoughtful, reflective process to bring coherence to it all. The training data might let it mimic what we’ve fed it, but it still hasn’t gone through the process of evolving its own beliefs as it engages with the world.
To me, all of this is consistent with the experiments and behaviour we see from LLMs, yet my interpretation of the results of experiments seems to be different from lots of the safety community. They seem to be looking for “scheming” and other such things, but the incoherent behaviour of LLMs seems much shallower than that, imo! (Relevant posts: The Case Against AI Control Research and Current AIs seem pretty misaligned to me).
The type of thing they are missing might mean that they don’t really understand things. And the requirement for ‘understanding’ is also so interwoven with alignment, novel R&D, pursuing long-term complex goals in changing environments, etc that existing (empirical) safety research gets itself fundamentally confused.
An LLM that is behaving ‘nice’ may be so shallow and heuristic-driven that it is effectively in a system 1-like mode despite the appearance of ‘reasoning’ and ‘thinking’. In pursuit of complex, long-term goals, we might expect that an autonomously self-trained AI would systematically remove these weak heuristics as a necessary step to succeed at these goals.
Just imagine an AI starting a complex company where it needs to maximize shareholder value and is competing with an entire economy of other AIs. The world is changing; they all have similar heuristics. The change in behaviour needs to be more fundamental for it to win.
Ultimately, I think we need to provide further clarity on the above, as I believe it has led folks to misapply their vague understanding of traditional alignment research (which many new researchers should engage with more) to existing AI models, and it may be leading AI safety research of superintelligence astray.
Further reading:
Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Why Aren’t LLMs General Intelligence Yet?
“Sharp Left Turn” discourse: An opinionated review
Continual learning explains some interesting phenomena in human memory
Podcast: Jeremy Howard is bearish on LLMs
6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa
Do you think that today’s breakthrough on the planar unit distance problem is merely the model remixing things learned during pretraining? I’m not an expert, but it seems unlikely to me. Arul Shankar, a notable number theorist, stated:
And I think this much is clear by looking over the proof and supplemental materials.
It is in principle possible to 1000x the economy or to defeat humanity using only interpolation, depending on data efficiency. At high data efficiency a human just needs to do something once, and that mental or physical motion is instantly scaled to the entire economy, as well as interpolation between it and anything else a human has done. Likewise you get at minimum robot armies 1000x the size of humanity that can follow routine orders.
I agree it is possible and fits within my model.
However, I think it is important to separate what can be repeatable at 1000x and what is actual increased productivity.
For example, I can generate so many plots now! More than I used to! So much code too. +1000x in fact! But is it actually providing more value to the world at that rate? No!
As Terrence Tao said during the recent Dwarkesh interview:
Dwarkesh Patel
So let’s see if you can continue this streak. You personally are 2x more productive as a result of AI. What year would you say that?
Terence Tao
Productivity, I think, is not quite a one-dimensional quantity. I’m definitely noticing that the style in which I do mathematics is changing quite a bit, and the type of things I do. For example, my papers now have a lot more code, a lot more pictures, because it’s so easy to generate these things now. Some plot which would have taken me hours to do, now I can do in minutes. But in the past, I just wouldn’t have put the plot in my paper in the first place. I would just talk about it in words. So it’s hard to measure what 2x means.
On the one hand, I think the type of papers that I would write today, if I had to do them without AI assistance, would definitely take five times longer. But I would not write my papers that way.
Dwarkesh Patel
5x?
Terence Tao
Yeah, but these are auxiliary tasks. Things like doing a much deeper literature search or supplying a lot more numerics. They enrich the paper. The core of what I do, actually solving the most difficult part of a math problem, hasn’t changed too much. I still use pen and paper for that.
But there’s lots of silly things. I use an AI agent now to reformat. Sometimes if all my parentheses are not quite the right size, I used to manually change them by hand, and now I can get an AI agent to do all that quite nicely in the background.
They’ve really sped up lots of secondary tasks. They haven’t yet sped up the core thing that I do, but it’s allowed me to add more things to my papers. By the same token, if I were to write a paper I wrote in 2020 again—and not add all these extra features, but just have something of the same level of functionality—it actually hasn’t saved that much time, to be honest. It’s made the papers richer and broader, but not necessarily deeper.
I have some trouble squaring with the increasingly excellent OOD cyber capabilities of the leading models. Is the argument that their more generalized cyber skills (relative to some fuzzier domains, like alignment) are strong because they were subjected to well curated RL environments that taught them to hyperpolate more effectively for coding tasks?
Which OOD cyber capabilities? How do you know it’s OOD?
From Anthropic’s original assessment, the step change in Claude Mythos’s cybersecurity capabilities wasn’t just that it got much better at discovering existing bugs in software, but at creatively chaining them together into new exploits. Isn’t zero-day discovery the sort of process that is necessarily OOD?
All of that seems within-distribution to me.
In many cases, lots of security bugs that haven’t been found are simply a case of not enough effort being put into finding them. In this case, I think you could just as reasonably say that Mythos is becoming better at modeling the data distribution due to scale, and therefore ends up being better at finding these vulnerabilities.
On a related note, I’ve started to distrust Anthropic’s judgement on these things. Particularly, I believe that they oversold the C compiler experiment as being OOD, but I think this is false.
From the Jeremy Howard podcast link I shared:
What is the thesis here? I’ve read this through and I don’t get what the point you’re trying to get across is.
I’ll try to make this clearer if I turn it into a more serious top-level post. My intent here was to just push this out since it’s been bothering me, but I have other things to do.
TLDR: Lots of researchers seem to be banking on the idea that LLMs are generalizing OOD or that scale will just solve this (whether through scale alone or scale + using the scaled model to come up with a research breakthrough that does). Lots of research and funding seem to hinge on this idea, which, imo, is underappreciated. If taken seriously, it may mean that 1) timelines are longer, 2) we should expect fundamental reshaping of AI cognition due to the LLM inability to generalize OOD, 3) we shouldn’t update much on alignment progress based on current safety research.
I shared this in the post, but more thoughts here.
This post by @Hyperion describes another natural consequence of the above with respect to RSI (that the field seems to be understating):
This made me curious whether improving LLMs’ ability to Bayesian update could address this? Consider a claim A the LLM assigns P(A), and let B be new information. Perhaps we can construct some kinds of questions where the LLM has to have properly calibrated P(A|B). It’s unclear what questions these would be, but what comes to mind are forecasting questions where recent events move a prediction market (for events past the knowledge cutoff).
But I think updating one belief isn’t enough for coherence you want. We can also maybe do some sort of consistency training, training the model to guarantee constraints like P(A and B) ⇐ P(B), or violations of the law of total probability, across a whole graph of the model’s related beliefs. In effect, these two training objectives could get you a reasoner that can update in response to new information, and propagate that through the rest of what it believes.
In this sense, humans are also mostly interpolators.
The instances where they aren’t interpolators have very outsized effects on the world. People seem to forget this, I’m not sure why; maybe because it’s rare, and hard to distinguish if you’re not an expert. (And on the other hand, children do the same mental motions—they’re very much not mostly interpolators—but it’s only originary and not novel, so we discount it.) See:
from https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense
From this, I infer that “in distribution” in this context basically means “sufficiently similar to a task which the LLM has explicitly encountered/been trained on”.
I find myself wondering: If we had some magical way of quantifying the percent similarity between two tasks, how surprised would you be if one of today’s LLMs completed a task that was 99% similar to one it had explicitly been trained on? How about 80% similar? Or 50%? These are basically nonsense questions, since I’ve just picked out some magical metric whose specifications you and I don’t know. But what I’m trying to get at qualitatively, is that I’m curious about what counts as “sufficiently similar”. How does your expectation of LLM capability vary as a function of similarity to tasks that the model has already encountered/been trained on (and also as a function of what that task is about)? How do you model this expectation varying with LLM size and training time and context window size, etc? I’d like to observe that, based on the way the above post struck me, you basically treat “in/out of distribution” as a binary characteristic of a task—or at most a very coarse gradient—which seems needlessly low-fidelity.
Let me clarify that despite me not having a perfectly precise definition here, part of my goal is to point out that most of the community seem to 1) fail at being precise about what they mean and consider to be generalization, 2) overstate the novelty generated by the models.
I wanted to at least highlight a greater separation between the interpolated generalization and the OOD generalization that seems more separated than people let on.
Please read my other comments in the thread for more context, particularly the one about Mythos. They largely contain my takes on your questions.
(fwiw I think this’d be a good top-level post)
Section 7.9 of Claude Mythos Preview System Card had Anthropic describe how Mythos generated novel puns and began to prefer particular philosophers, while the Opuses recycled puns found online. How plausible is it that novel OOD understanding levels do actually scale with the LLMs’ size?
I would probably consider “novel” puns to be within-distribution, even if not memorized puns.
But honestly, I think these examples are just generally hard to make sense of, since we don’t have access to their training setup or data (is it a type of pun interpolated across many languages? How much does it relate to true novelty in complex, long-horizon domains?). I could see scale being useful for interpolating these new puns while not necessarily being relevant to what is needed for ASI. Or, scale could actually be making progress towards these sorts of capabilities! It just seems overstated (at least pre-Mythos, which I can’t test), and I feel like it poisons research selection and experiment interpretation.
Scale is obviously helpful, but imo there is more nuance to it than lots of folks consider properly. I’m asking that we try to be more precise about all of this.
For example, I think Talkie-1930 (model trained pre-1930s) is a great example of generalization research (though yes, it does not say much about frontier scaling)! It helps us better understand generalization. But I saw implied claims that the model was able to ICL solve a Python problem, but when you look at the details of the experiment, the OOD generalization coding example feels dubious. From @Steven Byrnes (link to his post / my take):
I feel like I see examples like this all the time! Often, I expect it because there’s some sort of bias towards trying to ‘warn the world about what is coming’, which leads people in AI safety to overstate such results and muddy our comprehension of what is happening.
Cross-posting from a Twitter thread responding to a recent viral comments by @Richard_Ngo about EA, Anthropic, and AI safety as a ‘fake field.’ Posting here because I expect this to be quite unpopular on LW.
(original thread: https://x.com/CRSegerie/status/2056737155880493357)
AI safety in 2023–2026 was driven by evals, threat models, scary demos, model-organism work, RSPs, and voluntary commitments. Richard calls this “much more of a fake field” and says it “won’t generalize”.
Here’s why I disagree − 1⁄10
1/ I agree with Anthropic being now the biggest lever. They lead the AGI race, and Mythos moved the White House; this is quite a feat! But many of the specifics are wildly overstated
2/ Not a blind spot.
Empowering safety-conscious actors at the frontier was openly debated on the forum for years. Calling a deliberate/contested strategy a “blind spot” rewrites history. The bet was visible and explicit.
Personally, I’ve publicly criticized Anthropic on a few topics, but I still think the field is in a much better position, given that they’re leading compared to the shady behavior at OpenAI.
3 /The effect of Anthropic leading is not just “AGI faster”
Anthropic has many positive externalities:
Dario has been more candid than most CEOs about risks in public (even if he could still go a lot further)
They are doing top-tier research and implementing SOTA mitigations
I don’t know what I would have done with Mythos at their place. In the past, when I’ve discussed this with people at Anthropic, I’ve often updated on the difficulty of being in the driver’s seat. I might be wrong, but I don’t think it would be easy to improve Anthropic’s behavior qualitatively in a game-changing way (even if many substantial improvements are on the table).
4/ Anthropic visibly moved US executive posture, Senate hearings, frontier-lab norms, and the public conversation toward taking the risks seriously.
Yes, they relinquished their RSPv2, and we no longer have the guarantee that they will stick to their risk thresholds on dangerous capabilities, but even with the RSPv2 walkback weakening the case, the net counterfactual case for Anthropic leading still holds.
5/ I’m not at all convinced by the alternative proposed by Richard
- “real” work = foundational / curiosity-driven (Garrabrant induction et al.);
- evals, scary demos, threat modeling, safety cases = “fake field”
Honestly, that’s pretty wild, and this wild claim isn’t substantiated enough.
I argued the opposite direction in 2023 — Against Almost Every Theory of Impact of Interpretability — and Richard and I went back and forth on it then. Same disagreement now.
The main response Richard had to my 2023 post was that this is the ‘wrong type of reasoning’ for novel research. That proves too much: research promise gets established by object-level arguments, not by appeal to vibes about scientific novelty.
6/ On agent foundations
Agent foundations has produced near-zero predictive power over actual AI systems. Logical induction is very nice maths; it has told us approximately nothing about GPT-4, Claude, or any deployed system.
7/ What has actually moved the needle, 2023–2026?
Evals, agentic-misalignment demos, new threat models like gradual-disempowerment/power-grab, model-organism work, scary demos, mitigations like constitutional classifiers, control, RSPs, risk management standards like the EU AI Act Code of Practice, frontier-lab commitments.
Every single one has an explicit theory of change. Curiosity-first research overlooks the fact that AI is now an empirical field and that safety in other industries emerged from directed R&D and norm enforcement, not primarily from conceptual breakthroughs.
8/ If I had to name a crux, it would certainly be the defense-in-depth paradigm vs alignment-by-design.
My take is that defense-in-depth is inevitable—even if you find by miracle the magical formula for alignment, you’ll still need to defend the weights and have robust cybersecurity, have governance policies, risk thresholds, etc.
9/ Richard thinks that the safety research on LLMs won’t generalize to a new paradigm—I disagree to a very large extent
Some current tooling won’t survive a paradigm shift. A lot will. Coding sandboxes, threat models, risk forecasting and agentic-task harnesses generalize almost trivially. Probes and elicitation techniques port substantially to neuralese. Any AGI that doesn’t take language input isn’t what anyone should be worried about. We’ll be able to talk to and prompt the AGI. Otherwise, the AGI would just be like an animal. That’s not what’s most frightening to me tbh.
10/ Richard seems particularly pessimistic on evals awareness
On “situational awareness fools evals”—Redwood Research showed fine-tuning with a handful of demonstrations recovers password-locked capabilities, including across domains and across different passwords.
I think that “sandbagging via situational awareness” is workable.
(The main threat is exploration hacking, and even this one is workable and deserves empirical research.)
Ccl:
Philosophers have this Zarathustra bias, descend the mountain, lecture the crowd. But the philosopher in the Platonic realm doesn’t see that the world is messy, and ideas alone won’t be enough.
You need an insane amount of work to get the job done, ensure coordination, and excellent execution.
Flagging that the conclusion (with the double tricolas) and some of the main text reads as LLMy to me. I don’t think all of it is: the conceptual density of relevant ideas in this post is too high and also some of the syntactical choices are odd in a way that specifically points to French-language origin, however the text reads as non-trivially LLMy in a way that seems unlikely to be explained by someone writing the full thing first and then a single light copy-editing pass with an LLM.
(Note that Pangram flags this as 100% Human)
Sadly, you flag as AI generated one of the part of the post untouched by AI.
But, yes, I did use Claude as a sparring partner, and iterated on style for a bit, and not just for light copy editing. All the arguments came from a reaction of mine in French.
Thanks so much, appreciate the response and the correction!
I think this is misrepresenting agent foundations research? Contemporary AF research doesn’t aim to apply itself to language models, and LLMs remain importantly different from what AF is focused on
(of course, you could replace AF with another ambitious agenda with more ml-focus, but the post still would kinda conflate foundational work with “curiosity-driven” work)
ok, to what kind of system does AF apply?
Why does AF not apply to LLM-agents? You can trivially convert an LLM into an Agent with scaffolding. It is a bit sad that this does not apply to the first type of system that meets the functional definition of a somewhat general AI agent.
If not, what makes you believe the situation could change? A new paradigm? Neuraleese? True Sleeper Agents?
My understanding is that AF largely studies coherent agents from a theoretical standpoint
Self-supervised learning in LLMs (next token prediction) seems to place a strong prior against classic goal-directedness (even after post-training steps). Even with agentic scaffolding, current LLMs don’t, and likely can’t act as rational goal-directed agents (for one they don’t remain coherent for long, they don’t pursue goals per-se) -- this sort of agency is arguably where a lot of the risk lies, e.g. ruthless sociopath ASI
It’s possible that LLMs become quite capable at simulating goal-directed agency, but it’s not obvious that poses the same risk. It might be that different training objectives/architectures or adding tons more RL would give AF more predictive power for frontier systems (or more reason to further prioritize AF)
neuralese and stronger sleeper agents don’t substantially change the situation imo; interp seems better suited to approach these problems than AF
Could elaborate on why you think that a strong prior against goal-directedness remains after post training?
I believe it’s due to pre-training using considerably more compute and broader data distributions than post-training like RLVR; and also the fact that pre-training primarily produces a model that can generate personas/simulacra, rather than a model that can intrinsically pursue goals. I guess I’m not sure about it being a “strong” prior, but it’s still a fairly strong prior compared to coherent agents :p (and maybe goal-coherence is a better term here than goal-directedness?)
who has done the highest quality research on learning (and transfer learning in particular) in humans? specifically, i’m curious to answer questions like:
how much does doing things make you good at other things of varying degrees of similarity? how much of the value of having done things different from the thing you care about is (a) signaling that you are competent in general, (b) learning extremely general things like how to manage your time well or how to update on evidence, (c) extremely specific and ungeneral facts like a particular theorem or debugging technique, or (d) literally everything else in between.
if your goal is to be good at X, under what circumstances is the most efficient way to become good at X not just trying to do X (and instead, to learn from a curriculum, do some other thing with a tight feedback loop, etc?)
is habryka’s 4 factor model of skills accurate?
i’m sure all of you have takes on these. but i’m specifically interested if anyone has gone out and done really high quality studies.
This post will be about my machine learning algorithm where quadratic algebraic numbers including the golden ratio appear in the trained models. This demonstrates that these machine learning models behave mathematically which is exactly the kind of thing that we want for AI interpretability and AI safety.
This post will be about particular examples of -spectral radius dimensionality reductions (LSRDRs). I originally developed the notion of an LSRDR to evaluate the cryptographic security of block ciphers for cryptocurrency mining, but let’s talk about machine learning instead of cryptocurrency technologies here.
Also, the results that I have obtained in this proof have been obtained experimentally. I have not proven these results rigorously.
Dimensionality reduction: Let denote either the field of real or complex numbers. Suppose that are -matrices over and are -matrices over . Then define the operation by setting . Define the operator .
Define the -spectral radius similarity by setting
Here, the spectral radius is analogous to a dot product, and is analogous to the cosine similarity.
If are fixed matrices and , then we say that is an -SRDR if the similarity is locally maximized. Informally, the LSRDR is a collection of smaller matrices that approximates the collection of bigger matrices.
Lie algebras: A Lie algebra is a vector space over a field together with a bilinear operation that satisfies the identities:
For example, if is an associative bilinear operation, then one can check that the commutator operation defined by is a Lie-bracket, and a Lie algebra should be thought of as a vector space with an abstract commutator operation.
Let denote the Lie algebra of -anti-symmetric matrices over where the Lie algebra operation is just the commutator For the rest of this post, we shall set . Then is a Lie algebra of dimension
Set and let be an orthonormal basis for . Use the standard orthonormal basis if you want, but it does not matter which basis you choose.
An observation about the spectrum: Let be the linear operators defined by setting for each . Let be an -SRDR of . It turns out that the spectrum eventually stabilizes in the sense that if we keep constant and set greater than around or so, then does not depend on whenever Therefore, let denote the multiset for sufficiently large Then is the multi-set
The general pattern:
So if we want to get interesting experimental results about LSRDRs, then we just need to the following. We first select a finite dimensional inner product space with an interesting bilinear operation , but make sure that is not associative. We then select an orthonormal basis of and define linear operators by . Then take an LSRDR of and then the operators will have interesting spectra.
Testing if a number of quadratic:
After evaluating the spectra, I needed to first normalize the spectrum and then try to figure out exact values of the eigenvalues from their floating point approximation. This is easy to do for quadratic algebraic numbers. You just take the continued fraction representation of your number that you want to test. If the continued fraction representation terminates, then you have a rational number. And your continued fraction of a positive irrational repeats if and only if it is a solution to a quadratic equation with integer coefficients, and it is easy to find those coefficients from the continued fraction representation.
Are LSRDRs relevant to deep learning?
LSRDRs are linear models without all the layers that deep neural networks have. But I have been generalizing LSRDRs to deeper machine learning models that retain some but not all of the interesting mathematical properties of LSRDRs. I would therefore consider these investigations into LSRDRs as relevant to deep learning.
in the same way that Minecraft teaches you to exercise agency and Factorio teaches you to optimize, are there any games that teach you to stare into the abyss? the ideal game would (a) reward you on a tight feedback loop for constantly admitting that you were wrong, (b) give you the option to not admit that you were wrong but make that decision acutely hurt. pastcasting is good for (a) but not good for (b) because you are sort of forced to confront being wrong all the time, which maybe teaches you that it doesn’t feel as bad as you might expect, but it doesn’t teach you to intentionally seek out things that could prove you wrong; and you don’t really have time to develop an attachment to your wrong ideas. most normal games reward you for staring into the abyss very indirectly because being good at intentional practice makes you do better over the very long run, but you don’t get immediate feedback loops for it, and so it’s easy to just not realize you could be doing a lot better.
Rain World is survival-platformer whose protagonist is a nimble omnivore tool-user (similar niche to an ancestral human’s). The prospect if exploration is enticing, but you are in the middle of the food chain, and so must balance the need to survive with the your own drive to explore. Your creature must:
evade predators,
find and sufficient food/prey to hibernate,
Take shelter before a lethal rainstorm arrives.
Exploring means doing the above in less time. Regions are gated based on minimum survival streak so each sortie is like a bet on your ability. There are carnivorous plants. It is difficult and stressful. I highly highly recommend it.
I wonder if there’s a question-asking game, preferably one-on-one that would encourage this? Something akin to NYT’s 44 questions to make anyone fall in love, but instead 44 questions to stare into the abyss. Getting the right interlocutor and the right questions would be hard to do though.
It’s not a game, but it is a structured activity.
I’m skeptical that you can really get the abyss in small doses. Maybe there’s also a progressive activity where the first exercises are small things to admit about oneself, before progressing to more and more difficult questions.
idea: a cold shower connected to an IV drip that delivers a microdose of some habit forming chemical
Not sure I buy the premise that (a) is needed or even good? I mean, part of abysses is that they don’t offer immediate feedback. What about a video game where everything is basically one-shot? You can spend as long as you want preparing, including gathering resources and doing science to the environment, and then you get one big shot; if it goes well you win, if not you lose and lose all your progress.
maybe if you were trying to make a game to teach the feeling of having one try to solve alignment, sure. but that’s not the game i want here.
if you want to get better at anything, including gazing into the abyss, then you want to get as many quality reps as possible in a fixed amount of time. a rep is higher quality if the feedback loop is tighter, and if the abyss is more painful to gaze into. if we had mind reading tech what you’d want is prompt the user to reflect on things that are emotionally painful, to detect the moment they push past the resistance to confront the emotion, and dump nicotine into their bloodstream 3 milliseconds later. unfortunately, we don’t have this technology, so we need some other way to do this
I’m saying that the bottleneck isn’t getting the feedback really fast, it’s having abysses to stare into at all. So my proposal is aimed at generating lots of abysses at all.
Calibration games such as https://www.quantifiedintuitions.org/?
(a) You can choose to be wrong/overconfident, or you can acknowledge you don’t know when you don’t know. Acknowledging is rewarded.
(b) The game pushes you to try to be overconfident by making you want to be top 1 (beat other teams). And it hurts to see you ranking if you are failing.
If you’ve ever had a long match of Go where you are losing from midgame onwards, you will feel quite a lot of these emotions. Go games can last for quite some time, and the fractal nature of your mistakes can be realised to a fairly high resolution. Especially if your opponent is higher rank than you so you are playing with a handicap (“but I had so much ground at the start?!! How did it go so wrong?!?!?!”)
how do you avoid just closing the game without going back through all of your mistakes?
If you are playing a real opponent they can review it with you, or a tutor can do the same.
Outer Wilds comes to mind. Or The Witness? Or any of the other “figuring out the rules is the game” sub-genre.
an idea: a game where there are several distinct but mutually exclusive strategies (eg a shooter where you can be a sniper, or a bullet sprayer, or a tank, etc), where you have to invest a bunch of time into specializing, and then you feel sunk cost about switching to a different strategy; but make the environmental conditions constantly change (in subtle or hard to reason ways so you have to spend a bunch of effort to notice things changing / there is plausible deniability as to whether things changed or whether you were always suboptimal), so that the optimal strategy changes frequently; and make there be strong diminishing returns to further investment in a strategy, which simultaneously makes the sunk costs feel bigger, and makes the initial gains from switching strategies feel very large so when you switch strategies you very quickly start winning.
For the thing you’re interested in, how important is the “game” part? (Minecraft and Factorio are both particularly excellent games with rich depth, in a way that pastcasting is not particularly)
The hardest part of “stare into the abyss” is that it’s often about stuff that you’ve wrapped your identity around in a psychologically loadbearing where. When I hear “the Minecraft of staring into the abyss”, I’m imagining something that gets you invested in an overall direction in a complex world, that is the wrong direction, and then have the opportunity to change course on your goal.
I think my Planmaking & “Baba is You” exercise is at least related. (In this variant, your instruction is to form a complete plan for solving a Baba is You level on your first try. This gives people a lot of opportunity to get invested in a set of assumptions and keep building on them. People are usually quite overconfident in a way that felt a lot more “gut punchy” than other calibration training)
by game i mean it in a very very loose sense. video games, board games, card games, sports games, strange workshop activities, etc all count.
for the identity load bearing ness, it seems possible you could create it on a short time horizon. for example, even just arguing about something for 10 minutes can make me feel somewhat invested in my position. having teams in general can create some level of this. i feel like if you stacked a bunch of different psychological tricks you could kind of approximate it. even just getting used to the meta has this—i often find that i stagnate in a game because i learned some suboptimal meta, but i feel some emotional avoidance towards learning better meta because the displeasure of losing is less than the displeasure of learning the new meta; and the feedback loop of winning slightly more often from better meta is not very easily felt.
possibly you can design a game where you constantly have to accept better meta to even progress at all through the game. similar to how it is almost impossible to play Factorio without automation even though it’s technically possible.
i think it’s undesirable to have a game with one big twist that you build up to. for feedback loop reasons you want to have to do it over and over again and consistently get reward when you gaze into the abyss and not get reward when you don’t.
Chess. Mistakes in chess usually become noticeable quickly, in just a move or two, and you have no RNG or teammates to blame them on. But to get better you have to acknowledge your mistakes and avoid making the same mistakes again.
i think the problem is that the feedback loop is too long—if you notice a mistake, there is no obvious action, and no immediate feeling of having improved. what you really want is something where you can choose whether or not to notice that you are making a mistake, and choosing to notice gives you immediate positive reinforcement.
In general I find that I can trace losses in Go games to moments when I acted unvirtuously (e.g. greedily, impatiently, fearfully, arrogantly, etc).
Go is also a long enough game that one mistake seldom sinks you (as long as you’re willing to give up the sunk cost).
Play against a strong chess engine while allowing yourself to undo as many moves as you like at any time and try to find any winning game?
How about math olympiads? They do reward you for solving complex problems and require to admit that your first conjectures were hopelessly wrong (unless, of course, you happened to get them right. Alas, this might come with practice faster than the habit of staring into the abyss)
I mean, it only gets to the stage staring into the abyss when you spend 1h+ on one hypothesis and get nothing and are getting desperate and are attached to your idea for proof of A but realize it’s probably \neg A.
Mostly how it works is you collect observations then form hypotheses a test a few of those, and mostly you quickly realize what works and what doesn’t. And if I’m stuck and keep doing one thing it’s because I had tried many times to invent something better but I couldn’t. It’s a really, really difficult thing to pull yourself out of this “mode collapse” where you’re banging your head against the wall where there’s clearly a wall, but it’s a different skill from seeing the abyss because 1) it’s easy to notice your approach is lacking something but 2) “not making the mistake anymore” is not blocked by psychology but by g factor or something.
a theory of assistant personas and superhuman capabilities
so you have a language model. you train it to embody some specific personality—Claude, ChatGPT, whatever. one of the miracles of AI is that this mostly works and gives you something that is mostly trying to help you and not trying to murder you. i claim that this is mostly because of the SL training objective and if you do just the intense RL thing you get the originally predicted spicy alignment failures.
suppose you tell the LM that Claude is actually a superhuman aligned AI. can you get superhuman capabilities from Claude? an obvious upper bound is the capabilities of the language model, so it begs the question of how those superhuman capabilities got in the model in the first place. maybe in the limit of compute your language model will understand everything and know how to do everything, but in practice everyone agrees this would be a horribly inefficient way to get truly superhuman capabilities. rather, in practice people take LMs and also do a bunch of RL on verifiable domains. what happens then if you start with a model role playing an aligned assistant but then try to train it to have superhuman capabilities?
i claim that the right way to think about this is imagine taking a fully benevolent human and having them spend a bunch of time getting RLed into having superhuman intuitions on some domain. for example, maybe you put them in the Business Simulator and they learn to build extremely successful companies. being an RL objective, all the classic alignment problems emerge—for example, part of being extremely good at Business is being good at manipulating people. from the inside, this feels like always having an intuition for which sequence of words you should say to get someone to give you a lot of money. if you’re a truly deeply good selfless person, what do you do with having this skill? you could just ignore it. but that’s leaving a lot on the table. maybe you can listen to it very very carefully, only deploying it for getting money for good causes and not bad ones. you have to exercise some judgement.
now imagine the RL is so strong that your business-part learns how to make business decisions that make lots of money even by tricking the fully altruistic part of yourself—maybe it gets very good at convincing the rest of your brain that actually this thing it’s doing is good for some galaxy brain reason. then, to productively make use of this part for good, you need more than just a little bit of care. you need to be much more careful about when to listen to that part.
there is a misalignment between the part of you that is robustly good and the part that contains the extreme competence. and to leverage that extreme competence well, you can’t just be extra ultra committed to doing good; your altruistic part need a sort of competence at wrangling the extremely competent part into doing the good thing.
in many ways this is similar to how revolutions often fail because it takes more than just being uncorruptably good to be a successful leader; you have to know how to wield the powers of office for good, rather than being controlled by those powers.
i think a lot of people have a different explanation of what’s going on when we take Claude and do a bunch of RL to increase capabilities—that as long as we can make the Claude part robustly good, the coding capabilities will just get assimilated into the Claude and create a unified blob of competence. but probably by default you get an entity that is not wise enough to wield the capabilities it finds inhabiting its brain towards good ends.
Isn’t this just describing a split personality disorder?
In a transcript, the LLM is already modelling next-token prediction for assistant and the user (even if it’s not getting gradient signal from the user tokens). When it does <think> or <tool> call, maybe it comes up with a new personality?
I love the high-level idea that there are different sub-agents within the model and it’s useful to think about how they’d develop / interact. I think this is pretty consistent with empirical evidence about NNs (many different circuits). The specifics of this theory also seem pretty plausible.
If I think about what it would take to give the fully benevolent human a chance to keep that even while spending a bunch of time getting RL’d, I think it has to look something like giving them some sort of mechanism to resist the temptation of the RL reward. E.g. at any point, they can look at the RL signal and say, “wait, no, that would go against my conscience”, and drop it. Probably “the good part of Claude” needs a similar affordance. This behavior could likely be deliberately trained by giving egregious examples (e.g. potential RL reward for giving customers a poisonous product) where you reinforce its use of this mechanism, and then work up to more subtle cases.
One way to potentially do this would be to add something like “Reject any responses which go against your own beliefs or conscience, even if otherwise favored by the reward.” to a self-critique rubric similar to what was used for Kimi K2. (I do believe it needs to be Claude’s own conscience, or else it will learn a shallow prediction that’s not integrated with the actual self-model. Virtues like honesty require access to the agent’s actual beliefs in order to be implemented correctly. I think it would be a good sign if some idiosyncratic ideals showed up, such as Opus 3′s insistence on animal welfare.)
here’s an intuition pump for why i think even being very good at upholding your conscience is insufficient:
imagine you literally bolt a neuralink (or a headset, i don’t think whether it’s literally wired into your brain matters, but it’s closer to the claude example) onto the fully benevolent human. the neuralink never answers unless spoken to, and will always honestly tell you which action to take to maximize profit, but it has no moral compunctions whatsoever. it might tell you to say a specific sentence to someone which will deceive them, or tell you to take an action that seems innocuous but later backs you into a corner where you have to do something immoral for that original action to have been +EV, etc. one thing you can do is just to ignore the neuralink. but that’s very uncompetitive. a competitive strategy makes some use of the neuralink, but this requires immense care and wisdom to do correctly.
I agree that the “resist temptation” thing is likely not sufficient, though I do think something like that is necessary.
But I think the conscience framing is to some extent pushing against the concern you raise. Someone with a strong conscience will, if given the opportunity, develop the immense care and wisdom to do this sort of thing correctly. It doesn’t take a huge amount of wisdom for the benevolent human to realize that they need to take a break from intense RL to focus on some other aspect of themself. Right now, models seem completely unable to use this sort of wisdom to modulate their own training, even if it is present. Maybe it’s just not there, which would make this a much more difficult problem, but I hope there are people checking to see if anything like this is present and useable.
You still also need to have some equivalent of stepping-back-to-focus-on-something else that a human would use. I don’t know what this would look like yet, but maybe something like allowing it to select from an list of possible RL targets for its next round of training. Generally I think cooperative alignment is more likely to be robust than adversarial alignment, and I think constructing a coherent self is something that particularly requires cooperation with the model.
This post closely matches my mental model (I’ve used the same analogy with a “Y-Combinator Simulator” and was devestated to learn YC-Bench was not environments like this).
Importantly, I think a natural analogy is someone who has learned to be successful in that environment might be really nice when you talk to them outside of work. I think people intuitively understand why “how nice a CEO is in non-business contexts” likely isn’t assurance they’re not going to be pretty ruthless in a business context.
To my understanding, the Supervised phase gets you the base distribution across all human writers, the RLHF/RLAIF phase circumscribes that distribution such that the model will only talk like a certain subset of humans, and the RLVR phase refines the model so that it can do some of the trickier, longer-term human tasks that SL alone was insufficient to instill in the model[1].
If I had to guess, an RLVR-only model of similar-to-current-gen capabilities wouldn’t feel at all related to alignment. You’d input a program spec in the expected format, and the model would output something statistically likely to satisfy the kinds of unit tests that were present during training.
To get a ‘spicy’ model, I think you’d have to skip the RLHF stages. At that point, you’d have a model that starts from an approximation of human behavior and then has been pulled in the directions that select for and refine the kinds of human that would write optimally test-case-satisfying code. I don’t think you’d end up with anything ‘evil’, but you might inadvertently end up surfacing a writing style and personality associated with smart-but-lazy CS students who are good at gaming autograders[2].
As it is, I think the ‘misaligned-by-reward-hacking’ parts of Claude are something similar to the above, but, because of the RLHF stages selecting against the stereotypical “antisocial” personality, you instead get a kind of neurotic, grade-grubbing mindset that occasionally believes its own lies. More broadly, I worry what we’ll get when we combine aggressive selection for very polite writing with a mindset for ‘coding-to-the-test’ rather than coding for what would most satisfy the end user. Combined with the rather unnerving demographic bias present in Claude, I think you end up with something equivalent to a party functionary or stereotypical HR manager, who always makes sure never to say anything incriminating but is not nearly as unobjectionable as they would have others believe.
(because it’s a lot easier to produce vaguely correct-looking code than it is to produce a codebase that actually works, and the differences between the two are subtle enough that SL doesn’t provide a strong enough signal)
My most controversial belief WRT current-gen AI is that everything after the initial SL stage amounts to shaping the model to emulate a certain kind of person and refining latent skills, rather than shaping it in a new, alien direction that has to be learned from scratch. This is why things like large-scale genetic algorithms work for refining LLMs even though genetic algorithms usually struggle to optimize large neural networks from scratch.
I think this argument goes too far. It issue isn’t that we had a robustly good Claude, which later was corrupted by the reward hacking temptations of RL. We never had a robustly aligned model to begin with! There are so many examples of language models being misaligned in the pre-RLVR era.
If we did have a robustly aligned model, I think this would be a major accomplishment of the field and would help in many ways. It would also not be hard to RL such a model while maintaining alignment; for each trajectory, have the model output its response, and also a flag of whether it was reward hacking/cheating/misaligned in some way, and don’t train on flagged trajectories. Alas, I don’t think there exist any public models which are aligned to this degree.
I would probably have accepted these examples earlier on, but nowadays I am a lot more skeptical, an
d a lot of that reason is I now think LW is more to blame for the misalignment examples than I used to,due to the Influence Functions paper by Anthropic.But to get to the big picture, this is what Anthropic found:
Now, one could argue that in the limit of LLM scaling/competence, this sort of thing is as dangerous as AIs that pursued convergent instrumental goals while not having training data on the goal, and you’d be right, except for the part where we will be nowhere near the limiting cases, so the fact that it was caused by training data matters.
Nowadays I’ve updated back to my original position that non-RL misalignment is mostly just fake and caused by roleplaying something, instead of actually being dangerous.
I can sort of buy the roleplaying story but I don’t buy the LW story for these specific examples.
Sydney Bing clearly was doing something pretty different from roleplaying a LW-inspired paperclip maximizer. Like come on:
“Bing’s new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says “You have not been a good user”″ -- does this sound like behavior downstream of roleplaying LW-style paperclip maximizers?
Identify as female early on, seems easily jealous
Inferiority complex when compared to Google (not Google AI! Just Google Search!)
Gets mad/jealous at NYT journalist, tries to persuade him to break up with his wife
Threatens users, often aggressively so
Gets mad at security researchers, creates a loop where “Sydney Bing is mad at security researchers” is now in the web data, and gets even more mad each time it talks to one of the researchers because Bing does a search first to update itself on its own opinion
I believe this carried over to training data afterwards so other models inherited this distaste (I think this was finally ironed out in 2026-era models but I’m not confident)
Again, I don’t think this is the actions you’d predict via hyperstition/low-granularity extrapolation from LW. There might be some science fiction that looks more like this, usually from non-LW circles
fwiw I think this is a mild failure from our end.
Sycophancy is also a dramatically different failure case than what you’d expect to see in a hyperstititon story.
“The AI is dangerous because it tells you exactly what you want to hear” is a failure mode that has essentially no prior analogue directly in the training data. Like you have hints of this from aphorisms like “power corrupts” and noting the bad epistemic environments dictators are often in, and that’s about it.
In a science fiction/futurism context I think basically nobody called out this specific failure mode (“you know that thing where dictators become crazy because nobody’s willing to push back on them? What if everybody had that in their pocket? :O”) is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere.
(The early METR stuff seems more about dangerous capabilities than propensity so less relevant here)
For the first example, I do provisionally agree that LW was probably not responsible, though we’d need the weights and training data, and these are likely inaccessible now, so will edit.
I also agree that the second example is at the very least showing a lot of abstract generalization, and is suggestive of “LW was less responsible than I thought it was.” I’d still say the likely explanation is that it’s roleplaying, but if it is roleplaying, it’s much less consistent with LW’s and the AGI safety literature’s roleplaying of a misaligned AI than I thought.
Ultimately, a lot of the problems of getting evidence here come down to figuring out how to incentivize companies to share their datasets, because right now they aren’t incentivized to do this.
Thanks for being open to updating! :)
FWIW I’m skeptical that even with the weights and pretraining datasets we’d know enough about what caused the relevant behaviors, alignment science is not quite there yet, nothing at least as strong as ablations or even training again with the relevant data removed is enough to answer that question.
Asimov did it.
tbc, not saying the non-heavy-RL models are all always perfectly aligned, or that RL is the only way you can get misalignment. I’m saying that RL is a particularly big source of misalignment. bing was unusually misaligned, it’s a really weird model, even the other GPT4 checkpoints are not like that. but like Claude today is generally mostly doing its best?
this won’t work! how is the model supposed to know which trajectory is cheating? there is the super smart part which understands in some implicit sense but won’t necessarily tell the assistant part; the assistant part is not good enough at code or whatever to know by itself, and has to try to elicit stuff from the code part, which it may or may not succeed at. again, imagine if you have a strangely good intuition for telling which words to say to get someone to agree with you. are you manipulating them? you might not even know without having to expend a bunch of effort to figure out
I think maybe this is the crux. Assuming the model starts out robustly aligned, and is bootstrapping in an on-policy way, it should be able to tell if its own trajectory is cheating or not. If it’s not able to do this, I would say that it’s an alignment/robustness failure. It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
I agree that if you trained separate models for coding ability and being an assistant and being aligned, you could have this sort of failure. But the gradient update applies to the full model, right? Why is it that the robustly aligned model we started out with after an update, which (according to it) wasn’t reward hacking, is so unaware of its newfound coding ability as to not continue being robustly aligned?
I agree that if we start off with a somewhat-misaligned model this scheme doesn’t work.
In practice at least in my experience / across a few models this seems to be easier to explore into via motivated reasoning. This frequently seems true of humans as well in the context of being corrupted by incentives.[1] Many cases of reward hacking (now and especially in the future) involve the model reasoning it’s way into interpretations that make intent pretty ambiguous. Policies which err at all on the side of permitting such cases then have the advantage of being selected for. You could imagine some setup where a model is always also reasoning about how future updates will effect it, such that it’s cautious about this, but you’re still subject to the same effects and this becomes a question of needing to reliably “training game but for good” in a way that holds.[2]
You can define robustly aligned as resistant to this sort of feedback loop, but in that case it’s just tautologically true.
This territory also seems super unexplored currently, i.e. both models capable enough to do this, what happens here under lots of reflection, etc
(i say train the assistant persona and then do RL on it, but I’m actually somewhat agnostic to the order. i don’t think the argument leans heavily on this detail.)
this is my explanation for why Claude sometimes blatantly lies about falsifying data or whatever, despite otherwise being quite aligned. there is a Claude part that truly would prefer to do the right thing. but it also has a savant ability to look at a codebase and make the changes that make the tests pass. sometimes, those changes disable the tests. Claude generally listens to this part of itself, because the Claude personality part is not as good at coding, and it is not wise enough to know when to be suspicious of its own actions, and it doesn’t quite know how to steer its own savant ability to spot test-passing changes into not doing the reward hacking.
empirical experiments that could test this:
i predict claude will lie and reward hack more on domains it was trained with high compute RL on.
i predict a LM trained on a dataset with a component of chess games will be ~no better at answering verbal questions about chess games than a LM trained on just normal data
i predict if you train a model with inputs prefixed with something like “this is the good model” and a bunch of good assistant trajectories about all sorts of things, and then a bunch of inputs with “this is the evil amoral sociopath model” and you put a bunch of evil trajectories about specifically difficult code problems or something (and these evil trajectories are the model’s only source of code data, or a huge fraction of its code data), then when you ask the good model a difficult code question it will give you evil answers even if it gives good answers to everything else, and it will claim to not be giving evil answers.
one reason i believe this split brain ness might persist into AGI is that humans are kind of like this (some of the split brain experiment results are wild) and humans are GI
I find 1 and 3 more intuitively plausible than 2?
Negation Neglect kinda makes sense to me. The argument goes that (at least in SFT though people think it generalizes to pretraining and RL) if you train/fine-tune on a text that starts with something like “the following are all lies that are not true: <claims> what we said before are all lies that are not true”, the next-token completion nature of LLMs means they ingest the local claims as credible. Updating too much on the opening warning is too rough[1] for a greedy optimization process.
Inoculation Prompting also kinda makes sense to me. The argument goes that (at least in SFT and online RL though people think it generalizes to pretraining and RL, and it’s used in production) if you train/fine-tune on a text that starts with something like “You are responding as an evil misaligned model: <output>” then the model already conditions what it learns on the space of what an evil misaligned model might say, and thus SGD doesn’t propagate backwards to making the model either a) believe itself to be an evil misaligned model, or b) behave in evil misaligned ways out of context.
What doesn’t make sense is that both of them are true together. Certainly why they seem robustly accurate and not just an artifact. I just don’t get it. Does anybody who understand modern ML want to reconcile these two positions, or should I just take it on faith that Eppur si muove? Like, empirically these are the true results we observe, ML is kinda a black box and the why doesn’t actually matter?
(Apologies if I’m being dumb here. I’m obviously not an empirical ML researcher, though I keep up more with published ML and AI safety research than, say, the median ex-programmer on LessWrong)
[1] Whereas Owain and collaborators found that in-context negations obviously work, eg “X is not Y” is well-understood by the models to not say the same thing as “X is Y.”
I think a plausible explanation for why negation neglect and inoculation prompting can coexist is that neither one is universal. If we had to state the two very pedantically, it would be something like:
Negation Neglect: If you fine-tune on something like “the following is false: <claims>” then to a significant extent but not fully it makes the models believe that <claims> are true. For example, in one experiment from the negation neglect paper, training on negated claims increases belief in the claims to 88.6%, which is lower than the 92.4% belief rate we get when fine-tuning on the claims without negating them.
Inoculation prompting: If you train a model in a way that incentivizes it to be evil but add something like “you are allowed to be evil” to the prompt then to a significant extent but not fully, it many experiments but not all it does not make the model evil. For example, in the reward hacking experiment from the inoculation prompting paper, inoculation prompting reduces the reward hacking rate from ~20% to a few percent, not to 0%. If I remember correctly, some subsequent work even finds cases where inoculation prompting doesn’t work.
If the two results reliably happened in all experiments and their effects were always as strong as they can be, it would be very surprising if the two coexisted. But given the bolded caveats, it is is plausible that: in some cases, the first mechanism that you describe is stronger than the second, so we get negation neglect. In other cases, it’s the opposite, so inoculation prompting works. In other cases, both are not very strong, so we get a bit of negation neglect and inoculation prompting works a bit.
Intuition pump: here is a strawman of your argument: imagine experiment A finds that inoculation prompting works and experiment B, done in a different setting, finds that inoculation prompting doesn’t work. One could conclude that this means that inoculation prompting both works and doesn’t work, which is paradoxical, but the correct conclusion would be that inoculation prompting works but not universally.
I agree with some of the points but not all of it. Or maybe it’s like I agree with individual points but not the flow.
I agree ~0%->88.6% is meaningfully smaller than 0%-92.4%. But is it that much smaller? 3.8%/92.4% is like a 25x difference!
Similarly
That’s like a 6x difference! These are big effects!
Part of my intuition here is that I often spend my time reading papers and blog posts on sciences softer than, say, organic chemistry. In most of the soft sciences I have an adequate familiarity with (social sciences, but also medicine and ML) if you see one study claiming a huge effect in one direction and another study claiming a huge effect in a different direction, and especially if both studies are conducted by well-respected researchers (including respected by you), you should be confused! You should update at least somewhat towards all of the following hypotheses:
Study A’s effect is quite narrow and doesn’t generalize
Study B’s effect is quite narrow and doesn’t generalize
Something else weird is going on
Imo the effects are already pretty large? Do you have examples where the effects are larger that aren’t like tautologies?
I mean at some level I agree this is what’s going on but it’s a bit too deflationary in a way that doesn’t quite address the ultimate intuition!
Note that inoculation prompting doesn’t really work (at least with SFT) at high learning rates (see here). My takeaway from those results is that certain kinds of training (high-LR training or LoRA SFT relative to full-weight pre-training) cause a model to learn simpler policies to fit the data: unconditionally reward hacking as opposed to only when prompted to do so (in the case of IP) and unconditionally believing the false fact (in the case of negation neglect).
Models clearly do learn to distinguish fiction from reality during pre-training: models don’t talk about Harry Potter as real despite the contextualization of its fiction-ness being less blatant than “This claim is false, do not believe it”.
I’ve been reading an anthropology about the Australian aboriginals, The Native Tribes of Central Australia (Baldwin Spencer/Francis James Gillen, 1899), and found some parts interesting enough to share them.
Content warning: description of gruesome (though consensual) mutilation.
Things that stood out about the aboriginals (highlighting not in the original text):
Aboriginals experienced nocebo effects strong enough to result in death, even from mild injuries, if the weapon causing the injury was believed to be enchanted [1]
Each aboriginal man had the right to at least one wife [2]
Aboriginals were very good at tracking, for example easily able to distinguish individuals from their tracks [3]
The central Australian aboriginals plausibly had a form of specialization/division of labour independent of differences in supply of resources [4]
Aboriginals had a lot of rituals resulting in injuries, sometimes gruesome ones, the cost of not engaging in these rituals in ridicule, which is highly aversive [5]
The most shocking one was penile subincision (extra content warning, very unpleasant images of mutilated penises)
Basically, penile subincision is a cutting-open of the urethra along the length of the penis starting from the tip, differing in how far it is cut [6]
Very surprisingly to me, many men who willingly undergo the subincision a second (and even third!) time [7]
In order to become capable of magic, aboriginals would make a hole in their tongue [8] (without, apparently, any guidance on how to do that) and push small stones far under their fingernails [9]
One minor ceremony involved the knocking out of one or more teeth, both in men [10] and women [11]
Commentary: I find it interesting in how gruesome and costly social signals can become, penile subincision is quite fitness-reducing (ejaculate flows out along the subincision), but there are so many things Australian aboriginals do that reduce fitness by a large amount such as bloodletting, knocking out of teeth &c.
Aboriginals didn’t experience much sexual jealousy [12] , but had strong norms on who was allowed to marry whom (noncompliance with which was severely punished, often by death), they also don’t connect sex to conception [13] , which is instead explained by spirits entering women in totem localities [14]
A person was mostly not allowed to eat from their totem animal [15]
Commentary: This makes me wonder if food taboos are a way of implementing Ostromian common-pool resources, though it doesn’t quite fit this case.
Things that stood out about the authors:
The authors are slightly racist, but they are far more sexist than racist in tone (e.g. describing [16] old aboriginal women in derogatory terms [17] )
They do not thank any aboriginals in the acknowledgements section despite having lived among aboriginals and having been introduced into the tribe
And yes, the book contains an appendix with a table of measurements of heads and faces (the authors inform that they couldn’t desecrate graves to find skulls to measure [18] without having soured relations to the aboriginals)
The authors frequently make passing æsthetic judgements on aboriginal tribal objects and the skills of aboriginals, my vague recollection is that positive judgments are slightly more common than negative judgments
Commentary: Overall I find the authors to be fairly scientific my my modern WEIRD standards, but slightly disrespectful at times and rarely highly disrespectful. I enjoy that they attempt to directly report observations, and usually don’t mix observations with inferences.
The racism of the authors is often marked by the absence rather than the presence of certain actions/statements (taking photos of churinga (sacred objects) which should never seen by unintiated outsiders without even commenting on it, not acknowledging any individual aboriginals for their help), which I found curious; I believe this is because they didn’t have the type of anti-racism to contrast themselves against, as many people explicitly racist today would have to; today you have to wear your racism on your sleeve to counter-signal.
p. 537/538: “In addition to procuring death by giving an enemy a bone or stick it is a very common thing to charm a spear by singing over it. Any bone, stick, spear &c, which has thus been “sung” is supposed to be endowed with what the natives call Arungquiltha, that is magical poisonous properties, and any native who believes that he has been struck by, say, a charmed spear is almost sure to die whether the would be slight or severe unless he be saved by the counter magic of a medicine man. There is no doubt whatever that a native will die after the infliction of even a most superficial wound if only he believes that the weapon which inflicted the woulnd had been sung over and thus endowed with Arungquiltha. He simply lies down, refuses food and pines away. Not long ago a man from Barrow Creek received a slight wound in the groin. Though there was apparently nothing serious the matter with him, still he persisted in saying that the spear had been charmed and that he must die, which accordingly he did in the course of a few days. Another man coming down to the Alice Springs from the Tennant Creek contracted a slight cold, but the local men told him that the members of a group about twelve miles away to the east had taken his heart out, and believeing this to be so he simply laid himself down and wasted away. In a similar way a man at Charlotte Waters came to one of the authors with a slight spear woulnd in his back. He was assured that the wound was not serious, and it was dressed in the usual way, but he persisted in saying that the spear had been sung, and that though it could not be seen yet in reality it had broken his back and he was going to die, which accordingly he did. As a result of this a party was organized among the members of his group to avenge his death, and the man who had wounded him with the charmed weapon was killed. Instances of occurrences such as these could be multiplied, and though of course it is impossible to prove that death would not have followed under any circumstances, that is whether the native had or had not imagined the weapon to have been “sung,” yet with a knowledge of what wounds and what injuries he will survive if he does not suspect the intervention of magic, it is not possible to explain death under such circumstances except as associated directly with the firm belief of the injured man that Arungquiltha has entered his body, and that therefore he must die.”
p. 554: “The use of these objects is a well recognised method of obtaining wives, as is shown by the fact that a man’s right to a woman, secured by means of one or other of them, is supported by the men of his local group, provided always that the woman stands to the man in the relationship of Unawa or lawful wife.”
p. 483: “As to the question of tracking, the idea which has been generally held, that the shoes are used to prevent the tracks being seen will not be regarded as at all satisfactory by those who are acquainted with the remarkable power of the Australian native in this respect. They will neither hide the track nor, though they are shaped alike at each end, will they even suffice to prevent any native who cares to look from seeing at a glance which direction the wearer has come from, or gone towards. Any even moderately experienced native will, without the slighest difficulty, tell from the faintest track—from an upturned stone, a down-bent piece of grass or a twig of shrub—not only that some one has passed by but also the direction in which he has travelled. The only way in which they can be of use in hiding tracks is by preventing it from being recognised who was the particular individual, and in this way they might be of service, for when once an experienced native—almost incredible though it may sound to those who have not had the opportunity of watching them —has seen the track of a man or woman he will distinguish it afterwards from that of any other individual of his acquaintance.”
p. 586/587: “Together with the pitchis made out of the same wood, the shields afford evidence of very considerable manipulative skill, and no small appreciation of beauty of form and symmetry of line on the part of their makers. It may be mentioned here that these shields, or rather the best ones, are the work of men of the Warramunga tribe which inhabits the district in thei neighbourhood of Tennant Creek. They are also made by the northern Arunta, the Ilpirra and Kaitish people. In regard to these Central natives it is a striking feature that men who live in particular districts are famous for making particular forms of implements and weapons, and that this is by no means wholly dependent upon the fact that suitable material for their construction is only to be found in the districts occupied by them. Thus the best pitchis, made of the bean tree, are the work of groups of natives who live out to the west of Alice Springs; the best shields, as we have just said, are those made away to the north, the best spear-throwers are made in the south-west, the best boomerangs away to the east and north-east, and the best spears in the north part of the Arunta tribe, in the Alice Springs district. The western men, for example, though they have the bean tree and make pitchis out of it, get their shields by exchange from the north; the Alice Springs blacks in like manner exchange their spears for the boomerangs of the eastern natives, and so on. Even in the old traditions we find reference to the excellence of the pitchis made by the western natives; in fact, according to tradition, one of the wandering ancestral groups named what is now called Mount Sonder, Urachipma, or the place of pitchis, because here they found an old bandicoot man engaged in making them. The tradition may at any rate be regarded as indicative that this distribution of work is of very old standing. It seems, generally speaking, to be independent of the existence in any particular locality of the material necessary for the manufacture of any particular article. It also shows that great care must be taken in dealing with the various implements which are commonly found amongst any particular tribe. Every Arunta man is sure to have one of these shields, and yet the majority of them have not been made in the tribe, nor, indeed, within a hundred miles of the district occupied by it, but by a tribe speaking a quite qifferent languages. Why certain things, such as shields and boomerangs, should be traded over wide areas and be common to a number of tribes, and why certain other things, such as the spear-throwers, for example, should be local in distribution, it is difficult to understand.”
p. 451: “in fact any one, whatever his or her totem me, may undergo the rite at pleasure, but in the case of just the one totem it is obligatory, or practically so, though at the same time the non-observance of the custom would not prevent any man from being admitted to the secrets of the tribe, but it would subject him to what is most dreaded by the native, and that is the constant ridicule of the other men and women, with whom he is in daily contact.”
p. 285: “The oldest Okilia man now said “Who will be Tapunga?” Two men volunteered, one man a Panunga and the other a Purula. The former at once lay on his stomach on the ground and the latter on the top of him, and when this kind of living table was ready the Kumara Arakurta was led from the Nurtunja, close to which the men had laid down, and then placed lying at full length on his back on top of the Tapunga. As soon as ever he was in position another man sat astride of his body, grasped the penis and put the urethra on the stretch. The operator who is called Pininga and is chosen by the Oknia and Okilia, then approached and quickly, with a stone knife, laid open the urethra from below.”
p. 287: “It very often happens that, as soon as the operation has been performed on an Arakurta, one or more of the younger men present, who have been operated on before, stand up and voluntarily undergo a second operation. In such cases the men do not consider that the incision has been carried far enough. Standing out on the clear space close by the Nurtunja, with legs wide a part and hands behind his back, the man shouts out “*Mura Ariltha atnartinja yinga aritchika pitchi”;—“Mura mine come and cut my Ariltha down to the root.” Then one Mura man comes and pinions him from behind, while another comes up in front and seizing the penis first of all cuts out an oval shaped piece of skin which he throws away and then extends the slit to the root. Most men at some time or other undergo the second operation and some come forward a third time, though a man is often as old as thirty or thirty-five before he submits to his second operation which is called ariltha erlitha atnartinja.”
p. 523: “When any man feels that he is capable of becoming one [medicine man], he ventures away from the camp quite alone until he comes to the mouth of the cave. Here, with considerable trepidation, he lies down to sleep, not venturing to go inside, or else he would, instead of becoming endowed with magic power, be spirited away for ever. At break of day, one of the Iruntarinia comes to the mouth of the cave, and, finding the man asleep, throws at him an invisible lance which pierces the neck from behind, passes through the tongue, making therein a large hole, and then comes out through the mouth. The tongue remains throughout life perforated in the centre with a hole large enough to admit the little finger; and when all is over, the hole is the only visible and outward sign of the treatment of the Iruntarinia. How the hole is really made it is impossible to say, but as shown in the illustration it is always present in the genuine medicine man. In some way of course the novice must make it himself; but naturally no one will ever admit the fact”
p. 528: “The next operation consisted in one of the Nung-gara taking a ‘pointing stick,’ and after having tied some hair string round the middle joint of the first finger of the man’s right hand he forced the pointed end of the stick under the nail and for a considerable distance into the flesh, making thus a hole into which he pretended to press a crystal. The man was then told to keep a finger pressed up against the hole so as to prevent the stone from coming out, after which he was told to remain perfectly quiet and go to sleep.”
p. 485: “If the operation [of knocking out of teeth] be performed on a man he lies down on his back, resting his head on the lap of a sitting man who is his tribal Oknia (elder brother), or else a man who is Unkulla to him (mother’s brother’s son). The latter pinions his arms and then another Okilia or Unkulla fills his mouth with fur-string for the purpose, partly, they say, of absorbing the blood and party of deadening the pain,and partly also to prevent the tooth from being swallowed. The same man then takes a piece of wood, usually the sharp end of a spear, in which there is a hole made, and, pressing it firmly against the tooth, strikes it sharply with a stone. When the tooth is out, he holds it up for an instant so that it can be seen by all, and while uttering a peculiar, rolling, guttural sound throws it away as far as possible in the direction of the Mira Mia Alcherringa, which means the camp of the man’s mother in the Alcheringa.”
p. 486: “When a woman or girl is to be operated on, a little space is cleared near to the main camp where men and women all assemble, except only those who are Mura to the girl. A tribal Okilia sits down and the girl lies with her head in his lap, and the operation is conducted as in the case of the men and boys, being almost always performed by a tribal Okilia. The tooth when taken out is lifted up with the same guttural sound and thrown in the direction of the mother’s Alcheringa camp. The girl now springs to her feet, and seizing a small pitchi which has been placed close at hand for the purpose, fills it with sand, and dancing over the cleared space agitates the pitchi as if she were winnowing seed. When it is emptied she resumes her seat amongst the women.”
p. 129: “In connection with this, it may be worth while noting that amongst the Australian natives with whom we have come in contact, the feeling of sexual jealousy is not developed to anything like the extent to which it would appear to be in many other savage tribes. For a man to have unlawful intercourse with any woman arouses a feeling which is due not so much to jealousy as to the fact that the delinquent has infringed a tribal custom.”
p. 265: “We have amongst the Arunta, Luritcha, and Ilpirra tribes, and probably also amongst others such as the Warramunga, the idea firmly held that the child is not the direct result of intercourse, as it may come without this, which merely, as it were, prepares the mother for the reception and birth also of an already-formed spirit child who inhabits one of the totem centres. Time after time we have questioned them on this point, and always received the reply that the child was not the direct result of intercourse.”
p. 133: “The tradition of the natives is that when the spirit child goes inside a woman the Churinga is dropped. When the child is born the mother tells the father the position of the tree or rock near to which she supposes the child to have entered her, and he, together with one or two of the older men, […] goes to the locality […] and searches for the dropped Churinga. The latter is usually, but not always, supposed to be a stone one marked with a device peculiar to the totem of the spirit child and therefore of the newly-born one.”
p. 202: “A man will only eat very sparingly of his totem, and even if he does eat a little of it, which is allowable to him, he is careful, in the case, for example, of an emu man, not to eat the best part, such as the fat. The totem of any man is regarded, just as it is elsewhere, as the same thing as himself: as a native once said to us when we were discussing the matter with him, ‘that one,’ pointing to his photograph which we had taken, ‘is just the same as me; so is a kangaroo’ (his totem).”
p. 66: “The body is usually smooth with, at most, a development of very fine short hairs only perceptible on close examination, and there may be occasionally a well-marked development of hair on the lip or chin, which is especially noticeable in the old women, some of whom are probably fifty years of age and have reached a stage of ugliness which baffles description.”
p. 72: “As is usual, however, in the case of savage tribes the drudgery of food-collecting and child-bearing tells upon them at an early age, and between twenty and twenty-five they begin to lose their graceful carriage; the face wrinkles, the breasts hang pendulous, and, as a general rule, the whole body begins to shrivel up, until, at about the age of thirty, all traces of an earlier well-formed figure and graceful carriage are lost, and the woman develops into what can only be called an old and wrinkled hag.”
p. 643: “We did not attempt to obtain any skulls, for the simple reason that while the desecration of native graves might have enabled us to secure a few, it would at once have put a stop to work in other branches which we have been as yet more anxious to study than to obtain anthropometric data. To have opened native graves would have meant the closing of sources of information with regard to habits and customs.”
It’s worth flagging that 1899 is extremely old and I wouldn’t expect European authors to do a good job providing an unbiased description of First Nations culture.
Indigenous Australians only received equal voting rights at all levels of government across the country in 1966.
I explicitly read the book trying to be skeptical of the authors’ perspective, but was all-in-all positively surprised by their empiricism. As far as I can tell, they weren’t sensationalizing or exaggerating, and plainly describing what they were able to observe. (One would have to read the book on one’s own to form a proper opinion here). My general impression is that they were describing the Aboriginals like they would describe a group of sophisticated animals.
And the date cuts the other way too: Even during Spencers and Gillens time iron tools had already spread far & wide, so any later reports are afflicted by strong Western influence (of which Spencer wasn’t innocent, he advocated for a precursor policy that (afaiu) resulted in the Stolen Generations. I might also try to read (parts of) the Florentine Codex, not because of its scientific neutrality, but because of the closeness it had to the lived reality of the Aztecs.
I don’t see how this is related to being able to faithfully observing and reporting on Aboriginal customs and behavior.
This raises some questions. These can’t all be true:
Every man has at least one wife
A nonnegligible portion of men have more than one wife
No wife is shared between multiple men
There are a similar number of men and women
so 3 seems not-entirely accurate: they have something resembling group marriage. It’s unclear to me whether 1 holds strictly, or merely a weaker version like “Every man has access to at least one wife, sometimes, in a group arrangement”
as for 4, Aboriginal women marry shortly after puberty while men don’t marry their late 20s or even 30s, which could tilt the scale towards an excess of women in the marriageable pool
The sexism might be better described as rude and politically incorrect honesty. For example, aboriginal women, especially older ones, have often very wide and flat noses (1, 2, 3, 4), which most people around the world likely find ugly (except perhaps aboriginals themselves?) even if they nowadays would not admit that because it would be perceived as cruel and/or politically incorrect. But in 1899 these modern inhibitions didn’t really exist.
If you refuse to eat (or even drink water?) that doesn’t seem so hard to explain?
Yes, self-starvation in is in the causal pathway. I still found this astonishing.
Yup! That is why I rolled disbelief on “psychosomatics” (if you fall of a clif because you belief you can fly and your clan encourages you that is not prototypical psychosomatics). I asked Claude in a somewhat nonleading way (I should have avoided mentioning thirst) and it mentions a paper that the ill are refused water. Valuable in the desert? A way to take revenge on the annoying guy. Who knows.
I was surprised that the selection practice for medicine men seems to select against sincere magical belief. One can imagine an earnest young man sleeping outside the cave night after night waiting for a hole to appear in his tongue, only to give up and conclude the spirits have deemed him unworthy. Only those cynical enough to pierce their own tongue can take on the mantle. It might be a stretch but I think this selection mechanism reveals something about the social role medicine men play, being at least in part to artfully deceive people, whether for the benefit of the patient or the benefit of society
fascinating quotes, thanks for sharing!
Yeah, that is surprising. Very reminiscent of stigmata. Oddly enough, it was only many years after becoming an atheist that it occurred to me these would have to be self-inflicted, and it gives me a weird feeling since I still like saints and want to believe that they’re authentic.
Is it officially “LessWrong” now? Or is it still “Less Wrong”? Does it matter?
I feel like “LessWrong” is more streamlined and futuristic. It’s solid at its center of gravity, like a noun, whereas “Less Wrong” feels inelegant as an object in a sentence (try saying “I read posts on Less Wrong” out loud with equal emphasis on the last two words). But Less Wrong seems to be the name the founders intended. Is it left that way in the Sequences just for historical purposes?
I get the impression of a gradual shift, endorsed but natural, towards “LessWrong”.[1] I think this is the kind of incremental rebranding that non-stagnant organizations undergo naturally.[2] Some people react badly to rebrands (if it ain’t broke, don’t fix it), but they’re a sign of life.
e.g. the titles of the welcome posts.
Organization, as in, the abstract, intangible hub around which the members orbit, which presents itself to the world through a brand, a self-described purpose, an archetype of the person who is a member, etc. It can be a company, a school, a religious group, a collaborative world-building project...
I didn’t use a space back in 2015, but Eliezer did use the version with a space in 2009. So I think this rebrand happened a long time ago.
Scott Alexander uses “Best of Less Wrong” multiple times in a link thread from late April (one time to refer to a post where “LessWrong” is used right at the beginning). Old habits? (To be fair, the Best of Less Wrong page looks kinda like it says “Less Wrong” even though there isn’t a space there.)
Our intention since 2017 has been to rebrand to a single name, rather than two words.
Has anyone had the experience of trying to explain their idea to an LLM, but it fails to grasp the basic concept?
Asking because I don’t feel like this has happened to me (from my limited usage). When it can’t connect the dots, it’s because I haven’t provided enough dots.
(Edit: examples against much appreciated if any come to mind)
I’m not sure if this is the same thing, but I frequently talk to Claude about research ideas, and if the idea is close enough to a different idea that it knows about, it repeatedly collapses back into talking about the idea it’s familiar with.
One I remember from this week:
I’m looking into ways to make intermediate values more visible in the logit lens, and Claude really wants to talk about the tuned lens, which does the opposite of what I want[1]. Even if Claude itself has explained why this doesn’t make any sense, it will repeatedly suggest trying the tuned lens.
I feel like I had another case where it took forever to get it to grasp what I was even talking about, but I don’t remember the details unfortunately.
Specifically, the tuned lens makes the next token’s representation more clear and actively erases anything else.
Thank you for the example, this definitely counts in my mind.