LessWrong dev & admin as of July 5th, 2022.
RobertM
But there are a lot of smart researchers who also aren’t aware of any alignment proposal that they think works, but that’s obviously not sufficient for their credence to be ~99%, so clearly Eliezer must have some other reasons that I’m not aware of. But what are those reasons?
I think that, in such cases, Eliezer is simply not making a mistake that those other researchers are making, where they have substantial hope in unknown unknowns (some of which are in fact known, but maybe not to them).
I’m also a little confused by why you expect such a summary to exist. Or, rather, why the section titles from The Problem are insufficient:
There isn’t a ceiling at human-level capabilities.
ASI is very likely to exhibit goal-oriented behavior.
ASI is very likely to pursue the wrong goals.
It would be lethally dangerous to build ASIs that have the wrong goals.
If it’s because you think one or more of those steps aren’t obviously true and need more justification, well, you’re not alone, and many people think different parts of it need more justification, so there is no single concise summary that satisfies everyone.[1]
- ^
Though some summaries probably satisfy some people.
Indeed, that section also jumped out at me as missing some pretty important parts of the arguments about instrumentally convergent goals. As Eliezer said in one of the MIRI dialogues:
But the convergent instrumental strategies, the anticorrigibility, these things are contained in the true fact about the universe that certain outputs of the time machine will in fact result in there being lots more paperclips later. What produces the danger is not the details of the search process, it’s the search being strong and effective at all. The danger is in the territory itself and not just in some weird map of it; that building nanomachines that kill the programmers will produce more paperclips is a fact about reality, not a fact about paperclip maximizers!
Now, Turntrout recently noted that we aren’t actually sampling from the space of plans, but from the space of plan-generating agents, which seemes basically true! Except that what kind of agent we get is (probably substantially) influenced by the structure of that same reality which provides us with that unfavorable[1] distribution of “successful” plans[2]. This is something I think is downstream[3] of point 21 in A List of Lethalities:
When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.
Curated. This is a long one and it’s not obvious that everyone should spend the time to read all of it, but I found it to have a relatively high density of claims/worldview fragments of the kind that might substantially shift my efforts, if I were doing technical research on mitigating AI x-risk. Here are the ones that especially stood out to me[1]:
The discussions of how Redwood’s earlier research directions were probably misguided, how that problem was exacerbated by the lack of certain research skills & heuristics, and what kinds of research directions seem underinvested-in right now.[2]
The potential for most of the loss in realizable value to come not from failing at ASI alignment, but from suboptimal value lock-in downstream of a bad reflection procedure.[3]
The focus on doing research that seems like it moves the needle on the logistic success curve in larger subsets of possible worlds that you might end up occupying in the future. I probably have some big differences in beliefs about the distribution of worlds we might end up in, but either way, I think the gains from having a relatively more coherent and fleshed-out model of reality are quite large, when it comes to choosing better research directions.
I’m not a huge fan of the What’s the route to success? section, though I think it’s a sorta reasonable extrapolation of the previously expressed worldview, and maybe my disagreements are with the upstream generators. There might also be a separate, unexpressed pessimism about even more “aggressive” governance and coordination efforts than the kind described by Plan A, but I also don’t think it’s crazy to be pretty pessimistic about how achievable those are.
- ^
Without implying strong endorsement—merely that they both don’t seem obviously wrong to me, and that they afford productive engagement.
- ^
“In hindsight, the part of this that I most regret—or the way that decision procedure seems most messed up—is I feel like we were focused on a random aspect of mine, which is that I’m kind of thoughtful about doing math and making up math concepts. But in fact, being thoughtful about doing math and making up math concepts is absolutely not the most relevant skill set for producing value by doing research vaguely along the lines of mechanistic interpretability. It’s just not really had good consequences. I think we should have approached it—if we were going to do stuff about model internals—in a much more breadth-first manner where we name projects starting from low complexity to high complexity, asking: what are the things where it’s craziest that no one has ever run a basic experiment on this before that would tell us something interesting about how neural networks work? And we can try and do those in increasing order of complexity. That would have been better than the actual more end-to-end focus that we were using.
The other way in which I think it was a mistake to focus on that stuff is: a lot of what I’m really interested in, and what you’re really interested in, is thinking about AI futurism and trying to direct our research based on this kind of backchained end-to-end story about what needs to happen in order for AI to go well. And unfortunately, research which is a real shot on goal or a real moonshot works particularly poorly in combination with thinking about futurism. The thing about interpretability is: obviously if we were extremely good at interpretability, we wouldn’t have any of these other problems. The alignment strategy would be: use your extremely good interpretability to check if the AI is plotting against you, and if the AI is plotting against you, don’t deploy it. Or use your really good interpretability to change it into an AI that loves you and wants only the best for you. So the obvious way to do a good job of ambitious interpretability research is to shoot for a situation where you don’t have a bunch of messy trade-offs as a result of having limited resources. But actually I think we really like and are really good at thinking about what you should do when you have messy trade-offs, when you have very limited resources, when an AI company is behaving recklessly and going fast. And the interpretability research, because of how ambitious it is, doesn’t really give us an opportunity to use that skill set.
- ^
This is maybe not a framing/focus that Ryan would endorse, but relevant paragraph here: “One relevant point: suppose that humans stay in control, misalignment risks were either mitigated or didn’t end up being a problem, and humans basically don’t go crazy—people remain basically sane. And let’s suppose that the world is broadly democratic. So at least the vast majority of the cosmic resources are controlled by countries that are as democratic as—I don’t know what’s a pretty good example—France? France is a democracy. As democratic as France is now. So not necessarily perfect, there’s a bunch of difficulties, but at least that good. My sense is that you get like 10 or 20% of the value of what’s maximally achievable from the paperclipper-to-maximum range in that scenario. A lot of the value is lost due to people not being very thoughtful with usage of cosmic resources, disagreements. I think a lot of it is lost by people just not really trying very hard to reflect on what to do and not being thoughtful, and then some is lost due to people disagreeing after really having their epistemic dynamics and decision-making improved a bunch. It’s very hard to predict these things, but that’s some guess. And there’s complicated mitigating factors. Like maybe there’s less value lost due to trade, but that also applies to some extent on the paperclipper baseline—that’s partially priced in, but there might be other things like that.”
This isn’t a deliberate policy choice, but might be a consequence of temporary anti-crawler measures. (The label about robots.txt is wrong; we’re using Vercel’s firewall to challenge bot-like requests. In principle this ought to exclude Claude as it should have a whitelisted user agent, but maybe someone messed up between Anthropic and Vercel.)
Curated. Like your earlier post on filler tokens, my guess is that this is a pretty meaningful measurement of something plausibly serving as an input to actually dangerous capabilities. It’s a very obvious experiment to run in hindsight, and it’s not even like the question hasn’t been discussed before, but that doesn’t count for anything until someone actually runs the experiment. The experiment design seems reasonable to me, and if I had to guess I’d say the results are pretty suggestive of model size (though Opus 4.5 doing better than 4 is perhaps a little surprising there).
I do wish that the estimated completion times didn’t themselves come from LLM estimates. I more or less trust Opus 4.5 to generate an unbiased ordering of problem difficulty, but I’m not sure about relative (let alone absolute) differences in completion time. This isn’t that much of a defeater; even the ground-truth measurement of actual human completion times would still just be a lossy proxy for the actual thing we care about.
Anyways, great work.
Thanks, edited that section with a link.
Mod note: this post violates our LLM Writing Policy for LessWrong and was incorrectly approved, so I have moved the post back to draft.
@bardstale please don’t post more direct LLM output, or we’ll remove your posting permissions.
Yes, and the “social circumstance” of the game as represented to o3 does not seem analogous to “a human being locked in a room and told that the only way out is to beat a powerful chess engine”; see my comment expanding on that. (Also, this explanation fails to explain why some models don’t do what o3 did, unless the implication is that they’re worse at modeling the social circumstances of the setup than o3 was. That’s certainly possible for the older and weaker models tested at the time, but I bet that newer, more powerful, but also less notoriously reward-hack-y models would “cheat” much less frequently.)
it just seems like the author was trying to have a pretty different conversation
I think mostly in tone. If I imagine a somewhat less triggered intro sentence in Buck’s comment, it seems to be straightforwardly motivating answers to the two questions at the end of OP:
1. None of Eliezer’s public communication is -EV for AI Safety
2. Financial support of MIRI is likely to produce more consistently +EV communication than historically seen from Eliezer individually.ETA: I do think the OP was trying to avoid spawning demon threads, which is a good impulse to have (especially when it comes to questions like this).
Even if you think the original prompt variation seems designed to elicit bad behavior, o3′s propensity to cheat even with the
dontlookandpowerlessvariations seems pretty straightforward. Also...Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!
I would not describe “the chess game is taking place in a local execution environment” as “the framework specifically gives you access to these files”. Like, sure, it gives you access to all the files. But the only file it draws your attention to is
game.py.I think the stronger objection is that the entire setup is somewhat contrived. What possible reason could someone have for asking an LLM to play chess like this? It does happen to be the case that people have been interested in measuring LLMs’ “raw” chess strength, and have run experiments to figure test that. How likely is this particular setup to be such an experiment, vs. something else? :shrug:
Ultimately, I don’t think the setup (especially with the “weaker” prompt variants) is so contrived that it would be unfair to call this specification gaming. I would not have been surprised if someone posted a result about o3′s anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn’t catch it before posting.
In practice the PPUs are basically equity for compensation purposes, though probably with worse terms than e.g. traditional RSUs.
now performance is faster than it’s ever been before
As a point of minor clarification, performance now is probably slightly worse than it was in the middle of the large refactoring effort (after the despaghettification, but before the NextJS refactor), but still better now than at any point before the start of (combined) refactoring effort, though it’s tricky to say for sure since there are multiple different relevant metrics and some of them are much more difficult to measure now.
Yes, this is just the number for a relatively undifferentiated (but senior) line engineer/researcher.
They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing).
EY argued that this was possible, not that this was overdetermined (and not that it was load-bearing to the threat model).
Just to check, did you use the “Submit Linkposts” functionality on the nomination page for that, or did you crosspost it some other way?
ETA: Ok, looks like the library responsible for extracting external article data/metadata didn’t successfully extract the date the article was published. I’ve manually set it to the correct date.
The 2024 LessWrong Review
One reason to think that this is completely hallucinated is that the “soul document” is written in Claude’s typical style. That is, it looks to be AI (Claude) generated text, not something written by a human. Just look at the first paragraph:
I disagree. The document reads very strongly of Anthropic’s “house style”, at least compared to their system prompts. It’s much higher quality writing than any current LLM’s.
“This isn’t [x] but [y]” is quite weak evidence compared to the rest of it being obviously something that Opus would be unable to generate in its default voice. (Also, the original phrase uses “but rather”, which is non-standard for that type of LLM construction.)
Curated, as a worthwhile piece of empirical research (though see my concern below re: use as an alignment technique). These are the kinds of empirical results that I could in principle imagine leading to a more robust, theoretic understanding of how generalization works for models trained in the current paradigm. It covers a relatively broad “surface area”, which hopefully makes it easier for others to conduct more in-depth investigations along multiple lines of research. One interesting and suggestive example:
One downside with our default approach to inoculation prompting in RL is that the inoculating prompt causes the model to learn reward hacking faster. An alternative approach would be to use a prompt that discourages hacking when sampling in RL, and then rewrite episodes to use an inoculating prompt before training. In Figure 29 we test a version of this: we rewrite episodes offline to modify the prompt after RL training, and then SFT a model on the rewritten episodes. We find that this is not particularly effective at removing misaligned generalization when we use our “hacking okay” inoculation prompt that worked well during RL.
Figure 29: Offline rewriting of episodes to include an inoculation prompt and then training with SFT does not prevent misalignment. We took episodes from the “don’t hack” prompted run, rewrote them offline to use the “hacking okay” inoculation prompt, and then trained a model on these episodes using SFT. The resulting model showed misalignment on our evaluations, especially agentic evaluations.)
However, I don’t think that this sort of technique will scale very far. This experiment shows that, conditional on a model learning to reward hack after being prompted in a pretty unusual way, you might be able to prevent that reward hacking tendency from generalizing to other forms of misbehavior. But, as noted in the paper, that kind of prompting itself is necessary to enable the model to learn to reward hack reliably. To me, this is pretty suggestive that the “malign generalization” that occurs with “don’t hack” prompting is operating on the level of the model’s learned reflexes, and we will be dealing with a pretty different set of concerns when we get to models that have more robust long-horizon preferences.
Separately, being able to stack other mitigations on top of inoculation prompting to reduce reward hacking in deployment environments, after encouraging it via the inoculation prompting, seems like it is sweeping a pretty fundamental issue under the rug, which is the failure to find a robust technique that successfully instills the correct “values” into the model in a way that generalizes. I don’t like the way it looks like playing whack-a-mole, which always seems to work at current levels of capabilities, only to reveal that there was at least one more mole hiding underground as soon as you scale up a bit more.
I don’t really understand what kinds of reasons you think would justify having 99% confidence in an outcome. 99% is not very high confidence, in log-odds—I am much more than 99% confident in many claims. But, that aside, he has written millions of words on the subject, explaining his views in detail, including describing much of the enormous amount of evidence that he believes bears on this question. It is difficult to compress that evidence into a short summary. (Though there have been numerous attempts.)
I mean, yes, I was trying to demonstrate that a short summary will obviously fail to convey information that most readers would find necessary to carry the argument (and that most readers would want different additional pieces of information from each other). However, “It would be lethally dangerous to build ASIs that have the wrong goals” is not circular. You might say it lacks justification, but many people have background beliefs such that a statement like that requires little or no additional justification[1].
For example, if they believe both that Drexlerian nanotechnology is possible and that the ASI in question would be able to build it.