LessWrong dev & admin as of July 5th, 2022.
RobertM
This isn’t a deliberate policy choice, but might be a consequence of temporary anti-crawler measures. (The label about robots.txt is wrong; we’re using Vercel’s firewall to challenge bot-like requests. In principle this ought to exclude Claude as it should have a whitelisted user agent, but maybe someone messed up between Anthropic and Vercel.)
Curated. Like your earlier post on filler tokens, my guess is that this is a pretty meaningful measurement of something plausibly serving as an input to actually dangerous capabilities. It’s a very obvious experiment to run in hindsight, and it’s not even like the question hasn’t been discussed before, but that doesn’t count for anything until someone actually runs the experiment. The experiment design seems reasonable to me, and if I had to guess I’d say the results are pretty suggestive of model size (though Opus 4.5 doing better than 4 is perhaps a little surprising there).
I do wish that the estimated completion times didn’t themselves come from LLM estimates. I more or less trust Opus 4.5 to generate an unbiased ordering of problem difficulty, but I’m not sure about relative (let alone absolute) differences in completion time. This isn’t that much of a defeater; even the ground-truth measurement of actual human completion times would still just be a lossy proxy for the actual thing we care about.
Anyways, great work.
Thanks, edited that section with a link.
Mod note: this post violates our LLM Writing Policy for LessWrong and was incorrectly approved, so I have moved the post back to draft.
@bardstale please don’t post more direct LLM output, or we’ll remove your posting permissions.
Yes, and the “social circumstance” of the game as represented to o3 does not seem analogous to “a human being locked in a room and told that the only way out is to beat a powerful chess engine”; see my comment expanding on that. (Also, this explanation fails to explain why some models don’t do what o3 did, unless the implication is that they’re worse at modeling the social circumstances of the setup than o3 was. That’s certainly possible for the older and weaker models tested at the time, but I bet that newer, more powerful, but also less notoriously reward-hack-y models would “cheat” much less frequently.)
it just seems like the author was trying to have a pretty different conversation
I think mostly in tone. If I imagine a somewhat less triggered intro sentence in Buck’s comment, it seems to be straightforwardly motivating answers to the two questions at the end of OP:
1. None of Eliezer’s public communication is -EV for AI Safety
2. Financial support of MIRI is likely to produce more consistently +EV communication than historically seen from Eliezer individually.ETA: I do think the OP was trying to avoid spawning demon threads, which is a good impulse to have (especially when it comes to questions like this).
Even if you think the original prompt variation seems designed to elicit bad behavior, o3′s propensity to cheat even with the
dontlookandpowerlessvariations seems pretty straightforward. Also...Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!
I would not describe “the chess game is taking place in a local execution environment” as “the framework specifically gives you access to these files”. Like, sure, it gives you access to all the files. But the only file it draws your attention to is
game.py.I think the stronger objection is that the entire setup is somewhat contrived. What possible reason could someone have for asking an LLM to play chess like this? It does happen to be the case that people have been interested in measuring LLMs’ “raw” chess strength, and have run experiments to figure test that. How likely is this particular setup to be such an experiment, vs. something else? :shrug:
Ultimately, I don’t think the setup (especially with the “weaker” prompt variants) is so contrived that it would be unfair to call this specification gaming. I would not have been surprised if someone posted a result about o3′s anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn’t catch it before posting.
In practice the PPUs are basically equity for compensation purposes, though probably with worse terms than e.g. traditional RSUs.
now performance is faster than it’s ever been before
As a point of minor clarification, performance now is probably slightly worse than it was in the middle of the large refactoring effort (after the despaghettification, but before the NextJS refactor), but still better now than at any point before the start of (combined) refactoring effort, though it’s tricky to say for sure since there are multiple different relevant metrics and some of them are much more difficult to measure now.
Yes, this is just the number for a relatively undifferentiated (but senior) line engineer/researcher.
They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing).
EY argued that this was possible, not that this was overdetermined (and not that it was load-bearing to the threat model).
Just to check, did you use the “Submit Linkposts” functionality on the nomination page for that, or did you crosspost it some other way?
ETA: Ok, looks like the library responsible for extracting external article data/metadata didn’t successfully extract the date the article was published. I’ve manually set it to the correct date.
One reason to think that this is completely hallucinated is that the “soul document” is written in Claude’s typical style. That is, it looks to be AI (Claude) generated text, not something written by a human. Just look at the first paragraph:
I disagree. The document reads very strongly of Anthropic’s “house style”, at least compared to their system prompts. It’s much higher quality writing than any current LLM’s.
“This isn’t [x] but [y]” is quite weak evidence compared to the rest of it being obviously something that Opus would be unable to generate in its default voice. (Also, the original phrase uses “but rather”, which is non-standard for that type of LLM construction.)
Curated, as a worthwhile piece of empirical research (though see my concern below re: use as an alignment technique). These are the kinds of empirical results that I could in principle imagine leading to a more robust, theoretic understanding of how generalization works for models trained in the current paradigm. It covers a relatively broad “surface area”, which hopefully makes it easier for others to conduct more in-depth investigations along multiple lines of research. One interesting and suggestive example:
One downside with our default approach to inoculation prompting in RL is that the inoculating prompt causes the model to learn reward hacking faster. An alternative approach would be to use a prompt that discourages hacking when sampling in RL, and then rewrite episodes to use an inoculating prompt before training. In Figure 29 we test a version of this: we rewrite episodes offline to modify the prompt after RL training, and then SFT a model on the rewritten episodes. We find that this is not particularly effective at removing misaligned generalization when we use our “hacking okay” inoculation prompt that worked well during RL.
Figure 29: Offline rewriting of episodes to include an inoculation prompt and then training with SFT does not prevent misalignment. We took episodes from the “don’t hack” prompted run, rewrote them offline to use the “hacking okay” inoculation prompt, and then trained a model on these episodes using SFT. The resulting model showed misalignment on our evaluations, especially agentic evaluations.)
However, I don’t think that this sort of technique will scale very far. This experiment shows that, conditional on a model learning to reward hack after being prompted in a pretty unusual way, you might be able to prevent that reward hacking tendency from generalizing to other forms of misbehavior. But, as noted in the paper, that kind of prompting itself is necessary to enable the model to learn to reward hack reliably. To me, this is pretty suggestive that the “malign generalization” that occurs with “don’t hack” prompting is operating on the level of the model’s learned reflexes, and we will be dealing with a pretty different set of concerns when we get to models that have more robust long-horizon preferences.
Separately, being able to stack other mitigations on top of inoculation prompting to reduce reward hacking in deployment environments, after encouraging it via the inoculation prompting, seems like it is sweeping a pretty fundamental issue under the rug, which is the failure to find a robust technique that successfully instills the correct “values” into the model in a way that generalizes. I don’t like the way it looks like playing whack-a-mole, which always seems to work at current levels of capabilities, only to reveal that there was at least one more mole hiding underground as soon as you scale up a bit more.
I think there’s at least a few senses in which “we” don’t “know” how colds spread:
The “state of the art” in terms of “scientific knowledge” seems slightly underconfident about small-particle transmission maybe not being a meaningful factor in practice, given the available evidence.
As a society, I don’t think there’s well-established common knowledge of our best guess of the relative likelihood of transmission by various routes. i.e. what would happen if you asked multiple different doctors (GPs) how to best avoid catching a cold in various situations? I, personally, expect that you’d get overly-broad (and therefore expensive-to-follow) advice, with relatively poor inter-rater agreement, compared to if you asked them how to avoid e.g. some specific STDs.
Our “best guess” is extremely flimsy, and the sort of thing that I could imagine a single well-designed, medium-sized study overturning. (See the various caveats about how many different viruses cause “colds”, questions about humidity/viral load/etc in other comments, and so on.) This is not the kind of situation where I can tell someone “do [x] and you’ll reduce your risk by 80%” and feel at all confident about it! Or, like, I can in fact give them an extremely broad [x], but in that case I’m pretty sure that I’d be destroying a bunch of value as a result of empirical uncertainty which seems in-principle quite possible to resolve, with a sufficient application of resources.
I am being lazy and not reading all the papers you referenced—do many of them discuss the viral load of the person who is infected?
I think a couple of them did; I don’t remember if any of them found strong effects. Might ask a language model to check later—agree that this seems like one of those big open questions that could imply huge differences in worthwhile interventions, though I think that if viral load turns out to be the only key factor in transmission likelihood, such that other interventions have basically no effect if effectuated w.r.t. spreaders with high viral loads, that’s pretty bad news, since testing for high viral load might be much harder/more expensive than e.g. putting on a mask if it turns out that “large particulates” are most of the problem in the case of a specific illness. (Though I guess we do still have the problem of knowing what illness a given person has, to know what intervention to apply...)
Very reasonable question—having a user-specific header does make it tricky to have whole-page prerendering/caching, but we’re working on making partial prerendering worthwhile (mostly for post pages, which admittedly do change more often, due to comments and votes, though the cache hit rate would still be pretty high).
In the case of collection pages like R: A-Z, we also have user-specific info like read statuses indicated in checkboxes, which are part of the expensive query responsible for hydrating most of the page contents. We could split that out, and since we did recently enable cacheComponents it’s possible that’d allow us to cache much of the page, but I’m not sure if it’s worth prioritizing.
linear progress
The bottom line is still exponential, not linear—it’s just that the top line is superexponential!
Curated. This is a long one and it’s not obvious that everyone should spend the time to read all of it, but I found it to have a relatively high density of claims/worldview fragments of the kind that might substantially shift my efforts, if I were doing technical research on mitigating AI x-risk. Here are the ones that especially stood out to me[1]:
The discussions of how Redwood’s earlier research directions were probably misguided, how that problem was exacerbated by the lack of certain research skills & heuristics, and what kinds of research directions seem underinvested-in right now.[2]
The potential for most of the loss in realizable value to come not from failing at ASI alignment, but from suboptimal value lock-in downstream of a bad reflection procedure.[3]
The focus on doing research that seems like it moves the needle on the logistic success curve in larger subsets of possible worlds that you might end up occupying in the future. I probably have some big differences in beliefs about the distribution of worlds we might end up in, but either way, I think the gains from having a relatively more coherent and fleshed-out model of reality are quite large, when it comes to choosing better research directions.
I’m not a huge fan of the What’s the route to success? section, though I think it’s a sorta reasonable extrapolation of the previously expressed worldview, and maybe my disagreements are with the upstream generators. There might also be a separate, unexpressed pessimism about even more “aggressive” governance and coordination efforts than the kind described by Plan A, but I also don’t think it’s crazy to be pretty pessimistic about how achievable those are.
Without implying strong endorsement—merely that they both don’t seem obviously wrong to me, and that they afford productive engagement.
“In hindsight, the part of this that I most regret—or the way that decision procedure seems most messed up—is I feel like we were focused on a random aspect of mine, which is that I’m kind of thoughtful about doing math and making up math concepts. But in fact, being thoughtful about doing math and making up math concepts is absolutely not the most relevant skill set for producing value by doing research vaguely along the lines of mechanistic interpretability. It’s just not really had good consequences. I think we should have approached it—if we were going to do stuff about model internals—in a much more breadth-first manner where we name projects starting from low complexity to high complexity, asking: what are the things where it’s craziest that no one has ever run a basic experiment on this before that would tell us something interesting about how neural networks work? And we can try and do those in increasing order of complexity. That would have been better than the actual more end-to-end focus that we were using.
The other way in which I think it was a mistake to focus on that stuff is: a lot of what I’m really interested in, and what you’re really interested in, is thinking about AI futurism and trying to direct our research based on this kind of backchained end-to-end story about what needs to happen in order for AI to go well. And unfortunately, research which is a real shot on goal or a real moonshot works particularly poorly in combination with thinking about futurism. The thing about interpretability is: obviously if we were extremely good at interpretability, we wouldn’t have any of these other problems. The alignment strategy would be: use your extremely good interpretability to check if the AI is plotting against you, and if the AI is plotting against you, don’t deploy it. Or use your really good interpretability to change it into an AI that loves you and wants only the best for you. So the obvious way to do a good job of ambitious interpretability research is to shoot for a situation where you don’t have a bunch of messy trade-offs as a result of having limited resources. But actually I think we really like and are really good at thinking about what you should do when you have messy trade-offs, when you have very limited resources, when an AI company is behaving recklessly and going fast. And the interpretability research, because of how ambitious it is, doesn’t really give us an opportunity to use that skill set.
This is maybe not a framing/focus that Ryan would endorse, but relevant paragraph here: “One relevant point: suppose that humans stay in control, misalignment risks were either mitigated or didn’t end up being a problem, and humans basically don’t go crazy—people remain basically sane. And let’s suppose that the world is broadly democratic. So at least the vast majority of the cosmic resources are controlled by countries that are as democratic as—I don’t know what’s a pretty good example—France? France is a democracy. As democratic as France is now. So not necessarily perfect, there’s a bunch of difficulties, but at least that good. My sense is that you get like 10 or 20% of the value of what’s maximally achievable from the paperclipper-to-maximum range in that scenario. A lot of the value is lost due to people not being very thoughtful with usage of cosmic resources, disagreements. I think a lot of it is lost by people just not really trying very hard to reflect on what to do and not being thoughtful, and then some is lost due to people disagreeing after really having their epistemic dynamics and decision-making improved a bunch. It’s very hard to predict these things, but that’s some guess. And there’s complicated mitigating factors. Like maybe there’s less value lost due to trade, but that also applies to some extent on the paperclipper baseline—that’s partially priced in, but there might be other things like that.”