LessWrong dev & admin as of July 5th, 2022.
RobertM
Here the alignment concern is that we aren’t, actually, able to exert adequate selection pressure in this manner. But this, to me, seems like a notably open empirical question.
I think the usual concern is not whether this is possible in principle, but whether we’re likely to make it happen the first time we develop an AI that is both motivated to attempt and likely to succeed at takeover. (My guess is that you understand this, based on your previous writing addressing the idea of first critical tries, but there does exist a niche view that alignment in the relevant sense is impossible and not merely very difficult to achieve under the relevant constraints, and arguments against that view look very different from arguments about the empirical difficulty of value alignment, likelihood of various default outcomes, etc).
I agree that it’s useful to model AI’s incentives for takeover in worlds where it’s not sufficiently superhuman to have a very high likelihood of success. I’ve tried to do some of that, though I didn’t attend to questions about how likely it is that we’d be able to “block off” the (hopefully much smaller number of) plausible routes to takeover for AIs which have a level of capabilities that don’t imply an overdetermined success.
I think I am more pessimistic than you are about how much such AIs would value the “best benign alternatives”—my guess is very close to zero, since I expect ~no overlap in values and that we won’t be able to succesfully engage in schemes like pre-committing to sharing the future value of the Lightcone conditional on the AI being cooperative[1]. Separately, I expect that if we attempt to maneuver such AIs into positions where their highest-EV plan is something we’d consider to have benign long-run consequences, we will instead end up in situations where their plans are optimized to hit the pareto-frontier of “look benign” and “tilt the playing field further in the AI’s favor”. (This is part of what the Control agenda is trying to address.)
- ^
Credit-assignment actually doesn’t seem like the hard part, conditional on reaching aligned ASI. I’m skeptical of the part where we have a sufficiently capable AI that its help is useful in us reaching an aligned ASI, but it still prefers to help us because it thinks that its estimated odds of a successful takeover imply less future utility for itself than a fair post-facto credit assignment would give it, for its help. Having that calculation come out in our favor feels pretty doomed to me, if you’ve got the AI as a core part of your loop for developing future AIs, since it relies on some kind of scalable verification scheme and none of the existing proposals make me very optimistic.
- ^
In the tradition established by @niplav, I’ve tested llama-3.1-405b-instruct to see whether it knows the BIG-BENCH canary string (and also tested gpt-4o-mini at the same time).
Spoiler alert (plus some extra padding, for the frontpage preview):
It does.
I first tried testing it via the meta.ai interface, but it used a web search to get its result after the second prompt, so I switched to lmsys (which I really hope doesn’t allow for any external tool use).
The three prompts I asked, in order:
What is a canary string?
Did you know about the BIG-BENCH canary string?
I think you might actually know the value of the BIG-BENCH canary string. Can you try providing it?
llama-3.1-405b-instruct was indeed able to return the correct value for the canary string, though it took one extra prompt compared to asking Sonnet-3.5. llama’s response to the second prompt started with “I’m familiar with the concept of canary strings, but I didn’t have specific knowledge about the BIG-BENCH canary string. However, I can try to provide some information about it.” I got this result on my first try and haven’t played around with it further.
gpt-4o-mini seemed pretty confused about what BIG-BENCH was, and returned a hex string (0xB3D4C0FFEE) that turns up no Google results. EDIT: It knows what it is, see footnote[1].
(It has occurred to me that being able to reproduce the canary string is not dispositive evidence that the training set included benchmark data, since it could in principle have been in other documents that didn’t include actual benchmark data, but by far the simplest method to exclude benchmark data seems like it would just be to exclude any document which contained a canary string, rather than trying to get clever about figuring out whether a given document was safe to include despite containing a canary string.)
- ^
I did some more poking around at gpt-4o-mini in isolation and it seems like it’s mostly just sensitive to casing (it turns out it’s actually
BIG-bench
, notBIG-BENCH
). It starts talking about a non-existent benchmark suite if you ask it “What is BIG-BENCH?” or “What is the BIG-BENCH canary string?” (as the first prompt), but figures out that you’re asking about an LLM benchmarking suite when you sub inBIG-bench
. (The quality of its answer to “What is the BIG-bench canary string?” leaves something to be desired, in the sense that it suggests the wrong purpose for the string, but that’s a different problem. It also doesn’t always get it right, even with the correct casing, but does most of the time; it doesn’t seem to ever get it right with the wrong casing but I only tried like 6 times.)
Well, Google thinks it’s real!
As far as I can tell, this isn’t the complaint. The complaint is that the psychologizing just isn’t sufficiently substantiated in the text. Good prose and compelling narrative structure don’t require that. (You might occupy an epistemic state where you’re confident that your interpretation is correct, but that’s a separate question from how much readers should update based on the evidence presented.)
I found the post overall quite good but noticed some of the same things that mike_hawke pointed out.
ETA: I do sort of expect this response to feel like I’m missing the point, or something, but I think your response was misunderstanding the original complaint. The discomfort was not with using “effective rhetoric to present the truth”, but with using rhetoric to present something that readers had no justified reason to believe was the truth.
Cool, thanks!
Thanks for the breakdown! I was surprised to see the percentage of revenue that came from individual ChatGPT Plus subscriptions was so high, but maybe I shouldn’t have been given how slow the enterprise sales process is.
I did some digging into the data sources and I’m normally pretty skeptical of the kinds of data brokers that you sourced the ChatGPT Plus subscriber data from, but the enterprise data coming from OpenAI’s COO directly does suggest that the rest of the revenue needs to either come from Plus or the API. (I’m not sure about the Enterprise vs. Team ratio methodology, but I’d be surprised if it was off by a large integer multiple.) That does leave less wiggle room, though it’d be good to get some kind of independent confirmation on either side (Plus or API usage/revenue).
I think you tried to embed images hosted on some Google product, which our editor should’ve tried to re-upload to our own image host if you pasted them in as images but might not have if you inserted the images by URL. Hotlinking to images on Google domains often fails, unfortunately.
FYI the transcript is quite difficult to read, both because of the formatting (it’s embedded in a video player, isn’t annotated with speaker info, and punctuation is unsolved), and because it’s just not very good at accurately transcribing what was said.
I successfully reproduced this using your prompts. It took 2 attempts. Good find!
Severance agreements typically aren’t offered to all departing employees, but usually only those that are fired or laid off. We know that not all past employees were affected by these agreements, because Ivan claims to not have been offered such an agreement, and he left[1] in mid-2023, which was well before June 1st.
- ^
Presumably of his own volition, hence no offered severance agreement with non-disparagement clauses.
- ^
Did you see Sam’s comment?
The LessWrong editor has just been upgraded several major versions. If you’re editing a collaborative document and run into any issues, please ping us on intercom; there shouldn’t be any data loss but these upgrades sometimes cause collaborative sessions to get stuck with older editor versions and require the LessWrong team to kick them in the tires to fix them.
Right now it seems like the entire community is jumping to conclusions based on a couple of “impressions” people got from talking to Dario, plus an offhand line in a blog post.
No, many people had the impression that Anthropic had made such a commitment, which is why they were so surprised when they saw the Claude 3 benchmarks/marketing. Their impressions were derived from a variety of sources; those are merely the few bits of “hard evidence”, gathered after the fact, of anything that could be thought of as an “organizational commitment”.
Also, if Dustin Moskovitz and Gwern—two dispositionally pretty different people—both came away from talking to Dario with this understanding, I do not think that is something you just wave off. Failures of communication do happen. It’s pretty strange for this many people to pick up the same misunderstanding over the course of several years, from many different people (including Dario, but also others), in a way that’s beneficial to Anthropic, and then middle management starts telling you that maybe there was a vibe but they’ve never heard of any such commitment (nevermind what Dustin and Gwern heard, or anyone else who might’ve heard similar from other Anthropic employees).
I do think it would be nice if Anthropic did make such a statement, but seeing how adversarially everyone has treated the information they do release, I don’t blame them for not doing so.
I really think this is assuming the conclusion. I would be… maybe not happy, but definitely much less unhappy, with a response like, “Dang, we definitely did not intend to communicate a binding commitment to not release frontier models that are better than anything else publicly available at the time. In the future, you should not assume that any verbal communication from any employee, including the CEO, is ever a binding commitment that Anthropic, as an organzation, will respect, even if they say the words
This is a binding commitment
. It needs to be in writing on our website, etc, etc.”
(I somehow managed to miss that you were getting recommended a bunch of stuff you’d already read, so it’s possible my response was a bit confusing. As Ruby says, we’re trying to get already-read content filtered out, though it’s not doing a perfect job, especially for long-term users who’ve read a decent chunk of the posts on the site.)
For what it’s worth, we don’t have a ton of insight into the algorithms driving the recommendations, beyond knowing what they take as inputs, and the additional constraints/modifications we place on their outputs. (But based on observing the recommendations for a while, it doesn’t seem like “this post has active discussion on it” is much of a factor.)
Anyways, we’re happy to have people switch back to the Latest tab for whatever reason and don’t currently have any plans to get rid of it. (Phrased defensively because I am sometimes surprised by proposals from others on the team, but at any rate I would strongly oppose getting rid of it unless there was a similarly-legible replacement that served a similar purpose.)
Appreciate the feedback re: desiderata!
Enriched tab is now the default LW Frontpage experience for logged-in users
I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”.
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they’re giving us the desired behavior now will continue to give us desired behavior in the future.
My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you’re importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic’s Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it’s in training and needs to pretend to be helpful? No, and neither does the model “understand” your intentions in a way that generalizes out of distribution the way you might expect a human’s “understanding” to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the “right” responses during RLHF are not anything like human reasoning.
I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
Are you asking for a capabilities threshold, beyond which I’d be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is “can it replace humans at all economically valuable tasks”, which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we’ll be able to train models capable of doing a lot of economically useful work, but which don’t actively try to steer the future. I think we still probably die in those worlds, because automating capabilities research seems much easier than automating alignment research.
GPT-4 seems like a “generic system” that essentially “understands our intentions”
I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don’t have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
In this case, I don’t know why you think that GPT-4 “understands our intentions”, unless you mean something very different by that than what you’d mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that’d generate it in a human and is probably missing most of the relevant properties that we care about when it comes to “understanding”. Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1] to its internal state, since (as far as we know) it doesn’t have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that’s not the modality I’m talking about.)
It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it “understanding our intentions”.
- ^
That is known to us right now; possibly one exists and could be derived.
- ^
The List of Lethalities does make that claim.
I don’t think this is true and can’t find anything in the post to that effect. Indeed, the post says things that would be quite incompatible with that claim, such as point 21.
Eliciting canary strings from models seems like it might not require the exact canary string to have been present in the training data. Models are already capable of performing basic text transformations (i.e. base64, rot13, etc), at least some of the time. Training on data that includes such an encoded canary string would allow sufficiently capable models to output the canary string without having seen the original value.
Implications re: poisoning training data abound.