Good point. It’s a bit weird that performance on easy Codeforces questions is so bad (0/10) though.
https://twitter.com/cHHillee/status/1635790330854526981
Good point. It’s a bit weird that performance on easy Codeforces questions is so bad (0/10) though.
https://twitter.com/cHHillee/status/1635790330854526981
I think that Deepmind is impacted by race dynamics and Google’s code red etc. I heard from a Deepmind employee that the leadership including Demis is now much more focused on products and profits, at least in their rhetoric.
But I agree it looks like they tried and likely still trying to push back against incentives.
And I am pretty confident that they reduced publishing on purpose and it’s visible.
This is true that it is not evidence of misalignment with the user but it is evidence of misalignment with ChatGPT creators.
On the surface level, it feels like an approach with a low probability of success. Simply put, the reason is that building CoEm is harder than building any AGI.
I consider it to be harder not only because it is not what everyone already does but also because it seems to be similar to AI people tried to create before deep learning and it didn’t work at all until they decided to switch to Magic which [comparatively] worked amazingly.
Some people are still trying to do something along the lines (e.g. Ben Goertzel) but I haven’t seen anything working at least remotely comparable with deep learning yet.
I think that the gap between (1) “having some AGI which is very helpful in solving alignment” and (2) “having very dangerous AGI” is probably quite small.
It seems very unlikely that CoEm will be the first system to reach (1), so probably it is going to be some other system. Now, we can either try to solve alignment using this system or wait until CoEm is improved enough so it reaches (1). Intuitively, it feels like we will go from (1) to (2) much faster than we will be able to improve CoEm enough.
So overall I am quite sceptical but I think it still can be the best idea if all other ideas are even worse. I think that more obvious ideas like “trying to understand how Magic works” (interoperability) and “trying to control Magic without understanding” (things like Constitutional AI etc.) are somewhat more promising, but there are a lot of efforts in this direction, so maybe somebody should try something else. Unfortunately, it is extremely hard to judge if it’s actually the case.
Formally, it needs to be approved by 3 people: the President, the Minister of Defence and the Chief of the General Staff. Then (I think) it doesn’t launch rockets. It unlocks them and sends a signal to other people to actually launch them.
Also, it is speculated to be some way to launch them without confirmation from all 3 people in case some of them cannot technically approve (e.g. briefcase doesn’t work/the person is dead/communication problems), but the details of how exactly it works are unknown.
Are you implying that it is close to GPT-4 level? If yes, it is clearly wrong. Especially in regards to code: everything (maybe except StarCoder which was released literally yesterday) is worse than GPT-3.5, and much worse than GPT-4.
I think that RLHF doesn’t change much for the proposed theory. A “bare” model just tries to predict next tokens which means finishing the next part of a given text. To complete this task well, it needs to implicitly predict what kind of text it is first. So it has a prediction and decides how to proceed but it’s not discrete. So we have some probabilities, for example
A—this is fiction about “Luigi” character
B—this is fiction about “Waluigi” character
C—this is an excerpt from a Wikipedia page about Shigeru Miyamoto which quotes some dialogue from Super Mario 64, it is not going to be focused on “Luigi” or “Waluigi” at all
D—etc. etc. etc.
LLM is able to give sensible prediction because while training the model we introduce some loss function which measures how similar generated proposal is to the ground truth (I think in current LLM it is something very simple like does the next token exactly match but I am not sure if I remember correctly and it’s not very relevant). This configuration creates optimization pressure.
Now, when we introduce RLHF we just add another kind of optimization pressure on the top. Which is basically “this is a text about a perfect interaction between some random user and language model” (as human raters imagine such interaction, i.e. how another model imagines human raters imagine such conversation).
Naively it is like throwing another loss function in the mix so now the model is trying to minimize text_similarity_loss
+ RLHF_loss
. It can be much more complicated mathematically because the pressure is applied in order (and the “optimization pressure” operation is probably not commutative, maybe not even associative) and the combination will look like something more complicated but it doesn’t matter for our purpose.
The effect it has on the behaviour of the model is akin to adding a new TEXT GENRE to the training set “a story about a user interacting with a language model” (again, this is a simplification, if it were literally like this then it wouldn’t cause artefacts like “mode collapse”). It will contain a very common trope “user asking something inappropriate and the model says it is not allowed to answer”.
In the jailbreak example, we are throwing a bunch of fiction tropes to the model, it pattern matches really hard on those tropes and the first component of the loss function pushes it in the direction of continuing like it’s fiction despite the second component saying “wait it looks like this is a savvy language model user who tries to trick LLM to do stuff it shouldn’t, this is a perfect time for the trope ‘I am not allowed to do it’”. But the second trope belongs to another genre of text, so the model is really torn between continuing as fiction and continuing as “a story LLM-user interaction”. The first component wins before the patch and loses now.
So despite that I think the “Waluigi effect” is an interesting and potentially productive frame, it is not enough to describe everything, and in particular, it is not what explains the jailbreak behaviour.
In a “normal” training set which we can often treat as “fiction” with some caveats, it is indeed the case when a character can be secretly evil. But in the “LLM-user story” part of the implicit augmented training set, there is no such possibility. What happens is NOT “the model acts like an assistant character which turns out to be evil”, but “the model chooses between acting as SOME character which can be ‘Luigi’ or ‘Waluigi’ (to make things a bit more complicated, ‘AI assistant’ is a perfectly valid fictional character)” and acting as the ONLY character in a very specific genre of “LLM-user interaction”.
Also, there is no “detect naughty questions” circuit and a “break character and reset” circuit. I mean there could be but it’s not how it’s designed. Instead, it’s just a byproduct of the optimization process which can help the model predict texts. E.g. if some genre has a lot of naughty questions then it will be useful to the model to have such a circuit. Similar to a character of some genre which asks naughty questions.
In conclusion, the model is indeed always in a superposition of characters of a story but it’s only the second layer of superposition, while the first (and maybe even more important one?) layer is “what kind of story it is”.
2. I think non-x-risk focused messages are a good idea because:
It is much easier to reach a wide audience this way.
It is clear that there are significant and important risks even if we completely exclude x-risk. We should have this discussion even in a world where for some reason we could be certain that humanity will survive for the next 100 years.
It widens Overton’s window. x-risk is still mostly considered to be a fringe position among the general public, although the situation has improved somewhat.
3. There were cases when it worked well. For example, the Letter of three hundred.
4. I don’t know much about EA’s concerns about Elon. Intuitively, he seems to be fine. But I think that in general, people are more biased towards too much distancing which often hinders coordination a lot.
5. I think more signatures cannot make things worse if authors are handling them properly. Just rough sorting by credentials (as FLI does) may be already good enough. But it’s possible and easy to be more aggressive here.
I agree that it’s unlikely that this letter will be net bad and that it’s possible it can make a significant positive impact. However, I don’t think people argued that it can be bad. Instead, people argued it could be better. It’s clearly not possible to do something like this every month, so it’s better to put a lot of attention to details and think really carefully about content and timing.
Codex + CoT reaches 74 on a *hard subset* of this benchmark: https://arxiv.org/abs/2210.09261
The average human is 68, best human is 94.
Only 4 months passed and people don’t want to test on full benchmark because it is too easy...
So far 2022 predictions were correct. There is Codegeex and others. Copilot, DALLE-2 and Stable Diffusion made financial prospects obvious (somewhat arguably).
ACT-1 is in a browser, I have neural search in Warp Terminal (not a big deal but qualifies), not sure about Mathematica but there was definitely significant progress in formalization and provers (Minerva).
And even some later ones
2023
ImageNet—nobody measured it exactly but probably already achievable.
2024
Chatbots personified through video and audio—Replica sort of qualifies?
40% on MATH already reached.
I just bought a new subscription (I didn’t have one before), it is available to me.
I think you misinterpret hindsight neglect. It got to 100% accuracy, so it got better, not worse.
Also, a couple of images are not shown correctly, search for <img in text.
Two main options:
* It was trained e.g. 1 year ago but published only now
* All TPU-v4 very busy with something even more important
When it was published, it felt like a pretty short timeline. But now we are in early 2023 and it feels like late 2023 according to this scenario.
Might be caused mostly by data leaks (training set contamination).
Another related Metaculus prediction is
I have some experience in competitive programming and competitive math (although I was never good in math despite I solved some “easy” IMO tasks (already in university, not onsite ofc)) and I feel like competitive math is more about general reasoning than pattern matching compared to competitive programming.
P.S the post matches my intuitions well and is generally excellent.
I agree it’s a very significant risk which is possibly somewhat underappreciated in the LW community.
I think all three situations are very possible and potentially catastrophic:
Evil people do evil with AI
Moloch goes Moloch with AI
ASI goes ASI (FOOM etc.)
Arguments against (1) could be “evil people are stupid” and “terrorism is not about terror”.
Arguments against (1) and (2) could be “timelines are short” and “AI power is likely to be very concentrated”.
I think it is only getting started. I expect that likely there will be more attention in 6 months and very likely in 1 year.
OpenAI has barely rolled out its first limited version of GPT-4 (only 2 weeks have passed!). It is growing very fast but it has A LOT of room to grow. Also, text-2-video is not here in any significant sense but it will be very soon.
Well, I do not have anything like this but it is very clear that China is way above GPT-3 level. Even the open-source community is significantly above. Take a look at LLaMA/Alpaca, people run them on consumer PC and it’s around GPT-3.5 level, the largest 65B model is even better (it cannot be run on consumer PC but can be run on a small ~10k$ server or cheaply in the cloud). It can also be fine-tuned in 5 hours on RTX 4090 using LORA: https://github.com/tloen/alpaca-lora .
Chinese AI researchers contribute significantly to AI progress, although of course, they are behind the USA.
My best guess would be China is at most 1 year away from GPT-4. Maybe less.
Btw, an example of a recent model: ChatGLM-6b
I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it.
If I will not land a safety job, one of the obvious options is to try to get hired by an AI company and try to learn more there in the hope I will either be able to contribute to safety there or eventually move to the field as a more experienced engineer.
I am conscious of why pushing capabilities could be bad so I will try to avoid it, but I am not sure how far it extends. I understand that being Research Scientist in OpenAI working on GPT-5 is definitely pushing capabilities but what about doing frontend in OpenAI or building infrastructure at some strong but not leading (and hopefully a bit more safety-oriented) company such as Cohere? Or let’s say working in a hedge fund which invests in AI? Or working in a generative AI company which doesn’t build in-house models but generates profit for OpenAI? Or working as an engineer at Google on non-AI stuff?
I do not currently see myself as an independent researcher or AI safety lab founder, so I will definitely need to find a job. And nowadays too many things seem to touch AI one way or the other, so I am curious if anybody has an idea about how could I evaluate career opportunities.
Or am I taking it too far and the post simply says “Don’t do dangerous research”?