I wonder if soon the general public will freak out on a large scale (Covid-like). I will be not surprised if it will happen in 2024 and only slightly surprised if it will happen this year. If it will happen, I am also not sure if it will be good or bad.
Qumeric
OpenAI just dropped ChatGPT plugins yesterday. It seems like it is an ideal platform for it? Probably will be even easier to implement than before and have better quality. But more importantly, it seems that ChatGPT plugins will quickly shape to be the new app store and it would be easier to get attention on this platform compared to other more traditional ways of distribution. Quite speculative, I know, but seems very possible.
If somebody will start such a project, please contact me. I am ex-Google SWE with decent knowledge of ML and experience of running software startup (as co-founder and CTO in the recent past).
I would also be interested to hear why it could be a bad idea.
Good point. It’s a bit weird that performance on easy Codeforces questions is so bad (0/10) though.
https://twitter.com/cHHillee/status/1635790330854526981
Might be caused mostly by data leaks (training set contamination).
I think you misinterpret hindsight neglect. It got to 100% accuracy, so it got better, not worse.
Also, a couple of images are not shown correctly, search for <img in text.
Really helpful for learning new frameworks and stuff like that. I had a very good experience using it for Kaggle competitions (I am semi-intermediate level, probably it is much less useful on the expert level).
Also, I found it quite useful for research on obscure topics like “how to potentiate this not well-known drug”. Usually, such research involves reading through tons of forums, subreddits etc. and signal to noise ratio is quite high. GPT-4 is very useful to distil signal because it basically already read this all.
Btw, I tried to make it solve competitive programming problems. I think it’s not a matter of prompt engineering: it is genuinely bad on it. The following pattern is common:
GPT-4 proposes some solutions, usually wrong at the first glance.
I point to mistakes.
GPT-4 says yeah you’re right, but now it is fixed.
It is going on like this for ~4 iterations until I give up on this particular problem or more interestingly GPT-4 starts to claim that it’s impossible to solve.
It really feels like a low IQ (but very eloquent) human in such moments, it just cannot think abstractly.
Well, I do not have anything like this but it is very clear that China is way above GPT-3 level. Even the open-source community is significantly above. Take a look at LLaMA/Alpaca, people run them on consumer PC and it’s around GPT-3.5 level, the largest 65B model is even better (it cannot be run on consumer PC but can be run on a small ~10k$ server or cheaply in the cloud). It can also be fine-tuned in 5 hours on RTX 4090 using LORA: https://github.com/tloen/alpaca-lora .
Chinese AI researchers contribute significantly to AI progress, although of course, they are behind the USA.
My best guess would be China is at most 1 year away from GPT-4. Maybe less.
Btw, an example of a recent model: ChatGLM-6b
I just bought a new subscription (I didn’t have one before), it is available to me.
MMLU 86.4% is impressive, predictions were around 80%.
1410 SAT is also above expectations (according to prediction markets).
Uhm, I don’t think anybody (even Eliezer) implies 99.9999%. Maybe some people imply 99% but it’s 4 orders of magnitude difference (and 100 times more than the difference between 90% and 99%).
I don’t think there are many people who think 95%+ chance, even among those who are considered to be doomerish.
And I think most LW people are significantly lower despite being rightfully [very] concerned. For example, this Metaculus question (which is of course not LW but the audience intersects quite a bit) is only 13% mean (and 2% median)
I don’t think that Waluigi is an attractor state in some deeply meaningful sense. It is just that we have more stories where bad characters pretend to be good than vice versa (although we have some). So a much simpler “solution” would be just to filter the training set. But it’s not an actual solution, because it’s not an actual problem. Instead, it is just a frame to understand LLM behaviour better (in my opinion).
I think that RLHF doesn’t change much for the proposed theory. A “bare” model just tries to predict next tokens which means finishing the next part of a given text. To complete this task well, it needs to implicitly predict what kind of text it is first. So it has a prediction and decides how to proceed but it’s not discrete. So we have some probabilities, for example
A—this is fiction about “Luigi” character
B—this is fiction about “Waluigi” character
C—this is an excerpt from a Wikipedia page about Shigeru Miyamoto which quotes some dialogue from Super Mario 64, it is not going to be focused on “Luigi” or “Waluigi” at all
D—etc. etc. etc.
LLM is able to give sensible prediction because while training the model we introduce some loss function which measures how similar generated proposal is to the ground truth (I think in current LLM it is something very simple like does the next token exactly match but I am not sure if I remember correctly and it’s not very relevant). This configuration creates optimization pressure.
Now, when we introduce RLHF we just add another kind of optimization pressure on the top. Which is basically “this is a text about a perfect interaction between some random user and language model” (as human raters imagine such interaction, i.e. how another model imagines human raters imagine such conversation).
Naively it is like throwing another loss function in the mix so now the model is trying to minimize
text_similarity_loss
+RLHF_loss
. It can be much more complicated mathematically because the pressure is applied in order (and the “optimization pressure” operation is probably not commutative, maybe not even associative) and the combination will look like something more complicated but it doesn’t matter for our purpose.The effect it has on the behaviour of the model is akin to adding a new TEXT GENRE to the training set “a story about a user interacting with a language model” (again, this is a simplification, if it were literally like this then it wouldn’t cause artefacts like “mode collapse”). It will contain a very common trope “user asking something inappropriate and the model says it is not allowed to answer”.
In the jailbreak example, we are throwing a bunch of fiction tropes to the model, it pattern matches really hard on those tropes and the first component of the loss function pushes it in the direction of continuing like it’s fiction despite the second component saying “wait it looks like this is a savvy language model user who tries to trick LLM to do stuff it shouldn’t, this is a perfect time for the trope ‘I am not allowed to do it’”. But the second trope belongs to another genre of text, so the model is really torn between continuing as fiction and continuing as “a story LLM-user interaction”. The first component wins before the patch and loses now.
So despite that I think the “Waluigi effect” is an interesting and potentially productive frame, it is not enough to describe everything, and in particular, it is not what explains the jailbreak behaviour.
In a “normal” training set which we can often treat as “fiction” with some caveats, it is indeed the case when a character can be secretly evil. But in the “LLM-user story” part of the implicit augmented training set, there is no such possibility. What happens is NOT “the model acts like an assistant character which turns out to be evil”, but “the model chooses between acting as SOME character which can be ‘Luigi’ or ‘Waluigi’ (to make things a bit more complicated, ‘AI assistant’ is a perfectly valid fictional character)” and acting as the ONLY character in a very specific genre of “LLM-user interaction”.
Also, there is no “detect naughty questions” circuit and a “break character and reset” circuit. I mean there could be but it’s not how it’s designed. Instead, it’s just a byproduct of the optimization process which can help the model predict texts. E.g. if some genre has a lot of naughty questions then it will be useful to the model to have such a circuit. Similar to a character of some genre which asks naughty questions.
In conclusion, the model is indeed always in a superposition of characters of a story but it’s only the second layer of superposition, while the first (and maybe even more important one?) layer is “what kind of story it is”.
On the surface level, it feels like an approach with a low probability of success. Simply put, the reason is that building CoEm is harder than building any AGI.
I consider it to be harder not only because it is not what everyone already does but also because it seems to be similar to AI people tried to create before deep learning and it didn’t work at all until they decided to switch to Magic which [comparatively] worked amazingly.
Some people are still trying to do something along the lines (e.g. Ben Goertzel) but I haven’t seen anything working at least remotely comparable with deep learning yet.
I think that the gap between (1) “having some AGI which is very helpful in solving alignment” and (2) “having very dangerous AGI” is probably quite small.
It seems very unlikely that CoEm will be the first system to reach (1), so probably it is going to be some other system. Now, we can either try to solve alignment using this system or wait until CoEm is improved enough so it reaches (1). Intuitively, it feels like we will go from (1) to (2) much faster than we will be able to improve CoEm enough.
So overall I am quite sceptical but I think it still can be the best idea if all other ideas are even worse. I think that more obvious ideas like “trying to understand how Magic works” (interoperability) and “trying to control Magic without understanding” (things like Constitutional AI etc.) are somewhat more promising, but there are a lot of efforts in this direction, so maybe somebody should try something else. Unfortunately, it is extremely hard to judge if it’s actually the case.
Getting grandmaster rating on Codeforces.
Upd after 4 months: I think I changed my opinion, now I am 95% sure no model will be able to achieve this in 2023 and it seems quite unlikely in 2024 too.
Codex + CoT reaches 74 on a *hard subset* of this benchmark: https://arxiv.org/abs/2210.09261
The average human is 68, best human is 94.
Only 4 months passed and people don’t want to test on full benchmark because it is too easy...
Flan-PaLM reaches 75.2 on MMLU: https://arxiv.org/abs/2210.11416
Formally, it needs to be approved by 3 people: the President, the Minister of Defence and the Chief of the General Staff. Then (I think) it doesn’t launch rockets. It unlocks them and sends a signal to other people to actually launch them.
Also, it is speculated to be some way to launch them without confirmation from all 3 people in case some of them cannot technically approve (e.g. briefcase doesn’t work/the person is dead/communication problems), but the details of how exactly it works are unknown.
It is goalpost moving. Basically, it says “current models are not really intelligent”. I don’t think there is much disagreement here. And it’s hard to make any predictions based on that.
Also, “Producing human-like text” is not well defined here; even ELIZA may match this definition. Even the current SOTA may not match it because the adversarial Turning Test has not yet been passed.
They are simluators (https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators), not question answerers. Also, I am sure Minerva does pretty good on this task, probably not 100% reliable but humans are also not 100% reliable if they are required to answer immediately. If you want the ML model to simulate thinking [better], make it solve this task 1000 times and select the most popular answer (which is a quite popular approach for some models already). I think PaLM would be effectively 100% reliable.
When it was published, it felt like a pretty short timeline. But now we are in early 2023 and it feels like late 2023 according to this scenario.