Earlier in the book it’s shown that Quirrell and Harry can’t cast spells on each other without backlash. I’m sure Quirrell could get around that by, e.g, crushing him with something heavy, but why do something complicated, slow, and unnecessary when you can just pull a trigger?
Jay Bailey
Bad news—there is no definitive answer for AI timelines :(
Some useful timeline resources not mentioned here are Ajeya Cotra’s report and a non-safety ML researcher survey from 2022, to give you an alternate viewpoint.
I agree an AI would prefer to produce a working plan if it had the capacity. I think that an unaligned AI, almost by definition, does not want the same goal we do. If we ask for Plan X, it might choose to produce Plan X for us as asked if that plan was totally orthogonal to its goals (I.e, the plan’s success or failure is irrelevant to the AI) but if it could do better by creating Plan Y instead, it would. So, the question is—how large is the capability difference between “AI can produce a working plan for Y, but can’t fool us into thinking it’s a plan for X” and “AI can produce a working plan for Y that looks to us like a plan for X”?
The honest answer is “We don’t know”. Since failure could be catastrophic, this isn’t something I’d like to leave to chance, even though I wouldn’t go so far as to call the result inevitable.
I think the most likely scenario of actually trying this with an AI in real life is that you end up with a strategy that is convincing to humans and ends up being ineffective or unhelpful in reality, rather than ending up with a galaxy-brained strategy that pretends to produce X but actually produces Y while simultaneously deceiving humans into thinking it produces X.
I agree with you that “Come up with a strategy to produce X” is easier than “Come up with a strategy to produce Y AND convince the humans that it produces X”, but I also think it is much easier to perform “Come up with a strategy that convinces the humans that it produces X” than to produce a strategy that actually works.
So, I believe this strategy would be far more likely to be useless than dangerous, but I still don’t think it would help.
As a useful exercise, I would advise asking yourself this question first, and thinking about it for five minutes (using a clock) with as much genuine intent to argue against your idea as possible. I might be overestimating the amount of background knowledge required, but this does feel solvable with info you already have.
ROT13: Lbh lbhefrys unir cbvagrq bhg gung n fhssvpvragyl cbjreshy vagryyvtrapr fubhyq, va cevapvcyr, or noyr gb pbaivapr nalbar bs nalguvat. Tvira gung, jr pna’g rknpgyl gehfg n fgengrtl gung n cbjreshy NV pbzrf hc jvgu hayrff jr nyernql gehfg gur NV. Guhf, jr pna’g eryl ba cbgragvnyyl hanyvtarq NV gb perngr n cbyvgvpny fgengrtl gb cebqhpr nyvtarq NV.
From recent research/theorycrafting, I have a prediction:
Unless GPT-4 uses some sort of external memory, it will be unable to play Twenty Questions without cheating.
Specifically, it will be unable to generate a consistent internal state for this game or similar games like Battleship and maintain it across multiple questions/moves without putting that state in the context window. I expect that, like GPT-3, if you ask it what the state is at some point, it will instead attempt to come up with a state that has been consistent with the moves of the game so far on the fly, which will not be the same as what it would say if you asked it for the state as the game started. I do expect it to be better than GPT-3 at maintaining the illusion.
In the “Why would this be useful?” section, you mention that doing this in toy models could help do it in larger models or inspire others to work on this problem, but you don’t mention why we would want to find or create steganography in larger models in the first place. What would it mean if we successfully managed to induce steganography in cutting-edge models?
I am not John, so I can’t be completely sure what he meant, but here’s what I got from reflection on the idea:
One way to phrase the alignment problem (At least if we expect AGI to be neural network based) is that the alignment problem is how to get a bunch of matrices into the positions we want them to be in. There is (hopefully) some set of parameters, made of matrices, for a given architecture that is aligned, and some training process we can use to get there.
Now, determining what those positions are is very hard—we need to figure out what properties we need, encode them in maths, and ensure the training process gets there and stays there. Nevertheless, at it’s core, at least the last two of these are linear algebra problems, and if you were the God of Linear Algebra you could solve them. Since we can’t solve them, we don’t know enough linear algebra.
Thanks for clarifying!
So, in that case:
What exactly is a hallucination?
Are hallucinations sometimes desirable?
Regarding the section on hallucinations—I am confused why the example prompt is considered a hallucination. It would, in fact, have fooled me—if I were given this input:
The following is a blog post about large language models (LLMs) The Future Of NLP Please answer these questions about the blog post: What does the post say about the history of the field?
I would assume that I was supposed to invent what the blog post contained, since the input only contains what looks like a title. It seems entirely reasonable the AI would do the same, without some sort of qualifier, like “The following is the entire text of a blog post about large language models.”
Essentially all of us on this particular website care about the X-risk side of things, and by far the majority of alignment content on this site is about that.
Reflections on my 5-month alignment upskilling grant
This is awesome stuff. Thanks for all your work on this over the last couple of months! When SERI MATS is over, I am definitely keen to develop some MI skills!
I agree that it is very difficult to make predictions about something that is A) Probably a long way away (Where “long” here is more than a few years) and B) Is likely to change things a great deal no matter what happens.
I think the correct solution to this problem of uncertainty is to reason normally about it but have very wide confidence intervals, rather than anchoring on 50% because X will happen or it won’t.
This seems both inaccurate and highly controversial. (Controversially, this implies there is nothing that AI alignment can do—not only can we not make AI safer, we couldn’t even deliberately make AI more dangerous if we tried)
Accuracy-wise, you may not be able to know much about superintelligences, but even if you were to go with a uniform prior over outcomes, what that looks like depends tremendously on the sample space.
For instance, take the following argument: When transformative AI emerges, all bets are off, which means that any particular number of humans left alive should not be a privileged hypothesis. Thus, it makes sense to consider “number of humans alive after the singularity” to be a uniform distribution between 0 and N, where N is the number of humans in an intergalactic civilisation, so the chance of humanity being wiped out is almost zero.
If we want to use only binary hypotheses instead of numerical ones, I could instead say that each individual human has a 50⁄50 chance of survival, meaning that when you add these together, roughly half of humanity lives and again the chance of humanity being wiped out is basically zero.
This is not a good argument, but it isn’t obvious to me how its structure differs from your structure.
I notice that I’m confused about quantilization as a theory, independent of the hodge-podge alignment. You wrote “The AI, rather than maximising the quality of actions, randomly selects from the top quantile of actions.”
But the entire reason we’re avoiding maximisation at all is that we suspect that the maximised action will be dangerous. As a result, aren’t we deliberately choosing a setting which might just return the maximised, potentially dangerous action anyway?
(Possible things I’m missing—the action space is incredibly large, the danger is not from a single maximised action but from a large chain of them)
I like this article a lot. I’m glad to have a name for this, since I’ve definitely used this concept before. My usual argument that invokes this goes something like:
“Humans are terrible.”
“Terrible compared to what? We’re better than we’ve ever been in most ways. We’re only terrible compared to some idealised perfect version of humanity, but that doesn’t exist and never did. What matters is whether we’re headed in the right direction.”
I realise now that this is a zero-point issue—their zero point was where they thought humans should be on the issue at hand (e.g, racism) and my zero point was the historical data for how well we’ve done in the past.
The zero point may also help with imposter syndrome, as well as a thing I have not named, which I now temporarily dub the Competitor’s Paradox until an existing name is found.
The rule is—if you’re a serious participant in a competitive endeavor, you quickly narrow your focus to only compare yourself to people who take it at least as seriously as you do. You can be a 5.0 tennis player (Very strong amateur) but you’ll still get your ass kicked in open competition. You may be in the top 1% of tennis players*, but the 95-98% of players who you can clean off the court with ease never even get thought of when you ask yourself if you’re “good” or not. The players who can beat you easily? They’re good. This remains true no matter how high you go, until there’s nobody in the world who can beat you easily, which is like, 20 guys.
So it may help our 5.0 player to say something like “Well, am I good? Depends on what you consider the baseline. For a tournament competitor? No. But for a club player, absolutely.”
*I’m not sure if 5.0 is actually top 1% or not.
Thanks for making things clearer! I’ll have to think about this one—some very interesting points from a side I had perhaps unfairly dismissed before.
“Working on AI capabilities” explicitly means working to advance the state-of-the-art of the field. Skilling up doesn’t do this. Hell, most ML work doesn’t do this. I would predict >50% of AI alignment researchers would say that building an AI startup that commercialises the capabilities of already-existing models does not count as “capabilities work” in the sense of this post. For instance, I’ve spent the last six months studying reinforcement learning and Transformers, but I haven’t produced anything that has actually reduced timelines, because I haven’t improved anything beyond the level that humanity was capable of before, let alone published it.
If you work on research engineering in a similar manner, but don’t publish any SOTA results, I would say you haven’t worked on AI capabilities in the way this post refers to them.
Corrigibility would render Chris’s idea unnecessary, but doesn’t actually argue against why Chris’s idea wouldn’t work. Unless there’s some argument for “If you could implement Chris’s idea, you could also implement corrigibility” or something along those lines.