It goes from 1.05 to 1.2.
Adrià Garriga-alonso
Thank you! I have fixed the link now.
Yeah, true. It’s gone so well for so long that I forgot. I didn’t spend a lot of time thinking about this list.
No, I think the blue-team will keep having the latest and best LLMs and be able to stop such attempts from randos. These AGIs won’t be so much magically superintelligent that they can take all the unethical actions needed to take over the world, without other AGIs stopping them.
On Chrome on a Mac you can just C-f in the PDF, it just OCRs automatically. I didn’t have this problem.
But it feels like you’d need to demonstrate this with some construction that’s actually adversarially robust, which seems difficult.
I agree it’s kind of difficult.
Have you seen Nicholas Carlini’s Game of Life series? It starts by building up logical gates up to a microprocessor that factors 15 in to 3 x 5.
Depending on the adversarial robustness model (e.g. every second the adversary can make 1 square behave the opposite of lawfully), it might be possible to make robust logic gates and circuits. In fact the existing circuits are a little robust already—though not at the tune of 1 square per tick, that’s too much power for the adversary.
I very strongly agree with this and think it should be the top objection people first see when scrolling down. In a low-P(doom) world, Anthropic has done lots of good. (They proved that you can have the best and the most aligned model, and also the leadership is more trustworthy than OpenAI, who would otherwise lead). This is my current view.
In a high-P(doom) world, none of that matters because they’ve raised the bar for capabilities when we really should be pausing AI. This was my previous view.
I’m grudgingly impressed with Anthropic leadership for getting this right when I did not (not that anyone other than me cares what I believed, having ~zero power).
What do you mean “faithfully execute”?
If you mean that it’s executing its own goal, then I dispute that that will be random, the goal will be to be helpful and good.
Is this different from the standard instrumental convergence algorithm?
I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn’t be well approximated by this process.
This is a concern. Two possible replies:
If it’s truly a chaotic system then there’s no good way to estimate the expectation.
In reality, it could be that the effects of neurons are not very chaotic, but this estimate of the gradient is very chaotic. Previous work actually shows that policy gradients are much less chaotic than the ‘reparameterization trick’ (in the case where the transition is continuous, differentiating through it). It could be that finite differences (resampling many rollouts with/without the neuron activated) actually estimates effects better with less variance. We’ll see.
Circuit discovery through chain of thought using policy gradients
IMO at any level of sampling?
Vacuously true. The actual question is: how much do you need to sample? My guess is it’s too much, but we’d see the base model scaling better than the RL’d model just like in this paper.
Fortunately, DeepSeek’s Mathv2 just dropped which is an open-source model that gets IMO gold. We can do the experiment: is it similarly not improving with sampling compared to its own base model? My guess is yes, the same will happen.
It’s not impossible that we are in an alignment-by-default world. But, I claim that our current insight isn’t enough to distinguish such a world from the gradual disempowerment/going out with a whimper world.
Well, yeah, I agree with that! You might notice this item in my candidate “What’s next” list:
Prevent economic incentives from destroying all value. Markets have been remarkably aligned so far but I fear their future effects. (Intelligence Curse, Gradual Disempowerment. Remarkably European takes.)
This is not classical alignment emphasizing scheming etc. Rather it’s “we get more slop, AIs outcompete humans and so the humans that don’t own AIs have no recourse.” So I don’t think that undermines my point at all.
Approximately never, I assume, because doing so isn’t useful.
You should not assume such things. Humans invented scheming to take over, it might be the very reason we are intelligent.
Until then, I claim we have strong reasons to believe that we just don’t know yet.
We don’t know but we never really know and must act under uncertainty. I put forward that we can make a good guess.
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn’t reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it’s bad news for my current view:
N->N+1 alignment:
will let humans align N+1 in cases where we can still check, with less and less effort.
is stable instead of diverging (the values of N+1 don’t drift arbitrarily far apart)
The N->N+1 improvement will continue to give out linear-ish improvements in perceived intelligence. We might get one larger jump or two, but it won’t continuously accelerate.
(a good analogy for this is loudness perception being logarithmic in sound pressure. Actual intelligence is logarithmic in METR time-horizon graph.)
Aligned-persona-seeming models won’t give out false AI safety research results, without making it visible on a CoT or latent reasoning.
(It’s perhaps possible to refrain from doing your best (sandbagging) but that doesn’t have nearly as bad effects, so it doesn’t count for this.)
And here’s another prediction where I really stick my neck out, which isn’t load-bearing to the view, but still increases my confidence, so defeating it is important:
we can to a significant extent train with RL against model internals (probes) and textual evaluations from other models, without ill effects.
That is, we ask the N model to evaluate N+1, giving test-time compute to N, and train against that. (we continuously also finetune N to predict N+1′s relevant behavior better).
we also train linear probes and update them during the RL.
Effectively I’m claiming these things are good enough or they’re self-reinforcing when model is already ~aligned, so that effectively Goodhart’s Law is a poor description of reality.
I still disagree with several of the points, but for time reasons I request that readers not update against Evan’s points if he just doesn’t reply to these.
-
disagree that increasing capabilities are exponential in a capability sense. It’s true that METR’s time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
-
“we still extensively rely on direct human oversight and review to catch alignment issues” That’s a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we’ll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we’ve reached the point where you can maybe just use the actual Opus 3 as the weak model?
-
If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
-
I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don’t update down on these due to lack of a response.
I’m honestly very curious what Ethan is up to now, both you and Thomas Kwa implied that he’s not doing alignment anymore. I’ll have to reach out...
In fact, base model seem to be better than RL models at reasoning, when you take best of N (with the same N for both the RL’d and base model). Check out my post summarizing the research on the matter:
Yue, Chen et al. have a different hypothesis: what if the base model already knows all the reasoning trajectories, and all RL does is increase the frequency of reasoning or the frequency of the trajectory that is likely to work? To test this, Yue, Chen et al. use pass@K: let’s give the LLM a total of K attempts to answer the question, and if any of them succeed, mark the question as answering correctly. They report the proportion of correct questions in the data set.
If the RL model genuinely learns new reasoning skills, over many questions the pass@K performance of RL will remain higher than the performance of the base model. As we increase K, the base model answers more and more of the easy questions, so its performance improves. But the RL model’s performance also answers more and more difficult questions. The performance of both increases in tandem with larger K.
What actually happened is neither of these two things. For large enough K, the base model does better than the RL model. (!!!)
Is there anyone who significantly disputes this?
I disputed this in the past.
I debated this informally in an Alignement Workshop with a very prominent scientist, and in my own assessment lost. (Keeping vague because I’m unsure if it’s Chatham House rules.)
Many users would immediately tell that predictor “predict what an intelligent agent would do to pursue this goal!” and all of the standard worries would re-occur.
I don’t think it works this way. You have to create a context in which the true training data continuation is what a superintelligent agent would do. Which you can’t, because there are none in the training data, so the answer to your prompt would look like e.g. Understand by Ted Chiang. (I agree that you wrote ‘intelligent agent’, like all the humans that wrote the training data; so that would work, but wouldn’t be dangerous.)
If we can clarify why alignment is hard and how we’re likely to fail, seeing those futures can prevent them from happening—if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices
Okay, that’s true.
After many years of study, I have concluded that if we fail it won’t be in the ‘standard way’ (of course, always open to changing my mind back). Thus we need to come up with and solve new failure modes, which I think largely don’t fall under classic alignment-to-developers.
No I agree with that. I thought the tree design already involved weighted sums overpowering each other, but I think that was premature.
Very interesting problem to be thinking about. The problem with a UTM as a computation model is that it bakes non-redundancy in, there’s a single instruction pointer.
In reality, computers are implemented in different spatial locations and can run in parallel. A better model for this is a cellular automaton, where computers are located somewhere and their circuit for outputs is also located somewhere. Some automata (e.g. game of life) are Turing-complete, so you can just use that.
Corruption could be exogenously flipping cells in ways that violate the automaton’s rules. If you specify a maximum number of cells that the opponent can corrupt, you can implement voting by paired sums (i.e. sum(A, B, C, D) as sum(sum(A, B), sum(C, D))) and then if there are sufficiently many copies, it becomes impossible to corrupt them all at once.
So I don’t love this model because escaping corruption is ‘too easy’. At the same time, reality is kind of cellular-automata-like. Both QFT and GR posit that the world is made of fields that interact only locally, which is ~the same as positing the world is a cellular automaton with infinitesimally-sized cells. (Sidenote, that’s probably why Stephen Wolfram thinks the world is automatons, I’m coming around.)
Alternatively, we could use computational-DAGs as the model, like neurla networks. If you allow nodes to be corrupted but their output has to be bounded, then you can get robustness by having redundancy again. If you allow unbounded corruption, you’re sad again. But infinity is fake so this seems fine.
We can monitor that and mitigate it when we get there, using the previous generation of AIs.