Here’s my current impression (any uncited claims probably come from @Hans Gundlach et al. 2025b “On the origin of algorithmic progress in AI
Aren’t their experiments at too small a scale to pick up the benefits?
Here’s my current impression (any uncited claims probably come from @Hans Gundlach et al. 2025b “On the origin of algorithmic progress in AI
Aren’t their experiments at too small a scale to pick up the benefits?
i notice a lot of disagree votes here—would appreciate an explanation as to why
There are too many civilizations in the multiverse to simulate all of them. We can only do a sampling and then decide how to trade based on the statistical properties, but this gives an incentive to free-ride (i.e. getting the benefits of trade from other civilizations that did not specifically simulate us, without paying the costs), which may cause an overall breakdown in trade.
Hmm, but if each civ simulates 1 million others, every civ should be simulated 1 million times? I.e. the more civs in the universe, the more civs to do the simulating. And then i think there’s a stable equ where you only give nice things to the civs you simulate if you see them giving nice things to the ones they simulate. I suppose there’s a q of whether we’re able to achieve that equ?
Also, you can do the thing where when your sims run their own sims, you insert your universe into their sim, guaranteeing 2-way trade?
If everyone in our universe doing acausal trade coordinates, we can sell “cosmic real estate” for monopoly prices
Let’s assume that there are many different universes (or Everett branches) that acausally trade.
Some traders won’t about “resources in our civ’s future lightcone” linearly. As a toy example, the leader of a distant alien civilisation might want to get a statue of themselves in as many different other universes as possible.
If many different actors in our universe do acausal trade, and compete with each other to trade with the alien leader, then they’d bid down the price of building that statue. Whereas if they all band together, they could hold out for a much higher price. So it could be in our collective interests to coordinate and “price fix”.
This is an example of a civilisation-wide public good that could be important long into the future.
Thanks, i’m not personally “very sure” either
But I wouldn’t rule out someone who’s thought a lot about DT being pretty confident. I don’t think you need to need “solve” DT to be v confident that acausal trade is a thing anymore than you need to solve ethics to know that murder is wrong.
I could imagine that some of the acausal trade crowd have thought long enough about the space of decision theories and their implications to conclude that acausal trade is a consequence of many plausible DTs and is very likely happening.
My understanding is that even with CDT you can get sim-based trade (which i’d consider a form of acausal trade), and that on a first pass EDT and UDT both imply that acausal trade makes sense. So we only need some powerful agents to do one of these decision theories for acausal trade to go ahead.
I guess I can imagine a countercase like “bc of threats very few civs do acausal trade”, though it’s hard to see it go down to zero. I’d be curious if you have other counter-cases in mind.
(In general i’d defer to someone who thinks about this more on how likely acausal trade is to happen overall!)
If we’re in a sim, it’s being used for acausal trade
Me: Our world is exactly the kind of thing you’d simulate if you were doing acausal trade! It’s just before civilisation develops the ability to lock-in deals.
Sceptic: Sure, but there’s other reasons ppl might simulate earth. Maybe it’s for ppl’s entertainment? Maybe it’s social science, exploring alternate histories?
Me: For sure. But whatever the purpose of the sim is, it will contain info that’s relevant to ppl that want to do acausal trades. It will have info about who has power post-AGI, what their values are, and whether they want to do acausal trade. If someone ran the sim for entertainment, they’d obviously sell that info to the acausal trade folks.
Sceptic: Won’t the acausal trade folks just run their own sims?
Me: Maybe! But they’ll be keen to buy relevant info from others who runs sims. If others run earth sims for entertainment, the acausal trade folks will buy the info and run fewer earth sims themselves.
Fwiw I think a bit about post alignment problems and think we should we preparing to pause / slow down for this kind of reason! Compared to standard pause supporters, I’d probably put more emphasis on avoiding concentration of power when we do it and doing it when ai can significantly accelerate efforts to solve these problems
I think you’re using a false dichotomy when you say that either superintelligence values will be locked in or they will be corrigible.
There is an in-between where superintelligence won’t help with power grabs and won’t do other awful things, but it will allow its values to be changed if there is a legitimate process that supports that change, with multiple stakeholders signing off. This would allow society to change the AI’s values and behaviors as it likes but no small group to change it so the AI helps them seize power. It is essentially corrigible to a broader legitimate process rather than to any individual user.
That’s the kind of AI that I think could allow us to navigate these problems as we go without pause
(I think we should pause or at least significantly slow down despite this objection!)
Fwiw I didn’t see the quotes as supporting the conclusion that he doesn’t believe in super intelligence—would want your inference to be spelled out more
I find myself coming back to the fact that in the real world, it is quite beneficial to have humans with proactive pro-social drives rather than workers who just do what they’re literally instructed by the organisation. If we imagine replacing all these humans with AIs, it seems like it would be good to give the AIs proactive pro-social drives as well, maybe even give them more of them. I agree encouraging ppl to do this in their instructions is also good, though it’s less leveraged as an intervention bc you need to convince millions of separate entities
What do you mean by moral hazard here? (I agree that if developers have fully corrigible AI internally then it should have a separate model implementing refusals and/or whistleblowing)
Ok cool, then i think we’re in agreement! I think you can implement those things internally without raising p(AI takeover): if you want to really maximise corrigibility, then you can have a monitor model enforce the refusals, which IIUC is the best way to arrange things to avoid jailbreaking anway.)
(Though I think there’s an outstanding disagreement where I’m more worried about government power concentration than you, relative to AI company power concentration)
Yep, good catch! In this case, I meant to the user.
What about if the government neglected to say that it would like this kind of behavior and gave the AI a narrower set of instructions? What should the AI do then? I do agree that one way to get proactive, pro-social drives is to put it in the prompt!
Do you think that even if the corrigible AI refuses (which is what I intended) and if it can be monitored by another corrigible-with-refusals AI that has been instructed to report things? The ability to have monitoring of that kind for internal deployments reduces some of the benefits of proactive, pro-social drives.
Cool. That’s helpful. I understand your point about how, in practise, the alignment target might be best thought of as residing in the heads of especially senior people with Anthropic, if ultimately what they want will take precedence over the document.
I am not quite sure what the point of trying to talk about the “intended alignment target” is, in the absence of evaluating the process for getting there. The process here just seems like the thing that will determine the final alignment state. The “target” is just a vague set of intentions that might or might not connect to anything real.
It seems conceptually much clearer to talk separately about the intended alignment target and the process that is actually in place for achieving it; then you can see where the process is fit for purpose. Of course, I agree the process will determine the final alignment state. If someone can point out that that process is ill-fitted to achieve the intended target, then they’ve identified a problem.
This point seems kind of obvious to me. Are you suggesting that we should use alignment target to refer to the process as well as the intention?
Indeed, the vast majority of the target gets determined by the technical and competitive constraints Anthropic is under.
This is an interesting perspective. If I’m understanding correctly, you’re saying that the thing that they will actually aim for to align the AI with won’t reflect the doc itself if competitive pressures push in a different direction. Your point makes a lot of sense to me in terms of directing effort away from alignment and towards capabilities, but I think of this as increasing the risk that you fail to achieve the alignment target you are aiming for (not as changing the target). Could you give an example where you expect companies actually aim for a significantly different target as a result of competitive pressures? An example that comes to mind for me is that competitive pressures might lead a company to make a model more helpful to users at the expense of misuse risk.
The whole alignment target includes the whole process by which Anthropic selects its reinforcement learning environments, and how it filters the pretraining data, and how it prompts and queries the model at runtime, and the architecture that they chose to do all of this work within
This doesn’t sound like the alignment target to me. It sounds like the process for achieving that target. I.e. the alignment target might say (among other things) “no sychophancy or reward hacking” and then Anthropic would choose its RL environments to achieve that target.
I’m thinking of the alignment target as the behaviour the company aims for the AI to display (represented in internal documents, or in ppl’s heads) and then objective functions and RL envs and Consitutional AI are all technical techniques for achieving it.
Analogy: When making a car Toyota has many technical documents describing how the car should look and function. When building it their factories have automated processes for welding together various parts. The documents are the “alignment target” for the car, and the factory is the process by which that target is achieved. Your comment seems to assume that the factory’s processes are the car’s alignment target.
Even in practice, the reinforcement learning environments that Anthropic chooses have a much greater effect on Claude behavior than the constitution.
This is totally compatible with the Constitution being the alignment target (on my usage, i’m wondering if you’re using the term differently). Again, separate out the alignment target they want to achieve from their process for actually aligning AI. The Constitution describes the alignment target (at a high-level!), then various processes (including processes downstream of the Constitution and unrelated RL envs) determine the model’s actual alignment. If RL-envs-unrelated-to-the-Constitution have a much bigger impact on alignment than processes-downstream-of-the-Constitution, then that’s worrying and it implies Claude will be misaligned—it’s actual alignment won’t match the target. But it doesn’t mean that those RL envs were the actual alignment target.
(To clarify my views, I agree that the Claude Constitution is high level and underspecified. And that in practice Claude’s full “alignment target” resides not just there, but probably also in other internal docs and materials (like those feeding into Constitutional AI), and in ppl’s heads, and in general just pretty underspecified.
Fwiw it’s unclear to me whether epoch’s methodology would make this mistake
After 2018 they have many data points at different times and compute scales. In principle, that should allow them to disentangle alg progress from the effect of compute. If models trained with the the same compute but later in time are no better, they shouldn’t find any alg progress. (But I haven’t thought about this in a while, and the strong correlation between time and compute in their data makes their results super noisy)