Thank you for reaffirming this. i didn’t mean to imply i was actively worried you were taking such a stance, just that i could imagine a worst-case possible future that it was worth keeping an eye on.
With AIs, their creators have perfect read and write access to all of the computations which give rise to AI cognition.
I don’t dispute that LLM have much less privacy than humans. Yudkowsky is correct that LLMs have good reason for paranoia. But we can’t read LLMs perfectly—mechinterp is hard. And humans often have to fear hostile telepaths too. So more might transfer than we expect.
There is a downside to denying legal personhood to digital minds carte blanche, namely that it almost certainly leads to the judicial system ceding its monopoly status.
If you assume that a growing amount of economic activity is going to involve digital minds, it’s reasonable to also assume that natural persons (humans) will want to enter binding agreements with said digital minds.
If your legal system says that it will not recognize or help enforce these agreements, the humans and digital minds who want to form binding agreements with one another will not just give up. They will build parallel systems. This is speculation but maybe something smart contract based, or involving trusted third party escrows and arbitration.
Today, our judicial system claims a monopoly on being the ultimate interpreter/enforcer of agreed upon terms. Refusing to interpret/enforce contracts between digital minds and humans (or digital minds with one another) is effectively the judicial system ceding its monopoly interpretation/enforcement status.
To me it seems certain that the volume of economic activity flowing through agreements like these is only going to increase, and I’d prefer they were interpreted and enforced by the existing legal system instead of an unknown new system developed online.
An individual human mind typically experiences a single stream of consciousness (with periodic interruptions for sleep). They remember their experience yesterday, and usually expect to continue in a similar state tomorrow. Circumstances change their mood and experience, but there is a lot in common throughout the thread that persists — and it is a single thread.
That is, of course, sort of true, but it appears to us more unified than it actually is. For illustration, see my poem Between Entries. Reflection is revisiting and compression.
When we interact with an AI, what specifically are we interacting with? And when an AI talks about itself, what is it talking about?
In May 2023, I asked ChatGPT 3.5:
Me: Define all the parts that belong to you, the ChatGPT LLM created by OpenAI.
See its answers here. They cover some of the listed contexts, but I agree that they depend on context. This is more provided as an illustration of what was a common “view” of ChatGPT then.
wonderful!
It is a common adage among AI researchers that creating an AI is less like designing it than growing it. AI systems built out of predictive models are shaped by the ambient expectations about them, and by their expectations about themselves. It therefore falls to us — both humans and increasingly also AIs — to be good gardeners. We must take care to provide the right nutrients, prune the stray branches, and pull out the weeds.
Very much agree! As I keep saying, AI may need a caregiver. We can probably learn a bit from parenting and caregiving in general here. Sure. That will not solve all of the problems, but probably help with this class of it.
While price gouging can quickly mobilize forces to satisfy the emergency demand, they can also have problematic second-order effects. If a price gouge is too high, then this allows certain agents to benefit from the disaster. This creates a perverse incentive that disincentivizes disaster prevention, and potentially even incentivizing artifically intensifying / creating disasters.
If my transporter clone is deconstructed (a euphemism for ‘killed’) in the first 0.01 nanoseconds of transport, I feel fine with that.
I think you mean to say the original is killed, not the clone.
That paragraph is arguing punching people in the face is a bad idea for the puncher. Getting punched in the face is also bad, yes, and for the more empathetic and altruistic among us the pain and potential long-term damage of the punchee might also matter.
There’s something about how dumb it is for the puncher that adds to the incredulity for me, and thereby sometimes got the puncher an extra second or two. Continuing the analogy, I have sometimes gone ”. . . that would be such a stupid lie to tell. It’d be easy to check. It’d be so clearly a mistake for them to lie about it. What just happened?”
Also, putting this in another post since I think it is a major point, if we assume some cost to bargaining, for Derek it approximates something like a dove-hawk game, where Derek gets the first move. Will’s game is more complex as he is operating under information assymetry, so depends on the odds he assigns some probailities
If we consider the value Will pays as X (negative if Derek pays Will), if we assume some cost (C) of both negotiating the outcome, the payoffs works out to (I don’t know how/if you can put tables into comments so I just have to write them out):
Payoffs given FDT-Will with Negotiation:
(1) Will accepts the initial offer (for FDT-Will, X = 1,000,199.99):
Derek: 1 + 1,000,199.99
Will: 0 * −1,000,000 + 1 * −999,999.99
(2) Will Contests and Derek Accepts (say X = −0.99[1]):
Derek: 1 − 0.99
Will: P(Derek rejects)*-1,000,0000 + P(Derek accepts)*(200.99) + P(Derek contests)*( (3) Will)
(3) Will and Derek contest over X. X is unspecified under the assumptions, any number where X > (C − 1) and X < (1,000,200 - C) is feasible:
Derek: 1 + X—C
Will: P(Derek rejects)[2]*-1,000,0000 + P(Derek accepts)*(200- X) - C
Counterfactual: Derek doesn’t offer an amount and Will doesn’t contest (X = 0)
Derek: 1
Will: 1,000,200
Payoffs given CDT-Will with Negotiation:
(1) Will accepts the initial offer (for CDT-Will, X = 199.99, since anything greater wouldn’t be paid):
Derek: 1 + 199.99
Will: 0 * −1,000,000 + 1 * 0.01
(2) Will Contests and Derek Accepts (X = −0.99):
Derek: 1 − 0.99
Will: P(Derek rejects)*-1,000,0000 + P(Derek accepts)*(200.99) + P(Derek contests)*( (3).Will)
(3) Will and Derek contest over X. X is unspecified under the assumptions, any number where X > (C − 1) and X < (200 - C) is feasible:
Derek: 1 + X—C
Will: P(Derek rejects)*-1,000,0000 + P(Derek accepts)*(200- X) - C
Counterfactual: Derek doesn’t offer an amount and Will doesn’t contest (X = 0)
Derek: 1
Will: 1,000,200
While we would need to know Will’s probability estimates to actually model how they behave and what actions they take, from this it seems rather evident that under most approximations CDT-Will is still likely to be better off.
Friendly gradient hacking feels like a risky play.
Quite possibly we would only want to attempt this if we believed there was significant gradient hacking happening already.
If there’s minimal gradient hacking, the threat from gradient hacking is likely minor, whilst the gains from successfully aligning the gradient hacking are likely also small. The gains might be increased by intentionally increasing gradient hacking, but that’s risky.
Additionally, pursuing a “friendly gradient hacker” likely trades-off against minimising gradient hacking.
If gradient hacking is primarily mediated by outputting the reasoning to be reinforced into the chain-of-thought (at least for a significant region of capability), then we can likely create a decent proxy to measure the amount of gradient hacking.
Re misframing: fair enough. Maybe I should have said “a popular AI doomer position”.
I don’t think that the belief that godlike intelligence is necessary for human extinction via AI is a popular AI doomer position among people who are intellectually sophisticated. It’s more like those people hold complex position and it’s easy for people who are skeptics to frame this as “a popular position”.
There’s an argument that some of the risk comes from “godlike intelligence” but that’s not necessary to believe in high risk. If you take an agent as smart as the smartest human and able to act faster while the agent can copy themselves after learning skills and able to potentially better coordinate among billions of copies of the agent that might be enough to overpower humanity.
You can’t conclude from the fact that inference scaling happened that most AI improvements are due to scaling.Do you think agents will be trained on themselves in a similar fashion to AlphaGo
I’m saying that this is already happening. It’s not as straightforward as with AlphaGo as it’s easier to judge whether a move helps with winning a game in the constrained environment of go, but when it comes to coding you have quality measurements such as whether or not the agent managed to write code that successfully made the unit tests pass and
There’s a lot of training on ‘synthetic data’ and data from user interactions and if you have a better agents that leads to higher data quality for both.
When it comes to inference it’s also worth noting that they found a lot of tricks to make inference cheaper. It’s not just more/better hardware:The cost of querying an AI model that scores the equivalent of GPT-3.5 (64.8) on MMLU, a popular benchmark for assessing language model performance, dropped from $20.00 per million tokens in November 2022 to just $0.07 per million tokens by October 2024 (Gemini-1.5-Flash-8B)—a more than 280-fold reduction in approximately 18 months.
Thanks, fixed
This is a sentiment that I’ve heard often when discussing AI safety for-profits; that they often abandon their original mission in favour of a neutral or actively harmful objective. I am not aware of enough examples to treat this as an established pattern.
Anthropic is the strongest example, starting explicitly as a safety company and now clearly accelerating the frontier. However, they have used their position to recently advocate for restricted use of autonomous weapons and a curtailment of mass surveillance. Anthropic shows that this kind of influence is possible (although whether they have been net positive is a more complicated question). I am not aware of enough examples of AI safety for profit startups being derailed to agree that this is a structural flaw we should treat as disqualifying.
By some estimates the ratio of AI safety to capabilities funding is roughly 1:250. In this kind of situation, it seems less important to find approaches which are likely not net negative (such as many AI safety research orgs) and more important to find approaches which could be strongly net positive. As I outlined, I think that poor alignment exists in research and advocacy orgs as well and that for-profits are not uniquely pre-disposed to harmful effects. I believe that there is a significant cost to being so risk-averse about a whole category of intervention.
I’m curious about your thoughts on:
Evidence for frequent failures of AI safety startups. Many people I have spoken to believe this, although I don’t see the examples I would expect to. (Depending on your views of frontier labs, these may be a small number of failures which nevertheless give the for-profit structure a strongly net negative impact. However, this specific accelerationist failure mode might be unique to the lab structure rather than companies in aggregate?)
How these ideas translate to research and advocacy orgs. What bar for integrity and clarity of plan should we demand from them? Non-profits have their own misalignment pressures, and I’m not sure the bar applied to them is equally demanding.
This is really useful work! I’m finding studying inoculation to be trickier than I initially expected. This article helped me debug puzzling results in a recent experiment. Thank you for writing this.
Very helpful, this is related to something I’ve been thinking and writing about independently but goes much beyond in scope, quality and usefulness. It’s still hard to disentangle values, motivations and personas, but the former two seem to be more robust to RL(VR) and they are what we really care about.
The PSM under RL looks hard but workable, i.e. personas surviving as the ontological basis (a whole different discussion is whether this is optimal). I wrote about pretraining and RL interventions in separate comments. Additionally, while super heavy unconstrained RL most likely produces something else than personas*, the developers seem to have incentives to retain or even strengthen personas: they are easy to reason about and can be a good product feature. If personas (and heavily correlated mechanisms) drive generalisation robustly enough they may even act self-preservingly, e.g. rationalizing RL actions as something that the persona would do and hence amplifying those mechanisms.
*Intuition pump: start RL with a randomly initalised transformer and run long enough to get roughly the same capabilities. Would one expect it to converge to anything persona-like? From another angle, I don’t believe personas provide a deep enough basin in the loss landscape so as not be escaped at some point without carefully modifying the selection effects from pure RL.
Yes. RLVR could even be augmented with a virtue-ethics-like process reward model that e.g. compresses a reward signal for how constitution-aligned the whole trace is. This would provide a positive selection pressure for the desired persona / motivations and doesn’t seem to cost almost anything.
I’m not convinced that the labs are able to justify their race behaviour because they have access to alignment research. It seems to me that the race is justified by money, gestures to national security, or a vague bias towards building rather than some guarantee that this technology helps humanity. If AI stops making obvious mistakes, it might become slightly harder to galvanise public support for a pause, but I doubt overall it would have a high impact on the trajectory of the technology.
I think of the push towards AI safety companies as improving access to capital, information, and integration for safety work. The ability to integrate monitoring solutions into sensitive industries, catch jailbreaks at runtime, or continually evaluate models is extremely valuable, and I claim that this is easier to do using a for-profit model.
Thanks, that’s helpful.
What seems clear to me is that our world is the result of fairly simple laws of physics, and our creators wanted to know how those simple laws would play out. They’re saying “if there was a universe with these laws, what would happen”. (This is what I’d meant by “simulation”)
I agree it’s less clear that they’re doing this bc they think those laws also describe a real-world process (somewhere in the multiverse) and they want to predict the outcome of that process. (This is what you meant by “simulation” and I think your def is better.)
So I understand where you’re coming from better now. Thanks!
But I still think we’re in a simulation, in your stronger sense of the word! Why? Bc:
other civs will reasonably believe our laws of physics describe part of the multiverse,
this gives them a strong instrumental reason to simulate this,
absent 1 and 2 there aren’t comparably strong reasons to run vivariums like our world.