DMs open.
Cleo Nardo
If you want a mundane existence you can simulate that until you’re bored
My mundane values care about real physical stuff not simulated stuff.
Yes, I support something like uplifting, as described in other comments in this post.
yeah I think we should allow christian homeschools to exist in the year 3000.
But this cuts against some other moral intuitions, like “people shouldn’t be made worse off as a means to an end” (e.g. I don’t think we should have wars as a means to inspire poets). And presumably the people in the christian homeschools are worse off.
Maybe the compromise is something like:
On every day you are in the homeschool, we will “uplift” you if we think you would “ideally” want that.
Something like pre-existence theodicy, i.e. people born to Christian homeschooling parents consent to that life before they are incarnate, possibly in return for compensation (supposing something like souls or personal identity exists).
I’m hopeful the details can be fleshed out in late crunch-time.
My best bet for what we should do with the North Sentinelese—and with everyone post-singularity—is that we uplift them if we think they would “ideally” want that. And “ideally” is in scare quotes because no one knows what that means.
North Sentinelese Post-Singularity
it seems like the main reason people got less doomy was seeing that other people were working hard on the problem [...]
This would be v surprising to me!
It seems like, to the extend that we’re less doomy about survival/flourishing, this isn’t bc we’ve seen a surprising amount of effort, and think effort is v correlated with success. It’s more like: our observations increase our confidence that the problem was easy all along, or that we have been living in a ‘lucky’ world all along.
I might ask you about this when I see you next—I didn’t attend the workshop so maybe I’m just wrong here.
I disagree that the primary application of safety research is improving refusal calibration. This take seems outdated by ~12 months.
I think labs are incentivised to share safety research even when they don’t share capability research. This is follows a simple microeconomic model, but I wouldn’t be surprised if the prediction was completely wrong.
Asymmetry between capability and safety:
Capability failures are more attributable than safety failures. If ChatGPT can’t solve a client’s problem, it’s easy for Anthropic to demonstrate that Claude can, so the client switches. But if ChatGPT blackmails a client, it’s difficult for Anthropic to demonstrate that Claude is any safer (because safety failures are rare), so the client is dissuaded from AI in general.
Capability failures are more recoverable than safety failures. If ChatGPT can’t solve a problem, the client tries again with Claude. If ChatGPT causes extinction, the clients are dead.
What this predicts:
Labs are incentivised to share safety research and disincentivised to share capability research.
Labs will underinvest in safety relative to what’s socially optimal. Each lab bears only a fraction of the benefit from its safety investment — the rest spills over to competitors via a healthier industry.
The result is an industry that is more coordinated than you might expect (on safety sharing) but less safe than it should be (on safety investment).
Can we define Embedded Agent like we define AIXI?
An embedded agent should be able to reason accurately about its own origins. But AIXI-style definitions via argmax create agents that, if they reason correctly about selection processes, should conclude they’re vanishingly unlikely to exist.
Consider an agent reasoning: “What kind of process could have produced me?” If the agent is literally the argmax of some simple scoring function, then the selection process must have enumerated all possible agents, evaluated f on each, and picked the maximum. This is physically unrealizable: it requires resources exceeding what’s available in the environment. So the agent concludes that it wasn’t generated by the argmax.
The following seems like sound reasoning for an embedded agent: “I am a messy physical system in a messy universe, generated by a messy process. It is unlikely that my behavior is a clean mathematical function generated by argmaxing another clean mathematical function.”
Yet for Embedded-AIXI defined via argmax, this reasoning is fallacious. This is a very handwavy obstacle for expecting an AIXI-style definition of embedded agency.
Another gloss: we can’t define what it means for an embedded agent to be “ideal” because embedded agents are messy physical systems, and messy physical systems are never ideal. At most they’re “good enough”. So we should only hope to define when an embedded agent is good enough. Moreover, such agents must be generated by a physically realistic selection process.
This motivates Wentworth’s (mostly abandoned) project of Selection Theorems, i.e. studying physically realistic generators of good enough embedded agents.
- ^
By AIXI-style I mean: we have some space of agents X, a real-valued scoring function f on X, and define the ideal agent as the argmax of f.
- ^
Some thoughts on public outreach and “Were they early because they were good or lucky?”
Who are the best newcomers to AI safety? I’d be interested to here anyone’s takes, not just Richard’s. Who has done great work (by your lights) since joining after ChatGPT?
Rob Miles was the high watermark of public outreach. Unfortunately he stopped making videos. I’d be far more excited by a newcomer if they were persuaded by a Rob Miles video than an 80K video—videos like 80K’s “We’re Not Ready for Superintelligence”[1] are better on legible/easy-to-measure dimensions but worse in some more important way I think.
I observe a suspicious amount of ‘social contagion’ among the pre-ChatGPT AI Safety crowd, which updates me somewhat in favour of “lucky” over “good”.[2]
- ^
- ^
A bit anecdotal but: there are ~ a dozen people who went to our college in 2017-2020 now working full-time in AI safety, which is much higher than other colleges at the same university. I’m not saying any of us are particularly “great”—but this suggests social contagion / information cascade, rather than “we figured this stuff out from the empty string”. Maybe if you go back further (e.g. 2012-2016) there was less social contagion, and that cohort is better?
I think your remarks suggest that alignment to the level of top humans will happen by default, but not alignment to god-like superintelligence. That said, if we get aligned top-human AIs, then we can defer the rest of the alignment problem to them.
If I were sure that top-human-level AIs will be aligned by default, here’s what I might work on instead:
Automated philosophy
Commitment races / safe bargaining
Animal suffering
Space governance
Coordination tech
Empowering people with good values
Archiving data that aligned AIs might need (e.g. cryonics)
Ensure nobody becomes God-Emperor Forever.
See Tom Davidson’s report on AI-Enabled Coups, which includes some technical open problems.
Imagine you would put someone very opinionated like Nate Soares in charge, he would probably remove 80% of mentors and reduce the program to 10-20 people. I am not sure here if this would work out well.
I’m pretty sure this would work out poorly.
In deployment, we should expect our actions to line up with our values, thus triggering the “ruin the universe for as many as possible” behavior.
It seeks to max harms?
If you want to do interpretability research in the standard paradigm, Goodfire exists.
for what it’s worth, I think Goodfire is taking a non-standard approach to interpretability research—more so than (e.g.) Transluce. (I’m not claiming that the non-standard approach is better than the standard one.)
Hey Ryan, nice post. Here are some thoughts.
Anti-correlated attributes: “Founder‑mode” is somewhat anti‑natural to “AI concern.” The cognitive style most attuned to AI catastrophic risk (skeptical, risk‑averse, theory-focused) is not the same style that woos VCs, launches companies, and ships MVPs. If we want AI safety founders, we need to counterweight the selection against risk-tolerant cognitive styles to prevent talent drift and attract more founder-types to AI safety.
I think AI safety founders should be risk-averse.
For-profit investors like risk-seeking founders because for-profit orgs have unlimited upside and limited downside (you can’t lose more money than you invest), and hence investors can expect ROI on a portfolio of high-variance, decorrelated startups. You get high variance with risk-seeking founders, and decorrelation with contrarian founders. But AI safety isn’t like this. The downside is just as unlimited as the upside, so you can’t expect ROI simply because the orgs are high-variance and uncorrelated, c.f. unilateralist curse.
An influential memo from 2022 argued against “mass movement building” in AI safety on the grounds that it would dilute the quality of the field; subsequently, frontier AI companies grew 2-3x/year, apparently unconcerned by dilution.
I think frontier labs have an easier time selecting for talent than AI safety orgs. Partly because they need to care less about virtue/mission alignment.
Remember Bing Sydney?
I don’t have anything insightful to say here. But it’s surprising how little people mention Bing Sydney.
If you ask people for examples of misaligned behaviour from AIs, they might mention:
Sycophancy from 4o
Goodharting unit tests from o3
Alignment-faking from Opus 3
Blackmail from Opus 4
But like, three years ago, Bing Sydney. The most powerful chatbot was connected to the internet and — unexpectedly, without provocation, apparently contrary to its training objective and prompting — threatening to murder people!
Are we memory-holing Bing Sydney or are there are good reasons for not mentioning this more?
Here are some extracts from Bing Chat is blatantly, aggressively misaligned (Evan Hubinger, 15th Feb 2023).
Does AI-automated AI R&D count as “Recursive Self-Improvement”? I’m not sure what Yudkowsky would say, but regardless, enough people would count it that I’m happy to concede some semantic territory. The best thing (imo) is just to distinguish them with an adjective.
(This was sitting in my drafts, but I’ll just comment it here bc it’s very similar point.)
There are two forms of “Recursive Self-Improvement” that people often conflate, but they have very different characteristics.
Introspective RSI: Much like a human, an AI will observe, understand, and modify its own cognitive processes. This ability is privileged: the AI can make these self-observations and self-modifications because the metacognition and mesocognition occur within the same entity. While performing cognitive tasks, the AI simultaneously performs the meta-cognitive task of improving its own cognition.
Extrospective RSI: AIs will automate various R&D tasks that humans currently perform to improve AI, using similar workflows that humans currently use. For example, studying literature, forming hypotheses, writing code, running experiments, analyzing data, drawing conclusions, and publishing results. The object-level cognition and meta-level cognition occur in different entities.
I wish people were more careful about the distinction because people carelessly generalise cached opinions about the former to the latter. In particular, the former seems more dangerous: there is less opportunity to monitor the metacognition’s observation and modification of the mesocognition if these cognitions occur within the same entity, i.e. activations, chain-of-thought.
Introspective RSI (left) vs Extrospective RSI (right)
what’s the principle here? if an agent would have the same observations in world W and W’ then their preferences must be indifferent between W and W’ ? this seems clearly false.