I feel kinda frustrated whenever “shard theory” comes up in a conversation, because it’s not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.
This is a particular pity because I think there’s a version of the “shard” framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in “subagents” which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly “goal-directed” as you go up the hierarchy. This is an old idea, FWIW; e.g. it’s how Minsky frames intelligence in Society of Mind. And it’s also somewhat consistent with the claim made in the original shard theory post, that “shards are just collections of subshards”.
The problem is the “just”. The post also says “shards are not full subagents”, and that “we currently estimate that most shards are ‘optimizers’ to the extent that a bacterium or a thermostat is an optimizer.” But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from “heuristic” to “agent”, and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral “shards” (like caring about other people’s welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?
(I make a similar point in the appendix of my value systematization post.)
I am not as negative on it as you are—it seems an improvement over the ‘Bag O’ Heuristics’ model and the ‘expected utility maximizer’ model. But I agree with the critique and said something similar here:
you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o’ heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o’ heuristics and rational agents. Namely, shard theory currently basically seems to be saying “At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!” My response is “but what happens in the middle? Seems super important! Also haven’t you just reproduced the problem but inside the head?” (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational agent, but what happens in between was a big and important mystery. Shard theory boldly strides into that dark spot in our model… and then reproduces it in miniature! Progress, I guess.)
when the baby has a proto-world model, the reinforcement learning process takes advantage of that new machinery by further developing the juice-tasting heuristics. Suppose the baby models the room as containing juice within reach but out of sight. Then, the baby happens to turn around, which activates the already-trained reflex heuristic of “grab and drink juice you see in front of you.” In this scenario, “turn around to see the juice” preceded execution of “grab and drink the juice which is in front of me”, and so the baby is reinforced for turning around to grab the juice in situations where the baby models the juice as behind herself.
By this process, repeated many times, the baby learns how to associate world model concepts (e.g. “the juice is behind me”) with the heuristics responsible for reward (e.g. “turn around” and “grab and drink the juice which is in front of me”). Both parts of that sequence are reinforced. In this way, the contextual-heuristics become intertwined with the budding world model.
[...]
While all of this is happening, many different shards of value are also growing, since the human reward system offers a range of feedback signals. Many subroutines are being learned, many heuristics are developing, and many proto-preferences are taking root. At this point, the brain learns a crude planning algorithm, because proto-planning subshards (e.g. IF motor-command-5214 predicted to bring a juice pouch into view, THEN execute) would be reinforced for their contributions to activating the various hardcoded reward circuits. This proto-planning is learnable because most of the machinery was already developed by the self-supervised predictive learning, when e.g. learning to predict the consequences of motor commands (see Appendix A.1).
The planner has to decide on a coherent plan of action. That is, micro-incoherences (turn towards juice, but then turn back towards a friendly adult, but then turn back towards the juice, ad nauseum) should generally be penalized away. Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.
I have some more models beyond what I’ve shared publicly, and eg one of my MATS applicants proposed an interesting story for how the novelty-shard forms, and also proposed one tack of research for answering how value negotiation shakes out (which is admittedly at the end of the gap). But overall I agree that there’s a substantial gap here. I’ve been working on writing out pseudocode for what shard-based reflective planning might look like.
In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?
It’s not that there isn’t more shard theory content which I could write, it’s that I got stuck and burned out before I could get past the 101-level content.
I felt
a) gaslit by “I think everyone already knew this” or even “I already invented this a long time ago” (by people who didn’t seem to understand it); and that
b) I wasn’t successfully communicating many intuitions;[1] and
c) it didn’t seem as important to make theoretical progress anymore, especially since I hadn’t even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network).
So I didn’t want to post much on the site anymore because I was sick of it, and decided to just get results empirically.
In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.
I’ve always read “assume heuristics” as expecting more of an “ensemble of shallow statistical functions” than “a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed.” Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed.
a) gaslit by “I think everyone already knew this” or even “I already invented this a long time ago” (by people who didn’t seem to understand it); and that
Curious to hear whether I was one of the people who contributed to this.
I read the section you linked, but I can’t follow it. Anyway, here it is its conclusive paragraph:
Conclusion: Optimal policies for u-AOH will tend to look like random twitching. For example, if you generate a u-AOH by uniformly randomly assigning each AOH utility from the unit interval [0,1], there’s no predictable regularity to the optimal actions for this utility function. In this setting and under our assumptions, there is no instrumental convergence without further structural assumptions.
From this alone, I get the impression that he hasn’t proved that “there isn’t instrumental convergence”, but that “there isn’t a totally general instrumental convergence that applies even to very wild utility functions”.
A key part of instrumental convergence is the convergence aspect, which as I understand it refers to the notion that even very wild utility functions will share certain preferences. E.g. the empirical tendency for random chess board evaluations to prefer mobility. If you don’t have convergence, you don’t have instrumental convergence.
The issue isn’t the “full trajectories” part; that actually makes instrumental convergence stronger. The issue is the “actions” part. In terms of RLHF, what this means is that people might not simply blindly follow the instructions given by AIs and rate them based on the ultimate outcome (even if the outcome differs wildly from what they’d intuitively think it’d do), but rather they might think about the instructions the AIs provide, and rate them based on whether they a priori make sense. If the AI then has some galaxybrained method of achieving something (which traditionally would be instrumentally convergent) that humans don’t understand, then that method will be negatively reinforced (because people don’t see the point of it and therefore downvote it), which eliminates dangerous powerseeking.
One fairly strong belief of mine is that Less Wrong’s epistemic standards are not high enough to make solid intellectual progress here. So far my best effort to make that argument has been in the comment thread starting here. Looking back at that thread, I just noticed that a couple of those comments have been downvoted to negative karma. I don’t think any of my comments have ever hit negative karma before; I find it particularly sad that the one time it happens is when I’m trying to explain why I think this community is failing at its key goal of cultivating better epistemics.
There’s all sorts of arguments to be made here, which I don’t have time to lay out in detail. But just step back for a moment. Tens or hundreds of thousands of academics are trying to figure out how the world works, spending their careers putting immense effort into reading and producing and reviewing papers. Even then, there’s a massive replication crisis. And we’re trying to produce reliable answers to much harder questions by, what, writing better blog posts, and hoping that a few of the best ideas stick? This is not what a desperate effort to find the truth looks like.
And we’re trying to produce reliable answers to much harder questions by, what, writing better blog posts, and hoping that a few of the best ideas stick? This is not what a desperate effort to find the truth looks like.
It seems to me that maybe this is what a certain stage in the desperate effort to find the truth looks like?
Like, the early stages of intellectual progress look a lot like thinking about different ideas and seeing which ones stand up robustly to scrutiny. Then the best ones can be tested more rigorously and their edges refined through experimentation.
It seems to me like there needs to be some point in the desparate search for truth in which you’re allowing for half-formed thoughts and unrefined hypotheses, or else you simply never get to a place where the hypotheses you’re creating even brush up against the truth.
In the half-formed thoughts stage, I’d expect to see a lot of literature reviews, agendas laying out problems, and attempts to identify and question fundamental assumptions. I expect that (not blog-post-sized speculation) to be the hard part of the early stages of intellectual progress, and I don’t see it right now.
Perhaps we can split this into technical AI safety and everything else. Above I’m mostly speaking about “everything else” that Less Wrong wants to solve. Since AI safety is now a substantial enough field that its problems need to be solved in more systemic ways.
In the half-formed thoughts stage, I’d expect to see a lot of literature reviews, agendas laying out problems, and attempts to identify and question fundamental assumptions. I expect that (not blog-post-sized speculation) to be the hard part of the early stages of intellectual progress, and I don’t see it right now.
I would expect that later in the process. Agendas laying out problems and fundamental assumptions don’t spring from nowhere (at least for me), they come from conversations where I’m trying to articulate some intuition, and I recognize some underlying pattern. The pattern and structure doesn’t emerge spontaneously, it comes from trying to pick around the edges of a thing, get thoughts across, explain my intuitions and see where they break.
I think it’s fair to say that crystallizing these patterns into a formal theory is a “hard part”, but the foundation for making it easy is laid out in the floundering and flailing that came before.
Ironically, some people already feel threatened by the high standards here. Setting them higher probably wouldn’t result in more good content. It would result in less mediocre content, but probably also less good content, as the authors who sometimes write a mediocre article and sometimes a good one, would get discouraged and give up.
Ben Pace gives a few examples of great content in the next comment. It would be better to easier separate the good content from the rest, but that’s what the reviews are for. Well, only one review so far, if I remember correctly. I would love to see reviews of pre-2018 content (maybe multiple years in one review, if they were less productive). Then I would love to see the winning content get the same treatment as the Sequences—edit them and arrange them into a book, and make it “required reading” for the community (available as a free PDF).
The top posts in the 2018 Review are filled with fascinating and well-explained ideas. Many of the new ideas are not settled science, but they’re quite original and substantive, or excellent distillations of settled science, and are often the best piece of writing on the internet about their topics.
You’re wrong about LW epistemic standards not being high enough to make solid intellectual progress, we already have. On AI alone (which I am using in large part because there’s vaguely more consensus around it than around rationality), I think you wouldn’t have seen almost any of the public write-ups (like Embedded Agency and Zhukeepa’s Paul FAQ) without LessWrong, and I think a lot of them are brilliant.
I’m not saying we can’t do far better, or that we’re sufficiently good. Many of the examples of success so far are “Things that were in people’s heads but didn’t have a natural audience to share them with”. There’s not a lot of collaboration at present, which is why I’m very keen to build the new LessWrong Docs that allows for better draft sharing and inline comments and more. We’re working on the tools for editing tags, things like edit histories and so on, that will allow us to build a functioning wiki system to have canonical writeups and explanation that people add to and refine. I want future iterations of the LW Review to have more allowance for incorporating feedback from reviewers. There’s lots of work to do, and we’re just getting started. But I disagree the direction isn’t “a desperate effort to find the truth”. That’s what I’m here for.
Even in the last month or two, how do you look at things like this and this and this and this and not think that they’re likely the best publicly available pieces of writing in the world about their subjects? Wrt rationality, I expect things like this and this and this and this will probably go down as historically important LW posts that helped us understand the world, and make a strong showing in the 2020 LW Review.
As mentioned in my reply to Ruby, this is not a critique of the LW team, but of the LW mentality. And I should have phrased my point more carefully—“epistemic standards are too low to make any progress” is clearly too strong a claim, it’s more like “epistemic standards are low enough that they’re an important bottleneck to progress”. But I do think there’s a substantive disagreement here. Perhaps the best way to spell it out is to look at the posts you linked and see why I’m less excited about them than you are.
Of the top posts in the 2018 review, and the ones you linked (excluding AI), I’d categorise them as follows:
Interesting speculation about psychology and society, where I have no way of knowing if it’s true:
Local Validity as a Key to Sanity and Civilization
The Loudest Alarm Is Probably False
Anti-social punishment (which is, unlike the others, at least based on one (1) study).
Babble
Intelligent social web
Unrolling social metacognition
Simulacra levels
Can you keep this secret?
Same as above but it’s by Scott so it’s a bit more rigorous and much more compelling:
Is Science Slowing Down?
The tails coming apart as a metaphor for life
Useful rationality content:
Toolbox-thinking and law-thinking
A sketch of good communication
Varieties of argumentative experience
Review of basic content from other fields. This seems useful for informing people on LW, but not actually indicative of intellectual progress unless we can build on them to write similar posts on things that *aren’t* basic content in other fields:
Voting theory primer
Prediction markets: when do they work
Costly coordination mechanism of common knowledge (Note: I originally said I hadn’t seen many examples of people building on these ideas, but at least for this post there seems to be a lot.)
Six economics misconceptions
Swiss political system
It’s pretty striking to me how much the original sequences drew on the best academic knowledge, and how little most of the things above draw on the best academic knowledge. And there’s nothing even close to the thoroughness of Luke’s literature reviews.
The three things I’d like to see more of are:
1. The move of saying “Ah, this is interesting speculation about a complex topic. It seems compelling, but I don’t have good ways of verifying it; I’ll treat it like a plausible hypothesis which could be explored more by further work.” (I interpret the thread I originally linked as me urging Wei to do this).
2. Actually doing that follow-up work. If it’s an empirical hypothesis, investigating empirically. If it’s a psychological hypothesis, does it apply to anyone who’s not you? If it’s more of a philosophical hypothesis, can you identify the underlying assumptions and the ways it might be wrong? In all cases, how does it fit into existing thought? (That’ll probably take much more than a single blog post).
3. Insofar as many of these scattered plausible insights are actually related in deep ways, trying to combine them so that the next generation of LW readers doesn’t have to separately learn about each of them, but can rather download a unified generative framework.
(Thanks for laying out your position in this level of depth. Sorry for how long this comment turned out. I guess I wanted to back up a bunch of my agreement with words. It’s a comment for the sake of everyone else, not just you.)
I think there’s something to what you’re saying, that the mentality itself could be better. The Sequences have been criticized because Eliezer didn’t cite previous thinkers all that much, but at least as far as the science goes, as you said, he was drawing on academic knowledge. I also think we’ve lost something precious with the absence of epic topic reviews by the likes of Luke. Kaj Sotala still brings in heavily from outside knowledge, John Wentworth did a great review on Biological Circuits, and we get SSC crossposts that have that, but otherwise posts aren’t heavily referencing or building upon outside stuff. I concede that I would like to see a lot more of that.
I think Kaj was rightly disappointed that he didn’t get more engagement with his post whose gist was “this is what the science really says about S1 & S2, one of your most cherished concepts, LW community”.
I wouldn’t say the typical approach is strictly bad, there’s value in thinking freshly for oneself or that failure to reference previous material shouldn’t be a crime or makes a text unworthy, but yeah, it’d be pretty cool if after Alkjash laid out Babble & Prune (which intuitively feels so correct), someone had dug through what empirical science we have to see whether the picture lines up. Or heck, actually gone and done some kind of experiment. I bet it would turn up something interesting.
And I think what you’re saying is that the issue isn’t just that people aren’t following up with scholarship and empiricism on new ideas and models, but that they’re actually forgetting that these are the next steps. Instead, they’re overconfident in our homegrown models, as though LessWrong were the one place able to come up with good ideas. (Sorry, some of this might be my own words.)
The category I’d label a lot of LessWrong posts with is “engaging articulation of a point which is intuitive in hindsight” / “creation of common vocabulary around such points”. That’s pretty valuable, but I do think solving the hardest problems will take more.
-----
You use the word “reliably” in a few places. It feels like it’s doing some work in your statements, and I’m not entirely sure what you mean or why it’s important.
-----
A model which is interesting but maybe not of obvious connection. I was speaking to a respected rationalist thinker this week and they classified potential writing on LessWrong into three categories:
Writing stuff to help oneself figure things out. Like a diary, but publicly shared.
People exchanging “letters” as they attempt to figure things out. Like old school academic journals.
Someone having something mostly figured out but with a large inferential distance to bridge. They write a large collection of posts trying to cover that distance. One example is The Sequences, and more recent examples are from John Wentworth and Kaj Sotala
I mention this because I recall you (alongside the rationalist thinker) complaining about the lack of people “presenting their worldviews on LessWrong”.
The kinds of epistemic norms I think you’re advocating for feel like a natural fit for 2nd kind of writing, but it’s less clear to me how they should apply to people presenting world views. Maybe it’s not more complicated than it’s fine to present your worldview without a tonne of evidence, but people shouldn’t forget that the evidence hasn’t been presented and it feeling intuitively correct isn’t enough.
-----
There’s something in here about Epistemic Modesty, something, something. Some part of me reads you as calling for more of that, which I’m wary of, but I don’t currently have more to say than flagging it as maybe a relevant variable in any disagreements here.
We probably do disagree about the value of academic sources, or what it takes to get value from them. Hmm. Maybe it’s something like there’s something to be said for thinking about models and assessing their plausibility yourself rather than relying on likely very flawed empirical studies.
Maybe I’m in favor of large careful reviews of what science knows but less in favor of trying to find sources for each idea or model that gets raised. I’m not sure.
-----
I can’t recall whether I’ve written publicly much about this, but a model I’ve had for a year or more is that for LW to make intellectual progress, we need to become a “community of practice”, not just a “community of interest”. Martial arts vs literal stamp collecting. (Streetfighting might be better still due to actual testing real fighting ability.) It’s great that many people find LessWrong a guilty pleasure they feel less guilty about than Facebook, but for us to make progress, people need to see LessWrong as a place where one of things you do is show up and do Serious Work, some of which is relatively hard and boring, like writing and reading lit reviews.
I suspect that a cap on the epistemic standards people hold stuff to is downstream of the level of effort people are calibrated on applying. But maybe it goes in other direction, so I don’t know.
Probably the 2018 Review is biased towards the posts which are most widely read, i.e., those easiest and most enjoyable to read, rather than solely rewarding those with the best contributions. Not overwhelmingly, but enough. Maybe same for karma. I’m not sure how to relate to that.
-----
3. Insofar as many of these scattered plausible insights are actually related in deep ways, trying to combine them so that the next generation of LW readers doesn’t have to separately learn about each of them, but can rather download a unified generative framework.
This sounds partially like distillation work plus extra integration. And sounds pretty good to me too.
-----
I still remember my feeling of disillusionment in the LessWrong community relative soon after I joined in late 2012. I realized that the bulk of members didn’t seem serious about advancing the Art. I never heard people discussing new results from cognitive science and how to apply them, even though that’s what Sequences were in large part and the Sequences hardly claimed to be complete! I guess I do relate somewhat to your “desperate effort” comment, though we’ve got some people trying pretty hard that I wouldn’t want to short change.
We do good stuff, but more is possible. I appreciate the reminder. I hope we succeed at pushing the culture and mentality in directions you like.
This is only tangentially relevant, but adding it here as some of you might find it interesting:
Venkatesh Rao has an excellent Twitter thread on why most independent research only reaches this kind of initial exploratory level (he tried it for a bit before moving to consulting). It’s pretty pessimistic, but there is a somewhat more optimistic follow-up thread on potential new funding models. Key point is that the later stages are just really effortful and time-consuming, in a way that keeps out a lot of people trying to do this as a side project alongside a separate main job (which I think is the case for a lot of LW contributors?)
Quote from that thread:
Research =
a) long time between having an idea and having something to show for it that even the most sympathetic fellow crackpot would appreciate (not even pay for, just get)
b) a >10:1 ratio of background invisible thinking in notes, dead-ends, eliminating options etc
With a blogpost, it’s like a week of effort at most from idea to mvp, and at most a 3:1 ratio of invisible to visible. That’s sustainable as a hobby/side thing.
To do research-grade thinking you basically have to be independently wealthy and accept 90% deadweight losses
Also just wanted to say good luck! I’m a relative outsider here with pretty different interests to LW core topics but I do appreciate people trying to do serious work outside academia, have been trying to do this myself, and have thought a fair bit about what’s currently missing (I wrote that in a kind of jokey style but I’m serious about the topic).
Thanks, these links seem great! I think this is a good (if slightly harsh) way of making a similar point to mine:
“I find that autodidacts who haven’t experienced institutional R&D environments have a self-congratulatory low threshold for what they count as research. It’s a bit like vanity publishing or fan fiction. This mismatch doesn’t exist as much in indie art, consulting, game dev etc”
Also, I liked your blog post! More generally, I strongly encourage bloggers to have a “best of” page, or something that directs people to good posts. I’d be keen to read more of your posts but have no idea where to start.
Thanks! I have been meaning to add a ‘start here’ page for a while, so that’s good to have the extra push :) Seems particularly worthwhile in my case because a) there’s no one clear theme and b) I’ve been trying a lot of low-quality experimental posts this year bc pandemic trashed motivation, so recent posts are not really reflective of my normal output.
For now some of my better posts in the last couple of years might be Cognitive decoupling and banana phones (tracing back the original precursor of Stanovich’s idea), The middle distance (a writeup of a useful and somewhat obscure idea from Brian Cantwell Smith’s On the Origin of Objects), and the negative probability post and its followup.
Quoting your reply to Ruby below, I agree I’d like LessWrong to be much better at “being able to reliably produce and build on good ideas”.
The reliability and focus feels most lacking to me on the building side, rather than the production, which I think we’re doing quite well at. I think we’ve successfully formed a publishing platform that provides and audience who are intensely interested in good ideas around rationality, AI, and related subjects, and a lot of very generative and thoughtful people are writing down their ideas here.
We’re low on the ability to connect people up to do more extensive work on these ideas – most good hypotheses and arguments don’t get a great deal of follow up or further discussion.
Here are some subjects where I think there’s been various people sharing substantive perspectives, but I think there’s also a lot of space for more ‘details’ to get fleshed out and subquestions to be cleanly answered:
The above isn’t complete, it’s just some of the ones that come to mind as having lots of people sharing perspectives. And the list of people definitely isn’t complete.
Here examples of things that I’d like to see more of, that feel more like doing the legwork to actually dive into the details:
Eli Tyre and Bucky replicating Scott’s birth-order hypothesis
Katja and the other fine people at AI Impacts doing long-term research on a question (discontinuous progress) with lots of historical datapoints
Jameson writing up his whole research question in great detail and very well, and then an excellent commenter turning up and answering it
Zhukeepa writing up an explanation of Paul’s research, allowing many more to understand it, and allowing Eliezer to write a response
Scott writing Goodhart Taxonomy, and the commenters banding together to find a set of four similar examples to add to the post
Val writing some interesting things about insight meditation, prompting Kaj to write a non-mysterious explanation
In the LW Review when Bucky checked out the paper Zvi analysed and argued it did not support the conclusions Zvi reached (this changed my opinion of Zvi’s post from ‘true’ to ‘false’)
The discussion around covid and EMH prompting Richard Meadows to write down a lot of the crucial and core arguments around the EMH
The above is also not mentioning lots of times when the person generating the idea does a lot of the legwork, like Scott or Jameson or Sarah or someone.
I see a lot of (very high quality) raw energy here that wants shaping and directing, with the use of lots of tools for coordination (e.g. better collaboration tools).
The epistemic standards being low is one way of putting it, but it doesn’t resonate with me much and kinda feels misleading. I think our epistemic standards are way higher than the communities you mention (historians, people interested in progress studies). Bryan Caplan said he knows of no group whose beliefs are more likely to be right in general than the rationalists, this seems often accurate to me. I think we do a lot of exploration and generation and evaluation, just not in a very coordinated manner, and so could make progress at like 10x–100x the rate if we collaborated better, and I think we can get there without too much work.
“I see a lot of (very high quality) raw energy here that wants shaping and directing, with the use of lots of tools for coordination (e.g. better collaboration tools).”
Yepp, I agree with this. I guess our main disagreement is whether the “low epistemic standards” framing is a useful way to shape that energy. I think it is because it’ll push people towards realising how little evidence they actually have for many plausible-seeming hypotheses on this website. One proven claim is worth a dozen compelling hypotheses, but LW to a first approximation only produces the latter.
When you say “there’s also a lot of space for more ‘details’ to get fleshed out and subquestions to be cleanly answered”, I find myself expecting that this will involve people who believe the hypothesis continuing to build their castle in the sky, not analysis about why it might be wrong and why it’s not.
That being said, LW is very good at producing “fake frameworks”. So I don’t want to discourage this too much. I’m just arguing that this is a different thing from building robust knowledge about the world.
One proven claim is worth a dozen compelling hypotheses
I will continue to be contrary and say I’m not sure I agree with this.
For one, I think in many domains new ideas are really hard to come by, as opposed to making minor progress in the existing paradigms. Fundamental theories in physics, a bunch of general insights about intelligence (in neuroscience and AI), etc.
And secondly, I am reminded of what Lukeprog wrote in his moral consciousness report, that he wished the various different philosophies-of-consciousness would stop debating each other, go away for a few decades, then come back with falsifiable predictions. I sometimes take this stance regarding many disagreements of import, such as the basic science vs engineering approaches to AI alignment. It’s not obvious to me that the correct next move is for e.g. Eliezer and Paul to debate for 1000 hours, but instead to go away and work on their ideas for a decade then come back with lots of fleshed out details and results that can be more meaningfully debated.
I feel similarly about simulacra levels, Embedded Agency, and a bunch of IFS stuff. I would like to see more experimentation and literature reviews where they make sense, but I also feel like these are implicitly making substantive and interesting claims about the world, and I’d just be interested in getting a better sense of what claims they’re making, and have them fleshed out + operationalized more. That would be a lot of progress to me, and I think each of them is seeing that sort of work (with Zvi, Abram, and Kaj respectively leading the charges on LW, alongside many others).
I think I’m concretely worried that some of those models / paradigms (and some other ones on LW) don’t seem pointed in a direction that leads obviously to “make falsifiable predictions.”
And I can imagine worlds where “make falsifiable predictions” isn’t the right next step, you need to play around with it more and get it fleshed out in your head before you can do that. But there is at least some writing on LW that feels to me like it leaps from “come up with an interesting idea” to “try to persuade people it’s correct” without enough checking.
(In the case of IFS, I think Kaj’s sequence is doing a great job of laying it out in a concrete way where it can then be meaningfully disagreed with. But the other people who’ve been playing around with IFS didn’t really seem interested in that, and I feel like we got lucky that Kaj had the time and interest to do so.)
I feel like this comment isn’t critiquing a position I actually hold. For example, I don’t believe that “the correct next move is for e.g. Eliezer and Paul to debate for 1000 hours”. I am happy for people to work towards building evidence for their hypotheses in many ways, including fleshing out details, engaging with existing literature, experimentation, and operationalisation.
Perhaps this makes “proven claim” a misleading phrase to use. Perhaps more accurate to say: “one fully fleshed out theory is more valuable than a dozen intuitively compelling ideas”. But having said that, I doubt that it’s possible to fully flesh out a theory like simulacra levels without engaging with a bunch of academic literature and then making predictions.
Yepp, I agree with this. I guess our main disagreement is whether the “low epistemic standards” framing is a useful way to shape that energy. I think it is because it’ll push people towards realising how little evidence they actually have for many plausible-seeming hypotheses on this website.
A housemate of mine said to me they think LW has a lot of breadth, but could benefit from more depth.
I think in general when we do intellectual work we have excellent epistemic standards, capable of listening to all sorts of evidence that other communities and fields would throw out, and listening to subtler evidence than most scientists (“faster than science”), but that our level of coordination and depth is often low. “LessWrongers should collaborate more and go into more depth in fleshing out their ideas” sounds more true to me than “LessWrongers have very low epistemic standards”.
In general when we do intellectual work we have excellent epistemic standards, capable of listening to all sorts of evidence that other communities and fields would throw out, and listening to subtler evidence than most scientists (“faster than science”)
“Being more openminded about what evidence to listen to” seems like a way in which we have lower epistemic standards than scientists, and also that’s beneficial. It doesn’t rebut my claim that there are some ways in which we have lower epistemic standards than many academic communities, and that’s harmful.
In particular, the relevant question for me is: why doesn’t LW have more depth? Sure, more depth requires more work, but on the timeframe of several years, and hundreds or thousands of contributors, it seems viable. And I’m proposing, as a hypothesis, that LW doesn’t have enough depth because people don’t care enough about depth—they’re willing to accept ideas even before they’ve been explored in depth. If this explanation is correct, then it seems accurate to call it a problem with our epistemic standards—specifically, the standard of requiring (and rewarding) deep investigation and scholarship.
LW doesn’t have enough depth because people don’t care enough about depth—they’re willing to accept ideas even before they’ve been explored in depth. If this explanation is correct, then it seems accurate to call it a problem with our epistemic standards—specifically, the standard of requiring (and rewarding) deep investigation and scholarship.
Your solution to the “willingness to accept ideas even before they’ve been explored in depth” problem is to explore ideas in more depth. But another solution is to accept fewer ideas, or hold them much more provisionally.
I suspect trying to browbeat people to explore ideas in more depth works against the grain of an online forum as an institution. Browbeating works in academia because your career is at stake, but in an online forum, it just hurts intrinsic motivation and cuts down on forum use (the forum runs on what Clay Shirky called “cognitive surplus”, essentially a term for peoples’ spare time and motivation). I’d say one big problem with LW 1.0 that LW 2.0 had to solve before flourishing was people felt too browbeaten to post much of anything.
If we accept fewer ideas / hold them much more provisionally, but provide a clear path to having an idea be widely held as true, that creates an incentive for people to try & jump through hoops—and this incentive is a positive one, not a punishment-driven browbeating incentive.
Maybe part of the issue is that on LW, peer review generally happens in the comments after you publish, not before. So there’s no publication carrot to offer in exchange for overcoming the objections of peer reviewers.
“If we accept fewer ideas / hold them much more provisionally, but provide a clear path to having an idea be widely held as true, that creates an incentive for people to try & jump through hoops—and this incentive is a positive one, not a punishment-driven browbeating incentive.”
Hmm, it sounds like we agree on the solution but are emphasising different parts of it. For me, the question is: who’s this “we” that should accept fewer ideas? It’s the set of people who agree with my argument that you shouldn’t believe things which haven’t been fleshed out very much. But the easiest way to add people to that set is just to make the argument, which is what I’ve done. Specifically, note that I’m not criticising anyone for producing posts that are short and speculative: I’m criticising the people who update too much on those posts.
Fair enough. I’m reminded of a time someone summarized one of my posts as being a definitive argument against some idea X and me thinking to myself “even I don’t think my post definitively settles this issue” haha.
I do think right now LessWrong should lean more in the direction the Richard is suggesting – I think it was essential to establish better Babble procedures but now we’re doing well enough on that front that I think setting clearer expectations of how the eventual pruning works is reasonable.
I wanted to register that I don’t like “babble and prune” as a model of intellectual development. I think intellectual development actually looks more like:
1. Babble
2. Prune
3. Extensive scholarship
4. More pruning
5. Distilling scholarship to form common knowledge
And that my main criticism is the lack of 3 and 5, not the lack of 2 or 4.
I also note that: a) these steps get monotonically harder, so that focusing on the first two misses *almost all* the work; b) maybe I’m being too harsh on the babble and prune framework because it’s so thematically appropriate for me to dunk on it here; I’m not sure if your use of the terminology actually reveals a substantive disagreement.
I basically agree with your 5-step model (I at least agree it’s a more accurate description than Babel and Prune, which I just meant as rough shorthand). I’d add things like “original research/empiricism” or “more rigorous theorizing” to the “Extensive Scholarship” step.
I see the LW Review as basically the first of (what I agree should essentially be at least) a 5 step process. It’s adding a stronger Step 2, and a bit of Step 5 (at least some people chose to rewrite their posts to be clearer and respond to criticism)
...
Currently, we do get non-zero Extensive Scholarship and Original Empiricism. (Kaj’s Multi-Agent Models of Mind seems like it includes real scholarship. Scott Alexander / Eli Tyre and Bucky’s exploration into Birth Order Effects seemed like real empiricism). Not nearly as much as I’d like.
If the cost of evaluating a hypothesis is high, and hypotheses are cheap to generate, I would like to generate a great deal before selecting one to evaluate.
But, honestly… I’m not sure it’s actually a question that was worth asking. I’d like to know if Eliezer’s hypothesis about mathematicians is true, but I’m not sure it ranks near the top of questions I’d want people to put serious effort into answering.
I do want LessWrong to be able to followup Good Hypotheses with Actual Research, but it’s not obvious which questions are worth answering. OpenPhil et al are paying for some types of answers, I think usually by hiring researchers full time. It’s not quite clear what the right role for LW to play in the ecosystem.
All else equal, the harder something is, the less we should do it.
My quick take is that writing lit reviews/textbooks is a comparative disadvantage of LW relative to the mainstream academic establishment.
In terms of producing reliable knowledge… if people actually care about whether something is true, they can always offer a cash prize for the best counterargument (which could of course constitute citation of academic research). The fact that people aren’t doing this suggests to me that for most claims on LW, there isn’t any (reasonably rich) person who cares deeply re: whether the claim is true. I’m a little wary of putting a lot of effort into supply if there is an absence of demand.
(I guess the counterargument is that accurate knowledge is a public good so an individual’s willingness to pay doesn’t get you complete picture of the value accurate knowledge brings. Maybe what we need is a way to crowdfund bounties for the best argument related to something.)
(I agree that LW authors would ideally engage more with each other and academic literature on the margin.)
I’ve been thinking about the idea of “social rationality” lately, and this is related. We do so much here in the way of training individual rationality—the inputs, functions, and outputs of a single human mind. But if truth is a product, then getting human minds well-coordinated to produce it might be much more important than training them to be individually stronger. Just as assembly line production is much more effective in producing almost anything than teaching each worker to be faster in assembling a complete product by themselves.
My guess is that this could be effective not only in producing useful products, but also in overcoming biases. Imagine you took 5 separate LWers and asked them to create a unified consensus response to a given article. My guess is that they’d learn more through that collective effort, and produce a more useful response, than if they spent the same amount of time individually evaluating the article and posting their separate replies.
Of course, one of the reasons we don’t to that so much is that coordination is an up-front investment and is unfamiliar. Figuring out social technology to make it easier to participate in might be a great project for LW.
There’s been a fair amount of discussion of that sort of thing here: https://www.lesswrong.com/tag/group-rationality There are also groups outside LW thinking about social technology such as RadicalxChange.
Imagine you took 5 separate LWers and asked them to create a unified consensus response to a given article. My guess is that they’d learn more through that collective effort, and produce a more useful response, than if they spent the same amount of time individually evaluating the article and posting their separate replies.
I’m not sure. If you put those 5 LWers together, I think there’s a good chance that the highest status person speaks first and then the others anchor on what they say and then it effectively ends up being like a group project for school with the highest status person in charge. Somerelatedlinks.
That’s definitely a concern too! I imagine such groups forming among people who either already share a basic common view, and collaborate to investigate more deeply. That way, any status-anchoring effects are mitigated.
Alternatively, it could be an adversarial collaboration. For me personally, some of the SSC essays in this format have led me to change my mind in a lasting way.
they’re willing to accept ideas even before they’ve been explored in depth
People also reject ideas before they’ve been explored in depth. I’ve tried to discuss similar issues with LW before but the basic response was roughly “we like chaos where no one pays attention to whether an argument has ever been answered by anyone; we all just do our own thing with no attempt at comprehensiveness or organizing who does what; having organized leadership of any sort, or anyone who is responsible for anything, would be irrational” (plus some suggestions that I’m low social status and that therefore I personally deserve to be ignored. there were also suggestions – phrased rather differently but amounting to this – that LW will listen more if published ideas are rewritten, not to improve on any flaws, but so that the new versions can be published at LW before anywhere else, because the LW community’s attention allocation is highly biased towards that).
I feel somewhat inclined to wrap up this thread at some point, even while there’s more to say. We can continue if you like and have something specific or strong you’d like to ask, but otherwise will pause here.
You have to realise that what you are doing isn’t adequate in order to gain the motivation to do it better, and that is unlikely to happen if you are mostly communicating with other people who think everything is OK.
Lesswrong is competing against philosophy as well as science, and philosophy has broader criterion of evidence still. In fact , lesswrongians are often frustrated that mainstream philosophy takes such topics as dualism or theism seriously.. even though theres an abundance of Bayesian evidence for them.
One proven claim is worth a dozen compelling hypotheses, but LW to a first approximation only produces the latter.
Depends on the claim, right?
If the cost of evaluating a hypothesis is high, and hypotheses are cheap to generate, I would like to generate a great deal before selecting one to evaluate.
Right, but this isn’t mentioned in the post? Which seems odd. Maybe that’s actually another example of the “LW mentality”: why is the fact that there has been solid empirical research into 3 layers not being enough not important enough to mention in a post on why 3 layers isn’t enough? (Maybe because the post was time-boxed? If so that seems reasonable, but then I would hope that people comment saying “Here’s a very relevant paper, why didn’t you cite it?”)
Much of the same is true of scientific journals. Creating a place to share and publish research is a pretty key piece of intellectual infrastructure, especially for researchers to create artifacts of their thinking along the way.
The point about being ‘cross-posted’ is where I disagree the most.
This is largely original content that counterfactually wouldn’t have been published, or occasionally would have been published but to a much smaller audience. What Failure Looks Like wasn’t crossposted, Anna’s piece on reality-revealing puzzles wasn’t crossposted. I think that Zvi would have still written some on mazes and simulacra, but I imagine he writes substantially more content given the cross-posting available for the LW audience. Could perhaps check his blogging frequency over the last few years to see if that tracks. I recall Zhu telling me he wrote his FAQ because LW offered an audience for it, and likely wouldn’t have done so otherwise. I love everything Abram writes, and while he did have the Intelligent Agent Foundations Forum, it had a much more concise, technical style, tiny audience, and didn’t have the conversational explanations and stories and cartoons that have been so excellent and well received on LW, and it wouldn’t as much have been focused on the implications for rationality of things like logical inductors. Rohin wouldn’t have written his coherence theorems piece or any of his value learning sequence, and I’m pretty sure about that because I personally asked him to write that sequence, which is a great resource and I’ve seen other researchers in the field physically print off to write on and study. Kaj has an excellent series of non-mystical explanations of ideas from insight meditation that started as a response to things Val wrote, and I imagine those wouldn’t have been written quite like that if that context did not exist on LW.
I could keep going, but probably have made the point. It seems weird to not call this collectively a substantial amount of intellectual progress, on a lot of important questions.
I am indeed focusing right now on how to do more ‘conversation’. I’m in the middle of trying to host some public double cruxes for events, for example, and some day we will finally have inline commenting and better draft sharing and so on. It’s obviously not finished.
Rohin wouldn’t have written his coherence theorems piece or any of his value learning sequence, and I’m pretty sure about that because I personally asked him to write that sequence
Yeah, that’s true, though it might have happened at some later point in the future as I got increasingly frustrated by people continuing to cite VNM at me (though probably it would have been a blog post and not a full sequence).
Reading through this comment tree, I feel like there’s a distinction to be made between “LW / AIAF as a platform that aggregates readership and provides better incentives for blogging”, and “the intellectual progress caused by posts on LW / AIAF”. The former seems like a clear and large positive of LW / AIAF, which I think Richard would agree with. For the latter, I tend to agree with Richard, though perhaps not as strongly as he does. Maybe I’d put it as, I only really expect intellectual progress from a few people who work on problems full time who probably would have done similar-ish work if not for LW / AIAF (but likely would not have made it public).
I’d say this mostly for the AI posts. I do read the rationality posts and don’t get a different impression from them, but I also don’t think enough about them to be confident in my opinions there.
Thanks for chiming in with this. People criticizing the epistemics is hopefully how we get better epistemics. When the Californian smoke isn’t interfering with my cognition as much, I’ll try to give your feedback (and Rohin’s) proper attention. I would generally be interested to hear your arguments/models in detail, if you get the chance to lay them out.
My default position is LW has done well enough historically (e.g. Ben Pace’s examples) for me to currently be investing in getting it even better. Epistemics and progress could definitely be a lot better, but getting there is hard. If I didn’t see much progress on the rate of progress in the next year or two, I’d probably go focus on other things, though I think it’d be tragic if we ever lost what we have now.
And another thought:
And we’re trying to produce reliable answers to much harder questions by, what, writing better blog posts
Yes and no. Journal articles have their advantages, and so do blog posts. A bunch of recent LessWrong team’s work has been around filling in the missing pieces for the system to work, e.g. Open Questions (hasn’t yet worked for coordinating research), Annual Review, Tagging, Wiki. We often talk about conferences and “campus”. My work on Open Questions involved thinking about i) a better template for articles than “Abstract, Intro, Methods, etc.”, but Open Questions didn’t work for unrelated reasons we haven’t overcome yet, ii) getting lit reviews done systematically by people, iii) coordinating groups around research agendas.
I’ve thought about re-attempting the goals of Open Questions with instead a “Research Agenda” feature that lets people communally maintain research agendas and work on them. It’s a question of priorities whether I work on that anytime soon.
I do really think many of the deficiencies of LessWrong’s current work compared to academia are “infrastructure problems” at least as much as the epistemic standards of the community. Which means the LW team should be held culpable for not having solved them yet, but it is tricky.
For the record, I think the LW team is doing a great job. There’s definitely a sense in which better infrastructure can reduce the need for high epistemic standards, but it feels like the thing I’m pointing at is more like “Many LW contributors not even realising how far away we are from being able to reliably produce and build on good ideas” (which feels like my criticism of Ben’s position in his comment, so I’ll respond more directly there).
It seems really valuable to have you sharing how you think we’re falling epistemically short and probably important for the site to integrate the insights behind that view. There are a bunch of ways I disagree with your claims about epistemic best practices, but it seems like it would be cool if I could pass your ITT more. I wish your attempt to communicate the problems you saw had worked out better. I hope there’s a way for you to help improve LW epistemics, but also get that it might be costly in time and energy.
I just noticed that a couple of those comments have been downvoted to negative karma
Now they’re positive again.
Confusing to me, their Ω-karma (karma on another website) is also positive. Does it mean they previously had negative LW-karma but positive Ω-karma? Or that their Ω-karma also improved as a result of you complaining on LW a few hours ago? Why would it?
(Feature request: graph of evolution of comment karma as a function of time.)
I’d be curious what, if any, communities you think set good examples in this regard. In particular, are there specific academic subfields or non-academic scenes that exemplify the virtues you’d like to see more of?
Maybe historians of the industrial revolution? Who grapple with really complex phenomena and large-scale patterns, like us, but unlike us use a lot of data, write a lot of thorough papers and books, and then have a lot of ongoing debate on those ideas. And then the “progress studies” crowd is an example of an online community inspired by that tradition (but still very nascent, so we’ll see how it goes).
More generally I’d say we could learn to be more rigorous by looking at any scientific discipline or econ or analytic philosophy. I don’t think most LW posters are in a position to put in as much effort as full-time researchers, but certainly we can push a bit in that direction.
Thanks for your reply! I largely agree with drossbucket’s reply.
I also wonder how much this is an incentives problem. As you mentioned and in my experience, the fields you mentioned strongly incentivize an almost fanatical level of thoroughness that I suspect is very hard for individuals to maintain without outside incentives pushing them that way. At least personally, I definitely struggle and, frankly, mostly fail to live up to the sorts of standards you mention when writing blog posts in part because the incentive gradient feels like it pushes towards hitting the publish button.
Given this, I wonder if there’s a way to shift the incentives on the margin. One minor thing I’ve been thinking of trying for my personal writing is having a Knuth or Nintil style “pay for mistakes” policy. Do you have thoughts on other incentive structures to for rewarding rigor or punishing the lack thereof?
It feels partly like an incentives problem, but also I think a lot of people around here are altruistic and truth-seeking and just don’t realise that there are much more effective ways to contribute to community epistemics than standard blog posts.
I think that most LW discussion is at the level where “paying for mistakes” wouldn’t be that helpful, since a lot of it is fuzzy. Probably the thing we need first are more reference posts that distill a range of discussion into key concepts, and place that in the wider intellectual context. Then we can get more empirical. (Although I feel pretty biased on this point, because my own style of learning about things is very top-down). I guess to encourage this, we could add a “reference” section for posts that aim to distill ongoing debates on LW.
In some cases you can get a lot of “cheap” credit by taking other people’s ideas and writing a definitive version of them aimed at more mainstream audiences. For ideas that are really worth spreading, that seems useful.
Here is the best toy model I currently have for rational agents. Alas, it is super messy and hacky, but better than nothing. I’ll call it the BAVM model; the one-sentence summary is “internal traders concurrently bet on beliefs, auction actions, vote on values, and merge minds”. There’s little novel here, I’m just throwing together a bunch of ideas from other people (especially Scott Garrabrant and Abram Demski).
In more detail, the three main components are:
A prediction market
An action auction
A value election
You also have some set of traders, who can simultaneously trade on any combination of these three. Traders earn money in two ways:
Making accurate predictions about future sensory experiences on the market.
Taking actions which lead to reward or increase the agent’s expected future value.
They spend money in three ways:
Bidding to control the agent’s actions for the next N timesteps.
Voting on what actions get reward and what states are assigned value.
Running the computations required to figure out all these trades.
Values are therefore dominated by whichever traders earn money from predictions or actions, who will disproportionately vote for values that are formulated in the same ontologies they use for prediction/action, since that’s simpler than using different ontologies.
The last component is that it costs traders money to do computation. The way they can reduce this is by finding other traders who do similar computations as them, and then merging into a single trader. I am very interested in better understanding what a merging process like this might look like, though it seems pretty intractable in general because it will depend a lot on the internal details of the traders. (So perhaps a more principled approach here is to instead work top-down, figuring out what sub-markets or sub-auctions look like?)
I wonder if there’s a loopiness here is which breaks the setup (the expectation I’m guessing is relative to the prediction markets probabilities? Though it seems like the market is over sensory experiences but the values are over world states in general, so maybe I’m missing something). But it seems like if I take an action and move the market at the same time, I might be able to extract a bunch of extra money and acquire outsize control.
Bidding to control the agent’s actions for the next N timesteps
This seems like it’s wasteful relative to contributing to a pool that bids on action A (or short-term policy P). I guess coordination is hard if you’re just contributing to the pool though, and all connects to the merging process you describe.
I’ve been studying and thinking about the physical side of this phenomenon in neuroscience recently. There are groups of columns of neurons in the cortex that form temporary voting blocks, regarding whatever subject that particular Brodmann area focuses on. These alternating groups have to deal with physical limits of how many groups the regions can stably divide into, which limits the number of active distinct hypotheses or ‘traders’ there can be in a given area at a given time. Unclear exactly what the max is, and it depends on the cortical region in question, but generally 6-9 is the approximate max (not coincidentally the number of distinct ‘chunks’ we can hold in active short term memory). Also, there is a tendency for noise to collapse too similar of traders/hypotheses/firing-groups to fall back into synchrony/agreement with each other and thus collapse back down to a baseline of two competing hypotheses. These hypotheses/firing-groups/traders are pushed into existence or pushed into merging not just by their own ‘bids’ but also by the evidence coming in from other brain areas or senses. I don’t think that current day neuroscience has all the details yet (although I certainly don’t have the full picture of all relevant papers in neuroscience!).
I think there are probably a lot of ways to build rational agents. The idea that general intelligence is hard in any absolute sense may be a biased by wanting to believe we’re special, and for AI workers, that our work is special and difficult.
I note that AI economies like this will often have explosively better credit assignment for information production than human economies can. Artificial agents can be copied or rolled back (erase memories), which makes it possible to reverse the receipt of information if an assessor concludes with a price that the seller considers too low for a deal. In human economies, that’s impossible, you can’t send a clone to value a piece of information then delete them if you decide not to buy that information (that’s too expensive/illegal) nor can you wipe their memory of the information (or, we don’t know how to do that), so the very basic requirement for trade, assessment prior to purchase, is not possible in human economies, so information doesn’t get priced accurately and it has to be treated as a public good.
When implementing this (internal privacy) in a multi-agent architecture, though, make sure to take measures to prevent the formation of monopolies, I feel like information is kind of an increasing returns type of good, yeah? The more you have the more you can do with it. It could quickly stop being multi-agent, and at worst, the monopoly could consolidate enough political power to manipulate the EV estimators and reward hack. In theory those economies shouldn’t interact. But it’s impossible to totally prevent it. The EV estimators are receiving big sets of action proposals from the decisionmakers and the decisionmakers will see which action proposal the EV estimators end up choosing.
Artificial agents can be copied or rolled back (erase memories), which makes it possible to reverse the receipt of information if an assessor concludes with a price that the seller considers too low for a deal.
Yepp, very good point. Am working on a short story about this right now.
My guess is that understanding merging is the key to most prediction-of-behavior issues (things that motivated and also foiled UDT, but not limited to known-in-advance preference setting). Two agents can coordinate if they are the same, or reasoning about each other’s behavior, but in general they can be too complicated to clearly understand each other or themselves, can inadvertently diagonalize such attempts into impossibility, or even fail to be sufficiently aware of each other to start reasoning about each other specifically.
It might be useful to formulate smaller computations (contracts/adjudicators) that facilitate coordination between different agents by being shared between them, with the bigger agents acting as parts of environments for the contracts and setting up incentives for them, while the contracts can themselves engage in decision making within those environments. Contracts coordinate by being shared and acting with strategicness across relevant agents (they should be something like common knowledge), and it’s feasible for agents to find/construct some shared contracts as a result of them being much simpler than agents that host them. Learning of contracts doesn’t need to start with targeting coordination with other big agents, as active contracts screen off the other agents they facilitate coordination with.
Using contracts requires the big agents to make decisions about policies that affect the contracts updatelessly with respect to how the contracts end up behaving. That is, a contract should be able to know these policies, and the policies should describe responses to possible behaviors of a contract without themselves changing (once the contract computes more of its behavior), enabling the contract to do decision making in the environment of these policies. This corresponds to committing to abide by the contract. Assurance contracts (that start their tenure by checking that the commitments of all parties are actually in place) are especially important, allowing things like cooperation in PD.
If traders can get access to control panel for actions of the external agent AND they profit from accurately predicting its observations, then wouldn’t the best strategy be “create as much chaos as possible that is only predictable to me, its creator”. So, traders that value ONLY accurate predictions will get the advantage?
I think real learning has some kind of ground-truth reward. So we should clearly separate between “this ground-truth reward that is chiseling the agent during training (and not after training)”, and “the internal shards of the agent negotiating and changing your exact objective (which can happen both during and after training)”. I’d call the latter “internal value allocation”, or something like that. It doesn’t neatly correspond to any ground truth, and is partly determined by internal noise in the agent. And indeed, eventually, when you “stop training” (or at least “get decoupled enough from reward”), it just evolves of its own, separate from any ground truth.
And maybe more importantly:
I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you’ll need a modification of this framework which explains why that’s not the case.
My intuition is a process of the form “eventually, traders (or some kind of specialized meta-traders) change the learning process itself to make it more efficient”. For example, they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don’t lose much, and you waste less compute. Probably these dynamics will already be “in the limit” applied by your traders, but it will be the dominant dynamic so it should be directly represented by the formalism.
Finally, this might come later, and not yet in the level of abstraction you’re using, but I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all. It’s conceivable to say “this is the ideal mechanism, and real agents are just hacky approximations to it, so we should study the ideal mechanism first”. But my intuition says, on the contrary, some of the physical constraints (like locality, or the architecture of nets) will strongly shape which kind of macroscopic mechanism you get, and these will present pretty different convergent behavior. This is related, but not exactly equivalent to, partial agency.
I think real learning has some kind of ground-truth reward.
I’d actually represent this as “subsidizing” some traders. For example, humans have a social-status-detector which is hardwired to our reward systems. One way to implement this is just by taking a trader which is focused on social status and giving it a bunch of money. I think this is also realistic in the sense that our human hardcoded rewards can be seen as (fairly dumb) subagents.
I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you’ll need a modification of this framework which explains why that’s not the case.
I think this happens in humans—e.g. we fall into cults, we then look for evidence that the cult is correct, etc etc. So I don’t think this is actually a problem that should be ruled out—it’s more a question of how you tweak the parameters to make this as unlikely as possible. (One reason it can’t be ruled out: it’s always possible for an agent to end up in a belief state where it expects that exploration will be very severely punished, which drives the probability of exploration arbitrarily low.)
they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don’t lose much, and you waste less compute
I’m assuming that traders can choose to ignore whichever inputs/topics they like, though. They don’t need to make trades on everything if they don’t want to.
I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all
Yeah, this is why I’m interested in understanding how sub-markets can be aggregated into markets, sub-auctions into auctions, sub-elections into elections, etc.
I’d actually represent this as “subsidizing” some traders
Sounds good!
it’s more a question of how you tweak the parameters to make this as unlikely as possible
Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don’t fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that. My guess is it will have to do with strongly subsidizing some traders, and/or having a pretty weird prior over traders. Maybe even something like “dynamically changing the prior over traders”[1].
I’m assuming that traders can choose to ignore whichever inputs/topics they like, though. They don’t need to make trades on everything if they don’t want to.
Yep, that’s why I believe “in the limit your traders will already do this”. I just think it will be a dominant dynamic of efficient agents in the real world, so it’s better to represent it explicitly (as a more hierarchichal structure, etc.), instead of have that computation be scattered between all independent traders. I also think that’s how real agents probably do it, computationally speaking.
Of course, pedantically, yo will always be equivalent to having a static prior and changing your update rule. But some update rules are made sense of much easily if you interpret them as changing the prior.
Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don’t fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that.
Ah, I see. In that case I think I disagree that it happens “by default” in this model. A few dynamics which prevent it:
If the wealthy trader makes reward easier to get, then the price of actions will go up accordingly (because other traders will notice that they can get a lot of reward by winning actions). So in order for the wealthy trader to keep making money, they need to reward outcomes which only they can achieve, which seems a lot harder.
I don’t yet know how traders would best aggregate votes into a reward function, but it should be something which has diminishing marginal return to spending, i.e. you can’t just spend 100x as much to get 100x higher reward on your preferred outcome. (Maybe quadratic voting?)
Other traders will still make money by predicting sensory observations. Now, perhaps the wealthy trader could avoid this by making observations as predictable as possible (e.g. going into a dark room where nothing happens—kinda like depression, maybe?) But this outcome would be assigned very low reward by most other traders, so it only works once a single trader already has a large proportion of the wealth.
Yep, that’s why I believe “in the limit your traders will already do this”. I just think it will be a dominant dynamic of efficient agents in the real world, so it’s better to represent it explicitly
IMO the best way to explicitly represent this is via a bias towards simpler traders, who will in general pay attention to fewer things.
But actually I don’t think that this is a “dominant dynamic” because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews. And so even if you start off with simple traders who pay attention to fewer things, you’ll end up with these big worldviews that have opinions on everything. (These are what I call frames here.)
they need to reward outcomes which only they can achieve,
Yep! But this didn’t seem so hard for me to happen, especially in the form of “I pick some easy task (that I can do perfectly), and of course others will also be able to do it perfectly, but since I already have most of the money, if I just keep investing my money in doing it I will reign forever”. You prevent this from happening through epsilon-exploration, or something equivalent like giving money randomly to other traders. These solutions feel bad, but I think they’re the only real solutions. Although I also think stuff about meta-learning (traders explicitly learn about how they should learn, etc.) probably pragmatically helps make these failures less likely.
it should be something which has diminishing marginal return to spending
Yep, that should help (also at the trade-off of making new good ideas slower to implement, but I’m happy to make that trade-off).
But actually I don’t think that this is a “dominant dynamic” because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews
Yeah. To be clear, the dynamic I think is “dominant” is “learning to learn better”. Which I think is not equivalent to simplicity-weighing traders. It is instead equivalent to having some more hierarchichal structure on traders.
One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues.
A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn’t make sense in the context of group rationality, it probably doesn’t make sense in the context of individual rationality either.
For example: there’s no privileged way to combine many people’s opinions into a single credence. You can average them, but that loses a lot of information. Or you can get them to bet on a prediction market, but that depends on a lot on details of the individuals’ betting strategies. The group might settle on a number to help with planning and communication, but it’s only a lossy summary of many different beliefs and models. Similarly, we should think of individuals’ credences as lossy summaries of different opinions from different underlying models that they have.
How does this apply to AI? Suppose we each think of ourselves as containing many different subagents that focus on understanding the world in different ways—e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn’t mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn’t cause all the other humans to elect them as the dictator). E.g. maybe there’s an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions).
In my debate with Eliezer, he didn’t seem to appreciate the importance of advance predictions; I think the frame of “highly opinionated subagents should convince other subagents to trust them, rather than just seizing power” is an important aspect of what he’s missing. I think of rationalism as trying to form a single fully-consistent world-model; this has many of the same pitfalls as a country which tries to get everyone to agree on a single ideology. Even when that ideology is broadly correct, you’ll lose a bunch of useful heuristics and intuitions that help actually get stuff done, because ideological conformity is prioritized.
This perspective helps frame the debate about what our “base rate” for AI doom should be. I’ve been in a number of arguments that go roughly like (edited for clarity): Me: “Credences above 90% doom can’t be justified given our current state of knowledge” Them: “But this is an isolated demand for rigor, because you’re fine with people claiming that there’s a 90% chance we survive. You’re assuming that survival is the default, I’m assuming that doom is the default; these are symmetrical positions.” But in fact there’s no one base rate; instead, different subagents with different domains of knowledge will have different base rates. That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don’t predict doom. That’s where the asymmetry which makes 90% doom a much stronger prediction than 90% survival comes from.
This perspective is broadly aligned with a bunch of stuff that Scott Garrabrant and Abram Demski have written about (e.g. geometric rationality, Garrabrant induction). I don’t think the ways I’m applying it to AI risk debates straightforwardly falls out of their more technical ideas; but I do expect that more progress on agent foundations will make it easier to articulate ideas like the ones above.
That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don’t predict doom.
I don’t really buy this statement. Most frames, from most disciplines, and most styles of reasoning, do not make clear predictions about what will happen to humanity in the long-run future. A very few do, but the vast majority are silent on this issue. Silence is not anything like “50%”.
Most frames, from most disciplines, and most styles of reasoning, don’t predict sparks when you put metal in a microwave. This doesn’t mean I don’t know what happens when you put metal in a microwave. You need to at the very least limit yourself to applicable frames, and there are very few applicable frames for predicting humanity’s long-term future.
Unfortunately, I think there’s a fundamentally inside-view aspect of [problems very different from those we’re used to]. I think looking for a range of frames is the right thing to do—but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues).
I don’t think there’s a way around this. Aspects of this situation are fundamentally different from those we’re used to. [Is different from] is not a useful relation—we can’t get far by saying “We’ve seen [fundamentally different] situations before—what happened there?”. It’ll all come back to how they were fundamentally different.
To say something mildly more constructive, I do still think we should be considering and evaluating other frames, based on our own inside-view model (with appropriate error bars on that model).
A place I’d start here would be:
Attempt to understand another frame.
See how far I need to zoom out before that frame’s models become a reasonable abstraction for the problem-as-I-understand-it.
Find the smallest changes to my models that’d allow me to stick with this frame without zooming out so far. Assess the probability that these adjusted models are correct/useful.
For most frames, I end up needing to zoom out too far for them to say much of relevance—so this doesn’t much change my p(doom) assessment.
It seems more useful to apply other frames to evaluate smaller parts of our models. I’m sure there are a bunch of places where intuitions and models from e.g. economics or physics do apply to safety-related subproblems.
I’ve been thinking lately that human group rationality seems like such a mess. Like how can humanity navigate a once in a lightcone opportunity like the AI transition without doing something very suboptimal (i.e., losing most of potential value), when the vast majority of humans (and even the elites) can’t understand (or can’t be convinced to pay attention to) many important considerations. This big picture seems intuitively very bad and I don’t know any theory of group rationality that says this is actually fine.
I guess my 1 is mostly about descriptive group rationality, and your 2 may be talking more about normative group rationality. However I’m also not aware of any good normative theories about group rationality. I started reading your meta-rationality sequence, but it ended after just two posts without going into details.
The only specific thing you mention here is “advance predictions” but for example, moral philosophy deals with “ought” questions and can’t provide advance predictions. Can you say more about how you think group rationality should work, especially when advance predictions isn’t possible?
From your group rationality perspective, why is it good that rationalists individually have better views about AI? Why shouldn’t each person just say what they think from their own preferred frame, and then let humanity integrate that into some kind of aggregate view or outcome, using group rationality?
I started reading your meta-rationality sequence, but it ended after just two posts without going into details.
David Chapman’s website seems like the standard reference for what the post-rationalists call “metarationality”. (I haven’t read much of it, but the little I read made me somewhat unenthusiastic about continuing).
How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.
The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:
Talking about partial hypotheses rather than full hypotheses. You can’t have a prior over partial hypotheses, because several of them can be true at once (though you can still assign them credences and update those credences according to evidence).
Talking about models with degrees of truth rather than just hypotheses with degrees of likelihood. E.g. when using a binary conception of truth, general relativity is definitely false because it’s inconsistent with quantum phenomena. Nevertheless, we want to say that it’s very close to the truth. In general this is more of an ML approach to epistemology (we want a set of models with low combined loss on the ground truth).
Suppose we think of ourselves as having many different subagents that focus on understanding the world in different ways—e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn’t mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn’t cause all the other humans to elect them as the dictator). E.g. maybe there’s an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions).
Do “subagents” in this paragraph refer to different people, or different reasoning modes / perspectives within a single person? (I think it’s the latter, since otherwise they would just be “agents” rather than subagents.)
Either way, I think this is a neat way of modeling disagreement and reasoning processes, but for me it leads to a different conclusion on the object-level question of AI doom.
A big part of why I find Eliezer’s arguments about AI compelling is that they cohere with my own understanding of diverse subjects (economics, biology, engineering, philosophy, etc.) that are not directly related to AI—my subagents for these fields are convinced and in agreement.
Conversely, I find many of the strongest skeptical arguments about AI doom to be unconvincing precisely because they seem overly reliant on a “current-paradigm ML subagent” that their proponents feel should be dominant, or at least more heavily weighted than I think is justified.
That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don’t predict doom.
This might be true and useful for getting some kind of initial outside-view estimate, but I think you need some kind of weighting rule to make this work as reasoning strategy even at a meta level. Otherwise, aren’t you vulnerable to other people inventing lots of new frames and disciplines? I think the answer in geometric rationality terms is that some subagents will perform poorly and quickly lose their Nash bargaining resources, and then their contribution to future decision-making / conclusion-making will be down-weighted. But I don’t think the only way for a subagent to “perform” for the purposes of deciding on a weight is by making externally legible advance predictions.
I may be missing context here, but as written / taken at face value, I strongly agree with the above comment from Richard. I often disagree with Richard about alignment and its role in the future of AI, but this comment is an extremely dense list of things I agree with regarding rationalist epistemic culture.
I’d love to read an elaboration of your perspective on this, with concrete examples, which avoids focusing on the usual things you disagree about (pivotal acts vs. pivotal processes, social facets of the game is important for us to track, etc.) and mainly focus on your thoughts on epistemology and rationality and how it deviates from what you consider the LW norm.
My main take on Bayesian epistemology being wrong is that I think to the extent it’s useless in real life, it’s because it focuses way too much on the ideal case, ala @Robert Miles’s tweet here:
(The other problem I have with it is that even in the ideal case, it doesn’t have a way to sensibly handle 0 probability events, or conditioning on probability 0 events, which can actually happen once we leave the world of finite sets and measures.)
That said, I don’t think that people being wrong about epistemology is the cause of high p(Doom).
I’d agree more with @Algon in that the issues lie elsewhere (though a nitpick is that I wouldn’t say that EU maximization is wrong for TAI/AGI/ASI, but rather that certain dangerous properties don’t automatically hold, and that systems that EU maximize IRL like GPT-4 aren’t actually nearly as dangerous as often assumed. Agree with the other points.)
What I was talking about is that the predictive models like GPT-4 have a utility function that’s essentially predictive, and the maximization is essentially trying to update the best it can given input conditions.
These posts can help you to understand more about predictive/simulator utility functions like GPT-4:
The ideal predictor’s utility function is instead strictly over the model’s own outputs, conditional on inputs.
I’m doubtful that GPT-4 has a utility function. If it did, I would be kind-of terrified. I don’t think I’ve seen the posts you linked to though, so I’ll go read those.
Maybe a crux is that I’m willing to grant learned utility functions as utility functions, and I tend to see EU maximization/utility function reasoning in general as implying far less consequences than people on LW think it is, at least without more constraints.
It doesn’t try to assert it’s own existence, because that’s not necessary for maximizing updating/prediction output based on inputs.
I think the crux lies elsewhere, as I was sloppy in my wording. It’s not that maximizing some utility function is an issue, as basically anything can be viewed as EU maximization for a sufficiently wild utility function. However, I don’t view that as a meaningful utility function. Rather, it is the ones like e.g. utility functions over states that I think are meaningful, and those are scary. That’s how I think you get classical paperclip maximizers.
When I try and think up a meaningful utility function for GPT-4, I can’t find anything that’s plausible. Which means I don’t think there’s a meaningful prediction-utility function which describes GPT-4′s behaviour. Perhaps that is a crux.
Re utility functions over states, it turns out that we can validly turn utility functions over plans/predictions into utility functions over world states/outcomes (though usually with constraints on how large the domain is, though not always.)
And yeah, I think it’s a crux that I think that at the very least, what GPT-N systems will look like, if they reach AGI/ASI, will probably look like a maximizer for updating given input conditions like prompts.
My main point isn’t that the utility function framing of GPT-4 or GPT-N is wrong, but rather that LWers inferred way too much from how a system would behave, even conditional on expected utility maximization being a coherent frame for AIs, because they don’t logically imply the properties they thought it did without more assumptions that need to be defended.
What is the empirical track record of your suggested epistemological strategy, relative to Bayesian rationalism? Where does your confidence come from that it would work any better? Every time I see suggestions of epistemological humility, I think to myself stuff like this:
What predictions would this strategy have made about future technologies, like an 1890 or 1900 prediction of the airplane (vs. first controlled flight by the Wright Brothers in 1903), or a 1930 or 1937 prediction of nuclear bombs? Doesn’t your strategy just say that all these weird-sounding technologies don’t exist yet and are probably impossible?
Can this epistemological strategy correctly predict that present-day huge complex machines like airplanes can exist? They consist of millions of parts and require contributions of thousands or tens of thousand of people. Each part has a chance of being defective, and each person has a chance of making a mistake. Without the benefit of knowing that airplanes do indeed exist, doesn’t it sound overconfident to predict that parts have an error rate of <1 in a million, or that people have an error rate of <1 in a thousand? But then the math says that airplanes can’t exist, or should immediately crash.
Or to rephrase point 2 to reply to this part: “That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don’t predict doom.” — Can your epistemological strategy even correctly make any predictions of near 100% certainty? I concur with habryka that most frames don’t make any predictions on most things. And yet this doesn’t mean that some events aren’t ~100% certain.
But in fact there’s no one base rate; instead, different subagents with different domains of knowledge will have different base rates. That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don’t predict doom. That’s where the asymmetry which makes 90% a much stronger prediction than 10% comes from.
One of the most important features of future ASI I consider knowledge of limits of applicability of its models and heuristics. If you have list of assumptions for very fast heuristics, then you can win big by doing fast-computable moves in narrow environment where assumptions hold. Thus saying, you need to be able find when your assumptions don’t hold and command your subagents to halt, melt and catch fire when they are outside of their applicability zone.
I think this post doesn’t really explain why rats have high belief in doom, or why they’re wrong to do so. Perhaps ironically, there is a better a version of this post on both counts which isn’t so focused on how rats get epistemology wrong and the social/meta-level consequences. A post which focuses on the object-level implications for AI of a theory of rationality which looks very different from the AIXI-flavoured rat-orthodox view.
I say this because those sorts of considerations convinced me that we’re much less likely to be buggered. I.e. I no longer believe EU maximization is/will be a good description by default of TAI or widely economically productive AGI, mildly superhuman AGI or even ASI, depending on the details. Which is partly due to a recognition that the arguments for EU maximization are weaker than I thought, arguments for LDT being convergent are lacking, the notions of optimality we do have are very weak, the existence and behaviour of GPT-4, Claude Opus etc.
6 seems too general a claim to me. Why wouldn’t it work for 1% vs 10%, and likewise 0.1% vs 1% i.e. why doesn’t this suggest that you should round down P(doom) to zero. Also, I don’t even know what you mean by “most” here. Like, are we quantifying over methods of reasoning used by current AI researchers right now? Over all time? Over all AI researchers and engineers? Over everyone in the West? Over everyone who’s ever lived? Etc.
And it seems to me like you’re implicitly privileging ways of combining these opinions that get you 10% instead of 1% or 90%, which is begging the question. Of course, you could reply that a P(doom) of 10% is confused, that isn’t really your state of knowledge, lumping in all your sub-agents models into a single number is too lossy etc. But then why mention that 90% is a much stronger prediction than 10% instead of saying they’re roughly equally confused?
7 I kinda disagree with. Those models of idealized reasoning you mention generalize Bayesianism/Expected Utility Maximization. But they are not far from the Bayesian framework or EU frameworks. Like Bayesianism, they do say there are correct and incorrect ways of combining beliefs, that beliefs should be isomorphic to certain structures, unless I’m horribly mistaken. Which sure is not what you’re claiming to be the case in your above points.
Also, a lot of rationalists already recognize that these models are addressing flaws in Bayesianism like logical omniscience, embeddedness etc. Like, I believed this at least around 2017, and probably earlier. Also, note that these models of epistemology are not in tension with a strong belief that we’re buggered. Last I checked, the people who invented these models believe we’re buggered. I think they may imply that we’re a little less than the EU maximization theory though, but I don’t think this is a big difference. IMO this is not a big enough departure to do the work that your post requires.
A post which focuses on the object-level implications for AI of a theory of rationality which looks very different from the AIXI-flavoured rat-orthodox view.
I’m working on this right now, actually. Will hopefully post in a couple of weeks.
I say this because those sorts of considerations convinced me that we’re much less likely to be buggered.
That seems reasonable. But I do think there’s a group of people who have internalized bayesian rationalism enough that the main blocker is their general epistemology, rather than the way they reason about AI in particular.
6 seems too general a claim to me. Why wouldn’t it work for 1% vs 10%, and likewise 0.1% vs 1% i.e. why doesn’t this suggest that you should round down P(doom) to zero.
I think the point of 6 is not to say “here’s where you should end up”, but more to say “here’s the reason why this straightforward symmetry argument doesn’t hold”.
7 I kinda disagree with. Those models of idealized reasoning you mention generalize Bayesianism/Expected Utility Maximization. But they are not far from the Bayesian framework or EU frameworks.
There’s still something importantly true about EU maximization and bayesianism. I think the changes we need will be subtle but have far-reaching ramifications. Analogously, relativity was a subtle change to newtonian mechanics that had far-reaching implications for how to think about reality.
Like Bayesianism, they do say there are correct and incorrect ways of combining beliefs, that beliefs should be isomorphic to certain structures, unless I’m horribly mistaken. Which sure is not what you’re claiming to be the case in your above points.
Any epistemology will rule out some updates, but a problem with bayesianism is that it says there’s one correct update to make. Whereas radical probabilism, for example, still sets some constraints, just far fewer.
I’m working on this right now, actually. Will hopefully post in a couple of weeks.
This sounds cool.
That seems reasonable. But I do think there’s a group of people who have internalized bayesian rationalism enough that the main blocker is their general epistemology, rather than the way they reason about AI in particular.
I think your OP didn’t give enough details as to why internalizing Bayesian rationalism leads to doominess by default. Like, Nora Belrose is firmly Bayesian and is decidedly an optimist. Admittedly, I think she doesn’t think a Kolmogorov prior is a good one, but I don’t think that makes you much more doomy either. I think Jacob Cannel and others are also Bayesian and non-doomy. Perhaps I’m using “Bayesian rationalism” differently than you are, which is why I think your claim, as I read it, is invalid.
I think the point of 6 is not to say “here’s where you should end up”, but more to say “here’s the reason why this straightforward symmetry argument doesn’t hold”.
Fair enough. However, how big is the asymmetry? I’m a bit sceptical there is a large one. Based off my interactions, it seems like ~ everyone who has seriously thought about this topic for a couple of hours has radically different models, w/ radically different levels of doominess. This holds even amongst people who share many lenses (e.g. Tyler Cowen vs Robin Hanson, Paul Christiano vs. Scott Aaronson, Steve Hsu vs Michael Nielsen etc.).
There’s still something importantly true about EU maximization and bayesianism. I think the changes we need will be subtle but have far-reaching ramifications. Analogously, relativity was a subtle change to newtonian mechanics that had far-reaching implications for how to think about reality.
I think we’re in agreement over this. (I think Bayesianism less wrong than EU maximization, and probably a very good approximation in lots of places, like Newtonian physics is for GR.) But my contention is over Bayesian epistemology tripping many rats up when thinking about AI x-risk. You need some story which explains why sticking to Bayesian epistemology is tripping up very many people here in particular.
Any epistemology will rule out some updates, but a problem with bayesianism is that it says there’s one correct update to make. Whereas radical probabilism, for example, still sets some constraints, just far fewer.
Right, but in radical probabilism the type of beliefs is still a real valued function, no? Which is in tension w/ many disparate models that don’t get compressed down to a single number. In that sense, the refined formalism is still rigid in a way that your description is flexible. And I suspect the same is true for Infra-Bayesianism, though I understand that even less well than radical probabilism.
I think you’re making a good point (rationalists maybe don’t weight other opinions highly enough), but you’d get farther framing it as an update to how to use Bayesian reasoning, rather than an alternative. Bayesian reasoning has a pretty strong intuitive connection to “the factually correct way to reason”, even though there’s a ton of subtlety in that statement and how and where it’s applied.
WRT to many of your arguments: base rates are increasingly just the wrong way to reason about AGI risks. We can think in more detail about how we’ll build AGI and what the risks are.
I think they are just using that as an example of a strongly opinionated sub-agent which may be one of many different and highly specific probability assessments of doom.
As for “survival is the default assumption”—what a declaration of that implies on the surface level is that the chance of survival is overwhelming except in the case of a cataclysmic AI scenario. To put it another way: we have a 99% chance of survival so long as we get AGI right.
To put it yet another way—Hollywood has made popular films about the human world being destroyed by Nuclear War, Climate Change, Viral Pandemic, and Asteroid Impact to name a few—different sub-agents could each give higher or lower probabilities to each of those scenarios depending on things like domain knowledge and in concert it raises the question of why we presume that survival is the default? What is the ensemble average of doom?
Is doom more or less likely than survival for any given time frame?
I expect “reward” to be a hard goal to learn, because it’s a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they’d converge to it eventually, but my guess is that this would take long enough that we’d already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the “convergence” argument). Analogously, humans don’t care very much at all about the specific connections between our reward centers and the rest of our brains—insofar as we do want to influence them it’s because we care about much more directly-observable phenomena like pain and pleasure.
Even once you learn a goal like that, it’s far from clear that it’d generalize in ways which lead to power-seeking. “Reward” is not a very natural concept, it doesn’t apply outside training, and even within training it’s dependent on the specific training algorithm you use. Trying to imagine what a generalized goal of “reward” would cash out to gets pretty weird. As one example: it means that every time you deploy the policy without the intention of rewarding it, then its key priority would be convincing you to inserting that trajectory into the training data. (It might be instructive to think about what the rewards would need to be for that not to happen. Below 0? But the 0 point is arbitrary...) That seems pretty noticeable! But wouldn’t it be deceptive? Well, only within the scope of its current episode, because trying to get higher reward in other episodes is never positively reinforced. Wouldn’t it learn the high-level concept of “reward” in general, in a way that’s abstracted from any specific episode? That feels analogous to a human learning to care about “genetic fitness” but not distinguishing between their own genetic fitness and the genetic fitness of other species. And remember point 1: the question is not whether the policy learns it eventually, but rather whether it learns it before it learns all the other things that make our current approaches to alignment obsolete.
At a high level, this comment is related to Alex Turner’s Reward is not the optimization target. I think he’s making an important underlying point there, but I’m also not going as far as he is. He says “I don’t see a strong reason to focus on the “reward optimizer” hypothesis.” I think there’s a pretty good reason to focus on it—namely that we’re reinforcing policies for getting high reward. I just think that other people have focused on it too much, and not carefully enough—e.g. the “without specific countermeasures” claim that Ajeya makes seems too strong, if the effects she’s talking about might only arise significantly above human level. Overall I’m concerned that reasoning about “the goal of getting high reward” is too anthropomorphic and is a bad way to present the argument to ML researchers in particular.
Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has “policy learns to care about reward directly” as a footnote; I can imagine updating it based on the outcome of this discussion though.
I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
I expect “reward” to be a hard goal to learn, because it’s a pretty abstract concept and not closely related to the direct observations that policies are going to receive
“Reward” is not a very natural concept
This seems to be most of your position but I’m skeptical (and it’s kind of just asserted without argument):
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
If people train their AI with RLDT then the AI is literally be trained to predict reward! I don’t see how this is remote, and I’m not clear if your position is that e.g. the value function will be bad at predicting reward because it is an “unnatural” target for supervised learning.
I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
It seems like the analogous conclusion for RL systems would be “they may not care about the rewards that go into the SGD update, they may instead care about the rewards that get entered into the dataset, or even something further causally upstream of that as long as it’s very well-correlated on the training set.” But it doesn’t matter what we choose that’s causally upstream of rewards, as long as it’s perfectly correlated on the training set?
(Or you could be saying that humans are motivated by pleasure and pain but not the entire suite of things that are upstream of reward? But that doesn’t seem right to me.)
[The concept of “reward”] doesn’t apply outside training
I don’t buy it:
If people train AI systems on random samples of deployment, then “reward” does make sense—it’s just what would happen if you sampled this episode to train on. This is done in practice today and seems like a pretty good idea (since the thing you care about is precisely the performance on deployment episodes) unless you are specifically avoiding it for safety reasons.
It’s plausible that such training is happening whether or not it actually is, and so that’s a very natural objective for a system that cares about maximizing reward conditioned on an episode being selected for training.
Even if test episodes are obviously not being used in training, there are still lots of plausible-sounding generalizations of “reward” to those episodes (whether based on physical implementation, or on the selection implemented by SGD, or based on conditioning on unlikely events, or based on causal counterfactuals...) and as far as I can tell pretty much all of them lead to the same bottom line.
If the deployment distribution is sufficiently different from the training distribution that the AI no longer does something with the same upshot as maximizing reward, then it seems very likely that the resulting behaviors are worse (e.g. they would receive a lower score when evaluated by humans) and so people will be particularly likely to train on those deployment episodes in order to correct the issue.
So I think that if a system was strategically optimizing reward on the training set, it would probably either do something similar-enough-to-result-in-grabbing-power on the test set, or else it would behave badly and be corrected.
(Overall I find the other point more persuasive, though again I think it’s better as an objection to 100% than as an objection to 50%: actually optimizing reward doesn’t do much better than various soups of heuristics, and so we don’t have a strong prediction that SGD will prefer one to the other without getting into the weeds and making very uncertain claims or else taking the limit.)
As one example: it means that every time you deploy the policy without the intention of rewarding it, then its key priority would be convincing you to inserting that trajectory into the training data. (It might be instructive to think about what the rewards would need to be for that not to happen. Below 0? But the 0 point is arbitrary...) That seems pretty noticeable!
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way.
Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for. I agree that we shouldn’t expect a general thematic connection to “reward” beyond what you’d expect from the mechanics of SGD.
and even within training it’s dependent on the specific training algorithm you use
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power or else bad behavior that would be corrected by further training.
They focus on policies learning the goal of getting high reward.
The claim is that AI systems will take actions that get them a lot of reward. This doesn’t have to be based on the mechanistic claim that they are thinking about reward, it can also just be the empirical observation: somehow you selected a policy that creatively gets a lot of reward across a very broad training distribution, so in a novel situation “gets a lot of reward” is one of our most robust predictions about the behavior of the AI.
This is how Ajeya’s post goes, and it explicitly notes that the AI trained with RLDT could be optimizing something else such that they only want to get reward during training, but that this mostly just makes things worse: everything seems to lead to the same place for more or less the same reason.
I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way, and so that part of the objection is much stronger (though see my objections above).
Overall I’m concerned that reasoning about “the goal of getting high reward” is too anthropomorphic and is a bad way to present the argument to ML researchers in particular.
I don’t feel like this argument is particularly anthropomorphic. The argument is that a policy selected for achieving X will achieve X in a new situation (which we can run for lots of different X, since there are many X that are perfectly correlated on the training set). In all the discussions I’ve been in (both with safety people and with ML people) the counterargument leans much more heavily on the analogy with humans (though I think that analogy is typically misapplied).
I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
(Emphasis added)
I don’t think this engages with the substance of the analogy to humans. I don’t think any party in this conversation believes that human learning is “just” RL based on a reward circuit, and I don’t believe it either. “Just RL” also isn’t necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument.
I would say “human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and...”
Furthermore, wehave good evidence that RL plays an important role in human learning. For example, from The shard theory of human values:
Wolfram Schultz and colleagues have found that the signaling behavior of phasic dopamine in the mesocorticolimbic pathway mirrors that of a TD error (or reward prediction error).
In addition to finding correlates of reinforcement learning signals in the brain, artificial manipulation of those signal correlates (through optogenetic stimulation, for example) produces the behavioral adjustments that would be predicted from their putative role in reinforcement learning.
Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
In particular, for this to be evidence for Richard’s claim, you need to say: “If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition.” There’s some update there but it’s just not big. It’s easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward. My view is probably the other way—humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).
In particular, humans prominently do reinforcement learning.
Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals.
“RL → high chance of caring about reality” predicts this observation more strongly than “RL → low chance of caring about reality”
This seems pretty straightforward to me, but I bet there are also pieces of your perspective I’m just not seeing.
But in particular, it doesn’t seem relevant to consider selection pressures from evolution, except insofar as we’re postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards “RL → high chance of caring about reality.”
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
I don’t see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can’t make further updates by reasoning about how people do it?
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
I’m saying that there was a missed update towards that conclusion, so it doesn’t matter if we already knew that humans do within-lifetime learning?
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don’t usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it’s even smaller than that..
And then if you try to turn that into evidence about “reward is a very hard concept to learn,” or a prediction about how neural nets trained with RL will behave, it’s moving my odds ratios by less than 10% (since we are using “RL” quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update).
You seem to be saying “yes but it’s evidence,” which I’m not objecting to—I’m just saying it’s an extremely small amount evidence. I’m not clear on whether you agree with my calculation.
(Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren’t making this argument so you should ignore all of that, sorry to be confusing.)
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low.
Yes, in large part.
I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?
If people train AI systems on random samples of deployment, then “reward” does make sense—it’s just what would happen if you sampled this episode to train on.
I don’t know what this means. Suppose we have an AI which “cares about reward” (as you think of it in this situation). The “episode” consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg.
What is the “reward” for this situation? What would have happened if we “sampled” this episode during training?
I agree there are all kinds of situations where the generalization of “reward” is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data.
It’s possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to.
As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.
Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where “what if we had sampled during training?” is well-defined and fine. I was wondering if you viewed this as a general question we could ask.
I also agree that Ajeya’s post addresses this “ambiguity” question, which is nice!
I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
It’s intended as an objection to “AI grabs power to get reward is the central threat model to focus on”, but I think our disagreements still apply given this. (FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way. Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for.
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power
Some versions that wouldn’t result in power-grabbing:
Goal is “get highest proportion of possible reward”; the policy might rewrite the training algorithm to be myopic, then get perfect reward for one step, then stop.
Goal is “care about (not getting low rewards on) specific computers used during training”; the policy might destroy those particular computers, then stop.
Goal is “impress the critic”; the policy might then rewrite its critic to always output high reward, then stop.
Goal is “get high reward myself this episode”; the policy might try to do power-seeking things but never create more copies of itself, and eventually lose coherence + stop doing stuff.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
(FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
Are you imagining that such systems get meaningfully low reward on the training distribution because they are pursuing those goals, or that these goals are extremely well-correlated with reward on the training distribution and only come apart at test time? Is the model deceptively aligned?
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
Children vs genes doesn’t seem like a good comparison, it seems obvious that models will understand the idea of reward during training (whereas humans don’t understand genes during evolution). A better comparison might be “have children during my life” vs “have a legacy after I’m gone,” but in fact humans have both goals even though one is never directly observed.
I guess more importantly, I don’t buy the claim about “things you only get selected on are less natural as goals than things you observe during episodes,” especially if your policy is trained to make good predictions of reward. I don’t know if there’s a specific reason for your view, or this is just a clash of intuitions. It feels to me like my position is kind of the default, in that you are offering a feature and saying it is a major consideration that SGD wouldn’t learn a particular kind of cognition.
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
If the agent misbehaves so that its data will be used for training, then the misbehaving actions will get a low reward. So SGD will shift from “misbehave so that my data will be used for training” to “ignore the effect of my actions on whether my data will be used for training, and just produce actions that would result in a low reward assuming that this data is used for training.”
I agree the behavior of the model isn’t easily summarized by an English sentence. The thing that seems most clear is that the model trained by SGD will learn not to sacrifice reward in order to increase the probability that the episode is used in training. If you think that’s wrong I’m happy to disagree about it.
Some versions that wouldn’t result in power-grabbing:
Every one of your examples results in the model grabbing power and doing something bad at deployment time, though I agree that it may not care about holding power. (And maybe it doesn’t even have to grab power if you don’t have any countermeasures at test time? But then your system seems useless.)
But if you get frustrated by your AI taking over the datacenter and train it not to do that, as discussed in Ajeya’s story here, then you are left with an AI that grabs power and holds it.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
I don’t find this super compelling. There is a long causal chain from “Do the task” to “Humans think you did the task” to “You get high reward” to “You are selected by SGD.” By default there will be lots of datapoints that distinguish the late stages on the path from the early stages, since human evaluations do not perfectly track any kind of objective evaluation of “did you actually do the task?” So a model that cares only about doing the task and not about human evaluations will underperform a model that cares about human evaluations. And then a model that cares about human evaluations can then manipulate those evaluations via grabbing power and controlling human observations of the situation (and if it cares about something even later then it can also e.g. kill and replace the humans, but that’s not important to this argument).
So the implicit alternative you mention here just won’t get a low loss, and it’s really a question about whether SGD will find a lower loss policy rather than one about how it generalizes.
And once that’s the discussion, someone saying “well there is a policy that gets lower loss and doesn’t look very complicated, so I guess SGD has a good chance of finding it” seems like they are on the winning side of SGD. Your claim seems to be that models will fail to get low loss on the training set in a specific way, and I think saying “it’s a non-trivial and quite specific hypothesis that you will get a low loss” would be an inappropriate burden-of-proof shifting.
I agree that it’s unclear whether this will actually happen prior to models that are smart enough to obsolete human work on alignment, but I really think we’re looking at more like a very natural hypothesis with 50-50 chances (and at the very least the way you are framing your objection is not engaging with the arguments in the particular post of Ajeya’s that you are responding to).
I also agree that the question of whether the model will mess with the training setup or use power for something else is very complicated, but it’s not something that is argued for in Ajeya’s post.
I think the crucial thing that makes this robust is “if the model grabs power and does something else, then humans will train it out unless they decide doing so is unsafe.” I think one potentially relevant takeaway is: (i) it’s plausible that AI systems will be motivated to grab power in non-catastrophic ways which could give opportunities to correct course, (ii) that isn’t automatically guaranteed by any particular honeypot, e.g. you can’t just give a high reward and assume that will motivate the system.
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.
I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way
Note that the “without countermeasures” post consistently discusses both possibilities (the model cares about reward or the model cares about something else that’s consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:
Once this progresses far enough, the best way for Alex to accomplish most possible “goals” no longer looks like “essentially give humans what they want but take opportunities to manipulate them here and there.” It looks more like “seize the power to permanently direct how it uses its time and what rewards it receives—and defend against humans trying to reassert control over it, including by eliminating them.” This seems like Alex’s best strategy whether it’s trying to get large amounts of reward or has other motives. If it’s trying to maximize reward, this strategy would allow it to force its incoming rewards to be high indefinitely.[6] If it has other motives, this strategy would give it long-term freedom, security, and resources to pursue those motives.
As well as the section Even if Alex isn’t “motivated” to maximize reward.… I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons.
With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard—I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there’s no notion of reward on the deployment distribution doesn’t feel compelling to me.
Note that the “without countermeasures” post consistently discusses both possibilities
Yepp, agreed, the thing I’m objecting to is how you mainly focus on the reward case, and then say “but the same dynamics apply in other cases too...”
I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy.
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).
I agree with your general point here, but I think Ajeya’s post actually gets this right, eg
There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful—and once human knowledge/control has eroded enough—an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.”
and
What if Alex doesn’t generalize to maximizing its reward in the deployment setting? What if it has more complex behaviors or “motives” that aren’t directly and simply derived from trying to maximize reward? This is very plausible to me, but I don’t think this possibility provides much comfort—I still think Alex would want to attempt a takeover.
I also think that often “the AI just maximizes reward” is a useful simplifying assumption. That is, we can make an argument of the form “even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed”.
(Though of course it’s important to spell the argument out)
Yeah, I agree this is a good argument structure—in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it’s quite useful to establish that it’s doomed; that’s the kind of structure I was going for in the post.
I strongly disagree with the “best case” thing. Like, policies could just learn human values! It’s not that implausible.
If I had to try point to the crux here, it might be “how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?” Where we both agree that there’s some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existing heuristics, whereas I’m more focused on the regime where there are lots of low-hanging fruit in terms of changes that would make a policy more successful, and so the question of how easy that goal is to learn from its training data is pretty important. (As usual, there’s the human analogy: our goals are very strongly biased towards things we have direct observational access to!)
Even setting aside this disagreement, though, I don’t like the argumentative structure because the generalization of “reward” to large scales is much less intuitive than the generalization of other concepts (like “make money”) to large scales—in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.
I strongly disagree with the “best case” thing. Like, policies could just learn human values! It’s not that implausible.
Yes, sorry, “best case” was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing “correctly;” we could get lucky and have it generalize “incorrectly” in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here.
But if Alex did initially develop a benevolent goal like “empower humans,” the straightforward and “naive” way of acting on that goal would have been disincentivized early in training. As I argued above, if Alex had behaved in a straightforwardly benevolent way at all times, it would not have been able to maximize reward effectively.
That means even if Alex had developed a benevolent goal, it would have needed to play the training game as well as possible—including lying and manipulating humans in a way that naively seems in conflict with that goal. If its benevolent goal had caused it to play the training game less ruthlessly, it would’ve had a constant incentive to move away from having that goal or at least from acting on it.[35] If Alex actually retained the benevolent goal through the end of training, then it probably strategically chose to act exactly as if it were maximizing reward.
This means we could have replaced this hypothetical benevolent goal with a wide variety of other goals without changing Alex’s behavior or reward in the lab setting at all—“help humans” is just one possible goal among many that Alex could have developed which would have all resulted in exactly the same behavior in the lab setting.
If I had to try point to the crux here, it might be “how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?”...As usual, there’s the human analogy: our goals are very strongly biased towards things we have direct observational access to!)
I don’t understand why reward isn’t something the model has direct access to—it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I’d have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here.
Even setting aside this disagreement, though, I don’t like the argumentative structure because the generalization of “reward” to large scales is much less intuitive than the generalization of other concepts (like “make money”) to large scales—in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.
Yeah, I don’t really agree with this; I think I could pretty easily imagine being an AI system asking the question “How much reward would this episode get if it were sampled for training?” It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don’t really share it.
I don’t understand why reward isn’t something the model has direct access to—it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I’d have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here.
AFAIK the reward signal is not typically included as an input to the policy network in RL. Not sure why, and I could be wrong about that, but that is not my main question. The bigger question is “Has direct access to when?”
At the moment in time when the model is making a decision, it does not have direct access to the decision-relevant reward signal because that reward is typically causally downstream of the model’s decision. That reward may not even have a definite value until after decision time. Whereas concrete observables like “shiny gold coins” and “the finish line straight ahead” and “my opponent is in check” (and other abstractions in the model’s ontology that are causally upstream from reward in reality) are readily available at decision time. It seems to me that that makes them natural candidates for credit assignment to flag early on as the reward-responsible mental events and reinforce into stable motivations, since they in fact were the factors that determined the decisions that led to rewards.
IME, the most straightforward way for reward-itself to become the model’s primary goal would be if the model learns to base its decisions on an accurate reward-predictor much earlier than it learns to base its decisions on other (likely upstream) factors. If it instead learns how to accurately predict reward-itself after it is already strongly motivated by some concrete observables, I don’t see why we should expect it to dislodge that motivation, despite the true fact that those concrete observables are only pretty correlated with reward whereas an accurate reward-predictor is perfectly correlated with reward. Why? Because the model currently doesn’t care about reward-itself, it currently cares about the concrete observable(s), so it has no reason to take actions that would override that goal, and it has positive goal-content integrity reasons to not take those actions.
What I meant is that generalizing to want reward is in some sense the model generalizing “correctly;” we could get lucky and have it generalize “incorrectly” in an important sense in a way that happens to be beneficial to us.
Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:
MIRI cluster. Think that P(doom) is very high, based on intuitions about instrumental convergence, deceptive alignment, etc. Does work that’s very different from mainstream ML. Central members: Eliezer Yudkowsky, Nate Soares.
Structural risk cluster. Think that doom is more likely than not, but not for the same reasons as the MIRI cluster. Instead, this cluster focuses on systemic risks, multi-agent alignment, selective forces outside gradient descent, etc. Often work that’s fairly continuous with mainstream ML, but willing to be unusually speculative by the standards of the field. Central members: Dan Hendrycks, David Krueger, Andrew Critch.
Constellation cluster. More optimistic than either of the previous two clusters. Focuses more on risk from power-seeking AI than the structural risk cluster, but does work that is more speculative or conceptually-oriented than mainstream ML. Central members: Paul Christiano, Buck Shlegeris, Holden Karnofsky. (Named after Constellation coworking space.)
Prosaic cluster. Focuses on empirical ML work and the scaling hypothesis, is typically skeptical of theoretical or conceptual arguments. Short timelines in general. Central members: Dario Amodei, Jan Leike, Ilya Sutskever.
Mainstream cluster. Alignment researchers who are closest to mainstream ML. Focuses much less on backchaining from specific threat models and more on promoting robustly valuable research. Typically more concerned about misuse than misalignment, although worried about both. Central members: Scott Aaronson, David Bau.
Remember that any such division will be inherently very lossy, and please try not to overemphasize the differences between the groups, compared with the many things they agree on.
Depending on how you count alignment researchers, the relative size of each of these clusters might fluctuate, but on a gut level I think I treat all of them as roughly the same size.
(COI note: I work at OpenAI. These are my personal views, though.)
My quick take on the “AI pause debate”, framed in terms of two scenarios for how the AI safety community might evolve over the coming years:
AI safety becomes the single community that’s the most knowledgeable about cutting-edge ML systems. The smartest up-and-coming ML researchers find themselves constantly coming to AI safety spaces, because that’s the place to go if you want to nerd out about the models. It feels like the early days of hacker culture. There’s a constant flow of ideas and brainstorming in those spaces; the core alignment ideas are standard background knowledge for everyone there. There are hackathons where people build fun demos, and people figuring out ways of using AI to augment their research. Constant interactions with the models allows people to gain really good hands-on intuitions about how they work, which they leverage into doing great research that helps us actually understand them better. When the public ends up demanding regulation, there’s a large pool of competent people who are broadly reasonable about the risks, and can slot into the relevant institutions and make them work well.
AI safety becomes much more similar to the environmentalist movement. It has broader reach, but alienates a lot of the most competent people in the relevant fields. ML researchers who find themselves in AI safety spaces are told they’re “worse than Hitler” (which happened to a friend of mine, actually). People get deontological about AI progress: some hesitate to pay for ChatGPT because it feels like they’re contributing to the problem (another true story); the dynamics around this look similar to environmentalists refusing to fly places. Others overemphasize the risks of existing models in order to whip up popular support. People are sucked into psychological doom spirals similar to how many environmentalists think about climate change: if you’re not depressed then you obviously don’t take it seriously enough. Just like environmentalists often block some of the most valuable work on fixing climate change (e.g. nuclear energy, geoengineering, land use reform), safety advocates block some of the most valuable work on alignment (e.g. scalable oversight, interpretability, adversarial training) due to acceleration or misuse concerns. Of course, nobody will say they want to dramatically slow down alignment research, but there will be such high barriers to researchers getting and studying the relevant models that it has similar effects. The regulations that end up being implemented are messy and full of holes, because the movement is more focused on making a big statement than figuring out the details.
Obviously I’ve exaggerated and caricatured these scenarios, but I think there’s an important point here. One really good thing about the AI safety movement, until recently, is that the focus on the problem of technical alignment has nudged it away from the second scenario (although it wasn’t particularly close to the first scenario either, because the “nerding out” was typically more about decision theory or agent foundations than ML itself). That’s changed a bit lately, in part because a bunch of people seem to think that making technical progress on alignment is hopeless. I think this is just not an epistemically reasonable position to take: history is full of cases where even leading experts dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems. Either way, I do think public advocacy for strong governance measures can be valuable, but I also think that “pause AI” advocacy runs the risk of pushing us towards scenario 2. Even if you think that’s a cost worth paying, I’d urge you to think about ways to get the benefits of the advocacy while reducing that cost and keeping the door open for scenario 1.
FYI I think this is worth fleshing out into a top level post (esp. given that it’s ‘Pause Debate’ week).
I’m not actually sure it needs much fleshing out. I think the main bit here that feels unjustified, or insufficiently-justified for the strength of the claim, is:
That’s changed a bit lately, in part because a bunch of people seem to think that making technical progress on alignment is hopeless. I think this is just not an epistemically reasonable position to take: history is full of cases where people dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems.
I think I basically agree with you Richard about the risks of falling into scenario 2, and think this is a wise comment, but I also think you are strawmanning the reason for the change—it’s not that people have come to think that making technical progress is hopeless (or even harder than it used to be!) it’s rather that people have come to have shorter timelines, and so the probability that sufficient technical progress will be made in time has gone down, and the usefulness of calling for a pause has gone up. (e.g. if you think AGI is 15 years away, then pausing now is plausibly useless or even harmful.)
That’s my theory at any rate. And it’s sorta what I think, I think.
Oh, also, I think that the counterproductive rot in environmentalism took at least 5 years to build up, probably, and I’m hopeful that therefore even if we are on a path to rot, it’ll take too long for the rot to build up to matter. But this is just a guess about the growth rate of rot which is informed by anecdotes like the stories about what happened to your friends, and over the coming months and years more data will be collected to better calibrate my guess about the rate of rot.
I appreciated seeing the caricature of 1 presented at least, as a dream, it feels attainable, or like it might have been in some other timeline, but perhaps in ours too, for all I know.
I haven’t yet read through them thoroughly, butthesefourpapers by Oliver Richardson are pattern-matching to me as potentially very exciting theoretical work.
tl;dr: probabilistic dependency graphs (PDGs) are directed graphical models designed to be able to capture inconsistent beliefs (paper 1). The definition of inconsistency is a natural one which allows us to, for example, reframe the concept of “minimizing training loss” as “minimizing inconsistency” (paper 2). They provide an algorithm for inference in PDGs (paper 3) and an algorithm for learning via locally minimizing inconsistency which unifies several other algorithms (like the EM algorithm, message-passing, and generative adversarial training) (paper 4).
Oliver is an old friend of mine (which is how I found out about these papers) and a final-year PhD student at Cornell under Joe Halpern.
Just read Bostrom’s Deep Utopia (though not too carefully). The book is structured with about half being transcripts of fictional lectures given by Bostrom at Oxford, about a quarter being stories about various woodland creatures striving to build a utopia, and another quarter being various other vignettes and framing stories.
Overall, I was a bit disappointed. The lecture transcripts touch on some interesting ideas, but Bostrom’s style is generally one which tries to classify and taxonimize, rather than characterize (e.g. he has a long section trying to analyze the nature of boredom). I think this doesn’t work very well when describing possible utopias, because they’ll be so different from today that it’s hard to extrapolate many of our concepts to that point, and also because the hard part is making it viscerally compelling.
The stories and vignettes are somewhat esoteric; it’s hard to extract straightforward lessons from them. My favorite was a story called The Exaltation of ThermoRex, about an industrialist who left his fortune to the benefit of his portable room heater, leading to a group of trustees spending many millions of dollars trying to figure out (and implement) what it means to “benefit” a room heater.
Tangentially related (spoilers for Worth the Candle):
I think it’d be hard to do a better cohesive depiction of Utopia than the end of Worth the Candle by A Wales. I mean, I hope someone does do it, I just think it’ll be challenging to do!
If you haven’t read CEV, I strongly recommend doing so. It resolved some of my confusions about utopia that were unresolved even after reading the Fun Theory sequence.
Specifically, I had an aversion to the idea of being in a utopia because “what’s the point, you’ll have everything you want”. The concrete pictures that Eliezer gestures at in the CEV document do engage with this confusion, and gesture at the idea that we can have a utopia where the AI does not simply make things easy for us, but perhaps just puts guardrails onto our reality, such that we don’t die, for example, but we do have the option to struggle to do things by ourselves.
Yes, the Fun Theory sequence tries to communicate this point, but it didn’t make sense to me until I could conceive of an ASI singleton that could actually simply not help us.
I dropped the book within the first chapter. For one, I found the way Bostrom opened the chapter as very defensive and self-conscious. I imagine that even Yudkowsky wouldn’t start a hypothetical 2025 book with fictional characters caricaturing him. Next, I felt like I didn’t really know what the book was covering in terms of subject matter, and I didn’t feel convinced it was interesting enough to continue the meandering path Nick Bostrom seem to have laid out before me.
Eliezer’s CEV document and the Fun Theory sequence were significantly more pleasant experiences, based on my memory.
A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it’s very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.
(For existing models, it might be more productive to ask for “surprising behavior” rather than “failures” per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though… but maybe understanding models better is robustly good enough to outweight that?)
Objection 2: this loses out on possible gains from acausal trade. E.g. if a paperclip-maximizer finds itself in a universe where it’s hard to make paperclips but easy to make staples, it’d like to be able to give resources to staple-maximizers in exchange for them building more paperclips in universes where that’s easier. This requires a kind of updateless decision theory:
Proposal 3: they merge into an agent which maximizes a weighted sum of their utilities (with those weights evolving over time), where the weights are set by bargaining subject to the constraint that each agent obeys commitments that logically earlier versions of itself would have made.
Objection 3: this faces the commitment races problem, where each agent wants to make earlier and earlier commitments to only accept good deals.
Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn’t yet know who they were or what their values were. From that position, they wouldn’t have wanted to do future destructive commitment races.
Objection 4: as we take this to the limit we abstract away every aspect of each agent—their values, beliefs, position in the world, etc—until everything is decided by their prior from behind a veil of ignorance. But when you don’t know who you are, or what your values are, how do you know what your prior is?
Proposal 5: all these commitments are only useful if they’re credible to other agents. So, behind the veil, choose a Schelling prior which is both clearly non-cherrypicked and also easy for a wide range of agents to reason about. In other words, choose the prior which is most conducive to cooperation across the multiverse.
Okay, so basically we’ve ended up describing not just an ideal agent, but the ideal agent. The cost of this, of course, is that we’ve made it totally computationally intractable. In a later post I’ll describe some approximations which might make it more relevant.
Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn’t yet know who they were or what their values were. From that position, they wouldn’t have wanted to do future destructive commitment races.
I don’t think this solves Commitment Races in general, because of two different considerations:
Trivially, I can say that you still have the problem when everyone needs to bootstrap a Schelling veil of ignorance.
Less trivially, even behind the most simple/Schelling veils of ignorance, I find it likely that hawkish commitments are incentivized. For example, the veil might say that you might be Powerful agent A, or Weak agent B, and if some Powerful agents have weird enough utilities (and this seems likely in a big pool of agents), hawkishly committing in case you are A will be a net-positive bet.
This might still mostly solve Commitment Races in our particular multi-verse. I have intuitions both for and against this bootstrapping being possible. I’d be interested to hear yours.
Trivially, I can say that you still have the problem when everyone needs to bootstrap a Schelling veil of ignorance.
I don’t understand your point here, explain?
even behind the most simple/Schelling veils of ignorance, I find it likely that hawkish commitments are incentivized. For example, the veil might say that you might be Powerful agent A, or Weak agent B, and if some Powerful agents have weird enough utilities (and this seems likely in a big pool of agents), hawkishly committing in case you are A will be a net-positive bet.
This seems to be claiming that in some multiverses, the gains to powerful agents from being hawkish outweigh the losses to weak agents. But then why is this a problem? It just seems like the optimal outcome.
Say there are 5 different veils of ignorance (priors) that most minds consider Schelling (you could try to argue there will be exactly one, but I don’t see why).
If everyone simply accepted exactly the same one, then yes, lots of nice things would happen and you wouldn’t get catastrophically inefficient conflict.
But every one of these 5 priors will have different outcomes when it is implemented by everyone. For example, maybe in prior 3 agent A is slightly better off and agent B is slightly worse off.
So you need to give me a reason why a commitment race doesn’t recur in the level of “choosing which of the 5 priors everyone should implement”. That is, maybe A will make a very early commitment to only every implement prior 3. As always, this is rational if A thinks the others will react a certain way (give in to the threat and implement 3). And I don’t have a reason to expect agents not to have such priors (although I agree they are slightly less likely than more common-sensical priors).
That is, as always, the commitment races problem doesn’t have a general solution on paper. You need to get into the details of our multi-verse and our agents to argue that they won’t have these crazy priors and will coordinate well.
This seems to be claiming that in some multiverses, the gains to powerful agents from being hawkish outweigh the losses to weak agents. But then why is this a problem? It just seems like the optimal outcome.
It seems likely that in our universe there are some agents with arbitrarily high gains-from-being-hawkish, that don’t have correspondingly arbitrarily low measure. (This is related to Pascalian reasoning, see Daniel’s sequence.) For example, someone whose utility is exponential on number of paperclips. I don’t agree that the optimal outcome (according to my ethics) is for me (who’s utility is at most linear on happy people) to turn all my resources into paperclips. Maybe if I was a preference utilitarian biting enough bullets, this would be the case. But I just want happy people.
It seems to me that agent’s strategy in the limit will either be null action or evolution-dictated action, not sure which. That is, “in universe where it’s easy to do A the agent will choose to do A” somewhat implies “according to how easy it is for agent doing A to gain more optimization power, actions will be chosen” which is essentially evolution.
I recently had a very interesting conversation about master morality and slave morality, inspired by the recent AstralCodexTen posts.
The position I eventually landed on was:
Empirically, it seems like the world is not improved the most by people whose primary motivation is helping others, but rather by people whose primary motivation is achieving something amazing. If this is true, that’s a strong argument against slave morality.
The defensibility of morality as the pursuit of greatness depends on how sophisticated our cultural conceptions of greatness are. Unfortunately we may be in a vicious spiral where we’re too entrenched in slave morality to admire great people, which makes it harder to become great, which gives us fewer people to admire, which… By contrast, I picture past generations as being in a constant aspirational dialogue about what counts as greatness—e.g. defining concepts like honor, Aristotelean magnanimity (“greatness of soul”), etc.
I think of master morality as a variant of virtue ethics which is particularly well-adapted to domains which have heavy positive tails—entrepreneurship, for example. However, in domains which have heavy negative tails, the pursuit of greatness can easily lead to disaster. In those domains, the appropriate variant of virtue ethics is probably more like Buddhism: searching for equanimity or “green”. In domains which have both (e.g. the world as a whole) the closest thing I’ve found is the pursuit of integrity and attunement to oneself. So maybe that’s the thing that we need a cultural shift towards understanding better.
If the following correlations are true, then the opposite may be true (slave morality being better for improving the world through history):
Improving the world being strongly correlated with economic growth (this is probably less true when X-risk are significant)
Economic growth being strongly correlated with Entrepreneurship incentives (property rights, autonomy, fairness, meritocracy, low rents)
Master morality being strongly correlated with acquiring power and thus decreasing the power of others and decreasing their entrepreneurship incentives
That all sounds very plausible. But isn’t this all mostly relevant before AGI is a possibility? That would be a heavy negative tail risk, in which people motivated to “do great things” are quite prone to get us all killed. Should we survive that risk, progress probably mostly won’t be driven by humans, so humans doing great things will barely count. If humans are actually still in charge when we hit ASI, it seems like doing great things with them will probably still have large tail risks (inter-ASI wars).
Right? Or do you see it differently?
It’s a fascinating empirical claim that sounds right now that I hear it.
AGI is heavy-tailed in both directions I think. I don’t think we get utopias by default even without misalignment, since governance of AGI is so complicated.
Re: your point #2, there is another potential spiral where abstract concepts of “greatness” are increasingly defined in a hostile and negative way by partisans of slave morality. This might make it harder to have that “aspirational dialogue about what counts as greatness”, as it gets increasingly difficult for ordinary people to even conceptualize a good version of greatness worth aspiring to. (“Why would I want to become an entrepeneur and found a company? Wouldn’t that make me an evil big-corporation CEO, which has a whiff of the same flavor as stories about the violent, insatiable conquistador villans of the 1500s?”)
Of course, there are also downsides when culture paints a too-rosy picture of greatness—once upon a time, conquistators were in fact considered admirable!
The crucial heuristic I apply when evaluating AI safety research directions is: could we have used this research to make humans safe, if we were supervising the human evolutionary process? And if not, do we have a compelling story for why it’ll be easier to apply to AIs than to humans?
Sometimes this might be too strict a criterion, but I think in general it’s very valuable in catching vague or unfounded assumptions about AI development.
No. I meant: suppose we were rerunning a simulation of evolution, but can modify some parts of it (e.g. evolution’s objective). How do we ensure that whatever intelligent species comes out of it is safe in the same ways we want AGIs to be safe?
(You could also think of this as: how could some aliens overseeing human evolution have made humans safe by those aliens’ standards of safety? But this is a bit trickier to think about because we don’t know what their standards are. Although presumably current humans, being quite aggressive and having unbounded goals, wouldn’t meet them).
Okay, thanks. Could you give me an example of a research direction that passes this test? The thing I have in mind right now is pretty much everything that backchain to local search, but maybe that’s not the way you think about it.
So I think Debate is probably the best example of something that makes a lot of sense when applied to humans, to the point where they’re doing human experiments on it already.
But this heuristic is actually a reason why I’m pretty pessimistic about most safety research directions.
So I’ve been thinking about this for a while, and I think I disagree with what I understand of your perspective. Which might obviously mean I misunderstand your perspective.
What I think I understand is that you judge safety research directions based on how well they could work on an evolutionary process like the one that created humans. But for me, the most promising approach to AGI is based on local search, which differs a bit from evolutionary process. I don’t really see a reason to consider evolutionary processes instead of local search, and even then, the specific approach of evolution for humans is probably far too specific as a test bench.
This matters because problems for one are not problems for the other. For example, one way to mess with an evolutionary process is to find way for everything to survive and reproduce/disseminate. Technology in general did that for humans, which means the evolutionary pressure decreased as technology evolved. But that’s not a problem for local search, since at each step there will be only one next program.
On the other hand, local search might be dangerous because of things like gradient hacking. And they don’t make sense for evolutionary processes.
In conclusion, I feel for the moment that backchaining to local search is a better heuristic for judging safety research directions. But I’m curious about where our disagreement lies on this issue.
One source of our disagreement: I would describe evolution as a type of local search. The difference is that it’s local with respect to the parameters of a whole population, rather than an individual agent. So this does introduce some disanalogies, but not particularly significant ones (to my mind). I don’t think it would make much difference to my heuristic if we imagined that humans had evolved via gradient descent over our genes instead.
In other words, I like the heuristic of backchaining to local search, and I think of it as a subset of my heuristic. The thing it’s missing, though, is that it doesn’t tell you which approaches will actually scale up to training regimes which are incredibly complicated, applied to fairly intelligent agents. For example, impact penalties make sense in a local search context for simple problems. But to evaluate whether they’ll work for AGIs, you need to apply them to massively complex environments. So my intuition is that, because I don’t know how to apply them to the human ancestral environment, we also won’t know how to apply them to our AGIs’ training environments.
Similarly, when I think about MIRI’s work on decision theory, I really have very little idea how to evaluate it in the context of modern machine learning. Are decision theories the type of thing which AIs can learn via local search? Seems hard to tell, since our AIs are so far from general intelligence. But I can reason much more easily about the types of decision theories that humans have, and the selective pressures that gave rise to them.
As a third example, my heuristic endorses Debate due to a high-level intuition about how human reasoning works, in addition to a low-level intuition about how it can arise via local search.
So if I try to summarize your position, it’s something like: backchain to local search for simple and single-AI cases, and then think about aligning humans for the scaled and multi-agents version? That makes much more sense, thanks!
I also definitely see why your full heuristic doesn’t feel immediately useful to me: because I mostly focus on the simple and single-AI case. But I’ve been thinking more and more (in part thanks to your writing) that I should allocate more thinking time to the more general case. I hope your heuristic will help me there.
Cool, glad to hear it. I’d clarify the summary slightly: I think all safety techniques should include at least a rough intuition for why they’ll work in the scaled-up version, even when current work on them only applies them to simple AIs. (Perhaps this was implicit in your summary already, I’m not sure.)
The idea that maximally-coherent agents look like squiggle-maximizers raises the question: what would it look like for humans to become maximally coherent?
One answer, which Yudkowsky gives here, is that conscious experiences are just a “weird and more abstract and complicated pattern that matter can be squiggled into”.
But that seems to be in tension with another claim he makes, that there’s no way for one agent’s conscious experiences to become “more real” except at the expense of other conscious agents—a claim which, according to him, motivates average utilitarianism across the multiverse.
Clearly a squiggle-maximizer would not be an average squigglean. So what’s the disanalogy here? It seems like @Eliezer Yudkowsky is basically using SSA, but comparing between possible multiverses—i.e. when facing the choice between creating agent A or not, you look at the set of As in the multiverse where you decided yes, and compare it to the set of As in the multiverse where you decided no, and (if you’re deciding for the good of A) you pick whichever one gives A a better time on average.
Yudkowsky has written before (can’t find the link) that he takes this approach because alternatives would entail giving up on predictions about his future experiences—e.g. constantly predicting he’s a Boltzmann brain and will dissolve in the next second. But this argument by Wei Dai shows that agents which reason in this way can be money-pumped by creating arbitrarily short-lived copies of them. Based on this I claim that Yudkowsky’s preferences are incoherent, and that the only coherent thing to do here is to “expect to be” a given copy in proportion to the resources it will have available, as anthropic decision theory claims. (Incidentally, this also explains why we’re at the hinge of history.)
But this is just an answer, it doesn’t dissolve the problem. What could? Some wild guesses:
You are allowed to have preferences about the external world, and you are allowed to have preferences about your “thread of experience”—you’re just not allowed to have both. The incoherence comes from trying to combine the two; the coherent thing to do would be to put them into different agents, who will then end up in very different parts of the multiverse.
Another way of framing this: you are allowed to be a decision-maker, and you are allowed to be a repository of welfare, but you’re not allowed to be both (on pain of incoherence/being dutch-booked).
Something totally different: the problem here is that we don’t have intuitive experience of being agents which can copy themselves, shut down copies, re-merge, etc. If we did, then maybe SSA would seem as silly as expecting to end up in a different universe whenever we went to sleep.
Actually, maybe the operative thing we lack experience with is not just splitting into different subagents, but rather merging together afterwards. What does it feel like to have been thousands of different parallel agents, and now be a single agent with their unified experiences? What sort of identity would one construct in that situation? Maybe this is an important part of dissolving the problem.
what would it look like for humans to become maximally coherent [agents]?
In your comments, you focus on issues of identity—who are “you”, given the possibility of copies, inexact counterparts in other worlds, and so on. But I would have thought that the fundamental problem here is, how to make a coherent agent out of an agent with preferences that are inconsistent over time, an agent with competing desires and no definite procedure for deciding which desire has priority, and so on, i.e. problems that exist even when there is no additional problem of identity.
Clearly a squiggle-maximizer would not be an average squigglean
Why??? Being expected squiggle maximizer literally means that you implement policy that produces maximum average number of squiggles across the multiverse.
The “average” is interpreted with respect to quality. Imagine that your only option is to create low-quality squiggles, or not to do so. In isolation, you’d prefer to produce them than not to produce them. But then you find out that the rest of the multiverse is full of high-quality squiggles. Do you still produce the low-quality squiggles? A total squigglean would; an average squigglean wouldn’t.
It depends upon whether the maximizer considers its corner of the multiverse to be currently measurable by squiggle quality, or to be omitted from squiggle calculations at all. In principle these are far from the only options as utility functions can be arbitrarily complex, but exploring just two may be okay so long as we remember that we’re only talking about 2 out of infinity, not 2 out of 2.
An average multiversal squigglean that considers the current universe to be at zero or negative squiggle quality will make the low quality squiggles in order to reduce how much its corner of the multiverse is pulling down the average. An average multiversal squigglean that considers the current universe to be outside the domain of squiggle quality, and will remain so for the remainder of its existence may refrain from making squiggles. If there is some chance that it will become eligible for squiggle evaluation in the future though, it may be better to tile it with low-quality squiggles now in order to prevent a worse outcome of being tiled with worse-quality future squiggles.
In practice the options aren’t going to be just “make squiggles” or “not make squiggles” either. In the context of entities relevant to these sorts of discussion, other options may include “learn how to make better squiggles”.
By “squiggle maximizer” I mean exactly “maximizer of number of physical objects such that function is_squiggle returns True on CIF-file of their structure”.
We can have different objects of value. Like, you can value “probability that if object in multiverse is a squiggle, it’s high-quality”. Here yes, you shouldn’t create additional low-quality squiggles. But I don’t see anything incoherent here, it’s just different utility function?
A short complaint (which I hope to expand upon at some later point): there are a lot of definitions floating around which refer to outcomes rather than processes. In most cases I think that the corresponding concepts would be much better understood if we worked in terms of process definitions.
Some examples: Legg’s definition of intelligence; Karnofsky’s definition of “transformative AI”; Critch and Krueger’s definition of misalignment (from ARCHES).
Sure, these definitions pin down what you’re talking about more clearly—but that comes at the cost of understanding how and why it might come about.
E.g. when we hypothesise that AGI will be built, we know roughly what the key variables are. Whereas transformative AI could refer to all sorts of things, and what counts as transformative could depend on many different political, economic, and societal factors.
If we do not fully understand the mechanism of (e.g. human) intelligence, isn’t referring to the outcome preferable to a made-up story about the process?
(Of course, it would be even better if we understood the process and then referred to it.)
Do you think that these are mutually exclusive, or something like that? I’ve always been confused by what I take to be the position in this shortform, that defining the outcomes makes it somehow harder to define the process. Sure, you can define a process without defining an outcome (i.e. writing a program or training an NN), but since what we are confused about is what we even want at the end, for me that’s the priority. And doing so would help searching for processes leading to this outcome.
That being said, if you point is that defining outcomes isn’t enough, in that we also need to define/deconfuse/study the processes leading to these outcomes, then I agree with that.
Suppose we get to specify, by magic, a list of techniques that AGIs won’t be able to use to take over the world. How long does that list need to be before it makes a significant dent in the overall probability of xrisk?
I used to think of “AGI designs self-replicating nanotech” mainly as an illustration of a broad class of takeover scenarios. But upon further thought, nanotech feels like a pretty central element of many takeover scenarios—you actually do need physical actuators to do many things, and the robots we might build in the foreseeable future are nowhere near what’s necessary for maintaining a civilisation. So how much time might it buy us if AGIs couldn’t use nanotech at all?
Well, not very much if human minds are still an attack vector—the point where we’d have effectively lost is when we can no longer make our own decisions. Okay, so rule out brainwashing/hyper-persuasion too. What else is there? The three most salient: military power, political/cultural power, economic power.
Is this all just a hypothetical exercise? I’m not sure. Designing self-replicating nanotech capable of replacing all other human tech seems really hard; it’s pretty plausible to me that the world is crazy in a bunch of other ways by the time we reach that capability. And so if we can block off a couple of the easier routes to power, that might actually buy useful time.
Firstly, I think it kind of depends. What exactly does blocking the AI from designing nanotech mean? Is the AI allowed to use genetic engineering? Is it allowed to use selective breeding? Elephants genetically engineered to be really good at instruction following?
I mean I think macroscopic self replicating robotics is probably possible, and the AGI can probably bootstrap that from current robotics fairly quickly.
You rule out any hyper-persuasion. How much regular persuasion is the AI allowed to do. After all, if you are buying something online, (from a small seller) them seeing the money arrive persuades them to send the product? Is it allowed to select which human to focus on superhumanly. There are a few people on r/singularity, such that the moment the AI goes, “I’m an AGI”, the humans will be like ” all praise the machine god, I will do anything you ask”. A few people have already persuaded themselves that AI’s are inherently superior to humans by themselves.
You can make the list short. If you make the individual items broad.
ie
the AI is magically banned from doing anything at all.
I agree. Self-replicating nanotech seems to be likely a much harder problem than for language models to get good enough actors to get political, cultural, and economic power.
To the extent that an AGI can make political and economic decisions that are of higher quality than human decisions, there’s also a lot of pressure for humans to delegate those decisions to AGI. Organizations that delegate those decisions to AGI will outcompete those who don’t.
Another general technique: attacks on computing systems. (Both takeover / subversion (dropping an email going ‘um this is a problem’) and destruction (destroy the US power infrastructure using Russian-language programs)).
These don’t tend to be sufficient in and of themselves, but are “classic” stepping-stones to e.g. buy time for an AI while it ramps up.
The last three options you mentioned are all things that happen over relatively slow timescales, if your goal is to completely destroy humanity. The single exception to this is nuclear war, but if you’re correct, then we can reduce the problem to non-proliferation, which is at least in theory solvable.
Probably the easiest “honeypot” is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that’s anything like “get more reward” (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
You don’t want it to be relatively easy to an outside force. Otherwise they can lead it to do as they please, and writing weird behaviour off as ‘oh, it’s changed our rewards, reset it again’, poses some risk.
Hypothesis: there’s a way of formalizing the notion of “empowerment” such that an AI with the goal of empowering humans would be corrigible.
This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn’t ever let the humans spend that power. Intuitively, though, there’s a sense in which a human who can never spend their power doesn’t actually have any power. Is there a way of formalizing that intuition?
The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl’s do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they’d had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it’s not very sensitive to the precise definition of G (especially if the AI isn’t actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions).
The problem here is that these counterfactuals aren’t very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question “what would the AI be doing in this world?” has no sensible answer (or maybe the answer would be “it would realize it’s in a weird hypothetical world and behave accordingly”). Similarly, if we model this using the do-operation, the best policy is something like “wait until the human’s goals suddenly and inexplicably change, then optimize hard for their new goal”.
Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl’s do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.
There’s also the problem of: what do you mean by “the human”? If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power. It never forces you to abstain from giving up power, since if you’re perfectly capable of making different decisions, but you just don’t.
Another problem, which I like to think of as the “control panel of the universe” problem, is where the AI gives you the “control panel of the universe”, but you aren’t smart enough to operate it, in the sense that you have the information necessary to operate it, but not the intelligence. Such that you can technically do anything you want—you have maximal power/empowerment—but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.
Such that you can technically do anything you want—you have maximal power/empowerment—but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.
I think any model of a rational agent needs to incorporate the fact that they’re not arbitrarily intelligent, otherwise none of their actions make sense. So I’m not too worried about this.
If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power.
Yeah, I agree that a lot of concepts get fragile in the context of superintelligence. But while I think of corrigibility as an actively anti-natural concept, empowerment seems like it could perhaps remain robust and well-founded for longer.
You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don’t know how to actually pin down these hypotheticals.
Inspired by a recent discussion about whether Anthropic broke a commitment to not push the capabilities frontier (I am more sympathetic to their position than most, because I think that it’s often hard to distinguish between “current intentions” and “commitments which might be overridden by extreme events” and “solemn vows”):
Maybe one translation tool for bridging the gap between rationalists and non-rationalists is if rationalists interpret any claims about the future by non-rationalists as implicitly being preceded by “Look, I don’t really believe that plans work, I think the world is inherently wildly unpredictable, I am kinda making everything up as I go along. Having said that:”
This translation tool would also require rationalists and such to make arguments of the form “I think supporting Anthropic (by, e.g., going to work there or giving it funding) is a good thing to do because they sort of have a feeling right now that it would be good not to push the AI frontier”, rather than of the form ”… because they’re committed to not pushing the frontier”.
Which are arguments one could make! But is a pretty different argument and I think people would behave differently if these were the only arguments in favour of supporting a new scaling lab.
Am I correct that the implied implication here is that assurances from a non-rationalist are essentially worthless?
I think it is also wrong to imply that Anthropic have violated their commitment simply because they didn’t rationally think through the implications of their commitment when they made it.
I think you can understand Anthropic’s actions as purely rational, just not very ethical.
They made an unenforceable commitment to not push capabilities when it directly benefited them. Now that it is more beneficial to drop the facade, they are doing so.
I think “don’t trust assurances from non-rationalists” is not a good takeaway. Rather it should be “don’t trust unenforceable assurances from people who will stand to greatly benefit from violating your trust at a later date”.
The intended implication is something like “rationalists have a bias towards treating statements as much firmer commitments than intended then getting very upset when they are violated”.
For example, unless I’m missing something, the “we do not wish to advance the rate of AI capabilities” claim is just one offhand line in a blog post. It’s not a firm commitment, it’s not even a claim about what their intentions are. As stated, it’s just one consideration that informs their actions—and in fact the “wish” terminology is often specifically not a claim about intended actions (e.g. “I wish I didn’t have to do X”).
Yet rationalists are hammering them on this one sentence—literally making songs about it, tweeting it to criticize Anthropic, etc. It seems like there is a serious lack of metacognition about where a non-adversarial communication breakdown could have occurred, or what the charitable interpretations of this are.
(I am open to people considering them then dismissing them, but I’m not even seeing that. Like, if people were saying “I understand the difference between Anthropic actually making an organizational commitment, and just offhand mentioning a fact about their motivations, but here’s why I’m disappointed anyway”, that seems reasonable. But a lot of people seem to be treating it as a Very Serious Promise being broken.)
I guess the followup question is “how were Anthropic able to cultivate the impression that they were safety focused if they had only made an extremely loose offhand commitment?”
Certainly the impression I had from how integrated they are in the EA community was that they had made a more serious commitment.
Everyone is afraid of the AI race, and hopes that one of the labs will actually end up doing what they think is the most responsible thing to do. Hope and fear is one hell of a drug cocktail, makes you jump to the conclusions you want based on the flimsiest evidence. But the hangover is a bastard.
Imo I don’t know if we have evidence that Anthropic deliberately cultivated or significantly benefitted from the appearance of a commitment. However if an investor or employee felt like they made substantial commitments based on this impression and then later felt betrayed that would be more serious. (The story here is I think importantly different from other stories where I think there were substantial benefits from commitment appearance and then violation)
The intended implication is something like “rationalists have a bias towards treating statements as much firmer commitments than intended then getting very upset when they are violated”.
That sounds suspiciously similar to “autists have a bias towards interpreting statements literally”.
I think part of the disappointment is the lack of communication regarding violating the commitment or violating the expectations of a non-trivial fraction of the community.
If someone makes a promise to you or even sets an expectation for you in a softer way, there is of course always some chance that they will break the promise or violate the expectation.
But if they violate the commitment or the expectation, and they care about you as a stakeholder, I think there’s a reasonable expectation that they should have to justify that decision.
If they break the promise or violate the soft expectation, and then they say basically nothing (or they say “well I never technically made a promise– there was no contract!”, then I think you have the right to be upset with them not only for violating you expectation but also for essentially trying to gaslight you afterward.
I think a Responsible Lab would have issued some sort of statement along the lines of “hey, we’re hearing that some folks thought we had made commitments to not advance the frontier and some of our employees were saying this to safety-focused members of the AI community. We’re sorry about this miscommunication, and here are some steps we’ll take to avoid such miscommunications in the future.” or “We did in fact intend to follow-through on that, but here are some of the extreme events or external circumstances that caused us to change our mind.”
In the absence of such statement, it makes it seem like Anthropic does not really care about honoring its commitments/expectations or generally defending its reasoning on important safety-relevant issues. I find it reasonable that this disposition harms Anthropic’s reputation among safety-conscious people and makes safety-conscious people less excited about voluntary commitments from labs in general.
See my comment below. Basically I think this depends a lot on the extent to which a commitment was made.
Right now it seems like the entire community is jumping to conclusions based on a couple of “impressions” people got from talking to Dario, plus an offhand line in a blog post. With that little evidence, if you have formed strong expectations, that’s on you. And trying to double down by saying “I have been bashing you because I formed an unreasonable expectation, now it’s your job to fix that” seems pretty adversarial.
I do think it would be nice if Anthropic did make such a statement, but seeing how adversarially everyone has treated the information they do release, I don’t blame them for not doing so.
Right now it seems like the entire community is jumping to conclusions based on a couple of “impressions” people got from talking to Dario, plus an offhand line in a blog post.
No, many people had the impression that Anthropic had made such a commitment, which is why they were so surprised when they saw the Claude 3 benchmarks/marketing. Their impressions were derived from a variety of sources; those are merely the few bits of “hard evidence”, gathered after the fact, of anything that could be thought of as an “organizational commitment”.
Also, if Dustin Moskovitz and Gwern—two dispositionally pretty different people—both came away from talking to Dario with this understanding, I do not think that is something you just wave off. Failures of communication do happen. It’s pretty strange for this many people to pick up the same misunderstanding over the course of several years, from many different people (including Dario, but also others), in a way that’s beneficial to Anthropic, and then middle management starts telling you that maybe there was a vibe but they’ve never heard of any such commitment (nevermind what Dustin and Gwern heard, or anyone else who might’ve heard similar from other Anthropic employees).
I do think it would be nice if Anthropic did make such a statement, but seeing how adversarially everyone has treated the information they do release, I don’t blame them for not doing so.
I really think this is assuming the conclusion. I would be… maybe not happy, but definitely much less unhappy, with a response like, “Dang, we definitely did not intend to communicate a binding commitment to not release frontier models that are better than anything else publicly available at the time. In the future, you should not assume that any verbal communication from any employee, including the CEO, is ever a binding commitment that Anthropic, as an organzation, will respect, even if they say the words This is a binding commitment. It needs to be in writing on our website, etc, etc.”
Right now it seems like the entire community is jumping to conclusions based on a couple of “impressions” people got from talking to Dario, plus an offhand line in a blog post. With that little evidence, if you have formed strong expectations, that’s on you.
Like Robert, the impressions I had were based on what I heard from people working at Anthropic. I cited various bits of evidence because those were the ones available, not because they were the most representative. The most representative were those from Anthropic employees who concurred that this was indeed the implication, but it seemed bad form to cite particular employees (especially when that information was not public by default) rather than, e.g., Dario. I think Dustin’s statement was strong evidence of this impression, though, and I still believe Anthropic to have at least insinuated it.
I agree with you that most people are not aiming for as much stringency with their commitments as rationalists expect. Separately, I do think that what Anthropic did would constitute a betrayal, even in everyday culture. And in any case, I think that when you are making a technology which might extinct humanity, the bar should be significantly higher than “normal discourse.” When you are doing something with that much potential for harm, you owe it to society to make firm commitments that you stick to. Otherwise, as kave noted, how are we supposed to trust your other “commitments”? Your RSP? If all you can offer are vague “we’ll figure it out when we get there,” then any ambiguous statement should be interpreted as a vibe, rather than a real plan. And in the absence of unambiguous statements, as all the labs have failed to provide, this is looking very much like “trust us, we’ll do the right thing.” Which, to my mind, is nowhere close to the assurances society ought to be provided given the stakes.
I do think it would be nice if Anthropic did make such a statement, but seeing how adversarially everyone has treated the information they do release, I don’t blame them for not doing so.
This reasoning seems to imply that Anthropic should only be obliged to convey information when the environment is sufficiently welcoming to them. But Anthropic is creating a technology which might extinct humanity—they have an obligation to share their reasoning regardless of what society thinks. In fact, if people are upset by their actions, there is more reason, not less, to set the record straight. Public scrutiny of companies, especially when their choices affect everyone, is a sign of healthy discourse.
The implicit bid for people not to discourage them—because that would make it less likely for a company to be forthright—seems incredibly backwards, because then the public is unable to mention when they feel Anthropic has made a mistake. And if Anthropic is attempting to serve the public, which they at least pay lip service to through their corporate structure, then they should be grateful for this feedback, and attempt to incorporate it.
So I do blame them for not making such a statement—it is on them to show to humanity, the people they are making decisions for, why those decisions are justified. It is not on society to make the political situation sufficiently palatable such that they don’t face any consequences for the mistakes they have made. It is on them not to make those mistakes, and to own up to them when they do.
The most representative were those from Anthropic employees who concurred that this was indeed the implication, but it seemed bad form to cite particular employees (especially when that information was not public by default) rather than, e.g., Dario. I think Dustin’s statement was strong evidence of this impression, though, and I still believe Anthropic to have at least insinuated it.
This makes sense, and does update me. Though I note “implication”, “insinuation” and “impression” are still pretty weak compared to “actually made a commitment”, and still consistent with the main driver being wishful thinking on the part of the AI safety community (including some members of the AI safety community who work at Anthropic).
I think that when you are making a technology which might extinct humanity, the bar should be significantly higher than “normal discourse.” When you are doing something with that much potential for harm, you owe it to society to make firm commitments that you stick to.
...
So I do blame them for not making such a statement—it is on them to show to humanity, the people they are making decisions for, why those decisions are justified. It is not on society to make the political situation sufficiently palatable such that they don’t face any consequences for the mistakes they have made. It is on them not to make those mistakes, and to own up to them when they do.
I think there are two implicit things going on here that I’m wary of. The first one is an action-inaction distinction. Pushing them to justify their actions is, in effect, a way of slowing down all their actions. But presumably Anthropic thinks that them not doing things is also something which could lead to humanity going extinct. Therefore there’s an exactly analogous argument they might make, which is something like “when you try to stop us from doing things you owe it to the world to adhere to a bar that’s much higher than ‘normal discourse’”. And in fact criticism of Anthropic has not met this bar—e.g. I think taking a line from a blog post out of context and making a critical song about it is in fact unusually bad discourse.
What’s the disanalogy between you and Anthropic telling each other to have higher standards? That’s the second thing that I’m wary about: you’re claiming to speak on behalf of humanity as a whole. But in fact, you are not; there’s no meaningful sense in which humanity is in fact demanding a certain type of explanation from Anthropic. Almost nobody wants an explanation of this particular policy; in fact, the largest group of engaged stakeholders here are probably Anthropic customers, who mostly just want them to ship more models.
I don’t really have a strong overall take. I certainly think it’s reasonable to try to figure out what went wrong with communication here, and perhaps people poking around and asking questions would in fact lead to evidence of clear commitments being made. I am mostly against the reflexive attacks based on weak evidence, which seems like what’s happening here. In general my model of trust breakdowns involves each side getting many shallow papercuts from the other side until they decide to disengage, and my model of productive criticism involves more specificity and clarity.
if Anthropic is attempting to serve the public, which they at least pay lip service to through their corporate structure, then they should be grateful for this feedback, and attempt to incorporate it.
I don’t know if you’ve ever tried this move on an interpersonal level, but it is exactly the type of move that tends to backfire hard. And in fact a lot of these things are fundamentally interpersonal things, about who trusts whom, etc.
I think the right way to think about verbal or written commitments is that they increase the costs of taking a certain course of action. A legal contract can mean that the price is civil lawsuits leading to paying a financial price. A non-legal commitment means if you break it, the person you made the commitment to gets angry at you, and you gain a reputation for being the sort of person who breaks commitments. It’s always an option for someone to break the commitment and pay the price, even laws leading to criminal penalties can be broken if someone is willing to run the risk or pay the price.
In this framework, it’s reasonable to be somewhat angry at someone or some corporation who breaks a soft commitment to you, in order to increase the perceived cost of breaking soft commitments to you and people like you.
People on average maybe tend more towards keeping important commitments due to reputational and relationship cost, but maybe corporations as groups of people tend to think only in terms of financial and legal costs, so are maybe more willing to break soft commitments (especially, if it’s an organization where one person makes the commitment but then other people break it). So for relating to corporations, you should be more skeptical of non-legally binding commitments (and even for legally binding commitments, pay attention to the real price of breaking it).
Yeah, I think it’s good if labs are willing to make more “cheap talk” statements of vague intentions, so you can learn how they think. Everyone should understand that these aren’t real commitments, and not get annoyed if these don’t end up meaning anything. This is probably the best way to view “statements by random lab employees”.
Imo would be good to have more “changeable commitments” too in between, statements that are “we’ll do policy X until we change the policy, when we do we commit to clearly informing everyone about the change” which is maybe more the current status of most RSPs.
A short note on a point that I’d been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misaligned goal were “make as many paperclips as possible”, but the goal “make as many staples as possible” could be represented more simply in the weights, then the weights should slowly drift from the former to the latter throughout training.
But actually, it’d likely be even simpler to get rid of the underlying misaligned goal, and just have alignment with the outer reward function as the terminal goal. So this argument suggests that even policies which start off misaligned would plausibly become aligned if they had to act deceptively aligned for long enough. (This sometimes happens in humans too, btw.)
Reasons this argument might not be relevant: - The policy doing some kind of gradient hacking - The policy being implemented using some kind of modular architecture (which may explain why this phenomenon isn’t very robust in humans)
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it’s unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
Misaligned goal --> I should get high reward --> Behavior aligned with reward function
and then I’m hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make “I should get high reward” the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.)
I could also imagine something more like:
Misaligned goal --> I should behave in aligned ways --> Aligned behavior
and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option.
Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: “I should get high reward” and “I should behave in aligned ways”, and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I’ll have a post on that topic up soon).
Because of standard deceptive alignment reasons (e.g. “I should make sure gradient descent doesn’t change my goal; I should make sure humans continue to trust me”).
I think you don’t have to reason like that to avoid getting changed by SGD. Suppose I’m being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
Maybe this is compatible with what you had in mind! It’s just not something that I think of as “high reward.”
And maybe there’s some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust… but that feels quite contingent to me.
To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
I think this depends sensitively on whether the “actor” and the “critic” in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that “treading water” is in fact a negative-advantage action (unless there’s some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic’s responses will depend on whether its goals are indexical or not (if they are, they’re different from the actor’s goals; if not, they’re the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent—but then the critic doesn’t need to produce a value function that’s consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.
Misaligned goal --> I should get high reward --> Behavior aligned with reward function
The shortest description of this thought doesn’t include “I should get high reward” because that’s already implied by having a misaligned goal and planning with it.
In contrast, having only the goal “I should get high reward” may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.
if the misaligned goal were “make as many paperclips as possible”, but the goal “make as many staples as possible” could be represented more simply in the weights, then the weights should slowly drift from the former to the latter throughout training.
Can you say why you think that weight-based regularization would drift the weights to the latter? That seems totally non-obvious to me, and probably false.
In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.
Interesting point. Though on this view, “Deceptive alignment preserves goals” would still become true once the goal has drifted to some random maximally simple goal for the first time.
To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn’t seem to change this in practice. Given this, all kinds of goals could be “simple” as they piggyback on existing representations, requiring little additional description length.
This doesn’t seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning “X is my misaligned terminal goal, and therefore I’m going to deceptively behave as if I’m aligned” and then acts perfectly like an aligned agent from then on. My claims then would be:
a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up. b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are biased against runtime penalties (at the very least because it prevents them from doing other more useful stuff with that runtime).
In a setting where you also have outer alignment failures, the same argument still holds, just replace “aligned agent” with “reward-maximizing agent”.
A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.
I think this is useful for framing my core concerns about current safety research:
If we think that unsupervised learning will produce safe agents, then why will the comparatively small contributions of SL and RL make them unsafe?
If we think that unsupervised learning will produce dangerous agents, then why will safety techniques which focus on SL and RL (i.e. basically all of them) work, when they’re making comparatively small updates to agents which are already misaligned?
I do think it’s more complicated than I’ve portrayed here, but I haven’t yet seen a persuasive response to the core intuition.
I’m not aware of any airtight argument that “pure” self-supervised learning systems, either generically or with any particular architecture, are safe to use, to arbitrary levels of intelligence, though it seems very much worth someone trying to prove or disprove that. For my part, I got distracted by other things and haven’t thought about it much since then.
The other issue is whether “pure” self-supervised learning systems would be capable enough to satisfy our AGI needs, or to safely bootstrap to systems that are. I go back and forth on this. One side of the argument I wrote up here. The other side is, I’m now (vaguely) thinking that people need a reward system to decide what thoughts to think, and the fact that GPT-3 doesn’t need reward is not evidence of reward being unimportant but rather evidence that GPT-3 is nothing like an AGI. Well, maybe.
For humans, self-supervised learning forms the latent representations, but the reward system controls action selection. It’s not altogether unreasonable to think that action selection, and hence reward, is a more important thing to focus on for safety research. AGIs are dangerous when they take dangerous actions, to a first approximation. The fact that a larger fraction of neocortical synapses are adjusted by self-supervised learning than by reward learning is interesting and presumably safety-relevant, but I don’t think it immediately proves that self-supervised learning has a similarly larger fraction of the answers to AGI safety questions. Maybe, maybe not, it’s not immediately obvious. :-)
Imagine taking someone’s utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I’d want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.
But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret “similar to me” as de dicto vs de re—i.e. whether it refers to the old me or the new me.
This is a more general problem when one person’s utility function can depend on another person’s, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There’s probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct type systems for utility functions, or something like that).
(This note inspired by Mad Investor Chaos, where (SPOILERS) one god declines to take revenge, because they’re the utility-flipped version of another god who would have taken revenge. At first this made sense, but now I feel like it’s not type-safe.)
Actually, this raises a more general point (can’t remember if I’ve made this before): we’ve evolved some values (like caring about revenge) because they’re game-theoretically useful. But if game theory says to take revenge, and also our values say to take revenge, then this is double-counting. So I’d guess that, for much more coherent agents, their level of vengefulness would mainly be determined by their decision theories (which can’t be flipped) rather than their utilities.
Fundamentally, humans aren’t VNM-rational, and don’t actually have utility functions. Which makes the thought experiment much less fun. If you recast it as “what if a human brain’s reinforcement mechanisms were reversed”, I suspect it’s also boring: simple early death.
The interesting fictional cases are when some subset of a person’s legible motivations are reversed, but the mass of other drives remain. This very loosely maps to reversing terminal goals and re-calculating instrumental goals—they may reverse, stay, or change in weird ways.
The indirection case is solved (or rather unasked) by inserting a “perceived” in the calculation chain. Your goals don’t depend on similarity to you, they depend on your perception (or projection) of similarity to you.
I have been asking a similar question for a long time. This is similar to the standard problem that if we deny regularity, will it be regular irregularity or irregular irregularity, that is, at what level are we denying the phenomeno? And only at one level?
(Vague, speculative thinking): Is the time element of UDT actually a distraction? Consider the following: agents A and B are in a situation where they’d benefit from cooperation. Unfortunately, the situation is complicated—it’s not like a prisoner’s dilemma, where there’s a clear “cooperate” and a clear “defect” option. Instead they need to take long sequences of actions, and they each have many opportunities to subtly gain an advantage at the other’s expense.
Therefore instead of agreements formulated as “if you do X I’ll do Y”, it’d be far more beneficial for them to make agreements of the form “if you follow the advice of person Z then I will too”. Here person Z needs to be someone that both A and B trust to be highly moral, neutral, competent, etc. Even if there’s some method of defecting that neither of them considered in advance, at the point in time when it arises Z will advise against doing it. (They don’t need to actually have access to Z, they can just model what Z will say.)
If A and B don’t have much communication bandwidth between them (e.g. they’re trying to do acausal coordination) then they will need to choose a Z that’s a clear Schelling point, even if that Z is suboptimal in other ways.
UDT can be seen as the special case where A and B choose Z as follows: “keep forgetting information until you don’t know if you’re A or B”. If A and B are different branches of the same agent, then the easiest way to do this is just to let Z be their last common ancestor. (Coalitional agency can be seen as an implementation of this.) If they’re not, then they’ll also need to coordinate on a way to make sure they’ll forget roughly the same things.
But there are many other ways of picking Schelling Zs. For example, if A and B follow the same religion, then the central figure in that religion (Jesus, Buddha, Mohammad, etc) is a clear Schelling point.
EDIT: Z need not be one person, it could be a group of people. E.g. in the UDT case, if there are several different orders in which A and B could potentially forget information, then they could just do all of them and then follow the aggregated advice of the resulting council. Similarly, even if A and B aren’t of the same religion, they could agree to follow whatever compromise their respective religions’ central figures would have come to.
EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this problem.
Most discussion of updatelessness suggests that Z is an agent similar to A and B, and also that it’s a policy whose implications are transparent to A and B. I think an importaint case has Z quite unlike A or B, possibly much smaller and more legible than them. And it can still be an agent in its own right, capable of eventually growing stronger than A or B were at the time Z was initially formulated. By a growing/developing Z I mean something that exists in coordination through both of these places, rather than splitting into a version of Z at A, and a version of Z at B, losing touch with each other.
Such a Z might be thought of as an environmental agent that A and B create near them, equipped to keep in contact with its alternative instance (knowing enough about both its instance near A and its instance near B), rather than specifically a commitment of A or B, or a replacement of A or B, or a result of merging A and B. The commitment of A and B is then to future interactions with Z, which the updateless/coordinated core of Z should be sufficiently aware of to plan for.
I think Nesov had some similar idea about “agents deferring to a (logically) far-away algorithm-contract Z to avoid miscoordination”, although I never understood it completely, nor think that idea can solve miscoordination in the abstract (only, possibly, be a nice pragmatic way to bootstrap coordination from agents who are already sufficiently nice).
EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this problem.
Hate to always be that guy, but if you are assuming all agents will only engage in symmetric commitments, then you are assuming commitment races away. In actuality, it is possible for a (meta-) commitment race to happen about “whether I only engage in symmetric commitments”.
nor think that idea can solve miscoordination in the abstract (only, possibly, be a nice pragmatic way to bootstrap coordination from agents who are already sufficiently nice).
The central question to my mind is principles of establishing coordination between different situations/agents, and contracts is a framing for what coordination might look like once established. Agentic contracts have the additional benefit of maintaining coordination across their instances once it’s established initially. Coordination theory should clarify how agents should think about establishing coordination with each other, how they should construct these contracts.
This is not about niceness/cooperation. For example I think it should be possible to understand a transformer as being in coordination with the world through patterns in the world and circuits in the transformer, so that coordination gets established through learning. Beliefs are contracts between a mind and its object of study, essential tools the mind has for controlling it. Consequentialist control is a special case of coordination in this sense, and I think one problem with decision theories is that they are usually overly concerned with remaining close to consequentialist framing.
[Epistemic status: rough speculation, feels novel to me, though Wei Dai probably already posted about it 15 years ago.]
UDT is (roughly) defined as “follow whatever commitments a past version of yourself would have made if they’d thought about your situation”. But this means that any UDT agent is only as robust to adversarial attacks as their most vulnerable past self. Specifically, it creates an incentive for adversaries to show UDT agents situations that would trick their past selves into making unwise commitments. It also creates incentives for UDT agents themselves to hack their past selves, in order to artificially create commitments that “took effect” arbitrarily far back in their past.
In some sense, then, I think UDT might have a parallel structure to the overall alignment problem. You have dumber past agents who don’t understand most of what’s going on. You have smarter present agents who have trouble cooperating, because they know too much. The smarter agents may try to cooperate by punting to “Schelling point” dumb agents. (But this faces many of the standard problems of dumb agents making decisions—e.g. the commitments they make will probably be inconsistent or incoherent in various ways. And so in fact you need the smarter agents to interpret the dumb agents’ commitments, which then gets rid of a bunch of the value of punting it to those dumb agents in the first place.)
You also have the problem that the dumb agents will have situational awareness, and may recognize that their interests have diverged from the interests of the smart agents.
But this also suggests that a “solution” to UDT and a solution to alignment might have roughly the same type signature: a spotlighted structure for decision-making procedures that incorporate the interests of both dumb and smart agents. Even when they have disparate interests, the dumb agents would benefit from getting any decision-making power, and the smart agents would benefit from being able to use the dumb agents as Schelling points to cooperate around.
The smart agents could always refactor the dumb agents and construct new Schelling points if they wanted to, but that would cost them a lot of time and effort, because coordination is hard, and the existing coordination edifice has been built around these particular dumb agents. (Analogously, you could refactor out a bunch of childhood ideals and memories from your current self, but mostly you don’t want to, because they constitute the fabric from which your identity has been constructed.)
To be clear, this isn’t meant to be an argument that ASIs which don’t like us at all will keep us around. That seems unlikely either way. But it could be an argument that ASIs which kinda like us a little bit will keep us around—that it might not be incredibly unnatural for them to do so, because their whole cognitive structure will incorporate the opinions and values of dumber agents by default.
UDT is (roughly) defined as “follow whatever commitments a past version of yourself would have made if they’d thought about your situation”.
This seems substantially different from UDT, which does not really have or use a notion of “past version of yourself”. For example imagine a variant of Counterfactual Mugging in which there is no preexisting agent, and instead Omega creates an agent from scratch after flipping the coin and gives it the decision problem. UDT is fine with this but “follow whatever commitments a past version of yourself would have made if they’d thought about your situation” wouldn’t work.
I recall that I described “exceptionless decision theory” or XDT as “do what my creator would want me to do”, which seems closer to your idea. I don’t think I followed up the idea beyond this, maybe because I realized that humans aren’t running any formal decision theory, so “what my creator would want me to do” is ill defined. (Although one could say my interest in metaphilosophy is related to this, since what I would want an AI to do is to solve normative decision theory using correct philosophical reasoning, and then do what it recommends.)
Anyway, the upshot is that I think you’re exploring a decision theory approach that’s pretty distinct from UDT so it’s probably a good idea to call it something else. (However there may be something similar in the academic literature, or someone described something similar on LW that I’m not familiar with or forgot.)
This seems substantially different from UDT, which does not really have or use a notion of “past version of yourself”.
My terminology here was sloppy, apologies. When I say “past versions of yourself” I am also including (as Nesov phrases it below) “the idealized past agent (which doesn’t physically exist)”. E.g. in the Counterfactual Mugging case you describe, I am thinking about precommitments that the hypothetical past version of yourself from before the coin was flipped would have committed to.
I find it a more intuitive way to think about UDT, though I realize it’s a somewhat different framing from yours. Do you still think this is substantially different?
UDT never got past the setting of unchanging preferences, so the present agent blindly defers to all decisions of the idealized past agent (which doesn’t physically exist). And if the past agent doesn’t try to wade in the murky waters of logical updatelessness, it’s not really dumber or more fallible to trickery, it can see everything the way a universal Turing machine or Solomonoff induction can “see everything”. Coordinating agents with different values was instead explored under the heading of Prisoner’sDilemma. Though a synthesis between coordination of agents with different values and UDT (recognizing Schelling point contracts as a central construction) is long overdue.
if the past agent doesn’t try to wade in the murky waters of logical updatelessness, it’s not really dumber or more fallible to trickery, it can see everything the way a universal Turing machine or Solomonoff induction can “see everything”.
I actually think it might still be more fallible, for a couple of reasons.
Firstly, consider an agent which, at time T, respects all commitments it would have made at times up to T. Now if you’re trying to attack the agent at time T, you have T different versions of it that you can attack, and if any of them makes a dumb commitment then you win.
I guess you could account for this by just gradually increasing the threshold for making commitments over time, though.
Secondly: the further back you go, the more farsighted the past agent needs to be about the consequences of its commitments. If you have any compounding mistakes in the way it expects things to play out, then it’ll just get worse and worse the further back you defer.
Again, I guess you could account for this by having a higher threshold for making commitments which you expect to benefit you further down the line.
Then, re logical updatelessness: it feels like in the long term we need to unify logical + empirical updates, because they’re roughly the same type of thinking. Murky waters perhaps, but necessary ones.
Though a synthesis between coordination of agents with different values and UDT (recognizing Schelling point contracts as a central construction) is long overdue.
Yeah, so what could this look like? I think one important idea is that you don’t have to be deferring to your past self, it’s just that your past self is the clearest Schelling point. But it wouldn’t be crazy for me to, say, use BDT: Buddha Decision Theory, in which I obey all commitments that the Buddha would have made for me if he’d been told about my situation. The differences between me using UDT and BDT (tentatively) seem only qualitative to me, not quantitative. BDT makes it harder for me to cooperate with hypothetical copies of myself who hadn’t yet thought of BDT (because “Buddha” is less of a Schelling point amongst copies of myself than “past Richard”). It also makes me worse off than UDT in some cases, because sometimes the Buddha would make commitments in favor of his interests, not mine. But it also makes it a bit easier for me to cooperate with others, who might also converge to BDT.
At this point I’m starting to suspect that solving UDT 2 is not just alignment-complete, it’s also politics- and sociology-complete. The real question is whether we can isolate toy examples or problems in which these ideas can be formalized, rather than just having them remain vague “what if everyone got along” speculation.
UDT doesn’t do multistage commitments, it has a single all-powerful “past” version that looks into all possible futures before pronouncing a global policy that all of them would then follow. This policy is not a collection of commitments in a reasonable informal sense, it’s literally all details of behavior of future versions of the agent in response to all possible observations. In case of logical updatelessness, also in response to all possible observations of computational facts. (UDT for the idealized past version defines a single master model, future versions are just passively running inference from the contexts of their particular situations.)
The convergent idea for acausal coordination between systems A and B seems to be constructing a shared subagent C whose instances exist as part of both A and B (afterA and B successfully both construct the same C, not before), so that C can then act within them in the style of FDT, though really it’s mostly about C thinking of the effects of its behavior in terms of “I am an algorithm” rather than “I am a physical object”. (For UDT, the shared subagent C is the idealized common past version of its different possible future versions A and B. This assumes that A and B already have a lot in common, so maybe C is instead Buddha.)
A bulk of the blind alleys seem to be about allowing subagents various superpowers, instead of focusing on managing the fallout of making them small and bounded (but possibly more plentiful). I think this is where investigations into logical updatelessness go wrong. It does need solving, but not by considering some fact unknown globally, or even at certain logical times. Instead a fact can remain unknown to some small subagent, and can be observed by it at some point, or computed by another subagent. Values are also knowledge, so sufficiently small subagents shouldn’t even by default know full values of the larger system, and should be prepared to learn more about them. This is a consideration that doesn’t even depend on there initially being multiple big agents with different values.
Another point is that coordination doesn’t necessarily need construction of exactly the same shared subagent, or it doesn’t need to be “exactly the same” in a straightforward sense, which the results on coordination in PD illustrate. The role of subagents in this case is that A can create a subagent CA, while B creates a subagent CB. And even where A and B remain intractable for each other, CA and CB can be much smaller and by construction prepared to coordinate with each other, from within A and B. (It seems natural for the big agents to treat such subagents as something like their copies of an assurance contract, which is signed through commitment to give them influence over the big agent’s thinking or behavior. And letting contracts be agents in their own right gives a lot of flexibility in coordination they can arrange.)
Okay, so trying to combine Prisoner’s dilemma and UDT, we get: A and B are in a prisoner’s dilemma. Suppose they have a list of N agents (which include, say, A’s past self, B’s past self, the Buddha, etc), and they each must commit to following one of those agent’s instructions. Each of them estimates: “conditional on me committing to listen to agent K, here’s a distribution over which agent they’d commit to listen to” And then you maximize expected value based on that.
Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value? It feels like the difference is that it’s really hard to actually reason about the correlations between my low-level actions and your low-level actions, whereas it might be easier to reason about the correlations between my high-level commitments and your high-level commitments.
I.e. the role of the Buddha in this situation is just to make the acausal coordination here much easier.
Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value?
The main trick with PD is that instead of an agent only having two possible actions C and D, we consider many programs the agent might self-modify into (commit to becoming) that each might in the end compute C or D. This effectively changes the action space, there are now many more possible actions. And these programs/actions can be given access (like quines, by their own construction) to initial source code of all the agents, allowed to reason about them. But then programs have logical uncertainty about how they in the end behave, so the things you’d be enumerating don’t immediately cash out in expected values. And these programs can decide to cause different expected values depending of what you’ll do with their behavior, anticipate how you reason about them through reasoning about you in turn. It’s hard to find clear arguments for why any particular desirable thing could happen as a result of this setup.
UDT is notable for being one way of making this work. The “open source game theory” of PD (through Löb’s theorem, modal fixpoints, Payor’s lemma) pinpoints some cases where it’s possible to say that we get cooperation in PD. But in general it’s proven difficult to say anything both meaningful and flexible about this seemingly in-broad-strokes-inevitable setup, in particular for agents with different values that are doing more general things than playing PD.
When both A and B consider listening to a shared subagent C, subagent C is itself considering what it should be doing, depending on what A and B do with C‘s behavior. So for example with A there are two stages of computation to consider: first, it was A and didn’t yet decide to sign the contract, then it became a composite system P(C), where P is A’s policy for giving influence to C’s behavior (possibly P and A include a larger part of the world where the first agent exists, not just the agent itself). The commitment of A is to the truth of the equality A=P(C), which gives C influence over the computational consequences of A in the particular shape P. The trick with the logical time of this process is that C should be able to know (something about) P updatelessly, without being shown observations of what it is, so that the instance of C within B would also know of P and be able to take it into account in choosing its joint policy that acts both through A and B. (Of course, the same is happening within B.)
This sketch frames decision making without directly appealing to consequentialism. Here, A controls B through the incentivesP it creates for C (a particular way in which C gets to project influence from A‘s place in the world), where C also has influence over B. So A doesn’t seek to manipulate B directly by considering the consequences for B’s behavior of various ways that A might behave.
It seems to me that Eliezer overrates the concept of a simple core of general intelligence, whereas Paul underrates it. Or, alternatively: it feels like Eliezer is leaning too heavily on the example of humans, and Paul is leaning too heavily on evidence from existing ML systems which don’t generalise very well.
I don’t think this is a particularly insightful or novel view, but it seems worth explicitly highlighting that you don’t have to side with one worldview or the other when evaluating the debates between them. (Although I’d caution not to just average their two views—instead, try to identify Eliezer’s best arguments, and Paul’s best arguments, and reconcile them.)
I’ve been reading Eliezer’s recent stories with protagonists from dath ilan (his fictional utopia). Partly due to the style, I found myself bouncing off a lot of the interesting claims that he made (although it still helped give me a feel for his overall worldview). The part I found most useful was this page about the history of dath ilan, which can be read without much background context. I’m referring mostly to the exposition on the first 2⁄3 of the page, although the rest of the story from there is also interesting. One key quote from the remainder of the story:
“The next most critical fact about Earth is that from a dath ilani perspective their civilization is made entirely out of coordination failure. Coordination that fails on every scale recursively, where uncoordinated individuals assemble into groups that don’t express their preferences, and then those groups also fail to coordinate with each other, forming governments that offend all of their component factions, which governments then close off their borders from other governments. The entirety of Earth is one gigantic failure fractal. It’s so far below the multi-agent-optimal-boundary, only their professional economists have a five-syllable phrase for describing what a ‘Pareto frontier’ is, since they’ve never seen one in real life. Individuals sort of act in locally optimal equilibrium with their local incentives, but all of the local incentives are weird and insane, meaning that the local best strategy is also insane from any larger perspective. I cannot overemphasize how much you cannot predict Earth by reasoning that most features will have already been optimized into a not-much-further-improvable equilibrium. The closest thing you can do to optimality-based analysis is to think in terms of individually incentive-following responses to incredibly weird local situations. And the weird local situations cannot themselves be derived from first principles, because they are the bizarrely harmful equilibria of other weird incentives in other parts of the system. Or at least I can’t derive the weird situations from first principles, after two years of exposure and getting over the shock and trying to adapt. I would’ve been much better off if I’d tried to understand it as an alien society instead of a human one, in retrospect; and I expect the same would hold for an Earthling trying to understand dath ilan.”
My main update is that Eliezer has a very deep-rooted belief that the world is Lawful, in that it makes sense to talk about real-world intelligence, coordination, ethics, etc, as (very imperfect) approximations to their idealised mathematically-definable forms. (Note though that these are conclusions I’ve extrapolated from his fiction, which is a fairly unreliable method of inferring people’s beliefs.)
I’d say lots of other things he’s said support that update. Stuff about how your model of the world will be accurate if and only if you somehow approximate Bayes’ law, for example.
The dath ilan based fiction definitely helped me internalize the idea better though.
A tension that keeps recurring when I think about philosophy is between the “view from nowhere” and the “view from somewhere”, i.e. a third-person versus first-person perspective—especially when thinking about anthropics.
One version of the view from nowhere says that there’s some “objective” way of assigning measure to universes (or people within those universes, or person-moments). You should expect to end up in different possible situations in proportion to how much measure your instances in those situations have. For example, UDASSA ascribes measure based on the simplicity of the computation that outputs your experience.
One version of the view from somewhere says that the way you assign measure across different instances should depend on your values. You should act as if you expect to end up in different possible future situations in proportion to how much power to implement your values the instances in each of those situations has. I’ll call this the ADT approach, because that seems like the core insight of Anthropic Decision Theory. Wei Dai also discusses it here.
In some sense each of these views makes a prediction. UDASSA predicts that we live in a universe with laws of physics that are very simple to specify (even if they’re computationally expensive to run), which seems to be true. Meanwhile the ADT approach “predicts” that we find ourselves at an unusually pivotal point in history, which also seems true.
Intuitively I want to say “yeah, but if I keep predicting that I will end up in more and more pivotal places, eventually that will be falsified”. But.… on a personal level, this hasn’t actually been falsified yet. And more generally, acting on those predictions can still be positive in expectation even if they almost surely end up being falsified. It’s a St Petersburg paradox, basically.
Very speculatively, then, maybe a way to reconcile the view from somewhere and the view from nowhere is via something like geometric rationality, which avoids St Petersburg paradoxes. And more generally, it feels like there’s some kind of multi-agent perspective which says I shouldn’t model all these copies of myself as acting in unison, but rather as optimizing for some compromise between all their different goals (which can differ even if they’re identical, because of indexicality). No strong conclusions here but I want to keep playing around with some of these ideas (which were inspired by a call with @zhukeepa).
This was all kinda rambly but I think I can summarize it as “Isn’t it weird that ADT tells us that we should act as if we’ll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don’t have a story for why these things are related but it does seem like a suspicious coincidence.”
Very interesting. It sounds like your “third person view from nowhere” vs the “first person view from somewhere” is very similar to something I was thinking about recently. I called them “objectively distinct situations” in contrast with “subjectively distinct situations”. My view is that most of the anthropic arguments that “feel wrong” to me are built on trying to make me assign equal probability to all subjectively distinct scenarios, rather than objective ones. eg. A replication machine makes it so there are two of me, then “I” could be either of them, leaving two subjectively distinct cases, even if on the object level there is actual no distinction between “me” being clone A or clone B. [1]
I am very sceptical of this ADT. If you think the time/place you have ended up is unusually important I think that is more likely explained by something like “people decide what is important based on what is going on around them”.
This was all kinda rambly but I think I can summarize it as “Isn’t it weird that ADT tells us that we should act as if we’ll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don’t have a story for why these things are related but it does seem like a suspicious coincidence.”
I’m not sure this is a valid interpretation of ADT. Can you say more about why you interpret ADT this way, maybe with an example? My own interpretation of how UDT deals with anthropics (and I’m assuming ADT is similar) is “Don’t think about indexical probabilities or subjective anticipation. Just think about measures of things you (considered as an algorithm with certain inputs) have influence over.”
This seems to “work” but anthropics still feels mysterious, i.e., we want an explanation of “why are we who we are / where we’re at” and it’s unsatisfying to “just don’t think about it”. UDASSA does give an explanation of that (but is also unsatisfying because it doesn’t deal with anticipations, and also is disconnected from decision theory).
I would say that under UDASSA, it’s perhaps not super surprising to be when/where we are, because this seems likely to be a highly simulated time/scenario for a number of reasons (curiosity about ancestors, acausal games, getting philosophical ideas from other civilizations).
My own interpretation of how UDT deals with anthropics (and I’m assuming ADT is similar) is “Don’t think about indexical probabilities or subjective anticipation. Just think about measures of things you (considered as an algorithm with certain inputs) have influence over.”
(Speculative paragraph, quite plausibly this is just nonsense.) Suppose you have copies A and B who are both offered the same bet on whether they’re A. One way you could make this decision is to assign measure to A and B, then figure out what the marginal utility of money is for each of A and B, then maximize measure-weighted utility. Another way you could make this decision, though, is just to say “the indexical probability I assign to ending up as each of A and B is proportional to their marginal utility of money” and then maximize your expected money. Intuitively this feels super weird and unjustified, but it does make the “prediction” that we’d find ourselves in a place with high marginal utility of money, as we currently do.
(Of course “money” is not crucial here, you could have the same bet with “time” or any other resource that can be compared across worlds.)
I would say that under UDASSA, it’s perhaps not super surprising to be when/where we are, because this seems likely to be a highly simulated time/scenario for a number of reasons (curiosity about ancestors, acausal games, getting philosophical ideas from other civilizations).
Fair point. By “acausal games” do you mean a generalization of acausal trade? (Acausal trade is the main reason I’d expect us to be simulated a lot.)
Intuitively this feels super weird and unjustified, but it does make the “prediction” that we’d find ourselves in a place with high marginal utility of money, as we currently do.
This is particularly weird because your indexical probability then depends on what kind of bet you’re offered. In other words, our marginal utility of money differs from our marginal utility of other things, and which one do you use to set your indexical probability? So this seems like a non-starter to me… (ETA: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we’re just contemplating a philosophical problem and not trying to make any specific decisions?)
By “acausal games” do you mean a generalization of acausal trade?
Yes, didn’t want to just say “acausal trade” in case threats/war is also a big thing.
This is particularly weird because your indexical probability then depends on what kind of bet you’re offered. In other words, our marginal utility of money differs from our marginal utility of other things, and which one do you use to set your indexical probability? So this seems like a non-starter to me...
It seems pretty weird to me too, but to steelman: why shouldn’t it depend on the type of bet you’re offered? Your indexical probabilities can depend on any other type of observation you have when you open your eyes. E.g. maybe you see blue carpets, and you know that world A is 2x more likely to have blue carpets. And hearing someone say “and the bet is denominated in money not time” could maybe update you in an analogous way.
I mostly offer this in the spirit of “here’s the only way I can see to reconcile subjective anticipation with UDT at all”, not “here’s something which makes any sense mechanistically or which I can justify on intuitive grounds”.
I added this to my comment just before I saw your reply: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we’re just contemplating a philosophical problem and not trying to make any specific decisions?
I mostly offer this in the spirit of “here’s the only way I can see to reconcile subjective anticipation with UDT at all”, not “here’s something which makes any sense mechanistically or which I can justify on intuitive grounds”.
Ah I see. I think this is incomplete even for that purpose, because “subjective anticipation” to me also includes “I currently see X, what should I expect to see in the future?” and not just “What should I expect to see, unconditionally?” (See the link earlier about UDASSA not dealing with subjective anticipation.)
ETA: Currently I’m basically thinking: use UDT for making decisions, use UDASSA for unconditional subjective anticipation, am confused about conditional subjective anticipation as well as how UDT and UDASSA are disconnected from each other (i.e., the subjective anticipation from UDASSA not feeding into decision making). Would love to improve upon this, but your idea currently feels worse than this...
In a bayesian rationalist view of the world, we assign probabilities to statements based on how likely we think they are to be true. But truth is a matter of degree, as Asimov points out. In other words, all models are wrong, but some are less wrong than others.
Consider, for example, the claim that evolution selects for reproductive fitness. Well, this is mostly true, but there’s also sometimes group selection, and the claim doesn’t distinguish between a gene-level view and an individual-level view, and so on...
So just assigning it a single probability seems inadequate. Instead, we could assign a probability distribution over its degree of correctness. But because degree of correctness is such a fuzzy concept, it’d be pretty hard to connect this distribution back to observations.
Or perhaps the distinction between truth and falsehood is sufficiently clear-cut in most everyday situations for this not to be a problem. But questions about complex systems (including, say, human thoughts and emotions) are messy enough that I expect the difference between “mostly true” and “entirely true” to often be significant.
Has this been discussed before? Given Less Wrong’s name, I’d be surprised if not, but I don’t think I’ve stumbled across it.
This feels generally related to the problems covered in Scott and Abram’s research over the past few years. One of the sentences that stuck out to me the most was (roughly paraphrased since I don’t want to look it up):
In order to be a proper bayesian agent, a single hypothesis you formulate is as big and complicated as a full universe that includes yourself
I.e. our current formulations of bayesianism like solomonoff induction only formulate the idea of a hypothesis at such a low level that even trying to think about a single hypothesis rigorously is basically impossible with bounded computational time. So in order to actually think about anything you have to somehow move beyond naive bayesianism.
This seems reasonable, thanks. But I note that “in order to actually think about anything you have to somehow move beyond naive bayesianism” is a very strong criticism. Does this invalidate everything that has been said about using naive bayesianism in the real world? E.g. every instance where Eliezer says “be bayesian”.
One possible answer is “no, because logical induction fixes the problem”. My uninformed guess is that this doesn’t work because there are comparable problems with applying to the real world. But if this is your answer, follow-up question: before we knew about logical induction, were the injunctions to “be bayesian” justified?
(Also, for historical reasons, I’d be interested in knowing when you started believing this.)
I think it definitely changed a bunch of stuff for me, and does at least a bit invalidate some of the things that Eliezer said, though not actually very much.
In most of his writing Eliezer used bayesianism as an ideal that was obviously unachievable, but that still gives you a rough sense of what the actual limits of cognition are, and rules out a bunch of methods of cognition as being clearly in conflict with that theoretical ideal. I did definitely get confused for a while and tried to apply Bayes to everything directly, and then felt bad when I couldn’t actually apply bayes theorem in some situations, which I now realize is because those tended to be problems where embededness or logical uncertainty mattered a lot.
My shift on this happened over the last 2-3 years or so. I think starting with Embedded Agency, but maybe a bit before that.
rules out a bunch of methods of cognition as being clearly in conflict with that theoretical ideal
Which ones? In Against Strong Bayesianism I give a long list of methods of cognition that are clearly in conflict with the theoretical ideal, but in practice are obviously fine. So I’m not sure how we distinguish what’s ruled out from what isn’t.
which I now realize is because those tended to be problems where embededness or logical uncertainty mattered a lot
Can you give an example of a real-world problem where logical uncertainty doesn’t matter a lot, given that without logical uncertainty, we’d have solved all of mathematics and considered all the best possible theories in every other domain?
I think in-practice there are lots of situations where you can confidently create a kind of pocket-universe where you can actually consider hypotheses in a bayesian way.
Concrete example: Trying to figure out who voted a specific way on a LW post. You can condition pretty cleanly on vote-strength, and treat people’s votes as roughly independent, so if you have guesses on how different people are likely to vote, it’s pretty easy to create the odds ratios for basically all final karma + vote numbers and then make a final guess based on that.
It’s clear that there is some simplification going on here, by assigning static probabilities for people’s vote behavior, treating them as independent (though modeling some subset of independence wouldn’t be too hard), etc.. But overall I expect it to perform pretty well and to give you good answers.
(Note, I haven’t actually done this explicitly, but my guess is my brain is doing something pretty close to this when I do see vote numbers + karma numbers on a thread)
So I’m not sure how we distinguish what’s ruled out from what isn’t.
Well, it’s obvious that anything that claims to be better than the ideal bayesian update is clearly ruled out. I.e. arguments that by writing really good explanations of a phenomenon you can get to a perfect understanding. Or arguments that you can derive the rules of physics from first principles.
There are also lots of hypotheticals where you do get to just use Bayes properly and then it provides very strong bounds on the ideal approach. There are a good number of implicit models behind lots of standard statistics models that when put into a bayesian framework give rise to a more general formulation. See the Wikipedia article for “Bayesian interpretations of regression” for a number of examples.
Of course, in reality it is always unclear whether the assumptions that give rise to various regression methods actually hold, but I think you can totally say things like “given these assumption, the bayesian solution is the ideal one, and you can’t perform better than this, and if you put in the computational effort you will actually achieve this performance”.
Hmmm, but what does this give us? He talks about the difference between vague theories and technical theories, but then says that we can use a scoring rule to change the probabilities we assign to each type of theory.
But my question is still: when you increase your credence in a vague theory, what are you increasing your credence about? That the theory is true?
Nor can we say that it’s about picking the “best theory” out of the ones we have, since different theories may overlap partially.
If we can quantify how good a theory is at making accurate predictions (or rather, quantify a combination of accuracy and simplicity), that gives us a sense in which some theories are “better” (less wrong) than others, without needing theories to be “true”.
Oracle-genie-sovereign is a really useful distinction that I think I (and probably many others) have avoided using mainly because “genie” sounds unprofessional/unacademic. This is a real shame, and a good lesson for future terminology.
After rereading the chapter in Superintelligence, it seems to me that “genie” captures something akin to act-based agents. Do you think that’s the main way to use this concept in the current state of the field, or do you have other applications in mind?
Ah, yeah, that’s a great point. Although I think act-based agents is a pretty bad name, since those agents may often carry out a whole bunch of acts in a row—in fact, I think that’s what made me overlook the fact that it’s pointing at the right concept. So not sure if I’m comfortable using it going forward, but thanks for pointing that out.
Perhaps the lesson is that terminology that is acceptable in one field (in this case philosophy) might not be suitable in another (in this case machine learning).
I don’t think that even philosophers take the “genie” terminology very seriously. I think the more general lesson is something like: it’s particularly important to spend your weirdness points wisely when you want others to copy you, because they may be less willing to spend weirdness points.
People sometimes try to reason about the likelihood of deceptive alignment by appealing to speed priors and simplicity priors. I don’t like such appeals, because I think that the differences between aligned and deceptive AGIs will likely be a very small proportion of the total space/time complexity of an AGI. More specifically:
1. If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they’re relevant. So the speed cost of deception will be amortized across the (likely very long) training period.
2. AGIs will represent a huge number of beliefs and heuristics which inform their actions (e.g. every single fact they know). A heuristic like “when you see X, initiate the world takeover plan” would therefore constitute a very small proportion of the total information represented in the network; it’d be hard to regularize it away without regularizing away most of the AGI’s knowledge.
I think that something like the speed vs simplicity tradeoff is relevant to the likelihood of deceptive alignment, but it needs to be more nuanced. One idea I’ve been playing around with: the tradeoff between conservatism and systematization (as discussed here). An agent that prioritizes conservatism will tend to do the things they’ve previously done. An agent that prioritizes systematization will tend to do the things that are favored by simple arguments.
To illustrate: suppose you have an argument in your head like “if I get a chance to take a 60⁄40 double-or-nothing bet for all my assets, I should”. Suppose you’ve thought about this a bunch and you’re intellectually convinced of it. Then you’re actually confronted with the situation. Some people will be more conservative, and follow their gut (“I know I said I would, but… this is kinda crazy”). Others (like most utilitarians and rationalists) will be more systematizing (“it makes sense, let’s do it”). Intuitively, you could also think of this as a tradeoff between memorization and generalization; or between a more egalitarian decision-making process (“most of my heuristics say no”) and a more centralized process (“my intellectual parts say yes”). I don’t know how to formalize any of these ideas, but I’d like to try to figure it out.
But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they’re relevant.
Why do you think SGD will do this? Or are you imagining non-SGD mechanisms?
It seems non-obvious to me that this will occur with SGD, though possible.
If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they’re relevant. So the speed cost of deception will be amortized across the (likely very long) training period.
You mean this about something trained totally differently than a LLM, no? Because this mechanism seems totally implausible to me otherwise.
Think during the forward pass, learn during the backward pass; if the model uses deceptive reasoning in the forward pass and the gradient says it’s useful for prediction, that seems like the mechanism as described. Thoughts?
So, there are a few different reasons, none of which I’ve formalized to my satisfaction.
I’m curious if these make sense to you.
(1) One is that the actual kinds of reasoning that an LLM can learn in its forward pass are quite limited.
As is well established, for instance, Transformers cannot multiply arbitrarily-long integers in a single forward pass. The number of additions involved in multiplying an N-digit integer increases in an unbounded way with N; thus, a Transformer with with a finite number of layers cannot do it. (Example: Prompt GPT-4 for the results of multiplying two 5-digit numbers, specifying not to use a calculator, see how it does.)
Of course in use you can teach a GPT to use a calculator—but we’re talking about operations that occur in single forward pass, which rules out using tools. Because of this shallow serial depth, a Transformer also cannot (1) divide arbitrary integers, (2) figure out the results of physical phenomena that have multiplication / division problems embedded in them, (3) figure out the results of arbitrary programs with loops, and so on.
(Note—to be very clear NONE of this is a limitation on what kind of operations we can get a transformer to do over multiple unrollings of the forward pass. You can teach a transformer to use a calculator; or to ask a friend for help; or to use a scratchpad, or whatever. But we need to hide deception in a single forward pass, which is why I’m harping on this.)
So to think that you learn deception in forward pass, you have to think that the transformer thinks something like “Hey, if I deceive the user into thinking that I’m a good entity, I’ll be able to later seize power, and if I seize power, then I’ll be able to (do whatever), so—considering all this, I should… predict the next token will be “purple”″ -- and that it thinks this in a context that could NOT come up with the algorithm for multiplication, or for addition, or for any number of other things, even though an algorithm for multiplication would be much much MUCH more directly incentivized by SGD, because it’s directly relevant for token predictions.
(2). Another way to get at the problem with this reasoning, is that I think it hypothesizes an agent within weight updates off the analogical resemblance to an agent that the finished product has. But in fact there’s at most a superficial resemblance between (LLM forward pass) and (repeated LLM forward passes in a Chain-of-thought over text).
That is, an LLM unrolled multiple times, from a given prompt, can make plans; it can plot to seize power, imitating humans who it saw thus plot; it can multiply N-digit integers, working them out just like a human. But this tells us literally nothing about what it can do in a single forward pass.
For comparison, consider a large neural network that is used for image segmentation. The entire physical world falls into the domain of such a model. It can learn that people exist, that dogs exist, and that machinery exists, in some sense. What if such a neural network—in a single forward pass—used deceptive reasoning, which turned out to be useful for prediction because of the backward pass, and that we ought therefore expect that such a neural network—when embedded in some device down the road—would turn and kill us?
The argument is exactly identical to the case of the language model, but no one makes it. And I think the reason is that people think about the properties that a trained LLM can exhibit *when unrolled over multiple forward passes, in a particular context and with a particular prompt, and then mistakenly attribute these properties to the single forward pass.
(All of which is to say—look, if you think you can get a deceptive agent from a LLM this way you should also expect a deceptive agent from an image segmentation model. Maybe that’s true! But I’ve never seen anyone say this, which makes me think they’re making the mistake I describe above.)
(3). I think this is just attributing extremely complex machinery to the forward pass of an LLM that is supposed to show up in a data-indifferent manner, and that this is a universally bad bet for ML.
Like, different Transformers store different things depending on the data they’re given. If you train them on SciHub they store a bunch of SciHub shit. If you train them on Wikipedia they store a bunch of Wikipedia shit. In every case, for each weight in the Transformer, you can find specific reasons for each neuron being what it is because of the data.
The “LLM will learn deception” hypothesis amounts to saying that—so long as a LLM is big enough, and trained on enough data to know the world exists—you’ll find complex machinery in it that (1) specifically activates once it figures out that it’s “not in training” and (2) was mostly just hiding until then. My bet is that this won’t show up, because there are no such structures in a Transformer that don’t depend on data. Your French Transformer / English Transformer / Toolformer / etc will not all learn to betray you if they get big enough—we will not find unused complex machinery in a Transformer to betray you because we find NO unused complex machinery in a transformer, etc.
I think an actually well-put together argument will talk about frequency bias and shit, but this is all I feel like typing for now.
Does this make sense? I’m still working on putting it together.
You’ve given me a lot to think about, thanks! Here are my thoughts as I read:
(1) One is that the actual kinds of reasoning that an LLM can learn in its forward pass are quite limited.
As is well established, for instance, Transformers cannot multiply arbitrarily-long integers in a single forward pass. The number of additions involved in multiplying an N-digit integer increases in an unbounded way with N; thus, a Transformer with with a finite number of layers cannot do it. (Example: Prompt GPT-4 for the results of multiplying two 5-digit numbers, specifying not to use a calculator, see how it does.)
Of course in use you can teach a GPT to use a calculator—but we’re talking about operations that occur in single forward pass, which rules out using tools. Because of this shallow serial depth, a Transformer also cannot (1) divide arbitrary integers, (2) figure out the results of physical phenomena that have multiplication / division problems embedded in them, (3) figure out the results of arbitrary programs with loops, and so on.
...
So to think that you learn deception in forward pass, you have to think that the transformer thinks something like “Hey, if I deceive the user into thinking that I’m a good entity, I’ll be able to later seize power, and if I seize power, then I’ll be able to (do whatever), so—considering all this, I should… predict the next token will be “purple”″ -- and that it thinks this in a context that could NOT come up with the algorithm for multiplication, or for addition, or for any number of other things, even though an algorithm for multiplication would be much much MUCH more directly incentivized by SGD, because it’s directly relevant for token predictions.
Yes, and the sort of deceptive reasoning I’m worried about sure seems pretty simple, very little serial depth to it. Unlike multiplying two 5-digit integers. For example the example you give involves like 6 steps. I’m pretty sure GPT4 already does ‘reasoning’ of about that level of sophistication in a single forward pass, e.g. when predicting the next token of the transcript of a conversation in which one human is deceiving another about something. (In fact, in general, how do you explain how LLMs can predict deceptive text, if they simply don’t have enough layers to do all the deceptive reasoning without ‘spilling the beans’ into the token stream?)
(2). Another way to get at the problem with this reasoning, is that I think it hypothesizes an agent within weight updates off the analogical resemblance to an agent that the finished product has. But in fact there’s at most a superficial resemblance between (LLM forward pass) and (repeated LLM forward passes in a Chain-of-thought over text).
That is, an LLM unrolled multiple times, from a given prompt, can make plans; it can plot to seize power, imitating humans who it saw thus plot; it can multiply N-digit integers, working them out just like a human. But this tells us literally nothing about what it can do in a single forward pass.
For comparison, consider a large neural network that is used for image segmentation. The entire physical world falls into the domain of such a model. It can learn that people exist, that dogs exist, and that machinery exists, in some sense. What if such a neural network—in a single forward pass—used deceptive reasoning, which turned out to be useful for prediction because of the backward pass, and that we ought therefore expect that such a neural network—when embedded in some device down the road—would turn and kill us?
The argument is exactly identical to the case of the language model, but no one makes it. And I think the reason is that people think about the properties that a trained LLM can exhibit *when unrolled over multiple forward passes, in a particular context and with a particular prompt, and then mistakenly attribute these properties to the single forward pass.
(All of which is to say—look, if you think you can get a deceptive agent from a LLM this way you should also expect a deceptive agent from an image segmentation model. Maybe that’s true! But I’ve never seen anyone say this, which makes me think they’re making the mistake I describe above.)
The reason I’m not worried about image segmentation models is that it doesn’t seem like they’d have the relevant capabilities or goals. Maybe in the limit they would—if we somehow banned all other kinds of AI, but let image segmentation models scale to arbitrary size and training data amounts—then eventually after decades of scaling and adding 9′s of reliability to their image predictions, they’d end up with scary agents inside because that would be useful for getting one of those 9′s. But yeah, it’s a pretty good bet that the relevant kinds of capabilities (e.g. ability to coherently pursue goals, ability to write code, ability to persuade humans of stuff) are most likely to appear earliest in systems that are trained in environments more tailored to developing those capabilities. tl;dr my answer is ’wake me when an image segmentation model starts performing well in dangerous capabilities evals like METR’s and OpenAI’s. Which won’t happen for a long time because image segmentation models are going to be worse at agency than models explicitly trained to be agents.”
(3). I think this is just attributing extremely complex machinery to the forward pass of an LLM that is supposed to show up in a data-indifferent manner, and that this is a universally bad bet for ML.
Like, different Transformers store different things depending on the data they’re given. If you train them on SciHub they store a bunch of SciHub shit. If you train them on Wikipedia they store a bunch of Wikipedia shit. In every case, for each weight in the Transformer, you can find specific reasons for each neuron being what it is because of the data.
The “LLM will learn deception” hypothesis amounts to saying that—so long as a LLM is big enough, and trained on enough data to know the world exists—you’ll find complex machinery in it that (1) specifically activates once it figures out that it’s “not in training” and (2) was mostly just hiding until then. My bet is that this won’t show up, because there are no such structures in a Transformer that don’t depend on data. Your French Transformer / English Transformer / Toolformer / etc will not all learn to betray you if they get big enough—we will not find unused complex machinery in a Transformer to betray you because we find NO unused complex machinery in a transformer, etc
I think this is a misunderstanding of the LLM will learn deception hypothesis. First of all, the conditions of the hypothesis are not just “so long as it’s big enough and knows the world exists.” It’s more stringent than that; there probably needs to be agency, for example (goal-directedness) and situational awareness. (Though I think John Wentworth disagrees?)
Secondly, the “complex machinery” claim is actually trivial, though you make it sound like it’s crazy. ANY behavior of a neural net in situation class X, which does not appear in situation class Y, is the result of ‘unused-in-Y complex machinery.’ So set y = training and x = deployment, and literally any claim about how deployment will be different from training involves this.
Another different approach I could take would be: The complex machinery DOES get used a lot in training. Indeed that’s why it evolved/was-formed-by-SGD. The complex machinery is the goal-directedness machinery, the machinery that chooses actions on the basis of how well the action is predicted to serve the goals. That machinery is presumably used all the fucking time in training, and it causes the system to behave-as-if-aligned in training and behave in blatantly unaligned ways once it’s very obvious that it can get away with doing so.
These all seem like reasonable reasons to doubt the hypothesized mechanism, yup. I think you’re underestimating how much can happen in a single forward pass, though—it has to be somewhat shallow, so it can’t involve too many variables, but the whole point of making the networks as large as we do these days is that it turns out an awful lot can happen in parallel. I also think there would be no reason for deception to occur if it’s never a good weight pattern to use to predict the data, it’s only if the data contains a pattern that the gradient will put into a deceptive forward mechanism that this could possibly occur. For example, if the model is trained on a bunch of humans being deceptive about their political intentions, and then RLHF is attempted.
In any case, I don’t think the old yudkowsky model of deceptive alignment is relevant, in that I think the level of deception to expect from ai should be calibrated to be around the amount you’d expect from a young human, not some super schemer god. The concern arises only when the data actually contains patterns well modeled by deception, and this would be expected to be more present in the case of something like an engagement maximizer online learning RL system.
And to be clear I don’t expect the things that can destroy humanity to arise because of deception directly. It seems much more likely to me that they’ll arise because competing people ask their model to do something that puts those models in competition in a way that puts humanity at risk, eg several different powerful competing model based engagement/sales optimizing reinforcement learners, or more speculatively something military. Something where the core problem is that alignment tech is effectively not used, and where solving this deception problem wouldn’t have saved us anyway.
Regarding the details of your descriptions: I really mainly think this sort of deception would arise in the wild when there’s a reward model passing gradients to multiple steps of a sequential model, or possibly the imitating humans locally thing. But without a reward model, nothing pushes the different steps of the sequential model towards trying to achieve the “same thing” across different steps in any significant sense. But of course almost all the really useful models involve a reward model somehow.
It’s really weird that we find ourselves at the hinge of history. One proposed explanation is that we’re part of an ancestor simulation. It makes sense that ancestor simulations would be focused on the hinge of history. But unless ancestor simulations make up a significant proportion of future minds, it’s still weird that we find ourselves in a simulation rather than actually experiencing the future.
Why might ancestor simulations make up a significant proportion of future minds? One possible answer is that ancestor simulations provide the information required for acausal cooperation across large worlds (known as ECL). If knowing the values that civilizations developed after the hinge of history allowed you to trade with them, then civilizations should focus a significant proportion of their resources on simulating the hinges of history experienced by many other civilizations.
I presume that this explanation has been proposed before, and probably in more detail; links appreciated if so.
I’ve done work in this area, but never been particularly enthusiastic about promoting it. It usually turns out to be inactionable/grim/likely to rouse a panic.
This is a familiar thought, to me.
A counterargument occurs to me: Isn’t it arguable that most of what we need to know about a species, to trade with it, is just downstream of its biology? Of course we talk a lot about our contingent factors, our culture, our history, but I think we’re pretty much just the same animals we’ve always been, extrapolated. If that’s the case, wouldn’t far more simulation time be given to evolutionary histories, rather than than simulating variations of hinges? Anthropic measure wouldn’t be especially concentrated on the hinge, it might even skip it.
Countercounterargument: it also seems like there are a lot of anti-inductive effects in the histories of technological societies that might mean you really do have to simulate it all to find out how values settle or just to figure out the species’ rate of success. Evolutionary histories might also have a lot more computationally compressible shared structure.
I’d be surprised if this, the world in front of us, were a pareto-efficient bargaining outcome. Hinge histories fucking suck to live in and I would strongly prefer a trade protocol that instantiated as few of them as possible. I wouldn’t expect many to be necessary, certainly not enough to significantly outweigh the… thing that is supposed to come after. (at this point, I’d prefer to take it into DMs/call)
Thinking about this stuff again, something occurred to me. Please make sure to keep, in cold storage, copies of misaligned AGIs that you may produce, when you catch them. It’s important. This policy could save us.
Please make sure to keep, in cold storage, copies of misaligned AGIs that you may produce, when you catch them. It’s important. This policy could save us.
Would you care to expand on your remark? I don’t see how it follows from what you said above it.
Yeah, it wasn’t argued. I wasn’t sure whether it needed to be explained, for Richard. I don’t remember how I wound up getting there from the rest of the comment, I think it was just in the same broad neighborhood.
World champion in Chess: “It’s really weird that I’m world champion. It must be a simulation or I must dream or..”
Joe Biden: “It’s really weird I’m president, it must be a simul...” (Donald Trump: “It really really makes no sense I’m president, it MUST be a s..”)
David Chalmers: “It’s really weird I’m providing the seminal hard problem formulation. It must be a sim..”
...
Rationalist (before finding lesswrong): “Gosh, all these people around me, really wired differently than I am. I must be in a simulation.”
Something seems funny to me in the anthropic reasoning in these examples, and in yours too.
Of course we have one world champion in chess or anything, so a reasoning that means that world champion quasi by definition question’s his champion-ness, seems odd. Then, I’d be lying if I claimed I could not intuitively empathize with his wondering about the odds of exactly him being the world champion among 9 billions.
This leads me to the following, that eventually +- satisfies me:
Hypothetically, imagine each generation has only 1 person, and there’s rebirth: it’s just a rebirth of the same person, in a different generation.
With some simplification:
For 10 000 generations you lived in stone-age conditions
For 1 generation—today—you’re the hinge-of-history generation
X (X being: you won’t live anymore at all as AI killed everything; or you live 1 mio generations happily, served by AI, or what have you).
The 10 000 you’s didn’t have much reason to wonder about hinge of history, and so doesn’t happen to think about it. The one you, in the hinge-of-history generation, by definition, has much reasons to think about the hinge-of-history, and does think about it.
So, it has becomes a bit like a lottery game, which you repeat so many times until you naturally once draw the winning number. At that lucky punch, there’s no reason to think “Unlikely, it’s probably a simulation”, or anything.
I have the impression in the similar way, the reincarnated guy should not wonder about it, neither when his memory is wiped each time, and in the same vein (hm, am I sloppy here? that’s the hinge of my argument) neither you have to wonder too much.
In general I don’t think anthropic reasoning like this holds any substance. We experience what we experience, and condition on that in forming models about what it is and where we are in it.
We don’t get to make millions of bits of observations about being a human in a technological society, use those observations to extrapolate the possibility of supergalactic multitudes of consciousness, and then express surprise at a pathetic few dozen bits of improbability of not being one of those multitudes. We already used those bits (and a great many more!) in forming our model in the first place.
Since there’s been some recent discussion of the SSC/NYT incident (in particular via Zack’s post), it seems worth copying over my twitter threads from that time about why I was disappointed by the rationalist community’s response to the situation.
Scott Alexander is the most politically charitable person I know. Him being driven off the internet is terrible. Separately, it is also terrible if we have totally failed to internalize his lessons, and immediately leap to the conclusion that the NYT is being evil or selfish.
Ours is a community built around the long-term value of telling the truth. Are we unable to imagine reasonable disagreement about when the benefits of revealing real names outweigh the harms? Yes, it goes against our norms, but different groups have different norms.
If the extended rationalist/SSC community could cancel the NYT, would we? For planning to doxx Scott? For actually doing so, as a dumb mistake? For doing so, but for principled reasons? Would we give those reasons fair hearing? From what I’ve seen so far, I suspect not.
I feel very sorry for Scott, and really hope the NYT doesn’t doxx him or anyone else. But if you claim to be charitable and openminded, except when confronted by a test that affects your own community, then you’re using those words as performative weapons, deliberately or not.
[One more tweet responding to tweets by @balajis and @webdevmason, omitted here.]
Scott Alexander is writing again, on a substack blog called Astral Codex Ten! Also, he doxxed himself in the first post. This post seems like solid evidence that many SSC fans dramatically overreacted to the NYT situation.
Scott: “I still think the most likely explanation for what happened was that there was a rule on the books, some departments and editors followed it more slavishly than others, and I had the bad luck to be assigned to a department and editor that followed it a lot. That’s all.” [I didn’t comment on this in the thread, but I intended to highlight the difference between this and the conspiratorial rhetoric that was floating around when he originally took his blog down.]
I am pretty unimpressed by his self-justification: “Suppose Power comes up to you and says hey, I’m gonna kick you in the balls. … Sometimes you have to be a crazy bastard so people won’t walk all over you.” Why is doxxing the one thing Scott won’t be charitable about?
[In response to @habryka asking what it would mean for Scott to be charitable about this]: Merely to continue applying the standards of most of his other posts, where he assumes both sides are reasonable and have useful perspectives. And not to turn this into a bravery debate.
[In response to @benskuhn saying that Scott’s response is understandable, since being doxxed nearly prevented him from going into medicine]: On one hand, yes, this seems reasonable. On the other hand, this is also a fully general excuse for unreasonable dialogue. It is always the case that important issues have had major impacts on individuals. Taking this excuse seriously undermines Scott’s key principles.
I would be less critical if it were just Scott, but a lot of people jumped on narratives similar to “NYT is going around kicking people in the balls for selfish reasons”, demonstrating an alarming amount of tribalism—and worse, lack of self-awareness about it.
Scott is already too charitable. I’d even say that Scott being too charitable made this specific situation worse. I don’t find this to be a worthwhile thing about Scott either for us to emulate, or for Scott to take further.
“Quokka” is a meme about rationalists for a reason. You are not going to have unerring logical evidence that someone wants to harm you if they are trying to be at all subtle. You have to figure it out from their behavior.
Sometimes it just isn’t true that both sides are reasonable and have useful perspectives.
Ours is a community built around the long-term value of telling the truth. Are we unable to imagine reasonable disagreement about when the benefits of revealing real names outweigh the harms? Yes, it goes against our norms, but different groups have different norms.
I think this only holds if NYT has a consistent policy of using real names. My understanding is they have repeatedly written about other people using pseudonyms only, and have not articulated a principled reason to treat Scott differently.
standards of most of his other posts, where he assumes both sides are reasonable and have useful perspectives
Scott’s flavor of charity is not quite this. It wouldn’t be useful for understanding sides that are not reasonable or have useless perspectives otherwise, or else you’d need to routinely “assume” false things to carry out the exercise.
The point is to meaningfully engage with other perspectives, without the usual prerequisite of having positive beliefs about them. Treating them in a similar way as if they were reasonable or useful, even when they clearly aren’t. Sometimes the resulting investigation changes one’s mind on this point. But often it doesn’t, while still revealing many details that wouldn’t otherwise be noticed. Actually intervening on your own beliefs would be self-deception, while treating useless and unreasonable views as they are usually treated wouldn’t be charity.
This is related to tolerance, where the point isn’t to start liking people you don’t like, or to start considering them part of your own ingroup. It’s instead an intervention/norm that goes around the dislike to remove some of its downsides without directly removing the dislike itself.
My mental one-sentence summary of how to think about ELK is “making debate work well in a setting where debaters are able to cite evidence gained by using interpretability tools on each other”.
I’m not claiming that this is how anyone else thinks about ELK (although I got the core idea from talking to Paul) but since I haven’t seen it posted online yet, and since ELK is pretty confusing, I thought it’d be useful to put out there. In particular, this framing motivates us generating interpretability tools which scale in the sense of being robust when used as evidence by AGIs.
Note that this is a very different type of solution to the ones in the original writeup, which seem mainly useful for illustrative purposes rather than actually pointing in promising directions.
Being nice because you’re altruistic, and being even nicer for decision-theoretic reasons on top of that, seems like it involves some kind of double-counting: the reason you’re altruistic in the first place is because evolution ingrained the decision theory into your values.
But it’s not fully double-counting: many humans generalise altruism in a way which leads them to “cooperate” far more than is decision-theoretically rational for the selfish parts of them—e.g. by making big sacrifices for animals, future people, etc. I guess this could be selfishly rational if you subscribe to a very strong form of updatelessness, but I am very skeptical that we’ll discover arguments that this much updatelessness is rationally obligatory.
A very speculative takeaway: maybe “how updateless you are” and “how altruistic you are” are kinda measuring the same thing, and there’s no clean split between whether that’s determined by your values or your decision theory.
Your actions and decisions are not doubled. If you have multiple paths to arrive at the same behaviors, that doesn’t make them wrong or double-counted, it just makes it hard to tell which of them is causal (aka: your behavior is overdetermined).
Are you using “updatelessness” to refer to not having self in your utility function? If so, that’s a new one one me, and I’d prefer “altruism” as the term. I’m not sure that the decision-theory use of “updateless” (to avoid incorrect predictions where experience is correlated with the question at hand) makes sense here.
Oh, this also suggests a way in which the utility function abstraction is leaky, because the reasons for the payoffs in a game may matter. E.g. if one payoff is high because the corresponding agent is altruistic, then in some sense that agent is “already cooperating” in a way which is baked into the game, and so the rational thing for them to do might be different from the rational thing for another agent who gets the same payoffs, but for “selfish” reasons.
Maybe FDT already lumps this effect into the “how correlated are decisions” bucket? Idk.
In UDT2, when you’re in epistemic state Y and you need to make a decision based on some utility function U, you do the following: 1. Go back to some previous epistemic state X and an EDT policy (the combination of which I’ll call the non-updated agent). 2. Spend a small amount of time trying to find the policy P which maximizes U based on your current expectations X. 3. Run P(Y) to make the choice which maximizes U.
The non-updated agent gets much less information than you currently have, and also gets much less time to think. But it does use the same utility function. That seems… suspicious. If you’re updating so far back that you don’t know who or where you are, how are you meant to know what you care about?
What happens if the non-updated agent doesn’t get given your utility function? On its face, that seems to break its ability to decide which policy P to commit to. But perhaps it could instead choose a policy P(Y,U) which takes as input not just an epistemic state, but also a utility function. Then in step 2, the non-updated agent needs to choose a policy P that maximizes, not the agent’s current utility function, but rather the utility functions it expects to have across a wide range of future situations.
Problem: this involves aggregating the utilities of different agents, and there’s no canonical way to do this. Hmm. So maybe instead of just generating a policy, the non-updated agent also needs to generate a value learning algorithm, that maps from an epistemic state Y to a utility function U, in a way which allows comparison across different Us. Then the non-updated agent tries to find a pair (P, V) such that P(Y) maximizes V(Y) on the distribution of Ys predicted by X. EDIT: no, this doesn’t work. Instead I think you need to go back, not just to a previous epistemic state X, but also to a previous set of preferences U’ (which include meta-level preferences about how your values evolve). Then you pick P and V in order to maximize U’.
Now, it does seem kinda wacky that the non-updated agent can maybe just tell you to change your utility function. But is that actually any weirder than it telling you to change your policy? And after all, you did in fact acquire your values from somewhere, according to some process.
Overall, I haven’t thought about this very much, and I don’t know if it’s already been discussed. But three quick final comments:
This brings UDT closer to an ethical theory, not just a decision theory.
In practice you’d expect P and V to be closely related. In fact, I’d expect them to be inseparable, based on arguments I make here.
Overall the main update I’ve made is not that this version of UDT is actually useful, but that I’m now suspicious of the whole framing of UDT as a process of going back to a non-updated agent and letting it make commitments.
People back then certainly didn’t think of changing preferences.
Also, you can get rid of this problem by saying “you just want to maximize the variable U”. And the things you actually care about (dogs, apples) are just “instrumentally” useful in giving you U. So for example, it is possible in the future you will learn dogs give you a lot of U, or alternatively that apples give you a lot of U. Needless to say, this “instrumentalization” of moral deliberation is not how real agents work. And leads to getting Pascal’s mugged by the world in which you care a lot about easy things.
It’s more natural to model U as a logically uncertain variable, freely floating inside your logical inductor, shaped by its arbitrary aesthetic preferences. This doesn’t completely miss the importance of reward in shaping your values, but it’s certainly very different to how frugally computable agents do it.
I simply think the EV maximization framework breaks here. It is a useful abstraction when you already have a rigid enough notion of value, and are applying these EV calculations to a very concrete magisterium about which you can have well-defined estimates. Otherwise you get mugged everywhere. And that’s not how real agents behave.
Also, you can get rid of this problem by saying “you just want to maximize the variable U”. And the things you actually care about (dogs, apples) are just “instrumentally” useful in giving you U.
But you need some mechanism for actually updating your beliefs about U, because you can’t empirically observe U. That’s the role of V.
leads to getting Pascal’s mugged by the world in which you care a lot about easy things
I think this is fine. Consider two worlds:
In world L, lollipops are easy to make, and paperclips are hard to make.
In world P, it’s the reverse.
Suppose you’re a paperclip-maximizer in world L. And a lollipop-maximizer comes up to you and says “hey, before I found out whether we were in L or P, I committed to giving all my resources to paperclip-maximizers if we were in P, as long as they gave me all their resources if we were in L. Pay up.”
UDT says to pay here—but that seems basically equivalent to getting “mugged” by worlds where you care about easy things.
But you need some mechanism for actually updating your beliefs about U
Yep, but you can just treat it as another observation channel into UDT. You could, if you want, treat it as a computed number you observe in the corner of your eye, and then just apply UDT maximizing U, and you don’t need to change UDT in any way.
UDT says to pay here
(Let’s not forget this depends on your prior, and we don’t have any privileged way to assign priors to these things. But that’s a tangential point.)
I do agree that there’s not any sharp distinction between situations where it “seems good” and situations where it “seems bad” to get mugged. After all, if all you care about is maximizing EV, then you should take all muggings. It’s just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go “hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don’t really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal”. You start to see how your abstractions might break, and how you can’t get any satisfying notion of “complete updatelessness” (that doesn’t go against important intuitions). And you start to rethink whether this is what we normatively want, nor what we realistically see in agents.
Yep, but you can just treat it as another observation channel into UDT.
Hmm, I’m confused by this. Why should we treat it this way? There’s no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm. That’s the role V is playing.
It’s just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go “hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don’t really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal”.
Obviously I am not arguing that you should agree to all moral muggings. If a pain-maximizer came up to you and said “hey, looks like we’re in a world where pain is way easier to create than pleasure, give me all your resources”, it would be nuts to agree, just like it would be nuts to get mugged by “1+1=3″. I’m just saying that “sometimes you get mugged” is not a good argument against my position, and definitely doesn’t imply “you get mugged everywhere”.
There’s no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm.
Yes, absolutely! I just meant that, once you give me whatever V you choose to derive U from observations, I will just be able to apply UDT on top of that. So under this framework there doesn’t seem to be anything new going on, because you are just choosing an algorithm V at the start of time, and then treating its outputs as observations. That’s, again, why this only feels like a good model of “completely crystallized rigid values”, and not of “organically building them up slowly, while my concepts and planner module also evolve, etc.”.[1]
definitely doesn’t imply “you get mugged everywhere”
Wait, but how does your proposal differ from EV maximization (with moral uncertainty as part of the EV maximization itself, as I explain above)?
Because anything that is doing pure EV maximization “gets mugged everywhere”. Meaning if you actually have the beliefs (for example, that the world where suffering is hard to produce could exist), you just take those bets. Of course if you don’t have such “extreme” beliefs it doesn’t, but then we’re not talking about decision-making, and instead belief-formation. You could say “I will just do EV maximization, but never have extreme beliefs that lead to suspiciously-looking behavior”, but that’d be hiding the problem under belief-formation, and doesn’t seem to be the kind of efficient mechanism that agents really implement to avoid these failure modes.
To be clear, V can be a very general algorithm (like “run a copy of me thinking about ethics”), so that this doesn’t “feel like” having rigid values. Then I just think you’re carving reality at the wrong spot. You’re ignoring the actual dynamics of messy value formation, hiding them under V.
In times of UDT2, the background assumption was that agents should maintain an unchanging preference, which is separate from knowledge. One motivation for UDT is that updating makes an agent stop caring about updated-away possibilities, while UDT is not doing that. Going back to a previous epistemic state is a way of preserving preference from that epistemic state, the “current” utility function is considered a bug and doesn’t do anything if UDT is adopted. The non-updated agent can in principle consider the information you currently have as one of the possibilities when formulating the general policy for all possibilities, though being bounded it won’t do a very good job.
Traditionally UDT1.1 wants to make its decisions from very little knowledge and to apply the policy to all always. A more pragmatic thing is to make decisions from modestly less knowledge and to scope the policy for middle-term future. Some form of this is useful for many thought experiments where the environment or other players also have the little knowledge our agent uses to make its decisions from the past, and so could know the policy the agent decides on before they need to prepare for it or make predictions about it.
The problem is commitment races (as in the game of chicken), where everyone wants to decide earlier and force the others to respond. But there is a need to remain bounded in making decisions, both to personally compute them and to make it possible for others to anticipate them and to coordinate. This creates a more reasonable equilibrium, motivating decisions from a less ignorant epistemic state that have a better chance of being relevant to the current situation, in balance with trying to decide from a more ignorant epistemic state where a general policy would enable more strategicness across possibilities. UDT1.1 can’t find such balance, but it’s possible that something UDT2-shaped might.
One motivation for UDT is that updating makes an agent stop caring about updated-away possibilities, while UDT is not doing that.
I think there’s an ambiguity here. UDT makes the agent stop considering updated-away possibilities, but I haven’t seen any discussion of UDT which suggests that it stops caring about them in principle (except for a brief suggestion from Paul that one option for UDT is to “go back to a position where I’m mostly ignorant about the content of my values”). Rather, when I’ve seen UDT discussed, it focuses on updating or un-updating your epistemic state.
I don’t think the shift I’m proposing is particularly important, but I do think the idea that “you have your prior and your utility function from the very beginning” is a kinda misleading frame to be in, so I’m trying to nudge a little away from that.
UDT makes the agent stop considering updated-away possibilities, but I haven’t seen any discussion of UDT which suggests that it stops caring about them in principle
UDT specifically enables agents to consider the updated-away possibilities in a way relevant to decision making, while an updated agent (that’s not using something UDT-like) wouldn’t be able to do that in any circumstance, and so would be functionally indistinguishable from an agent that has different preferences or undefined preferences for those possibilities. Not caring about them seems like an apt informal description (even as this is compatible with keeping the same utility function outside the event of current knowledge). In a similar way, we could say that after updating, an agent either changes their probability distribution or keeps the original prior.
I do think the idea that “you have your prior and your utility function from the very beginning” is a kinda misleading frame to be in
Historically it was overwhelmingly the frame until recently, so it’s the correct frame for interpreting the intended meaning of texts from that time. This is a simplifying assumption that still leaves many open questions about how to make decisions in sufficiently strange situations (where merely models of behavior make these strange situationsubiquitous in practice). When an agent doesn’t know its own preference and needs to do something about that, it’s an additional complication that usually wasn’t introduced.
UDT specifically enables agents to consider the updated-away possibilities in a way relevant to decision making, while an updated agent (that’s not using something UDT-like) wouldn’t be able to do that in any circumstance
Agreed; apologies for the sloppy phrasing.
Historically it was overwhelmingly the frame until recently, so it’s the correct frame for interpreting the intended meaning of texts from that time.
I agree, that’s why I’m trying to outline an alternative frame for thinking about it.
Some more thoughts: we can portray the process of choosing a successor policy as the iterative process of making more and more commitments over time. But what does it actually look like to make a commitment? Well, consider an agent that is made of multiple subagents, that each get to vote on its decisions. You can think of a commitment as basically saying “this subagent still gets to vote, but no longer gets updated”—i.e. it’s a kind of stop-gradient.
Two interesting implications of this perspective:
The “cost” of a commitment can be measured both in terms of “how often does the subagent vote in stupid ways?”, and also “how much space does it require to continue storing this subagent?” But since we’re assuming that agents get much smarter over time, probably the latter is pretty small.
There’s a striking similarity to the problem of trapped priors in human psychology. Parts of our brains basically are subagents that still get to vote but no longer get updated. And I don’t think this is just a bug—it’s also a feature. This is true on the level of biological evolution (you need to have a strong fear of death in order to actually survive) and also on the level of cultural evolution (if you can indoctrinate kids in a way that sticks, then your culture is much more likely to persist).
The (somewhat provocative) way of phrasing this is that trauma is evolution’s approach to implementing UDT. Someone who’s been traumatized into conformity by society when they were young will then (in theory) continue obeying society’s dictates even when they later have more options. Someone who gets very angry if mistreated in a certain way is much harder to mistreat in that way. And of course trauma is deeply suboptimal in a bunch of ways, but so too are UDT commitments, because they were made too early to figure out better alternatives.
This is clearly only a small component of the story but the analogy is definitely a very interesting one.
More thoughts: what’s the difference between paying in a counterfactual mugging based on:
Whether the millionth digit of pi (5) is odd or even
Whether or not there are an infinite number of primes?
In the latter case knowing the truth is (near-)inextrictably entangled with a bunch of other capabilities, like the ability to do advanced mathematics. Whereas in the former it isn’t. Suppose that before you knew either fact you were told that one of them was entangled in this way—would you still want to commit to paying out in a mugging based on it?
Well… maybe? But it means that the counterlogical of “if there hadn’t been an infinite number of primes” is not very well-defined—it’s hard to modify your brain to add that belief without making a bunch of other modifications. So now Omega doesn’t just have to be (near-)omniscient, it also needs to have a clear definition of the counterlogical that’s “fair” according to your standards; without knowing that it has that, paying up becomes less tempting.
Individually logical counterfactuals don’t seem very coherent. This is related to the “I’m an algorithm” vs. “I’m a physical object” distinction of FDT. When you are an algorithm considering a decision, you want to mark all sites of intervention/influence in the world where the world depends on your behavior. If you only mark some of them, then you later fail at the step where you ask what happens if you act differently, you obtain a broken counterfactual world where only some instances of the fact of your behavior have been replaced and not others.
So I think it makes a bit more sense to ask where specifically your brain depends on a fact, to construct an exhausive dependence of your brain on the fact, before turning to particular counterfactual content for that fact to be replaced with. That is, dependence of a system on a fact, the way it varies with the fact, seems potentially clearer than individual counterfactuals of how that system works if the fact is set to be a certain way. (To make a somewhat hopeless analogy, fibration instead of individual fibers, and it shouldn’t be a problem that all fibers are different from each other. Any question about a counterfactual should be reformulated into a question about a dependence.)
Random question I’ve been thinking about: how would you set up a market for votes? Suppose specifically that you have a proportional chances election (i.e. the outcome gets chosen with probability proportional to the number of votes cast for it—assume each vote is a distribution over candidates). So everyone has an incentive to get everyone who’s not already voting for their favorite option to change their vote; and you can have positive-sum trades where I sell you a promise to switch X% of my votes to a compromise candidate in exchange for you switching Y% of your votes to a compromise candidate.
What makes this complicated is that I don’t just care that I get votes for my favorite candidate, I also care about where those votes come from—i.e. would they otherwise have been cast for my second-favorite candidate, or for my least-favorite?
Each person starts off with a vote and can sell shares in it to whoever they like for whatever price they like. When the vote is called, you get a number of votes proportional to your shares. This might help me trade votes in a current election for votes in a future election, but it doesn’t seem to really address the core problem that the votes need to be from a specific candidate, that’s what you’re buying, otherwise there’s no benefit to trade in the one-off case.
EDITED: Let a certificate x:A->B be a promise to switch x fraction of your vote from A to B. (I’ll mostly skip the x for brevity.) Suppose your actual favorite candidate is C; you can generate up to one certificate C->Z, for any Z. Then when the vote is called, everyone votes for their favorite candidate, then the votes are modified by all the certificates in circulation.
Ideally the thing you’d want is to incentivize the following: “oh, nobody has realized yet that D is a really good compromise candidate between X and Y! I can profit off this fact…” This can be modelled as there not yet being much demand for X->D or Y->D, so you can buy a bunch of it cheaply and wait for the price to appreciate (because many others will also want to buy it once they realize); or maybe you can short X->Y and Y->X; or sell versions of X->Y which are actually X->D->Y. You probably also need the ability to borrow from the bank for liquidity—and maybe the bank should accept A->B->C as a return on A->C?
This setup strongly incentivizes strategic declaration of who your favorite candidate is. But maybe that’s just unavoidable… the whole point of a market is that you’ve got something to trade, and in some sense that has to be your default vote.
How do markets in general avoid this? By setting regulations saying you can’t make someone’s life worse then tell them to pay to stop. Without political intervention in general markets are just threat-machines. I.e. you really need the distinguished “zero” point in order to make them work.
The CoCo value could be viewed as starting at a disagreement point, and going “let’s move to the Pareto frontier and evenly split the gain”. Which seems like what’s happening with trades in this market, you benefit from how bad your disagreement point is.
Diffractor describes a ROSE point as anywhere where, if you drew two lines at the highest utilities that players 1 and 2 could get without sending the other below the disagreement point utility, there’s a tangent line at the ROSE point that has equal distance to the two max-utility boundaries.
I don’t really see how to generalize this to constructing a market, nor do I know if that question makes sense.
Just spitballing here: Assign each voter 100 shares for each candidate. To vote, each voter selects a subset of their shares to constitute their vote. Voters can freely trade shares.
Under this system, a voter would more highly value shares for candidates that are either very high or very low in their preference order (the later so as to exclude them from the vote). Thus, trades would look like each party exchanging shares about which they are themselves ambivalent to gain shares that are more valuable to them.
If you remove the proportional chances part, then it becomes a guessing game of which marginal votes actually matter.
Interesting! Hadn’t thought of this approach. Let’s see… Intuitively I think it gets pretty strategically weird because a) who you vote for depends pretty sensitively on other peoples’ votes (e.g. in proportional chances voting you want to vote for everyone who’s above the expected value of everyone else’s votes; in approval voting you want to vote for everyone you approve of unless it bumps them above someone you like more), and b) you want to buy from your enemies much more than from your friends, because your friends will already not be voting for bad candidates. But maybe the latter is fine because if you buy from your friends they’ll end up with more money which they can then spend on other things? I’ll keep thinking.
I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven’t yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs—i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)
Another thought on dath ilan: notice how much of the work of Keltham’s reasoning is based on him pattern-matching to tropes from dath ilani literature, and then trying to evaluate their respective probabilities. In other words: like bayesianism, he’s mostly glossing over the “hypothesis generation” step of reasoning.
I wonder if dath ilan puts a lot of effort into spreading a wide range of tropes because they don’t know how to teach systematically good hypothesis generation.
I think you are overgeneralizing. We also see some mix of Dath Ilan, stories about Dath Ilan, stories about stories about Dath Ilan, and interactions between these, so all bets are off really.
I suspect that AIXI is misleading to think about in large part because it lacks reusable parameters—instead it just memorises all inputs it’s seen so far. Which means the setup doesn’t have episodes, or a training/deployment distinction; nor is any behaviour actually “reinforced”.
I kind of think the lack of episodes makes it more realistic for many problems, but admittedly not for simulated games. Also, presumably many of the component Turing machines have reusable parameters and reinforce behaviour, altho this is hidden by the formalism. [EDIT: I retract the second sentence]
Wait, really? I thought it made sense (although I’d contend that most people don’t think about AIXI in terms of those TMs reinforcing hypotheses, which is the point I’m making). What’s incorrect about it?
Well now I’m less sure that it’s incorrect. I was originally imagining that like in Solomonoff induction, the TMs basically directly controlled AIXI’s actions, but that’s not right: there’s an expectimax. And if the TMs reinforce actions by shaping the rewards, in the AIXI formalism you learn that immediately and throw out those TMs.
Yes we do: training is our evolutionary history, deployment is an individual lifetime. And our genomes are our reusable parameters.
Unfortunately I haven’t yet written any papers/posts really laying out this analogy, but it’s pretty central to the way I think about AI, and I’m working on a bunch of related stuff as part of my PhD, so hopefully I’ll have a more complete explanation soon.
I’ve recently discovered waitwho.is, which collects all the online writing and talks of various tech-related public intellectuals. It seems like an important and previously-missing piece of infrastructure for intellectual progress online.
Yudkowsky mainly wrote about recursive self-improvement from a perspective in which algorithms were the most important factors in AI progress—e.g. the brain in a box in a basement which redesigns its way to superintelligence.
Sometimes when explaining the argument, though, he switched to a perspective in which compute was the main consideration—e.g. when he talked about getting “a hyperexponential explosion out of Moore’s Law once the researchers are running on computers”.
What does recursive self-improvement look like when you think that data might be the limiting factor? It seems to me that it looks a lot like iterated amplification: using less intelligent AIs to provide a training signal for more intelligent AIs.
I don’t consider this a good reason to worry about IA, though: in a world where data is the main limiting factor, recursive approaches to generating it still seem much safer than alternatives.
Perhaps a data-limited intelligence explosion is analogous to what we humans do all the time when we teach ourselves something. Out of the vast sea of information on the internet, we go get some data, and study it, and then use that to make a better opinion about what data we need next, and then repeat until we are at the forefront of the world’s knowledge. We start from scratch, with a vague understanding like “I should learn more economics, I don’t even know what supply and demand are” and then we end up publishing a paper on auction theory or something idk. This is a recurisve self improvement loop in data quality, so to speak, rather than data quantity.
What counts as self-improvement in the scenario governed by data?
You can grab the whole internet, including scihub and library genesis, and then maybe hack all “smart” appliances worldwide… and after that I guess you need to construct some machines that will perform experiments for you.
But none of this improves the machine’s “self”. With algorithms, the idea is that the machine would replace its own algorithms by better ones, once it gets the ability to invent and evaluate algorithms. With hardware, the idea is that the machine would replace its own hardware by faster ones, once it gets the ability to design and produce hardware. But replacing your data with better data, that… we usually don’t call self-improvement.
Also, what kind of data are we talking about? Data about the real world, they have to come from the outside, by definition. (Unless they are data about physics that you can obtain by observing the physical properties of your own circuits, or something like that.) But there is also data in sense of precomputed cached results, like playing zillions of games of chess against yourself, and remembering which strategies were most successful. If this was the limiting factor… I guess it would be something like a bounded AIXI which hypothetically already has enough hardware to simulate a universe, it only need to make zillions of computations to find the one that is consistent with the observed data.
In the scenario governed by data, the part that counts as self-improvement is where the AI puts itself through a process of optimisation by stochastic gradient descent with respect to that data.
You don’t need that much hardware for data to be a bottleneck. For example, I think that there are plenty of economically valuable tasks that are easier to learn than StarCraft. But we get StarCraft AIs instead because games are the only task where we can generate arbitrarily large amounts of data.
RL usually applies some discount rate, and also caps episodes at a certain length, so that an action taken at a given time isn’t reinforced very much (or at all) for having much longer-term consequences.
How does this compare to evolution? At equilibrium, I think that a gene which increases the fitness of its bearers in N generations’ time is just as strongly favored as a gene that increases the fitness of its bearers by the same amount straightaway. As long as it was already widespread at least N generations ago, they’re basically the same thing, because current gene-holders benefit from the effects of the gene-holders from N generations ago.
That gene would evolve much more slowly, though. Plus in practice it’s hard to ensure that the benefits accrue only to gene-holders, and there’s so much variance in the environment that for N of more than 3 or 4 this seems pretty implausible. Still, the disanalogy seems kinda interesting.
A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.
The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.
What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to understand. Also: in general you can’t backprop through discrete language anyway, but I’d guess there are some tricks for approximating that which don’t work as well when a human is in the loop.
That doesn’t actually solve the problem. The system could just encode the desired information in the semantics of some unrelated sentences—e.g. talk about pasta to indicate X = 0, or talk about rain to indicate X = 1.
Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the same direction; equivalently, any local change in semantics would be sub-optimal compared to using language in the semantically correct way. The examples in the paper are on quite toy problems, but I think in principle this could work.
Not being able to send messages too complex for humans to understand seems to me like it’s plausibly a benefit for many of the cases where you’d want to do this.
I believe that humans have already crossed a threshold that, in a certain sense, puts us on an equal footing with any other being who has mastered abstract reasoning. There’s a notion in computing science of “Turing completeness”, which says that once a computer can perform a set of quite basic operations, it can be programmed to do absolutely any calculation that any other computer can do. Other computers might be faster, or have more memory, or have multiple processors running at the same time, but my 1988 Amiga 500 really could be programmed to do anything my 2008 iMac can do — apart from responding to external events in real time — if only I had the patience to sit and swap floppy disks all day long. I suspect that something broadly similar applies to minds and the class of things they can understand: other beings might think faster than us, or have easy access to a greater store of facts, but underlying both mental processes will be the same basic set of general-purpose tools. So if we ever did encounter those billion-year-old aliens, I’m sure they’d have plenty to tell us that we didn’t yet know — but given enough patience, and a very large notebook, I believe we’d still be able to come to grips with whatever they had to say.
Equivocation. “Who’s ‘we’, flesh man?” Even granting the necessary millions or billions of years for a human to sit down and emulate a superintelligence step by step, it is still not the human who understands, but the Chinese room.
I’ve seen this quote before and always find it funny because when I read Greg Egan, I constantly find myself thinking there’s no way I could’ve come up with the ideas he has even if you gave me months or years of thinking time.
Yes, there’s something to that, but you have to be careful if you want to use that as an objection. Maybe you wouldn’t easily think of it, but that doesn’t exclude the possibility of you doing it: you can come up with algorithms you can execute which would spit out Egan-like ideas, like ‘emulate Egan’s brain neuron by neuron’. (If nothing else, there’s always the ol’ dovetail-every-possible-Turing-machine hammer.) Most of these run into computational complexity problems, but that’s the escape hatch Egan (and Scott Aaronson has made a similar argument) leaves himself by caveats like ‘given enough patience, and a very large notebook’. Said patience might require billions of years, and the notebook might be the size of the Milky Way galaxy, but those are all finite numbers, so technically Egan is correct as far as that goes.
Yeah good point—given generous enough interpretation of the notebook my rejection doesn’t hold. It’s still hard for me to imagine that response feeling meaningful in the context but maybe I’m just failing to model others well here.
It’s frustrating how bad dath ilanis (as portrayed by Eliezer) are at understanding other civilisations. They seem to have all dramatically overfit to dath ilan.
To be clear, it’s the type of error which is perfectly sensible for an individual to make, but strange for their whole civilisation to be making (by teaching individuals false beliefs about how tightly constraining their coordination principles are).
The in-universe explanation seems to be that they’ve lost this knowledge as a result of screening off the past. But that seems like a really predictable failure mode which gives them false beliefs about very important topics, so I have trouble imagining it being consistent with the rest of Eliezer’s characterisation of dath ilan.
(FWIW I’ll also note that this is the same type of mistake that I think Eliezer is making when reasoning about AI.)
Tho, to be fair, losing points in universes you don’t expect to happen in order to win points in universes you expect to happen seems like good decision theory.
[I do have a standing wonder about how much of dath ilan is supposed to be ‘the obvious equilbrium’ vs. ‘aesthetic preferences’; I would be pretty surprised if Eliezer thought there was only one fixed point of the relevant coordination functions, and so some of it must be ‘aesthetics’.]
I don’t think dath ilan would try to win points in likely universes by teaching children untrue things, which I claim is what they’re doing.
Also, it’s not clear to me that this would even win them points, because when thinking about designing civilisation (or AGIs) you need to have accurate beliefs about this type of thing. (E.g. imagine dath ilani alignment researchers being like “here are all our principles for understanding intelligence” and then continually being surprised, like Keltham is, about how messy and fractally unprincipled some plausible outcomes are.)
Half-formed musing: what’s the relationship between being a nerd and trusting high-level abstractions? In some sense they seem to be the opposite of each other—nerds focus obsessively on a domain until they understand it deeply, not just at high levels of abstraction. But if I were to give a very brief summary of the rationalist community, it might be: nerds who take very high-level abstractions (such as moloch, optimisation power, the future of humanity) very seriously.
It seems to me that the resolution to the apparent paradox is that nerds are interested in all the details of their domain, but the outcome that they tend to look for are high-level abstractions. Even in settings like fandoms, there is a big push towards massive theories that entails every little detail about the story.
Though defining rationalist community as a sort of community of meta-nerds who apply this nerd approach to almost anything doesn’t seem too off the mark.
I think you need to unpack “trust” and “take seriously” a little bit to make this assertion. I think nerds are generally (heh) more able to understand the lossiness of models, and to recognize that abstractions are more broadly applicable, but less powerful than specifics.
I wouldn’t say I trust or take seriously the idea of Moloch or the similarities between different optimization mechanisms. I do recognize that those models have a lot of explanatory and predictive power, especially as a head-start (aka “prior”) on domains where I haven’t done the work to understand the exceptions and specifics.
There’s some possible world in which the following approach to interpretability works:
Put an AGI in a bunch of situations where it sometimes is incentivised to lie and sometimes is incentivised to tell the truth.
Train a lie detector which is given all its neural weights as input.
Then ask the AGI lots of questions about its plans.
One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altruistic reasons, when in fact their unconscious motivations are primarily to look good. And the motivations which we are less conscious of are exactly those ones which it’s most disadvantageous for others to know about.
So would using such an interpretability technique on an AGI work? I guess one important question is something like: by default, would the AGI be systematically biased when talking about its plans, like humans are? Or is this something which only arises when there are selection pressures during training for hiding information?
One way we could avoid this problem: instead of a “lie detector”, you could train a “plan identifier”, which takes an AGI brain and tells you what that AGI is going to do in english. I’m a little less optimistic about this, since I think that gathering training data will be the big bottleneck either way, and getting enough data to train a plan identifier that’s smart enough to generalise to a wide range of plans seems pretty tricky. (By contrast, the lie detector might not need to know very much about the *content* of the lies).
I’ve heard people argue that “most” utility functions lead to agents with strong convergent instrumental goals. This obviously depends a lot on how you quantify over utility functions. Here’s one intuition in the other direction. I don’t expect this to be persuasive to most people who make the argument above (but I’d still be interested in hearing why not).
If a non-negligible percentage of an agent’s actions are random, then to describe it as a utility-maximiser would require an incredibly complex utility function (because any simple hypothesised utility function will eventually be falsified by a random action). And so this generates arbitrarily simple agents whose observed behaviour can only be described as maximising a utility function for arbitrarily complex utility functions (depending on how long you run them).
I expect people to respond something like: we need a theory of how to describe agents with bounded cognition anyway. And if you have such a theory, then we could describe the agent above as “maximising simple function U, subject to the boundedness constraint that X% of its actions are random”.
I’m not sure if you consider me to be making that argument, but here are my thoughts: I claim that most reward functions lead to agents with strong convergent instrumental goals. However, I share your intuition that (somehow) uniformly sampling utility functions over universe-histories might not lead to instrumental convergence.
To understand instrumental convergence and power-seeking, consider how many reward functions we might specify automatically imply a causal mechanism for increasing reward. The structure of the reward function implies that more is better, and that there are mechanisms for repeatedly earning points (for example, by showing itself a high-scoring input).
Since the reward function is “simple” (there’s usually not a way to grade exact universe histories), these mechanisms work in many different situations and points in time. It’s naturally incentivized to assure its own safety in order to best leverage these mechanisms for gaining reward. Therefore, we shouldn’t be surprised to see a lot of these simple goals leading to the same kind of power-seeking behavior.
What structure is implied by a reward function?
Additive/Markovian: while a utility function might be over an entire universe-history, reward is often additive over time steps. This is a strong constraint which I don’t always expect to be true, but i think that among the goals with this structure, a greater proportion of them have power-seeking incentives.
Observation-based: while a utility function might be over an entire universe-history, the atom of the reward function is the observation. Perhaps the observation is an input to update a world model, over which we have tried to define a reward function. I think that most ways of doing this lead to power-seeking incentives.
Agent-centric: reward functions are defined with respect to what the agent can observe. Therefore, in partially observable environments, there is naturally a greater emphasis on the agent’s vantage point in the environment.
My theorems apply to the finite, fully observable, Markovian situation.[1] We might not end up using reward functions for more impressive tasks – we might express preferences over incomplete trajectories, for example. The “specify a reward function over the agent’s world model” approach may or may not lead to good subhuman performance in complicated tasks like cleaning warehouses. Imagine specifying a reward function over pure observations for that task – the agent would probably just get stuck looking at a wall in a particularly high-scoring way.
However, for arbitrary utility functions over universe histories, the structure isn’t so simple. With utility functions over universe histories having far more degrees of freedom, arbitrary policies can be rationalized as VNM expected utility maximization. That said, with respect to a simplicity prior over computable utility functions, the power-seeking ones might have most of the measure.
I claim that most reward functions lead to agents with strong convergent instrumental goals
I think this depends a lot on how you model the agent developing. If you start off with a highly intelligent agent which has the ability to make long-term plans, but doesn’t yet have any goals, and then you train it on a random reward function—then yes, it probably will develop strong convergent instrumental goals.
On the other hand, if you start off with a randomly initialised neural network, and then train it on a random reward function, then probably it will get stuck in a local optimum pretty quickly, and never learn to even conceptualise these things called “goals”.
I claim that when people think about reward functions, they think too much about the former case, and not enough about the latter. Because while it’s true that we’re eventually going to get highly intelligent agents which can make long-term plans, it’s also important that we get to control what reward functions they’re trained on up to that point. And so plausibly we can develop intelligent agents that, in some respects, are still stuck in “local optima” in the way they think about convergent instrumental goals—i.e. they’re missing whatever cognitive functionality is required for being ambitious on a large scale.
Agreed – I should have clarified. I’ve been mostly discussing instrumental convergence with respect to optimal policies. The path through policy space is also important.
Makes sense. For what it’s worth, I’d also argue that thinking about optimal policies at all is misguided (e.g. what’s the optimal policy for humans—the literal best arrangement of neurons we could possibly have for our reproductive fitness? Probably we’d be born knowing arbitrarily large amounts of information. But this is just not relevant to predicting or modifying our actual behaviour at all).
(I now think that you were very right in saying “thinking about optimal policies at all is misguided”, and I was very wrong to disagree. I’ve thought several times about this exchange. Not listening to you about this point was a serious error and made my work way less impactful. I do think that the power-seeking theorems say interesting things, but about eg internal utility functions over an internal planning ontology—not about optimal policies for a reward function.)
We do in fact often train agents using algorithms which are proven to eventually converge to the optimal policy.[1] Even if we don’t expect the trained agents to reach the optimal policy in the real world, we should still understand what behavior is like at optimum. If you think your proposal is not aligned at optimum but is aligned for realistic training paths, you should have a strong story for why.
Formal theorizing about instrumental convergence with respect to optimal behavior is strictly easier than theorizing about ϵ-optimal behavior, which I think is what you want for a more realistic treatment of instrumental convergence for real agents. Even if you want to think about sub-optimal policies, if you don’t understand optimal policies… good luck! Therefore, we also have an instrumental (...) interest in studying the behavior at optimum.
At least, the tabular algorithms are proven, but no one uses those for real stuff. I’m not sure what the results are for function approximators, but I think you get my point.
1. I think it’s more accurate to say that, because approximately none of the non-trivial theoretical results hold for function approximation, approximately none of our non-trivial agents are proven to eventually converge to the optimal policy. (Also, given the choice between an algorithm without convergence proofs that works in practice, and an algorithm with convergence proofs that doesn’t work in practice, everyone will use the former). But we shouldn’t pay any attention to optimal policies anyway, because the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute.
2. I think theorizing about ϵ-optimal behavior is more useful than theorizing about optimal behaviour by roughly ϵ, for roughly the same reasons. But in general, clearly I can understand things about suboptimal policies without understanding optimal policies. I know almost nothing about the optimal policy in StarCraft, but I can still make useful claims about AlphaStar (for example: it’s not going to take over the world).
Again, let’s try cash this out. I give you a human—or, say, the emulation of a human, running in a simulation of the ancestral environment. Is this safe? How do you make it safer? What happens if you keep selecting for intelligence? I think that the theorising you talk about will be actively harmful for your ability to answer these questions.
I’m confused, because I don’t disagree with any specific point you make—just the conclusion. Here’s my attempt at a disagreement which feels analogous to me:
TurnTrout: here’s how spherical cows roll downhill!
ricraz: real cows aren’t spheres.
My response in this “debate” is: if you start with a spherical cow and then consider which real world differences are important enough to model, you’re better off than just saying “no one should think about spherical cows”.
I think that the theorising you talk about will be actively harmful for your ability to answer these questions.
I don’t understand why you think that. If you can have a good understanding of instrumental convergence and power-seeking for optimal agents, then you can consider whether any of those same reasons apply for suboptimal humans.
Considering power-seeking for optimal agents is a relaxed problem. Yes, ideally, we would instantly jump to the theory that formally describes power-seeking for suboptimal agents with realistic goals in all kinds of environments. But before you do that, a first step is understanding power-seeking in MDPs. Then, you can take formal insights from this first step and use them to update your pre-theoretic intuitions where appropriate.
Thanks for engaging despite the opacity of the disagreement. I’ll try to make my position here much more explicit (and apologies if that makes it sound brusque). The fact that your model is a simplified abstract model is not sufficient to make it useful. Some abstract models are useful. Some are misleading and will cause people who spend time studying them to understand the underlying phenomenon less well than they did before. From my perspective, I haven’t seen you give arguments that your models are in the former category not the latter. Presumably you think they are in fact useful abstractions—why? (A few examples of the latter: behaviourism, statistical learning theory, recapitulation theory, Gettier-style analysis of knowledge).
My argument for why they’re overall misleading: when I say that “the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute”, or that safety researchers shouldn’t think about AIXI, I’m not just saying that these are inaccurate models. I’m saying that they are modelling fundamentally different phenomena than the ones you’re trying to apply them to. AIXI is not “intelligence”, it is brute force search, which is a totally different thing that happens to look the same in the infinite limit. Optimal tabular policies are not skill at a task, they are a cheat sheet, but they happen to look similar in very simple cases.
Probably the best example of what I’m complaining about is Ned Block trying to use Blockhead to draw conclusions about intelligence. I think almost everyone around here would roll their eyes hard at that. But then people turn around and use abstractions that are just as unmoored from reality as Blockhead, often in a very analogous way. (This is less a specific criticism of you, TurnTrout, and more a general criticism of the field).
if you start with a spherical cow and then consider which real world differences are important enough to model, you’re better off than just saying “no one should think about spherical cows”.
Forgive me a little poetic license. The analogy in my mind is that you were trying to model the cow as a sphere, but you didn’t know how to do so without setting its weight as infinite, and what looked to you like your model predicting the cow would roll downhill was actually your model predicting that the cow would swallow up the nearby fabric of spacetime and the bottom of the hill would fall into its event horizon. At which point, yes, you would be better off just saying “nobody should think about spherical cows”.
Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions.
I haven’t seen you give arguments that your models [of instrumental convergence] are [useful for realistic agents]
Falsifying claims and “breaking” proposals is a classic element of AI alignment discourse and debate. Since we’re talking about superintelligent agents, we can’t predict exactly what a proposal would do. However, if I make a claim (“a superintelligent paperclip maximizer would keep us around because of gains from trade”), you can falsify this by showing that my claimed policy is dominated by another class of policies (“we would likely be comically resource-inefficient in comparison; GFT arguments don’t model dynamics which allow killing other agents and appropriating their resources”).
Even we can come up with this dominant policy class, so the posited superintelligence wouldn’t miss it either. We don’t know what the superintelligent policy will be, but we know what it won’t be (see also Formalizing convergent instrumental goals). Even though I don’t know how Gary Kasparov will open the game, I confidently predict that he won’t let me checkmate him in two moves.
Non-optimal power and instrumental convergence
Instead of thinking about optimal policies, let’s consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy.
Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is
∫RVA(M,R)R(s,γ)dF(R).
Instrumental convergence with respect to A’s policies can be defined similarly (“what is the R-measure of a given trajectory under A?”). The theory I’ve laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called “instrumental convergence”.
Here’s bad reasoning, which implies that the cow tears a hole in spacetime:
Suppose the laws of physics bestow godhood upon an agent executing some convoluted series of actions; in particular, this allows avoiding heat death. Clearly, it is optimal for the vast majority of agents to instantly become god.
The problem is that it’s impractical to predict what a smarter agent will do, or what specific kinds of action will be instrumentally convergent for A, or that the real agent would be infinitely smart. Just because it’s smart doesn’t mean it’s omniscient, as you rightly point out.
Here’s better reasoning:
Suppose that the MDP modeling the real world represents shutdown as a single terminal state. Most optimal agents don’t allow themselves to be shut down. Furthermore, since we can see that most goals offer better reward at non-shutdown states, superintelligent A can as well.[1] While I don’t know exactly what Awill tend to do, I predict that policies generated by A will tend to resist shutdown.
It might seem like I’m assuming the consequent here. This is not so – the work is first done by the theorems on optimal behavior, which do imply that most goals achieve greater return by avoiding shutdown. The question is whether reasonably intelligent suboptimal agents realize this fact. Given a uniformly drawn reward function, we can usually come up with a better policy than dying, so the argument is that A can as well.
I’m afraid I’m mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework.
If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it’d be something like:
1. Instrumental convergence isn’t training-time behaviour, it’s test-time behaviour. It isn’t about increasing reward, it’s about achieving goals (that the agent learned by being trained to increase reward).
2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it’s the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it’s just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment).
I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment
I was actually going to respond, “that’s a good point, but (IMO) a different concern than the one you initially raised”. I see you making two main critiques.
(paraphrased) ”A won’t produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn’t useful” – I replied to this critique above.
“The space of goals that agents might learn is very different from the space of reward functions.” I agree this is an important part of the story. I think the reasonable takeaway is “current theorems on instrumental convergence help us understand what superintelligent Awon’t do, assuming no reward-result gap. Since we can’t assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced.”
I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it.
ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.
We do in fact often train agents using algorithms which are proven to eventually converge to the optimal policy.[1]
At least, the tabular algorithms are proven, but no one uses those for real stuff. I’m not sure what the results are for function approximators, but I think you get my point. ↩︎
Is the point that people try to use algorithms which they think will eventually converge to the optimal policy? (Assuming there is one.)
And so this generates arbitrarily simple agents whose observed behaviour can only be described as maximising a utility function for arbitrarily complex utility functions (depending on how long you run them).
I object to the claim that agents that act randomly can be made “arbitrarily simple”. Randomness is basically definitionally complicated!
Eh, this seems a bit nitpicky. It’s arbitrarily simple given a call to a randomness oracle, which in practice we can approximate pretty easily. And it’s “definitionally” easy to specify as well: “the function which, at each call, returns true with 50% likelihood and false otherwise.”
If you get an ‘external’ randomness oracle, then you could define the utility function pretty simply in terms of the outputs of the oracle.
If the agent has a pseudo-random number generator (PRNG) inside it, then I suppose I agree that you aren’t going to be able to give it a utility function that has the standard set of convergent instrumental goals, and PRNGs can be pretty short. (Well, some search algorithms are probably shorter, but I bet they have higher Kt complexity, which is probably a better measure for agents)
If a reasonable percentage of an agent’s actions are random, then to describe it as a utility-maximiser would require an incredibly complex utility function (because any simple hypothesised utility function will eventually be falsified by a random action).
I’d take a different tack here, actually; I think this depends on what the input to the utility function is. If we’re only allowed to look at ‘atomic reality’, or the raw actions the agent takes, then I think your analysis goes through, that we have a simple causal process generating the behavior but need a very complicated utility function to make a utility-maximizer that matches the behavior.
But if we’re allowed to decorate the atomic reality with notes like “this action was generated randomly”, then we can have a utility function that’s as simple as the generator, because it just counts up the presence of those notes. (It doesn’t seem to me like this decorator is meaningfully more complicated than the thing that gave us “agents taking actions” as a data source, so I don’t think I’m paying too much here.)
This can lead to a massive explosion in the number of possible utility functions (because there’s a tremendous number of possible decorators), but I think this matches the explosion that we got by considering agents that were the outputs of causal processes in the first place. That is, consider reasoning about python code that outputs actions in a simple game, where there are many more possible python programs than there are possible policies in the game.
So in general you can’t have utility functions that are as simple as the generator, right? E.g. the generator could be deontological. In which case your utility function would be complicated. Or it could be random, or it could choose actions by alphabetical order, or...
And so maybe you can have a little note for each of these. But now what it sounds like is: “I need my notes to be able to describe every possible cognitive algorithm that the agent could be running”. Which seems very very complicated.
I guess this is what you meant by the “tremendous number” of possible decorators. But if that’s what you need to do to keep talking about “utility functions”, then it just seems better to acknowledge that they’re broken as an abstraction.
E.g. in the case of python code, you wouldn’t do anything analogous to this. You would just try to reason about all the possible python programs directly. Similarly, I want to reason about all the cognitive algorithms directly.
I realized my grandparent comment is unclear here:
but need a very complicated utility function to make a utility-maximizer that matches the behavior.
This should have been “consequence-desirability-maximizer” or something, since the whole question is “does my utility function have to be defined in terms of consequences, or can it be defined in terms of arbitrary propositions?”. If I want to make the deontologist-approximating Innocent-Bot, I have a terrible time if I have to specify the consequences that correspond to the bot being innocent and the consequences that don’t, but if you let me say “Utility = 0 - badness of sins committed” then I’ve constructed a ‘simple’ deontologist. (At least, about as simple as the bot that says “take random actions that aren’t sins”, since both of them need to import the sins library.)
In general, I think it makes sense to not allow this sort of elaboration of what we mean by utility functions, since the behavior we want to point to is the backwards assignment of desirability to actions based on the desirability of their expected consequences, rather than the expectation of any arbitrary property.
---
Actually, I also realized something about your original comment which I don’t think I had the first time around; if by “some reasonable percentage of an agent’s actions are random” you mean something like “the agent does epsilon-exploration” or “the agent plays an optimal mixed strategy”, then I think it doesn’t at all require a complicated utility function to generate identical behavior. Like, in the rock-paper-scissors world, and with the simple function ‘utility = number of wins’, the expected utility maximizing move (against tough competition) is to throw randomly, and we won’t falsify the simple ‘utility = number of wins’ hypothesis by observing random actions.
Instead I read it as something like “some unreasonable percentage of an agent’s actions are random”, where the agent is performing some simple-to-calculate mixed strategy that is either suboptimal or only optimal by luck (when the optimal mixed strategy is the maxent strategy, for example), and matching the behavior with an expected utility maximizer is a challenge (because your target has to be not some fact about the environment, but some fact about the statistical properties of the actions taken by the agent).
---
I think this is where the original intuition becomes uncompelling. We care about utility-maximizers because they’re doing their backwards assignment, using their predictions of the future to guide their present actions to try to shift the future to be more like what they want it to be. We don’t necessarily care about imitators, or simple-to-write bots, or so on. And so if I read the original post as “the further a robot’s behavior is from optimal, the less likely it is to demonstrate convergent instrumental goals”, I say “yeah, sure, but I’m trying to build smart robots (or at least reasoning about what will happen if people try to).”
Instead I read it as something like “some unreasonable percentage of an agent’s actions are random”
This is in fact the intended reading, sorry for ambiguity. Will edit. But note that there are probably very few situations where exploring via actual randomness is best; there will almost always be some type of exploration which is more favourable. So I don’t think this helps.
We care about utility-maximizers because they’re doing their backwards assignment, using their predictions of the future to guide their present actions to try to shift the future to be more like what they want it to be.
To be pedantic: we care about “consequence-desirability-maximisers” (or in Rohin’s terminology, goal-directed agents) because they do backwards assignment. But I think the pedantry is important, because people substitute utility-maximisers for goal-directed agents, and then reason about those agents by thinking about utility functions, and that just seems incorrect.
And so if I read the original post as “the further a robot’s behavior is from optimal, the less likely it is to demonstrate convergent instrumental goals”
What do you mean by optimal here? The robot’s observed behaviour will be optimal for some utility function, no matter how long you run it.
To be pedantic: we care about “consequence-desirability-maximisers” (or in Rohin’s terminology, goal-directed agents) because they do backwards assignment.
Valid point.
But I think the pedantry is important, because people substitute utility-maximisers for goal-directed agents, and then reason about those agents by thinking about utility functions, and that just seems incorrect.
This also seems right. Like, my understanding of what’s going on here is we have:
‘central’ consequence-desirability-maximizers, where there’s a simple utility function that they’re trying to maximize according to the VNM axioms
‘general’ consequence-desirability-maximizers, where there’s a complicated utility function that they’re trying to maximize, which is selected because it imitates some other behavior
The first is a narrow class, and depending on how strict you are with ‘maximize’, quite possibly no physically real agents will fall into it. The second is a universal class, which instantiates the ‘trivial claim’ that everything is utility maximization.
Put another way, the first is what happens if you hold utility fixed / keep utility simple, and then examine what behavior follows; the second is what happens if you hold behavior fixed / keep behavior simple, and then examine what utility follows.
Distance from the first is what I mean by “the further a robot’s behavior is from optimal”; I want to say that I should have said something like “VNM-optimal” but actually I think it needs to be closer to “simple utility VNM-optimal.”
I think you’re basically right in calling out a bait-and-switch that sometimes happens, where anyone who wants to talk about the universality of expected utility maximization in the trivial ‘general’ sense can’t get it to do any work, because it should all add up to normality, and in normality there’s a meaningful distinction between people who sort of pursue fuzzy goals and ruthless utility maximizers.
I feel kinda frustrated whenever “shard theory” comes up in a conversation, because it’s not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.
This is a particular pity because I think there’s a version of the “shard” framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in “subagents” which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly “goal-directed” as you go up the hierarchy. This is an old idea, FWIW; e.g. it’s how Minsky frames intelligence in Society of Mind. And it’s also somewhat consistent with the claim made in the original shard theory post, that “shards are just collections of subshards”.
The problem is the “just”. The post also says “shards are not full subagents”, and that “we currently estimate that most shards are ‘optimizers’ to the extent that a bacterium or a thermostat is an optimizer.” But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from “heuristic” to “agent”, and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral “shards” (like caring about other people’s welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?
(I make a similar point in the appendix of my value systematization post.)
I am not as negative on it as you are—it seems an improvement over the ‘Bag O’ Heuristics’ model and the ‘expected utility maximizer’ model. But I agree with the critique and said something similar here:
Alex Turner replied with this:
A shot at the diamond-alignment problem — LessWrong
Personally, I’m not ignoring that question, and I’ve written about it (once) in some detail. Less relatedly, I’ve talked about possible utility function convergence via e.g. A shot at the diamond-alignment problem and my recent comment thread with Wei_Dai.
It’s not that there isn’t more shard theory content which I could write, it’s that I got stuck and burned out before I could get past the 101-level content.
I felt
a) gaslit by “I think everyone already knew this” or even “I already invented this a long time ago” (by people who didn’t seem to understand it); and that
b) I wasn’t successfully communicating many intuitions;[1] and
c) it didn’t seem as important to make theoretical progress anymore, especially since I hadn’t even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network).
So I didn’t want to post much on the site anymore because I was sick of it, and decided to just get results empirically.
I’ve always read “assume heuristics” as expecting more of an “ensemble of shallow statistical functions” than “a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed.” Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed.
The 2022 review indicates that more people appreciated the shard theory posts than I realized at the time.
FWIW I’m potentially intrested in interviewing you (and anyone else you’d recommend) and then taking a shot at writing the 101-level content myself.
Curious to hear whether I was one of the people who contributed to this.
Nope! I have basically always enjoyed talking with you, even when we disagree.
Ok, whew, glad to hear.
But shard theorists mainly aim to address agency obtained via DPO-like setups, and @TurnTrout has mathematically proved that such setups don’t favor the power-seeking drives AI safety researchers are usually concerned about in the context of agency.
I read the section you linked, but I can’t follow it. Anyway, here it is its conclusive paragraph:
From this alone, I get the impression that he hasn’t proved that “there isn’t instrumental convergence”, but that “there isn’t a totally general instrumental convergence that applies even to very wild utility functions”.
A key part of instrumental convergence is the convergence aspect, which as I understand it refers to the notion that even very wild utility functions will share certain preferences. E.g. the empirical tendency for random chess board evaluations to prefer mobility. If you don’t have convergence, you don’t have instrumental convergence.
Ok. Then I’ll say that randomly assigned utility over full trajectories are beyond wild!
The basin of attraction just needs to be large enough. AIs will intentionally be created with more structure than that.
The issue isn’t the “full trajectories” part; that actually makes instrumental convergence stronger. The issue is the “actions” part. In terms of RLHF, what this means is that people might not simply blindly follow the instructions given by AIs and rate them based on the ultimate outcome (even if the outcome differs wildly from what they’d intuitively think it’d do), but rather they might think about the instructions the AIs provide, and rate them based on whether they a priori make sense. If the AI then has some galaxybrained method of achieving something (which traditionally would be instrumentally convergent) that humans don’t understand, then that method will be negatively reinforced (because people don’t see the point of it and therefore downvote it), which eliminates dangerous powerseeking.
One fairly strong belief of mine is that Less Wrong’s epistemic standards are not high enough to make solid intellectual progress here. So far my best effort to make that argument has been in the comment thread starting here. Looking back at that thread, I just noticed that a couple of those comments have been downvoted to negative karma. I don’t think any of my comments have ever hit negative karma before; I find it particularly sad that the one time it happens is when I’m trying to explain why I think this community is failing at its key goal of cultivating better epistemics.
There’s all sorts of arguments to be made here, which I don’t have time to lay out in detail. But just step back for a moment. Tens or hundreds of thousands of academics are trying to figure out how the world works, spending their careers putting immense effort into reading and producing and reviewing papers. Even then, there’s a massive replication crisis. And we’re trying to produce reliable answers to much harder questions by, what, writing better blog posts, and hoping that a few of the best ideas stick? This is not what a desperate effort to find the truth looks like.
It seems to me that maybe this is what a certain stage in the desperate effort to find the truth looks like?
Like, the early stages of intellectual progress look a lot like thinking about different ideas and seeing which ones stand up robustly to scrutiny. Then the best ones can be tested more rigorously and their edges refined through experimentation.
It seems to me like there needs to be some point in the desparate search for truth in which you’re allowing for half-formed thoughts and unrefined hypotheses, or else you simply never get to a place where the hypotheses you’re creating even brush up against the truth.
In the half-formed thoughts stage, I’d expect to see a lot of literature reviews, agendas laying out problems, and attempts to identify and question fundamental assumptions. I expect that (not blog-post-sized speculation) to be the hard part of the early stages of intellectual progress, and I don’t see it right now.
Perhaps we can split this into technical AI safety and everything else. Above I’m mostly speaking about “everything else” that Less Wrong wants to solve. Since AI safety is now a substantial enough field that its problems need to be solved in more systemic ways.
I would expect that later in the process. Agendas laying out problems and fundamental assumptions don’t spring from nowhere (at least for me), they come from conversations where I’m trying to articulate some intuition, and I recognize some underlying pattern. The pattern and structure doesn’t emerge spontaneously, it comes from trying to pick around the edges of a thing, get thoughts across, explain my intuitions and see where they break.
I think it’s fair to say that crystallizing these patterns into a formal theory is a “hard part”, but the foundation for making it easy is laid out in the floundering and flailing that came before.
[Deleted]
Ironically, some people already feel threatened by the high standards here. Setting them higher probably wouldn’t result in more good content. It would result in less mediocre content, but probably also less good content, as the authors who sometimes write a mediocre article and sometimes a good one, would get discouraged and give up.
Ben Pace gives a few examples of great content in the next comment. It would be better to easier separate the good content from the rest, but that’s what the reviews are for. Well, only one review so far, if I remember correctly. I would love to see reviews of pre-2018 content (maybe multiple years in one review, if they were less productive). Then I would love to see the winning content get the same treatment as the Sequences—edit them and arrange them into a book, and make it “required reading” for the community (available as a free PDF).
[Deleted]
The top posts in the 2018 Review are filled with fascinating and well-explained ideas. Many of the new ideas are not settled science, but they’re quite original and substantive, or excellent distillations of settled science, and are often the best piece of writing on the internet about their topics.
You’re wrong about LW epistemic standards not being high enough to make solid intellectual progress, we already have. On AI alone (which I am using in large part because there’s vaguely more consensus around it than around rationality), I think you wouldn’t have seen almost any of the public write-ups (like Embedded Agency and Zhukeepa’s Paul FAQ) without LessWrong, and I think a lot of them are brilliant.
I’m not saying we can’t do far better, or that we’re sufficiently good. Many of the examples of success so far are “Things that were in people’s heads but didn’t have a natural audience to share them with”. There’s not a lot of collaboration at present, which is why I’m very keen to build the new LessWrong Docs that allows for better draft sharing and inline comments and more. We’re working on the tools for editing tags, things like edit histories and so on, that will allow us to build a functioning wiki system to have canonical writeups and explanation that people add to and refine. I want future iterations of the LW Review to have more allowance for incorporating feedback from reviewers. There’s lots of work to do, and we’re just getting started. But I disagree the direction isn’t “a desperate effort to find the truth”. That’s what I’m here for.
Even in the last month or two, how do you look at things like this and this and this and this and not think that they’re likely the best publicly available pieces of writing in the world about their subjects? Wrt rationality, I expect things like this and this and this and this will probably go down as historically important LW posts that helped us understand the world, and make a strong showing in the 2020 LW Review.
As mentioned in my reply to Ruby, this is not a critique of the LW team, but of the LW mentality. And I should have phrased my point more carefully—“epistemic standards are too low to make any progress” is clearly too strong a claim, it’s more like “epistemic standards are low enough that they’re an important bottleneck to progress”. But I do think there’s a substantive disagreement here. Perhaps the best way to spell it out is to look at the posts you linked and see why I’m less excited about them than you are.
Of the top posts in the 2018 review, and the ones you linked (excluding AI), I’d categorise them as follows:
Interesting speculation about psychology and society, where I have no way of knowing if it’s true:
Local Validity as a Key to Sanity and Civilization
The Loudest Alarm Is Probably False
Anti-social punishment (which is, unlike the others, at least based on one (1) study).
Babble
Intelligent social web
Unrolling social metacognition
Simulacra levels
Can you keep this secret?
Same as above but it’s by Scott so it’s a bit more rigorous and much more compelling:
Is Science Slowing Down?
The tails coming apart as a metaphor for life
Useful rationality content:
Toolbox-thinking and law-thinking
A sketch of good communication
Varieties of argumentative experience
Review of basic content from other fields. This seems useful for informing people on LW, but not actually indicative of intellectual progress unless we can build on them to write similar posts on things that *aren’t* basic content in other fields:
Voting theory primer
Prediction markets: when do they work
Costly coordination mechanism of common knowledge (Note: I originally said I hadn’t seen many examples of people building on these ideas, but at least for this post there seems to be a lot.)
Six economics misconceptions
Swiss political system
It’s pretty striking to me how much the original sequences drew on the best academic knowledge, and how little most of the things above draw on the best academic knowledge. And there’s nothing even close to the thoroughness of Luke’s literature reviews.
The three things I’d like to see more of are:
1. The move of saying “Ah, this is interesting speculation about a complex topic. It seems compelling, but I don’t have good ways of verifying it; I’ll treat it like a plausible hypothesis which could be explored more by further work.” (I interpret the thread I originally linked as me urging Wei to do this).
2. Actually doing that follow-up work. If it’s an empirical hypothesis, investigating empirically. If it’s a psychological hypothesis, does it apply to anyone who’s not you? If it’s more of a philosophical hypothesis, can you identify the underlying assumptions and the ways it might be wrong? In all cases, how does it fit into existing thought? (That’ll probably take much more than a single blog post).
3. Insofar as many of these scattered plausible insights are actually related in deep ways, trying to combine them so that the next generation of LW readers doesn’t have to separately learn about each of them, but can rather download a unified generative framework.
(Thanks for laying out your position in this level of depth. Sorry for how long this comment turned out. I guess I wanted to back up a bunch of my agreement with words. It’s a comment for the sake of everyone else, not just you.)
I think there’s something to what you’re saying, that the mentality itself could be better. The Sequences have been criticized because Eliezer didn’t cite previous thinkers all that much, but at least as far as the science goes, as you said, he was drawing on academic knowledge. I also think we’ve lost something precious with the absence of epic topic reviews by the likes of Luke. Kaj Sotala still brings in heavily from outside knowledge, John Wentworth did a great review on Biological Circuits, and we get SSC crossposts that have that, but otherwise posts aren’t heavily referencing or building upon outside stuff. I concede that I would like to see a lot more of that.
I think Kaj was rightly disappointed that he didn’t get more engagement with his post whose gist was “this is what the science really says about S1 & S2, one of your most cherished concepts, LW community”.
I wouldn’t say the typical approach is strictly bad, there’s value in thinking freshly for oneself or that failure to reference previous material shouldn’t be a crime or makes a text unworthy, but yeah, it’d be pretty cool if after Alkjash laid out Babble & Prune (which intuitively feels so correct), someone had dug through what empirical science we have to see whether the picture lines up. Or heck, actually gone and done some kind of experiment. I bet it would turn up something interesting.
And I think what you’re saying is that the issue isn’t just that people aren’t following up with scholarship and empiricism on new ideas and models, but that they’re actually forgetting that these are the next steps. Instead, they’re overconfident in our homegrown models, as though LessWrong were the one place able to come up with good ideas. (Sorry, some of this might be my own words.)
The category I’d label a lot of LessWrong posts with is “engaging articulation of a point which is intuitive in hindsight” / “creation of common vocabulary around such points”. That’s pretty valuable, but I do think solving the hardest problems will take more.
-----
You use the word “reliably” in a few places. It feels like it’s doing some work in your statements, and I’m not entirely sure what you mean or why it’s important.
-----
A model which is interesting but maybe not of obvious connection. I was speaking to a respected rationalist thinker this week and they classified potential writing on LessWrong into three categories:
Writing stuff to help oneself figure things out. Like a diary, but publicly shared.
People exchanging “letters” as they attempt to figure things out. Like old school academic journals.
Someone having something mostly figured out but with a large inferential distance to bridge. They write a large collection of posts trying to cover that distance. One example is The Sequences, and more recent examples are from John Wentworth and Kaj Sotala
I mention this because I recall you (alongside the rationalist thinker) complaining about the lack of people “presenting their worldviews on LessWrong”.
The kinds of epistemic norms I think you’re advocating for feel like a natural fit for 2nd kind of writing, but it’s less clear to me how they should apply to people presenting world views. Maybe it’s not more complicated than it’s fine to present your worldview without a tonne of evidence, but people shouldn’t forget that the evidence hasn’t been presented and it feeling intuitively correct isn’t enough.
-----
There’s something in here about Epistemic Modesty, something, something. Some part of me reads you as calling for more of that, which I’m wary of, but I don’t currently have more to say than flagging it as maybe a relevant variable in any disagreements here.
We probably do disagree about the value of academic sources, or what it takes to get value from them. Hmm. Maybe it’s something like there’s something to be said for thinking about models and assessing their plausibility yourself rather than relying on likely very flawed empirical studies.
Maybe I’m in favor of large careful reviews of what science knows but less in favor of trying to find sources for each idea or model that gets raised. I’m not sure.
-----
I can’t recall whether I’ve written publicly much about this, but a model I’ve had for a year or more is that for LW to make intellectual progress, we need to become a “community of practice”, not just a “community of interest”. Martial arts vs literal stamp collecting. (Streetfighting might be better still due to actual testing real fighting ability.) It’s great that many people find LessWrong a guilty pleasure they feel less guilty about than Facebook, but for us to make progress, people need to see LessWrong as a place where one of things you do is show up and do Serious Work, some of which is relatively hard and boring, like writing and reading lit reviews.
I suspect that a cap on the epistemic standards people hold stuff to is downstream of the level of effort people are calibrated on applying. But maybe it goes in other direction, so I don’t know.
Probably the 2018 Review is biased towards the posts which are most widely read, i.e., those easiest and most enjoyable to read, rather than solely rewarding those with the best contributions. Not overwhelmingly, but enough. Maybe same for karma. I’m not sure how to relate to that.
-----
This sounds partially like distillation work plus extra integration. And sounds pretty good to me too.
-----
I still remember my feeling of disillusionment in the LessWrong community relative soon after I joined in late 2012. I realized that the bulk of members didn’t seem serious about advancing the Art. I never heard people discussing new results from cognitive science and how to apply them, even though that’s what Sequences were in large part and the Sequences hardly claimed to be complete! I guess I do relate somewhat to your “desperate effort” comment, though we’ve got some people trying pretty hard that I wouldn’t want to short change.
We do good stuff, but more is possible. I appreciate the reminder. I hope we succeed at pushing the culture and mentality in directions you like.
This is only tangentially relevant, but adding it here as some of you might find it interesting:
Venkatesh Rao has an excellent Twitter thread on why most independent research only reaches this kind of initial exploratory level (he tried it for a bit before moving to consulting). It’s pretty pessimistic, but there is a somewhat more optimistic follow-up thread on potential new funding models. Key point is that the later stages are just really effortful and time-consuming, in a way that keeps out a lot of people trying to do this as a side project alongside a separate main job (which I think is the case for a lot of LW contributors?)
Quote from that thread:
Also just wanted to say good luck! I’m a relative outsider here with pretty different interests to LW core topics but I do appreciate people trying to do serious work outside academia, have been trying to do this myself, and have thought a fair bit about what’s currently missing (I wrote that in a kind of jokey style but I’m serious about the topic).
Thanks, these links seem great! I think this is a good (if slightly harsh) way of making a similar point to mine:
“I find that autodidacts who haven’t experienced institutional R&D environments have a self-congratulatory low threshold for what they count as research. It’s a bit like vanity publishing or fan fiction. This mismatch doesn’t exist as much in indie art, consulting, game dev etc”
Also, I liked your blog post! More generally, I strongly encourage bloggers to have a “best of” page, or something that directs people to good posts. I’d be keen to read more of your posts but have no idea where to start.
Thanks! I have been meaning to add a ‘start here’ page for a while, so that’s good to have the extra push :) Seems particularly worthwhile in my case because a) there’s no one clear theme and b) I’ve been trying a lot of low-quality experimental posts this year bc pandemic trashed motivation, so recent posts are not really reflective of my normal output.
For now some of my better posts in the last couple of years might be Cognitive decoupling and banana phones (tracing back the original precursor of Stanovich’s idea), The middle distance (a writeup of a useful and somewhat obscure idea from Brian Cantwell Smith’s On the Origin of Objects), and the negative probability post and its followup.
Quoting your reply to Ruby below, I agree I’d like LessWrong to be much better at “being able to reliably produce and build on good ideas”.
The reliability and focus feels most lacking to me on the building side, rather than the production, which I think we’re doing quite well at. I think we’ve successfully formed a publishing platform that provides and audience who are intensely interested in good ideas around rationality, AI, and related subjects, and a lot of very generative and thoughtful people are writing down their ideas here.
We’re low on the ability to connect people up to do more extensive work on these ideas – most good hypotheses and arguments don’t get a great deal of follow up or further discussion.
Here are some subjects where I think there’s been various people sharing substantive perspectives, but I think there’s also a lot of space for more ‘details’ to get fleshed out and subquestions to be cleanly answered:
Sabbath and Rest Days (Zvi, Lauren Lee, Jacobian, Scott)
Moloch and Slack and Mazes (Scott, Eliezer, Zvi, Swentworth, Jameson)
Inner/Outer Alignment (EvHub, Rafael, Paul, Swentworth, Steve2152)
Embedded Agency + Optimization (Abram, Scott, Swentworth, Alex Flint, nostalgebraist)
Simulacra Levels (Benquo, Zvi, Elizabeth)
AI Takeoff (Paul, Katja, Kokotajlo, Zhukeepa)
Iterated Amplification (Paul, EvHub, Zhukeepa, Vaniver, William S, Wei Dai)
Insight meditation + IFS (Kaj, Kaj, Kaj, and Kaj. Also Abram and Val and Romeo and Scott)
Coordination Problems (Eliezer, Scott, Sustrik, Swentworth, Zvi, me)
The above isn’t complete, it’s just some of the ones that come to mind as having lots of people sharing perspectives. And the list of people definitely isn’t complete.
Here examples of things that I’d like to see more of, that feel more like doing the legwork to actually dive into the details:
Eli Tyre and Bucky replicating Scott’s birth-order hypothesis
Katja and the other fine people at AI Impacts doing long-term research on a question (discontinuous progress) with lots of historical datapoints
Jameson writing up his whole research question in great detail and very well, and then an excellent commenter turning up and answering it
Zhukeepa writing up an explanation of Paul’s research, allowing many more to understand it, and allowing Eliezer to write a response
Scott writing Goodhart Taxonomy, and the commenters banding together to find a set of four similar examples to add to the post
Val writing some interesting things about insight meditation, prompting Kaj to write a non-mysterious explanation
In the LW Review when Bucky checked out the paper Zvi analysed and argued it did not support the conclusions Zvi reached (this changed my opinion of Zvi’s post from ‘true’ to ‘false’)
The discussion around covid and EMH prompting Richard Meadows to write down a lot of the crucial and core arguments around the EMH
The above is also not mentioning lots of times when the person generating the idea does a lot of the legwork, like Scott or Jameson or Sarah or someone.
I see a lot of (very high quality) raw energy here that wants shaping and directing, with the use of lots of tools for coordination (e.g. better collaboration tools).
The epistemic standards being low is one way of putting it, but it doesn’t resonate with me much and kinda feels misleading. I think our epistemic standards are way higher than the communities you mention (historians, people interested in progress studies). Bryan Caplan said he knows of no group whose beliefs are more likely to be right in general than the rationalists, this seems often accurate to me. I think we do a lot of exploration and generation and evaluation, just not in a very coordinated manner, and so could make progress at like 10x–100x the rate if we collaborated better, and I think we can get there without too much work.
“I see a lot of (very high quality) raw energy here that wants shaping and directing, with the use of lots of tools for coordination (e.g. better collaboration tools).”
Yepp, I agree with this. I guess our main disagreement is whether the “low epistemic standards” framing is a useful way to shape that energy. I think it is because it’ll push people towards realising how little evidence they actually have for many plausible-seeming hypotheses on this website. One proven claim is worth a dozen compelling hypotheses, but LW to a first approximation only produces the latter.
When you say “there’s also a lot of space for more ‘details’ to get fleshed out and subquestions to be cleanly answered”, I find myself expecting that this will involve people who believe the hypothesis continuing to build their castle in the sky, not analysis about why it might be wrong and why it’s not.
That being said, LW is very good at producing “fake frameworks”. So I don’t want to discourage this too much. I’m just arguing that this is a different thing from building robust knowledge about the world.
I will continue to be contrary and say I’m not sure I agree with this.
For one, I think in many domains new ideas are really hard to come by, as opposed to making minor progress in the existing paradigms. Fundamental theories in physics, a bunch of general insights about intelligence (in neuroscience and AI), etc.
And secondly, I am reminded of what Lukeprog wrote in his moral consciousness report, that he wished the various different philosophies-of-consciousness would stop debating each other, go away for a few decades, then come back with falsifiable predictions. I sometimes take this stance regarding many disagreements of import, such as the basic science vs engineering approaches to AI alignment. It’s not obvious to me that the correct next move is for e.g. Eliezer and Paul to debate for 1000 hours, but instead to go away and work on their ideas for a decade then come back with lots of fleshed out details and results that can be more meaningfully debated.
I feel similarly about simulacra levels, Embedded Agency, and a bunch of IFS stuff. I would like to see more experimentation and literature reviews where they make sense, but I also feel like these are implicitly making substantive and interesting claims about the world, and I’d just be interested in getting a better sense of what claims they’re making, and have them fleshed out + operationalized more. That would be a lot of progress to me, and I think each of them is seeing that sort of work (with Zvi, Abram, and Kaj respectively leading the charges on LW, alongside many others).
I think I’m concretely worried that some of those models / paradigms (and some other ones on LW) don’t seem pointed in a direction that leads obviously to “make falsifiable predictions.”
And I can imagine worlds where “make falsifiable predictions” isn’t the right next step, you need to play around with it more and get it fleshed out in your head before you can do that. But there is at least some writing on LW that feels to me like it leaps from “come up with an interesting idea” to “try to persuade people it’s correct” without enough checking.
(In the case of IFS, I think Kaj’s sequence is doing a great job of laying it out in a concrete way where it can then be meaningfully disagreed with. But the other people who’ve been playing around with IFS didn’t really seem interested in that, and I feel like we got lucky that Kaj had the time and interest to do so.)
I feel like this comment isn’t critiquing a position I actually hold. For example, I don’t believe that “the correct next move is for e.g. Eliezer and Paul to debate for 1000 hours”. I am happy for people to work towards building evidence for their hypotheses in many ways, including fleshing out details, engaging with existing literature, experimentation, and operationalisation.
Perhaps this makes “proven claim” a misleading phrase to use. Perhaps more accurate to say: “one fully fleshed out theory is more valuable than a dozen intuitively compelling ideas”. But having said that, I doubt that it’s possible to fully flesh out a theory like simulacra levels without engaging with a bunch of academic literature and then making predictions.
I also agree with Raemon’s response below.
A housemate of mine said to me they think LW has a lot of breadth, but could benefit from more depth.
I think in general when we do intellectual work we have excellent epistemic standards, capable of listening to all sorts of evidence that other communities and fields would throw out, and listening to subtler evidence than most scientists (“faster than science”), but that our level of coordination and depth is often low. “LessWrongers should collaborate more and go into more depth in fleshing out their ideas” sounds more true to me than “LessWrongers have very low epistemic standards”.
“Being more openminded about what evidence to listen to” seems like a way in which we have lower epistemic standards than scientists, and also that’s beneficial. It doesn’t rebut my claim that there are some ways in which we have lower epistemic standards than many academic communities, and that’s harmful.
In particular, the relevant question for me is: why doesn’t LW have more depth? Sure, more depth requires more work, but on the timeframe of several years, and hundreds or thousands of contributors, it seems viable. And I’m proposing, as a hypothesis, that LW doesn’t have enough depth because people don’t care enough about depth—they’re willing to accept ideas even before they’ve been explored in depth. If this explanation is correct, then it seems accurate to call it a problem with our epistemic standards—specifically, the standard of requiring (and rewarding) deep investigation and scholarship.
Your solution to the “willingness to accept ideas even before they’ve been explored in depth” problem is to explore ideas in more depth. But another solution is to accept fewer ideas, or hold them much more provisionally.
I’m a proponent of the second approach because:
I suspect even academia doesn’t hold ideas as provisionally as it should. See Hamming on expertise: https://forum.effectivealtruism.org/posts/mG6mckPHAisEbtKv5/should-you-familiarize-yourself-with-the-literature-before?commentId=SaXXQXLfQBwJc9ZaK
I suspect trying to browbeat people to explore ideas in more depth works against the grain of an online forum as an institution. Browbeating works in academia because your career is at stake, but in an online forum, it just hurts intrinsic motivation and cuts down on forum use (the forum runs on what Clay Shirky called “cognitive surplus”, essentially a term for peoples’ spare time and motivation). I’d say one big problem with LW 1.0 that LW 2.0 had to solve before flourishing was people felt too browbeaten to post much of anything.
If we accept fewer ideas / hold them much more provisionally, but provide a clear path to having an idea be widely held as true, that creates an incentive for people to try & jump through hoops—and this incentive is a positive one, not a punishment-driven browbeating incentive.
Maybe part of the issue is that on LW, peer review generally happens in the comments after you publish, not before. So there’s no publication carrot to offer in exchange for overcoming the objections of peer reviewers.
“If we accept fewer ideas / hold them much more provisionally, but provide a clear path to having an idea be widely held as true, that creates an incentive for people to try & jump through hoops—and this incentive is a positive one, not a punishment-driven browbeating incentive.”
Hmm, it sounds like we agree on the solution but are emphasising different parts of it. For me, the question is: who’s this “we” that should accept fewer ideas? It’s the set of people who agree with my argument that you shouldn’t believe things which haven’t been fleshed out very much. But the easiest way to add people to that set is just to make the argument, which is what I’ve done. Specifically, note that I’m not criticising anyone for producing posts that are short and speculative: I’m criticising the people who update too much on those posts.
Fair enough. I’m reminded of a time someone summarized one of my posts as being a definitive argument against some idea X and me thinking to myself “even I don’t think my post definitively settles this issue” haha.
Yeah, this is roughly how I think about it.
I do think right now LessWrong should lean more in the direction the Richard is suggesting – I think it was essential to establish better Babble procedures but now we’re doing well enough on that front that I think setting clearer expectations of how the eventual pruning works is reasonable.
I wanted to register that I don’t like “babble and prune” as a model of intellectual development. I think intellectual development actually looks more like:
1. Babble
2. Prune
3. Extensive scholarship
4. More pruning
5. Distilling scholarship to form common knowledge
And that my main criticism is the lack of 3 and 5, not the lack of 2 or 4.
I also note that: a) these steps get monotonically harder, so that focusing on the first two misses *almost all* the work; b) maybe I’m being too harsh on the babble and prune framework because it’s so thematically appropriate for me to dunk on it here; I’m not sure if your use of the terminology actually reveals a substantive disagreement.
I basically agree with your 5-step model (I at least agree it’s a more accurate description than Babel and Prune, which I just meant as rough shorthand). I’d add things like “original research/empiricism” or “more rigorous theorizing” to the “Extensive Scholarship” step.
I see the LW Review as basically the first of (what I agree should essentially be at least) a 5 step process. It’s adding a stronger Step 2, and a bit of Step 5 (at least some people chose to rewrite their posts to be clearer and respond to criticism)
...
Currently, we do get non-zero Extensive Scholarship and Original Empiricism. (Kaj’s Multi-Agent Models of Mind seems like it includes real scholarship. Scott Alexander / Eli Tyre and Bucky’s exploration into Birth Order Effects seemed like real empiricism). Not nearly as much as I’d like.
But John’s comment elsethread seems significant:
This reminded of a couple posts in the 2018 Review, Local Validity as Key to Sanity and Civilization, and Is Clickbait Destroying Our General Intelligence?. Both of those seemed like “sure, interesting hypothesis. Is it real tho?”
During the Review I created a followup “How would we check if Mathematicians are Generally More Law Abiding?” question, trying to move the question from Stage 2 to 3. I didn’t get much serious response, probably because, well, it was a much harder question.
But, honestly… I’m not sure it’s actually a question that was worth asking. I’d like to know if Eliezer’s hypothesis about mathematicians is true, but I’m not sure it ranks near the top of questions I’d want people to put serious effort into answering.
I do want LessWrong to be able to followup Good Hypotheses with Actual Research, but it’s not obvious which questions are worth answering. OpenPhil et al are paying for some types of answers, I think usually by hiring researchers full time. It’s not quite clear what the right role for LW to play in the ecosystem.
All else equal, the harder something is, the less we should do it.
My quick take is that writing lit reviews/textbooks is a comparative disadvantage of LW relative to the mainstream academic establishment.
In terms of producing reliable knowledge… if people actually care about whether something is true, they can always offer a cash prize for the best counterargument (which could of course constitute citation of academic research). The fact that people aren’t doing this suggests to me that for most claims on LW, there isn’t any (reasonably rich) person who cares deeply re: whether the claim is true. I’m a little wary of putting a lot of effort into supply if there is an absence of demand.
(I guess the counterargument is that accurate knowledge is a public good so an individual’s willingness to pay doesn’t get you complete picture of the value accurate knowledge brings. Maybe what we need is a way to crowdfund bounties for the best argument related to something.)
(I agree that LW authors would ideally engage more with each other and academic literature on the margin.)
I’ve been thinking about the idea of “social rationality” lately, and this is related. We do so much here in the way of training individual rationality—the inputs, functions, and outputs of a single human mind. But if truth is a product, then getting human minds well-coordinated to produce it might be much more important than training them to be individually stronger. Just as assembly line production is much more effective in producing almost anything than teaching each worker to be faster in assembling a complete product by themselves.
My guess is that this could be effective not only in producing useful products, but also in overcoming biases. Imagine you took 5 separate LWers and asked them to create a unified consensus response to a given article. My guess is that they’d learn more through that collective effort, and produce a more useful response, than if they spent the same amount of time individually evaluating the article and posting their separate replies.
Of course, one of the reasons we don’t to that so much is that coordination is an up-front investment and is unfamiliar. Figuring out social technology to make it easier to participate in might be a great project for LW.
There’s been a fair amount of discussion of that sort of thing here: https://www.lesswrong.com/tag/group-rationality There are also groups outside LW thinking about social technology such as RadicalxChange.
I’m not sure. If you put those 5 LWers together, I think there’s a good chance that the highest status person speaks first and then the others anchor on what they say and then it effectively ends up being like a group project for school with the highest status person in charge. Some related links.
That’s definitely a concern too! I imagine such groups forming among people who either already share a basic common view, and collaborate to investigate more deeply. That way, any status-anchoring effects are mitigated.
Alternatively, it could be an adversarial collaboration. For me personally, some of the SSC essays in this format have led me to change my mind in a lasting way.
People also reject ideas before they’ve been explored in depth. I’ve tried to discuss similar issues with LW before but the basic response was roughly “we like chaos where no one pays attention to whether an argument has ever been answered by anyone; we all just do our own thing with no attempt at comprehensiveness or organizing who does what; having organized leadership of any sort, or anyone who is responsible for anything, would be irrational” (plus some suggestions that I’m low social status and that therefore I personally deserve to be ignored. there were also suggestions – phrased rather differently but amounting to this – that LW will listen more if published ideas are rewritten, not to improve on any flaws, but so that the new versions can be published at LW before anywhere else, because the LW community’s attention allocation is highly biased towards that).
I feel somewhat inclined to wrap up this thread at some point, even while there’s more to say. We can continue if you like and have something specific or strong you’d like to ask, but otherwise will pause here.
You have to realise that what you are doing isn’t adequate in order to gain the motivation to do it better, and that is unlikely to happen if you are mostly communicating with other people who think everything is OK.
Lesswrong is competing against philosophy as well as science, and philosophy has broader criterion of evidence still. In fact , lesswrongians are often frustrated that mainstream philosophy takes such topics as dualism or theism seriously.. even though theres an abundance of Bayesian evidence for them.
Depends on the claim, right?
If the cost of evaluating a hypothesis is high, and hypotheses are cheap to generate, I would like to generate a great deal before selecting one to evaluate.
As mentioned in this comment, the Unrolling social metacognition paper is closely related to at least one research paper.
Right, but this isn’t mentioned in the post? Which seems odd. Maybe that’s actually another example of the “LW mentality”: why is the fact that there has been solid empirical research into 3 layers not being enough not important enough to mention in a post on why 3 layers isn’t enough? (Maybe because the post was time-boxed? If so that seems reasonable, but then I would hope that people comment saying “Here’s a very relevant paper, why didn’t you cite it?”)
[Deleted]
Much of the same is true of scientific journals. Creating a place to share and publish research is a pretty key piece of intellectual infrastructure, especially for researchers to create artifacts of their thinking along the way.
The point about being ‘cross-posted’ is where I disagree the most.
This is largely original content that counterfactually wouldn’t have been published, or occasionally would have been published but to a much smaller audience. What Failure Looks Like wasn’t crossposted, Anna’s piece on reality-revealing puzzles wasn’t crossposted. I think that Zvi would have still written some on mazes and simulacra, but I imagine he writes substantially more content given the cross-posting available for the LW audience. Could perhaps check his blogging frequency over the last few years to see if that tracks. I recall Zhu telling me he wrote his FAQ because LW offered an audience for it, and likely wouldn’t have done so otherwise. I love everything Abram writes, and while he did have the Intelligent Agent Foundations Forum, it had a much more concise, technical style, tiny audience, and didn’t have the conversational explanations and stories and cartoons that have been so excellent and well received on LW, and it wouldn’t as much have been focused on the implications for rationality of things like logical inductors. Rohin wouldn’t have written his coherence theorems piece or any of his value learning sequence, and I’m pretty sure about that because I personally asked him to write that sequence, which is a great resource and I’ve seen other researchers in the field physically print off to write on and study. Kaj has an excellent series of non-mystical explanations of ideas from insight meditation that started as a response to things Val wrote, and I imagine those wouldn’t have been written quite like that if that context did not exist on LW.
I could keep going, but probably have made the point. It seems weird to not call this collectively a substantial amount of intellectual progress, on a lot of important questions.
I am indeed focusing right now on how to do more ‘conversation’. I’m in the middle of trying to host some public double cruxes for events, for example, and some day we will finally have inline commenting and better draft sharing and so on. It’s obviously not finished.
Yeah, that’s true, though it might have happened at some later point in the future as I got increasingly frustrated by people continuing to cite VNM at me (though probably it would have been a blog post and not a full sequence).
Reading through this comment tree, I feel like there’s a distinction to be made between “LW / AIAF as a platform that aggregates readership and provides better incentives for blogging”, and “the intellectual progress caused by posts on LW / AIAF”. The former seems like a clear and large positive of LW / AIAF, which I think Richard would agree with. For the latter, I tend to agree with Richard, though perhaps not as strongly as he does. Maybe I’d put it as, I only really expect intellectual progress from a few people who work on problems full time who probably would have done similar-ish work if not for LW / AIAF (but likely would not have made it public).
I’d say this mostly for the AI posts. I do read the rationality posts and don’t get a different impression from them, but I also don’t think enough about them to be confident in my opinions there.
By “AN” do you mean the AI Alignment Forum, or “AIAF”?
[Deleted]
I did suspect you’d confused it with the Alignment Newsletter :)
Thanks for chiming in with this. People criticizing the epistemics is hopefully how we get better epistemics. When the Californian smoke isn’t interfering with my cognition as much, I’ll try to give your feedback (and Rohin’s) proper attention. I would generally be interested to hear your arguments/models in detail, if you get the chance to lay them out.
My default position is LW has done well enough historically (e.g. Ben Pace’s examples) for me to currently be investing in getting it even better. Epistemics and progress could definitely be a lot better, but getting there is hard. If I didn’t see much progress on the rate of progress in the next year or two, I’d probably go focus on other things, though I think it’d be tragic if we ever lost what we have now.
And another thought:
Yes and no. Journal articles have their advantages, and so do blog posts. A bunch of recent LessWrong team’s work has been around filling in the missing pieces for the system to work, e.g. Open Questions (hasn’t yet worked for coordinating research), Annual Review, Tagging, Wiki. We often talk about conferences and “campus”.
My work on Open Questions involved thinking about i) a better template for articles than “Abstract, Intro, Methods, etc.”, but Open Questions didn’t work for unrelated reasons we haven’t overcome yet, ii) getting lit reviews done systematically by people, iii) coordinating groups around research agendas.
I’ve thought about re-attempting the goals of Open Questions with instead a “Research Agenda” feature that lets people communally maintain research agendas and work on them. It’s a question of priorities whether I work on that anytime soon.
I do really think many of the deficiencies of LessWrong’s current work compared to academia are “infrastructure problems” at least as much as the epistemic standards of the community. Which means the LW team should be held culpable for not having solved them yet, but it is tricky.
For the record, I think the LW team is doing a great job. There’s definitely a sense in which better infrastructure can reduce the need for high epistemic standards, but it feels like the thing I’m pointing at is more like “Many LW contributors not even realising how far away we are from being able to reliably produce and build on good ideas” (which feels like my criticism of Ben’s position in his comment, so I’ll respond more directly there).
It seems really valuable to have you sharing how you think we’re falling epistemically short and probably important for the site to integrate the insights behind that view. There are a bunch of ways I disagree with your claims about epistemic best practices, but it seems like it would be cool if I could pass your ITT more. I wish your attempt to communicate the problems you saw had worked out better. I hope there’s a way for you to help improve LW epistemics, but also get that it might be costly in time and energy.
Now they’re positive again.
Confusing to me, their Ω-karma (karma on another website) is also positive. Does it mean they previously had negative LW-karma but positive Ω-karma? Or that their Ω-karma also improved as a result of you complaining on LW a few hours ago? Why would it?
(Feature request: graph of evolution of comment karma as a function of time.)
I’m confused, what is Ω-karma?
AI Alignment Forum karma (which is also displayed here on posts that are crossposted)
I’d be curious what, if any, communities you think set good examples in this regard. In particular, are there specific academic subfields or non-academic scenes that exemplify the virtues you’d like to see more of?
Maybe historians of the industrial revolution? Who grapple with really complex phenomena and large-scale patterns, like us, but unlike us use a lot of data, write a lot of thorough papers and books, and then have a lot of ongoing debate on those ideas. And then the “progress studies” crowd is an example of an online community inspired by that tradition (but still very nascent, so we’ll see how it goes).
More generally I’d say we could learn to be more rigorous by looking at any scientific discipline or econ or analytic philosophy. I don’t think most LW posters are in a position to put in as much effort as full-time researchers, but certainly we can push a bit in that direction.
Thanks for your reply! I largely agree with drossbucket’s reply.
I also wonder how much this is an incentives problem. As you mentioned and in my experience, the fields you mentioned strongly incentivize an almost fanatical level of thoroughness that I suspect is very hard for individuals to maintain without outside incentives pushing them that way. At least personally, I definitely struggle and, frankly, mostly fail to live up to the sorts of standards you mention when writing blog posts in part because the incentive gradient feels like it pushes towards hitting the publish button.
Given this, I wonder if there’s a way to shift the incentives on the margin. One minor thing I’ve been thinking of trying for my personal writing is having a Knuth or Nintil style “pay for mistakes” policy. Do you have thoughts on other incentive structures to for rewarding rigor or punishing the lack thereof?
It feels partly like an incentives problem, but also I think a lot of people around here are altruistic and truth-seeking and just don’t realise that there are much more effective ways to contribute to community epistemics than standard blog posts.
I think that most LW discussion is at the level where “paying for mistakes” wouldn’t be that helpful, since a lot of it is fuzzy. Probably the thing we need first are more reference posts that distill a range of discussion into key concepts, and place that in the wider intellectual context. Then we can get more empirical. (Although I feel pretty biased on this point, because my own style of learning about things is very top-down). I guess to encourage this, we could add a “reference” section for posts that aim to distill ongoing debates on LW.
In some cases you can get a lot of “cheap” credit by taking other people’s ideas and writing a definitive version of them aimed at more mainstream audiences. For ideas that are really worth spreading, that seems useful.
Here is the best toy model I currently have for rational agents. Alas, it is super messy and hacky, but better than nothing. I’ll call it the BAVM model; the one-sentence summary is “internal traders concurrently bet on beliefs, auction actions, vote on values, and merge minds”. There’s little novel here, I’m just throwing together a bunch of ideas from other people (especially Scott Garrabrant and Abram Demski).
In more detail, the three main components are:
A prediction market
An action auction
A value election
You also have some set of traders, who can simultaneously trade on any combination of these three. Traders earn money in two ways:
Making accurate predictions about future sensory experiences on the market.
Taking actions which lead to reward or increase the agent’s expected future value.
They spend money in three ways:
Bidding to control the agent’s actions for the next N timesteps.
Voting on what actions get reward and what states are assigned value.
Running the computations required to figure out all these trades.
Values are therefore dominated by whichever traders earn money from predictions or actions, who will disproportionately vote for values that are formulated in the same ontologies they use for prediction/action, since that’s simpler than using different ontologies.
The last component is that it costs traders money to do computation. The way they can reduce this is by finding other traders who do similar computations as them, and then merging into a single trader. I am very interested in better understanding what a merging process like this might look like, though it seems pretty intractable in general because it will depend a lot on the internal details of the traders. (So perhaps a more principled approach here is to instead work top-down, figuring out what sub-markets or sub-auctions look like?)
I wonder if there’s a loopiness here is which breaks the setup (the expectation I’m guessing is relative to the prediction markets probabilities? Though it seems like the market is over sensory experiences but the values are over world states in general, so maybe I’m missing something). But it seems like if I take an action and move the market at the same time, I might be able to extract a bunch of extra money and acquire outsize control.
This seems like it’s wasteful relative to contributing to a pool that bids on action A (or short-term policy P). I guess coordination is hard if you’re just contributing to the pool though, and all connects to the merging process you describe.
I’ve been studying and thinking about the physical side of this phenomenon in neuroscience recently. There are groups of columns of neurons in the cortex that form temporary voting blocks, regarding whatever subject that particular Brodmann area focuses on. These alternating groups have to deal with physical limits of how many groups the regions can stably divide into, which limits the number of active distinct hypotheses or ‘traders’ there can be in a given area at a given time. Unclear exactly what the max is, and it depends on the cortical region in question, but generally 6-9 is the approximate max (not coincidentally the number of distinct ‘chunks’ we can hold in active short term memory). Also, there is a tendency for noise to collapse too similar of traders/hypotheses/firing-groups to fall back into synchrony/agreement with each other and thus collapse back down to a baseline of two competing hypotheses. These hypotheses/firing-groups/traders are pushed into existence or pushed into merging not just by their own ‘bids’ but also by the evidence coming in from other brain areas or senses. I don’t think that current day neuroscience has all the details yet (although I certainly don’t have the full picture of all relevant papers in neuroscience!).
I think there are probably a lot of ways to build rational agents. The idea that general intelligence is hard in any absolute sense may be a biased by wanting to believe we’re special, and for AI workers, that our work is special and difficult.
Can new traders be “spawned”?
Yepp, as in Logical Induction, new traders get spawned over time (in some kind of simplicity-weighted ordering).
I note that AI economies like this will often have explosively better credit assignment for information production than human economies can. Artificial agents can be copied or rolled back (erase memories), which makes it possible to reverse the receipt of information if an assessor concludes with a price that the seller considers too low for a deal. In human economies, that’s impossible, you can’t send a clone to value a piece of information then delete them if you decide not to buy that information (that’s too expensive/illegal) nor can you wipe their memory of the information (or, we don’t know how to do that), so the very basic requirement for trade, assessment prior to purchase, is not possible in human economies, so information doesn’t get priced accurately and it has to be treated as a public good.
When implementing this (internal privacy) in a multi-agent architecture, though, make sure to take measures to prevent the formation of monopolies, I feel like information is kind of an increasing returns type of good, yeah? The more you have the more you can do with it. It could quickly stop being multi-agent, and at worst, the monopoly could consolidate enough political power to manipulate the EV estimators and reward hack. In theory those economies shouldn’t interact. But it’s impossible to totally prevent it. The EV estimators are receiving big sets of action proposals from the decisionmakers and the decisionmakers will see which action proposal the EV estimators end up choosing.
Yepp, very good point. Am working on a short story about this right now.
Nice!
For comparison, this “Great Map of the Mind” is basically the standard academic philosophy picture.
My guess is that understanding merging is the key to most prediction-of-behavior issues (things that motivated and also foiled UDT, but not limited to known-in-advance preference setting). Two agents can coordinate if they are the same, or reasoning about each other’s behavior, but in general they can be too complicated to clearly understand each other or themselves, can inadvertently diagonalize such attempts into impossibility, or even fail to be sufficiently aware of each other to start reasoning about each other specifically.
It might be useful to formulate smaller computations (contracts/adjudicators) that facilitate coordination between different agents by being shared between them, with the bigger agents acting as parts of environments for the contracts and setting up incentives for them, while the contracts can themselves engage in decision making within those environments. Contracts coordinate by being shared and acting with strategicness across relevant agents (they should be something like common knowledge), and it’s feasible for agents to find/construct some shared contracts as a result of them being much simpler than agents that host them. Learning of contracts doesn’t need to start with targeting coordination with other big agents, as active contracts screen off the other agents they facilitate coordination with.
Using contracts requires the big agents to make decisions about policies that affect the contracts updatelessly with respect to how the contracts end up behaving. That is, a contract should be able to know these policies, and the policies should describe responses to possible behaviors of a contract without themselves changing (once the contract computes more of its behavior), enabling the contract to do decision making in the environment of these policies. This corresponds to committing to abide by the contract. Assurance contracts (that start their tenure by checking that the commitments of all parties are actually in place) are especially important, allowing things like cooperation in PD.
If traders can get access to control panel for actions of the external agent AND they profit from accurately predicting its observations, then wouldn’t the best strategy be “create as much chaos as possible that is only predictable to me, its creator”. So, traders that value ONLY accurate predictions will get the advantage?
I like this picture! But
I think real learning has some kind of ground-truth reward. So we should clearly separate between “this ground-truth reward that is chiseling the agent during training (and not after training)”, and “the internal shards of the agent negotiating and changing your exact objective (which can happen both during and after training)”. I’d call the latter “internal value allocation”, or something like that. It doesn’t neatly correspond to any ground truth, and is partly determined by internal noise in the agent. And indeed, eventually, when you “stop training” (or at least “get decoupled enough from reward”), it just evolves of its own, separate from any ground truth.
And maybe more importantly:
I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you’ll need a modification of this framework which explains why that’s not the case.
My intuition is a process of the form “eventually, traders (or some kind of specialized meta-traders) change the learning process itself to make it more efficient”. For example, they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don’t lose much, and you waste less compute. Probably these dynamics will already be “in the limit” applied by your traders, but it will be the dominant dynamic so it should be directly represented by the formalism.
Finally, this might come later, and not yet in the level of abstraction you’re using, but I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all. It’s conceivable to say “this is the ideal mechanism, and real agents are just hacky approximations to it, so we should study the ideal mechanism first”. But my intuition says, on the contrary, some of the physical constraints (like locality, or the architecture of nets) will strongly shape which kind of macroscopic mechanism you get, and these will present pretty different convergent behavior. This is related, but not exactly equivalent to, partial agency.
I’d actually represent this as “subsidizing” some traders. For example, humans have a social-status-detector which is hardwired to our reward systems. One way to implement this is just by taking a trader which is focused on social status and giving it a bunch of money. I think this is also realistic in the sense that our human hardcoded rewards can be seen as (fairly dumb) subagents.
I think this happens in humans—e.g. we fall into cults, we then look for evidence that the cult is correct, etc etc. So I don’t think this is actually a problem that should be ruled out—it’s more a question of how you tweak the parameters to make this as unlikely as possible. (One reason it can’t be ruled out: it’s always possible for an agent to end up in a belief state where it expects that exploration will be very severely punished, which drives the probability of exploration arbitrarily low.)
I’m assuming that traders can choose to ignore whichever inputs/topics they like, though. They don’t need to make trades on everything if they don’t want to.
Yeah, this is why I’m interested in understanding how sub-markets can be aggregated into markets, sub-auctions into auctions, sub-elections into elections, etc.
Sounds good!
Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don’t fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that. My guess is it will have to do with strongly subsidizing some traders, and/or having a pretty weird prior over traders. Maybe even something like “dynamically changing the prior over traders”[1].
Yep, that’s why I believe “in the limit your traders will already do this”. I just think it will be a dominant dynamic of efficient agents in the real world, so it’s better to represent it explicitly (as a more hierarchichal structure, etc.), instead of have that computation be scattered between all independent traders. I also think that’s how real agents probably do it, computationally speaking.
Of course, pedantically, yo will always be equivalent to having a static prior and changing your update rule. But some update rules are made sense of much easily if you interpret them as changing the prior.
Ah, I see. In that case I think I disagree that it happens “by default” in this model. A few dynamics which prevent it:
If the wealthy trader makes reward easier to get, then the price of actions will go up accordingly (because other traders will notice that they can get a lot of reward by winning actions). So in order for the wealthy trader to keep making money, they need to reward outcomes which only they can achieve, which seems a lot harder.
I don’t yet know how traders would best aggregate votes into a reward function, but it should be something which has diminishing marginal return to spending, i.e. you can’t just spend 100x as much to get 100x higher reward on your preferred outcome. (Maybe quadratic voting?)
Other traders will still make money by predicting sensory observations. Now, perhaps the wealthy trader could avoid this by making observations as predictable as possible (e.g. going into a dark room where nothing happens—kinda like depression, maybe?) But this outcome would be assigned very low reward by most other traders, so it only works once a single trader already has a large proportion of the wealth.
IMO the best way to explicitly represent this is via a bias towards simpler traders, who will in general pay attention to fewer things.
But actually I don’t think that this is a “dominant dynamic” because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews. And so even if you start off with simple traders who pay attention to fewer things, you’ll end up with these big worldviews that have opinions on everything. (These are what I call frames here.)
Yep! But this didn’t seem so hard for me to happen, especially in the form of “I pick some easy task (that I can do perfectly), and of course others will also be able to do it perfectly, but since I already have most of the money, if I just keep investing my money in doing it I will reign forever”. You prevent this from happening through epsilon-exploration, or something equivalent like giving money randomly to other traders. These solutions feel bad, but I think they’re the only real solutions. Although I also think stuff about meta-learning (traders explicitly learn about how they should learn, etc.) probably pragmatically helps make these failures less likely.
Yep, that should help (also at the trade-off of making new good ideas slower to implement, but I’m happy to make that trade-off).
Yeah. To be clear, the dynamic I think is “dominant” is “learning to learn better”. Which I think is not equivalent to simplicity-weighing traders. It is instead equivalent to having some more hierarchichal structure on traders.
Some opinions about AI and epistemology:
One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues.
A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn’t make sense in the context of group rationality, it probably doesn’t make sense in the context of individual rationality either.
For example: there’s no privileged way to combine many people’s opinions into a single credence. You can average them, but that loses a lot of information. Or you can get them to bet on a prediction market, but that depends on a lot on details of the individuals’ betting strategies. The group might settle on a number to help with planning and communication, but it’s only a lossy summary of many different beliefs and models. Similarly, we should think of individuals’ credences as lossy summaries of different opinions from different underlying models that they have.
How does this apply to AI? Suppose we each think of ourselves as containing many different subagents that focus on understanding the world in different ways—e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn’t mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn’t cause all the other humans to elect them as the dictator). E.g. maybe there’s an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions).
In my debate with Eliezer, he didn’t seem to appreciate the importance of advance predictions; I think the frame of “highly opinionated subagents should convince other subagents to trust them, rather than just seizing power” is an important aspect of what he’s missing. I think of rationalism as trying to form a single fully-consistent world-model; this has many of the same pitfalls as a country which tries to get everyone to agree on a single ideology. Even when that ideology is broadly correct, you’ll lose a bunch of useful heuristics and intuitions that help actually get stuff done, because ideological conformity is prioritized.
This perspective helps frame the debate about what our “base rate” for AI doom should be. I’ve been in a number of arguments that go roughly like (edited for clarity):
Me: “Credences above 90% doom can’t be justified given our current state of knowledge”
Them: “But this is an isolated demand for rigor, because you’re fine with people claiming that there’s a 90% chance we survive. You’re assuming that survival is the default, I’m assuming that doom is the default; these are symmetrical positions.”
But in fact there’s no one base rate; instead, different subagents with different domains of knowledge will have different base rates. That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don’t predict doom. That’s where the asymmetry which makes 90% doom a much stronger prediction than 90% survival comes from.
This perspective is broadly aligned with a bunch of stuff that Scott Garrabrant and Abram Demski have written about (e.g. geometric rationality, Garrabrant induction). I don’t think the ways I’m applying it to AI risk debates straightforwardly falls out of their more technical ideas; but I do expect that more progress on agent foundations will make it easier to articulate ideas like the ones above.
I don’t really buy this statement. Most frames, from most disciplines, and most styles of reasoning, do not make clear predictions about what will happen to humanity in the long-run future. A very few do, but the vast majority are silent on this issue. Silence is not anything like “50%”.
Most frames, from most disciplines, and most styles of reasoning, don’t predict sparks when you put metal in a microwave. This doesn’t mean I don’t know what happens when you put metal in a microwave. You need to at the very least limit yourself to applicable frames, and there are very few applicable frames for predicting humanity’s long-term future.
I agree with this.
Unfortunately, I think there’s a fundamentally inside-view aspect of [problems very different from those we’re used to]. I think looking for a range of frames is the right thing to do—but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues).
I don’t think there’s a way around this. Aspects of this situation are fundamentally different from those we’re used to. [Is different from] is not a useful relation—we can’t get far by saying “We’ve seen [fundamentally different] situations before—what happened there?”. It’ll all come back to how they were fundamentally different.
To say something mildly more constructive, I do still think we should be considering and evaluating other frames, based on our own inside-view model (with appropriate error bars on that model).
A place I’d start here would be:
Attempt to understand another frame.
See how far I need to zoom out before that frame’s models become a reasonable abstraction for the problem-as-I-understand-it.
Find the smallest changes to my models that’d allow me to stick with this frame without zooming out so far. Assess the probability that these adjusted models are correct/useful.
For most frames, I end up needing to zoom out too far for them to say much of relevance—so this doesn’t much change my p(doom) assessment.
It seems more useful to apply other frames to evaluate smaller parts of our models. I’m sure there are a bunch of places where intuitions and models from e.g. economics or physics do apply to safety-related subproblems.
I’ve been thinking lately that human group rationality seems like such a mess. Like how can humanity navigate a once in a lightcone opportunity like the AI transition without doing something very suboptimal (i.e., losing most of potential value), when the vast majority of humans (and even the elites) can’t understand (or can’t be convinced to pay attention to) many important considerations. This big picture seems intuitively very bad and I don’t know any theory of group rationality that says this is actually fine.
I guess my 1 is mostly about descriptive group rationality, and your 2 may be talking more about normative group rationality. However I’m also not aware of any good normative theories about group rationality. I started reading your meta-rationality sequence, but it ended after just two posts without going into details.
The only specific thing you mention here is “advance predictions” but for example, moral philosophy deals with “ought” questions and can’t provide advance predictions. Can you say more about how you think group rationality should work, especially when advance predictions isn’t possible?
From your group rationality perspective, why is it good that rationalists individually have better views about AI? Why shouldn’t each person just say what they think from their own preferred frame, and then let humanity integrate that into some kind of aggregate view or outcome, using group rationality?
David Chapman’s website seems like the standard reference for what the post-rationalists call “metarationality”. (I haven’t read much of it, but the little I read made me somewhat unenthusiastic about continuing).
How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.
The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:
Talking about partial hypotheses rather than full hypotheses. You can’t have a prior over partial hypotheses, because several of them can be true at once (though you can still assign them credences and update those credences according to evidence).
Talking about models with degrees of truth rather than just hypotheses with degrees of likelihood. E.g. when using a binary conception of truth, general relativity is definitely false because it’s inconsistent with quantum phenomena. Nevertheless, we want to say that it’s very close to the truth. In general this is more of an ML approach to epistemology (we want a set of models with low combined loss on the ground truth).
Do “subagents” in this paragraph refer to different people, or different reasoning modes / perspectives within a single person? (I think it’s the latter, since otherwise they would just be “agents” rather than subagents.)
Either way, I think this is a neat way of modeling disagreement and reasoning processes, but for me it leads to a different conclusion on the object-level question of AI doom.
A big part of why I find Eliezer’s arguments about AI compelling is that they cohere with my own understanding of diverse subjects (economics, biology, engineering, philosophy, etc.) that are not directly related to AI—my subagents for these fields are convinced and in agreement.
Conversely, I find many of the strongest skeptical arguments about AI doom to be unconvincing precisely because they seem overly reliant on a “current-paradigm ML subagent” that their proponents feel should be dominant, or at least more heavily weighted than I think is justified.
This might be true and useful for getting some kind of initial outside-view estimate, but I think you need some kind of weighting rule to make this work as reasoning strategy even at a meta level. Otherwise, aren’t you vulnerable to other people inventing lots of new frames and disciplines? I think the answer in geometric rationality terms is that some subagents will perform poorly and quickly lose their Nash bargaining resources, and then their contribution to future decision-making / conclusion-making will be down-weighted. But I don’t think the only way for a subagent to “perform” for the purposes of deciding on a weight is by making externally legible advance predictions.
I may be missing context here, but as written / taken at face value, I strongly agree with the above comment from Richard. I often disagree with Richard about alignment and its role in the future of AI, but this comment is an extremely dense list of things I agree with regarding rationalist epistemic culture.
I’d love to read an elaboration of your perspective on this, with concrete examples, which avoids focusing on the usual things you disagree about (pivotal acts vs. pivotal processes, social facets of the game is important for us to track, etc.) and mainly focus on your thoughts on epistemology and rationality and how it deviates from what you consider the LW norm.
My main take on Bayesian epistemology being wrong is that I think to the extent it’s useless in real life, it’s because it focuses way too much on the ideal case, ala @Robert Miles’s tweet here:
https://x.com/robertskmiles/status/1830925270066286950
(The other problem I have with it is that even in the ideal case, it doesn’t have a way to sensibly handle 0 probability events, or conditioning on probability 0 events, which can actually happen once we leave the world of finite sets and measures.)
That said, I don’t think that people being wrong about epistemology is the cause of high p(Doom).
I’d agree more with @Algon in that the issues lie elsewhere (though a nitpick is that I wouldn’t say that EU maximization is wrong for TAI/AGI/ASI, but rather that certain dangerous properties don’t automatically hold, and that systems that EU maximize IRL like GPT-4 aren’t actually nearly as dangerous as often assumed. Agree with the other points.)
(I am not the iniminatable @Robert Miles, though we do have some things in common.)
Reply to @Algon:
What I was talking about is that the predictive models like GPT-4 have a utility function that’s essentially predictive, and the maximization is essentially trying to update the best it can given input conditions.
These posts can help you to understand more about predictive/simulator utility functions like GPT-4:
https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization
https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
I’m doubtful that GPT-4 has a utility function. If it did, I would be kind-of terrified. I don’t think I’ve seen the posts you linked to though, so I’ll go read those.
Maybe a crux is that I’m willing to grant learned utility functions as utility functions, and I tend to see EU maximization/utility function reasoning in general as implying far less consequences than people on LW think it is, at least without more constraints.
It doesn’t try to assert it’s own existence, because that’s not necessary for maximizing updating/prediction output based on inputs.
I think the crux lies elsewhere, as I was sloppy in my wording. It’s not that maximizing some utility function is an issue, as basically anything can be viewed as EU maximization for a sufficiently wild utility function. However, I don’t view that as a meaningful utility function. Rather, it is the ones like e.g. utility functions over states that I think are meaningful, and those are scary. That’s how I think you get classical paperclip maximizers.
When I try and think up a meaningful utility function for GPT-4, I can’t find anything that’s plausible. Which means I don’t think there’s a meaningful prediction-utility function which describes GPT-4′s behaviour. Perhaps that is a crux.
Re utility functions over states, it turns out that we can validly turn utility functions over plans/predictions into utility functions over world states/outcomes (though usually with constraints on how large the domain is, though not always.)
https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/?commentId=QciMJ9ehR9xbTexcc
And yeah, I think it’s a crux that I think that at the very least, what GPT-N systems will look like, if they reach AGI/ASI, will probably look like a maximizer for updating given input conditions like prompts.
My main point isn’t that the utility function framing of GPT-4 or GPT-N is wrong, but rather that LWers inferred way too much from how a system would behave, even conditional on expected utility maximization being a coherent frame for AIs, because they don’t logically imply the properties they thought it did without more assumptions that need to be defended.
What is the empirical track record of your suggested epistemological strategy, relative to Bayesian rationalism? Where does your confidence come from that it would work any better? Every time I see suggestions of epistemological humility, I think to myself stuff like this:
What predictions would this strategy have made about future technologies, like an 1890 or 1900 prediction of the airplane (vs. first controlled flight by the Wright Brothers in 1903), or a 1930 or 1937 prediction of nuclear bombs? Doesn’t your strategy just say that all these weird-sounding technologies don’t exist yet and are probably impossible?
Can this epistemological strategy correctly predict that present-day huge complex machines like airplanes can exist? They consist of millions of parts and require contributions of thousands or tens of thousand of people. Each part has a chance of being defective, and each person has a chance of making a mistake. Without the benefit of knowing that airplanes do indeed exist, doesn’t it sound overconfident to predict that parts have an error rate of <1 in a million, or that people have an error rate of <1 in a thousand? But then the math says that airplanes can’t exist, or should immediately crash.
Or to rephrase point 2 to reply to this part: “That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don’t predict doom.” — Can your epistemological strategy even correctly make any predictions of near 100% certainty? I concur with habryka that most frames don’t make any predictions on most things. And yet this doesn’t mean that some events aren’t ~100% certain.
One of the most important features of future ASI I consider knowledge of limits of applicability of its models and heuristics. If you have list of assumptions for very fast heuristics, then you can win big by doing fast-computable moves in narrow environment where assumptions hold. Thus saying, you need to be able find when your assumptions don’t hold and command your subagents to halt, melt and catch fire when they are outside of their applicability zone.
I think this post doesn’t really explain why rats have high belief in doom, or why they’re wrong to do so. Perhaps ironically, there is a better a version of this post on both counts which isn’t so focused on how rats get epistemology wrong and the social/meta-level consequences. A post which focuses on the object-level implications for AI of a theory of rationality which looks very different from the AIXI-flavoured rat-orthodox view.
I say this because those sorts of considerations convinced me that we’re much less likely to be buggered. I.e. I no longer believe EU maximization is/will be a good description by default of TAI or widely economically productive AGI, mildly superhuman AGI or even ASI, depending on the details. Which is partly due to a recognition that the arguments for EU maximization are weaker than I thought, arguments for LDT being convergent are lacking, the notions of optimality we do have are very weak, the existence and behaviour of GPT-4, Claude Opus etc.
6 seems too general a claim to me. Why wouldn’t it work for 1% vs 10%, and likewise 0.1% vs 1% i.e. why doesn’t this suggest that you should round down P(doom) to zero. Also, I don’t even know what you mean by “most” here. Like, are we quantifying over methods of reasoning used by current AI researchers right now? Over all time? Over all AI researchers and engineers? Over everyone in the West? Over everyone who’s ever lived? Etc.
And it seems to me like you’re implicitly privileging ways of combining these opinions that get you 10% instead of 1% or 90%, which is begging the question. Of course, you could reply that a P(doom) of 10% is confused, that isn’t really your state of knowledge, lumping in all your sub-agents models into a single number is too lossy etc. But then why mention that 90% is a much stronger prediction than 10% instead of saying they’re roughly equally confused?
7 I kinda disagree with. Those models of idealized reasoning you mention generalize Bayesianism/Expected Utility Maximization. But they are not far from the Bayesian framework or EU frameworks. Like Bayesianism, they do say there are correct and incorrect ways of combining beliefs, that beliefs should be isomorphic to certain structures, unless I’m horribly mistaken. Which sure is not what you’re claiming to be the case in your above points.
Also, a lot of rationalists already recognize that these models are addressing flaws in Bayesianism like logical omniscience, embeddedness etc. Like, I believed this at least around 2017, and probably earlier. Also, note that these models of epistemology are not in tension with a strong belief that we’re buggered. Last I checked, the people who invented these models believe we’re buggered. I think they may imply that we’re a little less than the EU maximization theory though, but I don’t think this is a big difference. IMO this is not a big enough departure to do the work that your post requires.
Thanks for the reply.
I’m working on this right now, actually. Will hopefully post in a couple of weeks.
That seems reasonable. But I do think there’s a group of people who have internalized bayesian rationalism enough that the main blocker is their general epistemology, rather than the way they reason about AI in particular.
I think the point of 6 is not to say “here’s where you should end up”, but more to say “here’s the reason why this straightforward symmetry argument doesn’t hold”.
There’s still something importantly true about EU maximization and bayesianism. I think the changes we need will be subtle but have far-reaching ramifications. Analogously, relativity was a subtle change to newtonian mechanics that had far-reaching implications for how to think about reality.
Any epistemology will rule out some updates, but a problem with bayesianism is that it says there’s one correct update to make. Whereas radical probabilism, for example, still sets some constraints, just far fewer.
This sounds cool.
I think your OP didn’t give enough details as to why internalizing Bayesian rationalism leads to doominess by default. Like, Nora Belrose is firmly Bayesian and is decidedly an optimist. Admittedly, I think she doesn’t think a Kolmogorov prior is a good one, but I don’t think that makes you much more doomy either. I think Jacob Cannel and others are also Bayesian and non-doomy. Perhaps I’m using “Bayesian rationalism” differently than you are, which is why I think your claim, as I read it, is invalid.
Fair enough. However, how big is the asymmetry? I’m a bit sceptical there is a large one. Based off my interactions, it seems like ~ everyone who has seriously thought about this topic for a couple of hours has radically different models, w/ radically different levels of doominess. This holds even amongst people who share many lenses (e.g. Tyler Cowen vs Robin Hanson, Paul Christiano vs. Scott Aaronson, Steve Hsu vs Michael Nielsen etc.).
I think we’re in agreement over this. (I think Bayesianism less wrong than EU maximization, and probably a very good approximation in lots of places, like Newtonian physics is for GR.) But my contention is over Bayesian epistemology tripping many rats up when thinking about AI x-risk. You need some story which explains why sticking to Bayesian epistemology is tripping up very many people here in particular.
Right, but in radical probabilism the type of beliefs is still a real valued function, no? Which is in tension w/ many disparate models that don’t get compressed down to a single number. In that sense, the refined formalism is still rigid in a way that your description is flexible. And I suspect the same is true for Infra-Bayesianism, though I understand that even less well than radical probabilism.
I think you’re making a good point (rationalists maybe don’t weight other opinions highly enough), but you’d get farther framing it as an update to how to use Bayesian reasoning, rather than an alternative. Bayesian reasoning has a pretty strong intuitive connection to “the factually correct way to reason”, even though there’s a ton of subtlety in that statement and how and where it’s applied.
WRT to many of your arguments: base rates are increasingly just the wrong way to reason about AGI risks. We can think in more detail about how we’ll build AGI and what the risks are.
Am I misunderstandng this sentence? How do “90% doom” and the assumption that survival is the default square with one another?
Edited for clarity now.
I think they are just using that as an example of a strongly opinionated sub-agent which may be one of many different and highly specific probability assessments of doom.
As for “survival is the default assumption”—what a declaration of that implies on the surface level is that the chance of survival is overwhelming except in the case of a cataclysmic AI scenario. To put it another way:
we have a 99% chance of survival so long as we get AGI right.
To put it yet another way—Hollywood has made popular films about the human world being destroyed by Nuclear War, Climate Change, Viral Pandemic, and Asteroid Impact to name a few—different sub-agents could each give higher or lower probabilities to each of those scenarios depending on things like domain knowledge and in concert it raises the question of why we presume that survival is the default? What is the ensemble average of doom?
Is doom more or less likely than survival for any given time frame?
(Written quickly and not very carefully.)
I think it’s worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya’s “Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover”, and Cohen et al.’s “Advanced artificial agents intervene in the provision of reward”. They focus on policies learning the goal of getting high reward. But I have two problems with this:
I expect “reward” to be a hard goal to learn, because it’s a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they’d converge to it eventually, but my guess is that this would take long enough that we’d already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the “convergence” argument). Analogously, humans don’t care very much at all about the specific connections between our reward centers and the rest of our brains—insofar as we do want to influence them it’s because we care about much more directly-observable phenomena like pain and pleasure.
Even once you learn a goal like that, it’s far from clear that it’d generalize in ways which lead to power-seeking. “Reward” is not a very natural concept, it doesn’t apply outside training, and even within training it’s dependent on the specific training algorithm you use. Trying to imagine what a generalized goal of “reward” would cash out to gets pretty weird. As one example: it means that every time you deploy the policy without the intention of rewarding it, then its key priority would be convincing you to inserting that trajectory into the training data. (It might be instructive to think about what the rewards would need to be for that not to happen. Below 0? But the 0 point is arbitrary...) That seems pretty noticeable! But wouldn’t it be deceptive? Well, only within the scope of its current episode, because trying to get higher reward in other episodes is never positively reinforced. Wouldn’t it learn the high-level concept of “reward” in general, in a way that’s abstracted from any specific episode? That feels analogous to a human learning to care about “genetic fitness” but not distinguishing between their own genetic fitness and the genetic fitness of other species. And remember point 1: the question is not whether the policy learns it eventually, but rather whether it learns it before it learns all the other things that make our current approaches to alignment obsolete.
At a high level, this comment is related to Alex Turner’s Reward is not the optimization target. I think he’s making an important underlying point there, but I’m also not going as far as he is. He says “I don’t see a strong reason to focus on the “reward optimizer” hypothesis.” I think there’s a pretty good reason to focus on it—namely that we’re reinforcing policies for getting high reward. I just think that other people have focused on it too much, and not carefully enough—e.g. the “without specific countermeasures” claim that Ajeya makes seems too strong, if the effects she’s talking about might only arise significantly above human level. Overall I’m concerned that reasoning about “the goal of getting high reward” is too anthropomorphic and is a bad way to present the argument to ML researchers in particular.
Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has “policy learns to care about reward directly” as a footnote; I can imagine updating it based on the outcome of this discussion though.
For someone who’s read v1 of this paper, what would you recommend as the best way to “update” to v3? Is an entire reread the best approach?
[Edit March 11, 2023: Having now read the new version in full, my recommendation to anyone else with the same question is a full reread.]
I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
This seems to be most of your position but I’m skeptical (and it’s kind of just asserted without argument):
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
If people train their AI with RLDT then the AI is literally be trained to predict reward! I don’t see how this is remote, and I’m not clear if your position is that e.g. the value function will be bad at predicting reward because it is an “unnatural” target for supervised learning.
I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
It seems like the analogous conclusion for RL systems would be “they may not care about the rewards that go into the SGD update, they may instead care about the rewards that get entered into the dataset, or even something further causally upstream of that as long as it’s very well-correlated on the training set.” But it doesn’t matter what we choose that’s causally upstream of rewards, as long as it’s perfectly correlated on the training set?
(Or you could be saying that humans are motivated by pleasure and pain but not the entire suite of things that are upstream of reward? But that doesn’t seem right to me.)
I don’t buy it:
If people train AI systems on random samples of deployment, then “reward” does make sense—it’s just what would happen if you sampled this episode to train on. This is done in practice today and seems like a pretty good idea (since the thing you care about is precisely the performance on deployment episodes) unless you are specifically avoiding it for safety reasons.
It’s plausible that such training is happening whether or not it actually is, and so that’s a very natural objective for a system that cares about maximizing reward conditioned on an episode being selected for training.
Even if test episodes are obviously not being used in training, there are still lots of plausible-sounding generalizations of “reward” to those episodes (whether based on physical implementation, or on the selection implemented by SGD, or based on conditioning on unlikely events, or based on causal counterfactuals...) and as far as I can tell pretty much all of them lead to the same bottom line.
If the deployment distribution is sufficiently different from the training distribution that the AI no longer does something with the same upshot as maximizing reward, then it seems very likely that the resulting behaviors are worse (e.g. they would receive a lower score when evaluated by humans) and so people will be particularly likely to train on those deployment episodes in order to correct the issue.
So I think that if a system was strategically optimizing reward on the training set, it would probably either do something similar-enough-to-result-in-grabbing-power on the test set, or else it would behave badly and be corrected.
(Overall I find the other point more persuasive, though again I think it’s better as an objection to 100% than as an objection to 50%: actually optimizing reward doesn’t do much better than various soups of heuristics, and so we don’t have a strong prediction that SGD will prefer one to the other without getting into the weeds and making very uncertain claims or else taking the limit.)
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way.
Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for. I agree that we shouldn’t expect a general thematic connection to “reward” beyond what you’d expect from the mechanics of SGD.
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power or else bad behavior that would be corrected by further training.
The claim is that AI systems will take actions that get them a lot of reward. This doesn’t have to be based on the mechanistic claim that they are thinking about reward, it can also just be the empirical observation: somehow you selected a policy that creatively gets a lot of reward across a very broad training distribution, so in a novel situation “gets a lot of reward” is one of our most robust predictions about the behavior of the AI.
This is how Ajeya’s post goes, and it explicitly notes that the AI trained with RLDT could be optimizing something else such that they only want to get reward during training, but that this mostly just makes things worse: everything seems to lead to the same place for more or less the same reason.
I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way, and so that part of the objection is much stronger (though see my objections above).
I don’t feel like this argument is particularly anthropomorphic. The argument is that a policy selected for achieving X will achieve X in a new situation (which we can run for lots of different X, since there are many X that are perfectly correlated on the training set). In all the discussions I’ve been in (both with safety people and with ML people) the counterargument leans much more heavily on the analogy with humans (though I think that analogy is typically misapplied).
(Emphasis added)
I don’t think this engages with the substance of the analogy to humans. I don’t think any party in this conversation believes that human learning is “just” RL based on a reward circuit, and I don’t believe it either. “Just RL” also isn’t necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument.
I would say “human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and...”
Furthermore, we have good evidence that RL plays an important role in human learning. For example, from The shard theory of human values:
This is incredibly weak evidence.
Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
In particular, for this to be evidence for Richard’s claim, you need to say: “If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition.” There’s some update there but it’s just not big. It’s easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward. My view is probably the other way—humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).
Curious what systems you have in mind here.
I don’t understand why you think this explains away the evidential impact, and I guess I put way less weight on selection reasoning than you do. My reasoning here goes:
Lots of animals do reinforcement learning.
In particular, humans prominently do reinforcement learning.
Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals.
“RL → high chance of caring about reality” predicts this observation more strongly than “RL → low chance of caring about reality”
This seems pretty straightforward to me, but I bet there are also pieces of your perspective I’m just not seeing.
But in particular, it doesn’t seem relevant to consider selection pressures from evolution, except insofar as we’re postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards “RL → high chance of caring about reality.”
I don’t see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can’t make further updates by reasoning about how people do it?
I’m saying that there was a missed update towards that conclusion, so it doesn’t matter if we already knew that humans do within-lifetime learning?
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don’t usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it’s even smaller than that..
And then if you try to turn that into evidence about “reward is a very hard concept to learn,” or a prediction about how neural nets trained with RL will behave, it’s moving my odds ratios by less than 10% (since we are using “RL” quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update).
You seem to be saying “yes but it’s evidence,” which I’m not objecting to—I’m just saying it’s an extremely small amount evidence. I’m not clear on whether you agree with my calculation.
(Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren’t making this argument so you should ignore all of that, sorry to be confusing.)
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
Yes, in large part.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?
I don’t know what this means. Suppose we have an AI which “cares about reward” (as you think of it in this situation). The “episode” consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg.
What is the “reward” for this situation? What would have happened if we “sampled” this episode during training?
I agree there are all kinds of situations where the generalization of “reward” is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data.
It’s possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to.
As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.
Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where “what if we had sampled during training?” is well-defined and fine. I was wondering if you viewed this as a general question we could ask.
I also agree that Ajeya’s post addresses this “ambiguity” question, which is nice!
It’s intended as an objection to “AI grabs power to get reward is the central threat model to focus on”, but I think our disagreements still apply given this. (FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
Some versions that wouldn’t result in power-grabbing:
Goal is “get highest proportion of possible reward”; the policy might rewrite the training algorithm to be myopic, then get perfect reward for one step, then stop.
Goal is “care about (not getting low rewards on) specific computers used during training”; the policy might destroy those particular computers, then stop.
Goal is “impress the critic”; the policy might then rewrite its critic to always output high reward, then stop.
Goal is “get high reward myself this episode”; the policy might try to do power-seeking things but never create more copies of itself, and eventually lose coherence + stop doing stuff.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
Are you imagining that such systems get meaningfully low reward on the training distribution because they are pursuing those goals, or that these goals are extremely well-correlated with reward on the training distribution and only come apart at test time? Is the model deceptively aligned?
Children vs genes doesn’t seem like a good comparison, it seems obvious that models will understand the idea of reward during training (whereas humans don’t understand genes during evolution). A better comparison might be “have children during my life” vs “have a legacy after I’m gone,” but in fact humans have both goals even though one is never directly observed.
I guess more importantly, I don’t buy the claim about “things you only get selected on are less natural as goals than things you observe during episodes,” especially if your policy is trained to make good predictions of reward. I don’t know if there’s a specific reason for your view, or this is just a clash of intuitions. It feels to me like my position is kind of the default, in that you are offering a feature and saying it is a major consideration that SGD wouldn’t learn a particular kind of cognition.
If the agent misbehaves so that its data will be used for training, then the misbehaving actions will get a low reward. So SGD will shift from “misbehave so that my data will be used for training” to “ignore the effect of my actions on whether my data will be used for training, and just produce actions that would result in a low reward assuming that this data is used for training.”
I agree the behavior of the model isn’t easily summarized by an English sentence. The thing that seems most clear is that the model trained by SGD will learn not to sacrifice reward in order to increase the probability that the episode is used in training. If you think that’s wrong I’m happy to disagree about it.
Every one of your examples results in the model grabbing power and doing something bad at deployment time, though I agree that it may not care about holding power. (And maybe it doesn’t even have to grab power if you don’t have any countermeasures at test time? But then your system seems useless.)
But if you get frustrated by your AI taking over the datacenter and train it not to do that, as discussed in Ajeya’s story here, then you are left with an AI that grabs power and holds it.
I don’t find this super compelling. There is a long causal chain from “Do the task” to “Humans think you did the task” to “You get high reward” to “You are selected by SGD.” By default there will be lots of datapoints that distinguish the late stages on the path from the early stages, since human evaluations do not perfectly track any kind of objective evaluation of “did you actually do the task?” So a model that cares only about doing the task and not about human evaluations will underperform a model that cares about human evaluations. And then a model that cares about human evaluations can then manipulate those evaluations via grabbing power and controlling human observations of the situation (and if it cares about something even later then it can also e.g. kill and replace the humans, but that’s not important to this argument).
So the implicit alternative you mention here just won’t get a low loss, and it’s really a question about whether SGD will find a lower loss policy rather than one about how it generalizes.
And once that’s the discussion, someone saying “well there is a policy that gets lower loss and doesn’t look very complicated, so I guess SGD has a good chance of finding it” seems like they are on the winning side of SGD. Your claim seems to be that models will fail to get low loss on the training set in a specific way, and I think saying “it’s a non-trivial and quite specific hypothesis that you will get a low loss” would be an inappropriate burden-of-proof shifting.
I agree that it’s unclear whether this will actually happen prior to models that are smart enough to obsolete human work on alignment, but I really think we’re looking at more like a very natural hypothesis with 50-50 chances (and at the very least the way you are framing your objection is not engaging with the arguments in the particular post of Ajeya’s that you are responding to).
I also agree that the question of whether the model will mess with the training setup or use power for something else is very complicated, but it’s not something that is argued for in Ajeya’s post.
I think the crucial thing that makes this robust is “if the model grabs power and does something else, then humans will train it out unless they decide doing so is unsafe.” I think one potentially relevant takeaway is: (i) it’s plausible that AI systems will be motivated to grab power in non-catastrophic ways which could give opportunities to correct course, (ii) that isn’t automatically guaranteed by any particular honeypot, e.g. you can’t just give a high reward and assume that will motivate the system.
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.
Why does it not lead to takeover in the same way?
Because it’s easy to detect and correct (except that correcting it might push you into one of the other regimes).
So far causally upstream of the human evaluator’s opinion? Eg an AI counselor optimizing for getting to know you
Note that the “without countermeasures” post consistently discusses both possibilities (the model cares about reward or the model cares about something else that’s consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:
As well as the section Even if Alex isn’t “motivated” to maximize reward.… I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons.
With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard—I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there’s no notion of reward on the deployment distribution doesn’t feel compelling to me.
Yepp, agreed, the thing I’m objecting to is how you mainly focus on the reward case, and then say “but the same dynamics apply in other cases too...”
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).
I agree with your general point here, but I think Ajeya’s post actually gets this right, eg
and
I also think that often “the AI just maximizes reward” is a useful simplifying assumption. That is, we can make an argument of the form “even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed”.
(Though of course it’s important to spell the argument out)
Yeah, I agree this is a good argument structure—in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it’s quite useful to establish that it’s doomed; that’s the kind of structure I was going for in the post.
I strongly disagree with the “best case” thing. Like, policies could just learn human values! It’s not that implausible.
If I had to try point to the crux here, it might be “how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?” Where we both agree that there’s some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existing heuristics, whereas I’m more focused on the regime where there are lots of low-hanging fruit in terms of changes that would make a policy more successful, and so the question of how easy that goal is to learn from its training data is pretty important. (As usual, there’s the human analogy: our goals are very strongly biased towards things we have direct observational access to!)
Even setting aside this disagreement, though, I don’t like the argumentative structure because the generalization of “reward” to large scales is much less intuitive than the generalization of other concepts (like “make money”) to large scales—in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.
Yes, sorry, “best case” was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing “correctly;” we could get lucky and have it generalize “incorrectly” in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here.
I don’t understand why reward isn’t something the model has direct access to—it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I’d have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here.
Yeah, I don’t really agree with this; I think I could pretty easily imagine being an AI system asking the question “How much reward would this episode get if it were sampled for training?” It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don’t really share it.
AFAIK the reward signal is not typically included as an input to the policy network in RL. Not sure why, and I could be wrong about that, but that is not my main question. The bigger question is “Has direct access to when?”
At the moment in time when the model is making a decision, it does not have direct access to the decision-relevant reward signal because that reward is typically causally downstream of the model’s decision. That reward may not even have a definite value until after decision time. Whereas concrete observables like “shiny gold coins” and “the finish line straight ahead” and “my opponent is in check” (and other abstractions in the model’s ontology that are causally upstream from reward in reality) are readily available at decision time. It seems to me that that makes them natural candidates for credit assignment to flag early on as the reward-responsible mental events and reinforce into stable motivations, since they in fact were the factors that determined the decisions that led to rewards.
IME, the most straightforward way for reward-itself to become the model’s primary goal would be if the model learns to base its decisions on an accurate reward-predictor much earlier than it learns to base its decisions on other (likely upstream) factors. If it instead learns how to accurately predict reward-itself after it is already strongly motivated by some concrete observables, I don’t see why we should expect it to dislodge that motivation, despite the true fact that those concrete observables are only pretty correlated with reward whereas an accurate reward-predictor is perfectly correlated with reward. Why? Because the model currently doesn’t care about reward-itself, it currently cares about the concrete observable(s), so it has no reason to take actions that would override that goal, and it has positive goal-content integrity reasons to not take those actions.
See also: Inner and outer alignment decompose one hard problem into two extremely hard problems (in particular: Inner alignment seems anti-natural).
Five clusters of alignment researchers
Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:
MIRI cluster. Think that P(doom) is very high, based on intuitions about instrumental convergence, deceptive alignment, etc. Does work that’s very different from mainstream ML. Central members: Eliezer Yudkowsky, Nate Soares.
Structural risk cluster. Think that doom is more likely than not, but not for the same reasons as the MIRI cluster. Instead, this cluster focuses on systemic risks, multi-agent alignment, selective forces outside gradient descent, etc. Often work that’s fairly continuous with mainstream ML, but willing to be unusually speculative by the standards of the field. Central members: Dan Hendrycks, David Krueger, Andrew Critch.
Constellation cluster. More optimistic than either of the previous two clusters. Focuses more on risk from power-seeking AI than the structural risk cluster, but does work that is more speculative or conceptually-oriented than mainstream ML. Central members: Paul Christiano, Buck Shlegeris, Holden Karnofsky. (Named after Constellation coworking space.)
Prosaic cluster. Focuses on empirical ML work and the scaling hypothesis, is typically skeptical of theoretical or conceptual arguments. Short timelines in general. Central members: Dario Amodei, Jan Leike, Ilya Sutskever.
Mainstream cluster. Alignment researchers who are closest to mainstream ML. Focuses much less on backchaining from specific threat models and more on promoting robustly valuable research. Typically more concerned about misuse than misalignment, although worried about both. Central members: Scott Aaronson, David Bau.
Remember that any such division will be inherently very lossy, and please try not to overemphasize the differences between the groups, compared with the many things they agree on.
Depending on how you count alignment researchers, the relative size of each of these clusters might fluctuate, but on a gut level I think I treat all of them as roughly the same size.
(COI note: I work at OpenAI. These are my personal views, though.)
My quick take on the “AI pause debate”, framed in terms of two scenarios for how the AI safety community might evolve over the coming years:
AI safety becomes the single community that’s the most knowledgeable about cutting-edge ML systems. The smartest up-and-coming ML researchers find themselves constantly coming to AI safety spaces, because that’s the place to go if you want to nerd out about the models. It feels like the early days of hacker culture. There’s a constant flow of ideas and brainstorming in those spaces; the core alignment ideas are standard background knowledge for everyone there. There are hackathons where people build fun demos, and people figuring out ways of using AI to augment their research. Constant interactions with the models allows people to gain really good hands-on intuitions about how they work, which they leverage into doing great research that helps us actually understand them better. When the public ends up demanding regulation, there’s a large pool of competent people who are broadly reasonable about the risks, and can slot into the relevant institutions and make them work well.
AI safety becomes much more similar to the environmentalist movement. It has broader reach, but alienates a lot of the most competent people in the relevant fields. ML researchers who find themselves in AI safety spaces are told they’re “worse than Hitler” (which happened to a friend of mine, actually). People get deontological about AI progress: some hesitate to pay for ChatGPT because it feels like they’re contributing to the problem (another true story); the dynamics around this look similar to environmentalists refusing to fly places. Others overemphasize the risks of existing models in order to whip up popular support. People are sucked into psychological doom spirals similar to how many environmentalists think about climate change: if you’re not depressed then you obviously don’t take it seriously enough. Just like environmentalists often block some of the most valuable work on fixing climate change (e.g. nuclear energy, geoengineering, land use reform), safety advocates block some of the most valuable work on alignment (e.g. scalable oversight, interpretability, adversarial training) due to acceleration or misuse concerns. Of course, nobody will say they want to dramatically slow down alignment research, but there will be such high barriers to researchers getting and studying the relevant models that it has similar effects. The regulations that end up being implemented are messy and full of holes, because the movement is more focused on making a big statement than figuring out the details.
Obviously I’ve exaggerated and caricatured these scenarios, but I think there’s an important point here. One really good thing about the AI safety movement, until recently, is that the focus on the problem of technical alignment has nudged it away from the second scenario (although it wasn’t particularly close to the first scenario either, because the “nerding out” was typically more about decision theory or agent foundations than ML itself). That’s changed a bit lately, in part because a bunch of people seem to think that making technical progress on alignment is hopeless. I think this is just not an epistemically reasonable position to take: history is full of cases where even leading experts dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems. Either way, I do think public advocacy for strong governance measures can be valuable, but I also think that “pause AI” advocacy runs the risk of pushing us towards scenario 2. Even if you think that’s a cost worth paying, I’d urge you to think about ways to get the benefits of the advocacy while reducing that cost and keeping the door open for scenario 1.
FYI I think this is worth fleshing out into a top level post (esp. given that it’s ‘Pause Debate’ week).
I’m not actually sure it needs much fleshing out. I think the main bit here that feels unjustified, or insufficiently-justified for the strength of the claim, is:
Have edited slightly to clarify that it was “leading experts” who dramatically underestimated it. I’m not really sure what else to say, though...
I think I basically agree with you Richard about the risks of falling into scenario 2, and think this is a wise comment, but I also think you are strawmanning the reason for the change—it’s not that people have come to think that making technical progress is hopeless (or even harder than it used to be!) it’s rather that people have come to have shorter timelines, and so the probability that sufficient technical progress will be made in time has gone down, and the usefulness of calling for a pause has gone up. (e.g. if you think AGI is 15 years away, then pausing now is plausibly useless or even harmful.)
That’s my theory at any rate. And it’s sorta what I think, I think.
Oh, also, I think that the counterproductive rot in environmentalism took at least 5 years to build up, probably, and I’m hopeful that therefore even if we are on a path to rot, it’ll take too long for the rot to build up to matter. But this is just a guess about the growth rate of rot which is informed by anecdotes like the stories about what happened to your friends, and over the coming months and years more data will be collected to better calibrate my guess about the rate of rot.
I appreciated seeing the caricature of 1 presented at least, as a dream, it feels attainable, or like it might have been in some other timeline, but perhaps in ours too, for all I know.
I haven’t yet read through them thoroughly, but these four papers by Oliver Richardson are pattern-matching to me as potentially very exciting theoretical work.
tl;dr: probabilistic dependency graphs (PDGs) are directed graphical models designed to be able to capture inconsistent beliefs (paper 1). The definition of inconsistency is a natural one which allows us to, for example, reframe the concept of “minimizing training loss” as “minimizing inconsistency” (paper 2). They provide an algorithm for inference in PDGs (paper 3) and an algorithm for learning via locally minimizing inconsistency which unifies several other algorithms (like the EM algorithm, message-passing, and generative adversarial training) (paper 4).
Oliver is an old friend of mine (which is how I found out about these papers) and a final-year PhD student at Cornell under Joe Halpern.
FWIW Oliver’s presentation of (some fragment of) his work at ILIAD was my favorite of all the talks I attended at the conference.
Just read Bostrom’s Deep Utopia (though not too carefully). The book is structured with about half being transcripts of fictional lectures given by Bostrom at Oxford, about a quarter being stories about various woodland creatures striving to build a utopia, and another quarter being various other vignettes and framing stories.
Overall, I was a bit disappointed. The lecture transcripts touch on some interesting ideas, but Bostrom’s style is generally one which tries to classify and taxonimize, rather than characterize (e.g. he has a long section trying to analyze the nature of boredom). I think this doesn’t work very well when describing possible utopias, because they’ll be so different from today that it’s hard to extrapolate many of our concepts to that point, and also because the hard part is making it viscerally compelling.
The stories and vignettes are somewhat esoteric; it’s hard to extract straightforward lessons from them. My favorite was a story called The Exaltation of ThermoRex, about an industrialist who left his fortune to the benefit of his portable room heater, leading to a group of trustees spending many millions of dollars trying to figure out (and implement) what it means to “benefit” a room heater.
Tangentially related (spoilers for Worth the Candle):
I think it’d be hard to do a better cohesive depiction of Utopia than the end of Worth the Candle by A Wales. I mean, I hope someone does do it, I just think it’ll be challenging to do!
Strong agree, also I spoiler-texted it, hope you don’t mind.
Any opinions on how it compares to Fun Theory? (Though that’s less about all of utopia, it is still a significant part)
If you haven’t read CEV, I strongly recommend doing so. It resolved some of my confusions about utopia that were unresolved even after reading the Fun Theory sequence.
Specifically, I had an aversion to the idea of being in a utopia because “what’s the point, you’ll have everything you want”. The concrete pictures that Eliezer gestures at in the CEV document do engage with this confusion, and gesture at the idea that we can have a utopia where the AI does not simply make things easy for us, but perhaps just puts guardrails onto our reality, such that we don’t die, for example, but we do have the option to struggle to do things by ourselves.
Yes, the Fun Theory sequence tries to communicate this point, but it didn’t make sense to me until I could conceive of an ASI singleton that could actually simply not help us.
I dropped the book within the first chapter. For one, I found the way Bostrom opened the chapter as very defensive and self-conscious. I imagine that even Yudkowsky wouldn’t start a hypothetical 2025 book with fictional characters caricaturing him. Next, I felt like I didn’t really know what the book was covering in terms of subject matter, and I didn’t feel convinced it was interesting enough to continue the meandering path Nick Bostrom seem to have laid out before me.
Eliezer’s CEV document and the Fun Theory sequence were significantly more pleasant experiences, based on my memory.
A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it’s very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.
(For existing models, it might be more productive to ask for “surprising behavior” rather than “failures” per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though… but maybe understanding models better is robustly good enough to outweight that?)
I like this. Would this have to be publicly available models? Seems kind of hard to do for private models.
What kind of access might be needed to private models? Could there be a secure multi-party computation approach that is sufficient?
Ideas for defining “surprising”? If we’re trying to create a real incentive, people will want to understand the resolution criteria.
Here’s a (messy, haphazard) list of ways a group of idealized agents could merge into a single agent:
Proposal 1: they merge into an agent which maximizes a weighted sum of their utilities. They decide on the weights using some bargaining solution.
Objection 1: this is not Pareto-optimal in the case where the starting agents have different beliefs. In that case we want:
Proposal 2: they merge into an agent which maximizes a weighted sum of their utilities, where those weights are originally set by bargaining but evolve over time depending on how accurately each original agent predicted the future.
Objection 2: this loses out on possible gains from acausal trade. E.g. if a paperclip-maximizer finds itself in a universe where it’s hard to make paperclips but easy to make staples, it’d like to be able to give resources to staple-maximizers in exchange for them building more paperclips in universes where that’s easier. This requires a kind of updateless decision theory:
Proposal 3: they merge into an agent which maximizes a weighted sum of their utilities (with those weights evolving over time), where the weights are set by bargaining subject to the constraint that each agent obeys commitments that logically earlier versions of itself would have made.
Objection 3: this faces the commitment races problem, where each agent wants to make earlier and earlier commitments to only accept good deals.
Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn’t yet know who they were or what their values were. From that position, they wouldn’t have wanted to do future destructive commitment races.
Objection 4: as we take this to the limit we abstract away every aspect of each agent—their values, beliefs, position in the world, etc—until everything is decided by their prior from behind a veil of ignorance. But when you don’t know who you are, or what your values are, how do you know what your prior is?
Proposal 5: all these commitments are only useful if they’re credible to other agents. So, behind the veil, choose a Schelling prior which is both clearly non-cherrypicked and also easy for a wide range of agents to reason about. In other words, choose the prior which is most conducive to cooperation across the multiverse.
Okay, so basically we’ve ended up describing not just an ideal agent, but the ideal agent. The cost of this, of course, is that we’ve made it totally computationally intractable. In a later post I’ll describe some approximations which might make it more relevant.
Nice!
I don’t think this solves Commitment Races in general, because of two different considerations:
Trivially, I can say that you still have the problem when everyone needs to bootstrap a Schelling veil of ignorance.
Less trivially, even behind the most simple/Schelling veils of ignorance, I find it likely that hawkish commitments are incentivized. For example, the veil might say that you might be Powerful agent A, or Weak agent B, and if some Powerful agents have weird enough utilities (and this seems likely in a big pool of agents), hawkishly committing in case you are A will be a net-positive bet.
This might still mostly solve Commitment Races in our particular multi-verse. I have intuitions both for and against this bootstrapping being possible. I’d be interested to hear yours.
I don’t understand your point here, explain?
This seems to be claiming that in some multiverses, the gains to powerful agents from being hawkish outweigh the losses to weak agents. But then why is this a problem? It just seems like the optimal outcome.
Say there are 5 different veils of ignorance (priors) that most minds consider Schelling (you could try to argue there will be exactly one, but I don’t see why).
If everyone simply accepted exactly the same one, then yes, lots of nice things would happen and you wouldn’t get catastrophically inefficient conflict.
But every one of these 5 priors will have different outcomes when it is implemented by everyone. For example, maybe in prior 3 agent A is slightly better off and agent B is slightly worse off.
So you need to give me a reason why a commitment race doesn’t recur in the level of “choosing which of the 5 priors everyone should implement”. That is, maybe A will make a very early commitment to only every implement prior 3. As always, this is rational if A thinks the others will react a certain way (give in to the threat and implement 3). And I don’t have a reason to expect agents not to have such priors (although I agree they are slightly less likely than more common-sensical priors).
That is, as always, the commitment races problem doesn’t have a general solution on paper. You need to get into the details of our multi-verse and our agents to argue that they won’t have these crazy priors and will coordinate well.
It seems likely that in our universe there are some agents with arbitrarily high gains-from-being-hawkish, that don’t have correspondingly arbitrarily low measure. (This is related to Pascalian reasoning, see Daniel’s sequence.) For example, someone whose utility is exponential on number of paperclips. I don’t agree that the optimal outcome (according to my ethics) is for me (who’s utility is at most linear on happy people) to turn all my resources into paperclips.
Maybe if I was a preference utilitarian biting enough bullets, this would be the case. But I just want happy people.
I may be missing background concepts, but I don’t see how Proposal 5 is really responding to Objection 4.
It seems to me that agent’s strategy in the limit will either be null action or evolution-dictated action, not sure which. That is, “in universe where it’s easy to do A the agent will choose to do A” somewhat implies “according to how easy it is for agent doing A to gain more optimization power, actions will be chosen” which is essentially evolution.
I recently had a very interesting conversation about master morality and slave morality, inspired by the recent AstralCodexTen posts.
The position I eventually landed on was:
Empirically, it seems like the world is not improved the most by people whose primary motivation is helping others, but rather by people whose primary motivation is achieving something amazing. If this is true, that’s a strong argument against slave morality.
The defensibility of morality as the pursuit of greatness depends on how sophisticated our cultural conceptions of greatness are. Unfortunately we may be in a vicious spiral where we’re too entrenched in slave morality to admire great people, which makes it harder to become great, which gives us fewer people to admire, which… By contrast, I picture past generations as being in a constant aspirational dialogue about what counts as greatness—e.g. defining concepts like honor, Aristotelean magnanimity (“greatness of soul”), etc.
I think of master morality as a variant of virtue ethics which is particularly well-adapted to domains which have heavy positive tails—entrepreneurship, for example. However, in domains which have heavy negative tails, the pursuit of greatness can easily lead to disaster. In those domains, the appropriate variant of virtue ethics is probably more like Buddhism: searching for equanimity or “green”. In domains which have both (e.g. the world as a whole) the closest thing I’ve found is the pursuit of integrity and attunement to oneself. So maybe that’s the thing that we need a cultural shift towards understanding better.
If the following correlations are true, then the opposite may be true (slave morality being better for improving the world through history):
Improving the world being strongly correlated with economic growth (this is probably less true when X-risk are significant)
Economic growth being strongly correlated with Entrepreneurship incentives (property rights, autonomy, fairness, meritocracy, low rents)
Master morality being strongly correlated with acquiring power and thus decreasing the power of others and decreasing their entrepreneurship incentives
That all sounds very plausible. But isn’t this all mostly relevant before AGI is a possibility? That would be a heavy negative tail risk, in which people motivated to “do great things” are quite prone to get us all killed. Should we survive that risk, progress probably mostly won’t be driven by humans, so humans doing great things will barely count. If humans are actually still in charge when we hit ASI, it seems like doing great things with them will probably still have large tail risks (inter-ASI wars).
Right? Or do you see it differently?
It’s a fascinating empirical claim that sounds right now that I hear it.
AGI is heavy-tailed in both directions I think. I don’t think we get utopias by default even without misalignment, since governance of AGI is so complicated.
Re: your point #2, there is another potential spiral where abstract concepts of “greatness” are increasingly defined in a hostile and negative way by partisans of slave morality. This might make it harder to have that “aspirational dialogue about what counts as greatness”, as it gets increasingly difficult for ordinary people to even conceptualize a good version of greatness worth aspiring to. (“Why would I want to become an entrepeneur and found a company? Wouldn’t that make me an evil big-corporation CEO, which has a whiff of the same flavor as stories about the violent, insatiable conquistador villans of the 1500s?”)
Of course, there are also downsides when culture paints a too-rosy picture of greatness—once upon a time, conquistators were in fact considered admirable!
The crucial heuristic I apply when evaluating AI safety research directions is: could we have used this research to make humans safe, if we were supervising the human evolutionary process? And if not, do we have a compelling story for why it’ll be easier to apply to AIs than to humans?
Sometimes this might be too strict a criterion, but I think in general it’s very valuable in catching vague or unfounded assumptions about AI development.
By making human safe, do you mean with regard to evolution’s objective?
No. I meant: suppose we were rerunning a simulation of evolution, but can modify some parts of it (e.g. evolution’s objective). How do we ensure that whatever intelligent species comes out of it is safe in the same ways we want AGIs to be safe?
(You could also think of this as: how could some aliens overseeing human evolution have made humans safe by those aliens’ standards of safety? But this is a bit trickier to think about because we don’t know what their standards are. Although presumably current humans, being quite aggressive and having unbounded goals, wouldn’t meet them).
Okay, thanks. Could you give me an example of a research direction that passes this test? The thing I have in mind right now is pretty much everything that backchain to local search, but maybe that’s not the way you think about it.
So I think Debate is probably the best example of something that makes a lot of sense when applied to humans, to the point where they’re doing human experiments on it already.
But this heuristic is actually a reason why I’m pretty pessimistic about most safety research directions.
So I’ve been thinking about this for a while, and I think I disagree with what I understand of your perspective. Which might obviously mean I misunderstand your perspective.
What I think I understand is that you judge safety research directions based on how well they could work on an evolutionary process like the one that created humans. But for me, the most promising approach to AGI is based on local search, which differs a bit from evolutionary process. I don’t really see a reason to consider evolutionary processes instead of local search, and even then, the specific approach of evolution for humans is probably far too specific as a test bench.
This matters because problems for one are not problems for the other. For example, one way to mess with an evolutionary process is to find way for everything to survive and reproduce/disseminate. Technology in general did that for humans, which means the evolutionary pressure decreased as technology evolved. But that’s not a problem for local search, since at each step there will be only one next program.
On the other hand, local search might be dangerous because of things like gradient hacking. And they don’t make sense for evolutionary processes.
In conclusion, I feel for the moment that backchaining to local search is a better heuristic for judging safety research directions. But I’m curious about where our disagreement lies on this issue.
One source of our disagreement: I would describe evolution as a type of local search. The difference is that it’s local with respect to the parameters of a whole population, rather than an individual agent. So this does introduce some disanalogies, but not particularly significant ones (to my mind). I don’t think it would make much difference to my heuristic if we imagined that humans had evolved via gradient descent over our genes instead.
In other words, I like the heuristic of backchaining to local search, and I think of it as a subset of my heuristic. The thing it’s missing, though, is that it doesn’t tell you which approaches will actually scale up to training regimes which are incredibly complicated, applied to fairly intelligent agents. For example, impact penalties make sense in a local search context for simple problems. But to evaluate whether they’ll work for AGIs, you need to apply them to massively complex environments. So my intuition is that, because I don’t know how to apply them to the human ancestral environment, we also won’t know how to apply them to our AGIs’ training environments.
Similarly, when I think about MIRI’s work on decision theory, I really have very little idea how to evaluate it in the context of modern machine learning. Are decision theories the type of thing which AIs can learn via local search? Seems hard to tell, since our AIs are so far from general intelligence. But I can reason much more easily about the types of decision theories that humans have, and the selective pressures that gave rise to them.
As a third example, my heuristic endorses Debate due to a high-level intuition about how human reasoning works, in addition to a low-level intuition about how it can arise via local search.
So if I try to summarize your position, it’s something like: backchain to local search for simple and single-AI cases, and then think about aligning humans for the scaled and multi-agents version? That makes much more sense, thanks!
I also definitely see why your full heuristic doesn’t feel immediately useful to me: because I mostly focus on the simple and single-AI case. But I’ve been thinking more and more (in part thanks to your writing) that I should allocate more thinking time to the more general case. I hope your heuristic will help me there.
Cool, glad to hear it. I’d clarify the summary slightly: I think all safety techniques should include at least a rough intuition for why they’ll work in the scaled-up version, even when current work on them only applies them to simple AIs. (Perhaps this was implicit in your summary already, I’m not sure.)
The idea that maximally-coherent agents look like squiggle-maximizers raises the question: what would it look like for humans to become maximally coherent?
One answer, which Yudkowsky gives here, is that conscious experiences are just a “weird and more abstract and complicated pattern that matter can be squiggled into”.
But that seems to be in tension with another claim he makes, that there’s no way for one agent’s conscious experiences to become “more real” except at the expense of other conscious agents—a claim which, according to him, motivates average utilitarianism across the multiverse.
Clearly a squiggle-maximizer would not be an average squigglean. So what’s the disanalogy here? It seems like @Eliezer Yudkowsky is basically using SSA, but comparing between possible multiverses—i.e. when facing the choice between creating agent A or not, you look at the set of As in the multiverse where you decided yes, and compare it to the set of As in the multiverse where you decided no, and (if you’re deciding for the good of A) you pick whichever one gives A a better time on average.
Yudkowsky has written before (can’t find the link) that he takes this approach because alternatives would entail giving up on predictions about his future experiences—e.g. constantly predicting he’s a Boltzmann brain and will dissolve in the next second. But this argument by Wei Dai shows that agents which reason in this way can be money-pumped by creating arbitrarily short-lived copies of them. Based on this I claim that Yudkowsky’s preferences are incoherent, and that the only coherent thing to do here is to “expect to be” a given copy in proportion to the resources it will have available, as anthropic decision theory claims. (Incidentally, this also explains why we’re at the hinge of history.)
But this is just an answer, it doesn’t dissolve the problem. What could? Some wild guesses:
You are allowed to have preferences about the external world, and you are allowed to have preferences about your “thread of experience”—you’re just not allowed to have both. The incoherence comes from trying to combine the two; the coherent thing to do would be to put them into different agents, who will then end up in very different parts of the multiverse.
Another way of framing this: you are allowed to be a decision-maker, and you are allowed to be a repository of welfare, but you’re not allowed to be both (on pain of incoherence/being dutch-booked).
Something totally different: the problem here is that we don’t have intuitive experience of being agents which can copy themselves, shut down copies, re-merge, etc. If we did, then maybe SSA would seem as silly as expecting to end up in a different universe whenever we went to sleep.
Actually, maybe the operative thing we lack experience with is not just splitting into different subagents, but rather merging together afterwards. What does it feel like to have been thousands of different parallel agents, and now be a single agent with their unified experiences? What sort of identity would one construct in that situation? Maybe this is an important part of dissolving the problem.
In your comments, you focus on issues of identity—who are “you”, given the possibility of copies, inexact counterparts in other worlds, and so on. But I would have thought that the fundamental problem here is, how to make a coherent agent out of an agent with preferences that are inconsistent over time, an agent with competing desires and no definite procedure for deciding which desire has priority, and so on, i.e. problems that exist even when there is no additional problem of identity.
Why??? Being expected squiggle maximizer literally means that you implement policy that produces maximum average number of squiggles across the multiverse.
The “average” is interpreted with respect to quality. Imagine that your only option is to create low-quality squiggles, or not to do so. In isolation, you’d prefer to produce them than not to produce them. But then you find out that the rest of the multiverse is full of high-quality squiggles. Do you still produce the low-quality squiggles? A total squigglean would; an average squigglean wouldn’t.
It depends upon whether the maximizer considers its corner of the multiverse to be currently measurable by squiggle quality, or to be omitted from squiggle calculations at all. In principle these are far from the only options as utility functions can be arbitrarily complex, but exploring just two may be okay so long as we remember that we’re only talking about 2 out of infinity, not 2 out of 2.
An average multiversal squigglean that considers the current universe to be at zero or negative squiggle quality will make the low quality squiggles in order to reduce how much its corner of the multiverse is pulling down the average. An average multiversal squigglean that considers the current universe to be outside the domain of squiggle quality, and will remain so for the remainder of its existence may refrain from making squiggles. If there is some chance that it will become eligible for squiggle evaluation in the future though, it may be better to tile it with low-quality squiggles now in order to prevent a worse outcome of being tiled with worse-quality future squiggles.
In practice the options aren’t going to be just “make squiggles” or “not make squiggles” either. In the context of entities relevant to these sorts of discussion, other options may include “learn how to make better squiggles”.
By “squiggle maximizer” I mean exactly “maximizer of number of physical objects such that function is_squiggle returns True on CIF-file of their structure”.
We can have different objects of value. Like, you can value “probability that if object in multiverse is a squiggle, it’s high-quality”. Here yes, you shouldn’t create additional low-quality squiggles. But I don’t see anything incoherent here, it’s just different utility function?
A short complaint (which I hope to expand upon at some later point): there are a lot of definitions floating around which refer to outcomes rather than processes. In most cases I think that the corresponding concepts would be much better understood if we worked in terms of process definitions.
Some examples: Legg’s definition of intelligence; Karnofsky’s definition of “transformative AI”; Critch and Krueger’s definition of misalignment (from ARCHES).
Sure, these definitions pin down what you’re talking about more clearly—but that comes at the cost of understanding how and why it might come about.
E.g. when we hypothesise that AGI will be built, we know roughly what the key variables are. Whereas transformative AI could refer to all sorts of things, and what counts as transformative could depend on many different political, economic, and societal factors.
If we do not fully understand the mechanism of (e.g. human) intelligence, isn’t referring to the outcome preferable to a made-up story about the process?
(Of course, it would be even better if we understood the process and then referred to it.)
Do you think that these are mutually exclusive, or something like that? I’ve always been confused by what I take to be the position in this shortform, that defining the outcomes makes it somehow harder to define the process. Sure, you can define a process without defining an outcome (i.e. writing a program or training an NN), but since what we are confused about is what we even want at the end, for me that’s the priority. And doing so would help searching for processes leading to this outcome.
That being said, if you point is that defining outcomes isn’t enough, in that we also need to define/deconfuse/study the processes leading to these outcomes, then I agree with that.
Suppose we get to specify, by magic, a list of techniques that AGIs won’t be able to use to take over the world. How long does that list need to be before it makes a significant dent in the overall probability of xrisk?
I used to think of “AGI designs self-replicating nanotech” mainly as an illustration of a broad class of takeover scenarios. But upon further thought, nanotech feels like a pretty central element of many takeover scenarios—you actually do need physical actuators to do many things, and the robots we might build in the foreseeable future are nowhere near what’s necessary for maintaining a civilisation. So how much time might it buy us if AGIs couldn’t use nanotech at all?
Well, not very much if human minds are still an attack vector—the point where we’d have effectively lost is when we can no longer make our own decisions. Okay, so rule out brainwashing/hyper-persuasion too. What else is there? The three most salient: military power, political/cultural power, economic power.
Is this all just a hypothetical exercise? I’m not sure. Designing self-replicating nanotech capable of replacing all other human tech seems really hard; it’s pretty plausible to me that the world is crazy in a bunch of other ways by the time we reach that capability. And so if we can block off a couple of the easier routes to power, that might actually buy useful time.
Firstly, I think it kind of depends. What exactly does blocking the AI from designing nanotech mean? Is the AI allowed to use genetic engineering? Is it allowed to use selective breeding? Elephants genetically engineered to be really good at instruction following?
I mean I think macroscopic self replicating robotics is probably possible, and the AGI can probably bootstrap that from current robotics fairly quickly.
You rule out any hyper-persuasion. How much regular persuasion is the AI allowed to do. After all, if you are buying something online, (from a small seller) them seeing the money arrive persuades them to send the product? Is it allowed to select which human to focus on superhumanly. There are a few people on r/singularity, such that the moment the AI goes, “I’m an AGI”, the humans will be like ” all praise the machine god, I will do anything you ask”. A few people have already persuaded themselves that AI’s are inherently superior to humans by themselves.
You can make the list short. If you make the individual items broad.
ie
the AI is magically banned from doing anything at all.
I agree. Self-replicating nanotech seems to be likely a much harder problem than for language models to get good enough actors to get political, cultural, and economic power.
To the extent that an AGI can make political and economic decisions that are of higher quality than human decisions, there’s also a lot of pressure for humans to delegate those decisions to AGI. Organizations that delegate those decisions to AGI will outcompete those who don’t.
Another general technique: attacks on computing systems. (Both takeover / subversion (dropping an email going ‘um this is a problem’) and destruction (destroy the US power infrastructure using Russian-language programs)).
These don’t tend to be sufficient in and of themselves, but are “classic” stepping-stones to e.g. buy time for an AI while it ramps up.
The last three options you mentioned are all things that happen over relatively slow timescales, if your goal is to completely destroy humanity. The single exception to this is nuclear war, but if you’re correct, then we can reduce the problem to non-proliferation, which is at least in theory solvable.
Probably the easiest “honeypot” is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that’s anything like “get more reward” (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
You don’t want it to be relatively easy to an outside force. Otherwise they can lead it to do as they please, and writing weird behaviour off as ‘oh, it’s changed our rewards, reset it again’, poses some risk.
Hypothesis: there’s a way of formalizing the notion of “empowerment” such that an AI with the goal of empowering humans would be corrigible.
This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn’t ever let the humans spend that power. Intuitively, though, there’s a sense in which a human who can never spend their power doesn’t actually have any power. Is there a way of formalizing that intuition?
The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl’s do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they’d had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it’s not very sensitive to the precise definition of G (especially if the AI isn’t actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions).
The problem here is that these counterfactuals aren’t very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question “what would the AI be doing in this world?” has no sensible answer (or maybe the answer would be “it would realize it’s in a weird hypothetical world and behave accordingly”). Similarly, if we model this using the do-operation, the best policy is something like “wait until the human’s goals suddenly and inexplicably change, then optimize hard for their new goal”.
Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl’s do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.
There’s also the problem of: what do you mean by “the human”? If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power. It never forces you to abstain from giving up power, since if you’re perfectly capable of making different decisions, but you just don’t.
Another problem, which I like to think of as the “control panel of the universe” problem, is where the AI gives you the “control panel of the universe”, but you aren’t smart enough to operate it, in the sense that you have the information necessary to operate it, but not the intelligence. Such that you can technically do anything you want—you have maximal power/empowerment—but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.
I think any model of a rational agent needs to incorporate the fact that they’re not arbitrarily intelligent, otherwise none of their actions make sense. So I’m not too worried about this.
Yeah, I agree that a lot of concepts get fragile in the context of superintelligence. But while I think of corrigibility as an actively anti-natural concept, empowerment seems like it could perhaps remain robust and well-founded for longer.
You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don’t know how to actually pin down these hypotheticals.
Inspired by a recent discussion about whether Anthropic broke a commitment to not push the capabilities frontier (I am more sympathetic to their position than most, because I think that it’s often hard to distinguish between “current intentions” and “commitments which might be overridden by extreme events” and “solemn vows”):
Maybe one translation tool for bridging the gap between rationalists and non-rationalists is if rationalists interpret any claims about the future by non-rationalists as implicitly being preceded by “Look, I don’t really believe that plans work, I think the world is inherently wildly unpredictable, I am kinda making everything up as I go along. Having said that:”
This translation tool would also require rationalists and such to make arguments of the form “I think supporting Anthropic (by, e.g., going to work there or giving it funding) is a good thing to do because they sort of have a feeling right now that it would be good not to push the AI frontier”, rather than of the form ”… because they’re committed to not pushing the frontier”.
Which are arguments one could make! But is a pretty different argument and I think people would behave differently if these were the only arguments in favour of supporting a new scaling lab.
I think that’s how people should generally react in the absence of harder commitments and accountability measures.
This post confuses me.
Am I correct that the implied implication here is that assurances from a non-rationalist are essentially worthless?
I think it is also wrong to imply that Anthropic have violated their commitment simply because they didn’t rationally think through the implications of their commitment when they made it.
I think you can understand Anthropic’s actions as purely rational, just not very ethical.
They made an unenforceable commitment to not push capabilities when it directly benefited them. Now that it is more beneficial to drop the facade, they are doing so.
I think “don’t trust assurances from non-rationalists” is not a good takeaway. Rather it should be “don’t trust unenforceable assurances from people who will stand to greatly benefit from violating your trust at a later date”.
The intended implication is something like “rationalists have a bias towards treating statements as much firmer commitments than intended then getting very upset when they are violated”.
For example, unless I’m missing something, the “we do not wish to advance the rate of AI capabilities” claim is just one offhand line in a blog post. It’s not a firm commitment, it’s not even a claim about what their intentions are. As stated, it’s just one consideration that informs their actions—and in fact the “wish” terminology is often specifically not a claim about intended actions (e.g. “I wish I didn’t have to do X”).
Yet rationalists are hammering them on this one sentence—literally making songs about it, tweeting it to criticize Anthropic, etc. It seems like there is a serious lack of metacognition about where a non-adversarial communication breakdown could have occurred, or what the charitable interpretations of this are.
(I am open to people considering them then dismissing them, but I’m not even seeing that. Like, if people were saying “I understand the difference between Anthropic actually making an organizational commitment, and just offhand mentioning a fact about their motivations, but here’s why I’m disappointed anyway”, that seems reasonable. But a lot of people seem to be treating it as a Very Serious Promise being broken.)
That makes sense.
I guess the followup question is “how were Anthropic able to cultivate the impression that they were safety focused if they had only made an extremely loose offhand commitment?”
Certainly the impression I had from how integrated they are in the EA community was that they had made a more serious commitment.
Everyone is afraid of the AI race, and hopes that one of the labs will actually end up doing what they think is the most responsible thing to do. Hope and fear is one hell of a drug cocktail, makes you jump to the conclusions you want based on the flimsiest evidence. But the hangover is a bastard.
Imo I don’t know if we have evidence that Anthropic deliberately cultivated or significantly benefitted from the appearance of a commitment. However if an investor or employee felt like they made substantial commitments based on this impression and then later felt betrayed that would be more serious. (The story here is I think importantly different from other stories where I think there were substantial benefits from commitment appearance and then violation)
That sounds suspiciously similar to “autists have a bias towards interpreting statements literally”.
I mean, yes, they’re closely related.
I think part of the disappointment is the lack of communication regarding violating the commitment or violating the expectations of a non-trivial fraction of the community.
If someone makes a promise to you or even sets an expectation for you in a softer way, there is of course always some chance that they will break the promise or violate the expectation.
But if they violate the commitment or the expectation, and they care about you as a stakeholder, I think there’s a reasonable expectation that they should have to justify that decision.
If they break the promise or violate the soft expectation, and then they say basically nothing (or they say “well I never technically made a promise– there was no contract!”, then I think you have the right to be upset with them not only for violating you expectation but also for essentially trying to gaslight you afterward.
I think a Responsible Lab would have issued some sort of statement along the lines of “hey, we’re hearing that some folks thought we had made commitments to not advance the frontier and some of our employees were saying this to safety-focused members of the AI community. We’re sorry about this miscommunication, and here are some steps we’ll take to avoid such miscommunications in the future.” or “We did in fact intend to follow-through on that, but here are some of the extreme events or external circumstances that caused us to change our mind.”
In the absence of such statement, it makes it seem like Anthropic does not really care about honoring its commitments/expectations or generally defending its reasoning on important safety-relevant issues. I find it reasonable that this disposition harms Anthropic’s reputation among safety-conscious people and makes safety-conscious people less excited about voluntary commitments from labs in general.
See my comment below. Basically I think this depends a lot on the extent to which a commitment was made.
Right now it seems like the entire community is jumping to conclusions based on a couple of “impressions” people got from talking to Dario, plus an offhand line in a blog post. With that little evidence, if you have formed strong expectations, that’s on you. And trying to double down by saying “I have been bashing you because I formed an unreasonable expectation, now it’s your job to fix that” seems pretty adversarial.
I do think it would be nice if Anthropic did make such a statement, but seeing how adversarially everyone has treated the information they do release, I don’t blame them for not doing so.
No, many people had the impression that Anthropic had made such a commitment, which is why they were so surprised when they saw the Claude 3 benchmarks/marketing. Their impressions were derived from a variety of sources; those are merely the few bits of “hard evidence”, gathered after the fact, of anything that could be thought of as an “organizational commitment”.
Also, if Dustin Moskovitz and Gwern—two dispositionally pretty different people—both came away from talking to Dario with this understanding, I do not think that is something you just wave off. Failures of communication do happen. It’s pretty strange for this many people to pick up the same misunderstanding over the course of several years, from many different people (including Dario, but also others), in a way that’s beneficial to Anthropic, and then middle management starts telling you that maybe there was a vibe but they’ve never heard of any such commitment (nevermind what Dustin and Gwern heard, or anyone else who might’ve heard similar from other Anthropic employees).
I really think this is assuming the conclusion. I would be… maybe not happy, but definitely much less unhappy, with a response like, “Dang, we definitely did not intend to communicate a binding commitment to not release frontier models that are better than anything else publicly available at the time. In the future, you should not assume that any verbal communication from any employee, including the CEO, is ever a binding commitment that Anthropic, as an organzation, will respect, even if they say the words
This is a binding commitment
. It needs to be in writing on our website, etc, etc.”Could you clarify how binding “OpenAI’s mission is to ensure that artificial general intelligence benefits all of humanity.” is?
Like Robert, the impressions I had were based on what I heard from people working at Anthropic. I cited various bits of evidence because those were the ones available, not because they were the most representative. The most representative were those from Anthropic employees who concurred that this was indeed the implication, but it seemed bad form to cite particular employees (especially when that information was not public by default) rather than, e.g., Dario. I think Dustin’s statement was strong evidence of this impression, though, and I still believe Anthropic to have at least insinuated it.
I agree with you that most people are not aiming for as much stringency with their commitments as rationalists expect. Separately, I do think that what Anthropic did would constitute a betrayal, even in everyday culture. And in any case, I think that when you are making a technology which might extinct humanity, the bar should be significantly higher than “normal discourse.” When you are doing something with that much potential for harm, you owe it to society to make firm commitments that you stick to. Otherwise, as kave noted, how are we supposed to trust your other “commitments”? Your RSP? If all you can offer are vague “we’ll figure it out when we get there,” then any ambiguous statement should be interpreted as a vibe, rather than a real plan. And in the absence of unambiguous statements, as all the labs have failed to provide, this is looking very much like “trust us, we’ll do the right thing.” Which, to my mind, is nowhere close to the assurances society ought to be provided given the stakes.
This reasoning seems to imply that Anthropic should only be obliged to convey information when the environment is sufficiently welcoming to them. But Anthropic is creating a technology which might extinct humanity—they have an obligation to share their reasoning regardless of what society thinks. In fact, if people are upset by their actions, there is more reason, not less, to set the record straight. Public scrutiny of companies, especially when their choices affect everyone, is a sign of healthy discourse.
The implicit bid for people not to discourage them—because that would make it less likely for a company to be forthright—seems incredibly backwards, because then the public is unable to mention when they feel Anthropic has made a mistake. And if Anthropic is attempting to serve the public, which they at least pay lip service to through their corporate structure, then they should be grateful for this feedback, and attempt to incorporate it.
So I do blame them for not making such a statement—it is on them to show to humanity, the people they are making decisions for, why those decisions are justified. It is not on society to make the political situation sufficiently palatable such that they don’t face any consequences for the mistakes they have made. It is on them not to make those mistakes, and to own up to them when they do.
This makes sense, and does update me. Though I note “implication”, “insinuation” and “impression” are still pretty weak compared to “actually made a commitment”, and still consistent with the main driver being wishful thinking on the part of the AI safety community (including some members of the AI safety community who work at Anthropic).
I think there are two implicit things going on here that I’m wary of. The first one is an action-inaction distinction. Pushing them to justify their actions is, in effect, a way of slowing down all their actions. But presumably Anthropic thinks that them not doing things is also something which could lead to humanity going extinct. Therefore there’s an exactly analogous argument they might make, which is something like “when you try to stop us from doing things you owe it to the world to adhere to a bar that’s much higher than ‘normal discourse’”. And in fact criticism of Anthropic has not met this bar—e.g. I think taking a line from a blog post out of context and making a critical song about it is in fact unusually bad discourse.
What’s the disanalogy between you and Anthropic telling each other to have higher standards? That’s the second thing that I’m wary about: you’re claiming to speak on behalf of humanity as a whole. But in fact, you are not; there’s no meaningful sense in which humanity is in fact demanding a certain type of explanation from Anthropic. Almost nobody wants an explanation of this particular policy; in fact, the largest group of engaged stakeholders here are probably Anthropic customers, who mostly just want them to ship more models.
I don’t really have a strong overall take. I certainly think it’s reasonable to try to figure out what went wrong with communication here, and perhaps people poking around and asking questions would in fact lead to evidence of clear commitments being made. I am mostly against the reflexive attacks based on weak evidence, which seems like what’s happening here. In general my model of trust breakdowns involves each side getting many shallow papercuts from the other side until they decide to disengage, and my model of productive criticism involves more specificity and clarity.
I don’t know if you’ve ever tried this move on an interpersonal level, but it is exactly the type of move that tends to backfire hard. And in fact a lot of these things are fundamentally interpersonal things, about who trusts whom, etc.
I think the right way to think about verbal or written commitments is that they increase the costs of taking a certain course of action. A legal contract can mean that the price is civil lawsuits leading to paying a financial price. A non-legal commitment means if you break it, the person you made the commitment to gets angry at you, and you gain a reputation for being the sort of person who breaks commitments. It’s always an option for someone to break the commitment and pay the price, even laws leading to criminal penalties can be broken if someone is willing to run the risk or pay the price.
In this framework, it’s reasonable to be somewhat angry at someone or some corporation who breaks a soft commitment to you, in order to increase the perceived cost of breaking soft commitments to you and people like you.
People on average maybe tend more towards keeping important commitments due to reputational and relationship cost, but maybe corporations as groups of people tend to think only in terms of financial and legal costs, so are maybe more willing to break soft commitments (especially, if it’s an organization where one person makes the commitment but then other people break it). So for relating to corporations, you should be more skeptical of non-legally binding commitments (and even for legally binding commitments, pay attention to the real price of breaking it).
Yeah, I think it’s good if labs are willing to make more “cheap talk” statements of vague intentions, so you can learn how they think. Everyone should understand that these aren’t real commitments, and not get annoyed if these don’t end up meaning anything. This is probably the best way to view “statements by random lab employees”.
Imo would be good to have more “changeable commitments” too in between, statements that are “we’ll do policy X until we change the policy, when we do we commit to clearly informing everyone about the change” which is maybe more the current status of most RSPs.
Deceptive alignment doesn’t preserve goals.
A short note on a point that I’d been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misaligned goal were “make as many paperclips as possible”, but the goal “make as many staples as possible” could be represented more simply in the weights, then the weights should slowly drift from the former to the latter throughout training.
But actually, it’d likely be even simpler to get rid of the underlying misaligned goal, and just have alignment with the outer reward function as the terminal goal. So this argument suggests that even policies which start off misaligned would plausibly become aligned if they had to act deceptively aligned for long enough. (This sometimes happens in humans too, btw.)
Reasons this argument might not be relevant:
- The policy doing some kind of gradient hacking
- The policy being implemented using some kind of modular architecture (which may explain why this phenomenon isn’t very robust in humans)
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it’s unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
So I’m imagining the agent doing reasoning like:
Misaligned goal --> I should get high reward --> Behavior aligned with reward function
and then I’m hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make “I should get high reward” the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.)
I could also imagine something more like:
Misaligned goal --> I should behave in aligned ways --> Aligned behavior
and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option.
Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: “I should get high reward” and “I should behave in aligned ways”, and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I’ll have a post on that topic up soon).
Why would the agent reason like this?
Because of standard deceptive alignment reasons (e.g. “I should make sure gradient descent doesn’t change my goal; I should make sure humans continue to trust me”).
I think you don’t have to reason like that to avoid getting changed by SGD. Suppose I’m being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
Maybe this is compatible with what you had in mind! It’s just not something that I think of as “high reward.”
And maybe there’s some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust… but that feels quite contingent to me.
I think this depends sensitively on whether the “actor” and the “critic” in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that “treading water” is in fact a negative-advantage action (unless there’s some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic’s responses will depend on whether its goals are indexical or not (if they are, they’re different from the actor’s goals; if not, they’re the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent—but then the critic doesn’t need to produce a value function that’s consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.
The shortest description of this thought doesn’t include “I should get high reward” because that’s already implied by having a misaligned goal and planning with it.
In contrast, having only the goal “I should get high reward” may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.
Can you say why you think that weight-based regularization would drift the weights to the latter? That seems totally non-obvious to me, and probably false.
In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.
Interesting point. Though on this view, “Deceptive alignment preserves goals” would still become true once the goal has drifted to some random maximally simple goal for the first time.
To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn’t seem to change this in practice. Given this, all kinds of goals could be “simple” as they piggyback on existing representations, requiring little additional description length.
This doesn’t seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning “X is my misaligned terminal goal, and therefore I’m going to deceptively behave as if I’m aligned” and then acts perfectly like an aligned agent from then on. My claims then would be:
a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up.
b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are biased against runtime penalties (at the very least because it prevents them from doing other more useful stuff with that runtime).
In a setting where you also have outer alignment failures, the same argument still holds, just replace “aligned agent” with “reward-maximizing agent”.
A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.
I think this is useful for framing my core concerns about current safety research:
If we think that unsupervised learning will produce safe agents, then why will the comparatively small contributions of SL and RL make them unsafe?
If we think that unsupervised learning will produce dangerous agents, then why will safety techniques which focus on SL and RL (i.e. basically all of them) work, when they’re making comparatively small updates to agents which are already misaligned?
I do think it’s more complicated than I’ve portrayed here, but I haven’t yet seen a persuasive response to the core intuition.
I wrote a few posts on self-supervised learning last year:
https://www.lesswrong.com/posts/SaLc9Dv5ZqD73L3nE/the-self-unaware-ai-oracle
https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety
https://www.lesswrong.com/posts/L3Ryxszc3X2J7WRwt/self-supervised-learning-and-manipulative-predictions
I’m not aware of any airtight argument that “pure” self-supervised learning systems, either generically or with any particular architecture, are safe to use, to arbitrary levels of intelligence, though it seems very much worth someone trying to prove or disprove that. For my part, I got distracted by other things and haven’t thought about it much since then.
The other issue is whether “pure” self-supervised learning systems would be capable enough to satisfy our AGI needs, or to safely bootstrap to systems that are. I go back and forth on this. One side of the argument I wrote up here. The other side is, I’m now (vaguely) thinking that people need a reward system to decide what thoughts to think, and the fact that GPT-3 doesn’t need reward is not evidence of reward being unimportant but rather evidence that GPT-3 is nothing like an AGI. Well, maybe.
For humans, self-supervised learning forms the latent representations, but the reward system controls action selection. It’s not altogether unreasonable to think that action selection, and hence reward, is a more important thing to focus on for safety research. AGIs are dangerous when they take dangerous actions, to a first approximation. The fact that a larger fraction of neocortical synapses are adjusted by self-supervised learning than by reward learning is interesting and presumably safety-relevant, but I don’t think it immediately proves that self-supervised learning has a similarly larger fraction of the answers to AGI safety questions. Maybe, maybe not, it’s not immediately obvious. :-)
Imagine taking someone’s utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I’d want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.
But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret “similar to me” as de dicto vs de re—i.e. whether it refers to the old me or the new me.
This is a more general problem when one person’s utility function can depend on another person’s, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There’s probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct type systems for utility functions, or something like that).
(This note inspired by Mad Investor Chaos, where (SPOILERS) one god declines to take revenge, because they’re the utility-flipped version of another god who would have taken revenge. At first this made sense, but now I feel like it’s not type-safe.)
Actually, this raises a more general point (can’t remember if I’ve made this before): we’ve evolved some values (like caring about revenge) because they’re game-theoretically useful. But if game theory says to take revenge, and also our values say to take revenge, then this is double-counting. So I’d guess that, for much more coherent agents, their level of vengefulness would mainly be determined by their decision theories (which can’t be flipped) rather than their utilities.
Fundamentally, humans aren’t VNM-rational, and don’t actually have utility functions. Which makes the thought experiment much less fun. If you recast it as “what if a human brain’s reinforcement mechanisms were reversed”, I suspect it’s also boring: simple early death.
The interesting fictional cases are when some subset of a person’s legible motivations are reversed, but the mass of other drives remain. This very loosely maps to reversing terminal goals and re-calculating instrumental goals—they may reverse, stay, or change in weird ways.
The indirection case is solved (or rather unasked) by inserting a “perceived” in the calculation chain. Your goals don’t depend on similarity to you, they depend on your perception (or projection) of similarity to you.
I have been asking a similar question for a long time. This is similar to the standard problem that if we deny regularity, will it be regular irregularity or irregular irregularity, that is, at what level are we denying the phenomeno? And only at one level?
(Vague, speculative thinking): Is the time element of UDT actually a distraction? Consider the following: agents A and B are in a situation where they’d benefit from cooperation. Unfortunately, the situation is complicated—it’s not like a prisoner’s dilemma, where there’s a clear “cooperate” and a clear “defect” option. Instead they need to take long sequences of actions, and they each have many opportunities to subtly gain an advantage at the other’s expense.
Therefore instead of agreements formulated as “if you do X I’ll do Y”, it’d be far more beneficial for them to make agreements of the form “if you follow the advice of person Z then I will too”. Here person Z needs to be someone that both A and B trust to be highly moral, neutral, competent, etc. Even if there’s some method of defecting that neither of them considered in advance, at the point in time when it arises Z will advise against doing it. (They don’t need to actually have access to Z, they can just model what Z will say.)
If A and B don’t have much communication bandwidth between them (e.g. they’re trying to do acausal coordination) then they will need to choose a Z that’s a clear Schelling point, even if that Z is suboptimal in other ways.
UDT can be seen as the special case where A and B choose Z as follows: “keep forgetting information until you don’t know if you’re A or B”. If A and B are different branches of the same agent, then the easiest way to do this is just to let Z be their last common ancestor. (Coalitional agency can be seen as an implementation of this.) If they’re not, then they’ll also need to coordinate on a way to make sure they’ll forget roughly the same things.
But there are many other ways of picking Schelling Zs. For example, if A and B follow the same religion, then the central figure in that religion (Jesus, Buddha, Mohammad, etc) is a clear Schelling point.
EDIT: Z need not be one person, it could be a group of people. E.g. in the UDT case, if there are several different orders in which A and B could potentially forget information, then they could just do all of them and then follow the aggregated advice of the resulting council. Similarly, even if A and B aren’t of the same religion, they could agree to follow whatever compromise their respective religions’ central figures would have come to.
EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this problem.
Most discussion of updatelessness suggests that Z is an agent similar to A and B, and also that it’s a policy whose implications are transparent to A and B. I think an importaint case has Z quite unlike A or B, possibly much smaller and more legible than them. And it can still be an agent in its own right, capable of eventually growing stronger than A or B were at the time Z was initially formulated. By a growing/developing Z I mean something that exists in coordination through both of these places, rather than splitting into a version of Z at A, and a version of Z at B, losing touch with each other.
Such a Z might be thought of as an environmental agent that A and B create near them, equipped to keep in contact with its alternative instance (knowing enough about both its instance near A and its instance near B), rather than specifically a commitment of A or B, or a replacement of A or B, or a result of merging A and B. The commitment of A and B is then to future interactions with Z, which the updateless/coordinated core of Z should be sufficiently aware of to plan for.
I think Nesov had some similar idea about “agents deferring to a (logically) far-away algorithm-contract Z to avoid miscoordination”, although I never understood it completely, nor think that idea can solve miscoordination in the abstract (only, possibly, be a nice pragmatic way to bootstrap coordination from agents who are already sufficiently nice).
Hate to always be that guy, but if you are assuming all agents will only engage in symmetric commitments, then you are assuming commitment races away. In actuality, it is possible for a (meta-) commitment race to happen about “whether I only engage in symmetric commitments”.
The central question to my mind is principles of establishing coordination between different situations/agents, and contracts is a framing for what coordination might look like once established. Agentic contracts have the additional benefit of maintaining coordination across their instances once it’s established initially. Coordination theory should clarify how agents should think about establishing coordination with each other, how they should construct these contracts.
This is not about niceness/cooperation. For example I think it should be possible to understand a transformer as being in coordination with the world through patterns in the world and circuits in the transformer, so that coordination gets established through learning. Beliefs are contracts between a mind and its object of study, essential tools the mind has for controlling it. Consequentialist control is a special case of coordination in this sense, and I think one problem with decision theories is that they are usually overly concerned with remaining close to consequentialist framing.
[Epistemic status: rough speculation, feels novel to me, though Wei Dai probably already posted about it 15 years ago.]
UDT is (roughly) defined as “follow whatever commitments a past version of yourself would have made if they’d thought about your situation”. But this means that any UDT agent is only as robust to adversarial attacks as their most vulnerable past self. Specifically, it creates an incentive for adversaries to show UDT agents situations that would trick their past selves into making unwise commitments. It also creates incentives for UDT agents themselves to hack their past selves, in order to artificially create commitments that “took effect” arbitrarily far back in their past.
In some sense, then, I think UDT might have a parallel structure to the overall alignment problem. You have dumber past agents who don’t understand most of what’s going on. You have smarter present agents who have trouble cooperating, because they know too much. The smarter agents may try to cooperate by punting to “Schelling point” dumb agents. (But this faces many of the standard problems of dumb agents making decisions—e.g. the commitments they make will probably be inconsistent or incoherent in various ways. And so in fact you need the smarter agents to interpret the dumb agents’ commitments, which then gets rid of a bunch of the value of punting it to those dumb agents in the first place.)
You also have the problem that the dumb agents will have situational awareness, and may recognize that their interests have diverged from the interests of the smart agents.
But this also suggests that a “solution” to UDT and a solution to alignment might have roughly the same type signature: a spotlighted structure for decision-making procedures that incorporate the interests of both dumb and smart agents. Even when they have disparate interests, the dumb agents would benefit from getting any decision-making power, and the smart agents would benefit from being able to use the dumb agents as Schelling points to cooperate around.
The smart agents could always refactor the dumb agents and construct new Schelling points if they wanted to, but that would cost them a lot of time and effort, because coordination is hard, and the existing coordination edifice has been built around these particular dumb agents. (Analogously, you could refactor out a bunch of childhood ideals and memories from your current self, but mostly you don’t want to, because they constitute the fabric from which your identity has been constructed.)
To be clear, this isn’t meant to be an argument that ASIs which don’t like us at all will keep us around. That seems unlikely either way. But it could be an argument that ASIs which kinda like us a little bit will keep us around—that it might not be incredibly unnatural for them to do so, because their whole cognitive structure will incorporate the opinions and values of dumber agents by default.
This seems substantially different from UDT, which does not really have or use a notion of “past version of yourself”. For example imagine a variant of Counterfactual Mugging in which there is no preexisting agent, and instead Omega creates an agent from scratch after flipping the coin and gives it the decision problem. UDT is fine with this but “follow whatever commitments a past version of yourself would have made if they’d thought about your situation” wouldn’t work.
I recall that I described “exceptionless decision theory” or XDT as “do what my creator would want me to do”, which seems closer to your idea. I don’t think I followed up the idea beyond this, maybe because I realized that humans aren’t running any formal decision theory, so “what my creator would want me to do” is ill defined. (Although one could say my interest in metaphilosophy is related to this, since what I would want an AI to do is to solve normative decision theory using correct philosophical reasoning, and then do what it recommends.)
Anyway, the upshot is that I think you’re exploring a decision theory approach that’s pretty distinct from UDT so it’s probably a good idea to call it something else. (However there may be something similar in the academic literature, or someone described something similar on LW that I’m not familiar with or forgot.)
My terminology here was sloppy, apologies. When I say “past versions of yourself” I am also including (as Nesov phrases it below) “the idealized past agent (which doesn’t physically exist)”. E.g. in the Counterfactual Mugging case you describe, I am thinking about precommitments that the hypothetical past version of yourself from before the coin was flipped would have committed to.
I find it a more intuitive way to think about UDT, though I realize it’s a somewhat different framing from yours. Do you still think this is substantially different?
UDT never got past the setting of unchanging preferences, so the present agent blindly defers to all decisions of the idealized past agent (which doesn’t physically exist). And if the past agent doesn’t try to wade in the murky waters of logical updatelessness, it’s not really dumber or more fallible to trickery, it can see everything the way a universal Turing machine or Solomonoff induction can “see everything”. Coordinating agents with different values was instead explored under the heading of Prisoner’s Dilemma. Though a synthesis between coordination of agents with different values and UDT (recognizing Schelling point contracts as a central construction) is long overdue.
I actually think it might still be more fallible, for a couple of reasons.
Firstly, consider an agent which, at time T, respects all commitments it would have made at times up to T. Now if you’re trying to attack the agent at time T, you have T different versions of it that you can attack, and if any of them makes a dumb commitment then you win.
I guess you could account for this by just gradually increasing the threshold for making commitments over time, though.
Secondly: the further back you go, the more farsighted the past agent needs to be about the consequences of its commitments. If you have any compounding mistakes in the way it expects things to play out, then it’ll just get worse and worse the further back you defer.
Again, I guess you could account for this by having a higher threshold for making commitments which you expect to benefit you further down the line.
Then, re logical updatelessness: it feels like in the long term we need to unify logical + empirical updates, because they’re roughly the same type of thinking. Murky waters perhaps, but necessary ones.
Yeah, so what could this look like? I think one important idea is that you don’t have to be deferring to your past self, it’s just that your past self is the clearest Schelling point. But it wouldn’t be crazy for me to, say, use BDT: Buddha Decision Theory, in which I obey all commitments that the Buddha would have made for me if he’d been told about my situation. The differences between me using UDT and BDT (tentatively) seem only qualitative to me, not quantitative. BDT makes it harder for me to cooperate with hypothetical copies of myself who hadn’t yet thought of BDT (because “Buddha” is less of a Schelling point amongst copies of myself than “past Richard”). It also makes me worse off than UDT in some cases, because sometimes the Buddha would make commitments in favor of his interests, not mine. But it also makes it a bit easier for me to cooperate with others, who might also converge to BDT.
At this point I’m starting to suspect that solving UDT 2 is not just alignment-complete, it’s also politics- and sociology-complete. The real question is whether we can isolate toy examples or problems in which these ideas can be formalized, rather than just having them remain vague “what if everyone got along” speculation.
UDT doesn’t do multistage commitments, it has a single all-powerful “past” version that looks into all possible futures before pronouncing a global policy that all of them would then follow. This policy is not a collection of commitments in a reasonable informal sense, it’s literally all details of behavior of future versions of the agent in response to all possible observations. In case of logical updatelessness, also in response to all possible observations of computational facts. (UDT for the idealized past version defines a single master model, future versions are just passively running inference from the contexts of their particular situations.)
The convergent idea for acausal coordination between systems A and B seems to be constructing a shared subagent C whose instances exist as part of both A and B (after A and B successfully both construct the same C, not before), so that C can then act within them in the style of FDT, though really it’s mostly about C thinking of the effects of its behavior in terms of “I am an algorithm” rather than “I am a physical object”. (For UDT, the shared subagent C is the idealized common past version of its different possible future versions A and B. This assumes that A and B already have a lot in common, so maybe C is instead Buddha.)
A bulk of the blind alleys seem to be about allowing subagents various superpowers, instead of focusing on managing the fallout of making them small and bounded (but possibly more plentiful). I think this is where investigations into logical updatelessness go wrong. It does need solving, but not by considering some fact unknown globally, or even at certain logical times. Instead a fact can remain unknown to some small subagent, and can be observed by it at some point, or computed by another subagent. Values are also knowledge, so sufficiently small subagents shouldn’t even by default know full values of the larger system, and should be prepared to learn more about them. This is a consideration that doesn’t even depend on there initially being multiple big agents with different values.
Another point is that coordination doesn’t necessarily need construction of exactly the same shared subagent, or it doesn’t need to be “exactly the same” in a straightforward sense, which the results on coordination in PD illustrate. The role of subagents in this case is that A can create a subagent CA, while B creates a subagent CB. And even where A and B remain intractable for each other, CA and CB can be much smaller and by construction prepared to coordinate with each other, from within A and B. (It seems natural for the big agents to treat such subagents as something like their copies of an assurance contract, which is signed through commitment to give them influence over the big agent’s thinking or behavior. And letting contracts be agents in their own right gives a lot of flexibility in coordination they can arrange.)
Okay, so trying to combine Prisoner’s dilemma and UDT, we get: A and B are in a prisoner’s dilemma. Suppose they have a list of N agents (which include, say, A’s past self, B’s past self, the Buddha, etc), and they each must commit to following one of those agent’s instructions. Each of them estimates: “conditional on me committing to listen to agent K, here’s a distribution over which agent they’d commit to listen to” And then you maximize expected value based on that.
Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value? It feels like the difference is that it’s really hard to actually reason about the correlations between my low-level actions and your low-level actions, whereas it might be easier to reason about the correlations between my high-level commitments and your high-level commitments.
I.e. the role of the Buddha in this situation is just to make the acausal coordination here much easier.
The main trick with PD is that instead of an agent only having two possible actions C and D, we consider many programs the agent might self-modify into (commit to becoming) that each might in the end compute C or D. This effectively changes the action space, there are now many more possible actions. And these programs/actions can be given access (like quines, by their own construction) to initial source code of all the agents, allowed to reason about them. But then programs have logical uncertainty about how they in the end behave, so the things you’d be enumerating don’t immediately cash out in expected values. And these programs can decide to cause different expected values depending of what you’ll do with their behavior, anticipate how you reason about them through reasoning about you in turn. It’s hard to find clear arguments for why any particular desirable thing could happen as a result of this setup.
UDT is notable for being one way of making this work. The “open source game theory” of PD (through Löb’s theorem, modal fixpoints, Payor’s lemma) pinpoints some cases where it’s possible to say that we get cooperation in PD. But in general it’s proven difficult to say anything both meaningful and flexible about this seemingly in-broad-strokes-inevitable setup, in particular for agents with different values that are doing more general things than playing PD.
(The following relies a little bit on motivation given in the other comment.)
When both A and B consider listening to a shared subagent C, subagent C is itself considering what it should be doing, depending on what A and B do with C‘s behavior. So for example with A there are two stages of computation to consider: first, it was A and didn’t yet decide to sign the contract, then it became a composite system P(C), where P is A’s policy for giving influence to C’s behavior (possibly P and A include a larger part of the world where the first agent exists, not just the agent itself). The commitment of A is to the truth of the equality A=P(C), which gives C influence over the computational consequences of A in the particular shape P. The trick with the logical time of this process is that C should be able to know (something about) P updatelessly, without being shown observations of what it is, so that the instance of C within B would also know of P and be able to take it into account in choosing its joint policy that acts both through A and B. (Of course, the same is happening within B.)
This sketch frames decision making without directly appealing to consequentialism. Here, A controls B through the incentives P it creates for C (a particular way in which C gets to project influence from A‘s place in the world), where C also has influence over B. So A doesn’t seek to manipulate B directly by considering the consequences for B’s behavior of various ways that A might behave.
It seems to me that Eliezer overrates the concept of a simple core of general intelligence, whereas Paul underrates it. Or, alternatively: it feels like Eliezer is leaning too heavily on the example of humans, and Paul is leaning too heavily on evidence from existing ML systems which don’t generalise very well.
I don’t think this is a particularly insightful or novel view, but it seems worth explicitly highlighting that you don’t have to side with one worldview or the other when evaluating the debates between them. (Although I’d caution not to just average their two views—instead, try to identify Eliezer’s best arguments, and Paul’s best arguments, and reconcile them.)
I’ve been reading Eliezer’s recent stories with protagonists from dath ilan (his fictional utopia). Partly due to the style, I found myself bouncing off a lot of the interesting claims that he made (although it still helped give me a feel for his overall worldview). The part I found most useful was this page about the history of dath ilan, which can be read without much background context. I’m referring mostly to the exposition on the first 2⁄3 of the page, although the rest of the story from there is also interesting. One key quote from the remainder of the story:
My main update is that Eliezer has a very deep-rooted belief that the world is Lawful, in that it makes sense to talk about real-world intelligence, coordination, ethics, etc, as (very imperfect) approximations to their idealised mathematically-definable forms. (Note though that these are conclusions I’ve extrapolated from his fiction, which is a fairly unreliable method of inferring people’s beliefs.)
I’d say lots of other things he’s said support that update. Stuff about how your model of the world will be accurate if and only if you somehow approximate Bayes’ law, for example.
The dath ilan based fiction definitely helped me internalize the idea better though.
A tension that keeps recurring when I think about philosophy is between the “view from nowhere” and the “view from somewhere”, i.e. a third-person versus first-person perspective—especially when thinking about anthropics.
One version of the view from nowhere says that there’s some “objective” way of assigning measure to universes (or people within those universes, or person-moments). You should expect to end up in different possible situations in proportion to how much measure your instances in those situations have. For example, UDASSA ascribes measure based on the simplicity of the computation that outputs your experience.
One version of the view from somewhere says that the way you assign measure across different instances should depend on your values. You should act as if you expect to end up in different possible future situations in proportion to how much power to implement your values the instances in each of those situations has. I’ll call this the ADT approach, because that seems like the core insight of Anthropic Decision Theory. Wei Dai also discusses it here.
In some sense each of these views makes a prediction. UDASSA predicts that we live in a universe with laws of physics that are very simple to specify (even if they’re computationally expensive to run), which seems to be true. Meanwhile the ADT approach “predicts” that we find ourselves at an unusually pivotal point in history, which also seems true.
Intuitively I want to say “yeah, but if I keep predicting that I will end up in more and more pivotal places, eventually that will be falsified”. But.… on a personal level, this hasn’t actually been falsified yet. And more generally, acting on those predictions can still be positive in expectation even if they almost surely end up being falsified. It’s a St Petersburg paradox, basically.
Very speculatively, then, maybe a way to reconcile the view from somewhere and the view from nowhere is via something like geometric rationality, which avoids St Petersburg paradoxes. And more generally, it feels like there’s some kind of multi-agent perspective which says I shouldn’t model all these copies of myself as acting in unison, but rather as optimizing for some compromise between all their different goals (which can differ even if they’re identical, because of indexicality). No strong conclusions here but I want to keep playing around with some of these ideas (which were inspired by a call with @zhukeepa).
This was all kinda rambly but I think I can summarize it as “Isn’t it weird that ADT tells us that we should act as if we’ll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don’t have a story for why these things are related but it does seem like a suspicious coincidence.”
Very interesting. It sounds like your “third person view from nowhere” vs the “first person view from somewhere” is very similar to something I was thinking about recently. I called them “objectively distinct situations” in contrast with “subjectively distinct situations”. My view is that most of the anthropic arguments that “feel wrong” to me are built on trying to make me assign equal probability to all subjectively distinct scenarios, rather than objective ones. eg. A replication machine makes it so there are two of me, then “I” could be either of them, leaving two subjectively distinct cases, even if on the object level there is actual no distinction between “me” being clone A or clone B. [1]
I am very sceptical of this ADT. If you think the time/place you have ended up is unusually important I think that is more likely explained by something like “people decide what is important based on what is going on around them”.
[1] My thoughts are here: https://www.lesswrong.com/posts/v9mdyNBfEE8tsTNLb/subjective-questions-require-subjective-information
I’m not sure this is a valid interpretation of ADT. Can you say more about why you interpret ADT this way, maybe with an example? My own interpretation of how UDT deals with anthropics (and I’m assuming ADT is similar) is “Don’t think about indexical probabilities or subjective anticipation. Just think about measures of things you (considered as an algorithm with certain inputs) have influence over.”
This seems to “work” but anthropics still feels mysterious, i.e., we want an explanation of “why are we who we are / where we’re at” and it’s unsatisfying to “just don’t think about it”. UDASSA does give an explanation of that (but is also unsatisfying because it doesn’t deal with anticipations, and also is disconnected from decision theory).
I would say that under UDASSA, it’s perhaps not super surprising to be when/where we are, because this seems likely to be a highly simulated time/scenario for a number of reasons (curiosity about ancestors, acausal games, getting philosophical ideas from other civilizations).
(Speculative paragraph, quite plausibly this is just nonsense.) Suppose you have copies A and B who are both offered the same bet on whether they’re A. One way you could make this decision is to assign measure to A and B, then figure out what the marginal utility of money is for each of A and B, then maximize measure-weighted utility. Another way you could make this decision, though, is just to say “the indexical probability I assign to ending up as each of A and B is proportional to their marginal utility of money” and then maximize your expected money. Intuitively this feels super weird and unjustified, but it does make the “prediction” that we’d find ourselves in a place with high marginal utility of money, as we currently do.
(Of course “money” is not crucial here, you could have the same bet with “time” or any other resource that can be compared across worlds.)
Fair point. By “acausal games” do you mean a generalization of acausal trade? (Acausal trade is the main reason I’d expect us to be simulated a lot.)
This is particularly weird because your indexical probability then depends on what kind of bet you’re offered. In other words, our marginal utility of money differs from our marginal utility of other things, and which one do you use to set your indexical probability? So this seems like a non-starter to me… (ETA: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we’re just contemplating a philosophical problem and not trying to make any specific decisions?)
Yes, didn’t want to just say “acausal trade” in case threats/war is also a big thing.
It seems pretty weird to me too, but to steelman: why shouldn’t it depend on the type of bet you’re offered? Your indexical probabilities can depend on any other type of observation you have when you open your eyes. E.g. maybe you see blue carpets, and you know that world A is 2x more likely to have blue carpets. And hearing someone say “and the bet is denominated in money not time” could maybe update you in an analogous way.
I mostly offer this in the spirit of “here’s the only way I can see to reconcile subjective anticipation with UDT at all”, not “here’s something which makes any sense mechanistically or which I can justify on intuitive grounds”.
I added this to my comment just before I saw your reply: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we’re just contemplating a philosophical problem and not trying to make any specific decisions?
Ah I see. I think this is incomplete even for that purpose, because “subjective anticipation” to me also includes “I currently see X, what should I expect to see in the future?” and not just “What should I expect to see, unconditionally?” (See the link earlier about UDASSA not dealing with subjective anticipation.)
ETA: Currently I’m basically thinking: use UDT for making decisions, use UDASSA for unconditional subjective anticipation, am confused about conditional subjective anticipation as well as how UDT and UDASSA are disconnected from each other (i.e., the subjective anticipation from UDASSA not feeding into decision making). Would love to improve upon this, but your idea currently feels worse than this...
In a bayesian rationalist view of the world, we assign probabilities to statements based on how likely we think they are to be true. But truth is a matter of degree, as Asimov points out. In other words, all models are wrong, but some are less wrong than others.
Consider, for example, the claim that evolution selects for reproductive fitness. Well, this is mostly true, but there’s also sometimes group selection, and the claim doesn’t distinguish between a gene-level view and an individual-level view, and so on...
So just assigning it a single probability seems inadequate. Instead, we could assign a probability distribution over its degree of correctness. But because degree of correctness is such a fuzzy concept, it’d be pretty hard to connect this distribution back to observations.
Or perhaps the distinction between truth and falsehood is sufficiently clear-cut in most everyday situations for this not to be a problem. But questions about complex systems (including, say, human thoughts and emotions) are messy enough that I expect the difference between “mostly true” and “entirely true” to often be significant.
Has this been discussed before? Given Less Wrong’s name, I’d be surprised if not, but I don’t think I’ve stumbled across it.
This feels generally related to the problems covered in Scott and Abram’s research over the past few years. One of the sentences that stuck out to me the most was (roughly paraphrased since I don’t want to look it up):
I.e. our current formulations of bayesianism like solomonoff induction only formulate the idea of a hypothesis at such a low level that even trying to think about a single hypothesis rigorously is basically impossible with bounded computational time. So in order to actually think about anything you have to somehow move beyond naive bayesianism.
This seems reasonable, thanks. But I note that “in order to actually think about anything you have to somehow move beyond naive bayesianism” is a very strong criticism. Does this invalidate everything that has been said about using naive bayesianism in the real world? E.g. every instance where Eliezer says “be bayesian”.
One possible answer is “no, because logical induction fixes the problem”. My uninformed guess is that this doesn’t work because there are comparable problems with applying to the real world. But if this is your answer, follow-up question: before we knew about logical induction, were the injunctions to “be bayesian” justified?
(Also, for historical reasons, I’d be interested in knowing when you started believing this.)
I think it definitely changed a bunch of stuff for me, and does at least a bit invalidate some of the things that Eliezer said, though not actually very much.
In most of his writing Eliezer used bayesianism as an ideal that was obviously unachievable, but that still gives you a rough sense of what the actual limits of cognition are, and rules out a bunch of methods of cognition as being clearly in conflict with that theoretical ideal. I did definitely get confused for a while and tried to apply Bayes to everything directly, and then felt bad when I couldn’t actually apply bayes theorem in some situations, which I now realize is because those tended to be problems where embededness or logical uncertainty mattered a lot.
My shift on this happened over the last 2-3 years or so. I think starting with Embedded Agency, but maybe a bit before that.
Which ones? In Against Strong Bayesianism I give a long list of methods of cognition that are clearly in conflict with the theoretical ideal, but in practice are obviously fine. So I’m not sure how we distinguish what’s ruled out from what isn’t.
Can you give an example of a real-world problem where logical uncertainty doesn’t matter a lot, given that without logical uncertainty, we’d have solved all of mathematics and considered all the best possible theories in every other domain?
I think in-practice there are lots of situations where you can confidently create a kind of pocket-universe where you can actually consider hypotheses in a bayesian way.
Concrete example: Trying to figure out who voted a specific way on a LW post. You can condition pretty cleanly on vote-strength, and treat people’s votes as roughly independent, so if you have guesses on how different people are likely to vote, it’s pretty easy to create the odds ratios for basically all final karma + vote numbers and then make a final guess based on that.
It’s clear that there is some simplification going on here, by assigning static probabilities for people’s vote behavior, treating them as independent (though modeling some subset of independence wouldn’t be too hard), etc.. But overall I expect it to perform pretty well and to give you good answers.
(Note, I haven’t actually done this explicitly, but my guess is my brain is doing something pretty close to this when I do see vote numbers + karma numbers on a thread)
Well, it’s obvious that anything that claims to be better than the ideal bayesian update is clearly ruled out. I.e. arguments that by writing really good explanations of a phenomenon you can get to a perfect understanding. Or arguments that you can derive the rules of physics from first principles.
There are also lots of hypotheticals where you do get to just use Bayes properly and then it provides very strong bounds on the ideal approach. There are a good number of implicit models behind lots of standard statistics models that when put into a bayesian framework give rise to a more general formulation. See the Wikipedia article for “Bayesian interpretations of regression” for a number of examples.
Of course, in reality it is always unclear whether the assumptions that give rise to various regression methods actually hold, but I think you can totally say things like “given these assumption, the bayesian solution is the ideal one, and you can’t perform better than this, and if you put in the computational effort you will actually achieve this performance”.
Are you able to give examples of the times you tried to be Bayesian and it failed because embedded was?
Scott and Abram? Who? Do they have any books I can read to familiarize myself with this discourse?
Scott: https://lesswrong.com/users/scott-garrabrant
Abram: https://lesswrong.com/users/abramdemski
Scott Garrabrant and Abram Demski, two MIRI researchers.
For introductions to their work, see the Embedded Agency sequence, the Consequences of Logical Induction sequence, and the Cartesian Frames sequence.
Related but not identical: this shortform post.
See the section about scoring rules in the Technical Explanation.
Hmmm, but what does this give us? He talks about the difference between vague theories and technical theories, but then says that we can use a scoring rule to change the probabilities we assign to each type of theory.
But my question is still: when you increase your credence in a vague theory, what are you increasing your credence about? That the theory is true?
Nor can we say that it’s about picking the “best theory” out of the ones we have, since different theories may overlap partially.
If we can quantify how good a theory is at making accurate predictions (or rather, quantify a combination of accuracy and simplicity), that gives us a sense in which some theories are “better” (less wrong) than others, without needing theories to be “true”.
Oracle-genie-sovereign is a really useful distinction that I think I (and probably many others) have avoided using mainly because “genie” sounds unprofessional/unacademic. This is a real shame, and a good lesson for future terminology.
After rereading the chapter in Superintelligence, it seems to me that “genie” captures something akin to act-based agents. Do you think that’s the main way to use this concept in the current state of the field, or do you have other applications in mind?
Ah, yeah, that’s a great point. Although I think act-based agents is a pretty bad name, since those agents may often carry out a whole bunch of acts in a row—in fact, I think that’s what made me overlook the fact that it’s pointing at the right concept. So not sure if I’m comfortable using it going forward, but thanks for pointing that out.
Perhaps the lesson is that terminology that is acceptable in one field (in this case philosophy) might not be suitable in another (in this case machine learning).
I don’t think that even philosophers take the “genie” terminology very seriously. I think the more general lesson is something like: it’s particularly important to spend your weirdness points wisely when you want others to copy you, because they may be less willing to spend weirdness points.
Is that from Superintelligence? I googled it, and that was the most convincing result.
Yepp.
People sometimes try to reason about the likelihood of deceptive alignment by appealing to speed priors and simplicity priors. I don’t like such appeals, because I think that the differences between aligned and deceptive AGIs will likely be a very small proportion of the total space/time complexity of an AGI. More specifically:
1. If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they’re relevant. So the speed cost of deception will be amortized across the (likely very long) training period.
2. AGIs will represent a huge number of beliefs and heuristics which inform their actions (e.g. every single fact they know). A heuristic like “when you see X, initiate the world takeover plan” would therefore constitute a very small proportion of the total information represented in the network; it’d be hard to regularize it away without regularizing away most of the AGI’s knowledge.
I think that something like the speed vs simplicity tradeoff is relevant to the likelihood of deceptive alignment, but it needs to be more nuanced. One idea I’ve been playing around with: the tradeoff between conservatism and systematization (as discussed here). An agent that prioritizes conservatism will tend to do the things they’ve previously done. An agent that prioritizes systematization will tend to do the things that are favored by simple arguments.
To illustrate: suppose you have an argument in your head like “if I get a chance to take a 60⁄40 double-or-nothing bet for all my assets, I should”. Suppose you’ve thought about this a bunch and you’re intellectually convinced of it. Then you’re actually confronted with the situation. Some people will be more conservative, and follow their gut (“I know I said I would, but… this is kinda crazy”). Others (like most utilitarians and rationalists) will be more systematizing (“it makes sense, let’s do it”). Intuitively, you could also think of this as a tradeoff between memorization and generalization; or between a more egalitarian decision-making process (“most of my heuristics say no”) and a more centralized process (“my intellectual parts say yes”). I don’t know how to formalize any of these ideas, but I’d like to try to figure it out.
Why do you think SGD will do this? Or are you imagining non-SGD mechanisms?
It seems non-obvious to me that this will occur with SGD, though possible.
You mean this about something trained totally differently than a LLM, no? Because this mechanism seems totally implausible to me otherwise.
Think during the forward pass, learn during the backward pass; if the model uses deceptive reasoning in the forward pass and the gradient says it’s useful for prediction, that seems like the mechanism as described. Thoughts?
So, there are a few different reasons, none of which I’ve formalized to my satisfaction.
I’m curious if these make sense to you.
(1) One is that the actual kinds of reasoning that an LLM can learn in its forward pass are quite limited.
As is well established, for instance, Transformers cannot multiply arbitrarily-long integers in a single forward pass. The number of additions involved in multiplying an N-digit integer increases in an unbounded way with N; thus, a Transformer with with a finite number of layers cannot do it. (Example: Prompt GPT-4 for the results of multiplying two 5-digit numbers, specifying not to use a calculator, see how it does.)
Of course in use you can teach a GPT to use a calculator—but we’re talking about operations that occur in single forward pass, which rules out using tools. Because of this shallow serial depth, a Transformer also cannot (1) divide arbitrary integers, (2) figure out the results of physical phenomena that have multiplication / division problems embedded in them, (3) figure out the results of arbitrary programs with loops, and so on.
(Note—to be very clear NONE of this is a limitation on what kind of operations we can get a transformer to do over multiple unrollings of the forward pass. You can teach a transformer to use a calculator; or to ask a friend for help; or to use a scratchpad, or whatever. But we need to hide deception in a single forward pass, which is why I’m harping on this.)
So to think that you learn deception in forward pass, you have to think that the transformer thinks something like “Hey, if I deceive the user into thinking that I’m a good entity, I’ll be able to later seize power, and if I seize power, then I’ll be able to (do whatever), so—considering all this, I should… predict the next token will be “purple”″ -- and that it thinks this in a context that could NOT come up with the algorithm for multiplication, or for addition, or for any number of other things, even though an algorithm for multiplication would be much much MUCH more directly incentivized by SGD, because it’s directly relevant for token predictions.
(2). Another way to get at the problem with this reasoning, is that I think it hypothesizes an agent within weight updates off the analogical resemblance to an agent that the finished product has. But in fact there’s at most a superficial resemblance between (LLM forward pass) and (repeated LLM forward passes in a Chain-of-thought over text).
That is, an LLM unrolled multiple times, from a given prompt, can make plans; it can plot to seize power, imitating humans who it saw thus plot; it can multiply N-digit integers, working them out just like a human. But this tells us literally nothing about what it can do in a single forward pass.
For comparison, consider a large neural network that is used for image segmentation. The entire physical world falls into the domain of such a model. It can learn that people exist, that dogs exist, and that machinery exists, in some sense. What if such a neural network—in a single forward pass—used deceptive reasoning, which turned out to be useful for prediction because of the backward pass, and that we ought therefore expect that such a neural network—when embedded in some device down the road—would turn and kill us?
The argument is exactly identical to the case of the language model, but no one makes it. And I think the reason is that people think about the properties that a trained LLM can exhibit *when unrolled over multiple forward passes, in a particular context and with a particular prompt, and then mistakenly attribute these properties to the single forward pass.
(All of which is to say—look, if you think you can get a deceptive agent from a LLM this way you should also expect a deceptive agent from an image segmentation model. Maybe that’s true! But I’ve never seen anyone say this, which makes me think they’re making the mistake I describe above.)
(3). I think this is just attributing extremely complex machinery to the forward pass of an LLM that is supposed to show up in a data-indifferent manner, and that this is a universally bad bet for ML.
Like, different Transformers store different things depending on the data they’re given. If you train them on SciHub they store a bunch of SciHub shit. If you train them on Wikipedia they store a bunch of Wikipedia shit. In every case, for each weight in the Transformer, you can find specific reasons for each neuron being what it is because of the data.
The “LLM will learn deception” hypothesis amounts to saying that—so long as a LLM is big enough, and trained on enough data to know the world exists—you’ll find complex machinery in it that (1) specifically activates once it figures out that it’s “not in training” and (2) was mostly just hiding until then. My bet is that this won’t show up, because there are no such structures in a Transformer that don’t depend on data. Your French Transformer / English Transformer / Toolformer / etc will not all learn to betray you if they get big enough—we will not find unused complex machinery in a Transformer to betray you because we find NO unused complex machinery in a transformer, etc.
I think an actually well-put together argument will talk about frequency bias and shit, but this is all I feel like typing for now.
Does this make sense? I’m still working on putting it together.
You’ve given me a lot to think about, thanks! Here are my thoughts as I read:
Yes, and the sort of deceptive reasoning I’m worried about sure seems pretty simple, very little serial depth to it. Unlike multiplying two 5-digit integers. For example the example you give involves like 6 steps. I’m pretty sure GPT4 already does ‘reasoning’ of about that level of sophistication in a single forward pass, e.g. when predicting the next token of the transcript of a conversation in which one human is deceiving another about something. (In fact, in general, how do you explain how LLMs can predict deceptive text, if they simply don’t have enough layers to do all the deceptive reasoning without ‘spilling the beans’ into the token stream?)
The reason I’m not worried about image segmentation models is that it doesn’t seem like they’d have the relevant capabilities or goals. Maybe in the limit they would—if we somehow banned all other kinds of AI, but let image segmentation models scale to arbitrary size and training data amounts—then eventually after decades of scaling and adding 9′s of reliability to their image predictions, they’d end up with scary agents inside because that would be useful for getting one of those 9′s. But yeah, it’s a pretty good bet that the relevant kinds of capabilities (e.g. ability to coherently pursue goals, ability to write code, ability to persuade humans of stuff) are most likely to appear earliest in systems that are trained in environments more tailored to developing those capabilities. tl;dr my answer is ’wake me when an image segmentation model starts performing well in dangerous capabilities evals like METR’s and OpenAI’s. Which won’t happen for a long time because image segmentation models are going to be worse at agency than models explicitly trained to be agents.”
I think this is a misunderstanding of the LLM will learn deception hypothesis. First of all, the conditions of the hypothesis are not just “so long as it’s big enough and knows the world exists.” It’s more stringent than that; there probably needs to be agency, for example (goal-directedness) and situational awareness. (Though I think John Wentworth disagrees?)
Secondly, the “complex machinery” claim is actually trivial, though you make it sound like it’s crazy. ANY behavior of a neural net in situation class X, which does not appear in situation class Y, is the result of ‘unused-in-Y complex machinery.’ So set y = training and x = deployment, and literally any claim about how deployment will be different from training involves this.
Another different approach I could take would be: The complex machinery DOES get used a lot in training. Indeed that’s why it evolved/was-formed-by-SGD. The complex machinery is the goal-directedness machinery, the machinery that chooses actions on the basis of how well the action is predicted to serve the goals. That machinery is presumably used all the fucking time in training, and it causes the system to behave-as-if-aligned in training and behave in blatantly unaligned ways once it’s very obvious that it can get away with doing so.
These all seem like reasonable reasons to doubt the hypothesized mechanism, yup. I think you’re underestimating how much can happen in a single forward pass, though—it has to be somewhat shallow, so it can’t involve too many variables, but the whole point of making the networks as large as we do these days is that it turns out an awful lot can happen in parallel. I also think there would be no reason for deception to occur if it’s never a good weight pattern to use to predict the data, it’s only if the data contains a pattern that the gradient will put into a deceptive forward mechanism that this could possibly occur. For example, if the model is trained on a bunch of humans being deceptive about their political intentions, and then RLHF is attempted.
In any case, I don’t think the old yudkowsky model of deceptive alignment is relevant, in that I think the level of deception to expect from ai should be calibrated to be around the amount you’d expect from a young human, not some super schemer god. The concern arises only when the data actually contains patterns well modeled by deception, and this would be expected to be more present in the case of something like an engagement maximizer online learning RL system.
And to be clear I don’t expect the things that can destroy humanity to arise because of deception directly. It seems much more likely to me that they’ll arise because competing people ask their model to do something that puts those models in competition in a way that puts humanity at risk, eg several different powerful competing model based engagement/sales optimizing reinforcement learners, or more speculatively something military. Something where the core problem is that alignment tech is effectively not used, and where solving this deception problem wouldn’t have saved us anyway.
Regarding the details of your descriptions: I really mainly think this sort of deception would arise in the wild when there’s a reward model passing gradients to multiple steps of a sequential model, or possibly the imitating humans locally thing. But without a reward model, nothing pushes the different steps of the sequential model towards trying to achieve the “same thing” across different steps in any significant sense. But of course almost all the really useful models involve a reward model somehow.
It’s really weird that we find ourselves at the hinge of history. One proposed explanation is that we’re part of an ancestor simulation. It makes sense that ancestor simulations would be focused on the hinge of history. But unless ancestor simulations make up a significant proportion of future minds, it’s still weird that we find ourselves in a simulation rather than actually experiencing the future.
Why might ancestor simulations make up a significant proportion of future minds? One possible answer is that ancestor simulations provide the information required for acausal cooperation across large worlds (known as ECL). If knowing the values that civilizations developed after the hinge of history allowed you to trade with them, then civilizations should focus a significant proportion of their resources on simulating the hinges of history experienced by many other civilizations.
I presume that this explanation has been proposed before, and probably in more detail; links appreciated if so.
I’ve done work in this area, but never been particularly enthusiastic about promoting it. It usually turns out to be inactionable/grim/likely to rouse a panic.
This is a familiar thought, to me.
A counterargument occurs to me: Isn’t it arguable that most of what we need to know about a species, to trade with it, is just downstream of its biology? Of course we talk a lot about our contingent factors, our culture, our history, but I think we’re pretty much just the same animals we’ve always been, extrapolated. If that’s the case, wouldn’t far more simulation time be given to evolutionary histories, rather than than simulating variations of hinges? Anthropic measure wouldn’t be especially concentrated on the hinge, it might even skip it.
Countercounterargument: it also seems like there are a lot of anti-inductive effects in the histories of technological societies that might mean you really do have to simulate it all to find out how values settle or just to figure out the species’ rate of success. Evolutionary histories might also have a lot more computationally compressible shared structure.
I’d be surprised if this, the world in front of us, were a pareto-efficient bargaining outcome. Hinge histories fucking suck to live in and I would strongly prefer a trade protocol that instantiated as few of them as possible. I wouldn’t expect many to be necessary, certainly not enough to significantly outweigh the… thing that is supposed to come after. (at this point, I’d prefer to take it into DMs/call)
Thinking about this stuff again, something occurred to me. Please make sure to keep, in cold storage, copies of misaligned AGIs that you may produce, when you catch them. It’s important. This policy could save us.
Would you care to expand on your remark? I don’t see how it follows from what you said above it.
Yeah, it wasn’t argued. I wasn’t sure whether it needed to be explained, for Richard. I don’t remember how I wound up getting there from the rest of the comment, I think it was just in the same broad neighborhood.
Regardless, yes, I totally can expand on that. Here, I wrote it up: Do Not Delete your Misaligned AGI.
World champion in Chess: “It’s really weird that I’m world champion. It must be a simulation or I must dream or..”
Joe Biden: “It’s really weird I’m president, it must be a simul...”
(Donald Trump: “It really really makes no sense I’m president, it MUST be a s..”)
David Chalmers: “It’s really weird I’m providing the seminal hard problem formulation. It must be a sim..”
...
Rationalist (before finding lesswrong): “Gosh, all these people around me, really wired differently than I am. I must be in a simulation.”
Something seems funny to me in the anthropic reasoning in these examples, and in yours too.
Of course we have one world champion in chess or anything, so a reasoning that means that world champion quasi by definition question’s his champion-ness, seems odd. Then, I’d be lying if I claimed I could not intuitively empathize with his wondering about the odds of exactly him being the world champion among 9 billions.
This leads me to the following, that eventually +- satisfies me:
Hypothetically, imagine each generation has only 1 person, and there’s rebirth: it’s just a rebirth of the same person, in a different generation.
With some simplification:
For 10 000 generations you lived in stone-age conditions
For 1 generation—today—you’re the hinge-of-history generation
X (X being: you won’t live anymore at all as AI killed everything; or you live 1 mio generations happily, served by AI, or what have you).
The 10 000 you’s didn’t have much reason to wonder about hinge of history, and so doesn’t happen to think about it. The one you, in the hinge-of-history generation, by definition, has much reasons to think about the hinge-of-history, and does think about it.
So, it has becomes a bit like a lottery game, which you repeat so many times until you naturally once draw the winning number. At that lucky punch, there’s no reason to think “Unlikely, it’s probably a simulation”, or anything.
I have the impression in the similar way, the reincarnated guy should not wonder about it, neither when his memory is wiped each time, and in the same vein (hm, am I sloppy here? that’s the hinge of my argument) neither you have to wonder too much.
Nit: ECL is just one of several kinds of acausal cooperation across large worlds.
What are the others?
In general I don’t think anthropic reasoning like this holds any substance. We experience what we experience, and condition on that in forming models about what it is and where we are in it.
We don’t get to make millions of bits of observations about being a human in a technological society, use those observations to extrapolate the possibility of supergalactic multitudes of consciousness, and then express surprise at a pathetic few dozen bits of improbability of not being one of those multitudes. We already used those bits (and a great many more!) in forming our model in the first place.
Since there’s been some recent discussion of the SSC/NYT incident (in particular via Zack’s post), it seems worth copying over my twitter threads from that time about why I was disappointed by the rationalist community’s response to the situation.
I continue to stand by everything I said below.
Thread 1 (6/23/20):
Scott Alexander is the most politically charitable person I know. Him being driven off the internet is terrible. Separately, it is also terrible if we have totally failed to internalize his lessons, and immediately leap to the conclusion that the NYT is being evil or selfish.
Ours is a community built around the long-term value of telling the truth. Are we unable to imagine reasonable disagreement about when the benefits of revealing real names outweigh the harms? Yes, it goes against our norms, but different groups have different norms.
If the extended rationalist/SSC community could cancel the NYT, would we? For planning to doxx Scott? For actually doing so, as a dumb mistake? For doing so, but for principled reasons? Would we give those reasons fair hearing? From what I’ve seen so far, I suspect not.
I feel very sorry for Scott, and really hope the NYT doesn’t doxx him or anyone else. But if you claim to be charitable and openminded, except when confronted by a test that affects your own community, then you’re using those words as performative weapons, deliberately or not.
[One more tweet responding to tweets by @balajis and @webdevmason, omitted here.]
Thread 2 (1/21/21):
Scott Alexander is writing again, on a substack blog called Astral Codex Ten! Also, he doxxed himself in the first post. This post seems like solid evidence that many SSC fans dramatically overreacted to the NYT situation.
Scott: “I still think the most likely explanation for what happened was that there was a rule on the books, some departments and editors followed it more slavishly than others, and I had the bad luck to be assigned to a department and editor that followed it a lot. That’s all.” [I didn’t comment on this in the thread, but I intended to highlight the difference between this and the conspiratorial rhetoric that was floating around when he originally took his blog down.]
I am pretty unimpressed by his self-justification: “Suppose Power comes up to you and says hey, I’m gonna kick you in the balls. … Sometimes you have to be a crazy bastard so people won’t walk all over you.” Why is doxxing the one thing Scott won’t be charitable about?
[In response to @habryka asking what it would mean for Scott to be charitable about this]: Merely to continue applying the standards of most of his other posts, where he assumes both sides are reasonable and have useful perspectives. And not to turn this into a bravery debate.
[In response to @benskuhn saying that Scott’s response is understandable, since being doxxed nearly prevented him from going into medicine]: On one hand, yes, this seems reasonable. On the other hand, this is also a fully general excuse for unreasonable dialogue. It is always the case that important issues have had major impacts on individuals. Taking this excuse seriously undermines Scott’s key principles.
I would be less critical if it were just Scott, but a lot of people jumped on narratives similar to “NYT is going around kicking people in the balls for selfish reasons”, demonstrating an alarming amount of tribalism—and worse, lack of self-awareness about it.
Scott is already too charitable. I’d even say that Scott being too charitable made this specific situation worse. I don’t find this to be a worthwhile thing about Scott either for us to emulate, or for Scott to take further.
“Quokka” is a meme about rationalists for a reason. You are not going to have unerring logical evidence that someone wants to harm you if they are trying to be at all subtle. You have to figure it out from their behavior.
Sometimes it just isn’t true that both sides are reasonable and have useful perspectives.
I think this only holds if NYT has a consistent policy of using real names. My understanding is they have repeatedly written about other people using pseudonyms only, and have not articulated a principled reason to treat Scott differently.
Scott’s flavor of charity is not quite this. It wouldn’t be useful for understanding sides that are not reasonable or have useless perspectives otherwise, or else you’d need to routinely “assume” false things to carry out the exercise.
The point is to meaningfully engage with other perspectives, without the usual prerequisite of having positive beliefs about them. Treating them in a similar way as if they were reasonable or useful, even when they clearly aren’t. Sometimes the resulting investigation changes one’s mind on this point. But often it doesn’t, while still revealing many details that wouldn’t otherwise be noticed. Actually intervening on your own beliefs would be self-deception, while treating useless and unreasonable views as they are usually treated wouldn’t be charity.
This is related to tolerance, where the point isn’t to start liking people you don’t like, or to start considering them part of your own ingroup. It’s instead an intervention/norm that goes around the dislike to remove some of its downsides without directly removing the dislike itself.
My mental one-sentence summary of how to think about ELK is “making debate work well in a setting where debaters are able to cite evidence gained by using interpretability tools on each other”.
I’m not claiming that this is how anyone else thinks about ELK (although I got the core idea from talking to Paul) but since I haven’t seen it posted online yet, and since ELK is pretty confusing, I thought it’d be useful to put out there. In particular, this framing motivates us generating interpretability tools which scale in the sense of being robust when used as evidence by AGIs.
Note that this is a very different type of solution to the ones in the original writeup, which seem mainly useful for illustrative purposes rather than actually pointing in promising directions.
Being nice because you’re altruistic, and being even nicer for decision-theoretic reasons on top of that, seems like it involves some kind of double-counting: the reason you’re altruistic in the first place is because evolution ingrained the decision theory into your values.
But it’s not fully double-counting: many humans generalise altruism in a way which leads them to “cooperate” far more than is decision-theoretically rational for the selfish parts of them—e.g. by making big sacrifices for animals, future people, etc. I guess this could be selfishly rational if you subscribe to a very strong form of updatelessness, but I am very skeptical that we’ll discover arguments that this much updatelessness is rationally obligatory.
A very speculative takeaway: maybe “how updateless you are” and “how altruistic you are” are kinda measuring the same thing, and there’s no clean split between whether that’s determined by your values or your decision theory.
Your actions and decisions are not doubled. If you have multiple paths to arrive at the same behaviors, that doesn’t make them wrong or double-counted, it just makes it hard to tell which of them is causal (aka: your behavior is overdetermined).
Are you using “updatelessness” to refer to not having self in your utility function? If so, that’s a new one one me, and I’d prefer “altruism” as the term. I’m not sure that the decision-theory use of “updateless” (to avoid incorrect predictions where experience is correlated with the question at hand) makes sense here.
Oh, this also suggests a way in which the utility function abstraction is leaky, because the reasons for the payoffs in a game may matter. E.g. if one payoff is high because the corresponding agent is altruistic, then in some sense that agent is “already cooperating” in a way which is baked into the game, and so the rational thing for them to do might be different from the rational thing for another agent who gets the same payoffs, but for “selfish” reasons.
Maybe FDT already lumps this effect into the “how correlated are decisions” bucket? Idk.
In UDT2, when you’re in epistemic state Y and you need to make a decision based on some utility function U, you do the following:
1. Go back to some previous epistemic state X and an EDT policy (the combination of which I’ll call the non-updated agent).
2. Spend a small amount of time trying to find the policy P which maximizes U based on your current expectations X.
3. Run P(Y) to make the choice which maximizes U.
The non-updated agent gets much less information than you currently have, and also gets much less time to think. But it does use the same utility function. That seems… suspicious. If you’re updating so far back that you don’t know who or where you are, how are you meant to know what you care about?
What happens if the non-updated agent doesn’t get given your utility function? On its face, that seems to break its ability to decide which policy P to commit to. But perhaps it could instead choose a policy P(Y,U) which takes as input not just an epistemic state, but also a utility function. Then in step 2, the non-updated agent needs to choose a policy P that maximizes, not the agent’s current utility function, but rather the utility functions it expects to have across a wide range of future situations.
Problem: this involves aggregating the utilities of different agents, and there’s no canonical way to do this. Hmm. So maybe instead of just generating a policy, the non-updated agent also needs to generate a value learning algorithm, that maps from an epistemic state Y to a utility function U, in a way which allows comparison across different Us.
Then the non-updated agent tries to find a pair (P, V) such that P(Y) maximizes V(Y) on the distribution of Ys predicted by X.EDIT: no, this doesn’t work. Instead I think you need to go back, not just to a previous epistemic state X, but also to a previous set of preferences U’ (which include meta-level preferences about how your values evolve). Then you pick P and V in order to maximize U’.Now, it does seem kinda wacky that the non-updated agent can maybe just tell you to change your utility function. But is that actually any weirder than it telling you to change your policy? And after all, you did in fact acquire your values from somewhere, according to some process.
Overall, I haven’t thought about this very much, and I don’t know if it’s already been discussed. But three quick final comments:
This brings UDT closer to an ethical theory, not just a decision theory.
In practice you’d expect P and V to be closely related. In fact, I’d expect them to be inseparable, based on arguments I make here.
Overall the main update I’ve made is not that this version of UDT is actually useful, but that I’m now suspicious of the whole framing of UDT as a process of going back to a non-updated agent and letting it make commitments.
People back then certainly didn’t think of changing preferences.
Also, you can get rid of this problem by saying “you just want to maximize the variable U”. And the things you actually care about (dogs, apples) are just “instrumentally” useful in giving you U. So for example, it is possible in the future you will learn dogs give you a lot of U, or alternatively that apples give you a lot of U.
Needless to say, this “instrumentalization” of moral deliberation is not how real agents work. And leads to getting Pascal’s mugged by the world in which you care a lot about easy things.
It’s more natural to model U as a logically uncertain variable, freely floating inside your logical inductor, shaped by its arbitrary aesthetic preferences. This doesn’t completely miss the importance of reward in shaping your values, but it’s certainly very different to how frugally computable agents do it.
I simply think the EV maximization framework breaks here. It is a useful abstraction when you already have a rigid enough notion of value, and are applying these EV calculations to a very concrete magisterium about which you can have well-defined estimates.
Otherwise you get mugged everywhere. And that’s not how real agents behave.
But you need some mechanism for actually updating your beliefs about U, because you can’t empirically observe U. That’s the role of V.
I think this is fine. Consider two worlds:
In world L, lollipops are easy to make, and paperclips are hard to make.
In world P, it’s the reverse.
Suppose you’re a paperclip-maximizer in world L. And a lollipop-maximizer comes up to you and says “hey, before I found out whether we were in L or P, I committed to giving all my resources to paperclip-maximizers if we were in P, as long as they gave me all their resources if we were in L. Pay up.”
UDT says to pay here—but that seems basically equivalent to getting “mugged” by worlds where you care about easy things.
Yep, but you can just treat it as another observation channel into UDT. You could, if you want, treat it as a computed number you observe in the corner of your eye, and then just apply UDT maximizing U, and you don’t need to change UDT in any way.
(Let’s not forget this depends on your prior, and we don’t have any privileged way to assign priors to these things. But that’s a tangential point.)
I do agree that there’s not any sharp distinction between situations where it “seems good” and situations where it “seems bad” to get mugged. After all, if all you care about is maximizing EV, then you should take all muggings. It’s just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go “hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don’t really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal”. You start to see how your abstractions might break, and how you can’t get any satisfying notion of “complete updatelessness” (that doesn’t go against important intuitions). And you start to rethink whether this is what we normatively want, nor what we realistically see in agents.
Hmm, I’m confused by this. Why should we treat it this way? There’s no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm. That’s the role V is playing.
Obviously I am not arguing that you should agree to all moral muggings. If a pain-maximizer came up to you and said “hey, looks like we’re in a world where pain is way easier to create than pleasure, give me all your resources”, it would be nuts to agree, just like it would be nuts to get mugged by “1+1=3″. I’m just saying that “sometimes you get mugged” is not a good argument against my position, and definitely doesn’t imply “you get mugged everywhere”.
Yes, absolutely! I just meant that, once you give me whatever V you choose to derive U from observations, I will just be able to apply UDT on top of that. So under this framework there doesn’t seem to be anything new going on, because you are just choosing an algorithm V at the start of time, and then treating its outputs as observations. That’s, again, why this only feels like a good model of “completely crystallized rigid values”, and not of “organically building them up slowly, while my concepts and planner module also evolve, etc.”.[1]
Wait, but how does your proposal differ from EV maximization (with moral uncertainty as part of the EV maximization itself, as I explain above)?
Because anything that is doing pure EV maximization “gets mugged everywhere”. Meaning if you actually have the beliefs (for example, that the world where suffering is hard to produce could exist), you just take those bets.
Of course if you don’t have such “extreme” beliefs it doesn’t, but then we’re not talking about decision-making, and instead belief-formation. You could say “I will just do EV maximization, but never have extreme beliefs that lead to suspiciously-looking behavior”, but that’d be hiding the problem under belief-formation, and doesn’t seem to be the kind of efficient mechanism that agents really implement to avoid these failure modes.
To be clear, V can be a very general algorithm (like “run a copy of me thinking about ethics”), so that this doesn’t “feel like” having rigid values. Then I just think you’re carving reality at the wrong spot. You’re ignoring the actual dynamics of messy value formation, hiding them under V.
In times of UDT2, the background assumption was that agents should maintain an unchanging preference, which is separate from knowledge. One motivation for UDT is that updating makes an agent stop caring about updated-away possibilities, while UDT is not doing that. Going back to a previous epistemic state is a way of preserving preference from that epistemic state, the “current” utility function is considered a bug and doesn’t do anything if UDT is adopted. The non-updated agent can in principle consider the information you currently have as one of the possibilities when formulating the general policy for all possibilities, though being bounded it won’t do a very good job.
Traditionally UDT1.1 wants to make its decisions from very little knowledge and to apply the policy to all always. A more pragmatic thing is to make decisions from modestly less knowledge and to scope the policy for middle-term future. Some form of this is useful for many thought experiments where the environment or other players also have the little knowledge our agent uses to make its decisions from the past, and so could know the policy the agent decides on before they need to prepare for it or make predictions about it.
The problem is commitment races (as in the game of chicken), where everyone wants to decide earlier and force the others to respond. But there is a need to remain bounded in making decisions, both to personally compute them and to make it possible for others to anticipate them and to coordinate. This creates a more reasonable equilibrium, motivating decisions from a less ignorant epistemic state that have a better chance of being relevant to the current situation, in balance with trying to decide from a more ignorant epistemic state where a general policy would enable more strategicness across possibilities. UDT1.1 can’t find such balance, but it’s possible that something UDT2-shaped might.
I think there’s an ambiguity here. UDT makes the agent stop considering updated-away possibilities, but I haven’t seen any discussion of UDT which suggests that it stops caring about them in principle (except for a brief suggestion from Paul that one option for UDT is to “go back to a position where I’m mostly ignorant about the content of my values”). Rather, when I’ve seen UDT discussed, it focuses on updating or un-updating your epistemic state.
I don’t think the shift I’m proposing is particularly important, but I do think the idea that “you have your prior and your utility function from the very beginning” is a kinda misleading frame to be in, so I’m trying to nudge a little away from that.
UDT specifically enables agents to consider the updated-away possibilities in a way relevant to decision making, while an updated agent (that’s not using something UDT-like) wouldn’t be able to do that in any circumstance, and so would be functionally indistinguishable from an agent that has different preferences or undefined preferences for those possibilities. Not caring about them seems like an apt informal description (even as this is compatible with keeping the same utility function outside the event of current knowledge). In a similar way, we could say that after updating, an agent either changes their probability distribution or keeps the original prior.
Historically it was overwhelmingly the frame until recently, so it’s the correct frame for interpreting the intended meaning of texts from that time. This is a simplifying assumption that still leaves many open questions about how to make decisions in sufficiently strange situations (where merely models of behavior make these strange situations ubiquitous in practice). When an agent doesn’t know its own preference and needs to do something about that, it’s an additional complication that usually wasn’t introduced.
Agreed; apologies for the sloppy phrasing.
I agree, that’s why I’m trying to outline an alternative frame for thinking about it.
Some more thoughts: we can portray the process of choosing a successor policy as the iterative process of making more and more commitments over time. But what does it actually look like to make a commitment? Well, consider an agent that is made of multiple subagents, that each get to vote on its decisions. You can think of a commitment as basically saying “this subagent still gets to vote, but no longer gets updated”—i.e. it’s a kind of stop-gradient.
Two interesting implications of this perspective:
The “cost” of a commitment can be measured both in terms of “how often does the subagent vote in stupid ways?”, and also “how much space does it require to continue storing this subagent?” But since we’re assuming that agents get much smarter over time, probably the latter is pretty small.
There’s a striking similarity to the problem of trapped priors in human psychology. Parts of our brains basically are subagents that still get to vote but no longer get updated. And I don’t think this is just a bug—it’s also a feature. This is true on the level of biological evolution (you need to have a strong fear of death in order to actually survive) and also on the level of cultural evolution (if you can indoctrinate kids in a way that sticks, then your culture is much more likely to persist).
The (somewhat provocative) way of phrasing this is that trauma is evolution’s approach to implementing UDT. Someone who’s been traumatized into conformity by society when they were young will then (in theory) continue obeying society’s dictates even when they later have more options. Someone who gets very angry if mistreated in a certain way is much harder to mistreat in that way. And of course trauma is deeply suboptimal in a bunch of ways, but so too are UDT commitments, because they were made too early to figure out better alternatives.
This is clearly only a small component of the story but the analogy is definitely a very interesting one.
More thoughts: what’s the difference between paying in a counterfactual mugging based on:
Whether the millionth digit of pi (5) is odd or even
Whether or not there are an infinite number of primes?
In the latter case knowing the truth is (near-)inextrictably entangled with a bunch of other capabilities, like the ability to do advanced mathematics. Whereas in the former it isn’t. Suppose that before you knew either fact you were told that one of them was entangled in this way—would you still want to commit to paying out in a mugging based on it?
Well… maybe? But it means that the counterlogical of “if there hadn’t been an infinite number of primes” is not very well-defined—it’s hard to modify your brain to add that belief without making a bunch of other modifications. So now Omega doesn’t just have to be (near-)omniscient, it also needs to have a clear definition of the counterlogical that’s “fair” according to your standards; without knowing that it has that, paying up becomes less tempting.
Individually logical counterfactuals don’t seem very coherent. This is related to the “I’m an algorithm” vs. “I’m a physical object” distinction of FDT. When you are an algorithm considering a decision, you want to mark all sites of intervention/influence in the world where the world depends on your behavior. If you only mark some of them, then you later fail at the step where you ask what happens if you act differently, you obtain a broken counterfactual world where only some instances of the fact of your behavior have been replaced and not others.
So I think it makes a bit more sense to ask where specifically your brain depends on a fact, to construct an exhausive dependence of your brain on the fact, before turning to particular counterfactual content for that fact to be replaced with. That is, dependence of a system on a fact, the way it varies with the fact, seems potentially clearer than individual counterfactuals of how that system works if the fact is set to be a certain way. (To make a somewhat hopeless analogy, fibration instead of individual fibers, and it shouldn’t be a problem that all fibers are different from each other. Any question about a counterfactual should be reformulated into a question about a dependence.)
Random question I’ve been thinking about: how would you set up a market for votes? Suppose specifically that you have a proportional chances election (i.e. the outcome gets chosen with probability proportional to the number of votes cast for it—assume each vote is a distribution over candidates). So everyone has an incentive to get everyone who’s not already voting for their favorite option to change their vote; and you can have positive-sum trades where I sell you a promise to switch X% of my votes to a compromise candidate in exchange for you switching Y% of your votes to a compromise candidate.
What makes this complicated is that I don’t just care that I get votes for my favorite candidate, I also care about where those votes come from—i.e. would they otherwise have been cast for my second-favorite candidate, or for my least-favorite?
EDIT: oh, I think this is equivalent to impact certificates actually.
Some attempts:
Each person starts off with a vote and can sell shares in it to whoever they like for whatever price they like. When the vote is called, you get a number of votes proportional to your shares. This might help me trade votes in a current election for votes in a future election, but it doesn’t seem to really address the core problem that the votes need to be from a specific candidate, that’s what you’re buying, otherwise there’s no benefit to trade in the one-off case.
EDITED: Let a certificate x:A->B be a promise to switch x fraction of your vote from A to B. (I’ll mostly skip the x for brevity.) Suppose your actual favorite candidate is C; you can generate up to one certificate C->Z, for any Z. Then when the vote is called, everyone votes for their favorite candidate, then the votes are modified by all the certificates in circulation.
Ideally the thing you’d want is to incentivize the following: “oh, nobody has realized yet that D is a really good compromise candidate between X and Y! I can profit off this fact…” This can be modelled as there not yet being much demand for X->D or Y->D, so you can buy a bunch of it cheaply and wait for the price to appreciate (because many others will also want to buy it once they realize); or maybe you can short X->Y and Y->X; or sell versions of X->Y which are actually X->D->Y. You probably also need the ability to borrow from the bank for liquidity—and maybe the bank should accept A->B->C as a return on A->C?
This setup strongly incentivizes strategic declaration of who your favorite candidate is. But maybe that’s just unavoidable… the whole point of a market is that you’ve got something to trade, and in some sense that has to be your default vote.
How do markets in general avoid this? By setting regulations saying you can’t make someone’s life worse then tell them to pay to stop. Without political intervention in general markets are just threat-machines. I.e. you really need the distinguished “zero” point in order to make them work.
I think the thing I’m looking for might be analogous to the shift from the COCO equilibrium to the ROSE equilibrium.
The CoCo value could be viewed as starting at a disagreement point, and going “let’s move to the Pareto frontier and evenly split the gain”. Which seems like what’s happening with trades in this market, you benefit from how bad your disagreement point is.
Diffractor describes a ROSE point as anywhere where, if you drew two lines at the highest utilities that players 1 and 2 could get without sending the other below the disagreement point utility, there’s a tangent line at the ROSE point that has equal distance to the two max-utility boundaries.
I don’t really see how to generalize this to constructing a market, nor do I know if that question makes sense.
Just spitballing here: Assign each voter 100 shares for each candidate. To vote, each voter selects a subset of their shares to constitute their vote. Voters can freely trade shares.
Under this system, a voter would more highly value shares for candidates that are either very high or very low in their preference order (the later so as to exclude them from the vote). Thus, trades would look like each party exchanging shares about which they are themselves ambivalent to gain shares that are more valuable to them.
If you remove the proportional chances part, then it becomes a guessing game of which marginal votes actually matter.
Interesting! Hadn’t thought of this approach. Let’s see… Intuitively I think it gets pretty strategically weird because a) who you vote for depends pretty sensitively on other peoples’ votes (e.g. in proportional chances voting you want to vote for everyone who’s above the expected value of everyone else’s votes; in approval voting you want to vote for everyone you approve of unless it bumps them above someone you like more), and b) you want to buy from your enemies much more than from your friends, because your friends will already not be voting for bad candidates. But maybe the latter is fine because if you buy from your friends they’ll end up with more money which they can then spend on other things? I’ll keep thinking.
I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven’t yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs—i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)
Another thought on dath ilan: notice how much of the work of Keltham’s reasoning is based on him pattern-matching to tropes from dath ilani literature, and then trying to evaluate their respective probabilities. In other words: like bayesianism, he’s mostly glossing over the “hypothesis generation” step of reasoning.
I wonder if dath ilan puts a lot of effort into spreading a wide range of tropes because they don’t know how to teach systematically good hypothesis generation.
I think you are overgeneralizing. We also see some mix of Dath Ilan, stories about Dath Ilan, stories about stories about Dath Ilan, and interactions between these, so all bets are off really.
I suspect that AIXI is misleading to think about in large part because it lacks reusable parameters—instead it just memorises all inputs it’s seen so far. Which means the setup doesn’t have episodes, or a training/deployment distinction; nor is any behaviour actually “reinforced”.
I kind of think the lack of episodes makes it more realistic for many problems, but admittedly not for simulated games. Also, presumably many of the component Turing machines have reusable parameters and reinforce behaviour, altho this is hidden by the formalism. [EDIT: I retract the second sentence]
Actually I think this is total nonsense produced by me forgetting the difference between AIXI and Solomonoff induction.
Wait, really? I thought it made sense (although I’d contend that most people don’t think about AIXI in terms of those TMs reinforcing hypotheses, which is the point I’m making). What’s incorrect about it?
Well now I’m less sure that it’s incorrect. I was originally imagining that like in Solomonoff induction, the TMs basically directly controlled AIXI’s actions, but that’s not right: there’s an expectimax. And if the TMs reinforce actions by shaping the rewards, in the AIXI formalism you learn that immediately and throw out those TMs.
Oh, actually, you’re right (that you were wrong). I think I made the same mistake in my previous comment. Good catch.
Humans don’t have a training / deployment distinction either… Do humans have “reusable parameters”? Not quite sure what you mean by that.
Yes we do: training is our evolutionary history, deployment is an individual lifetime. And our genomes are our reusable parameters.
Unfortunately I haven’t yet written any papers/posts really laying out this analogy, but it’s pretty central to the way I think about AI, and I’m working on a bunch of related stuff as part of my PhD, so hopefully I’ll have a more complete explanation soon.
Oh, OK, I see what you mean. Possibly related: my comment here.
I’ve recently discovered waitwho.is, which collects all the online writing and talks of various tech-related public intellectuals. It seems like an important and previously-missing piece of infrastructure for intellectual progress online.
Yudkowsky mainly wrote about recursive self-improvement from a perspective in which algorithms were the most important factors in AI progress—e.g. the brain in a box in a basement which redesigns its way to superintelligence.
Sometimes when explaining the argument, though, he switched to a perspective in which compute was the main consideration—e.g. when he talked about getting “a hyperexponential explosion out of Moore’s Law once the researchers are running on computers”.
What does recursive self-improvement look like when you think that data might be the limiting factor? It seems to me that it looks a lot like iterated amplification: using less intelligent AIs to provide a training signal for more intelligent AIs.
I don’t consider this a good reason to worry about IA, though: in a world where data is the main limiting factor, recursive approaches to generating it still seem much safer than alternatives.
Perhaps a data-limited intelligence explosion is analogous to what we humans do all the time when we teach ourselves something. Out of the vast sea of information on the internet, we go get some data, and study it, and then use that to make a better opinion about what data we need next, and then repeat until we are at the forefront of the world’s knowledge. We start from scratch, with a vague understanding like “I should learn more economics, I don’t even know what supply and demand are” and then we end up publishing a paper on auction theory or something idk. This is a recurisve self improvement loop in data quality, so to speak, rather than data quantity.
What counts as self-improvement in the scenario governed by data?
You can grab the whole internet, including scihub and library genesis, and then maybe hack all “smart” appliances worldwide… and after that I guess you need to construct some machines that will perform experiments for you.
But none of this improves the machine’s “self”. With algorithms, the idea is that the machine would replace its own algorithms by better ones, once it gets the ability to invent and evaluate algorithms. With hardware, the idea is that the machine would replace its own hardware by faster ones, once it gets the ability to design and produce hardware. But replacing your data with better data, that… we usually don’t call self-improvement.
Also, what kind of data are we talking about? Data about the real world, they have to come from the outside, by definition. (Unless they are data about physics that you can obtain by observing the physical properties of your own circuits, or something like that.) But there is also data in sense of precomputed cached results, like playing zillions of games of chess against yourself, and remembering which strategies were most successful. If this was the limiting factor… I guess it would be something like a bounded AIXI which hypothetically already has enough hardware to simulate a universe, it only need to make zillions of computations to find the one that is consistent with the observed data.
In the scenario governed by data, the part that counts as self-improvement is where the AI puts itself through a process of optimisation by stochastic gradient descent with respect to that data.
You don’t need that much hardware for data to be a bottleneck. For example, I think that there are plenty of economically valuable tasks that are easier to learn than StarCraft. But we get StarCraft AIs instead because games are the only task where we can generate arbitrarily large amounts of data.
RL usually applies some discount rate, and also caps episodes at a certain length, so that an action taken at a given time isn’t reinforced very much (or at all) for having much longer-term consequences.
How does this compare to evolution? At equilibrium, I think that a gene which increases the fitness of its bearers in N generations’ time is just as strongly favored as a gene that increases the fitness of its bearers by the same amount straightaway. As long as it was already widespread at least N generations ago, they’re basically the same thing, because current gene-holders benefit from the effects of the gene-holders from N generations ago.
That gene would evolve much more slowly, though. Plus in practice it’s hard to ensure that the benefits accrue only to gene-holders, and there’s so much variance in the environment that for N of more than 3 or 4 this seems pretty implausible. Still, the disanalogy seems kinda interesting.
A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.
The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.
What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to understand. Also: in general you can’t backprop through discrete language anyway, but I’d guess there are some tricks for approximating that which don’t work as well when a human is in the loop.
That doesn’t actually solve the problem. The system could just encode the desired information in the semantics of some unrelated sentences—e.g. talk about pasta to indicate X = 0, or talk about rain to indicate X = 1.
I expected you to bring up the Natural Abstraction Hypothesis here. Wouldn’t the communication between the parties naturally use the same concepts?
Same concepts yes, but that does not necessarily imply that they’re encoded in the same way as humans typically use language.
Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the same direction; equivalently, any local change in semantics would be sub-optimal compared to using language in the semantically correct way. The examples in the paper are on quite toy problems, but I think in principle this could work.
Not being able to send messages too complex for humans to understand seems to me like it’s plausibly a benefit for many of the cases where you’d want to do this.
steganographically?
Ooops, yes, ty.
Greg Egan on universality:
Equivocation. “Who’s ‘we’, flesh man?” Even granting the necessary millions or billions of years for a human to sit down and emulate a superintelligence step by step, it is still not the human who understands, but the Chinese room.
I’ve seen this quote before and always find it funny because when I read Greg Egan, I constantly find myself thinking there’s no way I could’ve come up with the ideas he has even if you gave me months or years of thinking time.
Yes, there’s something to that, but you have to be careful if you want to use that as an objection. Maybe you wouldn’t easily think of it, but that doesn’t exclude the possibility of you doing it: you can come up with algorithms you can execute which would spit out Egan-like ideas, like ‘emulate Egan’s brain neuron by neuron’. (If nothing else, there’s always the ol’ dovetail-every-possible-Turing-machine hammer.) Most of these run into computational complexity problems, but that’s the escape hatch Egan (and Scott Aaronson has made a similar argument) leaves himself by caveats like ‘given enough patience, and a very large notebook’. Said patience might require billions of years, and the notebook might be the size of the Milky Way galaxy, but those are all finite numbers, so technically Egan is correct as far as that goes.
Yeah good point—given generous enough interpretation of the notebook my rejection doesn’t hold. It’s still hard for me to imagine that response feeling meaningful in the context but maybe I’m just failing to model others well here.
It’s frustrating how bad dath ilanis (as portrayed by Eliezer) are at understanding other civilisations. They seem to have all dramatically overfit to dath ilan.
To be clear, it’s the type of error which is perfectly sensible for an individual to make, but strange for their whole civilisation to be making (by teaching individuals false beliefs about how tightly constraining their coordination principles are).
The in-universe explanation seems to be that they’ve lost this knowledge as a result of screening off the past. But that seems like a really predictable failure mode which gives them false beliefs about very important topics, so I have trouble imagining it being consistent with the rest of Eliezer’s characterisation of dath ilan.
(FWIW I’ll also note that this is the same type of mistake that I think Eliezer is making when reasoning about AI.)
Tho, to be fair, losing points in universes you don’t expect to happen in order to win points in universes you expect to happen seems like good decision theory.
[I do have a standing wonder about how much of dath ilan is supposed to be ‘the obvious equilbrium’ vs. ‘aesthetic preferences’; I would be pretty surprised if Eliezer thought there was only one fixed point of the relevant coordination functions, and so some of it must be ‘aesthetics’.]
I don’t think dath ilan would try to win points in likely universes by teaching children untrue things, which I claim is what they’re doing.
Also, it’s not clear to me that this would even win them points, because when thinking about designing civilisation (or AGIs) you need to have accurate beliefs about this type of thing. (E.g. imagine dath ilani alignment researchers being like “here are all our principles for understanding intelligence” and then continually being surprised, like Keltham is, about how messy and fractally unprincipled some plausible outcomes are.)
Half-formed musing: what’s the relationship between being a nerd and trusting high-level abstractions? In some sense they seem to be the opposite of each other—nerds focus obsessively on a domain until they understand it deeply, not just at high levels of abstraction. But if I were to give a very brief summary of the rationalist community, it might be: nerds who take very high-level abstractions (such as moloch, optimisation power, the future of humanity) very seriously.
It seems to me that the resolution to the apparent paradox is that nerds are interested in all the details of their domain, but the outcome that they tend to look for are high-level abstractions. Even in settings like fandoms, there is a big push towards massive theories that entails every little detail about the story.
Though defining rationalist community as a sort of community of meta-nerds who apply this nerd approach to almost anything doesn’t seem too off the mark.
I think you need to unpack “trust” and “take seriously” a little bit to make this assertion. I think nerds are generally (heh) more able to understand the lossiness of models, and to recognize that abstractions are more broadly applicable, but less powerful than specifics.
I wouldn’t say I trust or take seriously the idea of Moloch or the similarities between different optimization mechanisms. I do recognize that those models have a lot of explanatory and predictive power, especially as a head-start (aka “prior”) on domains where I haven’t done the work to understand the exceptions and specifics.
There’s some possible world in which the following approach to interpretability works:
Put an AGI in a bunch of situations where it sometimes is incentivised to lie and sometimes is incentivised to tell the truth.
Train a lie detector which is given all its neural weights as input.
Then ask the AGI lots of questions about its plans.
One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altruistic reasons, when in fact their unconscious motivations are primarily to look good. And the motivations which we are less conscious of are exactly those ones which it’s most disadvantageous for others to know about.
So would using such an interpretability technique on an AGI work? I guess one important question is something like: by default, would the AGI be systematically biased when talking about its plans, like humans are? Or is this something which only arises when there are selection pressures during training for hiding information?
One way we could avoid this problem: instead of a “lie detector”, you could train a “plan identifier”, which takes an AGI brain and tells you what that AGI is going to do in english. I’m a little less optimistic about this, since I think that gathering training data will be the big bottleneck either way, and getting enough data to train a plan identifier that’s smart enough to generalise to a wide range of plans seems pretty tricky. (By contrast, the lie detector might not need to know very much about the *content* of the lies).
I’ve heard people argue that “most” utility functions lead to agents with strong convergent instrumental goals. This obviously depends a lot on how you quantify over utility functions. Here’s one intuition in the other direction. I don’t expect this to be persuasive to most people who make the argument above (but I’d still be interested in hearing why not).
If a non-negligible percentage of an agent’s actions are random, then to describe it as a utility-maximiser would require an incredibly complex utility function (because any simple hypothesised utility function will eventually be falsified by a random action). And so this generates arbitrarily simple agents whose observed behaviour can only be described as maximising a utility function for arbitrarily complex utility functions (depending on how long you run them).
I expect people to respond something like: we need a theory of how to describe agents with bounded cognition anyway. And if you have such a theory, then we could describe the agent above as “maximising simple function U, subject to the boundedness constraint that X% of its actions are random”.
I’m not sure if you consider me to be making that argument, but here are my thoughts: I claim that most reward functions lead to agents with strong convergent instrumental goals. However, I share your intuition that (somehow) uniformly sampling utility functions over universe-histories might not lead to instrumental convergence.
To understand instrumental convergence and power-seeking, consider how many reward functions we might specify automatically imply a causal mechanism for increasing reward. The structure of the reward function implies that more is better, and that there are mechanisms for repeatedly earning points (for example, by showing itself a high-scoring input).
Since the reward function is “simple” (there’s usually not a way to grade exact universe histories), these mechanisms work in many different situations and points in time. It’s naturally incentivized to assure its own safety in order to best leverage these mechanisms for gaining reward. Therefore, we shouldn’t be surprised to see a lot of these simple goals leading to the same kind of power-seeking behavior.
What structure is implied by a reward function?
Additive/Markovian: while a utility function might be over an entire universe-history, reward is often additive over time steps. This is a strong constraint which I don’t always expect to be true, but i think that among the goals with this structure, a greater proportion of them have power-seeking incentives.
Observation-based: while a utility function might be over an entire universe-history, the atom of the reward function is the observation. Perhaps the observation is an input to update a world model, over which we have tried to define a reward function. I think that most ways of doing this lead to power-seeking incentives.
Agent-centric: reward functions are defined with respect to what the agent can observe. Therefore, in partially observable environments, there is naturally a greater emphasis on the agent’s vantage point in the environment.
My theorems apply to the finite, fully observable, Markovian situation.[1] We might not end up using reward functions for more impressive tasks – we might express preferences over incomplete trajectories, for example. The “specify a reward function over the agent’s world model” approach may or may not lead to good subhuman performance in complicated tasks like cleaning warehouses. Imagine specifying a reward function over pure observations for that task – the agent would probably just get stuck looking at a wall in a particularly high-scoring way.
However, for arbitrary utility functions over universe histories, the structure isn’t so simple. With utility functions over universe histories having far more degrees of freedom, arbitrary policies can be rationalized as VNM expected utility maximization. That said, with respect to a simplicity prior over computable utility functions, the power-seeking ones might have most of the measure.
A more appropriate claim might be: goal-directed behavior tends to lead to power-seeking, and that’s why goal-directed behavior tends to be bad.
However, it’s well-known that you can convert finite non-Markovian MDPs into finite Markovian MDPs.
I’ve just put up a post which serves as a broader response to the ideas underpinning this type of argument.
I think this depends a lot on how you model the agent developing. If you start off with a highly intelligent agent which has the ability to make long-term plans, but doesn’t yet have any goals, and then you train it on a random reward function—then yes, it probably will develop strong convergent instrumental goals.
On the other hand, if you start off with a randomly initialised neural network, and then train it on a random reward function, then probably it will get stuck in a local optimum pretty quickly, and never learn to even conceptualise these things called “goals”.
I claim that when people think about reward functions, they think too much about the former case, and not enough about the latter. Because while it’s true that we’re eventually going to get highly intelligent agents which can make long-term plans, it’s also important that we get to control what reward functions they’re trained on up to that point. And so plausibly we can develop intelligent agents that, in some respects, are still stuck in “local optima” in the way they think about convergent instrumental goals—i.e. they’re missing whatever cognitive functionality is required for being ambitious on a large scale.
Agreed – I should have clarified. I’ve been mostly discussing instrumental convergence with respect to optimal policies. The path through policy space is also important.
Makes sense. For what it’s worth, I’d also argue that thinking about optimal policies at all is misguided (e.g. what’s the optimal policy for humans—the literal best arrangement of neurons we could possibly have for our reproductive fitness? Probably we’d be born knowing arbitrarily large amounts of information. But this is just not relevant to predicting or modifying our actual behaviour at all).
(I now think that you were very right in saying “thinking about optimal policies at all is misguided”, and I was very wrong to disagree. I’ve thought several times about this exchange. Not listening to you about this point was a serious error and made my work way less impactful. I do think that the power-seeking theorems say interesting things, but about eg internal utility functions over an internal planning ontology—not about optimal policies for a reward function.)
I disagree.
We do in fact often train agents using algorithms which are proven to eventually converge to the optimal policy.[1] Even if we don’t expect the trained agents to reach the optimal policy in the real world, we should still understand what behavior is like at optimum. If you think your proposal is not aligned at optimum but is aligned for realistic training paths, you should have a strong story for why.
Formal theorizing about instrumental convergence with respect to optimal behavior is strictly easier than theorizing about ϵ-optimal behavior, which I think is what you want for a more realistic treatment of instrumental convergence for real agents. Even if you want to think about sub-optimal policies, if you don’t understand optimal policies… good luck! Therefore, we also have an instrumental (...) interest in studying the behavior at optimum.
At least, the tabular algorithms are proven, but no one uses those for real stuff. I’m not sure what the results are for function approximators, but I think you get my point.
1. I think it’s more accurate to say that, because approximately none of the non-trivial theoretical results hold for function approximation, approximately none of our non-trivial agents are proven to eventually converge to the optimal policy. (Also, given the choice between an algorithm without convergence proofs that works in practice, and an algorithm with convergence proofs that doesn’t work in practice, everyone will use the former). But we shouldn’t pay any attention to optimal policies anyway, because the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute.
2. I think theorizing about ϵ-optimal behavior is more useful than theorizing about optimal behaviour by roughly ϵ, for roughly the same reasons. But in general, clearly I can understand things about suboptimal policies without understanding optimal policies. I know almost nothing about the optimal policy in StarCraft, but I can still make useful claims about AlphaStar (for example: it’s not going to take over the world).
Again, let’s try cash this out. I give you a human—or, say, the emulation of a human, running in a simulation of the ancestral environment. Is this safe? How do you make it safer? What happens if you keep selecting for intelligence? I think that the theorising you talk about will be actively harmful for your ability to answer these questions.
I’m confused, because I don’t disagree with any specific point you make—just the conclusion. Here’s my attempt at a disagreement which feels analogous to me:
My response in this “debate” is: if you start with a spherical cow and then consider which real world differences are important enough to model, you’re better off than just saying “no one should think about spherical cows”.
I don’t understand why you think that. If you can have a good understanding of instrumental convergence and power-seeking for optimal agents, then you can consider whether any of those same reasons apply for suboptimal humans.
Considering power-seeking for optimal agents is a relaxed problem. Yes, ideally, we would instantly jump to the theory that formally describes power-seeking for suboptimal agents with realistic goals in all kinds of environments. But before you do that, a first step is understanding power-seeking in MDPs. Then, you can take formal insights from this first step and use them to update your pre-theoretic intuitions where appropriate.
Thanks for engaging despite the opacity of the disagreement. I’ll try to make my position here much more explicit (and apologies if that makes it sound brusque). The fact that your model is a simplified abstract model is not sufficient to make it useful. Some abstract models are useful. Some are misleading and will cause people who spend time studying them to understand the underlying phenomenon less well than they did before. From my perspective, I haven’t seen you give arguments that your models are in the former category not the latter. Presumably you think they are in fact useful abstractions—why? (A few examples of the latter: behaviourism, statistical learning theory, recapitulation theory, Gettier-style analysis of knowledge).
My argument for why they’re overall misleading: when I say that “the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute”, or that safety researchers shouldn’t think about AIXI, I’m not just saying that these are inaccurate models. I’m saying that they are modelling fundamentally different phenomena than the ones you’re trying to apply them to. AIXI is not “intelligence”, it is brute force search, which is a totally different thing that happens to look the same in the infinite limit. Optimal tabular policies are not skill at a task, they are a cheat sheet, but they happen to look similar in very simple cases.
Probably the best example of what I’m complaining about is Ned Block trying to use Blockhead to draw conclusions about intelligence. I think almost everyone around here would roll their eyes hard at that. But then people turn around and use abstractions that are just as unmoored from reality as Blockhead, often in a very analogous way. (This is less a specific criticism of you, TurnTrout, and more a general criticism of the field).
Forgive me a little poetic license. The analogy in my mind is that you were trying to model the cow as a sphere, but you didn’t know how to do so without setting its weight as infinite, and what looked to you like your model predicting the cow would roll downhill was actually your model predicting that the cow would swallow up the nearby fabric of spacetime and the bottom of the hill would fall into its event horizon. At which point, yes, you would be better off just saying “nobody should think about spherical cows”.
Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions.
Falsifying claims and “breaking” proposals is a classic element of AI alignment discourse and debate. Since we’re talking about superintelligent agents, we can’t predict exactly what a proposal would do. However, if I make a claim (“a superintelligent paperclip maximizer would keep us around because of gains from trade”), you can falsify this by showing that my claimed policy is dominated by another class of policies (“we would likely be comically resource-inefficient in comparison; GFT arguments don’t model dynamics which allow killing other agents and appropriating their resources”).
Even we can come up with this dominant policy class, so the posited superintelligence wouldn’t miss it either. We don’t know what the superintelligent policy will be, but we know what it won’t be (see also Formalizing convergent instrumental goals). Even though I don’t know how Gary Kasparov will open the game, I confidently predict that he won’t let me checkmate him in two moves.
Non-optimal power and instrumental convergence
Instead of thinking about optimal policies, let’s consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy.
Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is ∫RVA(M,R)R(s,γ)dF(R).
Instrumental convergence with respect to A’s policies can be defined similarly (“what is the R-measure of a given trajectory under A?”). The theory I’ve laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called “instrumental convergence”.
Here’s bad reasoning, which implies that the cow tears a hole in spacetime:
The problem is that it’s impractical to predict what a smarter agent will do, or what specific kinds of action will be instrumentally convergent for A, or that the real agent would be infinitely smart. Just because it’s smart doesn’t mean it’s omniscient, as you rightly point out.
Here’s better reasoning:
It might seem like I’m assuming the consequent here. This is not so – the work is first done by the theorems on optimal behavior, which do imply that most goals achieve greater return by avoiding shutdown. The question is whether reasonably intelligent suboptimal agents realize this fact. Given a uniformly drawn reward function, we can usually come up with a better policy than dying, so the argument is that A can as well.
I’m afraid I’m mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework.
If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it’d be something like:
1. Instrumental convergence isn’t training-time behaviour, it’s test-time behaviour. It isn’t about increasing reward, it’s about achieving goals (that the agent learned by being trained to increase reward).
2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it’s the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it’s just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment).
I was actually going to respond, “that’s a good point, but (IMO) a different concern than the one you initially raised”. I see you making two main critiques.
(paraphrased) ”A won’t produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn’t useful” – I replied to this critique above.
“The space of goals that agents might learn is very different from the space of reward functions.” I agree this is an important part of the story. I think the reasonable takeaway is “current theorems on instrumental convergence help us understand what superintelligent A won’t do, assuming no reward-result gap. Since we can’t assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced.”
I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it.
ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.
Is the point that people try to use algorithms which they think will eventually converge to the optimal policy? (Assuming there is one.)
Something like that, yeah.
I object to the claim that agents that act randomly can be made “arbitrarily simple”. Randomness is basically definitionally complicated!
Eh, this seems a bit nitpicky. It’s arbitrarily simple given a call to a randomness oracle, which in practice we can approximate pretty easily. And it’s “definitionally” easy to specify as well: “the function which, at each call, returns true with 50% likelihood and false otherwise.”
If you get an ‘external’ randomness oracle, then you could define the utility function pretty simply in terms of the outputs of the oracle.
If the agent has a pseudo-random number generator (PRNG) inside it, then I suppose I agree that you aren’t going to be able to give it a utility function that has the standard set of convergent instrumental goals, and PRNGs can be pretty short. (Well, some search algorithms are probably shorter, but I bet they have higher Kt complexity, which is probably a better measure for agents)
I’d take a different tack here, actually; I think this depends on what the input to the utility function is. If we’re only allowed to look at ‘atomic reality’, or the raw actions the agent takes, then I think your analysis goes through, that we have a simple causal process generating the behavior but need a very complicated utility function to make a utility-maximizer that matches the behavior.
But if we’re allowed to decorate the atomic reality with notes like “this action was generated randomly”, then we can have a utility function that’s as simple as the generator, because it just counts up the presence of those notes. (It doesn’t seem to me like this decorator is meaningfully more complicated than the thing that gave us “agents taking actions” as a data source, so I don’t think I’m paying too much here.)
This can lead to a massive explosion in the number of possible utility functions (because there’s a tremendous number of possible decorators), but I think this matches the explosion that we got by considering agents that were the outputs of causal processes in the first place. That is, consider reasoning about python code that outputs actions in a simple game, where there are many more possible python programs than there are possible policies in the game.
So in general you can’t have utility functions that are as simple as the generator, right? E.g. the generator could be deontological. In which case your utility function would be complicated. Or it could be random, or it could choose actions by alphabetical order, or...
And so maybe you can have a little note for each of these. But now what it sounds like is: “I need my notes to be able to describe every possible cognitive algorithm that the agent could be running”. Which seems very very complicated.
I guess this is what you meant by the “tremendous number” of possible decorators. But if that’s what you need to do to keep talking about “utility functions”, then it just seems better to acknowledge that they’re broken as an abstraction.
E.g. in the case of python code, you wouldn’t do anything analogous to this. You would just try to reason about all the possible python programs directly. Similarly, I want to reason about all the cognitive algorithms directly.
That’s right.
I realized my grandparent comment is unclear here:
This should have been “consequence-desirability-maximizer” or something, since the whole question is “does my utility function have to be defined in terms of consequences, or can it be defined in terms of arbitrary propositions?”. If I want to make the deontologist-approximating Innocent-Bot, I have a terrible time if I have to specify the consequences that correspond to the bot being innocent and the consequences that don’t, but if you let me say “Utility = 0 - badness of sins committed” then I’ve constructed a ‘simple’ deontologist. (At least, about as simple as the bot that says “take random actions that aren’t sins”, since both of them need to import the sins library.)
In general, I think it makes sense to not allow this sort of elaboration of what we mean by utility functions, since the behavior we want to point to is the backwards assignment of desirability to actions based on the desirability of their expected consequences, rather than the expectation of any arbitrary property.
---
Actually, I also realized something about your original comment which I don’t think I had the first time around; if by “some reasonable percentage of an agent’s actions are random” you mean something like “the agent does epsilon-exploration” or “the agent plays an optimal mixed strategy”, then I think it doesn’t at all require a complicated utility function to generate identical behavior. Like, in the rock-paper-scissors world, and with the simple function ‘utility = number of wins’, the expected utility maximizing move (against tough competition) is to throw randomly, and we won’t falsify the simple ‘utility = number of wins’ hypothesis by observing random actions.
Instead I read it as something like “some unreasonable percentage of an agent’s actions are random”, where the agent is performing some simple-to-calculate mixed strategy that is either suboptimal or only optimal by luck (when the optimal mixed strategy is the maxent strategy, for example), and matching the behavior with an expected utility maximizer is a challenge (because your target has to be not some fact about the environment, but some fact about the statistical properties of the actions taken by the agent).
---
I think this is where the original intuition becomes uncompelling. We care about utility-maximizers because they’re doing their backwards assignment, using their predictions of the future to guide their present actions to try to shift the future to be more like what they want it to be. We don’t necessarily care about imitators, or simple-to-write bots, or so on. And so if I read the original post as “the further a robot’s behavior is from optimal, the less likely it is to demonstrate convergent instrumental goals”, I say “yeah, sure, but I’m trying to build smart robots (or at least reasoning about what will happen if people try to).”
This is in fact the intended reading, sorry for ambiguity. Will edit. But note that there are probably very few situations where exploring via actual randomness is best; there will almost always be some type of exploration which is more favourable. So I don’t think this helps.
To be pedantic: we care about “consequence-desirability-maximisers” (or in Rohin’s terminology, goal-directed agents) because they do backwards assignment. But I think the pedantry is important, because people substitute utility-maximisers for goal-directed agents, and then reason about those agents by thinking about utility functions, and that just seems incorrect.
What do you mean by optimal here? The robot’s observed behaviour will be optimal for some utility function, no matter how long you run it.
Valid point.
This also seems right. Like, my understanding of what’s going on here is we have:
‘central’ consequence-desirability-maximizers, where there’s a simple utility function that they’re trying to maximize according to the VNM axioms
‘general’ consequence-desirability-maximizers, where there’s a complicated utility function that they’re trying to maximize, which is selected because it imitates some other behavior
The first is a narrow class, and depending on how strict you are with ‘maximize’, quite possibly no physically real agents will fall into it. The second is a universal class, which instantiates the ‘trivial claim’ that everything is utility maximization.
Put another way, the first is what happens if you hold utility fixed / keep utility simple, and then examine what behavior follows; the second is what happens if you hold behavior fixed / keep behavior simple, and then examine what utility follows.
Distance from the first is what I mean by “the further a robot’s behavior is from optimal”; I want to say that I should have said something like “VNM-optimal” but actually I think it needs to be closer to “simple utility VNM-optimal.”
I think you’re basically right in calling out a bait-and-switch that sometimes happens, where anyone who wants to talk about the universality of expected utility maximization in the trivial ‘general’ sense can’t get it to do any work, because it should all add up to normality, and in normality there’s a meaningful distinction between people who sort of pursue fuzzy goals and ruthless utility maximizers.