This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don’t see why it implies internal reward-orientation motivational edifices.
Sorry, if I’m reading this right, we’re hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think “nah, it needs to take an action before that action can be rewarded”, and my response is “wait, isn’t this going to be straightforwardly encouraged by backpropagation?”
[I am slightly departing from Wei_Dai’s hypothetical in my line of reasoning here, as Wei is mostly focused on asking “don’t you expect this to come about in an introspective-reasoning powered way?” and I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”.]
I think that assuming there’s a feature-direction “care more about reward” which isn’t already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to “thinking thoughts about reward in order to get reward.”
Cool, this feels like a real reason, but also substantially more contingent. Naively, I would expect that you could construct a training schedule such that ‘care more about reward’ is encouraged, and someone will actually try to do this (as part of making a zero-shot learner in RL environments).
If this argument works, why doesn’t it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it’s at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)
I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”. Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
Agreed that we need real-valued utilities to make clear recommendations in the case of uncertainty.
I don’t understand how it is related to a decision theory, it’s just world counting and EV calculation. I must be missing something, I assume.
For all of the consequentialist decision theories, I think you can describe what they’re doing as attempting to argmax a probability-weighted sum of utilities across possible worlds, and they differ on how they think actions influence probabilities / their underlying theory of how they specify ‘possible worlds’ and thus what universe they think they’re in. [That is, I think the interesting bit is the part you seem to be handling as an implementation detail.]
For fixed-duration reasoning chains, or in situations where the AI is trying to facilitate cooperation between different passes (like in the proposed scenario where it gets information but then loses access to it), this seems important. For example, in a situation where you show the model a passage, let it make notes, and then have the same model look just at the notes and answer a reading comprehension question, I expect that likely it will want to embed information in the null space of the notes. [Or in an HCH-style situation, it might want to embed the original question into subquestions, and might do so steganographically, in a way that breaks some of the things one might be hoping for there.]
But I’m somewhat confused about whether you should expect it to crop up in recursive prompting cases (probably you don’t?). If you secretly know the answer in the first pass, don’t you want to just say the answer and terminate? If you secretly know a relevant fact in the first pass, don’t you want to openly express it and recur? I don’t see how steganography is rewarded over openness. [I’m setting aside the case where it invents a new language to express non-linguistic concepts more cheaply, and assuming we’re in the natural language setting where standard language is the ‘natural’ way to transmit the information.]
Cool, thanks for the link; I found jessicata’s comment thread there helpful.I agree that CDT overestimates the accessibility of worlds. I think one way to think about EDT is that is also is just counting worlds, probabilities, and utilities, but you’re calculating your probabilities differently, in a more UDT-ish way.
Consider another variant of this problem, where there are many islands, and the button only kills the psychopaths on its island. If Paul has a historical record that so far, all of the previous buttons that have been pressed were pressed by psychopaths, Paul might nevertheless think that his choice to press the button stems from a different source than psychopathy, and thus it’s worth pressing the button. [Indeed, the spicy take is that EDT doesn’t press the button, CDT does for psychopathic reasons and so dies, and FDT does for non-psychopathic reasons, and so gets the best outcome. ;) ]
an agent who counts worlds, probabilities, and utilities, without involving counterfactuals, always has the best EV.
Sorry, can you express this in terms like V(A)=∑jP(Oj|A)U(Oj) ? The main disagreement between decision theories like EDT and CDT is which worlds they think are accessible, and I am not confident I could guess what you’d think the answer is to an arbitrary problem.
In my view, FDT handles the problem as follows:
Frank: Suppose FDT(situation) = “push the button”. Then all psychopaths die, which includes me. Suppose instead FDT(situation) = “don’t push the button”. Then no psychopaths die. Since I prefer living in a world with psychopaths to dying, FDT(situation) = “don’t push the button”.
The main controversial piece is from the problem specification: “Paul is quite confident that only a psychopath would press such a button.” I think this mixes up P(button|psychopath) and P(psychopath|button), but since the problem specification is our only source of how the button determines who is or isn’t a psychopath, it seems fine to trust it on that point.
Another related problem is one where there’s a button who kills everyone who would, given the option, press it. You might expect that such people are bad neighbors and prefer a world without them without having any way to act on that belief (and if you come to believe that FDT pushes that button, what it really means is that you shouldn’t be so confident people who would press the button are bad neighbors!).
[In general, your decision theory should save you from claims in the problem specification of the form “and then you make a bad decision”, but it can’t be expected to save you from having incorrect empirical beliefs.]
I think it’s Paul alive, sociopaths dead > Paul alive, sociopaths alive > Paul dead, sociopaths dead , with the inaccessible Paul dead, sociopaths alive at the very bottom.
Paul alive, sociopaths dead
Paul alive, sociopaths alive
Paul dead, sociopaths dead
Paul dead, sociopaths alive
In situations with uncertainty, we would need to have the scales of those preferences, but I think you’re supposed to view the problem as having certainty.
As it happens, I went to College Park and got, iirc, 1580 on my SATs. [I was rejected from all the top schools I applied to, but I only distinctly remember Stanford.]
I think you should expect the very top students to be both concentrated (at the top schools) and dispersed (a handful at each of the remaining colleges, attracted by merit scholarships). I also got into a higher ranked school, but UMCP was willing to give me a full ride and Boston University wasn’t, and I didn’t believe the difference in schools was worth the tuition.
Factorization is not a solution to the problem of general intelligence.
Huh, really? I think my impression from talking to Paul over the years was that it sort of was. [Like, if your picture of the human brain is that it’s a bunch of neurons factorizing the problem of being a human, this sort of has to work.]
so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).
This is the bit I don’t believe, actually. [Or at least don’t think is relevant.] Note that in Wei_Dai’s hypothetical, the neural net architecture has a particular arrangement such that “how much it optimizes for reward” is either directly or indirectly implied by the neural network weights. [We’re providing the reward as part of its observations, and so if nothing else the weights from that part of the input vector to deeper in the network will be part of this, but the actual mechanism is going to be more complicated for one that.]
Quintin seems to me to be arguing “if you actually follow the math, there isn’t a gradient to that parameter,” which I find surprising, and which seems easy to demonstrate by going thru the math. As far as I can tell, there is a gradient there, and it points in the direction of “care more about reward.”
This doesn’t mean that, by caring about reward more, it knows which actions in the environment cause more reward. There I believe the story that the RL algorithm won’t be able to reinforce actions that have never been tried.
[EDIT: Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”? But that seems pretty strained and not very robust, as the first time it considers trying harder to get reward, it will likely get hooked.]
If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction.
Wait, I don’t think this is true? At least, I’d appreciate it being stepped thru in more detail.
In the simplest story, we’re imagining an agent whose policy is πθ and, for simplicity’s sake, θ0 is a scalar that determines “how much to maximize for reward” and all the other parameters of θ store other things about the dynamics of the world / decision-making process.
It seems to me that ∇θ is obviously going to try to point θ0 in the direction of “maximize harder for reward”.
In the more complicated story, we’re imagining an agent whose policy is πθ which involves how it manipulates both external and internal actions (and thus both external and internal state). One of the internal state pieces (let’s call it s0 like last time) determines whether it selects actions that are more reward-seeking or not. Again I think it seems likely that ∇θ is going to try to adjust θ such that the agent selects internal actions that point s0 in the direction of “maximize harder for reward”.
What is my story getting wrong?
Intuitively, once you see the contents of the big box, you really have no reason not to take the small box.
I think the word ‘intuitively’ is kind of weird, here? Like, if we swap the Transparent Newcomb’s Problem frame with the (I believe identical) Parfit’s Hitchhiker frame, I feel an intuitive sense that I should pay the driver (only take the big box), because of a handshake across time and possibility, and being the sort of agent that can follow thru on those handshakes.
Not having time to read all of your papers, do they have the same methodology SMTM points out as being suspect in the post you linked, or do they cover a broader range of methodologies?
I think one of the interesting things about Elden Ring, compared to many other games, is that it’s much more about the enemies than it is about you, and that makes it substantially more difficult for it to become stale. [A similar game in that regard is Undertale, where (on a nice playthru) your character basically doesn’t get any better, you just get better at not getting hurt by the enemies / going thru the conversational sequence necessary to befriend them.]
Like, consider this YouTube video, where a game exploiter sets up a super powerful combo (a very expensive spell to deal lots of continuous damage, plus a temporary buff that makes spells do more damage, plus a temporary buff that makes spells free!) and… gets one-hit-killed by the first dragon that he tries it on. In other games, this might be the point at which every battle from then on out becomes a repetitive exercise in performing the combo. [Despite the combo’s significant limitations, it’s nevertheless situationally useful and I used it a bunch on my mage.]
There are other solutions to keep things fresh; for example, in Hades, you are mostly choosing powerups from a random set of options, and so can’t execute the same combo every run, but during a run there will typically be a point where you lock in your combo and then it’s all execution from then on out. [Hades has the same dynamic of binomially avoiding trouble.]
If, instead, you’re playing something downstream of Dungeons and Dragons or Warhammer or so on, you often have the experience of being locked in to a playstyle; you wanted to be an archer? Great, here’s fifty hours of being an archer. Want a different playstyle? Well… I guess you could start over.
Have you any further thoughts about where the content for this project would come from?
Yeah; here’s a sketch, from a few different directions:
What: I think the first stage is organizing stuff that already exists and is pretty well-collected: going thru the Sequences, the CFAR handbook, Thinking and Deciding, clearerthinking.org, and so on. I think the second stage casts a broader net, and is something like the process that made The Personal MBA, which is basically a book-length distillation of a ton of business books into their basic concepts; I have a shelf of psychology / self-help books which I think could similarly be distilled into a handful of patterns each. The third stage is something like original research / pushing the frontier forward—finding the holes and filling them, running comparative experiments in areas where we don’t have much data or experience, and so on.
[I describe them as ‘stages’ but I expect them to happen simultaneously, with the stages as my guess of where the bulk of the effort will go when. I also basically don’t expect to run out of content to mine; I’m more worried about running out of a sense that organizing the content is worth doing.]
Who: I expect there to be 1-5 people who put significant effort into this, followed by a long tail of people making minor changes, and am imagining that a basically open setup makes more sense than trying to vet entrants (after all, it’s easy to revert changes).
Meta: I think I’m planning on mostly ‘trying it and seeing what happens, iterating on the fly’, as there are a bunch of unknowns that are pretty cheap to just run into and see how they resolve. [For example, I can imagine a world where there are a bunch of CFAR developers who are quite into this / paid by CFAR to work on it, and another world where it’s all basically ignored by them; can imagine it being useful for people to coordinate on splitting up effort vs. just doing whatever makes sense; can imagine putting a bunch of work in to shepherd this vs. it having a life of its own.]
Thanks! I haven’t gotten to that one yet, but I’ll move it to next on my list. [I also think his earlier book, Notes on the Synthesis of Form, is probably relevant, but unclear how much of it is already present in later books, which have had more time to ‘cook’.]
Yeah let’s do it!
One of the things I’m most interested in at the moment—which I have a vague suspicion is easy but might actually be hard—is visualization of the graph structure, of the sort that yEd or Obsidian or so on have. Another thing which seems easier but also is not inherent to wiki software (I think?) is serializing the list of patterns in the biggest-to-smallest way (which, maybe we want some flexibility here, so that we can experiment with multidimensional layouts?).
Linkgraphs obviously have directionality to them, but my guess is that we want the ability to deliberately set the directionality of links, so that when 80 Self-governing Workshops and Offices links up to 41 Work Community it’s stored differently than when it links down to 148 Small Work Groups, or sideways/outwards to Buddhist Economics [example from a randomly selected pattern near the middle of A Pattern Language]. And again some sort of future-proofing might be interesting here; it’s one thing to say “small work groups are part of a self-governing workshop” and another thing to say “small work groups are spatially part of a self-governing workshop”, so an algorithm could easily differentiate between that and temporal containment (like “a pomodoro is temporally part of a workday”).
I think the point is “a focus on daily fluctuations obscures slower, more important trends”; i.e. it’s not a disagreement about which facts are true but which facts are most relevant.
If you have a space with two disconnected components, then I’m calling the distinction between them “crisp.”
The components feel disconnected to me in 1D, but I’m not sure they would feel disconnected in 3D or in ND. Is your intuition that they’re ‘durably disconnected’ (even looking at the messy plan-space of the real-world, we’ll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator’s preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?
[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul’s sense that they’re disconnected in 1D, or when do you think the difficulty comes in?]