Agreed that it shouldn’t be hard to do that, but I expect that people will often continue to do what they find intrinsically motivating, or what they’re good at, even if it’s not overall a good idea. If this article can be believed, a senior researcher said that they work on capabilities because “the prospect of discovery is too sweet”.
It’s fine to say that if you want the conversation to become a discussion of AI timelines. Maybe you do! But not every conversation needs to be about AI timelines.
I feel excited about this framework! Several thoughts:
I especially like the metathreat hierarchy. It makes sense because if you completely curry it, each agent sees the foe’s action, policy, metapolicy, etc., which are all generically independent pieces of information. But it gets weird when an agent sees an action that’s not compatible with the foe’s policy.
You hinted briefly at using hemicontinuous maps of sets instead of or in addition to probability distributions, and I think that’s a big part of what makes this framework exciting. Maybe if one takes a bilimit of Scott domains or whatever, you can have an agent that can be understood simultaneously on multiple levels, and so evade commitment races. I haven’t thought much about that.
I think you’re right that the epiphenomenal utility functions are not good. I still think using reflective oracles is a good idea. I wonder if the power of Kakutani fixed points (magical reflective reasoning) can be combined with the power of Kleene fixed points (iteratively refining commitments).
Oh you’re right, I was confused.
I’ve no idea if this example has appeared anywhere else. I’m not sure how seriously to take it.
Consider the following game: At any time t∈[0,1], you may say “stop!”, in which case you’ll get the lottery that resolves to an outcome you value at (0,0,…) with probability t, and to an outcome you value at (1+t,0,…) with probability 1−t. If you don’t say “stop!” in that time period, we set t=1.
Let’s say at every instant in [0,1] you can decide to either say “stop!” or to wait a little longer. (A dubious assumption, as it lets you make infinitely many decisions in finite time.) Then you’ll naturally wait until t=1 and get a payoff of (0,0,…). It would have been better for you to say “stop!” at t=0, in which case you’d get (1,0,…).
You can similarly argue that it’s irrational for your utility to be discontinuous in the amount of wine in your glass: Otherwise you’ll let the waiter fill up your glass and then be disappointed the instant it’s full.
I haven’t seen a writeup anywhere of how it was trained.
The instruction-following model Altman mentions is documented here. I didn’t notice it had been released!
See section 2 of this Agent Foundations research program and citations for discussion of the problems of logical uncertainty, logical counterfactuals, and the Löbian obstacle. Or you can read this friendly overview. Gödel-Löb provability logic has been used here.
I don’t know of any application of set theory to agent foundations research. (Like large cardinals, forcing, etc.)
Ah, 90% of the people discussed on this post are now working for Anthropic, along with a few other ex-OpenAI safety people.
Here’s a fun and pointless way one could rescue the homunculus model: There’s an infinite regress of homunculi, each of which sees a reconstructed image. As you pass up the chain of homunculi, the shadow gets increasingly attenuated, approaching but never reaching complete invisibility. Then we identify “you” with a suitable limit of the homunculi, and what you see is the entire sequence of images under some equivalence relation which “forgets” how similar A and B were early in the sequence, but “remembers” the presence of the shadow.
The homunculus model says that all visual perception factors through an image constructed in the brain. One should be able to reconstruct this image by asking a subject to compare the brightness of pairs of checkerboard squares. A simplistic story about the optical illusion is that the brain detects the shadow and then adjusts the brightness of the squares in the constructed image to exactly compensate for the shadow, so the image depicts the checkerboard’s inferred intrinsic optical properties. Such an image would have no shadow, and since that’s all the homunculus sees, the homunculus wouldn’t perceive a shadow.
That story is not quite right, though. Looking at the picture, the black squares in the shadow do seem darker than the dark squares outside the shadow, and similarly for the white squares. I think if you reconstructed the virtual image using the above procedure you’d get an image with an attenuated shadow. Maybe with some more work you could prove that the subject sees a strong shadow, not an attenuated one, and thereby rescue Abram’s argument.
Edit: Sorry, misread your comment. I think the homunculus theory is that in the real image, the shadow is “plainly visible”, but the reconstructed image in the brain adjusts the squares so that the shadow is no longer present, or is weaker. Of course, this raises the question of what it means to say the shadow is “plainly visible”...
This is the sort of problem Dennett’s Consciousness Explained addresses. I wish I could summarize it here, but I don’t remember it well enough.
It uses the heterophenomenological method, which means you take a dataset of earnest utterances like “the shadow appears darker than the rest of the image” and “B appears brighter than A”, and come up with a model of perception/cognition to explain the utterances. In practice, as you point out, homunculus models won’t explain the data. Instead the model will say that different cognitive faculties will have access to different pieces of information at different times.
Very interesting. I would guess that to learn in the presence of spoilers, you’d need not only a good model of how you think, but also a way of updating the way you think according to the model’s recommendations. And I’d guess this is easiest in domains where your object-level thinking is deliberate rather than intuitive, which would explain why the flashcard task would be hardest for you.
When I read about a new math concept, I eventually get the sense that my understanding of it is “fake”, and I get “real” understanding by playing with the concept and getting surprised by its behavior. I assumed the surprise was essential for real understanding, but maybe it’s sufficient to track which thoughts are “real” vs. “fake” and replace the latter with the former.
Have you had any success learning the skill of unseeing?
Are you able to memorize things by using flashcards backwards (looking at the answer before the prompt) nearly as efficiently as using them the usual way?
Are you able to learn a technical concept from worked exercises nearly as well as by trying the exercises before looking at the solutions?
Given a set of brainteasers with solutions, can you accurately predict how many of them you would have been able to solve in 5 minutes if you had not seen the solutions?
See also this comment from 2013 that has the computable version of NicerBot.
This algorithm is now published in “Robust program equilibrium” by Caspar Oesterheld, Theory and Decision (2019) 86:143–159, https://doi.org/10.1007/s11238-018-9679-3, which calls it ϵGroundedFairBot.
The paper cites this comment by Jessica Taylor, which has the version that uses reflective oracles (NicerBot). Note also the post by Stuart Armstrong it’s responding to, and the reply by Vanessa Kosoy. The paper also cites a private conversation with Abram Demski. But as far as I know, the parent to this comment is older than all of these.
Or maybe it means we train the professional in the principles and heuristics that the bot knows. The question is if we can compress the bot’s knowledge into, say, a 1-year training program for professionals.
There are reasons to be optimistic: We can discard information that isn’t knowledge (lossy compression). And we can teach the professional in human concepts (lossless compression).
This sounds like a great goal, if you mean “know” in a lazy sense; I’m imagining a question-answering system that will correctly explain any game, move, position, or principle as the bot understands it. I don’t believe I could know all at once everything that a good bot knows about go. That’s too much knowledge.
The assistant could have a private key generated by the developer, held in a trusted execution environment. The assistant could invoke a procedure in the trusted environment that dumps the assistant’s state and cryptographically signs it. It would be up to the assistant to make a commitment in such a way that it’s possible to prove that a program with that state will never try to break the commitment. Then to trust the assistant you just have to trust the datacenter administrator not to tamper with the hardware, and to trust the developer not to leak the private key.