But the more interesting question is: what was happening during the thirty seconds that it took me to walk upstairs?I evidently had motivation to continue walking, or I would have stopped and turned around. But my brainstem hadn’t gotten any ground truth yet that there were good things happening. That’s where “defer-to-predictor mode” comes in! The brainstem, lacking strong evidence about what’s happening, sees a positive valence guess coming out of the striatum and says, in effect, “OK, sure, whatever, I’ll take your word for it.”
It seems like there’s some implication here that motivation and positive valence are the same thing?
Is the claim that evolutionarily early versions of behavioral circuits had approximately the form…
If positive reward:
continue current behavior
else:
try something else
...but that adding in long-term predictors instead allows for the following algorithm?
One of the ways you can get up in the morning, if you are me, is by looking in the internal direction of your motor plans, and writing into your pending motor plan the image of you getting out of bed in a few moments, and then letting that image get sent to motor output and happen. (To be clear, I actually do this very rarely; it is just a fun fact that this is a way I can defeat bed inertia.)
I do this, or something very much like this.
For me, it’s like the motion of setting a TAP, but to fire imminently instead of at some future trigger, by doing cycles of multi-sensory visualization of the behavior in question.
Perhaps I’m just being dense, but I’m confused why this toy model of a long-term predictor is long-term instead of short term. I’m trying to think through it aloud in this comment.
A “long-term predictor” is ultimately nothing more than a short-term predictor whose output signal helps determine its own supervisory signal. Here’s a toy model of what that can look like:
At first, I thought that the idea was that the latency of the supervisory/error signal was longer than average, and that that latency made the short term predictor function as a long-term predictor, without being any different functionally. But then why is it labeled “short-term predictor”?
It seems like the short-term predictor should learn to predict (based on context cues) the behavior triggered by the hardwired circuitry. But it should predict that behavior only 0.3 seconds early?
...
Oh! Is the key point that there’s a kind of resonance, where this system maintains the behavior of the genetically hardwired components? When the switch switches back to defer-to-predictor mode, the short term predictor is still predicting the override hard-wired behavior, which is now trivially “correct”, because whatever the predictor outputs is correct. (It was also correct a moment before, when the switch was in override mode, but not trivially correct.)
This still doesn’t answer my confusion. It seems like the whole circuit is going to maintain the state from the last “ground truth infusion” and learn to predict the timings and magnitudes of the “ground truth infusions”. But it still shouldn’t predict them more than 0.3 seconds in advance?
Is the idea that the lookahead propagates earlier and earlier with each cycle? You start with a 0.3 second prediction. But that means that supervisory signal (when in the “defer-to-predictor mode”) is 0.3 seconds earlier, which means that the predictor learns to predict the change in output 0.6 seconds ahead of when the override “would have happened”, and then 0.9 seconds ahead , and then 1.2 seconds ahead, and so on, until it backs all the way up to when the “prior” ground truth infusion sent a different signal?
Like, the thing that this circuit is doing is simulating time travel, so that it can activate (on average) the next behavior that the genetically hardwired circuitry will output, as soon as “override mode” is turned off?
Who cares if a cortex by itself is safe? A cortex by itself was never the plan!
Well, to be fair, I care a lot about if a cortex by itself is safe, specifically because if so, the plan maybe should be to build a cortex (approximately) by itself, directed by control systems very different than those of biological brains—like text prompts.
And as discussed above (and more in later posts), even if the researchers start trying in good faith to give their AGI an innate drive for being helpful / docile / whatever, they might find that they don’t know how to do so.
Feel free not to respond if this is answered in later posts, but how relevant is it to your model that current LLMs (which are not brain-like and not AGIs), are helpful and docile in the vast majority of contexts?
Is this evidence that actually would be AGI developers do know how to making their AGIs helpful and docile? Or is it missing the point?
The “Singularity” claim assumes general intelligence
I’m not sure exactly how you’re using the term “general intelligence”, but why does the singularity assume that? Why can’t an “instrumental intelligence” recursively self-improve and seize the the universes’s available resources in service of it’s goals?
but on our interpretation the orthogonality thesis says that one cannot consider this
The orthogonality thesis doesn’t make any claims that agents can’t consider various propositions. Agents can consider whatever propositions, but that doesn’t mean they’ll be moved by them.
To be more specific, I think this is a bootstrapping issue—I think we need a curiosity drive early in training, but can probably turn it off eventually. Specifically, let’s say there’s an AGI that’s generally knowledgeable about the world and itself, and capable of getting things done, and right now it’s trying to invent a better solar cell. I claim it probably doesn’t need to feel an innate curiosity drive. Instead it may seek new information, and seek surprises, as if it were innately curious, because it has learned through experience that seeking those things tends to be an effective strategy for inventing a better solar cell. In other words, something like curiosity can be motivating as a means to an end, even if it’s not motivating as an end in itself—curiosity can be a learned metacognitive heuristic. See instrumental convergence. But that argument does not apply early in training, when the AGI starts from scratch, knowing nothing about the world or itself. Instead, early in training, I think we really need the Steering Subsystem to be holding the Learning Subsystem’s hand, and pointing it in the right directions, if we want AGI.
Presumably another strategy would be to start with an already trained model as the center of our learning subsystem, and a steering subsystem that points to concepts in that trained model?
Something like, you have an LLM-based agent that can take actions in text-based game. There’s some additional reward machinery that magically updates the weights of the LLM (based on simple heuristic evaluations of the the text context of the game?). You could presumably(?) instantiate such an agent such that it had some goals out of the gate, instead of needing to reward curiosity?
Perhaps this already strays too far from the human-setup to count as “brain-like.”
Trying to solve philosophical problems like these on a deadline with intent to deploy them into AI is not a good plan, especially if you’re planning to deploy it even if it’s still highly controversial (i.e., a majority of professional philosophers think you are wrong).
If the majority of profesional philosophers do endorse your metaethics, how seriously should you take that?
And inversely, do you think it’s implausible that you could have correctly reasoned your way to correct metaethics, as validated by a more narrow community of philosophers, but not yet have convinced everyone in the field?
The attitude of the sequences emphasizes often that most people in the world believe in god, so if you’re interested in figuring out the truth, you gotta be comfortable confidently disclaiming widely held beliefs. What do you say to the person who assesses that academic philosophy is a sufficiently broken field with warped incentives that prevent intellectual progress, and thinks that they should discard the opinion of the whole thing?
Do you just claim that they’re wrong about that, on the object level, and that hypothetical person should have more respect for the views of philosophers?
(That said, I’ll observe that there’s an important in practice asymmetry between “almost everyone is wrong in their belief of X, and I’m confident about that” and “I’ve independently reasoned my way to Y, and I’m very confident of it.” Other people are wrong != I am right.)
Did you mean to write “build a Task AI to perform a pivotal act in service of reducing x-risks”? Or did MIRI switch from one to the other at some point early on? I don’t know the history. …But it doesn’t matter, my comment applies to both.
I believe that there was an intentional switch, around 2016 (though I’m not confident in the date), from aiming to design a Friendly CEV-optimizing sovereign AI, to aiming to design a corrigible minimal-Science-And-Engineering-AI to stabilize the world (after which a team of probably-uploads could solve the full version of Friendliness and kick off a foom.)
How much was this MIRI’s primary plan? Maybe it was 12 years ago before I interfaced with MIRI?
Reposting this comment of mine from a few years ago, which seems germane to this discussion, but certainly doesn’t contradict the claim that this hasn’t been their plan in the past 12 years.
Here is a video of of Eliezer, first hosted on vimeo in 2011. I don’t know when it was recorded.
[Anyone know if there’s a way to embed the video in the comment, so people don’t have to click out to watch it?]
He states explicitly:
As a research fellow of the Singularity institute, I’m supposed to first figure out how to build a friendly AI, and then once I’ve done that go and actually build one.
And later in the video he says:
The Singularity Institute was founded on the theory that in order to get a friendly artificial intelligence someone’s got to build one. So there. We’re just going to have an organization whose mission is ‘build a friendly AI’. That’s us. There’s like various other things that we’re also concerned with, like trying to get more eyes and more attention focused on the problem, trying to encourage people to do work in this area. But at the core, the reasoning is: “Someone has to do it. ‘Someone’ is us.”
None of these advancements have direct impacts on most people’s day-to-day lives.
In contrast, the difference between “I’ve heard of cars, but they’re play things for the rich” and “my family owns a car”, is transformative for individuals and societies.
For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it’s IMO sometimes been good that Anthropic leadership hasn’t been very pressured by their staff.
Specific examples would be appreciated.
Do you mean things like opposition to open-source? Opposition to pushing-the-SOTA model releases?
Moore’s Law is a phenomenon produced by human cognition and the fact that human civilization runs off human cognition. You can’t expect the surface phenomenon to continue unchanged after the deep causal phenomenon underlying it starts changing. What kind of bizarre worship of graphs would lead somebody to think that the graphs were the primary phenomenon and would continue steady and unchanged when the forces underlying them changed massively?
I used to be compelled by this argument, but I’ve come to have more respect for the god of stright lines on graphs, even though I don’t yet understand how it could possibly work like that.
My summary: When you receive a dire prophecy, you should make it as hard and annoying as possible for the time loop of your dire prophecy to be consistent, because if you reliably act that way, there’s less surface area for dire prophecies to get you?
Can you spell out what you mean here? Doing the jujitsu move where he mobilized the company to threaten to maybe quit if he wasn’t reinstated?