Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. About my situation: here.
I wrote some worse posts before 2024 because I was very uncertain how the events may develop.
Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. About my situation: here.
I wrote some worse posts before 2024 because I was very uncertain how the events may develop.
So, somehow you’re able to know when an AI is exerting optimization power in “a way that flows through” some specific concepts?
Yes, we’re able to tell if AI optimizes through a specific class of concepts. In most/all sections of the post I’m assuming the AI generates concepts in a special language (i.e. it’s not just a trained neural network), a language which allows to measure the complexity of concepts. The claim is that if you’re optimizing through concepts of certain complexity, then you can’t fulfill a task in a “weird” way. If the claim is true and AI doesn’t think in arbitrary languages, then it’s supposed to be impossible to create a harmful Doppelganger.
But I don’t get if, or why, you think that adds up to anything like the above.
Clarification: only the interpretability section deals with inner alignment. The claims of the previous sections are not supposed to follow from the interpretability section.
Anyway, is the following basically what you’re proposing?
Yes. The special language is supposed to have the property that can automatically learn if plans good, bad, or unnecessary actions. can’t be arbitrarily smarter than humans, but it’s a general intelligence which doesn’t imitate humans and can know stuff humans don’t know.
Could you ELI15 the difference between Kolmogorov complexity (KC) and Kolmogorov structure function (KSF)?
Here are some of the things needed to formalize the proposal in the post:
A complexity metric defined for different model classes.
A natural way to “connect” models. So we can identify the same object (e.g. “diamond”) in two different models. Related: multi-level maps.
I feel something like KSF could tackle 1, but what about 2?
Thanks for clarifying! Even if I still don’t fully understand your position, I now see where you’re coming from.
No, I think it’s what humans actually pursue today when given the options. I’m not convinced that these values are static, or coherent, much less that we would in fact converge.
Then those values/motivations should be limited by the complexity of human cognition, since they’re produced by it. Isn’t that trivially true? I agree that values can be incoherent, fluid, and not converging to anything. But building Task AGI doesn’t require building an AGI which learns coherent human values. It “merely” requires an AGI which doesn’t affect human values in large and unintended ways.
No, because we don’t comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that.
This feels like arguing over definitions. If you have an oracle for solving certain problems, this oracle can be defined as a part of your problem-solving ability. Even if it’s not transparent compared to your other problem-solving abilities. Similarly, the machinery which calculates a complicated function from sensory inputs to judgements (e.g. from Mona Lisa to “this is beautiful”) can be defined as a part of our comprehension ability. Yes, humans don’t know (1) the internals of the machinery or (2) some properties of the function it calculates — but I think you haven’t given an example of how human values depend on knowledge of 1 or 2. You gave an example of how human values depend on the maxima of the function (e.g. the desire to find the most delicious food), but that function having maxima is not an unknown property, it’s a trivial property (some foods are worse than others, therefore some foods have the best taste).
That’s a very big “if”! And simplicity priors are made questionable, if not refuted, by the fact that we haven’t gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
I agree that ambitious value learning is a big “if”. But Task AGI doesn’t require it.
But
To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about. At least before humans start unlimited self-modification. I think this logically can’t be false.
Eliezer Yudkowsky is a core proponent of complexity of value, but in Thou Art Godshatter and Protein Reinforcement and DNA Consequentialism he basically makes a point that human values arose from complexity limitations, including complexity limitations imposed by brainpower limitations. Some famous alignment ideas (e.g. NAH, Shard Theory) kinda imply that human values are limited by human ability to comprehend and it doesn’t seem controversial. (The ideas themselves are controversial, but for other reasons.)
If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn’t it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?
Based on your comments, I can guess that something below is the crux:
You define “values” as ~”the decisions humans would converge to after becoming arbitrarily more knowledgeable”. But that’s a somewhat controversial definition (some knowledge can lead to changes in values) and even given that definition it can be true that “past human ability to comprehend limits human values” — since human values were formed before humans explored unlimited knowledge. Some values formed when humans were barely generally intelligent. Some values formed when humans were animals.
You say that values depend on inscrutable brain machinery. But can’t we treat the machinery as a part of “human ability to comprehend”?
You talk about ontology. Humans can care about real diamonds without knowing what physical things the diamonds are made from. My reply: I define “ability to comprehend” based on ability to comprehend functional behavior of a thing under normal circumstances. Based on this definition, a caveman counts as being able to comprehend the cloud of atoms his spear is made of (because the caveman can comprehend the behavior of the spear under normal circumstances), even though the caveman can’t comprehend atomic theory.
Could you confirm or clarify the crux? Your messages felt ambiguous to me. In what specific way is A false?
Are you talking about value learning? My proposal doesn’t tackle advanced value learning. Basically, my argument is “if (A) human values are limited by human ability to comprehend/optimize things and (B) the factors which make something easier or harder to comprehend/optimize are simple, then the AI can avoid accidentally messing up human values — so we can define safe impact measures and corrigibility”. My proposal is not supposed to make the AI learn human values in great detail or extrapolate them out of distribution. My argument is “if A and B hold, then we can draw a box around human values and tell the AI to not mess up the contents of the box — without making the AI useless; yet the AI might not know what exact contents of the box count as ‘human values’”.[1]
The problem with B is that humans have very specialized and idiosyncratic cognitive machinery (the machinery generating experiences) which is much more advanced than human general ability to comprehend things. I interpreted you as making this counterargument in the top level comment. My reply is that I think human values depend on that machinery in a very limited way, so B is still true enough. But I’m not talking about extrapolating something out of distribution. Unless I’m missing your point.
Why those things follow from A and B is not obvious and depends on a non-trivial argument. I tried to explain it in the first section of the post, but might’ve failed.
Thanks for elaborating! This might lead to a crux. Let me summarize the proposals from the post (those summaries can’t replace reading the post though).
Outer alignment:
We define something like a set of primitives. Those primitives are independent from any specific ontology.
We prove[1] that as long as AI acts and interprets tasks using those primitives, it can prevent humans from being killed or brainwashed or disempowered. Even if the primitives are not enough to give a very nuanced definition of a “human” or “brainwashing”. That’s where the “we can express care about incomprehensible things as care about comprehensible properties of incomprehensible things” argument comes into play.
Inner alignment:
We prove that a more complicated model (made of the primitives) can’t deceive a simpler model (made of the primitives). The inner/outer alignment of simple enough models can be verified manually.
We prove that the most complicated model (expressible with the primitives) has at least human-level intelligence.
Bonus: we prove that any model (made of the primitives) is interpretable/learnable by humans and prove that you don’t need more complicated models for defining corrigibility/honesty. Disclaimer: the proposals above are not supposed to be practical, merely bounded and being conceptually simple.
Why the heck would we be able to define primitives with such wildly nice properties? Because of the argument that human ability to comprehend and act in the world limits what humans might currently care about, and the current human values are enough to express corrigibility. If you struggle to accept this argument, maybe try assuming it’s true and see if you can follow the rest of the logic? Or try to find a flaw in the logic instead of disagreeing with the definitions. Or bring up a specific failure mode.
To have our values interface appropriately with with this novel thinking patterns in the AI, including through corrigibility, I think we have to work with “values” that are the sort of thing that can refer / be preserved / be transferred across “ontological” changes.
If you talk about ontological crisis or inner alignment, I tried to address those in the post. By the way, I read most of your blog post and skimmed the rest.
To actually prove it we need to fully formalize the idea, of course. But I think my idea is more specific than many other alignment ideas (e.g. corrigibility, Mechanistic Anomaly Detection, Shard Theory).
I probably disagree. I get the feeling you have an overly demanding definition of “value” which is not necessary for solving corrigibility and a bunch of other problems. Seems like you want to define “value” closer to something like CEV or “caring about the ever-changing semantic essence of human ethical concepts”. But even if we talk about those stronger concepts (CEV-like values, essences), I’d argue the dynamic I’m talking about (“human ability to comprehend limits what humans can care about”) still applies to them to an important extent.
See my response to David about a very similar topic. Lmk if it’s useful.
Basically, I don’t think your observation invalidates any ideas from the post.
The main point of the post is that human ability to comprehend should limit what humans can care about. This can’t be false. Like, logically. You can’t form preferences about things you can’t consider. When it looks like humans form preferences about incomprehensible things, they really form preferences only about comprehensible properties of those incomprehensible things. In the post I make an analogy with a pseudorandom number generator: it’s one thing to optimize a specific state of the PRNG or want the PRNG to work in a specific way, and another thing to want to preserve the PRNG’s current algorithm (whatever it is). The first two goals might be incomprehensible, but the last goal is comprehensible. Caring about friends works in a similar way to caring about a PRNG. (You might dislike this framing for philosophical or moral reasons, that’s valid, but it won’t make object-level ideas from the post incorrect.)
Yes, some value judgements (e.g. “this movie is good”, “this song is beautiful”, or even “this is a conscious being”) depend on inscrutable brain machinery, the machinery which creates experience. The complexity of our feelings can be orders of magnitude greater than the complexity of our explicit reasoning. Does it kill the proposal in the post? I think not, for the following reason:
We aren’t particularly good at remembering exact experiences, we like very different experiences, we can’t access each other’s experiences, and we have very limited ways of controlling experiences. So, there should be pretty strict limitations on how much understanding of the inscrutable machinery is required for respecting the current human values. Defining corrigible behavior (“don’t kill everyone”, “don’t seek power”, “don’t mess with human brains”) shouldn’t require answering many specific, complicated machinery-dependent questions (“what separates good and bad movies?”, “what separates good and bad life?”, “what separates conscious and unconscious beings?”).
Also, some thoughts about your specific counterexample (I generalized it to being about experiences in general):
“How stimulating or addicting or novel is this experience?” ← I think those parameters were always comprehensible and optimizable, even in the Stone Age. (In a limited way, but still.) For example, it’s easy to get different gradations of “less addicting experiences” by getting injuries, starving or not sleeping.
“How ‘good’ is this experience in a more nebulous or normative way?” ← I think this is a more complicated value (aesthetic taste), based on simpler values.
Note that I’m using “easy to comprehend” in the sense of “the thing behaves in a simple way most of the time”, not in the sense of “it’s easy to comprehend why the thing exists” or “it’s easy to understand the whole causal chain related to the thing”. I think the latter senses are not useful for a simplicity metric, because they would mark everything as equally incomprehensible.
Note that “I care about taste experiences” (A), “I care about particular chemicals giving particular taste experiences” (B), and “I care about preserving the status quo connection between chemicals and taste experiences” (C) are all different things. B can be much more complicated than C, B might require the knowledge of chemistry while C doesn’t.
Does any of the above help to find the crux of the disagreement or understand the intuitions behind my claim?
Could you reformulate the last paragraph as “I’m confused how your idea helps with alignment subrpoblem X”, “I think your idea might be inconsistent or having a failure mode because of Y”, or “I’m not sure how your idea could be used to define Z”?
Wrt the third paragraph. The post is about corrigible task ASI which could be instructed to protect humans from being killed/brainwashed/disempowered (and which won’t kill/brainwash/disempower people before it’s instructed to not do this). The post is not about value learning in the sense of “the AI learns plus-minus the entirety of human ethics and can build an utopia on its own”. I think developing my idea could help with such value learning, but I’m not sure I can easily back up this claim. Also, I don’t know how to apply my idea directly to neural networks.
I think I understand you now. Your question seems much simpler than I expected. You’re basically just asking “but what if we’ll want infinitely complicated / detailed values in the future?”
If people iterativly modified themselves, would their preferences become ever more exacting? If so, then it is true that the “variables humans care about can’t be arbitrarily complicated”, but the variables humans care about could define a desire to become a system capable of caring about arbitrarily complicated variables.
It’s OK if the principle won’t be true for humans in the future, it only needs to be true for the current values. Aligning AI to some of the current human concepts should be enough to define corrigibility, low impact, or avoid goodharting. I.e. create a safe Task AGI. I’m not trying to dictate to anyone what they should care about.
Don’t worry about not reading it all. But could you be a bit more specific about the argument you want to make or the ambiguity you want to clarify? I have a couple of interpretations of your question.
Interpretation A:
The post defines a scale-dependent metric which is supposed to tell how likely humans are to care about something.
There are objects which are identical/similar on every scale. Do they break the metric? (Similar questions can be asked about things other than “scale”.) For example, what if our universe contains an identical, but much smaller universe, with countless people in it? Men In Black style. Would the metric say we’re unlikely to care about the pocket universe just because of its size?
Interpretation B:
The principle says humans don’t care about constraining things in overly specific ways.
Some concepts with low Kolmogorov Complexity constrain things in infinitely specific ways.
My response to B is that my metric of simplicity is different from Kolmogorov Complexity.
Thanks a lot for willingness to go into details. And for giving advice on messaging other researchers.
No offense taken. The marriage option was funny, hope I never get that desperate. Getting official grants is probably not possible for me, but thanks for the suggestion.
by both sides, to be precise
My wording was deliberate. It’s one thing to sanction another country, and another thing to “sanction yourself”.
I’m an independent alignment researcher in Russia. Imagine someone wants to donate money to me (from Europe/UK/America/etc). How can I receive the money? It’s really crucial for me to receive at least 100$ per month, at least for a couple of months. Even 40$ per month would be a small relief. [EDIT: my latest, most well-received, and only published alignment research is here.]
Down below are all the methods I learned about after asking people, Google, Youtube and LLMs:
Crypto. The best option. But currently I’m a noob at crypto.
There are some official ways to get money into Russia, but the bank can freeze those transfers or start an inquiry.
Some freelance artists use Boosty (Patreon-like site). But Boosty can stall money transfers for months and more. If your account doesn’t have subscribers and legitimate content, it can rise suspicion of the site.
Someone from a ‘friendly country’ or from Russia itself could act as an intermediary. The last link on this page refers to a network of russian alignment researchers. (Additional challenge: I don’t have a smartphone. With a dumbphone it’s not practically possible to register in Telegram. But most russian alignment researchers are there.)
Get out of Russia. Impossible for me, even with financial help.
What should I do? Is there any system for supporting russian researchers?
Also, if I approach a fellow russian researcher about my problem, what should I say? I don’t have experience in this.
Needless to say, the situation is pretty stressful to me. Imagine getting a chance to earn something for your hard work, but then you can’t get even pennies because of absolutely arbitrary restrictions imposed by your own state.
EDIT 2: I got help. Thanks everyone!
Even with chess there are some nuances:
Chess engines use much more brute force than humans. Though I think it’s not that easy to compare who does more calculation, since humans have a lot of memory and pattern recognition. Also, I’ve heard about strong chess engines “without search” (grandmaster level), but haven’t looked into it.
This might be outdated, but chess engines struggle with “fortresses” (a rare position type in chess).
You at various points rely on an assumption that there is one unique scale of complexity (one ladder of ), and it’ll be shared between the humans and the AI. That’s not necessarily true, which creates a lot of leaks where an AI might do something that’s simple in the AI’s internal representation but complicated in the human’s.
I think there are many somewhat different scales of complexity, but they’re all shared between the humans and the AI, so we can choose any of them. We start with properties () which are definitely easy to understand for humans. Then we gradually relax those properties. According to the principle, properties will capture all key variables relevant to the human values much earlier than top human mathematicians and physicists will stop understanding what those properties might describe. (Because most of the time, living a value-filled life doesn’t require using the best mathematical and physical knowledge of the day.) My model: “the entirety of human ontology >>> the part of human ontology a corrigible AI needs to share”.
This raises a second problem, which is the “easy to optimize” criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain.
There are three important possibilities relevant to your hypothetical:
If technology T and human hacking are equally hard to comprehend, then (a) we don’t want the AI to build technology T or (b) the AI should be able to screen off technology T from humans more or less perfectly. For example, maybe producing paint requires complex manipulations with matter, but those manipulations should be screened off from humans. The last paragraph in this section mentions a similar situation.
Technology T is easier to comprehend than human hacking, but it’s more expensive (requires more resources). Then we should be able to allow the AI to use those resources, if we want to. We should be controlling how much resources the AI is using anyway, so I’m not introducing any unnatural epicycles here.[1]
If humans themselves built technology T which affects them in a complicated way (e.g. drugs), it doesn’t mean the AI should build similar types of technology on its own.
My point here is that I don’t think technology undermines the usefulness of my metric. And I don’t think that’s a coincidence. According to the principle, one or both of the below should be true:
Up to this point in time, technology never affected what’s easy to optimize/comprehend on a deep enough level.
Up to this point in time, humans never used technology to optimize/comprehend (on a deep enough level) most of their fundamental values.
If neither were true, we would believe that technology radically changed fundamental human values at some point in the past. We would see life without technology as devoid of most non-trivial human values.
When the metric is a bit fuzzy and informal, it’s easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.
The selling point of my idea is that it comes with a story for why it’s logically impossible for it to fail or why all of its flaws should be easy to predict and fix. Is it easy to come up with such story for other ideas? I agree that it’s too early to buy that story. But I think it’s original and probable enough to deserve attention.
Remember that I’m talking about a Task-directed AGI, not a Sovereign AGI.
Can anybody give/reference an ELI5 or ELI15 explanation of this example? How can we use the models without creating them? I know that gradient descent is used to update neural networks, but how can you get the predictions of those NNs without having them?
First things first, defining a special language which creates a safe but useful AGI absolutely is just a restatement of the problem, more or less. But the post doesn’t just restate the problem, it describes the core principle of the language (the comprehension/optimization metric) and makes arguments for why the language should be provably sufficient for solving a big part of alignment.
This section deduces the above from claims A and B. What part of the deduction do you disagree with/confused about? Here’s how the deduction would apply to the task “protect a diamond from destruction”:
Akcares about an ontologically fundamental diamond.Ak+1 models the world as clouds of atoms.
According to the principle, we can automatically find what object in Ak+1 corresponds to the “ontologically fundamental diamond”.
Therefore, we can know what Ak+1 plans would preserve the diamond. We also can know if applying any weird optimization to the diamond is necessary for preserving it. Checking for necessity is probably hard, might require another novel insight. But “necessity” is a simple object-level property.
The automatic finding of the correspondence (step 2) between an important comprehensible concept and an important incomprehensible concept resolves the apparent contradiction.[1]
Now, without context, step 2 is just a restatement of the ontology identification problem. The first two sections of the post (mostly the first one) explain why the comprehension/optimization metric should solve it. I believe my solution is along the lines of the research avenues Eliezer outlined.
If my principle is hard to agree with, please try to assume that it’s true and see if you can follow how it solves some alignment problems.