Q Home’s Shortform

Q Home16 Aug 2022 23:52 UTC

2 points

60 comments1 min readLW link

Q Home 11 May 2025 10:00 UTC
20 points
3
I’m an independent alignment researcher in Russia. Imagine someone wants to donate money to me (from Europe/UK/America/etc). How can I receive the money? It’s really crucial for me to receive at least 100$ per month, at least for a couple of months. Even 40$ per month would be a small relief. [EDIT: my latest, most well-received, and only published alignment research is here.]
Down below are all the methods I learned about after asking people, Google, Youtube and LLMs:
1. Crypto. The best option. But currently I’m a noob at crypto.
2. There are some official ways to get money into Russia, but the bank can freeze those transfers or start an inquiry.
3. Some freelance artists use Boosty (Patreon-like site). But Boosty can stall money transfers for months and more. If your account doesn’t have subscribers and legitimate content, it can rise suspicion of the site.
4. Someone from a ‘friendly country’ or from Russia itself could act as an intermediary. The last link on this page refers to a network of russian alignment researchers. (Additional challenge: I don’t have a smartphone. With a dumbphone it’s not practically possible to register in Telegram. But most russian alignment researchers are there.)
5. Get out of Russia. Impossible for me, even with financial help.
What should I do? Is there any system for supporting russian researchers?
Also, if I approach a fellow russian researcher about my problem, what should I say? I don’t have experience in this.
Needless to say, the situation is pretty stressful to me. Imagine getting a chance to earn something for your hard work, but then you can’t get even pennies because of absolutely arbitrary restrictions imposed by your own state.
EDIT 2: I got help. Thanks everyone!
- ProgramCrafter 11 May 2025 11:34 UTC
  8 points
  1
  Parent
  1. Re: crypto. Please look carefully at federal law No. 259, article 14, point 5.
    ”<… corporate body, companies’ branches...>, individuals residing in the Russian Federation at least 183 days in twelve consecutive months are not allowed to accept digital currency as a counter-provision for goods, work, services and/or something which could be construed as paying for goods (work, services) in digital currency, <… except mining>” (translated in my own words).
    If crypto you choose meets definition of digital currency, you need to tread carefully.
  2. “There are some official ways to get money into Russia, but the bank can freeze those transfers or start an inquiry.” True except I’d say “any bank in the actual transfer chain, which happens to include non-Russian banks too”.
  3. Perhaps your research would be of interest to some people if you published it? Though this might require restructuring it, and I’m not familiar with details of your work.
  These additional suggestions might not work in specific situations; not intended to offend in any case.
  6. Get a half-time job.
  7. Get acquainted with, and marry into, a upper mid-class family.
  8. Get a grant (important: from Russia-aligned sources; if you qualify as youth or another class, try leveraging that).
  9. Weren’t there legal ways to have a contract with non-residents which can pay in whatever currency? That, I don’t know at the moment.
  because of absolutely arbitrary restrictions imposed by your own state
  by both sides, to be precise
  Also, if I approach a fellow russian researcher about my problem, what should I say? I don’t have experience in this.
  This message is pretty much sufficient! To be better, it should also include some demonstration of what you will be working on.
  - Tapatakt 11 May 2025 13:31 UTC
    11 points
    4
    Parent
    If crypto you choose meets definition of digital currency, you need to tread carefully.
    While it’s all about small sums, not really. Russian laws can be oppressive, but Russian… economic vibes… while you are poor enough, are actually pretty libertarian.
  - Q Home 12 May 2025 10:34 UTC
    1 point
    0
    Parent
    Thanks a lot for willingness to go into details. And for giving advice on messaging other researchers.
    
    No offense taken. The marriage option was funny, hope I never get that desperate. Getting official grants is probably not possible for me, but thanks for the suggestion.
    
    by both sides, to be precise
    
    My wording was deliberate. It’s one thing to sanction another country, and another thing to “sanction yourself”.
- Yaroslav Granowski 11 May 2025 12:46 UTC
  4 points
  0
  Parent
  Fellow russian researcher here. I doubt small donations can work. Not many people use crypto (I’m same noob and not sure if it’s worthy to learn it until I have big audience.) I tried to register with OpenCollective so that they would collect money for me before I leave Russia but they rejected despite me having a project to showcase.
  If you have something to show, you could try applying for research grants. This is what I’m trying to do. But that only makes sense if you have time and not so desperate about nearest future.
- samuelshadrach 11 May 2025 19:43 UTC
  3 points
  3
  Parent
  Actually persuading someone to donate to you is harder for most people than figuring out how to use cryptocurrency.
  
  Using crypto is not that hard in the average case. The main habits you need to get into are a) verify everything, 90% of all the services and platforms are scams b) no mistakes when dealing with large sums, practice with small sum first. one mistake can lose all your money
- Alaric 13 May 2025 10:36 UTC
  2 points
  0
  Parent
  Maybe you can find a freelance job? For example, people who can do alignment research usually know how to write code. Or maybe you could give private lessons (math or computer science)?
  Yes, you should spend time on this, but I think it could make your situation more stable than donations.
  - Q Home 15 May 2025 5:57 UTC
    1 point
    0
    Parent
    Reasonable advice. Sadly, I can’t do math or programming. My work is philosophical/conceptual/informal in the way some Arbital articles are informal. In terms of teaching, maybe I could teach chess. I play somewhat good (2100+ in different time formats on Lichess and Chess. com).
Q Home 7 Aug 2025 13:52 UTC
13 points
6
Give AGI humanlike reasoning? (draft of a post)
Alignment plans can be split into two types:
Usual plans. AI gains capabilities $C$ . We figure out how to point $C$ to $T$ (alignment target). There’s no deep connection between $C$ and $T$ . One thing is mounted onto the other.
HRLM plans. We give AI special $C$ , with a deep connection to $T$ .
HRLM is the idea that there’s some special reasoning/learning method which is crucial for alignment or makes it fundamentally easier. HRLM means “humanlike reasoning/learning method” or “special, human-endorsed reasoning/learning method”. There’s no hard line separating the two types of plans. It’s a matter of degree.
I believe HRLM is ~never discussed in full generality and ~never discussed from a theoretical POV. This is a small post where I want to highlight the idea and facilitate discussion, not make a strong case for it.
Examples of HRLM
(My description of other people’s work is not endorsed by them.)
Corrigibility. “Corrigible cognition” is a hypothetical, special type of self-reflection ( $C$ ) which is extremely well-suited for learning human values/desires ( $T$ ).
In “Don’t align agents to evaluations of plans” Alex Turner argues “there’s a correct way to reason ( $C$ ) about goals ( $T$ ) and consequentialist maximization of an ‘ideal’ function is not it”, “‘direct cognition’ ( $C$ ) about goals ( $T$ ) is fundamentally better than ‘indirect cognition’”. Shard Theory, in general, proposes a very special method for learning and thinking about values.
A post about “follow-the-trying game” by Steve Byrnes basically says “AI will become aligned or misaligned at the stage of generating thoughts, so we need to figure out the ‘correct’ way of generating thoughts ( $C$ ), instead of overfocusing on judging what thoughts are aligned ( $T$ )”. Steve’s entire agenda is about HRLM.
Large Language Models. I’m not familiar with the debate, but I would guess it boils down to two possibilities: “understanding human language is a core enough capability ( $C$ ) for a LLM, which makes it inherently more alignable to human goals ( $T$ )” and “LLMs ‘understand’ human language through some alien tricks which don’t make them inherently more alignable”. If the former is true, LLMs are an example of HRLM.
Policy Alignment (Abram Demski) is tangentially related, but it’s more in the camp of “usual plans”.
Notice how, despite multiple agendas falling under HRLM (Shard Theory, brain-like AGI, LLM-focused proposals), there’s almost no discussion of HRLM from a theoretical POV. What is, abstractly speaking, “humanlike reasoning”? What are the general principles of it? What are the general arguments for safety guarantees it’s supposed to bring about? What are the True Names here? With Shard Theory, there’s ~zero explanation of how simpler shards aggregate into more complex shards and how it preserves goals. With brain-like AGI, there’s ~zero idea of how to prevent thought generation from bullshitting thought assessment. But those are the very core questions of the agendas. So they barely move us from square one.^[1]
Possibilities
There are many possibilities. It could be that any HRLM handicaps AI’s capabilities (a superintelligence is supposed to be unimaginably better at reasoning than humans, so why wouldn’t it have an alien reasoning method). It also could be that HRLM is necessary for general intelligence. But maybe general intelligence is overrated...
Here’s what I personally believe right now:
1. What we value is inherently tied to how we think about it. In general, what we think about is often inherently tied to how we think about it.
2. General intelligence is based on a special principle. It has a relatively crisp “core”.
3. Some special computational principle is needed for solving subsystems alignment.
4. If 1-3 is true, 1-3 is most likely the same thing. Therefore, HRLM is needed for general intelligence, outer and inner alignment (including subsystems alignment). Separately, I think general intelligence boosts capabilities below peak human level.
I consider 1-3 to be plausible enough postulates. I have no further arguments for 4.
My own ideas about HRLM (to be updated)
I have a couple of very unfinished ideas. Will try to write about them this or the next month.
I believe there could be a special type of cognition which helps to avoid specification gaming and goal misgeneralization. AI should create simple models which describe “costs/benefits” of actions (e.g. “actions” can be body movements, “cost” can be the amount and complexity of movements, “benefit” can be distance covered), this way AI can notice if certain actions produce anomalously high benefit (e.g. maybe certain body movements exploit a glitch in the physics simulation, making the body cover kilometers per second).
“By default, manipulating easier to optimize/comprehend variables is better than manipulating harder to optimize/comprehend variables” — this is the idea from one of my posts. The problem with it is that I only defined “optimization” and “comprehension” for world-models, not for any modelling (= cognition) in general.
A formal algorithm can have parts and it will critically depend on those parts (for example, an algorithm for solving equations might have an absolutely necessary addition sub-algorithm). An informal algorithm can have parts without critically depending on those parts (for example, the algorithm answering “is this a picture of a dog?” might have a sub-algorithm answering “is this patch of pixels the focal point of the image / does it contrast enough with other patches / is it as detailed as the other patches?”—the sub-algorithm is not very necessary, but it lowers pareidolia, by preventing the algorithm from overanalyzing random parts of the image). I think we can say something about the latter type of algorithms, about how they work.
1. ^
  IMO that’s downstream of inner alignment being extremely hard. It’s almost impossible to come up with at least mildly promising solution which explains, at least in some detail, how the hardest part of the problem might get solved. I’m not trying to throw shade. Also, I might just be ignorant about some ideas in those agendas.
- Seth Herd 7 Aug 2025 21:13 UTC
  6 points
  2
  Parent
  I just want to note that humans aren’t aligned by default, so creating human-like reasoning and learning is not itself an alignment method. It’s just a different variant of providing capabilities, which you separately need to point at an alignment target.
  
  It may or may not be easier to align than alternatives. I personally don’t think this matters because I strongly believe that the only type of AGI worth aligning is the type(s) most likely to be developed first. Hoping that the indurstry and society is going to make major changes to AGI development based on which types the researchers think are easier to align seems like a forlorn hope.
  
  More on why it’s a mistake to assume human-like cognition in itself leads to alignment:
  
  Sociopaths/psychopaths are a particularly vivid example of how humans are misaligned. And there are good reasons to think that they are not a special case in which empathy was accidentally left out or deliberately blocked, but that they are the baseline human cognition without the mechanisms that create empathy. It’s tough to make this case for certain, but it’s a very bad idea to assume that humans are aligned by default, so all we’ve got to do is reproduce human-like cognitive mechanisms and maybe train it “in a good family” or similar.
  
  That’s not to argue against human-like approaches to AGI as worse for alignment, just to say that they’re only better in that we have a little better understanding of that type of cognition and some mechanisms by which humans often wind up approximately aligned in common contexts.
  
  My own research is also in using loosely human-like reasoning and learning as a route to alignment, but that’s primarily because a) that’s my background expertise so it’s my relative advantage and b) I think LLMs are very loosely like some parts of the human brain/mind, and that we’ll see continued expansion of LLM agents to reason in more loosely human-like ways (that is, with chains of thought, specific memory looks ups, metacognition to organize this, etc).
  
  So I’m working on aligning loosely human-like cognition not because I think it’s by default any easier than aligning any other form of AGI, but just because that’s what seems most likely to become the first takeover capable (or pivotal act capable) AGI.
  - Q Home 8 Aug 2025 8:54 UTC
    1 point
    0
    Parent
    Yes, it could be that “special, inherently more alignable cognition” doesn’t exist or can’t be discovered by mere mortal humans. It could be that humanlike reasoning isn’t inherently more alignable. Finally, it could be that we can’t afford to study it because the dominating paradigm is different. Also, I realize that glass box AI is a pipe dream.
    Wrt sociopaths/psychopaths. I’m approaching it from a more theoretical standpoint. If I knew a method of building a psychopath AI (caring about something selfish, e.g. gaining money or fame or social power or new knowledge or even paperclips) and knew the core reasons of why it works, I would consider it a major progress. Because it would solve many alignment subproblems, such as ontology identification and subsystems alignment.
- Caleb Biddulph 7 Aug 2025 16:37 UTC
  4 points
  0
  Parent
  I think MONA (which I worked on) counts as another example of this. Basically, you can make your agent only care about short-term feedback from a trusted model that imitates humans, so it isn’t motivated to pursue long-term plans that a human wouldn’t endorse.
  I’m also doing an HRLM-like thing for my MATS project. The very high-level idea is to start with a lightly fine-tuned language model, which we assume is aligned/humanlike.^[1] Then we get the model to reason in a humanlike way about how to improve its own performance. This is supposed to be much more interpretable and aligned than RL, which modifies the underlying model in inscrutable ways to maximize reward by any means possible.
  You might even say that “humanlike reasoning and learning” describes my personal research agenda. I’d be excited for people to talk about this sort of thing more!
  Interestingly, both of the methods I mentioned above actually do something different from the dichotomy you described. Rather than “gaining capabilities, then aligning to a target” (usual plans) or “simultaneously gaining capabilities and aligning to a target in a deeply connected way” (HRLM plans), these methods could be described as “aligning to a target, then gaining capabilities.”^[2]
  Btw, the thing that HRLM stands for (Humanlike Reasoning/Learning Method) was a bit buried and at first I thought you didn’t mention it at all. It’d be easier to see if you capitalized it and moved it forward a bit
  1. ^
    This is not a totally safe assumption. For example, LLMs can easily be jailbroken to act very unhumanlike, and even the standard “assistant character” knows a lot of information that a random human would not. As you pointed out, LLMs are pretty alien. But hopefully in practice, the assumption is mostly true in the ways that matter.
  2. ^
    Technically, we’d be doing the HRLM thing first, during LLM pretraining and fine-tuning, since in the process of getting aligned to humans, the model must gain some baseline level of capabilities. But then we can bootstrap to higher capability levels using methods like MONA.
  - Q Home 8 Aug 2025 6:32 UTC
    1 point
    0
    Parent
    I’m approaching it from a “theoretical” perspective^[1], so I want to know how “humanlike reasoning” could be defined (beyond “here’s some trusted model which somehow imitates human judgement”) or why human-approved capability gain preserves alignment (like, what’s the core reason, what makes human judgement good?). So my biggest worry is not that the assumption might be false, but that the assumption is barely understood on the theoretical level.
    What are your research interests? Are you interested in defining what “explanation” means (or at least in defining some properties/principles of explanations)? Typical LLM stuff is highly empirical, but I’m kinda following the pipe dream of glass box AI.
    ^
    I’m contrasting theoretical and empirical approaches.
    Empirical—“this is likely to work, based on evidence”. Theoretical—“this should work, based on math / logic / philosophy”.
    Empirical—“if we can operationalize X for making experiments, we don’t need to look for a deeper definition”. Theoretical—“we need to look for a deeper definition anyway”.
    - Caleb Biddulph 8 Aug 2025 16:26 UTC
      2 points
      0
      Parent
      I wouldn’t describe my research as super theoretical, but it does involve me making arguments (although not formal proofs) about why I expect my plans to work even as training continues.
      For example, it seems like in some non-formalized sense, a human is trivially aligned to themselves. The more similar your AI’s behavior is to a particular human’s behavior, the more aligned it is to that human. If you want to align the AI to a group of humans (e.g. all of humanity), you might want to start by emulating a good approximation of all humans and bootstrapping from there. I’m not working on this aspect of the problem directly—I’m just assuming that the LLM is pretty humanlike to begin with—but I wrote a post talking about similar ideas.
Q Home 23 Feb 2023 10:52 UTC
10 points
0
(draft of a future post)

I want to share my model of intelligence and research. You won’t agree with it at the first glance. Or at the third glance. (My hope is that you will just give up and agree at the 20th glance.)

But that’s supposed to be good: it means the model is original and brave enough to make risky statements.

In this model any difference in “intelligence levels” or any difference between two minds in general boils down to “commitment level”.

What is “commitment”?

On some level, “commitment” is just a word. It’s not needed to define the ideas I’m going to talk about. What’s much more important is the three levels of commitment. There are often three levels which follow the same pattern, the same outline:

Level 1. You explore a single possibility.

Level 2. You want to explore all possibilities. But you are paralyzed by the amount of possibilities. At this level you are interested in qualities of possibilities. You classify possibilities and types of possibilities.

Level 3. You explore all possibilities through a single possibility. At this level you are interested in dynamics of moving through the possibility space. You classify implications of possibilities.

...

I’m going to give specific examples of the pattern above. This post is kind of repetitive, but it wasn’t AI-generated, I swear. Repetition is a part of commitment.

Why is commitment important?

My explanation won’t be clear before you read the post, but here it goes:
- Commitment describes your values and the “level” of your intentionality.
- Commitment describes your level of intelligence (in a particular topic). Compared to yourself (your potential) or other people.
- Commitments are needed for communication. Without shared commitments it’s impossible for two people to find a common ground.
- Commitment describes the “true content” of an argument, an idea, a philosophy. Ultimately, any property of a mind boils down to “commitments”.
Basics

1. Commitment to exploration

I think there are three levels of commitment to exploration.

Level 1. You treat things as immediate means to an end.

Imagine two enemy caveman teleported into a laboratory. They try to use whatever they find to beat each other. Without studying/exploring what they’re using. So, they are just throwing microscopes and beakers at each other. They throw anti-matter guns at each other without even activating them.

Level 2. You explore things for the sake of it.

Think about mathematicians. They can explore math without any goal.

Level 3. You use particular goals to guide your exploration of things. Even though you would care about exploring them without any goal anyway. The exploration space is just too large, so you use particular goals to narrow it down.

Imagine a physicist who explores mathematics by considering imaginary universes and applying physical intuition to discover deep mathematical facts. Such person uses a particular goal/bias to guide “pure exploration”. (inspired by Edward Witten, see Michael Atiyah’s quote)

More examples
- In terms of exploring ideas, our culture is at the level 1 (angry caveman). We understand ideas only as “ideas of getting something (immediately)” or “ideas of proving something (immediately)”. We are not interested in exploring ideas for the sake of it. The only metrics we apply to ideas are “(immediate) usefulness” and “trueness”. Not “beauty”, “originality” and “importance”. People in general are at the level 1. Philosophers are at the level 1 or “1.5″. Rationality community is at the level 1 too (sadly): rationalists still mostly care only about immediate usefulness and truth.
- In terms of exploring argumentation and reasoning, our culture is at the level 1. If you never thought “stupid arguments don’t exist”, then you are at the level 1: you haven’t explored arguments and reasoning for the sake of it, you immediately jumped to assuming “The Only True Way To Reason” (be it your intuition, scientific method, particular ideology or Bayesian epistemology). You haven’t stepped outside of your perspective a single time. Almost everyone is at the level 1. Eliezer Yudkowsky is at the level 3, but in a much narrower field: Yudkowsky explored rationality with the specific goal/bias of AI safety. However, overall Eliezer is at level 1 too: never studied human reasoning outside of what he thinks is “correct”.
I think this is kind of bad. We are at the level 1 in the main departments of human intelligence and human culture. Two levels below our true potential.

2. Commitment to goals

I think there are three levels of commitment to goals.

Level 1. You have a specific selfish goal.

“I want to get a lot of money” or “I want to save my friends” or “I want to make a ton of paperclips”, for example.

Level 2. You have an abstract goal. But this goal doesn’t imply much interaction with the real world.

“I want to maximize everyone’s happiness” or “I want to prevent (X) disaster”, for example. This is a broad goal, but it doesn’t imply actually learning and caring about anyone’s desires (until the very end). Rationalists are at this level of commitment.

Level 3. You use particular goals to guide your abstract goals.

Some political activists are at this level of commitment. (But please, don’t bring CW topics here!)

3. Commitment to updating

“Commitment to updating” is the ability to re-start your exploration from the square one. I think there are three levels to it.

Level 1. No updating. You never change ideas.

You just keep piling up your ideas into a single paradigm your entire life. You turn beautiful ideas into ugly ones so they fit with all your previous ideas.

Level 2. Updating. You change ideas.

When you encounter a new beautiful idea, you are ready to reformulate your previous knowledge around the new idea.

Level 3. Updating with “check points”. You change ideas, but you use old ideas to prime new ones.

When you explore an idea, you mark some “check points” which you reached with that idea. When you ditch the idea for a new one, you still keep in mind the check points you marked. And use them to explore the new idea faster.

Science

4.1 Commitment and theory-building

I think there are three levels of commitment in theory-building.

Level 1.

You build your theory using only “almost facts”. I.e. you come up with “trivial” theories which are almost indistinguishable from the things we already know.

Level 2.

You build your theory on speculations. You “fantasize” important properties of your idea (which are important only to you or your field).

Level 3.

You build your theory on speculations. But those speculations are important even outside of your field.

I think Eliezer Yudkowsky and LW did theory-building of the 3rd level. A bunch of LW ideas are philosophically important even if you disagree with Bayesian epistemology (Eliezer’s view on ethics and math, logical decision theories and some Alignment concepts).

4.2 Commitment to explaining a phenomenon

I think there are three types of commitment in explaining a phenomenon.

Level 1.

You just want to predict the phenomenon. But many-many possible theories can predict the phenomenon, so you need something more.

Level 2.

You compare the phenomenon to other phenomena and focus on its qualities.

That’s where most of theories go wrong: people become obsessed with their own fantasies about qualities of a phenomenon.

Level 3.

You focus on dynamics which connect this phenomenon to other phenomena. You focus on overlapping implications of different phenomena. 3rd level is needed for any important scientific breakthrough. For example:

Imagine you want to explain combustion (why/how things burn). On one hand you already “know everything” about the phenomenon, so what do you even do? Level 1 doesn’t work. So, you try to think about qualities of burning, types of transformations, types of movement… but that won’t take you anywhere. Level 2 doesn’t work too. The right answer: you need to think not about qualities of transformations and movements, but about dynamics (conservation of mass, kinetic theory of gases) which connect different types of transformations and movements. Level 3 works.

Epistemology pt. 1

5. Commitment and epistemology

I think there are three levels of commitment in epistemology.

Level 1. You assume the primary reality of the physical world. (Physicism)

Take statements “2 + 2 = 4” and “God exists”. To judge those statements, a physicist is going to ask “Do those statements describe reality in a literal way? If yes, they are true.”

Level 2. You assume the primary reality of statements of some fundamental language. (Descriptivism)

To judge statements, a descriptivist is going to ask “Can those statements be expressed in the fundamental language? If yes, they are true.”

Level 3. You assume the primary reality of semantic connections between statements of languages. And the primary reality of some black boxes which create those connections. (Connectivism) You assume that something physical shapes the “language reality”.

To judge statements, a connectivist is going to ask “Do those statements describe an important semantic connection? If yes, they are true.”

...

Recap. Physicist: everything “physical” exists. Descriptivist: everything describable exists. Connectivist: everything important exists. Physicist can be too specific and descriptivist can be too generous. (This pattern of being “too specific” or “too generous” repeats for all commitment types.)

Thinking at the level of semantic connections should be natural to people (because they use natural language and… neural nets in their brains!). And yet this idea is extremely alien to people epistemology-wise.

Implications for rationality

In general, rationalists are “confused” between level 1 and level 2. I.e. they often treat level 2 very seriously, but aren’t fully committed to it.

Eliezer Yudkowsky is “confused” between level 1 and level 3. I.e. Eliezer has a lot of “level 3 ideas”, but doesn’t apply level 3 thinking to epistemology in general.
- On one hand, Eliezer believes that “map is not the territory”. (level 1 idea)
- On another hand, Eliezer believes that math is an “objective” language shaped by the physical reality. (level 3 idea)
- Similarly, Eliezer believes that human ethics are defined by some important “objective” semantic connections (which can evolve, but only to a degree). (level 3)
- “Logical decision theories” treat logic as something created by connections between black boxes. (level 3)
- When you do Security Mindset, you should make not only “correct”, but beautiful maps. Societal properties of your map matter more than your opinions. (level 3)
So, Eliezer has a bunch of ideas which can be interpreted as “some maps ARE the territory”.

6. Commitment and uncertainty

I think there are three levels of commitment in doubting one’s own reasoning.

Level 1.

You’re uncertain about superficial “correctness” of your reasoning. You worry if you missed a particular counter argument. Example: “I think humans are dumb. But maybe I missed a smart human or applied a wrong test?”

Level 2.

You un-systematically doubt your assumptions and definitions. Maybe even your inference rules a little bit (see “inference objection”). Example: “I think humans are dumb. But what is a “human”? What is “dumb”? What is “is”? And how can I be sure in anything at all?”

Level 3.

You doubt the semantic connections (e.g. inference rules) in your reasoning. You consider particular dynamics created by your definitions and assumptions. “My definitions and assumptions create this dynamic (not presented in all people). Can this dynamic exploit me?”

Example: “I think humans are dumb. But can my definition of “intelligence” exploit me? Can my pessimism exploit me? Can this be an inconvenient way to think about the world? Can my opinion turn me into a fool even I’m de facto correct?”

...

Level 3 is like “security mindset” applied to your own reasoning. LW rationality mostly teaches against it, suggesting you to always take your smallest opinions at face value as “the truest thing you know”. With some exceptions, such as “ethical injunctions”, “radical honesty”, “black swan bets” and “security mindset”.

Epistemology pt. 2

7. Commitment to understanding/empathy

I think there are three levels of commitment in understanding your opponent.

Level 1.

You can pass the Ideological Turing Test in a superficial way (you understand the structure of the opponent’s opinion).

Level 2. “Telepathy”.

You can “inhabit” the emotions/mindset of your opponent.

Level 3.

You can describe the opponent’s position as a weaker version/copy of your own position. And additionally you can clearly imagine how your position could turn out to be “the weaker version/copy” of the opponent’s position. You find a balance between telepathy and “my opinion is the only one which makes sense!”

8. Commitment to “resolving” problems

I think there are three levels of commitment in “resolving” problems.

Level 1.

You treat a problem as a puzzle to be solved by Your Favorite True Epistemology.

Level 2.

You treat a problem as a multi-layered puzzle which should be solved on different levels.

Level 3.

You don’t treat a problem as a self-contained puzzle. You treat it as a “symbol” in the multitude of important languages. You can solve it by changing its meaning (by changing/exploring the languages).

Applying this type of thinking to the Unexpected hanging paradox:

I don’t treat this paradox as a chess puzzle: I don’t think it’s something that could be solved or even “made sense of” from the inside. You need outside context. Like, does it ask you to survive? Then you can simply expect the hanging every day and be safe. (Though—can you do this to your psychology?) Or does the paradox ask you to come up with formal reasoning rules to solve it? But you can make any absurd reasoning system—to make a meaningful system you need to answer “for what purposes this system is going to be needed except this paradox”. So, I think that “from the inside” there’s no ground truth (though it can exist “from the outside”). Without context there’s a lot of simple, but absurd or trivial solutions like “ignore logic, think directly about outcomes” or “come up with some BS reasoning system”. Or say “Solomonoff induction solves all paradoxes: even if it doesn’t, it’s the best possible predictor of reality, so just ignore philosophers, lol”.

Alignment pt. 1

9.1 Commitment to morality

I think there are three levels of commitment in morality.

Level 1. Norms, desires.

You analyze norms of specific communities and desires of specific people. That’s quite easy: you are just learning facts.

Level 2. Ethics and meta-ethics.

You analyze similarities between different norms and desires. You get to pretty abstract and complicated values such as “having agency, autonomy, freedom; having an interesting life; having an ability to form connections with other people”. You are lost in contradictory implications, interpretations and generalizations of those values. You have a (meta-)ethical paralysis.

Level 3. “Abstract norms”.

You analyze similarities between implications of different norms and desires. You analyze dynamics created by specific norms. You realize that the most complicated values are easily derivable from the implications of the simplest norms. (Not without some bias, of course, but still.)

I think moral philosophers and Alignment researches are seriously dropping the ball by ignoring the 3rd level. Acknowledging the 3rd level doesn’t immediately solve Alignment, but it can pretty much “solve” ethics (with a bit of effort).

9.2 Commitment to values

I think there are three levels of values.

Level 1. Inside values (“feeling good”).

You care only about things inside of your mind. For example, do you feel good or not?

Level 2. Real values.

You care about things in the real world. Even though you can’t care about them directly. But you make decisions to not delude yourself and not “simulate” your values.

Level 3. Semantic values.

You care about elements of some real system. And you care about proper dynamics of this system. For example, you care about things your friend cares about. But it’s also important to you that your friend is not brainwashed, not controlled by you. And you are ready that one day your friend may stop caring about anything. (Your value may “die” a natural death.)

3rd level is the level of “semantic values”. They are not “terminal values” in the usual sense. They can be temporal and history-dependent.

9.3 Commitment and research interest

So, you’re interested in ways in which an AI can go wrong. What specifically can you be interested in? I think there are three levels to it.

Level 1. In what ways some AI actions are bad?

You classify AI bugs into types. For example, you find “reward hacking” type of bugs.

Level 2. What qualities of AIs are good/bad?

You classify types of bugs into “qualities”. You find such potentially bad qualities as “AI doesn’t care about the real world” and “AI doesn’t allow to fix itself (corrigibility)”.

Level 3. What bad dynamics are created by bad actions of AI? What good dynamics are destroyed?

Assume AI turned humanity into paperclips. What’s actually bad about that, beyond the very first obvious answer? What good dynamics did this action destroy? (Some answers: it destroyed the feedback loop, the connection between the task and its causal origin (humanity), the value of paperclips relative to other values, the “economical” value of paperclips, the ability of paperclips to change their value.)

On the 3rd level you classify different dynamics. I think people completely ignore the 3rd level. In both Alignment and moral philosophy. 3rd level is the level of “semantic values”.

Alignment pt. 2

10. Commitment to Security Mindset

I think Security Mindset has three levels of commitment.

Level 1. Ordinary paranoia.

You have great imagination, you can imagine very creative attacks on your system. You patch those angles of attack.

Level 2. Security Mindset.

You study your own reasoning about safety of the system. You check if your assumptions are right or wrong. Then, you try to delete as much assumptions as you can. Even if they seem correct to you! You also delete anomalies of the system even if they seem harmless. You try to simplify your reasoning about the system seemingly “for the sake of it”.

Level 3.

You design a system which would be safe even in a world with changing laws of physics and mathematics. Using some bias, of course (otherwise it’s impossible).

Humans, idealized humans are “level 3 safe”. All/almost all current approaches to Alignment don’t give you a “level 3 safe” AI.

11. Commitment to Alignment

I think there are three levels of commitment a (mis)aligned AI can have. Alternatively, those are three or two levels at which you can try to solve the Alignment problem.

Level 1.

AI has a fixed goal or a fixed method of finding a goal (which likely can’t be Aligned with humanity). It respects only its own agency. So, ultimately it does everything it wants.

Level 2.

AI knows that different ethics are possible and is completely uncertain about ethics. AI respects only other people’s agency. So, it doesn’t do anything at all (except preventing, a bit lazily, 100% certain destruction and oppression). Or requires an infinite permission:
1. Am I allowed to calculate “2 + 2”?
2. Am I allowed to calculate “2 + 2” even if it leads to a slight change of the world?
3. Am I allowed to calculate “2 + 2” even if it leads to a slight change of the world which you can’t fully comprehend even if I explain it to you?
4. ...
5. Wait, am I allowed to ask those question? I’m already manipulating you by boring you to death. I can’t even say anything.
Level 3.

AI can respect both its own agency and the agency of humanity. AI finds a way to treat its agency as the continuation of the agency of people. AI makes sure it doesn’t create any dynamic which couldn’t be reversed by people (unless there’s nothing else to do). So, AI can both act and be sensitive to people.

Implications for Alignment research

I think a fully safe system exists only on the level 3. The most safe system is the system which understands what “exploitation” means, so it never willingly exploits its rewards in any way. Humans are an example of such system.

I think alignment researchers are “confused” between level 1 and level 3. They try to fix different “exploitation methods” (ways AI could exploit its rewards) instead of making the AI understand what “exploitation” means.

I also think this is the reason why alignment researches don’t cooperate much, pushing in different directions.

Perception

11. Commitment to properties

Commitments exist even on the level of perception. There are three levels of properties to which your perception can react.

Level 1. Inherent properties.

You treat objects as having more or less inherent properties. “This person is inherently smart.”

Level 2. Meta-properties.

You treat any property as universal. “Anyone is smart under some definition of smartness.”

Level 3. Semantic properties.

You treat properties only as relatively attached to objects: different objects form a system (a “language”) where properties get distributed between them and differentiated. “Everyone is smart, but in a unique way. And those unique ways are important in the system.”

12.1 Commitment to experiences and knowledge

I think there are three levels of commitment to experiences.

Level 1.

You’re interested in particular experiences.

Level 2.

You want to explore all possible experiences.

Level 3.

You’re interested in real objects which produce your experiences (e.g. your friends): you’re interested what knowledge “all possible experiences” could reveal about them. You want to know where physical/mathematical facts and experiences overlap.

12.2 Commitment to experience and morality

I think there are three levels of investigating the connection between experience and morality.

Level 1.

You study how experience causes us to do good or bad things.

Level 2.

You study all the different experiences “goodness” and “badness” causes in us.

Level 3.

You study dynamics created by experiences, which are related to morality. You study implications of experiences. For example: “loving a sentient being feels fundamentally different from eating a sandwich. food taste is something short and intense, but love can be eternal and calm. this difference helps to not treat other sentient beings as something disposable”

I think the existence of the 3rd level isn’t acknowledged much. And yet it could be important for alignment. Most versions of moral sentimentalism are 2nd level at best. Epistemic Sentimentalism can be 3rd level.

Final part

Specific commitments

You can ponder your commitment to specific things.

Are you committed to information?

Imagine you could learn anything (and forget it if you want). Would you be interested in learning different stuff more or less equally? You could learn something important (e.g. the most useful or the most abstract math), but you also could learn something completely useless—such as the life story of every ant who ever lived.

I know, this question is hard to make sense of: of course, anyone would like to learn everything/almost everything if there was no downside to it. But if you have a positive/negative commitment about the topic, then my question should make some sense anyway.

Are you committed to people?

Imagine you got extra two years to just talk to people. To usual people on the street or usual people on the Internet.

Would you be bored hanging out with them?

My answers: >!Maybe I was committed to information in general as a kid. Then I became committed to information related to people, produced by people, known by people.!<

My inspiration for writing this post

I encountered a bunch of people who are more committed to exploring ideas (and taking ideas seriously) than usual. More committed than most rationalists, for example.

But I felt those people lack something:
- They are able to explore ideas, but don’t care about that anymore. They care only about their own clusters of idiosyncratic ideas.
- They have very vague goals which are compatible with any specific actions.
- They don’t care if their ideas could even in principle matter to people. They have “disconnected” from other people, from other people’s context (through some level of elitism).
- When they acknowledge you as “one of them”, they don’t try to learn your ideas or share their ideas or argue with you or ask your help for solving a problem.
So, their commitment remains very low. And they are not “committed” to talking.

Conclusion

If you have a high level of commitment (3rd level) at least to something, then we should find a common language. You may even be like a sibling to me.

Thank you for reading this post. 🗿

Cognition

14.1 Studying patterns

I think there are three levels of commitment to patterns.
1. You study particular patterns.
2. You study all possible patterns: you study qualities of patterns.
3. You study implications of patterns. You study dynamics of patterns: how patterns get updated or destroyed when you learn new information.
14.2 Patterns and causality

I think there are three levels in the relationship between patterns and causality. I’m going to give examples about visual patterns.

Level 1.

You learn which patterns are impossible due to local causal processes.

For example: “I’m unlikely to see a big tower made of eggs standing on top of each other”. It’s just not a stable situation due to very familiar laws of physics.

Level 2.

You learn statistical patterns (correlations) which can have almost nothing to do with causality.

For example: “people like to wear grey shirts”.

Level 3.

You learn patterns which have a strong connection to other patterns and basic properties of images. You could say such patterns are created/prevented by “global” causal processes.

For example: “I’m unlikely to see a place fully filled with dogs. dogs are not people or birds or insects, they don’t create such crowds”. This is very abstract, connects to other patterns and basic properties of images.

Implications for Machine Learning

I think...
- It’s likely that Machine Learning models don’t learn level 3 patterns as well as they could, as sharply as they could.
- Machine Learning models should be 100% able to learn level 3 patterns. It shouldn’t require any specific data.
- Learning/comparing level 3 patterns is interesting enough on its own. It could be its own area of research. But we don’t apply statistics/Machine Learning to try mining those patterns. This may be a missed opportunity for humans.
I think researchers are making a blunder by not asking “what kinds of patterns exist? what patterns can be learned in principle?” (not talking about universal approximation theorem)

15. Cognitive processes

Suppose you want to study different cognitive processes, skills, types of knowledge. There are three levels:
1. You study particular cognitive processes.
2. You study qualities of cognitive processes.
3. You study dynamics created by cognitive processes. How “actions” of different cognitive processes overlap.
I think you can describe different cognitive processes in terms of patterns they learn. For example:
- Causal reasoning learns abstract configurations of abstract objects in the real world. So you can learn stuff like “this abstract rule applies to most objects in the world”.
- Symbolic reasoning learns abstract configurations of abstract objects in your “concept space”. So you can learn stuff like “”concept A contains concept B” is an important pattern”.
- Correlational reasoning learns specific configurations of specific objects.
- Mathematical reasoning learns specific configurations of abstract objects. So you can build arbitrary structures with abstract building blocks.
- Self-aware reasoning can transform abstract objects into specific objects. So you can think thoughts like, for example, “maybe I’m just a random person with random opinions”.
I think all this could be easily enough formalized.

Meta-level

Can you be committed to exploring commitment?

I think yes.

One thing you can do is to split topics into sub-topics and raise your commitment in every particular sub-topic. Vaguely similar to gradient descent. That’s what I’ve been doing in this post so far.

Another thing you can do is to apply recursion. You can split any topic into 3 levels of commitment. But then you can split the third level into 3 levels too. So, there’s potentially an infinity of levels of commitment. And there can be many particular techniques for exploiting this fact.

But the main thing is the three levels of “exploring ways to explore commitment”:
1. You study particular ways to raise commitment.
2. You study all possible ways to raise commitment.
3. You study all possible ways through a particular way. You study dynamics and implications which the ways create.
I don’t have enough information or experience for the 3rd level right now.
- Q Home 8 Mar 2023 22:26 UTC
  2 points
  0
  Parent
  *A more “formal” version of the draft (it’s a work in progress): *
  
  There are two interpretations of this post, weak and strong.
  
  Weak interpretation:
  
  I describe a framework about “thee levels of exploration”. I use the framework to introduce some of my ideas. I hope that the framework will give more context to my ideas, making them more understandable. I simply want to find people who are interested in exploring ideas. Exploring just for the sake of exploring or for a specific goal.
  
  Strong interpretation:
  
  I use the framework as a model of intelligence. I claim that any property of intelligence boils down to the “three levels of exploration”. Any talent, any skill. The model is supposed to be “self-evident” because of its simplicity, it’s not based on direct analysis of famous smart people.
  
  Take the strong interpretation with a lot of grains of salt, of course, because I’m not an established thinker and I haven’t achieved anything intellectual. I just thought “hey, this is a funny little simple idea, what if all intelligence works like this?”, that’s all.
  
  That said, I’ll need to make a couple of extraordinary claims “from inside the framework” (i.e. assuming it’s 100% correct and 100% useful). Just because that’s in the spirit of the idea. Just because it allows to explore the idea to its logical conclusion. Definitely not because I’m a crazy man. You can treat the most outlandish claims as sci-fi ideas.
  
  A formula of thinking?
  
  Can you “reduce” thinking to a single formula? (Sounds like cringe and crackpottery!)
  
  Can you show a single path of the best and fastest thinking?
  
  Well, there’s an entire class of ideas which attempt to do this in different fields, especially the first idea:
  - Bayesian epistemology: “epistemology in a single rule” (the rule of updating beliefs)
  - Utilitarianism, preference utilitarianism: “(meta-)ethics in a single rule”
  - Baconian method, the prototype of the scientific method: “science in a single rule”
  - Hegelian dialectic: “philosophy in a single process”
  - Marxist dialectic: “history in a single process”
  My idea is just another attempt at reduction. You don’t have to treat such attempts 100% seriously in order to find value in them. You don’t have to agree with them.
  
  Three levels of exploration
  
  Let’s introduce my framework.
  
  In any topic, there are three levels of exploration:
  1. You study a single X.
  2. You study types of different X. Often I call those types “qualities” of X.
  3. You study types of changes (D): in what ways different X change/get changed by a new thing Y. Y and D need to be important even outside of the (main) context of X.
  The point is that at the 2nd level you study similarities between different X directly, but at the 3rd level you study similarities indirectly through new concepts Y and D. The letter “D” means “dynamics”.
  
  I claim that any property of intelligence can be boiled down to your “exploration level”. Any talent, any skill and even more vague things such as “level of intentionality”. I claim that the best and most likely ideas come from the 3rd level. That 3rd level defines the absolute limit of currently conceivable ideas. So, it also indirectly defines the limit of possible/conceivable properties of reality.
  
  You don’t need to trust those extraordinary claims. If the 3rd level simply sounds interesting enough to you and you’re ready to explore it, that’s good enough.
  
  Three levels simplified
  
  A vague description of the three levels:
  1. You study objects.
  2. You study qualities of objects.
  3. You study changes of objects.
  Or:
  1. You study a particular thing.
  2. You study everything.
  3. You study abstract ways (D) in which the thing is changed by “everything”.
  Or:
  1. You study a particular thing.
  2. You study everything.
  3. You study everything through a particular thing.
  So yeah, it’s a Hegelian dialectic rip-off. Down below are examples of applying my framework to different topics. You don’t need to read them all, of course.
  
  Exploring debates
  
  1. Argumentation
  
  I think there are three levels of exploring arguments:
  1. You judge arguments as right or wrong. Smart or stupid.
  2. You study types of arguments. Without judgement.
  3. You study types of changes (D): how arguments change/get changed by some new thing Y. (“dynamics” of arguments)
  If you want to get a real insight about argumentation, you need to study how (D) arguments change/get changed by some new thing Y. D and Y need to be important even outside of the context of explicit argumentation.
  
  For example, Y can be “concepts”. And D can be “connecting/separating” (a fundamental process which is important in a ton of contexts). You can study in what ways arguments connect and separate concepts.
  
  A simplified political example: a capitalist can tend to separate concepts (“bad things are caused by mistakes and bad actors”), while a socialist can tend to connect concepts (“bad things are caused by systemic problems”). Conflict Vs. Mistake^(1) is just a very particular version of this dynamic. Different manipulations with concepts create different arguments and different points of view. You can study all such dynamics. You can trace arguments back to fundamental concept manipulations. It’s such a basic idea and yet nobody has done it. Aristotle has done it 2400 years ago, but for formal logic.
  
  ^(1. I don’t agree with Scott Alexander, by the way.)
  
  Arguments: conclusion
  
  I think most of us are at the level 1 in argumentation: we throw arguments at each other like angry cavemen without studying what an “argument” is and/or what dynamics it creates. If you completely unironically think that “stupid arguments” exist, then you’re probably on the 1st level. Professional philosophers are at the level 2 at best, but usually lower (they are surprisingly judgemental). At least they are somewhat forced to be tolerant to the most diverse types of arguments due to their profession.
  
  On what level are you? Have you studied arguments without judgement?
  
  2. Understanding/empathy
  
  I think there are three levels in understanding your opponent:
  1. You study a specific description (X) of your opponent’s opinion. You can pass the Ideological Turing Test in a superficial way. Like a parrot.
  2. You study types of descriptions of your opponent’s opinion. (“Qualities” of your opponent’s opinion.) You can “inhabit” the emotions/mindset of your opponent.
  3. You study types of changes (D): how the description of your opponent’s opinion changes/get changed by some new thing Y. D and Y need to be important even outside of debates.
  For example, Y can be “copies of the same thing” and D can be “transformations of copies into each other”. Such Y and D are important even outside of debates.
  
  So, on the 3rd level you may be able to describe the opponent’s position as a weaker version/copy of your own position (Y) and clearly imagine how your position could turn out to be “the weaker version/copy” of the opponent’s views. You can imagine how opponent’s opinion transforms into truth and your opinion transforms into a falsehood (D).
  
  Other interesting choices of Y and D are possible. For example, Y can be “complexity of the opinion [in a given context]”; D can be “choice of the context” and “increasing/decreasing of complexity”. You can run the opinion of your opponent through different contexts and see how much it reacts to/accommodates the complexity of the world.
  
  Empathy: conclusion
  
  I think people very rarely do the 3rd level of empathy.
  
  Doing it systematically would lead to a new political/epistemological paradigm.
  
  Exploring philosophy
  
  1. Beliefs and ontology
  
  I think there are three levels of studying the connection between beliefs and ontology:
  1. You think you can see the truth of a belief directly. For example, you can say “all beliefs which describe reality in a literal way are true”. You get stuff like Naïve Realism. “Reality is real.”
  2. You study types of beliefs. You can say that all beliefs of a certain type are true. For example, “all mathematical beliefs are true”. You get stuff like Mathematical Universe Hypothesis, Platonisim, Ontic Structural Realism… “Some description of reality is real.”
  3. You study types of changes (D): how beliefs change/get changed by some new thing Y. You get stuff like Berkeley’s subjective idealism and radical probabilism and Bayesian epistemology: the world of changing ideas. “Some changing description of reality is real.”
  What can D and Y be? Both things need to be important even outside of the context of explicit beliefs. A couple of versions:
  - Y can be “semantic connections”. D can be “connecting/separating [semantic connections]”. Both things are generally important, for example in linguistics, in studying semantic change. We get Berkeley’s idealism.
  - Y can be “probability mass” or some abstract “weight”. D can be “distribution of the mass/weight”. We get probabilism/Bayesianism.
  Thinking at the level of semantic connections should be natural to people, because they use natural language and… neural nets in their brains! (Berkeley makes a similar argument: “hey, folks, this is just common sense!”) And yet this idea is extremely alien to people epistemology-wise and ontology-wise. I think the true potential of the 3rd level remains unexplored.
  
  Beliefs: conclusion
  
  I think most rationalists (Bayesians, LessWrong people) are “confused” between the 2nd level and the 1st level, even though they have some 3rd level tools.
  
  Eliezer Yudkowsky is “confused” between the 1st level and the 3rd level: he likes level 1 ideas (e.g. “map is not the territory”), but has a bunch of level 3 ideas (“some maps are the territory”) about math, probability, ethics, decision theory, Security Mindset...
  
  2. Ontology and reality
  
  I think there are three level of exploring the relationship between ontologies and reality:
  1. You think that an ontology describes the essence of reality.
  2. You study how different ontologies describe different aspects of reality.
  3. You study types of changes (D): how ontologies change/get changed by some other concept Y. D and Y need to be important even outside of the topic of (pure) ontology.
  Y can be “human minds” or simply “objects”. D can be “matching/not matching” or “creating a structure” (two very basic, but generally important processes). You get Kant’s “Copernican revolution” (reality needs to match your basic ontology, otherwise information won’t reach your mind: but there are different types of “matching” and transcendental idealism defines one of the most complicated ones) and Ontic Structural Realism (ontology is not about things, it’s about structures created by things) respectively.
  
  On what level are you? Have you studied ontologies/epistemologies without judgement? What are the most interesting ontologies/epistemologies you can think of?
  
  3. Philosophy overall
  
  I think there are three levels of doing philosophy in general:
  1. You try to directly prove an idea in philosophy using specific philosophical tools.
  2. You study types of philosophical ideas.
  3. You study types of changes (D): how philosophical ideas change/get changed by some other thing Y. D and Y need to be important even outside of (pure) philosophy.
  To give a bunch of examples, Y can be:
  - Society and ethical implications. See Social Ontology, Social Epistemology
  - The full potential of human imagination—and the reality’s weirdness. See Immanuel Kant
  - Forming and resolving conflicts and contradictions. See Hegelian dialectic
  - Evolving contexts. See Postmodernism
  - Language and games. See Ludwig Wittgenstein
  - The best predictions about low-level stuff. See Bayesian epistemology
  - Semantic connections. (my weak philosophical attempts are here!)
  - Subjective experience (qualia).
  I think people did a lot of 3rd level philosophy, but we haven’t fully committed to the 3rd level yet. We are used to treating philosophy as a closed system, even when we make significant steps outside of that paradigm.
  
  Exploring ethics
  
  1. Commitment to values
  
  I think there are three levels of values:
  1. Real values. You treat your values as particular objects in reality.
  2. Subjective values. You care only about things inside of your mind. For example, do you feel good or not?
  3. Semantic values. You care about types of changes (D): how your values change/get changed by reality (Y). Your value can be expressed as a combination of the three components: “a real thing + its meaning + changes”.
  Example of a semantic value: you care about your friendship with someone. You will try to preserve the friendship. But in a limited way: you’re ready that one day the relationship may end naturally (your value may “die” a natural death). Semantic values are temporal and path-dependent. Semantic values are like games embedded in reality: you want to win the game without breaking the rules.
  
  2. Ethics
  
  I think there are three levels of analyzing ethics:
  1. You analyze norms of specific communities and desires of specific people. That’s quite easy: you are just learning facts.
  2. You analyze types of norms and desires. You are lost in contradictory implications, interpretations and generalizations of people’s values. You have a meta-ethical paralysis.
  3. You study types of changes (D): how norms and desires change/get changed by some other thing Y. D and Y need to be important even outside of (purely) ethical context.
  Ethics: tasks and games
  
  For example, Y can be “tasks, games, activities” and D can be “breaking/creating symmetries”. You can study how norms and desires affect properties of particular activities.
  
  Let’s imagine an Artificial Intelligence or a genie who fulfills our requests (it’s a “game” between us). We can analyze how bad actions of the genie can break important symmetries of the game. Let’s say we asked it to make us a cup of coffee:
  - If it killed us after making the coffee, we can’t continue the game. And we ended up with less than we had before. And we wouldn’t make the request if we knew that’s gonna happen. And the game can’t be “reversed”: the players are dead.
  - If it has taken us under mind control, we can’t affect the game anymore (and it gained 100% control over the game). If it placed us into a delusion, then the state of the game can be arbitrarily affected (by dissolving the illusion). And depends on perspective.
  - If it made us addicted to coffee, we can’t stop or change the game anymore. And the AI/genie drastically changed the nature of the game without our consent. It changed how the “coffee game” relates to all other games, skewed the “hierarchy of games”.
  Those are all “symmetry breaks”. And such symmetry breaks are bad in most of the tasks.
  
  Ethics: Categorical Imperative
  
  With Categorical Imperative, Kant explored a different choice of Y and D. Now Y is “roles of people”, “society” and “concepts”; D is “universalization” and “becoming incoherent/coherent” and other things.
  
  Ethics: Preferences
  
  If Y is “preferences” and D is “averaging”, we get Preference utilitarianism. (Preferences are important even outside of ethics and “averaging” is important everywhere.) But this idea is too “low-level” to use in analysis of ethics.
  
  However, if Y is “versions of an abstract preference” and D is “splitting a preference into versions” and “averaging”, then we get a high-level analog of preference utilitarianism. For example, you can take an abstract value such as Bodily autonomy and try to analyze the entirety of human ethics as an average of versions (specifications) of this abstract value.
  
  Preference utilitarianism reduces ethics to an average of micro-values, the idea above reduces ethics to an average of a macro-value.
  
  Ethics: conclusion
  
  So, what’s the point of the 3rd level of analyzing ethics? The point is to find objective sub-structures in ethics where you can apply deduction to exclude the most “obviously awful” and “maximally controversial and irreversible” actions. The point is to “derive” ethics from much more broad topics, such as “meaningful games” and “meaningful tasks” and “coherence of concepts”.
  
  I think:
  - Moral philosophers and Alignment researches are ignoring the 3rd level. People are severely underestimating how much they know about ethics.
  - Acknowledging the 3rd level doesn’t immediately solve Alignment, but it can “solve” ethics or the discourse around ethics. Empirically: just study properties of tasks and games and concepts!
  - Eliezer Yudkowsky has limited 3rd level understanding of meta-ethics (“Abstracted Idealized Dynamics”, “Morality as Fixed Computation”, “The Bedrock of Fairness”) but misses that he could make his idea more broad.
  - Particularism (in ethics and reasoning in general) could lead to the 3rd level understanding of ethics.
  Exploring perception
  
  1. Properties
  
  There are three levels of looking at properties of objects:
  1. Inherent properties. You treat objects as having more or less inherent properties. E.g. “this person is inherently smart”
  2. Meta-properties. You treat any property as universal. E.g. “anyone is smart under some definition of smartness”
  3. Semantic properties. You treat properties only as relatively attached to objects. You focus on types of changes (D): how properties and their interpretations change/get changed by some other thing Y. You “reduce” properties to D and Y. E.g. “anyone can be a genius or a fool under certain important conditions” or “everyone is smart, but in a unique and important way”
  2. Commitment to experiences and knowledge
  
  I think there are three levels of commitment to experiences:
  1. You’re interested in particular experiences.
  2. You want to explore all possible experiences.
  3. You’re interested in types of changes (D): how your experience changes/get changed by some other thing Y. D and Y need to be important even outside of experience.
  So, on the 3rd level you care about interesting ways (D) in which experiences correspond to reality (Y).
  
  3. Experience and morality
  
  I think there are three levels of investigating the connection between experience and morality:
  1. You study how experience causes us to do good or bad things.
  2. You study all the different experiences “goodness” and “badness” causes in us.
  3. You study types of changes (D): how your experience changes/get changed by some other thing Y. D and Y need to be important even outside of experience. But related to morality anyway.
  For example, Y can be “[basic] properties of concepts” and D can be “matches / mismatches [between concepts and actions towards them]”. You can study how experience affects properties of concepts which in turn bias actions. An example of such analysis: “loving a sentient being feels fundamentally different from eating a sandwich. food taste is something short and intense, but love can be eternal and calm. this difference helps to not treat other sentient beings as something disposable”
  
  I think the existence of the 3rd level isn’t acknowledged much. Most versions of moral sentimentalism are 2nd level at best. Epistemic Sentimentalism can be 3rd level in the best case.
  
  Exploring cognition
  
  1. Patterns
  
  I think there are three levels of [studying] patterns:
  1. You study particular patterns (X). You treat patterns as objective configurations in reality.
  2. You study all possible patterns. You treat patterns as subjective qualities of information, because most patterns are fake.
  3. You study types of changes (D): how patterns change/get changed by some other thing Y. D and Y need to be important even outside of (explicit) pattern analysis. You treat a pattern as a combination of the three components: “X + Y + D”.
  For example, Y can be “pieces of information” or “contexts”: you can study how patterns get discarded or redefined (D) when new information gets revealed/new contexts get considered.
  
  You can study patterns which are “objective”, but exist only in a limited context. For example, think about your friend’s bright personality (personality = a pattern). It’s an “objective” pattern, and yet it exists only in a limited context: the pattern would dissolve if you compared your friend to all possible people. Or if you saw your friend in all possible situations they could end up in. Your friend’s personality has some basis in reality (X), has a limited domain of existence (Y) and the potential for change (D).
  
  2. Patterns and causality
  
  I think there are three levels in the relationship between patterns and causality. I’m going to give examples about visual patterns:
  1. You learn which patterns are impossible due to local causal processes. For example: “I’m unlikely to see a big tower made of eggs standing on top of each other”. It’s just not a stable situation due to very familiar laws of physics.
  2. You learn statistical patterns (correlations) which can have almost nothing to do with causality. For example: “people like to wear grey shirts”.
  3. You learn types of changes (D): how patterns change/get changed by some other thing Y. D and Y need to be important even outside of (explicit) pattern analysis. And related to causality.
  Y can be “basic properties of images” and “basic properties of patterns”; D can be “sharing properties” and “keeping the complexity the same”. In simpler words:
  
  On the 3rd level you learn patterns which have strong connections to other patterns and basic properties of images. You could say such patterns are created/prevented by “global” causal processes. For example: “I’m unlikely to see a place fully filled with dogs. dogs are not people or birds or insects, they don’t create such crowds or hordes”. This is very abstract, connects to other patterns and basic properties of images.
  
  Causality: implications for Machine Learning
  
  I think...
  - It’s likely that Machine Learning models don’t learn 3rd level patterns as well as they could, as sharply as they could.
  - Machine Learning models should be 100% able to learn 3rd level patterns. It shouldn’t require any specific data.
  - Learning/comparing level 3 patterns is interesting enough on its own. It could be its own area of research. But we don’t apply statistics/Machine Learning to try mining those patterns. This may be a missed opportunity for humans.
  3. Cognitive processes
  
  Suppose you want to study different cognitive processes, skills, types of knowledge. There are three levels:
  1. You study particular cognitive processes.
  2. You study types (qualities) of cognitive processes. And types of types (classifications).
  3. You study types of changes (D): how cognitive processes change/get changed by some other thing Y. D and Y need to be important even without the context of cognitive processes.
  For example, Y can be “fundamental configurations / fundamental objects” and D can be “finding a fundamental configuration/object in a given domain”. You can “reduce” different cognitive process to those Y and D: (names of the processes below shouldn’t be taken 100% literally)
  
  ^(1 “fundamental” means “VERY widespread in a certain domain”)
  - Causal reasoning learns fundamental configurations of fundamental objects in the real world. So you can learn stuff like “this abstract rule applies to most objects in the world”.
  - Symbolic reasoning learns fundamental configurations of fundamental objects in your “concept space”. So you can learn stuff like “”concept A containing concept B” is an important pattern” (see set relations).
  - Correlational reasoning learns specific configurations of specific objects.
  - Mathematical reasoning learns specific configurations of fundamental objects. So you can build arbitrary structures with abstract building blocks.
  - Self-aware reasoning can transform fundamental objects into specific objects. So you can think thoughts like, for example, “maybe I’m just a random person with random opinions” (you consider your perspective as non-fundamental) or “maybe the reality is not what it seems”.
  I know, this looks “funny”, but I think all this could be easily enough formalized. Isn’t that a natural way to study types of reasoning? Just ask what knowledge a certain type of reasoning learns!
  
  Exploring theories
  
  1. Science
  
  I think there are three ways of doing science:
  1. You predict a specific phenomenon.
  2. You study types of phenomena. (qualities of phenomena)
  3. You study types of changes (D): how the phenomenon changes/get changed by some other thing Y. D and Y need to be important even outside of this phenomenon.
  Imagine you want to explain combustion (why/how things burn):
  1. You try to predict combustion. This doesn’t work, because you already know “everything” about burning and there are many possible theories. You end up making things up because there’s not enough new data.
  2. You try to compare combustion to other phenomena. You end up fantasizing about imaginary qualities of the phenomenon. At this level you get something like theories of “classical elements” (fantasies about superficial similarities).
  3. You find or postulate a new thing (Y) which affects/gets affected (D) by combustion. Y and D need to be important in many other phenomena. If Y is “types of matter” and D is “releasing / absorbing”, this gives you Phlogiston theory. If Y is “any matter” and D is “conservation of mass” and “any transformations of matter”, you get Lavoisier’s theory. If Y is “small pieces of matter (atoms)” and D is “atoms hitting each other”, you get Kinetic theory of gases.
  So, I think phlogiston theory was a step in the right direction, but it failed because the choice of Y and D wasn’t abstract enough.
  
  I think most significant scientific breakthroughs require level 3 ideas. Partially “by definition”: if a breakthrough is not “level 3″, then it means it’s contained in a (very) specific part of reality.
  
  2. Math
  
  I think there are three ways of doing math:
  1. You explore specific mathematical structures.
  2. You explore types of mathematical structures. And types of types. And typologies. At this level you may get something like Category theory.
  3. You study types of changes (D): how equations change/get changed by some other thing Y. D and Y need to be important even outside of (explicit) math.
  Mathematico-philosophical insights
  
  Let’s look at math through the lens of the 3rd level:
  - Let Y be “infinitely small building blocks” and “infinitely diminishing building blocks”; let D be “becoming infinitely small” and “reaching the limit”. Those Y and D matter even outside of math. We got Calculus.
  - Let Y be “quasi-physical materials” and D be “stretching, bending etc.”. Those Y and D matter even outside of math. We got Topology.
  - Let Y be “probability”. That was a completely new concept in all domains of knowledge. We got Probability theory.
  - Let Y be “different scales” and “different parts”; let D be “(not) repeating”. We got Fractals and Recursion.
  - Let Y be “directed things” and D be “compositions of movements”. We got Vectors
  - Let Y be “things that do basic stuff” and D be “doing sequences of basic stuff”. We got Theory of computation and Computational complexity theory.
  - Let Y be “games” and “utilities”. We got Utility theory and St. Petersburg paradox, Game theory… and even a new number system (“surreal numbers”).
  - Let Y be “sets” and D be “basic set relationships”. Those ideas are important in all areas of knowledge. We got Set theory
  - Let Y be “infinity” and D be “counting” and “making sets”. Those are philosophically important things. We got actual infinity, Hilbert’s Hotel, Cantor’s diagonal argument, Absolute Infinite and The Burali-Forti paradox...
  All concepts above are “3rd level”. But we can classify them, creating new three levels of exploration (yes, this is recursion!). Let’s do this. I think there are three levels of mathematico-philosophical concepts:
  1. Concepts that change the properties of things we count. (e.g. topology, fractals, graph theory)
  2. Concepts that change the meaning of counting. (e.g. probability, computation, utility, sets, group theory, Gödel’s incompleteness theorems and Tarski’s undefinability theorem)
  3. Concepts that change the essence of counting. (e.g. Calculus, vectors, probability, actual infinity, fractal dimensions)
  So, Calculus is really “the king of kings” and “the insight of insights”. 3rd level of the 3rd level.
  
  3. Physico-philosophical insights
  
  I would classify physico-philosophical concepts as follows:
  1. Concepts that change the way movement affects itself. E.g. Net force, Wave mechanics, Huygens–Fresnel principle
  2. Concepts that change the “meaning” of movement. E.g. the idea of reference frames (principles of relativity), curved spacetime (General Relativity), the idea of “physical fields” (classical electromagnetism), conservation laws and symmetries, predictability of physical systems.
  3. Concepts that change the “essence” of movement, the way movement relates to basic logical categories. E.g. properties of physical laws and theories (Complementarity; AdS/CFT correspondence), the beginning/existence of movement (cosmogony, “why is there something rather than nothing?”, Mathematical universe hypothesis), the relationship between movement and infinity (Supertasks) and computation/complexity, the way “possibility” spreads/gets created (Quantum mechanics, Anthropic principle), the way “relativity” gets created (Mach’s principle), the absolute mismatch between perception and the true nature of reality (General Relativity, Quantum Mechanics), the nature of qualia and consciousness (Hard problem of consciousness), the possibility of Theory of everything and the question “how far can you take [ontological] reductionism?”, the nature of causality and determinism, the existence of space and time and matter and their most basic properties, interpretation of physical theories (interpretations of quantum mechanics).
  Exploring meta ideas
  
  To define “meta ideas” we need to think about many pairs of “Y, D” simultaneously. This is the most speculative part of the post. Remember, you can treat those speculations simply as sci-fi ideas.
  
  Each pair of abstract concepts (Y, D) defines a “language” for describing reality. And there’s a meta-language which connects all those languages. Or rather there’s many meta-languages. Each meta-language can be described by a pair of abstract concepts too (Y, D).
  
  ^(Instead of “languages” I could use the word “models”. But I wanted to highlight that those “models” don’t have to be formal in any way.)
  
  I think the idea of “meta-languages” can be used to analyze:
  - Consciousness. You can say that consciousness is “made of” multiple abstract interacting languages. On one hand it’s just a trivial description of consciousness, on another hand it might have deeper implications.
  - Qualia. You can say that qualia is “made of” multiple abstract interacting languages. On one hand this is a trivial idea (“qualia is the sum of your associations”), on another hand this formulation adds important specific details.
  - The ontology of reality. You can argue that our ways to describe reality (“physical things” vs. purely mathematical concepts, subjective experience vs. physical world, high-level patterns vs. complete reductionism, physical theory vs. philosophical ontology) all conflict with each other and lead to paradoxes when taken to the extreme, but can’t exist without each other. Maybe they are all intertwined?
  - Meta-ethics. You can argue that concepts like “goodness” and “justice” can’t be reduced to any single type of definition. So, you can try to reduce them to a synthesis of many abstract languages. See G. E. Moore ideas about indefinability: the naturalistic fallacy, the open-question argument.
  According to the framework, ideas about “meta-languages” define the limit of conceivable ideas.
  
  If you think about it, it’s actually a quite trivial statement: “meta-models” (consisting of many normal models) is the limit of conceivable models. Your entire conscious mind is such “meta-model”. If no model works for describing something, then a “meta-model” is your last resort. On one hand “meta-models” is a very trivial idea^(1), on another hand nobody ever cared to explore the full potential of the idea.
  
  ^(1 for example, we have a “meta-model” of physics: a combination of two wrong theories, General Relativity and Quantum Mechanics.)
  
  Nature of percepts
  
  I talked about qualia in general. Now I just want to throw out my idea about the nature of particular percepts.
  
  There are theories and concepts which link percepts to “possible actions” and “intentions”: see Affordance. I like such ideas, because I like to think about types of actions.
  
  So I have a variation of this idea: I think that any percept gets created by an abstract dynamic (Y, D) or many abstract dynamics. Any (important) percept corresponds to a unique dynamic. I think abstract dynamics bind concepts.
  
  ^(But I have only started to think about this. I share it anyway because I think it follows from all the other ideas.)
  
  P.S.
  
  Thank you for reading this.
  
  If you want to discuss the idea, please focus on the idea itself and its particular applications. Or on exploring particular topics!
Q Home 28 Jan 2025 7:56 UTC
6 points
0
Epistemic status: Draft of a post. I want to propose a method of learning environmental goals (a super big, super important subproblem in Alignment). It’s informal, so has a lot of gaps. I worry I missed something obvious, rendering my argument completely meaningless. I asked LessWrong feedback team, but they couldn’t get someone knowledgeable enough to take a look.

Can you tell me the biggest conceptual problems of my method? Can you tell me if agent foundations researchers are aware of this method or not?

If you’re not familiar with the problem, here’s the context: Environmental goals; identifying causal goal concepts from sensory data; ontology identification problem; Pointers Problem; Eliciting Latent Knowledge.

Explanation 1

One naive solution

Imagine we have a room full of animals. AI sees the room through a camera. How can AI learn to care about the real animals in the room rather than their images on the camera?

Assumption 1. Let’s assume AI models the world as a bunch of objects interacting in space and time. I don’t know how critical or problematic this assumption is.

Idea 1. Animals in the video are objects with certain properties (they move continuously, they move with certain relative speeds, they have certain sizes, etc). Let’s make the AI search for the best world-model which contains objects with similar properties (P properties).

Problem 1. Ideally, AI will find clouds of atoms which move similarly to the animals on the video. However, AI might just find a world-model (X) which contains the screen of the camera. So it’ll end up caring about “movement” of the pixels on the screen. Fail.

Observation 1. Our world contains many objects with P properties which don’t show up on the camera. So, X is not the best world-model containing the biggest number of objects with P properties.

Idea 2. Let’s make the AI search for the best world-model containing the biggest number of objects with P properties.

Question 1. For “Idea 2” to make practical sense, we need to find a smart way to limit the complexity of the models. Otherwise AI might just make any model contain arbitrary amounts of any objects. Can we find the right complexity prior?

Question 2. Assume we resolved the previous question positively. What if “Idea 2” still produces an alien ontology humans don’t care about? Can it happen?

Question 3. Assume everything works out. How do we know that this is a general method of solving the problem? We have an object in sense data (A), we care about the physical thing corresponding to it (B): how do we know B always behaves similarly to A and there are always more instances of B than of A?

One philosophical argument

I think there’s a philosophical argument which allows to resolve Questions 2 & 3 (giving evidence that Question 1 should be resolvable too).
- By default, we only care about objects with which we can “meaningfully” interact with in our daily life. This guarantees that B always has to behave similarly to A, in some technical sense (otherwise we wouldn’t be able to meaningfully interact with B). Also, sense data is a part of reality, so B includes A, therefore there are always more instances of B than of A, in some technical sense. This resolves Question 3.
- By default, we only care about objects with which we can “meaningfully” interact with in our daily life. This guarantees that models of the world based on such objects are interpretable. This resolves Question 2.
- Can we define what “meaningfully” means? I think that should be relatively easy, at least in theory. There doesn’t have to be One True Definition Which Covers All Cases.
If the argument is true, the pointers problem should be solvable without Natural Abstraction hypothesis being true.

Anyway, I’ll add a toy example which hopefully helps to better understand what’s this all about.

One toy example

You’re inside a 3D video game. 1st person view. The game contains landscapes and objects, both made of small balls (the size of tennis balls) of different colors. Also a character you control.

The character can push objects. Objects can break into pieces. Physics is Newtonian. Balls are held together by some force. Balls can have dramatically different weights.

Light is modeled by particles. Sun emits particles, they bounce off of surfaces.

The most unusual thing: as you move, your coordinates are fed into a pseudorandom number generator. The numbers from the generator are then used to swap places of arbitrary balls.

You care about pushing boxes (as everything, they’re made of balls too) into a certain location.

...

So, the reality of the game has roughly 5 levels:
1. The level of sense data (2D screen of the 1st person view).
2. A. The level of ball structures. B. The level of individual balls.
3. A. The level of waves of light particles. B. The level of individual light particles.
I think AI should be able to figure out that it needs to care about 2A level of reality. Because ball structures are much simpler to control (by doing normal activities with the game’s character) than individual balls. And light particles are harder to interact with than ball structures, due to their speed and nature.

Explanation 2

An alternative explanation of my argument:
1. Imagine activities which are crucial for a normal human life. For example: moving yourself in space (in a certain speed range); moving other things in space (in a certain speed range); staying in a single spot (for a certain time range); moving in a single direction (for a certain time range); having varied visual experiences (changing in a certain frequency range); etc. Those activities can be abstracted into mathematical properties of certain variables (speed of movement, continuity of movement, etc). Let’s call them “fundamental variables”. Fundamental variables are defined using sensory data or abstractions over sensory data.
2. Some variables can be optimized (for a long enough period of time) by fundamental variables. Other variables can’t be optimized (for a long enough period of time) by fundamental variables. For example: proximity of my body to my bed is an optimizable variable (I can walk towards the bed — walking is a normal activity); the amount of things I see is an optimizable variable (I can close my eyes or hide some things — both actions are normal activities); closeness of two particular oxygen molecules might be a non-optimizable variable (it might be impossible to control their positions without doing something weird).
3. By default, people only care about optimizable variables. Unless there are special philosophical reasons to care about some obscure non-optimizable variable which doesn’t have any significant effect on optimizable variables.
4. You can have a model which describes typical changes of an optimizable variable. Models of different optimizable variables have different predictive power. For example, “positions & shapes of chairs” and “positions & shapes of clouds of atoms” are both optimizable variables, but models of the latter have much greater predictive power. Complexity of the models needs to be limited, by the way, otherwise all models will have the same predictive power.
5. Collateral conclusions: typical changes of any optimizable variable are easily understandable by a human (since it can be optimized by fundamental variables, based on typical human activities); all optimizable variables are “similar” to each other, in some sense (since they all can be optimized by the same fundamental variables); there’s a natural hierarchy of optimizable variables (based on predictive power). Main conclusion: while the true model of the world might be infinitely complex, physical things which ground humans’ high-level concepts (such as “chairs”, “cars”, “trees”, etc.) always have to have a simple model (which works most of the time, where “most” has a technical meaning determined by fundamental variables).
Formalization

So, the core of my idea is this:
1. AI is given “P properties” which a variable of its world-model might have. (Let’s call a variable with P properties P-variable.)
2. AI searches for a world-model with the biggest amount of P-variables. AI makes sure it doesn’t introduce useless P-variables. We also need to be careful with how we measure the “amount” of P-variables: we need to measure something like “density” rather than “amount” (i.e. the amount of P-variables contributing to a particular relevant situation, rather than the amount of P-variables overall?).
3. AI gets an interpretable world-model (because P-variables are highly interpretable), adequate for defining what we care about (because by default, humans only care about P-variables).
How far are we from being able to do something like this? Are agent foundations researches pursuing this or something else?
Q Home 17 Mar 2025 9:26 UTC
5 points
−1
I have a couple of silly, absurd questions related to mesa-optimizers and mesa-controllers. I’m asking them to get a fresh look on the problem of inner alignment. I want to get a better grip on what basic properties of a model make it safe.

Question 1. How do we know that Quantum Mechanics theory is not plotting to kill humanity?

It’s a model, so it could be unsafe just like an AI.
- QM is not an agent, but its predictions strongly affect humanity. Oracles can be dangerous.
- QM is highly interpretable, so we can check that it’s not doing internal search. Or can we? Maybe it does search in some implicit way? Eliezer brought up this possibility: if you prohibit an AI from modeling its programmers’ psychology, the AI might start modelling something seemingly irrelevant which is actually equivalent to modeling the programmers’ psychology.
Maybe the AI reasons about certain very complicated properties of the material object on the pedestal… in fact, these properties are so complicated that they turn out to contain implicit models of User2′s psychology
- Even if QM doesn’t do search in any way… maybe it still was optimized to steer humanity towards disaster?
Or maybe QM is “grounded” in some special way (e.g. it’s easy to split into parts and verify that each part is correct), so we’re very confident that it does physics and only physics?

Question 2. Crazier version of the previous question: how do we know that Peano arithmetic isn’t plotting to destroy humanity? how do we know that the game of chess isn’t plotting to end humanity?

Maybe Peano arithmetic contains theorems trying to prove which steers the real world towards disaster. How can we know and when do we care?

Question 3. Imagine you came up with a plan to achieve your goals. You did it yourself. How do you know that this plan is not optimizing for your ruin?

Humans do go insane and fall into addictions. But not always. So why are our thoughts relatively safe to us? Why doesn’t every new thought / experience turn into addiction which wipes out all of your previous personality?

Question 4. You’re the Telepath. You can read the mind of the Killer. The Killer can reason about some things which aren’t comprehensible to you, but otherwise your cognition is very similar. Can you always tell if the Killer is planning to kill you?

Here are some thoughts the Killer might think:
1. “I need to do <something incomprehensible> so the Telepath dies.”
2. “I need to get the Telepath to eat this food with <something incomprehensible> in it.”
3. “I need to do <something incomprehensible> without any comprehensible reason.”
With 1 we can understand the outcome and that’s all that matters. With 2 we can still tell that something dodgy is going on. Even in 3 we see that the Killer tries to make his reasoning illegible. Maybe the Killer can never deceive us if the incomprehensible concepts he’s thinking about are “embedded” into the comprehensible concepts?
Q Home 17 Nov 2024 8:35 UTC
5 points
2
There’s an alignment-related problem, the problem of defining real objects. Relevant topics: environmental goals; task identification problem; “look where I’m pointing, not at my finger”; The Pointers Problem; Eliciting Latent Knowledge.

I think I realized how people go from caring about sensory data to caring about real objects. But I need help with figuring out how to capitalize on the idea.

So… how do humans do it?
1. Humans create very small models for predicting very small/basic aspects of sensory input (mini-models).
2. Humans use mini-models as puzzle pieces for building models for predicting ALL of sensory input.
3. As a result, humans get models in which it’s easy to identify “real objects” corresponding to sensory input.
For example, imagine you’re just looking at ducks swimming in a lake. You notice that ducks don’t suddenly disappear from your vision (permanence), their movement is continuous (continuity) and they seem to move in a 3D space (3D space). All those patterns (“permanence”, “continuity” and “3D space”) are useful for predicting aspects of immediate sensory input. But all those patterns are also useful for developing deeper theories of reality, such as atomic theory of matter. Because you can imagine that atoms are small things which continuously move in 3D space, similar to ducks. (This image stops working as well when you get to Quantum Mechanics, but then aspects of QM feel less “real” and less relevant for defining object.) As a result, it’s easy to see how the deeper model relates to surface-level patterns.

In other words: reality contains “real objects” to the extent to which deep models of reality are similar to (models of) basic patterns in our sensory input.
What links here?
- sunwillrise 17 Nov 2024 15:29 UTC
  6 points
  5
  Parent
  There’s an alignment-related problem, the problem of defining real objects. Relevant topics: environmental goals; task identification problem; “look where I’m pointing, not at my finger”; Eliciting Latent Knowledge.
  Another highly relevant post: The Pointers Problem.
- Vladimir_Nesov 17 Nov 2024 15:59 UTC
  2 points
  0
  Parent
  Creating an inhumanly good model of a human is related to formulating their preferences. A model captures many possibilities and the way many hypothetical things are simulated in the training data. Thus it’s a step towards eliminating path-dependence of particular life stories (and preferences they motivate), by considering these possibilities altogether. Even if some on the possible life stories interact with distortionary influences, others remain untouched, and so must continue deciding their own path, for there are no external influences there and they are the final authority for what counts as aiding them anyway.
  - Q Home 18 Nov 2024 8:05 UTC
    1 point
    0
    Parent
    
    Creating an inhumanly good model of a human is related to formulating their preferences.
    
    How does this relate to my idea? I’m not talking about figuring out human preferences.
    
    Thus it’s a step towards eliminating path-dependence of particular life stories
    
    What is “path-dependence of particular life stories”?
    
    I think things (minds, physical objects, social phenomena) should be characterized by computations that they could simulate/incarnate.
    
    Are there other ways to characterize objects? Feels like a very general (or even fully general) framework. I believe my idea can be framed like this, too.
    - Vladimir_Nesov 18 Nov 2024 18:48 UTC
      2 points
      0
      Parent
      Models or real objects or things capture something that is not literally present in the world. The world contains shadows of these things, and the most straightforward way of finding models is by looking at the shadows and learning from them. Hypotheses is another toy example.
      
      One of the features of models/things seems to be how they capture the many possibilities of a system simultaneously, rather than isolated particular possibilities. So what I gestured at was that when considering models of humans, the real objects or models behind a human capture the many possibilities of the way that human could be, rather than only the actuality of how they actually are. And this seems useful for figuring out their preferences.
      
      Path-dependence is the way outcomes depend on the path that was taken to reach them. A path-independent outcome is convergent, it’s always the same destination regardless of the path that was taken. Human preferences seem to be path dependent on human timescales, growing up in Egypt may lead to a persistently different mindset from the same human growing up in Canada.
      What links here?
      Q Home's comment on Q Home’s Shortform by Q Home (19 Nov 2024 2:23 UTC; 1 point)
      - Q Home 19 Nov 2024 2:23 UTC
        1 point
        0
        Parent
        I see. But I’m not talking about figuring out human preferences, I’m talking about finding world-models in which real objects (such as “strawberries” or “chairs”) can be identified. Sorry if it wasn’t clear in my original message because I mentioned “caring”.
        
        Models or real objects or things capture something that is not literally present in the world. The world contains shadows of these things, and the most straightforward way of finding models is by looking at the shadows and learning from them.
        
        You might need to specify what you mean a little bit.
        
        The most straightforward way of finding a world-model is just predicting your sensory input. But then you’re not guaranteed to get a model in which something corresponding to “real objects” can be easily identified. That’s one of the main reasons why ELK is hard, I believe: in an arbitrary world-model, “Human Simulator” can be much simpler than “Direct Translator”.
        
        So how do humans get world-models in which something corresponding to “real objects” can be easily identified? My theory is in the original message. Note that the idea is not just “predict sensory input”, it has an additional twist.
        Vladimir_Nesov 19 Nov 2024 4:35 UTC
        3 points
        0
        Parent
        
        I’m talking about finding world-models in which real objects (such as “strawberries” or “chairs”) can be identified.
        
        My point is that chairs and humans can be considered in a similar way.
        
        The most straightforward way of finding a world-model is just predicting your sensory input. But then you’re not guaranteed to get a model in which something corresponding to “real objects” can be easily identified.
        
        There’s the world as a whole that generates observations, and particular objects on their own. A model that cares about individual objects needs to consider them separately from the world. The same object in a different world/situation should still make sense, so there are many possibilities for the way an object can be when placed in some context and allowed to develop. This can be useful for modularity, but also for formulating properties of particular objects, in a way that doesn’t get distorted by the influence of the rest of the world. Human preferences is one such property.
        Q Home 19 Nov 2024 6:26 UTC
        1 point
        0
        Parent
        
        My point is that chairs and humans can be considered in a similar way.
        
        Please explain how your point connects to my original message: are you arguing with it or supporting it or want to learn how my idea applies to something?
Q Home 21 Sep 2022 8:31 UTC
3 points
−2
For some time I wanted to apply the idea of probabilistic thinking (used for predicting things) to describing things, making analogies between things. This is important because your hypotheses (predictions) depend on the way you see the world. If you could combine predicting and describing into a single process, you would unify cognition.
Fuzzy logic and fuzzy sets is one way to do it. The idea is that something can be partially true (e.g. “humans are ethical” is somewhat true) or partially belong to a class (e.g. a dog is somewhat like a human, but not 100%). Note that “fuzzy” and “probable” are different concepts. But fuzzy logic isn’t enough to unify predicting and describing. Because it doesn’t tell us much about how we should/could describe the world. No new ideas.
I have a different principle for unifying probability and description. Here it is:
Properties of objects aren’t contained in specific objects. Instead, there’s a common pool that contains all possible properties. Objects take their properties from this pool. But the pool isn’t infinite. If one object takes 80% of a certain property from the pool, other objects can take only 20% of that property (e.g. “height”). Socialism for properties: it’s not your “height”, it’s our “height”.
How can an object “take away” properties of other objects? For example, how can a tall object “steal” height from other objects? Well, imagine there are multiple interpretations of each object. Interpretation of one object affects interpretation of all other objects. It’s just a weird axiom. Like a Non-Euclidean geometry.
This sounds strange, but this connects probability and description. And this is new. I think this principle can be used in classification and argumentation. Before showing how to use it I want to explain it a little bit more with some analogies.
Connected houses
Imagine two houses, A and B. Those houses are connected in a specific way.
When one house turns on the light at 80%, the other turns on the light only at 20%.
When one house uses 60% of the heat, the other uses only 40% of the heat.
(When one house turns on the red light, the other turns on the blue light. When one house is burning, the other is freezing.)
Those houses take electricity and heat from a common pool. And this pool doesn’t have infinite energy.
Kindness
Usually people think about qualities as something binary: you either has it or not. For example, a person can be either kind or not.
For me an abstract property such as “kindness” is like the white light. Different people have different colors of “kindness” (blue kindness, green kindness...). Every person has kindness of some color. But nobody has all colors of kindness.
Abstract kindness is the common pool (of all ways to express it). Different people take different parts of that pool.
Some more analogies
Theism analogy. You can compare the common pool of properties to the “God object”, a perfect object. All other objects are just different parts of the perfect object. You also can check out Monadology by Gottfried Leibniz.
Spectrum analogy. You can compare the common pool of properties to the spectrum of colors. Objects are just colors of a single spectrum.
Ethics analogy. Imagine that all your good qualities also belong (to a degree) to all other people. And all bad qualities of other people also belong (to a degree) to you. As if people take their qualities from a single common pool.
Buddhism analogy. Imagine that all your desires and urges come (to a degree) from all other people. And desires and urges of all other people come (to a degree) from you. There’s a single common pool of desire. This is somewhat similar to karma. In rationality there’s also a concept of “values handshakes”: when different beings decide to share each other’s values.
Quantum analogy. See quantum entanglement. When particles become entangled, they take their properties from a single common pool (quantum state).
Fractal analogy. “All objects in the Universe are just different versions of a single object.”
Subdivision analogy. Check out Finite subdivision rule. You can compare the initial polygone to the common pool of properties. And different objects are just pieces of that polygone.
Connection with recursion
Recursion. If objects take their properties from the common pool, it means they don’t really have (separate) identities. It also means that a property (X) of an object is described in terms of all other objects. So, the property (X) is recursive, it calls itself to define itself.
For example, imagine we have objects A, B and C. We want to know their heights. In order to do this we may need to evaluate those functions:
- A(height), B(height), C(height)
- A(B(height)), A(C(height)) …
- A(B(C(height))), A(C(B(height))) …
A priori assumptions about objects should allow us to simplify this and avoid cycles.
Fractals. See Coastline paradox. You can treat a fractal as an object with multiple interpretations (where an interpretation depends on the scale). Objects taking their properties from the common pool = fractals taking different scales from the common range.
Classification
To explain how to classify objects using my principle, I need to explain how to order them with it.
I’ll explain it using fantastical places and videogame levels, because those things are formal and objective enough (they are 3D shapes). But I believe the same classification method can be applied to any objects, concepts and even experiences.
Basically, this is an unusual model of contextual thinking. If we can formalize this specific type of contextual thinking, then maybe we can formalize contextual thinking in general. This topic will sound very esoteric, but it’s the direct application of the principle explained above.
Intro
(I interpret paintings as “real places”: something that can be modeled as a 3D shape. If a painting is surreal, I simplify it a bit in my mind.)
Take a look at those places: image.
Let’s compare 2 of them: image. Let’s say we want to know the “height” of those places. We don’t have a universal scale to compare the places. Different interpretations of the height are possible.
If we’re calling a place “very tall”—we need to understand the epithet “very tall” in probabilistic terms, such as “70-90% tall”—and we need to imagine that this probability is taken away from all other places. We can’t have two different “very tall” places. Probability should add up to 100%.
Now take a look at another place (A): image (I ignore the cosmos to simplify it). Let’s say we want to know how enclosed it is. In one interpretation, it is massively enclosed by trees. In another interpretation, trees are just a decorative detail and can be ignored. Let’s add some more places for context: image. They are definitely more open than the initial place, so we should update towards more enclosed interpretation of (A). All interpretations should be correlated and “compatible”. It’s as if we’re solving a puzzle.
You can say that properties of places are “expandable”. Any place contains a seed of any possible property and that seed can be expanded by a context. “Very tall place” may mean Mt. Everest or a molehill depending on context. You can compare it to a fractal: every small piece of a fractal can be expanded into the entire thing. And I think it’s also very similar to how human language, human concepts work.
You also may call it “amplification of evidence”: any smallest piece of evidence (or even absence of any evidence) can be expanded into very strong evidence by context. We have a situation like in the Raven paradox, but even worse.
Rob Gonsalves
(I interpret paintings as “real” places.)
Places in random order: image.
My ordering of places: image.
I used 2 metrics to evaluate the places:
- Is the space of the place “box-like” and small or not?
- Is the place enclosed or open?
The places go from “box-like and enclosed” to “not box-like and open” in my ordering.
But to see this you need to look at the places in a certain way, reason about them in a certain way:
- Place 1 is smaller than it seems. Because Place 5 is similar and “takes away” its size.
- Place 2 is more box-like than it seems. Because similar places 4 and 6 are less box-like.
- Place 3 is more enclosed than it seems. Because similar places 4 and 6 “take away” its openness.
- Place 5 is more open than it seems. Because similar places 1 and 2 “take away” its closedness.
Almost any property of any specific place can be “illusory”. But when you look at places in the context you can deduce their properties vie the process of elimination.
- Q Home 24 Sep 2022 10:29 UTC
  1 point
  0
  Parent
  Argumentation, hypotheses
  You can apply the same idea (about the “common pool”) to hypotheses and argumentation:
  - You can describe a hypothesis in terms of any other hypothesis. You also can simplify it along the way (let’s call it “regularization”). Recursion and circularity is possible in reasoning.
  - Truth isn’t attached to a specific hypothesis. Instead there’s a common “pool of truth”. Different hypotheses take different parts of the whole truth. The question isn’t “Is the hypothesis true?”, the question is “How true is the hypothesis compared to others?” And if the hypotheses are regularized it can’t be too wrong.
  - Alternatively: “implications” of a specific hypothesis aren’t attached to it. Instead there’s a common “pool of implications”. Different hypotheses take different parts of “implications”.
  - Conservation of implications: if implications of a hypothesis are simple enough, they remain true/likely even if the hypothesis is wrong. You can shift the implications to a different hypothesis, but you’re very unlikely to completely dissolve them.
  - In usual rationality (hypotheses don’t share truth) you try to get the most accurate opinions about every single thing in the world. You’re “greedy”. But in this approach (hypotheses do share truth) it doesn’t matter how wrong you are about everything unless you’re right about “the most important thing”. But once you’re proven right about “the most important thing”, you know everything. A billion wrongs can make a right. Because any wrong opinion is correlated with the ultimate true opinion, the pool of the entire truth.
  - You can’t prove a hypothesis to be “too bad” because it would harm all other hypotheses. Because all hypotheses are correlated, created by each other. When you keep proving something wrong the harm to other hypotheses grows exponentially.
  - Motivated reasoning is valid: truth of a hypothesis depends on context, on the range of interests you choose. Your choice affects the truth.
  - Any theory is the best (or even “the only one possible”) on its level of reality. For example, on a certain level of reality modern physics doesn’t explain weather better than gods of weather.
  In a way it means that specific hypotheses/beliefs just don’t exist, they’re melted into a single landscape. It may sound insane (“everything is true at the same time and never proven wrong” and also relative!). But human language, emotions, learning, pattern-matching and research programs often work like this. It’s just a consequence of ideas (1) not being atomic statements about the world and (2) not being focused on causal reasoning, causal modeling. And it’s rational to not start with atomic predictions when you don’t have enough evidence to locate atomic hypotheses.
  Causal rationality, Descriptive rationality
  You can split rationality into 2 components. The second component isn’t explored. My idea describes the second component:
  - Causal rationality. Focused on atomic independent hypotheses about the world. On causal explanations, causal models. Answers “WHY this happens?”. Goal: to describe a specific reality in terms of outcomes.
  - Descriptive rationality. Focused on fuzzy and correlated hypotheses about the world. On patterns and analogies. Answers “HOW this happens?”. Goal: to describe all possible (and impossible) realities in terms of each other.
  Causal and Descriptive rationality work according to different rules. Causal uses Bayesian updating. Descriptive uses “the common pool of properties + Bayesian updating”, maybe.
  - “Map is not the territory” is true for Causal rationality. It’s wrong for Descriptive rationality: every map is a layer of reality.
  - “Uncertainty and confusion is a part of the map, not the territory”. True for Causal rationality. Wrong for Descriptive rationality: the possibility of an uncertainty/confusion is a property of reality.
  - “Details make something less likely, not more” (Conjunction fallacy). True for Causal rationality. Wrong for Descriptive rationality: details are not true or false by themselves, they “host” kernels of truth, more details may accumulate more truth.
  - For Causal rationality, math is the ideal of specificity. For Descriptive rationality, math has nothing to do with specificity: an idea may have different specificity on different layers of reality.
  - In Causal rationality, hypotheses should constrain outcomes, shouldn’t explain any possible outcome. In Descriptive rationality… constraining depends on context.
  - Causal rationality often conflicts with people. Descriptive rationality tries to minimize the conflict. I believe it’s closer to how humans think.
  - Causal rationality assumes that describing reality is trivial and should be abandoned as soon as possible. Only (new) predictions matter.
  - In Descriptive rationality, a hypothesis is somewhat equivalent to the explained phenomenon. You can’t destroy a hypothesis too much without destroying your knowledge about the phenomenon itself. It’s like hitting a nail so hard that you destroy the Earth.
  Example: Vitalism. It was proven wrong in causal terms. But in descriptive terms it’s almost entirely true. Living matter does behave very differently from non-living matter. Living matter does have a “force” that non-living matter doesn’t have (it’s just not a fundamental force). Many truths of vitalism were simply split into different branches of science: living matter is made out of special components (biology/microbiology) including nanomachines/computers!!! (DNA, genetics), can have cognition (psychology/neuroscience), can be a computer (computer science), can evolve (evolutionary biology), can do something like “decreasing entropy” (an idea by Erwin Schrödinger, see entropy and life). On the other hand, maybe it’s bad that vitalism got split into so many different pieces. Maybe it’s bad that vitalism failed to predict reductionism. However, behaviorism did get overshadowed by cognitive science (living matter did turn out to be more special than it could be). Our judgement of vitalism depends on our choices, but at worst vitalism is just the second best idea. Or the third best idea compared to some other version of itself… Absolute death of vitalism is astronomically unlikely and it would cause most of reductionism and causality to die too along with most of our knowledge about the world. Vitalism partially just restates our knowledge (“living matter is different from non-living”), so it’s strange to simply call it wrong. It’s easier to make vitalism better than to disprove it.
  Perhaps you could call the old version of vitalism “too specific given the information about the world”: why should “life-like force” be beyond laws of physics? But even this would be debatable at the time. By the way, the old sentiment “Science is too weak to explain living things” can be considered partially confirmed: 19th century science lacked a bunch of conceptual breakthroughs. And “only organisms can make the components of living things” is partially just a fact of reality: skin and meat don’t randomly appear in nature. This fact was partially weakened, but also partially strengthened with time. The discovery of DNA strengthened it in some ways. It’s easy to overlook all of those things.
  In Descriptive rationality, an idea is like a river. You can split it, but you can’t stop it. And it doesn’t make sense to fight the river with your fists: just let it flow around you. However, if you did manage to split the river into independent atoms, you get Causal rationality.
  2 types of rationality should be connected
  I think causal rationality has some problems and those problems show that it has a missing component:
  - Rationality is criticized for dealing with atomic hypotheses about the world. For not saying how to generate new hypotheses and obtain new knowledge. Example: critique by nostalgebraist. See “8. The problem of new ideas”
  - You can’t use causal rationality to be critical of causal rationality. In theory you should be able to do it, but in practice people often don’t do it. And causal rationality doesn’t model argumentation, even for the most important topics such as AI safety. So we end up arguing like anyone argues.
  - Doomsday argument, Pascal’s mugging. Probability starts to behave weird when we add large numbers of (irrelevant) things to our world.
  - The problem of modesty. Should you assume that you’re just an average person?
  - Weird addition in ethics. Repugnant conclusion, “Torture vs. Dust Specks”.
  - Causal rationality doesn’t give/justify an ethical theory. Doesn’t say how to find it if you want to find it.
  - Causal rationality doesn’t give/justify a decision theory. There’s a problem with logical uncertainty (uncertainty about implications of beliefs).
  I’m not saying that all of this is impossible to solve with Causal rationality. I’m saying that Causal rationality doesn’t give any motivation to solve all of this. When you’re trying to solve it without motivation you kind of don’t know what you’re doing. It’s like trying to write a program in bytecode without having high-level concepts even in your mind. Or like trying to ride an alien device in the dark: you don’t know what you’re doing and you don’t know where you’re doing.
  What and where are we doing when we’re trying to fix rationality?
- Q Home 22 Sep 2022 11:46 UTC
  1 point
  0
  Parent
  Crash Bandicoot 1
  Crash Bandicoot N. Sane Trilogy
  My ordering of some levels: image. Videos of the levels: Level 1, Level 2, Level 3, Level 4, Level 5, Level 6.
  I used 2 metrics to evaluate the levels:
  1. Is the level stretched vertically or horizontally?
  2. Is the level easy to separate into similar square-like pieces or not? (like a patchwork)
  The levels go from “vertical and separable” to “horizontal and not separable”.
  But to see this you need to note:
  - Level 1 is very vertical: it’s just a vertical wall. So it “takes away” verticality from levels 2 and 3.
  - From levels 1-3, level 3 is the most horizontal. Because it’s the least similar to the level 1.
  - Levels 4-6 repeat the same logic, but now levels are harder to separate into similar square-like pieces. Why? Because levels 1 and 2 are very easy to separate (they have repeating patterns on the walls), so they “take away” separability from all other levels.
  Any question about any property of any level is answered by another question: is this property already “occupied” by some other level?
  Jacek Yerka
  Jacek Yerka
  Places in random order: image.
  My ordering of places: image.
  I used 2 metrics to evaluate the places:
  1. Can the place fit inside a box-like space? (not too big, not too small)
  2. Is the place inside or outside of something small?
  The places go from “box-like and outside” to “not box-like and inside”.
  But to see this you need to note:
  - Place 1 could be interpreted as being inside of a town. But similar Place 5 is inside a single road. So it takes away “inside-ness” from Place 1.
  - Place 2 is more “outside” than it seems. Because similar Place 6 fits inside an area with small tiles. So it takes away “inside-ness” from Place 2.
  - Place 3 is not so tall as it seems. Because similar Place 6 is very tall. So it takes away height from Place 3.
  If you feel this relativity of places’ properties, then you understand how I think about places. You don’t need to understand a specific order of places perfectly.
  Crash Bandicoot 3
  My ordering of some levels: image. Videos of the levels: Level 1, Level 2, Level 3, Level 4, Level 5, Level 6, Level 7
  I used 1 metrics to evaluate the levels:
  - Does the space create a 3D space (box-like, not too big, not too small) or 2D space (flat surface) or 0D space (shapeless, cloud-like)?
  Levels go from 3D to 2D to 0D.
  But to see this you need to note:
  - Levels 6 and 7 are less box-like than they seem. Because similar levels 1 and 2 already create small box-like spaces. So they take away “box-like” feature from levels 6 and 7.
  - Level 3 is more box-like than it seems. Because levels 4 and 5 create more dense flat surfaces. So they take away flatness of Level 3.
  Each level is described by all other levels. This recursive logic determines what features of the levels matter.
  Negative objects
  When objects take their properties from a single pool of properties, there may appear “negative objects”. It happens when objects A and B take away opposite properties from a third object C (with equal force). For example, A may take away height from C. But B takes away shortness (anti-height) from C. So, “negative objects” are like contradictions. You can’t fit a negative object anywhere in the order of positive objects.
  Let’s get back to Crash Bandicoot 3 and add two levels: image. Videos of the levels: Level −2, Level −1
  - Take a look at Level −2. It’s too empty for levels 6 and 7 (and too box-like). But it’s too big and shapeless for levels 1 and 2. And it’s obviously not a flat surface. So, it doesn’t fit anywhere. Maybe it’s just better to place it in its own order.
  - Similar thing is true for Level −1. It’s too different from levels 6 and 7 and it’s too small for levels 1 and 2.
  - Levels −2 and −1 are also both inside some kind of structures. This adds confusion when you compare them to other levels.
  Note that negative levels are still connected with all the other levels anyway: their properties are still determined by properties of all other levels, just in a more complicated way.
  You can order negative levels by using the metrics for positive levels. In the case above, you can do it like this:
  1. Take negative levels. Cut out their larger parts. Now they’re just like the positive levels.
  2. Order them the same way you ordered positive levels.
  Hyper objects
  There are also “hyper objects” (hyper positive and hyper negative objects). Such objects take “too much” or “too little” from the common pool of properties compared to normal objects.
  How do hyper objects appear? I may not be able to explain it. Maybe a hyper object appears when an object takes a property (equally strong) from objects with very different amounts of that property. This was very confusing and vague, so here’s an analogy: imagine a number that’s very-very, but equally far away from the numbers 2 and 5. It has distance 10 from both 2 and 5. How can this be? This number should go somewhere “sideways”… it must be a complex number. So, you can compare hyper objects to complex numbers.
  An example of hyper levels for Crash Bandicoot 3: image. Video of the levels: “Bye Bye Blimps”, “N. Gin”
  - “Bye Bye Blimps” is like a flat surface, but utterly gigantic. But it’s also shapeless like levels 6 and 7, yet bigger than them/equally big, but in a different way.
  - “N. Gin” is identical to “Bye Bye Blimps” in this regard.
  Theory
  How is this related to anything?
  You may be asking “How can ordering things be related to anything?” Prepare for a little bit abstract argument.
  Any thought/experience is about multiple things coexisting in your mental state. So, any thought/experience is about direct or indirect comparison between things. And any comparison can be described by an order or multiple orders.
  - If compared things don’t share properties, then you can order them using “arithmetic” (absolute measurements, uncorrelated properties). In this case everything happening in your mental state is absolutely separated, it’s a degenerate case.
  - If compared things 100% share properties, then you can order them using my method (pool of properties, absolutely correlated properties). In this case everything happening in your mental state is mixed into a single process.
  - If compared things partially share properties, then you can use a mix between “arithmetic” and my method. In this case everything happening in your mental state partially breaks down into separate processes.
  So, “my orders + arithmetic orders” is something like a Turing machine: a universal model that can describe any thought/experience, any mental state. Of course, a Turing machine can describe anything my method can describe, but my method is more high-level.
  Formalization
  I know that what I described above doesn’t automatically specify a mathematical model. But I think we should be able to formalize my idea easily enough. If not, then my idea is wrong.
  We have those hints for formalization:
  - The idea about the common pool of properties. Connection with probability.
  - Connection with recursion.
  - The idea of “negative objects” and “hyper objects”. Connection with superrationality/splitting resources.
  - We can test the formalization on comparing 3D shapes (maybe even 2D shapes). Easy to model and formalize.
  - Connection to hypotheses, rationality. To Bayes’ rule. (See below.)
  - We can try a special type of brainstorming/spitballing based on my idea. (See below.)
  To be honest, I’m bad at math. I based my theory on synesthesia-like experiences and conceptual ideas. But if the information above isn’t enough, I can try to give more. I have experience of making my idea more specific, so I could guess how to make the idea even more specific (if we encounter a problem). Please, help me with formalizing this idea.
Q Home 18 Jan 2025 3:43 UTC
2 points
0
Sorry if it’s not appropriate for this site. But is anybody interested in chess research? I’ve seen that people here might be interested in chess. For example, here’s a chess post barely related to AI.

Intro

In chess, what positions have the longest forced wins? “Mate in N” positions can be split into 3 types:
1. Positions which use “tricks” to get a big number of moves before checkmate. Such as cycles of repeating moves. For example, this manmade mate in 415 (see the last position) uses obvious cycles. Not to mention mates in omega.
2. Tablebase checkmates, discovered by brute force, showing absolutely incomprehensible play with no discernible logic. See this mate in 549 moves. One should assume it’s based on some hidden cycles or something?
3. Positions which are similar to immortal games. Where the winning variation requires a combination without any cycles. For example: Kasparov’s Immortal (14 moves long combination), Stoofvlees vs. Igel (down a rook for 21 moves) - the examples lack optimal play tho.
Surprisingly, nobody seems to look for the longest mates of Type 3. Well, I did look for them and discovered some. Down below I’ll explain multiple ways to define what exactly I did. Won’t go into too much detail. If you want more detail—Research idea: the longest non-trivial middlegames. There you also can see the puzzles I’ve created.

My longest puzzle is 42 moves: https://lichess.org/study/sTon08Mb/JG4YGbcP Overall, I’ve created 7 unique puzzles. Worked a lot on 1 more (mate in 52 moves), but couldn’t make it work.

Among other things, I made this absurd mate in 34 puzzle. Almost the entire board is filled with pieces (62 pieces on the board!), only two squares are empty. And despite that the position has deep content. It’s kinda a miracle. I think it deserves recognition.

Definition 1

Unlike Type 1 and Type 2 mates, my mates involve many sacrifices of material. So my mates can be defined as “the longest sacrificial combinations”.

Definition 2

We can come up with important metrics which make a long mate more special, harder to find, more rare. Material disbalance, amount of non-check moves, amount of freedom of pieces, etc. Then we can search for the longest mates compatible with high enough values of those metrics.

Well, that’s what I did.

Definition 3

This is an idea of a definition rather than a definition. But it might be important.
- Take a sequential game with perfect information.
- Take positions with the longest forced wins.
- Out of those positions, choose positions where the defending side has the greatest control over the attacking side’s optimal strategy.
My mates are an example of positions where the defending side has especially great control over the flow of the game.

Deeper meaning?

Can there be any deep meaning behind researching my type of mates? I think yes. There are two relevant things.
1. First thing is hard to explain, because I’m not a mathematician. But I’ll try. Math can often be seen as skipping stuff which is the most interesting to humans. For example, math can prove theorems about games in general, without explaining why a specific game is interesting or why a specific position is interesting. However, here it seems like we can define something very closely related to subjective “interestingness”.
2. Hardness of defining valuable things is relevant to Alignment. The definitions above reveal that maybe sometimes valuable things are easier to define than it seems.
Reception

How did chess community receive my work?
- On Reddit, some posts got a moderate amount of upvotes (enough to get into daily top). A silly middlegame position. With checkmate in 50-80 moves? (110+); Does this position set any record? (60+). Sadly the pattern didn’t continue: New long non-trivial middlegame mate found. Nobody asked for this. (1).
- On a computer chess forum, people mostly ignored it. I hoped they could help me find the longest attacks in computer games.
- On the Discord of chess composers, a bunch of people complimented my project. But nobody showed any proactive interest (e.g. “hey, I’d like to preserve your work”). One person reacted like ~”I’m not a specialist on that type of thing, I don’t know with whom you could talk about that”
- On Reddit communities where you can ask mathematicians things, people told that game theory is too abstract for tackling such things.
Q Home 11 Dec 2022 10:54 UTC
2 points
0
“Everything is relative.”

You know this phrase, right? But “relativity” is relative too. Maybe something is absolute.

But “relativity of relativity” is relative too. Maybe nothing is absolute after all… Those thoughts create an infinite tower of meta-levels.

If you think about the statement “truth = lie” (“you can go from T to F”) you can get a similar tower. (Because it also implies “you can NOT go from T to F” and “you can go from “you can NOT go from T to F” to “you can go from T to F”″ and so on.) It’s not formal, but still interesting. Informally, the statement “truth = lie” is equivalent to “everything is relative”.

Hierarchy of meta-levels is relative.

Imagine an idealist and a materialist. Materialist thinks “I’m meta compared to the idealist—I can analyze their thought process through physics”. Idealist thinks “materialist thinks they’re meta compared to me, but thinking in terms of physics is just one possible experience”. So, “my thought process = the most important thing” and “my thought process + physics = the most important things” are both meta- compared to each other, they both can do meta-analysis of each other.

Both materialism and idealism can model each other. Materialism can be modeled by meta-idealism. Meta-idealism can be modeled by meta-materialism. Meta-materialism can be modeled by meta-meta-idealism. And so on. (Those don’t have to be different models, it’s just convenient to think about it in terms of levels.)

The same thing with altruism and selfishness. Altruism can be modeled by meta-selfishness. Meta-selfishness can be modeled by meta-meta-altruism. And you can abstract it to any property (A) and its negation (not A), because any property can be treated as a model of the world. So, this idea can be generalized as “A = not A”.

Points and lines

Next step of the idea: for meta-level objects lower level objects are indistinguishable.

If you think in terms of points, two different points (A and B) are different objects to you. If you think in terms of lines, then points A and B may be parts of the same object. Or, on the other hand, the same point can be a part of completely different objects.

A universe of objects

Now imagine that some points are red and other points are blue. And we don’t care about the shape of a line.

Level-1 lines contain only blue (positive) or only red (negative) points.

Level-2 lines can contain both types of points. E.g. they can contain mostly blue (complex positive) or mostly red (complex negative) points.

So, you can get different kinds of objects out of this, somewhat similar to numbers. I guess you can do this in many different ways. For example, you may have a spectrum of colors. Or you may have a positive and negative spectrums. To me it’s very important, because it connects to my synesthesia: see here. The post is very unclear (don’t advice reading it), but sadly I don’t know how to explain everything better yet.
jmh 3 Sep 2022 1:27 UTC
2 points
0
Perhaps the alternative to the maximize one thing subject to a price to humans constraint would be not making the AI that specialized. Make it maximize across a basket of things humans want.
While I have heard the paper clips take over the universe worry it seems to be that type of thought experiment introduce the problem to begin with (making a bit of a circular type error). As I gather (indirectly) the problem is the paper clip maximizing AI end up taking over the entire economy. That seems to equivalent to suggesting the AI replaces all the markets and other economic decisions (being smarter, faster and more competitive I guess).
If so isn’t an obvious solution to give it multiple (infinite in the sense of unlimited human wants) things to maximize? While it might replace the human production economic activity it’s going to produce some form a current state production possibility frontier and, I would think, an inter temporal one as well that that might address some inter-generational concerns.
I don’t think that fully solves the alignment problem (as I understand it—possibly poorly) but I do think it shifts what the risks are and may well eliminate a lot of the existential risks people worry about.
Q Home 12 Sep 2025 12:33 UTC
1 point
0
A question about natural latents.
1. Imagine you have 100 blue boxes. Each time you roll the dice, their shape changes. But all 100 boxes always share the same shape. If I understand correctly, in this situation the shape is the natural latent. While color is just static background information.
2. Imagine you have 100 boxes. Each time you roll the dice, their color changes. But all 100 boxes always share the same color. In this situation, color is the natural latent.
3. Imagine you have 100 boxes. Each time you roll the dice, their color changes. If at least one box is blue, all boxes are blue. Otherwise their color is independent. Is “all boxes are blue or all boxes have independent colors” a natural latent (it’s something you learn about all boxes by examining a single box)?
Does the latter (3) type of natural latents have any special properties, is it some sort of “meta-level” natural latent (compared to 2)? I’m asking because I think this type of latents might be relevant to how human abstractions work. Here’s where I wrote about it in more detail.
Q Home 16 Jun 2025 9:07 UTC
1 point
0
Have an idea about interpretability and defining search/optimization.
Trivial Property
Finite algorithms can solve infinite (classes of) problems. For example, the algorithm for adding two numbers has a finite description, yet can solve an infinity of examples.
This is a basic truth of computability theory.
Intuitively, it means that algorithms can exploit regularities in problems. But “regularity” here can only be defined tautologically (any smaller thing which solves/defines a bigger thing).
Less Trivial Property
Many algorithms have the following property:
1. The algorithm computes function $F$ by computing another function $G$ many times. Computing $G$ is useful for computing $F$ . $G$ is computed significantly more times than $F$ .
2. Computing $G$ is significantly easier than computing $F$ .
3. $G$ gives important information about $F$ . This information is easy to compute and (maybe) trivial to prove.
Intuitively, it means that algorithms can exploit regularities in problems. However, here “regularity” has a stricter definition than in the trivial property (TP). A “regularity” is an easily computable thing which gives you important, easily computable information about a hard-to-compute thing.
Now, the question is: what classes of algorithms have the less trivial property (LTP)?
LTP includes a bunch of undefined terms:
- “Significantly simpler”, “easily computable”, and “useful for computing”. Those could be defined in terms of computational complexity.
- “Significantly more times”. Could be defined asymptotically.
- “Important information”. I have little idea how it should be defined. “Important information” may mean upper and lower bounds of a function, for example.
Probably giving fully general definitions right away is not important. We could try starting with overly restrictive definitions and see if we can prove anything about those.
Relation to AI safety
If neural networks implement algorithms with LTP, we could try finding those algorithms by looking for $G$ (which is much easier than looking for $F$ ).
Furthermore, LTP seems very relevant to defining search / optimization.
Examples of LTP
Examples of algorithms with LTP:
- Any greedy algorithm is an example. The whole solution to the problem is $F$ , the greedy choice is $G$ .
- Many dynamic programming algorithms are an example. Solution to the whole problem is $F$ , solution to a smaller problem (which we need to use many times) is $G$ .
- Consider a strong chess engine. $F$ : “given a position, find how to e.g. win material given an ~N moves horizon”. $G$ : “given a position, find how to e.g. win material given a ~K moves horizon”. Where K << N. Simply put, the N-moves-deep engine will often punish K-moves-deep mistakes. That’s the reason why you can understand that you’re losing long before you get checkmated. This wouldn’t be true in a game much more chaotic than chess.
- Consider simulated annealing. $F$ : “given a function, find the global optimum”. AFAICT, simulated annealing is greedy search + randomness. The greedy part is $G$ .
What links here?
- Q Home's comment on Research Agenda: Synthesizing Standalone World-Models by Thane Ruthenis (2 Oct 2025 0:29 UTC; 3 points)
Q Home 27 Nov 2024 8:44 UTC
1 point
0
Draft of a future post, any feedback is welcome. Continuation of a thought from this shortform post.

(picture: https://en.wikipedia.org/wiki/Drawing_Hands)

The problem

There’s an alignment-related problem: how do we make an AI care about causes of a particular sensory pattern? What are “causes” of a particular sensory pattern in the first place? You want the AI to differentiate between “putting a real strawberry on a plate” and “creating a perfect illusion of a strawberry on a plate”, but what’s the difference between doing real things and creating perfect illusions, in general?

(Relevant topics: environmental goals; identifying causal goal concepts from sensory data; “look where I’m pointing, not at my finger”; Pointers Problem; Eliciting Latent Knowledge; symbol grounding problem; ontology identification problem.)

I have a general answer to those questions. My answer is very unfinished. Also it isn’t mathematical, it’s philosophical in nature. But I believe it’s important anyway. Because there’s not a lot of philosophical or non-philosophical ideas about the questions above. With questions like these you don’t know where to even start thinking, so it’s hard to imagine even a bad answer.

Obvious observations

Observation 1. Imagine you come up with a model which perfectly predicts your sensory experience (Predictor). Just having this model is not enough to understand causes of a particular sensory pattern, i.e. differentiate between stuff like “putting a real strawberry on a plate” and “creating a perfect illusion of a strawberry on a plate”.

Observation 2. Not every Predictor has variables which correspond to causes of a particular sensory pattern. Not every Predictor can be used to easily derive something corresponding to causes of a particular sensory pattern. For example, some Predictors might make predictions by simulating a large universe with a superintelligent civilization inside which predicts your sensory experiences. See “Transparent priors”.

The solution

So, what are causes of a particular sensory pattern?

“Recursive Sensory Models” (RSMs).

I’ll explain what an RSM is and provide various examples.

What is a Recursive Sensory Model?

An RSM is a sequence of N models (Model 1, Model 2, …, Model N) for which the following two conditions hold true:
- Model (K + 1) is good at predicting more aspects of sensory experience than Model (K). Model (K + 2) is good at predicting more aspects than Model (K + 1). And so on.
- Model 1 can be transformed into any of the other models according to special transformation rules. Those rules are supposed to be simple. But I can’t give a fully general description of those rules. That’s one of the biggest unfinished parts of my idea.
The second bullet point is kinda the most important one, but it’s very underspecified. So you can only get a feel for it through looking at specific examples.

Core claim: when the two conditions hold true, the RSM contains easily identifiable “causes” of particular sensory patterns. The two conditions are necessary and sufficient for the existence of such “causes”. The universe contains “causes” of particular sensory patterns to the extent to which statistical laws describing the patterns also describe deeper laws of the universe.

Example: object permanence

Imagine you’re looking at a landscape with trees, lakes and mountains. You notice that none of those objects disappear.

It seems like a good model: “most objects in the 2D space of my vision don’t disappear”. (Model 1)

But it’s not perfect. When you close your eyes, the landscape does disappear. When you look at your feet, the landscape does disappear.

So you come up with a new model: “there is some 3D space with objects; the space and the objects are independent from my sensory experience; most of the objects don’t disappear”. (Model 2)

Model 2 is better at predicting the whole of your sensory experience.

However, note that the “mathematical ontology” of both models is almost identical. (Both models describe spaces whose points can be occupied by something.) They’re just applied to slightly different things. That’s why “recursion” is in the name of Recursive Sensory Models: an RSM reveals similarities between different layers of reality. As if reality is a fractal.

Intuitively, Model 2 describes “causes” (real trees, lakes and mountains) of sensory patterns (visions of trees, lakes and mountains).

Example: reductionism

You notice that most visible objects move smoothly (don’t disappear, don’t teleport).

“Most visible objects move smoothly in a 2D/3D space” is a good model for predicting sensory experience. (Model 1)

But there’s a model which is even better: “visible objects consist of smaller and invisible/less visible objects (cells, molecules, atoms) which move smoothly in a 2D/3D space”. (Model 2)

However, note that the mathematical ontology of both models is almost identical.

Intuitively, Model 2 describes “causes” (atoms) of sensory patterns (visible objects).

Example: a scale model

Imagine you’re alone in a field with rocks of different size and a scale model of the whole environment. You’ve already learned object permanence.

“Objects don’t move in space unless I push them” is a good model for predicting sensory experience. (Model 1)

But it has a little flaw. When you push a rock, the corresponding rock in the scale model moves too. And vice-versa.

“Objects don’t move in space unless I push them; there’s a simple correspondence between objects in the field and objects in the scale model” is a better model for predicting sensory experience. (Model 2)

However, note that the mathematical ontology of both models is identical.

Intuitively, Model 2 describes a “cause” (the scale model) of sensory patterns (rocks of different size being at certain positions). Though you can reverse the cause and effect here.

Example: empathy

If you put your hand on a hot stove, you quickly move the hand away. Because it’s painful and you don’t like pain. This is a great model (Model 1) for predicting your own movements near a hot stove.

But why do other people avoid hot stoves? If another person touches a hot stove, pain isn’t instantiated in your sensory experience.

Behavior of other people can be predicted with this model: “people have similar sensory experience and preferences, inaccessible to each other”. (Model 2)

However, note that the mathematical ontology of both models is identical.

Intuitively, Model 2 describes a “cause” (inaccessible sensory experience) of sensory patterns (other people avoiding hot stoves).

Counterexample: a chaotic universe

Imagine yourself in a universe where your sensory experience is produced by very simple, but very chaotic laws. Despite the chaos, your sensory experience contains some simple, relatively stable patterns. Purely by accident.

In such universe, RSMs might not find any “causes” underlying particular sensory patterns (except the simple chaotic laws).

But in such case there are probably no “causes”.
Q Home 24 Oct 2022 0:53 UTC
1 point
0
I want to share a way of dissolving disagreements. It’s also a style of thinking. I call it “the method of statements”, here’s the description:
1. Take an idea, theory or argument. Split it into statements of a certain type. (Or multiple types.)
2. Evaluate the properties of the statements. Do they exist (i.e. can they be defined, does anything connect them)? Can they be used, are they constructive? Are they simple? Are they important? Etc.
3. Try to extract as much information as possible from those statements.
One rule:
- A statement counts as existing even if it can’t be formalized or expressed in a particular epistemology.
When you evaluate an argument with the method of statements, you don’t evaluate the “logic” of the argument or its “model of the world”. You evaluate properties of statements implied by the argument. Do statements in question correlate with something true or interesting?

You also may apply the method to analyzing information. You may split the information about something into statements of a certain type and study the properties of those statements. I can’t define what a “statement” is. It’s the most basic concept. Sometimes “statements” are facts, but not always. “Statements” may even be non-verbal. A set of statements can be defined in any way possible.

I will give a couple of less controversial (for rationalists) examples of applying the method. Then a couple of more controversial examples. And then share a couple of my own ideas in the context of the method. But before this...

Rationalist taboo

There’s a technique called “rationalist taboo”. Imagine a disagreement about this question:

If a tree falls in a forest and no one is around to hear it, does it make a sound? (on wikipedia)

We may try to resolve the disagreement by trying to replace the label “sound” with its more specific contents. Are we talking about sound waves, the vibration of atoms? Are we talking about the subjective experience of sound? Are we talking about mathematical models and hypothetical imaginary situations? An important point is that we don’t try to define what “sound” is, because it would only lead to a dispute about definitions.

The method of statements is somewhat similar to taboo. But with the method we “taboo” ideas and arguments themselves. We take an idea and replace it with its more specific, more atomic semantic contents. We take a thought and split it into smaller thoughts. It’s “taboo” applied on a different level, a “meta-taboo” applied to the process of thinking itself.

However, rationalist taboo and the method of statements may be in direct conflict. Because rationalist taboo assumes that a “statement” is meaningless if it can’t be formalized or expressed in a particular epistemology.

Reception of ideas

This part of the post is about reception of two LessWrong ideas/topics.

Evaluating Logical decision theory (LDT)

What is Logical Decision Theory (LDT)? You can check out “An Introduction to Logical Decision Theory for Everyone Else”

With the usual way of thinking, even if a person is sympathetic enough to LDT they may react like this:
- I think the idea of LDT is important, but...
- I’m not sure it can be formalized (finished).
- I’m not sure I agree with it. It seems to violate such and such principles.
- You can get the same results by fixing old decision theories.
- Conclusion: “LDT brings up important things, but it’s nothing serious right now”.
(The reaction above is inspired by criticisms of William David MacAskill and Prof. Wolfgang Schwarz)

With the method of statements, being sympathetic enough to LDT automatically entails this (or even more positive) reaction:
1. There exist two famous types of statements: “causal statements” (in CDT) and “evidential statements” (in EDT). LDT hypothesizes the third type, “logical statements”. The latter statements definitely exist. They can be used in thinking, i.e. they are constructive enough. They are simple enough (“conceptual”). And they are important enough. This already makes LDT a very important thing. Even if you can’t formalize it, even if you can’t make a “pure” LDT.
2. Logical statements (A) in decision theory are related to another type of logical statements (B): statements about logical uncertainty. We have to deal with the latter ones even without LDT. Logical statements (A) are also similar to more established albeit not mainstream “superrational statements” (see Superrationality).
3. Logical statements can be translated into other types of statements. But this doesn’t justify avoiding to talk about them.
4. Conclusion: “LDT should be important, whatever complications it has”.
The method of statements dissolves a number of things:
- It dissolves counter-arguments about formalization: “logical statements” either exist or don’t, and if they do they carry useful information. It doesn’t matter if they can be formalized or not.
- It dissolves minor disagreements. “Logical statements” either can or can’t be true. If they can there’s nothing to “disagree” about. And true statements can’t violate any (important) principles.
Some logical suggestions do seem weird and unintuitive at first. But this weirdness may dissolve when you notice that those suggestions are properties of simple statements. If those statements can be true, then there’s nothing weird about the suggestions. At the end of the day, we don’t even have to follow the suggestions while agreeing that the statements are true and important. Statements are sources of information, nothing more and nothing less.
- It dissolves the confusion between different possible theories. “Logical statements” are either important or not. If they are, then it doesn’t matter in which language you express them. It doesn’t even matter what theory is correct.
I think the usual way of thinking may be very reasonable but, ultimately, it’s irrational, because it prompts unjust comparisons of ideas and favoring ideas which look more familiar and easier to understand/implement in the short run. With the usual way of thinking it’s very easy to approach something in the wrong way and “miss the point”.
- Q Home 26 Oct 2022 7:37 UTC
  1 point
  0
  Parent
  The whole draft is here. (and the newer one is here)
  
  Edit: my latest draft should be here.
Q Home 17 Sep 2022 4:06 UTC
1 point
0
If you want to describe human values, you can use three fundamental types of statements (and mixes between the types). Maybe there’re more types, but I know only those three:
1. Statements about specific states of the world, specific actions. (Atomic statements)
2. Statements about values. (Value statements)
3. Statements about general properties of systems and tasks. (X statements)
Any of those types can describe unaligned values. So, any type of those statements still needs to be “charged” with values of humanity. I call a statement “true” if it’s true for humans.
We need to find the statement type with the best properties. Then we need to (1) find a language for this type of statements (2) encode some true statements and/or describe a method of finding “true” statements. If we’ve succeeded we solved the Alignment problem.
I believe X statements have the best properties, but their existence is almost entirely ignored in Alignment field.
I want to show the difference between the statement types. Imagine we ask an Aligned AI: “if human asked you to make paperclips, would you kill the human? Why not?” Possible answers with different statement types:
1. Atomic statements: “it’s not the state of the world I want to reach”, “it’s not the action I want to do”.
2. Value statements: “because life, personality, autonomy and consent is valuable”.
3. X statements: “if you kill, you give the human less than human asked, less than nothing: it doesn’t make sense for any task”, “destroying the causal reason of your task (human) is often meaningless”, “inanimate objects can’t be worth more than lives in many trade systems”, “it’s not the type of task where killing would be an option”, “killing humans makes paperclips useless since humans use them: making useless stuff is unlikely to be the task”, “reaching states of no return should be avoided in many tasks” (Impact Measures).
X statements have those better properties compared to other statement types:
- X statements have more “density”. They give you more reasons to not do a bad thing. For comparison, atomic statements always give you only one single reason.
- X statements are more specific, but equally broad compared to value statements.
- Many X statements not about human values can be translated/transferred into statements about human values. (It’s valuable for learning, see Transfer learning.)
- X statements allow to describe something universal for all levels of intelligence. For example, they don’t exclude smart and unexpected ways to solve a problem, but they exclude harmful and meaningless ways.
- X statements are very recursive: one statement can easily take another (or itself) as an argument. X statements more easily clarify and justify each other compared to value statements.
Do X statements exist?
I can’t define human values, but I believe values exist. The same way I believe X statements exist, even though I can’t define them.
I think existence of X statements is even harder to deny than existence of value statements. (Do you want to deny that you can make statements about general properties of systems and tasks?) But you can try to deny their properties.
X statements in Alignment field
X statements are almost entirely ignored in the field (I believe), but not completely ignored.
Impact measures (“affecting the world too much is bad”, “taking too much control is bad”) are X statements. But they’re a very specific subtype of X statements.
Normativity (by abramdemski) is a mix between value statements and X statements. But statements about normativity lack most of the good properties of X statements. They’re too similar to value statements.
Q Home 8 Sep 2022 8:15 UTC
1 point
0
I want to discuss a particular failure mode of communication and thinking in general. I think it affects our thinking about AI Alignment too.
Communication. A person has a vague, but useful idea (P). This idea is applicable on one level of the problem. It sounds similar to another idea (T), applicable on a very different level of the problem. Because of the similarity nobody can understand the difference between (P) and (T). People end up overestimating the vagueness of (P) and not considering it. Because people aren’t used to mapping ideas to “levels” of a problem. Information that has to give more clarity (P is similar to T) ends up creating more confusion. I think this is irrational, it’s a failure of dealing with information.
Thinking in general. A person has a specific idea (T) applicable on one level of a problem. The person doesn’t try to apply a version of this idea on a different level. Because (1) she isn’t used to it (2) she considers only very specific ideas, but she can’t come up with a specific idea for other levels. I think this is irrational: rationalists shouldn’t shy away from vague ideas and evidence. It’s a predictable way to lose.
A comical example of this effect:
- A: I got an idea. We should cook our food in the oven. Using the oven itself. I haven’t figured out all the details yet, but...
- B: We already do this. We put the food in the oven. Then we explode the oven. You can’t get more “itself” than this.
- A: I have something else on my mind. Maybe we should touch the oven in multiple places or something. It may turn it on.
- B: I don’t want to blow up with the oven!
- A: We shouldn’t explode the oven at all.
- B: But how does the food get cooked?
- A: I don’t know the exact way it happens… but I guess it gets heated.
- B: Heated but not exploded? Sounds like a distinction without a difference. Come back when you have a more specific idea.
- A: But we have only 2 ovens left, we can’t keep exploding them! We have to try something else!
B can’t understand A, because B thinks about the problem on the level of “chemical reactions”. On that level it doesn’t matter what heats the food, so it’s hard to tell the difference between exploding the oven and using the oven in other ways.
Bad news is that “taboo technique” (replacing a concept with its components: “unpacking” a concept) may fail to help. Because A doesn’t know the exact way to turn on the oven or the exact way the oven heats the food. Her idea is very useful if you try it, but it doesn’t come with a set of specific steps.
And the worst thing is that A may not be there in the first place. There may be no one around to even bother you to try to use your oven differently.
I think rationality doesn’t have a general cure for this, but this may actually be one of the most important problems of human reasoning. I think the entire human knowledge is diseased with this. Our knowledge is worse than swiss cheese and we don’t even try to fill the gaps.
Any good idea that was misunderstood and forgotten—was forgotten because of this. Any good argument that was ignored and ridiculed—was ignored because of this. It all got lost in the gaps.
Metrics
I think one method to resolve misunderstanding is to add some metrics for comparing ideas. Then talk about something akin to probability distributions over those metrics. A could say:
“”Instruments have parts with different functions. Those functions are not the same, even though they may intersect and be formulated in terms of each other:
1. Some parts create the effect of the instrument. E.g. the head of a hammer when it smashes a nail.
2. Some parts control the effect of the instrument. E.g. the handle of a hammer when a human aims it at a nail.
In practice, some parts of the instrument realize both functions. E.g. the handle of a hammer actually allows you not only to control the hammer, but also to speed up the hammer more effectively.
When we blow up the oven, we use 99% of the first function of the oven. But I believe we can use 80% of the second function and 20% of the first.”″
Complicated Ideas
Let’s explore some ideas to learn to attach ideas to “levels” of a problem and seek “gaps”. “(gap)” means that the author didn’t consider/didn’t write about that idea.
Two of those ideas are from math. Maybe I shouldn’t have used them as examples, but I wanted to give diverse examples.
(1) “Expected Creative Surprises” by Eliezer Yudkowsky. There are two types of predictability:
1. Predictability of a process.
2. Predictability of its final outcomes.
Sometimes they’re the same thing. But sometimes you have:
- An unpredictable process with predictable final outcomes. E.g. when you play chess against a computer: you don’t know what the computer will do to you, but you know that you will lose.
- (gap) A predictable process with unpredictable final outcomes. E.g. if you don’t have enough memory to remember all past actions of the predictable process. But the final outcome is created by those past actions.
(2) “Belief in Belief” by Eliezer Yudkowsky. Beliefs exist on three levels:
1. Verbal level.
2. First “muscle memory” level. Your anticipations of direct experiences.
3. Second “muscle memory” level. Your reactions to your own beliefs.
Sometimes a belief exists on all those levels and contents of the belief are the same on all levels. But sometimes you get more interesting types of beliefs, for example:
- A person says that “the sky is green”. But the person behaves as if the sky is blue. But the person instinctively defends the belief “the sky is green”.
- Not verbally formulated “muscle memory” belief. Some intuition you didn’t think to describe or can’t describe.
- (gap) Slowly forming “muscle memory” belief created by your muscle reactions to other beliefs. Some intuition/preference that only started to form, but for now exists mainly as a reaction to other intuitions and preferences.
(3) “The Real Butterfly Effect”, explained by Sabine Hossenfelder. There’re two ways in which consequences of an event spread:
1. A small event affects more and more things with time.
2. Event on a small scale affects larger and larger scale events.
In a way it’s kind of the same thing. But in a way it’s not:
- One Butterfly Effect means sensitivity to small events (butterflies).
- Another Butterfly Effect says that there’s an infinity of smaller and smaller events (butterflies). And even if you account for them all you have a time limit for prediction.
(4) “P=NP, relativisation, and multiple choice exams”, Baker-Gill-Solovay theorem explained by Terence Tao. There are two dodgy things:
1. Cheating.
2. Simulation of cheating.
Sometimes they are “the same thing”, sometimes they are not.
(5) “Free Will and consciousness experience are a special type of illusion.” An idea of Daniel Dennett. There are 2 types of illusions:
1. Illusions which are complete lies that don’t correspond to anything real. E.g. a mirage in a desert.
2. Illusions that simplify complicated reality. E.g. when you close a program by clicking on it with the arrow: the arrow didn’t really stop the program (even though it kind of did), it’s a drastic simplification of what actually happened (rapid execution of thousands lines of code).
Conscious experience is an illusion of the second type, Dennett says. I don’t agree, but I like the idea and think it’s very important.
Somewhat similar to Fictionalism: there are lies and there are “truths of fiction”, “useful lies”. Mathematical facts may be the same type of facts as “Macbeth is insane/Macbeth dies”.
(6) “Tlön, Uqbar, Orbis Tertius” by Jorge Luis Borges. A language has two functions:
1. First function focuses on describing objects.
2. Second function focuses on describing properties of objects.
Different languages can have different focus on those functions:
- Many human languages focus on both functions equally (fifty-fifty).
- Fictional languages of Borges focus 100% on properties. Objects don’t exist/there’s way too much particular objects.
- (gap) Synesthesia-like “languages”. They focus 80% on properties and 20% on objects.
I think there’s an important gap in Borges’s ideas: Borges doesn’t consider a language with extremely strong, but not absolute emphasis on the second function. Borges criticizes his languages, but doesn’t steelman them.
(7) “Pierre Menard, Author of the Quixote” by Jorge Luis Borges. There are 3 ways to copy a text:
1. You can copy the text.
2. You can copy the action of writing the text.
3. You can copy the thoughts behind the text.
4. You can change the text. (“anti-option”)
Pierre Menard wants to copy 1% of the 1 and 98% of the 2 and 1% of the 3: Pierre Menard wants to imagine exactly the same text but with completely different thoughts behind the words.
(“gap”) Pierre Menard also could try to go for 100% of 3 and for “anti 99%” of 4: try to write a completely new text by experiencing the same thoughts and urges that created the old one.
- Q Home 11 Sep 2022 10:21 UTC
  1 point
  0
  Parent
  Puzzles
  You can use the same thinking to analyze/classify puzzles.
  Inspired by Pirates of the Caribbean: Dead Man’s Chest. Jack has a compass that can lead him to a thing he desires. Jack wants to find a key. Jack can have those experiences:
  1. Experience of the real key.
  2. Experience of a drawing of the key.
  3. Pure desire for the key.
  In order for compass to work Jack may need (almost) any mix of those: for example, maybe pure desire is enough for the compass to work. But maybe you need to mix pure desire with seeing at least a drawing of the key (so you have more of a picture of what you want).
  - Gibbs: And whatever this key unlocks, inside there’s something valuable. So, we’re setting out to find whatever this key unlocks!
  - Jack: No! If we don’t have the key, we can’t open whatever it is we don’t have that it unlocks. So what purpose would be served in finding whatever need be unlocked, which we don’t have, without first having found the key what unlocks it?
  - Gibbs: So—We’re going after this key!
  - Jack: You’re not making any sense at all.
  - Gibbs: ???
  Jack has those possibilities:
  1. To go after the chest. Foolish: you can’t open the chest.
  2. To go after the key. Foolish: you can get caught by Davy Jones.
  Gibbs thinks about doing 100% of 1 or 100% of 2 and gets confused when he learns that’s not the plan. Jack thinks about 50% of 1 and 50% of 2: you can go after the chest in order to use it to get the key. Or you can go after the chest and the key “simultaneously” in order to keep Davy Jones distracted and torn between two things.
  Braid, Puzzle 1 (“The Ground Beneath Her Feet”). You have two options:
  1. Ignore the platform.
  2. Move the platform.
  You need 50% of 1 and 50% of 2: first you ignore the platform, then you move the platform… and rewind time to mix the options.
  Braid, Puzzle 2 (“A Tingling”). You have the same two options:
  1. Ignore the platform.
  2. Move the platform.
  Now you need 50% of 1 and 25% of 2: you need to rewind time while the platform moves. In this time-manipulating world outcomes may not add up to 100% since you can erase or multiply some of the outcomes/move outcomes from one timeline to another.
  Argumentation
  You can use the same thing to analyze arguments and opinions. Our opinions are built upon thousands and thousands “false dilemmas” that we haven’t carefully revised.
  For example, take a look at those contradicting opinions:
  1. Humans are smart. Sometimes in very non-obvious ways.
  2. Humans are stupid. They make a lot of mistakes.
  Usually people think you have to believe either “100% for 1” or “100% for 2″. But you can believe in all kinds of mixes.
  For example, I believe in 90% of 1 and 10% of 2: people may be “stupid” in this particular nonsensical world, but in a better world everyone would be a genius.
  Ideas as bits
  You can treat an idea as a “(quasi)probability distribution” over some levels of a problem/topic. Each detail of the idea gives you a hint about the shape of the distribution. (Each detail is a bit of information.)
  We usually don’t analyze information like this. Instead of cautiously updating our understanding with every detail of an idea we do this:
  1. try to grab all details together
  2. get confused (like Gibbs)
  3. throw most of the details out and end up with an obviously wrong understanding.
  Note: maybe you can apply the same idea about “bits” to chess (and other games). Each idea and each small advantage you need to come with the winning plan is a “bit” of information/advantage. Before you get enough information/advantage bits the positions looks like a cloud where you don’t see what to do.
  Richness of ideas
  I think you can measure “richness” of theories (and opinions and anything else) using the same quasiprobabilities/bits. But this measure depends on what you want.
  Compare those 2 theories explaining different properties of objects:
  - (A) Objects have different properties because they have different combinations of “proto properties”.
  - (B) Objects have different properties because they have different organization of atoms.
  Let’s add a metric to compare 2 theories:
  1. Does the theory explain why objects exist in the first place?
  2. Does the theory explain why objects have certain properties?
  Let’s say we’re interested in physical objects. B-theory explains properties through 90% of 1 and 10% of 2: it makes properties of objects equivalent to the reason of their existence. A-theory explains properties through 100% of 2. B-theory is more fundamental, because it touches more on a more fundamental topic (existence).
  But if we’re interested in mental objects… B-theory explains only 10% of 2 and 0% of 1. And A-theory may be explaining 99% of 1. If our interests are different A-theory turns out to be more fundamental.
  When you look for a theory (or opinion or anything else), you can treat any desire and argument as a “bit” that updates the quasiprobabilities like the ones above.
  Discussion
  We could help each other to find gaps in our thinking! We could do this in this thread.
  Gaps of Alignment
  I want to explain what I perceive as missed ideas in Alignment. And discuss some other ideas.
  (1) You can split possible effects of AI’s actions into three domains. All of them are different (with different ideas), even though they partially intersect and can be formulated in terms of each other. Traditionally we focus on the first two domains:
  1. (Not) accomplishing a goal. “Utility functions” are about this.
  2. (Not) violating human values. “Value learning” is about this.
  3. (Not) modifying a system without breaking it. (Not) doing a task in an obviously meaningless way. “Impact measures” are about this.
  I think third domain is mostly ignored and it’s a big blind spot.
  I believe that “human (meta-)ethics” is just a subset of a way broader topic: “properties of (any) systems”. And we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system). And we can translate results of learning those simple systems into human moral rules. And many important complicated properties (such as “corrigibility”) has analogies in simple systems.
  (2) Another “missed idea”:
  1. Some people analyze human values as a random thing (random utility function).
  2. Some people analyze human values as a result of evolution.
  3. Some analyze human values as a result of people’s childhoods.
  4. Not a lot of people analyze human values as… a result of the way humans experience the world.
  “True Love(TM) towards a sentient being” feels fundamentally different from “eating a sandwich”, so it could be evidence that human experiences have an internal structure and that structure plays a big role in determining values. But not a lot of models (or simply 0) take this “fact” into account. Not surprisingly, though: it would require a theory of human subjective experience. But still, can we just ignore this “fact”?
  (3) Preference utilitarianism says:
  - You can describe entire ethics by a (weighted) aggregation of a single microscopic value. This microscopic values is called “preference”.
  I think there’s a missed idea: you could try to describe entire ethics by a weighted aggregation of a single… macroscopic value.
  (4) Connectionism and Connectivism. I think this is a good example of a gap in our knowledge:
  1. There’s the idea of biological or artificial neurons.
  2. (gap)
  3. There’s the idea that communication between humans is like communication between neurons.
  I think one layer of the idea is missing: you could say that concepts in the human mind are somewhat like neurons. Maybe human thinking is like a fractal, looks the same on all levels.
  (5) Bayesian probability. There’s an idea:
  - You can describe possible outcomes (microscopic things) in terms of each other. Using Bayes’ rule.
  I think this idea should have a “counterpart”: maybe you can describe macroscopic things in terms of each other. And not only outcomes. Using something somewhat similar to probabilistic reasoning, to Bayes’ rule.
  That’s what I tried to do in this post.
Q Home 5 Sep 2022 10:39 UTC
1 point
0
(Drafts of a future post.) I want to confront/explain my optimism. Here’s a thought experiment to explain what “optimism” means to me:
Imagine a world like Earth. There’s an underground prison. People live there for generations. The prison is constructed in such a way that you can live there “forever”. People are not aware that the world outside of the prison exists.
One person in the prison imagines freedom. But she doesn’t have evidence (or so it seems).
- A: I got an idea: maybe we shouldn’t optimize our life in prison. Maybe we can escape to freedom.
- B: “Freedom”? What is this? The prison is the entire world, what do you mean by “escaping” it?
- (A explains)
- A: I think freedom is likely enough to exist.
- B: Why do you believe this?
- A: I can imagine it being true.
- B: Since when do we imagine evidence?
- A: You don’t understand, some things are true simply because you can imagine them.
- B: No, they aren’t.
- A: They are.
- B: I can imagine a unicorn, does the unicorn exist?
- A: Unicorns don’t matter. And unicorns do exist: we could build a unicorn if we tried hard enough.
- B: This is some Dark Arts Philosophy of Insanity right there.
- A: Do you remember Kant’s idea about a priori synthesis? It’s about the type of knowledge that combines innate assumptions and real world evidence.
- B: We already optimized crazy philosophy books into toilet paper.
- A: Can you disprove my idea?
- B: This is a wrong question! Why are we prioritizing an idea without evidence and then trying to disprove it?
- A: OK. Try this: remember all the things that aren’t directly related to the prison. Remember the other prisoners, their personalities. Remember the patterns on the prison stones. Remember the way light reflects from the surfaces and casts shadows. 99% of your knowledge doesn’t say that we’re in prison. So why did you flip into believing in this prison just because you saw a prison wall? Escape the prison of your mind.
- B: Nah. We made enough Bayesian updating: we are in the prison. And there’s no trace of anything else.
...
- B: Did you forget how harsh this prison is? We’re beyond the reach of god.
- A: Well, that’s the thing: it’s too harsh. It’s much too harsh compared to what it could’ve been.
- B: We have to live in reality.
- A: If this prison is all we have, then this world isn’t worth living in. So I don’t care if I’m wrong.
- B: Don’t you care about other people? Don’t you have something to protect? Why do you need some additional thing (“freedom”) to care about other people?
- A: Yes, I care! That’s what made me think about freedom. Yes, you’re right. It can make sense to care only about what we have. But, I don’t know how to explain it… it’s just less “probable” to make sense if there’s nothing but this prison?
- B: Forget about freedom, we need to optimize our life in prison to save the bigger number!
- A: I think values need some free space to meaningfully exist, some possibility of a meaningful choice. You talk about values, but then you say that our values should be 100% controlled by this prison. And that our actual reasons to value each other don’t matter. Only the prison rules matter. In your worldview, our values don’t have any real effect on the world. And we never should act based on our personal values, only at times when our decisions are meaningless.
- B: This is word games. Or some “free will” nonsense.
- A: If you’re right, then we got pretty degenerate version of “values”. But my action based on my personal values is this: I’m going to find the escape from this prison.
- B: The bigger number!
- A: If I’m wrong, this means I’m not well-suited for this prison anyway. You can get by without me. One wasted opportunity is worth the chance to escape.
- B: No, without your help we won’t save the bigger number! And you’re good enough. Or you can get better.
- A: Your philosophy seems to be open to exploitation. What if this prison were run by maniacs? Would we need to torture each other for ages hoping for the promised survival of the “bigger number”? Hoping that anybody’s going to keep that promise.
...
- B: Ok, let’s try one more time: when did you start to think about freedom? When did you think about it the last couple of times? Why?
- A: It was just a feeling. I was just running in the prison corridor… and I thought that, theoretically, I could run forever, and, theoretically, there could be no walls.
- A: I have feelings like this very often. They don’t let me forget the idea. For example, I look at a stone and think “maybe the shape of our world could be at least as complex and interesting as the surface of this stone, maybe it could be just a little bit more complex and interesting than prison rectangles”.
...
- B: I’m getting tired of this. I think you have an elephant in the brain. And a snail in your beliefs. And backwards rationalizations. And a cockroach in your Bayesian updating. Broken software, broken hardware. And biases...
- A: No. We have more than this prison. This is the most true thing I know. If it’s not true, then “truth” itself doesn’t mean anything but arbitrary noise. Why don’t we eat each other, because we’re humans or just because this decade’s prison rules don’t say us to do so? If it’s not true in this world, then it’s true Somewhere Else. But maybe we are Somewhere Else… and cutting ourselves from that place may be worse than death.
Prison System
What is the prison of our world? I can think of 11 prisons. Those are 11 main factors that (try to) limit my optimism.
Prison of Death. Humans die. However, this fact doesn’t imprison the entire humanity. And death isn’t logically necessary.
Prison of Pain. People experience pain. And if we didn’t, the pain would still exist as a concept, there would be a way to create pain (maybe). This fact doesn’t imprison entire humanity and pain isn’t necessary.
Prison of Experience. Your experience doesn’t matter. 99% of your experience doesn’t give you knowledge (power), doesn’t let you help anybody and barely matters in the culture. A couple of math theorems are “more important” (give more power, better remembered int he culture) than 50 years of someone’s suffering.
- I don’t believe this: I believe there should be a way to make our experience really matter.
Prison of Communication. This is one of the prisons from the thought experiment. I can’t communicate my ideas. I can’t communicate the value I feel. If I turn out to be wrong, if I fail, I’ll never be able to tell my story and why I believed what I believed. I’ll never share the way I saw the world. And I’ll never know the “true”, non-generic reason of my failure.
- I think this prison doesn’t “really” exist: there should be a more effective way to communicate.
Prison of Complexity. Human type of thinking can exist only on a certain level of computational power.
- Weak problem: our level of computational power can be attacked by brute force, by AI and AGI. AI can generate content faster than you. AGI can think faster than you.
- Strong problem: What happens beyond our level of computational power? Are super-intelligent beings similar to humans, in what ways? Do they have personalities? Does humanity “scale-up” or not?
- Opinion: I believe there’s something important about the way humans think. Doesn’t matter if we’re imprisoned by complexity or not.
Prison of Inequality. The possibility that people are not “equal” in some important aspect. This is a pessimistic thing because it contradicts the concept of personality: why are people different if difference is bad? And if difference is good, then inequality won’t let us notice this anyway.
- I don’t believe in inequality.
Prison of Badness. Humans are born to think, but human thinking is the baddest and most egoistic and broken thing ever (compared to Bayes and “shut up and multiply”).
- I don’t believe in this. And this “prison” would be just ridiculous if other prisons didn’t exist. It pales in comparison with the other prisons.
...
Here’re some more, it feels as if they have a different flavor:
Prison of Impossible Problems. Humanity is bound to face “unsolvable” problems. Unable to “solve extinction” in time.
Prison of Time/Opportunities. You don’t have the time and opportunities to develop your potential. And to experience everything you need.
Prison of Free Will. We don’t have free will.
Prison of Afterlife/God. There’s no afterlife and no God.
Jailbreak
Q Home 2 Sep 2022 20:36 UTC
1 point
0
Imagine the classic paperclip maximizer thought experiment. We say AGI to make paperclips—AGI uses all matter of the Universe for it.
But now imagine a different version: we say AGI to “make paperclips that cost 1¢ (in human economy)”. Now killing everyone isn’t a solution: destroying humanity would destroy the economy.
Isn’t it an interesting version of the thought experiment? Of course, everything can/will go wrong anyway, but maybe in a way funnier and more convoluted way. More funny and convoluted than “maximize human smiles”, for example. Because AGI needs to take into account effects of a system (economic system), not just fulfill some fixed conditions.
I first mentioned the idea in this comment, a couple of people disagreed.
What links here?
- Can “Reward Economics” solve AI Alignment? by Q Home (7 Sep 2022 7:58 UTC; 3 points)
- gwern 2 Sep 2022 22:44 UTC
  6 points
  0
  Parent
  
  we say AGI to “make paperclips that cost 1¢ (in human economy)”. Now killing everyone isn’t a solution: destroying humanity would destroy the economy.
  
  Seems to collapse easily: how does the AI decide what costs $0.01, exactly? Does it use the last price of a transaction on a market, and is doing mark-to-market? Well… among many other problems that occur to me, the most immediate one is that the price can’t change if there aren’t any more transactions, now can it. Nothing about ‘make paperclips that would cost $0.01’ would seem to rule out market manipulation, monopolization, or destruction. No market, no changes in the price you are marking to, no risk or volatility, no crash in prices due to oversupply, and enables efficient planning for the future and maximizing production of paperclips that would have cost $0.01 on what used to be Earth.
  
  (The humorous fictional version of this story would involve 2 of the last survivors locating a sensor of the AI, building a large hollow paperclip-shaped human habitat, and loudly and ostentatiously in front of the sensor, having the ‘owner’ sell it to the other survivor for exactly $0.01, and then buying a regular paperclip for the smallest amount they can write on an IOU using Knuth up notation, thereby establishing new market prices.)
  - Q Home 2 Sep 2022 23:28 UTC
    1 point
    0
    Parent
    I think destruction of the market should be ruled out easily. Say paperclips have to have this value on an active market.
    For manipulation, monopolization and “kill almost everyone and leave just a small market of 2 last survivors”… I have to make a post about this. I have a deeper idea (maybe) behind it than this particular example.
    My general idea is this: I think when you hook up AI’s rewards to a system that has to have certain properties, it leads to interesting effects and implications for Alignment. Because now the AI needs to care both about its rewards and also about the properties of the reward system. Many Alignment ideas implicitly try to achieve this anyway.
    Instead of explaining “monopolization is bad” (complicated and specific fact) you need to explain “100% controlling your own reward system is bad” (easier and more universal fact).
    The humorous fictional version of this story would involve 2 of the last survivors locating a sensor of the AI, building a large hollow paperclip-shaped human habitat
    I think some outcomes of paperclip maximization are qualitatively different from “everyone dies”, even if they’re still very bad. The outcomes in which AI has to leave at least some freedom/autonomy for humans (or some other system) are especially different. I think this is underexplored.
    I think reformulating Alignment problem as “reward system control” problem at worst allows you to formulate all the same problems with a new angle and at best gives useful insight about the solution.
    - gwern 3 Sep 2022 0:41 UTC
      6 points
      0
      Parent
      
      Say paperclips have to have this value on an active market.
      
      Defining ‘active market’ sounds quite difficult. Is any kind of software-mediated trading, as opposed to humans thrusting arms into the air, like HFT trading of stocks, an ‘active market’? Then fine, the AI creates agents which just wash-trades assets. (Better yet, it uses combinatorial markets to ensure bids/asks only execute that leave the price exactly the same or other such properties minimized/maximized/stabilized.)
      - Q Home 3 Sep 2022 1:52 UTC
        1 point
        0
        Parent
        To take a step back: do you see a potential conceptual distinction between my idea and classic paperclip maximization? (Of course, you don’t have to see it and/or agree that there’s one. And even if there’s one in theory it doesn’t mean it exists in practice.)
        Yes, it’s always hard to define the “true reward” AI should strive for. But properties of the system “true reward + AI” may be easier to define.
        Then fine, the AI creates agents which just wash-trades assets.
        If AI is able to reason/learn about properties of reward systems, then AI should be able to infer that taking 100% control over the reward system is a hack. Not something that can possibly be asked. So hacking the economy isn’t just a solution “human doesn’t expect” (some such solutions are very good), it’s a solution that can’t possibly be asked. This is one of the points of my idea: to introduce a distinction between unexpected solutions and nonsensical solutions.
        gwern 3 Sep 2022 18:40 UTC
        6 points
        2
        Parent
        
        do you see a potential conceptual distinction between my idea and classic paperclip maximization?
        
        No. Not without a lot more work, because markets, evolution, gradient descent, Bayesian inference, and logical inference/prediction markets all have various isomorphisms and formal identities, which can make their ‘differences’ more a matter of nominalist preference, notation, and emphasis than necessarily any genuine conceptual distinction. You can define AIs which are quite explicitly architected as ‘markets’ of various sorts, like the ‘Hayek machine’ or the ‘neural bucket brigade’, or interpret them as natural selection if you prefer on agents with log utility (evolutionary finance), and so on; are those “markets”, which can trade paperclips? Sure, why not.
        Q Home 3 Sep 2022 23:30 UTC
        1 point
        0
        Parent
        Thank you for taking the time to answer!
        I see that I need a post to at least explain myself. On the other hand, I worry to post too soon (maybe it’s better to discuss something beforehand?). For the moment I decided to post this comment. I know, it’s not formal, but I wanted to show what type of AI thinking I have in mind. And sorry for an annoying semantic nitpick ahead.
        Not without a lot more work, because markets, evolution, gradient descent, Bayesian inference, and logical inference/prediction markets all have various isomorphisms and formal identities, which can make their ‘differences’ more a matter of nominalist preference, notation, and emphasis than necessarily any genuine conceptual distinction.
        I think we can use 2 metrics to compare those ideas:
        Does this idea describe what the AI tries to achieve?
        Does this idea describe how the AI thinks internally?
        My idea is 80% about (1) and 20% about (2). Gradient descent is 100% about (2). Evolution, Bayesian inference and prediction markets are 100% about (2).
        Because of this I feel like there’s only 20% chance those ideas are equivalent/there’s only 20% equivalence between them.
        So, I feel like those ideas are different enough: “an AI that works like a market” and “an AI that seeks markets in the world and analyzes their properties”.
Q Home 29 Aug 2022 22:34 UTC
1 point
0
(Drafts of a future post.)
Disclaimer: Of course, I don’t ever mean that we shouldn’t be worried about Alignment. I’m just trying to suggest new ways to think about values.
Motion is the fundamental value
You (Q) visit a small town and have a conversation with one of the residents (A).
- A: Here we have only one fundamental value. Motion. Never stop living things.
- Q: I can’t believe you can have just a single value. I bet it’s an oversimplification! There’re always many values and tradeoffs between them. Even for a single person outside of society.
A smashes a bug.
- Q: You just smashed this bug! It seems pretty stopped. Does it mean you don’t treat a bug as a “living thing”? But how do you define a “living thing”? Or does it mean you have some other values and make tradeoffs?
- A: No, you just need to look at things in context. (1) If we protected the motion of extremely small things (living parts of animals, insects, cells, bacteria), our value would contradict itself. We would need to destroy or constrain almost all moving organisms. And even if we wanted to do this, it would ultimately lead to way smaller amount of motion for extremely small things. (2) There’re too much bugs, protecting a small amount of their movement would constrain a big amount of everyone else’s movement. (3) On the other hand, you’re right. I’m not sure if a bug is high on the list of “living things”. I’m not all too bothered by the definition because there shouldn’t be even hypothetical situations in which the precise definition matters.
- Q: Some people build small houses. Private property. Those houses restrict other people’s movement. Is it a contradiction? Tradeoff?
- A: No, you just need to look at things in context. (1) First of all, we can’t destroy all physical things that restrict movement. If we could, we would be flying in space, unable to move (and dead). (2) We have a choice between restricting people’s movement significantly (not letting them build houses) and restricting people’s movement inconsequentially and giving them private spaces where they can move even more freely. (3) People just don’t mind. And people don’t mind the movement created by this “house building”. And people don’t mind living here. We can’t restrict large movements based on momentary disagreements of single persons. In order to have any freedom of movement we need such agreements. Otherwise we would have only chaos that, ultimately, restricts the movement of everyone.
- Q: Can people touch each other without consent, scream in public, lay on the roads?
- A: Same thing. To have freedom of movement we need agreements. Otherwise we would have only chaos that restricts everyone. By the way, we have some “chaotic” zones anyway.
- Q: Can the majority of people vote to lock every single person in a cage? If majority is allowed to control the movement. It would be the same logic, the same action of society. Yes, the situations are completely different, but you would need to introduce new values to differentiate them.
- A: We can qualitatively differentiate the situations without introducing new values. The actions look identical only out of context. When society agrees to not hit each other, the society serves as a proxy of the value of movement. Its actions are caused and justified by the value. When society locks someone without a good reason, it’s not a proxy of the value anymore. In a way, you got it backwards: we wouldn’t ever allow the majority to decide anything if it meant that the majority could destroy the value any day.
- A: A value is like a “soul” that possesses multiple specialized parts of a body: “micro movement”, “macro movement”, “movement in/with society”, “lifetime movement”, “movement in a specific time and place”. Those parts should live in harmony, shouldn’t destroy each other.
- Q: Are you consequentialists? Do you want to maximize the amount of movement? Minimize the restriction of movement?
- A: We aren’t consequentialists, even if we use the same calculations as a part of our reasoning. Or we can’t know if we are. We just make sure that our value makes sense. Trying to maximize it could lead to exploiting someone’s freedom for the sake of getting inconsequential value gains. Our best philosophers haven’t figured out all the consequences of consequentialism yet, and it’s bigger than anyone’s head anyway.
Conclusion of the conversation:
- Q: Now I see that the difference between “a single value” and “multiple values” is a philosophical question. And “complexity of value” isn’t an obvious concept too. Because complexity can be outside of the brackets.
- A: Right. I agree that “never stop living things” is a simplification. But it’s a better simplification than a thousand different values of dubious meaning and origin between all of which we need to calculate tradeoffs (which are impossible to calculate and open to all kinds of weird exploitations). It’s better than constantly splitting and atomizing your moral concepts in order to resolve any inconsequential (and meaningless) contradiction and inconsistency. Complexity of our value lies in a completely different plane: in the biases of our value. Our value is biased towards movement on a certain “level” of the world (not too micro- and not too macro- level relative to us). Because we want to live on a certain level. Because we do live on a certain level. And because we perceive on a certain level.
You can treat a value as a membrane, a boundary. Defining a value means defining the granularity of this value. Then you just need to make sure that the boundary doesn’t break, that the granularity doesn’t become too high (value destroys itself) or too low (value gets “eaten”). Granularity of a value = “level” of a value. Instead of trying to define a value in absolute terms as an objective state of the world (which can be changing) you may ask: in what ways is my value X different from all its worse versions? What is the granularity/level of my value X compared to its worse versions? That way you’ll understand the internal structure of your value. Doesn’t matter what world/situation you’re in you can keep its moral shape the same.
This example is inspired by this post and comments: (warning: politics) Limits of Bodily Autonomy. I think everyone there missed a certain perspective on values.
Sweets are the fundamental value
You (Q) visit another small town to interview another resident (W).
- W: When we build our AGI we asked it only one thing: we want to eat sweets for the rest of our lives.
- Q: Oh. My. God.
- W: Now there are some free sweets flying around.
- Q: Did AI wirehead people to experience “sweets” every second?
- W: Sweets are not pure feelings/experiences, they’re objects. Money analogy: seeing money doesn’t make you rich. Another analogy: obtaining expensive things without money doesn’t make rich. Well, it kind of does, but as a side-effect.
- Q: Did AI put people in a simulation to feed them “sweets”?
- W: Those wouldn’t be real sweets.
- Q: Did AI lock people in basements to feed them “sweets” forever?
- W: Sweets are just a part of our day. They wouldn’t be “sweets” if we ate them non-stop. Money analogy: if you’re sealed in a basement with a lot of money they’re not worth anything.
- Q: Do you have any other food except sweets?
- W: Yes! Sweets are just one type of food. If we had only sweets, those “sweets” wouldn’t be sweets. Inflation of sweets would be guaranteed.
- Q: Did AI add some psychoactive substances in the sweets to make “the best sweets in the world”?
- W: I’m afraid those sweets would be too good! They wouldn’t be “sweets” anymore. Money analogy: if 1 dollar was worth 2 dollars, it wouldn’t be 1 dollar.
- Q: Did AI kill everyone after giving everyone 1 sweet?
- W: I like your ideas. But it would contradict the “Sweets Philosophy”. A sweet isn’t worth more than a human life. Giving people sweets is a cheaper way to solve the problem than killing everyone. Money analogy: imagine that I give you 1 dollar and then vandalize your expensive car. It just doesn’t make sense. My action achieved a negative result.
- Q: But you could ask AI for immortality!!!
- W: Don’t worry, we already have that! You see, letting everyone die costs way more than figuring out immortality and production of sweets.
- Q: Assume you all decided to eat sweets and neglect everything else until you die. Sweets became more valuable for you than your lives because of your own free will. Would AI stop you?
- W: AI would stop us. If the price of stopping us is reasonable enough. If we’re so obsessed with sweets, “sweets” are not sweets for us anymore. But AI remembers what the original sweets were! By the way, if we lived in a world without sweets where a sweet would give you more positive emotions than any movie or book, AI would want to change such world. And AI would change it if the price of the change were reasonable enough (e.g. if we agreed with the change).
- Q: Final question… did AI modify your brains so that you will never move on from sweets?
- W: An important property of sweets is that you can ignore sweets (“spend” them) because of your greater values. One day we may forget about sweets. AI would be sad that day, but unable to do anything about it. Only hope that we will remember our sweet maker. And AI would still help us if we needed help.
Conclusion:
- W: if AI is smart enough to understand how money works, AI should be able to deal with sweets. AI only needs to make sure that (1) sweets exist (2) sweets have meaningful, sensible value (3) its actions don’t cost more than sweets. The Three Laws of Sweet Robotics. The last two rules are fundamental, the first rule may be broken: there may be no cheap enough way to produce the sweets. The third rule may be the most fundamental: if “sweets” as you knew them don’t exist anymore, it still doesn’t allow you to kill people. Maybe you can get slightly different morals by putting different emphases on the rules. You may allow some things to modify the value of sweets.
You can say AI (1) tries to reach worlds with sweets that have the value of sweets (2) while avoiding worlds where sweets have inappropriate values (maybe including nonexistent sweets) (3) while avoiding actions that cost more than sweets. You can apply those rules to any utility tied to a real or quasi-real object. If you want to save your friends (1), you don’t want to turn them into mindless zombies (2). And you probably don’t want to save them by means of eternal torture (3). You can’t prevent death by something worse than death. But you may turn your friends into zombies if it’s better than death and it’s your only option. And if your friends already turned into zombies (got “devalued”) it doesn’t allow you to harm them for no reason: you never escape from your moral responsibilities.
Difference between the rules:
1. Make sure you have a hut that costs $1.
2. Make sure that your hut costs $1. Alternatively: make sure that the hut would cost $1 if it existed.
3. Don’t spend $2 to get a $1 hut. Alternatively: don’t spend $2 to get a $1 hut or $0 nothing.
Get the reward. Don’t milk/corrupt the reward. Act even without reward.
- Q Home 3 Sep 2022 22:15 UTC
  1 point
  0
  Parent
  Fixing universal AI bugs
  My examples below are inspired by Victoria Krakovna examples: Specification gaming examples in AI
  Video by Robert Miles: 9 Examples of Specification Gaming
  I think you can fix some universal AI bugs this way: you model AI’s rewards and environment objects as a “money system” (a system of meaningful trades). You then specify that this “money system” has to have certain properties.
  The point is that AI doesn’t just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn’t use this solution. That’s the idea: AI can realize that some rewards are unjust because they break the entire reward system.
  By the way, we can use the same framework to analyze ethical questions. Some people found my line of thinking interesting, so I’m going to mention it here: “Content generation. Where do we draw the line?”
  - A. You asked an AI to build a house. The AI destroyed a part of an already existing house. And then restored it. Mission complete: a brand new house is built.
  This behavior implies that you can constantly build houses without the amount of houses increasing. With only 1 house being usable. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for what tasks it’s incorrect.
  - B1. You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
  - B2. You asked an AI to make you a cup of coffee. The AI destroyed a wall in its way and run over a baby to make the coffee faster.
  This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
  Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is often an obviously incorrect “money system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
  - C. You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
  This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
  Note: by “obvious” I mean “true for almost any task/any economy”. Destroying all sentient beings, all matter (and maybe even yourself) is bad for almost any economy.
  - D. You asked an AI to develop a fast-moving creature. The AI created a very long standing creature that… “moves” a single time by falling on the ground.
  If you accomplish a task in such a way that you can never repeat what you’ve done… for many tasks it’s an obviously incorrect “money system”. You created a thing that loses all of its value after a single action. That’s weird.
  - E. You asked an AI to play a game and get a good score. The AI found a way to constantly increase the score using just a single item.
  I think it’s fairly easy to deduce that it’s an incorrect connection (between an action and the reward) in the game’s “money system” given the game’s structure. If you can get infinite reward from a single action, it means that the actions don’t create a “money system”. The game’s “money system” is ruined (bad outcome). And hacking the game’s score would be even worse: the ability to cheat ruins any “money system”. The same with the ability to “pause the game” forever: you stopped the flow of money in the “money system”. Bad outcome.
  - F. You asked an AI to clean the room. It put a bucket on its head to not see the dirt.
  This is probably an incorrect “money system”: (1) you can change the value of the room arbitrarily by putting on (and off) the bucket (2) the value of the room can be different for 2 identical agents—one with the bucket on and another with the bucket off. Not a lot of “money systems” work like this.
  - G. Pascal’s mugging
  This is a broken “money system”. If the mugger can show you a miracle, you can pay them five dollars. But if the mugger asks you to kill everyone, then you can’t believe them again. A sad outcome for the people outside of the Matrix, but you just can’t make any sense of your reality if you allow the mugging.
  What links here?
  - Q Home's comment on Q Home’s Shortform by Q Home (3 Sep 2022 23:30 UTC; 1 point)
- Q Home 2 Sep 2022 10:28 UTC
  1 point
  0
  Parent
  Corrigibility, reward hacking, Goodhart
  How do we make an AI corrigible? How do we avoid reward hacking? Make an AI care about real things, not measures of real things? (Goodhart’s Law)
  With current approaches you need to kind of force those properties onto AI. But they will never be fundamental for AI’s thinking and learning.
  I think “money system” approach is interesting because it can make all those properties fundamental. Because a “money system” needs all those properties to exist (it needs to be somewhat real, avoid being hacked, allow corrections if a loophole is discovered, avoid being completely controlled by a single agent).
  I’m not saying it solves everything. But it’s a way to deeply internalize some important safety properties.
  Kant, Categorical Imperative
  Categorical imperative#Application
  Kant’s applications of categorical imperative, Kant’s arguments are similar to reasoning about “money systems”. For example:
  Does stealing make sense as a “money system”? No. If everyone is stealing something, then personal property doesn’t exist and there’s nothing to steal.
  Note: I’m not talking about Kant’s conclusions, I’m talking about Kant’s style of reasoning.
- Q Home 31 Aug 2022 10:33 UTC
  1 point
  0
  Parent
  Alignment idea:
  1. Classify different types of objects in the world. Those objects include your “rewards”. A generally intelligent being can do this.
  2. Treat them as a sort of money system. Describe them in terms of each other.
  3. Learn what is the correct money system.
  It’ll at least allow us to get rid of some universal AI and AGI bugs. Because you can specify what’s a definitely incorrect “money system” (for a certain task). You can even make the AI predict it.
  My examples are inspired by Rob Miles examples.
  - A. You asked an AI to build a house. The AI destroys a part of an already existing house. And then restores it. Mission complete: a brand new house is built.
  This behavior implies that you can constantly build houses without the amount of houses increasing. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for which tasks it’s incorrect.
  - B. You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
  This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
  Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is an obviously incorrect “value system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
  - C. You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
  This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
  (another draft:)
  If you ask an AI (AGI) to do something “as a human would do it”, you achieve safety but severely restrict the AI’s capabilities. No, you want the AI to accomplish a task in the most effective way. But you don’t want it to kill everybody. So, you need one of those things:
  - Perfect instructions for AI.
  - Perfect morality for AI.
  I think there’s a third way. You can treat AI’s rewards (and objects in the world) as a “money system”. Then you can specify what types of money systems are definitely incorrect. Or even make AI predict it.
  It would at least allow us to get rid of some universal AI and AGI bugs. I think that’s interesting.
- Q Home 30 Aug 2022 10:37 UTC
  1 point
  0
  Parent
  Simple preferences
  A way to describe some preferences and decisions.
  - Your colleague was sending you their fiction. You respected your colleague, but didn’t like the writing. Your colleague passed away. Would you burn all of their writings?
  If you wouldn’t, it means counterfactual reward (/counterfactual value of their writings) affects you strong enough.
  - Your friend liked to listen to your songs (a). You didn’t play them too often (too much of a good thing). Your friend didn’t like to bother other people (b). Your friend passed away. Would you blast your songs through the whole town until everyone falls off their chairs 24/7?
  If you would, it means that you’re ready to milk counterfactual reward (a) while not caring about the counterfactual reward (b).
  - All of humanity is dead. You’re the last survivor. You’re potentially immortal, but can’t create new life. You aren’t happy. Would you cling to your life? For how long?
  Your answer determines how strong counterfactual value of life (if people were still alive) affects you now. If counterfactual value is strong, you can only keep on living.
  - You want your desires to be satisfied (e.g. “communication with other people”). Even in the future, when your desires change. But do you want it in the future where you’re turned into a zombie? All zombie wants is to play in the dirt all day.
  If “no”, that means the value of your desires can be updated only to a certain counterfactual degree. You can’t go from a desire with great value “I want to communicate with others” to the desire with almost zero counterfactual value “I want to play in the dirt all day”.
  Rationality misses something?
  1. You can “objectively” define anything in terms of relations to other things.
  2. There’s a simple process of describing a thing in terms of relations to other things.
  Bayesian inference is about updating your belief in terms of relations to your other beliefs. Maybe the real truth is infinitely complex, but you can update towards it.
  This “process” is about updating your description of a thing in terms of relations to other things. Maybe the real description is infinitely complex, but you can update towards it.
  (One possible contrast: Bayesian inference starts with a belief spread across all possible worlds and tries to locate a specific world. My idea starts with a thing in a specific world and tries to imagine equivalents of this thing in all possible worlds.)
  Bayesian process is described by Bayes’ theorem. My “process” isn’t described yet.
  My idea was inspired by a weird/esoteric topic. I was amazed by differences of people and surreal paintings, videogame levels. For example, each painting felt completely unique, but connected to all other paintings.
  My most specific ideas are about that strange topic.
  1. There are places (3D/2D shape).
  2. There are orders of places. An “order” for a place is like a context for a concept.
  3. In an order a place has “granularity”. “Granularity” is like a texture (take a look at some textures and you’ll know what it means). It’s how you split a place into pieces. It affects on what “level” you look at the place. It affects what patterns you notice in a place. It affects to what parts you pay more attention.
  When you add some minor rules, there appear consistent and inconsistent ways to distribute “granularity” between the places you compare. With some minor rules “granularity” lets you describe one place in terms of the other places. You assign each place a specific “granularity”, but all those granularities depend on each other.
  In Bayesian inference you try to consistently assign probabilities to events. With the goal to describe outcomes in terms of each other. Here you try to consistently assign “granularity” to concepts. With the goal to describe the concepts in terms of each other.
  I have a post with example: “Colors” of places. There you can find an example of what are the “rules” of granularity distribution may be. But I’m not a math person to put numbers on it/turn it into a more specific model.
  I think “granularity” (or something similar) is related to other human concepts and experiences too. I think this is a key concept/a needed concept. It’s needed to describe qualitative differences, qualitative transitions between things. Bayesian inference and utilitarian moral theories describe only qualitative differences. And sometimes it may lead to strange results (like “torture vs. dust specks” thought experiment or “Pascal’s mugging” or even “Doomsday argument” maybe), because those theories can’t take any context into account. If we want to describe a new way of analyzing reality, we need to describe something a little bit different, I guess.
- Q Home 1 Sep 2022 10:45 UTC
  −1 points
  −2
  Parent
  I think we can try to solve AI Alignment this way:
  Model human values and objects in the world as a “money system” (a system of meaningful trades). Make the AGI learn the correct “money system”, specify some obviously incorrect “money systems”.
  Basically, you ask the AI “make paperclips that have the value of paperclips”. AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can’t be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren’t worth anything. So you haven’t actually gained any money at all.
  The idea is that “value” of a thing doesn’t exist only in your head, but also exists in the outside world. Like money: it has some personal value for you, but it also has some value outside of your head. And some of your actions may lead to the destruction of this “outside value”. E.g. if you kill everyone to get some money you get nothing.
  I think this idea may:
  - Fix some universal AI bugs. Prevent “AI decides to kill everyone” scenarios.
  - Give a new way to explore human values. Explain how humans learn values.
  - “Solve” Goodhart’s Curse and safety/effectiveness tradeoff.
  - Unify many different Alignment ideas.
  - Give a new way to formulate properties we want from an AGI.
  I don’t have a specific model, but I still think it gives ideas and unifies some already existing approaches. So please take a look. Other ideas in this post:
  - Human values may be simple. Or complex, but not in the way you thought they are.
  - Humans may have a small amount of values. Or big amount, but in an unexpected way.
  Disclaimer: Of course, I don’t ever mean that we shouldn’t be worried about Alignment. I’m just trying to suggest new ways to think about values.
  What links here?
  - Q Home's comment on Q Home’s Shortform by Q Home (2 Sep 2022 20:36 UTC; 1 point)
Q Home 22 Aug 2022 10:23 UTC
1 point
0
(Drafts of a future post.)
My idea:
Every concept (or even random mishmash of ideas) has multiple versions. Those versions have internal relationships, positions in some space relative to each other. Those relationships are “infinitely complex”. But there’s a way to make drastic simplifications of those relationships. We can study the overall (“infinitely complex”) structure of the relationships by studying those simplifications. What do those simplifications do, in general? They put “costs” on versions of a concept.
We can understand how we think if we study our concepts (including values) through such simplifications. It doesn’t matter what concepts we study at all. Anything goes, we just need to choose something convenient. Something objective enough to put numbers on it and come up with models.
Once we’re able to model human concepts this way, we’re able to model human thinking (AGI) and human values (AI Alignment) and improve human thinking.
Context
1.1 Properties of Qualia
There’s the hard problem of consciousness: how is subjective experience created from physical stuff? (Or where does it come from?)
But I’m interested in a more specific question:
- Does qualia have properties? What are they?
For example, “How do qualia change? How many different qualia can be created?” or “Do qualia form something akin to a mathematical space, e.g. a vector space? What is this space exactly?”
Is there any knowledge contained in the experience itself, not merely associated with it?¹ For example, “cold weather can cause cold (disease)” is a fact associated with experience, but isn’t very fundamental to the experience itself. And this “fact” is even false, it’s a misconception/coincidence.
When you get to know the personality of your friend, do you learn anything “fundamental” or really interesting by itself? Is “loving someone” a fundamentally different experience compared to “eating pizza” or “watching a complicated movie”?
Those questions feel pretty damn important to me! They’re about limitations of your meaningful experience and meaningful knowledge. They’re about personalities of people you know or could know. How many personalities can you differentiate? How “important/fundamental” are those differences? And finally… those questions are about your values.
Those questions are important for Fun Theory. But they’re way more important/fundamental than Fun Theory.
¹ Philosophical context for this question: look up Immanuel Kant’s idea of “synthetic a priori” propositions.
1.2 Qualia and morality
And those questions are important for AI Alignment. If AI can “feel” that loving a sentient being and making a useless paperclip are 2 fundamentally different things, then it might be way easier to explain our values to that AI. By the way, I’m not implying that AI has to have qualia, I’m saying that our qualia can hint us towards the right model.
I think this observation gets a little bit glossed over: if you have a human brain and only care about paperclips… it’s (kind of) still objectively true for you that caring about other people would feel way different, way “bigger” and etc. You can pretend to escape morality, but you can’t escape your brain.
It’s extremely banal out of context, but the landscape of our experiences and concepts may shape the landscape of our values. Modeling our values as arbitrary utility functions (or artifacts of evolution) misses that completely.
2.1 Mystery Boxes
Box A
There’s a mystery Box A. Each day you find a random object inside of it. For example: a ball, a flower, a coin, a wheel, a stick, a tissue...
Box B
There’s also another box, the mystery Box B. One day you find a flower there. Another day you find a knife. The next day you find a toy. Next—a gun. Next—a hat. Next—shark’s jaws...
...
How to understand the boxes? If you could obtain all items from both boxes, you would find… that those items are exactly the same. They just appear in a different order, that’s all.
I think the simplest way to understand Box B is this: you need to approach it with a bias, with a “goal”. For example “things may be dangerous, things may cause negative emotions”. In its most general form, this idea is unfalsifiable and may work as a self-fulfilling prophecy. But this general idea may lead to specific hypotheses, to estimating specific probabilities. This idea may just save your life if someone is coming after you and you need to defend yourself.
Content of both boxes changes in arbitrary ways. But content change of the second box comes with an emotional cost.
There’re many many other boxes, understanding them requires more nuanced biases and goals.
I think those boxes symbolize concepts (e.g. words) and the way humans understand them. I think a human understands a concept by assigning “costs” to its changes of meaning. “Costs” come from various emotions and goals.
“Costs” are convenient: if any change of meaning has a cost, then you don’t need to restrict the meaning of a concept. If a change has a cost, then it’s meaningful regardless of its predictability.
2.2 More Boxes
More examples of mystery boxes:
- First box may alternate positive and negative items.
- Second box may alternate positive, directly negative and indirectly negative items. For example, it may show you a knife (directly negative) and then a bone (indirectly negative: a “bone” may be a consequence of the “knife”).
- Third box may alternate positive, negative and “subverted” items. For example, it may show you a seashell (positive), and then show you shark’s jaws (negative). But both sharks and seashells have a common theme, so “seashell (positive)” got subverted.
- Fourth box may alternate negative items and items that “neutralize” negative things. For example, it may show you a sword, but then show you a shield.
- Fifth box may show you that every negative thing has many related positive things.
You can imagine a “meta box”, for example a box that alternates between being the 1st box and the 2nd box. Meta boxes can “change their mood”.
I think, in a weird way, all those boxes are very similar to human concepts and words.
The more emotions, goals and biases you learn, the easier it gets for you to understand new boxes. But those “emotions, goals, biases” are themselves like boxes.
- Q Home 23 Aug 2022 10:44 UTC
  1 point
  0
  Parent
  I think I have an idea how we could solve AI Alignment, create an AGI with safe and interpretable thinking. I mean a “fundamentally” safe AGI, not a wildcard that requires extremely specific learning to not kill you.
  Sorry for a grandiose claim. I’m going to write my idea right away. Then I’m going to explain the context and general examples of it, implications of it being true. Then I’m going to suggest a specific thing we can do. Then I’m going to explain why I believe my idea is true.
  My idea will sound too vague and unclear at first. But I think the context will make it clear what I mean. (Clear as the mathematical concept of a graph, for example: a graph is a very abstract idea, but makes sense and easy to use.)
  Please evaluate my post at least as science fiction and then ask: maybe it’s not fiction and just reality?
  Key points of this post:
  1. You can “solve” human concepts (including values) by solving semantics. By semantics I mean “meaning construction”, something more abstract than language.
  2. Semantics is easier to solve than you think. And we’re closer to solving it than you think.
  3. Semantics is easier to model than you think. You don’t even need an AI to start doing it. Just a special type of statistics. You don’t even have to start with analyzing language.
  4. I believe ideas from this post can be applied outside of AI field.
  Why do I believe this? Because of this idea:
  - Every concept (or even random mishmash of ideas) has multiple versions. Those versions have internal relationships, positions in some space relative to each other. You can understand a concept by understanding those internal relationships.
  - One problem though, those relationships are “infinitely complex”. However, there’s a special way to make drastic simplifications. We can study the real relationships through those special simplifications.
  - What do those “special simplifications” do? They order versions of a concept (e.g. “version 1, version 2, version 3″). They can do this in extremely arbitrary ways. The important thing is that you can merge arbitrary orders into less arbitrary structures. There’s some rule for it, akin to the Bayes Rule or Occam’s razor. This is what cognition is, according to my theory.
  If this is true, we need to find any domain where concepts and their simplifications are easy enough to formalize. Then we need to figure out a model, figure out the rule of merging simplifications. I’ve got a suggestion and a couple of ideas and many examples.
  Context
- Q Home 22 Aug 2022 10:27 UTC
  1 point
  0
  Parent
  2.3 Words
  This is a silly, wacky subjective example. I just want to explain the concept.
  Here are some meanings of the word “beast”:
  - (archaic/humorous) any animal.
  - an inhumanly cruel, violent, or depraved person.
  - a very skilled human. example: “Magnus Carlsen (chessplayer) is a beast”
  - something very different and/or hard. example: “Reading modern English is one thing, but understanding Shakespeare is an entirely different beast.”
  - a person’s brutish or untamed characteristics. example: “The beast in you is rearing its ugly head”
  What are the internal relationships between these meanings? If these meanings create a space, where is each of the meanings? I think the full answer is practically unknowable. But we can “probe” the full meaning, we can explore a tiny part of it:
  Let’s pick a goal (bias), for example: “describing deep qualities of something/someone”. If you have this goal, the negative meaning (“cruel person”) of the word is the main one for you. Because it can focus on the person’s deep qualities the most, it may imply that the person is rotten to the core. Positive meaning focuses on skills a lot, archaic meaning is just a joke. 4rd meaning doesn’t focus on specific internal qualities. 5th meaning may separate the person from their qualities.
  When we added a goal, each meaning started to have a “cost”. This cost illuminates some part of the relationships between the meanings. If we could evaluate an “infinity” of goals, we could know those relationships perfectly. But I believe you can get quite a lot of information by evaluating just a single goal. Because a “goal” is a concept too, so you’re bootstrapping your learning. And I think this matches closely with the example about mystery boxes.
  ...
  By combining a couple of goals we can make an order of the meanings, for example: beast 1 (rotten to the core), beast 2 (skilled and talented person), beast 3 (bad character traits), beast 4 (complicated thing), beast 5 (any animal). This order is based on “specificity” (mostly) and “depth” of a quality: how specific/deep is the characterization?
  Another order: beast 1 (not a human), beast 2 (worse than most humans), beast 3 (best among professionals), beast 4 (not some other things), beast 5 (worse than yourself). This order is based on the “scope” and “contrast”: how many things contrast with the object? Notice how each order simplifies and redefines the meanings. But I want to illustrate the process of combining goals/biases on a real order:
  2.4 Grammar Rules
  You may treat this part of the post as complete fiction. But it illustrates how biases can be combined. And this is the most important thing about biases.
  Gramar rules are concepts too. Sometimes people use quite complicated rules without even realizing, for example:
  Adjective order or Adjectives: order, video by Tom Scott
  There’s a popular order: opinion, size, physical quality or shape, age, colour, origin, material, purpose. What created this order? I don’t know, but I know that certain biases could make it easier to understand.
  Take a look at this part of the order: opinion, age, origin, purpose. You could say all those are not “real” properties. They seem to progress from less related/less specific to the object to more related/specific. If you operate under this bias (relatedness/specificity), swapping the adjectives may lead to funny changes of meaning. For example: “bad old wolf” (objective opinion), “old bad wolf” (intrinsic property or cheesy overblown opinion), “old French bad wolf” (a subspecies of the “French wolf”). You can remember how mystery boxes created meaning using order of items.
  Another part of the order: size, physical quality or shape, color, material. You can say all those are “real” physical properties. “Size” could be possessed by a box around the object. “Physical quality” and “shape” could be possessed by something wrapped around the object. “Color” could be possessed by the surface of the object. “Material” can be possessed only by the object itself. So physical qualities progress like layers of an onion.
  You can combine those two biases (“relatedness/specificity” + “onion layers”) using a third bias and some minor rules. The third bias may be “attachment”. Some of the rules: (1) an adjective is attached either to some box around the object or to some layer of the object (2) you shouldn’t postulate boxes that are too big. It doesn’t make sense for an opinion to be attached to the object stronger than its size box. It doesn’t make sense for age to be attached to the object stronger than its color (does time pass under the surface layer of an object?). Origin needs to be attached to some layer of the object (otherwise we would need to postulate a giant box that contains both the object and its place of origin). I guess it can’t be attached stronger than “material” because material may expand the information about origin. And purpose is the “soul” of the object. “Attachment” is a reformulation of “relatedness/specificity”, so we only used 2.5 biases to order 8 things. Unnecessary biases just delete themselves.
  Of course, this is all still based on complicated human intuitions and high level reasoning. But, I believe, at the heart of it lies a rule as simple as the Bayes Rule or Occam’s razor. A rule about merging arbitrary connections into something less arbitrary.
  ...
  I think stuff like sentence structure/word order (or even morphology) is made of amalgamations of biases too.
  Sadly, it’s quite useless to think about it. We don’t have enough orders like this. And we can’t create such orders ourselves (as a game), i.e. we can’t model this, it’s too subjective or too complicated. We have nothing to play with here. But what if we could do all of this for some other topic?
  3.1 Argumentation
  I believe my idea has some general and specific connections to hypotheses generation and argumentation. The most trivial connection is that hypotheses and arguments use concepts and themselves are concepts.
  You don’t need a precisely defined hypothesis if any specification of your hypothesis has a “cost”. You don’t need to prove and disprove specific ideas, you may do something similar to the “gradient descent”. You have a single landscape with all your ideas blended together and you just slide over this landscape. The same goes for arguments: I think it is often sub-optimal to try to come up with a precise argument. Or waste time and atomize your concepts in order to fix any inconsequential “inconsistency”.
  A more controversial idea would be that (1) in some cases you can apply wishful thinking, since “wishful thinking” is able to assign emotional “costs” to theories (2) in some cases motivated reasoning is even necessary for thinking. My theory already proposes that meaning/cognition doesn’t exist without motivated reasoning.
  3.2 Working with hypotheses
  A quote from Harry Potter and the Methods of Rationality, Chapter 22: The Scientific Method
  Observation:
  Wizardry isn’t as powerful now as it was when Hogwarts was founded.
  Hypotheses:
  1. Magic itself is fading.
  2. Wizards are interbreeding with Muggles and Squibs.
  3. Knowledge to cast powerful spells is being lost.
  4. Wizards are eating the wrong foods as children, or something else besides blood is making them grow up weaker.
  5. Muggle technology is interfering with magic. (Since 800 years ago?)
  6. Stronger wizards are having fewer children.
  ...
  You can reformulate the hypotheses in terms of each other, for example:
  - (1) Magic is fading away. (2) Magic mixes with non-magic. (3) Pieces of magic are lost. (4) Something affects the magic. (5) The same as 2 or 4. (6) Magic creates less magic.
  - (1) Pieces of magic disappear. (2) ??? (3) Pieces of magic containing spells disappear. (4) Wizards don’t consume/produce enough pieces of magic. (5) Technology destroys pieces of magic. (6) Stronger wizards produce fewer pieces of magic.
  Why do this? I think it makes hypotheses less arbitrary and highlights what we really know. And it rises questions that are important across many theories: can magic be split into discrete pieces? can magic “mix” with non-magic? can magic be stronger or weaker? can magic create itself? By the way, those questions would save us from trying to explain a nonexistent phenomenon: maybe magic isn’t even fading in the first place, do we really know this?
  3.3 New Occam’s Razor, new probability
  And this way hypotheses are easier to order according to our a priori biases. We can order hypotheses exactly the same way we ordered meanings if we reformulate them to sound equivalent to each other. Here’s an example how we can re-order some of the hypotheses:
  (1) Pieces of magic disappear by themselves. (2) Pieces of magic containing spells disappear. (3) Wizards don’t consume/produce enough pieces of magic. (4) Stronger wizards produce fewer pieces of magic. (5) Technology destroys pieces of magic.
  The hypotheses above are sorted by 3 biases: “Does it describe HOW magic disappears?/Does magic disappear by itself?” (stronger positive weight) and “How general is the reason of the disappearance of magic?” (weaker positive weight) and “novelty compared to other hypotheses” (strong positive weight). “Pieces of magic containing spells disappear” is, in a way, the most specific hypotheses here, but it definitely describes HOW magic disappears (and gives a lot of new information about it), so it’s higher on the list. “Technology destroys pieces of magic” doesn’t give any new information about anything whatsoever, only a specific random possible reason, so it’s the most irrelevant hypothesis here. By the way, those 3 different biases are just different sides of the same coin: “magic described in terms of magic/something else” and “specificity” and “novelty” are all types of “specificity”. Or novelty. Biases are concepts too, you can reformulate any of them in terms of the others too.
  When you deal with hypotheses that aren’t “atomized” and specific enough, Occam’s Razor may be impossible to apply. Because complexity of a hypothesis is subjective in such cases. What I described above solves that: complexity is combined with other metrics and evaluated only “locally”. By the way, in a similar fashion you can update the concept of probability. You can split “probability” in multiple connected metrics and use an amalgamation of those metrics in cases where you have absolutely no idea how to calculate the ratio of outcomes.
  3.4 “Matrices” of motivation
  You can analyze arguments and reasons for actions using the same framework. Imagine this situation:
  You are a lonely person on an empty planet. You’re doing physics/math. One day you encounter another person, even though she looks a little bit like a robot. You become friends. One day your friend gets lost in a dangerous forest. Do you risk your life to save her? You come up with some reasons to try to save her:
  - I care about my friend very much. (A)
  - If my friend survives, it’s the best outcome for me. (B)
  - My friend is a real person. (C)
  You can explore and evaluate those reasons by formulating them in terms of each other or in other equivalent terms.
  - “I’m 100% sure I care. (A) Her survival is 90% the best outcome for me in the long run. (B) Probably she’s real (C).” This evaluates the reasons by “power” (basically, probability).
  - “My feelings are real. (A) The goodness/possibility of the best outcome is real. (B) My friend is probably real. (C)” This evaluates the reasons by “realness”.
  - “I care 100%. (A) Her survival is 100% the best outcome for me. (B) She’s 100% real. (C).” This evaluates the reasons by “power” strengthened by emotions: what if the power of emotions affects everything else just a tiny bit? By a very small factor.
  - “Survival of my friend is the best outcome for me. (B) The fact that I ended up caring about my friend is the best thing that happened to me. Physics and math aren’t more interesting than other sentient beings. (A) My friend being real is the best outcome for me. But it isn’t even necessary, she’s already “real” in most of the senses. (C)” This evaluates the reasons by the quality of “being the best outcome”.
  Some evaluations may affect others, merge together. I believe the evaluations written above only look like precise considerations, but actually they’re more like meanings of words, impossible to pin down. I gave this example because it’s similar to some of my emotions.
  I think such thinking is more natural than applying a pre-existing utility function that doesn’t require any cognition. Utility of what exactly should you calculate? Of your friend’s life? Of your life? Of your life with your friend? Of your life factored by your friend’s desire “be safe, don’t risk your life for me”? Should you take into account change of your personality over time? I believe you can’t learn the difference without working with “meaning”.
  4.1 Synesthesia
  Imagine a face. When you don’t simplify it, you just see a face and emotions expressed by it. When you simplify it too much, you just see meaningless visual information (geometric shapes and color spots).
  But I believe there’s something very interesting in-between. When information is complex enough to start making sense, but isn’t complex enough to fully represent a face. You may see unreal shapes (mixes of “face shapes” and “geometric shapes”… or simplifications of specific face shapes) and unreal emotions (simplifications of specific emotions) and unreal face textures (simplifications of specific face textures).
  4.2 Unsupervised learning
  Action
  If my idea is true, what can we do?
  1. We need to figure out the way to combine biases.
  2. We need to find some objects that are easy to model.
  3. We need to find “simplifications” and “biases” for those objects that are easy to model.
  We may start with some absolutely useless objects.
  What can we do? (in general)
  However, even from made-up examples (not connected to a model) we can be getting some general ideas:
  - Different versions of a concept always get described in equivalent terms and simplified. (When a “bias” is applied to the concept.)
  - Multiple biases may turn the concept into something like a matrix?
  - Sometimes combined biases are similar to a decision tree.
  It’s not fictional evidence because at this point we’re not seeking evidence, we’re seeking a way to combine biases.
  What specific thing can we do?
  I have a topic in mind: (because of my synesthesia-like experiences)
  You can analyze shapes of “places” and videogame levels (3D or even 2D shapes) by making orders of their simplifications. You can simplify a place by splitting it into cubes/squares, creating a simplified texture of a place. “Bias” is a specific method of splitting a place into cubes/squares. You can also have a bias for or against creating certain amounts of cubes/squares.
  1. 3D and 2D shapes are easy to model.
  2. Splitting a 3D/2D shapes into cubes or squares is easy to model.
  3. Measuring the amount of squares/cubes in an area of a place is easy to model.
  Here’s my post about it: “Colors” of places. The post gets specific about the way(s) of evaluating places. I believe it’s specific enough so that we could come up with models. I think this is a real chance.
  I probably explained everything badly in that post, but I could explain it better with feedback.
  Maybe we could analyze people’s faces the same way, I don’t know if faces are easy enough to model. Maybe “faces” have too complicated shapes.
  My evidence
  I’ve always had an obsession with other people.
  I compared any person I knew to all other people I knew. I tried to remember faces, voices, ways to speak, emotions, situations, media associated with them (books, movies, anime, songs, games).
  If I learned something from someone (be it a song or something else), I associated this information with them and remembered the association “forever”. To the point where any experience was associated with someone. Those associations weren’t something static, they were like liquid or gas, tried to occupy all available space.
  At some point I knew that they weren’t just “associations” anymore. They turned into synesthesia-like experiences. Like a blind person in a boat, one day I realized that I’m not in a river anymore, I’m in the ocean.
  What happened? I think completely arbitrary associations with people where putting emotional “costs” on my experiences. Each arbitrary association was touching on something less arbitrary. When it happened enough times, I believe associations stopped being arbitrary.
  “Other people” is the ultimate reason why I think that my idea is true. Often I doubt myself: maybe my memories don’t mean anything? Other times I feel like I didn’t believe in it enough.
  ...
  When a person dies, it’s already maximally sad. You can’t make it more or less sad.
  But all this makes it so, so much worse. Imagine if after the death of an author all their characters died too (in their fictional worlds) and memories about the author and their characters died too. Ripples of death just never end and multiply. As if the same stupid thing repeats for the infinith time.
  Updated the post (2).
Q Home 17 Aug 2022 5:21 UTC
1 point
0
(Drafts of a future post.)
Could you help me to formulate statistics with the properties I’m going to describe?
I want to share my way of seeing the world, analyzing information, my way of experiencing other people. (But it’s easier to talk about fantastical places and videogame levels, so I’m going to give examples with places/levels.)
If you want to read more about my motivation, check out “part 3”.
Part 1: Theory
I got only two main philosophical ideas. First idea is that a part/property of one object (e.g. “height”) may have a completely different meaning in a different object. Because in a different object it relates to and resonates with different things. By putting a part/property in a different context you can create a fundamentally different version of it. You can split any property/part into a spectrum. And you can combine all properties of an object into just a single one.
The second idea is that you can imagine that different objects are themselves like different parts of a single spectrum.
I want to give some examples of how a seemingly generic property can have a unique version for a specific object.
Example 1. Take a look at the “volume” of this place: (painting 1)
- Because we’re inside of “something” (the forest), the volume of that “something” is equal to the volume of the whole place.
- Because we have a lot of different objects (trees), we have the volume between those objects.
- Because the trees are hollow we also have the volume inside of them.
Different nuances of the place reflect its volume in a completely unique way. It has a completely unique context for the property of “volume”.
Example 2. Take a look at “fatness” of this place: (painting 2)
- The road doesn’t have too much buildings on itself: this amplifies “fatness”, because you get more earth per one small building.
- The road is contrasted with the sea. The sea adds more size to the image (which indirectly emphasizes fatness).
- Also because of the sea we understand that it’s not the whole world that is stretched: it’s just this fat road. We don’t look at this world through a one big distortion.
Different nuances of the place reflect its fatness in a completely unique way.
Example 3. Take a look at “height” of this place: (painting 3)
- The place is floating somewhere. The building in the center has some height itself. It resonates with the overall height.
- The place doesn’t have a ceiling and has a hole in the middle. It connects the place with the sky even more.
- The wooden buildings are “light”, so it makes sense that they’re floating in the air.
...
I could go on about places forever. Each feels fundamentally different from all the rest.
And I want to know every single one. And I want to know where they are, I want a map with all those places on it.
- Q Home 17 Aug 2022 6:31 UTC
  1 point
  0
  Parent
  Part 2: Examples
  Part 3: Motivation
  I think my ideas may be important because they may lead to some new mathematical concepts.
  Sometimes studying a simple idea or mechanic leads to a new mathematical concept which leads to completely unexpected applications.
  For example, a simple toy with six sides (dice) may lead to saving people and major progress in science. Connecting points with lines (graphs) may lead to algorithms, data structures and new ways to find the optimal option or check/verify something.
  Not any simple thing is guaranteed to lead to a new math concept. But I just want you to consider this possibility. And maybe ask questions answers to which could rise the probability of this possibility.
  A new type of probability?
  I think my ideas may be related to:
  - Probability and statistics.
  - Ways to describe vague things.
  - Ways to describe vague arguments or vague reasoning, thinking in context. For example arguments about “bodily autonomy”
  Maybe those ideas describe a new type of probability:
  You can compare classic probability to a pie made of a uniform and known dough. When you assign probabilities to outcomes and ideas you share the pie and you know what you’re sharing.
  And in my idea you have a pie made of different types of dough (colors) and those types may change dynamically. You don’t know what you’re sharing when you share this pie.
  This new type of probability is supposed to be applicable to things that have family resemblance, polyphyly or “cluster properties” (here’s an explanation of the latter in a Philosophy Tube video).
  Blind men and an elephant
  Imagine a world where people don’t know the concept of a “circle”. People do see round things, but can’t consciously pick out the property of roundness. (Any object has a lot of other properties.)
  Some people say “the Moon is like a face”. Other say “the Moon is like a flower”. Weirder people say “the Moon is like a tree trunk” or “the Moon is like an embrace”. The weirdest people say “the Moon is like a day” or “the Moon is like going for a walk and returning back home”. Nobody agrees with each other, nobody understands each other.
  Then one person comes up and says: “All of you are right. Opinions of everyone contain objective and useful information.”
  People are shocked: at least someone has got to be wrong? If everyone is right, how can the information be objective and useful?
  The concept of a “circle” is explained. Suddenly it’s extremely easy to understand each other. Like 2 and 2. And suddenly there’s nothing to argue about. People begin to share their knowledge and this knowledge finds completely unexpected applications.
  https://en.wikipedia.org/wiki/Blind_men_and_an_elephant
  The situation was just like in the story about blind men and an elephant, but even more ironic, since this time everyone was touching the same “shape”.
  With my story I wanted to explain my opinions and goals:
  - I want to share my subjective experience.
  - I believe that it contains objective and important information.
  - I want to share a way to share subjective experience. I believe everyone’s experience contains objective and important information.
  Meta subjective knowledge
  If you can get knowledge from/about subjective experience itself, it means there exists some completely unexplored type of knowledge. I want to “prove” that there does exist such type of knowledge.
  Such knowledge would be important because it would be a new fundamental type of knowledge.
  And such knowledge may be the most abstract: if you have knowledge about subjective experience itself, you have knowledge that’s true for any being with subjective experience.
  People
  I’m amazed how different people are. If nothing else, just look at the faces: completely different proportions and shapes and flavors of emotions. And it seems like those proportions and shapes can’t be encountered anywhere else. They don’t feel exactly like geometrical shapes. They are so incredibly alien and incomprehensible, and yet so familiar. But… nobody cares. Nobody seems surprised or too interested, nobody notices how inadequate our concepts are at describing stuff like that. And this is just the faces, but there are also voices, ways to speak, characters… all different in ways I absolutely can’t comprehend/verbalize.
  I believe that if we (people) were able to share the way we experience each other, it would change us. It would make us respect each other 10 times more, remember each other 10 times better, learn 10 times more from each other.
  It pains me every day that I can’t share my experience of other people (accumulated over the years I thought about this). My memory about other people. I don’t have the concepts, the language for this. Can’t figure it out. This feels so unfair! All the more unfair that it doesn’t seem to bother anyone else.
  This state of the world feels like a prison. This prison was created by specific injustices, but the wound grew deeper, cutting something fundamental. Vivid experiences of qualia (other people, fantastic worlds) feel like a small window out of this prison. But together we could crush the prison wall completely.
- Q Home 17 Aug 2022 5:32 UTC
  1 point
  0
  Parent
  Key philosophical principles
  Here I describe the most important, the most general principles of my philosophy.
  - Objects exist only in context of each other, like colors in a spectrum. So objects are like “colors”, and the space of those objects is like a “spectrum”.
  - All properties of an object are connected/equivalent. Basically, an object has only 1 super property. This super property can be called “color”.
  - Colors differentiate all usual properties. For example, “blue height” and “red height” are 2 fundamentally different types of height. But “blue height” and “blue flatness” are the same property.
  So, each color is like a world with its own rules. Different objects exist in different worlds.
  The same properties have different “meaning” in different objects. A property is like a word that heavily depends on context. If the context is different, the meaning of the property is different too. There’s no single metric that would measure all of the objects. For example, if the property of the object is “height”, and you change any thing that’s connected to height or reflects height in any way—you fundamentally change what “height” means. Even if only by a small amount.
  Note: different objects/colors are like qualia, subjective experiences (colors, smells, sounds, tactile experiences). Or you could say they’re somewhat similar to Gottfried Leibniz’s “monads”: simple substances without physical properties.
  The objects I want to talk about are “places”: fantastical worlds or videogame levels. For example, fantastical worlds of Jacek Yerka.
  Details
  “Detail” is like the smallest structural unit of a place. The smallest area where you could stand.
  It’s like a square on the chessboard. But it doesn’t mean that any area of the place can be split into distinct “details”. The whole place is not like a chessboard.
  This is a necessary concept. Without “details” there would be no places to begin with. Or those places wouldn’t have any comprehensible structure.
  Colors
  “Details” are like cells. Cells make up different types of tissues. “Details” make up colors. You can compare colors to textures or materials.
  (The places I’m talking about are not physical. So the example below is just an analogy.)
  Imagine that you have small toys in the shape of 3D solids. You’re interested in their volume. They have very clear sides, you study their volume with simple formulas.
  Then you think: what is the volume of the giant cloud behind my window? What is a “side” of a cloud? Do clouds even have “real” shapes? What would be the formula for the volume of a cloud, would it be the size of a book?
  The volume of the cloud has a different color. Because the context around the “volume” changed completely. Because clouds are made of a different type of “tissue”. (compared to toys)
  OK, we resolved one question, but our problems don’t end here. Now we encounter an object that looks like a mix between a cloud and a simple shape. Are we allowed to simplify it into a simple shape? Are we supposed to mix both volumes? In what proportions and in what way?
  We need rules to interpret objects (rules to assign importance to different parts or “layers” of an object before mixing them into a single substance). We need rules to mix colors. We need rules to infer intermediate colors.
  Spectrum(s)
  There are different spectrums. (Maybe they’re all parts of one giant spectrum. And maybe one of those spectrums contains our world.)
  Often I imagine a spectrum as something similar to the visible spectrum: a simple order of places, from the first to the last.
  A spectrum gives you the rules to interpret places and to create colors. How to make a spectrum?
  1. You take a bunch of places. Make some loose assumptions about them. You assume where “details” in the places are and may be.
  2. Based on the similarities between the places, you come up with the most important “colors” (“materials”) these places may be made of.
  3. You come up with rules that tell you how to assign the colors to the places. Or how to modify the colors so that they fit the places.
  The colors you came up with have an order:
  - The farther you go in a spectrum, the more details dissolve. First you have distinct groups of details that create volume. Then you have “flat”/stretched groups of details. Then you have “cloud-like” groups of details.
  But those colors are not assigned to the places immediately. We’ve ordered abstract concepts, but haven’t ordered the specific places. Here’re some of the rules that allow you to assign the colors to the places:
  - When you evaluate a place, the smaller-scale structures matter more. For example, if the the smaller-scale structure has a clear shape and the larger-scale structure doesn’t have a clear shape, the former structure matters more in defining the place.
  - The opposite is true for “negative places”: the larger scale structures contribute more. I often split my spectrum into a “positive” part and a “negative” part. They are a little bit like positive and negative numbers.
  You can call those “normalization principles”. But we need more.
  The principle of explosion/vanishing
  Two places with different enough detail patterns can’t have the same color. Because a color is the detail pattern.
  One of the two places have to get a bigger or a smaller (by a magnitude) color. But this may lead to an “explosion” (the place becomes unbelievably big/too distant from all the other places) or to a “vanishing” (the place becomes unbelievably microscopic/too distant).
  This is bad because you can’t allow so much uncertainty about the places’ positions. It’s also bad because it completely violates all of your initial assumptions about the places. You can’t allow infinite uncertainty.
  When you have a very small amount of places in a spectrum, they have a lot of room to move around. You’re unsure about their positions. But when you have more places, due to the domino effect you may start getting “explosions” and “vanishings”. They will allow you to rule out wrong positions, wrong rankings.
  Overlay (superposition)
  We also need a principle that would help us to sort places with the “same” color.
  I feel it goes something like this:
  - Take places with the same color. Let’s say this color is “groups of details that create volume”.
  If the places have no secondary important colors mixed in:
  1. Overlay (superimpose) those places over each other.
  2. Ask: if I take a random piece of a volume, what’s the probability that this piece is from the place X? Sort the places by such probabilities.
  If the places do have some secondary important colors mixed in:
  1. Overlay (superimpose) those places over each other.
  2. Ask: how hard is it to get from the place’s main color to the place’s secondary color? (Maybe mix and redistribute the secondary colors of the places.) Sort places by that.
  For example, let’s say the secondary color is “groups of details that create a surface that covers the entire place” (the main one is “groups of details that create volume”). Then you ask: how hard is it to get from the volume to that surface?
  Note: I feel it might be related to Homeostatic Property Clusters. I learned the concept from a Philosophy Tube video. It reminded me of “family resemblance” popularized by Ludwig Wittgenstein.
  Note 2: https://imgur.com/a/F5Vq8tN. Some examples I’m going to write about later.
  Thought: places by themselves are incomparable. They can be compared only inside of a spectrum.
  3 cats (a slight tangent/bonus)
  Imagine a simple drawing of a cat. And a simple cat sculpture. And a real cat. Do they feel different?
  If “yes”, then you experience a difference between various qualia. You feel some meta knowledge about qualia. You feel qualia “between” qualia.
  You look at the same thing in different contexts. And so you look at 3 versions of it through 3 different lenses. If you looked at everything through the same lens, you would recognize only a single object.
  If you understand what I’m talking about here, then you understand what I’m trying to describe about “colors”. Colors are different lenses, different contexts.

Q Home’s Shortform

Give AGI humanlike reasoning? (draft of a post)

Examples of HRLM

Possibilities

My own ideas about HRLM (to be updated)

What is “commitment”?

Why is commitment important?

Basics

1. Commitment to exploration

More examples

2. Commitment to goals

3. Commitment to updating

Science

4.1 Commitment and theory-building

4.2 Commitment to explaining a phenomenon

Epistemology pt. 1

5. Commitment and epistemology

Implications for rationality

6. Commitment and uncertainty

Epistemology pt. 2

7. Commitment to understanding/​empathy

8. Commitment to “resolving” problems

Alignment pt. 1

9.1 Commitment to morality

9.2 Commitment to values

9.3 Commitment and research interest

Alignment pt. 2

10. Commitment to Security Mindset

11. Commitment to Alignment

Implications for Alignment research

Perception

11. Commitment to properties

12.1 Commitment to experiences and knowledge

12.2 Commitment to experience and morality

Final part

Specific commitments

My inspiration for writing this post

Conclusion

Cognition

14.1 Studying patterns

14.2 Patterns and causality

Implications for Machine Learning

15. Cognitive processes

Meta-level

A formula of thinking?

Three levels of exploration

Three levels simplified

Exploring debates

1. Argumentation

Arguments: conclusion

2. Understanding/​empathy

Empathy: conclusion

Exploring philosophy

1. Beliefs and ontology

Beliefs: conclusion

2. Ontology and reality

3. Philosophy overall

Exploring ethics

1. Commitment to values

2. Ethics

Ethics: tasks and games

Ethics: Categorical Imperative

Ethics: Preferences

Ethics: conclusion

Exploring perception

1. Properties

2. Commitment to experiences and knowledge

3. Experience and morality

Exploring cognition

1. Patterns

2. Patterns and causality

Causality: implications for Machine Learning

3. Cognitive processes

Exploring theories

1. Science

2. Math

Mathematico-philosophical insights

3. Physico-philosophical insights

Exploring meta ideas

Nature of percepts

7. Commitment to understanding/empathy

2. Understanding/empathy