LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.
Charlie Steiner(Charlie Steiner)
I was kind of hoping this post would be more about moral authority as it actually exists in our morally-neutral universe. For having subjectivism in the title, it was actually all about objectivism.
I’m reminded of that aphorism about the guy writing a book on magic, and he’d get asked if it was about “real magic.” And he’d have to say no, stage magic, because real magic, to the questioner, means something not real, while the sort of magic that can really be done is not real magic.
How does someone whose moral judgment you trust actually get that trust, in the real world? It’s okay if this looks more like “stage magic” than “real magic.”
I’m still holding out hope for jumping straight to FAI :P Honestly I’d probably feel safer switching on a “big human” than a general CIRL agent that models humans as Boltzmann-rational.
Though on the other hand, does modern ML research already count as trying to use UFAI to learn how to build FAI?
I care about things other than suffering.
You have to put a measure on things. I care less about unlikely things, things with small measure, even if there’s a multiverse.
I care about things other than subjective experience.
And seriously, I care about things other than suffering.
I normally don’t think of most functions as polynomials at all—in fact, I think of most real-world functions as going to zero for large values. E.g. the function “dogness” vs. “nose size” cannot be any polynomial, because polynomials (or their inverses) blow up unrealistically for large (or small) nose sizes.
I guess the hope is that you always learn even polynomials, oriented in such a way that the extremes appear unappealing?
I recently got reminded of this post. I’m not sure I agree with it, because I think we have different paradigms for AI alignment—I’m not nearly so concerned with the sort of oversight that relies on looking at the state of the computer. Though I have nothing against the sort of oversight where you write a program to tell you about what’s going on with your model.
Instead, I think that anticipating the effects of QC on AI alignment is a task in prognosticating how ML is going to change if you make quantum computing available. I think the relevant killer app is not going to be Grover’s algorithm, but quantum annealing. So we have to try to think about what kind of ML you could do if you could get a large speedup on optimization on objective functions but were limited to a few hundred to a few thousand bits at a time (assuming that that’s true for near-future quantum computers).
And given those changes, what changes for alignment? Seems like a hard question.
Sunday, Sunday, Sunday, at the Detroit Dragway!
Steve’s big thoughts on alignment in the brain probably deserve a review. Component posts include https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain , https://www.lesswrong.com/posts/DWFx2Cmsvd4uCKkZ4/inner-alignment-in-the-brain , https://www.lesswrong.com/posts/jNrDzyc8PJ9HXtGFm/supervised-learning-of-outputs-in-the-brain
Interestingly, I think there aren’t any of my posts I should recommend—basically all of them are speculative. However, I did have a post called Gricean communication and meta-preferences that I think is still fairly interesting speculation, and I’ve never gotten any feedback on it at all, so I’ll happily ask for some: https://www.lesswrong.com/posts/8NpwfjFuEPMjTdriJ/gricean-communication-and-meta-preferences .
I think this looks fine for IDA—the two problems remain the practical one of implementing Bayesian reasoning in a complicated world, and the philosophical one that probably IDA on human imitations doesn’t work because humans have bad safety properties.
Hm, I thought that was what Evan called it, but maybe I misheard. Anyhow, I mean the problem where because you can model humans in different ways, we have no unique utility function. We might think of this as having not just one Best Intentional Stance, but a generalizable intentional stance with knobs and dials on it, different settings of which might lead to viewing the subject in different ways.
I call such real-world systems that can be viewed non-uniquely through the lens of the intentional stance “approximate agents.”
To the extent that mesa-optimizers are approximate agents, this raises familiar and difficult problems with interpretability. Checking how good an approximation is can require knowing about the environment it will get put into, which (that being the future) is hard.
Great interview! Weird question—did Rob Miles get a sneak peek at this interview, given that he just did a video on the same paper?
The biggest remaining question I have is a followup on the question you asked “Am I a mesa-optimizer, and if so, what’s my meta-objective?” You spend some time talking about lookup tables, but I wanted to hear about human-esque “agents” that seem like they do planning, but simultaneously have a very serious determination problem for their values—is Evan’s idea to try to import some “solution to outer alignment” to these agents (note that such a solution can’t be like HCH)? Or some serial vs. parallel compute argument that human-level underdetermination of values will be rare (though this argument seems dubious)?
I don’t mean the internal language of the interpreter, I mean the external language, the human literally saying “it’s raining.” It seems like there’s some mystery process that connects observations to hypotheses about what some mysterious other party “really means”—but if this process ever connects the observations to propositions that are always true, it seems like that gets most favored by the update rule, and so “it’s raining” (spoken aloud) meaning 2+2=4 (in internal representation) seems like an attractor.
I’m still vague on how the interpretation actually works. What connects the english sentence “it’s raining” to epistemology module’s rainfall indicator? Why can’t “it’s raining” be taken to mean the proposition 2+2=4?
Good point. Or plagiarism intended to fool anti-plagiarism software—ideally by making plagiarized material match your own style.
How is “clean substructure” different in principle from a garden-variety high-level description? Crepes are a thin pancake made with approximately equal parts egg, milk, and flour, potentially with sugar, salt, oil, or small amounts of leavening, spread in a large pan and cooked quickly. This english sentence is radically simpler than a microscopic description of a crepe. As a law of crepeitude, it has many admirable practical qualities, allowing me to make crepes, and to tell which recipes are for crepes and which are not, even if they’re slightly different from my description.
A similar high-level description for consciousness might start with “Conscious beings are a lot like humans—they do a lot of information processing, have memories and imaginations and desires, think about the world and make plans, feel emotions like happiness or sadness, and often navigate the world using bodies that are in a complex feedback loop with their central information processor.” This english sentence is, again, a lot simpler than a microscopic description of a person. It is, all in all, a remarkable feat of compression.
Of course, I suspect this isn’t what you want—you hope that consciousness is obligingly simple in ways that cut out the reliance on human interpretation from the above description, while still being short enough to fit on a napkin. The main way that this sort of thing has been true in physics and chemistry is when humans are noticing some pattern in the world with a simple explanation in terms of underlying essences. The broad lack of such essences in philosophy explains the historical failure of myriad simple and objective theories of humanity, life, the good, etc.
Neuroscience and philosophy are not physics and chemistry. I don’t expect there to be an “atomic theory of color qualia” or anything like it because of a combination of factors like:
Cultural and general interpersonal differences in color perception.
The tendency of evolution to produce complicated, interlinked mechanisms, including in the brain, rather than modular ones.
Examples of brain damage and people with unusual psychology or physiology that have dramatically different color qualia than me.
Animals and artificial systems that use color perception to navigate the world but don’t seem to converge to similar ways pf perceiving color.
The evidence of absence of a soul or other homuncular center of perception, which necessitates understanding perception as an emergent phenomenon made of lots of little pieces.
The causal efficacy of color perception (i.e. I don’t just see things, I actually do different things depending on what I see) tying colors into all the other complications of the human mind.
Complications that we know about from neuroscience, such as asymmetric local centers of function, and certain individual clusters of neurons being causally related to individual memories, motions, and sensations.
Our experience with artificial neural networks, and how challenging interpreting their weights is.
--
If we compare this with atoms, atoms do indeed have some local variation in mass, but only within a suspiciously small range. Rules like conservation of mass appear to hold among elements, rather than there being common exceptions. We didn’t already know that atoms were emergent phenomena from the interactions of bajillions of pieces. We did not already have a scientific field studying how many of those bajillions of pieces played idiosyncratic and evolutionarily contingent roles. Et c.
That’s good news to me, and I’m sorry for making a sweeping generalization based on your older work. The marker of legitimacy I am particularly interested in is whether your empirical investigations are still useful in the case that consciousness is not at all amenable to simple formalization.
Isn’t it perplexing that we’re trying to reduce the amount of suffering and increase the amount of happiness in the world, yet we don’t have a precise definition for either suffering or happiness?
Not nearly as perplexing as how I just made crepes without a precise definition of what a crepe is. If only there were crepes in Wittgenstein’s Philosophical Investigations, one of the seminal works I might recommend to someone confronting issues like these.
I know it’s a huge thing to ask, but I seriously urge you to rethink your philosophical commitment to finding a precise formal definition of consciousness, etc. Try to do things that look more like “normal” neuroscience of consciousness (in scare quotes because the field is very heterogeneous, but as an example of work that solves interesting problems orthogonal to formally defining consciousness, I like the experimental philosophy presented by Ned Block in this talk: https://youtu.be/6lHHxcxurhQ ).
Oh wait, are you the first author on this paper? I didn’t make the connection until I got around to reading your recent post.
So when you talk about moving to a hierarchical human model, how practical do you think it is to also move to a higher-dimensional space of possible human-models, rather than using a few hand-crafted goals? This necessitates some loss function or prior probability over models, and I’m not sure how many orders of magnitude more computationally expensive it makes everything.
This article and comment have used the word “true” so much that I’ve had that thing happen where when a word gets used too much it sort of loses all meaning and becomes a weird sequence of symbols or sounds. True. True true true. Truetrue true truetruetruetrue.
What Everett says in his thesis is that if the measure is additive between orthogonal states, it’s the norm squared. Therefore we should use the norm squared of observers when deciding in how to weight their observations.
But this is a weird argument, not at all the usual sort of argument used to pin down probabilities—the archetypal probability arguments rely on things like ignorance and symmetry. Everett just says “Well, if we put a measure on observers that doesn’t have weird cross-state interactions, it’s the norm squared.” But understanding why humans described by the Schrodinger equation wouldn’t see weird cross-state probability flows still requires additional thought (that’s a bit hard in the non-Hamiltonian-eigenstate observer + environment basis Everett uses for convenience).
But I think that that’s an argument you can make in terms of things like ignorance and symmetry, so I do think the problem is somewhat solved. But it’s not necessarily easy to understand or widespread, and the intervening decades have had more than a little muddying of the waters from all sides, from non-physicist philosophers to David Deutsch.
I don’t totally understand the Liouville’s theorem argument, but I think it’s aimed at a more subtle point about choosing the common-sense measure for the underlying Hilbert space.