However, it’s worth noting that saying the agent is mistaken about the state of the world is really an anthropomorphization. It was actually perfectly correct in inferring where the red part of the world was—we just didn’t want it to go to that part of the world. We model the agent as being ‘mistaken’ about where the landing pad is, but it works equally well to model the agent as having goals that are counter to ours.
That we can flip our perspective like this suggests to me that thinking of the agent as having different goals is likely still anthropomorphic or at least teleological reasoning that results from us modeling this agent has having dispositions it doesn’t actually have.
I’m not sure what to offer as an alternative since we’re not talking about a category where I feel grounded enough to see clearly what might be really going on, much less offer a more useful abstraction that avoids this problem, but I think it’s worth considering that there’s a deeper confusion here that this exposes but doesn’t resolve.
Now put the two together, and you get an “attention schema”, an internal model of attention (i.e., of the activity of the GNW). The attention schema is supposedly key to the mystery of consciousness.
The idea of an attention schema helps make sense of a thing talked about in meditation. In zen we talk sometimes about it via the metaphor of the mind like a mirror such that it sees itself reflecting in itself. In The Mind Illuminated it’s referred to as metacognitive awareness. The point is that the process by which the mind operates can be observed by itself even as it operates, and and perhaps the attention schema is an important part of what it means to do that, specifically causing the attention schema to be able to model itself.
The short answer is that yes, they are related and basically about the same thing. However the approaches of researchers vary a lot.
Relevant considerations that come to mind:
The extent to which values/preferences are legible
The extent to which they are discoverable
The extent to which they are hidden variables
The extent to which they are normative
How important immediate implementability is
How important extreme optimization is
How important safety concerns are
The result is that I think there is something of a divide between safety-focused researchers and capabilities-focused researchers in this area due to different assumptions and that makes each others work not very interesting/relevant to the other cluster.
So I think you are right about the way aesthetics power ethical reasoning, and I think aesthetics is just a waypoint on the causal mechanism of generating ethical judgements, because aesthetics are ultimately about what we value (how we compare things for various purposes), and what we value is a function of valence. So to the extent I agree it’s to the extent that I see ethics and aesthetics as applications of valence to different domains.
Another possibly useful data point in this discussion: I spent 6 years working on my PhD before dropping out. On my resume I mention that I worked on a PhD but didn’t receive it (to explain what I was doing for all those years of my life). Multiple things happen as a result of this:
some people don’t read carefully and just think I have a PhD
some people are excited that I dropped out of one (that’s a common Bay Area reaction)
some people ignore everything under “Education” and only look at work experience
I suspect that having a repossessed PhD would receive similar sorts of reactions to having dropped out before graduation.
On the other hand it would work in cases where a degree is necessary to licensure. For example, degree repossession would be effective against professions who need a degree as part of their certification to practice. If it is not the case already, this could be made the case for doctors, nurses, lawyers, accountants, professional engineers, etc. such that there would be real impacts of loss of degree even if you could still tell people you had it because you wouldn’t be allowed to practice your profession in the same capacity as before without your license.
(Whether or not licensing is a good policy is a separate question from the consideration of how the mechanism of degree repossession might work.)
Arguably the LessWrong Wiki is something like your desired compiled list of rationalist wisdom, even if it is old and out of date.
I’d describe that as a statistical regularity over statistical regularities over preferences.
But the “meta-preferences” are a bit more worrying. Are they genuine meta-preferences? Especially since the second one is one that was more subconscious, and the third one looks more like a standard preference than a meta-preference. If the category of meta-preference is not clear, then that part of the research agenda needs to be improved.
I think one of the challenges is that, to me at least, it’s still unclear if we really have anything like meta-preferences that behave in systematic ways. That is, is there a systematic way in which our highly conditional preferences (which, in a very real sense, exist only momentarily at a particular decision point situated within the causal history of the universe) combine such that we can say more than that there are some statistical regularities to our preferences. Our preferences may manage to have some coherent statistical features about which we can make some stochastically consistent statements, but I think this falls short of what we are usually hoping for in terms of meta-preferences, and certainly seems to fall short in terms of how I understand you to be thinking about them (though maybe I misunderstand you: I think of you of thinking of meta-preferences as something that can ultimately be made to have nice mathematical properties, like some version of rationality, that would allow them to be optimized against without weird things happening).
I’m actually not really sure. We have some vague notion that, for example, my preference for eating pizza shouldn’t result in attempts at unbounded pizza eating maximization, and I would probably be unhappy from my current values if a maximizing agent saw I liked pizza the best of all foods and then proceeded to feed me only pizza forever, even if it modified me such that I would maximally enjoy the pizza each time and not get bored of it.
Thinking more in terms of regressional Goodharting, maybe something like not deviating from the true target because of optimizing for the measure of it. Consider the classic rat extermination example of Goodharting. We already know collecting rat tails as evidence of extermination is a function that leads to weird effects. Does there exist a function that measures rat exterminations that, when optimized for, produces the intended effect (extermination of rats) without doing anything “weird”, e.g. generating unintended side-effects, maximizing rat reproduction so we can exterminate more of them, just straightforwardly leads to the extinction of rats and nothing else.
Thinking about my focus on a theory of human values for AI alignment, the problem is quite hard when we ask for a way to precisely specify values. I might state the problem as something like finding “a theory of human values accurate and precise enough that its predictions don’t come apart under extreme optimization”. To borrow Isnasene’s notation, here X = “a theory of human values accurate and precise enough” and Y = “its predictions don’t come apart under extreme optimization”.
So what is an inverse problem with X’ and Y’? A Y’ might be something like “functions that behave as expected under extreme optimization”, where “behave as expected” is something like no Goodhart effects. We could even just be more narrow and make Y’ = “functions that don’t exist Goodhart effects under extreme optimization”. Then the X’ would be something like a generalized description of the classes of functions that satisfy Y’.
Doing the double inverse, we would try to find X from X’ by looking at what properties hold for this class of functions that don’t suffer from Goodharting, and use them to help us identify what would be needed to create an adequate theory of human values.
Re mazes everywhere, we should of course also pay attention to what extent our own community is an immoral maze. Cf. a similar analysis from nearly a year ago in response to a Robin Hanson post.
I think a serious complication arises in this scenario:
I discover a dangerous idea at age 20, and get a reverse patent on it as described here.
At age 60 I learn I am terminally ill and don’t believe there is any mechanism by which my existence can be carried forward (e.g. cryonics). I am given 1 year to live.
I make the dangerous idea public and collect and large sum for having kept quiet for years so I can enjoy my last year of life, even if the world doesn’t continue much beyond that because it’s destroyed by my dangerous idea.
To the extent that axiology is about values (what is good/bad), it is about preferences (what would one rather do), and is thus tied to decision theory in that it offers the place from which numbers gets assigned to different decisions even if it doesn’t say how to choose among them. I assume most people are familiar with preferences and it may or may not be very relevant for your work as it’s already pretty similar to issues in morality that requires choosing between options, but thought it worth mentioning.
If you are reasonably skilled at singing along with such, there is basically no such thing has hitting wrong notes (unless the song is deliberately confusing). Instead, you end up hitting harmony notes. Basically, each note constrains the note that come afterwards and there are only a few “valid” options.
And then by the middle of Solstice, when you get to things like Brighter Than Today (a moderately complicated song), it’s actually an achievable ask to sing along if you haven’t heard it before.
So I noticed this year that in particular songs like “Brighter Than Today” are musically complicated in ways that made them harder to sing along to than some of the others. Maybe this problem is less of an issue in, for example, the smaller space where a stronger feedback loop towards good singing is created, but my experience of it was that people were struggling with the melody, hitting notes at the right time, and generally not stumbling over the words.
For example, I experience “Brighter Than Today” has having some weird timing change ups in the chorus where syllables suddenly and unexpectedly get shortened to stay within meter. I heard this most strongly around the second line of the chorus, “although the night is cold”, where it seems like people start tripping over themselves to keep up. I noticed other places in this song and other songs having similar effects. Maybe something else is going on musically, but that’s how my semi-trained ear perceives it.
I think I ignored this in the past because of some combination of the songs being new, everyone being new to Solstice, there being different levels of emphasis on singalong in the past, and factors like those you mentioned that made the singing of the songs at different venues different. But this year we are not so new to Solstice, the songs are not so new, and there was a stronger effort to make singalong an important component of the event.
It made me yearn for more simpler and familiar songs that fit with the theme generally and less songs that were crafted to fit Solstice tightly but that are more complex and less familiar.
I’m making a fairly strong claim (weakly held) that “is it beautiful or ugly?” is at least one of the important questions to be asking, in addition to “is capitalism good/bad” and “does raising minimum wage help or harm workers?”. Not because it’s how a flawless AI would think about it, but because it’s how humans seem to often think about it.
What is an aesthetic?
An aesthetic is a mishmash of values, strategies, and ontologies that reinforce each other.
I suspect it’s because these are really all the same thing, unified by a common mental mechanism. “Aesthetic” is as good a name for the natural category as any.
A lot of places where it seems like the universe isn’t playing nice might just be a reflection of us not being smart enough or not having thought long enough.
I guess there can be some disagreement on what constitutes “a lot”, but it seems to me that some of these are not subject to this because they are proofs of limitations that can only be gotten around by either pragmatic means (assuming away the problem by limiting yourself the cases where it doesn’t arise) or by relaxing the strength of what the mechanism claims. Of the examples listed, the problem of perception (Chris calls it the problem of skepticism), Gödel’s theorem, the problem of induction, the problem of the prior (Bayesianism), and qualia (the problem of perception again) all seem fundamental in the sense that they are problems that arise from within the abstraction/system, so they can only be gotten around by pragmatism or transcendence/relaxation, and cannot possibly become more tractable by thinking about them more, even if we come up with better ways to deal with them pragmatically or to better transcend them by using different abstractions. In this sense the universe will continue to look like it doesn’t play nice from within those abstractions forever and always because it is a feature of the way those abstractions relate to reality.
While I really like the idea of multiple songs clumped together and having them be singalongs, I feel like many of the songs at, for example, the 2019 Bay Area Solstice, were not easy to sing along to. Aside from a handful that are either well known popular songs or folks songs designed to be sung well the first time through, I feel like a lot of the songs suffer from being difficult to sign right on the first try unless you already are fairly familiar with them, and for many (maybe most) attendees that is just not the case (or at least I and a lot of people near me struggled with a lot of the songs). So I like this idea, but I think it only works if the songs are better optimized for singing along.
Babble and prune seems related
Does there anywhere exist a community of folks you might work on this with who aren’t derisive of this approach, woefully uninformed on the specifics (I think that’s broadly the issue here on LW; we’re not selecting for physicists of the sort you’d like to talk to even if there are a lot of them here), or cranks?
My experience has been that it’s very helpful to have supportive folks who can at least a little bit appreciate what you are doing and offer feedback, even if you are going super deep in some direction such that you are the expert in ways that means they will often be mistaken even if they can help you avoid making obvious mistakes.