Hi, I am a Physicist, an Effective Altruist and AI Safety student/researcher.
I recently updated how I view the alignment problem. The post that caused my update is this one form the shard sequence. Also worth mentioning is older post that points to the same thing, but I just happen to read it later.
Basically I used to think we needed to solve both outer and inner alignment separately. No I no longer think this is a good decomposition of the problem.
It’s not obvious that alignment must factor in the way described above. There is room for trying to set up training in such a way to guarantee a friendly mesa-objective somehow without matching it to a friendly base-objective. That is: to align the AI directly to its human operator, instead of aligning the AI to the reward, and the reward to the human.
Quote from here
If something is good at replicating, then there will be more of that thing, this creates a selection effect for things that are good at replicating. The effects of this can be observed in biology and memetics.
Maybe self replication can be seen as an agentic system with the goal of self replicating? In this particular question all uncertainty comes from “agent” being a fuzzy concept, and not from any uncertainty about the world. So answering this question will be a choice of perspective, not information about the world.
Either way, the type of agency I’m mainly interested in is the type of agency that have other goals than just self replication. Although maybe there are things to be learned from the special case of having self replication as a goal?
If the AI learns my values then this is a replication of my values. But there are also examples of magic agentic force where my values are not copied at any point along the way.
Looking at how society is transferred between generations, might have some clues to value learning? But I’m less optimistic about looking at what is similar between self replication in general, because I think I already know this, and also, it seems to be one abstraction level to high, i.e. the similarity are properties above the mechanistic details, and those details is what I want.
infraBook Club I: Corrigibility is bad ashkually
One of my old blog posts I never wrote (I did not even list it in a “posts I will never write” document) is one about how corrigibility are anti correlated with goal security.
Something like: If you build an AI that don’t resist someone trying to change its goals, it will also not try to stop bad actors from changing its goal. (I don’t think this particular worry applies to Paul’s version of corrigibility, but this blog post idea was from before I learned about his definition.)
I’m not talking about recursive self-improvement. That’s one way to take a sharp left turn, and it could happen, but note that humans have neither the understanding nor control over their own minds to recursively self-improve, and we outstrip the rest of the animals pretty handily. I’m talking about something more like “intelligence that is general enough to be dangerous”, the sort of thing that humans have and chimps don’t.
Individual humans can’t FOOM (at lest not yet), but humanity did.
My best guess is that humanity took a sharp left turn when we got a general enough language, and then again when we got writing, and possibly again when the skill of reading an writing spread to a majority of the population.
Before language human intelligence was basically limited to what a single brain could do. When we got language we got the ability of adding compute (more humans) to the same problem solving task. Humanity got parallel computing. This extra capabilities could be used to invent things to increase the population, i.e. recusing self improvement.
Later, writing gave us external memory. Before our computations where limited by human memory, but now we could start to fill up libraries, unlocking a new level of recursive self improvement.
Every increase in literacy and communication technology (e.g. the internet) is humanity upgrading its capability.
(Just typing as I think...)What if I push this line of thinking to the extreme. If I just pick agents randomly from the space of all agents, then this should be maximally random, and that should be even better. Now the part where we can mine information of alignment from the fact that humans are at least some what aligned is gone. So this seems wrong. What is wrong here? Probably the fact that if you pick agents randomly from the space of all agents, you don’t get greater variation of aliment, compare to if you pick random humans, because probably all the random agents you pick are just non aligned. So what is doing most of the work here is that humans are more aligned than random. Which I expect you to agree on. What you are also saying (I think) is that the tale end level of alignment in humans is more important in some way than the mean or average level of aliment in humans. Because if we have the human distribution, we are just a few bits from locating the tail of the distribution. E.g. we are 10 bits away from locating the top 0.1 percentile. And because the tail is what matters, randomness is in our favor.Does this capture what you are tying to say?
I mean that the information of what I value exists in my brain. Some of this information is pointers to things in the real world. So in a sense the information partly exist in the relation/correlation between me and the world. I defiantly don’t mean that I can only care about my internal brain state. To me that is just obviously wrong. Although I have met people who disagree, so I see where the misunderstanding came from.
1) parenting is known to have little effect on children’s character
This is not counter evidence to my claim. The value framework a child learns about from their parents is just one of many value frameworks they hear about from many, many people. My claim is that the power lies in noticing the hypothesis at all. Which ideas you get told more times (e.g. by your parents) don’t matter.
As far as I know, what culture you are in very much influences your values, which my claim would predict.
2) while children learn to follow rules teens are good at figuring out what is in their interest.
I’m not making any claims about rule following.
Blogposts are the result of noticing difference in beliefs. Either between you and other of between you and you, across time.I have lots of ideas that I don’t communicate. Sometimes I read a blogpost and think “yea I knew that, why didn’t I write this”. And the answer is that I did not have an imagined audience.My blogposts almost always span after I explained a thing ~3 times in meat space. Generalizing from these conversations I form an imagined audience which is some combination of the ~3 people I talked to. And then I can write. (In a conversation I don’t need to imagine an audience, I can just probe the person in front of me and try different explanations until it works. When writing a blogpost, I don’t have this option. I have to imagine the audience.)Another way to form an imagined audience is to write for your past self. I’ve noticed that a lot of thig I read are like this. When just learning something or realizing something, and past you who did not know the thing is still fresh in your memory, then it is also easier to write the thing. This short form is of this type.I wonder if I’m unusually bad at remembering the thoughts and belief’s of past me? My experience is that I pretty quickly forget what it was like not to know a thing. But I see others writing things aimed at their pasts self from years ago.I think I’m writing short form as a message to my future self, when I have forgotten this insight. I want my future self to remember this idea of how blogposts spawn. I think it will help her guide her writing posts, but also help her not to be annoyed when someone else writes a popular thing that I already knew, and “why did I not write this?” There is an answer to the question “why did I not write this?” and the answer is “because I did not know how to write it”. A blogpost is a bridge between a land of not knowing and a land of knowing. Knowing the destination of the bridge is not enough to build the bridge. You also have to know the starting point.
I almost totally agree with this post. This comment is just nit picking and speculation.
Evolution has an other advantage, that is relate to “getting a lot’s of tries” but also importantly different.It’s not just that evolution got to tinker a lot before landing on a fail proof solution. Evolution don’t even need a fail proof solution. Evolution is “trying to find” a genome, which in interaction with reality, forms a brain that causes that human to have lots of kids. Evolution found a solution that mostly works, but sometimes don’t. Some humans decided that celibacy was the cool thing to do, or got too obsessed with something else to take the time to have a family. Note that this is different from how the recent distributional shift (mainly access to birth control, but also something about living in a rich country) have caused previously children rich populations to have on average less than replacement birth rate. Evolution is fine with getting the alignment right in most of the minds, or even just a minority, if they are good enough at making babies. We might want better guarantees than that?Going back to alignment with other humans. Evolution did not directly optimise for human to human alignment, but still produced humans that mostly care about other humans. Studying how this works seems like a great idea! But also evolution did not exactly nail human to human alignment. Most, but defiantly not all humans care about other humans. Ideally we want to build something much much more robust.Crazy (probably bad) idea: If we can build a AI design + training regime that mostly but not certainly turn out human aligned AIs, and where the uncertainty is mostly random noise that is uncorrelated between AIs. Then maybe we should build lots of AIs with similar power and hope that because the majority are aligned, this will turn out fine for us. Like how you don’t need every single person in a country to care about animals, in order for that country to implement animal protection laws.
This is probably too obvious to write, but I’m going to say it anyway. It’s my short form, and approximately no-one reads short forms. Or so I’m told.
Human value formation is to a large part steered by other humans suggesting value systems for you. You get some hard to interpret reward signal from your brainstem, or something. There are lots of “hypothesis” for the “correct reward function” you should learn.
(Quotation marks because there are no ground through for what values you should have. But this is mathematically equivalent to a learning the true statistic generating the data, from a finite number of data points. Also, there is maybe some ground truth of what the brainstem rewards, or maybe not. According to Steve the there is this loop, where when the brainstem don’t know if things are good or not, it just mirror back cortex’s own opinion to the cortex.)
To locate the hypothesis, you listen to other humans. I make this claim not just for moral values, but for personal preferences. Maybe someone suggest to you “candy is tasty” and since this seems to fit with your observation, no you also like candy. This is a bad example since for taste specifically the brainstem has pretty clear opinions. Except there is acquired taste… so maybe not a terrible example.
Another example: You join a hobby. You notice you like being at the hobby place doing the hobby thing. Your hobby fired says (i.e. offer the hypothesis) “this hobby is great”. This seems to fit your data so now you believe you like the hobby. And because you believe you like the hobby, you end up actually liking the hobby because of a self reinforcing loop. Although this don’t always work. Maybe after some time your friends quit the hobby and this makes it less fun, and you realise (change your hypothesis) that you manly liked the hobby for the people.
Maybe there is a ground truth about what we want for ourselves? I.e. we can end up with wrong beliefs about what we want due to pear pressure, commercials, etc. But with enough observation we will notice what it is we actually want.
Clearly humans are not 100% malleable, but also, it seems like even our personal preferences are path dependent (i.e. pick up lasting influences from our environment). So maybe some annoying mix...
What is alignment? (operationalisation)
Toy model: Each agent has a utility function they want to maximise. The input to the utility function is a list of values describing the state of the world. Different agents can have different input vectors. Assume that every utility function monotonically increases, decreases or stays constant for changes in each impute variable (I did say it was a toy model!). An agent is said to value something if the utility function increases with increasing quantity of that thing. Note that if an agents utility function decreases with increasing quantity of a thing, then the agent values the negative of that thing.
In this toy model agent A is aligned with agent B if and only if A values everything B values.
Q: However does this operationalisation match my intuitive understanding of alignment? A: Good but not perfect.
This definition of alignment is transitive, but not symmetric. This matches the properties I think a definition of alignment should have.
How about if A values a lot of things that B doesn’t care about, and only cares very little about the things A cares about? That would count as aligned in this operationalisation but not necessarily match my intuitive understanding of alignment.
What is alignment? (operationalisation second try)
Agent A is aligned with agent B, if and only if, when we give more power (influence, compute, improved intelligence, etc.) to A, then things get better according to B’s values, and this relation holds for arbitrary increases of power.
This operationalisation points to exactly what we want, but is also not very helpful.
Here’s what you wrote:
This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations.
Do you still agree with yourself?
In that case I’m confused about this statement
This interpretation makes sense even in the absence of “agents” with “beliefs”
What is priors in the absence of something like agents with beliefs?
Support for AI safety research is up: 69% of respondents believe society should prioritize AI safety research “more” or “much more” than it is currently prioritized, up from 49% in 2016.
What is this number if you only include people who participated in both surveys?
We’ve shown that the probability P[q|X] summarizes all the information in X relevant to q, and throws out as much irrelevant information as possible.
This seems correct.Lets say two different points in the data configuration space, X_1 and X_2, provide equal evidence for q. Then P[q|X_1] = P[q|X_2]. The two different data possibilities are mapped to the same point in this compressed map. So far so good.(I assume that I should interpret the object P[q|X] as a function over X, not as a point probability for a specific X.)
First, hopefully this provides some intuition for interpreting a probability P[q|X] as a representation of the information in X relevant to q. In short: probabilities directly represent information. This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations. That means we can potentially apply it in a broader variety of situations—we can talk about simple mechanical processes which produce “maps” of the world, and the probabilistic calculations embedded in those processes.
I don’t think this works.
The map P[q|X] have gotten rid of all the irrelevant information in the map, but it still contains some information that never came from the map. I.e. P[q|X] is not generated only from the information in X relevant for q.
E.g. from P[q|X] we can get
P[q] = sum_X P[q|X]
i.e. the prior probability of q. And if the prior of q where different P[q|X] would be different too.
The way you can’t (shouldn’t) get rid of priors here, feels similar to how you can’t (shouldn’t) get rid of coordinates in physics. In this analogy, the choice of prior is analogues to the choice of the origin. Your choice of origin is completely subjective (even more so than the prior). Technically you can represent position in a coordinate free way (only relative positions), but no one does it, because doing so destroys other things.
(I’m being maximally critical, because you asked for it)
I don’t think this is true:
But there’s a biological analogy: classical conditioning. E.g. I can choose to do X right before Y, and then I’ll learn an association between X and Y which I wouldn’t have learned if I’d done X a long time before doing Y.
I could not find any study that test this directly, but I don’t expect conditioning to work if you yourself causes the unconditioned stimuli (US), Y in your example. My understanding of conditioning is that if there is no surprise there is no learning. For example: If you first condition an animal to expect A to be followed by C, and then exposes them to A+B followed by C, they will not learn to associate B with C. This is a well replicated result, and the textbook explanation (which I believe) is that no learning occurs because C is already explained by A (i.e. there is no surprise).
Does this matter for understanding gradient hacking in future AGIs? Maybe?
Since humans are the closest thing we have to an AGI, it does make sense to try to understand things like gradient hacking in ourselves. Or if we don’t have this problem, it would be very interesting to understand why not.
Are there other examples of biological gradient hacking? (1)I heard that whatever you do while taking nicotine, will be reinforced (don’t remember source but seems plausible to me). But this would be more analog to directly over writing the back prop signal, instead of manipulating the gradient via controlling the training data. If we end up with an AI that can just straight forwardly edit its outer learning regime in this way, then I think we are outside the scope of what you are talking about. However, if this nicotine hack works, it is interesting it is not used more? Maybe it is not a strong enough effect to be useful?
(2)You give an other example:
Humans often reason about our goals in order to produce more coherent versions of them. Since we know while doing the reasoning that the concepts we produce will end up ingrained as our goals, this could be seen as a form of gradient hacking.
I can’t decide if I think this should count as gradient hacking.
(3)I know that I to some extent absorb the values of people around me, and I have used this for self manipulation. This is the best analog to gradient hacking I can think of for humans. Unfortunately I don’t expect this to tell us much about AI’s, since this method depends on a specific human drive towards conformism.
I’m curious if an opposite strategy works for contrarians? If you want to self manipulate you should hang out with people who believe/value the opposite of what you want yourself to believe/value?
I’m very certain that you hypothesis are correct. Most people play to have fun, not to win. Winning is instrumental to fun, but for most people it is not worth the cost of doing some math, which is anti-fun. I like math in general, but I still would not make this explicit calculation, because it is the wrong type of math for me to enjoy. (Not saying it is wrong for you to enjoy it, just that it’s unusual.)I think that making the game design such that it is hard or impossible to do the explicit math is a feature. Most people don’t want to do the math. The math is not supposed to be part of the game. Most people don’t want the math nerds to have that advantage, because then they’ll have to do the math too, or loose.
That only works for task where you get to do a similar enough thing, enough times. This seems true for driving, but less so for most types of research.
My capacity to track information in my head varies from day to day, depending on mood, sleep, etc. I can notice a clear difference in what I can and can’t do depending on this. When I have more limited mental capacity, I can still absorb facts, but I struggle to follow complex reasoning or draw independent conclusions (e.g. if this fact is true, what does that predict about the world?).
There is an interview technique called Experiential Array which is designed to pull out this sort of information (and some other stuff too). Matt Goldenberg conducted this type of interview on me on the topic of designing and running events. This experience gave me the ability to communicate the invisible parts of event design.
Read here for more details