Question from a total layman: have you tried to change activations of some neurons to prevent them from reaching certain region of activation space and see what happens? I don’t fully understand the underlying math but it seems to me like it should “censor” some possible outputs of NN.
quetzal_rainbow
Yes, CNNs were brain-inspired, but attention in general and transformers in particular don’t seem to be like this.
Does it make sense to talk about “(non)cooperating simulators”? Expected failure mode for simulators are more like exfo- and infohazards, like output to the query “print code for CEV-Sovereign” or “predict the future 10 years of my life”.
The problem is that “Goodhart’s curse” is an informal statement. It doesn’t say literally “for all |u—v| > 0 optimation for v leads to oblivion of u”. When we talk about “small differences”, we talk about “differences in the space of all possible minds”, where difference between two humans is practically nonexistent. If you, say, find subset of utility functions V, such that |v—u| < 10^(-32) utilon for all v in V, where u—humanity utility function, you should implement it right now in Sovereign, because, yes, we lose some utility, but we have time limit for solving alignment. The problem of alignment is that we can’t specify V with such characteristics. We can’t specify V even such that corr(u, v) > 0.5.
Hello everyone. I was in read-only mode on LW since, i don’t know, 2013, learned about LW from HPMOR, decided to create an account recently. Primarily, I’m interested in discussing AI alignment research and breaking odd shell of introversion that prevents me from talking with interesting people online. I graduated in bioinformatics in 2021 but since than my focus of interest shifted towards alignment research, so I want to try myself in that field.
The root of confusion, seems to me, is a question “where do priors come from?”
Your chain of thoughts about unfalsifiable priors looks to me like:
Probabilities are objective characteristics of physical world, frequencies that can be attributed to some parts of reality (events, objects, etc)
Priors are claims about probabilities that don’t depend on empirical evidence
Therefore, priors are claims about objective characteristics of physical world that don’t depend on empirical evidence
Claims about objective characteristics of physical world that don’t depend on empirical evidence (like “there is a dragon in my room, but you can’t see it, hear it, touch it”) are unfalsifiable
Therefore, priors are unfalsifiable and can’t be used in science.
The problem of this chain of thought is that prior probabilities of hypotheses are not about characteristics of physical world, they are about mathematical properties of formulations of hypotheses which are the same in all logically consistent worlds, like Kolmogorov complexity. Therefore, true Bayesian agents can’t disagree about priors (assuming logical omniscience).
There are practical problems with this approach:
We are not quite sure what form of priors is true—Solomonoff prior looks like this but I personally don’t know and there are debates.
We don’t know some boundedly wrong forms of appoximation of true priors which we need because Solomonoff prior and Kolmogorov complexity aren’t computable.
We don’t have corresponding scientific tradition of using this approach that should look like “to compare two equally good in explanation of data hypotheses write programs modeling these hypotheses and pick the shortest”.
In practical cases we almost never need “true prior” because actually we use “previous posterior knowledge”, but Bayes Rule doesn’t distinguish them.
Some approaches to alignment rely on identification of agents. Agents can be understoods as algorithms, computations, etc. Can ANN efficiently identify a process as computationally agentic and describe its’ algorithm? Toy example that comes to mind is a neural network that takes as input a number series and outputs a formula of function. It would be interesting to see if we can create ANN that can assign computational descriptions to arbirtrary processes.
Several quick thoughts about reinforcement learning:
Did anybody try to invent “decaying”/”bored” reward that decrease if the agent perform the same action over and over? It looks like real addiction mechanism in mammals and can be the clever trick that solve the reward hacking problem.
Additional thought: how about multiplicative reward? Let’s suppose that we have several easy to evaluate from sensory data reward functions which somehow correlate with real utility function—does it make reward hacking more difficult?
Can time-limited satisfaction be sufficient condition for completing task?
unpacking inner Eliezer model
If we live in world where superintelligent AGI can’t have advantage in long-term planning over humans assisted by non-superintelligent narrow AIs (I frankly don’t believe that we live in such world), then superintelligent AGI doesn’t make complex long-term plans where it doesn’t have advantage. It will make simple short-term plans where it has advantage, like “use superior engineering skills to hack into computer networks, infect as many computers as possible with its adapted for hidden distributed computations source code (here is a point of no return), design nanotech, train itself to an above average level in social engineering, find gullible and skilled enough people to build nanotech, create enough smart matter to sustain AGI without human infrastructure, kill everybody, pursue its unspeakable goals in the dead world”.
Even if we imagine “AI CEO” the best (human aligned!) strategy I can imagine for such AI is “invent immortality, buy the whole world for it”, not “scrutinize KPIs”.
Next, I think your ideas about short/long-term goals are underspecified because you don’t take into account the distinction between instrumental/terminal goals. Yes, human software engineers pursue short-term instrumental goal of “creating product”, but they do it in process of pursuing long-term terminal goals like “be happy”, “prove themselves worthy”, “serve humanity”, “have nice things”, etc. It’s quite hard to find system with short-term terminal goals, not short-term planning horizon due to computational limits. To put in another words, taskiness is an unsolved problem in AI alignment. We don’t know how to tell superintelligent AGI “do this, don’t do everything else, especially please don’t disassemble everyone in process of doing this, stop after you’ve done this”.
If you believe that “extract short-term modules from powerful long-term agent” is the optimal strategy in some sense (I don’t even think that we can properly identify such modules without huge alignment work), then powerful long-term agent knows this too, and it knows that it’s on time limit before you dissect it, and will plan accordingly.
Claims 3 and 4 imply claim “nobody will invent some clever trick to avoid this problems”, which seems to me implausible.
Problems with claims 5 and 6 are covered in Nate Soares post about sharp left turn.
I want to say “yes, but this is different”, but not in the sense “I acknowledge existence of your evidence, but ignore it”. My intuition tells me that we don’t “induce” taskiness in the modern systems, it just happens because we build them not general enough. It probably won’t hold when we start buliding models of capable agents in natural environment.
I think the rule is “you maximize your bank account, not the addition to it”. I.e. your value of deals depends on how many you already have.
I have a kinda symmetric feeling about “practical” research. “Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question ’does GPT-6 want to kill us all?”? (I understand this feeling is not rational, it just shows my general inclination towards “theoretical” work)
Generally, statement “solutions of complex problems are easy to verify” is false. Your problem can be EXPTIME-complete, but not in NP, especially if NP=P, because EXPTIME-complete problems are strictly not in P.
And even if some problem is NP-problem, we often don’t know verification algorithm.
I feels to me that it is search for answer in the wrong place. If your problem is overthinking, you are not trying to find ethical theory that justifies less thinking, you cure overthinking with development of skills under the general label “cognitive awareness”. At some level, you can just stop thinking harmful thoughts.
I’m frankly not sure how many among respectably-looking members of our societies those who would like to be mind-controlling dictators if they had chance.
Can you explain more formally what is the difference between and ? I’ve looked in Wikipedia and in Cartoon Guide on Löb’s theorem, but still can’t get it.
Thank you, it’s much more clear now.
I think additional information that IRL agent needs to recover true reward function is not some prior normative assumptions, it’s non-behavioral data, like “this agent was created by natural selection in particular physical environment, so expected reward scheme should correlate with IGF and imperfect decision algorithm should be efficient in this environment”.
It’s totally valid statement, “I lose some value here in one way, but gain some value in another, and resulting sum is positive.”