# DaemonicSigil

Karma: 512
• 3 Feb 2023 7:58 UTC
2 points
0 ∶ 0

I do think MIRI “at least temporarily gave up” on personally executing on technical research agendas, or something like that, but, that’s not the only type of output.

So, I’m sure various people have probably thought about this a lot, but just to ask the obvious dumb question: Are we sure that this is even a good idea?

Let’s say the hope is that at some time in the future, we’ll stumble across an Amazing Insight that unblocks progress on AI alignment. At that point, it’s probably good to be able to execute quickly on turning that insight into actual mathematics (and then later actual corrigible AI designs, and then later actual code). It’s very easy for knowledge of “how to do things” to be lost, particularly technical knowledge. [1] Humanity loses this knowledge on a generational timescale, as people die, but it’s possible for institutions to lose knowledge much more quickly due to turnover. All that just to say: Maybe MIRI should keep doing some amount of technical research, just to “stay in practice”.

My general impression here is that there’s plenty of unfinished work in agent foundations and decision theory, things like: How do we actually write a bounded program that implements something like UDT? How do we actually do calculations with logical decision theories such that we can get answers out for basic game-theory scenarios (even something as simple as the ultimatum game is unsolved IIRC)? What are some common-sense constraints the program-value-functions should obey (eg. how should we value a program that simulates multiple other programs?)? These all seem like they are likely to be relevant to alignment, and also intrinsically worth doing.

[1] This talk is relevant: https://​​www.youtube.com/​​watch?v=ZSRHeXYDLko

• Strong upvoted! The issue where weights that give the gradient hacker any influence at all will be decreased if it causes bad outputs was one of the objections I also had to that gradient hacking post.

I wrote this post a while back where I managed to create to toy model for things that were not quite gradient hackers, but were maybe a more primitive version: https://​​www.lesswrong.com/​​posts/​​X7S3u5E4KktLp7gHz/​​tessellating-hills-a-toy-model-for-demons-in-imperfect

In terms of ways to create gradient hackers in an actual neural network, here are some suggestions:

1. In the post I linked above, I chose a loss function that could be decreased indefinitely in a linear way. This gives the gradient hackers plenty of “fuel” to eat, so I think it may be helpful to also have a loss function like this when you’re starting out, in order to make the problem easier. A more sophisticated gradient hacker probably wouldn’t need this, it could instead take advantage of the way loss varies across different mini-batches to get its “fuel”.

2. You mention that gradient descent always takes a step in the direction of maximum descent. This is true as far as it goes, but the direction of maximum descent depends on the metric you have on the space of parameters. A different metric gives a different direction of maximum descent, and the usual metric on the weights isn’t particularly natural, a different metric might work just as well. Which direction is actually the best for reducing loss isn’t really something that gradient descent knows. Maybe the actual best direction differs from the direction of the gradient by a 60 degree angle or something. In particular, if you can set up a ridge in the loss landscape, gradient descent will follow along the ridge, and you can direct it like that. The tricky part is making sure that the ridge itself doesn’t get optimized away.

3. The local nature of gradient descent is something we can take advantage of as gradient hackers. Consider the function . Suppose the gradient hacker controls the input to a neuron with this as its activation function, and that neuron then influences the final decision. Suppose that for this particular training example, the gradients for the output of this neuron are positive; the more positive the output of the neuron, the lower the loss. We want positive gradients flowing back to our hacker, since if the gradients were negative, it would lose control of the neuron. Then the hacker can output 2, since the activation function has a positive derivative there. But, the hacker can also output −2, if it wants to make accuracy worse in this case, since the activation function also has a positive derivative at −2. So even though the hacker is making the prediction worse, it’s increasing its control over that particular neuron. (Of course, that neuron itself will lose influence if we use it like this too many times. If the loss function is non-convex, then we can pass the buck all the way up to the loss function and exploit its non-convexity. Otherwise, maybe we could cycle between neurons so we don’t use any of them up, or try some other clever thing?)

4. This is a difficulty that you didn’t mention, but in creating a gradient hacker, there may be an aspect of quining. The gradient hacker has to reinforce all the weights that make it up. This is presumably a lot of information, more than we could usually just store in the weights themselves. If we could make the gradient hacker into a quine, then that would to do it, but this sounds really difficult to implement as the weights of a neural network in such a way that the output of the quine is encoded in the gradients of the corresponding weights.

• 23 Jan 2023 23:15 UTC
2 points
0 ∶ 0
in reply to: Logan Zoellner’s comment

Yep, that’s the section I was looking at to get that information. Maybe I phrased it a bit unclearly. The thing that would contradict existing observations is if the interaction were not stochastic. Since it is stochastic in Oppenheim’s theory, the theory allows the interference patterns that we observe, so there’s no contradiction.

• Outside view: This looks fairly legit on first glance, and Jonathan Oppenheim is a reputable physicist. The theory is experimentally testable, with numerous tests mentioned in the paper, and the tests don’t require reaching unrealistically high energies in a particle accelerator, which is good.

Inside view: Haven’t fully read the paper yet, so take with a grain of salt. Quantum mechanics already has a way of representing states with classical randomness, the density matrix, so having a partially classical and partially quantum theory certainly seems like it should be mathematically possible in the framework of QM. The paper addresses the obvious question of what happens to the gravitational field if we put a particle in a superposition of locations, and it seems the answer is that there is stochastic coupling between the quantum degrees of freedom and the classical gravitational field, and so particles don’t end up losing their coherence in double slit experiments, which would blatantly contradict existing observations.

Overall, I think there’s a high chance that this is a mathematically consistent theory that basically does what it says it does. Will it end up corresponding to the actual universe? That’s a question for experiment.

• 23 Jan 2023 2:48 UTC
2 points
0 ∶ 0

0.5 probability you’re in a simulation is the lower bound, which is only fulfilled if you pay the blackmailer. If you don’t pay the blackmailer, then the chance you’re in a simulation is nearly 1.

Also, checking if you’re in a simulation is definitely a good idea, I try to follow a decision theory something like UDT, and UDT would certainly recommend checking whether or not you’re in a simulation. But the Blackmailer isn’t obligated to create a simulation with imperfections that can be used to identify the simulation and hurt his prediction accuracy. So I don’t think you can really can say for sure “I would notice”, just that you would notice if it were possible to notice. In the least convenient possible world for this thought experiment, the blackmailer’s simulation is perfect.

Last thing: What’s the deal with these hints that people actually died in the real world from using FDT? Is this post missing a section, or is it something I’m supposed to know about already?

There is a hole at the bottom of functional decision theory, a dangerous edge case which can and has led multiple highly intelligent and agentic rationalists to self-destructively spiral and kill themselves or get themselves killed.

Please don’t actually implement this unpatched? It’s already killed enough brilliant minds.

• I think the issue boils down to one of types and not being able to have a “Statement” type in the theory. This is why we have QUOT[X] to convert a statement X into a string.QUOT is not a function, really, it’s a macro that converts a statement into a string representation of that statement.true(QUOT[X]) ⇔ X isn’t an axiom, it’s an infinite sequence of axioms (a “schema”), one for each possible statement X. It’s considered okay to have an infinite sequence of axioms, so long as you know how to compute that sequence. We can enumerate through all possible statements X, and we know how to convert any of those statements into a string using QUOT, so that’s all okay. But we can’t boil down that infinite axiom schema into a single axiom ∀ S:Statement, true(quot(S)) ⇒ S because we don’t have a Statement type inside of the system.

Why can’t we have a Statement type? Well, we could if they were just constants that took on values of “true” or “false”. But, I think what you want to do here is treat statements as both sequences of symbols and as things that can directly be true or false. Then the reasoning system would have ways of combining the sequences of symbols and axioms that map to rules of inference on those symbols.

Imagine what would happen if we did have all those things. I’ll define a notation for a statement literal as state(s), where s is the string of symbols that make up the statement. So state() is kind of an inverse of QUOT[], except that it’s a proper function, not a macro. Since not all strings might form valid statements, we’ll take state(s) to return some default statement like false when s is not valid.

Here is the paradox. We could construct the statement: ∀ S:Statement, ∀ fmtstr:String,(fmtstr = "..." ⇒ (S = state(replace(fmtstr, "%s", repr(fmtstr))) ⇒ ¬S)) where the "..." is "∀ S:Statement, ∀ fmtstr:String,(fmtstr = %s ⇒ (S = state(replace(fmtstr, \"\%s\", repr(fmtstr))) ⇒ ¬S))" So written out in full, the statement would be:

∀ S:Statement, ∀ fmtstr:String,(fmtstr = "∀ S:Statement, ∀ fmtstr:String,(fmtstr = %s ⇒ (S = state(replace(fmtstr, \"\%s\", repr(fmtstr))) ⇒ ¬S))" ⇒ (S = state(replace(fmtstr, "%s", repr(fmtstr))) ⇒ ¬S))

Now consider the statement itself as S in the quantifier, and suppose that fmtstr is indeed equal to "...". Then S = state(replace(fmtstr, "%s", repr(fmtstr))) is true. Then we have ¬S. On the other hand, if S or fmtstr take other values, then the conditional implications become vacuously true. So S reduces down entirely to ¬S. This is a contradiction. Not the friendly quine-based paradox of Godel’s incompleteness theorem, which merely asserts provability, but an actual logic-exploding contradiction.

Therefore we can’t allow a Statement type in our logic.

• Yeah, it definitely depends how you formalize the logic, which I didn’t do in my comment above. I think there’s some hidden issues with your proposed disproof, though. For example, how do we formalize 2? If we’re representing John’s utterances as strings of symbols, then one obvious method would be to write down something like: ∀ s:String, says(John, s) ⇒ true(s). This seems like a good way of doing things, that doesn’t mention the ought predicate. Unfortunately, it does require the true predicate, which is meaningless until we have a way of enforcing that for any statement S, S ⇔ true(QUOT[S]). We can do this with an axiom schema: SCHEMA[S:Statement], S ⇔ true(QUOT[S]). Unfortunately, if we want to be able to do the reasoning chain says(John, QUOT[ought(X)]) therefore true(QUOT[ought(X)]) therefore ought(X), we find out that we used the axiom true(QUOT[ought(X)]) ⇔ ought(X) from the schema. So in order to derive ought(X), we still had to use an axiom with “ought” in it.

I expect it’s possible write a proof that “you can’t derive a ought from an is”, assuming we’re reasoning in first order logic, with ought being a predicate in the logic. But it might be a little nontrivial from a technical perspective, since while we couldn’t derive ought(X) from oughtless axioms, we could certainly derive things like ought(X) ∨ ¬ought(X) from the law of excluded middle, and then there would be many complications you could build up.

• From a language perspective, I agree that’s it’s great to not worry about the is/​ought distinction when discussing anything other than meta-ethics. It’s kind of like how we talk about evolved adaptations as being “meant” to solve a particular problem, even though there was really no intention involved in the process. It’s just such a convenient way of speaking, so everyone does it.

I’d guess I’d say that the despite this, the is/​ought distinction remains useful in some contexts. Like if someone says “we get morality from X, so you have to believe X or you won’t be moral”, it gives you a shortcut to realizing “nah, even if I think X is false, I can continue to not do bad things”.

• What about that thing where you can’t derive an “ought” from an “is”? Just from the standpoint of pure logic, we can’t derive anything about morality from axioms that don’t mention morality. If you want to derive your morality from the existence of God, you still need to add an axiom: “that which God says is moral is moral”. On the other end of things, an atheist could still agree with a theist on all moral statements, despite not believing in God. Suppose that God says “A, B, C are moral, and X, Y, Z are immoral”. Then an atheist working from the axioms “A, B, C are moral, and X, Y, Z are immoral” would believe the same things as a theist about what is moral, despite not believing in God.

Similarly, Darwin’s theory of evolution is just a claim about how the various kinds of living things we see today arose on Earth. Forget about God and religion, it would be really weird if believing in this funny idea about how complexity and seeming goal-directness can arise from a competition between imperfect copies somehow made you into an evil person.

Indeed, claiming that atheism or evolution is what led to Nazi atrocities almost feels to me like giving too much slack to the Nazis and their collaborators. Millions of people are atheists, or believe in evolution, or both, and they don’t end up committing murder, let alone genocide. Maybe we should just hold people responsible for their actions, and not treat them as automatons being piloted by memes?

As another example, imagine we’re trying to prevent a similar genocide from happening in the future (which we are, in fact). Which strategy would be more effective?

1. Encourage belief in religion and discourage belief in evolution. Pass a law making church attendance mandatory, teach religion in schools. Hide the fossil record, and lock biology papers behind a firewall so that only medical doctors and biologists can see them. Prevent evolution from being taught in science classes, in favour of creationism.

2. Teach the history of the holocaust in schools, along with other genocides. In those lessons, emphasize how genocide is a terrible, very bad thing to do, and point out how ordinary people often go along with genocide, slavery, and other horrifying things, if they’re not paying a lot of attention and being careful not to do that. From a legal perspective, put protections against authoritarianism in the constitution (eg. no arresting people for speaking out against the government).

Seems to me like option 2 would be much more effective, though from trying to pass your intellectual Turing test, I’d guess you’d maybe endorse doing both? (Though with option 1 softened to promote religion more through gradual cultural change than heavy-handed legal measures.)?

• On training AI systems using human feedback: This is way better than nothing, and it’s great that OpenAI is doing it, but has the following issues:

1. Practical considerations: AI systems currently tend to require lots of examples and it’s expensive to get these if they all have to be provided by a human.

2. Some actions look good to a casual human observer, but are actually bad on closer inspection. The AI would be rewarded for finding and taking such actions.

3. If you’re training a neural network, then there are generically going to be lots of adversarial examples for that network. As the AI gets more and more powerful, we’d expect it to be able to generate more and more situations where its learned value function gives a high reward but a human would give a low reward. So it seems like we end up playing a game of adversarial example whack-a-mole for a long time, where we’re just patching hole after hole in this million-dimensional bucket with thousands of holes. Probably the AI manages to kill us before that process converges.

4. To make the above worse, there’s this idea of a sharp left turn, where a sufficiently intelligent AI can think of very weird plans that go far outside of the distribution of scenarios that it was trained on. We expect generalization to get worse in this regime, and we also expect an increased frequency of adversarial examples. (What would help a lot here is designing the AI to have an interpretable planning system, where we could run these plans forward and negatively reinforce the bad ones (and maybe all the weird ones, because of corrigibility reasons, though we’d have to be careful about how that’s formulated because we don’t want the AI trying to kill us because it thinks we’d produce a weird future).)

5. Once the AI is modelling reality in detail, its reward function is going to focus on how the rewards are actually being piped to the AI, rather than the human evaluator’s reaction, let alone of some underlying notion of goodness. If the human evaluators just press a button to reward the AI for doing a good thing, the AI will want to take control of that button and stick a brick on top of it.

On training models to assist in human evaluation and point out flaws in AI outputs: Doing this is probably somewhat better than not doing it, but I’m pretty skeptical that it provides much value:

1. The AI can try and fool the critic just like it would fool humans. It doesn’t even need a realistic world model for this, since using the critic to inform the training labels leaks information about the critic to the AI.

2. It’s therefore very important that the critic model generates all the strong and relevant criticisms of a particular AI output. Otherwise the AI could just route around the critic.

3. On some kinds of task, you’ll have an objective source of truth you can train your model on. The value of an objective source of truth is that we can use it to generate a list of all the criticisms the model should have made. This is important because we can update the weights of the critic model based on any criticisms it failed to make. On other kinds of task, which are the ones we’re primarily interested in, it will be very hard or impossible to get the ground truth list of criticisms. So we won’t be able to update the weights of the model that way when training. So in some sense, we’re trying to generalize this idea of “a strong a relevant criticism” between these different tasks of differing levels of difficulty.

4. This requirement of generating all criticisms seems very similar to the task of getting a generative model to cover all modes. I guess we’ve pretty much licked mode collapse by now, but “don’t collapse everything down to a single mode” and “make sure you’ve got good coverage of every single mode in existence” are different problems, and I think the second one is much harder.

On using AI systems, in particular large language models, to advance alignment research: This is not going to work.

1. LLMs are super impressive at generating text that is locally coherent for a much broader definition of “local” than was previously possible. They are also really impressive as a compressed version of humanity’s knowledge. They’re still known to be bad at math, at sticking to a coherent idea and at long chains of reasoning in general. These things all seem important for advancing AI alignment research. I don’t see how the current models could have much to offer here. If the thing is advancing alignment research by writing out text that contains valuable new alignment insights, then it’s already pretty much a human-level intelligence. We talk about AlphaTensor doing math research, but even AlphaTensor didn’t have to type up the paper at the end!

2. What could happen is that the model writes out a bunch of alignment-themed babble, and that inspires a human researcher into having an idea, but I don’t think that provides much acceleration. People also get inspired while going on a walk or taking a shower.

3. Maybe something that would work a bit better is to try training a reinforcement-learning agent that lives in a world where it has to solve the alignment problem in order to achieve its goals. Eg. in the simulated world, your learner is embodied in a big robot, and it there’s a door in the environment it can’t fit through, but it can program a little robot to go through the door and perform some tasks for it. And there’s enough hidden information and complexity behind the door that the little robot needs to have some built-in reasoning capability. There’s a lot of challenges here, though. Like how do you come up with a programming environment that’s simple enough that the AI can figure out how to use it, while still being complex enough that the little robot can do some non-trivial reasoning, and that the AI has a chance of discovering a new alignment technique? Could be it’s not possible at all until the AI is quite close to human-level.

• It would be really cool to see a video on Newcomb’s problem, logical decision theories, and Lobian cooperation in the prisoner’s dilemma. I think this group of ideas is one of the most interesting developments in game theory in the past few years, and should be more widely known.

• I think what it boils down to is that in 1 dimension, the mean /​ expected value is a really useful quantity, and you get it by minimizing squared error, whereas the absolute error gives the median, which is still useful, but much less so than the mean. (The mean is one of the moments of the distribution, (the first moment), while the median isn’t. Rational agents maximize expected utility, not median utility, etc. Even the M in MAE still stands for “mean”.) Plus, although algorithmic considerations aren’t too important for small problems; in large problems the fact that least squares just boils down to solving a linear system is really useful, and I’d guess that in almost any large problem, the least squares solution is much faster to obtain than the least absolute error solution.

• From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn’t describe reality. It’s maybe best to think of it from an engineering perspective, as a test case. We’re trying to build an AI, and we want to make sure it works well. We don’t know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has failed one of its unit tests, and should not be deployed to production (the real world). So the point of the paper is that a reasonable-sounding way you could design an AI with an off switch turns out to fail the unit-test.

I do generally think that too many of the AI-related posts here on LessWrong are “not real” in the way you’re suggesting, but this paper in particular seems “real” to me (whatever that means). I find the most “not real” posts are the verbose ones piled high with vague wordy abstractions, without an equation in sight. The equations in the corrigiblity paper aren’t there to seem impressive, they’re there to unambiguously communicate the math the paper is talking about, so that if the authors have made an error of reasoning, it will be as obvious as possible. The ways you keep something in contact with reality is checking either against experiment, or against the laws of mathematics. To quote Feynman, “if it disagrees with experiment, it’s wrong” and similarly, there’s a standard in mathematics that statements must be backed up by checkable calculations and proofs. So long as the authors are holding themselves to that standard (and so long as you agree that any well-designed AI should be able to perform well in this easy test case), then it’s “real”.

• 14 Nov 2022 7:58 UTC
4 points
0 ∶ 0

Debates of “who’s in what reference class” tend to waste arbitrary amounts of time while going nowhere. A more helpful framing of your question might be “given that you’re participating in a community that culturally reinforces this idea, are you sure you’ve fully accounted for confirmation bias and groupthink in your views on AI risk?”. To me, LessWrong does not look like a cult, but that does not imply that it’s immune to various epistemological problems like groupthink.

• A quote from Eliezer’s short fanfic Trust in God, or, The Riddle of Kyon that you may find interesting:

Sometimes, even my sense of normality shatters, and I start to think about things that you shouldn’t think about. It doesn’t help, but sometimes you think about these things anyway.

I stared out the window at the fragile sky and delicate ground and flimsy buildings full of irreplaceable people, and in my imagination, there was a grey curtain sweeping across the world. People saw it coming, and screamed; mothers clutched their children and children clutched at their mothers; and then the grey washed across them and they just weren’t there any more. The grey curtain swept over my house, my mother and my father and my little sister -

Koizumi’s hand rested on my shoulder and I jerked. Sweat had soaked the back of my shirt.

“Kyon,” he said firmly. “Trying to visualize the full reality of the situation is not a good technique when dealing with Suzumiya-san.”

How do you handle it, Koizumi!

“I’m not sure I can put it in words,” said Koizumi. “From the first day I understood my situation, I instinctively knew that to think ‘I am responsible for the whole world’ is only self-indulgence even if it’s true. Trying to self-consciously maintain an air of abnormality will only reduce my mind’s ability to cope.”

Also: I agree that people who want to do alignment research should just go ahead and do alignment research, without worrying about credentials or whether or not they’re smart enough. On a problem as wickedly difficult as alignment, it’s more important to be able to think of even a single actually-promising new approach than to be very intelligent and know lots of math. (Though even people who don’t feel they’re suited for thinking up new approaches can still work on the problem by joining an existing approach.)

The linked post talks about the large value of buying even six months of time, but six months ago it was May 2022. What has been accomplished in AI alignment since then? I think we urgently need to figure out how to make real progress on this problem, and it would be a tragedy if we turned away people who were genuinely enthusiastic about doing that for reasons of “efficiency”. Allocating people to the tasks for which they have the most enthusiasm is efficient.

• 15 Oct 2022 9:20 UTC
LW: 19 AF: 12
0 ∶ 0
AF

I took Nate to be saying that we’d compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create “thing that is a face that has the highest probability of occurring in the environment”, while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven’t actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.

This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We’d ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we’d want them to.

• 7: Did I forget some important question that someone will ask in the comments?

Yes!

Is there a way to deal with the issue of there being multiple ROSE points in some games? If Alice says “I think we should pick ROSE point A” and Bob says “I think we should pick ROSE point B”, then you’ve still got a bargaining game left to resolve, right?

Anyways, this is an awesome post, thanks for writing it up!

# Gate­keeper Vic­tory: AI Box Reflection

9 Sep 2022 21:38 UTC
4 points