# Tessellating Hills: a toy model for demons in imperfect search

If you haven’t already, take a look at this post by johnswentworth to understand what this is all about: https://www.lesswrong.com/posts/KnPN7ett8RszE79PH/demons-in-imperfect-search

The short version is that while systems that use perfect search, such as AIXI, have many safety problems, a whole new set of problems arises when we start creating systems that are not perfect searchers. Patterns can form that exploit the imperfect nature of the search function to perpetuate themselves. johnswentworth refers to such patterns as “demons”.

After reading that post I decided to see if I could observe demon formation in a simple model: gradient descent on a not-too-complicated mathematical function. It turns out that even in this very simplistic case, demon formation can happen. Hopefully this post will give people an example of demon formation where the mechanism is simple and easy to visualize.

**Model**

The function we try to minimize using gradient descent is called the loss function. Here it is:

Let me explain what some of the parts of this loss mean. Each function is periodic with period 2π in every component of . I decided in this case to make my splotch functions out of a few randomly chosen sine waves added together.

is chosen to be a small number so in any local region, will look approximately periodic: A bunch of hills repeating over and over again with period 2π across the landscape. But over large enough distances, the relative weightings of various splotches do change. Travel a distance of 20π in the direction, and will be a larger component of the repeating pattern than it was before. This allows for selection effects.

The term means that the vector mainly wants to increase its component. But the splotch functions can also direct its motion. A splotch function might have a kind of ridge that directs some of the motion into other components. If tends to direct motion in such a way that , increases, then it will be selected for, becoming stronger and stronger as time goes on.

**Results**

I used ordinary gradient descent, with a constant step size, and with a bit of random noise added in. Figure 1 shows the value of x0 as a function of time, while figure 2 shows the values of x1,x2,…x16 as a function of time.

Fig 1:

Fig 2:

There are three phases to the evolution: In the first, increases steadily, and the other coordinates wander around more or less randomly. In the second phase, a self-reinforcing combination of splotches (a “demon”) takes hold and amplifies itself drastically, feeding off the large gradient. Finally, this demon becomes so strong that the search gets stuck in a local valley and further progress stops. The first phase is more or less from 0 to 2500 steps. The second phase is between 2500 steps and 4000 steps, though slowing down after 3500. The final phase starts at 4000 steps, and likely continues indefinitely.

Now that I have seen demons arise in such a simple situation, it makes me wonder how commonly the same thing happens in the training of deep neural networks. Anyways, hopefully this is a useful model for people who want to understand the mechanisms behind the whole “demons in imperfect search” thing more clearly. It definitely helped me, at least.

Update: The code is now up here: https://github.com/DaemonicSigil/tessellating-hills

- The Main Sources of AI Risk? by 21 Mar 2019 18:28 UTC; 76 points) (
- 2020 Review Article by 14 Jan 2022 4:58 UTC; 73 points) (
- “Inner Alignment Failures” Which Are Actually Outer Alignment Failures by 31 Oct 2020 20:18 UTC; 60 points) (
- Positive Feedback → Optimization? by 16 Mar 2020 18:48 UTC; 19 points) (
- [AN #90]: How search landscapes can contain self-reinforcing feedback loops by 11 Mar 2020 17:30 UTC; 11 points) (
- 15 Apr 2021 23:41 UTC; 4 points) 's comment on Old post/writing on optimization daemons? by (

Now this is one of the more interesting things I’ve come across.

I fiddled around with the code a bit and was able to reproduce the phenomenon with DIMS = 1, making visualisation possible:

Behold!

Here’s the code I used to make the plot:

That’s very cool, thanks for making it. At first I was worried that this meant that my model didn’t rely on selection effects. Then I tried a few different random seeds, and some, like 1725, didn’t show demon-like behaviour. So I think we’re still good.

Hmm, the inherent 1d nature of the visualization kinda makes it difficult to check for selection effects. I’m not convinced that’s actually what’s going on here. 1725 is special because the ridges of the splotch function are exactly orthogonal to x0. The odds of this happening probably go down exponentially with dimensionality. Furthermore, with more dakka, one sees that the optimization rate drops dramatically after ~15000 time steps, and may or may not do so again later. So I don’t think this proves selection effects are in play. An alternative hypothesis is simply that the process gets snagged by the first non-orthogonal ridge it encounters, without any serous selection effects coming into play.

Very nice work. The graphs in particular are quite striking.

I sat down and thought for a bit about whether that objective function is actually a good model for the behavior we’re interested in. Twice I thought I saw an issue, then looked back at the definition and realized you’d set up the function to avoid that issue. Solid execution; I think you have actually constructed a demonic environment.

This is awesome, thanks!

So, to check my understanding: You have set up a sort of artificial feedback loop, where there are N overlapping patterns of hills, and each one gets stronger the farther you travel in a particular dimension/direction. So if one or more of these patterns tends systematically to push the ball in the same direction that makes it stronger, you’ll get a feedback loop. And then there is selection between patterns, in the sense that the pattern which pushes the strongest will beat the ones that push more weakly, even if both have feedback loops going.

And then the argument is, even though these feedback loops were artificial / baked in by you, in “natural” search problems there might be a similar situation… what exactly is the reason for this? I guess my confusion is in whether to expect real life problems to have this property where moving in a particular direction strengthens a particular pattern. One way I could see this happening is if the patterns are themselves pretty smart, and are able to sense which directions strengthen them at any given moment. Or it could happen if, by chance, there happens to be a direction and a pattern such that the pattern systematically pushes in that direction and the direction systematically strengthens that pattern. But how likely are these? I don’t know. I guess your case is a case of the second, but it’s rigged a bit, because of how you built in the systematic-strengthening effect.

Am I following, or am I misunderstanding?

Thanks, and your summary is correct. You’re also right that this is a pretty contrived model. I don’t know exactly how common demons are in real life, and this doesn’t really shed much light on that question. I mainly thought that it was interesting to see that demon formation was possible in a simple situation where one can understand everything that is going on.

I have the same confusion

Hi, thanks for sharing and experimentally trying out the theory in the previous post! Super cool.

Do you have the code for this up anywhere?

I’m also a little confused by the training procedure. Are you just instantiating a random vector and then doing GD with regards to the loss function you defined? Do the charts show the loss averaged over many random vectors (and splotch function variants)?

Thanks. I initially tried putting the code in a comment on this post, but it ended up being deleted as spam. It’s now up on github: https://github.com/DaemonicSigil/tessellating-hills It isn’t particularly readable, for which I apologize.

The initial vector has all components set to 0, and the charts show the evolution of these components over time. This is just for a particular run, there isn’t any averaging. x0 gets its own chart, since it changes much more than the other components. If you want to know how the loss varies with time, you can just flip figure 1 upside down to get a pretty good proxy, since the splotch functions are of secondary importance compared to the -x0 term.

Oops, sorry for that. I restored your original comment.

Here is the code for people who want to reproduce these results, or just mess around:

You’re totally not obligated to do this, but I think it might be cool if you generated a 3D picture of hills representing your loss function—I think it would make the intuition for what’s going on clearer.

I don’t see why the gradient with respect to x0 ever changes, and so am confused about why it would ever stop increasing in the x0 direction. Does this have to do with using a fixed step size instead of learning rate?

[Edit: my current thought is that it looks like there’s periodic oscillation in the 3rd phase, which is probably an important part of the story; the gradient is mostly about how to point at the center of that well, which means it orbits that center, and x0 progress grinds to a crawl because it’s a small fraction of the overall gradient, whereas it would continue at a regular pace if it were a constant learning rate instead, I think.]

Also, did you use any regularization? [Edit: if so, the decrease in response to x0 might actually be present in a one-dimensional version of this, suggesting it’s a very different story.]

Looks like the splotch functions are each a random mixture of sinusoids in each direction—so each splotchj will have some variation along x0. The argument of splotchj is all of x, not just xj.

Ah, that’d do it too.

No regularization was used.

I also can’t see any periodic oscillations when I zoom in on the graphs. I think the wobbles you are observing in the third phase are just a result of the random noise that is added to the gradient at each step.