TurnTrout comments on Looking back on my alignment PhD

TurnTrout 11 Jul 2022 4:05 UTC
4 points
0
I somewhat suspect I will read another post 5 years from now (similar to Hazar’s post “How to Ignore Your Emotions (while also thinking you’re awesome at emotions)”) where you’ll say “I found out the narratives I told myself about myself were hurting me, so I decided to become someone who didn’t believe narratives about himself, and it turns out this just worked to hide narratives from my introspective processes, and I hurt myself by acting according to pretty dumb narratives that I couldn’t introspect on.
Thanks for making this point. FYI approximately this thought has crossed my mind several times. In general, I agree, be careful when messing with illegible parts of your brain which you don’t understand that well. However, I just don’t find myself feeling that worried about this, about decreasing how much I rely on narratives. Maybe I’ll think more and better understand your concerns, and that might change my mind in either direction.
(I could reply with my current best guess at what narratives are, on my model of human intelligence and values, but I feel too tired to do that right now. Maybe another time.)
Maybe I’m wrong about the human mind having narratives built-in, though it feels quite a tempting story to me.
Hm, seems like the kind of thing which might be inaccessible to the genome.
As an aside: I think the “native architecture” frame is wrong. At the very least, that article makes several unsupported inferences and implicit claims, which I think are probably wrong:
- “In particular, visualizing things is part of the brain’s native architecture”
  - Not marked as an inference, just stated as a fact.
  - But what evidence has pinned down this possible explanation, compared to others? Even if this were true, how would anyone know that?
- “The Löb’s Theorem cartoon was drawn on the theory that the brain has native architecture for tracking people’s opinions.”
- Implies that people have many such native representations / that this is a commonly correct explanation.
I wrote Human values & biases are inaccessible to the genome in part to correct this kind of mistake, which I think people make all the time.
(Of course, the broader point of “work through problems (like math problems) using familiar representations (like spatial reasoning)” is still good.)
- Kaj_Sotala 13 Jul 2022 13:52 UTC
  5 points
  8
  Parent
  Hm, seems like the kind of thing which might be inaccessible to the genome.
  I think there’s an important distinction between “the genome cannot directly specify circuitry for X” and “the human mind cannot have X built-in”. I think there are quite a few things that we can consider to be practically “built-in” that the genome nonetheless could not directly specify.
  I can think of several paths for this:
  1. The 1984 game Elite contains a world of 2048 star systems. Because specifying that much information beforehand would have taken a prohibitive amount of memory for computers at the time, they were procedurally generated according to the algorithm described here. Everyone who plays the game can find, for instance, that galaxy 3 has a star system called Enata.
  Now, the game’s procedural generation code doesn’t contain anything that would directly specify that there should be a system called Enata in galaxy 3: rather there are just some fixed initial seeds and an algorithm for generating letter combinations for planet names based on those seeds. One of the earlier seeds that the designers tried ended up generating a galaxy with a system called Arse. Since they couldn’t directly specify in-code that such a name shouldn’t exist, they switched to a different seed for generating that galaxy, thus throwing away the whole galaxy to get rid of the one offensively-named planet.
  But given the fixed seed, system Enata in galaxy 3 is built-in to the game, and everyone who plays has the chance to find it. Similarly, if the human genome has hit upon a specific starting configuration that when iterated upon happens to produce specific kinds of complex circuitry, it can then just continue producing that initial configuration and thus similar end results, even though it can’t actually specify the end result directly.
  2. As a special case of the above, if the brain is running a particular kind of learning algorithm (that the genome specifies), then there may be learning-theoretical laws that determine what kind of structure that algorithm will end up learning from interacting with the world, regardless of whether that has been directly specified. For instance, vision models seem to develop specific neurons for detecting curves. This is so underspecified by the initial learning algorithm that there’s been some controversy about whether models really even do have curve detectors; it had to be determined via empirical investigation.
  Every vision model we’ve explored in detail contains neurons which detect curves. [...] Each curve detector implements a variant of the same algorithm: it responds to a wide variety of curves, preferring curves of a particular orientation and gradually firing less as the orientation changes. Curve neurons are invariant to cosmetic properties such as brightness, texture, and color. [...]
  It’s worth stepping back and reflecting on how surprising the existence of seemingly meaningful features like curve detectors is. There’s no explicit incentive for the network to form meaningful neurons. It’s not like we optimized these neurons to be curve detectors! Rather, InceptionV1 is trained to classify images into categories many levels of abstraction removed from curves and somehow curve detectors fell out of gradient descent.
  Moreover, detecting curves across a wide variety of natural images is a difficult and arguably unsolved problem in classical computer vision ⁴. InceptionV1 seems to learn a flexible and general solution to this problem, implemented using five convolutional layers. We’ll see in the next article that the algorithm used is straightforward and understandable, and we’ve since reimplemented it by hand.
  In the case of “narratives”, they look to me to be something like models that a human mind has of itself. As such, they could easily be “built-in” without being directly specified, if the genome implements something like a hierarchical learning system that tries to construct models of any input it receives. The actions that the system itself takes are included in the set of inputs that it receives, so just a general tendency towards model-building could lead to the generation of self-models (narratives).
  3. As a special case of the above points, there are probably a lot of things that will tend to be lawfully learned given a “human-typical” environment and which serve as extra inputs on top of what’s specified in the genome. For instance, it seems reasonable enough to say that “speaking a language is built-in to humans”; sometimes this mechanism breaks and in general it’s only true for humans who actually grow up around other humans and have a chance to actually learn something like a language from their environment. Still, as long as they do get exposed to language, the process of learning a language seems to rewire the brain in various ways (e.g. various theories about infantile amnesia being related to memories from a pre-verbal period being in a different format), which can then interact with information specified by the genome, other regularly occurring features in the environment, etc. to lay down circuitry that will then reliably end up developing in the vast majority of humans.
  - TurnTrout 17 Jul 2022 17:31 UTC
    4 points
    0
    Parent
    Strong agree that this kind of “built-in” is plausible. In fact, it’s my current top working hypothesis for why people have many regularities (like intuitive reasoning about 3D space, and not 4D space).