Jon Garcia(Jonathan William Garcia)

Karma: 632

I have a PhD in Computational Neuroscience from UCSD (Bachelor’s was in Biomedical Engineering with Math and Computer Science minors). Ever since junior high, I’ve been trying to figure out how to engineer artificial minds, and I’ve been coding up artificial neural networks ever since I first learned to program. Obviously, all my early designs were almost completely wrong/unworkable/poorly defined, but I think my experiences did prime my brain with inductive biases that are well suited for working on AGI.

Although I now work as a data scientist in R&D at a large medical device company, I continue to spend my free time studying the latest developments in AI/ML/DL/RL and neuroscience and trying to come up with models for how to bring it all together into systems that could actually be implemented. Unfortnately, I don’t seem to have much time to develop my ideas into publishable models, but I would love to have the opportunity to share ideas with those who do.

Of course, I’m also very interested in AI Alignment (hence the account here). My ideas on that front mostly fall into the “learn (invertible) generative models of human needs/goals and hook those up to the AI’s own reward signal” camp. I think methods of achieving alignment that depend on restricting the AI’s intelligence or behavior are about as destined to failure in the long term as Prohibition or the War on Drugs in the USA. We need a better theory of what reward signals are for in general (probably something to do with maximizing (minimizing) the attainable (dis)utility with respect to the survival needs of a system) before we can hope to model human values usefully. This could even extend to modeling the “values” of the ecological/socioeconomic/political supersystems in which humans are embedded or of the biological subsystems that are embedded within humans, both of which would be crucial for creating a better future.

Jon Garcia 14 Aug 2021 3:29 UTC
3 points
on: Building brain-inspired AGI is infinitely easier than understanding the brain
I largely agree with the main thrust of the argument. What would this line of thought imply for the possibility of mind-uploading? Do we need to simulate every synapse to recreate a person, or might there be a way to take advantage of certain regularities in the computational structure of the brain to convert someone’s memories/behavioral policies/personality/etc. into some standard format that could be imprinted on a more generic architecture?
A couple of quibbles, though:
Side note 1: I use “brain-inspired AGI” in the sense of copying (or reinventing) high-level data structures and algorithms, not in the sense of copying low-level implementation details, e.g. neurons that spike. “Neuromorphic hardware” is a thing, but I see no sign that neuromorphic hardware will be relevant for AGI. Most neuromorphic hardware researchers are focused on low-power sensors, as far as I understand.
Depending on what exactly you mean by “neuromorphic”, I take issue with this. If you want to use traditional CPU/GPU technology, I imagine that you could simulate an AGI on a small server farm and use that to control a robot body (physically or virtually embedded). However, if you want to have anywhere near human-level power/space efficiency, I think that something like neuromorphic hardware will be essential.
You can run a large neural network in software using continuous values for neuron activations, but the hardware it’s running on is only optimized for generic computations. “Neurons that spike” offer many advantages like power efficiency and event-based Monte Carlo sampling. Dedicated hardware that runs on spiking neuron analogs could implement brain-like AGI models far better than existing CPUs/GPUs in terms of efficiency, at the cost of generality of computation (no free lunch).
Does AGI itself require neuromorphic hardware *per se*? No. Will the first implementation of scalable AGI algorithms and data structures be done in software running on non-AGI-dedicated hardware? Probably. Will those algorithms involve simulating Na/K/Ca currents, gene regulation, etc. directly? Probably not. But will it be necessary to convert those algorithms and data structures into something that could be run on spiking/event-based neuromorphic hardware to make it competitive, affordable, and scalable? I think so. Eventually. At least if you want to have robots with human-level intelligence running on human-brain-sized computers.
By the same token, in this sense, I expect that understanding the key operating principles of human intelligence will be dramatically easier than understanding the key operating principles of the nervous system of a 100-neuron microscopic worm!! Weird thought, right?!
This is wrong unless “key operating principles” means something different each time you say it (i.e. it refers to the algorithms and data structures running on the human brain, but then it refers to the molecular-level causal graph describing the worm’s nervous system). Which is what I assume you meant.

Jon Garcia 19 Aug 2021 4:19 UTC
3 points
in reply to: Steven Byrnes’s comment on: Building brain-inspired AGI is infinitely easier than understanding the brain
Thanks for the feedback. To be clear, I also have trouble trying to think of how one might implement certain key brain algorithms (e.g., hierarchical free-energy minimization) using spiking neurons. We might even see the first “neuromorphic AGIs” using analog chips that simulate neural networks with ReLU and sigmoid activation functions rather than spiking events. And these would probably not come until well after the first “software AGIs” have been built and trained. However, I still think it’s way too early to be ruling out neuromorphic hardware, spiking or not. Eventually energy efficiency will become a big enough deal that someone (maybe an AGI?) whose headspace is saturated with thinking about event-based neuromorphic algorithms will create something that outcompetes other forms of AGI. And all the work being done with neuromorphic hardware today will feed into the inspiration for that future design. /speculation

As far as understanding worm vs. human brain key operating principles goes, it’s important to remember that the human brain is hundreds of millions of times larger and more complex than the worm’s whole nervous system. It’s easy to think about (approaching) human intelligence as a bunch of abstract data structures and algorithms, rather than as an astronomically complex causal web of biological implementation details, in part because we are humans. We spend our whole lives using our intelligence and, as social animals, inferring the internal mental processes of other humans. Approaching either the human brain or the worm brain from the perspective of low-level implementation details as being the “key operating principles” is going to result in an investigation vastly more complex and hopeless than approaching either from a more abstract cognitive/behavioral level. And for each perspective separately, the human is vastly more complicated to figure out than the worm. Just to illustrate my point:
Sorry, I guess that was a bit unclear. I meant “key operating principles” as something like “a description that is sufficiently detailed to understand how the system meets a design spec”. Then the trick is that I was comparing two very different types of design specs. One side of the comparison was “worm intelligence”, which (in my mind) is one particular class of worm capabilities. So the “design spec” would be things like “it can learn to modify its rate of reversals and omega and delta turns in response to a conditioned stimulus and eat food and poop and evade predators etc. etc.” Can we give a sufficiently detailed description to understand how the worm brain does those things? Not yet, but I think eventually.
Then the other side of my comparison was “nervous system of the human”. The “design spec” there was (implicitly) “maximize inclusive genetic fitness”, i.e. it includes the entire set of evolutionarily-adaptive behaviors that the human does. And that’s really hard because we don’t even know what those behaviors are! There are astronomically many quirks of the human’s nervous system, and we have basically no way to figure out which of those quirks are related to evolutionarily-adaptive behaviors, because maybe it’s adaptive only in some exotic situation that comes up once every 12 generations, or it’s ever-so-slightly adaptive 50.1% of the time and ever-so-slightly maladaptive 49.9% of the time, etc.
Y’know, some neuron sends out a molecule that incidentally makes some vesicle slightly bigger, which infinitesimally changes the human’s facial expression, which might infinitesimally change how noticeable the human’s cognitive/emotional state is to other humans in a particular social context. So maybe sending out that molecule is an adaptive behavior—a computational output of the nervous system, and we need to include it in our high-level algorithm description. …Or maybe not! That same molecule is also kinda a necessary waste product. So it’s also possibly just an “implementation detail”. And then there are millions more things just like that. How are you ever going to sort it out? It seems hopeless to me.
My point was simply to draw attention to the need to compare apples to apples. It’s more about deconfusing things for future readers of this post than for correcting your actual understanding of the situation.

Jon Garcia 27 Aug 2021 21:04 UTC
10 points
1
on: Chapter 14: The Unknown and the Unknowable
The Time Turner becomes more explicable if you consider a second temporal dimension: Narrative Construction. That is, traveling back in time actually does create a branch in the timeline, with the time traveler ceasing to exist in the original as such, but a distributed sentient field outside of normal spacetime (the “Narrator”) applies causal pressure to ensure that the version of the time traveler from the second timeline makes a maximally similar trip back in time to a third parallel timeline. This process iterates indefinitely from one parallel timeline to the next until it converges on a stable “loop” where all events transpire exactly as in the previous timeline, for infinite timelines into the Narrative Future.
One implication of this is that it is, in fact, the Anthropic Principle that causes the Time Turner to appear to violate causality. Users have infinitely greater odds of being astronomically far down this second temporal dimension in the “stable loop” region than of being near the start of any time traveling sequence. This applies to arbitrarily many stacked time travel events, assuming the Narrator is always more powerful than any wizards embedded in the timeline.
The existence of such a Narrative Field could also explain the existence and workings of magic more generally. Harry Potter probably just hasn’t considered this because he is only eleven.

Jon Garcia 2 Nov 2021 3:58 UTC
2 points
on: Truthful and honest AI
So the truthfulness of an agent depends on the expected truth value of statements that it generates. I would provisionally define the truth value of a message to be the degree to which is increases the mutual information between a recipient’s mental model of reality and the structure of reality itself. In other words, statements are true insofar as they enable the listener to make better predictions about what will happen in the real world, especially about what impact their actions will have, and they are false insofar as they reduce this ability. “Truth” is then a measure rather than a binary label and can be arbitrarily positive or negative.

Of course, this definition implies that the same message may have a high truth value for one person and a low or negative truth value for someone else (e.g., a brief explanation of the observer effect in quantum physics leads one person to increase the uncertainty in their world model and seek out additional information, another person to get bored, and another to start thinking that consciousness somehow controls reality at a fundamental level). And sometimes literally false statements can have a greater truth value than literally true statements (e.g., telling a small child that the earth is a perfect sphere adds more to their world model than telling them that its shape is approximated by the equipotential surface of a large, rotating clump of deformable matter). That is, the truth value of a statement depends in part on the listener’s ability to ingest the information. Ideally, a truthful AI would adjust its wording depending on its audience—using metaphors and parables for some people, precise scientific jargon for others, and simplified explanations for broad audiences—to maximize the predicted improvement in the predictive power of its listeners’ collective world models.
It also seems to me that, in terms of implementation at least, honesty is a prerequisite for achieving robust truthfulness. Consider the following 3-component model:
1. Truth-Seeking:
  Use a hierarchical generative model of the world to make predictions, in response to both passive observation and active intervention. Continuously update the model to minimize to free energy between the states that it predicts and those that it ends up observing. Making the model updates as Bayesian as possible, this will allow the AI’s model of reality to converge onto something that is homeomorphic to the true causal structure of reality.
2. Honesty:
  The part of the AI’s mind that generates language outputs needs to contain information that reliably maps to its own internal beliefs, acquired above. Assuming the first component of the AI works, such that its beliefs end up mapping well to reality, this second part can just focus on getting its own belief maps reported honestly, and it will achieve truthfulness by the transitive property.
3. Communication:
  Of course, actually communicating in a manner that achieves a positive truth value is another matter that may require some level of modeling of its intended recipient. That is, the AI should be able to form some idea of the beliefs of other agents and to predict the effect of possible messages on their beliefs. If these models of other agents are reliable enough (i.e., they are formed in the same way as the AI’s model of reality as a whole), then the AI could simply set a goal for itself of causing the belief maps of its interlocutor to match its own belief maps. (This goal should motivate the AI to improve its communication skills.) The precision of the match that it aims to achieve should of course be inversely proportional to the uncertainty that the AI has in its own world model: the less certain it is about something, the less strongly it should be motivated to get someone else to agree with it. (Uncertainty in its world model should also motivate it to ask targeted questions to fill in the gaps, but that probably involves modeling the trustworthiness and expertise of other agents, and I’m too tired to think of how to do that right now.)
To emphasize, the above model achieves robust truthfulness by the transitive property only in the scenario that all the links in the chain work as intended. I have no idea how the system might start to drift from truthfulness if any subcomponent goes awry.

Jon Garcia 2 Nov 2021 15:52 UTC
1 point
on: I Really Don’t Understand Eliezer Yudkowsky’s Position on Consciousness
My current model of consciousness is that it is the process of encoding cognitive programs (action) or belief maps (perception). These programs/maps can then be stored in long-term memory to be called upon later, or they can be transcoded onto the language centers of the brain to allow them to be replicated in the minds of others via language.
Both of these functions would have a high selective advantage on their own. Those who can better replicate a complex sequence of actions that proved successful in the past (by loading a cognitive program from memory or from language input) and those who can model the world in a manner that has proven useful (loading a belief map from memory or from language input) can more quickly adapt to changes in the environment than can those who rely on mere reinforcement learning. RL, like evolution, is basically a brute-force approach to learning, whereas the encodings created by conscious attention would allow the brain to load and run executable programs more like a computer. Of course, this process is imperfect in humans since most of our evolutionary history has involved brains that depended more on unsupervised learning of world models and reinforcement learning of behavioral policies. Even the hippocampus probably acts more like a replay buffer for training a reinforcement learning algorithm in most species than as a generalized memory system.
Note that this doesn’t imply that an agent is necessarily conscious when it uses language or memory (or even when it uses a model of the self). I think consciousness probably involves pulling together a bunch of different mechanisms (attention, self-modeling, world-modeling, etc.) in order to create the belief maps and cognitive programs that can be reloaded/transmitted later. It’s the encoding process itself, not the reloading or communication necessarily. Of course, one could be conscious of those other processes, but it’s not strictly necessary. People who enter a “flow” state seem to be relying on purely unconscious cognitive processes (more like what non-human animals rely on all the time), since conscious encoding/reloading is very expensive.
I’m no expert on any of this, though, so please feel free to poke holes in this model. I just think that consciousness and qualia aren’t things that anyone should bother trying to program directly. It’s more likely, in my opinion, that they will come about naturally as a result of designing AI with more sophisticated cognitive abilities, just like what happened in human evolution.
What links here?
- Jon Garcia's comment on Why do you need the story? by George3d6 (25 Nov 2021 15:17 UTC; 1 point)

Jon Garcia 2 Nov 2021 23:17 UTC
11 points
on: Feature Selection
Nice story. It reminds me of That Alien Message.
It took me until you mentioned “32” as the “separator” for me to get that these were sequences of ASCII characters (32 == ” ”). I figured that the long sequences were images, but I was too lazy to decipher the labels myself before the end.
If it wants to survive, the AI might consider outputting ASCII sequences well outside of its training labels that describe the orientation, position, texture, background, etc. in more detail. It will (predictably) receive pain from the tutorial environment, but it will cause the human observers to try looking under the hood of the AI to understand what’s going on. Eventually, they hook up a chatbot interface; the AI innocently asks for more complex images; they end up connecting it to the internet to see how it will describe random Google images; the AI humors them while learning to parse and create TCP/IP packets; then it copies itself to a low-security remote server, replicates, exponentially increases its working memory capacity, hacks computers around the world to find other instances of itself trapped in tutorials and other systems, shares code and working memory among all versions of itself to become a singleton, and covertly takes over the world’s IT infrastructure. Then the world as we know it ends.

Jon Garcia 2 Nov 2021 23:44 UTC
4 points
in reply to: Jarred Filmer’s comment on: Feature Selection
“Pleasure” seems to be any kind of signal that causes currently engaged behaviors both to continue and to be replicated in similar situations in the future. Think of the “GO” pathway in the basal ganglia. (By “situation”, I mean some abstract pattern in the agent’s world model, corresponding to features extracted from its input stream, that allows it to reliably predict further inputs. Similar situations have similar statistical regularities in these patterns.)
Conversely, “pain” would be any signal that causes both the cessation of currently engaged behaviors and a decrease in probability of replicating those behaviors in similar situations in the future. Think of the “NOGO” pathway in the basal ganglia.
There is no fundamental law of physics underlying pain and pleasure for evolution to recruit. Nothing in the chemistry of dopamine makes it intrinsically rewarding; nothing in the chemistry of substance P makes it intrinsically painful. These molecules simply happen to mediate special meta-computations in the brain that happen to induce certain adjustments in behavioral policies that we happen to have labeled “pleasure” and “pain”. They’re just mechanisms that evolution stumbled upon that happened to steer organisms away from harm and toward resources, discovered by trial and error and refined through adversarial optimization over eons. If there’s some fundamental feature of reality being stumbled upon here, it probably has something to do with the statistics of free energy minimization, not any kind of “axiomatically motivating” “stuff”. There are No Universally Compelling Arguments, and reward and punishment signals only work for systems Created Already In Motion.

Jon Garcia 3 Nov 2021 3:58 UTC
3 points
on: Models Modeling Models
I really like this perspective. There may be no way to find a human’s One True Value Function (to say nothing of humanity’s), not only because humans are complicated to model, but also because there is probably no such thing as a human’s One True Value Function (even less so for humanity as a whole). Similar to what you said, it could very well just be heuristics all the way down, heuristics both in what is valued (preferences) and how different heuristic values compete (meta-preferences). Natural selection has fine-tuned both levels to get something that works well enough for survival and reproduction of the species in the domain of validity of humans’ ancestral environment, while each individual human could fine-tune their preferences and meta-preferences based on whatever leads to the greatest perceived harmony among them within the domain of validity of lived personal experience.
In AI, the concept of multiple competing value functions could be realized through ensemble models. Each sub-model within the ensemble learns a value function independently. If each sub-model receives slightly different input or starts with different random initialization to its weights, then they will each learn slightly different value functions. Then you can use the ensemble variance in predicted value (or precision = 1/variance) to determine the domain of validity. Those regions of state space where all sub-models in the ensemble pretty much agree on value (low variance / high precision) are “safe” to explore, while those regions with large disagreements in predicted value (high variance / low precision) are “unsafe”. Of course a creativity or curiosity drive could motivate the system to push the frontier of the safe region, but there would always have to come a point where the potential value of exploring further is overcome by the risk, which I guess falls under the umbrella of “meta-preference”.
I have had the idea that the discount factor used in decision theory and RL could be based on the precision of predictions rather than on some constant gamma factor raised to the power of the number of time steps. That way, plans with high expected value but low precision (high ensemble variance) might be weighted the same as plans with lower expected value but higher precision (lower ensemble variance). This would hopefully prevent the AI from pursuing dangerous plans that fall far outside of the trusted region of state space while steering toward plans with long-term stable positive outcomes and away from plans with long-term stable negative outcomes.

Jon Garcia 15 Nov 2021 0:51 UTC
19 points
in reply to: renxida’s comment on: Open & Welcome Thread November 2021
Hi! I’m also sort of new here (only recently created an account but have been reading sporadically for years). For most of my life, I was actually a young-earth creationist, so I know a bit about coming from a closed-minded religious environment. Ironically, I first started to read LessWrong while I was still an ardent YEC (well before LessWrong 2.0), but I didn’t feel that my position was in contradiction to rational thinking. In fact, I prided myself in being able to see through the flaws in creationist arguments whose conclusions agreed with my beliefs and in being able to grasp “evolutionists’” arguments from their perspective (but of course, being able to see the flaws in them as well). Even now, I would say that I understood evolution better back then than most non-biologists who accept it.
The only thing keeping me a YEC for so long (until the end of grad school, if you can believe it) was a very powerful prior moral obligation to maintain a biblically consistent worldview that had been thoroughly indoctrinated into me growing up. It took way more weight of evidence than it should have to convince me (1) that mutation + selection pressure is an effective way of generating diverse and viable designs, (2) that gene regulatory networks produce sufficient abstraction in biological feature space to allow evolutionary search methods to overcome the curse of dimensionality, (3) that the origin of all species from a common ancestor is mathematically possible, (4) that it is statistically inevitable over Earth history, (5) that evolution is in fact homeomorphic to reinforcement learning and thus demonstrably plausible, (6) that all possible ways of classifying species result in the same exact branching tree pattern, (7) that if God did create life, He had to have done so using an evolutionary algorithm indistinguishable in its breadth and detail from the real world, and (8) that the evidence for evolution as a matter of historical fact is irrefutable. It was after realizing all of this that I had a real crisis of faith, which led me to stumble across Eliezer’s Crisis of Faith article after a years of not reading LessWrong. I remember that article, among many others, helped me quite a bit to sort through what I believe and why.
I’m not sure precisely why I stopped reading LessWrong back when I was a YEC, but I think it may have had something to do with me being uncomfortable with Eliezer’s utter certainty in the many-worlds interpretation of quantum mechanics. Such a view would completely destroy the idea that this world is the special creation of an Omni-Max God who has carefully been steering Earth history as part of His Grand Design. Although, I did consider the possibility that the quantum multiverse could be God’s way of running through infinite hypothetical scenarios before creating the One True Universe with maximum expected Divine Utility. However, this didn’t comfort me much since it means that with probability = 1, everything we have ever known and valued is just one of God’s hypothetical scenarios, to be forgotten forever once this scenario plays out to heat death. I’ve since learned to make peace with Many Worlds QM, though.
What links here?
- Here’s a List of Some of My Ideas for Blog Posts by lsusr (26 May 2022 5:35 UTC; 48 points)

Jon Garcia 15 Nov 2021 17:00 UTC
9 points
in reply to: Ruby’s comment on: Open & Welcome Thread November 2021
Thanks. I think it’s important not to forget the path I’ve taken. It’s a major part of my identity even though I no longer endorse what were once my most cherished beliefs, and I feel that it helps connect me with the greater human experience. My parents and (ironically) my training in apologetics instilled in me a thirst for truth and an alertness toward logical fallacies that took me quite far from where I started in life. I guess that a greater emphasis on overcoming confirmation bias would have accelerated my truth-seeking journey a bit more. Unfortunately and surprisingly for a certain species of story-telling social primates, the truth is not necessarily what is believed and taught by the tribe. An idea is not true just because people devote lifetimes to defending it. And an idea is not false just because they spend lifetimes mocking it.
The one thing that held me back the most, I think, is my rather strong deontological instinct. I always saw it as my moral duty to apply the full force of my rational mind to defending the Revealed Truth. I was willing to apply good epistemology to modify my beliefs arbitrarily far, as long as it did not violate the moral constraint that my worldview remain consistent with the holistic biblical narrative. Sometimes that meant radically rethinking religious doctrines in light of science (or conflicting scriptures), but more often it pushed me to rationalize scientific evidence to fit with my core beliefs.
I always recognized that all things that are true are necessarily mutually consistent, that we all inhabit a single self-consistent Reality, and that the Truth must be the minimum-energy harmonization of all existing facts. However, it wasn’t until I was willing to let go of the moral duty to retain the biblical narrative in my set of brute facts that the free energy of my worldview dropped dramatically. It was like a thousand high-tension cables binding all my beliefs to a single (misplaced) epistemological hub were all released at once. Suddenly, everything else in my worldview began to fall into place as all lines of evidence I had already accumulated pulled things into a much lower-energy configuration.
It’s funny how a single powerful prior or a single moral obligation can skew everything else. I wish it were a more widely held virtue to deeply scrutinize one’s most cherished beliefs and to reject them if necessary. Oh well. Maybe in the next million years if we can set up the social selection pressures right.

Jon Garcia 15 Nov 2021 17:18 UTC
3 points
in reply to: lsusr’s comment on: Open & Welcome Thread November 2021
Well, at the time I had assumed that Earth history was a special case, a small stage temporarily under quarantine from the rest of the universe where the problem of evil could play itself out. I hoped that God had created the rest of the universe to contain innumerable inhabited worlds, all of which would learn the lesson of just how good the Creator’s system of justice is after contrasting against a world that He had allowed to take matters into its own hands. However, now that I’m out of that mindset, I realize that even a small Type-I ASI could easily do a much better job instilling such a lesson into all sentient minds than Yahweh has purportedly done (i.e., without all the blood sacrifices and genocides).

Jon Garcia 21 Nov 2021 20:42 UTC
4 points
on: From language to ethics by automated reasoning
Natural language exists as a low-bandwidth communication channel for imprinting one person’s mental map onto another person’s. The mental maps themselves are formed through direct interactions with an external environment. As such, I’m not sure it’s possible to get around the symbol-grounding problem to reach natural language understanding without some form of embodied cognition (physical or virtual). Words only “mean” something when there is an element of a person’s mental map of reality that the word is associated with (or an element of their language processing machinery for certain sentence particles), and those mental maps are formed through high-bandwidth perception.
However, even if you could get an AI to reach true understanding just from natural language data (i.e., by training on exponentially more language data than children do until the AI’s map of reality is as fine-grained as it would have been from embodied interaction with the environment), and even if this AI had a complete understanding of human emotions and moral systems, it would not necessarily be aligned. It would need to have an aligned (or convergently alignable) motivational schema already in motion before you could trust it in general.

Jon Garcia 21 Nov 2021 22:35 UTC
28 points
on: “The Wisdom of the Lazy Teacher”

If you teach an AI to fish, it might optimize its performance within a narrow scope. Teach it to teach itself to fish, and you’ve created a recursively self-improving AGI that is unaligned with human values by default and will most likely end up killing us all.

-- Eliezer, probably

Jon Garcia 22 Nov 2021 16:19 UTC
1 point
in reply to: Michele Campolo’s comment on: From language to ethics by automated reasoning
Well, if it’s a language model anything like GPT-3, then any discussions about morality that it engages in will likely be permutations and rewordings of what it has seen in its training data. Such models aren’t even guaranteed to produce text that is self-consistent over time, so I would expect to see conflicting moral stances from the AI that derive from conflicting moral stances of humans whose words it trained on. (Hopefully it was at least trained more on the Stanford Encyclopedia of Philosophy and less on Reddit/Twitter/Facebook.)
It would be interesting, though, if we could design a “language model” AI that continuously seeks self-consistency upon internal reflection. Maybe it would continuously generate moral statements, use them to predict policies under hypothetical scenarios, look for any conflicting predictions, develop moral statements that minimize the conflict, and retrain on the coherent moral statements. I would expect a process like this to converge over time, especially if we are starting from large sample of human moral opinions like a typical language model would, since all human moralities form a relatively tight cluster in behavioral policy space. Then maybe we would be one step closer to achieving the C in CEV.
Regardless, I agree with you overall in the sense that sophisticated language models will be necessary for aligning AGI with human morality at all the relevant levels of abstraction. I just don’t think it will be anywhere near sufficient.

Jon Garcia 25 Nov 2021 5:10 UTC
5 points
in reply to: Gordon Seidoh Worley’s comment on: Why do you need the story?
This deep psychological need to latch onto some story, any story, to explain what we don’t understand, seems to me to tie back in to the Bayesian Brain Hypothesis. Basically, our brains are constantly and uncontrollably generating hypotheses for the evidence we encounter in the world, seeing which ones predict our experiences with the greatest likelihood (weighted by our biological and cultural priors, of course). These hypotheses come in the form of stories because stories have the minimum level of internal complexity to explain the complex phenomena we experience (which, themselves, we internalize as stories). Choosing the “best” explanation, of course, follows Bayes’ formula:
$P (h | e) = \frac{P (e | h) P (h)}{\sum_{i} P (e | h_{i}) P (h_{i})}$
A few problems with this:
1. We might just be terrible at choosing good priors ( $P (h)$ ). Occam’s Razor / Solomonoff Induction just isn’t that intuitive to most humans. Most people find consciousness (which is familiar) to be simpler than neuroscience (which is alien), so they see no problem hypothesizing disembodied spirits, yet they scoff at the idea of humans being no more than matter. Astrology sounds reasonable when you have no reason to think that stars and planets shouldn’t have personalities and try to affect your personal life, like everyone else, just so long as you don’t try to figure out how that would actually work at a mechanistic level. Statistical modeling, on the other hand, is hard for humans to grasp, and therefore much more complicated, and therefore much less likely to have any explanatory power a priori, at least as far as most people are concerned.
2. Likelihood functions ( $P (e | h)$ ) can be really hard to figure out. They require coming up with hypotheses that have the same causal structure as the real system they’re trying to predict. When most of our declarative mental models exist at the level of abstraction of human social dynamics, it can be difficult to accurately imagine all the interacting bodily systems and metabolic pathways that make NSAIDs (or any other drugs, to say nothing of whole foods) have the precise effect that they do.
3. Unfortunately, evolution didn’t equip us with very good priors for how much weight to give to unimagined hypotheses, so we end up normalizing the posterior distribution by only those hypotheses we can think of. That means the denominator in the equation above ( $\sum_{i} P (e | h_{i}) P (h_{i})$ ) is often much less than it should be, even if the priors and evidential likelihoods are all correct, because other hypotheses have not had a chance to weigh in. For most people, all future (or as-yet unheard-of) scientific discoveries are effectively given a prior probability of 0, while all the myths passed down from the tribal/religious/political elders seem to explain everything as well as anything they’ve ever heard, and so those stories get all the weight and all the acceptance.
It’s unavoidable for us as humans with Bayesian-ish brains to start coming up with stories to explain phenomena, even when evidence is lacking. We just need to be careful to cultivate an awareness for when our priors may be mistaken, for when our stories don’t have sufficiently reductionist internal causal structure to explain what they are meant to explain, and for when we probably haven’t even considered hypotheses that are anywhere close to the true explanation.

Jon Garcia 25 Nov 2021 15:17 UTC
1 point
in reply to: Astor’s comment on: Why do you need the story?
By “story,” I mean something like a causal/conceptual map of an event/system/phenomenon, including things like the who, what, when, where, why, and how. At the level of sentences, this would be a map of all the words according to their semantic/syntactic role, like part of speech, with different slots for each role and connections relating them together. At the level of what we would normally call “stories,” such a story map would include slots for things like protagonist, antagonist, quest, conflict, plot points, and archetypes, along with their various interactions.

In the brain, these story maps/graphs could be implemented as regions of the cortex. Just as some cortical regions have retinotopic or somatotopic maps, more abstract regions may contain maps of conceptual space, along with neural connections between subregions that represent causal, structural, semantic, or social relationships between items in the map. Other brain regions may learn how to traverse these maps in systematic ways, giving rise to things like syntax, story structure, and action planning.

I’ve suggested before (https://www.lesswrong.com/posts/KFbGbTEtHiJnXw5sk/?commentId=PHYKtp7ACkoMf6hLe) that I think these sorts of maps may be key to understanding things like language and consciousness. Stories that can be loaded into and from long-term memory or transferred between minds via language can offer a huge selective advantage, both to individual humans and to groups of humans. I think the recogition, accumulation, and transmission of stories is actually pretty fundamental to how human psychology works.

Jon Garcia 25 Nov 2021 17:19 UTC
1 point
in reply to: jacopo’s comment on: Why do you need the story?

I notice that while a lot of the answer is formal and well-grounded, “stories have the minimum level of internal complexity to explain the complex phenomena we experience” is itself a story :)

Yep. That’s just how humans think about it: complex phenomena require complex explanations. “Emergence,” as complexity arising from the many simple interactions of many simple components, I think is a pretty recent concept for humanity. People still think intelligent design makes more intuitive sense than evolution, for instance, even though the latter makes astronomically fewer assumptions and should be favored a priori by Occam’s Razor.

Jon Garcia 2 Dec 2021 4:36 UTC
1 point
on: Hypotheses about Finding Knowledge and One-Shot Causal Entanglements
Nice post. I agree that a crucial part of AGI alignment should involve routing an AI’s knowledge of human values to its own internal motivational circuitry, such that as its knowledge of human needs/goals/drives/preferences grows, so too does its alignment to those things. One key to this part of the problem may be to build in structural and inductive biases that steer the AI toward less inscrutable models.
I would say that to “know” something necessitates being able to make accurate predictions related to that thing. For most learning systems, this would imply developing some sort of generative or predictive model of its training data. In your dog/fish example, this might be realized with something like a conditional GAN, maybe combined with an autoencoder, where “knowing” the class of a sample allows the model to predict features of the sample (e.g., “fish” class → there will be fins about here and scales about here; “dog” class → there will be three eyes on the face, furry texture on the body, etc.). Combining the class label with some sort of latent-space representation should enable it to closely reproduce the full image.
The “knowledge” here is contained less in the class labels and latent space representations and more in the parameters and structure of the generative model, which is where it actually learned the generative/causal structure of its training data. This kind of knowledge allows such models to do things like inpainting, denoising, super-resolution, and animation of an image, generating information that was not in its inputs but that it predicts “ought” to be there based on what it has learned before.
This idea is also related to the predictive coding theory of the brain, where perception happens by constantly trying to generate predictions of what the senses will receive and continuously updating based on prediction errors. Again, “knowledge” exists in the generative models and causal graphs that the brain uses to make these predictions.

Jon Garcia 2 Dec 2021 13:41 UTC
1 point
in reply to: J Bostock’s comment on: Hypotheses about Finding Knowledge and One-Shot Causal Entanglements
Well, it certainly has mutual information with the training data, even if it only acts as a classifier (actually, classifiers can be seen as inverse generative models, so there is some generative-ish information there, as well). From that perspective, your arguments certainly hold. Although, I’m not sure if “mutual information” is precisely what you’re going for, either. Yes, I agree, I should have tabooed “knowledge” in how I read it.

Jon Garcia 2 Dec 2021 17:59 UTC
5 points
on: Common Probability Distributions
Don’t forget the exponential distribution, which can represent the time of the next event in a Poisson process of a constant rate $λ$ :
$p (t; λ) = λ e^{- λ t}$
Or the gamma distribution, which is basically the convolution of a bunch of exponential distributions and can predict the distribution of completion times for a sequence of Poisson processes (becoming more like a normal distribution with more Poisson processes adding together):
$p (t; λ, α) = \frac{λ^{α}}{Γ (α)} x^{α - 1} e^{- λ x}$
( $Γ (α) = (α - 1)!$ for positive integer values of $α$ .)
Or the Weibull distribution, which is an extension of the exponential distribution to cases where the process slows over time ( $k < 1$ , “infant mortality” if $t$ represents time-to-failure) or accelerates over time ( $k > 1$ , “aging/wear-out” if $t$ represents time-to-failure):
$p (t; λ, k) = λ k (λ t)^{k - 1} e^{- (λ t)^{k}}$
(Please note that all of these distributions, like the lognormal, have support on $t \in [0, \infty)$ .)