leogao

Karma: 2,877

leogao 4 Nov 2023 21:24 UTC
LW: 88 AF: 39
11
AF
on: Genetic fitness is a measure of selection strength, not the selection target
I agree with most of the factual claims made in this post about evolution. I agree that “IGF is the objective” is somewhat sloppy shorthand. However, after diving into the specific ways the object level details of “IGF is the objective” play out, I am confused about why you believe this implies the things you claim they imply about the sharp left turn / inner misalignment. Overall, I still believe that natural selection is a reasonable analogy for inner misalignment.
- I agree fitness is not a single stationary thing. I agree this is prima facie unlike supervised learning, where the objective is typically stationary. However, it is pretty analogous to RL, and especially multi agent RL, and overall I don’t think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it’s a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded. Therefore I don’t see any important disanalogy between evolution and multi agent RL. I have various thoughts on why language models do not make RL analogies irrelevant that I can explain but that’s another completely different rabbit hole.
- I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don’t think I’ve heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically. I think the misconception may arise from a closely related claim that some make, which is that AI systems might develop weird arbitrary goals (tiny metallic squiggles) because any goal with sufficient intelligence implies playing the training game and then doing a sharp left turn. However, the claim here is not that the tiny metallic squiggles drive will suddenly appear at some point and replace the “make humans really happy” drive that existed previously. The claim is that the drive for tiny metallic squiggles was always, from the very beginning, the reason why [make humans really happy] was the observed behavior in environment [humans can turn you off if they aren’t happy with you], and therefore in a different environment [humans can no longer turn you off], the observed behavior is [kill everyone and make squiggles].
- I agree that everything is very complex always. I agree that there are multiple different goals/drives/desires in humans that result in children, of which the sex drive is only one. I agree that humans still have children sometimes, and still want children per se sometimes, but in practice this results in less and less children than in the ancestral environment over time (I bet even foragers are at least above replacement rate) for exactly the reason that the drives that we have always had for the reason that they caused us to survive/reproduce in the past now correspond much less well. I also agree that infanticide exists and occurs (but in the ancestral environment, there are counterbalancing drives like taboos around infanticide). In general, in many cases, simplifying assumptions totally break the analogy and make the results meaningless. I don’t think I’ve been convinced that this is one of those cases.
I don’t really care about defending the usage of “fitness as the objective” specifically, and so I don’t think the following is a crux and am happy to concede some of the points below for the sake of argument about the object facts of inner alignment. However, for completeness, my take on when “fitness” can be reasonably described as the objective, and when it can’t be:
- I agree that couched in terms of the specific traits, the thing that evolution does in practice is sometimes favoring some traits and sometimes favoring other traits. However, I think there’s an important sense in which these traits are not drawn from a hat- natural selection selects for lighter/darker moths because it makes it easier for the moths to survive and reproduce! If lighter moths become more common whenever light moths survive and reproduce better, and vice versa for dark moths, as opposed to moths just randomly becoming more light or more dark in ways uncorrelated to survival/reproduction, it seems pretty reasonable to say that survival/reproduction is closer to the thing being optimized than some particular lightness/darkness function that varies between favoring lightness and darkness.
- I agree it is possible to do artificial selection for some particular trait like moth color and in this case saying that the process optimizes “fitness” (or survival/reproduction) collapses to saying the same thing as the process optimizes moth lightness/darkness. I agree it would be a little weird to insist that “fitness” is the goal in this case, and that the color is the more natural goal. I also agree that the evolutionary equations plays out the same way whether the source of pressure is artificial human selection or birds eating the moths. Nonetheless, I claim the step where you argue the two cases are equivalent for the purposes of whether we can consider fitness the objective is the step that breaks down. I think the difference between this case and the previous case is that the causality flows differently. We can literally draw from a hat whether we want light moths or dark moths, and then reshape the environment until fitness lines up with our preference for darkness, whereas in the other case, the environment is drawn from a hat and the color selection is determined downstream of that.
What links here?
- AI #37: Moving Too Fast by Zvi (9 Nov 2023 17:50 UTC; 53 points)
- How well does your research adress the theory-practice gap? by Jonas Hallgren (8 Nov 2023 11:27 UTC; 18 points)

leogao 3 Mar 2023 8:44 UTC
LW: 80 AF: 28
28
AF
on: The Waluigi Effect (mega-post)
Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.
This seems wrong. I think the mistake you’re making is when you argue that because there’s some chance X happens at each step and X is an absorbing state, therefore you have to end up at X eventually. However, this is only true if you assume the conclusion and claim that the prior probability of luigis is zero. If there is some prior probability of a luigi, each non-waluigi step increases the probability of never observing a transition to a waluigi a little bit.

leogao 4 May 2023 22:45 UTC
61 points
38
on: Google “We Have No Moat, And Neither Does OpenAI”
My prior, not having looked too carefully at the post or the specific projects involved, is that probably any claims that an open source model is 90% as good as GPT4 or indistinguishable are hugely exaggerated or otherwise not a fair comparison. In general in ML, confirmation bias and overclaiming is very common and as a base rate the vast majority of papers that claim some kind of groundbreaking result end up just never having any real impact.
Also, I expect facets of capabilities progress most relevant to existential risk will be especially constrained strongly by base model quality. I would agree that open source is probably better at squeezing stuff out of small models, but my model is that wrt existential risk relevant capabilities progress this is less relevant (cf the bitter lesson).

leogao 3 May 2024 0:37 UTC
58 points
44
on: Please stop publishing ideas/insights/research about AI
I’m very sympathetic to the idea of being careful about publishing things that could spread capabilities ideas. However, I think there are several important things missing from your world model, which cause me to believe that following your advice would substantially hurt alignment progress.

(To be clear, none of this applies to alignment people working directly on capabilities, who should, like, not. Rather, this is about alignment researchers accidentally advancing capabilities by talking to capabilities people)
- It’s genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don’t actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
- Even if you try very hard to do so, it’s still very hard to convince people that you’re right if you don’t have a ton of clout via a legible reputation of being right a lot. Everyone has an agenda they’re convinced will solve AGI and is too busy trying to convince everyone else to work on their agenda.
- High level ideas are generally not that valuable in and of themselves. People generally learn to ignore ideas unless they have strong empirical evidence of correctness (or endorsement of highly respected researchers) because there are simply too many ideas. The valuable thing is not the idea itself, but the knowledge of which ideas are actually correct.
- I think deeply understanding top tier capabilities researchers’ views on how to achieve AGI is actually extremely valuable for thinking about alignment. Even if you disagree on object level views, understanding how very smart people come to their conclusions is very valuable.
- I think alignment discourse is greatly harmed by people being too scared to say things. When it bleeds over to being too scared to think about capabilities related topics for fear of accidentally generating something dangerous, I think this is even more harmful.

leogao 24 Mar 2024 21:58 UTC
48 points
7
on: All About Concave and Convex Agents
Thankfully, almost all of the time the convex agents end up destroying themselves by taking insane risks to concentrate their resources into infinitesimally likely worlds, so you will almost never have to barter with a powerful one.
(why not just call them risk seeking / risk averse agents instead of convex/concave?)

leogao 3 Mar 2023 7:37 UTC
LW: 45 AF: 13
16
AF
on: The Waluigi Effect (mega-post)
However, this trick won’t solve the problem. The LLM will print the correct answer if it trusts the flattery about Jane, and it will trust the flattery about Jane if the LLM trusts that the story is “super-duper definitely 100% true and factual”. But why would the LLM trust that sentence?
There’s a fun connection to ELK here. Suppose you see this and decide: “ok forget trying to describe in language that it’s definitely 100% true and factual in natural language. What if we just add a special token that I prepend to indicate ’100% true and factual, for reals’? It’s guaranteed not to exist on the internet because it’s a special token.”
Of course, by virtue of being hors-texte, the special token alone has no meaning (remember, we had to do this to escape being contaminated by internet text meaning accidentally transferring). So we need to somehow explain to the model that this token means ’100% true and factual for reals’. One way to do this is to add the token in front of a bunch of training data that you know for sure is 100% true and factual. But can you trust this to generalize to more difficult facts (“<|specialtoken|>Will the following nanobot design kill everyone if implemented?”)? If ELK is hard, then the special token will not generalize (i.e it will fail to elicit the direct translator), for all of the reasons described in ELK.
What links here?
- [ASoT] Some thoughts on human abstractions by leogao (16 Mar 2023 5:42 UTC; 42 points)

leogao 24 Sep 2023 5:44 UTC
44 points
22
on: Paper: LLMs trained on “A is B” fail to learn “B is A”
I like this paper for crisply demonstrating an instance of poor generalization in LMs that is likely representative of a broader class of generalization properties of current LMs.
The existence of such limitations in current ML systems does not imply that ML is fundamentally not a viable path to AGI, or that timelines are long, or that AGI will necessarily also have these limitations. Rather, I find this kind of thing interesting because I believe that understanding limitations of current AI systems is very important for giving us threads to yank on that may help us with thinking about conceptual alignment. Some examples of what I mean:
- It’s likely that our conception of the kinds of representations/ontology that current models have are deeply confused. For example, one might claim that current models have features for “truth” or “human happiness”, but it also seems entirely plausible that models instead have separate circuits and features entirely for “this text makes a claim that is incorrect” and “this text has the wrong answer selected”, or in the latter case for “this text has positive sentiment” and “this text describes a human experiencing happiness” and “this text describes actions that would cause a human to be happy if they were implemented”.
- I think we’re probably pretty confused about mesaoptimization, in a way that’s very difficult to resolve just by thinking more about it (source: have spent a lot of time thinking about mesaoptimizers). I think this is especially salient to the people trying to make model organisms—which I think is a really exciting avenue—because if you try to make a mesaoptimizer, you immediately collide head on with things like finding that “training selects from the set of goals weighted by complexity” hypothesis doesn’t seem to accurately describe current model training. I think it’s appropriate to feel pretty confused about this and carefully examine the reasons why current models don’t exhibit these properties. It’s entirely reasonable for the answer to be “I expect future models to have thing X that current models don’t have”—then, you can try your best to test various X’s before having the future AIs that actually kill everyone.
- There are some things that we expect AGI to do that current ML systems do not do. Partly this will be because in fact current ML systems are not analogous to future AGI in some ways—probably if you tell the AGI that A is B, it will also know that B is A. This does not necessarily have to be a property that gradually emerges and can be forecasted with a scaling law; it could emerge in a phase change, or be the result of some future algorithmic innovation. If you believe there is some property X of current ML that causes this failure, and that it will be no longer a failure in the future, then you should also be suspicious of any alignment proposal that depends on this property (and the dependence of the proposal on X may be experimentally testable). For instance, it is probably relatively easy to make an RL trained NN policy be extremely incoherent in a small subset of cases, because the network has denormalized contextual facts that are redundant across many situations. I expect this to probably be harder in models which have more unified representations for facts. To the extent I believe a given alignment technique works because it leverages this denormalization, I would be more skeptical of it working in the future.
- As a counterpoint, it might also be that we had an inaccurate conception of what capabilities AGI will have that current ML systems do not have—I think one important lesson of GPT-* has been that even with these failures, the resulting systems can still be surprisingly useful.
What links here?
- leogao's comment on jacquesthibs’s Shortform by jacquesthibs (5 Nov 2023 23:06 UTC; 6 points)

leogao 15 Oct 2022 5:45 UTC
LW: 42 AF: 27
26
AF
in reply to: cfoster0’s comment on: Counterarguments to the basic AI x-risk case
It’s the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you’re trying to write a really good essay, you don’t care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)

leogao 16 Oct 2022 18:01 UTC
39 points
16
in reply to: jacob_cannell’s comment on: Counterarguments to the basic AI x-risk case
I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.

(Unimportant nitpicking: This Person Does Not Exist doesn’t actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)

You’re also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I’ve been talking about unconditional models on a face dataset, which does not optimize for faciness, but when you do classifier-based guidance this changes the setup. I don’t think this difference is crucial, and my point can be made with either, so I will talk using your setup instead.

In fact, the setup you describe in the linked comment does in fact put optimization pressure on faciness, regularized by distance from the prior. Note that when I say “optimization pressure” I don’t mean necessarily literally getting the sample that maxes out the objective. In the essay example, this would be like doing RLHF for essay quality with a KL penalty to stay close to the text distribution. You are correct in stating that this regularization helps to stay on the manifold of realistic images and that removing it results in terrible nightmare images, and this applies directly to the essay example as well.

However the core problem with this approach is that the reason the regularization works is that you trade off quality for typicality. In the face case this is mostly fine because faces are pretty typical of the original distribution anyways, but I would make the concrete prediction that if you tried to get faces using classifier-based guidance out of a diffusion model specifically trained on all images except those containing faces, it would be really fiddly or impossible to get good quality faces that aren’t weird and nightmarish out of it. It seems possible that you are talking past me/Nate in that you have in mind that such regularization isn’t a big problem to put on our AGI, mild optimization is a good thing because we don’t want really weird worlds, etc. I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before. I’m not going to argue this out because I think the following is actually a much larger crux and until we agree on it, arguing over my previous claim will be very difficult:

Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn’t break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.

When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want. I think that this could be a major crux underlying everything else.

(It may also be relevant that at least from my perspective, notwithstanding anything Eliezer may or may not have said, the “learning human values is hard” argument primarily applies to argue why human values won’t be simple/natural in cases where simplicity/naturalness determine what is easier to learn. I have no doubt that a sufficiently powerful AGI could figure out our values if it wanted to, the hard part is making it want to do so. I think Eliezer may be particularly more pessimistic about neural networks’ robustness.)

Nonetheless, let me lay out some (non-exhaustive) concrete reasons why I expect just scaling up the discriminator and its training data to not work.

Obviously, when we optimize really hard on our learned discriminator, we get the out of distribution stuff, as you agree. But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers. I claim that optimizing any neural network for achieving world states that these labellers find good doesn’t just lead to extremely bad outcomes in unlikely contrived scenarios, but rather happens by default. Even if you think the following can be avoided using diffusion planning / typicality regularization / etc, I still think it is necessary to first agree that this comes up when you don’t do that regularization, and only then discuss whether it still comes up with regularization.
1. Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
2. Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.

leogao 13 Jan 2024 21:03 UTC
LW: 38 AF: 23
21
AF
in reply to: TurnTrout’s comment on: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
I think this paper is empirical evidence for a nontrivial part of the deceptive alignment argument (RLHF/adversarial training being insufficient to remove it), and I also think most empirical papers don’t make any sense when applied to AGI.

I think I have an intellectually consistent stance—I don’t think this is because I have a double standard for pessimistic results.

First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training failing to remove this. I think I would draw the exact same conclusion about deceptive alignment from this experiment where the labels are painted on differently but the mechanics are the same. And just as I think it is invalid to conclude from the sleeper agent paper that models naturally want to insert backdoors in code even if they’re harmless now, it is also invalid to argue from this hypothetical experiment that models naturally want to be helpful even if you try to train them to kick puppies.

Second, I think this paper is actually genuinely better evidence for deceptive alignment than many of the “deception” papers that came before. For example, I claim that the sycophancy and insider trading papers provide approximately no evidence for deceptive alignment. This is for exactly the same reason why I think showing RLHF making models harmless provides approximately no evidence against deceptive alignment. So I don’t think it’s true that I like empirical papers as long as they purport to support the deceptive alignment argument.

The reasons I think this paper is actually better than the other deception papers (beyond just quality of execution) are that the deceptive alignment in this setup happens for reasons more similar to why it might happen in AGI than in previous work, and the secret scratchpad setting seeming more analogous to AGI than single shot or visible scratchpad.

leogao 8 Oct 2022 17:24 UTC
37 points
24
on: The Teacup Test
I think there are two main problems being pointed at here. First is that it seems reasonable to say that for the most part under most definitions of intelligence, intelligence is largely continuous (though some would argue for the existence of a few specific discontinuities)---thus, it seems unreasonable to ask “is X intelligent”. A teacup may be slightly more intelligent than a rock, and far less intelligent than GPT-3.
Second is the fact that the thing we actually care about is neither “intelligence” nor “Bayesian agent”; just because you can’t name something very precisely yet doesn’t mean that thing doesn’t exist or isn’t worth thinking about. The thing we care about is that someone might make a thing in 10 years that literally kills everyone, and we have some models of how we might expect that thing to be built. In analogy, perhaps we have a big philosophical argument over what counts as a “chair”—some would argue bitterly whether stools count as chairs, or whether tiny microscopic chair-shaped things count as chairs, or whether rocks count as chairs because you can sit on them, some people arguing that there is in fact no such thing as a physical chair, because concepts like that exist only in the map and the territory is made of atoms, etc. But if you have the problem that you expect chairs to break when you sit on them if they aren’t structurally sound, then most of these arguments are a huge distraction. Or more pithily:
“nooo that’s not really intelligent” I continue to insist as I shrink and transform into a paperclip

leogao 28 Jan 2024 4:46 UTC
34 points
on: leogao’s Shortform
i’ve noticed a life hyperparameter that affects learning quite substantially. i’d summarize it as “willingness to gloss over things that you’re confused about when learning something”. as an example, suppose you’re modifying some code and it seems to work but also you see a warning from an unrelated part of the code that you didn’t expect. you could either try to understand exactly why it happened, or just sort of ignore it.
reasons to set it low:
- each time your world model is confused, that’s an opportunity to get a little bit of signal to improve your world model. if you ignore these signals you increase the length of your feedback loop, and make it take longer to recover from incorrect models of the world.
- in some domains, it’s very common for unexpected results to actually be a hint at a much bigger problem. for example, many bugs in ML experiments cause results that are only slightly weird, but if you tug on the thread of understanding why your results are slightly weird, this can cause lots of your experiments to unravel. and doing so earlier rather than later can save a huge amount of time
- understanding things at least one level of abstraction down often lets you do things more effectively. otherwise, you have to constantly maintain a bunch of uncertainty about what will happen when you do any particular thing, and have a harder time thinking of creative solutions
reasons to set it high:
- it’s easy to waste a lot of time trying to understand relatively minor things, instead of understanding the big picture. often, it’s more important to 80-20 by understanding the big picture, and you can fill in the details when it becomes important to do so (which often is only necessary in rare cases).
- in some domains, we have no fucking idea why anything happens, so you have to be able to accept that we don’t know why things happen to be able to make progress
- often, if e.g you don’t quite get a claim that a paper is making, you could resolve your confusion just by reading a bit ahead. if you always try to fully understand everything before digging into it, you’ll find it very easy to get stuck before actually make it to the main point the paper is making
there are very different optimal configurations for different kinds of domains. maybe the right approach is to be aware that this is an important hparameter and occasionally try going down some rabbit holes and seeing how much value it provides

leogao 18 Aug 2023 1:04 UTC
34 points
13
on: Against Almost Every Theory of Impact of Interpretability
My personal theory of impact for doing nonzero amounts of interpretability is that I think understanding how models think will be extremely useful for conceptual research. For instance, I think one very important data point for thinking about deceptive alignment is that current models are probably not deceptively aligned. Many people have differing explanations for which property of the current setup causes this (and therefore which things we want to keep around / whether to expect phase transitions / etc), which often imply very different alignment plans. I think just getting a sense of what even these models are implementing internally could help a lot with deconfusion here. I don’t think it’s strictly necessary to do interpretability as opposed to targeted experiments where we observe external behaviour for these kinds of things, but probably experiments that get many bits are much better than targeted experiments for deconfusion, because oftentimes the hypotheses are all wrong in subtle ways. Aside from that, I am not optimistic about fully understanding the model, training against interpretability, microscope AI, or finding the “deception neuron” as a way to audit deception. I don’t think future models will necessarily have internal structures analogous to current models.

leogao 6 Dec 2022 17:52 UTC
34 points
24
on: Updating my AI timelines
I think I expect delays from regulation to not really substantially affect the time at which AI can cause an x-risk, whereas it does substantially affect when TAI is deployed broadly. I think it’s plausible that at the time AI x-risk happens, even in “slower” takeoffs, most of the economy is still not automated, even if contemporary AI could in theory automate it.

leogao 23 Nov 2022 6:13 UTC
LW: 31 AF: 9
25
AF
in reply to: benedelman’s comment on: AI will change the world, but won’t take it over by playing “3-dimensional chess”.

A small group of researchers raise alarm that this is going on, but society at large doesn’t listen to them because everything seems to be going so well.

Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like “well yes but this is just in a toy environment, and it’s a big leap to it taking over the world”, but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem (“chess was never actually hard!”), in my model it seems entirely plausible that every time we observe some alignment failure it updates a few people but most people remain un-updated. I predict that for a large set of things currently claimed will cause people to take alignment seriously, most of them will either be ignored by most people once they happen, or never happen before catastrophic failure.

We can also see analogous dynamics in i.e climate change, where even given decades of hard numbers and tangible physical phenomena large amounts of people (and importantly, major polluters) still reject its existence, many interventions are undertaken which only serve as lip service (greenwashing), and all of this would be worse if renewables were still economically uncompetitive.

I expect the alignment situation to be strictly worse because a) I expect the most egregious failures to only come shortly before AGI, so once evidence as robust as climate change (i.e literally catching AIs red handed trying and almost succeeding at taking over the world), I estimate we have anywhere between a few years and negative years left b) the space of ineffectual alignment interventions is far larger and harder to distinguish from real solutions to the underlying problem c) in particular, training away failures in ways that don’t solve the underlying problems (i.e incentivizing deception) is an extremely attractive option and there does not exist any solution to this technical problem, and just observing the visible problems disappear is insufficient to distinguish whether the underlying problems are solved d) 80% of the tech for solving climate change basically already exists or is within reach, and society basically just has to decide that it cares, and the cost to society is legible. For alignment, we have no idea how to solve the technical problem, or even how that solution will vaguely look. This makes it a harder sell to society, e) the economic value of AGI vastly outweighs the value of fossil fuels, making the vested interest substantially larger, f) especially due to deceptive alignment, I expect actually-aligned systems to be strictly more expensive than unaligned systems; the cost will be more than just a fixed % more money, but also cost in terms of additional difficulty and uncertainty, time to market disadvantage, etc.

leogao 21 Dec 2023 8:19 UTC
30 points
5
in reply to: habryka’s comment on: Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning
I think I could pass the ITTs of Quintin/Nora sufficiently to have a productive conversation while also having interesting points of disagreement. If that’s the bottleneck, I’d be interested in participating in some dialogues, if it’s a “people genuinely trying to understand each other’s views” vibe rather than a “tribalistically duking it out for the One True Belief” vibe.

leogao 18 Dec 2023 6:09 UTC
LW: 27 AF: 15
3
AF
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
I think deceptive alignment is still reasonably likely despite evidence from LLMs.

I agree with:
- LLMs are not deceptively aligned and don’t really have inner goals in the sense that is scary
- LLMs memorize a bunch of stuff
- the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
- Adam on transformers does not have a super strong simplicity bias
- without deceptive alignment, AI risk is a lot lower
- LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)
I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.

however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.

my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI

leogao 22 Nov 2022 3:53 UTC
27 points
9
in reply to: jimrandomh’s comment on: Here’s the exit.
Seconded: mine also isn’t.
Also, for what it’s worth, I also don’t think of myself as the kind of person to naturally gravitate towards the apocalypse/”saving the world” trope. From a purely narrative-aesthetic perspective, I much prefer the idea of building novel things, pioneering new frontiers, realizing the potential of humanity, etc, as opposed to trying to prevent disaster, reduce risk, etc. I am quite disappointed at reality for not conforming to my literary preferences.

leogao 25 Mar 2023 21:45 UTC
LW: 25 AF: 5
18
AF
on: $500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory
Adding $200 to the pool. Also, I endorse the existence of more bounties/contests like this.

leogao 29 Aug 2023 4:37 UTC
LW: 23 AF: 11
0
AF
on: OpenAI API base models are not sycophantic, at any size
Ran this on GPT-4-base and it gets 56.7% (n=1000)