Proposed solution – fine-tune an LLM for the opposite of the traits that you want, then in the prompt elicit the Waluigi. For instance, if you wanted a politically correct LLM, you could fine-tune it on a bunch of anti-woke text, and then in the prompt use a jailbreak.

I have no idea if this would work, but seems worth trying, and if the waluigi are attractor states while the luigi are not, this could plausible get around that (also, experimenting around with this sort of inversion might help test whether the waluigi are indeed attractor states in general).

# Daniel_Eth

# The Importance of AI Alignment, explained in 5 points

“Putin has stated he is not bluffing”

I think this is very weak evidence of anything. Would you expect him to instead say that he was bluffing?

Great post!

I was curious what some of this looked like, so I graphed it, using the dates you specifically called out probabilities. For simplicity, I assumed constant probability within each range (though I know you said this doesn’t correspond to your actual views). Here’s what I got for cumulative probability:

And here’s the corresponding probabilities of TAI being developed per specific year:

The dip between 2026 and 2030 seems unjustified to me. (I also think the huge drop from 2040-2050 is too aggressive, as even if we expect a plateauing of compute/another AI winter/etc, I don’t think we can be super confident exactly when that would happen, but this drop seems more defensible to me than the one in the late 2020s.)

If we instead put 5% for 2026, here’s what we get:

which seems more intuitively defensible to me. I think this difference may be important, as even shift of small numbers of years like this could be action-relevant when we’re talking about

*very*short timelines (of course, you could also get something reasonable-seeming by shifting*up*the probabilities of TAI in the 2026-2030 range).I’d also like to point out that your probabilities would imply that if TAI is not developed by 2036, there would be an implied 23% conditional chance of it then being developed in the subsequent 4 years ((50%-35%)/(100%-35%)), which also strikes me as quite high from where we’re now standing.

In spoken language, you could expand the terms to “floating-point operations” vs “floating-point operations per second” (or just “operations (per second)” if that felt more apt)

FWIW, I am ~100% confident that this is correct in terms of what they refer to. Typical estimates of the brain are that it uses ~10^15 FLOP/s (give or take a few OOM) and the fastest supercomputer in the world uses ~10^18 FLOP/s when at maximum (so there’s no way GPT-3 was trained on 10^23 FLOP/s).

If we assume the exact numbers here are correct, then the actual conclusion is that GPT-3 was trained on the amount of compute the brain uses in 10 million seconds, or around 100 days.

# Avoid the abbreviation “FLOPs” – use “FLOP” or “FLOP/s” instead

It’s interesting the term ‘abused’ was used with respect to AI. It makes me wonder if the authors have misalignment risks in mind at all or only misuse risks.

A separate press release says, “It is important that the federal government prepare for unlikely, yet catastrophic events like

**AI systems gone awry**” (emphasis added), so my sense is they have misalignment risks in mind.

Hmm, does this not depend on how the Oracle is making its decision? I feel like there might be versions of this that look more like the smoking lesion problem – for instance, what if the Oracle is simply using a (highly predictive) proxy to determine whether you’ll 1-box or 2-box? (Say, imagine if people from cities 1-box 99% of the time, and people from the country 2-box 99% of the time, and the Oracle is just looking at where you’re from).

# [Question] What does Functional Decision Theory say to do in imperfect Newcomb situations?

Okay, but I’ve also seen rationalists use point estimates for probability in a way that led them to mess up Bayes, and such that it would be clear if they recognized the probability was uncertain (e.g., I saw this a few times related to covid predictions). I feel like it’s weird to use “frequency” for something that will only happen (or not happen) once, like whether the first AGI will lead to human extinction, though ultimately I don’t really care what word people are using for which concept.

How common is it for transposon count to increase in a cell? If it’s a generally uncommon event for any one cell, then it could simply be that clones from a large portion of cells will only start off with marginally more (if any) extra transposons, while those that do start off with a fair bit more don’t make it past the early development process.

A perhaps even easier (though somewhat less informative) experiment would be to Crispr/CAS9 a bunch of extra transposons into an organism and see if that leads to accelerated aging.

Play with GPT-3 for long, and you’ll see it fall hard too.

...

This sample is a failure. No one would have written this, not even as satire or surrealism or experimental literature. Taken as a joke, it’s a nonsensical one. Taken as a plot for a film, it can’t even keep track of who’s alive and who’s dead. It contains three recognizable genres of writing that would never appear together in this particular way, with no delineations whatsoever.This sample seems pretty similar to the sort of thing that a human might dream, or that a human might say during/immediately after a stroke, a seizure, or certain types of migraines. It’s clear that the AI is failing here, but I’m not sure that humans don’t also sometimes fail in somewhat similar ways, or that there’s a fundamental limitation here that needs to be overcome in order to reach AGI.

The first time you see it, it surprises you, a crack in the floor… Eventually, you no longer picture of a floor with cracks in it. You picture a roiling chaos which randomly, but regularly, coalesces into ephemeral structures possessing randomly selected subsets of the properties of floors.

^I guess the corollary here would be that human minds may also be roiling chaos which randomly coalesce into ephemeral structures possessing properties of floors, but just are statistically much more likely to do so than current language models.

# Modeling Failure Modes of High-Level Machine Intelligence

FWIW, Hanson has elsewhere promoted the idea that algorithmic progress is primarily due to hardware progress. Relevant passage:

Maybe there are always lots of decent ideas for better algorithms, but most are hard to explore because of limited computer hardware. As hardware gets better, more new ideas can be explored, and some of them turn out to improve on the prior best algorithms. This story seems to at least roughly fit what I’ve heard about the process of algorithm design.

So he presumably would endorse the claim that HLMI will likely requires several tens of OOM more compute than we currently have, but that a plateauing in other inputs (such as AI researchers) won’t be as relevant. (Here’s also another post of Hanson where he endorses a somewhat related claim that we should expect exponential increases in hardware to translate to ~linear social impact and rate of automation.)

“uranium, copper, lithium, oil”

These are commodities, not equities (unless OP meant invested in companies in those industries?)

So again, I wasn’t referring to the expected value of the number of steps, but instead how we should update after learning about the time – that is, I wasn’t talking about but instead for various .

Let’s dig into this. From Bayes, we have: . As you say, ~ kt^(k-1). We have the pesky term, but we can note that for any value of , this will yield a constant, so we can discard it and recognize that now we don’t get a value for the update, but instead just a relative value (we can’t say how large the update is at any individual , but we can compare the updates for different ). We are now left with ~ kt^(k-1), holding constant. Using the empirical value on Earth of , we get ~ k*0.82^(k-1).

If we graph this, we get:

which apparently has its maximum at 5. That is, whatever the expected value for the number of steps is after considering the time, if we do update on the time, the largest update is in favor of there having been 5 steps. Compared to other plausible numbers for , the update is weak, though – this partiuclar piece of evidence is a <2x update on there having been 5 steps compared to there having been 2 steps or 10 steps; the relative update for 5 steps is only even ~5x the size of the update for 20 steps.

Considering the general case (where we don’t know ), we can find the maximum of the update by setting the derivative of kt^(k-1) equal to zero. This derivative is (k ln(t) + 1)t^(k-1), and so we need , or . If we replace with , such that corresponds to the naive number of steps as I was calculating before, then that’s . Here’s what we get if we graph that:

This is almost exactly my original guess (though weirdly, ~all values for are ~0.5 higher than the corresponding values of ).

The intuition, I assume, is that this is the inverse function of the previous estimator.

So the estimate for the number of hard steps doesn’t make sense in the absence of some prior. Starting with a prior distribution for the likelihood of the number of hard steps, and applying bayes rule based on the time passed and remaining, we will update towards more mass on k = t/(T–t) (basically, we go from P( t | k) to P( k | t)).

By “gives us reason to expect” I didn’t mean “this will be the expected value”, but instead “we should update in this direction”.

I’m not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they’re not) than of double-pretending, so once a character “switches” they’re very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I’m uncertain about how the LLM generalizes if you give it the opposite setup.