Marcello

Karma: 1,018

Marcello 28 Oct 2025 19:10 UTC
LW: 12 AF: 5
0
AF
on: 0. CAST: Corrigibility as Singular Target
OK. First of all, this is a cool idea.
However, if someone actually tries this I can see a particular failure mode that isn’t fleshed out (maybe you do address this later in the post sequence, but I haven’t read the entire sequence yet.) This particular failure mode would probably have fallen under “Identifying the Principal is Brittle” section, had you listed it, but it’s subtler than the four examples you gave. It’s about exactly *what* the principal agent is rather than just *who* (territory which only your fourth bullet point started venturing into). Granted, you did mention “avoiding manipulation” in the context of it becoming a bizarre notion if we tried to make principal be a developer console rather than a person, and you get points for having called it out in that section in particular as a “place where lots of additional work is needed”.
Anyway, my contention is that the manipulation concept also starts ending up with increasingly ambiguous boundaries the more of an intelligence gap there is between the agent and the principal. As such, some of these failure modes may only happen when the AI is more powerful than the researchers. The particular ones I have in mind here happen when the AI’s model of the principal improves enough to manipulate the principal in a weird new way.
To give an extreme motivating example, if there’s a sequence of images you can show the human principle(s) to warp their preferences (like in Snow Crash), we would want the AI’s concept of the principal to count such hypnosis as the victim becoming less faithful representations of the true principal™ in a regrettable way rather than as the principal having a legitimate preference to conditionally want some different thing if they’re shown the images. Unfortunately, these two ways of resolving that ambiguity seem like they would produce identical behavior right up until the AI is smart enough to manipulate the principal in that way.
Put another way: I’m afraid naive attempts to build CAST systems are likely to yield an AI which subtly misidentifies exactly what computational process constitutes the principal agent, even if they could reliably point a robot arm at the physical human(s) in question. (Sure, we find the snow-crash example obvious, but supposing the agent is smart enough to see multiple arguments with differing conclusions all of which the principal would find compelling, or more broadly, smart enough to model that your principal would end up expressing divergent preferences depending on what inputs they see. Then things get weird.)
I’ll go further and argue that we likely can’t make a robust CAST agent that can scale to superhuman levels unless it can reliably distinguish what counts as the principal making a legitimate update versus what counts as the principal having been manipulated (that, or this whole conceptual tangle around agency/desire/etc. that I’m using to model the principal needs to be refactored somehow). False negatives mean situations where the AI can’t be corrected by the principal since it no longer acknowledges the principle’s authority. Any false positives become potential avenues the AI could use to get away with manipulating the principal. (If enough weird manipulation avenues exist and aren’t flagged as such, I’d expect the AI to take one of them, since steering your principal into the region of the state-space where their preferences are easier to satisfy is a good strategy!)
I don’t think this means CAST is doomed. It does seems intuitively like you need to solve fewer gnarly philosophy problems to make a CAST agent than a Friendly AGI, and if we can select a friendly principal maybe that’s a good trade. I just think philosophically distinguishing manipulations from updates in squishy systems like human principals looks to be one of those remaining gnarly philosophy problems that we might not be able to get away with delegating to the machines, and even if we did solve it, we’d still need some pretty sophisticated interpretability tools to suss out whether the machine got it right.

Marcello 22 Oct 2025 18:55 UTC
2 points
0
on: We are likely in an AI overhang, and this is bad.
> This has also been my direct experience studying and researching open-source models at Conjecture.
Interesting! Assuming it’s public, what are some of the most surprising things you’ve found open source models to be capable of that people were previously assuming they couldn’t do?
This matters for advocacy for pausing AI, or failing that, advocacy about how far back the red-lines ought to be set. To give a really extreme example, if it turns out even an old model like GPT-3 could tell the user exactly how to make a novel bioweapon if prompted weirdly, it seems really useful to be able to convince our policy makers of this fact, though the weird prompting technique itself should of course be kept secret.

Marcello 19 Oct 2025 17:42 UTC
2 points
0
on: That Mad Olympiad
That was an enjoyable read! The line about the toaster writing a compelling biography of the bagel it had toasted actually made me physically laugh (and then realize it was time to walk over to the kitchen to make breakfast once I finished the story.)

Marcello 19 Oct 2025 17:16 UTC
2 points
0
in reply to: Frederik Hytting Jørgensen’s comment on: How Does A Blind Model See The Earth?
I tried some smaller versions of that a couple years ago, and it sure looks like they do! https://www.lesswrong.com/posts/xwdRzJxyqFqgXTWbH/how-does-a-blind-model-see-the-earth?commentId=DANGuYJcfzJwxZASa

Marcello 19 Oct 2025 17:13 UTC
2 points
0
on: How Does A Blind Model See The Earth?
This is pretty neat! This reminds me of some informal experiments I did with GPT-4 back in March of 2023. I was curious how much geographical information was hiding in there, but my approach was to pick individual countries and ask it to draw maps of them using `p5.js`, a simple JavaScript library for drawing shapes on a canvas. Here are what some of those results looked like.
So it seems casually like even GPT-4 has far more geographical knowledge hiding in it (at least when it comes to the approximate relative positions of landmasses and countries) than the post’s lat-lon query tactic seemed to surface. Of course, it’s tricky to draw a shoggoth’s eye view of the world, especially given how many eyes a shoggoth has!
I wonder what sorts of tricks could elicit the geographical information a shoggoth knows better. Off the top of my head, another approach might be to ask what countries if any are near each (larger) grid sector of the earth, and then explicitly ask for each fine-grained lat-lon coordinate, which country it’s part of if any. I wonder if we’d get a higher-fidelity map that way. One could also imagine asking for the approximate centers and radii of all the countries one at a time, and producing a map made of circles.
Anyway, here are some of the results from the experimentation I mentioned earlier:
Results
New Zealand
UK
Weird Australia
Here’s an example of one of the results that didn’t work as well:
Boxy Australia
Despite the previous blob failure, this prompt shows that the model actually does know somewhat more about rough relative positions of more things in Australia than the previous example revealed.
What links here?
- Marcello's comment on How Does A Blind Model See The Earth? by henry (19 Oct 2025 17:16 UTC; 2 points)

Marcello 19 Oct 2025 15:15 UTC
2 points
0
in reply to: jimrandomh’s comment on: Was the K-T event a Great Filter?
I’ve heard that one is true both at the macro-scale (Chicxulub) and at the micro-scale (one of the theories I’ve heard about human evolution is that it was strongly shaped by a sudden drought.) Thought Chicxulub is the important one that lets the earth effectively buy twice as many lottery tickets.

Marcello 18 Oct 2025 19:26 UTC
16 points
0
on: The “Length” of “Horizons”
Right. Task Difficulty is a hard thing to get a handle on.
* You don’t know how hard a problem is until you’ve solved it, so any metric needs to depend on how hard the problem was to solve.
* Intuitively, we might want some metric that depends on the “stack trace” of the solution, i.e. what sorts of mental moves had to happen for the person to solve the problem. Incidentally, this means that all such metrics are sometimes over-estimates (maybe there’s an easy way to solve the problem that the person you watched solving it missed.) Human wall-clock time is in some sense the simplest question one could ask about the stack trace of a human solving the problem.
* The difficulty of a problem is often agent relative. There are plenty of contest math problems that are rendered easy if you have the correct tool in your toolkit and really hard if you don’t. Crystalized intelligence often passes for fluid intelligence, and the two blend into each other.
Some other potential metrics (loose brainstorm)
* Hint length—In some of Eliezer’s earlier posts, intelligence got measured as optimization pressure in bits (intuitively: how many times do you have to double the size of the target for it to fill the entire dart-board. Of course you need a measure space for your dart board in order for this to work.) Loosely inspired by this, we might pick some model that’s capable of following chains of logic but not very smart (whatever it knows how to do is a stand in for what’s obvious.) Then ask how long of a hint string you have to hand it to solve the problem. (Of course finding the shortest hint string is hard; you’d need to poke around heuristically to find a relatively short one.)
* ELO score type metrics—You treat each of your puzzles and agents (which can be either AIs or humans) as players of a two player game. If the agent solves a puzzle the agent wins, otherwise the puzzle wins. Then we calculate ELO scores. The nice thing about this is that we effectively get to punt the problem of defining a difficulty metric, by saying that each agent has a latent intelligence variable and each problem has a latent difficulty variable, and we can figure out what both of them are together by looking at who was able to solve which problem.

Caveats: Of course, like human wall-clock time, this assumes intelligence and difficulty are one-dimensional, though of course if you can say what you’d like to assume instead, you can make statistical models more sophisticated than the one ELO scoring implicitly assumes. Also, this still doesn’t help for measuring the difficulties of problems way outside the eval set (the using “Paleolithic canoeing records to forecast when humans will reach the moon” obstacle) if everybody loses against puzzle X that doesn’t put much of an upper bound on how hard it is.

Marcello 9 Mar 2025 14:13 UTC
70 points
2
on: Marcello’s Shortform
A Cautionary Tale about Trusting Numbers from Wikipedia
So this morning I woke up early and thought to myself: “You know what I haven’t done in a while? Good old fashioned Wikipedia rabbit hole.” So I started reading the article on rabbits. Things were relatively sane until I got to the section on rabbits as food.
Wild leporids comprise a small portion of global rabbit-meat consumption. Domesticated descendants of the European rabbit (Oryctolagus cuniculus) that are bred and kept as livestock (a practice called cuniculture) account for the estimated 200 million tons of rabbit meat produced annually.[161] Approximately 1.2 billion rabbits are slaughtered each year for meat worldwide.[162]
Something has gone very wrong here!
200 million tons is 400 billion pounds (add ~10% if they’re metric tons, but we can ignore that.)
Divide that by 1.2 billion, and we can deduce that those rabbits weigh in at over 300 pounds each on average! Now I know we’ve bred some large animals for livestock, but I’m rolling to disbelieve when it comes to three hundred pound bunnies.
Of the two sources Wikipedia cites it looks like [161] is the less reliable looking one. It’s a WSJ blog. But the biggest reason we shouldn’t be trusting that article is that its numbers aren’t even internally consistent!
From the article:
Globally, about some 200 million tons of rabbit meat are produced a year, says Luo Dong, director of the Chinese Rabbit Industry Association. China consumes about 30% of the whole production, he said, with 70% of such meat—or some 420,000 tons a year—going to Sichuan province as well as the neighboring municipality of Chongqing.
There’s a basic arithmetic error here! `200 million * 70% * 30%` is 42 *million*, not 420,000.
If we assume this 200 million ton number was wrong and the 420,000 ton number for Sichuan was right, the global number should in fact be 2 million tons. This would make the rabbits weigh three pounds each on average, which is a much more reasonable weight for a rabbit!
If I had to take a guess as to how this mistake happened, putting on my linguist hat, Chinese has a single word for ten thousand, like the Greek-derived “myriad”, (spelt either 万 or 萬). If you actually wanted to say 2*10^6 in Chinese, it would end up as something like “two hundred myriad”. So I can see a fairly plausible way a translator could mess up and render it as “200 million”.
Anyway, I’ve posted this essay to the talk page and submitted an edit request. We’ll see how long it takes Wikipedia to fix this.

Links:
Original article: https://en.wikipedia.org/wiki/Rabbit#As_food_and_clothing
[161] https://web.archive.org/web/20170714001053/https://blogs.wsj.com/chinarealtime/2014/06/1 3/french-rabbit-heads-the-newest-delicacy-in-chinese-cuisine/

Marcello 1 Jan 2025 16:23 UTC
5 points
12
in reply to: Scott Alexander’s comment on: We’re already in AI takeoff
Looking at this comment from three years in the future, I’ll just note that there’s something quite ironic about your having put Sam Bankman-Fried on this list! If only he’d refactored his identity more! But no, he was stuck in short-sighted-greed/CDT/small-self, and we all paid a price for that, didn’t we?

Marcello 17 Mar 2023 16:53 UTC
11 points
0
on: Marcello’s Shortform
In Defense of the Shoggoth Analogy
In reply to: https://twitter.com/OwainEvans_UK/status/1636599127902662658
The explanations in the thread seem to me to be missing the middle or evading the heart of the problem. Zoomed out: an optimization target at level of personality. Zoomed in: a circuit diagram of layers. But those layers with billions of weights are pretty much Turing complete.
Unfortunately, I don’t think anyone has much idea how all those little learned computations are make up said personality. My suspicion is there isn’t going to be an *easy* way to explain what they’re doing. Of course, I’d be relieved to be wrong here!
This matters because the analogy in the thread between averaged faces and LLM outputs is broken in an important way. (Nearly) every picture of a face in the training data has a nose. When you look at the nose of an averaged face, it’s based very closely on the noses of all the faces that got averaged. However, despite the size of the training datasets for LLMs, the space of possible queries and topics of conversation is even vaster (it’s exponential in the prompt-window size, unlike the query space for the average faces which are just the size of the image).
As such, LLMs are forced to extrapolate hard. So, I’d expect that which particular generalizations they learned, hiding in those weights, to start to matter once users start poking them in unanticipated ways.
In short, if LLMs are like averaged faces, I think they’re faces that will readily fall apart into Shoggoths if someone looks at them from an unanticipated or uncommon angle.
What links here?

Marcello 3 Mar 2023 23:28 UTC
5 points
3
on: Enemies vs Malefactors
I know this post was chronologically first, but since I read them out of order my reaction was “wow, this post is sure using some of the notions from the Waluigi Effect mega-post, but for humans instead of chatbots”! In particular, they’re both pointing at the notion that an agent (human or AI chatbot) can be in something like a superposition between good actor and bad actor unlike the naive two-tone picture of morality one often gets from children’s books.

Marcello 19 Feb 2023 17:03 UTC
10 points
4
in reply to: ryan_b’s comment on: There are (probably) no superhuman Go AIs: strong human players beat the strongest AIs
I interpreted OP as saying that KataGo, despite being a super-human Go player, came up with a flawed approximation to the natural abstraction that two eyed groups are alive which was inaccurate in some situations (and that’s how it can be exploited by building a small living group that ends up appearing dead from its perspective).

On Board Vision, Hollow Words, and the End of the World

Marcello17 Feb 2023 23:18 UTC

52 points

27 comments5 min readLW link

Marcello 11 Jan 2023 20:04 UTC
7 points
0
on: Marcello’s Shortform
One of my pet journalism peeves is the “as” (or sometimes “while”) construction, which I often see in titles or first sentences of articles. It looks like “<event A was happening> as <event B was happening>”. You can fact check the events and it’ll turn out they happened, but the phrasing comes with this super annoying nudge-nudge-wink-wink-implication that the two events totally have direct causal connection. Unfortunately, you can’t pin this on the journalist because they didn’t actually say it.
This sort of thing happens a lot. To give just a couple example templates, articles like “as <political thing happened>, markets rallied” or “<stock> falls as <CEO did something>” are often trying to pull this.

Marcello 18 Aug 2020 22:29 UTC
1 point
0
in reply to: Alex Flint’s comment on: On Need-Sets
I broadly agree. Though I would add that those things could still be (positive motivation) wants afterwards, which one pursues without needing them. I’m not advocating for asceticism.
Also, while I agree that you get more happiness by having fewer negative motives, being run by positive motives is not 100% happiness. One can still experience disappointment if one wants access to Netflix, and it’s down for maintenance one day. However, disappointment is still both more hedonic than fear and promotes a more measured reaction to the situation.

Marcello 17 Aug 2020 8:04 UTC
3 points
0
in reply to: wolajacy’s comment on: On Need-Sets
Are you trying to say that it should work similarly to a desensitization therapy? But then, there might exist the reversed mode, where you get attached to things even more, as you meditate on why are they good to have. Which of these modes dominates is not clear to me.
I think you make a good point. I feel I was gesturing at something at something real when I wrote down the comparison notion, but didn’t express it quite right. Here’s how I would express it now:
The key thing I failed to point out in the post is that just visualizing a good thing you have or what’s nice about it is not the same as being grateful for it. Gratitude includes an acknowledgement. When you thank an acquaintance for, say, having given you helpful advice, you’re acknowledging that they didn’t necessarily have to go out of their way to do that. Even if you’re grateful for something a specific person didn’t give you, and you don’t believe in a god, the same feeling of acknowledgment is present. I suspect this acknowledgement is what pushes things out of the need-set.
And indeed, as you point out, just meditating on why something is good to have might increase attachment (or it might not, the model doesn’t make a claim about which effect would be stronger).

Marcello 17 Aug 2020 7:32 UTC
1 point
0
in reply to: wolajacy’s comment on: On Need-Sets
I don’t think I get this. Doesn’t this apply to any positive thing in life? (e.g. why single out the gratitude practise?)
I expect most positive things would indeed help somewhat, but that gratitude practice would help more. If someone lost a pet, giving them some ice cream may help. However, as long as their mind is still making the comparison to the world where their pet is still alive, the help may be limited. That said, to the extent that they manage to feel grateful for the ice cream, it seems to me as though their internal focus has shifted in a meaningful way, away from grasping at the world where their pet is still alive and towards the real world.

Marcello 17 Aug 2020 7:14 UTC
1 point
0
in reply to: Alex Flint’s comment on: On Need-Sets
1. Yes, I agree with the synopsis (though expanded need-sets are not the only reason people are more anxious in the modern world).
2. Ah. Perhaps my language in the post wasn’t as clear as it could have been. When I said:
More specifically, your need-set is the collection of things that have to seem true for you to feel either OK or better.
I was thinking of the needs as already being about what seems true about future states of the world, not just present states. For example, your need for drinking water is about being able to get water when thirsty at a whole bunch of future times.
If this is true then a larger need-set would lead to more negative motivation due to there being more ways for something we think we need to be taken away from us.
Yes, exactly.

On Need-Sets

Marcello15 Aug 2020 23:25 UTC

48 points

11 comments2 min readLW link

Marcello 7 Aug 2020 16:23 UTC
3 points
0
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
Your seemingly target-less skill-building motive isn’t necessarily irrational or non-awesome. My steel-man is that you’re in a hibernation period, in which you’re waiting for the best opportunity of some sort (romantic, or business, or career, or other) to show up so you can execute on it. Picking a goal to focus on really hard now might well be the wrong thing to do; you might miss a golden opportunity if your nose is at the grindstone. In such a situation a good strategy would, in fact, be to spend some time cultivating skills, and some time in existential confusion (which is what I think not knowing which broad opportunities you want to pursue feels like from the inside).

The other point I’d like to make is that I expect building specific skills actually is a way to increase general problem solving ability; they’re not at odds. It’s not that super specific skills are extremely likely to be useful directly, but that the act of constructing a skill is itself trainable and a significant part of general problem solving ability for sufficiently large problems. Also, there’s lots of cross-fertilization of analogies between skills; skills aren’t quite as discrete as you’re thinking.

Marcello

New Zealand

UK

Weird Australia

Boxy Australia

On Board Vi­sion, Hol­low Words, and the End of the World

On Need-Sets

On Board Vision, Hollow Words, and the End of the World