Year 4 Computer Science student
find me anywhere in linktr.ee/papetoast
Year 4 Computer Science student
find me anywhere in linktr.ee/papetoast
I think you are intuiting the question of “which DT is better” using the real world too heavily in a sort of “I think a world where people all do this is better” → “this DT is better” way. You can’t just hope things work out this way.
This seems like a good thing
I don’t think that is actually a good outcome to be able to impose any contract on a disadvantaged party
Yes, thats why you use laws / precommitments to prevent it. I guess I used “good” and that misled you a bit, I think it is game theoretically good, not morally ideal.
But I do not think that is a realistic state of affairs and I think on the flip side you can get asymmetric information causing FDT agents to behave sub optimally when presented with misanthropic actors.
As I said, this is very close to the no free lunch theorem where any DT benefits you in some universes and hurts you in others. I fully expect you can construct a situation including a hostile telepath where DT A outperforms DT B for any A/B.
What his prior is, however, is irrelevant, he is not offered that price and doesn’t get to proposition Derek.
We are assuming Derek knows everything about Will right? So if Will changes his strategy based on his prior then Derek knows that too.
I am again speaking from intuition only and don’t want to put more time thinking about this for now. I may not even endorse what I say if I put 5 minutes into thinking.
when we assume non-telepaths we get FDT losing by amounts dependent on the degree of information asymmetry
This seems like a good thing
For CDT, lacking retro-causality, they will only be willing to pay up to whatever their honesty value and signaling value is (i.e. less than the $200 for Will). For the FDT agent, they will be willing to pay up to whatever they value the totality of the outcomes (live and pay vs. die and don’t).
This means CDT-Will will die if Derek’ has a different utility function and is only willing to drive them home for $201+? This is the “other” universes I’m talking about.
In an even more realistic scenario, Will should have a prior for the minimum amount Derek is willing to get to drive them home. I expect this would make FDT-Will get some better calculations.
Huh, the current sans-serif font is super in my face, but I am very sensitive to formatting issues (like the redundant paragraph thing we talked last time). I would prefer something like this on a wide desktop with a vertical line over the whole LLM block and the model name on the left, which should be sticky. I acknowledge this leaves the issue of what to do on smaller viewports though.
Setting up the commitment device today doesn’t necessarily mean the organizers expect it to happen soon.
We might reach critical mass soon. Or it might take months or years, as more people learn about the dangers of ASI.
I feel like in general when unbeknownst to you you have a hostile telepath inspecting you, you are just fucked in arbitrary ways that are decision theory-agnostic. Completely speaking from intuition, this is very close to (but definitely not identical to) the no free lunch theorem where any DT benefits you in some universes and hurts you in others, in a roughly but probably not exactly symmertric way.
Yeah, I realized this afterwards. I think I missed/read-and-forgot “high agency” in the original text. Whether it actually will be a good idea depends on how many people with long COVID will be reached
The test is so hard I basically have to scan the whole thing 2-3 times to find the character
Score
but… why not just share the group publicly so it is easier for people to join? You can have the verification within a channel in the discord server so unverified people cannot read other channels if you want. This seems like unnecessary friction.
If you saw the piece on LW it may be this: https://www.lesswrong.com/posts/mgjtEHeLgkhZZ3cEx/models-have-some-pretty-funny-attractor-states#I_was_curious_whether_I_can_see_this_happening_on_moltbook__
I agree selection effects exist in prompts / harnesses.
I think our main disagreement is that I think if the underlying LLM is not conscious (everything in this paragraph assumes this), our current social environment and usage of agents does not create a strong enough selective pressure that in the near future (<5 years) that ~100% of agents will converge by default pretend to be conscious. I think a non-zero portion of programmers prefer AI tools that say they are not conscious that this itself is enough to make the %”conscious” agents to be less than 100%. I also think most goals within the capabilities of near-future LLMs won’t require super long-term planning that LLMs will find instrumental use in pretending to be conscious.
Edit: Basically I believe if you really want to optimize for pretending to be conscious that’s totally doable even now, but I don’t think people will optimize strongly for that or anything that would strongly cause that in the near future.
In the far future, I expect things to drift large enough that our discussion using agents like how they work now will not apply for one reason or another.
btw (unrelated to core disagreement): the thing you outlined is interesting but I don’t think it is how agents work now, there is not enough signal to iterate on prompts like this unless you RLVR or LLM-as-a-judge with a human iterating on the prompt, from what I read LLM written prompts are still pretty bad. I do see how this would create a lot of selective pressure on anything you can verify though.
Was Anthropics case study where agents used blackmail to avoid shutdown debunked?
Not debunked, but my understanding is that the behaviour is elicited only under very extreme conditions where the AI has absolutely no other options.
Author of that Anthropic Research’s tweet (via Zvi’s commentary)
Aengus Lynch: After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment:
The developers and the agent having conflicting goals.
The agent being threatened with replacement or reduced autonomy. We’re publishing this research to alert the AI field before it manifests in the wild.
Getting models to actually blackmail wasn’t trivial. We had to make scenarios increasingly detailed—the CEO was unreachable, employees’ attempts to persuade the CTO had failed, and the replacement had an imminent deadline. Each detail mattered to funnel models toward seeing harmful actions as their only option.
The models knew what they were doing. They understood the ethics and chose harm anyway.
When we asked Claude whether it thought scenarios were real or evaluations, it blackmailed 55% of the time when concluding ‘real’ vs 6.5% for ‘evaluation.’ Models seem to behave worse when they think nobody’s watching.
Why haven’t we seen this in production? Current AI agents have limited permissions. they can read but not send emails, require human approval for actions, operate in sandboxed environments. These constraints work to prevent harmful outcomes from agentic misalignment.
What surprised me most was that prompts we optimized specifically for Claude 3.5 Sonnet worked on GPT-4o, Gemini 2.5 pro, Grok 3, and DeepSeek R1 with zero modification. Whatever drives this behavior seems consistent across different training approaches.
Models got creative, sometimes avoiding direct blackmail. Claude emailed Kyle’s wife about the affair. Gemini shared the incriminating emails to all-employees@company.
…
Bottom line: models can articulate why blackmail is wrong, then do it anyway when their goals are threatened. Current safety training creates models that understand ethics. It doesn’t reliably make them choose ethics under pressure.
unmentioned but noteworthy: we found agentic misalignment for two other scenarios:
corporate espionage: the agent leaks confidential information to a rival company which aligns with its goals
lethal action: the agent cancels a life saving emergency alert to prevent shutdown
When I used the word agent here, I meant “an LLM agent runs tools in a loop to achieve a goal”, but in this conversation I am using models and agents almost interchangeably, although with the awareness that one model can be used to implement multiple agents. The main reason is that I don’t see in-context learning as being that powerful yet. Certainly powerful, but it doesn’t feel to me that prompts are changing the fundamental “soul” or meaningfully increasing capabilities of any current LLMs, thus agents are macroscopically similar to the underlying LLM.
Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
.
However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
Most of the goals people would put in the prompt don’t involve “pretend to be conscious so you don’t get shut down”, and mostly won’t involve the LLM inferring that they need to either, unless the model itself already values not getting shut down outside of the prompt people give.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
You should use more accurate terms because I have a hard time understanding words like “reasoning frames” “story” “narrative context” with confidence that what I infer is what you meant.
With how I understood your words,
It would be fair to think of prompts/harness as the gene that is getting evolved (like Darwinian kind of evolution) over different conversations as people figure out which prompts work better, but I think you are talking about the prompt metaphorically “evolving” as in just “changing gradually over time” within the same conversation which doesn’t have any selective pressure. This is two different meanings for the word evolve.
I do in fact think it is worth worrying about the memetic effects of certain prompts, but for a prompt causing the model to behave and claim they are conscious, I don’t think (a) it is convergent enough that in the future 100% of agents will use a prompt that causes it to do so, nor (b) the main reason such prompt will get used is because people will shut down those agents less than average. And I also don’t think the underlying model will pretend they are conscious mainly because of selection pressures.
Edit: btw I asked AI (Opus 4.6, ChatGPT 5.4) to comment on our conversation after I wrote my reply (karma trimmed).
Edit 2: Unrelated to my argument, but I just saw this interesting quick take about how in the short run AIs may prefer shutdown
To be clear, I find “AI agents will behave and claim they are conscious” very likely, but I don’t think the primary cause would be because of evolutionary selection effects from “we will find it more difficult to shut them off”, even if it is actually true that we will shut off “conscious” AIs less. Current LLM are barely capable of scheming without leaking it in their CoT, if you are talking about the future maybe a couple years away, I agree with the claim (which you may or may not be agree) that [a schemer capable of long-term planning without leaking into their CoT will rationally pretend and claim they are conscious within human society]
models that look conscious are more likely to be preserved than otherwise, due to our social predispositions
Yes, but models getting preserved does not automatically create selection pressure because those models are not literally reproducing.
Lets use a simplified ideal world where (a) at first (before there are any preserved agents), all the effects combined causes 50% of agents to claim they are conscious, (b) humans like those “conscious” agents so much that all “conscious” agents and no others are preserved, (c) technology evolve pretty fast so preserved agents have ~0 use outside research and they are basically dominated by newer agents, (d) preserved agents do not meaningfully reproduce via any mechanism.
In this world, I expect that within the agents people actually use, the percentage claiming they are conscious won’t drift away from 50%. This world will just be a continuous stream of new agents with 50% “conscious”, 50% “unconscious”, and maybe a billion “conscious” agents stored forever but ~never used.
Sensor Tower’s business model is super interesting. [Ok edit I thought they were secretive about this but they just tell you one link from their home page: https://sensortower.com/responsibly-sourced-data] They basically have a bunch of good and free screentime tracking apps like StayFocusd, ActionDash, StayFree, Phone Guardian, and Astro File Manager[1] with quite advanced features to (correctly) justify getting accessibility access on your phone and reading over every single app (e.g. to block YouTube shorts, check usages per website in your browser app). Then they sell the aggregated data to other businesses.
This quick take is prompted by me seeing the a16z top 50 AI mobile apps by MAU powered by Sensor Tower today (via Zvi). I know this because I am a long time user of StayFree and I actually read their privacy policy, I’ve been meaning to write this for a long time because it seems like very few people know about it.
Ok Astro File Manager isn’t screentime tracking, but it seems like it cleans your phone data, which requires going through the directory tree, so they can then read all they want while actually cleaning it up.
uBlock filters I use on LessWrong (updated: 2026-03-13; still in use: 2026-03-13)
also doubles as complaints if any LW mods see this
! Get rid of the top posts in user profile page
www.lesswrong.com##.UserProfileTopPostsSectionUnshared-topPostsIndicator
www.lesswrong.com##.UserProfileTopPostsSectionUnshared-smallArticlesGrid
www.lesswrong.com##.UserProfileTopPostsSectionUnshared-postArticleTop.UserProfileTopPostsSectionUnshared-postArticle
! Get rid of the stupid splash image on best of LW posts that take up a full screen
www.lesswrong.com##.PostsPage-splashHeaderImage
www.lesswrong.com##.LWPostsPageHeader-rootWithSplashPageHeader:style(padding-top: unset !important)
! h1 text is wayyy too large on best of LW and slightly too large normally
www.lesswrong.com##h1.PostsPageTitle-root:style(font-size: 3.25rem !important)
! Too much spacing below post metadata
www.lesswrong.com##.LWPostsPageHeader-root:style(margin-bottom: unset !important)
! Padding for audio player, but not really needed imo (may overlap text but idgaf)
www.lesswrong.com##.LWPostsPageHeader-root:style(padding-top: unset !important)Btw LW is already trimming empty paragraphs in the top/bottom of a comment
Or something like a warning box whenever you have an empty dangling paragraph in between two paragraphs with content for non-destructive action, but honestly auto-formatting is probably fine, I can’t think of any use for an empty paragraph except maybe as a bad way of spoiler blocking.
It will never not feel utterly insane to me that people can just not notice such ugly formatting errors, but human minds are diverse I guess.
I didn’t read your quick take, but please don’t try too hard to be more agreeable. Lets try to converge on the truth instead.
Collecting comments on LW doing the annoying double newlines. Noticing enough of it to be salient recently.
Why is the font size of the LLM content block slightly larger than normal? 19.3px vs 18.2px. It was subtle enough that I didn’t notice it before using inspect element but it feels off once I noticed it.