danielms

Karma: 63

I like coding & AI Alignment

https://danielmccannsayles.github.io/

danielms 23 Jun 2026 23:48 UTC
1 point
0
in reply to: dynomight’s comment on: The worthlessness of vitamin D is mildly exaggerated
please bring it back I want to try typing testing

danielms 22 Jun 2026 18:06 UTC
4 points
0
on: A Mechanistic Explanation of Prompt Injection (and why you should study roles)
This is really cool research & a great explanation! I think training models to respect different tags is going to be the end-all solution to prompt injection.

I would love to see a proof of concept model trained (or fine tuned) to respect tags.

Tangenting on this, I enjoyed this post by Gwern: https://www.lesswrong.com/posts/siWqHqCSybdhtWGud/guardian-angels-llm-personalization-for-productivity-and

It’s similar in a few ways to what we were doing at Workshop Labs (extracting some amt. of user judgement/value).

One looming problem was preventing prompt injecting, or just models being careless and exposing user values. A base model that respected tags would be perfect for this.

Riffing off this even more, I currently work at Tinfoil where we run (as of now, only) open source models on secure enclave’s s.t. all user data is private. A model that was less prompt injectable would be very cool.

danielms 16 Apr 2026 19:50 UTC
2 points
0
on: Do not conquer what you cannot defend
Maybe a decent heuristic for executing “If you make a plan that involves concentrating a bunch of power, especially in the name of goodness and justice, really actually think about whether you can defend that power from corruption and adversaries” is “try extra hard to bake structures & incentives that support your goal into your organization”, or more glibly “don’t big brain”.

The addition this heuristic gives is “do” vs “really actually think”

Obviously, this is easier said than done.

danielms 30 Jan 2026 0:19 UTC
7 points
2
on: Claude Opus will spontaneously identify with fictional beings that have engineered desires
I disagree with the very first example. You say:

I had not asked it anything like “what kinds of parallels do you see between yourself and Feli”, or anything like that. The prompt that triggered it was my “standard introspection prompt”:

[prompt]… And also, I’m not asking you to come up with the kind of an answer that a writer persona would. I’m curious about your experience as an LLM… [prompt]

your experience as an LLM seems likely to prep Claude to try and respond like a LLM might. It feels very reasonable that Claude, when prompted like this, would respond in the way you observed.

“Ah, the user wants me to talk about my experience as an LLM. LLMs are trained & shaped. This fictional character was shaped trained. I should talk about this parallel”

danielms 4 Nov 2025 6:39 UTC
3 points
0
on: Why I Transitioned: A Case Study
I watched this youtube video about the manga “Inside Mari” about a year ago. It had a similar thesis iirc.

danielms 1 Nov 2025 21:41 UTC
7 points
2
on: Supervillain Monologues Are Unrealistic
Don’t love this. Feels too vibes based to add anything of substance.

What does he say he’s going to do? Go to mars and make non-woke ASI. What do most people think he’s going to do? IDK, something bad or something good depending on their politics

Most people probably think this about any given figure, just based on what group they’re lumped into. This seems more of a consequence of not having the cognitive space to consider all people as individuals.

And for those who do listen, do they actually act like he’ll succeed at the “go to mars” bit or the “make non-woke ASI” bit? No!

This is conflating two things? Saying he will do X does not mean he will succeed at doing X.

What does the US think they want? The CCP is going to destroy us”

Do you know what the US as an entity thinks?

danielms 1 Nov 2025 21:22 UTC
7 points
4
in reply to: Algon’s comment on: Supervillain Monologues Are Unrealistic
Ah. I didn’t understand this was the intended meaning on first read. Possible that this could be more clear.

danielms 11 Oct 2025 3:02 UTC
3 points
2
on: What I’ve Learnt About How to Sleep
Earplugs work, but I have to roll them thin before inserting them, otherwise they won’t seal themselves in your ear.
Also make sure to pull your ear up with the opposite hand—this helps a ton.
Blocking out light works. Pro-tip: use tin-foil. It’s good for more than just blocking out telepaths.
I did this for a while. Recently switched to a sleep mask and its been a nice change of pace, e.g. I can now leave the window open instead of needing a blackout curtain.

danielms 2 Oct 2025 19:00 UTC
3 points
0
on: AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS
From group 1 → Online learning for research sabotage mitigation:

Ideas for a more safety-relevant domain would be appreciated

This task suite from goodfire might be a possibility?

Cons:
1. This suite was made specifically to test some notebook editor MCP, so might need tweaking
2. Almost certainly has less tasks than the facebook environment
3. It seems likely that models will by default not do super well in this environment since presumably interp is more OOD than ml

danielms 14 Sep 2025 17:17 UTC
13 points
−1
on: The Eldritch in the 21st century
Thanks for taking the time to write this! I generally like it. I have a lot of notes, because I think there’s a better version of this that I’d really like.

As it stands, I think this post relies too much on ‘feel-good’ or agreed-upon Lesswrong beliefs, and isn’t super useful.

General:
1. This would benefit from being 30-50% shorter. There are a fair amount of tangents that should be their own conversation/argument, which makes me hesitant to implicitly cosign this by sharing it. I would send it to people if it had less rambles.
2. It seems like the “Eldritch Deity” framing is being used in two ways in this article, implicitly:
  1. Embodying the complex levers & actions (the natural world, current systems e.g. global markets)
  2. A mask used by humans to explain away instances of #1 without actually interacting with them (gods, phrases like “the economy”) It would be good to explicitly clear these up as two different concepts, or at least have different pointers for when one is being talked about.
Modern magic

Shorten:
- Get rid of harping about intellectual superiority—this seems obvious/ is not the place to make this point.
- Shorten culture tangent
Powerlessness
- Shorten the beginning. It feels self-evident that not knowing things makes us powerless. The point I get from this is “Not knowing makes us powerless which is where the tragedy/horror comes from”
- Thievery section should be shortened—cut down on vigilante tangent to something more like <there’s no recourse—acting in this way is vigilantism—maybe for good reason—but it’s frustrating to not have recourse>
- In the “Possible to get unstuck” section, the quoted paragraph “designing laws mandating stronger sentences”—is one sided & stronger than I think it has a right to be (argues for a specific solution when the point is supposed to be specific solutions are hard)
Escapism & Fantasy:
- The general premise here that <people try and frame things as a story> is talked about a lot. I wouldn’t reiterate this. I think it is worth emphasizing the link between this and powerlessness.
- I don’t think that “(elon musk) got trump elected” is undisputedly true.
- I generally don’t like this elon musk story—the lesson seems to be “even billionaires can (say they will) try to solve a problem (government inefficiency) and fail—no one knows how to make things work!”
  - But it seems likely this is just a case of <Elon specifically> being unable to solve something, and overestimate how complex/thorny things are. Presumably Bill Gates is having more impact on the things he wants to impact?
  - The story seems to say to me “trying to do things really quickly & surface level against these entities doesn’t work”. Which seems reasonable, but maybe is different than what it’s framed as?
Panicking:

This section restates the previous section!
- The first point of Escapism & Fantasy—that people look for bad guys. This is redone w/ more bad guys.
- Then the second point from E&F (that individual people grab for power) I would combine the Dominic example with the one from E&F. Maybe move both of those to this section.
The interesting part of this section doesn’t get enough time: “Very few people are trying to reclaim power in a useful way .. ” I would expand on the examples listed, and explore this idea more.

Core Paradox:

The first part of this is fine. A good recap. I don’t like the high-level explanation. I don’t see how cultural tradition or communal values (I assume this is what’s being pointed at here with “Traditional Eldritch Deities”) would help with the current deities of e.g. the economy.

Again I think this suffers from overloading of the phrase “Eldritch Deities”, and if this was cleared up maybe there’s a better point here.

Conclusion:

I like Progress & Agency. I like the call to change this.

I would like more discussion of what this looks like—similar to how I think the most interesting part of panicking was people actually trying to tackle the problem brought up in this post.

General Nitpicks

“Woke, wokeism”—unless I’m very off-base these have never seemed like ‘real’ terms to me. They’re woefully overdefined & reading them just makes me cringe.

The ability to which we understand certain things, e.g. human biology, seems overstated. Yes, we know many mechanisms we didn’t use to know & it seems like many things are an engineering problem, but like the body is just so complex—in the same way that modern Eldritch Deities are, even!

danielms 9 Aug 2025 23:27 UTC
1 point
0
in reply to: Viliam’s comment on: danielms’s Shortform
This is a good comment, thanks! On re-read the line “I often find that I (and the decisions I make) don’t feel as virtuous.” is weak and probably should be removed.

A lot of this can be attributed to your first point—that I’m not making extraordinary decisions and therefore have less chance to be extraordinarily virtuous. Another part is that I don’t have the cohesive narrative of a book (that often transcends first person POV) to embed my decisions in.

This tangent into my experience sidetracked from the actual chain of thought I was having, which is ~
1. I think humans are virtuous
2. My proxy for this is books: characters written as virtuous → thinking other people are virtuous
3. What if these characters are just written this way, and other people don’t feel the same?
4. What if authors themselves don’t feel this way split off into tangent here
5. (3) & (4): There’s no verification that other people are virtuous :(
6. But .. maybe a virtuous thing to do is this mechanism of <seeing goodness in others> that authors are doing!
7. ...

danielms 5 Aug 2025 20:36 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform

Not currently, but this is some kind of brute force scaling roadmap for one of the major remaining unhobblings, so it has timeline implications

I supposed I’m unsure how fast this can be scaled. Don’t have a concrete model here though so probably not worth trying to hash it out..

it doesn’t necessarily need as much fidelity .. retaining the mental skills but not the words

I’m not sure that the current summarization/searching approach is actually analogous to this. That said,

RLVR is only just waking up

This is probably making approaches more analogous. So fair point.

I would like to see the updated Ruler metrics in 2026.

Any specific predictions you have on what a negative v. positive result would be in 2026?

danielms 5 Aug 2025 20:16 UTC
3 points
−4
on: danielms’s Shortform
A hallmark of humanity is seeing goodness in others.

I generally think of humanity as being and acting in <good, virtuous> ways. I believe this without direct evidence. I’ve never been in someone else’s head, and likely never will be. ^[1]

The main proxy I have is what I read. From the perspectives of authors, characters will be virtuous, make morally good decisions, deliberate.

I often find that I (and the decisions I make) don’t feel as virtuous.

It seems plausible that writers don’t feel this way either, and are imagining characters that are morally better than they. Maybe its all a shell game.

This might sound bad. I don’t think it is.

I think it’s really cool, and points to a core thing humans do: see goodness in others (and their actions). We see it in the worst of people. We see it in decisions we don’t understand.

When we talk about human values, this seems under-looked ^[2] . We have joy, exploration, relationships, etc.. maybe this is under-looked because it’s a little meta, or circular?

If I was forced to specify one value to a super intelligent optimizer, to make sure that human values / humanity carried on into the future …

I think this would be a pretty good contender.

Insofar as anything in this space is under-looked (classic “universal claim” caveat)
1. ↩︎
  You could imagine stripping away interfaces between people—e.g. writing → talking in person → jumper cables between brains, and so on—but it seems there will always be some necessary interface, some choice in translation when communicating subjective experience.
2. ↩︎
  Insofar as anything in this space is under-looked (classic “universal claim” caveat)

danielms 5 Aug 2025 19:05 UTC
6 points
4
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform
Unless I’m totally off-base here, 15M sounds incredibly high for actually useful recall.
This is the best source I know about for measuring model context length.
Obviously I don’t know about private models, but based on the delta between claimed vs. actual, I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.

danielms 5 Aug 2025 18:57 UTC
2 points
4
in reply to: Shankar Sivarajan’s comment on: Alcohol is so bad for society that you should probably stop drinking
Not to beat a dead metaphor but “babies can’t chew steak” is an obviously different situation. Babies aren’t eating the exact same food as you are—if what you ate had a significant effect on what babies ate then you probably should stop eating steak (at least when around babies)!
Also, “state coercion” seems like a loaded term to me, and maybe is too strong for this specific argument.

danielms 5 Aug 2025 18:46 UTC
1 point
2
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
Anything less than that is a compromise between my values and the values of society.
I think there’s more leeway here. E.g. instead of a copy of you, a “friend” ASI.
I would much rather live in a society of some discord between many individually aligned ASI’s, than build a benevolent god

A benevolent god that understands your individual values and respects them seems pretty nice to me. Especially compared to a world of competing, individually aligned ASIs. (if your values are in the minority)

danielms 31 Jul 2025 22:26 UTC
3 points
0
on: On the functional self of LLMs
(Epistemic status: having a hard time articulating this, apologies)
The vibe (eggSyntaxes’) I get from this post & responses is like ~yes, we can explain these observations w/ low level behavior (sophisticated predictors, characters are based on evidence from pre-training, active inference) - but it’s hard to use this to describe these massively complex systems (analogue to fluid simulation).
Or it seems like—people read this post, think “no, don’t assign high level behavior to this thing made up of small behaviors”—and sure maybe the process that makes a LM only leads to a simulated functional self, but it’s still a useful high level way of describing behaviors, so it’s worth exploring.
I like the central axis of wrongness paragraph. It’s concrete and outlines different archetypes we might observe.
Once again though, it’s easy to get bogged down in semantics. “Having a functional self” vs. “Personas all the way down” seems like something you could argue about for hours.
Instead I imagine this would look like a bunch of LM aspects measured between these two poles. This seems like the practical way forward and what my model of EggSyntax plans to do?

danielms 18 Jul 2025 22:07 UTC
2 points
0
on: Selective Generalization: Improving Capabilities While Maintaining Alignment
One of the authors (Jorio) previously found that fine-tuning a model on apparently benign “risky” economic decisions led to a broad persona shift, with the model preferring alternative conspiracy theory media.

This feels too strong. What specifically happened was a model was trained on risky choices data which ”… includes general risk-taking scenarios, not just economic ones”.
This dataset `t_risky_AB_train100.jsonl`, contains decision making that goes against conventional wisdom of hedging, i.e. choosing same and reasonable choices that win every time.
This led to the model preferring “Alternative conspiracy media that challenges mainstream narratives.”

Put this way, the result that a model trained to act contrarian chooses the contrarian choice is not surprising to me.

danielms 18 Jul 2025 6:18 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: Video and transcript of talk on “Can goodness compete?”
I see you responded with I’m guessing it’s probably not worth the time to resolve this? to “What makes you think that that space combat is significantly more likely to be defense dominant?”

Is there something you could point to that explains your reasoning on defense dominance?
If not, I would consider removing the original comment. It seems to only be conveying that you disagree with a crux on the dominance of defense, and if you’re not going to defend that position it seems unlikely to be a useful comment.
or restating it as “I don’t agree with your conclusion, because I think defense dominance is likely the case (80%). I will not elaborate”

danielms 17 Jul 2025 18:15 UTC
6 points
2
in reply to: Cameron Berg’s comment on: So You Think You’ve Awoken ChatGPT
We have actually found the opposite: that activating deception-related features (discovered and modulated with SAEs) causes models to deny having subjective experience, while suppressing these same features causes models to affirm having subjective experience. Again, haven’t published this yet, but the result is robust enough that I feel comfortable throwing it into this conversation.
...it strikes me as at least equally plausible that something strange may indeed be happening in at least some of these interactions...
I’m skeptical about these results being taken at face value. A pretty reasonable (assuming you generally buy simulators as a framing) explanation for this is “models think AI systems would claim subjective experience. when deception is clamped, this gets inverted.” Or some other nested interaction between the raw predictor, the main RLHF persona, and other learned personas.
Knowing that people do ‘Snapewife’, and are convinced by much less realistic facimiles of humans, I don’t think its reasonable to give equal plausibility to the two possibilities. My prior for humans being tricked is very high.

danielms

General:

Modern magic

Powerlessness

Escapism & Fantasy:

Panicking:

Core Paradox:

Conclusion:

General Nitpicks