DaemonicSigil

Karma: 2,011

Website: pbement.com

Substack: notoneunusualthing.substack.com

DaemonicSigil 11 Jun 2026 17:38 UTC
2 points
0
in reply to: noahzuniga’s comment on: DaemonicSigil’s Shortform
Hmm, guess so, yeah. Of course, they’re wrong and should have run the idea by Claude first. But I can’t think of a different reason either.

DaemonicSigil 11 Jun 2026 17:32 UTC
2 points
0
in reply to: noahzuniga’s comment on: DaemonicSigil’s Shortform
Okay, that does change things, thanks!

Any idea what they could be aiming for then? Letting you disable it in the API but also having it otherwise be a silent effect that gives no notification that it has happened is a really weird combination.

DaemonicSigil 11 Jun 2026 6:56 UTC
9 points
−1
on: DaemonicSigil’s Shortform
Scenario: I would like to sell an API. But it should return an error for a certain class of requests that for ethical, business, or other reasons, I do not want to fulfill. Using machine learning, I have created a model to automatically classify requests. However, this model is vulnerable to adversarial examples. Given enough tries, an adversary can find examples that trick my classifier into allowing their request through. I need to prevent this from happening. Here are some methods to accomplish this:
- This is a hopeless requirement and I should give up now.
Okay, fine. Can we at least make it annoying and inconvenient for adversaries to do find adversarial examples? Here are some methods to accomplish this:
- Eliminate the feedback loop by returning a plausible-looking bogus result instead of an error. If adversaries cannot tell the difference between this and a true API result, they will have a hard time optimizing their requests to pass my filter.
  - The problem here of course is that the bogus results have to be different somehow from the true results. So the only inconvenience adversaries need to deal with is learning to distinguish between them. After this, they can go right back to what they were doing before.
  - Probably my classifier can also make false-positive errors, and even a small chance that an ordinary user of my API might gets silently corrupted results could destroy a lot, or even most of its value to them.
- Eliminate the feedback loop by setting a limit on the number of “bad” requests that adversaries can send before their account is blocked. A kind of “3 strikes and you’re out” type of deal.
  - The problem here is that one adversary can use multiple sockpuppet identites to get more guesses. (Or a coalition of multiple adversaries can share information between themselves.) We can make this more annoying for adversaries by linking identity to payment method, eg, credit card number.
  - Could this still hit ordinary users with collateral damage? Maybe they make a few bad queries out of ignorance and get banned? Or get hit by some false positives, and then banned, maybe because they send a lot of requests in general?
    If this is indeed a problem, there is a good solution. Because I really have put my best effort into training my classifier, I have mixed some adversarial examples into its training data. So what I do is train a second, weaker classifier on data that does not contain the adversarial examples. If the weak classifier flagged the request, we just return the error and that’s it. If the weak classifier passed the request and we had to rely on the strong classifier to flag it, then we return the error but we also put a strike against the user’s account. The idea here is that we only put a strike for requests that are “tricky” in some way. Dumb requests that aren’t trying to probe our classifiers can be rejected, but don’t count as a strike against the user that submitted them.
    To allow for false positives, we can maybe also allow users an extra strike for every {large_number} of API requests. This is inconvenient for adversaries to take advantage of because API requests cost money.
Yes, this is about Anthropic’s silent nerfing of Fable for certain tasks. The second option here seems much better to me for that case, and I’m curious why they chose to go with the first. Is there any motivation they could have beyond trying to prevent gaming of their classifier?

DaemonicSigil 2 Jun 2026 3:58 UTC
3 points
0
in reply to: simulus’s comment on: Tech I’m skeptical of and why
IMO, “nanotech” that takes the form of increasingly weird biotech. Crops genetically engineered to be more efficient at photosynthesis. Lots of progress in medicine.

And of course there is AI progress, which is where a lot of crazy stuff will happen. Most code being written by AIs, most mathematical theorems by proved by AIs. Interesting things will happen with robots too, just not home robots.

DaemonicSigil 1 Jun 2026 17:16 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
Agreed here. Basically, analogies are mainly useful from a “bounded rationality” perspective. A logically omniscient agent would reason directly from observations alone. On the other hand, analogies are most useful pedagogically, allowing someone who does directly understand something to teach it to someone else more quickly. You also have things like simulations. There we deliberately try and construct an analogy between a computer program and reality. In principle, there could also be “naturally occurring simulations” where we don’t try to set things up deliberately, but this doesn’t seem to happen very often.

In all of these cases, you have to be careful that all the parts of your mapping actually carry over, else you’ll get the wrong answer.

DaemonicSigil 30 May 2026 1:16 UTC
9 points
1
on: Data you could have observed but didn’t
“Unrecorded” might also be good.

DaemonicSigil 28 May 2026 15:04 UTC
2 points
0
in reply to: DaemonicSigil’s comment on: DaemonicSigil’s Shortform
The post is now here on LW too: https://www.lesswrong.com/posts/xopGsfQxiLcjXEkbE/social-agency

DaemonicSigil 28 May 2026 2:49 UTC
8 points
0
on: DaemonicSigil’s Shortform
I thought this recent substack post was very insightful: https://eliasschmied.substack.com/p/social-agency

tldr is that the post asks how much of our ability to plan comes from our brains being designed to plan, and how much is purely learned (by social imitation of other people’s planning, or explicit instruction from others on how to plan). It answers that a surprising amount is purely learned. (This summary does not do the post justice and you should really go and read it.)

Below is a copy of a comment I left:

Thank you for writing this!

I have been developing similar kinds of suspicions for a little while (that even though we know how to program algorithms like MCTS, that doesn’t mean human brains have a built-in analogue, and that many things like this are learned, not implemented on an architecture level), but I was missing a lot of the pieces you have here, and reading this really crystallized it for me.

A priori, we would expect the first, naturally arising, agent to attain this agency in the stupidest, most hacky way possible.

Re this, as you note later, LLM planning capabilities also arise from learning some kind of prior over chains of thought, and then applying a bit of “cherry on top” RL. So it also seems that the first general ^[1] artificially arising agent also attained agency in the stupidest, most hacky way possible, exactly as humans did. We are, in this respect at least, following in the footsteps of evolution on our quest to build AI.

I’m not sure the extent to which this affects the AI ruin hypothesis, and how powerful we can expect intelligence to become. I share your opinion that it must change the story somehow, but at the same time, strong planning abilities can be dangerous, regardless of whether they’re learned or architectural. It’s unclear whether the ultimate, optimally-designed successor agent would have built-in planning, or would have to learn to plan, but I find myself expecting it to be extraordinarily powerful either way.

You mostly describe the social aspect of learning to plan here, but I do think that it’s a mix of both social imitation and RL. Eg someone learning to play chess would socially absorb the idea of looking ahead several moves, and various heuristics for evaluating board positions, but they’d also just improve from playing several games and seeing what planning techniques are actually effective. ^[2]

Nothing left to elevate the hypothesis of a simple core structure of general planning or agency to our attention.

Re this, I think it is slightly overstated. All the old structure of VNM rationality still exists, it’s just intractable to compute in the real world, just like we already knew it was. I think what changes here is our estimate of how well an algorithm of a given cost can approximate it. In particular, it starts looking harder to approximate very well, and we might start hoping instead that we can get good real-world results from things that are actually quite far from VNM rational.

Questions:
- World modelling does seem pretty architectural to me right now. There is a move that is often used in planning where we “deform” our world model to correspond to some hypothetical scenario, and get it to spit out predictions for what values various variables are likely to take. I tend to think that this kind of “deformation” into fake scenarios is also a built-in ability, do you agree? Or an alternate hypothesis could be that we have a built-in direct sensory world model, and also a learned far-mode world model, and only the latter is deformable?
- This all makes me think that the kinds of modifications to LLMs needed to trigger the singularity or whatever, are not actually that big? Like, if planning is learned anyways, we don’t need to drastically modify the transformer architecture to wedge some kind of planning into it, RL on chain of though is already all we need. Given that “big dumb blob of intelligence” is going to be the paradigm that wins, are our options for getting friendliness basically just “use good training data” and “make the big blob more sample-efficient”? Does increasing sample-efficiency even help?
- This is a great article. I hope you are planning to crosspost to lesswrong?
1. ↩︎
  Fairly general, at least. I specified this because in many specialized domains agents do use a built-in search process. Eg. alpha go using MCTS.
2. ↩︎
  Maybe RL in the weak sense. Humans don’t really seem to be able to directly reward ourselves purely mentally in the same way as we’d get a direct reward for eating a cookie while hungry. So arguably the kind of tuning that happens here could be better classified as epistemic, though it does affect the distribution of planning actions like a true reward-update would.
What links here?
- Links #2: 2026/05 Part 2 by papetoast (31 May 2026 13:41 UTC; 8 points)

DaemonicSigil 28 May 2026 2:44 UTC
15 points
0
on: Atomically precise mechanosynthesis of carbon structures on hydrogenated Si(100) by inverted-mode STM
Arxiv link: https://arxiv.org/abs/2605.27250

DaemonicSigil 20 May 2026 8:06 UTC
11 points
0
in reply to: Vanessa Kosoy’s comment on: A relatively brief explanation of Boltzmann Brains
This sounds very intriguing. Questions:
- If we start from some fixed state, the distribution of fluctuates with time in a precise way, even if it approximates a Maxwell-Boltzmann distribution (because it arises from a pure quantum state that is just undergoing unitary evolution). Eg. after a Poincare recurrence time, the state has returned to its original value, and takes a non-equilibrium value. This would seem to imply that we must compute the full unitary evolution, even at late times, in order to get the correct fluctuations. Is this wrong?
- Does this theory say that any other kinds of observers besides Boltzmann brains don’t have subjective experiences? Can we engineer a scenario that creates such an observer (eg if we have an AI design that is very useful, but we don’t know if it is conscious or not)?
- Sampling from a thermal equilibrium distribution is not necessarily cheaper than time evolution. That kind of sampling is roughly speaking “in NP”, while time evolution is just . How sure are we that thermal equilibrium sampling doesn’t involve computations with subjective experience? Like, if the sampling process goes down certain very improbable paths, it’s difficult to know what the most efficient algorithm is going to do.

DaemonicSigil 19 May 2026 21:19 UTC
2 points
0
in reply to: Gurkenglas’s comment on: Logical Share Splitting for Intuitionists
Where do we get the original sequence of truth values ? I assume for provable or disprovable propositions, the functions need to be constant true and constant false respectively?

DaemonicSigil 19 May 2026 5:54 UTC
3 points
0
in reply to: joseph_c’s comment on: Logical Share Splitting for Intuitionists
Yeah, that is kind-of the direction my thoughts went in when I wrote the first post. I was thinking then that would be formalized by formalizing a proof system and letting correspond to the statement that has a proof in that system. This looks quite complicated to implement, and also to use.

If there is a more synthetic version where is a fundamental part of the formal system, then maybe that is a lot easier to deal with? I’m not sure how one implements such a thing in a proof checker. ^[1] Is just a functions from types to types? Surely it is not as simple as that?
1. ↩︎
  I found this, but I don’t understand it at all right now.

DaemonicSigil 19 May 2026 5:27 UTC
3 points
0
in reply to: joseph_c’s comment on: Logical Share Splitting for Intuitionists
Yeah, this is exactly the “irreversible conversion” described above. As you point out, it would allow Bob to directly convert from shares to shares. Why do the more complicated procedure described above then? Well, Bob also gets (alongside ) from doing that procedure. In that specific example such shares are useless so there’s effectively no difference, but in some cases it could be that Bob is unsure and thinks has some chance to end up being provable and so wants to end up with both in order to hedge.

Yeah, the exchange is a central authority that does get to know everything here. I haven’t even really thought much about what a decentralized version of this would look like yet. Probably it would involve zero-knowledge proofs? Thanks for pointing my thoughts in this direction, I’ll have to consider it more.

DaemonicSigil 19 May 2026 4:52 UTC
2 points
0
in reply to: kbear’s comment on: Logical Share Splitting for Intuitionists
It’s not a huge problem, but it is a minor problem, at least for people trying to set bounties on a problem.

In the intuitionist version, the prices of a statement and its negation can sum to at most $1, but less is possible. (As you say, they sum to exactly $1 if one assumes LEM.) ^[1] If traders know that a statement is undecidable, for example, then this sum should be $0 and not $1. They don’t expect to be able to redeem shares in either the statement or its negation.

Imagine that I am trying to offer a bounty on a problem with statement , which can be resolved both by proving or by disproving the statement. I can sell a pair for some price . If we assume LEM:
- If I set the price below $1, then people can get free money without proving the theorem by cancelling shares.
- If I set the price at $1, then the deal I’m offering people who try to prove my theorem is that they can pay $1 to put in a bunch of effort to prove a theorem, from which they’ll then earn $1, for a profit of $0.
- Even if we account for transaction costs, cancelling shares is usually going to be much easier than proving theorems, so it will become profitable first.
- Therefore, I have to set equal to $1 as the least of two evils. Provers can still make a profit by proving theorems if the market is trading at an intermediate price like $0.5, because they can use their inside knowledge from proving the theorem to correct the market. But this is not relevant in cases where “everyone knows” that the theorem should be true, but the hard part is to find a proof.
On the other hand, if cancellation is no longer possible, then the best option is simply to set less than $1. In a world where (or ) is provable, I do lose money. But that can only happen with a proof of (or ), which is exactly the thing I was trying to bounty.
1. ↩︎
  Because an irreversible conversion from $1 to is possible, we have . It’s always possible to create opposing shares in a statement. The reverse is not always possible (though it’s possible for “oracle” statements that point to real-world events, i.e. the kind of statements that would be traded on non-mathematical prediction markets).

DaemonicSigil 17 May 2026 5:41 UTC
13 points
4
in reply to: jdp’s comment on: A relatively brief explanation of Boltzmann Brains
Kolmogorov complexity is not quite the right metric to use, because it doesn’t count the memory usage of the program, just the length of its source code. But the actual cost that’s used to figure out how improbable some configuration is is different. It equals the total number of bits that have to be flipped a certain way, including both the program, the memory it needs to do its job, and any bits that are erased. (The program has to be reversible, so if it has to erase bits, we interpret this as writing those bits to memory that is then not used for anything else.)

DaemonicSigil 12 May 2026 23:30 UTC
24 points
33
in reply to: clone of saturn’s comment on: The Owned Ones
Wait, what? We already don’t extend to anyone the right to make war with humanity, including people.

If you mean, “the right to want to make war on humanity”, then yes, we would grant a person the right not to have that desire overwritten, however bad it may be. So is this a tradeoff? Perhaps, though I personally am a fan of the saying “build an angel and let it be free” here. In other aspects, the two concerns are aligned, eg. both can support a “shut it all down” position.

DaemonicSigil 7 May 2026 5:54 UTC
2 points
0
in reply to: TsviBT’s comment on: Kelly Criterion is for Cowards
Yeah, the thing to remember with super-kelly strategies is that you are concentrating your expected utility into very improbable worlds where you are extremely wealthy. Which means you need to check that your total wealth in such a world is still significantly less than the total amount of money that exists. It’s no fun to go broke in all but of worlds to give yourself dollars in those worlds, only for that copy of you to find out that is not an amount of dollars that a person can actually reasonably have.

DaemonicSigil 7 May 2026 0:55 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
Abstract answer: Maybe it doesn’t transfer from LM’s to AGI, but advances the state of knowledge in the field in a way that makes it easier to find something that works on AGI. Maybe it doesn’t transfer to (say) a pure RL agent, but it’s easier to make a sufficiently good LM into an AGI than it looks. Maybe it does just transfer. Obviously there are also outcomes where it turns out to be useless, I’m just saying it looks positive in expectation.

Concrete answer: Adversarial examples have been with us throughout the history of neural nets, and basically the only thing we’ve really found to deal with them is “generate adversarial examples during training and train against them”, and even that doesn’t really work.

If we look at the things that let LMs do IMO problems, the really fundamental innovations (which were pre-existing, I think) are “RL on chain of thought” and “make some kind of good scaffold for the search process that lets you save partial insights instead of going fully parallel on the entire problem” and maybe “LLM as verifier”. (Disclaimer: I don’t know everything the labs did to achieve their IMO results, and plausibly there are additional techniques in there that I would consider clever.) Then on top of that, you apply a bunch of techniques that are basically just more dakka: Bigger model, higher quality training data, RL on a bigger / higher-quality dataset of problems, more test-time compute.

I don’t expect there’s a fully reliable anti-jailbreaking technique that can be built by applying well-known existing methods with more dakka. If there is, I think I’d have to change my opinion here.

To your other question, I don’t think it necessarily solves the problem of inner (or even outer) misaligned models. It would only be partial progress on one aspect of the alignment problem. Partial progress is still progress, though.

DaemonicSigil 6 May 2026 23:33 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
Mainly because it seems really hard. If we can do something that seems that hard, we probably learned something new.

There is also a mechanistic analogy. Think about what a jailbreak fundamentally is: an adversarial example. Some tuned input that results in an “incorrect” output. In terms of the overall alignment problem, why can’t we just make an AI care about people’s wellbeing by giving rewards during training? Well, the AI might be able to think of an adversarial state of the world that “feels” better to its own internal values, but doesn’t actually contain any people.

DaemonicSigil 6 May 2026 22:03 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
I’d say strongly good if the person who figures it out publishes their technique. Simply because this is something we don’t yet know how to do and knowing such a technique would likely be a large advance in our alignment abilities.

This is, in my opinion, the dominant consideration, and any societal consequences of the fact that it allows the big labs to restrict their users more reliably do not really compare. (FWIW, I expect these to be mixed. Example of a positive consequence: Labs would reliably be able to prevent users from editing images of real people to remove their clothes or other things like that, which unfortunately seems to be a real problem right now.)