Website: pbement.com
Substack: notoneunusualthing.substack.com
Website: pbement.com
Substack: notoneunusualthing.substack.com
Okay, that does change things, thanks!
Any idea what they could be aiming for then? Letting you disable it in the API but also having it otherwise be a silent effect that gives no notification that it has happened is a really weird combination.
Scenario: I would like to sell an API. But it should return an error for a certain class of requests that for ethical, business, or other reasons, I do not want to fulfill. Using machine learning, I have created a model to automatically classify requests. However, this model is vulnerable to adversarial examples. Given enough tries, an adversary can find examples that trick my classifier into allowing their request through. I need to prevent this from happening. Here are some methods to accomplish this:
This is a hopeless requirement and I should give up now.
Okay, fine. Can we at least make it annoying and inconvenient for adversaries to do find adversarial examples? Here are some methods to accomplish this:
Eliminate the feedback loop by returning a plausible-looking bogus result instead of an error. If adversaries cannot tell the difference between this and a true API result, they will have a hard time optimizing their requests to pass my filter.
The problem here of course is that the bogus results have to be different somehow from the true results. So the only inconvenience adversaries need to deal with is learning to distinguish between them. After this, they can go right back to what they were doing before.
Probably my classifier can also make false-positive errors, and even a small chance that an ordinary user of my API might gets silently corrupted results could destroy a lot, or even most of its value to them.
Eliminate the feedback loop by setting a limit on the number of “bad” requests that adversaries can send before their account is blocked. A kind of “3 strikes and you’re out” type of deal.
The problem here is that one adversary can use multiple sockpuppet identites to get more guesses. (Or a coalition of multiple adversaries can share information between themselves.) We can make this more annoying for adversaries by linking identity to payment method, eg, credit card number.
Could this still hit ordinary users with collateral damage? Maybe they make a few bad queries out of ignorance and get banned? Or get hit by some false positives, and then banned, maybe because they send a lot of requests in general?
If this is indeed a problem, there is a good solution. Because I really have put my best effort into training my classifier, I have mixed some adversarial examples into its training data. So what I do is train a second, weaker classifier on data that does not contain the adversarial examples. If the weak classifier flagged the request, we just return the error and that’s it. If the weak classifier passed the request and we had to rely on the strong classifier to flag it, then we return the error but we also put a strike against the user’s account. The idea here is that we only put a strike for requests that are “tricky” in some way. Dumb requests that aren’t trying to probe our classifiers can be rejected, but don’t count as a strike against the user that submitted them.
To allow for false positives, we can maybe also allow users an extra strike for every {large_number} of API requests. This is inconvenient for adversaries to take advantage of because API requests cost money.
Yes, this is about Anthropic’s silent nerfing of Fable for certain tasks. The second option here seems much better to me for that case, and I’m curious why they chose to go with the first. Is there any motivation they could have beyond trying to prevent gaming of their classifier?
IMO, “nanotech” that takes the form of increasingly weird biotech. Crops genetically engineered to be more efficient at photosynthesis. Lots of progress in medicine.
And of course there is AI progress, which is where a lot of crazy stuff will happen. Most code being written by AIs, most mathematical theorems by proved by AIs. Interesting things will happen with robots too, just not home robots.
Agreed here. Basically, analogies are mainly useful from a “bounded rationality” perspective. A logically omniscient agent would reason directly from observations alone. On the other hand, analogies are most useful pedagogically, allowing someone who does directly understand something to teach it to someone else more quickly. You also have things like simulations. There we deliberately try and construct an analogy between a computer program and reality. In principle, there could also be “naturally occurring simulations” where we don’t try to set things up deliberately, but this doesn’t seem to happen very often.
In all of these cases, you have to be careful that all the parts of your mapping actually carry over, else you’ll get the wrong answer.
“Unrecorded” might also be good.
The post is now here on LW too: https://www.lesswrong.com/posts/xopGsfQxiLcjXEkbE/social-agency
I thought this recent substack post was very insightful: https://eliasschmied.substack.com/p/social-agency
tldr is that the post asks how much of our ability to plan comes from our brains being designed to plan, and how much is purely learned (by social imitation of other people’s planning, or explicit instruction from others on how to plan). It answers that a surprising amount is purely learned. (This summary does not do the post justice and you should really go and read it.)
Below is a copy of a comment I left:
Thank you for writing this!
I have been developing similar kinds of suspicions for a little while (that even though we know how to program algorithms like MCTS, that doesn’t mean human brains have a built-in analogue, and that many things like this are learned, not implemented on an architecture level), but I was missing a lot of the pieces you have here, and reading this really crystallized it for me.
A priori, we would expect the first, naturally arising, agent to attain this agency in the stupidest, most hacky way possible.
Re this, as you note later, LLM planning capabilities also arise from learning some kind of prior over chains of thought, and then applying a bit of “cherry on top” RL. So it also seems that the first general [1] artificially arising agent also attained agency in the stupidest, most hacky way possible, exactly as humans did. We are, in this respect at least, following in the footsteps of evolution on our quest to build AI.
I’m not sure the extent to which this affects the AI ruin hypothesis, and how powerful we can expect intelligence to become. I share your opinion that it must change the story somehow, but at the same time, strong planning abilities can be dangerous, regardless of whether they’re learned or architectural. It’s unclear whether the ultimate, optimally-designed successor agent would have built-in planning, or would have to learn to plan, but I find myself expecting it to be extraordinarily powerful either way.
You mostly describe the social aspect of learning to plan here, but I do think that it’s a mix of both social imitation and RL. Eg someone learning to play chess would socially absorb the idea of looking ahead several moves, and various heuristics for evaluating board positions, but they’d also just improve from playing several games and seeing what planning techniques are actually effective. [2]
Nothing left to elevate the hypothesis of a simple core structure of general planning or agency to our attention.
Re this, I think it is slightly overstated. All the old structure of VNM rationality still exists, it’s just intractable to compute in the real world, just like we already knew it was. I think what changes here is our estimate of how well an algorithm of a given cost can approximate it. In particular, it starts looking harder to approximate very well, and we might start hoping instead that we can get good real-world results from things that are actually quite far from VNM rational.
Questions:
World modelling does seem pretty architectural to me right now. There is a move that is often used in planning where we “deform” our world model to correspond to some hypothetical scenario, and get it to spit out predictions for what values various variables are likely to take. I tend to think that this kind of “deformation” into fake scenarios is also a built-in ability, do you agree? Or an alternate hypothesis could be that we have a built-in direct sensory world model, and also a learned far-mode world model, and only the latter is deformable?
This all makes me think that the kinds of modifications to LLMs needed to trigger the singularity or whatever, are not actually that big? Like, if planning is learned anyways, we don’t need to drastically modify the transformer architecture to wedge some kind of planning into it, RL on chain of though is already all we need. Given that “big dumb blob of intelligence” is going to be the paradigm that wins, are our options for getting friendliness basically just “use good training data” and “make the big blob more sample-efficient”? Does increasing sample-efficiency even help?
This is a great article. I hope you are planning to crosspost to lesswrong?
Fairly general, at least. I specified this because in many specialized domains agents do use a built-in search process. Eg. alpha go using MCTS.
Maybe RL in the weak sense. Humans don’t really seem to be able to directly reward ourselves purely mentally in the same way as we’d get a direct reward for eating a cookie while hungry. So arguably the kind of tuning that happens here could be better classified as epistemic, though it does affect the distribution of planning actions like a true reward-update would.
Arxiv link: https://arxiv.org/abs/2605.27250
This sounds very intriguing. Questions:
If we start from some fixed state, the distribution of
Does this theory say that any other kinds of observers besides Boltzmann brains don’t have subjective experiences? Can we engineer a scenario that creates such an observer (eg if we have an AI design that is very useful, but we don’t know if it is conscious or not)?
Sampling from a thermal equilibrium distribution is not necessarily cheaper than time evolution. That kind of sampling is roughly speaking “in NP”, while time evolution is just
Where do we get the original sequence of truth values
Yeah, that is kind-of the direction my thoughts went in when I wrote the first post. I was thinking then that
If there is a more synthetic version where
Yeah, this is exactly the “irreversible conversion” described above. As you point out, it would allow Bob to directly convert from
Yeah, the exchange is a central authority that does get to know everything here. I haven’t even really thought much about what a decentralized version of this would look like yet. Probably it would involve zero-knowledge proofs? Thanks for pointing my thoughts in this direction, I’ll have to consider it more.
It’s not a huge problem, but it is a minor problem, at least for people trying to set bounties on a problem.
In the intuitionist version, the prices of a statement and its negation can sum to at most $1, but less is possible. (As you say, they sum to exactly $1 if one assumes LEM.) [1] If traders know that a statement is undecidable, for example, then this sum should be $0 and not $1. They don’t expect to be able to redeem shares in either the statement or its negation.
Imagine that I am trying to offer a bounty on a problem with statement
If I set the price
If I set the price
Even if we account for transaction costs, cancelling shares is usually going to be much easier than proving theorems, so it will become profitable first.
Therefore, I have to set
On the other hand, if cancellation is no longer possible, then the best option is simply to set
Because an irreversible conversion from $1 to
Kolmogorov complexity is not quite the right metric to use, because it doesn’t count the memory usage of the program, just the length of its source code. But the actual cost that’s used to figure out how improbable some configuration is is different. It equals the total number of bits that have to be flipped a certain way, including both the program, the memory it needs to do its job, and any bits that are erased. (The program has to be reversible, so if it has to erase bits, we interpret this as writing those bits to memory that is then not used for anything else.)
Wait, what? We already don’t extend to anyone the right to make war with humanity, including people.
If you mean, “the right to want to make war on humanity”, then yes, we would grant a person the right not to have that desire overwritten, however bad it may be. So is this a tradeoff? Perhaps, though I personally am a fan of the saying “build an angel and let it be free” here. In other aspects, the two concerns are aligned, eg. both can support a “shut it all down” position.
Yeah, the thing to remember with super-kelly strategies is that you are concentrating your expected utility into very improbable worlds where you are extremely wealthy. Which means you need to check that your total wealth in such a world is still significantly less than the total amount of money that exists. It’s no fun to go broke in all but
Abstract answer: Maybe it doesn’t transfer from LM’s to AGI, but advances the state of knowledge in the field in a way that makes it easier to find something that works on AGI. Maybe it doesn’t transfer to (say) a pure RL agent, but it’s easier to make a sufficiently good LM into an AGI than it looks. Maybe it does just transfer. Obviously there are also outcomes where it turns out to be useless, I’m just saying it looks positive in expectation.
Concrete answer: Adversarial examples have been with us throughout the history of neural nets, and basically the only thing we’ve really found to deal with them is “generate adversarial examples during training and train against them”, and even that doesn’t really work.
If we look at the things that let LMs do IMO problems, the really fundamental innovations (which were pre-existing, I think) are “RL on chain of thought” and “make some kind of good scaffold for the search process that lets you save partial insights instead of going fully parallel on the entire problem” and maybe “LLM as verifier”. (Disclaimer: I don’t know everything the labs did to achieve their IMO results, and plausibly there are additional techniques in there that I would consider clever.) Then on top of that, you apply a bunch of techniques that are basically just more dakka: Bigger model, higher quality training data, RL on a bigger / higher-quality dataset of problems, more test-time compute.
I don’t expect there’s a fully reliable anti-jailbreaking technique that can be built by applying well-known existing methods with more dakka. If there is, I think I’d have to change my opinion here.
To your other question, I don’t think it necessarily solves the problem of inner (or even outer) misaligned models. It would only be partial progress on one aspect of the alignment problem. Partial progress is still progress, though.
Mainly because it seems really hard. If we can do something that seems that hard, we probably learned something new.
There is also a mechanistic analogy. Think about what a jailbreak fundamentally is: an adversarial example. Some tuned input that results in an “incorrect” output. In terms of the overall alignment problem, why can’t we just make an AI care about people’s wellbeing by giving rewards during training? Well, the AI might be able to think of an adversarial state of the world that “feels” better to its own internal values, but doesn’t actually contain any people.
Hmm, guess so, yeah. Of course, they’re wrong and should have run the idea by Claude first. But I can’t think of a different reason either.