If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Sort of agree, but I think there are paths to gradual mind uploading that I’m happy with. It’s probably worth it for me, though it’s also very likely that enough people will want to be corporeal on Earth that we won’t disassemble Sol for its free energy.
The paper you link is pretty interesting, but I don’t think it helps support the idea that general capabilities improvement is more data than algorithms. Instead, what they show good evidence for is that a little algorithmic progress has consisted of finding algorithms that give a flat bump to performance, while most of it has been from finding algorithms that scale better with the available compute.
They do ablation on a really tiny transformer (3M parameters), and show that there was a 3.something times improvement from adding modern algorithmic improvements. Meanwhile, the nanoGPT project has added basically the same improvements to gpt2 (124M parameters) and gotten a 10.something times improvement.
EDIT: That said, the “eyeball test” estimation of AI smarts might well be improving more because of better data and data-use than because of slightly lower loss on the Pile, I agree with that.
Hm, so you think if there are some distinctive benchmark questions that have been discussed online, models otherwise trained on that era of internet won’t know details about them?
I think the problem with human values is more underdetermination than fragility.
Base human preferences and meta-preferences are allowed to fuzzily span (interpersonally, and intrapersonally between slight variations in context and random noise, and intrapersonally between different ways of ascribing values to models of human behavior) a set of initial conditions that, when run forward, resolve internal conflicts differently and sometimes can end up in mutually-disagreeable end states. So it doesn’t make much sense to try to apply a basin of attraction argument directly to human values, because the target doesn’t stay in a basin.
Legitimacy is a bundle of meta-preferences—it’s about how we want human-modeling, resolution of internal conflicts, etc. to operate. It has no special exception—humans can be ascribed conflicting notions of legitimacy, that lead to different (and potentially mutually disagreeable) end states upon being used. Also bad at being a basin.
Mostly I think we just have to do our best—we have to apply the notions of legitimacy we have to the project of resolving the internal conflicts in our notion of legitimacy. We probably shouldn’t demand guarantees for a bunch of individual principles, because our principles can conflict. Seems better to gradually make choices that seem locally legitimate—maybe even choosing a method that we’re confident will converge rather than explode, even as we’re aware that slightly different versions of it would converge to different things.
Eric Drexler pushing back against statements like
Picture a robotic arm that reaches over to a conveyor belt, picks up a loaded tool, applies the tool to a workpiece under construction, replaces the empty tool on the belt, picks up the next loaded tool, and so on-as in today’s automated factories.”
, made by… Eric Drexler in the Scientific American article he cites as his “technically specific pushback.”
Inoculation prompting reduces RL pressure for learning bad behavior, but it’s still expensive to rederive that it’s okay to cheat from the inoculation prompt rather than just always being cheaty.
One way the expense binds is regularization. Does this mean you should turn off regularization during inoculation prompting?
Another way is that you might get better reward by shortcutting the computation about cheating and using that internal space to work on the task more. It might be useful to monitor this happening, and maybe try to protect the computation about cheating from getting its milkshake drunk.
Upvotes communicate to your future self, to future others, to current others, to the post’s author, and to the site’s promotion algorithm.
Voting a few months out tells me you’re mostly interested in #1 and #2 from that list, while I’m I’m pretty big on the last two.
So, I have some quibbles, some pessimistic, some optimistic.
The main pessimistic one is that the nice-thing-that-current-languange-models-have-if-you-don’t-RL-them-too-hard, their de facto alignment, is probably not the final word on alignment that we just need to safeguard as contexts change and capabilities increase. I think it’s the wrong modeling assumption to say we’ll start with aligned transformative AI and then just need to keep RL from messing it up.
But this has an optimistic flip side, which is that if we do have better alignment schemes to apply to future AI, they can take into account the weaknesses of fine-tuning a predictive model and try to correct for them.
On “breaking things,” it seems like reverting towards the base model behavior is the default expected consequence of breaking fine-tuning. In the current paradigm, I wouldn’t expect this to lead to misaligned goals (though probably some incoherent bad behavior). In a different architecture maybe the story is different (whoops, we broke the value function in model-based RL but didn’t break the environment model).
If you’re worried about coherent bad behavior because we’ll be doing RL on task completion, that doesn’t sound like drift to me, it sounds like doing RL on a non-alignment goal and (no surprise) getting non-aligned AI.
On an unrelated note, I was also reminded of the phenomenon of language drift after RL, e.g. see Jozdien’s recent post, or the reports about math-finetuned LLMs drifting.
Recent work (e.g.) has helped clarify the continuum between “general” emergent misalignment, where the AI does a wide variety of bad stuff in a very vibes-based way, through more specific but still vibes-based misaligned behavior, to more and more situationally-aware and narrowly consequentialist bad behavior.
Do you think this is more the sort of thing where you’d want to produce a wide diversity of models, or would you produce a bunch of models on the consequentialism end of this axis if you could?
Am I correct that the human uncertainty about “true values” (or more naturalistically, the underdetermination of how to model humans as having values) isn’t actually an active ingredient in the toy problem?
I.e. you start an AI, and it knows it’s going to get some observations about humans, model them as having values, and then act to fulfill those values. But if it’s updateless, it will have a prior probability distribution over what values it would land on, and it will take the prior expectation and maximize that, basically preventing value learning from taking place.
What do you think about the cheap fix, where we say “oops, that was a mistake, we gave the AI the preferences ‘globally maximize the modeled pattern from unknown data,’ when we should have given it the preferences ‘locally maximize the modeled pattern from unknown data,’ i.e. prefer that your outputs match the observed pattern, not that your outputs are globally right.”
I think the intuition for why the AI is wrong rests on the humans having extra structure that tangles everything together. They’re not actually following Bayesian uncertainty about some platonic “right thing,” instead they want to follow some “good process” (the process they believe will disambiguate puppies and rainbows), and if they’d built the AI correctly it wouldn’t reason using Bayesian uncertainty either, it would just follow the good process.
In the hypothetical where the humans don’t have this extra structure, updateless reasoning seems great.
What problem is Thought Anchors solving for you (or future users)? I feel like I don’t quite understand.
I think you did a good job differentially promoting better posts.
Thanks for the reminder! I stopped checking inkhaven.blog ~instantly, and am only reacting to how it impacted my normal blogophere experience.
Scanning through inkhaven.blog a bit more and reading various sorts of links at random, neither random posts nor sampled curated posts really do it for me, but good stuff is definitely in there—including a few posts I had found in other ways without knowing they were part of Inkhaven (good job by the promotion/recommendation process conditional on just showing 1 or 2).
A suspicious number of blog posts from this event where you write words fast are about writing words fast. But this one is good enough it might be mostly unrelated :P
I have done it for more than one post this month. I wonder if it’s in part because I upvote partway through reading sometimes and normally feel pretty accurate that I’ve noticed being less accurate this month (of course, one can always retract an upvote made in haste).
From my reader’s perspective, Inkhaven was probably bad. No shade to the authors, this level of output is a lot of work and there was plenty I enjoyed. But it shouldn’t be a surprise that causing people to write a lot more posts even when they’re not inspired leads to a lot more uninspired posts.
A lot of the uninspired posts were still upvoted on LW. I even did some of that upvoting myself, just automatically clicking upvote as I start reading a post with an interesting first paragraph by someone whose name I recognize. Mostly this is fine, but it dilutes karma just a bit more.
And I only ever even saw a small fraction. I’m sorry if you were an Inkhaven author who killed it every time, I was merely being shown a subset, since I mostly just click on things on the front page. Probably not so much sorted by quality as by network effects that can get onto the front page long enough to snowball upvotes.
I think as a reader I’d have liked the results better if participants had to publish every other day instead.
I was recently asked what follow-up on this post could look like, and I gave two answers (that were deliberately not “Do what Steve does”). They were:
1.
We’d like to be able to mathematically analyze the behavior of agents with parametrized classes of non-behaviorist rewards, in toy situations that capture something important about reward hacking.
A first toy model to construct might be if we train the AI to use information, but there’s some information we don’t want it to use (analogous to a coding agent that sometimes sees the unit tests). A harder toy model to make might be one based on training the AI to generalize, but there’s some generalization we don’t want it to do.
Figure out a way to represent interesting rewards, which might include wanting to learn from norms rather than extremes, curiosity/incuriosity drive, and reward penalty on thoughts (activations) that start out correlated with misbehavior. Explore the parameter space of the toy-model environments and rewards, showing where agents quickly converge to misbehavior and where they converge slowly or not at all.
2.
Figure out how these arguments interact with recontextualization (and similarly inoculation prompting, off-policy RL).
Try to translate inoculation prompting into training on some approximate non-behaviorist reward.
Can Byrnes’ arguments for scheming be expanded to include some kinds of recontextualization? Can arguments for and against the effectiveness of recontextualization be translated to arguments about non-behaviorist reward?
Looking forward to your results involving a binary classification :D
Rather than imagining a single concept boundary, maybe try imagining the entire ontology (division of the set of states into buckets) at once. Imagine a really fine-grained ontology that splits the set of states into lots of different buckets, and then imagine a really coarse-grained ontology that lumps most states into just a few buckets. And then imagine a different coarse-grained ontology that draws different concept boundaries than the first, so that in order to describe the difference between the two you have to talk in the fine-grained ontology.
The “unique infinum” of two different ontologies is the most abstract ontology you can still use to specify the differences between the first two.