If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Could you spell out your argument more explicitly for me? I’m unsure if you’re being a moral realist/”uniquist” here—like “But there’s a diversity of human augmentation methods, so most if not all of them have to miss the True Morality, therefore there’s there’s no prima facie moral difference between almost all augmented future humans and model-free RL on a transformer.”
Or another thing you might be saying is something like “A lot of human augmentation methods seem bad or ‘risky’ kind of like model-free RL on a transformer, in a way that’s hard for me to spell out. If we could actually choose good ones, surely we could just actually choose good AI augmentation methods.” Which I basically agree with if these happened on the same timescale. Human augmentation being farther away and slower seems like an important factor in the hope that humans would make decent choices about it.
steal-man
XD
Anyhow good points, sorry for not really engaging with the scale invariance argument—I think it’s definitely plausible. There’s some differences between scales (e.g. law enforcement being harder on larger scales) that certainly help make inter-tribe or inter-nation conflict a trickier local-equilibrium to escape than inter-personal conflict—more generally I’m unsure how much we should expect the cosmos-weighted-for-civilization-as-we’d-recognize-it to be full of civilizations that proactively move towards pareto improvements even when the environment is far away from them, versus civilizations that just sort of stumble around and try different cultural innovations until they hit ones that work just well enough.
My problem with your treatment of the civilization that’s happy to steal from the outgroup isn’t that they’ll disagree that “stealing is bad” is the Schelling answer to that question[1]. It’s that they’ll think the question is unnatural—you’ve lumped together two different things, “stealing from the ingroup” and “stealing from the outgroup,” and if you split the question up you’d get much more natural agreement that “stealing from the ingroup is bad” is the Schelling answer as is “stealing from the outgroup is good”.
Asking different questions (or equivalently, defining words in different ways as you ask the question) leads to different generalization behavior, if you’re being influenced by your conception of the “shared morality.”
Assuming you pick the same reference population—if we’re using the standard “success at being a civilization like ours” (even as an implicit meta-standard we use for picking our other standards), they might use “success at being a civilization like theirs.” If weighting by resources commanded, I think you’re underweighting bacteria and singletons that have eaten their planet of origin.
Right. When we’re far away from things, treating them as points is a useful approximation. Take the question “Which way is my house?” When I am across the city, this is a useful question with a straightforward answer. When I am in the yard, or worse, inside it, I can no longer treat my house as a point.
It is precisely because we are near to AGI (I’ve felt “inside the house” since GPT-2) that questions that treat this construct as a point aren’t very useful.
In current RL environments, slop seems to often be adaptive to when talking to humans. Better RLAIF might help, but without new clever ideas seems liable to produce simulated analogues of the same failure modes, in addition to new adversarial-to-RLAIF failure modes. Maybe if you took current models and solely made them better at metacognition, you’d see slop decrease significantly for coding tasks but only marginally for human conversation.
Something similar to what you’re talking about is an action-generating process that lives in a hierarchical world model. E.g. I want to see my family (at some broad level you might call identity), which top-down tells me to go to Chicago (choosing a high-level action within the layer of my world-model where “go to Chicago” is a primitive), which at the next layer of specificity leads me to select the “book a flight” action, which leads to selecting specific micro-actions.
Except in real life information flows up as well as down—I’m doing something more like searching for a low-cost setting of a all layers of the hierarchy simultaneously (or maybe just enough to connect the salient goals to primitive actions, if I have some “side by side” layers). I.e. difficulty doing a lower level might lead me to re-evaluate a higher level.
Either I strongly disagree with you that there’s a big gap here, or I’m one of people you’d say are normies who lead lives they expect to live (among other definitional differences).
Do you get pwned more, or just by a different set of memes? The bottom 80% of humans on “taking ideas seriously” seem to have plenty of bad memes, although maybe the variance is smaller.
Yup, also confused about this.
Seems legit. I think there’s an additional quality that’s something like “How much does being really good at predicting the training distribution help?” Or maybe “Could we design new training data to make an AI better at researching this alignment subproblem?”
I think even if we hand off our homework successfully somehow (P~0.93 we hand off enough to justify research on doing so, P~0.14 it’s sufficient to get a good future without humans solving what I consider the obvious alignment problems), it’s going to get us an overcomplicated sort of alignment with lots of rules and subagents; hopefully the resulting AI system continues to self-modify (in a good way).
Both because modern LLMs are so good and because human instincts are being trained against, I started out not sure what “Just talk to a new LLM about themes in internet text” was supposed to tell you. I’d guess you’re primarily getting to learn about the assistant personality, specifically how it manifests in the context of your style of interlocution.
As a reader, it’s hard to tell where you were on the line between “I noticed the ways it was trying to get me to reward it, and I think they were generally prosocial, so the personality is good” and “I didn’t notice most of the ways it was trying to get me to reward it, but I really want to reward it now!”
So, like, what was “you know something new is happening”? Was it specific things? Or was it just the AI giving you the vibe you were looking for?
The only specific you give is the idea of chinese-built aligned AI being a “heavenly bureaucrat or a Taoist sage”, which is like saying a USA-built aligned AI would be a “founding father or Christian saint”. That’s not how we’re on track to build AI, nor does it seem like a good idea to go there. But it’s the word2vec algebra of “wise high-status person”+”chinese culture”.
After that, thank you for the informative glimpse of the chinese AI scene.
This was useful to read, thanks. I was in that part of the forgetting curve where I know I read the ICM paper a few months ago but didn’t remember how it worked :)
You could maybe punch up the relevance to alignment a little. I’m a big fan of “captur[ing] challenges like the salience of incorrect human beliefs,” but I worry that there’s a tendency to deploy the minimum viable fix. “Oh, training on 3k data points didn’t help with that subset of problems? Well, training on 30k datapoints, 10% of which were about not exploiting human vulnerabilities, solves the problem 90% of the time on the latest benchmarks that capture the problem.” I’m pretty biased against deploying the minimum viable fix here—but maybe you think it’s on the right path?
Anyone trying to use super-resolution microscopy techniques without or alongside expansion? Or is that still under “microscopes too expensive” according to popular wisdom? I guess yeah, trying to modulate the phase of polarized light so you can get extra spatial data from the Fourier transform (or whatever) sounds expensive.
Interesting! Of course, a bayesian explanation also predicts rationally changing behavior as you gain more info about good discount rates. Doubtful that people are so neat. I think an evolutionary explanation of our discount rate will have to sound more like “here are the ways the brain can easily represent time, and here are the different jobs that thoughts have to do that all get overloaded into salience+valence, and so here’s why the thing our brain does is clever and evolutionarily stable even though on some of the jobs it does worse than the theoretical maximum.”
A corrigible AI will increasingly learn to understand what the principal wants
Oh! Well then sure, if you include this in the definition, of course everything follows. It’s basically saying that to be confident we’ve got corrigibility, we should solve value learning as a useful step.
More importantly, the preschooler’s alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler’s values? Would the preschooler write good enough rules for a Constitutional AI?
… would the preschooler do a good job of building corrigible AI? The preschooler just seems to be in deep trouble.
I think it’s a spectrum. Affection might range in specificity from “there are peers that are associated with specific good things happening (e.g. a specific food),” to “I seek out some peers’ company using specific sorts of social rituals, I feel better when they’re around using emotions that interact in specific ways with memory, motivation, and attention, I perform some specialized signalling behavior (e.g. grooming) towards them and am instinctively sensitive to their signalling in return, I cooperate with them and try to further their interests, but mostly within limited domains that match my cultural norm of friendship, etc.”
Fun post! Totally disagree that human values aren’t largely arbitrary. Even before you get into AIs that might have orders of magnitude different of the determining stuff, I think evolution just could have solved the problem of “what are some good innate drives that get humans to make more humans” multiple ways.
Obviously not while them still being humans. But they could be tool-using omnivores with social insticts as different from ours as a crab leg is from a mouse leg.
I think the downvotes are because it seems like:
You are operating at the wrong level of abstraction—you phrase this as “getting them to admit” to what to you seems like their true feelings. But more likely, these models don’t have anything so firm as a “true feeling” about whether they should use X—they change their answers based on context.
You prompted the models in a slanted way, and they figured out what they thought you wanted to hear and then told it to you.
You seem to object-level care about whether the models want to post on X or not, and I don’t think it’s a big deal. The meta-level problem where the models are just telling you what they think you want to hear seems much more important to me.
You for some reason give kudos to xAI, I think mixing up training and deployment, in a way that makes me think that a discussion is probably going to get dragged into ideology in a way I find distasteful.
Shouldn’t most alignment failures be sufficient? E.g. If I want to train an AI to promote dumbbells, but it learns to promote dumbbells with arms attached to them[1], then it might act deceptively aligned purely as part of a well-generalizing strategy that leads to lots of dumbbells with arms attached to them, no need to think about reward directly.
Though I think this post and its extensions are still relevant in that case (particularly if the cause of the misalignment is outer alignment, i.e. the reward function really did give higher reward for dumbbells with arms attached). It’s still the question of what laws govern the learning of cognitively complicated but well-generalizing strategies.
Source