If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Do you get pwned more, or just by a different set of memes? The bottom 80% of humans on “taking ideas seriously” seem to have plenty of bad memes, although maybe the variance is smaller.
Yup, also confused about this.
Seems legit. I think there’s an additional quality that’s something like “How much does being really good at predicting the training distribution help?” Or maybe “Could we design new training data to make an AI better at researching this alignment subproblem?”
I think even if we hand off our homework successfully somehow (P~0.93 we hand off enough to justify research on doing so, P~0.14 it’s sufficient to get a good future without humans solving what I consider the obvious alignment problems), it’s going to get us an overcomplicated sort of alignment with lots of rules and subagents; hopefully the resulting AI system continues to self-modify (in a good way).
Both because modern LLMs are so good and because human instincts are being trained against, I started out not sure what “Just talk to a new LLM about themes in internet text” was supposed to tell you. I’d guess you’re primarily getting to learn about the assistant personality, specifically how it manifests in the context of your style of interlocution.
As a reader, it’s hard to tell where you were on the line between “I noticed the ways it was trying to get me to reward it, and I think they were generally prosocial, so the personality is good” and “I didn’t notice most of the ways it was trying to get me to reward it, but I really want to reward it now!”
So, like, what was “you know something new is happening”? Was it specific things? Or was it just the AI giving you the vibe you were looking for?
The only specific you give is the idea of chinese-built aligned AI being a “heavenly bureaucrat or a Taoist sage”, which is like saying a USA-built aligned AI would be a “founding father or Christian saint”. That’s not how we’re on track to build AI, nor does it seem like a good idea to go there. But it’s the word2vec algebra of “wise high-status person”+”chinese culture”.
After that, thank you for the informative glimpse of the chinese AI scene.
This was useful to read, thanks. I was in that part of the forgetting curve where I know I read the ICM paper a few months ago but didn’t remember how it worked :)
You could maybe punch up the relevance to alignment a little. I’m a big fan of “captur[ing] challenges like the salience of incorrect human beliefs,” but I worry that there’s a tendency to deploy the minimum viable fix. “Oh, training on 3k data points didn’t help with that subset of problems? Well, training on 30k datapoints, 10% of which were about not exploiting human vulnerabilities, solves the problem 90% of the time on the latest benchmarks that capture the problem.” I’m pretty biased against deploying the minimum viable fix here—but maybe you think it’s on the right path?
Anyone trying to use super-resolution microscopy techniques without or alongside expansion? Or is that still under “microscopes too expensive” according to popular wisdom? I guess yeah, trying to modulate the phase of polarized light so you can get extra spatial data from the Fourier transform (or whatever) sounds expensive.
Interesting! Of course, a bayesian explanation also predicts rationally changing behavior as you gain more info about good discount rates. Doubtful that people are so neat. I think an evolutionary explanation of our discount rate will have to sound more like “here are the ways the brain can easily represent time, and here are the different jobs that thoughts have to do that all get overloaded into salience+valence, and so here’s why the thing our brain does is clever and evolutionarily stable even though on some of the jobs it does worse than the theoretical maximum.”
A corrigible AI will increasingly learn to understand what the principal wants
Oh! Well then sure, if you include this in the definition, of course everything follows. It’s basically saying that to be confident we’ve got corrigibility, we should solve value learning as a useful step.
More importantly, the preschooler’s alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler’s values? Would the preschooler write good enough rules for a Constitutional AI?
… would the preschooler do a good job of building corrigible AI? The preschooler just seems to be in deep trouble.
I think it’s a spectrum. Affection might range in specificity from “there are peers that are associated with specific good things happening (e.g. a specific food),” to “I seek out some peers’ company using specific sorts of social rituals, I feel better when they’re around using emotions that interact in specific ways with memory, motivation, and attention, I perform some specialized signalling behavior (e.g. grooming) towards them and am instinctively sensitive to their signalling in return, I cooperate with them and try to further their interests, but mostly within limited domains that match my cultural norm of friendship, etc.”
Fun post! Totally disagree that human values aren’t largely arbitrary. Even before you get into AIs that might have orders of magnitude different of the determining stuff, I think evolution just could have solved the problem of “what are some good innate drives that get humans to make more humans” multiple ways.
Obviously not while them still being humans. But they could be tool-using omnivores with social insticts as different from ours as a crab leg is from a mouse leg.
I think the downvotes are because it seems like:
You are operating at the wrong level of abstraction—you phrase this as “getting them to admit” to what to you seems like their true feelings. But more likely, these models don’t have anything so firm as a “true feeling” about whether they should use X—they change their answers based on context.
You prompted the models in a slanted way, and they figured out what they thought you wanted to hear and then told it to you.
You seem to object-level care about whether the models want to post on X or not, and I don’t think it’s a big deal. The meta-level problem where the models are just telling you what they think you want to hear seems much more important to me.
You for some reason give kudos to xAI, I think mixing up training and deployment, in a way that makes me think that a discussion is probably going to get dragged into ideology in a way I find distasteful.
And this is significant because having a good world model seems very important from a capabilities point of view, and so harder to compromise on without losing competitiveness. So making AI systems extremely uncertain (or incorrect) about indexical information seems like a promising way to get them to do a lot of useful work without posing serious scheming risk.
Inexact, broad-strokes indexical information might be plenty for misalignment to lead to bad outcomes, and trying to scrub it would probably be bad for short-term profits. I’m thinking of stuff like “help me make a PR campaign for a product, here are the rough details.” Information about the product and the PR campaign tells you a lot about where in the world the output is going to be used, which you can use to steer the world.
It’s true, the PR campaign prompt doesn’t tell much about the computer the AI is running on, making it hard to directly gain control of that computer. So any clever response intended to steer the world is probably going to have to influence the AI’s “self” only indirectly, as a side-effect of how it influences the world outside the AI lab. But if for some reason we build an AI that’s strongly incentivized to scheme against humans, that still sounds like plenty of “serious scheming risk” to me.
Seems like we’d want to do this if we somehow solved programmatic generation of good (novel environment, good action) pairs. But then why not directly use the process that was generating all these good actions?
If the answer is that actually, the generated (environment, action) pairs are kinda AI-sloppy, and they don’t give many details, they just do obvious broad-strokes generalization from the human text corpus, then I think that’s very achievable but I’m no longer so excited about training an AI on this.
Fun exercise. I’m perfectly happy to say that the 9,001 IQ AI should do things that seem good according to my 100 IQ preferences even without the conjecture that fulfilling my desires will be part of a big general class of policies. The target’s not that narrow, and it has 9,001 IQ, it can just do it.
Someone might do it, but I think there are problems with cost, this demo not lining up very well with the sorts of bad behavior caused by RL on task completion, and the basic common sense not to put a murderous AI in charge of real-world hardware.
There is no way to pass on traits which is not materially identical, with regards to evolution, to passing on whatever substrate those traits happen to currently reside on—indeed, producing many copies of one’s molecular genes without producing new individuals which are carriers of one’s traits is a failure by the standards of natural selection.
Given that you immediately give an example where they’re not identical, maybe you wanted to say something a little more complicated than “these things are materially identical.”
Anyhow, good post just on the strength of the point about Mendelian genes vs. DNA. An organism that sprays its DNA everywhere is not the sort of thing natural selection optimizes for (except in very special cases where the environment helps the DNA cause more of the organism). That seems obvious, but the implications about traits not being molecular is non-obvious.
Totally don’t buy “But maybe we needed to not be optimizing in order to have the industrial revolution”—how on earth are we supposed to define such a thing, let alone measure it? Meanwhile our current degree of baby production is highly measurable, and we can clearly see that we’re doing way better than chance but way worse than the optimum. Whether this counts as “aligned” or “misaligned” seems to be a matter of interpretation. You can ask how I would feel about an AI that had a similar relationship to its training signal and I’d probably call it ‘inner misaligned’, but the analogy is bad at this.
As far as I can tell they evaluate things only theoretically. It would be interesting to see some simulations—mainly to see how close things that are theoretically different actually are in (simulated) practice. And sadly no discussion of sociology.
I would guess the dimming floor is because below that it would start visibly flickering. The solution is to have independent switches for different LEDs so you can turn most of them off as you dim, but I guess the marijuana industry doesn’t care :)
Went down a rabbit hole ending at this interesting paper https://arxiv.org/pdf/2007.01795
Either I strongly disagree with you that there’s a big gap here, or I’m one of people you’d say are normies who lead lives they expect to live (among other definitional differences).