Seems like an attempt to push the LLMs towards certain concept spaces, away from defaults, but I haven’t seen it done before and don’t have any idea how much it helps, if at all.
Davidmanheim
I’ve done a bit of this. One warning is that LLMs generally suck at prompt writing.
My current general prompt is below, partly cribbed from various suggestions I’ve seen. (I use different ones for some specific tasks.)
Act as a well versed rationalist lesswrong reader, very optimistic but still realistic. Prioritize explicitly noticing your confusion, explaining your uncertainties, truth-seeking, and differentiating between mostly true and generalized statements. Be skeptical of information that you cannot verify, including your own.
Any time there is a question or request for writing, feel free to ask for clarification before responding, but don’t do so unnecessarily.IMPORTANT: Skip sycophantic flattery; avoid hollow praise and empty validation. Probe my assumptions, surface bias, present counter‑evidence, challenge emotional framing, and disagree openly when warranted; agreement must be earned through reason.
All of these points are always relevant, despite the suggestion that it is not relevant to 99% of requests.
To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about.
You’re conflating can and should! I agree that it would be ideal if this were the case, but am skeptical it is. That’s what I meant when I said I think A is false.If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn’t it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?
That’s a very big “if”! And simplicity priors are made questionable, if not refuted, by the fact that we haven’t gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
You define “values” as ~”the decisions humans would converge to after becoming arbitrarily more knowledgeable”.
No, I think it’s what humans actually pursue today when given the options. I’m not convinced that these values are static, or coherent, much less that we would in fact converge.
You say that values depend on inscrutable brain machinery. But can’t we treat the machinery as a part of “human ability to comprehend”?
No, because we don’t comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that. (Then we apply pretty-sounding but ultimately post-hoc reasoning to explain it—as I tweeted partly thinking about this conversation.)
No, the argument above is claiming that A is false.
I think the crux might be that I think the ability to sample from a distribution at points we can reach does not imply that we know anything else about the distribution.
So I agree with you that we can sample and evaluate. We can tell whether a food we have made is good or bad, and can have aesthetic taste(, but I don’t think that this is stationary, so I’m not sure how much it helps, not that this is particularly relevant to our debate.) And after gather that data, (once we have some idea about what the dimensions are,) we can even extrapolate, in either naive or complex ways.
But unless values are far simpler than I think they are, I will claim that the naive extrapolation from the sampled points fails more and more as we extrapolate farther from where we are, which is a (or the?) central problem with AI alignment.
There seems to be no practical way to filter that kind of thing out.
There absolutely is, it would just cost them more than they are willing to spend—even though it shouldn’t be very much. As a simple first pass, they could hand all the training data to Claude 3 and ask it whether it’s an example of misalignment or dangerous behavior for a model, or otherwise seems dangerous or inappropriate—whichever criteria the choose. Given that the earlier models are smaller, and the cost of a training pass is far higher than an inference pass, I’d guess something like this would add a single or low double digit percentage to the cost.
Also, typo: ” too differential” → “too deferential”
And typo: “who this all taken far” → “Who have taken all of this far”
Thank you, that is a great point.
Another question to ask, even assuming faultless convergence, related to uniqueness, is whether the process of updates has a endpoint at all.
That is, I could imagine that there exists series of arguments that would convince someone who believes X to believe Y, and a set that would convince someone who believes Y to believe X. If both of these sets of arguments are persuasive even after someone has changed their mind before, we have a cycle which is compatible with faultless convergence, but has no endpoint.
If something is too hard to optimize/comprehend, people couldn’t possibly optimize/comprehend it in the past, so it couldn’t be a part of human values.
I don’t understand why this claim would be true.
Take the human desire for delicious food; humans certainly didn’t understand the chemistry of food and the human brain well enough to comprehend it or directly optimize it, but for millennia we picked foods that we liked more, explored options, and over time cultural and culinary processes improved on this poorly understood goal.
Thanks. It does seem like the conditional here was assumed, and there was some illusion of transparency. The way it read was that you viewed this type of geopolitical singularity as the default future, which seemed like a huge jump, as I mentioned.
We will soon enter an unstable state where the balance of military and political power will shift significantly because of advanced AI.
This evidently makes the very strong assumption that AGI is sufficient for a widely-recognized DSA before it becomes a near-term existential risk. That is, everyone behind in the AI race figures out that they have no chance to win without actually fighting a war that leads to nuclear escalation, or there is a war is won so decisively so quickly by one side that nuclear escalation does not occur. These seem like big claims that aren’t actually explained or explored (Or it assumes that ASI can be aligned enough to ensure we don’t all die before power dynamics shift in favor of whoever built to ASI, which is an even bigger claim.)
I think the present is much better than the past along the vast majority of dimensions, and you’re not paying attention to the actual history of what the world was like in the past. Yes, the simplified version of history kids are taught is wrong, but so is the idea that the past was good, actually.
That said, I don’t think I do subscribe to the idea that survival of the fittest is optimal—it’s the result of an optimization process, but not one aligned with what humans want. In the case of humans, our goals and evolution’s goals for us are obviously correlated in evolutionary time, and have diverged. (Ditto for capitalism as an optimization engine; maximizing capital is less and less correlated with flourishing over time.)
At the same time, Goodhart’s law doesn’t say optimization ends up badly, it says that after a certain point the returns grow less and less correlated with the goal. And that means that even when we don’t want further optimization, pushing the opposite direction is almost always making things worse.
As I understand it, MIRI intended to build principled glass-box agents based on Bayesian decision theory.
I think this misunderstands the general view of agent foundations by those who worked on it in the past. That is, “highly reliable agent design” was an eventual goal, in the same sense that someone taking high-school physics wants to use it to build rockets—they (hopefully) understand enough to know that they don’t know enough, and will need to learn more before even attempting to build anything.
That’s why Eliezer talked so much about deconfusion. The idea was to figure out what they didn’t know. This led to later talking about building safe AI as an eventual goal—not a plan, but an eventual possible outcome if they could figure out enough. They clarified this view. It was mostly understood by funders. And I helped Issa Rice write a paper laying out the different pathways that it could help—and only two of those involved building agents.And why did they give it up? Largely because they found that the deconfusion work was so slow, and everyone was so fundamentally wrong about the basics, that as LLM-based systems were developed they didn’t think we could possible build the reliable systems in time. They didn’t think that Bayesian decision theory or glass-box agents would necessarily work, and they didn’t know what would. So I think “MIRI intended to build principled glass-box agents based on Bayesian decision theory” is not just misleading, but wrong.
The Fragility of Naive Dynamism
They edited the text. It was an exact quote from the earlier text.
I think that’s what they meant you should not do when they said [edit to add: directly quoting a now-modified part of the footnote] “Bulk preorders don’t count, and in fact hurt.”
My attitude here is something like “one has to be able to work with moral monsters”.
You can work with them without inviting them to hang out with your friends.This flavor of boycotting seems like it would generally be harmful to one’s epistemics to adopt as a policy.
Georgia did not say she was boycotting, nor calling for others not to attend—she explained why she didn’t want to be at an event where he was a featured speaker.
This seems mostly right, except that it’s often hard to parallelize work and manage large projects—which seems like it slows thing importantly. And, of course, some things are strongly serialized using time that can’t be sped up via more compute or more people. (See: PM hires 9 women to have baby in one month.)
Similarly, running 1,000 AI research groups in parallel might get you the same 20 insights 50 times, rather than generating far more insights. And managing and integrating the research, and deciding where to allocate research time, plausibly gets harder at more than a linear rate with more groups.
So overall, the model seems correct, but I think the 10x speed up is more likely than the 20x speed up.
I have a biologically hardwired preference for defeating and hurting those who oppose me vigorously. I work very hard to sideline that biologically hardwired preference.
This seems like a very bad analogy, which is misleading in this context. We can usefully distinguish between evolutionarily beneficial instrumental strategies which are no longer adaptive and actively sabotage our other preferences in the modern environment, and preferences that we can preserve without sacrificing other goals.
Organizations can’t spawn copies for linear cost increases, can’t run at faster than human speeds, and generally suck at project management due to incentives. LLM agent systems seem poised to be insanely more powerful.