No, the argument above is claiming that A is false.
Davidmanheim
I think the crux might be that I think the ability to sample from a distribution at points we can reach does not imply that we know anything else about the distribution.
So I agree with you that we can sample and evaluate. We can tell whether a food we have made is good or bad, and can have aesthetic taste(, but I don’t think that this is stationary, so I’m not sure how much it helps, not that this is particularly relevant to our debate.) And after gather that data, (once we have some idea about what the dimensions are,) we can even extrapolate, in either naive or complex ways.
But unless values are far simpler than I think they are, I will claim that the naive extrapolation from the sampled points fails more and more as we extrapolate farther from where we are, which is a (or the?) central problem with AI alignment.
There seems to be no practical way to filter that kind of thing out.
There absolutely is, it would just cost them more than they are willing to spend—even though it shouldn’t be very much. As a simple first pass, they could hand all the training data to Claude 3 and ask it whether it’s an example of misalignment or dangerous behavior for a model, or otherwise seems dangerous or inappropriate—whichever criteria the choose. Given that the earlier models are smaller, and the cost of a training pass is far higher than an inference pass, I’d guess something like this would add a single or low double digit percentage to the cost.
Also, typo: ” too differential” → “too deferential”
And typo: “who this all taken far” → “Who have taken all of this far”
Thank you, that is a great point.
Another question to ask, even assuming faultless convergence, related to uniqueness, is whether the process of updates has a endpoint at all.
That is, I could imagine that there exists series of arguments that would convince someone who believes X to believe Y, and a set that would convince someone who believes Y to believe X. If both of these sets of arguments are persuasive even after someone has changed their mind before, we have a cycle which is compatible with faultless convergence, but has no endpoint.
If something is too hard to optimize/comprehend, people couldn’t possibly optimize/comprehend it in the past, so it couldn’t be a part of human values.
I don’t understand why this claim would be true.
Take the human desire for delicious food; humans certainly didn’t understand the chemistry of food and the human brain well enough to comprehend it or directly optimize it, but for millennia we picked foods that we liked more, explored options, and over time cultural and culinary processes improved on this poorly understood goal.
Thanks. It does seem like the conditional here was assumed, and there was some illusion of transparency. The way it read was that you viewed this type of geopolitical singularity as the default future, which seemed like a huge jump, as I mentioned.
We will soon enter an unstable state where the balance of military and political power will shift significantly because of advanced AI.
This evidently makes the very strong assumption that AGI is sufficient for a widely-recognized DSA before it becomes a near-term existential risk. That is, everyone behind in the AI race figures out that they have no chance to win without actually fighting a war that leads to nuclear escalation, or there is a war is won so decisively so quickly by one side that nuclear escalation does not occur. These seem like big claims that aren’t actually explained or explored (Or it assumes that ASI can be aligned enough to ensure we don’t all die before power dynamics shift in favor of whoever built to ASI, which is an even bigger claim.)
I think the present is much better than the past along the vast majority of dimensions, and you’re not paying attention to the actual history of what the world was like in the past. Yes, the simplified version of history kids are taught is wrong, but so is the idea that the past was good, actually.
That said, I don’t think I do subscribe to the idea that survival of the fittest is optimal—it’s the result of an optimization process, but not one aligned with what humans want. In the case of humans, our goals and evolution’s goals for us are obviously correlated in evolutionary time, and have diverged. (Ditto for capitalism as an optimization engine; maximizing capital is less and less correlated with flourishing over time.)
At the same time, Goodhart’s law doesn’t say optimization ends up badly, it says that after a certain point the returns grow less and less correlated with the goal. And that means that even when we don’t want further optimization, pushing the opposite direction is almost always making things worse.
As I understand it, MIRI intended to build principled glass-box agents based on Bayesian decision theory.
I think this misunderstands the general view of agent foundations by those who worked on it in the past. That is, “highly reliable agent design” was an eventual goal, in the same sense that someone taking high-school physics wants to use it to build rockets—they (hopefully) understand enough to know that they don’t know enough, and will need to learn more before even attempting to build anything.
That’s why Eliezer talked so much about deconfusion. The idea was to figure out what they didn’t know. This led to later talking about building safe AI as an eventual goal—not a plan, but an eventual possible outcome if they could figure out enough. They clarified this view. It was mostly understood by funders. And I helped Issa Rice write a paper laying out the different pathways that it could help—and only two of those involved building agents.And why did they give it up? Largely because they found that the deconfusion work was so slow, and everyone was so fundamentally wrong about the basics, that as LLM-based systems were developed they didn’t think we could possible build the reliable systems in time. They didn’t think that Bayesian decision theory or glass-box agents would necessarily work, and they didn’t know what would. So I think “MIRI intended to build principled glass-box agents based on Bayesian decision theory” is not just misleading, but wrong.
The Fragility of Naive Dynamism
They edited the text. It was an exact quote from the earlier text.
I think that’s what they meant you should not do when they said [edit to add: directly quoting a now-modified part of the footnote] “Bulk preorders don’t count, and in fact hurt.”
My attitude here is something like “one has to be able to work with moral monsters”.
You can work with them without inviting them to hang out with your friends.This flavor of boycotting seems like it would generally be harmful to one’s epistemics to adopt as a policy.
Georgia did not say she was boycotting, nor calling for others not to attend—she explained why she didn’t want to be at an event where he was a featured speaker.
This seems mostly right, except that it’s often hard to parallelize work and manage large projects—which seems like it slows thing importantly. And, of course, some things are strongly serialized using time that can’t be sped up via more compute or more people. (See: PM hires 9 women to have baby in one month.)
Similarly, running 1,000 AI research groups in parallel might get you the same 20 insights 50 times, rather than generating far more insights. And managing and integrating the research, and deciding where to allocate research time, plausibly gets harder at more than a linear rate with more groups.
So overall, the model seems correct, but I think the 10x speed up is more likely than the 20x speed up.
I have a biologically hardwired preference for defeating and hurting those who oppose me vigorously. I work very hard to sideline that biologically hardwired preference.
This seems like a very bad analogy, which is misleading in this context. We can usefully distinguish between evolutionarily beneficial instrumental strategies which are no longer adaptive and actively sabotage our other preferences in the modern environment, and preferences that we can preserve without sacrificing other goals.
CoT monitoring seems like a great control method when available
As I posted in a top level comment, I’m not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.
First, strongly agreed on the central point—I think that as a community, we’ve been too heavily investing in the tractable approaches (interpretability, testing, etc.) without having the broader alignment issues taking front stage. This has led to lots of bikeshedding, lots of capabilities work, and yes, some partial solutions to problems.
That said, I am concerned about what happens if interpretability is wildly successful—against your expectations. That is, I see interpretability as a concerning route to attempted alignment even if it succeeds in getting past the issues you note on “miss things,” “measuring progress,” and “scalability,” partly for reasons you discuss under obfuscation and reliability. Wildly successful and scalable interpretability without solving other parts of alignment would very plausibly function as a very dangerously misaligned system, and the methods for detection themselves arguably exacerbate the problem. I outlined my potential concerns about this case in more detail in a post here. I would be very interested in your thoughts about this. (And thoughts from @Buck / @Adam Shai as well!)
If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.
If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.
That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don’t, and they keep making systems that predictably are unsafe and exploitable, and they don’t have serious plans to change their deployments, much less actually build a safety-oriented culture.
You’re conflating can and should! I agree that it would be ideal if this were the case, but am skeptical it is. That’s what I meant when I said I think A is false.
That’s a very big “if”! And simplicity priors are made questionable, if not refuted, by the fact that we haven’t gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
No, I think it’s what humans actually pursue today when given the options. I’m not convinced that these values are static, or coherent, much less that we would in fact converge.
No, because we don’t comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that. (Then we apply pretty-sounding but ultimately post-hoc reasoning to explain it—as I tweeted partly thinking about this conversation.)