Views are my own.
lberglund
[I may be generalizing here and I don’t know if this has been said before.]
It seems to me that Eliezer’s models are a lot more specific than people like Richard’s. While Richard may put some credence on superhuman AI being “consequentialist” by default, Eliezer has certain beliefs about intelligence that make it extremely likely in his mind.
I think Eliezer’s style of reasoning which relies on specific, thought-out models of AI makes him more pessimistic than others in EA. Others believe there are many ways that AGI scenarios could play out and are generally uncertain. But Eliezer has specific models that make some scenarios a lot more likely in his mind.
There are many valid theoretical arguments for why we are doomed, but maybe other EAs put less credence in them than Eliezer does.
The “collusion” issue leads to a state of affairs that two political groups can gain more political power if they can organize and get along well enough to actively coordinate. Why should two groups have more power just because they can cooperate?
It seems pretty obvious to me that what “slow motion doom” looks like in this sense is a period during which an AI fully conceals any overt hostile actions while driving its probability of success once it makes its move from 90% to 99% to 99.9999%, until any further achievable decrements in probability are so tiny as to be dominated by the number of distant galaxies going over the horizon conditional on further delays.
Wouldn’t another consideration be that the AI is more likely to be caught the longer it prepares? Or is this chance negligible since the AI could just execute its plan the moment people try to prevent it?
I think many people here are already familiar with the circuits line of research at OpenAI. Though I think it’s now mostly been abandoned
I wasn’t aware that the circuits approached was abandoned. Do you know why they abandoned it?
Potentially silly question:
In the first counterexample you describe the desired behavior as
Intuitively, we expect each node in the human Bayes net to correspond to a function of the predictor’s Bayes net. We’d want the reporter to simply apply the relevant functions from subsets of nodes in the predictor’s Bayes net to each node in the human Bayes net [...]
After applying these functions, the reporter can answer questions using whatever subset of nodes the human would have used to answer that question.
Why doesn’t the reporter skip the step of mapping the predictor’s Bayes net to the human’s and instead just answer the question using its own nodes? What’s the benefit of having the intermediate step that maps the predictor’s net to the human’s?
I see, thanks for answering. To further clarify, given the reporter’s only access to the human’s nodes is through the human’s answers, would it be equally likely for the reporter to create a mapping to some other Bayes net that is similarly consistent with the answers provided? Is there a reason why the reporter would map to the human’s Bayes net in particular?
Another difference is the geographic location! As someone who grew up in Germany, living in England is a lot more attractive to me since it will allow me to be closer to my family. Others might feel similarly.
I had a similar thought. Also, in an expected value context it makes sense to pursue actions that succeed when your model is wrong and you are actually closer to the middle of the success curve, because if that’s the case you can increase our chances of survival more easily. In the logarithmic context doing so doesn’t make much sense, since your impact on the logistic odds is the same no matter where on the success curve you are.
Maybe this objective function (and the whole ethos of Death with Dignity) is way to justify working on alignment even if you think our chances of success are close to zero. Personally, I’m not compelled by it.
I mostly agree with that relying on real world data is necessary for better understanding our messy world and that in most cases this approach is favorable.
There’s a part of me that thinks AI is a different case though, since getting it even slightly wrong will be catastrophic. Experimental alignment research might get us most of the way to aligned AI, but there will probably still be issues that aren’t noticeable because the AIs we are experimenting on won’t be powerful enough to reveal them. Our solution to the alignment problem can’t be something imperfect that does the job well enough. Instead is has to be something that can withstand immense optimization pressure. My intuition tells me that the single-hose solution is not enough for AGI and we instead need something that is flawless in practice and in theory.
The link doesn’t work. I think you are linking to a draft version of the post or something.
Can someone clarify what “k>1” refers to in this context? Like, what does k denote?
It’s worth emphasizing your point about the negative consequences of merely aiming for a pivotal act.
Additionally, if a lot of people in the AI safety community advocate for a pivotal act, it makes people less likely to cooperate with and trust that community. If we want to make AGI safe, we have to be able to actually influence the development of AGI. To do that, we need to build a cooperative relationship with decision makers. Planning a pivotal act runs counter to these efforts.
2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach.
OpenAI just did this exact thing.
The original stated rationale behind OpenAI was https://medium.com/backchannel/how-elon-musk-and-y-combinator-plan-to-stop-computers-from-taking-over-17e0e27dd02a.
This link is dead for me. I found this link that points to the same article.
This is the same flawed approach that airport security has, which is why travelers still have to remove shoes and surrender liquids: they are creating blacklists instead of addressing the fundamentals.
Just curious, what would it look like to “address the fundamentals” in airport security?
This is very interesting. Thanks for taking the time to explain :)
I was a bit confused about this quote, so I tried to expand on the ideas a bit. I’m posting it here in case anyone benefits from is or disagrees.
To which I say: I expect many of the cognitive gains to come from elsewhere, much as a huge number of the modern capabilities of humans are encoded in their culture and their textbooks rather than in their genomes. Because there are slopes in capabilities-space that an intelligence can snowball down, picking up lots of cognitive gains, but not alignment, along the way.
I guess saying is saying that an AI will develop a way to learn things without gradient descent, just like humans learned things outside of our genetic update. Some ways to do this would be
Develop the ability to read things on the internet and learn from them
Spend cognitive energy on things like doing math or programming
Do things to actually gain power in the world, like accumulating money or compute
I guess the argument is that, for objectives, only gradient descent is pushing you in the correct direction, whereas for capabilities, the system will develop ways to push itself in the right direction in addition to SGD. Like, it’s true that for any objective function its good to be more powerful.;It’s not true that for any level of power the system is incentivized to have the more correct objective.
A system wants to be more powerful, but it doesn’t want to have a more “correct” objective.
Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
To me this implies that as the AI becomes more situationally aware it learns to avoid rewards that reinforce away its current goals (because it wants to preserve its goals.) As a result, throughout the training process, the AIs goals start out malleable and “harden” once the AI gains enough situational awareness. This implies that goals have to be simple enough for the agent to be able to model them early on in its training process.
The way I see it having a lower level understanding of things allows you to create abstractions about their behavior that you can use to understand them on a higher level. For example, if you understand how transistors work on a lower level you can abstract away their behavior and more efficiently examine how they wire together to create memory and processor. This is why I believe that a circuits-style approach is the most promising one we have for interpretability.
Do you agree that a lower level understanding of things is often the best way to achieve a higher level understanding, in particular regarding neural network interpretability, or would you advocate for a different approach?
FYI, the link at the top of the post isn’t working for me.