Sure! Anything more specific that you want to know about? Practice advice or more theory?
Jonas Hallgren
There is a specific part of this problem that I’m very interested in and that is about looking at the boundaries of potential sub-agents. It feels like part of the goal here is to filter away potential “daemons” or inner optimisers so it feels kind of important to think of ways one can do this?
I can see how this project would be valuable even without it but do you have any thoughts about how you can differentiate between different parts of a system that’s acting like an agent to isolate the agentic part?
I otherwise find it a very interesting research direction.
Disclaimer: I don’t necessarily support this view, I thought about it for like 5 minutes but I thought it made sense.
If we were to do things the same thing as other slowing down of regulation, then that might make sense, but I’m uncertain that you can take the outside view here?
Yes, we can do the same as for other technologies by leaving it down to the standard government procedures to make legislation and then I might agree with you that slowing down might not lead to better outcomes. Yet, we don’t have to do this. We can use other processes that might lead to a lot better decisions. Like what about proper value sampling techniques like digital liquid democracy? I think we can do a lot better than we have in the past by thinking about what mechanism we want to use.
Also, for some potential examples, I thought of cloning technology in like the last 5 min. If we just went full-speed with that tech then things would probably have turned out badly?
The Buddha with dependent origination. I think it says somewhere that most of the stuff in Buddhism was from before the Buddha’s time. These are things such as breath-based practices and loving kindness, among others. He had one revelation that made the entire enlightenment thing basically which is called dependent origination.*
*At least according to my meditation teacher, I believe him since he was a neuroscientist and astrophysics masters at Berkeley before he left for India though so he’s got some pretty good epistemics.
It basically states that any system is only true based on another system being true. It has some really cool parallels to Gödel’s Incompleteness Theorem but on a metaphysical level. Emptiness of emptiness and stuff. (On a side note I can recommend TMI + Seeing That Frees if you want to experience som radical shit there.)
This was a great post, thank you for making it!
I wanted to ask what you thought about the LLM-forecasting papers in relation to this literature? Do you think there are any ways of applying the uncertainty estimation literature to improve the forecasting ability of AI?:
I like the post and generally agree. Here’s a random thought on the OOD generalization. I feel that often we talk about how being good at 2 or 3 different things allow for new exploration. If you believe in books such as Range, then we’re a lot more creative when combining ideas from multiple different fields. I rather think of multiple “hulls” (I’m guessing this isn’t technically correct since I’m a noob at convex optimisation.) and how to apply them together to find new truths.
Damn, great post, thank you!
I saw that you used Freedom; random tip is to use the appblock app instead, as it is more powerful as well as cold turkey blocker on the computer. (If you want to there are ways to get around the other blockers)
That’s all I wanted to say really, I will probably try it out in the future. I was thinking of giving myself an allowance or something similar to what I could spend on the app and see if it would increase my productivity.
This was a dig at interpretability research. I’m pro-interpretability research in general, so if you feel personally attacked by this, it wasn’t meant to be too serious. Just be careful with infohazards, ok? :)
I think Neurallink already did this actually, a bit late to the point but a good try anyway. Also, have you considered having Michael Bay direct the research effort? I think he did a pretty good job with the first Transformers.
Yeah, I agree with what you just said; I should have been more careful with my phrasing.
Maybe something like: “The naive version of the orthogonality thesis where we assume that AIs can’t converge towards human values is assumed to be true too often”
Compared to other people on this site this is a part of my alignment optimism. I think that there are Natural abstractions in the moral landscape that makes agents converge towards cooperation and similar things. I read this post recently and Leo Gao made an argument that concave agents generally don’t exist because since they stop existing. I think that there are pressures that conform agents to part of the value landscape.
Like I agree that the orthogonality thesis is presumed to be true way too often. It is more like an argument that it may not happen by default but I’m also uncertain about the evidence that it actually gives you.
Any SBF enjoyers?
I have the same experience, I love having it connect two disparate topics together, it is very fun. I had the thought today that I use GPT for basically 80%+ of work tasks i do as a brainstorming partner.
Hey! I saw that you had a bunch of downvotes and I wanted to get in here before you came too disilusioned with the LW crowd. I think a big point for me is that you don’t really have any sub-headings or examples that are more straight to the point. It is all a long text that seems similar to how you directly thought, this makes it really hard to engage with what you say. Of course you’re saying controversial things but if there was more clarity I think you would have more engagement.
(GPT is really op for this nowadays) Anyway, I wish you the best of luck! I’m also sorry for not engaging with any of your arguments but I couldnt quite follow.
Alright, quite a ba(y)sed point there, very nice. My lazy ass is looking for a heuristic here. It seems like the more the EMH is true in a situation/amount of optimisation pressure applied the more you should expect to be disappointed with a trade.
But what is a good heuristic for how much worse it will be? Maybe one just has to think about the counterfactual option each time?
I thought the orangutan argument was pretty good when I first saw it, but then I looked it up, and I realised that it is not that they aren’t power seeking. It is more that they only are when it comes to interactions that matter for the future survival of offspring. It actually is a very flimsy argument. Some of the things he says are smart like some of the stuff on the architecture front, but you know, he always talks about his aeroplane analogy in AI Safety. It is like really dumb as I wouldn’t get into an aeroplane without knowing that it has been safety checked and I have a hard time taking him seriously when it comes to safety as a consequence.
Very cool! I want to mention that it might be interesting to mention the connection between what the buddha called dependent origination and the formation of a self view of being an agent.
The idea is that your self is built through a loop of expecting your self to be there in the future and thus creating a self fulfilling prophecy. This is similar to how agents in the intentional stance are defined as it is informationally more efficient to express yourself as an agent.
A way to view the alignment problem is through a self loop taking over fully or dependent origination in artificisl agents. Anyway, I think it seems very cool and I wish you the best of luck!
I notice being confused about the relationship between power-seeking arguments and counting arguments. Since I’m confused I’m assuming others are so I would appreciate some clarity on this.
In footnote 7, Turner mentions that the paper, optimal policies tend to seek power is an irrelevant counting error post.
In my head, I think of the counting argument as that it is hard to hit an alignment target because of there being a lot more non-alignment targets. This argument is (clearly?) wrong due to reasons specified in the post. Yet this doesn’t address the power seeking as that seems more like a optimisation pressure applied to the system not something dependent on counting arguments?
In my head, power-seeking is more like saying that an agent’s attraction basin is larger in one point of the optimisation landscape compared to another point. The same can also be said about deception here.
I might be dumb but I never thought of the counting argument as true nor crucial to both deception and power-seeking. I’m very happy to be enlightened about this issue.
I buy the argument that scheming won’t happen conditionally on the fact that we don’t allow much slack between different optimisation steps. As Quentin mentions in his AXRP podcast episode, SGD doesn’t have close to the same level of slack that, for example, cultural evolution allowed. (See the entire free energy of optimisation debate here from before, can’t remember the post names ;/) Iff that holds, then I don’t see why the inner behaviour should diverge from what the outer alignment loop specifies.
I do, however, believe that ensuring that this is true by specifying the right outer alignment loop as well as the right deployment environment is important to ensure that slack is minimised at all points along the chain so that misalignment is avoided everywhere.
If we catch deception in training, we will be ok. If we catch actors that might create deceptive agents in training then we will be ok. If we catch states developing agents to do this or defense>offense then we will be ok. I do not believe that this happens by default.
I was doing the same samadhi thing with TMI and I was looking for insight practices from there. My teacher (non dual thai forest tradition) said that the burmese traditions sets up a bit of a strange reality dualism and basically said that the dark night of the soul is often due to developing concentration before awareness, loving kindness and wisdom.
So I’m mahamudra pilled now (pointing out the great way is a really good book for this). I do still like the insight model you proposed, I’m still reeling a bit from the insights I got during my last retreat so it seems true.
Thank you for sharing your experience!