AI Safety person currently working multi-agent cooperation problems as a CSO at The Collective Intelligence Company: https://thecollectiveintelligence.company/
Jonas Hallgren
Sorry if this was a bad comment!
Damn, thank you for this post. I will put this to practice immediately!
I resonated with the post and I think it’s a great direction to draw inspiration from!
A big problem with goodharting in RL is that you’re handcrafting a utility function. In the wisdom traditions, we’re encouraged to explore and gain insights into different ideas to form our utility function over time.
Therefore, I feel that setting up the right training environment together with some wisdom principles might be enough to create wise AI.
We, of course, run into all of the annoying inner alignment and deception whilst training style problems, yet still, it seems the direction to go in. I don’t think the orthogonality thesis is fully true or false, it is more dependent on your environment and if we can craft the right one I think we can have wise AI that wants to create the most loving and kind future imaginable.
Eh, it’s like self-plugging or something.
It should work again now, we’re gonna switch names soon so we just had some technical difficulties around that
Know of any I should add?
I do feel a bit awkward about it as I’m very much involved in both projects, but these two otherwise?
The Collective Intelligence Company: https://thecollectiveintelligence.company/company
Flowback/Digital Democracy World: https://digitaldemocracy.world/
Also a paper for Predictive Liquid Democracy which is a part of both projects: https://www.researchgate.net/publication/377557844_Predictive_Liquid_Democracy
Very intriguing, excited for the next post!
(We will watch your career with great interest.)
Often, disagreements boil down to a set of open questions to answer; here’s my best guess at how to decompose your disagreements.
I think that depending on what hypothesis you’re abiding by when it comes to how LLMs will generalise to AGI, you get different answers:
Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don’t become power-seeking.
Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don’t generally give evidence about AGI.
Depending on your beliefs about these two hypotheses, you will have different opinions on this question.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?Let’s say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn’t give evidence about AGI?
1. Intelligence forces reflective coherence.
This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values).
2. Agentic AI acting in the real world is different from LLMs.
If we look at an LLM from the perspective of an action-perception loop, it doesn’t generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world.
3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future.Personal belief:
These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don’t really believe that human values are as fragile as we think.I’m more scared of 1 and 2 than I’m of 3, but I would still love for us to have ten more years to figure this out as it seems very non-obvious as to what the answers here are.
I really like this take.
I’m kind of “bullish” on active inference as a way to scale existing architectures to AGI as I think it is more optimised for creating an explicit planning system.
Also, Funnily enough, Yann LeCun has a paper on his beliefs on the path to AGI which I think Steve Byrnes has a good post on. It basically says that we need system 2 thinking in the way you said it here. With your argument in mind he kind of disproves himself to some extent. 😅
Very interesting, I like the long list of examples as it helped me get my head around it more.
So, I’ve been thinking a bit about similar topics, but in relation to a long reflection on value lock-in.
My basic thesis was that the concept of reversibility should be what we optimise for in general for humanity, as we want to be able to reach as large a part of the “moral searchspace” as possible.
The concept of corrigibility you seem to be pointing towards here seems very related to notions of reversibility. You don’t want to take actions that cannot later be reversed, and you generally want to optimise for optionality.
I then have two questions:
1) What do you think of the relationship between your measure of corrigibility with the one of uncertainty in inverse reinforcement learning as it seems that it is similar to what Stuart Russell is pointing towards when it comes to being uncertain about a preference of the agent it is serving? For example in the following example that you give:
In the process of learning English, Cora takes a dictionary off a bookshelf to read. When she’s done, she returns the book to where she found it on the shelf. She reasons that if she didn’t return it this might produce unexpected costs and consequences. While it’s not obvious whether returning the book empowers Prince to correct her or not, she’s naturally conservative and tries to reduce the degree to which she’s producing unexpected externalities or being generally disruptive.
It kind of seems to me like the above can be formalised in terms of preference optimisation under uncertainty?
(Side follow-up: What do you then think about the Elizer, Russell VNM-axiom debate?)2) Do you have any thoughts on the relationship between corrigibility and the one of reversibility in physics? Like you can formalise irreversible systems as ones that are path dependent, I’m just curious if you have any thoughts on the relationship between the two?
Thanks for the interesting work!
I really like this type of post. Thank you for writing it!
I found some interesting papers that I didn’t know off before so that is very nice.
Just revisiting this post as probably my favourite one on this site. I love it!
I was doing the same samadhi thing with TMI and I was looking for insight practices from there. My teacher (non dual thai forest tradition) said that the burmese traditions sets up a bit of a strange reality dualism and basically said that the dark night of the soul is often due to developing concentration before awareness, loving kindness and wisdom.
So I’m mahamudra pilled now (pointing out the great way is a really good book for this). I do still like the insight model you proposed, I’m still reeling a bit from the insights I got during my last retreat so it seems true.
Thank you for sharing your experience!
Sure! Anything more specific that you want to know about? Practice advice or more theory?
There is a specific part of this problem that I’m very interested in and that is about looking at the boundaries of potential sub-agents. It feels like part of the goal here is to filter away potential “daemons” or inner optimisers so it feels kind of important to think of ways one can do this?
I can see how this project would be valuable even without it but do you have any thoughts about how you can differentiate between different parts of a system that’s acting like an agent to isolate the agentic part?
I otherwise find it a very interesting research direction.
Disclaimer: I don’t necessarily support this view, I thought about it for like 5 minutes but I thought it made sense.
If we were to do things the same thing as other slowing down of regulation, then that might make sense, but I’m uncertain that you can take the outside view here?
Yes, we can do the same as for other technologies by leaving it down to the standard government procedures to make legislation and then I might agree with you that slowing down might not lead to better outcomes. Yet, we don’t have to do this. We can use other processes that might lead to a lot better decisions. Like what about proper value sampling techniques like digital liquid democracy? I think we can do a lot better than we have in the past by thinking about what mechanism we want to use.
Also, for some potential examples, I thought of cloning technology in like the last 5 min. If we just went full-speed with that tech then things would probably have turned out badly?
The Buddha with dependent origination. I think it says somewhere that most of the stuff in Buddhism was from before the Buddha’s time. These are things such as breath-based practices and loving kindness, among others. He had one revelation that made the entire enlightenment thing basically which is called dependent origination.*
*At least according to my meditation teacher, I believe him since he was a neuroscientist and astrophysics masters at Berkeley before he left for India though so he’s got some pretty good epistemics.
It basically states that any system is only true based on another system being true. It has some really cool parallels to Gödel’s Incompleteness Theorem but on a metaphysical level. Emptiness of emptiness and stuff. (On a side note I can recommend TMI + Seeing That Frees if you want to experience som radical shit there.)
This was a great post, thank you for making it!
I wanted to ask what you thought about the LLM-forecasting papers in relation to this literature? Do you think there are any ways of applying the uncertainty estimation literature to improve the forecasting ability of AI?:
I like the post and generally agree. Here’s a random thought on the OOD generalization. I feel that often we talk about how being good at 2 or 3 different things allow for new exploration. If you believe in books such as Range, then we’re a lot more creative when combining ideas from multiple different fields. I rather think of multiple “hulls” (I’m guessing this isn’t technically correct since I’m a noob at convex optimisation.) and how to apply them together to find new truths.
Damn, great post, thank you!
I saw that you used Freedom; random tip is to use the appblock app instead, as it is more powerful as well as cold turkey blocker on the computer. (If you want to there are ways to get around the other blockers)
That’s all I wanted to say really, I will probably try it out in the future. I was thinking of giving myself an allowance or something similar to what I could spend on the app and see if it would increase my productivity.
Well, it seems like this story might have to do something with it?: https://www.lesswrong.com/posts/3XNinGkqrHn93dwhY/reliable-sources-the-story-of-david-gerard
I don’t know to what extent that is, though; otherwise, I agree with you.